My thoughts on this week’s debate

By Matt Cutts

Earlier this week I was on a search panel with Harry Shum of Bing and Rich Skrenta of Blekko (and moderated by Vivek Wadhwa) and the video now live. It’s forty minutes long, but it covers a lot of ground:

One big point of discussion is whether Bing copies Google’s search results. I’m going to try to address this earnestly; if snarky is what you want, Stephen Colbert will oblige you.

First off, let me say that I respect all the people at Bing. From engineers to evangelists, everyone that I’ve met from Microsoft has been thoughtful and sincere, and I truly believe they want to make a great search engine too. I know that they work really hard, and the last thing I would want to do is imply that Bing is purely piggybacking Google. I don’t believe that.

That said, I didn’t expect that Microsoft would deny the claims so strongly. Yusuf Mehdi’s post says “We do not copy results from any of our competitors. Period. Full stop.”

Given the strength of the “We do not copy Google’s results” statements, I think it’s fair to line up screenshots of the results on Google that later showed up on Bing:

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

I think if you asked a regular person about these screenshots, Microsoft’s “We do not copy Google’s results” statement wouldn’t ring completely true.

Something I’ve heard smart people say is that this could be due to generalized clickstream processing rather than code that targets Google specifically. I’d love if Microsoft would clarify that, but at least one example has surfaced in which Microsoft was targeting Google’s urls specifically. The paper is titled Learning Phrase-Based Spelling Error Models from Clickthrough Data and here’s some of the relevant parts:

The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser [I assume this is Internet Explorer. --Matt] …. In our experiments, we “reverse-engineer” the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query – an important indication that the spelling suggestion is desired. From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs.”

This paper very much sounds like Microsoft reverse engineered which specific url parameters on Google corresponded to a spelling correction. Figure 1 of that paper looks like Microsoft used specific Google url parameters such as “spell=1″ to extract spell corrections from Google. Targeting Google deliberately is quite different than using lots of clicks from different places. This is at least one concrete example of Microsoft taking browser data and using it to mine data deliberately and specifically from Google (in this case, the efforts of Google’s spell correction team).

That brings me to an issue that I raised with Bing during the search panel and afterwards with Harry Shum: disclosure. A while ago, my copy of Windows XP was auto-updated to IE8. Here’s one of the dialog boxes:

IE8 suggested sites

I don’t think an average consumer realizes that if they say “yes, show me suggested sites” that they’re granting Microsoft permission to send their queries and clicks on Google to Microsoft, which will then be used in Bing’s ranking. I think my Mom would be confused that saying “Yes” to that dialog will send what she searches for on Google and what she clicks on to Microsoft. I don’t think that IE8′s disclosure is clear and conspicuous enough that a reasonable consumer could make an informed choice and know that IE8 will send their Google queries/clicks to Microsoft.

One comment that I’ve heard is that “it’s whiny for Google to complain about this.” I agree that’s a risk, but at the same time I think it’s important to go on the record about this.

Another comment that I’ve heard is that this affects only long-tail queries. As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries. To verify our hypothesis, rare queries were the easiest to test. To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings. Microsoft is the company best able to answer the degree to which clicks on Google figure into their Bing’s rankings, and I hope they clarify how much of an impact clicks on Google affect Microsoft’s rankings.

Unfortunately, most of the reply has been along the lines of “this is only one of 1000 signals.” Nate Silver does a good job of tackling this, so I’ll quote him:

Microsoft’s defense boils down to this: Google results are just one of the many ingredients that we use. For two reasons, this argument is not necessarily convincing.

First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined.

And it may also be that an even larger fraction of what creates value for Bing users are Google’s results. Bing might consider hundreds of other variables, but these might produce little overall improvement in the quality of its search, or might actually detract from it. (Microsoft might or might not recognize this, since measuring relevance is tricky: it could be that features that they think are improving the relevance of their results actually aren’t helping very much.)

Second, it is problematic for Microsoft to describe Google results as just one of many “signals and features”. Google results are not any ordinary kind of input; instead, they are more of a finished (albeit ever-evolving) product

Let’s take that thought to its conclusion. If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this? And if Microsoft is unwilling to stop incorporating Google’s clicks in Bing’s rankings, doesn’t that argue that Google’s clicks account for much more than 1/1000th of Bing’s rankings?

I really did try to be calm and constructive in this post, so I apologize if some frustration came through despite that–my feelings on the search panel were definitely not feigned. Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It’s because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing’s rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says “we don’t know how much of this win came from Google” does a disservice to everyone. I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.

If you want to dive into this topic even deeper, you can watch the full forty minute video above.