Tag Archives: Matt Cutts

A rel=canonical corner case

by Matt Cutts

I answered an interesting rel=canonical question over email today and thought I’d blog about it. If you’re not familiar with rel=canonical read these pages first. Then watch this video about rel=canonical vs. 301s, especially the second half:

 

Okay, I sometimes get a question about whether Google will always use the url from rel=canonical as the preferred url. The answer is that we take rel=canonical urls as a strong hint, but in some cases we won’t use them:
– For example, if we think you’re shooting yourself in the foot by accident (pointing a rel=canonical toward a non-existent/404 page), we’d reserve the right not to use the destination url you specify with rel=canonical.
– Another example where we might not go with your rel=canonical preference: if we think your website has been hacked and the hacker added a malicious rel=canonical. I recently tweeted about that case. On the “bright” side, if a hacker can control your website enough to insert a rel=canonical tag, they usually do far more malicious things like insert malware, hidden or malicious links/text, etc.

I wanted to talk today about another case in which we won’t use rel=canonical. First off, here’s a thought exercise: should Google trust rel=canonical if we see it in the body of the HTML? The answer is no, because some websites let people edit content or HTML on pages of the site. If Google trusted rel=canonical in the HTML body, we’d see far more attacks where people would drop a rel=canonical on part of a web page to try to hijack it.

Okay, so now we come to another corner case where we probably won’t trust a rel=canonical: if we see weird stuff in your HEAD section. For example, if you start to insert regular text or other tags that we normally only see in the BODY of HTML into the HEAD of a document, we may assume that someone just forgot to close the HEAD section. We don’t allow rel=canonical in the BODY (because as I mentioned, people would spam that), so we might not trust rel=canonical in those cases, especially if it comes after the regular text or tags that we normally only see in the BODY of a page.

But in general, as long as your HEAD looks fairly normal, things should be fine. If you really want to be safe, you can make sure that the rel=canonical is the first or one of the first things in the HEAD section. Again, things should be fine either way, but if you want an easy rule of thumb: put the rel=canonical toward the top of the HEAD.

Google launches two-factor authentication

by Matt Cutts

Google just launched two-factor authentication, and I believe everyone with a Google account should enable it.

Two-factor authentication (also known as 2-step verification) relies on something you know (like a password) and something you have (like a cell phone). Crackers have a harder time getting into your account, because even if they figure out your password, they still only have half of what they need. I wrote about two-factor authentication when Google rolled it out for Google Apps users back in September, and I’m a huge fan.

Account hijacking is no joke. Remember the Gawker password incident? If you used the same password on Gawker properties and Gmail, two-factor authentication would provide you with more protection. I’ve also had two relatives get their Gmail account hijacked when someone guessed their password. I’ve also seen plenty of incidents like this where two-factor authentication would have kept hackers out. If someone hacked your Gmail account, think of all the other passwords they could get access to, including your domain name or webhost accounts.

Is it a little bit of extra work? Yes. But two-step verification instantly provides you with a much higher level of protection. I use it on my personal Gmail account, and you should too. Please, protect yourself now and enable two-factor authentication.

How to strip JPEG metadata in Ubuntu

by Matt Cutts

If you want to post some JPEG pictures but you’re worried that they might have metadata like location embedded in them, here’s how to strip that data out.

First, install exiftool using this command:

sudo apt-get install libimage-exiftool-perl

Then, go into the directory with the JPEG files. If you want to remove metadata from every file in the directory, use

exiftool -all= *.jpg

The exiftool will make copies, so if you had a file called image.jpg, when you’re done you’ll have image.jpg with all the metadata stripped plus a file called image.jpg_original which will still have the metadata.

My thoughts on this week’s debate

By Matt Cutts

Earlier this week I was on a search panel with Harry Shum of Bing and Rich Skrenta of Blekko (and moderated by Vivek Wadhwa) and the video now live. It’s forty minutes long, but it covers a lot of ground:

One big point of discussion is whether Bing copies Google’s search results. I’m going to try to address this earnestly; if snarky is what you want, Stephen Colbert will oblige you.

First off, let me say that I respect all the people at Bing. From engineers to evangelists, everyone that I’ve met from Microsoft has been thoughtful and sincere, and I truly believe they want to make a great search engine too. I know that they work really hard, and the last thing I would want to do is imply that Bing is purely piggybacking Google. I don’t believe that.

That said, I didn’t expect that Microsoft would deny the claims so strongly. Yusuf Mehdi’s post says “We do not copy results from any of our competitors. Period. Full stop.”

Given the strength of the “We do not copy Google’s results” statements, I think it’s fair to line up screenshots of the results on Google that later showed up on Bing:

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

I think if you asked a regular person about these screenshots, Microsoft’s “We do not copy Google’s results” statement wouldn’t ring completely true.

Something I’ve heard smart people say is that this could be due to generalized clickstream processing rather than code that targets Google specifically. I’d love if Microsoft would clarify that, but at least one example has surfaced in which Microsoft was targeting Google’s urls specifically. The paper is titled Learning Phrase-Based Spelling Error Models from Clickthrough Data and here’s some of the relevant parts:

The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser [I assume this is Internet Explorer. –Matt] …. In our experiments, we “reverse-engineer” the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query – an important indication that the spelling suggestion is desired. From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs.”

This paper very much sounds like Microsoft reverse engineered which specific url parameters on Google corresponded to a spelling correction. Figure 1 of that paper looks like Microsoft used specific Google url parameters such as “spell=1″ to extract spell corrections from Google. Targeting Google deliberately is quite different than using lots of clicks from different places. This is at least one concrete example of Microsoft taking browser data and using it to mine data deliberately and specifically from Google (in this case, the efforts of Google’s spell correction team).

That brings me to an issue that I raised with Bing during the search panel and afterwards with Harry Shum: disclosure. A while ago, my copy of Windows XP was auto-updated to IE8. Here’s one of the dialog boxes:

IE8 suggested sites

I don’t think an average consumer realizes that if they say “yes, show me suggested sites” that they’re granting Microsoft permission to send their queries and clicks on Google to Microsoft, which will then be used in Bing’s ranking. I think my Mom would be confused that saying “Yes” to that dialog will send what she searches for on Google and what she clicks on to Microsoft. I don’t think that IE8′s disclosure is clear and conspicuous enough that a reasonable consumer could make an informed choice and know that IE8 will send their Google queries/clicks to Microsoft.

One comment that I’ve heard is that “it’s whiny for Google to complain about this.” I agree that’s a risk, but at the same time I think it’s important to go on the record about this.

Another comment that I’ve heard is that this affects only long-tail queries. As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries. To verify our hypothesis, rare queries were the easiest to test. To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings. Microsoft is the company best able to answer the degree to which clicks on Google figure into their Bing’s rankings, and I hope they clarify how much of an impact clicks on Google affect Microsoft’s rankings.

Unfortunately, most of the reply has been along the lines of “this is only one of 1000 signals.” Nate Silver does a good job of tackling this, so I’ll quote him:

Microsoft’s defense boils down to this: Google results are just one of the many ingredients that we use. For two reasons, this argument is not necessarily convincing.

First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined.

And it may also be that an even larger fraction of what creates value for Bing users are Google’s results. Bing might consider hundreds of other variables, but these might produce little overall improvement in the quality of its search, or might actually detract from it. (Microsoft might or might not recognize this, since measuring relevance is tricky: it could be that features that they think are improving the relevance of their results actually aren’t helping very much.)

Second, it is problematic for Microsoft to describe Google results as just one of many “signals and features”. Google results are not any ordinary kind of input; instead, they are more of a finished (albeit ever-evolving) product

Let’s take that thought to its conclusion. If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this? And if Microsoft is unwilling to stop incorporating Google’s clicks in Bing’s rankings, doesn’t that argue that Google’s clicks account for much more than 1/1000th of Bing’s rankings?

I really did try to be calm and constructive in this post, so I apologize if some frustration came through despite that–my feelings on the search panel were definitely not feigned. Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It’s because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing’s rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says “we don’t know how much of this win came from Google” does a disservice to everyone. I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.

If you want to dive into this topic even deeper, you can watch the full forty minute video above.

Algorithm change launched

By Matt Cutts

I just wanted to give a quick update on one thing I mentioned in my search engine spam post.

My post mentioned that “we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week.

This was a pretty targeted launch: slightly over 2% of queries change in some way, but less than half a percent of search results change enough that someone might really notice. The net effect is that searchers are more likely to see the sites that wrote the original content rather than a site that scraped or copied the original site’s content.

Thanks to Jeff Atwood and the team at Stack Overflow for providing feedback to Google about this issue. I mentioned the update over on Hacker News too, because folks on that site had been discussing specific queries too.

An interesting essay on search neutrality

by Matt Cutts

(Just as a reminder: while I am a Google employee, the following post is my personal opinion.)

Recently I read a fascinating essay that I wanted to comment on. I found it via Ars Technica and it discusses “search neutrality” (PDF link, but I promise it’s worth it). It’s written by James Grimmelmann, an associate professor at New York Law School. The New York Times called Grimmelmann “one of the most vocal critics” of the proposed Google Books agreement, so I was curious to read what he had to say about search neutrality.

What I discovered was a clear, cogent essay that calmly dissects the idea of “search neutrality” that was proposed in a New York Times editorial. If you’re at all interested in search policies, how search engines should work, or what “search neutrality” means when people ask search engines for information, advice, and answers–I highly recommend it. Grimmelmann considers eight potential meanings for search neutrality throughout the article. As Grimmelmann says midway through the essay, “Search engines compete to give users relevant results; they exist at all only because they do. Telling a search engine to be more relevant is like telling a boxer to punch harder.” (emphasis mine)

On the notion of building a completely transparent search engine, Grimmelmann says

A fully public algorithm is one that the search engine’s competitors can copy wholesale. Worse, it is one that websites can use to create highly optimized search-engine spam. Writing in 2000, long before the full extent of search-engine spam was as clear as it is today, Introna and Nissenbaum thought that the “impact of these unethical practices would be severely dampened if both seekers and those wishing to be found were aware of the particular biases inherent in any given
search engine.” That underestimates the scale of the problem. Imagine instead your inbox without a spam filter. You would doubtless be “aware of the particular biases” of the people trying to sell you fancy watches and penis pills–but that will do you little good if your inbox contains a thousand pieces of spam for every email you want to read. That is what will happen to search results if search algorithms are fully public; the spammers will win.

And Grimmelmann independently hits on the reason that Google is willing to take manual action on webspam:

Search-engine-optimization is an endless game of loopholing. …. Prohibiting local manipulation altogether would keep the search engine from closing loopholes quickly and punishing the loopholers–giving them a substantial leg up in the SEO wars. Search results pages would fill up with spam, and users would be the real losers.

I don’t believe all search engine optimization (SEO) is spam. Plenty of SEOs do a great job making their clients’ websites more accessible, relevant, useful, and fast. Of course, there are some bad apples in the SEO industry too.

Grimmelmann concludes

The web is a place where site owners compete fiercely, sometimes viciously, for viewers and users turn to intermediaries to defend them from the sometimes-abusive tactics of information providers. Taking the search engine out of the equation leaves users vulnerable to precisely the sorts of manipulation search neutrality aims to protect them from.

Really though, you owe it to yourself to read the entire essay. The title is “Some Skepticism About Search Neutrality.”

Which charities do you donate to?

Each year I like to ask what charities people are donating to. There’s still a couple days left in 2010, so I wanted to ask readers about their charity or non-profit giving.

I’ll mention the main organizations on my giving list this year:

  • charity: water brings clean, safe drinking water to people in developing nations.
  • The Poynter Institute is a school that trains journalists and would-be journalists, both in person and online.
  • The Committee to Protect Journalists defends press freedom and the rights of journalists to report the news world-wide without fear of harm.
  • MAPLight.org provides tools and data to investigate the influence of money and politics.
  • The Sunlight Foundation focuses on using technology to make government more transparent and accountable.
  • I don’t think I’ve mentioned my Mom’s charity on my blog before, but I did donate money this year to it, so it seems appropriate to mention it. Blessing Hands provides scholarships and other help to students in China. Side-note: in the same way that I don’t accept gifts or free things, if you ever decide to donate any money to Blessing Hands, please don’t tell me; I wouldn’t want a donation to create the appearance of any conflict of interest with my job.
  • The Electronic Frontier Foundation (EFF) defends everyone’s digital and online rights. The EFF has stopped more bad ideas online than I can even count.

Those were the organizations that I ended up giving some money to. Now it’s your turn. What charities would you like to mention, support, or call out?

By the way, I’d still like to find 501(c)(3) organizations with low overhead costs that support open-source software. And I’d still like to find an organization that teaches the basics of journalism online for free. The training could cover the history of journalism, research and fact checking, ethics, legal principles, rights, how to investigate, libel and slander, off the record vs. on background, and so on. Sort of like The Khan Academy, but teaching journalism. If anyone knows of such organizations or non-profits, please leave a comment!