Monday, December 5, 2011

Why Canonicalization Matters From A Linking Perspective

Search engine optimization (SEO) can be like any other technical field of study. It is filled with specialized jargon that, to a newbie, can be more than intimidating. I recall that feeling was especially strong when I first encountered the term canonicalization.

It is a 14-letter, seven-syllable monster of a term. I first heard it spoken, and had to ask the person who said it to repeat it. It didn’t help. (It had been a long day!)

The truth of the matter is that canonicalization is not all that complicated to understand if the explanation is lucid. So let’s try to explain what it means, why it’s important, and what it has to do with linking.

What Is Canonicalization?
In mathematics, when the same data can be represented in multiple ways, it is best to standardize that representation by establishing the data’s canonical form, the one primary form in which it will be used. In the computer science field, the act of defining the canonical form of data is called canonicalization.

Simply put, canonicalization defines the one primary way you’ll use to write data, such as a URL string. As webmaster, you can choose which canonical form to use for a given URL on your site, but once selected, the chosen form should always be the way that URL is written.

Why Canonicalization Is Important
Fundamentally, you need to know that search engines do not index pages by their content. They index URLs. The content associated with the indexed URLs is brought in to the search engine database, but URLs are what possess ranking.

What complicates matters in search (and why canonicalization is important) is that the same content page can have multiple URLs associated with it.

I’m not talking about when Web spammers scrape your content and publish it on their own website. I’m talking about variations of URLs on your website all pointing to the same page.

For example, the following hypothetical URLs would likely all point to the same page (in this case, the home page of a site):

example.com
www.example.com
www.example.com/
www.example.com/index.html
www.example.com/index.html?var1=105
www.example.com/index.html?var1=105&var2=abc

As you can see, a valid URL may either include or omit the subdomain prefix “www.”, a trailing slash after the top-level domain, the default webpage name for a folder, and/or one or more URL parameter suffixes (there are even more, but these are the most common). They can also be used in various combinations. The possible permutations of the above examples can quickly add up to a large number of URLs all pointing to the same content page.

And this is not only a problem for home pages. Deep link pages can have the similar problems, such as the following hypothetical examples:

www.example.com/folder1/
www.example.com/folder1/index.html
www.example.com/folder1/index.html?product=49
www.example.com/folder1/?userID=tinytim

When search engine crawlers encounter multiple URLs successfully pointing to the same content page, the overall potential PageRank for that content page is split among the URLs crawled. After all, even though the content is exactly the same, each crawled URL will have its own number of backlinks, so the PageRank for a given piece of content will differ among the URLs crawled.

Metaphorically speaking, imagine a full pitcher of water (the total potential page rank) and several empty cups of various sizes (your non-canonicalized URLs).

When you split up the water from the pitcher among the cups, you are technically still working with the same amount of water, but each cup only has a percentage of the total. None of the cups contains as much water as the pitcher could.

When that comes to PageRank, if your site’s pages are not canonicalized, you’re not using your full potential for page ranking. Not only are your URLs competing against those of your rivals from other websites, you are also competing against URL variations within your own website!

Wouldn’t it be better if you could consolidate your page rank in one URL as you might pour all of those cups of water back into one pitcher? That’s why we need to canonicalize our sites.

Canonicalization’s Connection To Linking
“Yeah, yeah, this is all well and good. But where’s the connection to linking,” you ask? Well, as you are a webmaster, you do have a degree of control over how at least some pages link to you.

After all, your intrasite links, not to mention your site navigation scheme links (and for that matter, the links in your XML-based Sitemap file) are all controlled by you.

This means you need to comb through your site (or your content management system, aka CMS) and see how the link to each page is referenced. You need ensure each link to a given page always uses the exact same URL form.

I personally advocate using absolute (aka full) URLs in links, if only because of the plague of content scrapers. As those people are too lazy to create their own content, they are also usually too lazy to examine and change stolen content source code.

If your content is scraped, readers of that content will be brought back to your site when they click the inline links you created (you do create inline links when relevant opportunities appear, right?).

Admittedly, there are times when your site architecture requires that you use URL parameters. In that case, you can also create rel=canonical tags in the section of your pages. The href attribute of this tag will define the canonical URL for the page, so if the URL normally requires URL parameters, the canonical URL is still defined.

Note that search engines have stated they will look at rel=canonical as a hint, not as a mandate. As such, this is not the magic canonicalization bullet for your site. You still need to be consistent with your canonical intrasite linking.

Also, for URL parameter users, be sure to check out both the Google and Bing Webmaster Tools. Both have added options enabling webmasters to define specific URL parameters to be ignored during crawls.

Google also allows you to select whether or not you want to use the subdomain prefix “www.” in your preferred URL. I’d guess that option will eventually come to Bing as well.

Lastly, for links you don’t control, such as inbound links from other sites, you can set up 301 permanent redirects for all non-canonical URL forms to the canonical URL for each page.

Just be sure you use a 301 permanent redirect. As the 301 is a permanent redirect, search engines interpret this to mean they can safely transfer all of the page rank value from the original (non-canonical) URL to the new (canonical) one.

Note that while 302 temporary redirects will redirect users to a canonical URL, search engines will not transfer any acquired page rank! (I have written in more detail about using 301 redirects here.)

If you’re really detail-oriented, you could even look at backlink tools, such as the aforementioned search engines’ webmaster tools or a third-party tool such as Open Site Explorer, to see who is linking to you and work with the errant webmasters who are not using your canonical URL in their outbound links.

After all, as good as a 301 redirect is for canonicalization, a redirect also introduces a potential page load speed delay, although that’s not likely as detrimental to your page rank as non-canonicalized URLs)

The bottom line is this: you have the ability to consolidate the PageRank for your content pages into canonical URLs.

Depending upon how badly your multiple URLs are dividing up your PageRank today, given how competitive (not to mention how valuable) top ranking can be for a given query, why wouldn’t you take the steps needed to consolidate the page rank of your content pages into one canonical URL?

Canonicalization may be a seven-syllable monster, but it’s not that complicated, and doing something about it could improve your position in the SERPs.

0 megjegyzés:

Post a Comment