Website:Walter Gregg

On this page: Main Content. Summary. Definitions. Importance. Server Redirect Preferred. JavaScript Remedy. JavaScript code. Extensionless U.R.L.'s.

A Canonical URL JavaScript

Archive from 2008 (Rev. 2017). Here's a script you shouldn't use. It won't work for everyone, search engines may not like it, and it may degrade your website statistics. It is important to redirect visitors to a single canonical address. But if your web hosting provider doesn't let you do this, you should move your website to a hosting plan that does. If there is some reason you can't, as a last resort, you could use a version of the JavaScript posted here, but only when there is no alternative.

Contents ↑

Summary

A website or web page may be reachable through a number of different addresses. It is important for a variety of reasons including branding and search engine ranking that whenever possible, visitors see and use only one of these addresses. The address selected for this purpose is known as the canonical address, hostname or url. The correct way to ensure that visitors are presented with the canonical name is to use the web server's RewriteEngine or equivalent so that if the visitor arrives via a non-canonical alias, the address is rewritten to the canonical form. But server administrators often deny their customers the ability to use the rewrite capability. When that is the case, it is still possible to use JavaScript such as that presented here to redirect most visitors to the correct address.

Contents ↑

Definitions

Canonical hostname
The canonical hostname of a website is the name a website publisher prefers and uses on business cards, in email signatures, and in advertising. It is the name the publisher wants people to use, particularly when they link to pages on the site. In essence, it is the listed, published name such as example.com as opposed to an alias name such as www.example.com. Many websites have other variants. See kotarski.co.uk: The canonical URL (archive.org 2008: kotarski.co.uk/ articles/ the-canonical url.html). Another example is GCI, an internet service provider (ISP) in Alaska. They let their customer's pages be reached via home.gci.net, matnet.com, www.matnet.com, alaskalife.net, www.alaskalife.net, and possibly others. This is because they have purchased many smaller ISPs over the years and didn't want to inconvenience customers with a hostname change (email from GCI support to me (May 25, 2007). Which of the names a website publisher considers canonical is largely a marketing decision. See no-www.org (https://web.archive.org/web/20080702032830/http://no-www.org/).
Canonical URL (uniform resource locator)
The canonical URL of a webpage is the name the website publisher prefers and uses for that page. It is the address using the canonical hostname, but more than that, it is the address using the most appropriate and generally the shortest pathname. For example, the canonical name of an index page is simply a trailing slash, without the filename. Search engine robots may even rewrite known index page variants, such as index.htm?, home.htm?, default.htm?, and index.php, to the trailing slash form. See code.google.com: Administering Crawl: Troubleshooting crawl (https://web.archive.org/web/20080731132041/http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Troubleshooting.html). The short form allows you to change the type of index page without breaking already indexed pages. For the same reason, many web servers can be configured to allow extensionless URLs. Using this form allows you to change document types without breaking incoming links. It's best for canonical URLs to exclude all unnecessary detail. See Tim Berners-Lee, Cool URIs Don't Change (1998) (w3.org/Provider/Style/URI).
Contents ↑

Importance of Canonical Naming

The chief problem with alias names, where several versions of a hostname or page address actually reach the same page, is that to search engines, they are separate pages. See Matt Cutts, URL Canonicalization (2006) (mattcutts.com/blog/seo-advice-url-canonicalization). If some incoming links use one URL and some use other URLs, the search engine may not know they refer to the same page and may index the URLs separately. This penalizes the page, because none of its URLs receive the pagerank benefit of all the incoming links. In fact, when different URLs link to the same page, pages can be further penalized or even banned because it can look like an attempt to spam the search engine index. See Google, Webmaster Guidelines (google.com/support/webmasters/bin/answer.py?answer=35769), particularly under Quality Guidelines about duplicate content.

Contents ↑

Server Side Redirects are Preferred

A preferred solution to enforce the use of a canonical name involves using the Linux Apache server's mod rewrite (RewriteEngine) in the .htaccess file to permanently redirect addresses to the preferred canonical address. But mod rewrite is too complex for typical customers. See Apache.org, mod_rewrite (httpd.apache.org/docs/1.3/mod/mod_rewrite.html). Hosting companies often disable it rather than incur the customer support calls it creates. See jdMorgan, Reply to bcrbcr re: Host Supports .htaccess but not Mod Rewrite (Apr. 16, 2007) (webmasterworld.com/apache/3312276.htm). For microsites and personal home pages, upgrading to a hosting package that includes mod rewrite support may not be sufficiently economic. That leaves a lot of websites that have unresolved canonical name problems that penalize them in search engine indexes.

Contents ↑

Client Side JavaScript is a Workaround

JavaScript offers a partial solution, but there may be undesirable side effects. A simple JavaScript can check the URL used to access a page, and if this is not the preferred URL, it can replace the page using the canonical URL. If JavaScript is unavailable, disabled, or too old to have the getElementByID function, the redirect will fail and the non-canonical version of the page will load. This is generally the case for search engine robots and old browsers. But for better than 80% of human visitors, the redirect to the canonical page will work. And that means that when they print, bookmark, or copy the URL, they will have the canonical version. I hypothesize that this should greatly reduce the use of non-canonical aliases, and that keeping people from seeing the wrong URL will make it less likely for search engine indexes to be contaminated with non-canonical URLs.

This workaround may have unanticipated side effects. Examples and the solution applied include:

  1. In some implementations, a redirected visitor has the Hotel California problem: when they try to use the back button, they can't leave your page because they are immediately redirected back again. This version uses location.replace to avoid that result.
  2. In some implementations, in-page links and query features are broken, because the redirection removes everything after the canonical url. This version appends the hash and search strings to avoid that result.
  3. In some implementations, copies saved to disk, archived at the wayback machine, or cached at search engines are inaccessible, because the redirection kicks in for any non-canonical URL. This version only redirects when the hostname matches our own canonical hostname or one of its known aliases.

The use of JavaScript redirects might harm search engine rankings or even result in being banned from indexes. JavaScript redirects are extensively used by spammers to present different content to search engines than to human visitors. Search engine algorithms are known to attempt detect this practice and downrate or ban pages using it. See K. Chellapilla and A. Maykov, A Taxonomy of JavaScript Redirection Spam (401 kilobyte PDF) (archive.org/web/20080920225551/http://airweb.cse.lehigh.edu/2007/papers/paper_115.pdf). However, in this case, we're presenting identical content to search engine robots and human visitors. Thus, the use is not inappropriate and should not result in penalties. Nevertheless, most search engine algorithms are not publically known. This means that there can be no assurance that search engines will not penalize you or ban you for using this JavaScript. You use it at your peril.

Contents ↑

Code: Example JavaScript

Contents ↑

Extensions Are Best Omitted

There are some advantages in making the canonical URL for linking to a page lack a file type extension. This way, you can freely change the document type, for example from HTML to PHP, without breaking incoming links. By default, most servers require the extension, but many servers have a means to make it optional. In Linux Apache, if your hosting company supports this, you merely need a /.htaccess file that includes 'Options +multiviews' (without the quotes). This very often works even with hosts that don't allow customers to use the RewriteEngine.

Contents ↑

2008 (Rev. 2017). (Walt.Gregg.Juneau.AK.US/6/javascript-make-canonical-url; CreativeCommons.org/licenses/by-nc-nd/4.0.)

 No Privacy