Google May Be Crawling AJAX Take Advantage Of It!
In October 2009, Google proposed a new standard for implementing AJAX on web sites that would help search engines extract the content. Now, there’s evidence this proposal is either live or is about to be. Read on for more details on the proposal, how it works, and why it might be past the proposal stage.
The Trouble With AJAX
Historically, search engines have had trouble accessing AJAX-based content and this proposal would enable Google (and presumably other search engines that adopted the standard) to index more of the web. The standard SEO advice for AJAX implementations has traditionally been to follow accessibility best practices. If you build the site with progressive enhancement or graceful degradation techniques so that screen readers can render the content, chances are that search engines can access the content as well. Last May, I outlined some of the crawlability issues with AJAX and options for search-friendly implementations.
Google has offered advice on AJAX as well, including:
- Google Webmaster Central Blog: A Spider’s View of Web 2.0
- Google Webmaster Central YouTube Channel: Crawling and Indexing
- Google Webmaster Central Help Center: AJAX
One of the primary search engine problems with AJAX is that it generates URLs that contain a hash mark (#). Since hash marks are also used for named anchors within a page, search engines typically ignore everything in a URL beginning with one (called the URL fragment). So, for instance, Google would see the following two URLs as identical:
- http://www.buffy.com/seasons.php
- http://www.buffy.com/seasons.php#best=2
Google’s AJAX Proposal
With Google’s proposal, an AJAX-generated URL that contains a hash mark (#) would also be replaced with a URL that uses #! in place of #. So, the second URL above would become http://www.buffy.com.seasons.php#!best=2. When Googlebot encounters the exclamation point after the hash mark, it would then request the URL from the server using a syntax that would replace the #! with ?_escaped_fragment_=.
Still with me? All this means is that when Googlebot encounters:
http://www.buffy.com/seasons.php#!best=2
it will request the following URL from the server:
http://www.buffy.com/seasons.php?_escaped_fragment_=best=2
Why, you ask? Well because ?_escaped_fragment_= in the URL tells the server to route the URL request to the headless browser to execute the AJAX code and render a static page.
But, you might protest, I don’t want my URLs in the search results to look like that! Not to worry, Google requests the URL using that syntax, but then translates the ?_escaped_fragment_= back into #! when displaying it to searchers.
How Do I Implement This?
This implementation basically requires that you:
- Modify your AJAX implementation so that URLs that contain hash marks (#) are also available via the hash mark/exclamation point (#!) combination (or, as I recommend below, that you replace the # versions entirely with the #! ones).
- Configure a headless browser on your web server that processes the ?_escaped_fragment_= versions of the URLs, executes the JavaScript on the page and returns a static page.
Oh, you still have questions? I have answers! Well, and some questions of my own.
What about all those links? Is Google going to consolidate all links to the # version of the URL and attribute them to the #! version? It appears that the answer is no. urrently, all links to URLs that contain a hash mark are attributed to the URL before the fragment, and that will continue to be the case. And the canonical tag won’t work in this case, since Google doesn’t process the # version of the URL. So returning to our earlier example, all links to http://www.buffy.com/seasons.php#best=2 are attributed to http://www.buffy.com/seasons.php.
Wait, do we need to start using #! instead of #? You likely don’t want to implement this in such a way that the # and #! URLs co-exist. Instead, you’ll want to replace # URLs with #! URLs. You can’t redirect search engine bots, of course (same reason bots can’t crawl and index the AJAX URLs as is). This means that as noted above, the pages won’t get credit for past links to the # version of the URLs. You should ensure that the #! version of the URLs is what displays in a visitor’s browser though, so that any new links are to the (now indexable) #! versions. What about visitors coming from existing links to the # versions of the URLs? You’ll want to add code that transforms the # version of the URLs to the #! version (see below for more on that).
How do I create #! URLs in place of # URLs? That’s pretty straightforward. Just (I know, there is not “just”) modify the AJAX code that creates URLs to output #! URLs instead of # URLs.
As noted above, for any existing AJAX pages that use #, you’ll want to redirect visitors to the new URLs that use #!. This won’t cause Google to transfer links from the # versions to the #! versions but it will ensure that visitors will see only the #! version and therefore, any new links will be to that version, which will causes Google to start accruing PageRank for those pages. Obviously, you’ll want to get any new links to the versions of the pages that Google will index so those pages have a better chance at ranking well.
My colleague Todd Nemet has a few suggestions for redirecting visitors from the # versions to the #! versions of the URLs.
- JavaScript – You can use document.location, such as:<script type=”text/javascript”> document.location=”http://www.buffy.com/seasons.php#!best=2?; </script>
- PHP – You can write a short PHP script, such as:<?php header(”HTTP/1.1 301 Moved Permanently”); header(”Location: http://www.buffy.com/seasons.php#!best=2?); ?>
- .htaccess – For Apache servers, you can use the NE flag in a rewrite rule, as shown below (although this really only works if you’re moving to the #! structure from a non-# URL):RewriteCond %{QUERY_STRING} ^best=(.*)$ RewriteRule ^seasons.php$ /seasons.php#!%1? [R=301,NE]
- Meta refresh – generally, a meta refresh isn’t recommended for redirects as search engines do a better job of following 301s, but in this case, you’re only redirecting visitors. You can add code similar to the following to the <head> section of the original page:<meta http-equiv=refresh content=”0; http://www.buffy.com/seasons.php#!best=2?>
What’s this about a “headless browser”? The headless browser runs on your web server and processes requests for the ?_escaped_fragment_= versions of URLs. In Google’s original blog post, they suggested checking out HtmlUnit, an open source headless browser. The headless browser executes the JavaScript and renders a static page, then returns it to the requestor. I can hear your next question already — what does that rendered static page look like? Well, it should probably expose all of the content on the page. The two important things here are that Google will be able to get to the content and index it and that Google will have distinct URLs for indexing that content.
What does this mean for accessibility? This question came up when the Google engineers spoke at the Jane and Robot Search Developer Summit I put on just after SMX East, where this proposal was announced. This implementation doesn’t help the content render correctly on mobile devices that don’t support JavaScript or on screen readers. So when considering whether to implement this vs. another technique, think about your accessibility needs.
You’ll also want to make sure that the AJAX URLs aren’t simply popups, since you don’t want search engines to index a popup without the surrounding page content. Ensure that the headless browser creates a static page that includes all content from the page.
Any other problems with this idea? Beyond the accessibility issues (which I think shouldn’t be overlooked), the biggest consideration is probably that this method doesn’t work for search engines other than Google. So if you care about getting this content indexed by Bing and Yahoo!, you’ll want to explore other methods. Also, as you’ll see below, it seems like it may be live on Google, but a bit buggy. So, if you plan to implement it, you’ll have to rely on Google working on the kinks. You should also fully plan out the implementation. Did you previously add workarounds for AJAX issues in other ways that will now conflict with this method?
Also, even if you don’t use AJAX and don’t implement this technique, potential for problems exist. It’s always been a good practice not to configure your server to resolve to any URL request. One reason is that from a crawl efficiency perspective, you can send the search engine bots into an infinite crawl space if your server responds with an HTTP response code of 200 for any URL. But note what has happened with the site below. The “real” URL is iankellysmusic.com/About/. But it seems a link exists on the web to iankellysmusic.com/About/#!. Google has followed that link and is interpreting it as this new AJAX technique.














