How Does Google Index Your Website?

Do you have your coffee? Because we’re going to talk about Caffeine!

In Google’s most recent Search Off the Record podcast episode, Gary Illyes, Google’s Webmaster Trends Analyst, shared how Google uses Caffeine to index webpages in search.

What is Google Caffeine?

Google Caffeine is Google’s indexing system. It acts as the bridge between Google’s crawler (Googlebot) and your website. While it has a multitude of functions, it’s main purpose is to read your website and turn it into uniform HTML format which it then indexes.

Think of it like the translators at the United Nations. Say the delegate from Denmark was speaking on stage to an audience from Thailand. The translator (Caffeine) would be turning the Danish language into Thai so that communication between the two parties can happen.

How does Google Caffeine work?

The first step is for the Googlebot to pick up the information on your website and produce a protocol buffer.

What is a protocol buffer?

A protocol buffer, developed by Google, is a method of translating data into a normalized structure. It’s designed to take in a lot of different types of web information and turn it into a single, simple, HTML format. This is done to streamline indexing. A protocol buffer doesn’t make any changes to a website. It simply reads and regurgitates. Google describes protocol buffers as “language-neutral” and “platform-neutral”.

After Googlebot produces the protocol buffer, Caffeine will pick up the HTML and start to process it. By “processing” it Caffeine will read through the HTML. Thus why it’s important to have clean functioning HTML on your website.

As it reads through the HTML, it will begin working through the structure that you’ve worked into your website – namely your header tags.

Header tags create structure. As Illyes notes: “We try to understand the styling that was applied on the h tags, so we can determine the relative importance of the h tags compared to each other.” That is why it’s important to use them appropriately. If you build a page made entirely of H4 tags Google, and Caffeine, will read everything on that page as equally important.

Can Google index PDFs?

Yes! Illyes revealed that Google can index a variety of formats including PDFs, spreadsheets, word documents, and more. Caffeine translates these file types into HTML.

It seems like, to streamline the process and increase the rate of indexability, it may make sense to create content in both a text-based non-PDF format as well as having a downloadable PDF version for users. Illyes acknowledged that PDF, as a binary format, is not easy to process.

Do robot.txt files matter to Google Caffeine?

Yes! Illyes remarked that robot.txt files are something that “we deeply care about.” If Caffeine finds a “noindex” code it will automatically stop reading the file and won’t index it.

Does HTML in a header tag affect indexing?

Yes again! Illyes revealed that the HTML reader will “close the head, right before those tags, and starts the body from there on.” Using the appropriate HTML and header tag structure increases indexability.

How can you make Google crawl your site faster?

Google is relatively transparent about what works. And what doesn’t. This latest podcast highlights the need for appropriate use of header tags, using HTML appropriately, having on-page content over PDFs, not miss using “noindex” codes, and being mindful of creating helpful content.

How do you know if Google crawled your site?

This information is available via Google Search Console using the URL inspection tool. This nifty tool will also let you know if your page was indexed. If it isn’t it’ll also tell you why.

Other takeaways from Gary Illyes:

Do meta keywords matter?

Nope! Google does not care about meta keywords! As Illyes says: “We don’t care about the meta keywords at all. At all.”

What are meta keywords?

Meta keywords are keywords that are built into the HTML of a website.

Do Out-of-Stock pages affect SEO?

Yes, Illyes revealed that they can impact indexability. Instead either remove the page from your website or edit it to include a “subscribe for updates” options. If you have too many on page “error” notifications you run the risk of Google reading it as a soft 404 page. Illyes hinted that staying helpful to users, and not being misleading, is what’s important.

What is a soft 404 page?

A soft 404 page is page that has a 2xx code but Google thinks it should be an error page. Illyes noted that it is a Google error. He shared, for example, that if “you are writing an article about error pages in general, and you can’t… get it indexed….That’ sometimes because our error page handling systems miss-detect your articled, based on the keywords that you use.”

Ready to keep reading? Learn how Google uses backlinks to determine rankings in search results.