Duplicate Content and Canonicalization – Oh my!

What is duplicate content and why does it matter? And what the heck is canonicalization?

In the latest episode of Search Off the Record, John Mueller, Search Relations Lead, Martin Splitt, Developer Advocate, and Gary Illeys, Search Advocate, discuss Canonicalization and Duplicate Content.

Here are the key takeaways:

How does Google find duplicate content?

Google finds duplicate content by reducing content into a “checksum” and comparing it. A checksum, as described by Martin Splitt, is a “fingerprint” of a website. If you think of content as a string of 1’s and 0’s, Google compares these strings of numbers to find duplicate content.

Things like footers or terms of service don’t show up as duplicate content because Google algorithms are advanced enough to recognize it and remove it from the equation.

Why is duplicate content bad?

Duplicate content is bad because users don’t want to read the same thing multiple types. Think of it this way: If you went to the grocery store and there was only one type of cereal in the cereal aisle you’d get bored of it fast and want something new. The same thing goes for content creation.

Not only that, but Google only has so much space for content. Because user experience is so important to Google, having fresh high-quality content that satisfies user’s queries, and un-stated intent, is important.

What is canonicalization?

Canonicalization refers to the act of setting a lead page for similar content. Since duplicate pages can impact ranking, making this distinction helps make it clear to Google which page should take the lead otherwise they will be seen as equally important.

By using a canonical URL, you help Google know that they should focus their crawl energy on that link of the page with less important, similar, content.

What is a canonical URL?

A canonical URL is a page that Google picks as the best page of your duplicate content? You can check your URLs with Google’s URL Inspection Tool in Google Search Console.

Can you pick which of your pages is canonical?

Sorta. Illyes shared that there are over 20 factors that go into helping determine duplicate and canonical content. If you know which page you want to take lead for duplicate make sure it’s on an HTTPS URL, included in the sitemap, or has a rel=canonical attribute.

Are redirects considered duplicate content?

No, Google understands and indexes redirect pages appropriately.

What’s the difference between duplicate content and canonicalization?

Duplicate content is the overall term for all of the repeated content. Within the duplicate content, a canonicalization is a group of related duplicate content.

Interested in learning how Google indexes your website? Read on…