The Massive Google Documents Leak: What It Reveals About Google’s Search Algorithm

Posted By Gaurav | 30-May-2024 | Google Updates
The news of the huge Google document leak is all around the internet. This is probably the biggest story in the history of Google Search and SEO. Let us dive deeper into what this document leak is and what we get to know from it.
The Massive Google Documents Leak: What It Reveals About Google’s Search Algorithm

In the first week of May, some documents were anonymously shared with Rand Fishkin, the co-founder of SparkToro. As per the anonymous source (now identified as Erfan Azimi), these documents come from Google’s internal content API warehouse. The authenticity of these documents is not yet confirmed. However, ex-Google employees believe that they may be real. These documents reveal a lot about what all features Google considers when ranking websites in the search results.

What is Google Documents Leak?

An automated bot released thousands of documents on GitHub in March 2024 and these documents were anonymously shared with Fishkin in May. The anonymous source claims that these documents are from Google’s internal content API warehouse. Although there is no comment on this from Google, ex-Google employees have confirmed that the documents seem authentic. 

These documents contain information about how Google’s search algorithms work and what are different ranking factors considered by Google. Although there is no information about the weightage that different elements hold, it gives us an idea of what all factors matter. Many claims in these documents are against the statements that Googlers have been making for several years. 

What’s Inside These Documents?

The API documentation leaked by the anonymous source represents 2,596 modules and 14,014 attributes. As per the documents, the information is accurate as of March. Here’s what we know about these leaked internal documents of Google:

1. Content Demotions: Google can demote the content of a web page for various reasons, such as:

  • SERP signals indicating user dissatisfaction
  • Links that do not match the target site
  • Location or product reviews
  • Exact match domains
  • Sexually explicit content

2. Importance of Links: The relevance and diversity of links still matter. As per these documents, PageRank is among Google’s ranking factors, and the PageRank of the website’s home page is still considered for every document. Google had earlier claimed that links are not among the top three rankings factors and maybe that’s true. We don’t know anything about how much weight links hold. It just tells us that links matter.

3. Change History of Web Pages: The documents have also revealed that Google keeps a copy of every change that has ever been made to a web page, which Google has indexed. However, when analyzing a link, only the last 20 changes made to the page are considered by Google.

4. Clicks Matter: With Google’s continuous algorithm updates, we are already aware that creating high-quality content and ensuring the best user experience is important. These documents also reveal the same about the importance of user experience. Google uses various measurements, including good Clicks, bad Clicks, unsquashed Clicks, and last Longest Clicks. Documents that are too lengthy may get shortened and the ones with short lengths get scores depending on their originality.

5. Entities Matter: It seems like Google still stores author information associated with a web page’s content in order to determine whether the entity is actually the author of that document or not. 

6. Site Authority: Earlier, Google used something like a site authority score in 2011 Google stated that having low-quality content in any part of the website can impact the entire website’s rankings. But Google has denied using any site authority score since then. But as per the leaked documents, Google still uses something like ‘siteAuthority.’

7. Chrome Data Matters: A module called ‘ChromeInTotal’ from these documents reveals that Google uses Chrome data in order to rank web pages in search results.

8. Small Sites: A feature called ‘smallPersonalSite’ is also there. As per this, Google can boost or demote small personal sites or blogs via a Twiddler. Here, Twiddler refers to re-ranking functions that can change the ranking of a page.

9. Whitelists: Some modules in these API documents reveal that Google whitelists specific domains related to COVID and elections. We already had some information about this as it is already known that Google has ‘exception lists’ when specific algorithms mistakenly impact websites. 

10. Freshness of Content Matters: Google considers the dates when ranking websites. It looks for the dates in the byline, URL, and on-page content, as per the documents.

11. Page Titles are Important: Google uses a feature called ‘titlematchScore’ to determine how well a page’s title matches the user query and ranks pages accordingly.

12. Relevance of Topics: Google uses specific features called siteRadius and siteFocusScore to determine whether a page’s content is relevant to the website or not. 

Summing Up

It’s still not clear whether the documents were leaked or discovered. Some experts also believed that the internal documents were accidentally pushed live by the internal code base of Google and then they were discovered. Also, there is no confirmation that the information contained in these API documents is authentic. However, Fishkin has connected with ex-Googlers and based on their responses, we kind of believe that the documents may be real. 

While some modules and ranking factors revealed by these documents are very much related to what we know, some of them are real shockers. There are numerous statements that contradict what Google professionals have been claiming for the last several years. 

Stay tuned as we will keep you updated on any latest information on this. 
 

Gaurav Yadav
SEO expert

Gaurav Yadav is a skilled SEO expert with over 8 years of experience in digital marketing. He specializes in technical SEO, content strategy, and link building, and has a proven track record of driving organic traffic growth for a diverse range of clients. With his expertise in various verticals, he can execute industry-specific SEO strategies for SAAS, BFSI, healthcare, lifestyle, and education.

Get a free quote