Many website publishers, especially more recent ones, have had this bitter experience: Today, Google can no longer index or even just crawl many pages. Is this a temporary error or a deeper problem that arose from the desire to fight spam and low-quality content? Here are some answers…
As we know, Google has been experiencing very big problems indexing web pages for many months now:
- More precisely on the latest sites (but not only).
- On many French sites (especially in .fr), but not only.
Thus, on many sites, pages are either crawled but not indexed, or simply not crawled. In Search Console, URLs show up as “Excluded” in the Coverage report with “Detected, not currently indexed” pending crawl or “Crawled, not currently indexed” when they’re crawled and pending indexing. This phenomenon has become important enough to understand that this is not an isolated error for this website. This is a very strong trend in Google crawling and indexing right now.
Page quality as the first criterion
Many webmasters have asked me questions about this in recent weeks, giving me the address of their site, which could not be visited or taken into account by the robots of the engine. I must say that among these sources of information, many were of very low quality:
- The content is too short.
- Articles from a site with 100% SEO vision in their design.
- Articles written solely to link to a page on another site (whether from a link selling platform or otherwise).
For these types of pages, it’s quite normal that Google implemented an algorithm to sort the wheat from the chaff. You must accept it. But for other pages (and especially other sites), no matter how completely valid and high-quality they may be, the problem is always present.
The tools are trying to remedy the situation, but…
Many tools were then introduced, most of which used the Google Indexing API. I encourage you to read Daniel Roch’s article published this week on Réacteur about who has tested them and where he tells us what he thinks of them.
Using them, and although the situation is not ideal, we still manage to improve the situation a little. And the fact that these forced indexing tools work (at least a little) shows the complete failure of Google at this level. As a result:
- Or the engine considers that the content in question is of low quality, in which case it should refuse to index them, regardless of how it was sent to it. If a page is denied indexing by natural methods (bot crawl, XML Sitemap, etc.), but is accepted through the API, then this is just big nonsense!
- Either it indexes them via an API, which at the same time means that the quality of the content is beyond doubt, but also demonstrates its current inability to crawl the web in a natural and efficient way.
Therefore, it could either be a bug in the engine and its robots, or a flaw in its crawling system preventing it from crawling websites, especially recent ones, in a clean and efficient way. Agree, a very serious moment for an engine that wants to be a world leader in this field!
(Note, however, that some URLs are accepted via the API submission tools, but are sometimes subsequently de-indexed by the engine).
We are currently seeing a deterioration in crawling capabilities over several years: first of all, the numerous bugs that have marked the last months in terms of indexing, and now the inability to crawl and index recent content. You could even say that Bing currently indexes the Internet much better than its historical competitor.. Who would have dared to say that a few years ago? At this level, it is even more innovative, especially with the IndexNow protocol, which has been available for several months now.
So what about today? After analyzing many sites that can’t link to themselves and running my own internal tests, I’ve come to the following conclusions:
- The current problems are so widespread and unbelievable thatGoogle must not know of these concerns. So there must be a logical explanation.
- Google might be tweaking right now filtration system index only quality content. But it’s an understatement to say it’s not yet developed, especially with recent content that hasn’t yet given the engine positive signals about the quality of the page’s content, and especially the site that displays them.
- If one of the criteria for filtering the quality of content, of course, is based on the analysis of texts offered on the network, then it seems necessary to get links (backlinks) quickly from a site “trusted” by Google (in which the engine has a certain amount of trust: old, never spammed, has great authority and legitimacy in its field, etc.). Every time we linked from a trusted site to a web page that had previously had indexing problems, said indexing miraculously kicked off within a day. But without any impact on the indexing of other pages of the target site, on the other hand. In other words, the indexing of one page does not lead to the indexing of others.
- Google is certainly trying to create firewalls to counter potential intrusion of spam content is written automatically using algorithms like GTP-3. If today the engine is a priori able to distinguish between automated content and texts written by people, what will happen in a few months or a few years? Therefore, it is quite possible that Google will develop appropriate algorithms, and first of all they will work with web pages with a history that allows them to analyze signals. Could Google’s incredible current situation mean that the next targeted content to be processed will be one that was uploaded recently but is still waiting to be processed? Will they then be parsed by an algorithm that can then do its job correctly on that type of page? We can imagine it without being sure, of course.
In any case, it is to be hoped that the situation will change quickly, as it clearly does not create a positive image for Mountain View and its ability to control its search engine and the current growth of the web. It must be admitted that a few years ago this was not the case. But the Internet was different, and the level of spam that the engine had to handle was also very different (remember that Google finds 40 billion spam pages every day! and the current evolution of SEO practices has something to do with this).
Is the Internet’s exponential evolution overtaking the engine? and the number of pages and information available on the Internet, and therefore the spam it is bombarded with? Or is this ultimately just a temporary incident and a situation that will be quickly fixed by Google’s tech teams? The near future will surely tell us more about this… One thing is for sure anyway: The current situation must change completely if Google maintains its current hegemony.…
A typical example of a (small) recent site that has many pages, albeit good quality and without any sign of spam, but not crawled or indexed. Source: Abundance