Google might broaden the checklist of unsupported robots.txt guidelines in its documentation based mostly on evaluation of real-world robots.txt knowledge collected by way of HTTP Archive.
Gary Illyes and Martin Splitt described the venture on the most recent episode of Search Off the Record. The work began after a group member submitted a pull request to Google’s robots.txt repository proposing two new tags be added to the unsupported checklist.
Illyes defined why the crew broadened the scope past the 2 tags within the PR:
“We tried to not do issues arbitrarily, however slightly accumulate knowledge.”
Reasonably than add solely the 2 tags proposed, the crew determined to have a look at the highest 10 or 15 most-used unsupported guidelines. Illyes stated the purpose was “an honest place to begin, an honest baseline” for documenting the commonest unsupported tags within the wild.
How The Analysis Labored
The crew used HTTP Archive to review what guidelines web sites use of their robots.txt information. HTTP Archive runs month-to-month crawls throughout hundreds of thousands of URLs utilizing WebPageTest and shops the leads to Google BigQuery.
The primary try hit a wall. The crew “rapidly discovered that nobody is definitely requesting robots.txt information” throughout the default crawl, that means the HTTP Archive datasets don’t usually embrace robots.txt content material.
After consulting with Barry Pollard and the HTTP Archive group, the crew wrote a customized JavaScript parser that extracts robots.txt guidelines line by line. The custom metric was merged earlier than the February crawl, and the ensuing knowledge is now obtainable within the custom_metrics dataset in BigQuery.
What The Information Reveals
The parser extracted each line that matched a field-colon-value sample. Illyes described the ensuing distribution:
“After enable and disallow and consumer agent, the drop is extraordinarily drastic.”
Past these three fields, rule utilization falls into an extended tail of much less widespread directives, plus junk knowledge from damaged information that return HTML as a substitute of plain textual content.
Google presently supports four fields in robots.txt. These fields are user-agent, enable, disallow, and sitemap. The documentation says different fields “aren’t supported” with out itemizing which unsupported fields are most typical within the wild.
Google has clarified that unsupported fields are ignored. The present venture extends that work by figuring out particular guidelines Google plans to doc.
The highest 10 to fifteen most-used guidelines past the 4 supported fields are anticipated to be added to Google’s unsupported guidelines checklist. Illyes didn’t identify particular guidelines that may be included.
Typo Tolerance Might Increase
Illyes stated the evaluation additionally surfaced widespread misspellings of the disallow rule:
“I’m most likely going to broaden the typos that we settle for.”
His phrasing implies the parser already accepts some misspellings. Illyes didn’t decide to a timeline or identify particular typos.
Why This Issues
Search Console already surfaces some unrecognized robots.txt tags. If Google paperwork extra unsupported directives, that might make its public documentation extra intently mirror the unrecognized tags individuals already see surfaced in Search Console.
Wanting Forward
The deliberate replace would have an effect on Google’s public documentation and the way disallow typos are dealt with. Anybody sustaining a robots.txt file with guidelines past user-agent, enable, disallow, and sitemap ought to audit for directives which have by no means labored for Google.
The HTTP Archive knowledge is publicly queryable on BigQuery for anybody who needs to look at the distribution immediately.
Featured Picture: Screenshot from: YouTube.com/GoogleSearchCentral, April 2026.
#Google #Increase #Unsupported #Robots.txt #Guidelines #Listing

