Why the internet needs crawl neutrality

Today, one company—Google—controls nearly all of the world’s access to information on the internet. Their monopoly in search means for billions of people, their gateway to knowledge, to products, and their exploration of the web is in the hands of one company. Most agree, this lack of competition in search is bad for individuals, communities and democracy.

Unbeknownst to many, one of the biggest obstacles to competing in search is a lack of crawl neutrality. The only way to build an independent search engine and the chance to fairly compete against Big Tech is to first efficiently and effectively crawl the Internet. However, the web is an actively hostile environment for upstart search engine crawlers, with most websites only allowing Google’s crawler and discriminating against other search engine crawlers like Neeva’s.

This critically important, yet often overlooked, issue has an enormous impact on preventing upstart search engines like Neeva from providing users with real alternatives, further reducing competition in search. Similar to net neutrality, today we need an approach to crawl neutrality. Without a change in policy and behavior, competitors in search will remain fighting with one hand tied behind our backs.

Let’s start from the beginning. Building a comprehensive index of the web is a prerequisite to compete in search. In other words, the first step to building the Neeva search engine is “downloading the Internet” via Neeva’s crawler, called Neevabot.

Here is where the trouble begins. For the most part, websites only allow Google and Bing’s crawlers unfettered access while discriminating against other crawlers like Neeva’s. These sites either disallow everything else in their robots.txt files, or (more commonly) don’t say anything in robots.txt, but return errors instead of content to other crawlers. The intent may be to filter out malicious actors, but the consequence is throwing the baby out with the bathwater. And you can’t serve up search results if you can’t crawl the web.

This forces startups to spend inordinate amounts of time and resources coming up with workarounds. For example, Neeva implements a policy of “crawling a site as long as the robots.txt allows GoogleBot and does not specifically disallow Neevabot.” Even after a workaround like this, portions of the web that contain useful search results remain inaccessible to many search engines.

As a second example, many websites will often allow a non-Google crawler via robots.txt and block it in other ways, either by throwing various kinds of errors (503s, 429s, …) or rate throttling. To crawl these sites, one has to deploy workarounds like “obfuscate by crawling using a bank of proxy IPs that rotate periodically.” Legitimate search engines like Neeva are loath to deploy adversarial workarounds like this.

These roadblocks are often intended at malicious bots, but have the effect of stifling legitimate search competition. At Neeva, we put a lot of effort into building a well-behaved crawler that respects rate limits, and crawls at the minimum rate needed to build a great search engine. Meanwhile, Google has carte blanche. It crawls the web 50B pages per day. It visits every page on the web once every three days, and taxes network bandwidth on all websites. This is the monopolist’s tax on the Internet.

For the lucky crawlers among us, a set of well wishers, webmasters and well meaning publishers can help get your bot whitelisted. Thanks to them, Neeva’s crawl now runs at hundreds of millions of pages a day, on track to hit billions of pages a day soon. Even so, this still requires identifying the right individuals in these companies that you can talk to, emailing and cold calling, and hoping for goodwill from webmasters on webmaster aliases that are typically ignored. A temporary fix that is not scalable.

Gaining permission to crawl shouldn’t be about who you know. There should be an equal playing field for anyone competing and following the rules. Google is a monopoly in search. Websites and webmasters are faced with an impossible choice. Either let Google crawl them, or don’t show up prominently in Google results. As a result, Google’s search monopoly causes the Internet at large to reinforce the monopoly by giving Googlebot preferential access.

The internet should not be allowed to discriminate between search engine crawlers based on who they are. Neeva’s crawler is capable of crawling the web at the speed and depth that Google does. There are no technical limitations, just anti-competitive market forces making it harder to compete fairly. And if it’s too much additional work for webmasters to distinguish bad bots that slow down their websites from legitimate search engines, then those with free rein like GoogleBot should be required to share their data with responsible actors.

Regulators and policymakers need to step in if they care for competition in search. The market needs crawl neutrality, similar to net neutrality.

Vivek Raghunathan is the cofounder of Neeva, an ad-free, private search engine. Asim Shankar is the Chief Technology Officer of Neeva.

Leave a Comment

Your email address will not be published.