OpenAI Web Crawler: Everything You Need to Know

Recently, OpenAI, the company behind making artificial intelligence a household name (Hi, ChatGPT) revealed information about its OpenAI Web Crawler. This news sparked several discussions and got the internet talking about AI.

This blog will cover everything you need to know about OpenAI’s web crawler.

What is a Web Crawler?

Before talking about the OpenAI web crawler, its important to understand what a web crawler is. A web crawler, search engine bot or spider downloads and indexes content from all over the Internet. Such a bot aims to learn the content of (nearly) all web pages so that it can obtain the information as needed. These bots are called “web crawlers” since crawling technically means automatically accessing a website and retrieving data using software.

Crawlers visit websites in a systematic way to discover the content of each page so that it can be indexed, updated, and retrieved in response to a user’s search query. Search engines tend to control these bots.

Search engines can produce the list of webpages that appear after a user types a search into Google or Bing (or another search engine) by applying a search algorithm to the data gathered by web crawlers. This allows search engines to give relevant links in response to user search queries.

A website must be indexed before it can be ranked in a search engine by a company or website owner. Websites cannot be found naturally by a search engine if they are not crawled and indexed.

Here are a few examples of web crawlers used for search engine indexing:

  • Googlebot: The crawler for Google’s search engine
  • Bingbot: Microsoft’s search engine crawler for Bing
  • Amazonbot: The Amazon web crawler.

OpenAI Web Crawler

Similar to the search engines mentioned above, OpenAI too launched its own web crawler, GPTBot, to collect AI training data. GPT4, the system that powers ChatGPT, is already incredibly accurate. It is suspected that GPT5, the next big release, is going to be trained on the data collected by the OpenAI web crawler.

The AI giant further asserted that GPTBot is “filtered” to eliminate sources with paywalls, personal identifying information, and material that violates its rules.

OpenAI also offered an option to prevent the bot from scraping your website when it added the GPTBot help page. One could prevent content from being shared with OpenAI by making a small tweak to a website’s robots.txt file.

The web crawler’s collected data will only help AI models gather more knowledge and become more accurate.

Sounds great.

However, when the news broke there was considerable backlash, primarily because there was no official announcement by OpenAI about the same.

Lots of website owners and content creators are choosing to block GPTBot. But why?

GPTBot: The Discussion

Of course, web crawlers are nothing new and are essential to the functioning of the modern internet. Websites are often encouraged to permit crawlers from Google and other search engines access to increase their web traffic.

However, in the case of GPTBot, people are opting out. Websites like The Verge have already placed the robots.txt flag to prevent the OpenAI model from gathering content to add to its LLMs. Neil Clarke, the editor of science fiction publication Clarkesworld, declared on X (formerly known as Twitter) that it will block GPTBot.

There’s a reason why websites and content creators would choose to specifically keep OpenAI’s bot out of their digital content. AI is the future, and the data collected by GPTBot will only help it make more accurate. But along with making it accurate, it also makes GPT a fierce competitor to such web content producers.

ChatGPT just summarises data from around the web without providing any citations, in contrast to Google, which increases traffic to a website after crawling it. It is hard to identify the information’s original source.

By letting OpenAI scrape its material to train future LLMs, producers of free online content have good reason to believe that they are only training a future competitor who will take users away from their website.

As with most things that are a part of the AI discussion, opting in or out is a two-faced discussion. Websites and independent online content creators, especially those who do it for free, can face competition later down the line.

Generative AI And a Question of Ethics

While generative AI models are extremely helpful and fun to work with, an important and recurring discussion is about the ethics behind training these AI models. The data that trains AI LLMs is created by humans, so the line between inspiration and plagiarism becomes blur.

For instance, a recent case against OpenAI contends that since its chatbot is trained without permission on everyone’s writing — everything from books to articles available online — it constitutes stealing. Also, this isn’t the only lawsuit against OpenAI.

So as AI advances and grows, so do the discussions around consent, collection and copyright.