/ llmtxt.info

AI crawlers explained — how Perplexity, ChatGPT, and Claude read your site

AI crawlers share some DNA with Googlebot but serve fundamentally different purposes. Understanding what they do — and do not do — helps you prepare your site effectively.

Last updated:

What are AI crawlers?

AI crawlers are automated bots that traverse the web to gather content for AI company systems. Like traditional search crawlers, they follow links, request pages, and process the HTML they receive. Unlike traditional search crawlers, their primary goal is not to build a ranking index — it is to gather training data, build retrieval indexes for AI search products, or provide up-to-date context for AI-generated responses.

The distinction matters because it shapes what they collect and how they use it. A Googlebot visit contributes to your site's presence in Google Search. An AI crawler visit may contribute to your site's representation in an AI model's training data, a retrieval index for an AI search engine, or the context window of an AI assistant responding to a specific query.

Known AI crawlers

The major AI companies have published documentation for their primary crawlers. The most commonly encountered are:

  • GPTBot (OpenAI) — Used to collect training data and retrieval content for OpenAI products. User-agent string: GPTBot. OpenAI publishes its IP ranges publicly. Can be blocked in robots.txt with User-agent: GPTBot / Disallow: /.
  • OAI-SearchBot (OpenAI) — A separate bot used specifically for ChatGPT Search (retrieval at inference time, not training). Distinct from GPTBot.
  • ClaudeBot (Anthropic) — Anthropic's web crawler. Anthropic has published documentation on how to block it. User-agent: ClaudeBot.
  • PerplexityBot — Perplexity's crawler for building and refreshing its AI search index. It is the retrieval component behind Perplexity's real-time web search.
  • Applebot-Extended — Apple's extended crawler used for Apple Intelligence features. Distinct from the base Applebot used for Spotlight and Siri.

Additional crawlers exist from other AI companies and research organizations. The landscape is growing as more products incorporate web retrieval capabilities.

How they differ from Googlebot

Traditional search crawlers like Googlebot have a well-understood function: crawl pages, index their content, and use that index to rank pages in response to keyword queries. The output is a searchable index and a ranked list of results.

AI crawlers serve a broader set of functions, which leads to meaningful behavioral differences:

Attribute Googlebot AI crawlers (GPTBot, ClaudeBot, PerplexityBot)
Primary goal Build a ranking index Gather training data, build retrieval index, or provide context
Output Position in Google Search results Training corpus, retrieval index, or inference-time context
Respects robots.txt? Yes Yes (for major crawlers)
Frequency Regular, ongoing crawling Varies; some crawl continuously, some batch-crawl
What they read Primarily HTML content HTML, plain text, Markdown, structured data, llms.txt
Effect on site Search ranking AI training data, AI search citation, AI assistant accuracy

What they fetch

AI crawlers typically fetch the same types of content as any other web crawler: HTML pages, plain text files, and files linked from pages they have already fetched. In practice, this means they encounter:

  • HTML pages — the primary content of your site. They parse the text content and may extract structured data (schema.org JSON-LD, OpenGraph tags).
  • robots.txt — fetched first, before crawling any pages, to determine access rules.
  • llms.txt — if present at the domain root and not blocked by robots.txt, it will be fetched as a regular URL. Whether it is treated as a priority signal for further crawling depends on the crawler's implementation.
  • Sitemaps — if listed in robots.txt or discoverable from the site root, sitemaps help AI crawlers discover URLs efficiently.
  • Plain text and Markdown files — if linked from pages or sitemaps, these may be fetched. Documentation sites with raw Markdown outputs benefit here.

How they use crawled data

The way crawled data is used varies significantly depending on the crawler's purpose:

  • Training data (GPTBot, ClaudeBot for training): Content is processed into a training corpus. The model learns patterns, facts, and language from this data. The influence on model behavior is diffuse — you cannot predict exactly how a specific page will affect model outputs. The quality and authority of your content matters more than any specific structural optimization.
  • Retrieval index (PerplexityBot, OAI-SearchBot): Content is indexed for real-time retrieval. When a user asks a question, the system searches this index for relevant passages, retrieves them, and feeds them to the language model as context before generating a response. This is retrieval-augmented generation (RAG) in practice. Your pages are cited when they are the most relevant retrieved passages.
  • Direct context loading (agent frameworks): Tools like Cursor read your llms.txt, then fetch the pages it links to, and load that content into the model's context window for a specific session. This is the most direct and predictable form of AI content consumption. When it happens, the pages you curated in llms.txt are the ones that influence the AI's answers.

How llms.txt helps

Given how AI crawlers work, llms.txt helps in a specific way: it gives crawlers and agent frameworks a curated shortcut to your most important content.

Without llms.txt, an AI crawler that wants to understand your site must either:

  • Crawl your entire sitemap and rank pages by some internal heuristic, or
  • Use general training data that may be outdated or inaccurate.

With a well-curated llms.txt, a crawler or agent can read one file and immediately know which 10 to 15 pages provide the most authoritative context. Instead of discovering your content through approximation, it gets a direct, owner-provided recommendation.

This matters most for:

  • Retrieval systems that want to index your most important pages first — llms.txt functions as a priority hint.
  • Agent frameworks (Cursor, Windsurf) that explicitly fetch llms.txt before building session context — the file directly determines what context the AI has when answering questions about your product.
  • RAG pipelines that allow operators to specify source documents — llms.txt can serve as the source list.

Managing AI crawler access

You have several options for managing how AI crawlers interact with your site:

  1. Allow all AI crawlers (default): If your robots.txt does not explicitly block AI crawler user-agents, they will crawl your publicly accessible pages. This is the default behavior for most sites.
  2. Block training crawlers, allow retrieval crawlers: Some site owners want to prevent their content from being used in AI model training while allowing it to appear in AI search results. This requires blocking training-specific user-agents (e.g., GPTBot) while allowing retrieval-specific ones (e.g., OAI-SearchBot, PerplexityBot). Distinguishing between the two requires understanding each company's published user-agent documentation.
  3. Block all AI crawlers: Add Disallow rules for each AI crawler user-agent in robots.txt. Note that this prevents your content from appearing in AI search citation results as well.
  4. Publish llms.txt to guide access: Whether or not you restrict AI crawlers, publishing a curated llms.txt tells compliant crawlers and agent frameworks which pages represent your site most authoritatively.

Continue reading

Sources