AI crawlers are quietly reshaping how websites and products show up in search, chatbots, and AI shopping experiences. If you run a store, a content site, or any kind of online business, these bots are already visiting you, even if you've never heard of them.
At Product Registry, we track these crawlers so merchants can see who is hitting their product data, why, and how often. This post is the home for a constantly updated list of AI crawlers and their user‑agent strings.
What are AI crawlers, and why do they exist?
A crawler (or "bot") is a program that visits web pages automatically. It sends a small piece of text called a user‑agent string with every request. That string is how you see "who" is visiting in your logs.
Traditional search engines (like Google or Bing) have done this for years to build search indexes. Now a new class of crawlers does something extra:
- Train AI models - Bots like GPTBot (OpenAI) and ClaudeBot (Anthropic) copy public pages so LLMs can learn from them.
- Power AI search & chat answers - Bots like OAI‑SearchBot (OpenAI), Claude‑SearchBot (Anthropic) and PerplexityBot index content so AI answers can quote or link to your pages.
- Fetch pages on demand for users - Agents like ChatGPT‑User and Claude‑User only load your page when a human asks the AI a question that needs your content.
On top of that, you still have generic crawlers (Googlebot, Bingbot, Applebot) that index the web and then reuse that data inside AI features like Gemini, Bing Copilot, Apple Intelligence, and more.
Most reputable AI crawlers say they respect robots.txt, the simple text file at yourdomain.com/robots.txt that tells bots what they may fetch. But not every crawler is well‑behaved, and some are now accused of stealth crawling (more on that below).
The different types of AI crawlers
Think about AI crawlers in four buckets. This makes it easier to decide which ones you want to allow, limit, or block.
1. AI‑specific crawlers for background tasks (indexing & training)
These bots run in the background, hitting millions of pages a day, and collecting data to train foundation models. Common examples include:
- GPTBot (OpenAI) for training models like ChatGPT.
- ClaudeBot (Anthropic) for training Claude.
One interesting example is the Common Crawl bot, CCBot. Although not branded as "AI", the data it crawls for creating a public dataset of the entire internet, is widely used in AI training pipelines.
These bots usually don't care about "real‑time" freshness. They care about coverage: getting a copy of as much of the public web as possible.
2. AI‑specific crawlers triggered by user actions
These are "on‑demand" bots. They only hit your site when a human uses an AI product:
- ChatGPT‑User - When someone asks ChatGPT something that needs your site, this agent fetches it on their behalf.
- Claude‑User - Same idea for Anthropic's Claude.
- Perplexity (assistant UA) - Used when Perplexity's AI assistant browses specific pages in response to a query.
- DuckAssistBot - Activated by DuckDuckGo's DuckAssist feature to grab content from trusted sources.
You can think of these as "AI browser tabs" opened by the user, not background scrapers. They still respect robots.txt (for the major vendors), but blocking them can mean your content won't be referenced when users ask questions.
3. Generic crawlers that also feed AI
These crawlers aren't "AI‑only," but their data increasingly ends up in AI products:
- Googlebot - Primary crawler for Google Search; its index powers Gemini / AI Overviews.
- Bingbot - Primary crawler for Bing Search and Microsoft Copilot.
- Applebot - Apple's crawler for Spotlight, Siri suggestions, and now part of Apple Intelligence.
Most sites keep these allowed, because blocking them means disappearing from traditional search as well.
4. Stealth crawlers (undeclared or misleading)
Here's where things get messy.
Some traffic doesn't clearly identify itself as a bot, or it switches its user‑agent when blocked. Cloudflare and others have published research accusing Perplexity's systems of:
- Crawling blocked sites using generic Chrome browser user‑agents
- Rotating IP addresses and network identifiers
- Ignoring
robots.txtand even Web Application Firewall rules
Perplexity disputes parts of those reports, but the takeaway for site owners is simple: some AI traffic may not self‑identify at all. You'll see it as "normal browsers" in your logs.
Bytespider has also been reported as very aggressive, sometimes overwhelming smaller sites, even when they try common blocking methods.
For this class of traffic, robots.txt alone is often not enough; people turn to firewalls, rate limiting, and AI‑specific bot filters.
AI crawler & user‑agent cheat sheet (December 2025)
Below is a starter table of the main AI‑linked crawlers we see most often in logs across the web and in Product Registry data. We'll keep this updated over time.
Important:
- "Respects robots.txt?" is based on official docs + public reporting as of December 2025.
- "Yes" = vendor explicitly documents robots.txt support and no strong evidence to the contrary.
- "Mixed/Contested" = reports of bad behavior or stealth crawling exist.
- Always verify against your own logs and risk tolerance.
| Name | Vendor | User‑Agent (token you'll see) | Respects robots.txt?* |
Comments |
|---|---|---|---|---|
| GPTBot | OpenAI | GPTBot (e.g. ... GPTBot/1.0; +https://openai.com/gptbot) |
Yes - documented, opt‑out via User-agent: GPTBot |
Background crawler for training OpenAI models like ChatGPT. Can be fully or partially blocked via robots.txt. |
| ChatGPT‑User | OpenAI | ChatGPT-User |
Yes (shares policy with GPTBot) | Runs on demand when ChatGPT needs to load a page for a user (plugins/ browsing). Not a bulk crawler. |
| OAI‑SearchBot | OpenAI | OAI-SearchBot (sometimes inside a Chrome‑like UA) |
Yes | Used for SearchGPT / ChatGPT search indexing (showing your site in search‑style results, not for training). Controlled by User-agent: OAI-SearchBot. |
| ClaudeBot | Anthropic | ClaudeBot |
Yes (per docs) | Background crawler for Claude training data. Anthropic shows how to opt out via robots.txt. |
| Claude‑SearchBot | Anthropic | Claude-SearchBot |
Generally yes, some confusion reported | Used to build an index for Claude's "search the web" feature. Some site owners report delays between changes in robots.txt and actual blocking. |
| Claude‑User | Anthropic | Claude-User |
Yes | On‑demand access when a user asks Claude something that needs your page. Similar role to ChatGPT‑User. |
| PerplexityBot | Perplexity AI | PerplexityBot (often inside a Mozilla UA) |
Officially yes, practically mixed | Perplexity says it respects robots.txt, but Cloudflare and others have accused Perplexity of bypassing blocks in some cases. Treat with care if your content is sensitive. |
| Perplexity-User | Perplexity AI | Perplexity/1.0 |
Unclear / mixed | Used for user‑driven browsing by Perplexity's assistant. Reports suggest this can also be involved in "stealth" style access depending on configuration. |
| Google‑Extended | Google-Extended |
Yes | Lets you allow Google Search while opting out of AI training for Gemini / Vertex. Blocking Google-Extended does not remove you from regular search results. |
|
| Applebot‑Extended | Apple | Applebot-Extended |
Yes | Controls whether your content can be used for training Apple Intelligence foundation models, separate from normal Applebot indexing. |
| CCBot | Common Crawl | CCBot/2.0 (or similar) |
Yes | Non‑profit crawler that builds open web datasets widely used in AI training and research. You can block it via User-agent: CCBot. |
| Bytespider | ByteDance (TikTok) | Bytespider |
Mixed / often problematic | Used for TikTok and ByteDance AI models. Many reports describe very heavy crawling and difficulty blocking it with robots.txt alone. Many sites resort to WAF rules or full user‑agent blocks. |
| DuckAssistBot | DuckDuckGo | DuckAssistBot |
Yes | Powers DuckDuckGo's AI‑assisted answers. Opting out does not affect your normal DuckDuckGo search ranking. |
| Googlebot | Googlebot (various desktop/mobile strings) |
Yes | Standard Google search crawler. Its index is reused across many AI features (Gemini, AI Overviews, etc.). Most sites keep this allowed. | |
| Bingbot | Microsoft | bingbot/2.0 etc. |
Yes | Core crawler for Bing Search and Microsoft Copilot. Blocking it reduces visibility in both classic search and AI chat. |
| Applebot | Apple | Applebot |
Yes | Indexes content for Siri, Spotlight, Safari search and feeds Apple Intelligence features alongside Applebot‑Extended. |
| Amazonbot | Amazon | Amazonbot |
Officially yes (docs; some reports of issues) | Amazon's crawler for Alexa and other Amazon services; it's documented as respecting robots.txt allow/disallow rules, but doesn't honor crawl-delay and admins sometimes see aggressive or spoofed “Amazonbot” traffic, so it's worth validating via IP or reverse DNS. |
| DuckAssistBot | DuckDuckGo | DuckAssistBot |
Yes | Real‑time crawler for DuckDuckGo's DuckAssist answers; fetches pages to generate AI‑assisted results that prominently cite sources, and DuckDuckGo says you can opt out via robots.txt. |
| Google‑CloudVertexBot | Google-CloudVertexBot (often in a Googlebot‑style UA) |
Yes | Crawls sites when a site owner connects content into Vertex AI / Vertex AI Search agents; robots rules for Google-CloudVertexBot (and Googlebot) control these fetches and don't affect normal Google Search indexing. |
|
| GoogleOther | GoogleOther |
Yes | Generic Google crawler used for internal products and one‑off research crawls rather than main search indexing; blocking it doesn't change your Google Search rankings. | |
| PetalBot | Huawei | PetalBot |
Generally yes | Official crawler for Huawei's Petal Search and Huawei Assistant / AI Search; indexes PC and mobile sites and is widely described as a legitimate search bot that generally respects robots.txt directives. |
| TikTokSpider | ByteDance (TikTok) | TikTokSpider |
Unclear / likely yes, verify | TikTok's crawler that fetches pages when users share external URLs, mainly to build link previews and metadata; tools suggest you can govern it via robots.txt, but TikTok doesn't provide strong public docs, so monitor request rates and behavior yourself. |
| MistralAI‑User | Mistral | MistralAI-User |
Mixed / contested | On‑demand browsing agent for Mistral's Le Chat, used when a user asks it to open a web page; some directories mark it as compliant with robots.txt, while others say it behaves more like a browser session that ignores robots, so treat this as assistant‑style traffic and enforce policy via UA / WAF rules if you care about strict blocking. |
| GoogleAgent‑Mariner | GoogleAgent-Mariner |
Unclear (no official robots policy yet) | Experimental Google AI agent that can drive a browser, click buttons, and complete multi‑step tasks on behalf of a user; it self‑identifies via GoogleAgent-Mariner, but Google hasn't yet published clear robots‑compliance rules, so handle it like an automated browser with UA‑based or IP‑based controls. |
|
| YouBot | You.com | YouBot |
Mixed / contested | Core crawler for You.com's AI‑first search and assistant; some descriptions say it follows standard crawling protocols, but security tools flag it as not respecting robots.txt, so treat it as “maybe” compliant and check behavior against your own logs and firewall rules. |
| Timpibot | Timpi | Timpibot |
No (often reported ignoring robots) | Decentralized crawler for Timpi's search index and LLM training; runs from many independent nodes, is frequently reported as aggressive, and is widely treated as not respecting robots.txt, so most people block it via firewall or web‑server rules rather than relying only on robots. |
| omgili / omgilibot | Webz.io (Omgili) | omgili/0.5 +http://omgili.com and omgilibot/0.4 +http://omgili.com |
No (commonly flagged) | Specialized crawler for forums and user‑generated content whose data is resold (including for AI/sentiment analysis); often listed as not respecting robots.txt, so treat it as a data‑scraping bot and block if you don't want your community content in third‑party datasets. |
| Diffbot | Diffbot | Diffbot |
Mixed (uses robots, but can override) | Commercial AI scraping service that turns pages into structured data and a knowledge graph for clients and AI training; it can honor User-agent: Diffbot, but its crawl APIs may ignore or override robots rules under certain whitelisted agreements, so don't assume full compliance if you haven't explicitly set terms—block at the server level if you want to keep it out. |
*This is not legal advice; it's a snapshot of current documentation and public reports. Always test your own robots.txt and firewall rules.
How this ties back to Product Registry
One of the core jobs of Product Registry is to detect and log AI crawler activity including GPTBot, PerplexityBot, Anthropic bots, Google AI crawlers, and others, so merchants can see which products are being fetched, how often, and by which bots.
Because we publish clean JSON‑LD endpoints and llms.txt/sitemaps, "good" AI crawlers have a clear path to your structured product data, while you stay in control through robots.txt and other controls.
Quick robots.txt example
Here's a simple example of how some sites are starting to treat AI crawlers differently:
# Allow normal search engines
User-agent: Googlebot
Allow: /
User-agent: bingbot
Allow: /
# Allow AI search visibility, but block training
User-agent: OAI-SearchBot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot
Allow: /
User-agent: Applebot-Extended
Disallow: /
# Block aggressive / unwanted crawlers
User-agent: Bytespider
Disallow: /
User-agent: PerplexityBot
Disallow: /
This is just an example; your ideal setup depends on your business model, your tolerance for scraping, and whether AI visibility is a growth channel or a cost center for you.
What we'll keep updating
Over time we'll keep this page in sync with:
- New AI crawlers and user‑agent strings (especially from smaller AI search engines)
- Changes in behavior (for example, if a bot starts or stops respecting
robots.txt) - Real‑world patterns we see in Product Registry's crawler analytics