The Always‑Updating Guide to AI Crawlers & User Agents (December 2025)

AI crawlers are quietly reshaping how websites and products show up in search, chatbots, and AI shopping experiences. If you run a store, a content site, or any kind of online business, these bots are already visiting you, even if you've never heard of them.

At Product Registry, we track these crawlers so merchants can see who is hitting their product data, why, and how often. This post is the home for a constantly updated list of AI crawlers and their user‑agent strings.

What are AI crawlers, and why do they exist?

A crawler (or "bot") is a program that visits web pages automatically. It sends a small piece of text called a user‑agent string with every request. That string is how you see "who" is visiting in your logs.

Traditional search engines (like Google or Bing) have done this for years to build search indexes. Now a new class of crawlers does something extra:

Train AI models - Bots like GPTBot (OpenAI) and ClaudeBot (Anthropic) copy public pages so LLMs can learn from them.
Power AI search & chat answers - Bots like OAI‑SearchBot (OpenAI), Claude‑SearchBot (Anthropic) and PerplexityBot index content so AI answers can quote or link to your pages.
Fetch pages on demand for users - Agents like ChatGPT‑User and Claude‑User only load your page when a human asks the AI a question that needs your content.

On top of that, you still have generic crawlers (Googlebot, Bingbot, Applebot) that index the web and then reuse that data inside AI features like Gemini, Bing Copilot, Apple Intelligence, and more.

Most reputable AI crawlers say they respect robots.txt, the simple text file at yourdomain.com/robots.txt that tells bots what they may fetch. But not every crawler is well‑behaved, and some are now accused of stealth crawling (more on that below).

The different types of AI crawlers

Think about AI crawlers in four buckets. This makes it easier to decide which ones you want to allow, limit, or block.

1. AI‑specific crawlers for background tasks (indexing & training)

These bots run in the background, hitting millions of pages a day, and collecting data to train foundation models. Common examples include:

GPTBot (OpenAI) for training models like ChatGPT.
ClaudeBot (Anthropic) for training Claude.

One interesting example is the Common Crawl bot, CCBot. Although not branded as "AI", the data it crawls for creating a public dataset of the entire internet, is widely used in AI training pipelines.

These bots usually don't care about "real‑time" freshness. They care about coverage: getting a copy of as much of the public web as possible.

These are "on‑demand" bots. They only hit your site when a human uses an AI product:

ChatGPT‑User - When someone asks ChatGPT something that needs your site, this agent fetches it on their behalf.
Claude‑User - Same idea for Anthropic's Claude.
Perplexity (assistant UA) - Used when Perplexity's AI assistant browses specific pages in response to a query.
DuckAssistBot - Activated by DuckDuckGo's DuckAssist feature to grab content from trusted sources.

You can think of these as "AI browser tabs" opened by the user, not background scrapers. They still respect robots.txt (for the major vendors), but blocking them can mean your content won't be referenced when users ask questions.

3. Generic crawlers that also feed AI

These crawlers aren't "AI‑only," but their data increasingly ends up in AI products:

Googlebot - Primary crawler for Google Search; its index powers Gemini / AI Overviews.
Bingbot - Primary crawler for Bing Search and Microsoft Copilot.
Applebot - Apple's crawler for Spotlight, Siri suggestions, and now part of Apple Intelligence.

Most sites keep these allowed, because blocking them means disappearing from traditional search as well.

4. Stealth crawlers (undeclared or misleading)

Here's where things get messy.

Some traffic doesn't clearly identify itself as a bot, or it switches its user‑agent when blocked. Cloudflare and others have published research accusing Perplexity's systems of:

Crawling blocked sites using generic Chrome browser user‑agents
Rotating IP addresses and network identifiers
Ignoring robots.txt and even Web Application Firewall rules

Perplexity disputes parts of those reports, but the takeaway for site owners is simple: some AI traffic may not self‑identify at all. You'll see it as "normal browsers" in your logs.

Bytespider has also been reported as very aggressive, sometimes overwhelming smaller sites, even when they try common blocking methods.

For this class of traffic, robots.txt alone is often not enough; people turn to firewalls, rate limiting, and AI‑specific bot filters.

AI crawler & user‑agent cheat sheet (December 2025)

Below is a starter table of the main AI‑linked crawlers we see most often in logs across the web and in Product Registry data. We'll keep this updated over time.

Important:

"Respects robots.txt?" is based on official docs + public reporting as of December 2025.

"Yes" = vendor explicitly documents robots.txt support and no strong evidence to the contrary.

"Mixed/Contested" = reports of bad behavior or stealth crawling exist.

Always verify against your own logs and risk tolerance.

Name	Vendor	User‑Agent (token you'll see)	Respects `robots.txt`?*	Comments
GPTBot	OpenAI	`GPTBot` (e.g. `... GPTBot/1.0; +https://openai.com/gptbot`)	Yes - documented, opt‑out via `User-agent: GPTBot`	Background crawler for training OpenAI models like ChatGPT. Can be fully or partially blocked via `robots.txt`.
ChatGPT‑User	OpenAI	`ChatGPT-User`	Yes (shares policy with GPTBot)	Runs on demand when ChatGPT needs to load a page for a user (plugins/ browsing). Not a bulk crawler.
OAI‑SearchBot	OpenAI	`OAI-SearchBot` (sometimes inside a Chrome‑like UA)	Yes	Used for SearchGPT / ChatGPT search indexing (showing your site in search‑style results, not for training). Controlled by `User-agent: OAI-SearchBot`.
ClaudeBot	Anthropic	`ClaudeBot`	Yes (per docs)	Background crawler for Claude training data. Anthropic shows how to opt out via `robots.txt`.
Claude‑SearchBot	Anthropic	`Claude-SearchBot`	Generally yes, some confusion reported	Used to build an index for Claude's "search the web" feature. Some site owners report delays between changes in `robots.txt` and actual blocking.
Claude‑User	Anthropic	`Claude-User`	Yes	On‑demand access when a user asks Claude something that needs your page. Similar role to ChatGPT‑User.
PerplexityBot	Perplexity AI	`PerplexityBot` (often inside a Mozilla UA)	Officially yes, practically mixed	Perplexity says it respects `robots.txt`, but Cloudflare and others have accused Perplexity of bypassing blocks in some cases. Treat with care if your content is sensitive.
Perplexity-User	Perplexity AI	`Perplexity/1.0`	Unclear / mixed	Used for user‑driven browsing by Perplexity's assistant. Reports suggest this can also be involved in "stealth" style access depending on configuration.
Google‑Extended	Google	`Google-Extended`	Yes	Lets you allow Google Search while opting out of AI training for Gemini / Vertex. Blocking `Google-Extended` does not remove you from regular search results.
Applebot‑Extended	Apple	`Applebot-Extended`	Yes	Controls whether your content can be used for training Apple Intelligence foundation models, separate from normal Applebot indexing.
CCBot	Common Crawl	`CCBot/2.0` (or similar)	Yes	Non‑profit crawler that builds open web datasets widely used in AI training and research. You can block it via `User-agent: CCBot`.
Bytespider	ByteDance (TikTok)	`Bytespider`	Mixed / often problematic	Used for TikTok and ByteDance AI models. Many reports describe very heavy crawling and difficulty blocking it with `robots.txt` alone. Many sites resort to WAF rules or full user‑agent blocks.
DuckAssistBot	DuckDuckGo	`DuckAssistBot`	Yes	Powers DuckDuckGo's AI‑assisted answers. Opting out does not affect your normal DuckDuckGo search ranking.
Googlebot	Google	`Googlebot` (various desktop/mobile strings)	Yes	Standard Google search crawler. Its index is reused across many AI features (Gemini, AI Overviews, etc.). Most sites keep this allowed.
Bingbot	Microsoft	`bingbot/2.0` etc.	Yes	Core crawler for Bing Search and Microsoft Copilot. Blocking it reduces visibility in both classic search and AI chat.
Applebot	Apple	`Applebot`	Yes	Indexes content for Siri, Spotlight, Safari search and feeds Apple Intelligence features alongside Applebot‑Extended.
Amazonbot	Amazon	`Amazonbot`	Officially yes (docs; some reports of issues)	Amazon's crawler for Alexa and other Amazon services; it's documented as respecting `robots.txt` allow/disallow rules, but doesn't honor `crawl-delay` and admins sometimes see aggressive or spoofed “Amazonbot” traffic, so it's worth validating via IP or reverse DNS.
DuckAssistBot	DuckDuckGo	`DuckAssistBot`	Yes	Real‑time crawler for DuckDuckGo's DuckAssist answers; fetches pages to generate AI‑assisted results that prominently cite sources, and DuckDuckGo says you can opt out via `robots.txt`.
Google‑CloudVertexBot	Google	`Google-CloudVertexBot` (often in a Googlebot‑style UA)	Yes	Crawls sites when a site owner connects content into Vertex AI / Vertex AI Search agents; robots rules for `Google-CloudVertexBot` (and `Googlebot`) control these fetches and don't affect normal Google Search indexing.
GoogleOther	Google	`GoogleOther`	Yes	Generic Google crawler used for internal products and one‑off research crawls rather than main search indexing; blocking it doesn't change your Google Search rankings.
PetalBot	Huawei	`PetalBot`	Generally yes	Official crawler for Huawei's Petal Search and Huawei Assistant / AI Search; indexes PC and mobile sites and is widely described as a legitimate search bot that generally respects `robots.txt` directives.
TikTokSpider	ByteDance (TikTok)	`TikTokSpider`	Unclear / likely yes, verify	TikTok's crawler that fetches pages when users share external URLs, mainly to build link previews and metadata; tools suggest you can govern it via `robots.txt`, but TikTok doesn't provide strong public docs, so monitor request rates and behavior yourself.
MistralAI‑User	Mistral	`MistralAI-User`	Mixed / contested	On‑demand browsing agent for Mistral's Le Chat, used when a user asks it to open a web page; some directories mark it as compliant with `robots.txt`, while others say it behaves more like a browser session that ignores robots, so treat this as assistant‑style traffic and enforce policy via UA / WAF rules if you care about strict blocking.
GoogleAgent‑Mariner	Google	`GoogleAgent-Mariner`	Unclear (no official robots policy yet)	Experimental Google AI agent that can drive a browser, click buttons, and complete multi‑step tasks on behalf of a user; it self‑identifies via `GoogleAgent-Mariner`, but Google hasn't yet published clear robots‑compliance rules, so handle it like an automated browser with UA‑based or IP‑based controls.
YouBot	You.com	`YouBot`	Mixed / contested	Core crawler for You.com's AI‑first search and assistant; some descriptions say it follows standard crawling protocols, but security tools flag it as not respecting `robots.txt`, so treat it as “maybe” compliant and check behavior against your own logs and firewall rules.
Timpibot	Timpi	`Timpibot`	No (often reported ignoring robots)	Decentralized crawler for Timpi's search index and LLM training; runs from many independent nodes, is frequently reported as aggressive, and is widely treated as not respecting `robots.txt`, so most people block it via firewall or web‑server rules rather than relying only on robots.
omgili / omgilibot	Webz.io (Omgili)	`omgili/0.5 +http://omgili.com` and `omgilibot/0.4 +http://omgili.com`	No (commonly flagged)	Specialized crawler for forums and user‑generated content whose data is resold (including for AI/sentiment analysis); often listed as not respecting `robots.txt`, so treat it as a data‑scraping bot and block if you don't want your community content in third‑party datasets.
Diffbot	Diffbot	`Diffbot`	Mixed (uses robots, but can override)	Commercial AI scraping service that turns pages into structured data and a knowledge graph for clients and AI training; it can honor `User-agent: Diffbot`, but its crawl APIs may ignore or override robots rules under certain whitelisted agreements, so don't assume full compliance if you haven't explicitly set terms—block at the server level if you want to keep it out.

*This is not legal advice; it's a snapshot of current documentation and public reports. Always test your own robots.txt and firewall rules.

How this ties back to Product Registry

One of the core jobs of Product Registry is to detect and log AI crawler activity including GPTBot, PerplexityBot, Anthropic bots, Google AI crawlers, and others, so merchants can see which products are being fetched, how often, and by which bots.

Because we publish clean JSON‑LD endpoints and llms.txt/sitemaps, "good" AI crawlers have a clear path to your structured product data, while you stay in control through robots.txt and other controls.

Quick `robots.txt` example

Here's a simple example of how some sites are starting to treat AI crawlers differently:

# Allow normal search engines
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

# Allow AI search visibility, but block training
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot
Allow: /
User-agent: Applebot-Extended
Disallow: /

# Block aggressive / unwanted crawlers
User-agent: Bytespider
Disallow: /

User-agent: PerplexityBot
Disallow: /

This is just an example; your ideal setup depends on your business model, your tolerance for scraping, and whether AI visibility is a growth channel or a cost center for you.

What we'll keep updating

Over time we'll keep this page in sync with:

New AI crawlers and user‑agent strings (especially from smaller AI search engines)
Changes in behavior (for example, if a bot starts or stops respecting robots.txt)
Real‑world patterns we see in Product Registry's crawler analytics