The Always‑Updating Guide to AI Crawlers & User Agents (December 2025)

A constantly updated guide to every major AI crawler and user-agent hitting the web. What they do, why they matter, and how to stay in control of your product data in the AI era.

AI crawlers are quietly reshaping how websites and products show up in search, chatbots, and AI shopping experiences. If you run a store, a content site, or any kind of online business, these bots are already visiting you, even if you've never heard of them.

At Product Registry, we track these crawlers so merchants can see who is hitting their product data, why, and how often. This post is the home for a constantly updated list of AI crawlers and their user‑agent strings.


What are AI crawlers, and why do they exist?

A crawler (or "bot") is a program that visits web pages automatically. It sends a small piece of text called a user‑agent string with every request. That string is how you see "who" is visiting in your logs.

Traditional search engines (like Google or Bing) have done this for years to build search indexes. Now a new class of crawlers does something extra:

On top of that, you still have generic crawlers (Googlebot, Bingbot, Applebot) that index the web and then reuse that data inside AI features like Gemini, Bing Copilot, Apple Intelligence, and more.

Most reputable AI crawlers say they respect robots.txt, the simple text file at yourdomain.com/robots.txt that tells bots what they may fetch. But not every crawler is well‑behaved, and some are now accused of stealth crawling (more on that below).


The different types of AI crawlers

Think about AI crawlers in four buckets. This makes it easier to decide which ones you want to allow, limit, or block.

1. AI‑specific crawlers for background tasks (indexing & training)

These bots run in the background, hitting millions of pages a day, and collecting data to train foundation models. Common examples include:

One interesting example is the Common Crawl bot, CCBot. Although not branded as "AI", the data it crawls for creating a public dataset of the entire internet, is widely used in AI training pipelines.

These bots usually don't care about "real‑time" freshness. They care about coverage: getting a copy of as much of the public web as possible.

2. AI‑specific crawlers triggered by user actions

These are "on‑demand" bots. They only hit your site when a human uses an AI product:

You can think of these as "AI browser tabs" opened by the user, not background scrapers. They still respect robots.txt (for the major vendors), but blocking them can mean your content won't be referenced when users ask questions.

3. Generic crawlers that also feed AI

These crawlers aren't "AI‑only," but their data increasingly ends up in AI products:

Most sites keep these allowed, because blocking them means disappearing from traditional search as well.

4. Stealth crawlers (undeclared or misleading)

Here's where things get messy.

Some traffic doesn't clearly identify itself as a bot, or it switches its user‑agent when blocked. Cloudflare and others have published research accusing Perplexity's systems of:

Perplexity disputes parts of those reports, but the takeaway for site owners is simple: some AI traffic may not self‑identify at all. You'll see it as "normal browsers" in your logs.

Bytespider has also been reported as very aggressive, sometimes overwhelming smaller sites, even when they try common blocking methods.

For this class of traffic, robots.txt alone is often not enough; people turn to firewalls, rate limiting, and AI‑specific bot filters.


AI crawler & user‑agent cheat sheet (December 2025)

Below is a starter table of the main AI‑linked crawlers we see most often in logs across the web and in Product Registry data. We'll keep this updated over time.

Important:

  • "Respects robots.txt?" is based on official docs + public reporting as of December 2025.
  • "Yes" = vendor explicitly documents robots.txt support and no strong evidence to the contrary.
  • "Mixed/Contested" = reports of bad behavior or stealth crawling exist.
  • Always verify against your own logs and risk tolerance.
Name Vendor User‑Agent (token you'll see) Respects robots.txt?* Comments
GPTBot OpenAI GPTBot (e.g. ... GPTBot/1.0; +https://openai.com/gptbot) Yes - documented, opt‑out via User-agent: GPTBot Background crawler for training OpenAI models like ChatGPT. Can be fully or partially blocked via robots.txt.
ChatGPT‑User OpenAI ChatGPT-User Yes (shares policy with GPTBot) Runs on demand when ChatGPT needs to load a page for a user (plugins/ browsing). Not a bulk crawler.
OAI‑SearchBot OpenAI OAI-SearchBot (sometimes inside a Chrome‑like UA) Yes Used for SearchGPT / ChatGPT search indexing (showing your site in search‑style results, not for training). Controlled by User-agent: OAI-SearchBot.
ClaudeBot Anthropic ClaudeBot Yes (per docs) Background crawler for Claude training data. Anthropic shows how to opt out via robots.txt.
Claude‑SearchBot Anthropic Claude-SearchBot Generally yes, some confusion reported Used to build an index for Claude's "search the web" feature. Some site owners report delays between changes in robots.txt and actual blocking.
Claude‑User Anthropic Claude-User Yes On‑demand access when a user asks Claude something that needs your page. Similar role to ChatGPT‑User.
PerplexityBot Perplexity AI PerplexityBot (often inside a Mozilla UA) Officially yes, practically mixed Perplexity says it respects robots.txt, but Cloudflare and others have accused Perplexity of bypassing blocks in some cases. Treat with care if your content is sensitive.
Perplexity-User Perplexity AI Perplexity/1.0 Unclear / mixed Used for user‑driven browsing by Perplexity's assistant. Reports suggest this can also be involved in "stealth" style access depending on configuration.
Google‑Extended Google Google-Extended Yes Lets you allow Google Search while opting out of AI training for Gemini / Vertex. Blocking Google-Extended does not remove you from regular search results.
Applebot‑Extended Apple Applebot-Extended Yes Controls whether your content can be used for training Apple Intelligence foundation models, separate from normal Applebot indexing.
CCBot Common Crawl CCBot/2.0 (or similar) Yes Non‑profit crawler that builds open web datasets widely used in AI training and research. You can block it via User-agent: CCBot.
Bytespider ByteDance (TikTok) Bytespider Mixed / often problematic Used for TikTok and ByteDance AI models. Many reports describe very heavy crawling and difficulty blocking it with robots.txt alone. Many sites resort to WAF rules or full user‑agent blocks.
DuckAssistBot DuckDuckGo DuckAssistBot Yes Powers DuckDuckGo's AI‑assisted answers. Opting out does not affect your normal DuckDuckGo search ranking.
Googlebot Google Googlebot (various desktop/mobile strings) Yes Standard Google search crawler. Its index is reused across many AI features (Gemini, AI Overviews, etc.). Most sites keep this allowed.
Bingbot Microsoft bingbot/2.0 etc. Yes Core crawler for Bing Search and Microsoft Copilot. Blocking it reduces visibility in both classic search and AI chat.
Applebot Apple Applebot Yes Indexes content for Siri, Spotlight, Safari search and feeds Apple Intelligence features alongside Applebot‑Extended.
Amazonbot Amazon Amazonbot Officially yes (docs; some reports of issues) Amazon's crawler for Alexa and other Amazon services; it's documented as respecting robots.txt allow/disallow rules, but doesn't honor crawl-delay and admins sometimes see aggressive or spoofed “Amazonbot” traffic, so it's worth validating via IP or reverse DNS.
DuckAssistBot DuckDuckGo DuckAssistBot Yes Real‑time crawler for DuckDuckGo's DuckAssist answers; fetches pages to generate AI‑assisted results that prominently cite sources, and DuckDuckGo says you can opt out via robots.txt.
Google‑CloudVertexBot Google Google-CloudVertexBot (often in a Googlebot‑style UA) Yes Crawls sites when a site owner connects content into Vertex AI / Vertex AI Search agents; robots rules for Google-CloudVertexBot (and Googlebot) control these fetches and don't affect normal Google Search indexing.
GoogleOther Google GoogleOther Yes Generic Google crawler used for internal products and one‑off research crawls rather than main search indexing; blocking it doesn't change your Google Search rankings.
PetalBot Huawei PetalBot Generally yes Official crawler for Huawei's Petal Search and Huawei Assistant / AI Search; indexes PC and mobile sites and is widely described as a legitimate search bot that generally respects robots.txt directives.
TikTokSpider ByteDance (TikTok) TikTokSpider Unclear / likely yes, verify TikTok's crawler that fetches pages when users share external URLs, mainly to build link previews and metadata; tools suggest you can govern it via robots.txt, but TikTok doesn't provide strong public docs, so monitor request rates and behavior yourself.
MistralAI‑User Mistral MistralAI-User Mixed / contested On‑demand browsing agent for Mistral's Le Chat, used when a user asks it to open a web page; some directories mark it as compliant with robots.txt, while others say it behaves more like a browser session that ignores robots, so treat this as assistant‑style traffic and enforce policy via UA / WAF rules if you care about strict blocking.
GoogleAgent‑Mariner Google GoogleAgent-Mariner Unclear (no official robots policy yet) Experimental Google AI agent that can drive a browser, click buttons, and complete multi‑step tasks on behalf of a user; it self‑identifies via GoogleAgent-Mariner, but Google hasn't yet published clear robots‑compliance rules, so handle it like an automated browser with UA‑based or IP‑based controls.
YouBot You.com YouBot Mixed / contested Core crawler for You.com's AI‑first search and assistant; some descriptions say it follows standard crawling protocols, but security tools flag it as not respecting robots.txt, so treat it as “maybe” compliant and check behavior against your own logs and firewall rules.
Timpibot Timpi Timpibot No (often reported ignoring robots) Decentralized crawler for Timpi's search index and LLM training; runs from many independent nodes, is frequently reported as aggressive, and is widely treated as not respecting robots.txt, so most people block it via firewall or web‑server rules rather than relying only on robots.
omgili / omgilibot Webz.io (Omgili) omgili/0.5 +http://omgili.com and omgilibot/0.4 +http://omgili.com No (commonly flagged) Specialized crawler for forums and user‑generated content whose data is resold (including for AI/sentiment analysis); often listed as not respecting robots.txt, so treat it as a data‑scraping bot and block if you don't want your community content in third‑party datasets.
Diffbot Diffbot Diffbot Mixed (uses robots, but can override) Commercial AI scraping service that turns pages into structured data and a knowledge graph for clients and AI training; it can honor User-agent: Diffbot, but its crawl APIs may ignore or override robots rules under certain whitelisted agreements, so don't assume full compliance if you haven't explicitly set terms—block at the server level if you want to keep it out.

*This is not legal advice; it's a snapshot of current documentation and public reports. Always test your own robots.txt and firewall rules.


How this ties back to Product Registry

One of the core jobs of Product Registry is to detect and log AI crawler activity including GPTBot, PerplexityBot, Anthropic bots, Google AI crawlers, and others, so merchants can see which products are being fetched, how often, and by which bots.

Because we publish clean JSON‑LD endpoints and llms.txt/sitemaps, "good" AI crawlers have a clear path to your structured product data, while you stay in control through robots.txt and other controls.


Quick robots.txt example

Here's a simple example of how some sites are starting to treat AI crawlers differently:

# Allow normal search engines
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

# Allow AI search visibility, but block training
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot
Allow: /
User-agent: Applebot-Extended
Disallow: /

# Block aggressive / unwanted crawlers
User-agent: Bytespider
Disallow: /

User-agent: PerplexityBot
Disallow: /

This is just an example; your ideal setup depends on your business model, your tolerance for scraping, and whether AI visibility is a growth channel or a cost center for you.


What we'll keep updating

Over time we'll keep this page in sync with:

Back to Blog