# Product Registry β LLMs & Crawler Guidance
**Location:** /llms.txt
**Version:** 1.1
**Last-Updated:** 2025-10-15
**Base-URL:** https://productregistry.org/
---
## π§ Identification & Contact
- **Agent-Policy:** welcome
- **Contact:** support@productregistry.org
- **Documentation:** https://productregistry.org/
- **Robots-Txt:** https://productregistry.org/robots.txt
- **License:** CC BY 4.0 (facts and structured data)
- **Data-Usage:** factual data (non-PII) may be cached and reused with attribution
---
## π Discovery
- **Sitemap-Index:** https://productregistry.org/sitemap.xml
- **Sitemap:** https://productregistry.org/sitemaps/products.xml
- **Sitemap:** https://productregistry.org/sitemaps/products-jsonld.xml
- **Sitemap:** https://productregistry.org/sitemaps/merchants.xml
- **Machine Feeds:**
- `https://productregistry.org/product/{id}.jsonld` β JSON-LD endpoint for each product
- `https://productregistry.org/sitemaps/products-jsonld.xml` β nightly feed of machine endpoints
---
## π¦ Canonical URL Patterns
| Pattern | Type | Canonical |
| ----------------------------- | --------------- | --------- |
| `/product/{product_id}` | ProductPage | β
|
| `/merchant/{merchant_domain}` | MerchantListing | β
|
| `/` | Index | β
|
---
## βοΈ Access Rules
- **Allow:** `/`
- **Allow:** `/product/`
- **Allow:** `/merchant/`
- **Allow:** `/sitemaps/`
- **Allow:** `/.well-known/`
- **Disallow:** `/api/`
- **Prefetch:** disallowed (avoid aggressive speculative fetching)
---
## π Crawl & Refresh Guidance
| Path | Recommended Frequency |
| ------------- | --------------------- |
| `/product/*` | daily |
| `/merchant/*` | weekly |
| `/` | monthly |
- Use standard freshness headers (`ETag`, `Last-Modified`, `Cache-Control`).
- If missing, assume **default cache TTL = 86400 s (24 h)**.
---
## π§© Structured Data
All product pages embed:
```html
```
- Identifiers: GTIN, UPC, EAN, SKU supported
- Offer, Brand, AggregateRating, Policy (return/warranty) included
- Canonical identity:
- Product key β `product_id`
- Merchant key β `merchant_domain`
---
## π’ Pagination Rules
**Merchant List:**
- Method: keyset
- Param: `after` (exclusive cursor for merchant_domain)
- Typical page size: 50
**Merchant Products:**
- Method: page/per_page
- Params: `page`, `per_page`, `sort` (values: title, price, updated_at, -updated_at)
---
## π Canonicality & Duplicates
- **Canonical Header:**
`Link: ; rel="canonical"`
- **Duplicate Handling:**
- Prefer canonical URL over UTM variants
- Ignore params: `utm_*`, `rid`, `fbclid`
---
## π€ Crawler Etiquette
- **Accept-Language:** en
- **Accept-Encoding:** gzip, br
- **User-Agent:** org+crawler-name (with contact URL)
- **Respect-Delays:** true
- **Max-Requests-Per-Minute:** 60
- **Max-Parallel-Requests:** 4
- **Retry-After-429:** 60 s (use exponential backoff)
- **Timeout:** β€ 20 s per request
---
## βοΈ Usage & Attribution
- **Attribution:** optional but appreciated
- **Cache & Reuse:** permitted for factual data
- **Prohibited:**
- Training on personal data
- Account enumeration
- Rate abuse or scraping of merchant private info
---
## π¨ Error Handling
| HTTP Code | Recommendation |
| --------- | ------------------------------------ |
| 429 | exponential backoff |
| 5xx | retry with jitter |
| 404 | stop retrying |
| 410 | permanent removal (respect deletion) |
---
## π§ͺ Examples
1. **Discover all products:**
`GET https://productregistry.org/sitemaps/products.xml`
2. **Fetch one productβs structured data:**
`GET https://productregistry.org/product/{product_id}`
3. **Iterate merchants:**
`GET https://productregistry.org/merchant`
Follow: `Link: <.../merchant?after={cursor}>; rel=next`
4. **Normalize URLs:**
Remove `utm_*` and `rid` params before deduplication.
---
Β© 2026 Product Registry β Neutral AI-ready product index.
Use of this data implies agreement with the terms and license listed above.