# Product Registry β€” LLMs & Crawler Guidance **Location:** /llms.txt **Version:** 1.1 **Last-Updated:** 2025-10-15 **Base-URL:** https://productregistry.org/ --- ## 🧭 Identification & Contact - **Agent-Policy:** welcome - **Contact:** support@productregistry.org - **Documentation:** https://productregistry.org/ - **Robots-Txt:** https://productregistry.org/robots.txt - **License:** CC BY 4.0 (facts and structured data) - **Data-Usage:** factual data (non-PII) may be cached and reused with attribution --- ## πŸ” Discovery - **Sitemap-Index:** https://productregistry.org/sitemap.xml - **Sitemap:** https://productregistry.org/sitemaps/products.xml - **Sitemap:** https://productregistry.org/sitemaps/products-jsonld.xml - **Sitemap:** https://productregistry.org/sitemaps/merchants.xml - **Machine Feeds:** - `https://productregistry.org/product/{id}.jsonld` β€” JSON-LD endpoint for each product - `https://productregistry.org/sitemaps/products-jsonld.xml` β€” nightly feed of machine endpoints --- ## πŸ“¦ Canonical URL Patterns | Pattern | Type | Canonical | | ----------------------------- | --------------- | --------- | | `/product/{product_id}` | ProductPage | βœ… | | `/merchant/{merchant_domain}` | MerchantListing | βœ… | | `/` | Index | βœ… | --- ## βš™οΈ Access Rules - **Allow:** `/` - **Allow:** `/product/` - **Allow:** `/merchant/` - **Allow:** `/sitemaps/` - **Allow:** `/.well-known/` - **Disallow:** `/api/` - **Prefetch:** disallowed (avoid aggressive speculative fetching) --- ## πŸ•’ Crawl & Refresh Guidance | Path | Recommended Frequency | | ------------- | --------------------- | | `/product/*` | daily | | `/merchant/*` | weekly | | `/` | monthly | - Use standard freshness headers (`ETag`, `Last-Modified`, `Cache-Control`). - If missing, assume **default cache TTL = 86400 s (24 h)**. --- ## 🧩 Structured Data All product pages embed: ```html ``` - Identifiers: GTIN, UPC, EAN, SKU supported - Offer, Brand, AggregateRating, Policy (return/warranty) included - Canonical identity: - Product key β†’ `product_id` - Merchant key β†’ `merchant_domain` --- ## πŸ”’ Pagination Rules **Merchant List:** - Method: keyset - Param: `after` (exclusive cursor for merchant_domain) - Typical page size: 50 **Merchant Products:** - Method: page/per_page - Params: `page`, `per_page`, `sort` (values: title, price, updated_at, -updated_at) --- ## πŸ”— Canonicality & Duplicates - **Canonical Header:** `Link: ; rel="canonical"` - **Duplicate Handling:** - Prefer canonical URL over UTM variants - Ignore params: `utm_*`, `rid`, `fbclid` --- ## πŸ€– Crawler Etiquette - **Accept-Language:** en - **Accept-Encoding:** gzip, br - **User-Agent:** org+crawler-name (with contact URL) - **Respect-Delays:** true - **Max-Requests-Per-Minute:** 60 - **Max-Parallel-Requests:** 4 - **Retry-After-429:** 60 s (use exponential backoff) - **Timeout:** ≀ 20 s per request --- ## βš–οΈ Usage & Attribution - **Attribution:** optional but appreciated - **Cache & Reuse:** permitted for factual data - **Prohibited:** - Training on personal data - Account enumeration - Rate abuse or scraping of merchant private info --- ## 🚨 Error Handling | HTTP Code | Recommendation | | --------- | ------------------------------------ | | 429 | exponential backoff | | 5xx | retry with jitter | | 404 | stop retrying | | 410 | permanent removal (respect deletion) | --- ## πŸ§ͺ Examples 1. **Discover all products:** `GET https://productregistry.org/sitemaps/products.xml` 2. **Fetch one product’s structured data:** `GET https://productregistry.org/product/{product_id}` 3. **Iterate merchants:** `GET https://productregistry.org/merchant` Follow: `Link: <.../merchant?after={cursor}>; rel=next` 4. **Normalize URLs:** Remove `utm_*` and `rid` params before deduplication. --- Β© 2026 Product Registry β€” Neutral AI-ready product index. Use of this data implies agreement with the terms and license listed above.