Is Your Ecommerce Site Crawlable by LLMs? How to Test and Fix It
Jiri Stepanek
If AI crawlers cannot access your product pages, your products will not appear in AI-generated shopping answers. This guide covers robots.txt for AI bots, JavaScript rendering issues, sitemap hygiene, canonicalization, and content accessibility for LLMs.

LLM crawlability for ecommerce: the new technical baseline
LLM crawlability determines whether AI shopping assistants can find and understand your product pages. If GPTBot cannot crawl your site, your products will not appear in ChatGPT shopping answers. If your product specs are rendered only with JavaScript, most AI crawlers see an empty page.
This is not a theoretical concern. GPTBot's share of web crawling has surged from 5% to 30% in the past year. AI shopping features in Google AI Mode, ChatGPT, Perplexity, and others are all pulling from crawled web content to recommend products. The ecommerce sites that show up in these answers are the ones that are technically accessible.
Traditional SEO crawlability (Googlebot access, sitemap coverage, canonical tags) still matters. But LLM crawlability adds new requirements that many ecommerce sites have not addressed. This guide covers the five areas you need to check and fix.
For how product data quality affects what AI engines do with your content once they access it, see our guide on product data enrichment.
Robots.txt: who to allow and who to block
Your robots.txt file is the first gate. If an AI crawler is blocked, nothing else matters — your content is invisible to that engine.
Major AI crawlers to know
| Bot | Operator | Purpose | Recommendation |
|---|---|---|---|
GPTBot | OpenAI | ChatGPT training + shopping features | Allow |
OAI-SearchBot | OpenAI | Real-time web search in ChatGPT | Allow |
ClaudeBot | Anthropic | Claude training + retrieval | Allow |
Googlebot | Search + AI Mode + AI Overviews | Allow (usually already allowed) | |
Google-Extended | Gemini training | Consider allowing | |
PerplexityBot | Perplexity | AI search + shopping | Allow |
Applebot-Extended | Apple | Apple Intelligence features | Consider allowing |
AmazonBot | Amazon | Alexa + Amazon search | Allow for ecommerce |
Bytespider | ByteDance | TikTok search features | Evaluate |
Recommended robots.txt approach for ecommerce
Rather than blocking AI crawlers by default, explicitly allow the ones that drive shopping visibility and block only those you have a clear reason to exclude:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
If you need to block specific sections (admin pages, checkout flows, internal search results), use targeted Disallow rules rather than blanket blocks.
Common mistake: many ecommerce sites added blanket AI crawler blocks in 2024 when the debate about training data was heated. If your robots.txt still blocks GPTBot or ClaudeBot, you are now invisible to their shopping features. Review and update.
JavaScript rendering: the silent killer
Most AI crawlers — including GPTBot, ClaudeBot, and PerplexityBot — do not execute JavaScript. They only see what is in the initial HTML response. This is fundamentally different from Googlebot, which can render JavaScript (though with delays).
What this means for ecommerce sites
If your product pages use client-side JavaScript to load:
- Product titles and descriptions
- Spec tables and attribute lists
- Pricing and availability
- Review data and ratings
- Image galleries
...then AI crawlers see none of it. Your page appears empty or incomplete.
How to test
The simplest test: use curl to fetch your product page and examine the raw HTML:
curl -s https://yourstore.com/product/example | grep -i "product-title"
If your product title, price, or specs are not in the raw HTML, AI crawlers cannot see them.
You can also use browser developer tools: disable JavaScript and reload a product page. If critical content disappears, you have a rendering problem.
How to fix
- Server-side rendering (SSR) — render product content on the server so the initial HTML includes all critical data. This is the most reliable fix.
- Static site generation (SSG) — pre-render product pages at build time. Works well for catalogs that do not change every minute.
- Hybrid rendering — server-render the critical content (title, price, specs, schema markup) and client-render non-essential elements (related products, dynamic recommendations).
If a full architecture change is not feasible short-term, ensure that at minimum your Product schema markup (JSON-LD) is in the initial HTML. AI crawlers can extract structured data even if the visual content requires JavaScript.
Sitemap hygiene and content structure
A clean sitemap helps AI crawlers (and Googlebot) discover your product pages efficiently.
Sitemap best practices for LLM crawlability
- Include all active product pages — every in-stock product with a canonical URL should be in your sitemap
- Exclude non-canonical URLs — filtered views, search result pages, and paginated listings should not be in the sitemap
- Use
lastmoddates — AI crawlers prioritize recently updated content. Accuratelastmodvalues help them find your newest products. - Segment by type — use separate sitemaps for products, categories, and blog content. This makes it easier for crawlers to prioritize.
- Keep it under 50,000 URLs per sitemap file — use sitemap index files for larger catalogs
The llms.txt emerging standard
A new standard called llms.txt is gaining traction. Placed at your domain root (like robots.txt), it provides AI systems with a structured summary of your site — what it is about, what content to prioritize, and how to navigate it.
For ecommerce sites, an llms.txt might include:
- Site description and product categories
- Links to your most important category pages
- Links to buying guides and comparison content
- API endpoints for product data (if available)
This is still early-stage, but forward-thinking ecommerce sites are already adopting it.
Canonicalization and duplicate content
AI crawlers, like Googlebot, can get confused by duplicate content. Ecommerce sites are especially prone to this because of:
- URL parameters — filters, sorting, and tracking parameters create thousands of duplicate URLs
- Variant URLs — the same product accessible at
/shoes/red-sneakersand/sneakers/red - HTTP vs HTTPS and www vs non-www — still common on older ecommerce platforms
- Pagination — category pages with
?page=2,?page=3creating thin duplicate content
How to fix
- Canonical tags — every product page should have a self-referencing
<link rel="canonical">pointing to the clean, preferred URL - Parameter handling — use
robots.txtDisallow rules or canonical tags to prevent crawlers from indexing filtered/sorted variations - Redirect chains — clean up redirect chains so crawlers reach the canonical URL in one hop
- Hreflang for multilingual sites — if you sell in multiple languages, implement hreflang tags so AI crawlers understand which version serves which market
For broader product page optimization that benefits both humans and AI, see our guide on PDP optimization.
Content accessibility: making pages machine-readable
Beyond rendering and crawl access, the structure of your content affects how well AI systems can extract and use it.
Use semantic HTML
- Use
<h1>for the product title,<h2>for section headings - Wrap spec tables in
<table>with proper<th>and<td>elements - Use
<ul>or<ol>for feature lists - Include meaningful
alttext on all product images
Avoid content in images
Critical product information — specs, ingredients, sizing charts — should be in HTML text, not embedded in images. AI crawlers cannot read text from images.
Minimize noise
AI systems extract better content from pages with a clear signal-to-noise ratio. Reduce boilerplate, excessive cross-sell blocks, and cookie banners that push product content below the fold.
Include structured data
Even if your visual content is accessible, JSON-LD structured data provides a parallel, machine-optimized layer that AI systems can parse reliably. Tools like Lasso can help ensure your product data is complete enough to generate meaningful schema markup.
Testing your LLM crawlability: a checklist
Run through this checklist for your top product pages:
-
robots.txtallows GPTBot, ClaudeBot, PerplexityBot, and Googlebot - Product pages return complete content in initial HTML (no JavaScript dependency for critical data)
- Product schema markup (JSON-LD) is present in the HTML source
- XML sitemap includes all active canonical product URLs with accurate
lastmod - Canonical tags are present and self-referencing on product pages
- No redirect chains or loops on product URLs
- Critical specs and attributes are in HTML text, not embedded in images
- Image
alttext is descriptive and includes product attributes
If you find gaps, prioritize fixing JavaScript rendering and robots.txt access first — these are the most common blockers. Then work on sitemap hygiene and content structure.
For teams managing large catalogs where data completeness is the bottleneck, explore Lasso's enrichment capabilities or reach out for a demo.