Guides8 min read

Is Your Ecommerce Site Crawlable by LLMs? How to Test and Fix It

Jiri Stepanek

Jiri Stepanek

If AI crawlers cannot access your product pages, your products will not appear in AI-generated shopping answers. This guide covers robots.txt for AI bots, JavaScript rendering issues, sitemap hygiene, canonicalization, and content accessibility for LLMs.

Abstract mist gradient in deep charcoal-blue and pale silver representing AI crawlers accessing ecommerce product content

LLM crawlability for ecommerce: the new technical baseline

LLM crawlability determines whether AI shopping assistants can find and understand your product pages. If GPTBot cannot crawl your site, your products will not appear in ChatGPT shopping answers. If your product specs are rendered only with JavaScript, most AI crawlers see an empty page.

This is not a theoretical concern. GPTBot's share of web crawling has surged from 5% to 30% in the past year. AI shopping features in Google AI Mode, ChatGPT, Perplexity, and others are all pulling from crawled web content to recommend products. The ecommerce sites that show up in these answers are the ones that are technically accessible.

Traditional SEO crawlability (Googlebot access, sitemap coverage, canonical tags) still matters. But LLM crawlability adds new requirements that many ecommerce sites have not addressed. This guide covers the five areas you need to check and fix.

For how product data quality affects what AI engines do with your content once they access it, see our guide on product data enrichment.

Robots.txt: who to allow and who to block

Your robots.txt file is the first gate. If an AI crawler is blocked, nothing else matters — your content is invisible to that engine.

Major AI crawlers to know

BotOperatorPurposeRecommendation
GPTBotOpenAIChatGPT training + shopping featuresAllow
OAI-SearchBotOpenAIReal-time web search in ChatGPTAllow
ClaudeBotAnthropicClaude training + retrievalAllow
GooglebotGoogleSearch + AI Mode + AI OverviewsAllow (usually already allowed)
Google-ExtendedGoogleGemini trainingConsider allowing
PerplexityBotPerplexityAI search + shoppingAllow
Applebot-ExtendedAppleApple Intelligence featuresConsider allowing
AmazonBotAmazonAlexa + Amazon searchAllow for ecommerce
BytespiderByteDanceTikTok search featuresEvaluate

Rather than blocking AI crawlers by default, explicitly allow the ones that drive shopping visibility and block only those you have a clear reason to exclude:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

If you need to block specific sections (admin pages, checkout flows, internal search results), use targeted Disallow rules rather than blanket blocks.

Common mistake: many ecommerce sites added blanket AI crawler blocks in 2024 when the debate about training data was heated. If your robots.txt still blocks GPTBot or ClaudeBot, you are now invisible to their shopping features. Review and update.

JavaScript rendering: the silent killer

Most AI crawlers — including GPTBot, ClaudeBot, and PerplexityBot — do not execute JavaScript. They only see what is in the initial HTML response. This is fundamentally different from Googlebot, which can render JavaScript (though with delays).

What this means for ecommerce sites

If your product pages use client-side JavaScript to load:

  • Product titles and descriptions
  • Spec tables and attribute lists
  • Pricing and availability
  • Review data and ratings
  • Image galleries

...then AI crawlers see none of it. Your page appears empty or incomplete.

How to test

The simplest test: use curl to fetch your product page and examine the raw HTML:

curl -s https://yourstore.com/product/example | grep -i "product-title"

If your product title, price, or specs are not in the raw HTML, AI crawlers cannot see them.

You can also use browser developer tools: disable JavaScript and reload a product page. If critical content disappears, you have a rendering problem.

How to fix

  • Server-side rendering (SSR) — render product content on the server so the initial HTML includes all critical data. This is the most reliable fix.
  • Static site generation (SSG) — pre-render product pages at build time. Works well for catalogs that do not change every minute.
  • Hybrid rendering — server-render the critical content (title, price, specs, schema markup) and client-render non-essential elements (related products, dynamic recommendations).

If a full architecture change is not feasible short-term, ensure that at minimum your Product schema markup (JSON-LD) is in the initial HTML. AI crawlers can extract structured data even if the visual content requires JavaScript.

Sitemap hygiene and content structure

A clean sitemap helps AI crawlers (and Googlebot) discover your product pages efficiently.

Sitemap best practices for LLM crawlability

  • Include all active product pages — every in-stock product with a canonical URL should be in your sitemap
  • Exclude non-canonical URLs — filtered views, search result pages, and paginated listings should not be in the sitemap
  • Use lastmod dates — AI crawlers prioritize recently updated content. Accurate lastmod values help them find your newest products.
  • Segment by type — use separate sitemaps for products, categories, and blog content. This makes it easier for crawlers to prioritize.
  • Keep it under 50,000 URLs per sitemap file — use sitemap index files for larger catalogs

The llms.txt emerging standard

A new standard called llms.txt is gaining traction. Placed at your domain root (like robots.txt), it provides AI systems with a structured summary of your site — what it is about, what content to prioritize, and how to navigate it.

For ecommerce sites, an llms.txt might include:

  • Site description and product categories
  • Links to your most important category pages
  • Links to buying guides and comparison content
  • API endpoints for product data (if available)

This is still early-stage, but forward-thinking ecommerce sites are already adopting it.

Canonicalization and duplicate content

AI crawlers, like Googlebot, can get confused by duplicate content. Ecommerce sites are especially prone to this because of:

  • URL parameters — filters, sorting, and tracking parameters create thousands of duplicate URLs
  • Variant URLs — the same product accessible at /shoes/red-sneakers and /sneakers/red
  • HTTP vs HTTPS and www vs non-www — still common on older ecommerce platforms
  • Pagination — category pages with ?page=2, ?page=3 creating thin duplicate content

How to fix

  • Canonical tags — every product page should have a self-referencing <link rel="canonical"> pointing to the clean, preferred URL
  • Parameter handling — use robots.txt Disallow rules or canonical tags to prevent crawlers from indexing filtered/sorted variations
  • Redirect chains — clean up redirect chains so crawlers reach the canonical URL in one hop
  • Hreflang for multilingual sites — if you sell in multiple languages, implement hreflang tags so AI crawlers understand which version serves which market

For broader product page optimization that benefits both humans and AI, see our guide on PDP optimization.

Content accessibility: making pages machine-readable

Beyond rendering and crawl access, the structure of your content affects how well AI systems can extract and use it.

Use semantic HTML

  • Use <h1> for the product title, <h2> for section headings
  • Wrap spec tables in <table> with proper <th> and <td> elements
  • Use <ul> or <ol> for feature lists
  • Include meaningful alt text on all product images

Avoid content in images

Critical product information — specs, ingredients, sizing charts — should be in HTML text, not embedded in images. AI crawlers cannot read text from images.

Minimize noise

AI systems extract better content from pages with a clear signal-to-noise ratio. Reduce boilerplate, excessive cross-sell blocks, and cookie banners that push product content below the fold.

Include structured data

Even if your visual content is accessible, JSON-LD structured data provides a parallel, machine-optimized layer that AI systems can parse reliably. Tools like Lasso can help ensure your product data is complete enough to generate meaningful schema markup.

Testing your LLM crawlability: a checklist

Run through this checklist for your top product pages:

  • robots.txt allows GPTBot, ClaudeBot, PerplexityBot, and Googlebot
  • Product pages return complete content in initial HTML (no JavaScript dependency for critical data)
  • Product schema markup (JSON-LD) is present in the HTML source
  • XML sitemap includes all active canonical product URLs with accurate lastmod
  • Canonical tags are present and self-referencing on product pages
  • No redirect chains or loops on product URLs
  • Critical specs and attributes are in HTML text, not embedded in images
  • Image alt text is descriptive and includes product attributes

If you find gaps, prioritize fixing JavaScript rendering and robots.txt access first — these are the most common blockers. Then work on sitemap hygiene and content structure.

For teams managing large catalogs where data completeness is the bottleneck, explore Lasso's enrichment capabilities or reach out for a demo.

Frequently Asked Questions

Ready to try Lasso?