Which AI crawlers should ecommerce sites allow?

At minimum, allow GPTBot (OpenAI/ChatGPT), Googlebot (covers AI Overviews and AI Mode), and ClaudeBot (Anthropic). Consider also allowing PerplexityBot, Applebot-Extended, and AmazonBot based on your strategy.

How do I test if my ecommerce site is crawlable by LLMs?

Fetch your product pages with curl or a tool that does not execute JavaScript. If key content (specs, prices, descriptions) is missing from the raw HTML, AI crawlers cannot see it either.

What is llms.txt and should ecommerce sites use it?

llms.txt is an emerging standard that provides a machine-readable summary of your site's content for AI systems. For ecommerce, it can guide AI crawlers to your most important product categories and content.

Does blocking AI crawlers hurt ecommerce visibility?

Yes. If you block GPTBot, your products will not appear in ChatGPT shopping results. If you block Googlebot, you lose both traditional search and AI Mode visibility.

Guides8 min read

Is Your Ecommerce Site Crawlable by LLMs? How to Test and Fix It

Q: Can AI crawlers render JavaScript on ecommerce sites?

Most AI crawlers (GPTBot, ClaudeBot, PerplexityBot) cannot execute JavaScript. They only see what is in the initial HTML response. Googlebot can render JavaScript but with delays.

Jiri Stepanek

February 13, 2026

If AI crawlers cannot access your product pages, your products will not appear in AI-generated shopping answers. This guide covers robots.txt for AI bots, JavaScript rendering issues, sitemap hygiene, canonicalization, and content accessibility for LLMs.

Abstract mist gradient in deep charcoal-blue and pale silver representing AI crawlers accessing ecommerce product content

LLM crawlability for ecommerce: the new technical baseline

LLM crawlability determines whether AI shopping assistants can find and understand your product pages. If GPTBot cannot crawl your site, your products will not appear in ChatGPT shopping answers. If your product specs are rendered only with JavaScript, most AI crawlers see an empty page.

This is not a theoretical concern. GPTBot's share of web crawling has surged from 5% to 30% in the past year. AI shopping features in Google AI Mode, ChatGPT, Perplexity, and others are all pulling from crawled web content to recommend products. The ecommerce sites that show up in these answers are the ones that are technically accessible.

Traditional SEO crawlability (Googlebot access, sitemap coverage, canonical tags) still matters. But LLM crawlability adds new requirements that many ecommerce sites have not addressed. This guide covers the five areas you need to check and fix.

For how product data quality affects what AI engines do with your content once they access it, see our guide on product data enrichment.

Robots.txt: who to allow and who to block

Your robots.txt file is the first gate. If an AI crawler is blocked, nothing else matters — your content is invisible to that engine.

Major AI crawlers to know

Bot	Operator	Purpose	Recommendation
`GPTBot`	OpenAI	ChatGPT training + shopping features	Allow
`OAI-SearchBot`	OpenAI	Real-time web search in ChatGPT	Allow
`ClaudeBot`	Anthropic	Claude training + retrieval	Allow
`Googlebot`	Google	Search + AI Mode + AI Overviews	Allow (usually already allowed)
`Google-Extended`	Google	Gemini training	Consider allowing
`PerplexityBot`	Perplexity	AI search + shopping	Allow
`Applebot-Extended`	Apple	Apple Intelligence features	Consider allowing
`AmazonBot`	Amazon	Alexa + Amazon search	Allow for ecommerce
`Bytespider`	ByteDance	TikTok search features	Evaluate

Recommended robots.txt approach for ecommerce

Rather than blocking AI crawlers by default, explicitly allow the ones that drive shopping visibility and block only those you have a clear reason to exclude:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

If you need to block specific sections (admin pages, checkout flows, internal search results), use targeted Disallow rules rather than blanket blocks.

Common mistake: many ecommerce sites added blanket AI crawler blocks in 2024 when the debate about training data was heated. If your robots.txt still blocks GPTBot or ClaudeBot, you are now invisible to their shopping features. Review and update.

JavaScript rendering: the silent killer

Most AI crawlers — including GPTBot, ClaudeBot, and PerplexityBot — do not execute JavaScript. They only see what is in the initial HTML response. This is fundamentally different from Googlebot, which can render JavaScript (though with delays).

What this means for ecommerce sites

If your product pages use client-side JavaScript to load:

Product titles and descriptions
Spec tables and attribute lists
Pricing and availability
Review data and ratings
Image galleries

...then AI crawlers see none of it. Your page appears empty or incomplete.

How to test

The simplest test: use curl to fetch your product page and examine the raw HTML:

curl -s https://yourstore.com/product/example | grep -i "product-title"

If your product title, price, or specs are not in the raw HTML, AI crawlers cannot see them.

You can also use browser developer tools: disable JavaScript and reload a product page. If critical content disappears, you have a rendering problem.

How to fix

Server-side rendering (SSR) — render product content on the server so the initial HTML includes all critical data. This is the most reliable fix.
Static site generation (SSG) — pre-render product pages at build time. Works well for catalogs that do not change every minute.
Hybrid rendering — server-render the critical content (title, price, specs, schema markup) and client-render non-essential elements (related products, dynamic recommendations).

If a full architecture change is not feasible short-term, ensure that at minimum your Product schema markup (JSON-LD) is in the initial HTML. AI crawlers can extract structured data even if the visual content requires JavaScript.

Sitemap hygiene and content structure

A clean sitemap helps AI crawlers (and Googlebot) discover your product pages efficiently.

Sitemap best practices for LLM crawlability

Include all active product pages — every in-stock product with a canonical URL should be in your sitemap
Exclude non-canonical URLs — filtered views, search result pages, and paginated listings should not be in the sitemap
Use lastmod dates — AI crawlers prioritize recently updated content. Accurate lastmod values help them find your newest products.
Segment by type — use separate sitemaps for products, categories, and blog content. This makes it easier for crawlers to prioritize.
Keep it under 50,000 URLs per sitemap file — use sitemap index files for larger catalogs

The llms.txt emerging standard

A new standard called llms.txt is gaining traction. Placed at your domain root (like robots.txt), it provides AI systems with a structured summary of your site — what it is about, what content to prioritize, and how to navigate it.

For ecommerce sites, an llms.txt might include:

Site description and product categories
Links to your most important category pages
Links to buying guides and comparison content
API endpoints for product data (if available)

This is still early-stage, but forward-thinking ecommerce sites are already adopting it.

Canonicalization and duplicate content

AI crawlers, like Googlebot, can get confused by duplicate content. Ecommerce sites are especially prone to this because of:

URL parameters — filters, sorting, and tracking parameters create thousands of duplicate URLs
Variant URLs — the same product accessible at /shoes/red-sneakers and /sneakers/red
HTTP vs HTTPS and www vs non-www — still common on older ecommerce platforms
Pagination — category pages with ?page=2, ?page=3 creating thin duplicate content

How to fix

Canonical tags — every product page should have a self-referencing <link rel="canonical"> pointing to the clean, preferred URL
Parameter handling — use robots.txt Disallow rules or canonical tags to prevent crawlers from indexing filtered/sorted variations
Redirect chains — clean up redirect chains so crawlers reach the canonical URL in one hop
Hreflang for multilingual sites — if you sell in multiple languages, implement hreflang tags so AI crawlers understand which version serves which market

For broader product page optimization that benefits both humans and AI, see our guide on PDP optimization.

Content accessibility: making pages machine-readable

Beyond rendering and crawl access, the structure of your content affects how well AI systems can extract and use it.

Use semantic HTML

Use <h1> for the product title, <h2> for section headings
Wrap spec tables in <table> with proper <th> and <td> elements
Use <ul> or <ol> for feature lists
Include meaningful alt text on all product images

Avoid content in images

Critical product information — specs, ingredients, sizing charts — should be in HTML text, not embedded in images. AI crawlers cannot read text from images.

Minimize noise

AI systems extract better content from pages with a clear signal-to-noise ratio. Reduce boilerplate, excessive cross-sell blocks, and cookie banners that push product content below the fold.

Include structured data

Even if your visual content is accessible, JSON-LD structured data provides a parallel, machine-optimized layer that AI systems can parse reliably. Tools like Lasso can help ensure your product data is complete enough to generate meaningful schema markup.

Testing your LLM crawlability: a checklist

Run through this checklist for your top product pages:

robots.txt allows GPTBot, ClaudeBot, PerplexityBot, and Googlebot
Product pages return complete content in initial HTML (no JavaScript dependency for critical data)
Product schema markup (JSON-LD) is present in the HTML source
XML sitemap includes all active canonical product URLs with accurate lastmod
Canonical tags are present and self-referencing on product pages
No redirect chains or loops on product URLs
Critical specs and attributes are in HTML text, not embedded in images
Image alt text is descriptive and includes product attributes

If you find gaps, prioritize fixing JavaScript rendering and robots.txt access first — these are the most common blockers. Then work on sitemap hygiene and content structure.

For teams managing large catalogs where data completeness is the bottleneck, explore Lasso's enrichment capabilities or reach out for a demo.

Frequently Asked Questions

Ready to try Lasso?

Start for free Book a demo