Technical

llms.txt, robots.txt and Schema.org: The Technical Checklist for AI Citability

Before an AI assistant can recommend your business, it has to be able to read your business. A surprising number of companies fail at this first hurdle — not because their writing is weak, but because their site is technically invisible to the crawlers that feed AI answers. This is a practical checklist you can work through, mapped directly to the checks a CitePulse audit runs.

1. Can AI crawlers reach you? (robots.txt)

The first gate is robots.txt, the file at the root of your domain that tells automated agents what they may crawl. AI systems use named user-agents, and if you do not explicitly allow them — or worse, block them — you remove yourself from the pool of sources AI answers are built from. The agents that matter most today:

  • GPTBot — OpenAI's crawler, feeding ChatGPT.
  • PerplexityBot — Perplexity's crawler.
  • Googlebot-Extended — Google's control for Gemini and AI training, separate from ordinary Googlebot.
  • ClaudeBot — Anthropic's crawler for Claude.

A common and costly mistake is blocking these agents by default — sometimes inherited from a security template — while assuming you are "open." If you want AI citations, your robots.txt should allow these agents and reference your sitemap. What CitePulse checks: we fetch your robots.txt and confirm whether each major AI bot is allowed or blocked, and whether a sitemap is declared.

2. Can AI render your content? (JavaScript dependency)

This is the single highest-impact factor and the one most often missed. Many AI crawlers do not execute client-side JavaScript the way a browser does. If your headline, services, pricing and proof points only appear after a framework hydrates the page, the crawler sees an empty shell. Your content might be beautiful in a browser and blank to GPTBot.

The fix is to ensure meaningful content exists in the initial HTML response — through server-side rendering, static generation, or pre-rendering. A quick informal test: disable JavaScript in your browser and reload your most important page. If it goes blank, AI sees blank too. What CitePulse checks: we compare the content available without JavaScript against the rendered page and flag pages whose substance depends on scripts.

3. Do machines understand your content? (Schema.org)

Schema.org structured data is a vocabulary that labels what your content is, so a machine does not have to guess. Instead of inferring from prose that "Acme Legal" is a law firm in Chicago, an assistant can read an Organization object that states it plainly. The types most relevant to AEO:

  • Organization — your legal name, URL, logo, contact and services.
  • Product or Service — what you sell, with descriptions and pricing where applicable.
  • FAQPage — question-and-answer pairs that map cleanly onto how people query assistants.
  • Article — for editorial and blog content, with author and date.

Valid, accurate structured data makes your facts easy to extract and, crucially, easy to corroborate against other sources. What CitePulse checks: we detect the presence and type of schema.org / JSON-LD markup on your pages and flag its absence.

4. Have you mapped your content for AI? (llms.txt)

The llms.txt convention is newer than the others. It is a Markdown file at /llms.txt that gives AI systems a curated, plain-language map of your site: a short description of who you are, your key facts, and links to your most important pages. It does not grant or deny access — that is robots.txt's job — but it helps AI tools find and summarize the content you most want represented.

llms.txt is an emerging convention, not a ratified standard. But it is cheap to add, hard to get wrong, and signals that your site is built with AI readers in mind.

A good llms.txt opens with a one-paragraph summary of the business, lists a handful of bullet-point facts, and links to your main pages with short descriptions. What CitePulse checks: we look for a valid llms.txt at your domain root and note whether it references your key pages.

5. Is your site map clean and complete? (sitemap.xml & HTTPS)

A valid sitemap.xml helps every crawler — search and AI alike — discover your pages efficiently and understand which ones matter. Pair it with HTTPS across the whole site; an insecure or partially secure site is a trust and crawl liability. These are table stakes, but audits routinely find broken, stale or missing sitemaps. What CitePulse checks: we verify HTTPS and the presence and validity of your sitemap.xml.

The checklist, in order of impact

  1. Serve meaningful content without JavaScript.
  2. Allow GPTBot, PerplexityBot, Googlebot-Extended and ClaudeBot in robots.txt.
  3. Add valid schema.org markup (Organization, Product/Service, FAQPage, Article).
  4. Publish a clean sitemap.xml and enforce HTTPS sitewide.
  5. Add an llms.txt that summarizes your business and links your key pages.

Work top to bottom. The first item alone resolves the majority of "AI cannot see us" cases. Together, these checks make up the LLM Layer Readiness portion of the CitePulse score — the technical foundation that determines whether AI can cite you, before we even measure whether it does.

Technical readiness is necessary, not sufficient

Passing every check above does not guarantee citations — it removes the reasons you would be excluded. Citability also depends on what AI actually says when asked, which sources corroborate you, and which competitors are named instead. That is the behavioral half of the picture, covered in our companion article on how AI assistants decide what to recommend. The practical move is to fix the technical foundation, then measure the answers.

Run all ten technical checks automatically

CitePulse audits your robots.txt, JavaScript dependency, schema.org, sitemap, HTTPS and llms.txt — then tests the real AI answers. Free, in about 30 seconds.

Run a free audit

Notes

  1. robots.txt user-agents referenced: GPTBot (OpenAI), PerplexityBot (Perplexity), Googlebot-Extended (Google), ClaudeBot (Anthropic).
  2. llms.txt is an emerging community convention for AI-facing content maps; it is not an official web standard and does not control crawler access.
  3. Structured data vocabulary per schema.org. Crawler JavaScript behavior varies by provider and changes over time; verify against current provider documentation.