Every time an AI agent reads your website — whether it’s ChatGPT browsing the web, an automation pulling your content, or an LLM-powered tool researching your brand — it consumes tokens. And if your site is built like most, it’s wasting a significant chunk of those tokens on things that have nothing to do with your actual content.

Want to see how many tokens your content actually uses? Run it through OpenAI’s tokenizer — paste your homepage HTML and see what comes back. The number is usually surprising. This article breaks down why that happens, what it costs you, and how to fix it without rebuilding your entire site.


What is AI scraping and why it matters now

This isn’t the old-school bot crawling your site for SEO. We’re talking about a new class of tools — LLM-powered agents that read your website the same way a human would, but at scale, and with a price attached to every word they process.

ChatGPT’s browsing feature does this. Perplexity does this. Automated research agents built on GPT-4 or Claude do this. When they visit your site, they pull the raw HTML, parse it, and send all of it into a language model as context. Every character counts. Every unnecessary element gets charged — whether it adds value or not.

Two years ago this was a minor concern. Today, with AI agents becoming a core part of how businesses research vendors, compare services, and gather information, how efficiently your site communicates with AI is no longer optional . It’s a real competitive advantage — and most businesses haven’t figured this out yet.


Why most websites are “noisy” for AI agents

Think about what a typical webpage actually contains when an AI reads it. Beyond your article or service description, there’s a navigation menu with 12 links. A cookie consent banner. A sticky header. A footer with legal notices, social icons, and a duplicate sitemap. Maybe a broken image placeholder or two. A newsletter popup that rendered as raw HTML text.

None of that is useful to an AI agent trying to understand what your business does. But the agent reads all of it anyway — because it has no way to know what matters and what doesn’t unless you tell it. And right now, most sites don’t tell it anything.

The most common sources of noise in a standard website:

  • Navigation menus and header elements
  • Cookie banners and GDPR notices
  • Sidebar widgets and related post blocks
  • Footer links, legal text, and social media icons
  • Broken or empty HTML placeholders
  • JavaScript-rendered components that appear as empty tags
  • Repetitive boilerplate present on every single page

In some audits we’ve run, over 60% of the token-consuming content on a page had nothing to do with the page’s actual purpose . You’re basically paying for the AI to read your footer. Repeatedly.


The real problem: tokens mean cost and quality

Tokens are the unit of measurement that language models use to process text — roughly one token per four characters in English. Every token an AI agent sends into a model to understand your site has two consequences: it costs money, and it competes for attention.

When an AI summarises your services page but 60% of the context it received was footer links and cookie text, the quality of that summary drops. The model has less room for the content that actually matters. This is called context dilution — and it’s one of the main reasons AI-generated descriptions of websites are often vague, incomplete, or just… off.

For businesses that depend on AI agents to surface their content — in search, in automated research tools, in competitive analysis pipelines — this has real commercial consequences. You’re not just wasting tokens. You’re making it harder for AI to represent your business accurately. And that’s a problem that compounds over time.

(Not sure how token-heavy your site actually is? Paste your homepage content into OpenAI’s tokenizer and see for yourself.)


4 practical solutions you can implement today

1. Use semantic HTML attributes to signal what to ignore

HTML already has semantic roles that tell browsers — and increasingly, AI agents — what each section of a page is for. Start using them properly. The role="navigation" attribute on your nav, role="banner" on your header, and role="contentinfo" on your footer are signals that a well-designed AI agent can use to deprioritise or skip those sections entirely.

Beyond native roles, you can add custom data attributes like data-ai-exclude="true" to elements you explicitly want scrapers to skip. Some modern AI crawlers and open-source agent frameworks already support this convention — and adoption is growing fast.

2. Create a clean JSON endpoint for your key content

This is probably the highest-impact change you can make. Instead of forcing AI agents to parse your entire HTML page, give them a dedicated endpoint — something like yoursite.com/api/content.json — that returns only the structured information that matters: your services, your value proposition, your team, your pricing model.

A clean JSON response with clearly labelled fields uses a fraction of the tokens a full HTML page requires, and it gives the agent exactly what it needs with zero noise. Think of it as a press kit for AI . You control the narrative, and you reduce the cost of every interaction. Makes sense, right?

3. Optimise your robots.txt for AI agents

Your robots.txt file is where you tell crawlers what they can and can’t access. Most businesses set it up once for Google and never touch it again. But AI agents have their own user-agent strings — GPTBot , ClaudeBot , PerplexityBot — and you can give them specific instructions.

Block them from low-value pages (login pages, thank-you pages, internal search results) and explicitly allow them on your most important content. This alone can cut a lot of the token waste from irrelevant pages being indexed and processed.

4. Structure your content with hierarchy and clarity

Language models extract meaning from well-structured content significantly better than from walls of text. Use proper heading hierarchies ( H1 H2 H3 ). Write clear, declarative topic sentences that summarise what each section covers. Don’t bury key information in the middle of long paragraphs with no structural signposting.

This improves AI readability, human readability, and SEO all at once — one of those rare cases where optimising for one channel makes you better across all of them.

If you want to go deeper on how AI automation can help you build these kinds of systems at scale — endpoints, structured data pipelines, agent-ready content — that’s exactly what we work on with our clients.


AEO: why this is becoming the next standard

SEO taught us that how you structure information affects who finds it and how. AI Engine Optimization (AEO) is the same principle applied to a new layer of the web — the layer where AI agents, not humans, are doing the reading.

Over the next two to three years, AI agents will become a primary way that businesses research vendors, customers discover services, and content gets surfaced across platforms. The websites already structured for efficient AI consumption will have a compounding advantage over those that aren’t.

This isn’t about chasing an algorithm. It’s about understanding that your website now has two audiences: humans and machines . And right now, most sites aren’t optimised well for either.

The problem isn’t your website. It’s that AI agents are reading your HTML instead of a clean endpoint built for them. The fix isn’t cleaning the HTML — it’s giving them a better door to walk through.

The businesses that build that door now won’t have to rebuild later. They’ll already be inside.


Optimize my web for AI

30 minutes. We audit your current site structure and show you exactly where tokens are being wasted.


Frequently Asked Questions

What is AI Engine Optimization (AEO) and how is it different from SEO?

SEO focuses on making your content readable and rankable for search engine crawlers. AEO focuses on making your content readable and processable for AI agents — tools like ChatGPT, Perplexity, and LLM-powered research pipelines. While SEO optimises for ranking signals, AEO optimises for token efficiency and semantic clarity. Both matter, and increasingly they overlap.

How many tokens does a typical webpage consume when an AI reads it?

It varies by site, but a standard blog post or service page with navigation, footer, and typical WordPress boilerplate can consume between 3,000 and 8,000 tokens per page. A clean JSON endpoint delivering the same core content often needs fewer than 800 tokens. The difference directly affects both cost and response quality. You can test your own pages with OpenAI’s tokenizer .

Does creating a JSON endpoint require a developer?

For most WordPress sites, a basic JSON endpoint can be created with a simple plugin or a few lines of custom code in your theme’s functions.php file. A developer can typically build a clean content endpoint in a few hours. If you’re on a headless CMS or a modern stack, you likely already have API capabilities built in — you just need to structure the output properly.

Can I block specific AI agents from crawling my site?

Yes. Most major AI crawlers respect robots.txt directives. GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot all have documented user-agent strings you can target. You can allow them on high-value pages and block them from internal or low-value URLs — giving you granular control over what AI agents consume from your site.

Will optimising for AI agents hurt my regular SEO?

No — in most cases it improves it. Clean semantic HTML, clear heading structure, and well-organised content are principles that both search engines and AI agents reward. The main additions (JSON endpoints, updated robots.txt, data attributes) are invisible to human visitors and don’t interfere with standard SEO signals.

Is this something small businesses actually need to worry about?

If your customers, competitors, or partners are using AI tools to research the market — and they almost certainly are — then yes. The businesses that appear clearly and accurately in AI-generated answers will have a growing edge over those that don’t. For SMBs competing with larger players, acting early here creates a disproportionate advantage. It’s one of those things that’s still cheap to do now and will be expensive to fix later.