What is AI bot crawling?

AI bot crawling is the process where AI companies (like OpenAI, Anthropic, Google) scan websites to gather data for training their models or generating real-time answers. Unlike search bots, they often extract content without driving traffic back to the source.

Why is AI crawling a problem for JavaScript sites?

Most AI crawlers, unlike Googlebot, struggle to execute JavaScript. If your content relies on client-side rendering (e.g., React, Vue), AI bots may see a blank page, meaning your content is excluded from AI-generated answers.

How can I block AI bots from my website?

You can block specific AI bots using your `robots.txt` file. Common user-agents to block include `GPTBot`, `ChatGPT-User`, `ClaudeBot`, `anthropic-ai`, and `PerplexityBot`. Cloudflare also offers tools to block AI scrapers.

LLMs.txt is an emerging standard proposing a file format (similar to robots.txt) specifically for giving directives to Large Language Models. It allows site owners to specify what content can be used for training and how it should be attributed.

AI Bot Crawling: Why Search Engines Are Reading Your Website Differently in 2025

Something weird started showing up in my server logs last year. At first I thought it was a misconfiguration. Then I thought it was a bot attack. It took me three weeks to realise I was watching the future of the web unfold in real-time.

Website owners noticed unusual crawler behaviour. Not Googlebot. Not Bingbot. Something different entirely. When you actually dig into the data, the scale is staggering: GPTBot crawling exploded by 305% from May 2024 to May 2025, jumping from 5% to 30% of AI crawler market share. ChatGPT-User requests surged by 2,825%. According to Cloudflare's 2025 analysis, automated bots now account for approximately 30% of all website traffic, with AI crawlers being the fastest-growing subset.

Traffic numbers tell only part of the story. What's happening fundamentally changes how information moves across the web, and if you're running a modern JavaScript site, you've probably already felt the pain.

The Silent Revolution Behind Your Server Logs

While I was obsessing over traditional SEO metrics (bounce rates, time on page, the usual stuff), a quiet revolution was reshaping web traffic patterns right under my nose. AI systems from OpenAI, Anthropic, Google, and emerging players like Perplexity are crawling the web at unprecedented scales, but they're doing it completely differently from traditional search engines.

Cloudflare's research reveals the scale: OpenAI's crawlers generated 569 million requests across major networks in just one month (December 2024). Anthropic's ClaudeBot had a crawl-to-referral ratio of 38,000:1 as of July 2025, crawling 38,000 pages for every time it actually sent a user to a website. ClaudeBot holds the highest crawl-to-referral ratio among all AI platforms. PerplexityBot? A staggering 157,490% increase in raw requests, making it the fastest-growing AI crawler.

Traditional search crawlers index your content to drive traffic back to you. AI crawlers extract knowledge to provide answers without sending visitors your way.

AI systems are fundamentally changing the value exchange that's powered the web for decades. You create content, they scrape it, users get answers, you get nothing.

The JavaScript Problem That's Breaking the Web

Most AI crawlers have a critical blind spot, and I learned about it the hard way.

Google has sophisticated JavaScript rendering capabilities. Most AI crawlers? Not so much. Anyone running a React app has felt this pain. (I spent a week wondering why my site wasn't showing up in AI-generated answers before I figured this out.) The data is stark: ChatGPT crawlers fetch JavaScript files in only 11.50% of their requests. Claude crawlers perform better at 23.84%, yet they're still missing most client-side rendered content.

Millions of websites built with React, Vue, or Angular face a fundamental compatibility crisis.

Product descriptions, user reviews, pricing information, or key content sections loaded dynamically through JavaScript remain invisible to most AI systems. They're crawling a hollow version of your site, missing the very content that makes it valuable. Your beautifully crafted React components? Invisible. Your dynamic pricing calculator? Doesn't exist to them. The testimonials that boost conversions? Never crawled.

Website developers face a major architectural decision. Continue with modern JavaScript frameworks and accept that AI systems will miss crucial content? Or revert to server-side rendering to ensure AI compatibility?

We've seen similar patterns with mobile-first indexing. As AI search traffic continues its rapid growth trajectory, the window for major architectural changes is narrowing.

When Crawlers Become Traffic Tsunamis

The infrastructure impact goes far beyond simple bandwidth concerns.

Real case studies reveal dramatic operational challenges. Game UI Database creator Edd Coates documented how persistent OpenAI bot crawling led to skyrocketing operational costs and severe server slowdowns. The crawlers weren't just visiting occasionally. They sent thousands of consecutive requests, overwhelming server infrastructure designed for human browsing patterns. "This was essentially a two-week long DDoS attack" Coates reported, with the homepage being reloaded 200 times per second.

Developer Xe Iaso experienced similar issues with their Git repository service, reporting repeated instability and downtime from Amazon's aggressive AI crawler traffic. A small forum running Calcudoku puzzles was forced offline when Chinese IP crawler traffic caused server loads to spike to 300+ (normal high load hovers around 3).

Isolated incidents? I wish. Across the web, AI crawlers consume bandwidth and server resources at rates that traditional infrastructure wasn't designed to handle. Small-to-medium websites running on shared hosting or VPS solutions face costs that can quickly become unmanageable. You're essentially subsidising AI training with your server bills. (I've had clients whose hosting costs doubled before we figured out what was happening.)

The Search Results Revolution You Can't Ignore

While AI crawlers are transforming your server logs, AI is simultaneously changing how search results appear to users.

Google's AI Overviews now appear in 13.14% of all queries as of March 2025, nearly doubling from 6.49% in January. AI-generated summaries consume 42% of screen real estate on desktop and 48% on mobile devices. That's nearly half your potential click space gone.

58.5% of U.S. Google searches resulted in zero clicks in 2024. Users got their answers without visiting any website.

Six out of ten searches don't generate a single click. Let that sink in. Users get their answers directly from AI-generated summaries without clicking through to the original sources. We're witnessing a fundamental shift in the web's traffic patterns and economic model. All that investment in content marketing and SEO? It's being used to train AI systems that keep users from ever visiting your site. (I've spent 15 years building content strategies. This one stings.)

Businesses that invested heavily in content marketing and SEO face a challenging dilemma. AI systems perfectly summarise your expertly crafted content, providing users with the information they need without them ever visiting your website. AI systems are disrupting the traditional model of attracting visitors through search rankings. You do the work, AI gets the credit, users never click through.

Real-World Responses: How Industry Leaders Are Adapting

Major infrastructure providers and content creators are responding with force.

Cloudflare, which supports approximately 20% of global web traffic, became the first major infrastructure provider to block AI crawlers by default without explicit permission. Cloudflare's decision followed their detection of billions of AI crawler requests across their network, representing an 18% increase in crawler traffic from May 2024 to May 2025. A massive shift in policy from a company managing 20% of global web traffic.

Multiple open source developers have reported AI crawlers dominating their traffic logs, forcing them to implement geographic blocks on entire countries to manage server loads. Infrastructure costs became prohibitive for independent developers who couldn't absorb the bandwidth and processing overhead. People building valuable open source tools for the community, now forced to restrict access because AI companies can't be bothered to implement reasonable rate limiting.

News and media organisations are implementing AI crawler restrictions at accelerating rates. Publishers worry that AI systems will provide readers with article summaries without generating the referral traffic that sustains their business model. High-quality content sites have been implementing crawler restrictions since mid-2023, with the trend accelerating sharply.

Technical Implementation: Your Strategic Response Plan

Managing AI crawlers demands a comprehensive technical approach. Below is an implementation roadmap based on current crawler behaviour and industry best practices.

Immediate robots.txt Configuration

A properly configured robots.txt file is your first line of defence.

Current essential AI crawlers to consider:

text

User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: anthropic-ai Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: /

Critical implementation details: add a directive after every User-agent line (minimum one Allow/Disallow), use blank lines between blocks to prevent merge errors, and re-test after major AI system releases as new versions can ignore older rules.

As of mid-2025, approximately 14% of the top 10,000 websites have implemented AI-specific crawler directives, though the majority of sites still allow unrestricted AI crawler access. Many organisations haven't recognised the infrastructure and bandwidth implications of unchecked AI crawler traffic.

Server-Side Rendering Requirements

Want AI systems to properly understand your content? Server-side rendering is no longer optional.

Most AI crawlers cannot execute JavaScript, rendering client-side content invisible to them. Product catalogues with dynamic pricing, user-generated content like reviews and comments, interactive features and form data, real-time information updates, and personalised content sections all fall victim to crawler blindness.

Websites must choose between AI compatibility and modern JavaScript frameworks. As AI search traffic grows, the decision becomes more critical. You're essentially choosing between modern development practices and AI visibility. It's a rubbish choice, but it's the choice we face.

Structured Data Implementation Strategy

AI crawlers miss JavaScript-injected structured data entirely.

Schema markup added after page loads is invisible to most AI systems. Your structured data strategy must include server-side rendering of all schema markup, direct HTML implementation rather than JavaScript injection, and prerendered pages with fully executed JavaScript for crawler access.

E-commerce sites, local businesses, and content publishers who rely on structured data for search visibility face particular challenges. If you've been injecting schema.org markup via JavaScript, congratulations, you've been wasting your time.

The Emerging LLMs.txt Standard

Following the model of robots.txt and sitemaps, a new standard called LLMs.txt is gaining adoption.

LLMs.txt, hosted on your domain, provides AI crawlers with specific directives including permissions, restrictions, licensing information, and preferred attribution methods. Website owners can specify which site areas are optimised for AI training, which sections require attribution, and how they prefer their content to be referenced in AI-generated responses.

Still emerging, the standard represents the industry's attempt to create more nuanced control over AI access than the binary allow/disallow model of robots.txt. Whether AI companies will actually respect it remains to be seen, given their track record. (Sceptical, but I've implemented it on a few sites anyway. Belt and braces.)

Business Strategy: Navigating the New Ecosystem

Technical implementation alone won't cut it. Strategic thinking that addresses the fundamental shift in web traffic patterns is essential.

AI systems that provide answers without generating clicks are challenging traditional revenue models based on search traffic and advertising impressions. Some organisations are exploring permission-based access models, where AI companies must negotiate licensing agreements to access content. Startups like TollBit and ScalePost offer tools to detect, block, and charge for AI traffic, creating new revenue streams for content creators.

Web ecosystem fragmentation poses a real risk.

Exclusive licensing deals between AI companies and data publishers could create a system where only large organisations can afford access to critical web data, suppressing competition and reducing the open nature of the web. Reddit's exclusive deal with Google and The New York Times negotiating separate arrangements with OpenAI signal the trend. If this becomes the norm, the open web becomes the expensive web.

That thought keeps me up at night, honestly. The internet I grew up building feels like it's slipping away.

Compliance Patterns and Enforcement Realities

AI crawler control effectiveness varies significantly between companies.

Well-established organisations like Google and OpenAI generally adhere to robots.txt protocols, but enforcement isn't universal. Anthropic faced criticism in 2024 for aggressive crawling behaviour, while Perplexity has been documented bypassing robots.txt rules entirely.

Among the top 10,000 domains with robots.txt files, approximately 14% added AI-specific crawler directives as of 2025, with a steep trend toward "Fully Disallowed" permissions since January 2025. Website owners are prioritising server performance and content protection over AI search visibility. When the choice is between functioning infrastructure and AI visibility, most people choose functioning infrastructure.

Future Implications: Preparing for Continued Evolution

AI crawler behaviour evolves at a pace that demands ongoing monitoring and constant adaptation.

GPTBot's 305% growth in just one year demonstrates how quickly the landscape can change. New AI systems emerge regularly, each with different crawling patterns and technical requirements. Website owners must develop monitoring systems to track crawler behaviour, implement flexible technical architectures that can adapt to changing requirements, and maintain updated knowledge of AI system capabilities and limitations.

AI compatibility requires infrastructure investment: server-side rendering, enhanced bandwidth capacity, and monitoring systems. Organisations that haven't adapted face significant technical debt.

The uncomfortable truth? Nothing about this gets simpler. It's getting more complex, more expensive, and more demanding. The rulebook is being rewritten monthly, and the price of falling behind compounds faster than most businesses realise.

Strategic Recommendations for 2025

Here's what I've figured out so far, based on current trends and watching this unfold across dozens of client sites. Some of this will probably be wrong in six months. That's the nature of the beast.

Implement comprehensive AI crawler monitoring to understand the actual impact on your infrastructure and user experience. You can't manage what you don't measure, and most sites have no idea what AI crawlers are doing to their infrastructure.

Evaluate your JavaScript architecture against AI crawler compatibility requirements. Running a modern React or Vue app with client-side rendering? You need to understand exactly what AI crawlers can and can't see.

Develop an AI access policy that balances content visibility with infrastructure protection. Not all AI crawlers are created equal. You can allow some while blocking others.

Invest in server-side rendering capabilities if AI search visibility matters to your business model. Want AI systems to understand your content? Server-side rendering isn't optional.

Monitor industry developments around LLMs.txt and other emerging standards. Standards evolve rapidly. Best practice today can be obsolete in six months.

The transition happening in 2025 isn't just technical. It's fundamental to how information flows across the web.

I'll be honest: I'm still figuring out what this means for the sites I manage. The old rules (create great content, optimise for search, watch the traffic roll in) are breaking down. The new rules haven't been written yet.

What I do know is this: the websites that thrive in 2026 will be the ones that understand AI crawlers as well as they once understood Google. And right now, most of us are playing catch-up. I'm including myself in that.

If you're not adapting, you're already behind. The gap widens daily.

This analysis is based on verified data from Cloudflare network research, official AI company documentation, and established industry case studies. All statistics have been verified against authoritative sources with dates noted where applicable.