One website blocked AI crawlers and saved AU$1,500 per month in bandwidth costs. Their traffic dropped 75%, from 800GB to 200GB daily, and not a single human visitor noticed (Read the Docs, 2025).
That's the hidden cost of AI crawlers in 2025. While you're focused on serving actual customers, OpenAI's GPTBot, Anthropic's ClaudeBot, and Perplexity's PerplexityBot are crawling your site thousands of times per day, consuming bandwidth you're paying for, and returning exactly zero traffic to your business.
But here's where it gets interesting. Not all AI crawlers behave the same way. Some respect your robots.txt file and crawl politely. Others ignore your instructions entirely and disguise themselves as regular Chrome browsers to bypass your blocks. One crawler is scraping the web 25 times faster than OpenAI's GPTBot and 3,000 times faster than Anthropic's ClaudeBot (Fortune, 2024).
This isn't theoretical. From May 2024 to May 2025, AI bot traffic grew 18% overall, with some individual crawlers showing explosive growth rates exceeding 300% (Cloudflare, 2025). If you're running a website in Australia, you're paying for this traffic spike whether you know it or not.
Let's break down exactly what each major AI crawler is doing, how they compare technically, and what it means for your website's performance and costs.
The AI Crawler Explosion: What's Actually Happening
AI crawlers now account for nearly 80% of all AI bot activity, and training drives the overwhelming majority of that traffic (Startup Hub, 2025). This isn't AI assistants helping users find your content. It's AI companies scraping your site to train their models, and the scale is staggering.
Meta's AI crawlers alone generate 52% of all AI crawler traffic, more than double the combined traffic from Google (23%) and OpenAI (20%) (SDxCentral, 2025). Together, Meta, Google, and OpenAI accounted for 95% of AI crawler request volume between April and July 2025.
Here's what that looks like in practice. OpenAI's GPTBot generated 569 million requests across Vercel's network in a single month. Anthropic's ClaudeBot followed with 370 million (Vercel, 2025). And these aren't spread evenly across all websites. Media and publishing sites are seven times more likely to see AI bot traffic than the average website (Arc XP, 2025).
If you're thinking "that won't affect my small business site," consider this. One website reported a fetcher bot hitting them with 39,000 requests per minute at peak load (InMotion Hosting, 2025). That's not a DDoS attack. That's just an AI crawler being aggressive.
The cost isn't just bandwidth. It's server performance, user experience, and in some cases, your hosting bill when you exceed your plan's limits. Let's look at how each major crawler behaves.
GPTBot (OpenAI): The Most Blocked Crawler
User Agent:Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
OpenAI launched GPTBot in August 2023, and it's become the most blocked AI crawler on the web. As of 2025, 312 domains have disallowed it, with 250 blocking it completely and 62 granting partial access (The Decoder, 2025).
Interestingly, it's also the most explicitly allowed crawler, with 61 domains granting it access. That polarisation tells you something about how website owners view AI training data collection.
How GPTBot Works
When GPTBot selects your site, it doesn't just grab the HTML. It extracts text, processes media (images, audio, video), and renders JavaScript to access your full page content (Search Engine Journal, 2023). It applies optical character recognition to images with text and converts audio and video to transcripts.
OpenAI states that GPTBot filters out paywalled content, illegal material, and personally identifiable information before using the data for training. All crawling originates from documented IP address ranges on OpenAI's website, providing transparency for web administrators.
Content Preferences
GPTBot prioritises HTML content at 57.70% of fetches. It also fetches JavaScript files (11.50% of requests), but here's the catch: it doesn't execute them (JetOctopus, 2025). If your site relies on client-side rendering, GPTBot can't see that content.
Images are a lower priority for GPTBot compared to text-based content. That's a stark contrast to ClaudeBot, as we'll see.
Growth & Traffic Volume
GPTBot's share of AI crawling traffic more than doubled from 4.7% to 11.7% between 2024 and 2025. That's a 305% rise in requests year-over-year (Cloudflare, 2025).
Unlike traditional search engine crawlers that might visit once and move on, infrastructure maintainer Dennis Schubert notes that AI crawlers "don't just crawl a page once and then move on, they come back every 6 hours" (The Register, 2025). That's why the bandwidth consumption multiplies so quickly.
Blocking GPTBot
OpenAI respects robots.txt directives. Add this to your robots.txt file and changes propagate within 24 hours:
User-agent: GPTBot Disallow: /For partial access, you can allow specific directories:
User-agent: GPTBot Allow: /blog/ Allow: /public-docs/ Disallow: /The fact that OpenAI actually honours these directives puts it ahead of some competitors.
ClaudeBot (Anthropic): The Image-Focused Crawler
User Agent:Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Anthropic takes a different approach to crawler transparency. Instead of one user agent, they use multiple agents for different purposes: ClaudeBot (primary crawler), anthropic-ai (bulk model training), Claude-Web (web-focused crawling), and Claude-User (on-demand retrieval) (Anthropic Support, 2025).
This multi-agent approach gives website owners more granular control over what gets crawled and why.
Stated Principles vs Reality
Anthropic publicly commits to respecting "do not crawl" signals via robots.txt, refusing to bypass CAPTCHAs, supporting the non-standard Crawl-delay extension, and aiming for minimal disruption through thoughtful crawl rates (DataDome, 2025).
In practice, ClaudeBot generated 370 million requests across Vercel's network in one month, with its share of AI crawling traffic rising from 6% to around 10% during 2025 (Vercel, 2025).
The Crawl-to-Referral Gap
Here's where it gets interesting. In July 2025, ClaudeBot's crawl-to-referral ratio was 38,000 to 1. That means ClaudeBot crawled 38,000 pages for every one visitor it sent back to the site (Cloudflare, 2025). That's actually an improvement from January's ratio of 286,000 to 1, but it's still a massive imbalance.
You're providing the content and paying for the bandwidth, and ClaudeBot returns almost nothing in traffic value.
Content Type Preferences: Images First
This is ClaudeBot's distinctive characteristic. While GPTBot prioritises HTML, ClaudeBot focuses heavily on images at 35.17% of total fetches. JavaScript files account for 23.84% of requests, though like GPTBot, ClaudeBot fetches them but doesn't execute them (JetOctopus, 2025).
If you're running an e-commerce site with extensive product photography, or a portfolio site showcasing visual work, ClaudeBot is likely consuming more of your bandwidth than GPTBot.
March 2025 Update: Live Web Search
In March 2025, Anthropic introduced live web search to Claude, enabling the AI to fetch fresh information, cite sources in real-time, and provide more timely answers. This likely contributed to the increased crawling activity throughout the year.
Blocking ClaudeBot
Anthropic respects robots.txt. To block all Anthropic crawlers:
User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Claude-Web Disallow: /PerplexityBot: The Controversial Crawler That Doesn't Follow Rules
User Agent: PerplexityBot (platform crawler) and Perplexity-User (user-triggered agent)
PerplexityBot is where this story gets controversial. In 2025, Cloudflare published a detailed report accusing Perplexity AI of positioning itself as a Google alternative while systematically ignoring website restrictions and disguising its scraping activities (Cloudflare, 2025).
The Stealth Crawling Scandal
According to Cloudflare's investigation, Perplexity continued accessing content from websites that had explicitly disallowed PerplexityBot in their robots.txt files. The company allegedly used an undeclared crawler with multiple IP addresses not listed in Perplexity's official IP ranges (Bank Info Security, 2025).
This wasn't just sloppy engineering. The undeclared crawler used a user agent designed to impersonate Google Chrome on macOS, making it look like regular user traffic. When website owners blocked the declared PerplexityBot, the stealth crawler kicked in.
Cloudflare observed this behaviour across tens of thousands of domains, with millions of content requests per day bypassing robots.txt directives (Malwarebytes, 2025).
The Industry Impact
The PerplexityBot controversy highlighted a broader problem. TollBit's State of the Bots Q1 2025 report showed an 87% increase in scraping compared to the previous quarter, with the share of bots ignoring robots.txt directives jumping from 3.3% to 12.9%. In March 2025 alone, TollBit recorded 26 million scrapes that bypassed robots.txt directives (Williams Media, 2025).
Official Documentation vs Observed Behaviour
Perplexity's official documentation states that PerplexityBot complies with robots.txt limits, honours robots.txt with changes propagating in approximately 24 hours, and follows industry-standard norms for request rates (Perplexity, 2025).
Researchers have documented otherwise. The gap between stated policy and observed behaviour is significant.
Technical Characteristics
PerplexityBot does not render JavaScript, so content relying on client-side rendering remains invisible to it. Its crawl frequency varies based on content freshness, site authority, and perceived importance, making it more sporadic and burst-driven compared to traditional search engines (Daydream, 2025).
Blocking PerplexityBot
You can add robots.txt directives:
User-agent: PerplexityBot Disallow: /But given the documented stealth crawling behaviour, robots.txt alone may not be sufficient. You'll need additional defences like IP range blocking, rate limiting, and behavioural detection.
ByteSpider: The Most Aggressive Crawler on the Internet
User Agent: Bytespider
If you think GPTBot is aggressive, you haven't met ByteSpider. Operated by ByteDance (TikTok's parent company), ByteSpider launched in April 2024 and quickly became what industry observers call "the most aggressive scraper on the internet" (Net Influencer, 2024).
The Speed Comparison
ByteSpider scrapes data at approximately 25 times the rate of GPTBot. Let that sink in. GPTBot is already consuming massive bandwidth, and ByteSpider operates at 25 times that speed (Fortune, 2024).
Compared to ClaudeBot, ByteSpider is roughly 3,000 times faster. It's scraping at "many multiples" of Google, Meta, Amazon, OpenAI, and Anthropic combined.
Traffic Share
In some measurements, ByteSpider accounts for nearly 90% of AI crawler traffic. HAProxy's analysis showed close to 90% of their AI crawler traffic came from ByteSpider alone (HAProxy, 2024).
The crawler has shown huge spikes in scraping activity over six-week periods throughout 2024 and 2025, with increasingly aggressive behaviour patterns.
Compliance Issues
ByteSpider does not respect robots.txt directives. It doesn't identify itself transparently. Multiple sources report that it tries to pretend to be real users and ignores instructions in robots.txt files (Dark Visitors, 2024).
Behaviour patterns include high request rates, excessive bandwidth usage, and documented instances of not adhering to bot access management directives.
Why So Aggressive?
Industry analysis suggests ByteDance is "desperately trying to catch up" in the generative AI race. The company previously used OpenAI's technology to help build ByteDance's own large language models, but aggressive data collection appears to be the strategy for training competitive AI models independently (Qz, 2024).
Blocking ByteSpider
You can add robots.txt directives, but ByteSpider is documented to ignore them:
User-agent: Bytespider Disallow: /For ByteSpider, you'll need more aggressive defences including IP range blocking, rate limiting, and potentially firewall rules at the infrastructure level.
The Real Cost for Australian Businesses
Let's translate these technical specifications into business impact, particularly for Australian website owners.
Bandwidth Consumption
AI crawlers are consuming between 250GB to 500GB of data per day for many websites (Search Engine Journal, 2025). One user reported GPTBot alone consumed 30TB of bandwidth in a single month (InMotion Hosting, 2025).
For Australian businesses on limited hosting plans, this translates directly into overage fees. If you're on a shared hosting plan with Panthur, VentraIP, or SiteHost, you might be paying for hosting upgrades solely because of AI crawler activity you're not even benefiting from.
The Read the Docs example we opened with is instructive. They decreased traffic by 75% (from 800GB to 200GB daily) by blocking AI crawlers, saving approximately AU$1,500 per month in bandwidth costs (TrevNet Media, 2025).
Server Performance Impact
AI crawlers place significant load on servers, leading to slower website performance for actual human visitors. This degrades user experience and can increase costs due to higher server resource consumption (Computerworld, 2025).
Unlike search engine crawlers that typically follow predictable patterns and respect crawl delays, AI crawlers exhibit more aggressive behaviours. They return frequently (every six hours for some sites), multiply resource consumption, and show less predictable patterns.
Fastly, a major CDN provider, warns that AI crawlers are causing "performance degradation, service disruption, and increased operational costs" (SiliconANGLE, 2025).
The Traffic Value Gap
Here's the business problem. Traditional search engine crawlers like Googlebot create value for website owners. They index your site, and when people search, they send traffic back to you. The relationship is reciprocal.
AI crawlers break that relationship. They scrape your content, use it to train their models, and then answer user questions directly without sending users to your site. Stanford research found click-through rates from AI chatbots at just 0.33% and AI search engines at 0.74%, compared to 8.6% for Google Search (Arc XP, 2025).
You're providing the content and paying for the bandwidth, and getting almost nothing in return.
Invalid Traffic & Analytics Distortion
In Q4 2024, monthly GIVT (General Invalid Traffic) volumes reached over 2 billion ad requests for the first time in history, with AI crawlers contributing to an 86% increase (DoubleVerify, 2024).
If you're using Google Analytics or similar tools, aggressive AI crawler activity can skew your data, making it harder to understand actual user behaviour and make informed business decisions.
The Ethical & Legal Battleground
While we're discussing technical specifications and costs, there's a massive legal battle playing out over whether AI companies have the right to scrape and train on your content at all.
The New York Times Lawsuit
On 27 December 2023, The New York Times sued OpenAI and Microsoft, alleging that ChatGPT was trained on copyrighted works without authorisation or payment (NPR, 2025).
In March 2025, a federal judge rejected OpenAI's motion to dismiss and allowed the case's main copyright infringement claims to proceed to trial. Three publishers' lawsuits (The New York Times, The New York Daily News, and the Center for Investigative Reporting) have been merged into one case (NPR, 2025).
The publishers argue that millions of copyrighted works were used without consent or payment. According to the complaint, OpenAI should be on the hook for billions of dollars in damages, and the lawsuit calls for the destruction of ChatGPT's dataset, which could completely upend the company.
OpenAI's Defence: Fair Use
OpenAI leaders argue that mass data scraping is protected under the "fair use" legal doctrine. They claim The Times "intentionally manipulated" prompts to make it appear as if ChatGPT generates near word-for-word excerpts of articles, and that such verbatim regurgitation is a "rare bug" (TechCrunch, 2024).
The central legal question is whether using copyrighted material in AI training qualifies as fair use under copyright law. In 2025, there's still no clear legal guidance, and the outcome of these lawsuits will shape the entire industry.
Publisher Responses: Two Strategies
Publishers have split into two camps. Some are pursuing legal action, seeking damages and dataset destruction. Others are signing licensing deals.
The Associated Press signed a two-year deal allowing OpenAI to use select news content dating back to 1985. Axel Springer hammered out a deal allowing OpenAI to use its data for three years in exchange for "tens of millions" of euros. News Corp and Vox Media have also reached content-sharing agreements (Axios, 2025).
The split reflects different strategic bets about the future of AI and media.
The Australian Copyright Position
In October 2025, the Albanese Government made Australia's position clear. The government is consulting on possible updates to copyright laws, but will not include a Text and Data Mining Exception, which some in the technology sector had called for (Attorney-General, 2025).
Under a Text and Data Mining Exception, AI developers would be able to use the works of Australian creators for free and without permission to train AI systems. Australia rejected this approach, turning the country into what one analysis called "an experiment with one possible future for AI copyright deals" (Semafor, 2025).
The government has asked to explore three priority areas:
- Encouraging fair, legal avenues for using copyright material in AI through examination of licensing arrangements
- Improving certainty on the application of copyright law to material generated through AI
- Exploring avenues for less costly enforcement, including through a potential small claims forum
At present, Australian law does not extend copyright protection to works created entirely by AI. Under the Copyright Act 1968, the term "author" refers to a human author, as AI wasn't a consideration when the Act was written (Sprintlaw, 2025).
Detection Methods & The Spoofing Problem
Even if you decide to block AI crawlers, there's a technical challenge: how do you know you're actually blocking them?
The Spoof Ratio
In 2025, AI crawler spoofing is widespread. The spoof ratio is 1:17, meaning 1 in every 18 requests using an AI crawler user agent is fake, representing 5.7% of all AI crawler traffic (Human Security, 2025).
Malicious bots often pretend to be Googlebot, Bingbot, or legitimate AI crawlers to bypass defences. Simply checking the user agent string isn't enough.
Advanced Spoofing Techniques
High-end spoofing campaigns go beyond simple user agent forgery. They mimic legitimate network details, often originating from the "right" ASN (Autonomous System Number) and using IP addresses adjacent to official ranges. Some campaigns spray low-volume traffic across thousands of serverless "Worker" IPs to avoid detection (Human Security, 2025).
The PerplexityBot stealth crawling example we discussed earlier demonstrates these techniques in action: rotating IPs and ASNs, spoofing real-browser user agents, and bypassing robots.txt.
IP Range Verification
Effective defence means checking every request against the crawler's published ASN and IP ranges, not just the user agent string. Most major AI companies publish their IP ranges:
- OpenAI: Documented IP ranges on their website
- Anthropic: Dedicated IP ranges for ClaudeBot
- Amazon: Verification via reverse DNS to
crawl.amazonbot.amazon - Common Crawl: IP ranges provided as JSON at
https://index.commoncrawl.org/ccbot.json
Verifying against these official ranges helps filter out fraudulent logs from bad actors posing as legitimate services.
Log Analysis Patterns
LLM crawlers often show distinctive patterns in server logs:
- Unusually high request rates from a single IP address
- Patterns of sequential URL access
- Repetitive requests over extended periods
- High-frequency requests that spike suddenly
Advanced detection uses machine learning for automated pattern recognition, predictive analytics, anomaly detection for unusual activity, and real-time processing (Structured Labs, 2025).
Emerging Cryptographic Verification
The industry is moving toward cryptographic verification using HTTP Message Signatures (RFC 9421). This enables websites to cryptographically verify agent authenticity rather than relying on easily spoofed identifiers like user agents or IP addresses (Human Security, 2025).
This is still emerging, but it represents the future of bot authentication.
Practical Defence Strategies for Australian Websites
Let's get practical. If you're running a website in Australia and want to manage AI crawler costs and performance impact, here's what works.
Layer 1: Robots.txt (The Polite Request)
Start with robots.txt directives. This works for ethical AI companies that respect the Robots Exclusion Protocol.
To block all major AI crawlers:
# OpenAI User-agent: GPTBot Disallow: / # Anthropic User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / # Google AI Training (separate from Search) User-agent: Google-Extended Disallow: / # Common Crawl User-agent: CCBot Disallow: / # ByteDance/TikTok User-agent: Bytespider Disallow: / # Amazon User-agent: Amazonbot Disallow: / # Perplexity User-agent: PerplexityBot Disallow: / # Meta/Facebook User-agent: FacebookBot Disallow: /Google-Extended is particularly useful because it allows you to block AI training data collection while still allowing regular Googlebot to crawl for search purposes. This separation means you can opt out of AI training without harming your SEO (ThatWare, 2025).
Changes to robots.txt typically propagate within 24 hours for compliant crawlers.
Layer 2: Rate Limiting (The Technical Barrier)
Many AI bots ignore polite signals, so you need real enforcement. Rate limiting restricts the number of requests from specific IP ranges, user agents, or ASNs within a given timeframe (Cloudflare, 2025).
Most CDNs, hosting providers, and CMS platforms now include AI-scraper rate limiting options. If you're using:
- Cloudflare: Built-in bot management with AI crawler blocking
- Netlify: Guide for blocking AI bots and controlling crawlers
- WordPress: Dark Visitors plugin for automated blocking
- Shopify: Dark Visitors tag available
For custom implementations, configure your web server (nginx, Apache) to limit requests per minute from known AI crawler IP ranges.
Layer 3: Behavioural Detection (The Smart Defence)
Monitor how visitors interact with your site to identify non-human traffic patterns early. Legitimate users navigate between pages, spend variable time on content, and interact with page elements. Bots often show sequential URL access patterns, consistent request timing, and no interaction with dynamic elements.
Layer 4: Infrastructure Solutions (The Heavy Artillery)
For sites experiencing severe crawler impact:
- Firewall Rules: Block entire IP ranges at the infrastructure level using ufw, iptables, or cloud firewall services
- CDN-Level Blocking: Implement bot management at the CDN edge before traffic reaches your origin server
- Geographic Restrictions: If your business serves Australian customers only, consider geographic IP filtering
The Strategic Question: Should You Block?
Not every website should block all AI crawlers. Consider:
Reasons to Allow:
- You're explicitly trying to get your content into AI training sets
- You've signed licensing deals with AI companies
- You want your content available via AI chat interfaces
- Bandwidth costs are negligible for your infrastructure
Reasons to Block:
- Bandwidth costs are significant (media sites, image-heavy e-commerce)
- Server performance is degrading
- You're concerned about copyright and unauthorised training
- Your content has commercial value you want to protect
- You're exceeding hosting plan limits due to bot traffic
One Reddit user reported that GPTBot alone consumed 30TB of bandwidth in just one month, without any clear business benefit (InMotion Hosting, 2025). For that site, blocking was clearly the right choice.
Monitoring Tools: Dark Visitors & Agent Analytics
If you want visibility into which AI crawlers are hitting your site and how much traffic they're consuming, specialised monitoring tools have emerged.
Dark Visitors (darkvisitors.com) provides real-time insight into crawler, scraper, and AI agent activity. It tracks hidden activity from all known AI agents, measures human conversions from AI platforms like ChatGPT, Perplexity, and Gemini, and generates automated robots.txt files that update as new crawlers emerge (Dark Visitors, 2025).
The service offers:
- Real-time traffic monitoring
- Surge and overload notifications
- Automated robots.txt generation (no manual updates required)
- Blocks for agents breaking rules
- Available as WordPress plugin, Node.js package, Shopify tag, or API
Profound's Agent Analytics provides similar functionality with a focus on understanding which AI agents are accessing your content and how they're using it (Profound, 2025).
These tools solve the detection problem we discussed earlier. Instead of manually analysing server logs, they automatically identify legitimate and spoofed AI crawler traffic.
Future Trends: What's Coming in 2026
AI crawler activity isn't slowing down. Here's what the data suggests for 2026.
Continued Traffic Growth
Arc XP's CDN observed a 300% year-over-year jump in AI-driven bot traffic, and media and publishing sites are seven times more likely to see AI bot traffic than the average website (Arc XP, 2025). Industry observers report that "new bots are appearing almost weekly" with no signs of abating growth.
From May 2024 to May 2025, AI bot traffic grew 18% overall. If that trend continues, expect another 15-20% growth in 2026.
Agentic AI & Enterprise Adoption
Gartner predicts that 40% of enterprise applications will leverage task-specific AI agents by 2026, compared to less than 5% in 2025 (Deloitte, 2025). These agents will handle tasks independently, from managing schedules to optimising supply chains.
More AI agents means more AI crawlers. Each new AI application that needs current information will likely deploy its own crawler or rely on existing crawlers to gather training data.
Emerging Standards: llms.txt
A new protocol called llms.txt is emerging. Unlike robots.txt (which controls crawlers) or sitemaps (which list URLs), llms.txt highlights the pages you want AI systems to reference during inference (USAII, 2026).
While it's not an officially supported standard by every AI provider yet, and Google denies using it for AI Overviews, evidence shows that OpenAI's GPTBot is already fetching llms.txt files regularly in site logs.
This could evolve into a way for website owners to guide AI systems toward preferred content while blocking training data collection.
Regulation & Governance
The European Union's AI Act is already setting standards and will be fully effective by 2026. Expect similar frameworks in the US and Asia (Zero Gravity, 2025).
Industry groups and major publishers are lobbying for government regulation to create clearer rules for AI data usage and compensation frameworks for content creators. The New York Times lawsuit and Australia's rejection of the Text and Data Mining Exception are early indicators of how this regulatory landscape might develop.
Gartner predicts that 40% of AI data breaches will stem from cross-border generative AI misuse by 2026, which will likely accelerate regulatory action (Hyperight, 2025).
The Search Traffic Shift
Stanford research found click-through rates from AI chatbots at just 0.33% and AI search engines at 0.74%, compared to 8.6% for traditional Google Search. If AI-powered search continues growing, website owners will need to rethink their entire traffic acquisition strategy.
The business model of the internet has been "create content, get found in search, convert visitors to customers." AI crawlers take the first and second steps but eliminate the third. That's an existential challenge for content-driven businesses.
The Australian Business Perspective
For Australian businesses, AI crawlers present several specific considerations.
Cost Structures
Australian hosting and bandwidth costs tend to be higher than US equivalents due to geographic distance and smaller market scale. A 75% reduction in bandwidth usage (like the Read the Docs example) could translate to AU$1,000-2,000 monthly savings for a medium-sized content site.
If you're using Australian hosting providers like VentraIP, Crazy Domains, or Panthur on tiered plans, AI crawler traffic might be pushing you into higher pricing tiers unnecessarily.
Copyright Protection
Australia's rejection of the Text and Data Mining Exception means you have stronger copyright protection than some other jurisdictions. If you're an Australian content creator or publisher, you retain the right to control how your works are used for AI training.
This creates both a defensive position (you can legitimately block AI crawlers) and potentially a commercial opportunity (licensing your content to AI companies on your terms).
E-commerce & Product Data
If you're running an Australian e-commerce site, ClaudeBot's focus on images means it's likely consuming significant bandwidth crawling your product photography. With Australian e-commerce growing and sites typically hosting hundreds or thousands of product images, this isn't trivial.
Consider implementing rate limiting specifically for image-focused crawlers, or blocking them entirely if the bandwidth costs outweigh any potential benefit from appearing in AI-generated shopping recommendations.
Government & Compliance
If you're operating in regulated industries (financial services, healthcare, legal services), be aware that AI crawlers may be accessing and potentially training on sensitive information displayed on your website. Even if you're not displaying truly confidential data, there may be compliance implications worth reviewing with your legal team.
The Technical Comparison Table
Here's a quick reference comparing the major AI crawlers discussed:
| Crawler | Owner | Speed | Respects robots.txt | Primary Content | JavaScript Execution | Crawl-to-Referral Ratio |
|---|---|---|---|---|---|---|
| GPTBot | OpenAI | 569M req/month | Yes (24h) | HTML (57.70%) | No | Not published |
| ClaudeBot | Anthropic | 370M req/month | Yes (supports Crawl-delay) | Images (35.17%) | No | 38,000:1 |
| PerplexityBot | Perplexity AI | Undisclosed | Claims yes, documented violations | HTML | No | Not published |
| ByteSpider | ByteDance | 25× faster than GPTBot | No | All content types | No | Not published |
| Google-Extended | Part of Googlebot | Yes (24h) | All content types | Yes | Not published | |
| Amazonbot | Amazon | Undisclosed | Yes (NOT Crawl-delay) | All content types | No | Not published |
| CCBot | Common Crawl | 2.4-2.6B pages/month | Yes | All content types | No | N/A (public dataset) |
What This Means for Your Website
We've covered a lot of technical ground. Let's bring it back to practical business impact.
If you're running an Australian website in 2025, AI crawlers are consuming your bandwidth, potentially degrading your server performance, and providing almost no traffic value in return. The major crawlers differ significantly in behaviour, compliance, and aggression.
GPTBot and ClaudeBot generally respect robots.txt and provide transparency through published IP ranges. They're aggressive but predictable. PerplexityBot has documented violations of robots.txt directives and deploys stealth techniques. ByteSpider is the most aggressive crawler on the internet, operating at speeds 25 times faster than GPTBot and ignoring robots.txt entirely.
The cost isn't theoretical. Real websites are saving AU$1,000-2,000 monthly by blocking AI crawlers, with no negative impact on human visitors or legitimate search engine traffic.
Your decision should be based on your specific situation. If bandwidth costs are significant, if you're exceeding hosting limits, or if server performance is degrading, implementing AI crawler controls is likely worthwhile. Start with robots.txt for compliant crawlers, add rate limiting for aggressive ones, and monitor the impact.
If you want your content accessible to AI systems, consider selective allowing. Google-Extended lets you block AI training while maintaining search visibility. You can allow specific directories while blocking others, giving you granular control.
The legal landscape remains uncertain, but Australia's rejection of the Text and Data Mining Exception gives you stronger copyright protection than some jurisdictions. You have the right to control how your content is used for AI training.
As we move into 2026, expect AI crawler traffic to continue growing, new crawlers to emerge, and the regulatory landscape to evolve. The websites that proactively manage AI crawler access now will be better positioned to control costs, maintain performance, and protect their content as this technology continues developing.
The era of unlimited free access to web content for AI training may be ending. The question is whether individual website owners will take control proactively, or whether regulation will impose solutions. For Australian businesses, the tools and legal framework to make that choice already exist.
Sources
- Search Engine Journal - AI Crawlers Draining Site Resources
- Fortune - ByteSpider Gobbling World's Data 25× Faster
- Cloudflare - From Googlebot to GPTBot: Who's Crawling Your Site in 2025
- Startup Hub - AI Bot Traffic 80% Dominance
- SDxCentral - Meta AI Crawlers Dominate 52%
- Vercel - The Rise of the AI Crawler
- Arc XP - AI Bot Traffic Trends
- InMotion Hosting - Why AI Crawlers Are Slowing Down Your Site
- The Decoder - GPTBot Gets Blocked the Most
- Search Engine Journal - OpenAI Launches GPTBot
- JetOctopus - AI Bots: What They Crawl and Why
- The Register - AI Crawlers Destroying Websites
- Anthropic Support - Does Anthropic Crawl Data from the Web
- DataDome - What is ClaudeBot
- Cloudflare - The Crawl-to-Click Gap
- Cloudflare - Perplexity Using Stealth Crawlers
- Bank Info Security - Perplexity's Bots Ignore No-Crawl Rules
- Malwarebytes - Perplexity AI Ignores No-Crawling Rules
- Williams Media - How to Block PerplexityBot
- Perplexity - Official Crawlers Documentation
- Daydream - How Perplexity Crawls and Indexes Your Website
- Net Influencer - ByteDance's Web Scraper Data Grab
- HAProxy - 90% of AI Crawler Traffic from ByteSpider
- Dark Visitors - Bytespider Documentation
- Qz - TikTok's Owner Scraping 25× Faster than OpenAI
- TrevNet Media - AI Web Crawlers Impact
- Computerworld - Rise of AI Crawlers Causing Havoc
- SiliconANGLE - Fastly Report on AI Bots
- DoubleVerify - AI Crawlers Increase Invalid Traffic
- NPR - New York Times Takes OpenAI to Court
- NPR - Judge Allows NYT Copyright Case to Go Forward
- TechCrunch - OpenAI Claims NYT Lawsuit Without Merit
- Axios - NYT Case Against OpenAI Can Advance
- Attorney-General - Albanese Government AI Copyright Statement
- Semafor - Australia Makes Pivotal AI Copyright Decision
- Sprintlaw - Copyright in the Age of AI
- Human Security - AI Crawler Spoofing
- Human Security - Understanding AI Traffic
- Human Security - Crawlers List Known Bots
- Structured Labs - Detecting LLM Crawlers Through CDN Logs
- Cloudflare - Prevent AI Crawlers from Scraping Sites
- ThatWare - Google-Extended Crawler Update
- Dark Visitors - Homepage
- Profound - Agent Analytics Features
- Deloitte - New AI Breakthroughs Shaping 2026
- USAII - Top 10 AI Trends to Watch in 2026
- Zero Gravity - SEO Predictions for 2026
- Hyperight - Bold AI Data Predictions for 2025/2026
