Austin, Texas.
Late on any weeknight, while most of the internet is quiet, thousands of automated scripts spread across the web with one goal: to collect as much proprietary content as they can before anyone notices. These aren’t the simple bots of the past. They switch between residential IP addresses, mimic human browsing habits, and use computer vision to get past CAPTCHA challenges. The main barrier stopping them from reaching your original content at scale is Cloudflare Bot Management, which just received a major upgrade.
Cloudflare Bot Management Confronts a New Breed of Predator
The threat has grown much faster than most security leaders expected. From July 2024 to July 2025, requests from GPTBot, which gathers training data for ChatGPT, increased by 147 percent. In the same period, requests from Meta-ExternalAgent, used to train Meta’s AI models, jumped by 843 percent. These aren’t small, unknown groups. They are large, well-funded tech companies systematically taking value from publishers, media organizations, and independent creators without paying for it.
The business model is clear. As of June 2025, OpenAI’s crawl-to-referral ratio is 1,700 to one. Anthropic’s is 73,000 to one. For every page an AI crawler indexes, it sends back almost no visitors. The old relationship between search engines and publishers—where indexing brought traffic has basically ended.
On July 1, 2025, Cloudflare became the first major internet infrastructure company to block AI scraping by default. Now, AI companies must get clear permission from any website before they can crawl it. This policy change was important, but it only affects bots that admit they are bots. The bigger challenge is stopping those who hide their identities.
The Ghost Scrapers Nobody Sees Coming
Modern scraping tools now use AI themselves. They rely on large language models to understand page content, use computer vision to solve visual puzzles, and apply reinforcement learning to navigate complex websites they have never seen before. Traditional firewall rules, such as blocking an IP address or flagging a user agent, are no longer effective against such adaptive bots.
This is precisely the gap that Cloudflare’s newest behavioral analysis module targets. The Cloudflare bot management generative scraper defense update moves decisively away from static signature matching and toward per-customer anomaly detection. For each customer zone, behavioral detections ingest traffic data to build a continuously updated baseline of normal activity for that specific website. The system understands seasonality, recognizes traffic spikes from authentic marketing campaigns, and maps the typical pathways real users take through a site. Once that baseline is established, deviations become visible in a way they never were before.
The scraping detection system looks at much more than just request headers. It tracks session paths, the order of requests, how users interact with dynamic page elements, and subtle client fingerprints, including JA4 fingerprints, all within each customer’s normal traffic patterns. Importantly, these models don’t need to read the actual page content. They focus on access patterns, not the substance, making them faster and easier to scale across Cloudflare’s millions of domains.
Generative Scraper Defenses and the Evasion Arms Race
Generative Scraper Defenses need to be advanced because attackers are always adapting. AI tools help both cybercriminals, and some AI companies build bots that evade controls such as location or IP blocking by changing their signatures or attack methods. Some bots now mimic human behavior well enough to bypass CAPTCHA challenges entirely.
Take the example of Perplexity AI, which was publicly accused of impersonating real website visitors to scrape content from publishers like Wired. The value of large amounts of original content is higher than ever, and some AI companies are not open about their scraping. If a company worth billions is willing to hide its data collection, the financial incentive for less honest operators is even greater.
Cloudflare’s answer is a feature called the “link maze.” This tool traps automated scripts in an infinite loop of fake links, wasting their computing power and helping Cloudflare spot their behavior for future blocking. The crawler protection rule can be configured to punish AI scrapers using the link maze, and it works alongside other controls such as automatic model updates and lightweight JavaScript detection.
AI Traffic Safeguards as Infrastructure, Not an Add-On
AI Traffic Safeguards are interesting because of where they work. Cloudflare operates at the network layer, so its protections start before any request reaches a website’s server. CEO Matthew Prince said the company blocked over 416 billion AI bot requests in the six months after the July 2025 default-block policy. This number isn’t about rare cases—it shows the scale of regular, large-scale data extraction that used to go unnoticed.
Cloudflare also started a private beta for “Pay Per Crawl,” a marketplace where publishers can set their own prices and charge AI companies each time a page is crawled. This gives publishers a third choice beyond just allowing or blocking access. The system starts to address what many content leaders see as the main business problem of the AI era: value is created, but not always captured.
For leaders at content-heavy companies—such as media, legal publishers, financial data providers, and SaaS documentation platforms—the impact goes beyond security. Cloudflare Bot Management is now a tool for protecting revenue. Every scrape that isn’t blocked could help train a competitor’s model, using your resources.
A Standard That the Industry Did Not Know It Needed
The wider significance of the Cloudflare bot management generative scraper defense update may be less about any single technical feature and more about the normalization of a new expectation: that content owners have enforceable rights over automated access to their work.
Cloudflare’s security researchers are always working to spot and classify AI-related crawlers and scrapers across their network. They use both customer reports of bad bots and analysis from watching huge amounts of traffic. This crowd-sourced feedback, which helps update machine learning models automatically, is what makes their defense system active and responsive, not just a set of fixed rules.
The next wave of ghost scrapers is already being built to get around today’s defenses. The real test for any security system isn’t if it can stop last year’s bots, but if it can spot new ones that haven’t been created yet. Cloudflare’s approach—using per-customer behavioral baselines and AI Traffic Safeguards at network scale—is the strongest solution the industry has seen so far. The big question is whether content owners will use it before the next wave arrives.
Source: The Cloudflare Blog













