I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing.
# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.
# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.
# BEGIN Cloudflare Managed content
User-agent: Amazonbot Disallow: /
User-agent: Applebot-Extended Disallow: /
User-agent: Bytespider Disallow: /
User-agent: CCBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: GPTBot Disallow: /
User-agent: meta-externalagent Disallow: /
# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$
The headline is somewhat misleading: sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare.
The idea that Cloudflare could do the latter at the sole discretion of its leadership, though, is indicative of the level of power Cloudflare holds.
My data served by Cloudflare has increased to 100gb /month compared to <20gb like 2 years ago, and they're all fairly static hobby sites. Actual people traffic is down by like half in the same time frame, so I imagine a lot of this is probably cost savings for Cloudflare to reduce resource usage.
> If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content
I don't see a way out of this happening. AI fundamentally discourages other forms of digital interaction as it grows.
Its mechanism of growing is killing other kinds of digital content. It will eventually kill the web, which is, ironically, its main source of food.
I've heard lots of people on HN complaining about bot traffic bogging down their websites, and as a website operator myself I'm honestly puzzled. If you're already using Cloudflare, some basic cache configuration should guarantee that most bot traffic hits the cache and doesn't bog down your servers. And even if you don't want to do that, bandwidth and CPU are so cheap these days that it shouldn't make a difference. Why is everyone so upset?
> When you enable this feature via a pre-configured managed rule, Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website. The rule has also been expanded to include more signatures of AI bots that do not follow the rules.
We already know companies like Perplexity are masking their traffic. I'm sure there's more than meets the eye, but taking this at face value, doesn't punishing respectful and transparent bots only incentivize obfuscation?edit: This link[0], posted in a comment elsewhere, addresses this question. tldr, obfuscation doesn't work.
> We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”
> When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.
[0] https://blog.cloudflare.com/declaring-your-aindependence-blo...Do the major AI companies actually honor robots.txt? Even if some of their publicly known crawlers might do it, surely they have surreptitious campaigns where they do some hidden crawling, just like how they illegally pirate books, images and user data to train on.
The list of bots is pretty short right now:
https://developers.cloudflare.com/bots/concepts/bot/#ai-bots
Archive link: https://archive.ph/ARnyu
Sounds very basic, sadly.
Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.
I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.
I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.
Did they ever fix the auto-blocking of RSS feeds?
Discussed yesterday (270+ comments)[0]
This is interesting. I'm a fan of Cloudflare, and appreciate all the free tiers they put out there for many.
Today I see this article about Cloudflare blocking scrapers. There are useful and legitimate cases where I ask Claude to go research something for me. I'm not sure if Cloudflare discerns legitimate search/research traffic from an AI client vs scraping. Of the sites that are blocked by default will include content by small creators (unless on major platforms with deal?), while the big guys who have something to sell like an Amazon, etc, will likely be able to facilitate and afford a deal to show up more in the results.
A few days ago, Cloudflare is also looking to charge AI companies to scrape the content, which is cached copies of other people's content. I'm guessing it will involve paying the owners of the data at some point as well. Being able to exclude it from this purpose (sell/license content, or scrape) would be a useful lever.
Putting those two stories together:
- Is this a new form of showing up in the AISEO (Search everywhere optimization) to show up in an AI's corpus or ability to search the web, or paying licensing fees instead of advertising fees.. these could be new business models which are interesting, but trying to see where these steps may vector ahead towards, and what to think about today.
- With training data being the most valuable thing for AI companies, and this is another avenue for revenue for Cloudflare, this can look like a solution which helps with content licensing as a service.
I'd like to see where abstracting this out further ends up going
Maybe I'm missing something, is anyone else seeing it this way, or another way that's illuminating to them? Is anyone thinking about rolling their own service for whatever parts of Cloudflare they're using?
I assume they will "protect original content online" by blocking LLM clients from ingesting data as context?
I'm not optimistic that you can effectively block your original content from ending up in training sets by simply blocking the bots. For now I just assume that anything I put online will end up being used to train some LLM
> Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website
It's the bots that do hide their behavior -- via residential proxy services -- that are causing most of the burden, for my site anyway. Not these large commercial AI vendors.
Every evolution of the web, from Web 2 giving us walled gardens to Web 3 giving us, well, nothing, to what we have now is taking us further from a network of communities and personal repositories of knowledge.
Sure, fidelity has gotten better but so much has been lost.
Isn’t this only useful for blogs, news sites, or forums? Why would I want an AI to know less about my product? I want it to understand it, talk about it, and ideally recommend it. Should be default off.
I’ve been using this for a while on my mastodon server and after a few tweaks to make sure it wasn’t blocking legit traffic it’s been really working great. Between Microsoft and Meta, they were hitting my services more than any other traffic combined which says a lot of you know how noisy mastodon can be. Server load went down dramatically.
It also completely put a stop to perplexity as far as I can tell.
And the robots file meant nothing, they’d still request it hundreds of thousands of times instead of caching it. Every request they’d hit it first then hit their intended url.
From an open source projects perspective we’d want to disable this on our docs sites. We actually want those to be very discoverable by LLMs, during training or online usage.
Yay, looking forward to more CAPTCHAs as a regular user.
Think this is the future, as the AI Web takes over the human web.
At Coinbase, we've been building tools to make the blockchain the ideal payment rails for use cases like this with our x402 protocol:
Ping if you're interested in joining our open source community.
This idea that you can publish data for people to download and read but not for people to download and store, or print, or think about, or train on is a doomed one.
If you don’t want people reading your data, don’t put it on the web.
The concept that copyright extends to “human eyeballs only” is a silly one.
I'm still not sure this is going to be very effective, as so many of the worst offenders don't identify themselves as bots, and often change their user agent. Has Cloudflare said anything about identifying the bad actors?
How would you do the opposite of this? Optimize your content to be more likely crawled by AI bots? I know traditional Google-focused SEO is not enough because these AI bots often use other web search/indexing APIs.
I dont want this by default. I want my website to end up in AI chatbots. For SEO
This is great. But my concerns about Cloudflare's power remain. Today it's blocking AI crawlers, tomorrow will it be blocking all browsers that fail hardware-attestation checks?
But how is this effective against Gemini and even OpenAI, who can instead of relying on their Google and Bing crawlers respectively to crawl the content?
As usual, this is the wrong approach.
The open web is akin to the commons, public domain and public land. So this is like putting a spy cam on a freeway billboard, detecting autonomous vehicles, and shining a spotlight at their camera to block them from seeing the ad. To what end?
Eventually these questions will need to be decided in court:
1) Do netizens have the right to anonymity? If not, then we'll have to disclose whether we're humans or artificial beings. Spying on us and blocking us on a whim because our behavior doesn't match social norms will amount to an invasion of privacy (eventually devolving into papers please).
2) Is blocking access to certain users discrimination? If not, then a state-sanctioned market of civil rights abuse will grow around toll roads (think whites-only drinking fountains).
3) Is downloading copyrighted material for learning purposes by AI or humans the same as pirating it and selling it for profit? If so, then we will repeat the everyone-is-a-criminal torrenting era of the 2000s and 2010s when "making available" was treated the same as profiting from piracy, and take abuses by HBO, the RIAA/MPAA and other organizations who shut off users' internet connections through threat of legal actions like suing for violating the DMCA (which should not have been made law in the first place).
I'm sure there are more. If we want to live in a free society, then we must be resolute in our opposition of draconian censorship practices by private industry. Gatekeeping by large, monopolistic companies like Cloudflare simply cannot be tolerated.
I hope that everyone who reads this finds alternatives to Cloudflare and tells their friends. If they insist on pursuing this attack on our civil rights for profit, then I hope we build a countermovement by organizing with the EFF and our elected officials to eventually bring Cloudflare up on antitrust charges.
Cloudflare has shown that they lack the judgement to know better. Which casts doubt on their technical merits and overall vision for how the internet operates. By pursuing this course of action, they have lost face like Google did when it removed its "don't be evil" slogan from its code of conduct so it could implement censorship and operate in China (among other ensh@ttification-related goals).
Edit: just wanted to add that I realize this may be an opt-in feature. But that's not the point - what I'm saying is that this starts a bad precedent and an unnecessary arms race, when we should be questioning whether spidering and training AI on copyrighted materials are threats in the first place.
AI will endlessly crawl my website, quickly exhausting the egress quota of my Supabase free plan,but Cloudflare can stop all of this.
I saw yesterday that they were going to allow websites to charge per scrape.
Looks like cloudflare just invented the new App Store.
This is a bit silly. Slowing down, yes, but blocking? People who *really* want that content will find a way and this will hit everyone else instead that will have to do silly riddles before following every link or run crypto mining for them before being shown the content .
I recently went to a big local auction site on which I buy frequently and I got one of these "we detected unusual traffic from your network" messages. And "prove you're human". Which was followed by "you completed the capcha in 0.4s your IP is banned". Really? Am I supposed to slow down my browsing now? I tried a different browser, a different OS, logging on,clearing cookies, etc. Same result when I tried a search. It took 4h after contacting their customer service to unblock it. And the explanation was "you're clicking too fast".
At some point it just becomes a farce and the hassle is not worth the content. Also, while my story doesn't involve any bots perhaps a time will come when local LLMs will be good enough that I'll be able to tell one "reorder my cat food" and it will go and do it. Why are they so determined to "stop it" (spoiler, they can't).
For anyone who says LLMs are already capable of ordering cat food I say not so fast. First the cat food has to be on sale/offer (sometimes combined with extras). Second it is supposed to be healthy (no grains) and third the taste needs to be to my cats liking. So far I'm not going to trust a LLM with this.
Why is every second article about this claiming that it’s automatic? It needs to be turned on or at least there was no mention of automatic in the original blog post.
I really hope that we can continue training AI the same way we train humans – basically for free.
s/A.I. Data Scrapers/non-sanctioned browsers running on non-sanctioned platforms/
They've been trying to do this for years. Now "AI" gives a convenient excuse.
I fail to see how this won’t just result in UA string or other obfuscation.
No one else can really do this except Cloudflare.
so TLDR it adjusts your robot.txt and relies on cloudflare to catch bot behavior and it doesn't actually do any sophisticated residential proxy filtering or common bypass methods that works on cloudflare turnstill, do I have this correct?
this just pushes AI agents "underground" to adopt the behavior of a full blown stealth focused scraper which makes it harder to detect.
account wall :-(
Poor ChatGPT-User, nobody understands you. Blocking a real user because of the, admittedly odd, browser they're using misses the point.
The destruction of the Web and IP theft needs to be addressed legally. The opinion of a single judge notwithstanding, "AI" scraping already violates copyright. This needs to be made explicit in law and scrapers must get the same treatment as Western governments gave to thousands of individuals who were bankrupted or jailed for copyright infringement.
We are in the Napster phase of Web content stealing.
Unfortunately I think pissing into the wind. Information websites are all but dead. AI contains all published human information. If you have positioned your website as an answer to a question, it won't survive that way.
"Information" is dead but content is not. Stories, empathy, community, connection, products, services. Content of this variety is exploding.
The big challenge is discoverability. Before, information arbitrage was one pathway to get your content discovered, or to skim a profit. This is over with AI. New means of discovery are necessary, largely network and community based. AI will throw you a few bones, but it will be 10% of what SEO did.
Few people realise that virtually everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop.
It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.