How Cloudflare is Giving Content Publishers the Ability to Enforce Charges for AI Crawlers wanting to scrape sites

Written content producers have become increasingly concerned since 2024 over the way that AI systems (including OpenAI’s ChatGPT and Google’s Gemini) have been trawling through their text and using it to train their large language model systems, leading to that content potentially being reproduced (in adapted form) with no attribution to or payment for the original creator.

This potential breach of copyright in content is something that is difficult to address, given that existing copyright laws did not anticipate the way that AI large language models would make use of content.

Some content providers have locked down their pages with directives to AI bots not to index or use their pages. Some AI systems will obey these directives, but some won’t, leading to a ‘wild west’ environment and the need for a solution that compensates copyright owners, while not preventing AI systems (that have become exponentially more popular since 2023) from learning and improving.

Cloudflare has come up with a proposal to address this important area, in the form of a practical suggestion for a system that could address this issue, rewarding copyright owners while allowing AI systems to continue to advance and develop.

Who/what is Cloudflare?

Cloudflare is an American company that offers a range of internet services. This includes content delivery network services, cybersecurity services, DDoS (distributed denial-of-service) mitigation and domain name services. Cloudflare is a global network that helps to make websites faster and safer, increasing performance and security.

Recently the company announced that it will be blocking AI crawlers by default from any new customer site that uses its services. From last year, it had already offered customers the ability to block AI bots completely from crawling sites or to pick and choose which would be allowed, and which would not. This new announcement that it will block by default means it will be necessary in future to opt out of these controls to allow AI bots to access your website, rather than being ‘opted in’.

What is Cloudflare pay-per-crawl?

Cloudflare pay-per-crawl will be a new option that content publishers can enable to stop AI crawlers freely scaping their sites for information. With pay-per-crawl enabled, the crawlers will be faced with a view of pricing when they attempt to access the site that will inform them that a charge is required to access the content.

Cloudflare explain it like this:

‘Pay per crawl integrates with existing web infrastructure, leveraging HTTP status codes and established authentication mechanisms to create a framework for paid content access.

Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing. Cloudflare acts as the Merchant of Record for pay per crawl and also provides the underlying technical infrastructure.’

What does Cloudflare pay-per-crawl mean for businesses?

This option from Cloudflare will give back control of content to the publishers, allowing them to vet who can and can’t access their content and who should have to pay to access it. The power that was lost over published content with free access by AI systems will be regained, and this should help to protect the financial viability of content creators.

The system is now in beta, with a small number of businesses trialling the scheme. Cloudflare has said that in future it will give content publishers more options in terms of how their content is viewed. By partnering with AI companies, they will be able to determine what AI bots are crawling. Through this verification process, content publishers then have the option to either enforce a charge on all AI bots, or just charge those crawling for content generation. AI bots crawling for training, learning and general search purposes may still be allowed access free of charge.

This option from Cloudflare will likely be welcomed by content publishers as a way of creating some protection for their work, and compensation when it is being used.

There is currently an open case between The New York Times, on the one hand, and both OpenAI and Microsoft, on the other. The case, which was first filed in December 2023 and has been ongoing since then, includes claims from The New York Times that OpenAI and Microsoft used numerous articles written and owned by the publication to train their AI models.

It is claimed that this led models to replicate almost exactly stories published by the newspaper. Copyright infringement is the publication’s main argument, in that the AI models are copying content and also allowing users access to it freely (rather than accessing the publication via a paywall).

OpenAI argues that its actions constitute fair use in training the models on public data, and has compared this to how web pages are indexed by search engines. But there is a clear difference between these two models, in that search engines act as a gateway linking users to and therefore ultimately connecting them with the content they have indexed at its source, whereas AI systems tend to provide an ‘answer’ and often do not require or suggest a visit to the source website(s).

The arguments central to this case highlight the specific legal issues that come into play around AI development and copyright law. The ruling on the New York Times case could impact on the future development of AI and how and what content it is legally allowed to train on, as well as what protections content publications have over their own copyrighted material in relation to AI.

What does Cloudflare pay-per-crawl mean for AI crawlers?

With 20% of the internet using Cloudflare (including many larger websites), the impact upon AI crawlers that use the web for training purposes could be significant. If a large number of companies opted to enable the pay-per-crawl option, and the AI companies were unwilling to pay up, then their AI crawlers would stand to lose more than 20% of information they are presently reliant upon for both training and learning.

Whilst there are already some methods in place to block crawlers, such as robots.txt rules, they are not always followed by crawlers programmed to extract information from the web. The robots.txt method works by placing a text file within your website listing the web crawlers you want to allow / disallow. This technically allows crawlers to see what they should and should not access on that specific site – but whether they obey these directives or not depends upon how they have been programmed.

Conclusion

This move by Cloudflare towards blocking AI crawlers by default can be seen as a step in the right direction in a commercial landscape where the interests of AI have been prioritised. It remains to be seen to what degree there is the will to make the proposed pay-per-crawl system work. Websites using Cloudflare at least have the option to decide whether or not they want to allow AI crawlers to continue to have access their sites from now on, while considering the implications of that access upon future levels of visits to their content.

The pay-per-crawl option, designed to compensate the original content publishers for sharing their work with AI systems through paid access, might be a compromise that some content creators find acceptable. They would continue to allow AI programs to use their content, but at least they would stand to gain a little revenue in the process. Whether this is considered a satisfactory trade-off or not may depend on the amount of payment involved and the value the individual website owner places on their content remaining unique and free from direct competition by large language model systems.

In the brave new world of AI searches and prompts, web traffic with informational search intent has already dramatically dropped, as the AI chat systems have effectively colonised territory that was previously the preserve of human writers on smaller private websites.

Human web content creators (including copywriters) may feel as though they are fighting for the survival of their business models in the face of the competition from corporate AI tech giants. While it remains to be seen how much individual writers and small-scale content producers will benefit, limiting the ability of tech giants to pirate the work of others for financial gain appears, if nothing else, to be a small step towards redressing a growing injustice, by recognising the rights of the original creators.

If you would like more information on our services, or have any thoughts on this article you would like to share, please send us a message.

Get in touch