The "Pay Per Crawl" Business Model


Should the big data aggregators that are scraping the web for training data for their LLMs compensate the websites from which they are scraping data?

I have to credit Matt Gowie for introducing me to the concept of "Pay Per Crawl", where anybody crawling these websites and aggregating the data for training or for RAG or anything would pay the websites per crawl.

This is an interesting kind of tug-of-war here, as there are so many people who are for an open, free internet in contrast with the closed pay-gated model.

I've addressed this before several times, and my thoughts are: You could argue that data should be free or it shouldn't, but computing power and CPU cycles are not free. Even if you host it yourself, you've got to pay for electricity for whatever server is hosting the content. So one way or the other, the people hosting the data need to be compensated.

This leads to all sorts of interesting models, though - how do you protect yourself from people crawling your website without access? Do you force everybody to get a login? Do you put the content on a network where they gate who can access it and who can't, and do their best to prevent bots? Do you add bot-limiting software like AWS WAF?

It looks like Cloudflare has taken some steps to launch this Pay Per Crawl model using the old 402 payment required header, and even gone as far as creating a pay-per-crawl sign-up page, which is, at the time of this writing, still in beta.

I'd be curious to know how this turns out. I'll be keeping an eye on it. What are your thoughts on the pay-per-crawl model? Let me know!