The "Pay Per Crawl" Business Model


Should the big data aggregators that are scraping the web for training data for their LLMs compensate the websites that they are scraping data from?

I have to credit Matt Gowie for introducing me to the concept of "Pay Per Crawl" where anybody crawling these websites and aggregating the data for training or for RAG or anything would pay the websites per crawl.

This is an interesting kind of tug-of-war here, as there's so many people that are for an open, free internet in contrast with the closed pay-gated model.

I've addressed this before several times, and my thoughts are: You could argue that data should be free or it shouldn't but compute power and CPU cycles are not free. Even if you host it yourself, you've got to pay for electricity for whatever server is hosting up the content. So one way or the other, the people hosting the data need to be compensated.

This leads to all sorts of interesting models though - how do you protect yourself from people crawling your website without access? Do you force everybody to get a login? Do you put the content on a network where they gate who can access it and who can't and do their best to prevent bots? Do you add bot limiting software like AWS WAF?

It looks like Cloudflare has taken some steps to launch this Pay Per Crawl model using the old 402 payment required header And even gone as far as creating a pay-per-crawl sign up page which is at the time of this writing still in beta.

I'd be curious to know how this turns out. I'll be keeping an eye on it. What are your thoughts on the pay per crawl model? Let me know!