CommonCrawl.org
-1.png)
Looking to crawl the majority of the web, but don’t want to spend the time and money?
Then you should check out Common Crawl.
They have spent almost two decades documenting over 300 billion pages.
Unfortunately, as I write this, they seem to be having some issues with some of their services(I should probably reach out and offer my services…).
There is a wide variety of data sources in there, but as you can imagine, Reddit appears to make up a decent chunk of what is indexed.
What can you do with this data?
The most common use case would likely be training general LLMs from scratch, but I am sure you could extract a lot of different data-driven insights from this.
I could imagine a use case where you tried to train a forecasting model for stock prices based on the amount and sentiment of posts and news articles historically.
Are you curious if Common Crawl has been crawling you?
The good news is that they are pretty transparent about their crawling efforts . Just keep an eye out for CCBot/2.0 (https://commoncrawl.org/faq/) (I love it when they put the link to the docs in the user agent.
Interested in putting this data to use?
I would be happy to help you with that. Let's set up a time to chat..