AWS S3 Use Case And Cost Break Down
You don’t have to be. One of the new techs I have my eye on is AWS S3 Tables.
Before we go too far I have not yet had a chance to do a crazy deep dive on this yet but I have at least 2 clients that could really use a simple Data Lake solution. So hopefully we can do a proof of concept and give S3 Tables a hands on test run in a real world production environment at some point.
How it works:
From my understanding, AWS Services like Athena/Glue can already crawl data stored in S3, for example, Parquet files piped in from Kinesis events.
This just takes that to another level storing “Tabular Data”(Data in a table like format with columns and rows).
The data is stored cold and you get charged for I/O instead of paying a flat hourly fee to reserve compute power like more traditional RDS setups(They have some cool serverless stuff that came out recently that is more like this).
Data Lakes vs Source Of Truth:
A common pattern I observe when kicking off a project with a new client is that they keep absolutely every piece of data that has ever been in the system in the primary source of truth DB.
This includes purchasing patterns from 10 years ago just in case someone in sales wants to run a report on it.
I don’t recommend this. Anything that is not pushing a customer to give you money today should be moved off the primary DB to some other hardware. In the sales report use case that would be a datalake, and in this case a perfect use case for S3.
Though I will say that S3 Tables might fall somewhere between 10 year old event data in cold storage and a real time source of truth DB.
Pricing:
Their pricing(Click into the “Tables” tab) is similar to standard S3 pricing with a cost per GB and a throughput cost. It differs in a few places though.
They have a larger focus on the number of objects. So at $0.025 per 1,000 objects, I could see that adding up but, according to their docs, each row is not a single object. They get combined with a process called “Compaction”:
By default, compaction periodically combines smaller objects into fewer, larger objects to improve query performance. When compaction is enabled, you are charged for the number of objects and the bytes processed during compaction.
This compaction also comes at a cost of $0.004 per 1,000 objects processed
and $0.05 per GB processed
.
I can’t wait to get some hands on data but if their pricing example is anywhere in the ballpark(they usually lowball these estimations) then a 1 TB can be utilized for roughly $35.19 per month.
Assuming you are using this for internal reports and not letting the whole world spam you with queries I could see this being close.
A Few Random Tips For Data Warehouses:
Don’t Poll:
These types of setups get expensive if you constantly hit it with requests. Having a cron job running every hour to get the latest trends is a waste of money. Use EDA to pipe the latest trends into your analytics dashboard and save the data lake stuff for long term reports run as needed.
Caching Is Your Friend:
Again if you run a report cache the results so you don’t have to rerun it all the time. I go more into detail on this in the videos on EDA on my YouTube Channel.
Is this the silver bullet?
There is no such thing as a silver bullet but I am confident there is a use case for this. And I am excited to dig in further on this one. Let me know if you want a deeper dive or a full length video on this one.
Question For You:
Are you using a DataLake or Wearhouse solution? If so, what does your setup look like?