I have a project where part of the requirements are that a small cross section of their users need to be able to run large historical queries on their data but those queries cannot slow down the site for any of the regular queries. These big queries are run sporadically, not all the time so I decided to go with a serverless solution where we would only be charged for data stored and compute resources used during the queries.
Note that glue is billed in 30-minute units and 10-minute minimum duration for each crawl so do NOT use it for little queries.
We start our journey with the application layer. It takes in traffic from the internet and when a new database record is created it publishes an event to kinesis that includes the newly created record and in the event of an update includes the previous states of the updated fields.
Kinesis records the event and makes it available for any consumers for a fixed duration. From there Kinesis Firehose is set up to push the event record into S3 in the format of an Apache Parquet file.
From there we can crawl the files stored in S3 using AWS Glue to generate table schemas that we can now query. There are a few options for how to query the data. If you prefer SQL you can use a tool like Athena. Otherwise there are plenty of SDKs and libraries, like Pandas that will help you to query your new DataLake.
That is pretty much its. I should note that one caveat is that since every record change for every record is recorded you will need to code your reports to take that into account and use the latest state. The good news is you can query and account for every change to every record if you want it. If you don’t need all of that you can probably use a Lambda to transform the records to just record the latest state but that would be a bit more work.
As with most things I do here is the open source Terraform script:
https://github.com/schematical/sc-terraform/blob/feat/glue/tf/projects/chaospixel/env/athena.tf
Understand that right this minute it is NOT a standalone module... yet. I just got it working in my infrastructure, but a module is coming soon.
Also check out the Network Diagram I did to help you wrap your head around this whole process.
A video should be coming soon as well! If you want a little helping hand implementing this workflow into your infrastructure check out my Group Coaching Program or 1 on 1 Consulting at schematical.com and check out my FREE eBook 20 Things You Can DoTo Save Money On Your Amazon Web Services Bill Today
~Cheers
Signup for the mailing list