Data Warehouses


Don’t confuse your database that serves up data to your end user for your data warehouse. Both have very specific use cases and offer different values.

When To Use A Data Warehouse: Users who just want to see pictures of their friends' cats on the internet have significantly different compute requirements than an internal data science team wanting to crawl every single event for the last five years, looking for emerging business trends in your data.

Furthermore, you would never want a computationally expensive query run by that data science team for a report to slow down the general user experience or have a negative effect. That is where a data warehouse comes into play. They are great at storing a ton of extra data that really is beyond the scope of what your average user needs to see on a daily basis and letting your internal teams query them for big-picture analytics.

Say, for example, you store every page view a user does on every product. That data has no data on your end user’s browsing experience, so keep it out of your primary DB. On the other hand, that data is extremely useful to your internal team so that they can track patterns across user behavior to predict future purchase potential, but that should be stored on separate hardware from the primary DB. That should be stored in a Data Warehouse that is optimized for a small group of people to make giant queries on a massive dataset as opposed to your regular database, which is optimized for a massive amount of people to make small, efficient queries on a much smaller dataset.

A caveat is that you might want to store your user's recent browsing history for a recommendation engine.

Option 1 - Always On Redshift: AWS Redshift is a great example of a Data Warehouse. It has all the standard Data Warehouse tools, but the big drawback is the cost. And likely, if you use the Provisioned Redshift, you are almost constantly over-provisioned.

Autoscaling: They do offer autoscaling, so you have some automatic systems to scale up and down your provisioned Redshift Instances. It is pretty intuitive, so I won’t go into it in much detail here.

Option 2 - Redshift Serverless:

Additionally, AWS also offers a serverless version of Redshift. Though, like the other serverless data stores, it incurs a charge just for storing the data, which makes sense.

Option 3 - AWS Glue: This is an interesting tool that allows you to save Parquet files to S3 and then use AWS Glue to crawl through the files and run your big queries against them. It's not as fast as always on DB, but the great news is that you only get charged for Glue when it is crawling and processing your query. Once it is done, it shuts down until you're ready to run your next query. There is one additional charge, though, and that is the storage costs on S3, but that takes the place of your storage cost from the other database/data warehouse solutions. The good news is you have the option of choosing any of the storage tiers S3 has to offer depending on your storage needs.

AWS Athena also deserves an honorable mention here, which allows you to write basic SQL queries that get run in a similar fashion to AWS Glue.

Option 4 - Redshift Spectrum: I actually just found this tool while researching for this document, but Redshift Spectrum looks like it is the best of Redshift, but using the AWS Glue model for crawling S3 files. It is worth looking into.

Bonus Points 1 - Cold Storage: Imagine you have an e-commerce platform that sells widgets, and the price changes almost daily or even weekly. What does the customer need to know to make a purchase today? Today's price is for sure; perhaps you want to give them a few more data points like last week's price fluctuations, but after a certain point, keeping those old data points is just filling up the DB and wasting precious CPU/Memory resources.

You don't want to completely lose those data points as that data is valuable for your internal team to make business decisions. That is where the concept of “Cold Storage” comes into play. That is where you have a process to remove records that are no longer needed by the end user from your active database to keep speed high and cost low. Meanwhile, your internal team can still query that information from your data warehouse.

Additionally, I have seen services that still give the end user the ability to query older records, but they are actually stored on a completely different set of hardware from the active records. An example of this would be keeping track of users’ orders over time. Orders that were placed in the last 12 months are kept in the active DB, but if the user queries orders older than 12 months, you will code your application to send those queries to the hardware that stores older records. The archived data source would be a great one for a serverless “On Demand” data store.

Bonus Points 2 - Event-Driven Architecture: How do events get from your active Database that serves your end users' content to your Data Warehouse? Well, one of the best ways is via event-driven architecture. This is kind of like queues on steroids. Every event, including Create, Update, and Delete operations, gets piped into your data warehouse.

Now, one amazing tool for this is AWS Kinesis Firehose. This tool works as a queue of sorts, but it supports having multiple workers consume from it, which is a miracle for MicroService architecture. This is one of my favorite tools on the whole platform.

Additionally, by design, it can basically pipe that data directly to a Data Warehouse like Glue or Redshift.