Why caching sucks!


Previously, I talked about pre-populating your search results, specifically using the searches that did not find pre-populated results to determine what you should be pre-populating in the future..

One way of doing this is to fall back to the wildcard search and query your dataset with the query string inputted by the user doing the search. Then, storing those results in a cache so the next time that search is run, those results will be served up.

This can work great, but depending on your requirements, it could have some major drawbacks.

The first decision you have to make is whether to run this wildcard search when the user clicks the search button or later in batch.

If you wait until you can batch it, then you will have to respond to the user with a “No results found” page of some type.

If you choose to run the wildcard search right then and there, depending on the dataset and complexity of your query, the user might be sitting there for a while. I have seen queries like this take upwards of 15-30 seconds, which is a lifetime of waiting for the page to load for the user.

The next consideration is, if you cache it, how do you keep it up to date? Your results will change over time, right? New records get added, and you want them to be searchable.

Here, you have a few options, the most widely used being to cache them based on a TTL (Time To Live) duration. Here, you would say that any cached results will expire after a predefined duration, like 1 week.

This is great, but once every week, you have to repopulate the cache for that search, which is going to take some time.

The combination of TTL with running the wildcard search against the DB when the user actually makes the search can leave you open to DDoS, too.

I have seen attackers map out a bunch of fairly obscure searches, they knew wouldn’t be populated then once they mapped the search out, they would hit the website with all the queries they had mapped out at once, causing all of them to fall back to a wildcard search against the source of truth DB.

This resulted in some latency and a bigger AWS bill because we chose to auto-scale rather than degrade the search for the legitimate users.

One way around this is to stagger/randomize your caching TTL so it's tougher for the bad guys to figure out when your window of vulnerability will be out, but if the malicious party is patient enough, they will still be able to pull off an attack.

If not cached by TTL, then what?

I will cover that later in this series on Search. For now, if you are interested in getting early access to my new e-book, which I am calling “Mastering Search At Scale”, just leave a comment by sending me an email.