Search Pipeline Stages


Search Pipeline Stages

How to architect scalable cost effective search engines.

A lot of the time, customers bring me in to fix slow and costly search engines, and a common pattern I find is that they are using wildcard/regex searches as their primary ways of querying data.

Wildcard searches should be a last resort (if even used at all) as crawling over millions of records is slow and computationally costly. That begs the question: What should be first?

Stop thinking of search as a single action; instead, start thinking about it in stages in a pipeline.

This means that when a search is entered, there are a series of steps along the way where the search could be answered before it hits the wildcard fallback.

Let’s say the infrastructure to host wildcard searches as your only way of search costs $10,000 a month. That may seem like a lot to some and a pittance to others, depending on where your business is at.

What if we could put a search pipeline stage that executes before that wildcard search that was capable of rendering accurate results for just 50% of the searches coming in, but it only costs $1,000 a month to run.

Theoretically, you could run your wildcard search infrastructure (at least the CPUs) on 50% of the infrastructure, effectively cutting your bill for the wildcard down to roughly $5,000 a month.

Add the 2 together and you get $6,000 a month, which is way better than the original $10,000.

Now imagine you add another search pipeline stage in between the primary and the wildcard that handles another 30% of searches before the search hits the wildcard at another $500 per month.

This means theoretically you could drop your wildcard search infrastructure down by 80% of its original cost to $2,000 per month. Add in the $1,000 for the primary search stage and $500 for the secondary, and you get a total of $3,500.

It's not as linear or as cut and dry as this, there are nuances, but hopefully you get the basic idea.

Don’t think you are just stuck with one method of search, especially wildcard search. Find ways of adding in fast cost-effective stages to handle your searches before they ever hit the search methods that require big expensive search hardware.