The science of queues – MergeReduce

March 14, 2009

This week a problem with a novel solution presented itself.

At my job we aggregate real estate data from around the world (thank you Canada) for major real estate brands (think balloons and golden lab’s).

On the front end we have very pretty UI’s and on the front of the back end some pretty mashup dashboards [add link later].

On the back end we have the gradual accumulation of substantive technical debt [love that word, come back to it later].

There are things that are messy in this dark place.

One messy aspect is a queue system that processes jobs for a database.  A majority of the jobs consist of three files (triplets, we call them) and are submitted once per day, with two files being complete containers of their data, and the third file being an incremental, having only the changes for that day.  Each triplet corresponds to a specific meta-container such that every day there is one triplet per container.

So recently the queue got to be more than 24 hours delayed, which meant containers could have two triplets sitting in the queue.  Since the older full files are now irrelevant as long as the newer one is processed (redundant work) there is no point to processing all six files.  A tricky three will do.

MergeReduce was born.  It finds duplicate container triplets that are not in jeopardy of processing anytime soon (we use a 1 hour threshold), merges the old and new incremental file, and deletes the older triplet.  It also logs, emails status and marks in the database that the triplet has been processed.

In a hard scrabble 48 hour period with not much sleep a version of MergeReduce was cobbled together.  Ran overnight it was able to eliminate over 100 triplets consisting of several million lines of data. 

Queue lag dramatically reduced thanks to the hard work of a few conscientious employees!

Lessons learned coming soon, like why you should flush your buffers, why string operations can be expensive, and the power of a thank you … time permitting.


Follow

Get every new post delivered to your Inbox.