Friday, 18 October 2013

Using database table as a queue

There is lot of information out there regarding this particular topic.
I recently had the opportunity to evaluate this approach.

Requirement

The requirement was to scale out a windows service across multiple boxes and enable each windows service instance to process incoming data independently. Job of the windows service is to read incoming data and process it one by one.

Approaches

There are several approaches to achieve this. Simplest one is a file-based approach where each box processes incoming data which has been divided into multiple files and copied to a location accessible to it. This location can be a shared directory on remote machine or a location on the individual boxes.
The advantages with such an approach are:
  1. Simple to understand and implement
  2. Easy to debug/maintain/troubleshoot
  3. Easy to scale by adding new boxes
  4. Easy to load balance by putting the boxes behind a vip
But, there are some disadvantages in this approach:
  1. Need for a layer to partition and copy incoming data such that the partitioned data is available to the boxes
  2. If the boxes need to access the partitioned files from a shared location then there are possibilities of latency/staleness due to inherent delays in file sharing protocols.
  3. If it is required to maintain state of the file processing then additional implementation needs to be done to maintain state. E.g. if the contents of the file are Orders in case of an ecommerce site, then depending on the use case it would be required to know whether the Order processed was successful or failure or needs retries etc.
The requirement mentioned above fits in a queue processing pattern. There are at least 2 popular and scalable ways to address queue processing.
  1. Using database table as a queue
  2. Using a message queue/service bus queue/cloud based queue kind of infrastructure (a la Azure Storage Queue)
The mechanism by introducing a new piece of architechture such as database or storage queue would be to load items in the queue from producer end and drain/remove items in the queue from consumer's end.
Let us see advantages of using a database:
  1. The items to be processed are available in a single place - DB table.
  2. We can introduce a column to track status to indicate whether the item has been processed or not.
  3. Improved load-sharing/reliability - the multiple boxes accessing the table can simply keep processing unprocessed items, if some of the boxes go down others will keep working and complete all items eventually.
  4. Out of box support for simultaneous access. Mutual exclusion can be a problem if file based approach is used.
The disadvantages however are:
  1. Allocation of DB class machines/hardware to install DB - this is a capex
  2. Overhead of maintaining a DB which will add to opex.
In case of a storage queue, the advantages are similar to that of DB table. Storage queues offer clear implementation and operations for the Queue abstract data type. Scalability also comes out of the box. Though cost wise storage queues in cloud might be cheaper, but there would be some substantial initial cost that would have to be paid in terms of newness of system, learning curve, implementation, maintenance and production operability.
For my requirement, a file based approach was used ultimately based on agreement with the larger internal forum. Unless it is essential for your requirement to pay DB or storage queue costs, file based approach would most of the times be the desired approach.

If you are looking for more information visit below links:

No comments:

Post a Comment