When a suggestion is made to do a piece of work in a software system as part of a batch I always try and analyze for a better alternative. Batches are a common solution to many large scale processing problems and the typical benefits of them are:

  • It takes load away from the software system to a time when it is used less frequently
  • It can provide a finer grain of control over the execution of expensive tasks
  • It can be a simple solution to prevent locking on crucial resources
  • It provides a better amortized cost for the usage of servers

Some of the problems I find with batch processing are that:

  • Feedback of failures are typically only realized post the batch running, these require investigation at a time when the context of the how that defect could of been introduced has passed
  • Batching is often used as a simple alternative to optimizing a process
  • Once one batch process is introduced, it doesn't take long for there to be a plethora of batched tasks as it becomes the go to solution for all large scale processing problems
  • Invalid data that is operated on during a batch is only realized post running, this results in a person having to analyze the batch log on a daily basis and then following up with those responsible for the invalid data
  • Batch processes rarely scale logarithmically, they typically scale exponentially or at best linearly with the data size
  • Users of the system have to wait for batch runs to occur, this can slow down business as they have to wait for jobs to be ran overnight or over the weekend, this usually adds an unmeasured operating cost to the business as time is needlessly wasted

The outcome that many of the points above are denoting towards is that if a batch is being considered as the solution to a problem then there is probably a better way to resolve the problem. The most common alternative to a batch solution is to do delta processing. Delta processing is basically the processing of data for information that has differed or been added from when the process was last ran until now. That is, processing that is done on the [delta](http://en.wikipedia.org/wiki/Delta(letter))_ (change) of the information.

The other alternative to batches is to optimize the process so that it becomes inexpensive enough to be done continuously and when needed. Typically batches are done to perform operations over significant amounts of data for reports, updates or other processing. If this is the case then perhaps a different data store should be considered that has a lower cost to access, retrieve and update the data. Alternatively the common scaling solutions of distribution and replication could be considered.

This article was first noted down on the 3rd of August, 2012.