Editorials

Pre-processing

Pre-processing is a practice that has been utilized as long as I can remember in computer systems. It continues today, but has different flavors. In the old days we used things like Batch Programs, REX, JCL, Shell Scripts, Tcl, Perl, Pipes and Filters, and much more depending on the operating system. We had many utility programs to do searching, filtering and sorting, or even re-formatting data. Without writing a complicated procedural program we used shell type languages to hang together different utilities that would find, filter, sort and shape our data, preparing it for the procedural programs that finally consumed it.

This kind of programming works well in environments where we don’t require real time access to data. Processes can be run overnight and reports provided the following day. All activities are performed based on the last set of reports that become less relevant as the day progresses. There is no big insight in this little history. Most of us know it, even if we didn’t experience it.

In the modern era, some of us have lost sight of the power of pre-processing, with some notable exceptions. Google has made a fortune on pre-processing. The speed their search engine provides is still the gold standard, resulting from modern descendants of the pre-processing mindset. Google has taken the pattern of Map/Reduced and converted it into a profit stream yet to be matched.

This same concept can be applied to your database systems. For example, SSIS is a great tool for doing the same kind of work we used to do with the early scripting languages. With databases there are some pre-processing techniques that can speed up your batch activities that may be required.

  • Use staging tables to reshape data into forms for faster processing
    • De-normalize data
    • Filter data
    • Summarize data
  • Drop or Create indexes, especially on staging tables
  • Create Covering indexes to support specialized queries
  • Shard Data for parallel processing
  • Bulk Import/Export Data for distributed processing

These are all simple tasks. However, I find it amazing the difference one or more of these theoretical pre-processing steps can make when you start digging into the details of what you are trying to accomplish in your batch processing.

Cheers,

Ben