Editorial Rss

Statistically Relevant

I can’t count how many data warehouse applications I have built over the years. One common theme has always been, “How do we keep it up to date?” A data warehouse can quickly become a data tenement (not a term I coined) if it contains useless data, or the data is not statistically current.

The definition of Statistically current changes based on the purpose of the data. It needs to only be as current as is needed to answer strategic questions. 

One system I worked on was a just in time pull system. Little inventory was kept on site for a manufacturing process. The pull system would be fired by inventory acquisition event, and automatically forward purchase orders to the appropriate vendor. For some parts with a longer lead time, potential orders were also taken into consideration for this process.

Even though the system needed to be on top of the ordering process, it wasn’t statistically necessary to maintain real time updates from the production and sales systems. 24 hour old data was the most statistically relevant. So, warehouse updates were performed overnight, rather than fighting for resources during times of utilization.

I’m curious about how a company such as Amazon establishes their processes. Their business is certainly different in scale. But, is their data any more statistically relevant such that they need more frequent updates to their warehouse information?

The point isn’t that you can’t create a system with near real time awareness. The question is, what would you do with it? Too often, the decision is that you need real time awareness, but the business can’t act on the contemporary information you spent a fortune to collect.

I’ll bet many of you have war stories about data warehouses gone wrong. Why not share some of your experience here or by email to