Editorials, Ethics

Statistics Can Say Whatever You Like…

It’s amazing how many times you can see that quote – or one very similar, as people are rationalizing information they need to have access to in order to make their case.  “Statistics,” I think, is too specific – I think it should be “Data Can Be Made to Say Whatever You Like…”

I can’t tell you how many times I’ve presented sample results and reporting only to be sent back because “yeah, that’s really great – and we’re getting closer to what we need.  Can you tweak it to say something more like this…” and then approaches and data included and so-on are shifted around to present a different picture.

This is a huge ethical dilemma for data folks.  You have so many different facets in play on something like this.  From remaining employed to paying attention to ethical goal posts.   It’s a very fine line to walk.

In comments to the public data set post, Tim Derrick mentions something very much along these same lines:

Data quality is also very important – we have seen people using correlations to fill in gaps in the input data, and then use that filled-in data to drive correlation-based learning, (basically they were just building their assumptions into the data), and then the machine learning gave results that matched their assumptions, so it all looked good statistically, but the results were not representative of reality. It is not as simple as it seems.”  (Emphasis mine)

I think it’s safe to say that we’ll see this more and more.  And, if it’s not about changing the data or excluding less tasty items to make a specific point, it could well be about including additional complementary data sets – things that add further meat to the hypothesis we do want to prove or support.   I’d love to hear some examples of this (no specifics, and change names, blah, blah, blah) and perhaps more importantly, what do you say or do in response to this type of request?

Perhaps this should be considered for addition to the Code of Ethics?  Data is provided and presented in the truest light and most transparent way possible.  The caped crusader in me thinks that’s a good idea.  The realist thinks defining what’s “right” in data and presenting it that way as an ethical goal is extremely tough to define and measure.

What types of things are you seeing?  What have you seen work, or go awry?