Editorials, Ethics

Statistics Can Say Whatever You Like…

It’s amazing how many times you can see that quote – or one very similar, as people are rationalizing information they need to have access to in order to make their case.  “Statistics,” I think, is too specific – I think it should be “Data Can Be Made to Say Whatever You Like…”

I can’t tell you how many times I’ve presented sample results and reporting only to be sent back because “yeah, that’s really great – and we’re getting closer to what we need.  Can you tweak it to say something more like this…” and then approaches and data included and so-on are shifted around to present a different picture.

This is a huge ethical dilemma for data folks.  You have so many different facets in play on something like this.  From remaining employed to paying attention to ethical goal posts.   It’s a very fine line to walk.

In comments to the public data set post, Tim Derrick mentions something very much along these same lines:

Data quality is also very important – we have seen people using correlations to fill in gaps in the input data, and then use that filled-in data to drive correlation-based learning, (basically they were just building their assumptions into the data), and then the machine learning gave results that matched their assumptions, so it all looked good statistically, but the results were not representative of reality. It is not as simple as it seems.”  (Emphasis mine)

I think it’s safe to say that we’ll see this more and more.  And, if it’s not about changing the data or excluding less tasty items to make a specific point, it could well be about including additional complementary data sets – things that add further meat to the hypothesis we do want to prove or support.   I’d love to hear some examples of this (no specifics, and change names, blah, blah, blah) and perhaps more importantly, what do you say or do in response to this type of request?

Perhaps this should be considered for addition to the Code of Ethics?  Data is provided and presented in the truest light and most transparent way possible.  The caped crusader in me thinks that’s a good idea.  The realist thinks defining what’s “right” in data and presenting it that way as an ethical goal is extremely tough to define and measure.

What types of things are you seeing?  What have you seen work, or go awry?

Facebooktwittergoogle_plusredditpinterestlinkedinmail
  • NameThatTune

    I remember years ago Mike Ditka was being asked questions by the local media about some decisions that we’re made during a game.

    The reporter said something like “statistically” wouldn’t it have made sense to…”.

    Ditka’s response was “statistics are for idiots”.

    I used to think that was such a stupid thing to say but in retrospect there is a lot of wisdom with that thought. I know it was in the context of football where there are so many intangibles at play, however your comments are consistent with how humans use statistics. They are used to convey whatever it is you wish to convey.

    Statistics can produce alternative facts. When you can’t agree with the facts you have no basis for a rational decision. Then when you make a decision based on alternative facts you can tweak the statistics to tell whatever story you want since there are so many possibilities.

    AI is fraught with peril since it depends on the data it consumes. You are what you eat…

  • Tom Gueth

    There is an old saying:

    Figures never lie, but liars always figure.

    And I love the Ditka quote. Statistics are not “facts”. They are a “possibility” of something happen. While statistics can lead to facts, they are not facts in themselves.

    Basically, I agree that this is an issue of ethics. But having some kind of code of ethics doesn’t solve the problem. The problem is most people, once the “results” are displayed, assume that the person presenting the data is ethical. Sadly, this is often not true. I have experienced it so many times in my past, people lying with data, whether real or made up.

    In science, a “result” must be repeatable – if not repeatable, then not correct. Today, few if anyone (especially the media) simply take results as facts. The only solution is to ask for the data and its sources and having it analyzed by others, before allowing results to be presented as facts. But today, the attitude is better to be first than correct. I listen to the idiocy presented today under “science” and “facts”, and know that much is incorrect. Good example – in 1980’s, eggs were bad for us, now they are good for us.

    Sadly, I don’t see this getting better.

    Tom

    • NameThatTune

      I read something recently that suggested a person is less likely to believe climate science if they identify as Conservative and have a higher education.

      The opposite is true if the person is a Liberal with a higher education.

      I don’t mean to inject politics as that is not my intent.

      I just find it interesting that there can be such a disconnect with how people interpret facts based on their ideological leanings.

      It’s further evidence that as humans, we are not capable of objectively interpreting facts. It’s a fatal flaw in our DNA.

  • John Shadows

    When you add the concepts and variablesof sample size, sample bias and outlier exclusion, you have a recipe for lying and fraud.
    I’m a fan of “drill down to the raw data and show me the roll up” type of reporting. Statistics around much of our hot button political and social topics don’t seem to have the raw data for folks to drill down to and see things like exclusion criteria so you have no idea what the report much less, the A.I. algorithms left out.