Wednesday, July 27, 2011

The Abuse of Statistics

I've been thinking about the abuse of statistics. I'd like to make a point about it, which will start off seeming contrived, but will hopefully end up being somewhat informative, so here goes. I'll avoid doing any complicated maths, and my numbers are approximate.

I've walked across a train crossing near my house around 500 times, so I must be lucky to be alive.

A train weighs a couple of hundred tonnes, so if I was hit by one, I would surely die.

They each take about 20 seconds to pass by my local crossing and a couple of hundred go through there each day. That means that for over an hour per day, our about 1 in every 20 seconds, there's a train passing through that crossing.
With 500 crossings, each taking several seconds, the odds are almost 100% that I would have been hit by now. And I walked through there today without even looking up. Am I brave, crazy or what?
The answers seem obvious with this example:
  1. Trains make a lot of noise, which tells me when to stop.
  2. There are gates that physically stop me walking across when it's dangerous. 
  3. The risk is mitigated by the train driver's ability to slow the train.
  4. Walking into the side of a train would be unlikely to kill me, and it's usually the side of the train that's passing by.
  5. Those 1 in 20 seconds occur in clusters, rather than randomly. It's not like there will always be one second of train coming within seconds.
The list goes on.
This example seems ridiculous, but there are traces of it wherever you see statistics used by the media.

Examples of stats abuse

A more illuminating example:

A man is charged with murder primarily because his DNA is found with the corpse. The DNA is matched so that only 1 in a million people would have the same markers. Damning evidence, no?
No. If the murder were committed in Melbourne, then there would be about 3 other people within driving distance who would also match the evidence, which is far more than reasonable doubt.
And an illustration from the current media:

There are hundreds of millions of kids with internet access, and they would each visit, let's say 1,000 pages per year. That would make it seem quite likely that some kids might stumble upon the 10,000 or so urls that the government desperately wants blocked, from among the trillions on the net.
The abuse of stats here is that surfing is not undirected. Most kids around the world would never venture outside of maybe a million pages in total, accounting for no more than maybe a billion urls, between them all. The overall likelihood of any kid stumbling across potentially-damaging material is almost nil.
And something we've probably all seen personally:

Geez, that guy's a bad driver. I bet he's Asian.
Never mind the fact that you're in an area that is predominantly Asian, so all the good drivers are also Asian. This kind of personal, non-recorded, observation-based prejudice is also affected by the fact that people only tend to remember the times that they are correct and discount the times they are incorrect. This also leads into a discussion of correlation v causation.

Correlation v Causation

In most examples of stats abuse, there's usually an element of people confusing correlation and causation:
Causation means that one thing is the actual reason, perhaps indirect, for another thing. For example, I'm pretty certain that these letters have appeared in this blog post because I pressed the right sequence of buttons on my keyboard. And that the buttons on my keyboard were pressed because my fingers were moving on them. With causation, you can always look for a deeper level of causation, which is the premise of the toddlers' recursive game "Why?" which could be reworded as "What causes that to happen?"
Correlation, however, merely shows that two things go together more or less often than would be expected if the two things were random. For example, liquids in bottles tend to be fizzier than liquids elsewhere. It would be a mistake to say that the liquid being in the bottle causes the liquid to be fizzy.
Correlations can happen because the relationship is causative, and causation will definitely involve a correlation, but most correlations do not involve causation.

Pirates v Global Warming

Pastafarians have highlighted this distinction by pointing to a negative correlation between piracy of the rape-and-pillage kind, and global warming, satirically claiming that piracy was what was keeping global warming in check. Well there's been quite a lot of that kind of piracy lately, so it's evident that the correlation is not causative.
It's in this realm that most of the current argument about global warming has been happening:
  1. Man pollutes.
  2. Lots of other things happen.
  3. The Earth warms.
Has mankind caused the warming?

I haven't personally looked at the data, and I don't personally know any of the scientists, so I can't answer that question strongly, though the experts who take the measurements seem to be agreed, which has weight. The hard part of their work is that there are an enormous amount of variables to try to discount. Is it solar flares that cause global warming? Or krill populations? Or volcanoes? These are the questions that have been worked on that give climate scientists a high level of confidence that it is in fact man that has caused at least some of the warming that has taken place over the last century or so.

Pirates v Creative Folk

By 'pirates' here, I'm obviously talking about the data-devouring kind.
The movie and music companies of the world see one act, the copying of information, and they come to the following conclusion:
  1. Movies and music cost money in stores
  2. That money goes "to the artists", i.e. to the labels and publishers
  3. People who copy don't pay
  4. Therefore, money is being lost by the artists.
It seems like a logical, causative sequence, until you examine it in detail. If someone copies something:
  1. It doesn't mean they would have ever paid for it.
  2. It doesn't mean they won't pay for it.
  3. They might encourage someone else to pay for it.
  4. They might buy ancillary goods and/or services that support the artist.
  5. Taking any action against customers or potential customers can result in even less purchases.
etc.
Even the former head of EMI recently recognised that pirates are actually their best customers.

Leading questions

Another area where stats are abused are in conclusions to be drawn from polls, where either the questions are ambiguous or leading or the options are incomplete or leading.

An example of an ambiguous question with leading options comes from the Herald Sun, where it asks "Have beauty pageants gone too far?" The only options are yes or no, and it's on an article about a 6 year old girl being in pageants. With no options like "It depends on the circumstances", who is going to choose anything but yes?

Or from The Age: Should the government adopt all the Ombudsman's recommendations into the chaplaincy program? Someone who thinks that 7 out of 8 recommendations should be adopted will be put in the same category as someone who would reject all recommendations. That makes it a pretty useless question. It's like playing Guess Who? and asking "Is it a character from Guess Who?" You can get a correct answer, but it won't be enlightening.

Or from the Daily Mail: "Do we really need another range of clothes designed by a celebrity?" where the options are "Yes, I'd buy it" and "No, it's colourful tat". Where are the "Yes, but I'd never buy it", "Whatever the market will accept" or "No, it's nice, but I can't imagine many people finding value in this over the existing brands" options?

Pretty much any media poll is defective, with the results being useless for accurate reporting of opinions. And even if the question and answers are well-crafted, all you can say is that "X% of people who read article Y believe Z, though we don't know who they are or why they voted, or whether there was any external attempt to herd people to vote a particular way."

Not very enlightening.

That's about all I can be bothered writing about for now.

What are your most hated abuses of statistics or misattributions of correlation as causation?

One of mine would have to be that colds are caused by cold weather.

No comments:

Post a Comment