How Trump screwed the (data) model…

Share on linkedin
Share on facebook
Share on twitter

Written by Imraan Bacus

The day the UK voted for Brexit, I tweeted that Trump would win the US elections. I didn’t have a fancy model or data at the time – just a sense that the world was going mad and that Trump winning fitted nicely into that playbook. A week before the elections, CNN Money’s headline read as follows (in reference to a Moody’s Analytics prediction);

“A model that has correctly predicted the winner of every U.S presidential race since Ronald Reagan in 1980 is forecasting a big victory for Hillary Clinton.”

Then things got real and the models bombed. Some of my colleagues even changed their LinkedIn job titles from “Data Scientest” to “Fortune-telling Gypsy” (to salvage some self-respect). So much for big data and fancy analytics. Or maybe I’m being unnecesarily harsh. As Anthony Goldbloom (Kaggle.com) says, “Statistics and data science gets more credit than it deserves when it’s correct—and more blame than it deserves when it’s incorrect.”

So what went wrong? How did so many clever people, churning through tons of poll & survey data, arrive at a conclusion that just didn’t play out in the real world. And equally importantly, what did Allan Litchman, the LA Times polls & Investors Business Daily do right to correctly predict that Orange would be the new Black?1. All models are not created equal

Models don’t often have widespread public application, but when they do (like in elections) people often view them as ‘black or white’ predictors of an outcome, while the truth is closer to ‘50 Shades of Grey’. In its simplest form, a model predicts the probability of an outcome – and if it does so at 70% accuracy, its also telling us that there’s a 30% probability of it being wrong.

We’ve all heard the saying, “what you put in, is what you get out”, and a model is only as good as the data it was derived from. “Since 1980” in CNN Money’s headline might fool you into thinking the model is built on 36 years of data, but when you look at the fact that elections only occur every four years, a model built on 9 data points suddenly starts to sound pretty dodgy. Allan Litchman’s model, on the other hand, is built using every single election outcome since 1860. That’s a lot more data, and ostensibly better ability to predict. You starting to get my drift…?2. Be careful how you collect your data

Pre-election polls were a key ingredient in these models, and voting day surfaced some really massive deviations between prediction and observation. Of the 93 polls conducted in the last 30 days pre-ceding the elections, only 10 polls gave Trump the lead. Nine of these 10 were from the LA Times.

So what did the LA Times do differently you ask? The answer – online polling. Traditional polling uses phone calls, and that too, primarily landline based phone polling. The LA Times ventured that online polling would result in a much more honest response, and that there was less chance of bias using this method. The Investor Business Daily, who also predicted the polls correctly, used traditonal phone-polling, but they ensured there was a spread of cellphones and landlines, and also ensured that the samples of cellphones used were representative across the spectrum, thereby also reducing bias and ensuring representative sampling. This was further corroborated by post-election analysis showing that the higher the proportion of lower-educated white people in a particular state, the higher the polling gap was in favour of Trump, indicating that this group was under represented in the samples. One only has to look at how the 18-25 year olds voted to understand the importance of representative sampling – lean too much into one sub-group and you get a completely skewed perspective.

Geographic distribution of voting results for voters aged 18 – 25 years

The bottom line here, is that if your sample sounds like “50% of the mice liked cheddar cheese, and the other mouse ran away”, you’re heading for trouble…

And so there you have it. Data science and analytics are powerful, but as with all things powerful, there is responsibility. In this case, the responsibility to build your models using accurate, unbiased and representative data. Do this, and I’m pretty sure the odds will forever be in your favour…