Big Data and Polling: Our First-Ever Attempt at Large-Scale Political Forecasting

(Editor’s Note: If you want to skip right to our forecasts, they’re at the end. But we highly recommend that you read the preamble for background and context.)

We started CivicScience five years ago to develop new ways to measure public opinion at a time when traditional methods of polling were becoming more and more difficult to sustain. Landline phone ownership continues to decline and fewer people have the time or inclination to respond to lengthy surveys, which means there is little remaining randomness in who participates. Meanwhile, pervasive advertising, biased 24-hour news cycles, and group-think social media can cause public sentiment to shift sometimes daily. The longstanding model of calling people on the telephone is broken and not likely to get any better.

We were far from the first people to see this coming. The advent of Robo-calling aimed to reduce the cost of polling so that more people could be called more often. Larger firms began augmenting their data with cell-phone respondents to put expensive duct tape on a romantic but fleeting obsession with random probability. Pioneers like Doug Rivers at YouGov introduced new models derived from captive “panelists,” recruited to answer online surveys in return for rewards. But the costs associated with panelist incentives, combined with biases among the people who have the time and inclination to join panels (Do you belong to one?), put a potential ceiling on this creative method.

Now, we are seeing a new frontier in opinion research, spearheaded by rock stars like Nate Silver and websites like RealClearPolitics.com. These innovators surmise that inherent flaws in traditional polling can be normalized by combining all the results published by reputable firms and producing an average of some sort. This is the first time we see the “Law of Big Data,” which suggests that more data can outperform clever algorithms, applied to opinion research. The problem with poll average models, however, is that we are merely aggregating disparate, small samples of non-representative responses, combined with precarious reweighting techniques. Just because these polls are all mashed together does not mean that Silver and others are solving this underlying problem: As phone-based polling becomes less reliable, so too will the resulting averages and dependent forecasts.

The CivicScience approach, while in some ways radically different, is the next stage in an evolution that moved from Gallup to PPP to YouGov to Nate Silver. Like Gallup, we believe in the fundamental premises of science and we work to achieve as much engagement, randomness, and representativeness as we can muster. Like PPP, arguably the leading Robo-calling firm in the country, we believe in speed, near-constant measurement, and reducing fixed costs for research. Like YouGov and Doug Rivers, we believe that the web represents a better way to engage more people than by calling them on the telephone. And, like Nate Silver, we believe that more data is better and that by applying advanced techniques in data mining, we can find signals and correlations that might otherwise be overlooked by the naked, politically-contaminated eye.

But we take everything a step further. By polling millions of people every week, meticulously organizing the data we collect, and automating the way those data are analyzed, we aspire to be the first true “Big Data” polling firm. Consider some of these numbers for context:

– In the past two years, we have collected over 191,000,000 poll responses from over 13,800,000 unique respondents, segmented by demographics, geography, and consumer behavior.

– Since January of 2012, we have collected over 16,300,000 responses to a total of 337 different poll questions related to campaigns, politics, policy, and ideology, including:

2,240,018 observations on voters’ reaction to specific negative campaign claims
619,539 observations on voters’ exposure to specific campaign ads
581,287 observations on President Obama’s approval rating
516,333 observations on who voters predicted would win the Presidential election in their state
311,765 observations on voters’ intended choice in the Presidential election
287,611 observations on intended choice in the Republican Presidential Primary between February 8^th and April 30^th.
202,362 observations on who won the three Presidential and one VP debate, all collected within 24 hours of each debate.
Over 100,000 observations each on sentiment toward policy issues like energy, consumer privacy, Voter ID regulations, charter schools, health care, government spending, public education, illegal immigration, and dozens more.
Over 58,000 observations each, on media behaviors including how much people like Jon Stewart, Glenn Beck, Donald Trump, and what TV networks and newspapers they prefer
Over 10,000 observations on key statewide races ranging from the US Senate elections in Missouri and Virginia to the Auditor General’s race in Pennsylvania.
Yesterday, we asked 85,798 people how likely they were to vote today. As of 6pm today, we asked 49,245 people if they have voted yet.

The real advantage of our “Big Data” approach is our ability to cross-tabulate any question we ask against thousands of other things we know about the various people we have polled over time. This kind of data mining yields interesting insights like the most predictive profile traits of an Obama voter, a Romney voter, or an Undecided. For clients, we have uncovered powerful correlations in the media habits of Coal Energy “persuadables” and the public transportation habits of Sugary Drink Ban supporters in New York City, for example.

Tomorrow, we will begin the laborious process of analyzing those 16+ Million political polls and 190+ Million consumer observations based on the real results we see across the country tonight. We will ask 100,000 people we have previously polled whether they voted and who they voted for. By crossing that with their past intentions, we can study people who said they planned to vote for one candidate (or not vote at all) and then did something else. Hopefully, we can analyze this “reporting bias” to better predict in the future who will actually vote and why. We will look at the states and districts where the conventional polling wisdom (and ours) was wrong and see what nuances can be found in our data that may have predicted those errors. We will report back from time to time.

With all that said, we would be cowards if we didn’t share our latest forecasts before the results begin rolling in tonight. We do this with a couple caveats: 1) It’s the first time we have ever attempted any election forecasting at this scale, so our models are based on theory, not practice and precedent. 2) Our raw sample is not without its own biases (though they’re the better than most). Any website can choose to embed our polling application in its content, which means we cannot always control the raw composition of our respondents from one day to the next. Fortunately, given our large numbers, we can use quotas and reliable weighting without much error. 3) There are a couple states where our numbers are possibly too small to make a confident prediction but we try. To our knowledge, nobody has done statewide polling in all 50 states in the past month, so give us a break.

Our data models would suggest an Electoral College haul of 290 for Barack Obama to 248 for Mitt Romney. And here are the state-by-state predictions for the Presidential race and US Senate races:

We don’t have the time or energy right now to delve into our weighting models but we will share all of that in the next few days when we can evaluate where things went right and wrong.

Have fun tonight.

Big Data and Polling: Our First-Ever Attempt at Large-Scale Political Forecasting

Image Source

Popular Posts

‘Dry Tripping’: Why Younger Americans Are Drinking Less and What It Means for Summer Vacation Plans

5 Unexpected Insights About People Who Value Time Over Money

Economic Sentiment Slightly Increases Following Last Period’s Large Decrease

Ozempic Tracker Highlights: GLP-1 Users Pull Back on Dining Out and Exercising

Keep me up to date on new Insights

EXPLORE

COMPANY

Big Data and Polling: Our First-Ever Attempt at Large-Scale Political Forecasting

Share

Image Source

Popular Posts

‘Dry Tripping’: Why Younger Americans Are Drinking Less and What It Means for Summer Vacation Plans

5 Unexpected Insights About People Who Value Time Over Money

Economic Sentiment Slightly Increases Following Last Period’s Large Decrease

Ozempic Tracker Highlights: GLP-1 Users Pull Back on Dining Out and Exercising

Keep me up to date on new Insights

EXPLORE

COMPANY