When the voting booths closed in France last Sunday, perhaps the most surprising result was that the opinion polls had for once been right. Much maligned since Brexit and Trump, the pollsters this time correctly predicted the winners, Emmanuel Macron and Marine Le Pen.

Source: The Telegraph

Alternative predictions using big data fared less well in France (article in English). Of the ones we know of, none predicted both winners correctly. The only big data forecast to correctly predict a Macron victory was the one provided by Echobox, which is based on over two billion data points from our popular French Election Tracker.

Weeks before the picture in the polls changed, we were also correct in predicting that Jean-Luc Mélenchon was headed for a stronger-than-expected result. We furthermore predicted — correctly, as it turned out — that Marine Le Pen’s support was weakening after the TV debate, leaving her with little chance of beating Macron. When all the votes were counted, Le Pen’s share of the votes was significantly lower than the polls had indicated a week before the election, and she was almost as far behind Macron as her father, Jean-Marie Le Pen, trailed Jacques Chirac in 2002. We, therefore, remain confident in our prediction that Marine Le Pen has only the slimmest chance to become Madame la Présidente.

When we published our predictions, we were open about their limitations. In particular, we discussed at length the factors that might lead us to underestimate Marine Le Pen’s and Jean-Luc Mélenchon’s chances, while overestimating François Fillon’s. We tried to compensate for this in our prediction, but still over-estimated François Fillon’s chances. In the end, despite Marine Le Pen’s decline from her early April polling highs, Fillon did not quite advance to the final round, in part because of a strong showing for right-wing candidate Nicolas Dupont-Aignan.

This does not change our belief that big data is the future of opinion research, but we are the first to acknowledge that in order to improve our methods we must identify and be transparent about the challenges we face, both in terms of data quality and in terms of the methods we use to interpret the data. Big data may eventually render polls obsolete, but today it is as imperfect as sample-based surveys.

In this article, we are looking back at our predictions to inform and improve our approach to predicting the outcome of the final round.

Macron or Le Pen: Who will be next in the Elysee Palace?

Lesson 1: Transparency is key

Both on the French Election Tracker website and in our predictions, we have been clear about the differences between our tracker and traditional polls. We pointed out that our tracker is still experimental, unlike opinion surveys which have been conducted and refined for almost a century. We even included an entire section entitled ‘We could be wrong’ when we set out our predictions, explaining the limitations of our data.

This reflects a conscious decision made before we first published our tracker. Good scientists are open and honest about the limitations their research faces and at Echobox we aspire to live up to standards as high as those of the cutting-edge academic research that powers our algorithms.

For this reason, we were transparent about our methodology, unlike other big data providers which opted for what Jeremie Mani called a “black box” approach that kept their methodology secret. We continue to believe that transparency is a must for the type of analysis we offer on Echobox Resources.

Lesson 2: Beware of negative attention

We specifically warned that a high score in the FET can reflect both “positive and negative attention,” making predictions inherently difficult. This was a challenge when predicting François Fillon’s vote share, because so much of the media interest in him was driven by scandals. As we pointed out:

On the surface, it appears that Fillon attracts much more interest than Macron […]. However, the FET shows clearly that Fillon’s performance was driven by extreme spikes in attention on days when major scandals broke.

We tried to correct for this by filtering out these negative attention spikes, which led to the prediction that Fillon would come in narrowly ahead of Le Pen but clearly behind Macron. We warned that we attached a higher degree of uncertainty to this prediction and unfortunately it turned out that we did not manage to fully account for the distortion caused by Fillon’s scandal-driven coverage.

We will therefore be mindful of spikes driven by unambiguously negative interest in either Macron or Le Pen when predicting the final outcome. By this we mean spikes driven by stories about corruption or gross incompetence, rather than merely controversial stories such as demands for radical policies that polarise the electorate.

Lesson 3: Context matters

One key limitation we identified in our data was that Le Pen and Mélenchon supporters might have eschewed traditional media outlets covered by our data, instead obtaining their information from social media. This is a pattern for supporters of populist parties on both the right and left that can be observed across countries.

To compensate for this, we included social media data to contextualise our finding that Le Pen was seeing declining interest after the TV debate. We furthermore put our prediction of a weak Le Pen performance in a broader context of a weakening of populist forces across Western Europe (e.g. Geert Wilders’s Freedom Party in the Netherlands, the UK Independence Party in Britain and the AfD in Germany).

When we make our final-round prediction next week, we will again include social media trends and the wider context.

Who will be the next Président(e)?

Our predictions for the first round of the French election were more accurate than those of other big data providers. More importantly, we were more careful and transparent in making them. This reflects our commitment to transparency in all of the research we publish on Echobox Resources and our commitment to scientific principles as we build the world’s first AI that understands the meaning of content.

We remain convinced that the cumulative level of attention over an entire election campaign is a relevant predictor of performance that can spot trends before they manifest themselves in conventional polls and identify the winner with a high level of confidence.

Since the first-round results, Emmanuel Macron has received more attention online than Marine Le Pen for over 65% of the time. We will be following how this cumulative score evolves and issue our final-round prediction 48 hours before the results come out.