Correlation is Not Causation... (So how can we believe stats can make accurate forecasting?)

One of the deviously fun parts of developing predictive models is the feeling of defying what they said about stats in school: Correlation is not causation.  With predictive modeling, we turn the idea on its head and bend the rules, so that statistics is not only descriptive about the past… it also indicative of the future. 

It’s actually easy to understand why this should work.  If you need to choose between 2 teams which will score highest, your best bet seems obvious: You choose the team that has already scored high in the past. 

Is it perfect?  No, it’s terrible!  But it works on average. 

Predictive models work this way.  You essentially seek to understand what things in the past correlate with things in the future; and most importantly, you weed out all the misleading trends that do not consistently prove valid.  It’s this bit about “consistent validity” that makes model development so laborious. (With experience, it becomes obvious that this is the most essential part of scientifically testing and ensuring robust predictions.)

Compared to mechanistic models (like a ball rolling down a smooth incline), the tradeoff for statistic models is the uncertainty: the results usually deviate from prediction by a wide range (whereas an ordinary ball does not roll erratically— rocketing upwards or plunging into the ground). 

Putting this all together: Statistical models cannot work like precise clockwork. But by careful development— separating past from future, identifying key predictors, and validating across sample sets— you can achieve a predictive model that is more often wrong than right. Perhaps it can even be right more often than any other way of doing it!