Some of you requested me to describe what is "overfit". So here's ...really probably... more than you really wanted to know! Anyway, I hope like 3 of you out in the world will get something from this discussion. Overfit it is an issue I treat seriously, to try to deliver the best possible recommendations. I strongly believe my efforts to address overfit make all the difference.
What is overfit? In my own words, an overfit is a feature in a model that causes wrong predictions-- despite the fact that the same feature well describes past trends.
Are there different kinds of overfit? Yes, I'm glad you asked. I'm not dealing with the classic example of polynomial overfit (when you try to fit a trend to y=a+bx+cx2+...+zx25 when it could have sufficed to use y=mx+b). The type of overfit I'm weeding out is of the "multivariable" kind. When there are more variables than just "x", then issues of overfit get tricky.
Example? Here's an easy example, and also the easiest kind to eliminate.
Suppose you're modeling kicker fantasy points, and you test the variable of "game O/U". It works! Kicker-score = 7.6 + 0.12*(O/U). p-value an great 10-14. Then suppose you also check if-- instead of using O/U-- the kicker's own-implied-team score works. Answer... Yes! It works even better! Kicker-score = 2.8 + 0.21 * (teamscore). p-value even lower, 10-15.
Now suppose you get a clever idea, realizing you have 2 variables that seem statistically significant: Why not use both? Voila: You get Kicker-score = 4.2 + 0.15*(teamscore) + 0.05*(O/U), and this "model" has better correlation than using just 1 variable (which is always true). Good, right? No, not necessarily. The p-values are now 0.001 for team score and 0.097 for O/U. The O/U does not pass the significance test any longer, and you need to remove it from the equation. If you leave in both variables, most likely you get worse future predictions than the 1-variable formula. [Results not shown.]
What does that example show? Many people would expect that using both variables (both significant when used independently) could yield some small advantage. Often wrong. In reality, using the "extra" variable makes a worse model, very much killing maximum accuracy.
Well, that's easy to do. Is there a harder example? Yes. Here's what happens more frequently than you might think... I will add 1 extra variable to a regression, and yay! It appears to pass a normal test for significance, but... including the variable still decreases predictive accuracy. Very frustrating, but not uncommon.
How do you get rid of these "imposters"? You can often use cross-validation to catch such sneaky, misleading variables.
What's cross-validation? It's a huge element to my work. Example (using "years"): (1) Perform a regression using data that excludes 1 certain year, e.g. 2020. (2) Use the resulting equations to simulate the excluded year (2020) (3) Ask "Would steps 1&2 produce more accuracy if I include a certain variable, or exclude it? (4) Repeat this test for 2019, 2018, etc. This process helps weed out variables that do pass the significance test but do not actually have predictive value.
So you just do something like that then... Isn't that enough? No, it's not always enough. That's a shame (because I spend a lot of time on this!). The reasons could fill a whole chapter. But let's keep it brief-- It is kind of interesting, but this is the part that sucks hours and hours from life.... In my experience, there are a few types of "imposter" variables. (1) Co-dependent pairs/triplets (Additive/subtractive). (2) Variables that only marginally improve past correlation. (3) "Era-specific" trends, (4) Double-counting factors already included in weekly adjustments (for injury etc.).
When does overfit come into Subvertadown models? Each time I introduce (for screening) a wider array of inputs (as part of the search for "better" data), there's a potential need to weed out new overfit. Very few of the 100 new variables I last tested actually got added-- and only after passing tests for significance and cross-validation. Nevertheless, some of these were misleading. It feels a bit sad to cut out variables with p-value 0.001. But out they go.
So how do I know when there's overfit? In principle, most overfit could be removed during the model development process. But since that's not 100% foolproof, I refer to my in-season accuracy measurements to try and detect if I need to address it. Once I establish the need, I can sniff it out with some analysis, or at least lean towards "underfit" to support minimum accuracy.
Tagged underBackground , Modeling