[If you want the results without any preamble, jump past the first sub-headline.]
Several years ago hockey analysts began using a statistic called Corsi to analyse team and player quality. While people have often called Corsi an “advanced statistic”, it’s really just the same idea as +/- but with a larger sample size; you add up all the shot attempts for and against each team, usually at 5v5 only, and then turn it into a ratio. Over the years a number of improvements have been made to how we use Corsi, the most important of which is score-adjustment. There were two things that made Corsi so valuable: it was highly repeatable (ie. if you know a team’s Corsi in one part of the season, you can predict its Corsi in their remaining games very well) and it was better than goals at predicting future goals (ie. if you know a team’s Corsi and its goal differential, you can get a more accurate prediction of the team’s future goal differential by using Corsi).
The obvious questions that most people ask are, “What about shot quality? Shouldn’t we be looking at scoring chances rather than all shot attempts?” Many NHL coaches seem to prefer scoring chances as well. A number of years ago, fans of several NHL teams decided to track scoring chances by hand to try to see whether they were in fact better than Corsi. In the end, scoring chances correlated very strongly with Corsi, and the decision was made to stop tracking chances. The reason that scoring chance tracking stopped was that it didn’t seem like a good investment in time given how close it was to Corsi. That isn’t to say that Corsi was better, just that it was easier to acquire, and everyone involved seem to think they’d learn more by dedicating their manual tracking efforts to something else, which eventually turned out to be zone entries and exits.
In recent years a number of analysts have made attempts to measure some component of shot quality and adjust Corsi results accordingly. This is typically known as expected goals or xG. Two of the better-known examples are from Emmanuel Perry at Corsica and Dawson Sprigings & Asmae Toumi at Hockey Graphs. Both showed some improved predictivity over Corsi, although not nearly as big an improvement as is seen using Corsi instead of goal differential.
At the moment the web site Natural Stat Trick makes available a pair of simpler statistics, one called “scoring chances” (SCF%) and one called “high-danger chances” (HDCF%). Definitions for both of those statistics can be found on the old War On Ice blog. Unlike xG, which typically adjusts each individual shot for a number of factors like its location, angle, and shot type, SCF% and HDCF% is simply a ratio of all shot attempts within the specified region of the offensive zone.
I’ve seen an increasingly large number of people citing scoring chances or high-danger chances when discussing teams and players. Normally this is done when a team like New Jersey or Washington is winning more games than you might guess based on their standard Corsi, and fans are searching for an explanation as to why. I decided to check whether those statistics actually provide any improved predictivity over Corsi. My hypothesis was that they wouldn’t, but there’s no way to know if the hypothesis is right until you test it, which is what I decided to do.
All data in this post was collected from Natural Stat Trick. All of the statistics are for 5v5 play only and are score and venue (home vs road) adjusted, except for goals, which use the regular, unadjusted count.
ARE SCORING CHANCES MORE PREDICTIVE THAN CORSI?
In short, yes. Here’s the method I used to determine that:
I gathered four statistics: Corsi, scoring chances, high-danger chances, and goals, for all 30 teams that were in the NHL for each season from 2013-14 to 2016-17. That’s all four full 82 game seasons that have been played since the last lockout, which works out to 120 team seasons, which I think is a pretty good sample size. I then split the data into two halves: the first 41 games of each season and the last 41 games of each season. The goal was to see how well you can predict goal scoring in the second half of the season with a team’s results from the first half.
The two most important things when determining how well a statistic describes the quality of a team’s play are whether the statistic is repeatable (is it a real skill or just random?) and whether it predicts future goal scoring. Ultimately what we want to do is separate which teams are doing well because of real underlying talent from the teams that are just getting lucky.
The first of those two tests is sometimes called autocorrelation, and it’s a measurement of how well a number correlates to a measurement of the same thing in the future. The results are between 0 and 1, with 1 meaning the two halves are identical and lower numbers meaning less similarity. Corsi is well-known to have high autocorrelation. How well do the other two statistics stack up?
Autocorrelation | |
CF% | 0.856 |
SCF% | 0.754 |
HDCF% | 0.676 |
Neither scoring chances nor high-danger chances are as repeatable as Corsi is, but they’re both much better than random, which tells us that there’s a pretty strong element of real skill involved. HDCF% is the most subject to variance of the statistics tested here, which means we should be less confident in its results. That’s a trend that’s going to continue.
The second test is how well each of these statistics does at predicting future goal scoring, which we’ll do using r squared. R squared is a measurement of how well one statistic explains the variance in a second statistic. The values are again between 0 and 1, with 1 meaning that the first statistic entirely explains the variance in the second, and lower values meaning that a smaller amount of the variance is explained. Let’s see how well half a season of our statistics does at predicting goal ratio in the second half of the season:
Next 41 GP GF% (r˄2) | |
First 41 GP CF% | 0.231 |
First 41 GP SCF% | 0.274 |
First 41 GP HDCF% | 0.197 |
Two things are clear here: Corsi is better than high-danger chances at predicting future goal scoring, and scoring chances are better than either of them. And the results for scoring chances here are pretty close to the predictivity offered by various xG models. You can see that the magnitude of difference between CF% and xG% in Dawson and Asmae’s model is comparable to what I’ve found here for scoring chances:
And the results are also similar to Manny’s xG model:
On this basis I think it’s fair to say both that the scoring chance model from Natural Stat Trick is an improvement on Corsi and that it’s pretty comparable to existing expected goal models.
SEPARATING OFFENCE AND DEFENCE
There’s something about this result that’s bothering me though, and it’s something that’s bothered me about expected goals models too. There’s pretty good evidence that on-ice SV% is not a repeatable talent for players. I know that’s a result that a lot of people really dislike, but it’s a result that’s been reproduced by multiple analysts. If that’s true, though, then shouldn’t expected goals models and scoring chances lag Corsi in terms of predictivity?
To try to get more information about that, I decided to do something that I haven’t seen any other hockey stats writers try: I looked at the predictivity of events for a team and events against a team separately (apologies if anyone has done this and I just haven’t seen it). The first question is which of Corsi for, scoring chances for, and high-danger chances for does the best job of predicting future goal scoring:
Next 41 GP GF/60 (r˄2) | |
First 41 GP CF/60 | 0.111 |
First 41 GP SCF/60 | 0.161 |
First 41 GP HDCF/60 | 0.119 |
Scoring chances have the edge here, with Corsi and high-danger chances producing pretty similar results.
But what if we look at events against a team? The results are reversed:
Next 41 GP GA/60 (r˄2) | |
First 41 GP CA/60 | 0.269 |
First 41 GP SCA/60 | 0.192 |
First 41 GP HDCA/60 | 0.101 |
Corsi is in fact much better at predicting goals against than scoring chances are, and drastically better than high-danger chances are.
This makes sense given what we know from other statistical analyses. Players have a pretty strong ability to influence their own shot quality (as demonstrated in the individual expected goals model from the Hockey Graphs post linked above). That’s why scoring chances outperform Corsi at predicting offence. But players have a much more limited ability to influence shot quality against, which is why raw shot attempt rates out-predict scoring chances in terms of goals against.
CONCLUSION
I think the evidence for preferring scoring chances to Corsi is pretty strong. While SCF% does show a bit less autocorrelation, it’s still a pretty stable statistic, and over the past 4 seasons it has shown a better ability than Corsi to predict future goal scoring. It also has another benefit, which is that it’s much more intuitive. Everyone who watches hockey has an understanding of what a “scoring chance” is, even if there will inevitably be disagreements about which specific shots to count. It should be easier to have conversations that bridge an analytical approach to a more traditional one if we use a statistic that takes less time to explain.
One thing that I have not attempted to do but that I think would be worthwhile would be to create a statistic that accounts for the fact that shot quality matters more for goals for than for goals against; a statistic that has the offensive advantages of scoring chances with the defensive advantages of Corsi. Given the statistics we’ve got right now, SCF% does outperform CF% overall, but it’s important to note that Corsi is still better at predicting future goals against.
There is no reason to use high-danger chances. It’s not a good stat.
Further, scoring chances show pretty similar predictivity to expected goal models. xG may slightly out-predict SCF% at the moment (and future xG models may improve further), but given how close they are and how much simpler scoring chances are to explain and understand, I think SCF% is a preferable statistic to xG, though that likely depends on what you intend to use it for (casual discussion, building a betting model, etc.).
Nice!
The sunk costs involved in HDCF analysis do sadden me however. Some peoples time may have been wasted and that is always a somewhat tragic consideration.
Given, you know, the finite amount of time we get on this earth.
Very cool!
Some really random thoughts:
1 – is there any particular reason to use the first half of the season to predict the second half? I get that there’s a logical reason, but if the goal is prediction (as opposed to explanation), you may dramatically increase your sample size for training without truly leaking data.
2 – your finding wrt inverted predictive value for stats for /against implies that there is true signal difference between teams; this makes sense: I imagine that some players (and hence teams) can dramatically outperform xGF (McDavid, Matthews, etc.)
3 – all of the above point to the possibility of much better prediction of future team performance if using multiple stats as features and black box algorithms that infer interactions between the stats.
4 – again if the goal is prediction, you could build ensemble models that shifted weights according to the number of games in the training set for a given team; this may actually be kind of cool.
5 – if the goal is explanation, you may want to try to suss out which stats are truly driving the bus by subjecting them to bivariate (how correlated are Corsi-for and xGF to each other) and multivariate analysis.
Again, awesome contribution.
1: I’d hazard a guess that it’s because roster turnover between halves is minimal, as is coaching changes.
1 – If you used the 2nd half games to predict the 2nd half (to verify the model is consistent/predictive), you wouldn’t be proving anything.
You’re right, not very useful for explanatory analysis, but potentially useful for predictive analysis; you may dramatically improve out of sample prediction accuracy as a result of doubling your training data set size.
If you wanted to predict future games then yes you’d use the full data available.
I think this is a no-brainer, isn’t it?
The reason we use Corsi in the first place is because it’s already tracked by the NHL and it’s pretty unambiguous/objective. And it’s really just a proxy for puck possession.