The Case for Dennis Rodman, Part 3/4(b)—Rodman’s X-Factor

The sports analytical community has long used Margin of Victory or similar metrics as their core component for predicting future outcomes. In situations with relatively small samples, it generally slightly outperforms win percentages, even when predicting win percentages.

There are several different methods for converting MOV into expected win-rates. For this series, I took the 55,000+ regular-season team games played since 1986 and compared their outcomes to the team’s Margin of Victory over the other 81 games of the season. I then ran this data through a logistic regression (a method for predicting things that come in percentages) with MOV as the predictor variable. Here is the resulting formula:

$\large{PredictedWin\% = \dfrac{1}{1+e^{-(.127mv-.004}}}$

^{Note: e is euler’s number, or ~2.72. mv is the variable for margin of victory.}

This will return the probability between 0 and 1, corresponding to the odds of winning the predicted game. If you want to try it out for yourself, the excel formula is:

1 / (1 + EXP(-(-0.0039+0.1272*[MOV])))

So, for example, if a team’s point differential (MOV) over 81 games is 3.78 points per game, their odds of winning their 82nd game would be 61.7%.

Of course, we can use this same formula to predict a player’s win% differential based on his MOV differential. If, based on his MOV contribution alone, a player’s team would be expected to win 61.7% of the time, then his predicted win% differential is what his contribution would be above average, in this case 11.7% (this is one reason why, for comparison purposes, I prefer to use adjusted win differentials, as discussed in Part 3(a)).

As discussed in the part 2(b) of this series (“With or Without Worm”), Dennis Rodman’s MOV differential was 3.78 points, which was tops among players with at least a season’s worth of qualifying data, corresponding to the aforementioned win differential of 11.7%. Yet this under-predicts his actual win percentage differential by 9.9%. This could be the result of a miscalibrated prediction formula, but as you can see in the following histogram, the mean for win differential minus predicted win differential for our 470 qualifying player dataset is actually slightly below zero at –0.7%:

Rodman has the 2nd highest overall, which is even more crazy considering that he had one of the highest MOV’s (and the highest of anyone with anywhere close to his sample size) to begin with. Note how much of an outlier he is in this scatterplot (red dot is Rodman):

I call this difference the “X-Factor.” For my purposes, “X” stands for “unknown”: That is, it is the amount of a player’s win differential that isn’t explained by the most common method for predicting win percentages. For any particular player, it may represent an actual skill for winning above and beyond a player’s ability to contribute to his team’s margin of victory (in section (c), I will go about proving that such a skill exists), or it may simply be a result of normal variance. But considering that Rodman’s sample size is significantly larger than the average in our dataset, the chances of it being “error” should be much smaller. Consider the following:

Again, Rodman is a significant outlier: no one with more than 2500 qualifying minutes breaks 7.5%. Rodman’s combination of large sample with large Margin of Victory differential with large X-Factor is remarkable. To visualize this, I’ve put together a 3-D scatter plot of all 3 variables:

It can be hard to see where a point stands in space in a 2-D image, but I’ve added a surface grid to try to help guide you: the red point on top of the red mountain is Dennis Rodman.

To get a useful measure of how extreme this is, we can approximate a sample-size adjustment by comparing the number of qualifying minutes for each player to the average for the dataset, and then adjusting the standard deviation for that player accordingly (proportional to the square root of the ratio, a method which I’ll discuss in more detail in section (d)). After doing this, I can re-make the same histogram as above with the sample-adjusted numbers:

No man is an island. Except, apparently, for Dennis Rodman. Note that he is about 4 standard deviations above the mean (and observe how the normal distribution line has actually blended with the axis below his data point).

Naturally, of course, this raises the question:

Where does Rodman’s X-Factor come from?

Strictly speaking, what I’m calling “X-Factor” is just the prediction error of this model with respect to players. Some of that error is random and some of it is systematic. In section (c), I will prove that it’s not entirely random, though where it comes from for any individual player, I can only speculate.

Margin of Victory treats all contributions to a game’s point spread equally, whether they came at the tail end of a blowout, or in the final seconds of squeaker. One thing that could contribute to a high X-factor is “clutch”ness. A “clutch” shooter (like a Robert Horry), for example, might be an average or even slightly below-average player for most of the time he is on the floor, but an extremely valuable one near the end of games that could go either way. The net effect from the non-close games would be small for both metrics, but the effect of winning close games would be much higher on Win% than MOV. Of course, “clutch”ness doesn’t have to be limited to shooters: e.g., if one of a particular player’s skill advantages over the competition is that he makes better tactical decisions near the end of close games (like knowing when to intentionally foul, etc.), that would reflect much more strongly in his W% than in his MOV.

Also, a player who contributes significantly whenever they are on the floor but is frequently taken out of non-close games as a precaution again fatigue or injury may have a Win % that accurately reflects his impact, but a significantly understated MOV. E.g., in the Boston Celtics “Big 3” championship season, Kevin Garnett was rested constantly—a fact that probably killed his chances of being that season’s MVP—yet the Celtics won by far the most games in the league. In this case, the player is “clutch” just by virtue of being on the floor more in clutch spots.

The converse possibility also exists: A player could be “reverse clutch,” meaning that he plays worse when the game is NOT on the line. This would ultimately have the same statistical effect as if he played better in crunch time. And indeed, based on completely non-rigorous and anecdotal speculation, I think this is a possible factor in Rodman’s case. During his time in Chicago, I definitely recall him doing a number of silly things in the 4th quarter of blowout games (like launching up ridiculous 3-pointers) when it didn’t matter—and in a game of small margins, these things add up.

Finally, though it cuts a small amount against the absurdity of Rodman’s rebounding statistics, I would be derelict as an analyst not to mention the possibility that Rodman may have played sub-optimally in non-close games in order to pad his rebounding numbers. The net effect, of course, would be that his rebounding statistics could be slightly overstated, while his value (which is already quite prodigious) could be substantially understated. To be completely honest, with his rebounding percentages and his X-Factor both being such extreme outliers, I have to think that at least some relationship existing between the two is likely.

If you’re emotionally attached to the freak-alien-rebounder hypothesis, this might seem to be a bad result for you. But if you’re interested in Rodman’s true value to the teams he played for, you should understand that, if this theory is accurate, it could put Rodman’s true impact on winning into the stratosphere. That is, this possibility gives no fuel to Rodman’s potential critics: the worst cases on either side of the spectrum are that Rodman was the sickest rebounder with a great impact on his teams, or that he was a great rebounder with the sickest impact.

In the next section, I will be examining the relative reliability and importance of Margin of Victory vs. Win % generally, across the entire league. In my “endgame” analysis, this is the balance of factors that I will use. But the league patterns do not necessarily apply in all situations: In some cases, a player’s X-factor may be all luck, in some cases it may be all skill, and in most it is probably a mixture of both. So, for example, if my speculation about Rodman’s X-Factor were true, my final analysis of Rodman’s value could be greatly understated.

UPDATE: 2010 NFL Season Neural Network Projections—In Review

Before the start of the 2010 NFL season, I used a very simple neural network to create this set of last-second projections:

Clearly, some of the predictions worked out better than others. E.g., Kansas City did manage to win their division (which I never would have guessed), but Dallas and San Francisco continued to make mockeries of their past selves. We did come dangerously close to a Jets/Packers Super Bowl, but in the end, SkyNet turned out to be more John Edwards than Nostradamus.

From a prediction-tracking standpoint, the real story of this season was the stunning about-face performance of Football Outsiders, who dominated the regular season basically from start to finish:

^{Note: Average and Median Errors reflect the difference between projected and actual wins. Correlations are between projected and actual win percentages.}

Not only did they almost completely flip the results from 2009, but their stellar 2010 results (combined with the below-average outing of my model) actually pushed their last 3 seasons slightly ahead of the neural network overall. This improvement also puts Koko (.25 * previous season’s wins + 6) far in FO’s rearview, providing further evidence that Koko’s 2009 success was a blip.

If we use each method’s win projections to project the post-season as well, however, things turn out a bit differently. Football Outsiders starts out in a strong position, having correctly picked 4 of 8 division champions and 9 of 12 playoff teams overall (against 2 and 8 for the NN respectively), but their performance worsens as the playoffs unfold:

The neural network correctly placed Green Bay in the Super Bowl and the Jets into the AFC championship game, while FO’s final 4 were Atlanta over Green Bay and Baltimore over Indianapolis.

Moreover, if we use these preseason projections to pick the overall results of the playoffs as they were actually set, the neural network outperforms its rivals by a wide margin:

^{Note: The error measured in this table is between predicted finish and actual finish. The Super Bowl winner finishes in 1st place, the loser in 2nd place, conference championship losers each tie for 3.5th place (average of 3rd and 4th), divisional losers tie for 6.5th (average of 5th, 6th, 7th, and 8th), and wild card round losers tie for 10.5th (average of 9th, 10th, 11th, and 12th).}

This minor victory will give me some satisfaction when I retool the model for next season—after all, this model is still essentially based on a small fraction of the variables used by its competitor, and neural networks generally get better and better with more data. On balance, though, the season clearly goes to Football Outsiders. So credit where it’s due, and congratulations to humankind for putting the computers in their place, at least for one more year.

The Case for Dennis Rodman, Part 2/4 (a)(ii)—Player Valuation and Unconventional Wisdom

In my last post in this series, I outlined and criticized the dominance of gross points (specifically, points per game) in the conventional wisdom about player value. Of course, serious observers have recognized this issue for ages, responding in a number of ways—the most widespread still being ad hoc (case by case) analysis. Not satisfied with this approach, many basketball statisticians have developed advanced “All in One” player valuation metrics that can be applied broadly.

In general, Dennis Rodman has not benefitted much from the wave of advanced “One Size Fits All” basketball statistics. Perhaps the most notorious example of this type of metric—easily the most widely disseminated advanced player valuation stat out there—is John Hollinger’s Player Efficiency Rating:

In addition to ranking Rodman as the 7th best player on the 1995-96 Bulls championship team, PER is weighted to make the league average exactly 15—meaning that, according to this stat, Rodman (career PER: 14.6) was actually a below average player. While Rodman does significantly better in a few predictive stats (such as David Berri’s Wages of Wins) that value offensive rebounding very highly, I think that, generally, those who subscribe to the Unconventional Wisdom typically accept one or both of the following: 1) that despite Rodman’s incredible rebounding prowess, he was still just a very good a role-player, and likely provided less utility than those who were more well-rounded, or 2) that, even if Rodman was valuable, a large part of his contribution must have come from qualities that are not typically measurable with available data, such as defensive ability.

My next two posts in this series will put the lie to both of those propositions. In section (b) of Part 2, I will demonstrate Rodman’s overall per-game contributions—not only their extent and where he fits in the NBA’s historical hierarchy, but exactly where they come from. Specifically, contrary to both conventional and unconventional wisdom, I will show that his value doesn’t stem from quasi-mystical unmeasurables, but from exactly where we would expect: extra possessions stemming from extra rebounds. In part 3, I will demonstrate (and put into perspective) the empirical value of those contributions to the bottom line: winning. These two posts are at the heart of The Case for Dennis Rodman, qua “case for Dennis Rodman.”

But first, in line with my broader agenda, I would like to examine where and why so many advanced statistics get this case wrong, particularly Hollinger’s Player Efficiency Rating. I will show how, rather than being a simple outlier, the Rodman data point is emblematic of major errors that are common in conventional unconventional sports analysis – both as a product of designs that disguise rather than replace the problems they were meant to address, and as a product of uncritically defending and promoting an approach that desperately needs reworking.

Player Efficiency Ratings

John Hollinger deserves much respect for bringing advanced basketball analysis to the masses. His Player Efficiency Ratings are available on ESPN.com under Hollinger Player Statistics, where he uses them as the basis for his Value Added (VA) and Expected Wins Added (EWA) stats, and regularly features them in his writing (such as in this article projecting the Miami Heat’s 2010-11 record), as do other ESPN analysts. Basketball Reference includes PER in their “Advanced” statistical tables (present on every player and team page), and also use it to compute player Value Above Average and Value Above Replacement (definitions here).

The formula for PER is extremely complicated, but its core idea is simple: combine everything in a player’s stat-line by rewarding everything good (points, rebounds, assists, blocks, and steals), and punishing everything bad (missed shots, turnovers). The value of particular items are weighted by various league averages—as well as by Hollinger’s intuitions—then the overall result is calculated on a per-minute basis, adjusted for league and team pace, and normalized on a scale averaging 15.

Undoubtedly, PER is deeply flawed. But sometimes apparent “flaws” aren’t really “flaws,” but merely design limitations. For example: PER doesn’t account for defense or “intangibles,” it is calculated without resort to play-by-play data that didn’t exist prior to the last few seasons, and it compares players equally, regardless of position or role. For the most part, I will refrain from criticizing these constraints, instead focusing on a few important ways that it fails or even undermines its own objectives.

Predictivity (and: Introducing Win Differential Analysis)

Though Hollinger uses PER in his “wins added” analysis, its complete lack of any empirical component suggests that it should not be taken seriously as a predictive measure. And indeed, empirical investigation reveals that it is simply not very good at predicting a player’s actual impact:

This bubble-graph is a product of a broader study I’ve been working on that correlates various player statistics to the difference in their team’s per-game performance with them in and out of the line-up. The study’s dataset includes all NBA games back to 1986, and this particular graph is based on the 1300ish seasons in which a player who averaged 20+ minutes per game both missed and played at least 20 games. Win% differential is the difference in the player’s team’s winning percentage with and without him (for the correlation, each data-point is weighted by the smaller of games missed or played. I will have much more to write about nitty-gritty of this technique in separate posts).

So PER appears to do poorly, but how does it compare to other valuation metrics?

SecFor (or “Secret Formula”) is the current iteration of an empirically-based “All in One” metric that I’m developing—but there is no shame in a speculative purely a priori metric losing (even badly) as a predictor to the empirical cutting-edge.

However, as I admitted in the introduction to this series, my statistical interest in Dennis Rodman goes way back. One of the first spreadsheets I ever created was in the early 1990’s, when Rodman still played for San Antonio. I knew Rodman was a sick rebounder, but rarely scored—so naturally I thought: “If only there were a formula that combined all of a player’s statistics into one number that would reflect his total contribution.” So I came up with this crude, speculative, purely a priori equation:

Points + Rebounds + 2*Assists + 1.5*Blocks + 2*Steals – 2*Turnovers.

Unfortunately, this metric (which I called “PRABS”) failed to shed much light on the Rodman problem, so I shelved it. PER shares the same intention and core technique, albeit with many additional layers of complexity. For all of this refinement, however, Hollinger has somehow managed to make a bad metric even worse, getting beaten by my OG PRABS by nearly as much as he is able to beat points per game—the Flat Earth of basketball valuation metrics. So how did this happen?

Minutes

The trend in much of basketball analysis is to rate players by their per-minute or per-possession contributions. This approach does produce interesting and useful information, and they may be especially useful to a coach who is deciding who to give more minutes to, or to a GM who is trying to evaluate which bench player to sign in free agency.

But a player’s contribution to winning is necessarily going to be a function of how much extra “win” he is able to get you per minute and the number of minutes you are able to get from him. Let’s turn again to win differential:

For this graph, I set up a regression using each of the major rate stats, plus minutes played (TS%=true shooting percentage, or one half of average points per shot, including free throws and 3 pointers). If you don’t know what a “normalized coefficient” is, just think of it as a stat for comparing the relative importance of regression elements that come in different shapes and sizes. The sample is the same as above: it only includes players who average 20+ minutes per game.

Unsurprisingly, “minutes per game” is more predictive than any individual rate statistic, including true shooting. Simply multiplying PER by minutes played significantly improves its predictive power, managing to pull it into a dead-heat with PRABS (which obviously wasn’t minute-adjusted to begin with).

I’m hesitant to be too critical of the “per minute” design decision, since it is clearly an intentional element that allows PER to be used for bench or rotational player valuation, but ultimately I think this comes down to telos: So long as PER pretends to be an arbiter of player value—which Hollinger himself relies on for making actual predictions about team performance—then minutes are simply too important to ignore. If you want a way to evaluate part-time players and how they might contribute IF they could take on larger roles, then it is easy enough to create a second metric tailored to that end.

Here’s a similar example from baseball that confounds me: Rate stats are fine for evaluating position players, because nearly all of them are able to get you an entire game if you want—but when it comes to pitching, how often someone can play and the number of innings they can give you is of paramount importance. E.g., at least for starting pitchers, it seems to me that ERA is backwards: rather than calculate runs allowed per inning, why don’t they focus on runs denied per game? Using a benchmark of 4.5, it would be extremely easy to calculate: Innings Pitched/2 – Earned Runs. So, if a pitcher gets you 7 innings and allows 2 runs, their “Earned Runs Denied” (ERD) for the game would be 1.5. I have no pretensions of being a sabermetrician, and I’m sure this kind of stat (and much better) is common in that community, but I see no reason why this kind of statistic isn’t mainstream.

More broadly, I think this minutes SNAFU is reflective of an otherwise reasonable trend in the sports analytical community—to evaluate everything in terms of rates and quality instead of quantity—that is often taken too far. In reality, both may be useful, and the optimal balance in a particular situation is an empirical question that deserves investigation in its own right.

PER Rewards Shooting (and Punishes Not Shooting)

As described by David Berri, PER is well-known to reward inefficient shooting:

“Hollinger argues that each two point field goal made is worth about 1.65 points. A three point field goal made is worth 2.65 points. A missed field goal, though, costs a team 0.72 points. Given these values, with a bit of math we can show that a player will break even on his two point field goal attempts if he hits on 30.4% of these shots. On three pointers the break-even point is 21.4%. If a player exceeds these thresholds, and virtually every NBA player does so with respect to two-point shots, the more he shoots the higher his value in PERs. So a player can be an inefficient scorer and simply inflate his value by taking a large number of shots.”

The consequences of this should be properly understood: Since this feature of PER applies to every shot taken, it is not only the inefficient players who inflate their stats. PER gives a boost to everyone for every shot: Bad players who take bad shots can look merely mediocre, mediocre players who take mediocre shots can look like good players, and good players who take good shots can look like stars. For Dennis Rodman’s case—as someone who took very few shots, good or bad— the necessary converse of this is even more significant: since PER is a comparative statistic (even directly adjusted by league averages), players who don’t take a lot of shots are punished.
Structurally, PER favors shooting—but to what extent? To get a sense of it, let’s plot PER against usage rate:

^{Note: Data includes all player seasons since 1986. Usage % is the percentage of team possessions that end with a shot, free throw attempt, or turnover by the player in question. For most practical purposes, it measures how frequently the player shoots the ball.}

That R-squared value corresponds to a correlation of .628, which might seem high for a component that should be in the denominator. Of course, correlations are tricky, and there are a number of reasons why this relationship could be so strong. For example, the most efficient shooters might take the most shots. Let’s see:

Actually, that trend-line doesn’t quite do it justice: that R-squared value corresponds to a correlation of .11 (even weaker than I would have guessed).

I should note one caveat: The mostly flat relationship between usage and shooting may be skewed, in part, by the fact that better shooters are often required to take worse shots, not just more shots—particularly if they are the shooter of last resort. A player that manages to make a mediocre shot out of a bad situation can increase his team’s chances of winning, just as a player that takes a marginally good shot when a slam dunk is available may be hurting his team’s chances. Presently, no well-known shooting metrics account for this (though I am working on it), but to be perfectly clear for the purposes of this post: neither does PER. The strong correlation between usage rate and PER is unrelated. There is nothing in its structure to suggest this is an intended factor, and there is nothing in its (poor) empirical performance that would suggest it is even unintentionally addressed. In other words, it doesn’t account for complex shooting dynamics either in theory or in practice.

Duplicability and Linearity

PER strongly rewards broad mediocrity, and thus punishes lack of the same. In reality, not every point that a player scores means their team will score one more point, just as not every rebound grabbed means that their team will get one more possession. Conversely—and especially pertinent to Dennis Rodman—not every point that a player doesn’t score actually costs his team a point. What a player gets credit for in his stat line doesn’t necessarily correspond with his actual contribution, because there is always a chance that the good things he played a part in would have happened anyway. This leads to a whole set of issues that I typically file under the term “duplicability.”

A related (but sometimes confused) effect that has been studied extensively by very good basketball analysts is the problem of “diminishing returns” – which can be easily illustrated like this: if you put a team together with 5 players that normally score 25 points each, it doesn’t mean that your team will suddenly start scoring 125 points a game. Conversely—and again pertinent to Rodman—say your team has 5 players that normally score 20 points each, and you replace one of them with somebody that normally only scores 10, that does not mean that your team will suddenly start scoring only 90. Only one player can take a shot at a time, and what matters is whether the player’s lack of scoring hurts his team’s offense or not. The extent of this effect can be measured individually for different basketball statistics, and, indeed, studies have showed wide disparities.

As I will discuss at length in Part 2(c), despite hardly ever scoring, differential stats show that Rodman didn’t hurt his teams offenses at all: even after accounting for extra possessions that Rodman’s teams gained from offensive rebounds, his effect on offensive efficiency was statistically insignificant. In this case (as with Randy Moss), we are fortunate that Rodman had such a tumultuous career: as a result, he missed a significant number of games in a season several times with several different teams—this makes for good indirect data. But, for this post’s purposes, the burning question is: Is there any direct way to tell how likely a player’s statistical contributions were to have actually converted into team results?

This is an extremely difficult and intricate problem (though I am working on it), but it is easy enough to prove at least one way that a metric like PER gets it wrong: it treats all of the different components of player contribution linearly. In other words, one more point is worth one more point, whether it is the 15th point that a player scores or the 25th, and one more rebound is worth one more rebound, whether it is the 8th or the 18th. While this equivalency makes designing an all-in one equation much easier (at least for now, my Secret Formula metric is also linear), it is ultimately just another empirically testable assumption.

I have theorized that one reason Rodman’s PER stats are so low compared to his differential stats is that PER punishes his lack of mediocre scoring, while failing to reward the extremeness of his rebounding. This is based on the hypothesis that certain extreme statistics would be less “duplicable” than mediocre ones. As a result, the difference between a player getting 18 rebounds per game vs. getting 16 per game could be much greater than the difference between them getting 8 vs. getting 6. Or, in other words, the marginal value of rebounds would (hypothetically) be increasing.

Using win percentage differentials, this is a testable theory. Just as we can correlate an individual player’s statistics to the win differentials of his team, we can also correlate hypothetical statistics the same way. So say we want to test a metric like rebounds, except one that has increasing marginal value built in: a simple way to approximate that effect is to make our metric increase exponentially, such as using rebounds squared. If we need even more increasing marginal value, we can try rebounds cubed, etc. And if our metric has several different components (like PER), we can do the same for the individual parts: the beauty is that, at the end of the day, we can test—empirically—which metrics work and which don’t.

For those who don’t immediately grasp the math involved, I’ll go into a little detail: A linear relationship is really just an exponential relationship with an exponent of 1. So let’s consider a toy metric, “PR,” which is calculated as follows: Points + Rebounds. This is a linear equation (exponent = 1) that could be rewritten as follows: (Points)^1 + (Rebounds)^1. However, if, as above, we thought that both points and rebounds should have increasing marginal values, we might want to try a metric (call it “PRsq”) that combined points and rebounds squared, as follows: (Points)^2 + (Rebounds)^2. And so on. Here’s an example table demonstrating the increase in marginal value:

The fact that each different metric leads to vastly different magnitudes of value is irrelevant: for predictive purposes, the total value for each component will be normalized — the relative value is what matters (just as “number of pennies” and “number of quarters” are equally predictive of how much money you have in your pocket). So applying this concept to an even wider range of exponents for several relevant individual player statistics, we can empirically examine just how “exponential” each statistic really is:

For this graph, I looked at each of the major rate metrics (plus points per game) individually. So, for each player-season in my (1986-) sample, I calculated the number of points, points squared, points^3rd. . . points^10th power, and then correlated all of these to that player’s win percentage differential. From those calculations, we can find roughly how much the marginal value for each metric increases, based on what exponent produces the best correlation: The smaller the number at the peak of the curve, the more linear the metric is—the higher the number, the more exponential (i.e., extreme values are that much more important). When I ran this computation, the relative shape of each curve fit my intuitions, but the magnitudes surprised me: That is, many of the metrics turned out to be even more exponential than I would have guessed.

As I know this may be confusing to many of my readers, I need to be absolutely clear: the shape of each curve has nothing to do with the actual importance of each metric. It only tells us how much that particular metric is sensitive to very large values. E.g., the fact that Blocks and Assists peak on the left and sharply decline doesn’t make them more or less important than any of the others, it simply means that having 1 block in your scoreline instead of 0 is relatively just as valuable as having 5 blocks instead of 4. On the other extreme, turnovers peak somewhere off the chart, suggesting that turnover rates matter most when they are extremely high.

For now, I’m not trying to draw a conclusive picture about exactly what exponents would make for an ideal all-in-one equation (polynomial regressions are very very tricky, though I may wade into those difficulties more in future blog posts). But as a minimum outcome, I think the data strongly supports my hypothesis: that many stats—especially rebounds—are exponential predictors. Thus, I mean this less as a criticism of PER than as an explanation of why it undervalues players like Dennis Rodman.

Gross, and Points

In subsection (i), I concluded that “gross points” as a metric for player valuation had two main flaws: gross, and points. Superficially, PER responds to both of these flaws directly: it attempts to correct the “gross” problem both by punishing bad shots, and by adjusting for pace and minutes. It attacks the “points” problem by adding rebounds, assists, blocks, steals, and turnovers. The problem is, these “solutions” don’t match up particularly well with the problems “gross” and “points” present.
The problem with the “grossness” of points certainly wasn’t minutes (note: for historical comparisons, pace adjustments are probably necessary, but the jury is still out on the wisdom of doing the same on a team-by-team basis within a season). The main problem with “gross” was shooting efficiency: If someone takes a bunch of shots, they will eventually score a lot of points. But scoring points is just another thing that players do that may or may not help their teams win. PER attempted to account for this by punishing missed shots, but didn’t go far enough. The original problem with “gross” persists: As discussed above, taking shots helps your rating, whether they are good shots or not.

As for “points”: in addition to any problems created by having arbitrary (non-empirical) and linear coefficients, the strong bias towards shooting causes PER to undermine its key innovation—the incorporation of non-point components. This “bias” can be represented visually:

^{Note: This data comes from a regression to PER including each of the rate stats corresponding to the various components of PER.}

This pie chart is based on a linear regression including rate stats for each of PER’s components. Strictly, what it tells us is the relative value of each factor to predicting PER if each of the other factors were known. Thus, the “usage” section of this pie represents the advantage gained by taking more shots—even if all your other rate stats were fixed. Or, in other words, pure bias (note that the number of shots a player takes is almost as predictive as his shooting ability).

For fun, let’s compare that pie to the exact same regression run on Points Per Game rather than PER:

^{Note: These would not be the best variables to select if you were actually trying to predict a player’s Points Per Game. Note also that “Usage” in these charts is NOT like “Other”—while other variables may affect PPG, and/or may affect the items in this regression, they are not represented in these charts.}

Interestingly, Points Per Game was already somewhat predictable by shooting ability, turnovers, defensive rebounding, and assists. While I hesitate to draw conclusions from the aesthetic comparison, we can guess why perhaps PER doesn’t beat PPG as significantly as we might expect: it appears to share much of the same DNA. (My more wild and ambitious thoughts suspect that these similarities reflect the strength of our broader pro-points bias: even when designing an All-in-One statistic, even Hollinger’s linear, non-empirical, a priori coefficients still mostly reflect the conventional wisdom about the importance of many of the factors, as reflected in the way that they relate directly to points per game).

I could make a similar pie-chart for Win% differential, but I think it might give the wrong impression: these aren’t even close to the best set of variables to use for that purpose. Suffice it to say that it would look very, very different (for an imperfect picture of how much so, you can compare to the values in the Relative Importance chart above).

Conclusions

The deeper irony with PER is not just that it could theoretically be better, but that it adds many levels of complexity to the problem it purports to address, ultimately failing in strikingly similar ways. It has been dressed up around the edges with various adjustments for team and league pace, incorporation of league averages to weight rebounds and value of possession, etc. This is, to coin a phrase, like putting lipstick on a pig. The energy that Hollinger has spent on dressing up his model could have been better spent rethinking the core of it.

In my estimation, this pattern persists among many extremely smart people who generate innovative models and ideas: once created, they spend most of their time—entire careers even—in order: 1) defending it, 2) applying it to new situations, and 3) tweaking it. This happens in just about every field: hard and soft sciences, economics, history, philosophy, even literature. Give me an academic who creates an interesting and meaningful model, and then immediately devotes their best efforts to tearing it apart! In all my education, I have had perhaps two professors who embraced this approach, and I would rank both among my very favorites.

This post and the last were admittedly relatively light on Rodman-specific analysis, but that will change with a vengeance in the next two. Stay tuned.

Update (5/13/11): Commenter “Yariv” correctly points out that an “exponential” curve is technically one in the form y^x (such as 2^x, 3^x, etc), where the increasing marginal value I’m referring to in the “Linearity” section above is about terms in the form x^y (e.g., x^2, x^3, etc), or monomial terms with an exponent not equal to 1. I apologize for any confusion, and I’ll rewrite the section when I have time.

Yes ESPN, Professional Kickers are Big Fat Chokers

A couple of days ago, ESPN’s Peter Keating blogged about “icing the kicker” (i.e., calling timeouts before important kicks, sometimes mere instants before the ball is snapped). He argues that the practice appears to work, at least in overtime. Ultimately, however, he concludes that his sample is too small to be “statistically significant.” This may be one of the few times in history where I actually think a sports analyst underestimates the probative value of a small sample: as I will show, kickers are generally worse in overtime than they are in regulation, and practically all of the difference can be attributed to iced kickers. More importantly, even with the minuscule sample Keating uses, their performance is so bad that it actually is “significant” beyond the 95% level.

In Keating’s 10 year data-set, kickers in overtime only made 58.1% of their 35+ yard kicks following an opponent’s timeout, as opposed to 72.7% when no timeout was called. The total sample size is only 75 kicks, 31 of which were iced. But the key to the analysis is buried in the spreadsheet Keating links to: the average length of attempted field goals by iced kickers in OT was only 41.87 yards, vs. 43.84 yards for kickers at room temperature. Keating mentions this fact in passing, mainly to address the potential objection that perhaps the iced kickers just had harder kicks — but the difference is actually much more significant.
To evaluate this question properly, we first need to look at made field goal percentages broken down by yard-line. I assume many people have done this before, but in 2 minutes of googling I couldn’t find anything useful, so I used play-by-play data from 2000-2009 to create the following graph:

The blue dots indicate the overall field-goal percentage from each yard-line for every field goal attempt in the period (around 7500 attempts total – though I’ve excluded the one 76 yard attempt, for purely aesthetic reasons). The red dots are the predicted values of a logistic regression (basically a statistical tool for predicting things that come in percentages) on the entire sample. Note this is NOT a simple trend-line — it takes every data point into account, not just the averages. If you’re curious, the corresponding equation (for predicted field goal percentage based on yard line x) is as follows:

$\large{1 - \dfrac{e^{-5.5938+0.1066x}} {1+e^{-5.5938+0.1066x}}}$

The first thing you might notice about the graph is that the predictions appear to be somewhat (perhaps unrealistically) optimistic about very long kicks. There are a number of possible explanations for this, chiefly that there are comparatively few really long kicks in the sample, and beyond a certain distance the angle of the kick relative to the offensive and defensive linemen becomes a big factor that is not adequately reflected by the rest of the data (fortunately, this is not important for where we are headed). The next step is to look at a similar graph for overtime only — since the sample is so much smaller, this time I’ll use a bubble-chart to give a better idea of how many attempts there were at each distance:

For this graph, the sample is about 1/100th the size of the one above, and the regression line is generated from the OT data only. As a matter of basic spatial reasoning — even if you’re not a math whiz — you may sense that this line is less trustworthy. Nevertheless, let’s look at a comparison of the overall and OT-based predictions for the 35+ yard attempts only:

^{Note: These two lines are slightly different from their counterparts above. To avoid bias created by smaller or larger values, and to match Keating’s sample, I re-ran the regressions using only 35+ yard distances that had been attempted in overtime (they turned out virtually the same anyway).}

Comparing the two models, we can create a predicted “Choke Factor,” which is the percentage of the original conversion rate that you should knock off for a kicker in an overtime situation:

A weighted average (by the number of OT attempts at each distance) gives us a typical Choke Factor of just over 6%. But take this graph with a grain of salt: the fact that it slopes upward so steeply is a result of the differing coefficients in the respective regression equations, and could certainly be a statistical artifact. For my purposes however, this entire digression into overtime performance drop-offs is merely for illustration: The main calculation relevant to Keating’s iced kick discussion is a simple binomial probability: Given an average kick length of 41.87 yards, which carries a predicted conversion rate of 75.6%, what are the odds of converting only 18 or fewer out of 31 attempts? OK, this may be a mildly tricky problem if you’re doing it longhand, but fortunately for us, Excel has a BINOM.DIST() function that makes it easy:

^{Note : for people who might not pick: Yes, the predicted conversion rate for the average length is not going to be exactly the same as the average predicted value for the length of each kick. But it is very close, and close enough.}

As you can see, the OT kickers who were not iced actually did very slightly better than average, which means that all of the negative bias observed in OT kicking stems from the poor performance seen in just 31 iced kick attempts. The probability of this result occurring by chance — assuming the expected conversion rate for OT iced kicks were equal to the expected conversion rate for kicks overall — would be only 2.4%. Of course, “probability of occurring by chance” is the definition of statistical significance, and since 95% against (i.e., less than 5% chance of happening) is the typical threshold for people to make bold assertions, I think Keating’s statement that this “doesn’t reach the level of improbability we need to call it statistically significant” is unnecessarily humble. Moreover, when I stated that the key to this analysis was the 2 yard difference that Keating glossed over, that wasn’t for rhetorical flourish: if the length of the average OT iced kick had been the same as the length of the average OT regular kick, the 58.1% would correspond to a “by chance” probability of 7.6%, obviously not making it under the magic number.

Applied Epistemology in Politics and the Playoffs

Two nights ago, as I was watching cable news and reading various online articles and blog posts about Christine O’Donnell’s upset win over Michael Castle in Delaware’s Republican Senate primary, the hasty, almost ferocious emergence of consensus among the punditocracy – to wit, that the GOP now has virtually zero chance of picking up that seat in November – reminded me of an issue that I’ve wanted to blog about since long before I began blogging in earnest: NFL playoff prediction models.

Specifically, I have been critical of those models that project the likelihood of each surviving team winning the Super Bowl by applying a logistic regression model (i.e., “odds of winning based on past performance”) to each remaining game. In January, I posted a number of comments to this article on Advanced NFL Stats, in which I found it absurd that, with 8 teams left, his model predicted that the Dallas Cowboys had about the same chance of winning the Super Bowl as the Jets, Ravens, Vikings, and Cardinals combined. In the brief discussion, I gave two reasons (in addition to my intuition): first, that these predictions were wildly out of whack with contract prices in sports-betting markets, and second, that I didn’t believe the model sufficiently accounted for “variance in the underlying statistics.” Burke suggested that the first point is explained by a massive epidemic of conjunction-fallacyitis among sports bettors. On its face, I think this is a ridiculous explanation: i.e., does he really believe that the market-movers in sports betting — people who put up hundreds of thousands (if not millions) of dollars of their own money — have never considered multiplying the odds of several games together? Regardless, in this post I will put forth a much better explanation for this disparity than either of us proffered at the time, hopefully mooting that discussion. On my second point, he was more dismissive, though I was being rather opaque (and somehow misspelled “beat” in one reply), so I don’t blame him. However, I do think Burke’s intellectual hubris regarding his model (aka “model hubris”) is notable – not because I have any reason to think Burke is a particularly hubristic individual, but because I think it is indicative of a massive epidemic of model-hubrisitis among sports bloggers.

In Section 1 of this post, I will discuss what I personally mean by “applied epistemology” (with apologies to any actual applied epistemologists out there) and what I think some of its more-important implications are. In Section 2, I will try to apply these concepts by taking a more detailed look at my problems with the above-mentioned playoff prediction models.

Section 1: Applied Epistemology Explained, Sort Of

For those who might not know, “epistemology” is essentially a fancy word for the “philosophical study of knowledge,” which mostly involves philosophers trying to define the word “knowledge” and/or trying to figure out what we know (if anything), and/or how we came to know it (if we do). For important background, read my Complete History of Epistemology (abridged), which can be found here: In Plato’s Theaetetus, Socrates suggests that knowledge is something like “justified true belief.” Agreement ensues. In 1963, Edmund Gettier suggests that a person could be justified in believing something, but it could be true for the wrong reasons. Debate ensues. The End.

A “hot” topic in the field recently has been dealing with the implications of elaborate thought experiments similar to the following:

*begin experiment*
Imagine yourself in the following scenario: From childhood, you have one burning desire: to know the answer to Question X. This desire is so powerful that you dedicate your entire life to its pursuit. You work hard in school, where you excel greatly, and you master every relevant academic discipline, becoming a tenured professor at some random elite University, earning multiple doctorates in the process. You relentlessly refine and hone your (obviously considerable) reasoning skills using every method you can think of, and you gather and analyze every single piece of empirical data relevant to Question X available to man. Finally, after decades of exhaustive research and study, you have a rapid series of breakthroughs that lead you to conclude – not arbitrarily, but completely based on the proof you developed through incredible amounts of hard work and ingenuity — that the answer to Question X is definitely, 100%, without a doubt: 42. Congratulations! To celebrate the conclusion of this momentous undertaking, you decide to finally get out of the lab/house/library and go celebrate, so you head to a popular off-campus bar. You are so overjoyed about your accomplishment that you decide to buy everyone a round of drinks, only to find that some random guy — let’s call him Neb – just bought everyone a round of drinks himself. What a joyous occasion: two middle-aged individuals out on the town, with reason to celebrate (and you can probably see where this is going, but I’ll go there anyway)! As you quickly learn, it turns out that Neb is around your same age, and is also a professor at a similarly elite University in the region. In fact, it’s amazing how much you two have in common: you have relatively similar demographic histories, identical IQ, SAT, and GRE scores, you both won multiple academic awards at every level, you have both achieved similar levels of prominence in your academic community, and you have both been repeatedly published in journals of comparable prestige. In fact, as it turns out, you have both been spent your entire lives studying the same question! You have both read all the same books, you have both met, talked or worked with many comparably intelligent — or even identical — people: It is amazing that you have never met! Neb, of course, is feeling so celebratory because finally, after decades of exhaustive research and study, he has just had a rapid series of breakthroughs that lead him to finally conclude – not arbitrarily, but completely based on the proof he developed through incredible amounts of hard work and ingenuity — that the answer to Question X is definitely, 100%, without a doubt: 54.

You spend the next several hours drinking and arguing about Question X: while Neb seemed intelligent enough at first, everything he says about X seems completely off base, and even though you make several excellent points, he never seems to understand them. He argues from the wrong premises in some areas, and draws the wrong conclusions in others. He massively overvalues many factors that you are certain are not very important, and is dismissive of many factors that you are certain are crucial. His arguments, though often similar in structure to your own, are extremely unpersuasive and don’t seem to make any sense, and though you try to explain yourself to him, he stubbornly refuses to comprehend your superior reasoning. The next day, you stumble into class, where your students — who had been buzzing about your breakthrough all morning — begin pestering you with questions about Question X and 42. In your last class, you had estimated that the chances of 42 being “the answer” were around 90%, and obviously they want to know if you have finally proved 42 for certain, and if not, how likely you believe it is now. What do you tell them?

All of the research and analysis you conducted since your previous class had, indeed, led you to believe that 42 is a mortal lock. In the course of your research, everything you have thought about or observed or uncovered, as well as all of the empirical evidence you have examined or thought experiments you have considered, all lead you to believe that 42 is the answer. As you hesitate, your students wonder why, even going so far as to ask, “Have you heard any remotely persuasive arguments against 42 that we should be considering?” Can you, in good conscience, say that you know the answer to Question X? For that matter, can you even say that the odds of 42 are significantly greater than 50%? You may be inclined, as many have been, to “damn the torpedoes” and act as if Neb’s existence is irrelevant. But that view is quickly rebutted: Say one of your most enterprising students brings a special device to class: when she presses the red button marked “detonate,” if the answer to Question X is actually 42, the machine will immediately dispense $20 bills for everyone in the room; but if the answer is not actually 42, it will turn your city into rubble. And then it will search the rubble, gather any surviving puppies or kittens, and blend them.

So assuming you’re on board that your chance encounter with Professor Neb implies that, um, you might be wrong about 42, what comes next? There’s a whole interesting line of inquiry about what the new likelihood of 42 is and whether anything higher than 50% is supportable, but that’s not especially relevant to this discussion. But how about this: Say the scenario proceeds as above, you dedicate your life, yadda yadda, come to be 100% convinced of 42, but instead of going out to a bar, you decide to relax with a bubble bath and a glass of Pinot, while Neb drinks alone. You walk into class the next day, and proudly announce that the new odds of 42 are 100%. Mary Kate pulls out her special money-dispensing device, and you say sure, it’s a lock, press the button. Yay, it’s raining Andrew Jacksons in your classroom! And then: **Boom** **Meow** **Woof** **Whirrrrrrrrrrrrrr**. Apparently Mary Kate had a twin sister — she was in Neb’s class.

*end experiment*

In reality, the fact that you might be wrong, even when you’re so sure you’re right, is more than a philosophical curiosity, it is a mathematical certainty. The processes that lead you to form beliefs, even extremely strong ones, are imperfect. And when you are 100% certain that a belief-generating process is reliable, the process that led you to that belief is likely imperfect. This line of thinking is sometimes referred to as skepticism — which would be fine if it weren’t usually meant as a pejorative.

When push comes to shove, people will usually admit that there is at least some chance they are wrong, yet they massively underestimate just what those chances are. In political debates, for example, people may admit that there is some miniscule possibility that their position is ill-informed or empirically unsound, but they will almost never say that they are more likely to be wrong than to be right. Yet, when two populations hold diametrically opposed views, either one population is wrong or both are – all else being equal, the correct assessment in such scenarios is that no-one is likely to have it right.

When dealing with beliefs about probabilities, the complications get even trickier: Obviously many people believe some things are close to 100% likely to be true, when the real probability may be some-much if not much-much lower. But in addition to the extremes, people hold a whole range of poorly-calibrated probabilistic beliefs, like believing something is 60% likely when it is actually 50% or 70%. (Note: Some Philosophically trained readers may balk at this idea, suggesting that determinism entails everything having either a 0 or 100% probability of being true. While this argument may be sound in classroom discussions, it is highly unpragmatic: If I believe that I will win a coin flip 60% of the time, it may be theoretically true that the universe has already determined whether the coin will turn up heads or tails, but for all intents and purposes, I am only wrong by 10%).

But knowing that we are wrong so much of the time doesn’t tell us much by itself: it’s very hard to be right, and we do the best we can. We develop heuristics that tend towards the right answers, or — more importantly for my purposes — that allow the consequences of being wrong in both directions even out over time. You may reasonably believe that the probability of something is 30%, when, in reality, the probability is either 20% or 40%. If the two possibilities are equally likely, then your 30% belief may be functionally equivalent under many circumstances, but they are not the same, as I will demonstrate in Section 2 (note to the philosophers: you may have noticed that this is a bit like the Gettier examples: you might be “right,” but for the wrong reasons).

There is a science to being wrong, and it doesn’t mean you have to mope in your study, or act in bad faith when you’re out of it. “Applied Epistemology” (at least as this armchair philosopher defines it) is the study of the processes that lead to knowledge and beliefs, and of the practical implications of their limitations.

Part 2: NFL Playoff Prediction Models

Now, let’s finally return to the Advanced NFL Stats playoff prediction model. Burke’s methodology is simple: using a logistic regression based on various statistical indicators, the model estimates a probability for each team to win their first round matchup. It then repeats the process for all possible second round matchups, weighting each by its likelihood of occurring (as determined by the first round projections) and so on through the championship. With those results in hand, a team’s chances of winning the tournament is simply the product of their chances of winning in each round. With 8 teams remaining in the divisional stage, the model’s predictions looked like this:

Burke states that the individual game prediction model has a “history of accuracy” and is well “calibrated,” meaning that, historically, of the teams it has predicted to win 30% of the time, close to 30% of them have won, and so on. For a number of reasons, I remain somewhat skeptical of this claim, especially when it comes to “extreme value” games where the model predicts very heavy favorites or underdogs. (E.g’s: What validation safeguards do they deploy to avoid over-fitting? How did they account for the thinness of data available for extreme values in their calibration method?) But for now, let’s assume this claim is correct, and that the model is calibrated perfectly: The fact that teams predicted to win 30% of the time actually won 30% of the time does NOT mean that each team actually had a 30% chance of winning.

That 30% number is just an average. If you believe that the model perfectly nails the actual expectation for every team, you are crazy. Since there is a large and reasonably measurable amount of variance in the very small sample of underlying statistics that the predictive model relies on, it necessarily follows that many teams will have significantly under or over-performed statistically relative to their true strength, which will be reflected in the model’s predictions. The “perfect calibration” of the model only means that the error is well-hidden.

This doesn’t mean that it’s a bad model: like any heuristic, the model may be completely adequate for its intended context. For example, if you’re going to bet on an individual game, barring any other information, the average of a team’s potential chances should be functionally equivalent to their actual chances. But if you’re planning to bet on the end-result of a series of games — such as in the divisional round of the NFL playoffs — failing to understand the distribution of error could be very costly.

For example, let’s look at what happens to Minnesota and Arizona’s Super Bowl chances if we assume that the error in their winrates is uniformly distributed in the neighborhood of their predicted winrate:

For Minnesota, I created a pool of 11 possible expectations that includes the actual prediction plus teams that were 5% to 25% better or worse. I did the same for Arizona, but with half the deviation. The average win prediction for each game remains constant, but the overall chances of winning the Super Bowl change dramatically. To some of you, the difference between 2% and 1% may not seem like much, but if you could find a casino that would regularly offer you 100-1 on something that is actually a 50-1 shot, you could become very rich very quickly. Of course, this uniform distribution is a crude one of many conceivable ways that the “hidden error” could be distributed, and I have no particular reason to think it is more accurate than any other. But one thing should be abundantly clear: the winrate model on which this whole system rests tells us nothing about this distribution either.

The exact structure of this particular error distribution is mostly an empirical matter that can and should invite further study. But for the purposes of this essay, speculation may suffice. For example, here is an ad hoc distribution that I thought seemed a little more plausible than a uniform distribution:

This table shows the chances of winning the Super Bowl for a generic divisional round playoff team with an average predicted winrate of 35% for each game. In this scenario, there is a 30% chance (3/10) that the prediction gets it right on the money, a 40% chance that the team is around half as good as predicted (the bottom 4 values), a 10% chance that the team is slightly better, a 10% chance that it is significantly better, and a 10% chance that the model’s prediction is completely off its rocker. These possibilities still produce a 35% average winrate, yet, as above, the overall chances of winning the Super Bowl increase significantly (this time by almost double). Of course, 2 random hypothetical distributions don’t yet indicate a trend, so let’s look at a family of distributions to see if we can find any patterns:

This chart compares the chances of a team with a given predicted winrate to win the Super Bowl based on uniform error distributions of various sizes. So the percentages in column 1 are the odds of the team winning the Super Bowl if the predicted winrate is exactly equal to their actual winrate. Then each subsequent column is the chances of them winning the Superbowl if you increase the “pool” of potential actual winrates by one on each side. Thus, the second number after 35% is the odds of winning the Super Bowl if the team is equally likely to be have a 30%, 35%, or 40% chance in reality, etc. The maximum possible change in Super Bowl winning chances for each starting prediction is contained in the light yellow box at the end of each row. I should note that I chose this family of distributions for its ease of cross-comparison, not its precision. I also experimented with many other models that produced a variety of interesting results, yet in every even remotely plausible one of them, two trends – both highly germane to my initial criticism of Burke’s model – endured:
1. Lower predicted game odds lead to greater disparity between predicted and actual chances.
To further illustrate this, here’s a vertical slice of the data, containing the net change for each possible prediction, given a discreet uniform error distribution of size 7:

2. Greater error ranges in the underlying distribution lead to greater disparity between predicted and actual chances.

To further illustrate this, here’s a horizontal slice of the data, containing the net change for each possible error range, given an initial winrate prediction of 35%:

Of course these underlying error distributions can and should be examined further, but even at this early stage of inquiry, we “know” enough (at least with a high degree of probability) to begin drawing conclusions. I.e., We know there is considerable variance in the statistics that Burke’s model relies on, which strongly suggests that there is a considerable amount of “hidden error” in its predictions. We know greater “hidden error” leads to greater disparity in predicted Super Bowl winning chances, and that this disparity is greatest for underdogs. Therefore, it is highly likely that this model significantly under-represents the chances of underdog teams at the divisional stage of the playoffs going on to win the Superbowl. Q.E.D.

This doesn’t mean that these problems aren’t fixable: the nature of the error distribution of the individual game-predicting model could be investigated and modeled itself, and the results could be used to adjust Burke’s playoff predictions accordingly. Alternatively, if you want to avoid the sticky business of characterizing all that hidden error, a Super-Bowl prediction model could be built that deals with that problem heuristically: say, by running a logistical regression that uses the available data to predict each team’s chances of winning the Super Bowl directly.

Finally, I believe this evidence both directly and indirectly supports my intuition that the large disparity between Burke’s predictions and the corresponding contract prices was more likely to be the result of model error than market error. The direct support should be obvious, but the indirect support is also interesting: Though markets can get it wrong just as much or more than any other process, I think that people who “put their money where their mouth is” (especially those with the most influence on the markets) tend to be more reliably skeptical and less dogmatic about making their investments than bloggers, analysts or even academics are about publishing their opinions. Moreover, by its nature, the market takes a much more pluralistic approach to addressing controversies than do most individuals. While this may leave it susceptible to being marginally outperformed (on balance) by more directly focused individual models or persons, I think it will also be more likely to avoid pitfalls like the one above.

Conclusions, and My Broader Agenda

The general purpose of post is to demonstrate both the importance and difficulty of understanding and characterizing the ways in which our beliefs – and the processes we use to form them — can get it wrong. This is, at its heart, a delicate but extremely pragmatic endeavor. It involves being appropriately skeptical of various conclusions — even when they seem right to you – and recognizing the implications of the multitude of ways that such error can manifest.

I have a whole slew of ideas about how to apply these principles when evaluating the various pronouncements made by the political commentariat, but the blogosphere already has a Nate Silver (and Mr. Silver is smarter than me anyway), so I’ll leave that for you to consider as you see fit.

Easy NFL Predictions, the SkyNet Way

In this post I briefly discussed regression to the mean in the NFL, as well as the difficulty one can face trying to beat a simple prediction model based on even a single highly probative variable. Indeed, for all the extensive research and cutting-edge analysis they conduct at Football Outsiders, they are seemingly unable to beat “Koko,” which is just about the simplest regression model known to primates.

Of course, since there’s no way I could out-analyze F.O. myself — especially if I wanted to get any predictions out before tonight’s NFL opener – I decided to let my computer do the work for me: this is what neural networks are all about. In case you’re not familiar, a neural network is a learning algorithm that can be used as a tool to process large quantities of data with many different variables — even if you don’t know which variables are the most important, or how they interact with each other.

The graphic to the right is the end result of several whole minutes of diligent configuration (after a lot of tedious data collection, of course). It uses 60 variables (which are listed under the fold below), though I should note that I didn’t choose them because of their incredible probative value – many are extremely collinear, if not pointless — I mostly just took what was available on the team and league summary pages on Pro Football Reference, and then calculated a few (non-advanced) rate stats and such in Excel.

Now, I don’t want to get too technical, but there are a few things about my methodology that I need to explain. First, predictive models of all types have two main areas of concern: under-fitting and over fitting. Football Outsiders, for example, creates models that “under fit” their predictions. That is to say, however interesting the individual components may be, they’re not very good at predicting what they’re supposed to. Honestly, I’m not sure if F.O. even checks their models against the data, but this is a common problem in sports analytics: the analyst gets so caught up designing their model a priori that they forget to check whether it actually fits the empirical data. On the other hand, to the diligent empirically-driven model-maker, overfitting — which is what happens when your model tries too hard to explain the data — can be just as pernicious. When you complicate your equations or add more and more variables, it gives your model more opportunity to find an “answer” that fits even relatively large data-sets, but which may not be nearly as accurate when applied elsewhere.

For example, to create my model, I used data from the introduction of the Salary Cap in 1994 on. When excluding seasons where a team had no previous or next season to compare to, this left me with a sample of 464 seasons. Even with a sample this large, if you include enough variables you should get good-looking results: a linear regression will appear to make “predictions” that would make any gambler salivate, and a Neural Network will make “predictions” that would make Nostradamus salivate. But when you take those models and try to apply them to new situations, the gambler and Nostradamus may be in for a big disappointment. This is because there’s a good chance your model is “overfit”, meaning it is tailored specifically to explain your dataset rather than to identifying the outside factors that the data-set reveals. Obviously it can be problematic if we simply use the present data to explain the present data. “Model validation” is a process (woefully ignored in typical sports analysis), by which you make sure that your model is capable of predicting data as well as explaining it. One of the simplest such methods is called “split validation.” This involves randomly splitting your sample in half, creating a “practice set” and a “test set,” and then deriving your model from the practice set while applying it to the test set. If “deriving” a model is confusing to you, think of it like this: you are using half of your data to find an explanation for what’s going on and then checking the other half to see if that explanation seems to work. The upside to this is that if your method of model-creating can pass this test reliably, your models should be just as accurate on new data as they are on the data you already have. The downside is that you have to cut your sample size in half, which leads to bigger swings in your results, meaning you have to repeat the process multiple times to be sure that your methodology didn’t just get lucky on one round.

For this model, the main method I am going to use to evaluate predictions is a simple correlation between predicted outcomes and actual outcomes. The dependent variable (or variable I am trying to predict), is the next season’s wins. As a baseline, I created a linear correlation against SRS, or “Simple Rating System,” which is PFR’s term for margin of victory adjusted for strength of schedule. This is the single most probative common statistic when it comes to predicting the next season’s wins, and as I’ve said repeatedly, beating a regression of one highly probative variable can be a lot of work for not much gain. To earn any bragging rights as a model-maker, I think you should be able to beat the linear SRS predictions by at least 5%, since that’s approximately the edge you would need to win money gambling against it in a casino. For further comparison, I also created a “Massive Linear” model, which uses the majority of the variables that go into the neural network (excluding collinear variables and variables that have almost no predictive value). For the ultimate test, I’ve created one model that is a linear regression using only the most probative variables, AND I allowed it to use the whole sample space (that is, I allowed it to cheat and use the same data that it is predicting to build its predictions). For my “simple” neural network, of course, I didn’t do any variable-weighting or analysis myself, and it required very little configuration: I used a very slow ‘learning rate’ (.025 if that means anything to you) with a very high number of learning cycles (5000), with decay on. For the validated models, I repeated this process about 20 times and averaged the outcomes. I have also included the results from running the data through the “Koko” model, and added results from the last 2 years of Football Outsiders predictions. As you will see, the neural network was able to beat the other models fairly handily:

^{Football Outsider numbers are obviously not since 1994. Note that Koko actually performs on par with F.O. overall, though both are pretty weak compared to the SRS regression or the cheat regression. “Koko” performed very well last season, posting a .560 correlation, though apparently last season was highly “predictable,” as all of the models based on previous patterns performed extremely well. Note also that the Massive Linear model performs poorly: this is as a result of overfitting, as explained above.}

Now here is where it gets interesting. When I first envisioned this post, I was planning to title it “Why I Don’t Make Predictions; And: Predictions!” — on the theory that, given the extreme variance in the sport, any highly-accurate model would probably produce incredibly boring results. That is, most teams would end up relatively close to the mean, and the “better” teams would normally just be the better teams from the year before. But when applied the neural network to the data for this season, I was extremely surprised by its apparent boldness:

^{I should note that the numbers will not add up perfectly as far as divisions and conferences go. In fact, I slightly adjusted them proportionally to make them fit the correct number of games for the league as a whole (which should have little or positive effect on its predictive power). SkyNet does not know the rules of football or the structure of the league, and its main goal is to make the most accurate predictions on a team by team basis, and then destroy humanity.}

Wait, what? New Orleans struggling to make the playoffs? Oakland with a better record than San Diego? The Jets as the league’s best team? New England is out?!? These are not the predictions of a milquetoast forecaster, so I am pleased to see that my simple creation has gonads. Of course there is obviously a huge amount of variance in this process, and a .43 correlation still leaves a lot to chance. But just to be completely clear, this is exactly the same model that soundly beat Koko, Football Outsiders, and several reasonable linear regressions — some of which were allowed to cheat – over the past 15 years. In my limited experience, neural networks are often capable of beating conventional models even when they produce some bizarre outcomes: For example, one of my early NBA playoff wins-predicting neural networks was able to beat most linear regressions by a similar (though slightly smaller) margin, even though it predicted negative wins for several teams. Anyway, I look forward to seeing how the model does this season. Though, in my heart of hearts, if the Jets win the Super Bowl, I may fear for the future of mankind.

A list of all the input variables, after the jump:

Read the rest of this entry »

Graph of the Day 2: NFL Regression—Descent Into Chaos

I guess it’s funky graph day here at SSA:
This one corresponds to the bubble-graphs in this post about regression to the mean before and after the introduction of the salary cap. Each colored ball represents one of the 32 teams, with wins in year n on the x axis and wins in year n+1 on the y axis. In case you don’t find the visual interesting enough in its own right, you’re supposed to notice that it gets crazier right around 1993.

The Case for Dennis Rodman, Part 1/4 (b)—Defying the Laws of Nature

In this post I will be continuing my analysis of just how dominant Dennis Rodman’s rebounding was. Subsequently, section (c) will cover my analysis of Wilt Chamberlain and Bill Russell, and Part 2 of the series will begin the process of evaluating Rodman’s worth overall.

For today’s analysis, I will be examining a particularly remarkable aspect of Rodman’s rebounding: his ability to dominate the boards on both ends of the court. I believe this at least partially gets at a common anti-Rodman argument: that his rebounding statistics should be discounted because he concentrated on rebounding to the exclusion of all else. This position was publicly articulated by Charles Barkley back when they were both still playing, with Charles claiming that he could also get 18+ rebounds every night if he wanted to. Now that may be true, and it’s possible that Rodman would have been an even better player if he had been more well-rounded, but one thing I am fairly certain of is that Barkley could not have gotten as many rebounds as Rodman the same way that Rodman did.

The key point here is that, normally, you can be a great offensive rebounder, or you can be a great defensive rebounder, but it’s very hard to be both. Unless you’re Dennis Rodman:

To prepare the data for this graph, I took the top 1000 rebounding seasons by total rebounding percentage (the gold-standard of rebounding statistics, as discussed in section (a)), and ranked them 1-1000 for both offensive (ORB%) and defensive (DRB%) rates. I then scored each season by the higher (larger number) ranking of the two. E.g., if a particular season scored a 25, that would mean that it ranks in the top 25 all-time for offensive rebounding percentage and in the top 25 all-time for defensive rebounding percentage (I should note that many players who didn’t make the top 1000 seasons overall would still make the top 1000 for one of the two components, so to be specific, these are the top 1000 ORB% and DRB% seasons of the top 1000 TRB% seasons).

This score doesn’t necessarily tell us who the best rebounder was, or even who was the most balanced, but it should tell us who was the strongest in the weakest half of their game (just as you might rank the off-hand of boxers or arm wrestlers). Fortunately, however, Rodman doesn’t leave much room for doubt: his 1994-1995 season is #1 all-time on both sides. He has 5 seasons that are dual top-15, while no other NBA player has even a single season that ranks dual top-30. The graph thus shows how far down you have to go to find any player with n number of seasons at or below that ranking: Rodman has 6 seasons register on the (jokingly titled) “Ambicourtedness” scale before any other player has 1, and 8 seasons before any player has 2 (for the record, Charles Barkley’s best rating is 215).

This outcome is fairly impressive alone, and it tells us that Rodman was amazingly good at both ORB and DRB – and that this is rare — but it doesn’t tell us anything about the relationship between the two. For example, if Rodman just got twice as many rebounds as any normal player, we would expect him to lead lists like this regardless of how he did it. Thus, if you believe the hypothesis that Rodman could have dramatically increased his rebounding performance just by focusing intently on rebounds, this result might not be unexpected to you.

The problem, though, is that there are both competitive and physical limitations to how much someone can really excel at both simultaneously. Not the least of which is that offensive and defensive rebounds literally take place on opposite sides of the floor, and not everyone gets up and set for every possession. Thus, if someone wanted to cheat toward getting more rebounds on the offensive end, it would likely come, at least in some small part, at the expense of rebounds on the defensive end. Similarly, if someone’s playing style favors one, it probably (at least slightly), disfavors the other. Whether or not that particular factor is in play, at the very least you should expect a fairly strong regression to the mean: thus, if a player is excellent at one or the other, you should expect them to be not as good at the other, just as a result of the two not being perfectly correlated. To examine this empirically, I’ve put all 1000 top TRB% seasons on a scatterplot comparing offensive and defensive rebound rates:

Clearly there is a small negative correlation, as evidenced by the negative coefficient in the regression line. Note that technically, this shouldn’t be a linear relationship overall – if we graphed every pair in history from 0,0 to D,R, my graph’s trendline would be parallel to the tangent of that curve as it approaches Dennis Rodman. But what’s even more stunning is the following:

Rodman is in fact not only an outlier, he is such a ridiculously absurd alien-invader outlier that when you take him out of the equation, the equation changes drastically: The negative slope of the regression line nearly doubles in Rodman’s absence. In case you’ve forgotten, let me remind you that Rodman only accounts for 12 data points in this 1000 point sample: If that doesn’t make your jaw drop, I don’t know what will! For whatever reason, Rodman seems to be supernaturally impervious to the trade-off between offensive and defensive rebounding. Indeed, if we look at the same graph with only Rodman’s data points, we see that, for him, there is actually an extremely steep, upward sloping relationship between the two variables:

In layman’s terms, what this means is that Rodman comes in varieties of Good, Better, and Best — which is how we would expect this type of chart to look if there were no trade-off at all. Yet clearly the chart above proves that such a tradeoff exists! Dennis Rodman almost literally defies the laws of nature (or at least the laws of probability).

The ultimate point contra Barkley, et al, is that if Rodman “cheated” toward getting more rebounds all the time, we might expect that his chart would be higher than everyone else’s, but we wouldn’t have any particular reason to expect it to slope in the opposite direction. Now, this is slightly more plausible if he was “cheating” on the offensive side on the floor while maintaining a more balanced game on the defensive side, and there are any number of other logical speculations to be made about how he did it. But to some extent this transcends the normal “shift in degree” v. “shift in kind” paradigm: what we have here is a major shift in degree of a shift in kind, and we don’t have to understand it perfectly to know that it is otherworldly. At the very least, I feel confident in saying that if Charles Barkley or anyone else really believes they could replicate Rodman’s results simply by changing their playing styles, they are extremely naive.

Addendum (4/20/11):

Commenter AudacityOfHoops asks:

I don’t know if this is covered in later post (working my way through the series – excellent so far), or whether you’ll even find the comment since it’s 8 months late, but … did you create that same last chart, but for other players? Intuitively, it seems like individual players could each come in Good/Better/Best models, with positive slopes, but that when combined together the whole data set could have a negative slope.

I actually addressed this in an update post (not in the Rodman series) a while back:

A friend privately asked me what other NBA stars’ Offensive v. Defensive rebound % graphs looked like, suggesting that, while there may be a tradeoff overall, that doesn’t necessarily mean that the particular lack of tradeoff that Rodman shows is rare. This is a very good question, so I looked at similar graphs for virtually every player who had 5 or more seasons in the “Ambicourtedness Top 1000.” There are other players who have positively sloping trend-lines, though none that come close to Rodman’s. I put together a quick graph to compare Rodman to a number of other big name players who were either great rebounders (e.g., Moses Malone), perceived-great rebounders (e.g., Karl Malone, Dwight Howard), or Charles Barkley:

By my accounting, Moses Malone is almost certainly the 2nd-best rebounder of all time, and he does show a healthy dose of “ambicourtedness.” Yet note that the slope of his trendline is .717, meaning the difference between him and Rodman’s 2.346 is almost exactly twice the difference between him and the -.102 league average (1.629 v .819).

The 1-15 Rams and the Salary Cap—Watch Me Crush My Own Hypothesis

It is a quirky little fact that 1-15 teams have tended to bounce back fairly well. Since expanding to 16 games in 1978, 9 teams have hit the ignoble mark, including last year’s St. Louis Rams. Of the 8 that did it prior to 2009, all but the 1980 Saints made it back to the playoffs within 5 years, and 4 of the 8 eventually went on to win Super Bowls, combining for 8 total. The median number of wins for a 1-15 team in their next season is 7:

My grand hypothesis about this was that the implementation of the salary cap after the 1993-94 season, combined with some of the advantages I discuss below (especially 2 and 3), has been a driving force behind this small-but-sexy phenomenon: note that at least for these 8 data points, there seems to be an upward trend for wins and downward trend for years until next playoff appearance. Obviously, this sample is way too tiny to generate any conclusions, but before looking at harder data, I’d like to speculate a bit about various factors that could be at play. In addition to normally-expected regression to the mean, the chain of consequences resulting from being horrendously bad is somewhat favorable:

The primary advantages are explicitly structural: Your team picks at the top of each round in the NFL draft. According to ESPN’s “standard” draft-pick value chart, the #1 spot in the draft is worth over twice as much as the 16th pick [side note: I don’t actually buy this chart for a second. It massively overvalue 1st round picks and undervalues 2nd round picks, particularly when it comes to value added (see a good discussion here)]:
The other primary benefit, at least for one year, comes from the way the NFL sets team schedules: 14 games are played in-division and against common divisional opponents, but the last two games are set between teams that finished in equal positions the previous year (this has obviously changed many times, but there have always been similar advantages). Thus, a bottom-feeder should get a slightly easier schedule, as evidenced by the Rams having the 2nd-easiest schedule for this coming season.
There are also reliable secondary benefits to being terrible, some of which get greater the worse you are. A huge one is that, because NFL statistics are incredibly entangled (i.e., practically every player on the team has an effect on every other player’s statistics), having a bad team tends to drag everyone’s numbers down. Since the sports market – and the NFL’s in particular – is stats-based on practically every level, this means you can pay your players less than what they’re worth going forward. Under the salary cap, this leaves you more room to sign and retain key players, or go for quick fixes in free agency (which is generally unwise, but may boost your performance for a season or two).
A major tertiary effect – one that especially applies to 1-15 teams, is that embarrassed clubs tend to “clean house,” meaning, they fire coaches, get rid of old and over-priced veterans, make tough decisions about star players that they might not normally be able to make, etc. Typically they “go young,” which is advantageous not just for long-term team-building purposes, but because young players are typically the best value in the short term as well.
An undervalued quaternary effect is that new personnel and new coaching staff, in addition to hopefully being better at their jobs than their predecessors, also make your team harder to prepare for, just by virtue of being new (much like the “backup quarterback effect,” but for your whole team).
A super-important quinary effect is that. . . Ok, sorry, I can’t do it.

Of course, most of these effects are relevant to more than just 1-15 teams, so perhaps it would be better to expand the inquiry a tiny bit. For this purpose, I’ve compiled the records of every team since the merger, so beginning in 1970, and compared them to their record the following season (though it only affects one data point, I’ve treated the first Ravens season as a Browns season, and treated the new Browns as an expansion team). I counted ties as .5 wins, and normalized each season to 16 games (and rounded). I then grouped the data by wins in the initial season and plotted it on a “3D Bubble Chart.” This is basically a scatter-plot where the size of each data-point is determined by the number of examples (e.g., only 2 teams have gone undefeated, so the top-right bubble is very small). The 3D is not just for looks: the size of each sphere is determined by using the weights for volume, which makes it much less “blobby” than 2D, and it allows you to see the overlapping data points instead of just one big ink-blot:

^{*Note: again, the x-axis on this graph is wins in year n, and the y axis is wins in year n+1. Also, note that while there are only 16 “bubbles,” they represent well over a thousand data points, so this is a fairly healthy sample.}

The first thing I can see is that there’s a reasonably big and fat outlier there for 1-15 teams (the 2nd bubble from the left)! But that’s hardly a surprise considering we started this inquiry knowing that group had been doing well, and there are other issues at play: First, we can see that the graph is strikingly linear. The equation at the bottom means that to predict a team’s wins for one year, you should multiply their previous season’s win total by ~.43 and add ~4.7 (e.g.’s: an 8-win team should average about 8 wins the next year, a 4-win team should average around 6.5, and a 12-win team should average around 10). The number highlighted in blue tells you how important the previous season’s win’s are as a predictor: the higher the number, the more predictive.

So naturally the next thing to see is a breakdown of these numbers between the pre- and post-salary cap eras:

Again, these are not small sample-sets, and they both visually and numerically confirm that the salary-cap era has greatly increased parity: while there are still plenty of excellent and terrible teams overall, the better teams regress and the worse teams get better, faster. The equations after the split lead to the following predictions for 4, 8, and 12 win teams (rounded to the nearest .25):

W	Pre-SC	Post-SC
4	6.25	7
8	8.25	8
12	10.5	9.25

Yes, the difference in expected wins between a 4-win team and a 12-win team in the post-cap era is only just over 2 wins, down from over 4.

While this finding may be mildly interesting in its own right, sadly this entire endeavor was a complete and utter failure, as the graphs failed to support my hypothesis that the salary cap has made the difference for 1-15 teams specifically. As this is an uncapped season, however, I guess what’s bad news for me is good news for the Rams.

Hidden Sources of Error—A Back-Handed Defense of Football Outsiders

So I was catching up on some old blog-reading and came across this excellent post by Brian Burke, Pre-Season Predictions Are Still Worthless, showing that the Football Outsiders pre-season predictions are about as accurate as picking 8-8 for every team would be, and that a simple regression based on one variable — 6 wins plus 1/4 of the previous season’s wins — is significantly more accurate

While Brian’s anecdote about Billy Madison humorously skewers Football Outsiders, it’s not entirely fair, and I think these numbers don’t prove as much as they may appear to at first glance. Sure, a number of conventional or unconventional conclusions people have reached are probably false, but the vast majority of sports wisdom is based on valid causal inferences with at least a grain of truth. The problem is that people have a tendency to over-rely on the various causes and effects that they observe directly, conversely underestimating the causes they cannot see.

So far, so obvious. But these “hidden” causes can be broken down further, starting with two main categories, which I’ll call “random causes” and “counter-causes”:

“Random causes” are not necessarily truly random, but do not bias your conclusions in any particular direction. It is the truly random combined with the may-as-well-be-random, and generates the inherent variance of the system.

“Counter causes” are those which you may not see, but which relate to your variables in ways that counteract your inferences. The salary cap in the NFL is one of the most ubiquitous offenders: E.g. an analyst sees a very good quarterback, and for various reasons believes that QB with a particular skill-set is worth an extra 2 wins per season. That QB is obtained by an 8-8 team in free agency, so the analyst predicts that team will win 10 games. But in reality, the team that signed that quarterback had to pay handsomely for that +2 addition, and may have had to cut 2 wins worth of players to do it. If you imagine this process repeating itself over time, you will see that the correlation between QB’s with those skills and their team’s actual winrate may be small or non-existent (in reality, of course, the best quarterbacks are probably underpaid relative to their value, so this is not a problem). In closed systems like sports, these sorts of scenarios crop up all the time, and thus it is not uncommon for a perfectly valid and logical-seeming inference to be, systematically, dead wrong (by which I mean that it not only leads to an erroneous conclusion in a particular situation, but will lead to bad predictions routinely).

So how does this relate to Football Outsiders, and how does it amount to a defense of their predictions? First, I think the suggestion that FO may have created “negative knowledge” is demonstrably false: The key here is not to be fooled by the stat that they could barely beat the “coma patient” prediction of 8-8 across the board. 8 wins is the most likely outcome for any team ex ante, and every win above or below that number is less and less likely. E.g., if every outcome were the result of a flip of a coin, your best strategy would be to pick 8-8 for every team, and picking *any* team to go 10-6 or 12-4 would be terrible. Yet Football Outsiders (and others) — based on their expertise — pick many teams to have very good and very bad records. The fact that they break even against the coma patient shows that their expertise is worth something.

Second, I think there’s no shame in being unable to beat a simple regression based on one extremely probative variable: I’ve worked on a lot of predictive models, from linear regressions to neural networks, and beating a simple regression can be a lot of work for marginal gain (which, combined with the rake, is the main reason that sports-betting markets can be so tough).

Yet, getting beaten so badly by a simple regression is a definite indicator of systematic error — particularly since there is nothing preventing Football Outsiders from using a simple regression to help them make their predictions. Now, I suspect that FO is underestimating football variance, especially the extent of regression to the mean. But this is a blanket assumption that I would happily apply to just about any sports analyst — quantitative or not — and is not really of interest. However, per the distinction I made above, I believe FO is likely underestimating the “counter causes” that may temper the robustness of their inferences without necessarily invalidating them entirely. A relatively minor bias in this regard could easily lead to a significant drop in overall predictive performance, for the same reason as above: the best and worst records are by far the least likely to occur. Thus, *ever* predicting them, and expecting to gain accuracy in the process, requires an enormous amount of confidence. If Football Outsiders has that degree of confidence, I would wager that it is misplaced.