A History of Hall of Fame QB-Coach Entanglement

Last week on PTI, Dan LeBatard mentioned an interesting stat that I had never heard before: that 13 of 14 Hall of Fame coaches had Hall of Fame QB’s play for them.  LeBatard’s point was that he thought great quarterbacks make their coaches look like geniuses, and he was none-too-subtle about the implication that coaches get too much credit.  My first thought was, of course: Entanglement, anyone? That is to say, why should he conclude that the QB’s are making their coaches look better than they are instead of the other way around?  Good QB’s help their teams win, for sure, but winning teams also make their QB’s look good.  Thus – at best – LeBatard’s stat doesn’t really imply that HoF Coaches piggyback off of their QB’s success, it implies that the Coach and QB’s successes are highly entangled.  By itself, this analysis might be enough material for a tweet, but when I went to look up these 13/14 HoF coach/QB pairs, I found the history to be a little more interesting than I expected.

First, I’m still not sure exactly which 14 HoF coaches LeBatard was talking about.  According the the official website, there are 21 people in the HoF as coaches.  From what I can tell, 6 of these (Curly Lambeau, Ray Flaherty, Earle Neale, Jimmy Conzelman, Guy Chamberlain and Steve Owen) coached before the passing era, so that leaves 15 to work with.  A good deal of George Halas’s coaching career was pre-pass as well, but he didn’t quit until 1967 – 5 years later than Paul Brown – and he coached a Hall of Fame QB anyway (Sid Luckman).  Of the 15, 14 did indeed coach HoF QB’s, at least technically.

To break the list down a little, I applied two threshold tests:  1) Did the coach win any Super Bowls (or league championships before the SB era) without their HoF QB?  And 2) In the course of his career, did the coach have more than one HoF QB?  A ‘yes’ answer to either of these questions I think precludes the stereotype of a coach piggybacking off his star player (of course, having coached 2 or more Hall of Famer’s might just mean that coach got extra lucky, but subjectively I think the proxy is fairly accurate).  Here is the list of coaches eliminated by these questions:

[table “5” not found /]
Joe Gibbs wins the outlier prize by a mile: not only did he win 3 championships “on his own,” he did it with 3 different non-HoF QB’s.  Don Shula had 3 separate eras of greatness, and I think would have been a lock for the hall even with the Griese era excluded.  George Allen never won a championship, but he never really had a HoF QB either: Jurgensen (HoF) served as Billy Kilmer (non-HoF)’s backup for the 4 years he played under Allen.  Sid Gillman had a long career, his sole AFL championship coming with the Chargers in 1963 – with Tobin Rote (non-HoF) under center.  Weeb Ewbank won 2 NFL championships in Baltimore with Johnny Unitas, and of course won the Super Bowl against Baltimore and Unitas with Joe Namath.  Finally, George Halas won championships with Pard Pearce (5’5”, non-HoF), Carl Brumbaugh (career passer rating: 34.9, non-HoF), Sid Luckman (HoF) and Billy Wade (non-HoF).  Plus, you know, he’s George Halas.
[table “1” not found /]
Though Chuck Noll won all of his championships with Terry Bradshaw (HoF), those Steel Curtain teams weren’t exactly carried by the QB position (e.g., in the 1974 championship season, Bradshaw averaged less than 100 passing yards per game).  Bill Walsh is a bit more borderline: not only did all of his championships come with Joe Montana, but Montana also won a Super Bowl without him.  However, considering Walsh’s reputation as an innovator, and especially considering his incredible coaching tree (which has won nearly half of all the Super Bowls since Walsh retired in 1989), I’m willing to give him credit for his own notoriety.  Finally, Vince Lombardi, well, you know, he’s Vince Lombardi.

Which brings us to the list of the truly entangled:
[table “4” not found /]
I waffled a little on Paul Brown, as he is generally considered an architect of the modern league (and, you know, a team is named after him), but unlike Lombardi, Walsh and Knoll, Brown’s non-Otto-Graham-entangled accomplishments are mostly unrelated to coaching.  I’m sure various arguments could be made about individual names (like, “You crazy, Tom Landry is awesome”), but the point of this list isn’t to denigrate these individuals, it’s simply to say that these are the HoF coaches whose coaching successes are the most difficult to isolate from their quarterback’s.

I don’t really want to speculate about any broader implications, both because the sample is too small to make generalizations, and because my intuition is that coaches probably do get too much credit for their good fortune (whether QB-related or not).  But regardless, I think it’s clear that LeBatard’s 13/14 number is highly misleading.

Applied Epistemology in Politics and the Playoffs

Two nights ago, as I was watching cable news and reading various online articles and blog posts about Christine O’Donnell’s upset win over Michael Castle in Delaware’s Republican Senate primary, the hasty, almost ferocious emergence of consensus among the punditocracy – to wit, that the GOP now has virtually zero chance of picking up that seat in November – reminded me of an issue that I’ve wanted to blog about since long before I began blogging in earnest: NFL playoff prediction models.

Specifically, I have been critical of those models that project the likelihood of each surviving team winning the Super Bowl by applying a logistic regression model (i.e., “odds of winning based on past performance”) to each remaining game.  In January, I posted a number of comments to this article on Advanced NFL Stats, in which I found it absurd that, with 8 teams left, his model predicted that the Dallas Cowboys had about the same chance of winning the Super Bowl as the Jets, Ravens, Vikings, and Cardinals combined. In the brief discussion, I gave two reasons (in addition to my intuition): first, that these predictions were wildly out of whack with contract prices in sports-betting markets, and second, that I didn’t believe the model sufficiently accounted for “variance in the underlying statistics.”  Burke suggested that the first point is explained by a massive epidemic of conjunction-fallacyitis among sports bettors.  On its face, I think this is a ridiculous explanation: i.e., does he really believe that the market-movers in sports betting — people who put up hundreds of thousands (if not millions) of dollars of their own money — have never considered multiplying the odds of several games together?  Regardless, in this post I will put forth a much better explanation for this disparity than either of us proffered at the time, hopefully mooting that discussion.  On my second point, he was more dismissive, though I was being rather opaque (and somehow misspelled “beat” in one reply), so I don’t blame him.  However, I do think Burke’s intellectual hubris regarding his model (aka “model hubris”) is notable – not because I have any reason to think Burke is a particularly hubristic individual, but because I think it is indicative of a massive epidemic of model-hubrisitis among sports bloggers.

In Section 1 of this post, I will discuss what I personally mean by “applied epistemology” (with apologies to any actual applied epistemologists out there) and what I think some of its more-important implications are.  In Section 2, I will try to apply these concepts by taking a more detailed look at my problems with the above-mentioned playoff prediction models.

Section 1: Applied Epistemology Explained, Sort Of

For those who might not know, “epistemology” is essentially a fancy word for the “philosophical study of knowledge,” which mostly involves philosophers trying to define the word “knowledge” and/or trying to figure out what we know (if anything), and/or how we came to know it (if we do).  For important background, read my Complete History of Epistemology (abridged), which can be found here: In Plato’s Theaetetus, Socrates suggests that knowledge is something like “justified true belief.”  Agreement ensues.  In 1963, Edmund Gettier suggests that a person could be justified in believing something, but it could be true for the wrong reasons.  Debate ensues.  The End.

A “hot” topic in the field recently has been dealing with the implications of elaborate thought experiments similar to the following:

*begin experiment*
Imagine yourself in the following scenario:  From childhood, you have one burning desire: to know the answer to Question X.  This desire is so powerful that you dedicate your entire life to its pursuit.  You work hard in school, where you excel greatly, and you master every relevant academic discipline, becoming a tenured professor at some random elite University, earning multiple doctorates in the process.  You relentlessly refine and hone your (obviously considerable) reasoning skills using every method you can think of, and you gather and analyze every single piece of empirical data relevant to Question X available to man.  Finally, after decades of exhaustive research and study, you have a rapid series of breakthroughs that lead you to conclude – not arbitrarily, but completely based on the proof you developed through incredible amounts of hard work and ingenuity — that the answer to Question X is definitely, 100%, without a doubt: 42.  Congratulations!  To celebrate the conclusion of this momentous undertaking, you decide to finally get out of the lab/house/library and go celebrate, so you head to a popular off-campus bar.  You are so overjoyed about your accomplishment that you decide to buy everyone a round of drinks, only to find that some random guy — let’s call him Neb – just bought everyone a round of drinks himself.  What a joyous occasion: two middle-aged individuals out on the town, with reason to celebrate (and you can probably see where this is going, but I’ll go there anyway)!  As you quickly learn, it turns out that Neb is around your same age, and is also a professor at a similarly elite University in the region.  In fact, it’s amazing how much you two have in common:  you have relatively similar demographic histories, identical IQ, SAT, and GRE scores, you both won multiple academic awards at every level, you have both achieved similar levels of prominence in your academic community, and you have both been repeatedly published in journals of comparable prestige.  In fact, as it turns out, you have both been spent your entire lives studying the same question!  You have both read all the same books, you have both met, talked or worked with many comparably intelligent — or even identical — people:  It is amazing that you have never met!  Neb, of course, is feeling so celebratory because finally, after decades of exhaustive research and study, he has just had a rapid series of breakthroughs that lead him to finally conclude – not arbitrarily, but completely based on the proof he developed through incredible amounts of hard work and ingenuity — that the answer to Question X is definitely, 100%, without a doubt: 54.

You spend the next several hours drinking and arguing about Question X: while Neb seemed intelligent enough at first, everything he says about X seems completely off base, and even though you make several excellent points, he never seems to understand them.  He argues from the wrong premises in some areas, and draws the wrong conclusions in others.  He massively overvalues many factors that you are certain are not very important, and is dismissive of many factors that you are certain are crucial.  His arguments, though often similar in structure to your own, are extremely unpersuasive and don’t seem to make any sense, and though you try to explain yourself to him, he stubbornly refuses to comprehend your superior reasoning.  The next day, you stumble into class, where your students — who had been buzzing about your breakthrough all morning — begin pestering you with questions about Question X and 42.  In your last class, you had estimated that the chances of 42 being “the answer” were around 90%, and obviously they want to know if you have finally proved 42 for certain, and if not, how likely you believe it is now.  What do you tell them?

All of the research and analysis you conducted since your previous class had, indeed, led you to believe that 42 is a mortal lock.  In the course of your research, everything you have thought about or observed or uncovered, as well as all of the empirical evidence you have examined or thought experiments you have considered, all lead you to believe that 42 is the answer.  As you hesitate, your students wonder why, even going so far as to ask, “Have you heard any remotely persuasive arguments against 42 that we should be considering?”  Can you, in good conscience, say that you know the answer to Question X?  For that matter, can you even say that the odds of 42 are significantly greater than 50%?  You may be inclined, as many have been, to “damn the torpedoes” and act as if Neb’s existence is irrelevant.  But that view is quickly rebutted:  Say one of your most enterprising students brings a special device to class:  when she presses the red button marked “detonate,” if the answer to Question X is actually 42, the machine will immediately dispense $20 bills for everyone in the room; but if the answer is not actually 42, it will turn your city into rubble.  And then it will search the rubble, gather any surviving puppies or kittens, and blend them.

So assuming you’re on board that your chance encounter with Professor Neb implies that, um, you might be wrong about 42, what comes next?  There’s a whole interesting line of inquiry about what the new likelihood of 42 is and whether anything higher than 50% is supportable, but that’s not especially relevant to this discussion.  But how about this:  Say the scenario proceeds as above, you dedicate your life, yadda yadda, come to be 100% convinced of 42, but instead of going out to a bar, you decide to relax with a bubble bath and a glass of Pinot, while Neb drinks alone.  You walk into class the next day, and proudly announce that the new odds of 42 are 100%.  Mary Kate pulls out her special money-dispensing device, and you say sure, it’s a lock, press the button.  Yay, it’s raining Andrew Jacksons in your classroom!  And then: **Boom** **Meow** **Woof** **Whirrrrrrrrrrrrrr**.  Apparently Mary Kate had a twin sister — she was in Neb’s class.

*end experiment*

In reality, the fact that you might be wrong, even when you’re so sure you’re right, is more than a philosophical curiosity, it is a mathematical certainty.  The processes that lead you to form beliefs, even extremely strong ones, are imperfect.  And when you are 100% certain that a belief-generating process is reliable, the process that led you to that belief is likely imperfect.  This line of thinking is sometimes referred to as skepticism — which would be fine if it weren’t usually meant as a pejorative.

When push comes to shove, people will usually admit that there is at least some chance they are wrong, yet they massively underestimate just what those chances are.  In political debates, for example, people may admit that there is some miniscule possibility that their position is ill-informed or empirically unsound, but they will almost never say that they are more likely to be wrong than to be right.  Yet, when two populations hold diametrically opposed views, either one population is wrong or both are – all else being equal, the correct assessment in such scenarios is that no-one is likely to have it right.

When dealing with beliefs about probabilities, the complications get even trickier:  Obviously many people believe some things are close to 100% likely to be true, when the real probability may be some-much if not much-much lower.  But in addition to the extremes, people hold a whole range of poorly-calibrated probabilistic beliefs, like believing something is 60% likely when it is actually 50% or 70%.  (Note: Some Philosophically trained readers may balk at this idea, suggesting that determinism entails everything having either a 0 or 100% probability of being true.  While this argument may be sound in classroom discussions, it is highly unpragmatic: If I believe that I will win a coin flip 60% of the time, it may be theoretically true that the universe has already determined whether the coin will turn up heads or tails, but for all intents and purposes, I am only wrong by 10%).

But knowing that we are wrong so much of the time doesn’t tell us much by itself: it’s very hard to be right, and we do the best we can.  We develop heuristics that tend towards the right answers, or — more importantly for my purposes — that allow the consequences of being wrong in both directions even out over time.  You may reasonably believe that the probability of something is 30%, when, in reality, the probability is either 20% or 40%.  If the two possibilities are equally likely, then your 30% belief may be functionally equivalent under many circumstances, but they are not the same, as I will demonstrate in Section 2 (note to the philosophers: you may have noticed that this is a bit like the Gettier examples: you might be “right,” but for the wrong reasons).

There is a science to being wrong, and it doesn’t mean you have to mope in your study, or act in bad faith when you’re out of it.  “Applied Epistemology” (at least as this armchair philosopher defines it) is the study of the processes that lead to knowledge and beliefs, and of the practical implications of their limitations.

Part 2:  NFL Playoff Prediction Models

Now, let’s finally return to the Advanced NFL Stats playoff prediction model.  Burke’s methodology is simple: using a logistic regression based on various statistical indicators, the model estimates a probability for each team to win their first round matchup.  It then repeats the process for all possible second round matchups, weighting each by its likelihood of occurring (as determined by the first round projections) and so on through the championship.  With those results in hand, a team’s chances of winning the tournament is simply the product of their chances of winning in each round.  With 8 teams remaining in the divisional stage, the model’s predictions looked like this:

image

Burke states that the individual game prediction model has a “history of accuracy” and is well “calibrated,” meaning that, historically, of the teams it has predicted to win 30% of the time, close to 30% of them have won, and so on.  For a number of reasons, I remain somewhat skeptical of this claim, especially when it comes to “extreme value” games where the model predicts very heavy favorites or underdogs.  (E.g’s:  What validation safeguards do they deploy to avoid over-fitting?  How did they account for the thinness of data available for extreme values in their calibration method?)  But for now, let’s assume this claim is correct, and that the model is calibrated perfectly:  The fact that teams predicted to win 30% of the time actually won 30% of the time does NOT mean that each team actually had a 30% chance of winning.

That 30% number is just an average.  If you believe that the model perfectly nails the actual expectation for every team, you are crazy.  Since there is a large and reasonably measurable amount of variance in the very small sample of underlying statistics that the predictive model relies on, it necessarily follows that many teams will have significantly under or over-performed statistically relative to their true strength, which will be reflected in the model’s predictions.  The “perfect calibration” of the model only means that the error is well-hidden.

This doesn’t mean that it’s a bad model: like any heuristic, the model may be completely adequate for its intended context.  For example, if you’re going to bet on an individual game, barring any other information, the average of a team’s potential chances should be functionally equivalent to their actual chances.  But if you’re planning to bet on the end-result of a series of games — such as in the divisional round of the NFL playoffs — failing to understand the distribution of error could be very costly.

For example, let’s look at what happens to Minnesota and Arizona’s Super Bowl chances if we assume that the error in their winrates is uniformly distributed in the neighborhood of their predicted winrate:

image

For Minnesota, I created a pool of 11 possible expectations that includes the actual prediction plus teams that were 5% to 25% better or worse.  I did the same for Arizona, but with half the deviation.  The average win prediction for each game remains constant, but the overall chances of winning the Super Bowl change dramatically.  To some of you, the difference between 2% and 1% may not seem like much, but if you could find a casino that would regularly offer you 100-1 on something that is actually a 50-1 shot, you could become very rich very quickly.  Of course, this uniform distribution is a crude one of many conceivable ways that the “hidden error” could be distributed, and I have no particular reason to think it is more accurate than any other.  But one thing should be abundantly clear: the winrate model on which this whole system rests tells us nothing about this distribution either.

The exact structure of this particular error distribution is mostly an empirical matter that can and should invite further study.  But for the purposes of this essay, speculation may suffice.  For example, here is an ad hoc distribution that I thought seemed a little more plausible than a uniform distribution:

image

This table shows the chances of winning the Super Bowl for a generic divisional round playoff team with an average predicted winrate of 35% for each game.  In this scenario, there is a 30% chance (3/10) that the prediction gets it right on the money, a 40% chance that the team is around half as good as predicted (the bottom 4 values), a 10% chance that the team is slightly better, a 10% chance that it is significantly better, and a 10% chance that the model’s prediction is completely off its rocker.  These possibilities still produce a 35% average winrate, yet, as above, the overall chances of winning the Super Bowl increase significantly (this time by almost double).  Of course, 2 random hypothetical distributions don’t yet indicate a trend, so let’s look at a family of distributions to see if we can find any patterns:

image

This chart compares the chances of a team with a given predicted winrate to win the Super Bowl based on uniform error distributions of various sizes.  So the percentages in column 1 are the odds of the team winning the Super Bowl if the predicted winrate is exactly equal to their actual winrate.  Then each subsequent column is the chances of them winning the Superbowl if you increase the “pool” of potential actual winrates by one on each side.  Thus, the second number after 35% is the odds of winning the Super Bowl if the team is equally likely to be have a 30%, 35%, or 40% chance in reality, etc.  The maximum possible change in Super Bowl winning chances for each starting prediction is contained in the light yellow box at the end of each row.  I should note that I chose this family of distributions for its ease of cross-comparison, not its precision.  I also experimented with many other models that produced a variety of interesting results, yet in every even remotely plausible one of them, two trends – both highly germane to my initial criticism of Burke’s model – endured:
1.  Lower predicted game odds lead to greater disparity between predicted and actual chances.
To further illustrate this, here’s a vertical slice of the data, containing the net change for each possible prediction, given a discreet uniform error distribution of size 7:

image

2.  Greater error ranges in the underlying distribution lead to greater disparity between predicted and actual chances.

To further illustrate this, here’s a horizontal slice of the data, containing the net change for each possible error range, given an initial winrate prediction of 35%:

image

Of course these underlying error distributions can and should be examined further, but even at this early stage of inquiry, we “know” enough (at least with a high degree of probability) to begin drawing conclusions.  I.e., We know there is considerable variance in the statistics that Burke’s model relies on, which strongly suggests that there is a considerable amount of “hidden error” in its predictions.  We know greater “hidden error” leads to greater disparity in predicted Super Bowl winning chances, and that this disparity is greatest for underdogs.  Therefore, it is highly likely that this model significantly under-represents the chances of underdog teams at the divisional stage of the playoffs going on to win the Superbowl.  Q.E.D.

This doesn’t mean that these problems aren’t fixable: the nature of the error distribution of the individual game-predicting model could be investigated and modeled itself, and the results could be used to adjust Burke’s playoff predictions accordingly.  Alternatively, if you want to avoid the sticky business of characterizing all that hidden error, a Super-Bowl prediction model could be built that deals with that problem heuristically: say, by running a logistical regression that uses the available data to predict each team’s chances of winning the Super Bowl directly.

Finally, I believe this evidence both directly and indirectly supports my intuition that the large disparity between Burke’s predictions and the corresponding contract prices was more likely to be the result of model error than market error.  The direct support should be obvious, but the indirect support is also interesting:  Though markets can get it wrong just as much or more than any other process, I think that people who “put their money where their mouth is” (especially those with the most influence on the markets) tend to be more reliably skeptical and less dogmatic about making their investments than bloggers, analysts or even academics are about publishing their opinions.  Moreover, by its nature, the market takes a much more pluralistic approach to addressing controversies than do most individuals.  While this may leave it susceptible to being marginally outperformed (on balance) by more directly focused individual models or persons, I think it will also be more likely to avoid pitfalls like the one above.

Conclusions, and My Broader Agenda

The general purpose of post is to demonstrate both the importance and difficulty of understanding and characterizing the ways in which our beliefs – and the processes we use to form them — can get it wrong.  This is, at its heart, a delicate but extremely pragmatic endeavor.  It involves being appropriately skeptical of various conclusions — even when they seem right to you – and recognizing the implications of the multitude of ways that such error can manifest.

I have a whole slew of ideas about how to apply these principles when evaluating the various pronouncements made by the political commentariat, but the blogosphere already has a Nate Silver (and Mr. Silver is smarter than me anyway), so I’ll leave that for you to consider as you see fit.

Calculator: NFL/NCAA QB Ratings

Recently, I have been working very hard on some exciting behind-the-scenes upgrades for the blog. For example, I’ve been designing a number of web-mining processes to beef up my football and basketball databases, which should lead to more robust content in the future. I’ve also been working on an a much easier way to make interactive posts (without having to hard-code them or use plug-ins). My thinking is, if I lower the difficulty of creating interactive calculators and graph generators enough, then a collection of fun/interesting/useful resources should practically build itself.

To that end, I believe I have found the right tools to make moderately complex interactive charts and data out of spreadsheets, and I have been getting better and better at the process. As a test-run, however, let’s start with something simpler: A calculator for the much-maligned and nigh-impenetrable QB Ratings systems of the NFL and NCAA:

If all is working properly, the rating should re-calculate automatically whenever you change the data (i.e., no need to push a button), provided you have valid numbers in all 5 boxes. Please let me know if you have any difficulty viewing or using it.

From an analytical standpoint, there’s obviously not much to see here — though for certain values I am mildly surprised by the extreme disparity between the NFL and NCAA flavors.

Easy NFL Predictions, the SkyNet Way

In this post I briefly discussed regression to the mean in the NFL, as well as the difficulty one can face trying to beat a simple prediction model based on even a single highly probative variable.  Indeed, for all the extensive research and cutting-edge analysis they conduct at Football Outsiders, they are seemingly unable to beat “Koko,” which is just about the simplest regression model known to primates.  Capture

Of course, since there’s no way I could out-analyze F.O. myself — especially if I wanted to get any predictions out before tonight’s NFL opener – I decided to let my computer do the work for me: this is what neural networks are all about.  In case you’re not familiar, a neural network is a learning algorithm that can be used as a tool to process large quantities of data with many different variables — even if you don’t know which variables are the most important, or how they interact with each other.

The graphic to the right is the end result of several whole minutes of diligent configuration (after a lot of tedious data collection, of course).  It uses 60 variables (which are listed under the fold below), though I should note that I didn’t choose them because of their incredible probative value – many are extremely collinear, if not pointless — I mostly just took what was available on the team and league summary pages on Pro Football Reference, and then calculated a few (non-advanced) rate stats and such in Excel.

Now, I don’t want to get too technical, but there are a few things about my methodology that I need to explain. First, predictive models of all types have two main areas of concern: under-fitting and over fitting.  Football Outsiders, for example, creates models that “under fit” their predictions.  That is to say, however interesting the individual components may be, they’re not very good at predicting what they’re supposed to.  Honestly, I’m not sure if F.O. even checks their models against the data, but this is a common problem in sports analytics: the analyst gets so caught up designing their model a priori that they forget to check whether it actually fits the empirical data.  On the other hand, to the diligent empirically-driven model-maker, overfitting — which is what happens when your model tries too hard to explain the data — can be just as pernicious.  When you complicate your equations or add more and more variables, it gives your model more opportunity to find an “answer” that fits even relatively large data-sets, but which may not be nearly as accurate when applied elsewhere.

For example, to create my model, I used data from the introduction of the Salary Cap in 1994  on.  When excluding seasons where a team had no previous or next season to compare to, this left me with a sample of 464 seasons.  Even with a sample this large, if you include enough variables you should get good-looking results: a linear regression will appear to make “predictions” that would make any gambler salivate, and a Neural Network will make “predictions” that would make Nostradamus salivate.  But when you take those models and try to apply them to new situations, the gambler and Nostradamus may be in for a big disappointment.  This is because there’s a good chance your model is “overfit”, meaning it is tailored specifically to explain your dataset rather than to identifying the outside factors that the data-set reveals.  Obviously it can be problematic if we simply use the present data to explain the present data.  “Model validation” is a process (woefully ignored in typical sports analysis), by which you make sure that your model is capable of predicting data as well as explaining it.  One of the simplest such methods is called “split validation.”  This involves randomly splitting your sample in half, creating a “practice set” and a “test set,” and then deriving your model from the practice set while applying it to the test set.  If “deriving” a model is confusing to you, think of it like this: you are using half of your data to find an explanation for what’s going on and then checking the other half to see if that explanation seems to work.  The upside to this is that if your method of model-creating can pass this test reliably, your models should be just as accurate on new data as they are on the data you already have.  The downside is that you have to cut your sample size in half, which leads to bigger swings in your results, meaning you have to repeat the process multiple times to be sure that your methodology didn’t just get lucky on one round.

For this model, the main method I am going to use to evaluate predictions is a simple correlation between predicted outcomes and actual outcomes.  The dependent variable (or variable I am trying to predict), is the next season’s wins.  As a baseline, I created a linear correlation against SRS, or “Simple Rating System,” which is PFR’s term for margin of victory adjusted for strength of schedule.  This is the single most probative common statistic when it comes to predicting the next season’s wins, and as I’ve said repeatedly, beating a regression of one highly probative variable can be a lot of work for not much gain.  To earn any bragging rights as a model-maker, I think you should be able to beat the linear SRS predictions by at least 5%, since that’s approximately the edge you would need to win money gambling against it in a casino.  For further comparison, I also created a “Massive Linear” model, which uses the majority of the variables that go into the neural network (excluding collinear variables and variables that have almost no predictive value).  For the ultimate test, I’ve created one model that is a linear regression using only the most probative variables, AND I allowed it to use the whole sample space (that is, I allowed it to cheat and use the same data that it is predicting to build its predictions).  For my “simple” neural network, of course, I didn’t do any variable-weighting or analysis myself, and it required very little configuration:  I used a very slow ‘learning rate’ (.025 if that means anything to you) with a very high number of learning cycles (5000), with decay on.  For the validated models, I repeated this process about 20 times and averaged the outcomes.  I have also included the results from running the data through the “Koko” model, and added results from the last 2 years of Football Outsiders predictions.  As you will see, the neural network was able to beat the other models fairly handily:

Football Outsider numbers are obviously not since 1994.  Note that Koko actually performs on par with F.O. overall, though both are pretty weak compared to the SRS regression or the cheat regression.  “Koko” performed very well last season, posting a  .560 correlation, though apparently last season was highly “predictable,” as all of the models based on previous patterns performed extremely well.  Note also that the Massive Linear model performs poorly: this is as a result of overfitting, as explained above.

Now here is where it gets interesting.  When I first envisioned this post, I was planning to title it “Why I Don’t Make Predictions; And: Predictions!” — on the theory that, given the extreme variance in the sport, any highly-accurate model would probably produce incredibly boring results.  That is, most teams would end up relatively close to the mean, and the “better” teams would normally just be the better teams from the year before.  But when applied the neural network to the data for this season, I was extremely surprised by its apparent boldness:


I should note that the numbers will not add up perfectly as far as divisions and conferences go.  In fact, I slightly adjusted them proportionally to make them fit the correct number of games for the league as a whole (which should have little or positive effect on its predictive power). SkyNet does not know the rules of football or the structure of the league, and its main goal is to make the most accurate predictions on a team by team basis, and then destroy humanity.

Wait, what?  New Orleans struggling to make the playoffs?  Oakland with a better record than San Diego?  The Jets as the league’s best team?  New England is out?!?  These are not the predictions of a milquetoast forecaster, so I am pleased to see that my simple creation has gonads.  Of course there is obviously a huge amount of variance in this process, and a .43 correlation still leaves a lot to chance. But just to be completely clear, this is exactly the same model that soundly beat Koko, Football Outsiders, and several reasonable linear regressions — some of which were allowed to cheat – over the past 15 years.  In my limited experience, neural networks are often capable of beating conventional models even when they produce some bizarre outcomes:  For example, one of my early NBA playoff wins-predicting neural networks was able to beat most linear regressions by a similar (though slightly smaller) margin, even though it predicted negative wins for several teams.  Anyway, I look forward to seeing how the model does this season.  Though, in my heart of hearts, if the Jets win the Super Bowl, I may fear for the future of mankind.

A list of all the input variables, after the jump:

Read the rest of this entry »

Quantum Randy Moss—An Introduction to Entanglement

[Update: This post from 2010 has been getting some renewed attention in response to Randy Moss’s mildly notorious statement in New Orleans. I’ve posted a follow-up with more recent data here: “Is Randy Moss the Greatest?” For discussion of the broader idea, however, you’re in the right place.]

As we all know, even the best-intentioned single-player statistical metrics will always be imperfect indicators of a player’s skill.  They will always be impacted by external factors such as variance, strength of opponents, team dynamics, and coaching decisions.  For example, a player’s shooting % in basketball is a function of many variables – such as where he takes his shots, when he takes his shots, how often he is double-teamed, whether the team has perimeter shooters or big space-occupying centers, how often his team plays Oklahoma, etc – only one of which is that player’s actual shooting ability.  Some external factors will tend to even out in the long-run (like opponent strength in baseball).  Others persist if left unaccounted for, but are relatively easy to model (such as the extra value of made 3 pointers, which has long been incorporated into “true shooting percentage”).  Some can be extremely difficult to work with, but should at least be possible to model in theory (such as adjusting a running back’s yards per carry based on the run-blocking skill of their offensive line).  But some factors can be impossible (or at least practically impossible) to isolate, thus creating systematic bias that cannot be accurately measured.  One of these near-impossible external factors is what I call “entanglement,” a phenomenon that occurs when more than one player’s statistics determine and depend on each other.  Thus, when it comes to evaluating one of the players involved, you run into an information black hole when it comes to the entangled statistic, because it can be literally impossible to determine which player was responsible for the relevant outcomes.

While this problem exists to varying degrees in all team sports, it is most pernicious in football.  As a result, I am extremely skeptical of all statistical player evaluations for that sport, from the most basic to the most advanced.  For a prime example, no matter how detailed or comprehensive your model is, you will not be able to detangle a quarterback’s statistics from those of his other offensive skill position players, particularly his wide receivers.  You may be able to measure the degree of entanglement, for example by examining how much various statistics vary when players change teams.  You may even be able to make reasonable inferences about how likely it is that one player or another should get more credit, for example by comparing the careers of Joe Montana with Kansas City and Jerry Rice with Steve Young (and later Oakland), and using that information to guess who was more responsible for their success together.  But even the best statistics-based guess in that kind of scenario is ultimately only going to give you a probability (rather than an answer), and will be based on a miniscule sample.

Of course, though stats may never be the ultimate arbiter we might want them to be, they can still tell us a lot in particular situations.  For example, if only one element (e.g., a new player) in a system changes, corresponding with a significant change in results, it may be highly likely that that player deserves the credit (note: this may be true whether or not it is reflected directly in his stats).  The same may be true if a player changes teams or situations repeatedly with similar outcomes each time.  With that in mind, let’s turn to one of the great entanglement case-studies in NFL history: Randy Moss.
I’ve often quipped to my friends or other sports enthusiasts that I can prove that Randy Moss is probably the best receiver of all time in 13 words or less.  The proof goes like this:

Chad Pennington, Randall Cunningham, Jeff George, Daunte Culpepper, Tom Brady, and Matt Cassell.

The entanglement between QB and WR is so strong that I don’t think I am overstating the case at all by saying that, while a receiver needs a good quarterback to throw to him, ultimately his skill-level may have more impact on his quarterback’s statistics than on his own.  This is especially true when coaches or defenses key on him, which may open up the field substantially despite having a negative impact on his stat-line.  Conversely, a beneficial implication of such high entanglement is that a quarterback’s numbers may actually provide more insight into a wide receiver’s abilities than the receiver’s own – especially if you have had many quarterbacks throwing to the same receiver with comparable success, as Randy Moss has.

Before crunching the data, I would like to throw some bullet points out there:

  • There have been 6 quarterbacks who have started 9 or more games in a season with Randy Moss as one of their receivers (for obvious reasons, I have replaced Chad Pennington with Kerry Collins for this analysis).
  • Only two of them had starting jobs in the seasons immediately prior to those with Moss (Kerry Collins, Tom Brady).
  • Only one of them had a starting job in the season immediately following those with Moss (Matt Cassell).
  • Pro Bowl appearances of quarterbacks throwing to Moss: 6.  Pro-Bowl appearances of quarterbacks after throwing to Moss: 0.
  • Daunte Culpepper made the Pro Bowl 3 times in his 5 seasons throwing to Moss.  He has won a combined 5 games as a starting quarterback in 5 seasons since.

With the exception of Kerry Collins, all of the QB’s who have thrown to Moss have had “career” years with him (Collins improved, but not by as much at the others).  To illustrate this point, I’ve compiled a number of popular statistics for each quarterback for their Moss years and their other years, in order to figure out the average affect Moss has had.  To qualify as a “Moss year,” they had to have been his quarterback for at least 9 games.  I have excluded all seasons where the quarterback was primarily a reserve, or was only the starting quarterback for a few games.  The “other” seasons include all of that QB’s data in seasons without Moss on his team.  This is not meant to bias the statistics, the reason I exclude partial seasons in one case and not the other is that I don’t believe occasional sub work or participation in a QB controversy accurately reflects the benefit of throwing to Moss, but those things reflect the cost of not having Moss just fine.  In any case, to be as fair as possible, I’ve included the two Daunte Culpepper seasons where he was seemingly hampered by injury, and the Kerry Collins season where Oakland seemed to be in turmoil, all three of which could arguably not be very representative.

As you can see in the table below, the quarterbacks throwing to Moss posted significantly better numbers across the board:

Randy Moss_20110_image001[Edit to note: in this table’s sparklines and in the charts below, the 2nd and third positions are actually transposed from their chronological order.  Jeff George was Moss’s 2nd quarterback and Culpepper was his 3rd, rather than vice versa.  This happened because I initially sorted the seasons by year and team, forgetting that George and Culpepper both came to Minnesota at the same time.]

Note: Adjusted Net Yards Per Attempt incorporates yardage lost due to sacks, plus gives bonuses for TD’s and penalties for interceptions.  Approximate Value is an advanced stat from Pro Football Reference that attempts to summarize all seasons for comparison across positions.  Details here.

Out of 60 metrics, only 3 times did one of these quarterbacks fail to post better numbers throwing to Moss than in the rest of his career:  Kerry Collins had a slightly lower completion percentage and slightly higher sack percentage, and Jeff George had a slightly higher interception percentage for his 10-game campaign in 1999 (though this was still his highest-rated season of his career).  For many of these stats, the difference is practically mind-boggling:  QB Rating may be an imperfect statistic overall, but it is a fairly accurate composite of the passing statistics that the broader football audience cares the most about, and 19.8 points is about the difference in career rating between Peyton Manning and J.P. Losman.

Though obviously Randy Moss is a great player, I still maintain that we can never truly measure exactly how much of this success was a direct result of Moss’s contribution and how much was a result of other factors.  But I think it is very important to remember that, as far as highly entangled statistics like this go, independent variables are rare, and this is just about the most robust data you’ll ever get.  Thus, while I can’t say for certain that Randy Moss is the greatest receiver in NFL History, I think it is unquestionably true that there is more statistical evidence of Randy Moss’s greatness than there is for any other receiver.

Full graphs for all 10 stats after the jump:

Read the rest of this entry »

Graph of the Day 2: NFL Regression—Descent Into Chaos

I guess it’s funky graph day here at SSA:
This one corresponds to the bubble-graphs in this post about regression to the mean before and after the introduction of the salary cap.  Each colored ball represents one of the 32 teams, with wins in year n on the x axis and wins in year n+1 on the y axis.  In case you don’t find the visual interesting enough in its own right, you’re supposed to notice that it gets crazier right around 1993.

Hey, Do You Think Brett Favre is Maybe Like Hamlet?

On a lighter note:  Earlier I was thinking about how tired I am of hearing various ESPN commentators complain about Brett Favre’s “Hamlet impression” – though I was just using the term “Hamlet impression” for the rant in my head, no one was actually saying it (at least this time).  I quickly realized how completely unoriginal my internal dialogue was being, and after scolding myself for a few moments, I resolved to find the identity of the first person to ever make the Favre/Hamlet comparison.

Lo and behold, the earliest such reference in the history of the internet – that is, according to Google – was none other than Gregg Easterbrook, in this TMQ column from August 27th, 2003:

TMQ loves Brett Favre. This guy could wake up from a knee operation and fire a touchdown pass before yanking out the IV line. It’s going to be a sad day when he cuts the tape off his ankles for the final time. And it’s wonderful that Favre has played his entire (meaningful) career in the same place, honoring sports lore and appeasing the football gods, never demanding a trade to a more glamorous media market.

But even as someone who loves Favre, TMQ thinks his Hamlet act on retirement has worn thin. Favre keeps planting, and then denying, rumors that he is about to hang it up. He calls sportswriters saying he might quit, causing them to write stories about how everyone wants him to stay; then he calls more sportswriters denying that he will quit, causing them to write stories repeating how everyone wants him to stay. Maybe Favre needs to join a publicity-addiction recovery group. The retire/unretire stuff got pretty old with Frank Sinatra and Michael Jordan; it’s getting old with Favre.

Ha!

The 1-15 Rams and the Salary Cap—Watch Me Crush My Own Hypothesis

It is a quirky little fact that 1-15 teams have tended to bounce back fairly well.  Since expanding to 16 games in 1978, 9 teams have hit the ignoble mark, including last year’s St. Louis Rams.  Of the 8 that did it prior to 2009, all but the 1980 Saints made it back to the playoffs within 5 years, and 4 of the 8 eventually went on to win Super Bowls, combining for 8 total.  The median number of wins for a 1-15 team in their next season is 7:

1-15 teams_23234_image001

1-15 teams_23234_image003

My grand hypothesis about this was that the implementation of the salary cap after the 1993-94 season, combined with some of the advantages I discuss below (especially 2 and 3), has been a driving force behind this small-but-sexy phenomenon: note that at least for these 8 data points, there seems to be an upward trend for wins and downward trend for years until next playoff appearance.  Obviously, this sample is way too tiny to generate any conclusions, but before looking at harder data, I’d like to speculate a bit about various factors that could be at play.  In addition to normally-expected regression to the mean, the chain of consequences resulting from being horrendously bad is somewhat favorable:

  1. The primary advantages are explicitly structural:  Your team picks at the top of each round in the NFL draft.  According to ESPN’s “standard” draft-pick value chart, the #1 spot in the draft is worth over twice as much as the 16th pick [side note: I don’t actually buy this chart for a second.  It massively overvalue 1st round picks and undervalues 2nd round picks, particularly when it comes to value added (see a good discussion here)]:image
  2. The other primary benefit, at least for one year, comes from the way the NFL sets team schedules: 14 games are played in-division and against common divisional opponents, but the last two games are set between teams that finished in equal positions the previous year (this has obviously changed many times, but there have always been similar advantages).  Thus, a bottom-feeder should get a slightly easier schedule, as evidenced by the Rams having the 2nd-easiest schedule for this coming season.
  3. There are also reliable secondary benefits to being terrible, some of which get greater the worse you are.  A huge one is that, because NFL statistics are incredibly entangled (i.e., practically every player on the team has an effect on every other player’s statistics), having a bad team tends to drag everyone’s numbers down.  Since the sports market – and the NFL’s in particular – is stats-based on practically every level, this means you can pay your players less than what they’re worth going forward.  Under the salary cap, this leaves you more room to sign and retain key players, or go for quick fixes in free agency (which is generally unwise, but may boost your performance for a season or two).
  4. A major tertiary effect – one that especially applies to 1-15 teams, is that embarrassed clubs tend to “clean house,” meaning, they fire coaches, get rid of old and over-priced veterans, make tough decisions about star players that they might not normally be able to make, etc.  Typically they “go young,” which is advantageous not just for long-term team-building purposes, but because young players are typically the best value in the short term as well.
  5. An undervalued quaternary effect is that new personnel and new coaching staff, in addition to hopefully being better at their jobs than their predecessors, also make your team harder to prepare for, just by virtue of being new (much like the “backup quarterback effect,” but for your whole team).
  6. A super-important quinary effect is that. . .  Ok, sorry, I can’t do it.

Of course, most of these effects are relevant to more than just 1-15 teams, so perhaps it would be better to expand the inquiry a tiny bit.  For this purpose, I’ve compiled the records of every team since the merger, so beginning in 1970, and compared them to their record the following season (though it only affects one data point, I’ve treated the first Ravens season as a Browns season, and treated the new Browns as an expansion team).  I counted ties as .5 wins, and normalized each season to 16 games (and rounded).  I then grouped the data by wins in the initial season and plotted it on a “3D Bubble Chart.”  This is basically a scatter-plot where the size of each data-point is determined by the number of examples (e.g., only 2 teams have gone undefeated, so the top-right bubble is very small).  The 3D is not just for looks: the size of each sphere is determined by using the weights for volume, which makes it much less “blobby” than 2D, and it allows you to see the overlapping data points instead of just one big ink-blot:

season wins_31685_image001

*Note: again, the x-axis on this graph is wins in year n, and the y axis is wins in year n+1. Also, note that while there are only 16 “bubbles,” they represent well over a thousand data points, so this is a fairly healthy sample.

The first thing I can see is that there’s a reasonably big and fat outlier there for 1-15 teams (the 2nd bubble from the left)!  But that’s hardly a surprise considering we started this inquiry knowing that group had been doing well, and there are other issues at play: First, we can see that the graph is strikingly linear.  The equation at the bottom means that to predict a team’s wins for one year, you should multiply their previous season’s win total by ~.43 and add ~4.7 (e.g.’s: an 8-win team should average about 8 wins the next year, a 4-win team should average around 6.5, and a 12-win team should average around 10).  The number highlighted in blue tells you how important the previous season’s win’s are as a predictor: the higher the number, the more predictive.

So naturally the next thing to see is a breakdown of these numbers between the pre- and post-salary cap eras:

season wins_31685_image003

season wins_31685_image005

Again, these are not small sample-sets, and they both visually and numerically confirm that the salary-cap era has greatly increased parity: while there are still plenty of excellent and terrible teams overall, the better teams regress and the worse teams get better, faster.  The equations after the split lead to the following predictions for 4, 8, and 12 win teams (rounded to the nearest .25):

W Pre-SC Post-SC
4 6.25 7
8 8.25 8
12 10.5 9.25
Yes, the difference in expected wins between a 4-win team and a 12-win team in the post-cap era is only just over 2 wins, down from over 4.

While this finding may be mildly interesting in its own right, sadly this entire endeavor was a complete and utter failure, as the graphs failed to support my hypothesis that the salary cap has made the difference for 1-15 teams specifically.  As this is an uncapped season, however, I guess what’s bad news for me is good news for the Rams.

Hidden Sources of Error—A Back-Handed Defense of Football Outsiders

So I was catching up on some old blog-reading and came across this excellent post by Brian Burke, Pre-Season Predictions Are Still Worthless, showing that the Football Outsiders pre-season predictions are about as accurate as picking 8-8 for every team would be, and that a simple regression based on one variable — 6 wins plus 1/4 of the previous season’s wins — is significantly more accurate

While Brian’s anecdote about Billy Madison humorously skewers Football Outsiders, it’s not entirely fair, and I think these numbers don’t prove as much as they may appear to at first glance.  Sure, a number of conventional or unconventional conclusions people have reached are probably false, but the vast majority of sports wisdom is based on valid causal inferences with at least a grain of truth.  The problem is that people have a tendency to over-rely on the various causes and effects that they observe directly, conversely underestimating the causes they cannot see.

So far, so obvious.  But these “hidden” causes can be broken down further, starting with two main categories, which I’ll call “random causes” and “counter-causes”:

“Random causes” are not necessarily truly random, but do not bias your conclusions in any particular direction.  It is the truly random combined with the may-as-well-be-random, and generates the inherent variance of the system.

“Counter causes” are those which you may not see, but which relate to your variables in ways that counteract your inferences.  The salary cap in the NFL is one of the most ubiquitous offenders:  E.g. an analyst sees a very good quarterback, and for various reasons believes that QB with a particular skill-set is worth an extra 2 wins per season.  That QB is obtained by an 8-8 team in free agency, so the analyst predicts that team will win 10 games.  But in reality, the team that signed that quarterback had to pay handsomely for that +2 addition, and may have had to cut 2 wins worth of players to do it.  If you imagine this process repeating itself over time, you will see that the correlation between QB’s with those skills and their team’s actual winrate may be small or non-existent (in reality, of course, the best quarterbacks are probably underpaid relative to their value, so this is not a problem).  In closed systems like sports, these sorts of scenarios crop up all the time, and thus it is not uncommon for a perfectly valid and logical-seeming inference to be, systematically, dead wrong (by which I mean that it not only leads to an erroneous conclusion in a particular situation, but will lead to bad predictions routinely).

So how does this relate to Football Outsiders, and how does it amount to a defense of their predictions?  First, I think the suggestion that FO may have created “negative knowledge” is demonstrably false:  The key here is not to be fooled by the stat that they could barely beat the “coma patient” prediction of 8-8 across the board.  8 wins is the most likely outcome for any team ex ante, and every win above or below that number is less and less likely.  E.g., if every outcome were the result of a flip of a coin, your best strategy would be to pick 8-8 for every team, and picking *any* team to go 10-6 or 12-4 would be terrible.  Yet Football Outsiders (and others) — based on their expertise — pick many teams to have very good and very bad records.  The fact that they break even against the coma patient shows that their expertise is worth something.

Second, I think there’s no shame in being unable to beat a simple regression based on one extremely probative variable:  I’ve worked on a lot of predictive models, from linear regressions to neural networks, and beating a simple regression can be a lot of work for marginal gain (which, combined with the rake, is the main reason that sports-betting markets can be so tough).

Yet, getting beaten so badly by a simple regression is a definite indicator of systematic error — particularly since there is nothing preventing Football Outsiders from using a simple regression to help them make their predictions. Now, I suspect that FO is underestimating football variance, especially the extent of regression to the mean.  But this is a blanket assumption that I would happily apply to just about any sports analyst — quantitative or not — and is not really of interest.  However, per the distinction I made above, I believe FO is likely underestimating the “counter causes” that may temper the robustness of their inferences without necessarily invalidating them entirely.  A relatively minor bias in this regard could easily lead to a significant drop in overall predictive performance, for the same reason as above:  the best and worst records are by far the least likely to occur.  Thus, *ever* predicting them, and expecting to gain accuracy in the process, requires an enormous amount of confidence.  If Football Outsiders has that degree of confidence, I would wager that it is misplaced.

Favre’s Not-So-Bad Interception

This post on Advanced NFL Stats (which is generally my favorite NFL blog), quantifying the badness of Brett Favre’s interception near the end of regulation, is somewhat revealing of a subtle problem I’ve noticed with simple win-share analysis of football plays.  To be sure, Favre’s interception “cost” the Vikings a chance to win the game in regulation, and after a decent return, even left a small chance of the Saints winning before overtime.  So in an absolute sense, it was a “bad” play, which is reflected by Brian’s conclusion that it cost the Vikings .38 wins.  But I think there are a couple of issues with that figure that are worth noting:

First, while it may have cost .38 wins versus the start of that play, a more important question might be how bad it was on the spectrum of possible outcomes.  For example, an incomplete pass still would not have left the Vikings in a great position, as they were outside of field goal range with enough time on the clock to run probably only one more play before making a FG attempt.  Likewise, if they had run the ball instead — with the Saints seemingly keyed up for the run — it is unlikely that they would have picked up the necessary yards to end the game there either.  It is important to keep in mind that many other negative outcomes, like a sack or a run for minus yards would be nearly as disastrous as the interception. In fact, by the nature of the position the Vikings were in, most “bad” outcomes would be hugely bad (in terms of win-shares), and most “good” outcomes would be hugely good.

The formal point here is that while Favre’s play was bad in absolute terms, it wasn’t much worse than a large percentage of other possible outcomes.  For an extreme comparison, imagine a team with 4th and goal at the 1 with 1 second left in the game, needing a touchdown to win, and the quarterback throws an incomplete pass.  The win-shares system would grade this as a terrible mistake!  I would suggest that a better way to quantify this type of result might be to ask the question: how many standard deviations worse than the mean was the outcome?  In the 4th down case, I think it’s hard to make either a terrible mistake or an incredible play, because practically any outcome is essentially normal.  Similarly, in the Favre case, while the interception was a highly unfavorable outcome, it wasn’t nearly as drastic as the basic win-shares analysis might make it seem.

Second, to rate this play based on the actual result is, shall we say, a little results-oriented.  As should be obvious, a completion of that length would have been an almost sure victory for the Vikings, so it’s unclear whether Favre’s throw was even a bad decision.  Considering they were out of field goal range at the start of the play, if the distribution of outcomes of the pass were 40% completions, 40% incompletions, and 20% interceptions, it would easily have been a win-maximizing gamble.  Regardless of the exact distribution ex ante, the -.38 wins outcome is way on the low end of the possible outcomes, especially considering that it reflects a longer than average return on the pick.  As should be obvious, many interceptions are the product of good quarterbacking decisions (I may write separately at a later point on the topic “Show me a quarterback that doesn’t throw interceptions, and I’ll show you a sucky quarterback”), and in this case it is not clear to me which type this was.

This should not be taken as a criticism of Advanced NFL Stats’ methodology. I’m certain Brian understands the difference between the resulting win-shares a play produces and the question of whether that result was the product of a poor decision.  When it comes to 4th downs, for example, everyone with even an inkling of analytical skill understands that Belichick’s infamously going for it against the Colts was definitely the win-maximizing play, even though it had a terrible result.  It doesn’t take a very big leap from there to realize that the same reasoning applies equally to players’ decisions.

My broader agenda that these issues partly relate to (which I will hopefully expand on significantly in the future) is that while I believe win-share analysis is the best — and in some sense the only — way to evaluate football decisions, I am also concerned with the many complications that arise when attempting to expand its purview to player evaluation.