## The Aesthetic Case Against 18 Games

By most accounts, the NFL’s plan to expand the regular season from 16 to 18 games is a done deal.  Indulge me for a moment as I take off my Bill-James-Wannabe cap and put on my dusty old Aristotle-Wannabe kausia:  In addition to various practical drawbacks, moving to 18 games risks disturbing the aesthetic harmony—grounded in powerful mathematics—inherent in the 16 game season.
Analytically, it is easy to appreciate the convenience of having the season break down cleanly into 8-game halves and 4-game quarters.  Powers of 2 like this are useful and aesthetically attractive: after all, we are symmetrical creatures who appreciate divisibility.  But we have a possibly even more powerful aesthetic attachment to certain types of asymmetrical relationships:  Mozart’s piano concertos aren’t divided into equally-sized beginnings, middles and ends.  Rather, they are broken into exposition, development, and recapitulation—each progressively shorter than the last.

Similarly, the 16 game season can fairly cleanly be broken into 3 or 4 progressively shorter but more important sections.  Using roughly the same proportions that Mozart would, the first 10 games (“exposition”) would set the stage and reveal who we should be paying attention to; the next 3-4 games (“development”) would be where the race for playoff positioning really begins in earnest, and the final 2-3 weeks (“recapitulation”) are where hopes are realized and hearts are broken—including the final weekend when post-season fates are settled.  Now, let’s represent the season as a rectangle with sides 16 (length of the season) and 10 (length of the “exposition”), broken down into consecutively smaller squares representing each section:

Note: The “last” game gets the leftover space, though if the season were longer we could obviously keep going.

At this point many of you probably know where this is going: The ratio between each square to all of the smaller pieces is roughly equal, corresponding to the “divine proportion,” which is practically ubiquitous in classical music, as well as in everything from book and movie plots to art and architecture to fractal geometry to unifying theories of “all animate and inanimate systems.”  Here it is again (incredibly clumsily-sketched) in the more recognizable spiral form:

The golden ratio is represented in mathematics by the irrational constant phi, which is:

1.6180339887…

Which, when divided into 1 gets you:

.6180339887…

Beautiful, right? So the roughly 10/4/1/1 breakdown above is really just 16 multiplied by 1/phi, with the remainder multiplied by 1/phi, etc—9.9, 3.8, 1.4, .9—rounded to the nearest game.  Whether this corresponds to your thinking about the relative significance of each portion of the season is admittedly subjective.  But this is an inescapably powerful force in aesthetics (along with symmetricality and symbols of virility and fertility), and can be found in places most people would never suspect, including in professional sports.  Let’s consider some anecdotal supporting evidence:

• The length of a Major League Baseball season is 162 games.  Not 160, but 162.  That should look familiar.
• Both NBA basketball and NHL hockey have 82-game seasons, or roughly half-phi.  Note 81 games would be impractical, because of need for equal number of home and road games (but bonus points if you’ve ever felt like the NBA season was exactly 1 game too long).
• The “exposition” portion of a half-phi season would be 50 games.  The NHL and NBA All-Star breaks both take place right around game 50, or a little later, each year.
• Though still solidly in between 1/2 and 2/3 of the way through the season, MLB’s “Summer Classic” usually takes place slightly earlier, around game 90 (though I might submit that the postseason crunch doesn’t really start until after teams build a post-All Star record for people to talk about).
• The NFL bye weeks typically end after week 10.
• Fans and even professional sports analysts are typically inclined to value “clutch” players—i.e., those who make their bones in the “Last” quadrant above—way more than a non-aesthetic analytical approach would warrant.

Etc.
So fine, say you accept this argument about how people observe sports, your next question may be: well, what’s wrong with 18 games? any number of games can be divided into phi-sized quadrants, right?  Well, the answer is basically yes, it can, but it’s not pretty:

The numbers 162, 82, and 16 all share a couple of nice qualities: first they are all roughly divisible by 4, so you have nice clean quarter-seasons.  Second, they each have aesthetically pleasing “exposition” periods: 100 games in MLB, 50 in the NBA and NHL, and 10 in the NFL.  The “exposition” period in an 18-game season would be 11 games.  Yuck!  These season-lengths balance our competing aesthetic desires for the harmony of symmetry and excitement of asymmetry.  We like our numbers round, but not too round.  We want them dynamic, but workable.

Finally, as to why the NFL should care about vague aesthetic concerns that it takes a mathematician to identify, I can only say: I don’t think these patterns would be so pervasive in science, art, and in broader culture if they weren’t really important to us, whether we know it or not.  Human beings are symmetrical down the middle, but as some guy in Italy noticed, golden rectangles are not only woven into our design, but into the design of the things we love.  Please, NFL, don’t take that away from us.

## Applied Epistemology in Politics and the Playoffs

Two nights ago, as I was watching cable news and reading various online articles and blog posts about Christine O’Donnell’s upset win over Michael Castle in Delaware’s Republican Senate primary, the hasty, almost ferocious emergence of consensus among the punditocracy – to wit, that the GOP now has virtually zero chance of picking up that seat in November – reminded me of an issue that I’ve wanted to blog about since long before I began blogging in earnest: NFL playoff prediction models.

Specifically, I have been critical of those models that project the likelihood of each surviving team winning the Super Bowl by applying a logistic regression model (i.e., “odds of winning based on past performance”) to each remaining game.  In January, I posted a number of comments to this article on Advanced NFL Stats, in which I found it absurd that, with 8 teams left, his model predicted that the Dallas Cowboys had about the same chance of winning the Super Bowl as the Jets, Ravens, Vikings, and Cardinals combined. In the brief discussion, I gave two reasons (in addition to my intuition): first, that these predictions were wildly out of whack with contract prices in sports-betting markets, and second, that I didn’t believe the model sufficiently accounted for “variance in the underlying statistics.”  Burke suggested that the first point is explained by a massive epidemic of conjunction-fallacyitis among sports bettors.  On its face, I think this is a ridiculous explanation: i.e., does he really believe that the market-movers in sports betting — people who put up hundreds of thousands (if not millions) of dollars of their own money — have never considered multiplying the odds of several games together?  Regardless, in this post I will put forth a much better explanation for this disparity than either of us proffered at the time, hopefully mooting that discussion.  On my second point, he was more dismissive, though I was being rather opaque (and somehow misspelled “beat” in one reply), so I don’t blame him.  However, I do think Burke’s intellectual hubris regarding his model (aka “model hubris”) is notable – not because I have any reason to think Burke is a particularly hubristic individual, but because I think it is indicative of a massive epidemic of model-hubrisitis among sports bloggers.

In Section 1 of this post, I will discuss what I personally mean by “applied epistemology” (with apologies to any actual applied epistemologists out there) and what I think some of its more-important implications are.  In Section 2, I will try to apply these concepts by taking a more detailed look at my problems with the above-mentioned playoff prediction models.

# Section 1: Applied Epistemology Explained, Sort Of

For those who might not know, “epistemology” is essentially a fancy word for the “philosophical study of knowledge,” which mostly involves philosophers trying to define the word “knowledge” and/or trying to figure out what we know (if anything), and/or how we came to know it (if we do).  For important background, read my Complete History of Epistemology (abridged), which can be found here: In Plato’s Theaetetus, Socrates suggests that knowledge is something like “justified true belief.”  Agreement ensues.  In 1963, Edmund Gettier suggests that a person could be justified in believing something, but it could be true for the wrong reasons.  Debate ensues.  The End.

A “hot” topic in the field recently has been dealing with the implications of elaborate thought experiments similar to the following:

*begin experiment*

*end experiment*

In reality, the fact that you might be wrong, even when you’re so sure you’re right, is more than a philosophical curiosity, it is a mathematical certainty.  The processes that lead you to form beliefs, even extremely strong ones, are imperfect.  And when you are 100% certain that a belief-generating process is reliable, the process that led you to that belief is likely imperfect.  This line of thinking is sometimes referred to as skepticism — which would be fine if it weren’t usually meant as a pejorative.

When push comes to shove, people will usually admit that there is at least some chance they are wrong, yet they massively underestimate just what those chances are.  In political debates, for example, people may admit that there is some miniscule possibility that their position is ill-informed or empirically unsound, but they will almost never say that they are more likely to be wrong than to be right.  Yet, when two populations hold diametrically opposed views, either one population is wrong or both are – all else being equal, the correct assessment in such scenarios is that no-one is likely to have it right.

When dealing with beliefs about probabilities, the complications get even trickier:  Obviously many people believe some things are close to 100% likely to be true, when the real probability may be some-much if not much-much lower.  But in addition to the extremes, people hold a whole range of poorly-calibrated probabilistic beliefs, like believing something is 60% likely when it is actually 50% or 70%.  (Note: Some Philosophically trained readers may balk at this idea, suggesting that determinism entails everything having either a 0 or 100% probability of being true.  While this argument may be sound in classroom discussions, it is highly unpragmatic: If I believe that I will win a coin flip 60% of the time, it may be theoretically true that the universe has already determined whether the coin will turn up heads or tails, but for all intents and purposes, I am only wrong by 10%).

But knowing that we are wrong so much of the time doesn’t tell us much by itself: it’s very hard to be right, and we do the best we can.  We develop heuristics that tend towards the right answers, or — more importantly for my purposes — that allow the consequences of being wrong in both directions even out over time.  You may reasonably believe that the probability of something is 30%, when, in reality, the probability is either 20% or 40%.  If the two possibilities are equally likely, then your 30% belief may be functionally equivalent under many circumstances, but they are not the same, as I will demonstrate in Section 2 (note to the philosophers: you may have noticed that this is a bit like the Gettier examples: you might be “right,” but for the wrong reasons).

There is a science to being wrong, and it doesn’t mean you have to mope in your study, or act in bad faith when you’re out of it.  “Applied Epistemology” (at least as this armchair philosopher defines it) is the study of the processes that lead to knowledge and beliefs, and of the practical implications of their limitations.

## Part 2:  NFL Playoff Prediction Models

Now, let’s finally return to the Advanced NFL Stats playoff prediction model.  Burke’s methodology is simple: using a logistic regression based on various statistical indicators, the model estimates a probability for each team to win their first round matchup.  It then repeats the process for all possible second round matchups, weighting each by its likelihood of occurring (as determined by the first round projections) and so on through the championship.  With those results in hand, a team’s chances of winning the tournament is simply the product of their chances of winning in each round.  With 8 teams remaining in the divisional stage, the model’s predictions looked like this:

Burke states that the individual game prediction model has a “history of accuracy” and is well “calibrated,” meaning that, historically, of the teams it has predicted to win 30% of the time, close to 30% of them have won, and so on.  For a number of reasons, I remain somewhat skeptical of this claim, especially when it comes to “extreme value” games where the model predicts very heavy favorites or underdogs.  (E.g’s:  What validation safeguards do they deploy to avoid over-fitting?  How did they account for the thinness of data available for extreme values in their calibration method?)  But for now, let’s assume this claim is correct, and that the model is calibrated perfectly:  The fact that teams predicted to win 30% of the time actually won 30% of the time does NOT mean that each team actually had a 30% chance of winning.

That 30% number is just an average.  If you believe that the model perfectly nails the actual expectation for every team, you are crazy.  Since there is a large and reasonably measurable amount of variance in the very small sample of underlying statistics that the predictive model relies on, it necessarily follows that many teams will have significantly under or over-performed statistically relative to their true strength, which will be reflected in the model’s predictions.  The “perfect calibration” of the model only means that the error is well-hidden.

This doesn’t mean that it’s a bad model: like any heuristic, the model may be completely adequate for its intended context.  For example, if you’re going to bet on an individual game, barring any other information, the average of a team’s potential chances should be functionally equivalent to their actual chances.  But if you’re planning to bet on the end-result of a series of games — such as in the divisional round of the NFL playoffs — failing to understand the distribution of error could be very costly.

For example, let’s look at what happens to Minnesota and Arizona’s Super Bowl chances if we assume that the error in their winrates is uniformly distributed in the neighborhood of their predicted winrate:

For Minnesota, I created a pool of 11 possible expectations that includes the actual prediction plus teams that were 5% to 25% better or worse.  I did the same for Arizona, but with half the deviation.  The average win prediction for each game remains constant, but the overall chances of winning the Super Bowl change dramatically.  To some of you, the difference between 2% and 1% may not seem like much, but if you could find a casino that would regularly offer you 100-1 on something that is actually a 50-1 shot, you could become very rich very quickly.  Of course, this uniform distribution is a crude one of many conceivable ways that the “hidden error” could be distributed, and I have no particular reason to think it is more accurate than any other.  But one thing should be abundantly clear: the winrate model on which this whole system rests tells us nothing about this distribution either.

The exact structure of this particular error distribution is mostly an empirical matter that can and should invite further study.  But for the purposes of this essay, speculation may suffice.  For example, here is an ad hoc distribution that I thought seemed a little more plausible than a uniform distribution:

This table shows the chances of winning the Super Bowl for a generic divisional round playoff team with an average predicted winrate of 35% for each game.  In this scenario, there is a 30% chance (3/10) that the prediction gets it right on the money, a 40% chance that the team is around half as good as predicted (the bottom 4 values), a 10% chance that the team is slightly better, a 10% chance that it is significantly better, and a 10% chance that the model’s prediction is completely off its rocker.  These possibilities still produce a 35% average winrate, yet, as above, the overall chances of winning the Super Bowl increase significantly (this time by almost double).  Of course, 2 random hypothetical distributions don’t yet indicate a trend, so let’s look at a family of distributions to see if we can find any patterns:

This chart compares the chances of a team with a given predicted winrate to win the Super Bowl based on uniform error distributions of various sizes.  So the percentages in column 1 are the odds of the team winning the Super Bowl if the predicted winrate is exactly equal to their actual winrate.  Then each subsequent column is the chances of them winning the Superbowl if you increase the “pool” of potential actual winrates by one on each side.  Thus, the second number after 35% is the odds of winning the Super Bowl if the team is equally likely to be have a 30%, 35%, or 40% chance in reality, etc.  The maximum possible change in Super Bowl winning chances for each starting prediction is contained in the light yellow box at the end of each row.  I should note that I chose this family of distributions for its ease of cross-comparison, not its precision.  I also experimented with many other models that produced a variety of interesting results, yet in every even remotely plausible one of them, two trends – both highly germane to my initial criticism of Burke’s model – endured:
1.  Lower predicted game odds lead to greater disparity between predicted and actual chances.
To further illustrate this, here’s a vertical slice of the data, containing the net change for each possible prediction, given a discreet uniform error distribution of size 7:

2.  Greater error ranges in the underlying distribution lead to greater disparity between predicted and actual chances.

To further illustrate this, here’s a horizontal slice of the data, containing the net change for each possible error range, given an initial winrate prediction of 35%:

Of course these underlying error distributions can and should be examined further, but even at this early stage of inquiry, we “know” enough (at least with a high degree of probability) to begin drawing conclusions.  I.e., We know there is considerable variance in the statistics that Burke’s model relies on, which strongly suggests that there is a considerable amount of “hidden error” in its predictions.  We know greater “hidden error” leads to greater disparity in predicted Super Bowl winning chances, and that this disparity is greatest for underdogs.  Therefore, it is highly likely that this model significantly under-represents the chances of underdog teams at the divisional stage of the playoffs going on to win the Superbowl.  Q.E.D.

This doesn’t mean that these problems aren’t fixable: the nature of the error distribution of the individual game-predicting model could be investigated and modeled itself, and the results could be used to adjust Burke’s playoff predictions accordingly.  Alternatively, if you want to avoid the sticky business of characterizing all that hidden error, a Super-Bowl prediction model could be built that deals with that problem heuristically: say, by running a logistical regression that uses the available data to predict each team’s chances of winning the Super Bowl directly.

Finally, I believe this evidence both directly and indirectly supports my intuition that the large disparity between Burke’s predictions and the corresponding contract prices was more likely to be the result of model error than market error.  The direct support should be obvious, but the indirect support is also interesting:  Though markets can get it wrong just as much or more than any other process, I think that people who “put their money where their mouth is” (especially those with the most influence on the markets) tend to be more reliably skeptical and less dogmatic about making their investments than bloggers, analysts or even academics are about publishing their opinions.  Moreover, by its nature, the market takes a much more pluralistic approach to addressing controversies than do most individuals.  While this may leave it susceptible to being marginally outperformed (on balance) by more directly focused individual models or persons, I think it will also be more likely to avoid pitfalls like the one above.

## Conclusions, and My Broader Agenda

The general purpose of post is to demonstrate both the importance and difficulty of understanding and characterizing the ways in which our beliefs – and the processes we use to form them — can get it wrong.  This is, at its heart, a delicate but extremely pragmatic endeavor.  It involves being appropriately skeptical of various conclusions — even when they seem right to you – and recognizing the implications of the multitude of ways that such error can manifest.

I have a whole slew of ideas about how to apply these principles when evaluating the various pronouncements made by the political commentariat, but the blogosphere already has a Nate Silver (and Mr. Silver is smarter than me anyway), so I’ll leave that for you to consider as you see fit.