Normalization (6/2/11)

By Paul Bessire

Thursday, June 2 at 3:15 PM ET
In our preseason and postseason preview articles, we typically include a section on "How This Works." While we discuss that we simulate each event 50,000 times and briefly mention which data is used, we never really get into how each simulation is conducted. As teased last week and because Googling the term brings up more results related to me and PredictionMachine, today I will finally be giving away all of our secrets about how this really works... kind of. Technology, experience, access to data and ability to manipulate data help to give us an advantage (some of that we will touch on as well), yet the most critical piece to sports simulation the way that we conduct it - log5 normalization - is pretty simple (it's also what I would classify as a nonsensical term because it is not log-rhythmic and doesn't really have to do with the number five).

Popularized (and named) by Bill James regarding expected batting averages in baseball, log5 normalization helps to put everything in context, which is an absolute necessity for any model designed to determine the likelihood of an event given interacting variables. In the case of baseball averages, there is a pitcher and a hitter. If Joey Votto, who is one of the better hitters in baseball and is currently hitting .338, faces one of the worst pitchers in the game, like Kyle Davies (sorry Kyle, the numbers just worked) for instance, who allows an opponent average of around .338, even if we believe that these values represent their "true" abilities, the expected likelihood that a hit occurs is not 33.8%. It's not really even close to that. When a good hitter faces a bad pitcher, his expected batting average should be much better than his normal average against the rest of the league (and vice versa).

Enter log5 normalization (as James first introduced it):

H/AB = ((AVG * OAV) / LgAVG) / (((AVG * OAV) / LgAVG + (1- AVG )*(1- OAV)/(1-LgAvg)))

Where: AVG is batter's true batting average
OAV is pitcher's true batting average allowed
and LgAVG (pitcher's league average + batter's league average)/2

In this formula, for the example above, the result is that Joey Votto should be expected to hit .437 in at-bats against Kyle Davies (assuming current league averages of .251 in AL and .252 in NL and in a park factor and crowd neutral ballpark). Against a pitcher like Josh Beckett, who currently has a .189 OAV, Votto would be expected to hit .262.

The formula itself is designed more with cross-era normalization in mind, meaning that it is great at helping to answer debates about how Babe Ruth would do against Pedro Martinez. The formula above also makes the invalid assumption that the pitcher and hitter factor equally in the at-bat event (that's not true; research that I have conducted and is published here notes that the batter plays more of a role in every step of the plate appearance event, including HBP and BB, than the pitcher). So in some ways - typically looking at players from within the same league - the actual version in the Predictalator is even simpler than above; while, in other ways - separate pitcher/hitter modifiers for each decision in the plate appearance event - it is a little more complicated.

I've often mentioned how each play is simulated within each game. It's even more detailed than that. We need to know how likely any occurrence is within each play. We start with the possibility of the most all-encompassing outcome and work our way towards a final result of the play. When we have calculated the odds (a probability between 0 and 1) of a specific result occurring, we draw a random number (between 0 and 1). If the random number falls within the likelihood of the event, the event occurs and we move on to the next play (or step in the play). If not, we move on to another possible outcome of the event.

Using baseball as an example, we first determine if a walk occurs by analyzing each player's walks per plate appearance and feeding those values through the modified log5 normalization formula. If the resulting value is .08 and we draw any random number greater than that, we then consider if a hit by pitch has occurred by looking at HBP per plate appearance minus walks. If not HBP, then a hit (using batting average, strikeouts and ballpark). If a hit, then what type of hit (using single, double, triple and homerun per hit, ballpark factor and defense if it takes away a hit or accounts for more or less bases). If not a hit, then if the ball is put in play (using strikeouts and outs). And, if an out, then who makes the play and does his defense turn an out into a hit.

There are also special circumstances (stealing bases, hit and run, bunting, sacrifice flies, pinch hitting, pitcher substitutions, etc.) as well as some homefield advantage (research suggests that the actual likelihood of getting on base at home is around 1.7% greater than the true log5 normalization would expect - and similarly less on the road), but that is everything to a baseball simulation engine. Of course, it's garbage in-garbage out, so the data has to be as pure and relevant to each player as possible, removed of the biases of previous levels of competition, ballparks, health, umpires and age/experience and the lineups and pitching staffs have to be right. Consequently, we spend the vast majority of our time on making sure the data is not garbage as opposed to worrying too much about the engine itself.

The beauty of this kind of simple normalization, though, is that it can be applied to everything. Because there are interacting variables in just about every decision we have to make and because each decision must be associated with a probability of occurrence for each possible outcome, weighted normalization allows us to add the appropriate context to every step of every engine we have.

The running back, offensive line, offensive coaching strategies, homefield advantage and situation all combine to help us determine the offense's role in the likelihood of any distance of a running play in football. Each defensive player and the defense's scheme combine to help us determine the defense's role in the same event. All bias is removed from the data, the resulting information is appropriately weighted and normalized to give us the range of probabilities for each occurrence and a random number is drawn to find the result (of course we had to figure out how to get to this point, who got the ball and if there is a turnover or penalty too). Each time a play is completed, a new set of circumstances dictates the probabilities in the next play. That's why 50,000 simulations produces 50,000 different results (assuming the random number generator truly is random).

In basketball, we determine which player ends the possession with the ball and then must determine if he shoots, is fouled or turns the ball over before addressing the outcome of either of those possibilities. All ten players on the court, the coaches, fans and referees play a role in those odds. Their inputs are combined with our weighted normalization.

And there you have it. Simulation 101. The one algorithm that is in every simulation engine we have is about as straight-forward as it gets (tell any quantitatively savvy person to quickly derive the simplest measure of normalization and they'd probably come up with something similar, if not identical, to this), but it's a fun and very valuable tool nonetheless (that can just as easily be used outside of the sports world when calculating expected outcomes of events where anything interacts).

Baseball and Basketball Performance
Next week's blog, will feature a more thorough baseball and NBA Playoffs recap as its main subject (as alluded to previously, I've been meaning to discuss normalization for awhile). The Pirates blowing a 7-0 lead today not withstanding (we had them on the run-line and with the Over), June has already started better than May ended.

Despite the noted struggles during Interleague play and, for some reason, on Sundays in May, there are a few notes of interest that we have not focused on before, yet may help... With "upset watch" picks - when we are picking the underdog to win outright - our record so far is 37-24 (61%, +$406). The average payout in those games is just +107, but that's better than laying anything. Interestingly, money-line underdogs are good to us in baseball in general. We are also 19-14 (58%, +$450) when playing +165 ML dogs or better. And when taking the "upset watch" pick on the run-line, we are 34-10 (77% +$141)... Our performance has not been as profitable with big favorites on the money or run lines (though we have been picking less than three -180 money-line teams a week, so it could be a sample size issue since we'd have to win more than 65% of those games to be profitable - we did win the only -180 or stronger normal+ pick we have had)... The more I review the data and the engine in baseball, the happier I am with how it looks and how it should be performing. Hopefully, we'll have a luckier June... When I mention luck, I am particularly referring to weather and extra-inning games. We have had a propensity for having strong opinions on games that are ultimately rained out and, unfortunately, there have already been more rainouts this year than all of last season. Unseasonably cold and rainy weather (even more extreme than most forecasts from the last two months) has had a profound impact on the game in general. It seems like summer is rolling in "normally" now, which should only help... Our record on the ML, O/U and RL in extra inning games is abysmal. There may be some reasons for that (taking road teams and unders more often than not in likely close games), but the results have gone against us far more often than they should have. Compound that notion with the fact that teams have already played 1.3 more innings a game than last season (ERAs may be down, but runs per game are way up) and 27% more extra-inning games than expected to this point in the season, and the extra-inning issue seems more fluke than otherwise.

With the NBA, I'll do a full summary next week, but was pleasantly surprised with the results when I finally put the data back together after losing it all a few weeks ago. Playing every playable pick has not produced tremendous results, but normal+, upset watch, top plays of the day and even 56%+ (a segment we discussed right off the bat) picks have all been profitable. Did you know we are on a 10-2 O/U run in the playoffs? Oddly, we had an almost identical trend in the last two rounds of last season.

As usual, if you have any of your own suggestions about how to improve the site, please do not hesitate to contact us at any time. We respond to every support contact as quickly as we can (usually within a few hours) and are very amenable to suggestions. I firmly believe that open communication with our customers and user feedback is the best way for us to grow and provide the types of products that will maximize the experience for all. Thank you in advance for your suggestions, comments and questions.