January 16, 2015
There are two important aspects of prediction. The first concerns the accuracy of the prediction—that is, how close a prediction is to the actual, observed result. The second is uncertainty, which is how sure a forecaster is about his or her projection. These issues are fundamental forecasting concepts, and similarly apply to predictions of the weather, the stock market, or the outcome of tomorrow’s ballgame. At present, only one of these facets of a prediction gets much attention in the world of baseball projections, and that is accuracy. Accuracy is measured by the absolute error, which defines how close, on average, a forecast is to the actual, observed result. Projectionists struggle primarily to minimize this number.
The under-examined facet of prediction that we will address in this article is the uncertainty. Whereas we know that predictions tend to be accurate to within a hundred or so points of OPS, we would also like to know whether we are more or less likely to be wrong on certain players. The uncertainty is often treated as a second-order concern because it is usually more difficult to estimate. However, as we show, it is possible to predict ahead of time which players’ forecasts are more uncertain than others. This concept is important because certain teams may prefer high versus low-risk players—a team with high win expectations (90+ wins) might prefer to reduce risk, whereas a middle-of-the-road team (80-85 wins) would presumably seek risk in order to “get lucky” and reach the postseason.
For this study, we investigated the three major slash statistics: batting average, on-base percentage, and slugging percentage. We chose these three because combined they do well at describing a hitter’s abilities. In order to predict uncertainty on a player-by-player basis, we looked for possible correlates in a player’s statistics.
The first place we looked was PECOTA. PECOTA possesses an underappreciated feature: In addition to a weighted mean projection for each player, PECOTA produces a range of predictions (percentiles), along with the probability of each outcome (a 90th percentile projection means that 90 percent of the time, a player’s forecast will be below this line). Some players have wide-ranging percentile predictions, while others are more narrow; these percentiles constitute a direct prediction of PECOTA’s own uncertainty. Steamer possesses a similar feature, so we also included its percentile forecasts. Several other measures of uncertainty include Tango’s reliability score, the total number and standard deviation of forecasts submitted to the Baseball Projection Project, a player’s own experience in terms of the number of seasons he had played, and a player’s career variability in terms of the standard deviation of his performance in each statistic.
The results show that we can make inferences about our own uncertainty for each of the three slash statistics. Linear models using all of the above variables were able to explain significant amounts of the absolute prediction error3 in AVG, OBP, and SLG. The following tables show the results of our regressions for 413 players in 2014. First, here is the combined regression of all predictors:
And here are the R2 values for each variable individually correlated against the absolute error.
The major contributors to predicting uncertainty in the combined regression were Steamer’s percentile projections and the career number of PAs. Both of these predictors make sense: The longer a player has played, the better our information becomes on that player. Steamer’s percentiles (more precisely, the standard deviation in Steamer’s percentiles) are also able to tell us something about how likely a player is to deviate from his forecasts.
Reassuringly, when we excluded Steamer’s percentile projections, the standard deviation in PECOTA’s percentile projections foretells uncertainty in all three statistics in similar ways. When PECOTA/Steamer percentiles had a wide range for each statistic, it suggested that for that player, we were likely to be less accurate (that is, there was likely to be a greater absolute error). However, when combined in the same regression, Steamer’s quantiles rendered PECOTA’s information redundant, due to the high (r~.5) correlation between them.
Other possible predictors proved less useful, or useful only for some statistics. Tom Tango’s reliability score did little in foretelling prediction error, except individually. The career standard deviation did not improve accuracy to any significant degree, and the standard deviation between projections was only useful for AVG, not OBP or SLG. Overall, we were best able to predict absolute error in AVG (R2=.2161), followed by OBP (R2=.2085), and SLG (R2=.1707).
In this first case, we used all players who possessed the necessary data, applying no plate appearance cutoff. This choice is the most inclusive, but neglects the fact that variability in each statistic is increased in smaller samples. Accordingly, a player’s projection error is likely to be more extreme in a sample of 100 PAs than it is in 400. By the time one gets to a full season’s worth of plate appearances, some statistics even become reasonably stable.
To account for this variability problem, we repeated the same analyses, but applying a threshold of 400 PAs to the players we modeled. This choice encompasses most of the regular, everyday players in the league. The results for this more restricted subset were considerably less promising: Most predictor variables dropped out of statistical significance, and the total R2 scores fell substantially (R2~.04-.08).
However, there is a hidden problem that necessitates a more nuanced interpretation. As we reduced the sample size of our dataset substantially by applying the PA threshold, we also reduced our ability to detect significant predictors of uncertainty (recall that p-values are functions of both effect size and sample size). As a result, we expect that even if some predictors of uncertainty work to the same degree in the restricted PA sample, they might not be called significant.
Re-examining the 400 PA uncertainty regressions, we found that percentiles (Steamer and PECOTA alike) remained weakly associated with prediction error for OBP and AVG. This result may suggest that percentiles still have some utility for predicting uncertainty, even for players who achieve high numbers of plate appearances. Career standard deviation proved more useful for the everyday players, as well. (We hope to confirm this by looking at multiple years in follow-up work). Even so, as noted, the R2 values were much diminished, showing that uncertainty is harder to estimate for everyday players. In the long run, rather than applying plate appearance cutoffs, a more potent strategy might be to regress all observed statistics according to the number of PAs.
Because we have shown that uncertainty is not the same for all players, and that we can estimate that uncertainty in advance, we think these results have significant implications for understanding projections. Some players, particularly those with less experience, are the most likely to deviate strongly from their projections. Forecasting systems which produce distributions of possible outcomes like Steamer and PECOTA are also able to gauge their own uncertainty to some degree.
Whether variable or unpredictable performance is beneficial to a team depends to a large extent on its level of competitiveness. Good, playoff-contending teams should want to buy players with low uncertainties, as they are most likely to produce guaranteed value. Teams on the other end of the win-curve, rebuilding or poor contenders for the playoffs, should want to buy more uncertain players, for several reasons. The more uncertain players are likely to be available at a discount, which will allow these lower-tier contenders to get better value for these players. Secondly, if the players strongly over-achieve their projections, they can be sold to contending teams for future value (i.e. young players or prospects). In practice, some baseball analysts already think in these terms to some degree, but our analysis provides a rigorous, quantitative support for doing so.
We can see abundant examples of this pattern of acquisition in the real business of baseball. The Padres, formerly thought to be poor contenders, have bought low on a series of high-risk acquisitions, like Matt Kemp and Wil Myers, both of whom have shown extremely variable performances in the past, and carry outsized PECOTA percentile projections. If these players overachieve this year, the Padres might be surprise playoff contenders. If, as seems more likely, the Padres fall out of contention by midseason, one or two of the risky Padres who overachieves can be traded, accelerating their rebuilding schedule.
Speaking of Matt Kemp and Wil Myers, one way this analysis could be improved is by incorporating injury information. Both players, and many others who have variable performances, have sustained debilitating injuries, which can severely reduce playing time or effectiveness. Injuries are a great wild card in projections, and a factor front offices undoubtedly appraise in pricing players (both in free agency and trades).
We have shown that uncertainty is variable between players, and can be predicted ahead of time. Percentile projections are a powerful tool in forecasting uncertainty: The greater the standard deviation in percentiles, the more likely a player’s performance is to err from his projection. Other useful factors include the number of PAs in the league, and to a lesser extent, the deviations between forecasting systems. Predicting uncertainty became much harder after applying a PA threshold, but percentiles and career variance still proved somewhat useful. Understanding not only the error in projections, but how that error is distributed among players could open the door to a more nuanced appreciation of player valuations and front office strategy.
Will Larson is a Ph. D. economist who moonlights as an amateur baseball statistician. He runs the Baseball Projection Project at www.bbprojectionproject.com. You can tweet to him @larsonwd or visit his personal website at www.williamlarson.com.
 More formally, we term “accuracy” to represent the forecast error, versus “uncertainty,” which we take to mean the forecast error variance.
 The other popular loss function is the root mean squared error (RMSE). We prefer the absolute forecast error because it downweights extreme “misses” versus RMSE calculations.
3Contrasting here the actual, observed result against the median PECOTA prediction. Results were similar when using the average of all prediction algorithms instead of just PECOTA.