July 22, 2016
DRA 2016: Challenging the Citadel of DIPS
As many of you know, we updated the formulation of Deserved Run Average (DRA) once again for the 2016 baseball season. We gave you the overview of the changes here, discussed the innards here, and talked about the new run-scaling mechanism here.
This last article deals with arguably the most important question of all: What, exactly, is DRA trying to tell you? And what does it mean?
Last year, DRA was focused on being a “better” RA9. After running one overall mixed model to create a value per plate appearance for each pitcher, we ran a second regression, using multi-adaptive regression splines (MARS), to model the last three years of relationships between all pitcher value rates and park-adjusted pitcher linear weights allowed. The predictions from this second regression took each season’s mixed model results, forced them back into a runs-allowed framework, and then converted PAs to IPs to get DRA.
This approach did succeed in putting DRA onto an RA9 scale, but in some ways it was less than ideal.
First, having moved one step forward with a mixed model, we arguably were taking a half step back by reintroducing the noisy statistics—raw linear weights and, effectively, RA9—that we were trying to get away from in the first place. The results were generally fine: Good pitchers did well, bad pitchers did poorly, and there were defensible reasons why DRA favored certain pitchers over others when it disagreed with other metrics. But, the fact that something works reasonably well is not, by itself, sufficient to continue doing it.
Second, this approach forced us to make DRA an entirely descriptive metric with limited predictive value, since its yardstick metric, RA9, is itself a descriptive metric with limited predictive value. This did allow DRA to “explain” about 70 percent of same-season run-scoring (in an r-squared sense), which was significantly more than FIP and other metrics, but also required that we refer readers instead to cFIP to measure pitcher skill and anticipated future production.
Moreover, none of this resolved the underlying tension of what DRA was trying to do and what relationship, exactly, it should be seeking with run expectancy and run-scoring. In fact, under our previous approach, it was difficult to define what the “ideal” relationship was supposed to be. As Rob Arthur noted more than once, “if you want great correlation with RA9, then just use RA9 and be done with it.” Our working theory was that we wanted close enough correlation to RA9 to be reasonable, yet not so close that it was suspicious or essentially just reproducing RA9. That made sense, but it was also more of a rule of thumb than a coherent framework.
By the time of spring training this year, we were inclined to go in a different direction. Many of the things we were trying to measure—park effect, quality of opposition, catcher framing—were never going to be specifically reflected by RA9. In fact, RA9’s inability to quantify these factors is why DRA seemed necessary to begin with.
So we’ve decided to stop forcing DRA to be a better RA9. Although DRA remains on an RA9-style scale, we’re instead letting it do what does best: give us the deserved runs allowed, as measured by the skills that the models find make pitchers effective. The hallmark of a true “skills” estimator is its reliability: the extent which it grades the same player with presumably similar skills in a similar matter at different times. Put another way, the skills that make the measure of a player one year should also be getting similarly rated the following year. If not, you are more likely measuring noise, not ability.
This turns out to be the right choice, and the choice we should have made all along. In fact, DRA turns out to be really good at measuring pitcher skills.
One way to test reliability is to see how similarly a metric rates the same pitcher from one year to the next. We’ll do this and compare DRA to both Fielding-Independent Pitching (FIP, the sabermetric standard) and to cFIP, our contextual variant of FIP. Because run-scoring is not normally distributed, we should and will use the more robust Spearman Correlation to compare metrics. The Spearman compares metrics by how they rank pitchers compared to each other, instead of relying on the raw values. We also weight performances by the number of innings each pitcher pitched.
Here are the year-to-year reliability comparisons for the same pitchers in consecutive years between 2010 and 2015:
Table 1: Pitcher Ability, Year to Year Reliability (higher is better)
These results are remarkable. Certainly, it’s no secret that FIP reflects core pitcher skills better than RA9. I’ve also previously shown that cFIP, with its mixed-model framework, is much more reliable than FIP, a fact confirmed by the chart above. The real story, though, is that DRA is now substantially out-performing FIP on year-to-year reliability, which is an astonishing feat.
Why is that? Well, remember that FIP and its variants are outgrowths of DIPS theory. Defense Independent Pitching Statistics (DIPS), insightfully proposed by Voros McCracken, postulated that pitchers have little, if any, control over balls put into play, but which do not leave the park. The most commonly cited measurement of these events is Batting Average on Balls in Play (BABIP). Subsequent analysis argued that (at least certain) pitchers have more control than McCracken believed, but the core philosophy of DIPS remains a prevalent aspect of everyday sabermetric thinking. In accordance with DIPS theory, FIP (and cFIP) limit themselves to strikeouts, walks, hit-batsmen (in some variants), and home runs.
DRA does not observe the same limits. In fact, DRA now includes several models that specifically rely upon the results of balls in play. If pitchers, on average, truly have little control over such results, then DRA’s consideration of these events should result in less accuracy, not more. And yet, the opposite is happening. To see why, look at what happens when DRA gets ahold of two statistics that have proved troublesome in their raw form:
Table 2: Component Comparison, Year to Year Reliability (higher is better)
Home runs are easily the most controversial part of the FIP formula, because they can be influenced by so many other factors. In fact, raw home run rates are so volatile that Dave Studeman proposed xFIP, which instead assigns pitchers a league-average home-run rate based on their flyball rate for the home run component. Inside DRA, though, home run rate is a strength, rather than a weakness. DRA not only models home runs effectively; its home run model correlates year to year at a level commonly associated with strikeouts and walks. Since home runs are the most valuable event in baseball, this is a significant achievement. It’s notable that just setting everyone to the mean produces a Spearman correlation of 0, so this result does not seem attributable merely to compressing the distribution of home runs.
But vis-à-vis DIPS, the more significant finding may be that DRA appears to have substantially solved BABIP. Viewed in its raw form, with a year-to-year correlation of a mere .15, BABIP looks unsolvable. But DRA takes into account a pitcher’s circumstances, including his parks, opponents, and other circumstances. This allows DRA to find patterns that simple algebra does not. DRA’s year-to-year correlation on balls in play is almost 15 times that of raw BABIP, strongly suggesting that it is diagnosing substantial true talent in BABIP suppression (or the lack thereof).
How do you keep track of all these different components? For ease of use, we compile the results of all the DRA models into three general categories, as measured by runs above or below average. The first, “NIP Runs,” shows the pitcher’s deserved runs arising from events involving balls Not put in Play (NIP): walks, strikeouts, and hit batsmen. The second category is “Hit Runs,” which are the pitcher’s runs saved from limiting damage on contact. The final category is “Out Runs,” which measures the pitcher’s ability to generate outs on balls in play, controlling, as always, for context. The DRA Runs leaderboard allows you to see the mix of how each component has contributed to (or inhibited) each pitcher’s success. (Negative is always better). It’s noteworthy that DRA derives all of its findings from the same play-by-play data that MLB Advanced Media (MLBAM) (and, by extension, Retrosheet) have made available for years: no exit velocity or Statcast measurements are required. This allows us to take DRA back to 1951 without issue.
DRA’s new-found reliability has important implications for pitcher valuation. DRA is the basis for pitcher Wins Above Replacement player (PWARP) here at Baseball Prospectus, and the reliability data above suggest that DRA should be a superior basis for evaluating pitcher “wins” contributed to their respective teams, in terms of runs saved or allowed. This, in turn, has implications for career value, particularly for Hall of Fame evaluation. For example, PWARP is very bullish on the careers of Kevin Brown and Frank Tanana. PWARP is skeptical of the attributed accomplishments of Tom Glavine. DRA also thinks very highly of Catfish Hunter, who is in the Hall of Fame but seems to be viewed as a poor choice by others. If nothing else, to the extent runs estimators are useful at all in career evaluation, DRA and PWARP offer a meaningful second opinion on some very interesting pitchers.
What about predicting future runs allowed? This is also an important question, because games are scored in actual runs, not deserved runs. Let’s look at the fit of these metrics to raw RA9 (or ERA, as appropriate), both in the current (within-year) season and the following (year-plus-one) season, again comparing the back-to-back performances of pitchers from 2010 through 2015.
Table 3: Current Season Fit to RA9 / ERA (higher is better)
Table 4: Next Season Fit to RA9 / ERA (higher is better)
Table 3 shows that in addition to doing a good job of detecting actual pitcher skills, DRA is actually better than other estimators at predicting future runs allowed: even better than cFIP, which has been our gold standard for metric reliability.
This, too, is astonishing. While all the estimators are reasonably close, DRA is at the top of the heap, and unlike FIP and other popular estimators, DRA includes the results of balls in play in its prediction. This, for the first time, provides confidence that we are evaluating a pitcher’s overall skill rather than just a few selected components, with the rest being some sort of crapshoot. To be thorough, I ran the same comparison on various stats published by our friends at FanGraphs, including xFIP, SIERA, FIP-, and xFIP-. None of them predicted future runs better than DRA.
FIP does maintain one clear advantage: the ability to “explain” past run-scoring, as indicated in Table 2. In other words, with DRA’s de-emphasis of that attribute, FIP is now the best remaining descriptive metric, short of ERA / RA9 themselves. But, for the reasons mentioned above, the usefulness of this trait is questionable. FIP’s superior ability to explain past runs stems from its incorporation of home runs without contextual adjustment. As noted above, raw home runs are known to be influenced by many factors besides pitcher skill. Furthermore, because home runs are the most valuable events in baseball, it is not surprising to learn that pitcher runs allowed are often closely tied to the number of home runs given up by that pitcher.
Likewise, cFIP maintains an advantage over DRA in year-to-year reliability, smoking all competitors, including xFIP and SIERA in that regard. This arguably means that cFIP remains useful, at least for that purpose, but it is noteworthy that this increased reliability does not come with added ability to predict future pitcher runs allowed.
As you can tell, we are extremely pleased with how DRA has developed. Your feedback has been an important part of DRA’s development, and we hope that both the BP and the larger sabermetric communities will continue to help us make DRA a more useful metric for everyone.
In the meantime, DRA appears to have finally met the challenge first made by DIPS theory over 15 years ago. While most have long agreed that a pitcher has at least some control over balls in play, reliably measuring that ability had, at least until now, remained elusive.
We think it safe to say that has now changed, and that baseball analytics will be better for it.
Ahmad Emad & Paul Bailey (2016). wCorr: Weighted Correlations. R package version 1.8.0. https://CRAN.R-project.org/package=wCorr.
R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
 For weights, we take the harmonic mean of the current year and the following year when doing a “year + 1” comparison, and use the innings pitched in the current year only as the weight when making same-year comparisons.
 Math: DRA out likelihood (.59)2 / BABIP (.15)2.
 Incidentally, BP offers a metric called “RA+,” which was designed to be the equivalent of ERA+, except on the RA9 scale. The performance of these metrics does not change when you correlate to that instead, and DRA actually gains a tick on FIP.
 These metrics were tested in their ability to predict the following year’s ERA, weighted by the harmonic mean of total batters faced over both seasons.