April 29, 2015
Introducing Deserved Run Average (DRA)'And All Its Friends
Baseball Prospectus' Director of Technology Harry Pavlidis will be chatting with readers Thursday at 1 p.m. ET. If you have any questions after reading this overview of Deserved Run Average, ask them here.
But ERA has a problem: it essentially blames (or credits) the pitcher for everything, simply because he threw the pitch that started the play. Sometimes, that is fair. If a pitcher throws a wild pitch, he can’t blame the right fielder for that. And if a pitcher grooves one down the middle of the plate, chances are that’s on him too. Not too many catchers request those.
However, most plays in baseball don’t involve wild pitches or gopher balls. Moreover, things often happen that are not the pitcher’s fault at all. Sometimes the pitcher throws strikes the umpire incorrectly calls balls. Other times they induce grounders their infielders aren’t adept enough to grab. And still other times, a routine fly ball leaves the park on a hot night at a batter-friendly stadium.
ERA doesn’t account for any of that. It just tells us, in summary fashion, how many runs were “charged” to the pitcher “of record.” And so, a starting pitcher who departs with a runner on first gets charged with that run even if the reliever walks the next three batters. The same starter would get charged if the reliever makes a good pitch, but the shortstop can’t turn a double play. And none of these runs count at all if they are “unearned”— an exclusion by which the home team’s scorer decides whether a fielder demonstrated “ordinary effort.”
The list of problems goes on. Pitchers who load the bases but escape are treated the same as pitchers who strike out the side. Pitchers with great catchers get borderline calls. Guys who can’t catch a break for months show immense “improvement.” Guys who are average one year wash out the next. ERA, in short, can be a bit of a mess, particularly when we have only a few months of data to consider.
The problem is this: We know which runs came across the plate, but we can’t tell, just from ERA, which runs were actually the pitcher’s fault. What we need is a reliable way to determine which runs the pitcher deserves to be charged with. That is the challenge we took on in creating Deserved Run Average (DRA).
The Search for an Alternative
A few years ago, our former colleague, Colin Wyers (now employed by the Houston Astros) thought he had a better solution. Labeled Fair Run Average (abbreviated “FAIR RA” or “FRA”), Colin’s approach tried to adjust for, among other things, what he considered to be a “fair” number of actual innings pitched, and assigned a “fair” number of runs allowed for each pitcher as a result.
Unfortunately, Fair Run Average has not succeeded. While some of its assigned values make sense, others do not. Many researchers have noted what appears to be a bias in Fair Run Average against pitchers who generate a lot of groundballs—a skill generally thought to be desirable. Fair RA just has not caught on, and, more importantly, our understanding of the tools for measuring baseball performance has advanced since the time Fair RA was conceived.
Today, we are transitioning to a new metric for evaluating the pitcher’s responsibility for runs that crossed the plate. We call it Deserved Run Average, or DRA. Leveraging recent applications of “mixed models” to baseball statistics, DRA controls for the context in which each event of a game occurred, thereby allowing a more accurate prediction of pitcher responsibility, particularly in smaller samples. DRA goes well beyond strikeouts, walks, hit batsman, and home runs, and considers all available batting events. DRA does not explain everything by any means, but its estimates appear to be more accurate and reliable than the alternatives. As such, DRA allows us to declare how many runs a pitcher truly deserved to give up, and to say so with more confidence than ever before.
Deserved Run Average
So, as an overview, here is what DRA does, step by step:
Step 1: Compile the individual value of all baseball batting events in a season.
When a batter steps into the box, a number of different events can ultimately occur. These range from a strikeout to a single to a double play to a home run. Over the course of a season, those events each, as a category, tend to result in an average number of additional (or fewer) runs. For example, a home run on average results in about 1.4 runs, because sometimes there are runners on base and sometimes there are not. By the same token, a double play tends to cost a team about three-quarters of a run. Although a double play can sometimes allow a run to score (such as when there happens to be a runner on third with no outs), it far more often ends the inning or empties the bases with no runs scored.
In the world of baseball statistics, the average seasonal value of these events is known as a “linear weight.” To understand the ultimate effect of the batting events, we first must assign the typical value of those events. So, DRA begins by collecting every single baseball batting event in a given season and assigning the average linear weight for the outcome of that play.
Step 2: Adjust each batting event for its context.
Once we have the average value of each play in a season, we start making our adjustments. Home runs depend, among other things, on stadium, temperature, and the quality of the opposing batter. Ball and strike calls tend to favor the home team. The likelihood of a hit depends on the quality of the opposing defense. The pitcher’s success depends on how far he is ahead in the count, and both a catcher’s framing ability and the size of the umpire’s strike zone help get him there.
So, DRA begins by adjusting for the average effect of these factors beyond the pitcher’s control in each plate appearance, using what is known as a linear mixed model. These environmental factors include:
There are two other aspects that affect how DRA scores pitchers.
First, rather than grade pitchers purely on the number of outs, like ERA does, DRA grades them on the basis of each plate appearance. Thus, pitchers who escape a bases-loaded jam are no longer treated the same as pitchers who retire all three batters they faced, simply because they both got three outs.
Second, DRA judges pitchers on the run expectancy of each play, rather than the runs that happen to cross the plate. If, for example, our hypothetical starter from earlier put a man on first and then was replaced, he would not be penalized the entire run if the reliever subsequently allowed that player to score. Rather, he would be penalized only by the likelihood that said player would have scored from first base on average, with the reliever getting charged the difference between that average likelihood and the full value of the run if it scores. Likewise, when a starter loads the bases, but the reliever gets the team out of it, the reliever doesn’t simply get credit for an out or two. Rather, he gets a bonus for all of the runs that were expected to score from a bases-loaded situation in an average situation, but didn’t. In this regard, true “stopper” relievers get more fairly recognized for their accomplishments, and we more accurately forecast their “deserved” runs allowed.
The DRA component that emerges from all these adjustments is value/PA: the average value of each plate appearance which the pitcher completed during the season.
Step 3: Account for base-stealing activity.
Understanding the average weight of a batting event is essential, but run-scoring also depends on who happens to be on the base at the time. Billy Hamilton is much more likely to score when on base than Billy Butler, all other things being equal. Certain pitchers also hold runners better than others. A runner who is afraid of being picked off will have fewer steal attempts. Runners who stay closer to the base should have a harder time scoring. And runners who are thrown out trying to steal are erased from the basepaths entirely.
To account for these situations, and provide some insight into the effect of baserunning on each event, we created two additional statistics: one looking at base-stealing success and one looking at the frequency with which baserunners attempt to steal bases. They are both (potentially) part of DRA, but are also useful in and of themselves.
We’ve also made an effort to make these statistics more approachable. Because we are looking at how pitchers compare to other pitchers in controlling baserunners, we are describing these stats as Swipe Rate Above Average (SRAA) and Takeoff Rate Above Average (TRAA).
Swipe Rate, as its name implies, judges each participant in a base-stealing attempt for his likely effect upon its success. Using a generalized linear mixed model, we simultaneously weight all participants involved in attempted steals against each other, and then determine the likelihood of the base ending up as stolen, as compared to the involvement of a league-average pitcher, catcher, or lead runner, respectively.
Stated another way, Swipe Rate allows us to evaluate how good Yadier Molina’s arm is while controlling for the inherent ability of his pitchers to hold runners and the quality of the runners he is facing on base. Likewise, we evaluate the ability of individual pitchers to hold runners while controlling for the possibility that they may be throwing to a catcher with a subpar arm. And for baserunners in particular, we now have something much more accurate to evaluate their base-stealing ability than base-stealing percentage.
Remember that base-stealing percentage, by itself, is not very useful: using straight percentages, an elite base-stealer who swipes 90 percent of his attempts and tries to steal 40 times a year ranks lower than a catcher who had one lucky steal all year (and therefore has a 100 percent base-stealing percentage). In the same way that Controlled Strikes Above Average (CSAA) controls for the effect of other factors on catcher framing, Swipe Rate Above Average regresses baserunners’ steal-success rates against both themselves and others to provide a more accurate assessment of each participant’s effect on the likelihood of a stolen base.
The factors considered by the Swipe Rate are:
Because the statistic rates pitchers above or below average in preventing stolen bases, average is zero, and pitchers generate either positive (bad) or negative (good) numbers. In 2014, here were the pitchers who were hardest to steal a base on:
And here were the pitchers baserunners exploited the most last year:
The model for TRAA (Takeoff Rate Above Average) is similar to SRAA, but more complicated. With Takeoff Rate, we don’t care whether the baserunner actually succeeds in stealing the base; what we care about is that he made an attempt. Our hypothesis is that base-stealing attempts are connected with the pitcher’s ability to hold runners. When baserunners are not afraid of a pitcher, they will take more steps off the bag. Baserunners who are further off the bag are more likely to beat a force out, more likely to break up a double play if they can’t beat a force out, and more likely to take the extra base if the batter gets a hit.
Takeoff Rate stats consider the following factors:
Takeoff Rate Above Average is also scaled to zero, and negative numbers are once again better for the pitcher than positive numbers. By TRAA, here were the pitchers who worried baserunners the most in 2014.
And here were the pitchers who emboldened baserunners in 2014:
Current 2015 ratings for Takeoff Rate Above Average are on our leaderboards. We don’t’ have enough data yet to release Swipe Rate Above Average, but we expect it will have enough to work with in another month or so.
Step 4: Account for Passed Balls / Wild Pitches.
Under baseball’s scoring rules, a wild pitch is assigned when a pitcher throws a pitch that is deemed too difficult for a catcher to control with ordinary effort, thereby allowing a baserunner (including a batter, on a third strike) to advance a base. A passed ball is assigned when a pitcher throws a pitch that a catcher ought to have controlled with ordinary effort, but which nonetheless gets away, also allowing a baserunner to move up a base. The difference between a wild pitch and a passed ball, like that of the “earned” run, is at the discretion of the official scorer. Because there can be inconsistency in applying these categories, we prefer to consider them together.
Last year, Dan Brooks and Harry Pavlidis introduced a regressed probabilistic model that combined Harry’s pitch classifications from PitchInfo with a With or Without You (WOWY) approach. RPM-WOWY measured pitchers and catchers on the number and quality of passed balls or wild pitches (PBWP) experienced while they were involved in the game.
Not surprisingly, we have updated this approach to a mixed model as well. Unfortunately, Passed Balls or Wild Pitches Above Average would be quite a mouthful. Again, we’re trying out a new term to see if it is easier to communicate these concepts. We’re going to call these events Errant Pitches. The statistic that compares pitchers and catchers in these events is called Errant Pitches Above Average, or EPAA.
Unfortunately, the mixed model only works for us from 2008 forward, which is when PITCHf/x data became available. Before that time, we will rely solely on WOWY to measure PBWP, which is when pitch counts were first tracked officially. For the time being, we won’t calculate EPAA before 1988 at all, and it will not play a role in calculating pitcher DRA for those seasons.
But, from 2008 through 2014, and going forward, here are the factors that EPAA considers:
Errant Pitches, as you can see, has a much smaller list of relevant factors than our other statistics.
In 2014, the pitchers with the best (most negative) EPAA scores were:
And the pitchers our model said were most likely to generate a troublesome pitch were:
Step 5: Calculate DRA (Deserved Run Average).
We’ve now got our components, so it is time to calculate each pitcher’s DRA. Here are the steps we follow:
First, we put all of our identified components—value/PA, Swipe Rate Above Average, Takeoff Rate Above Average, and Errant Pitches Above Average—together into a new regression, this time looking for their combined effect on run expectancy. We added two more variables that struck us as relevant: the percentage of each pitcher’s plate appearances that came as a starter versus as a reliever (we call this Starter Pitcher Percentage, or SPP) and the total number of batters faced. That gives us a total of six potential predictors for each pitcher to come up with their DRA for a season. We regress these using a method known as “MARS.” If the detail interests you, we invite you to enjoy the In Depth article, which discusses it further.
Second, to smooth out season-to-season variation, and to tease out the most accurate connection between these variables and runs allowed, we actually train our model on the previous three seasons. From this we derive the most accurate connection between our potential predictors and actual runs allowed by pitchers in the current run environment.
Finally, we take the connections determined by our model and use them to calculate each pitcher’s DRA for the current season: his Deserved Runs Average per nine innings. DRA does not distinguish between earned and unearned runs, because that distinction can be arbitrary and over the course of a season it tends to obscure rather than reveal differences between pitchers. We therefore adjust DRA so it is on the scale of Runs Allowed per nine innings (RA/9) rather than Earned Run Average (ERA). We understand that ERA is what many of you are used to, but once you get over that, you’ll be much happier.
We do ensure that, in converting runs per plate appearance to runs per nine innings, we use each pitcher’s individual ratio of batters-faced to innings pitched, rather than just a league average. This allows us to credit the pitchers who are most efficient, and avoid over-crediting pitchers who are putting baserunners on and getting lucky with the outcome. Pitchers in the latter category do not “deserve” the lower runs-allowed numbers they might (temporarily) be putting up.
What It Means
As for the first issue, past DRA is available on our leaderboards right now. In-season DRA during 2015 will be calculated each night after the previous day’s games have concluded. You will be able to use DRA not only to put past pitching performances in context but also to monitor the value of pitchers as we progress through the 2015 season, and beyond. As with our other statistics, DRA will be available for you to download and use for your own comparisons and work.
As for the second issue, rest assured that your time spent reading this article was not in vain. DRA does a very good job of measuring a pitcher’s actual responsibility for the runs that scored while he was on the mound—certainly better than any metric we are aware of in the public domain. And only DRA gives you the assurance that a pitcher’s performance is actually being considered in the context of the batter, catcher, runners on base, as well as the stadium and stadium environment in which the baseball game occurred.
The detailed explanation of DRA’s effectiveness is saved for the accompanying In Depth article. But since you’ve made it this far, we’ll give you the Reader’s Digest version. There are two measures of accuracy that we pay particular attention to in evaluating the accuracy of a new metric.
First, we look at how close, mathematically, the metric’s prediction is to the actual number of runs allowed with the pitcher on the mound. If the pitcher actually allowed four runs per nine innings, we test our alternative metric by how close it comes to that same number. The most commonly used calculation that does this is called the Root Mean Square Error or RMSE.
The second test looks at how accurately the metric ranks the various pitchers relative to each other. Why do we care about rank? Because we know that all pitcher run estimates are a bit off from their actual runs allowed, and more so early in the season. So as a check, we test whether it is at least ranking the pitchers correctly relative to each other. In other words, if the metric can’t estimate runs allowed down to the exact decimal point, the least it can do is tell the difference between Max Scherzer and Ricky Nolasco. This second approach is called the Spearman Correlation.
To judge DRA’s accuracy, we’ll compare it to the leading brand: FIP. We know FIP does a reasonable job of predicting a pitcher’s actual runs allowed in a season. Does DRA do a better job than FIP? It does.
We compared how well FIP and DRA predicted each pitcher’s RA/9 in each of the past four major-league seasons. We looked at their performance with all pitchers, and then two subsets: pitchers who faced at least 170 batters (about the workload of an established major-league reliever, or 40 IP), and pitchers who faced at least 660 batters (about 162 innings, which is a qualified major-league starter).
We then averaged the results over four seasons (2011–2014) to get a consistent (and recent) picture of each metric’s performance. Here is how it ended up:
DRA is consistently superior to FIP at all sample sizes. By accounting for the context in which the pitcher is throwing, DRA allows us to determine which runs are most fairly blamed on the pitcher. DRA is particularly effective with smaller samples. Even for pitchers with only a few batters faced, DRA is already separating the good pitchers from the bad with superior accuracy.
In the end, of course, we are not satisfied simply to have brought you DRA. In addition to being useful in and of itself, DRA has become the new foundation of Pitcher Wins Above Replacement Player (PWARP) here at Baseball Prospectus. By integrating DRA into WARP, we can do a better job than ever of evaluating how much value individual pitchers delivered to their teams, both during the current season and as compared to past pitchers in other seasons and eras. The new PWARP figures featuring DRA are also available on the leaderboards, under the column “DRA_PWARP.”
Just for fun, here are the 25 best qualified starters by DRA over the past 25 years. You’ll note that in some cases their DRA basically matches their RA/9; in others, it does not. Our position, of course, would be that when DRA and RA/9 disagree, you should go with DRA, as it tells you how well the pitcher really pitched. Without further ado:
One caution: DRA is not (presently) adjusted for run-scoring across different eras. Rather, it is adjusted to the average runs-allowed by the league for that season. So, please don’t directly compare Pedro’s DRA of 1.03 in 2000 to somebody else’s DRA in 1985 or some other season. A DRA metric that compares players across eras will be coming soon.
A second caution: DRA corrects for what is known as survival bias: the tendency of better pitchers to pitch more innings in a season. Applying the full DRA model early on can result in some extreme values. To avoid that, we will keep the model simple at first during the season, and model only value/pa to RE24. As we get further along, we’ll allow the full model to operate and achieve the best explanation of each pitcher’s performance.
We are excited about DRA, as well the other statistics we have introduced: Swipe Rate Above Average (to measure base-stealing success), Takeoff Rate Above Average (to measure base-stealing attempts), and Errant Pitches Above Average (to measure passed balls and wild pitches).
Three final things to remember.
First, while DRA accounts for a great many things, DRA doesn’t need to be complicated for fans. DRA is on our leaderboards. Just look up the pitcher(s) that interest you, and you’ll have the best estimate of how good they’ve been in a particular season. If you want to leave the details to us, feel free.
Second, remember that DRA was created to evaluate past performance. If you want to project future performance of a pitcher, use PECOTA. And if you want to evaluate how talented the pitcher is regardless of his performance to date, use cFIP, which is also on our leaderboards. In fact, cFIP is in the same table to DRA so you can compare recent results with the likelihood of future improvement (or decline).
Finally, DRA is now the foundation for Pitcher Wins Above Replacement (PWARP) here at Baseball Prospectus. For the time being, if you want to see how many wins a pitcher has been worth in a particular season, check — you got it — our leaderboards and the column DRA_PWARP. (WARPs that appear on pitchers' player pages remain, for now, the old, FRA-based WARPs.) We’ll change the description it to plain old WARP once people have gotten used to the new idea.
We welcome your comments, and hope you find DRA as useful as we do.
Special thanks to Rob McQuown for research assistance; to Rob Arthur, Rob McQuown, and Greg Matthews for their collaboration; to Stephen Milborrow for modeling advice; and to Tom Tango and Brian Mills for their review and insights.
 Please note that FRAA (Fielding Runs Above Average) is different than FRA (Fair Run Average), which we are replacing for everyday purposes with DRA.
 We use RE24/PA: the average effect of the pitcher on run expectancy per batter faced over the course of a season.
 We have populated DRA back to 1953.
 In fact, don’t try to suggest Pedro’s 2000 season is comparable to what anyone else has ever done at any time. It’s probably very unfair to the other player.
Jonathan Judge is an author of Baseball Prospectus. Follow @bachlaw