August 6, 2010
Weight of the World
I’ve laid out the foundations for a new defensive metric (here and here). One thing that I’ve done is lay out an explicit margin of error for our estimates of defensive performance over a period of time.
In measuring offense, there are certain amounts of uncertainty as well, so we’re going to look at how to measure that. But first, we have to come to some agreement—or, barring that, a cordial disagreement—over what it is we’re measuring.
I find that one of the biggest causes of arguments about baseball (particularly when sabermetrics is involved—either between sabermetric types and less sabermetric types, or among saberists) is when two sides think they’re talking about the same thing, when really they aren’t.
So, when I talk about measuring offense, I’m using that as a shorthand for all sorts of assumptions—about how teams score runs, how players in an offense interact, what we mean by value. I will lay out those assumptions here. I don’t expect these assumptions will line up with everyone else’s assumptions—or even that they should.
The first thing I assume is that we care about measuring what happened, not making value judgments over what “should” have happened. A broken bat floater that drops in for a single counts—when simply recording what happened, I don’t care if a player got “lucky” with his singles or not. By the same token—a player who reaches on an error didn’t make an out, he got on base. So in recording what occurred, I’m not going to credit him with an out, I’m going to credit him with a time reaching base. If it makes some of you feel better—there is a persistent skill to reaching on an error. It shouldn’t surprise anyone that the sort of players who tend to reach on an error are speedier, contact hitters.
That decision informs how we adjust for park as well. We know that it is easier to hit a home run in Coors Field (or score runs in general) than in Petco Park. The thing is, if it’s easier for a batter to hit home runs, it’s easier for the opposing side to do it as well. Even though the batter is producing more runs by hitting in a homer-friendly park, he isn’t necessarily producing more wins for his team.
Now, we know that parks affect hitters differently. Juan Pierre, in his years with the Rockies, didn’t see a boost in his home-run production. That’s because Coors Field wasn’t going to turn a Juan Pierre grounder into a home run. But when we adjust Pierre’s numbers for park, it doesn’t follow that we account for that. Juan Pierre’s style of play may have been better suited to another park, but in a simple recording of what occurred, we don’t care about should, but did. So we adjust Pierre’s production based on the run-inflating tendencies it shows toward the average hitter, not a slap-hitter like Pierre.
And when we evaluate what a player does, I feel strongly that we should (as much as we can) evaluate him independently of his teammates. Now, baseball is firmly a team sport, which makes this difficult at times. But I think it’s worth trying to do, even if we must admit that sometimes we can’t do it as well as we like.
Now, for instance, we know that a bases-loaded home run scores a lot more runs than the average home run. Others would look at that and consider the hitter should get more credit for being “clutch.” To my way of thinking, the players who deserve credit are the players who got on base for the hitter, thus increasing the value of the home run. So I tend to look at batting events independently of what happened before and what will happen after, and simply consider what the player did, isolated from his teammates.
Now, we can readily break down the run-scoring process. There are three ways a hitter can help his team score runs:
One very invaluable (and very, very flexible) tool for measuring these is run expectancy—in other words, the average number of runs that score in that situation. Let’s look at a sample run expectancy chart for 2009, looking at the position of the baserunners and the number of outs in the inning:
[This, you’ll find, doesn’t match up exactly with what’s currently available in the stat reports. These values are a bit more finely tuned, especially for situations that come up rarely (like a runner on third with two outs). And as we start rolling out the new stat reports, these are the values you’ll be seeing instead, barring any further improvements to be made.]
What we’re interested in is the change in run expectancy over time. For instance, if a batter makes an out at the start of an inning, he lowers his team’s run expectancy by 0.234 runs. If instead he hits a triple, he raises his team’s run expectancy by 0.83 runs. If he hits a home run, obviously the team scores a run, and the run expectancy resets. But what we’re interested is the aggregate change, including run scoring. Let’s suppose now that it’s bases loaded, no outs. The total change following a home run is:
0.516 - 2.318 + 4 = 2.198
Or, to generalize:
Final RE – Starting RE + Runs Scored
So that home run was worth 2.198. Now, the average change in run expectancy for an event is its linear weights value.
The values I will be presenting here will not correspond exactly to the typical box-score stats. I have gone ahead and separated infield singles from all other singles, for instance. I have combined infield singles with a batter reaching on a fielding error by an infielder (including the battery), due to concerns of scoring bias. There are two other ways to reach on error—a throwing error by an infielder (or a missed catch error by the first baseman), and a dropped fly ball by an outfielder. Those are considered separately.
Now, as mentioned above, I try as much as possible to separate a batting event from the situation. For some kinds of events, that’s impossible—the sacrifice fly, for instance, can only occur in certain situations. So I have, for the time being, considered all in-play outs (including double plays and sacrifices) together. Don’t worry—we will evaluate those, just separately from our situation-neutral linear weights values.
I am also, for the time being, excluding intentional walks. Teams choose to issue these in unusual circumstances, and so we can’t treat them as situation-neutral events. Again, we will come back for them.
Also, I am excluding baserunning events altogether. The values you see for a strikeout, for instance, do not reflect the possibility of a runner on first being thrown out in a strike-‘em-out, throw-‘em-out double play. This isn’t to minimize the impact of baserunning, only to hold it for consideration separate from hitting.
So, with that in mind, our linear weights values for 2009:
These are baselined so that the average plate appearance is worth zero runs—in other words, an average player will be “0” over a dozen trips to the plate or 1,200. (At least, an average player with robot-like consistency.)
I went out to three decimal places almost entirely so that people could see that there is, in fact, a difference between strikeouts and outs in play—but it’s a lot less than might be thought, due to the risk of a double play or fielder’s choice. (You also see a slight but real chance of an out occurring on a single, for instance—typically that’s a runner being thrown out stretching a single into a double, or a runner thrown out trying to advance on the single.)
Now, as we’ve done for fielding, let’s go and figure the margin of error for these weights on a per-play basis:
Some batting outcomes have very little variance in the change in run expectancy, relative to others. Strikeouts and walks are the most stable in terms of results, home runs the least stable.
Combining Offense and Defense
Let’s go back to our presentation of defense for a moment. In it, I said that the margin of error for a shortstop in 2009 was 0.290, per ball in play. For a hitter, the margin of error per plate appearance is generally .254 (this obviously varies a bit with hitting profile, but for now it will serve).
So in 2009, the average qualified starter at shortstop had 3,680 balls in play and 615 plate appearances. Generally speaking, the margin of error for fielding for a qualified starter would be 17.6 plays made or 14.3 runs. While on hitting, the margin of error would be 6.3.
Imagine two players, both shortstops, both with 3680 BIP and 615 PA. One player is +30 runs hitting, average at fielding. The other is -10 runs hitting, +40 runs fielding. Of course, “is” was probably a bit strong—that’s what our metrics say, but we know we’re only estimating things with a certain level of accuracy. Given what we know, which do we think was the better player that season?
Obviously, 30+0 is equal to 40-10, and so we are tempted to say “they’re the same.” But they aren’t—we’re much more certain that the first player was +30 at hitting than the second player was +40 at fielding.
As I’ve noted in the past, our margin of error isn’t symmetrical around our estimates of performance—a player who looks +40 is far more likely to be +20 than he is +60. So let’s go ahead and scale everything so that we expect the margin of error to be symmetrical. We can do that using something that looks a lot like regression to the mean.
First, we have to figure out the spread of runs around the average. So I found the standard deviation for fielding runs and offensive runs, and then subtracted the estimated uncertainty to get the “true” spread. Now, the “average player” gets a lot less playing time than your typical qualified starter—roughly 1,200 fielding chances and 197 plate appearances.
(And note that we’re regressing defense to the average of all shortstops, but we’re regressing offense to the average of all hitters. Since we’re comparing two shortstops in this example, we’re okay with this. But it’s something to address down the road.)
So we regress offense and defense, and then combine them, like so:
(30/6.3^2 + 0/9.8^2)/(1/6.3^2 + 1/9.8^2) = 21.2
(0/17.6^2 +0/5.9^2)/(1/17.6^2 +1/5.9^2) = 0
21.2 + 0 = 21.2
Basically what we’re doing is taking a weighted average of the estimated performance and the average, using our uncertainty as the weight. In this case, the smaller the weight, the better. That 30 runs of offense, once regressed, falls just a bit; but defense, being average, doesn’t regress at all. We can do the same for the second player:
(-10/6.3^2 + 0/9.8^2)/(1/6.3^2 +1/9.8^2) = -7.1
(40/17.6^2 +0/5.9^2)/(1/17.6^2 +1/5.9^2) = 4.0
-7.1 + 4.0 = -3.1
So while looking at the raw totals the two look equal, after regressing the values we can see that we are more confident in thinking the first player was more valuable.
We can also find new estimates of the margin of error for our regressed offensive and defensive values, and combine them to get total margin of error, like so:
SQRT(1/(1/6.3^2 + 1/9.8^2) ) = 5.3
SQRT(1/(1/17.6^2 +1/5.9^2)) = 5.6
SQRT(5.3^2 + 5.6^2) = 7.7
I do want to point out—this is a lot less regression than we would we doing if we were interested in trying to determine a player’s “true talent” level—in other words, our estimate of how skilled the player is. In this case, we’re simply trying to account for very different levels of uncertainty in measuring offense versus defense. So now the question comes: How do we avoid underrating defense relative to offense? For a single season, about the only thing we can do is come up with ways to trim the margin of error. But at a career level, things get much better—remember, the error doesn’t scale linearly. Here’s a comparison of how the margin of error would grow for these two players if they put up 10 years at that rate of playing time, compared to how the margin of error would scale linearly:
Our confidence in how we are measuring defense gets much closer (at least, in a relative sense) to our confidence in our measurement of offense over time.
I would love to trim the margin of error on defensive analysis some more. And I think I have a handful of tricks left up my sleeves there.
On offense, I promised situational evaluations of double plays, sacrifices, and intentional walks. That will come as well. And, of course, I’ve only offered ways to compare players with the same playing time at the same position. So I have some work to do there as well.