BP Comment Quick Links
May 14, 2013 Baseball TherapyHow Reliable Are Our Fielding Metrics?
A little more than a week ago, Jon Heyman of CBS sent out a tweet wondering why it was that Starling Marte and Bryce Harper had the same WAR. Heyman was quoting Baseball-Reference's version of WAR, which at that moment in time showed Marte and Harper tied at 1.7 wins. Harper had clearly been the superior hitter, but drilling down, it turned out that the fielding metric used by Baseball-Reference loved Marte's defense enough (and thought Harper's was average enough) to call them equals. The problem with any sort of number this early in the season is that on many measurements, we're still at a time when players haven't logged enough playing time for the measure to be considered reliable. But of course, some measures are more reliable than others. The more reliable a measure, the sooner we can be more confident that it actually reflects what the player's talent level was during that time. The less reliable it is, the more likely it is that there will be fluky spikes and valleys over short (and sometimes long) periods of time. Fielding metrics are an estimate of how many outs a player saved from Opening Day onward, and what that was worth. However, in the same way that a player who went 3-for-4 on Opening Day is technically a .750 hitter for the moment, it’s not real. A fielding metric might need some time to stabilize as well before we get a good read on what’s going on. There's been research on how quickly various batting and pitching statistics stabilize, but in general, few people have asked the question of how reliable our fielding metrics are. One reason is that several of the most commonly cited fielding metrics (UZR, Warning! Gory Mathematical Details Ahead! I started by looking at ground balls for infielders. First, I calculated what zones "belonged" to an infielder. For each zone, I looked at which infielder(s) made the play at least 25 percent of the time (when the ball did not scoot through) over all seven years in the data set. When a zone had more than one fielder assigned to it, for example, a ball in the 56 zone (between short and third) might belong to the shortstop or the third baseman, I did not penalize the third baseman for not fielding the ball if the shortstop got there first. It simply went as a "no play" for the third baseman. Conversely, I did not reward the shortstop for somehow making a play in short right field. (What the heck was he doing out there anyway?) My criterion for success was whether or not an out was recorded on the ground ball (either by force out, or just good ol’ throwing the ball to first). I played around with whether or not he got to the ball (regardless of whether he finished the play) or whether he fielded and threw cleanly. (If the first baseman dropped the throw, whose fault is that?) It didn't change the results all that much. All events were coded 0/1 (not out/out). This is a simpler model than is actually used in the major defensive metrics. What I've created here is a basic "outs per ball in zone" metric. The more developed measures control for more factors and adjust for the difficulty of each play, and they are better off for it. But then again, all defensive metrics boil down to "How many balls was he near and how many did he turn into outs?" I'm happy to concede that I'm dealing with a rough approximation and that your mileage may vary if your model is fueled by more granular data. But this ought to give us some order of magnitude to work with. I used the Kuder-Richardson, formula 21 to look at reliability. KR-21 is specifically set up to look at reliability in binary outcomes. I considered the stat stable when KR-21 crossed .70. I looked at sample sizes of up to 600 balls per fielder, meaning that I can see stability numbers to sampling frames of 300 in real life. If a measure failed to reach .70 within the frame available, I used the Spearman-Brown prophecy formula to estimate the point at which it would reach the reliability line in the sand. The results for ground balls to the infielders: First basemen: We need 290 GB at or near the first baseman before our crude measure of fielding stabilizes Next, I looked at fly balls and pop ups for all seven non-battery positions. I used the same basic logic, except that I assigned each zone to the fielder who made more than 50 percent of the plays in that zone. I excluded fly balls that left the park. Also, this does not include line drives, and catching those is largely a matter of luck. I coded each fly ball 0/1 based on whether or not the fielder caught the ball. For infielders, I was only able to go out to a sampling frame of 200 pop-ups (so my top resolution was 100 pop-ups). For outfielders (who get more fly balls), I was able to go to 500 (so my estimates run to 250 fly balls) First basemen: 48,000 pop-ups.* Really. *Those corner infielder numbers are mostly the result of the fact that reliability numbers barely budged from zero in the tested sample. We'll talk a bit more about what this means in a minute. To give some context around those numbers, the average team in 2012 had to take care of 6.3 ground balls and 4.7 fly balls/pop-ups per game. (Surprised?) That means that even if Starling Marte had played every inning of every game for the Pirates in left field and every single fly ball that the other team hit was hit his way, after 40 games, we would only expect him to have 188 fly balls hit his way, and that's a only halfway to getting a reliable measure of his outfield range. However, after 40 games, there are certain parts of Starling Marte's batting line that can be considered reliable. (Careful readers will note that I didn't address throwing arm stats, and that was more a matter of sample size than anything. In the past I've found that performances in throwing runners out on the bases aren't very stable year to year, primarily because there just aren't a lot of chances that a player gets to show off his arm.) What it Means With defensive numbers, that point of reliability just doesn't happen that fast. It takes longer for a player's true colors to shine through on defense. When a guy like Starling Marte has a big number on his defense, it might reflect what he's done to date, but we can't be completely confident that it captures what he did during that time, and there’s even more uncertainty about who he was deep down. And even if the metric isn’t overstating his performance, we can’t be sure whether he’s really the best fielder in the league or just enjoyed a convenient spike of luck. Either way, we need to be careful to drill down a bit to see what is driving a high (or low) value and to frame our understanding accordingly.
Russell A. Carleton is an author of Baseball Prospectus. Follow @pizzacutter4
17 comments have been left for this article.
|
I rapidly get out of my depth when the Kuder-Richardson formalism comes up, so the following question may not make sense, but I'm going to ask it anyway. :-) If I understand it correctly, one of the underpinnings of the KR21 formula, at least as applied to test construction for exams in classes, etc., is the assumption that the "test questions" are of broadly equal difficulty. That clearly doesn't apply in a baseball setting. Unless a fielder is just incomprehensibly bad, he'll make all of the "easy" plays. It'll be the "hard" plays that separate the good fielders from the bad ones, and the "incredibly hard" plays that separate the great fielders from the good ones. Why, then, is KR21 an appropriate formalism for this subject? Aren't you asking it to do an analysis that it's not really well suited for?
It am taking some small liberties with KR-21. I'm assuming that a grounder is a grounder is a grounder (and that all are of equal difficulty), mostly because in the data set I'm using, I can't tell the difference as to which grounders were soft two-bouncers right at the fielders and which were screamers headed through the middle. The way that I have the database structured, I lined up the "test questions" in chronological order. So for first basemen, "question" #1 was the first ground ball that he saw from 1993 onward that was hit in his general area. For some guys, that was an easy one, for others a near impossible ball to get. What I'm counting on is that the noise all cancels out in the wash.
Thanks for the followup, and I understand your methodology a little better now, notably the fact that you almost must use KR21 because of the limitations of the data set. However, that's kinda my concern in a nutshell.
Two points. First, the assumption that all grounders or pop flies are of equal difficulty is obviously wrong (nor do you claim it to be otherwise, for sure), and it leads to the inclusion of lots of plays in your data base that really don't contribute much in terms of discriminatory power. Any ground ball hit within 5 feet of a fielder is going to turn into an "accepted chance" for that fielder, to use 60-year-old terminology, unless the guy is immobile on the scale of a late-career Frank Howard. Those chances may shed light on the inadequacy of guys with real hands of stone, or a terminal case of the throwing yips (think Steve Sax or Chuck Knoblauch), but otherwise they don't contribute much except added statistical clutter.
Second, the contention that all that clutter "cancels out in the wash" is dubious, because not all fielders have the same proportion of non-trivial plays attributed to them. There are a number of reasons for that, ranging from the fielders' own reputations to the reputations of teammates to the surfaces they played on to the pitchers they played behind, and so on. In essence, they aren't all taking the same fielding exam -- which again is one of the key points about KR21.
Yes, I understand now that with the limitations of the data set, you probably can't do better. But with the "right" data set, that is, a reduced set that looks only at balls in play that really do have discriminating power, I'd be pretty confident that the numbers required to achieve some degree of stability would be much reduced, although you'd have to use a more powerful algorithm to test that claim.
Oops: when I said "reputations of teammates," I really meant "objective capabilities of teammates." I wish we could edit these comments to fix things like that. Anyway...