April 23, 2003
Lies, Damned Lies
Estimating Pitch Counts
Silicone. Margarine. O'Doul's. Why fool around with watered-down imitations when you've got the real thing ready and available?
Rightly or wrongly, a lot of attention has been focused on pitch counts in the past several years. That's partly because of the efforts of people like Rob Neyer, Keith Woolner, and Will Carroll, not to mention those coaches, executives and agents who understand the importance of protecting their golden-armed investments. Pitch counts have become easy to take for granted because pitch count data is more readily available now than it ever was in the past. These days, just about any self-respecting box score lists pitch counts alongside the rest of a pitcher's line, a far cry from the dirty newsprint days of yore, when pitch count references were about as common as mentions of Reality TV or the Information Superhighway.
But what about when you don't have pitch count information available? Like, say, you're at a ballgame, and wondering whether Dusty Baker should send Kerry Wood out for another inning? Or you're perusing through minor league stats? Or you're looking at old boxes on Retrosheet, which wonderful as they might be (this, folks, was the first game I ever attended), don't contain any information on pitch counts?
Well, it turns out that it's not that difficult to make a reasonable guess at pitch counts based on other information that's much easier to come by. Looking at a complete set of data from the 2001 and 2002 seasons as provided by Keith Woolner, I ran a simple linear regression of pitches thrown against various other characteristics of a pitcher's stat line. Here was the formula that I came up with:
Before we go any further, let me disclaim that I'm not the first person to try and skin this cat. In particular, Boyd Nation has done some wonderful work in estimating pitch counts for college pitchers based on a similar approach. I'm publishing my results anyway because:
Yep, that's right--just this once I'm going for simplicity instead of nuance. Must be spring time.
While both Boyd and I are working from large enough datasets that any number of things turn out to be statistically significant, the degree to which they improve the accuracy of the estimate is marginal. The three components listed here--batters faced, strikeouts, walks--are substantially more important than any other. The next most important predictor of pitch counts, interestingly, is groundball-to-flyball ratio. Groundball pitchers, all else being equal, generally throw somewhat fewer pitchers per hitter than flyballers. Theories among BP authors ranged from fewer foul balls allowed by groundballers, to flyballers needing to work the corners more to avoid the long ball. In either event, reduced pitch counts is a subtle advantage that many groundball pitchers possess.
Still, groundball-flyball data is tough to come by, so we'll stick with the three essentials. Although the IPC formula was derived based on full-season statistics rather than pitching lines for individual games, it still does a decent job in estimating the latter. Here are the 30 pitchers who started this past Sunday, along with their IPCs and their actual pitch counts.
Pitcher BFP BB SO IPC Pitches Error Ainsworth 22 3 3 85 79 7% Armas 28 1 6 101 97 4% Asencio 27 3 6 105 105 0% Batista 16 0 1 52 48 8% Beckett 24 3 8 99 107 8% Bierbrodt 9 0 1 30 37 23% Buehrle 24 2 2 86 77 10% Chacon 26 2 6 98 105 7% Daal 22 1 0 73 84 15% Davis 27 1 1 91 86 5% Duckworth 13 2 0 48 46 4% Estes 18 3 1 69 69 0% Fogg 5 0 0 16 15 5% Fossum 28 5 3 111 99 10% Franklin 27 1 2 92 85 8% Glavine 30 2 1 104 103 0% Graves 24 1 3 84 79 6% Halladay 30 0 6 104 105 1% Lewis 22 4 3 88 89 1% Lilly 27 1 4 95 96 1% Lohse 22 1 2 76 70 8% Maroth 29 0 3 97 116 20% Mussina 28 3 8 111 104 7% Nomo 29 2 7 110 103 6% Peavy 27 4 2 102 104 2% Reynolds 27 2 1 94 110 17% Robertson 25 2 4 92 77 17% Sheets 25 0 2 82 89 8% Tomko 36 4 6 137 111 19% Washburn 31 3 2 112 109 2%
There are a few misses--Brett Tomko was more economical with his pitches than we would have expected, Mike Maroth less so--but for the most part, the simple formula acquits itself well, with an average error of about 7.7%. The error is considerably less when the formula is applied to seasonal data, and some of the flukish performances even one another out.
Not that this is rocket science--the point is that pitch counts aren't any great mystery. The more batters you face, the more pitches you throw; the heavier you are on strikeouts and (especially) walks, the more pitches you throw per batter.
Now, there are pitchers whose pitch counts are routinely missed by IPC, as well as fancier versions that include GB:FB ratio and the like. But even in these cases, the data can give us some interesting insight into their approach on the mound. These were the five pitchers whose pitch counts were most overestimated by IPC in 2002:
Pitcher actual expected error strike% Maddux 2677 2934 -258 67% Mulder 2995 3168 -173 65% Sturtze 3560 3717 -157 66% Lawrence 3084 3240 -157 62% Oswalt 3423 3565 -143 69%
Once again, Greg Maddux is truly in a class by himself. Not only does he keep his pitch counts down by avoiding walks, but he's also unusually efficient with his pitches when recording other kinds of outs. Nor were the 2002 numbers a one-year fluke--the formula missed Maddux' actual pitch count by an even greater amount (-319) in 2001.
The data also bode well for Ray Oswalt and Mark Mulder, who have tossed a lot of innings at a young age, but throw to relatively few batters per inning, and throw relatively few pitches per batter. Brian Lawrence makes the list in spite of a strike rate that was almost exactly at the league average of 62%. If he starts hitting the corners, then a Saberhagen-like season or two is possible.
As for Tanyon Sturtze? Well, maybe he was chucking the ball in there a little bit, but what would you do if you had to pitch 225 innings for the Devil Rays?
The reverse list is also pretty interesting:
Pitcher actual expected error strike% Rueter 3262 2983 +278 59% Rusch 3614 3373 +241 61% Washburn 3358 3116 +241 63% Appier 3179 2945 +233 61% Milton 2698 2529 +168 67%
We've never given Kirk Rueter very much credit (though, strangely, Huckabay has him on his Scoresheet team) but is this the secret to his success? Rueter also placed near the top of the 2001 list, as the formula underestimated him by 181 pitches. While in the abstract, it's nice to keep one's pitch count down, for some pitchers working at all different speeds and all throughout the strike zone is a survival skill, even if it results in a lot of deep counts. Other pitchers of the same 'family', like Tom Glavine and Mark Buehrle, also threw substantially more pitches than IPC predicted in both 2001 and 2002.
The pattern is a little more confounding for Eric Milton, who does throw pretty hard, and also throws a lot of strikes (67%). In fact, Milton is the only pitcher among the first 20 names on the 'most underestimated' list with a strike rate substantially better than league average. What the pattern means, I'm not quite certain. Perhaps Milton has microcommand problems--moving well from inside to outside but not quite hitting his spots--or perhaps his pitch selection is poor. Milton has consistently failed to leverage his skills set into the success we've long predicted of him, and if any Twins fans out there think this has something to do with it, I'm all ears.
One other brief but fascinating aside: There's a slight but statistically significant relationship between pitches per batter and hit rate on balls in play. Pitchers who require more pitches per hitter allow slightly fewer hits. The correlation is small, just -.10 based on the dataset I'm working with, but it follows that pitchers who are doing their darndest to make sure that hitters put the right pitch into play are rewarded with a few more outs.
The other application I've teased at here--using implied pitch counts to look at historical data--really needs to be approached carefully, since it doesn't necessarily follow that we can extrapolate the data backwards to a time at which pitching strategy was quite different. Still, it's interesting to note that just about everything we associate with higher pitch counts has increased as time has passed. Strikeouts have increased, flyball percentage has increased; walk rates haven't increased quite as much, but they're still higher than they were before WWII.
Just for fun, here are Cy Young's career numbers:
IP 7356 BFP 28,517 BB 1217 K 2803
Put those totals into the IPC formula, and it comes out to just under 99,000 pitches over the course of Young's career. That sounds like a lot--it is a lot! But it also works out to just under 121 pitches per nine innings pitched. Young, if he was pitching at all effectively, could complete a game without really reaching far into the danger zone. And keep in mind that I suspect the formula well overestimates Young's pitch counts, if for no other reason than he pitched in an era when most everything was hit on the ground.
There's a price to be paid for all the strikeouts and walks in the modern game. And I don't mean the tofu dogs.