April 8, 2013
Rethinking Randomness: Pitchers and Their BABIPs
I think that we've really misunderstood pitcher BABIP over the years.
One of the main tenets of what's become known as DIPS Theory is that there are three "true" outcomes of a plate appearance from a pitcher's perspective, and that what happens when the ball is in play is mostly luck. It's one of those assumptions that's been around so long that it's baked into a lot of what sabermetricians hold dear. We have component ERAs that assume that a pitcher should have a league-average BABIP. We confidently state that a pitcher will regress to the league mean as if it were a matter of course. We predict doom for pitchers who have a .260 BABIP and salvation delayed for pitchers who have an "unlucky" .350 mark. "Danger Will Robinson! (Insert name of pitcher) has been running on luck and will collapse any moment now!" makes for an easy article. I know, I've written plenty of them. In fact, as recently as last week, I predicted that the Orioles would relapse into mediocrity because four of their main relievers from last year had BABIPs in the .260 range and thus, their success was a vast mirage. Because once the ball leaves the bat, it's all random chance, right?
At this point, there's a pretty good consensus that the real answer to the question is "Yeah, but... hang on a minute, there's more to it." There are a bunch of logical factors that can influence BABIP.
Still, most of the common ERA estimators (and a fair number of writers) continue to assume that BABIP is something that is out of the pitcher's control, and that, over time, it will return to league average (or at least a small window around that average).
Maybe we've been wrong all along. What if BABIP isn't a random event? What if we've just massively misunderstood the concept?
Warning! Gory Mathematical Details Ahead!
Proof no. 1: If BABIP is random, then why can I find a nice easy predictor of what's coming on the next ball in play?
Let's start with a fairly obvious question. In addition to groundball/flyball/pull/opposite field tendencies, wouldn't BABIP vary by how well the pitcher in question was throwing on that day? It's well known that pitchers vary in how much stuff they have from start to start. Any given pitcher might also have some minor injury that he pitches through (don't we all?) that still affects him over two or three starts. So, I decided to look at whether recent BABIP performance might predict the outcome of a single plate appearance.
To do this, I pulled a new trick out of the bag. For the years 1993-2012, I isolated all balls in play and coded whether they fell for a hit or not. I found the league average for that year. The way that BABIP is currently conceptualized, this should be the only number that we need. I converted the league BABIP into natural log of the odds ratio. In addition to the league number, I calculated what had happened to the 10 previous balls in play for this pitcher within this season. I did this as a moving average, so each ball in play had as a predictor the average fate of the 10 balls immediately before it. Again, I converted the BABIP to a logged odds ratio.
At first, I ran a logistic regression using only the previous 10 BIP as a predictor, controlling for the league BABIP for that year. And I got...nothing. There was no significant association between recent performance and what happened on the next ball in play. It looked like each ball in play, once it left the bat, was equally as likely to fall in as any other. Or at least like recent performance wasn't going to help me.
But then I changed to the previous 20 BIP as my sampling frame, and a funny thing happened. Significance. Pulling in more data from the pitcher's recent past made the predictor better. I went to 30 BIP and got significance again, and somewhat stronger significance at that. I went to 40, and then 50, and it kept getting better. There's a way to tell whether a predictor in a binary logistic regression is better or worse than another. It's a model fit statistic called -2 log likelihood. All you need is a consistent set of cases. Run a series of predictors on the same set of cases and the one that gives you the greatest amount of change in the -2 log likelihood is your best bet. You can also compare -2log contributions of different variables in the model. I isolated cases where I could calculate a running mean from 10 BIP to 250 BIP (in 10-BIP increments) within a season (thus, the pitcher needed to have at least 251 BIP for that season and only plate appearances from the 251st ball in play onward were used). To allow for streaks where a pitcher had an 0-for-10 groove going (you can't take a logarithm of zero), I excluded those cases from all analyses.
Looking at the comparisons of how the moving averages fared against the league-average BABIP was a revelation. At 10 BIP, the league BABIP had a 4-to-1 edge in predictive power, consistent with what we've been taught about BABIP all these years. But as the sampling frame crept up, the pitcher's recent results on balls in play started to become a relatively stronger predictor. By the 100-BIP sampling frame, a pitcher's recent performance was the stronger of the two predictors. Around 150 BIP, it was about a 60/40 split in favor of the pitcher's recent results, and it stayed around that ratio up to 250 BIP.
It's hard to argue that a pitcher's recent performance is unrelated to some sort of underlying skill that he has, and the sampling frame needed to show that is much shorter than we would have imagined. (We'll talk about that "skill" in more detail in a minute.) If BABIP is simply a matter of luck and pitchers are tethered to the league average, why is this skill-related predictor doing a better job than league average of predicting the results of the next ball in play?
Proof no. 2: It's not defense...
One obvious critique of the above is that I may simply be picking up on the effects of the defense behind a pitcher. A groundball pitcher with four vacuum-cleaner infielders behind him will look amazing when it comes to BABIP. We need a way to separate what the pitcher is doing from how much his defense picks him up. Another less-obvious critique is that a pitcher's BABIP might depend more on the quality of the batter whom he faces.
Fortunately, from 1993-1999, Retrosheet data contain an indicator of what sort of ball the batter hit (ground ball? line drive? Fly ball?) and where on the field the ball was hit based on a grid system. Now, these data have to be treated with some caution. Stringers classifying batted balls have biases. A line drive vs. a fly ball is something of a judgment call. And so is location. There is likely a tendency to place a ball that gets through the infield as being hit to the '56' zone (between the 3B and SS), but the same ball that another shortstop manages to get to as being in the '6' zone (right at the SS). Some of these data points are also 20 years old, and we have no data on how hard the ball was hit. It's not perfect, but it will do for now.
For each ground ball (excluding bunts), I looked at what zone the ball was recorded as entering. For each zone, I calculated the league-wide expected BABIP for a ball hit to that area. By doing this, I was able to get both the pitcher's and batter's overall expected BABIP on grounders, based solely on the location of where the balls were hit. If the pitcher was steering grounders to areas where his fielders should have gotten them, and the fielders were simply subpar or he was facing batters who were good at "hitting it where they ain't," this method should account for that.
I also calculated the BABIP for the pitcher's team on ground balls over the course of the season in question, excluding those that happened with the current pitcher on the mound. This will give us a rough estimate of the team's defensive quality overall. Finally, I calculated the league BABIP on grounders. I converted all of the above to logged odds ratios again. I created a logistic regression for all ground balls in the data set coded for whether they went for hits or not. I entered each of the four indicators above as predictors, including only plate appearances where both the batter and pitcher had 100 grounders or more during the year.
After that was done, I went back and did the same for line drives, and then for flyballs/pop ups. (For line drives, I dropped the inclusion criteria to 50 or more.)
I found the -2 log likelihood contributions for each of the predictors, similar to how I apportioned blame/credit in this article. Below is a table showing how well each of the predictors performed relative to each other for each type of batted ball.
We see that the batter's tendency to hit the ball where they generally ain't holds the greatest amount of sway over whether the ball will go for a hit. This squares with what we know about batter BABIP being a much more stable stat than pitcher BABIP. But the pitcher's tendencies to direct ground balls and fly balls to where the defense can generally get to them checks in as more important than the defense's general ability to turn batted balls into outs (the spread is closer for fly balls). And the league mean is present, but not a very strong predictor.
Far from being tethered to the league average, pitcher BABIP has a perfectly rational set of factors that influence it, and a good chunk of it belongs to the pitcher. Sure, the pitcher doesn't have full control over about 70 percent of the equation, but his contribution is generally twice as strong as that of the league average being used as a predictor.
Proof no. 3: An outcome and a skill are not the same thing.
Let's start this one with the language that surrounds the idea of DIPS and BABIP (Note: Always study the language that someone uses. Always. Language always betrays hidden assumptions.) In Voros McCracken's original BABIP study, there were four types of outcomes of a plate appearance: a strikeout, a walk, a home run, or a ball in play. Everything was kept in its own separate box, as if these were completely separate things, but within the box the assumption was that they were completely unified skill sets.
The three true/one false outcomes model of a plate appearance assumes that we should classify events based on whether they are discrete outcomes on the scoreboard, rather than whether they reflect some underlying skill of the pitcher. Because we equated outcomes with skills, we saw that while strikeouts, walks, and home runs (somewhat less so) were repeatable from year to year, BABIP wasn't. The consensus on BABIP was "no skill involved." Maybe it should have been "poorly designed construct." Maybe the problem with BABIP isn't that it's all luck, but that getting outs on balls in play encompasses different skills in different situations, some skills which are more influenced by factors outside the pitcher's control—whether luck or defense or the batter— than others. Maybe getting outs on grounders is a different skill than getting outs on fly balls that don't leave the park.
Statistically, it's hard to create a meaningful single number that represents the sum of a wide range of only mildly related (both in terms of covariance and conceptually) components. Those who are familiar with the statistical technique of factor analysis will be familiar with this idea. For those who aren't, a quick example: Suppose that I wanted to create an index of how sad and depressed someone is. I might ask questions like how often the person feels hopeless about the future or how often the person has uncontrollable crying spells or how often the person feels that even things that used to be fun just aren't anymore. As the answer to one of these questions goes up, the answer to the others will probably also go up as well. (For the initiated, they will have high factor loadings.)
Now, let's say that I tried to add in a question about how often the person had intrusive and obsessive thoughts. Obsessive thoughts are certainly a problem and may happen along with depression, but one can have depression and no obsessive thoughts or have obsessive thoughts but no depression. If I tried to shove this extra question into my measure, it will make the measure less stable.
Maybe we've been trying to put too many unrelated skills under the umbrella of BABIP. And for some reason, we've been surprised when it doesn't work. I'd argue that instead of a component ERA, maybe the first step is a component BABIP (like an xBABIP, which BP's Derek Carty has shown to be a good indicator of future performance)
Enough of this theoretical musing. The gory math awaits!
For the year pairs 2003-2004 to 2011-2012, I found all pitchers who had at least 250 balls in play in each year. Among these pairs, the year-to-year BABIP correlation was .193, which is the sort of lowly correlation that got this whole DIPS thing started. (Note: yes, I know I'm violating assumptions about the independence of data points. For just 2011-2012, it's .205. Happy?)
I ran a regression predicting the following year's BABIP using outcomes from the previous year that everyone assumes are "true": strikeouts per PA (year-to-year correlation of .77), walks per PA (.66), HR per PA (.30), GB% (.81), and FB% (.79), as well as BABIP.
The following equation produces a prediction that correlates with the next year's BABIP at a multiple-R of .305. That's not huge, but it's a) better than .193 and b) the same number as the year-to-year correlation for home run rate.
The equation: .291 + .143 * BABIP * GB_rate - .057 * K_per_PA - .630 * BB_per_PA + 1.765 * BABIP * BB_per_PA.
When we try a very simple component-level prediction for next year's BABIP, our predictive power goes up. Suddenly, this doesn't all look so cut-and-dried. The point is that when you take a more component-based view of BABIP, the skills—plural—and the interactions between those skills tend to come out. Maybe there is no difference between major-league pitchers in their ability to prevent hits on balls in play. But there certainly are differences in the abilities that go into preventing hits.
Well then... why does BABIP always seem to regress to .300?
But the cynic will point out that despite all this, while BABIP may not be a unitary skill, it is an outcome that makes a large amount of difference in what happens on the scoreboard. And it does not correlate well from year to year. And yes, most guys who have .260 BABIP one season follow it up with a .300 season the next year and show a resulting decrease in their headline stats.
I still hold to the idea that BABIP is (multi-)skill-based, and have no trouble reconciling these two facts in my head. I offer the following three thoughts:
1) There will always be random variation in any measurement from year to year, and the smaller the sample size, the more likely that random variation creeps in. There probably are seasons where a pitcher had a good BABIP that really was just good luck, and we'll expect him to revert back to form in the following year. But if we took a more component-based look at BABIP, we'd probably be able to tell which inputs are more or less given to randomness. If a pitcher got lucky on an indicator that we know really is luck-based, we might predict regression. But if it was on a skill that we know to be stable, we might predict that the magic will continue. Being able to discern who got lucky vs. who might sustain that performance would be a massively interesting talent, now wouldn't it? I think a component-based view of BABIP gets us closer to that.
2) If BABIP really does consist of several skills acting in concert, a "lucky" season is likely to be the result of a pitcher who has put it together on several different skills over the course of a year. The problem might be that a loss of one of those skills might be enough to tilt him back toward the mean, and while maintaining good form on one skill is hard enough, what if it's four or five different skills? That's four or five things on which the pitcher might mess up, and the result is that he would become simply ordinary again.
3) I think there's one other measurement error that we tend to make in sabermetrics. We assume that a player is his yearly average throughout the course of the season. This makes about as much sense as noting that the average high temperature in the city of Chicago is around 50 degrees, and packing for crisp, autumn weather—in January. Sure, the overall average is 50, but as seasons change, the climate changes too, and you have to adjust your expectations. We wouldn't make that mistake in packing for a trip, yet we do it all the time in sabermetrics.
In proof no. 1, we saw that a moving-average approach to predicting BABIP was quite effective in predicting what happened next, and at that, we needed to look back at only 100 BIP before it overtook the league average as a good predictor. This leaves open the possibility that whatever the skill or skills are that are involved in BABIP, it or they may fluctuate over time. These fluctuations may not represent random variation around a mean, as is often assumed. They might be real changes in true-talent level.
There's probably a natural floor (and ceiling) to how good a pitcher can be in preventing hits on balls in play. Major-league hitters will eventually square on up on even the toughest pitcher. But maybe the untapped concept that differentiates the regresser from the maintainer is the ability to hold on to a good true-talent level over a long period of time. Maybe that's a talent unto itself. Maybe studying those variations from month to month and seeing who is steady across time vs. who fluctuates wildly from week to week will shed some light on the subject.
More than anything, I hope that what we've learned is that saying "He got lucky!" isn't enough anymore. I worry that for too long, we didn't question the DIPS hypothesis strongly enough. I believe that the preponderance of evidence points to there being real differences between pitchers in their abilities to prevent hits on balls in play and that the assumption that the league-average BABIP is the best baseline going forward is false. Balls in play are not completely within the pitcher's control, but the pitcher's contribution is not trivial. We should build our assessments of pitcher quality with that knowledge in mind going forward.