CSS Button No Image Css3Menu.com

Baseball Prospectus home
  
  
Click here to log in Click here to subscribe
<< Previous Article
Premium Article Bizball: Playing the M... (07/16)
<< Previous Column
Baseball Therapy: Hire... (07/09)
Next Column >>
Premium Article Baseball Therapy: It H... (07/24)
Next Article >>
Premium Article Out of Left Field: Not... (07/16)

July 16, 2012

Baseball Therapy

It's a Small Sample Size After All

by Russell A. Carleton

Who said sabermetrics hasn't gone mainstream? We've now reached the point where even mainstream analysts are yelling "small sample size!" at one another. There's always been some understanding that a player who goes 4-for-5 in a game is not really an .800 hitter, but now, people are being more explicit in talking about sample size. I consider that a victory. Hooray for sabermetrics!

How big does a sample size need to be before it stops being... small? As I understand it, the most commonly cited study on the topic was written by a man code-named "Pizza Cutter" almost five years ago at a blog that no longer exists.

Mr. Cutter's idea was that he'd look at something called split-half reliability. BP's Derek Carty did something similar a while ago, while picking plate appearances at random. Mr. Cutter took two equal samples of the same number of PA for a bunch of players and checked to see how well they correlated with one another. The idea was that over time, a statistic becomes more and more "stable," meaning that it becomes a better indicator of his true talent level over that time frame.

After reading the original Pizza Cutter article, I am amazed that anyone pays attention to this study given its many methodological flaws. Among them:

  • It is written by a man who named himself after an auxiliary kitchen utensil
  • According to Mr. Cutter, who at the time was working with data from 2001-2006, he used consecutive pairs of years (2001-2002, 2003-2004, 2005-2006) for each player. This means that in his sample, Barry Bonds and anyone else who played in all six years would have been in his data set three times. Sloppy.
  • He used an evens-and-odds method to split his sample. In this case, he lined up everyone's PAs in chronological order, numbered them from one to whatever, and then split them into even and odd numbered PAs and calculated a correlation between these two buckets. This is a man who needs more methodological sophistication. It may not be likely, but what if his findings were the result of his even-and-odd method?
  • When looking at batted ball type rates, he did them with the denominator of per PA, rather than per ball in play.
  • He used a case-wise deletion strategy. So, his sample for 100 PA reliability is different from his 200 PA reliability sample.
  • He used 50 PA intervals. I'm going to use 10. Better resolution.
  • He left pitchers-as-batters in the sample. They should really be taken out. At higher levels of PA, they will naturally be selected out, but at low levels of PA, they might be muddying up the sample.
  • Why would a man obscure his real name like that on such an important study? Was he afraid that people would find out who he is?

Let's see if we can make this better. I would propose to duplicate Mr. Cutter's study with much better methodology. As always, if the numbers scare you, you can close your eyes for the next part, and go to "the results."

Warning! Gory Mathematical Details Ahead!
I missed doing that.

The data were Retrosheet play-by-play logs from 2003-2011. Pitchers batting were eliminated, as were all intentional walks (I counted them as never happening). Only batters who had at least 2000 PA in that time frame were selected. There were 311 such batters. All batter PAs were lined up chronologically and numbered in order, and I took the first 2000 PAs for each batter. This means that I was able to get reliability coefficients on samples up to 1000 PA.

For stats that had other denominators, such as batting average (ABs) or grounders (balls in play), I note the inclusion criteria in the chart below.

This time, instead of splitting up the sample into evens-and-odds as Mr. Cutter did, I used a much better methodology, the Kuder-Richardson reliability formula. (For the initiated, I used KR-21, a derivative of Cronbach's alpha. There were a couple cases where the outcome was not binary—SLG, ISO—where I used Cronbach.) The baseball statistics in which we are most interested are binary outcomes (strikeout rate is a yes/no question of whether the batter struck out over a series of PA), and Kuder-Richardson specifically assesses measure reliability in binary outcomes.

The formula is available elsewhere online, but the basic idea is this. Imagine that you had a sample of six PAs for a bunch of hitters. Now imagine if, instead of splitting them 1-3-5 vs. 2-4-6 (i.e., evens and odds), you could split them into every single possible combination available and correlate those two halves. So, you could see what the correlation between 1-2-3 and 4-5-6 would be, or the correlation between 1-2-4 and 3-5-6. Then, let's say that you could take the average of all of those correlations. Mathematically, that's what Kuder-Richardson (and Cronbach) does.

So, if I have a sample of 500 PAs for a list of batters, this method will tell me what happens when you split that into a pair of 250 PA samples in every possible way. The result will be a much better estimate of how reliable an indicator of a player's true talent level a statistic is over 250 PA. Of course, we know some stats reach higher levels of reliability at lower levels of PA, but it's interesting to note which ones are which and what that says about player evaluation as the season goes along.

I looked for the place where reliability passed .70, which is about the only thing that Mr. Pizza Cutter got right. At .70, the rate of signal to noise crosses the halfway point (.707 * .707) = 50%. Of course, with any sort of bright line, there's always the objection that it's a black/white contrast where 50 shades of grey are called for. I don't know what else to say other than "Yeah, I know."

The Results

Statistic

Definition

Stabilized at

Notes

Strikeout rate

K / PA

60 PA

 

Walk rate

BB / PA

120 PA

IBB's not included

HBP rate

HBP / PA

240 PA

 

Single rate

1B / PA

290 PA

 

XBH rate

(2B + 3B) / PA

1610 PA

Estimate*

HR rate

HR / PA

170 PA

 

 

 

 

 

AVG

H / AB

910 AB

Min 2000 ABs

OBP

(H + HBP + BB) / PA

460 PA

 

SLG

(1B + 2 * 2B + 3 * 3B + 4 * HR) / AB

320 AB

Min 2000 ABs, Cronbach's alpha used, Estimate*

ISO

(2B + 2 * 3B + 3 * HR) / AB

160 AB

Min 2000 ABs, Cronbach's alpha used

 

 

 

 

GB rate

GB / balls in play

80 BIP

Min 1000 BIP, Retrosheet classifications used

FB rate

(FB + PU) / balls in play

80 BIP

Min 1000 BIP including HR

LD rate

LD / balls in play

600 BIP

Min 1000 BIP including HR, Estimate*

HR per FB

HR / FB

50 FBs

Min 500 FB

BABIP

Hits / BIP

820 BIP

Min 1000 BIP, HR not included

Hopefully, Colin Wyers won't kill me for using Retrosheet batted ball classifications.

* - In some cases, the magic .70 mark was not reached within the constraints of the data set, so I used the Spearman-Brown prophecy formula to estimate at what point .70 was most likely to occur.

What it means
Take a look at the basic outcomes of a PA. The idea of the "three true outcomes" (TTO) has been something of a staple of sabermetric thinking for a while. The idea of the holy triad of strikeout, walk, and home run being "true" was something that came from DIPS theory and applied mostly to pitchers. While they are the three that stabilize for hitters most quickly, it's actually a gentle progression upward to HBP rate and then singles rate.

Perhaps we need to talk about the five factual outcomes for hitters? I realize that TTO is meant to describe a hitter like Adam Dunn or Jack Cust who has a style of play that emphasizes those three outcomes. However, between 2007-2010, when the two of them were duking it our for the title of TTO king, Cust began to see his rate of singles rise (while his HR fell), while Dunn hit comparatively fewer singles and kept his HRs (freakishly) consistent.

Rates of doubles and triples were an odd duck. There's been a certain sabermetric (should I use the word fetish here?) over the past few years for guys who have high doubles numbers, but whom the market overlooks because they don't have sexy HR totals. Those doubles numbers may be illusions. The home run numbers are more likely to be real. Caveat emptor. Or amator.

Ground balls and fly balls stabilize at roughly the same time (and quickly!). Skill in producing line drives is given to much more noise. Again, Colin Wyers has written over and over that it's hard to trust a classification of a line drive because it's a subjective judgment. But even trusting that Retrosheet is 100 percent correct that a player's line drive rate will likely vary a lot, his GB/FB ratio will be quick to stabilize. Some players are GB hitters, some are FB hitters, but line drives occasionally happen and it's hard to know why.

Overall, these numbers aren't vastly different from the original article by Pizza Cutter, but the methodological improvements that I've made take away some of the concerns that could be raised about the originals. The techniques are a little more obscure, but after five years, it's time for an update. If I see some other older works that might benefit from some methodological sprucing up, especially from this Pizza Cutter guy, I might look into doing just that.

(If there's a stat that you wish I had done, leave it in the comments, and I will do my best to get around to it. Let's stick to hitters for now.)

Next time, we'll talk about how these numbers are often misused and what they can and can't be used to show.

Russell A. Carleton is an author of Baseball Prospectus. 
Click here to see Russell's other articles. You can contact Russell by clicking here

16 comments have been left for this article.

<< Previous Article
Premium Article Bizball: Playing the M... (07/16)
<< Previous Column
Baseball Therapy: Hire... (07/09)
Next Column >>
Premium Article Baseball Therapy: It H... (07/24)
Next Article >>
Premium Article Out of Left Field: Not... (07/16)

RECENTLY AT BASEBALL PROSPECTUS
Playoff Prospectus: Come Undone
BP En Espanol: Previa de la NLCS: Cubs vs. D...
Playoff Prospectus: How Did This Team Get Ma...
Playoff Prospectus: Too Slow, Too Late
Premium Article Playoff Prospectus: PECOTA Odds and ALCS Gam...
Premium Article Playoff Prospectus: PECOTA Odds and NLCS Gam...
Playoff Prospectus: NLCS Preview: Cubs vs. D...

MORE FROM JULY 16, 2012
Fantasy Article Value Picks: Second, Short, and Catcher for ...
Fantasy Article Resident Fantasy Genius: To Platoon or Not t...
Premium Article Future Shock: The Nats Are Geniuses! Appel I...
The Week in Quotes: July 9-July 15
What You Need to Know: Monday, July 16
Premium Article The Prospectus Hit List: Monday, July 16
Premium Article Collateral Damage Daily: Monday, July 16

MORE BY RUSSELL A. CARLETON
2012-08-06 - Baseball Therapy: So You Wanna Be a Manager
2012-07-30 - Premium Article Baseball Therapy: Seven Minutes of Terror
2012-07-24 - Premium Article Baseball Therapy: It Happens Every May
2012-07-16 - Baseball Therapy: It's a Small Sample Size A...
2012-07-09 - Baseball Therapy: Hire Joe Morgan
2010-05-03 - Premium Article Baseball Therapy: Why Are Games So Long?
2010-04-26 - Premium Article Baseball Therapy: The Difference Between Nig...
More...

MORE BASEBALL THERAPY
2012-08-06 - Baseball Therapy: So You Wanna Be a Manager
2012-07-30 - Premium Article Baseball Therapy: Seven Minutes of Terror
2012-07-24 - Premium Article Baseball Therapy: It Happens Every May
2012-07-16 - Baseball Therapy: It's a Small Sample Size A...
2012-07-09 - Baseball Therapy: Hire Joe Morgan
2010-05-03 - Premium Article Baseball Therapy: Why Are Games So Long?
2010-04-26 - Premium Article Baseball Therapy: The Difference Between Nig...
More...