<< Previous Article

Prospectus Today: Comm... (04/21)

<< Previous Column

Crooked Numbers: Sizin... (04/14)

Next Column >>

Crooked Numbers: The I... (04/28)

Next Article >>

Under The Knife: Super... (04/21)

April 21, 2005

Crooked Numbers

April Fools

by James Click

Printer-friendly

the archives are now free.

All Baseball Prospectus Premium and Fantasy articles more than a year old are now free as a thank you to the entire Internet for making our work possible.

Not a subscriber? Get exclusive content like this delivered hot to your inbox every weekday. Click here for more information on Baseball Prospectus subscriptions or use the buttons to the right to subscribe and get instant access to the best baseball content on the web.

Subscribe for $4.95 per month
Recurring subscription - cancel anytime.

Purchase a $39.95 gift subscription
a 33% savings over the monthly price!

Already a subscriber? Click here and use the blue login bar to log in.

April stats are meaningless. OK, that's not entirely fair. March stats are meaningless, April stats are just misleading. As Joe Sheehan pointed out yesterday, most everyone knows this and understands it, but when you love talking about baseball, no one wants to say "let's wait until July." Instead, we qualify all our statements before launching into discussions of Brian Roberts' home run chase, Tim Hudson's hard luck, and Edgardo Alfonzo chasing .400.

As an exercise in restraint, here are the Best and Worst hitters on April 30, 2004 as ranked by MLVr (min 50 PAs in April and 300 on the year):

Batter            Year  AVG  OBP SLG   MLVR

Barry Bonds       2004 .472 .696 1.132 1.481
Charles Johnson   2004 .333 .458 .875 .848
Lew Ford          2004 .419 .471 .710 .784
Adam Dunn         2004 .328 .538 .750 .767
Sean Casey        2004 .414 .458 .667 .698
Jim Thome         2004 .364 .456 .714 .682
Moises Alou       2004 .361 .400 .735 .645
Manny Ramirez     2004 .388 .448 .647 .617
Laynce Nix        2004 .365 .397 .714 .617
Ron Belliard      2004 .417 .500 .548 .582
------------
Neifi Perez       2004 .220 .260 .275 -.371
Gabe Kapler       2004 .233 .270 .250 -.380
A.J. Pierzynski   2004 .236 .267 .250 -.385
Luis Rivas        2004 .190 .227 .317 -.391
Tike Redman       2004 .226 .229 .301 -.391
Jimmy Rollins     2004 .183 .263 .268 -.392
Ty Wigginton      2004 .188 .216 .333 -.394
Alex Gonzalez     2004 .182 .222 .312 -.413
Jason Phillips    2004 .162 .275 .221 -.435
Derek Jeter       2004 .168 .255 .232 -.460

While Barry Bonds had already established his dominance, there are quite a few names (Charles Johnson, Laynce Nix, Derek Jeter, Jimmy Rollins) who did not finish the year anywhere near where they began. Similarly, on the morning of May 1 last year, the Red Sox were 15-6, the Orioles 12-9, and the Yanks 12-11. Texas was leading the AL West and the Cardinals were 12-11, a game and a half behind the Astros and Cubs, tied for the division lead at 13-9.

Though there are always a few outliers every April, simply dismissing the first month of the season is obviously not the way to go. Games in April count as much as games in September, it's just that the ones in September have greater implications because the likelihood of various outcomes is vastly different. Much like leverage as it pertains to relievers, games later in the season have an apparently larger bearing on the standings. But a slow April, much like a starter who gets shelled in the early innings, can make those late games meaningless.

Similarly with individual player statistics, we can estimate just how meaningful that first month is. There are a couple different ways to do this. The first is to use something called confidence intervals for population proportions (referred to as "p-hat" because the symbol is a "p" with a"^" over it). P-hat allows us to determine how accurate our data is with varying degrees of confidence and ranges. Essentially, based on the sample size, the normal distribution curve, and the value in question, p-hat provides a quick formula to provide a range under which the "true" value lies.

The best way to think of p-hat is like a coin. We "know" the coin will land on heads 50% of the time if we flipped it forever, but if we only flip it five times, obviously it's not going to come up at 50%. As the number of flips increase, the more information we have about the coin and the closer the total proportion of heads flips will be to 50%. There's a normal curve of outcome distributions with 50% being the most likely (in the middle of the curve) and higher and lower proportions of heads less likely (the tails). Selecting a certain percentage of the area under the curve gives us that much confidence that the "true" likelihood of a heads flip will come up. Using p-hat, we can estimate the minimum and maximum values we need in order to cover the area of the true likelihood. The more times we flip the coin, the tighter the curve gets, and thus the closer the minimum and maximum values get to the mean for a particular confidence level.

Getting back to ballplayers, in 2004, Bonds had an OBP of .696 over his first 92 PAs of the season. Using p-hat, we can say that there is a 95% chance that Bonds' "true" OBP is between .602 and .790. If we want to scale back to an 80% confidence interval, the boundaries are .635 and .757. While Bonds finished the 2004 season with a .609 OBP--within the 95% range but outside 80%--over the larger set of all ballplayers, p-hat is very accurate.

Unfortunately, there are two problems with employing p-hat to the data above. The first is that p-hat is used with binomial variables, so something like OBP or AVG works well since it's dealing with a simple question of yes/no: hit/no hit; on-base/not on-base. SLG and MLVr, however, are not simple binomials and thus we can't use p-hat for them.

Secondly, even after the season is over, the confidence intervals using p-hat are very large. This is because a 162-game season isn't nearly long enough to confidently determine a player's "true" ability. Keith Woolner discussed this with regards to teams a few weeks ago, but the same goes for players. A total of 600-700 plate appearances is a lot, but based on confidence intervals, even with a sample size that large, the 95% confidence range is typically between 90 and 100 points of OBP. Looking at everyone who had an OBP of .350 in 2004, that means that one out of every 20 of them had a "true" OBP of over .390 or under .310. Given a larger sample size--over a career--they'll likely regress towards their "true" OBP. Thus, comparing confidence ranges based on April stats to confidence ranges based on full-season stats gets us into large areas of overlap as well as some rather complicated confidence measurements of the results.

Instead, in an effort to keep things a little simpler, let's see how the actual April stats compare to the full-season results. Looking again at the list above, some of those names are right where we'd expect them. Bonds is on top, joined by Adam Dunn, Jim Thome, and Manny Ramirez. Aside from Jeter on the bottom, most of those players are some of the lightest hitters in baseball: Neifi Perez, Luis Rivas, and occasional #3 hitter Tike Redman. Far from being worthless, stats in April are more often than not a good indicator of the season to come.

Getting back to the sample group of all players registering at least 50 PA in April and 300 PA on the season, here's how well correlated the stats are. In essence, how well April numbers predict the rest of the season for 2000-2004:

Using April stats from that season, the coefficient of correlation (r-squared) is .346, meaning that the April stats explain about 34.6% of the variance in MLVr. Given that they comprise about 16.7% of the season total, that's not very impressive. Contrast that with the previous season's MLVr:

That's not that much better, but notice both the change in scale (as April MLVr has a much wider range) and slope, indicating that there isn't nearly as much regression to the mean from season to season as from April to the end of the year. Looking at the previous year's MLVr reveals that we're not dealing with a case where April stats pale in comparison to other simple predictive measures. Running the two together as a multivariable regression, r-squared rises to .5595 with the previous season's MLVr about twice as valuable as April's MLVr.

So where does all this leave us? For starters, April stats are not meaningless, but rather there are a few outliers every season that draw a lot of attention. On the other hand, those outliers cannot simply be written off as a hot streak or cold streak. Instead, when combined with the previous season's stats or other projections, they can give an early indication about the expected performance of players this season. Averaging two helpings of last year's MLVr and April's MLVr will get you most of the way to an estimate of how a player is going to perform over the course of the season.

The other lesson is that only 95% of players will fall into those large confidence areas mentioned above. While it's difficult to generate them for MLVr, of the top 20 players in AVG or OBP at the end of April, it's a good bet that one of them will finish more than 90 points of OBP or 80 points of batting average above or below their current pace. Will it be Roberts? Clint Barmes? Jacque Jones? We'll have to watch to find out.

Related Content: The Streak, Outliers, The Who, Stats, April, OBP, Sample Size

0 comments have been left for this article.