Researchers have recently begun focusing more on hitters’ Batting Average on Balls in Play (BABIP) and properly being able to predict it. Major League hitters strike out in 20% of at bats and hit homeruns in 3% of at bats, meaning that to truly understand the true talent level (and therefore project performance) of a hitter, figuring out the other 77% of at-bats is very important. It has been established and then debated that pitchers have very little “control” over BABIP, but hitters clearly do have some effect. However, they do not seem to “control” it all that much. I am putting the word “control” in quotes, because I am uncomfortable assuming that the randomness is what drives the lack of year to year correlation in BABIP; rather, I think pitchers/defenses and hitters adjust to each other, explaining the low autocorrelations of some statistics. Take Ryan Howard, for example. He hits the ball very hard and hits a lot of balls very hard. In his first year, he had a very high BABIP, but then defenses adjusted and put three infielders in shallow right, and he no longer had the same BABIP anymore. So, what I’m trying to do is figure out what will help me figure out future BABIP for hitters rather than necessarily establish what it is they personally control.
Now let’s look at what I mean about the lack of year to year correlation. Consider the following table.
STATISTIC Correlation of 2007/2008
(I should note that this table represents my data set of the 224 Major League hitters who got over 100 at-bats in each year from 2005 to 2008. This is not a random sample—these are some of the very best baseball players in all of professional baseball, with a bias towards healthier and older players. In fact, the average player in my sample was about 28-31 during 2005-2008. Many major league hitters do not get 100 at-bats four years in a row, and I mention this to confess the limits of this study. I am not sure how to properly adjust for missing data, and I do not believe I have access to minor league equivalencies. Plus, I do not really believe that minor league equivalencies are remotely accurate enough to properly figure out this data.)
Anyway, what the table above means is that far less of BABIP can be explained than other outcomes of plate appearances from simply figuring out previous BABIP numbers. In fact, the best prediction of 2008 BABIP that BABIPs from 2005-2007 that could be gotten with a regression only has a .3787 correlation with BABIP. However, we can probably do better. There is a lot of other data out there. Chris Dutton and Peter Bendix recently posted THIS at The Hardball Times explaining some major correlates of BABIP, including BB/K ratio, Line Drive%, Groundball/Flyball ratio, Speed score, Strikeout rate, Spray (tendency to hit the ball more towards one side of the diamond than the other), and pitches/plate appearance. They can explain 35% of variance in BABIP in a given year based on these statistics and some indicator functions of handedness and home ballpark. I am curious how much these statistics can help predict future BABIP, however. I do not have access to all of this data, but there are a lot of new statistics being posted at fangraphs.com and baseball-reference.com that can help us understand BABIP a little better.
The earliest work by Dave Studeman said that a good way to approximate BABIP was to simply take line drive rate and add 12%. However, line drive rate itself only correlates year to year by about 20-30% so that is probably not the best way to figure it out. In fact, you can explain less of BABIP with line-drive rate than you can with previous BABIP. Consider the following table (again using the same sample of 224 hitters):
BABIP yr. correl. w/ prev. year: BABIP LD%
2008 .2823 .1590
2007 .3539 .0928
2006 .3053 .1847
So, actually that was a step back in projection but a step forward in understanding the roots of BABIP at the same time. Recent studies have actually tabulated BABIP on three general categories of hits—line drives, flyballs, and groundballs. Last year, these were as follows:
BABIP on hit type: Cumulative MLB My data set
Line drives .718 .713
Groundballs .237 .230
Flyballs 142 .140
However, this is not the same across all hitters. The correlation on BABIP for different hit types is pretty strong
Hit type\Correlation: ’05 & ’06 ’06 & ’07 ’07 & ‘08
Line Drives’ BABIP .1471 .0868 .1022
Groundballs’ BABIP .1005 .3071 .1863
Flyballs’ BABIP .3747 .2880 .1660
If you run a regression of BABIP in 2008 on BABIPs from 2005-2007, you get an R-squared statistic of .1417. If you run a regression of BABIP in 2008 on LD%, LD-BABIP, GB%, GB-BABIP, FB%, FB-BABIP from 2005-2007, you get an R-squared of .1719. Given the extra variables in the latter regression, that is not really all that different, but this can be broken down a little bit further.
I’ve run a few correlations and found strong correlations with each of these sub-statistics of BABIP.
Line Drives’ BABIP (LDBABIP):
LDBABIP is somewhat correlated itself year to year, but not by much. In fact, running a regression of LDBABIP in 2008 on itself in 2005-2007 only yields an R-squared of .034. However, LDBABIP is positively correlated with BB/PA, K/AB, HR/AB, Zone% (Percent of pitches in strike zone thrown to batter), F-Strike% (Percent of first pitches thrown to a hitter for strikes), Contact% (Percent of swings where contact is made). I ran a series of regressions for LDBABIP and it seems that most of what it comes down to is that hitters with higher LDBABIP hit the ball hard—and that is best captured with homeruns. In fact, consider the following table:
2008 LDBABIP with 2007: .1471 .2158 .2340
2007 LDBABIP with 2006: .0868 .2558 .2791
2006 LDBABIP with 2005: .1022 .1577 .2513
In fact, excluding previous LDBABIPs from the regression altogether didn’t change much for the predictability of LDBABIP so I did not even include it. I tried specifying HR/AB a few different ways. I used the natural log of HR/AB because the correlation was clearly higher. The following result was the best prediction of LDBABIP in regression form:
LN(HR/AB) 2005 .0108
LN(HR/AB) 2006 .0045
LN(HR/AB) 2007 .0080
This has an R-squared of .0824, meaning that the vast majority of the variance in LDBABIP is still unexplained. I do not have a good sense of what the reason could be why the 2006 coefficient is so low; it’s probably small sample size, so maybe gathering data from other yields would have given a more precise estimation. However, I am guessing that most of this is luck. The only ways that I think hitters could really vary in their LDBABIP capabilities are hitting the ball further, hitting the ball harder, and hitting the ball in more similar locations (strong pull hitters, for example), all of which are probably correlated with homerun rate anyway. It seems like the 28% of line drives that are caught is mostly bad luck.
Flyballs’ BABIP (FBBABIP)
A hitters’ FBBABIP is correlated with the following statistics: low flyball percentage, low infield hit rates, lower bunt hit rates, low contact rates, and low infield flyball percentages. In fact, most of those are correlated with each other anyway, so the best regression that I was able to get was just averaging the infield flyball percentage over 2005-2007. It seems that there were odd correlations with infield fly rates in 2005 and 2008 that probably are due to small sample size rather than the random tendency to repeat your infield fly tendencies in three year interviews. So I used the average of infield fly rates to come up with a decent regression to approximate things:
Infield Fly 2005-2007 average -.4795
This has an R-squared of .1705. Regressing directly on FBBABIPs had an R-squared of .0868. While it seems like FBBABIP on outfield flyballs is clearly correlated year to year as well (.0845 for 2007-2008), it seems like that is mostly correlated with infield fly rate. What I’m guessing is happening is that those hitters who consistently hit deeper flyballs are harder to defend against, and so they have fewer infield flies and fewer flyballs caught in general.
Groundballs’ BABIP (GBBABIP)
Hitters with high GBBABIPs tend to have low BB%, low K%, high GB%, higher infield hit rates, higher bunt hit rates, and higher contact rates. It’s pretty clear that these are fast runners with high contact rates. Things like lower BB%, lower K%—these are just correlated with high contact rates and speed. It seems that it’s best to understand that there are two kinds of hits on groundballs—balls that go between fielders and balls that runners beat out. The total GBBABIP seems to be around .240 but that seems to be split up into about 18% of groundballs finding their way to the outfield and 6% of groundballs being beat out. Both of these tend to be things that players “control”. Consider the year to year correlations between hitters’ success rates of each type:
OUTFIELD HITS ON GROUNDBALLS:
INFIELD HITS ON GROUNDBALLS:
Clearly, getting infield hits is more of a consistent skill, but there does seem to be a tendency for certain hitters to hit the ball in the hole more. The threshold for an individual correlation to be significant at the 95% level is about .133, so getting two of three numbers above that point, and the other still positive clearly shows that there is that tendency. I cannot say whether those are park effects, division or league defense effects, or a tendency to hit the ball into the ground harder than other people, but the correlation is there nonetheless. The other correlates of higher GBBABIP did not add much to the results, so it seems to be that GBBABIP can be approximated by regressing on previous GBBABIPs and previous infield hit rates. In my data, there were some bizarrely high correlations between 2008 and 2006 success that were not reflect in the other two year interval that I had available (2005 & 2007), so my best bet is that is small sample size bias, so I simply averaged the GBBABIPs and the infield hit rates on groundballs over 2005-2007 to approximate a decent result. I used age, since that significantly improved the regression.
Average GBBABIP 2005-2007 .3223
Average Infield Hit/Groundball 2005-07 .3157
Average Contact Rate 2005-07 .1527
I think that with more data, I could have come up with individual coefficients for each of these regressors, but doing so seemed to indicate weird things like a negative coefficient on 2007 infield hit rate, which did not make sense. (Also, I regressed 2008 GBBABIP on 2006 and 2007 infield hit rates only and then 2007 GBBABIP on 2005 and 2006 hit rates and the coefficients for the previous two years came out very different. There’s not enough data there, but clearly infield hit rate is informative and so I included the average infield hit rate over the previous three years.) Each of these coefficients was very significantly different from zero, and so I think this is a good way to approximate this regression. The R-squared was .1526.
Groundball, Flyball, and Line drive percentages
Coming up with expected groundball, flyball, and line drive rates simply involved regressing on the three previous years. Including age did not seem to do anything (though there does seem to be a slight tendency to hit more groundballs as you get older, but that was mostly captured by the previous three years of data. I got the following regression results:
GB% 2005 .1614
GB% 2006 .2316
GB% 2007 .4988
FB% 2005 .1469
FB% 2006 .2463
FB% 2007 .4760
Average LD% 2005-2007 .6179
Running the line drive percentage regression with respect to line drive percentages for each of the last three years seemed to spit out an insignificant (but still positive) coefficient for line drive% in 2007 and the R-squared was not much different than just using 2005-2007 averaged, so I did that. This way, the R-squared was .1490. Note that the R-squared for groundball percentage regression was .6586, and for flyball percentage regression was .6251. Clearly, there are much stronger correlations year to year for groundball and flyball percentage than for line drive percentage.
Running a simple regression for homeruns per flyball to figure out what percent of balls in play were actually of each hit type, and you can actually decompose BABIP to predict it.
All this makes you wonder what to expect from the Phillies. I’ll talk about what these numbers mean for a few Phillies and helping to figure out their BABIP.
Howard put up an insanely high BABIP of .363 in 2006, followed it with .336 in 2007, and then plummeted to .285 in 2008. Simply trying to regress his BABIP in 2008 using 2005-2007 BABIP would give a result of .311. In fact, Dutton and Bendix’s results apparently put his xBABIP right there for 2008. But what should be expected for 2009? Well, his LDBABIP on line drives was .793, .750, and .722 in each of the last three years. In fact, we know from the above analysis that Howard should be probably be above the league average of around .730 due to his high homerun rates, and the regression says his line drive rate should rebound to around .743. His batting average on groundballs plummeted from .250 to .187 to .163 in 2006-2008. He almost never gets infield hits and makes poor contact, so the regression spits out an expected GBBABIP of around .188 for him. Given the fact that that is inflated by the .250 GBBABIP in 2006 before teams regularly shifted their defense against him, I would guess that number might be a little lower. His FBBABIP went from .170 to .121 to .169 in the last three years. Given his very low infield fly rates, the regression predicts he will have a FBBABIP of around .175 next year. This leaves him around .317 for his total FBBABIP if he even gets a modest 25.3% HR/FB. This would be higher for a higher GBBABIP but given that some of the 2006 BABIP numbers are inflated due to the shift, I would guess he might land somewhere between .310 and .320 for his BABIP next year.
Utley put up BABIPs in 2006-2008 of .346, .368, and .301 respectively. Interestingly, he put up line drive rates of 19.5, 19.6, and 24.3% perspective, so his line drive rate went up just as his BABIP plummeted back to average. His BABIP on each hit type was lower this year. His LDBABIP dropped from an impossible .860 in 2006, to .792 in 2007, down to .736 this year. His GBBABIP went from .280 to .294 to .222; his FBBABIP went from .139 to .175 to .128. His decent but unspectacular homerun rate indicates that his LDBABIP will probably stay down, but his GBBABIP should rebound as he has a high contact rate, and has put up better infield hit rates and hit more balls in the hole in the past. The regression predicts it will be .244 this year. He suddenly had a spike in infield fly rate this year, and even his outfield fly BABIP went down. That seems like it could be related to his hip injury as he may not have been able to drive the ball as well in the latter half of the year, but the regression spits out a result of .137. That would give him a BABIP overall of .307. I tend to think it will be a bit higher than that, since I think playing with an injury really hurt his BABIP and maybe that biases the results downwards compared to the average player.
He did not have many at-bats in 2006, so those numbers are somewhat biased. But it’s notable that his .652 LDBABIP is way lower than could be expected and only .091 on flyballs and .183 on groundballs seem unlikely too. Adding things up, the model predict a BABIP of around .292, which would significantly improve his average overall as he rarely strikes out or hits homeruns.
I am working on a few other hitters right now, but these are three of the most interesting ones right now. I’m going to use these with some logical adjustments to update my projections again soon.
All in all, I think that this is certainly only the very beginning of significant improvements in projecting BABIP. What may seem like luck is actually real skills that are correlated with new statistics—things like percent of pitches where contact was made was not available before, but it clearly improves projection. Other new statistics seem to be highly correlated with various important outcomes, but not in a way that provides more information. Just to get an idea of how these things matter, running a regression of BABIP in 2008 on the variables included in the individual hit types’ regressions above—the average line drive% over the past three years, the past three flyball percentages, the natural log of homerun rates, the average of infield hits per groundball, the average of GBBABIP, the average of infield fly rate, homeruns per flyball, and age yields an R-squared of .2961. In other words, the predicted BABIP using this model will be correlated by .54 with the actual BABIP. Simply regressing on the previous three BABIPs yields an R-squared of just .1464—the predicted BABIP using that model only has a correlation of .38 with actual BABIP. Even taking into account the extra variables used, this is particularly significant and a helpful way to help BABIP projection. Undoubtedly, using some of the propriety data used by other researchers will only help.