clock menu more-arrow no yes

Filed under:

Improving BABIP Estimation

New, 10 comments

Fewer sabermetric findings caused more debate than what Voros McCracken declared that pitchers had no control over Batting Average on Balls in Play (BABIP).  Recent studies have focused on BABIP for hitters, who undoubtedly do control it to some degree.  As I pointed out in my previous article on this topic, walks, strikeouts, and homeruns have far stronger year to year correlation than BABIP for hitters.  This indicates that hitters vary far less in their abilities to control batting average on balls in play than they do in their abilities to make contact, draw walks, and hit homeruns.


In this article, I will describe a new regression formula to successfully predict a hitter’s BABIP using three years of historical statistics that has a .63 correlation with the actual realization of the hitter’s BABIP, and one that can even use one year of statistics and can achieve a .53 correlation with the actual realization of the hitter’s BABIP.




Dave Studeman launched much of the research on BABIP for hitters when he introduced the formula of (.120 + LD%) to retroactively predict what a hitters’ BABIP would have been if luck had not been a factor.  Since line drives have relatively low year-to-year correlation, this formula is not meant to be predictive of future BABIP, and it is important to note the difference.  Noting that a hitter had a high line drive rate, but was unlucky when it came to converting those into hits is not the same as saying that he will be lucky in subsequent years, since a very high line drive rate is not particularly repeatable.  Studeman also created an estimator using percentage of line drives and flyballs (and thereby indirectly using percentage of groundballs), as well as strikeouts back in 2005.  He found that hitters who struck out more did better on balls in play.  This is because keeping one’s job on a major league baseball team requires doing well enough on balls in play to make up for how often one strikes out.


In December, Peter Bendix and Chris Dutton published an important article in this area of research at The Hardball Times discussing a regression formula for BABIP using many other statistics.  This article was meant to approximate BABIP in the same year as the data was recorded.  While useful as an attempt to describe who was lucky and who was not, there is added value to being able to predict future BABIP from historical variables.  This type of research can be used by general managers and fantasy players alike.


In an attempt to actually get at the root of success on balls in play, I wrote an article a few weeks ago discussing the ability to predict line drive percentage, groundball percentage, and flyball percentage, as well as the respective BABIP on each of these types of hits.  I found the following correlates of each hit type’s BABIP:


Groundballs’ BABIP (GBBABIP):

--GBBABIP (positive)

--Infield hit rate (a more repeatable skill within GBBABIP, positive)

--Contact rate (as defined as on as the percent of pitches that a hitter swings at which he makes contact with, positive). 


Flyballs’ BABIP (FBBABIP):

--Infield fly rate (negative)


Line drives’ BABIP (LDBABIP):

--Ln(HR/AB) (positive)


I believe that the correlation with contact rate and GBBABIP is probably because hitters who miss the ball entirely when swinging frequently almost miss the ball entirely when swinging, just barely chopping the ball weakly into the ground and record lots of outs on groundballs.  This was significant, even though there is a clear selection bias the other way—that hitters who make less contact would tend to keep their jobs only if they had higher BABIP.


LDBABIP’s correlation with log homerun rate was due to the fact that line drives are hits more frequently when they are hit further and harder, and this ability is likely correlated with the ability to hit homeruns.  Due to the low year-to-year correlation for LDBABIP, it was difficult to establish much autocorrelation of LDBABIP-- the homerun rates were stable enough that they were able to predict it more successfully.


In fact, the same two variables that were explained in my article a few weeks ago—infield flies and homeruns—were two of three primary revisions to Dutton’s new model when he adjusted it to also be predictive of future performance for this piece last week by Derek Carty at The Hardball Times.  In Carty’s article, he found that Dutton’s predicted "xBABIP" values, after the adjustment of the Bendix-Dutton model introduced back in December, correlated with real BABIP by a factor of .50.  This slightly beat a reverse engineered approximation of Tom Tango’s Marcel’s predicted BABIP which correlated by a factor of .46.  Of course, this is not a particularly huge difference, and in reality, Dutton’s regression likely used some observations of the 2008 realizations of BABIP and was able to predict it more accurately.  Tango needed to figure out how 2004-2006 would predict 2007 BABIP and then use 2005-2007 would therefore predict 2008.  Regression when you know the outcome of the dependent variables and simply need to maximize the correlation with a known variable is a bit easier.  At the same time, Dutton’s model only used one year of data to approximate this and Tango used three years.


That does not mean that Dutton’s model is necessarily more useful than Tango’s Marcel.  It depends what you are going for-- keep in mind that Tango actually predicted batting averages.  If his method underestimates hits on balls in play for the same hitters for which it overestimates homeruns per at-bat, it will be properly estimating batting average but not being as successful at predicting success on balls in play.  In this model, I’ll establish a regression in which I use three years of data, as well as regressions using one year of data.




Initially, I worked on a way to predict BABIP using one year of data.  The variables I used included:

--Line Drive%


--Natural Log of HR/AB


--Outfield flyball BABIP

--Natural Log of Contact rate (as defined by

--Spray (as defined by Bendix and Dutton)

--Dummy variables for switch hitters and left-handed hitters


I should note that I used the log of HR/AB to correct for a problem of heteroskedasticity.  Simply using HR/AB (or HR/FB, which yielded similar results) biases the regression since the difference between expected and actual BABIP were larger for those hitters who had high HR/AB rates.  Econometrics has several useful resolutions to this problem, one of which is simply taking the logarithm of the independent variable in question as a regressor instead of the variable itself.  From evaluating line drives’ BABIP in my last article on this topic, I know that it is correlated more strongly with the log of HR/AB than with HR/AB itself, explaining the logic in using this term.  This highlights the importance of working in detail with a dataset in advance of performing a larger regression model such as the ones in this article.


I should also note that I will be adjusting this formula to include sacrifice flies as outs, using the formula uses: (H-HR)/(AB-SO-HR+SF).  I do this since sacrifice flies reflect a hitter’s tendency to fly out and thus in an attempt to measure skill, they should be treated as such.  I will not included sacrifice bunts since that is clearly more intentional than hitting a sacrifice fly.


Each of these variables was significant at the 95% level except for handedness and contact rate.  Since contact rate was statistically significant in so many other regressions, I think that this is just noise and it probably is significant, so I left it in the regression.  Handedness was sometimes significant as well so I left that in too.  There are other variables that will come up in superior regressions below, but this was the best one at predicting 2008 using 2007 statistics only.  The R-squared was .28, meaning that the correlation between the predicted BABIP values using this linear regression and the actual BABIP was the square root of that: about .53. 

Here is the summary of the regression output.  Note that I only used hitters who had over 300 PA in 2007 and 2008, following the convention of Derek Carty in his article.


Source    |       SS              df       MS                           Number of obs =     149

____________________________________     F(  9,   139) =    6.08

Model     |  .037681847     9  .004186872               Prob > F      =  0.0000

Residual |  .095787195   139  .000689117           R-squared     =  0.2823

____________________________________    Adj R-squared =  0.2359

Total       |  .133469042   148  .000901818             Root MSE      =  .02625



babip08         | Coef.      Std. Err.          t      P>|t|     [95% Conf. Interval]


ldp07             | .2239294   .1082472   2.07   0.040     .0099054    .4379534

gbp07            | .129778     .052851     2.46   0.015     .0252821    .2342739

lnhra07        | .0105141   .00442       2.38   0.019     .0017749    .0192533

iffbp07          | -.1279507  .0638752   -2.00  0.047    -.2542433   -.0016581

offbbabip07  | .1305535   .0583186    2.24   0.027     .0152472    .2458598

lncontact07 | .0533636   .039344      1.36   0.177    -.0244264    .1311536

spray07         | -.055403   .0327964     -1.69 0.093    -.1202474    .0094414

shb                | .0082968   .0065372     1.27  0.206    -.0046283     .021222

lhb                | -.0024456  .0052222    -0.47 0.640    -.0127709    .0078797

_cons            | .2490696   .0364046     6.84  0.000     .1770912    .3210481





That’s certainly a starting point.  After all, the Dutton model presented in Carty’s paper had a correlation of .50, so reaching .53 is certainly an improvement but not necessarily a significant difference.  As I pointed out earlier, this does not mean that plugging in these coefficients into a linear formula using the 2008 numbers for these formulas will give predictions for 2009 that have a .53 correlation with 2009 BABIP.  The coefficients do vary from year to year.


I also did a regression for 2007 BABIP using 2006 variables.  The details of it are summarized in THIS accompanying fanpost, which I leave out for the interest of space.  Using the 2006/2007 regression’s coefficients to predict 2008 BABIP using 2007 variables had a correlation of .47 with actual 2008 BABIP, slightly lower than the regression using 2007/2008 numbers.  This was partially due to the peculiar lack of correlation between 2007 variables and those in other years.  I have some suspicions about this, which are discussed in the accompanying fanpost.  The 2006 variables themselves only gave a predicted BABIP that correlated with true BABIP by .37.  The 2005 variables predicted 2006 BABIP with a correlation of .54 again, however.




Next I generated an average across 2005-2007 of several of these variables, and some other ones that were significant in previous works and I looked at hitters who had 300 PA in 2005-2008.  Obviously, this is not a very large group (only 121 hitters), but I was able to get some very significant results.  I would have liked to use variables for each year specifically but you cannot really use 121 observations in a regression with too many variables and get a reliable result, so I limited myself to the averages.  As time goes on, this could be refined far more.  However, this result blows the other ones out of the water, achieving an R-squared of .40, and hence a .63 correlation between the predicted values and the actual values of BABIP in 2008.  The variables in this regression were the averages for 2005-2007 of:


--Groundball rate (line drive rate was insignificant in this regression as the other statistics proved to be more reliable)

--Natural Log of HR/AB

--Groundball BABIP

--Infield fly rate

--Outfield flyball BABIP

--Natural Log of Contact rate (again, as defined by and not 1-K-rate)


These six variables were enough to get this result.  Here is the output and is my best suggestion for predicting hitters’ 2009 BABIP.  Simply compute the three-year averages of each of these six statistics and multiply them by their respective coefficients (and add the constant), and you will have an excellent estimate of 2009 BABIP for hitters.


Source   |       SS       df       MS                        Number of obs =     121

_______________________________     F(  6,   114) =   12.58

Model    |   .04538298     6   .00756383        Prob > F      =  0.0000

Residual |  .068549109   114  .000601308 R-squared     =  0.3983

_______________________________     Adj R-squared =  0.3667

Total      |  .113932089   120  .000949434     Root MSE      =  .02452



babip08          | Coef.             Std. Err.      t    P>|t|     [95% Conf. Interval]


gbpavg           | .1314046   .0532558     2.47   0.015     .0259052     .236904

lnhraavg       | .013903     .0050879     2.73   0.007      .003824     .023982

gbbabipavg    | .222142     .0893362     2.49   0.014     .0451677    .3991163

iffbpavg          | -.3540322  .0721422    -4.91   0.000    -.4969452   -.2111191

offbbabipavg | .1961453   .0875846     2.24   0.027     .0226409    .3696497

lncontact~g | .1525666   .0456725     3.34   0.001     .0620897    .2430435

_cons             | .2718173   .0356918     7.62   0.000     .2011122    .3425224




I also developed a regression model for data for each of these years together to get a larger sample.  I developed new statistics in that one, including the percentage of groundballs that stay in the infield for which one reaches base either by an infield hit or an error.  As it turns out with more data, reaching on errors is very highly correlated with infield hit rate in general as the difference between an infield hit and an error is frequently very small and up to the discretion of the official scorer.




For the WFC, here are my guesses for 2009 BABIP using the three-year formula (which obviously only works for hitters with more than 300 PA for 2006-2008), including a parenthesized implied batting average if I approximate their K and HR rates


Ryan Howard: .292 (.252)

Chase Utley: .305 (.283)

Jimmy Rollins: .304 (.287)

Pedro Feliz: .272 (.249)

Raul Ibanez: .300 (.274)

Shane Victorino .333 (.307)


Approximating this for Carlos Ruiz and Jayson Werth by using the model for one year data, we get .333 and .348 for Ruiz and Werth, respectively.  I doubt that Ruiz puts up a .333 BABIP but his high contact rate (over 90% this year) seems to imply that they will hit .297 and .302 this year.  As always, use common sense when doing this kind of analysis.



The best way I suggest to approximate BABIP is probably using the 3-year average method, but some of the one-year regressions are useful too.  A correlation of .63 between the predicted BABIPs and the actual BABIPs is better than the .45-.55 range of results that any of my one year regressions, Dutton’s new regression, or Tom Tango’s Marcel will get you.  However, we need to figure out exactly how far we are in predicting BABIP.  For one thing, no regression will ever get you an R-squared of 1.  No matter how much data you have, there is luck that plays a role in BABIP.  To determine how much further we can go, I conducted the following analysis.


Firstly, the standard deviation in estimated BABIP using my three-year average formula for BABIP is .019 for those hitters with over 300 plate appearances in 2005-2008.  The actually standard deviation of BABIP was .031 for hitters with over 300 PA.  Since the hitters who had over 300 PA in 2008 averaged 372 balls in play, that means that there actual standard deviation-- if we knew their true skill level in BABIP was Sqrt(BABIP(1-BABIP)/Number(Balls in Play))—would be about .024.  So, given the .031 actual standard deviation, that means that the true skill level probably varies by about .020.  Given that my formula actually did have about that variance, we are probably very close in terms of figuring out a hitters’ BABIP.  I’m sure that there are significant tweaks that could be made as I get more data, but I think that you can get a very, very good approximation of a hitters’ true skill level using this data and the true model would probably have an R-squared somewhere around .40-.45 due to limited sample size in one year of data.


If I were to regress BABIP in 2008 on the previous three years of BABIP, the standard deviation of the predicted outcomes is about .015.  Certainly, we have moved much closer to the .020 standard deviation in true BABIP skill that probably does exist.


Certainly, there could be improvements to this formula.  Using more data so that we could differentiate between the effects of statistics one, two, and three years ago, rather than using a simple average of a three-year period.  However, given a regression for BABIP in 2008 on BABIPs in 2005-2007 yields quite similar coefficients for each year, I suspect that most of the skill that goes into BABIP is best approximated using more data.  It seems like walks, homeruns, and strikeouts are the variables where most recent information is far more valuable.  BABIP itself seems best computed at this point using something like this three-year average formula, but I welcome suggestions and encourage further research using this methodology.


I would like to especially thank Eric Seidman, of,, and, for supplying me with additional data to develop my model.  The rest of my data came from and  Thank you to ‘Pizza Cutter’, Peter Bendix, and Chris Dutton for helpful comments on my previous article.