On the main page of The Good Phight, I published an article on BABIP estimation. For the sake of space, I did not include all of my regression analysis, but I thought it was best to include the rest of my results for the sake of thoroughness.

Consider the following regression for 2007 BABIP using 2006 numbers, with the exact same variables as the 2008 BABIP estimation using 2007 numbers.

Source | SS df MS Number of obs = 162

______________________________________ F( 9, 152) = 2.79

Model | .025327181 9 .002814131 Prob > F = 0.0047

Residual | .153375881 152 .001009052 R-squared = 0.1417

______________________________________ Adj R-squared = 0.0909

Total | .178703062 161 .001109957 Root MSE = .03177

__________________________________________________________

babip07 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

__________________________________________________________

ldp06 | .0217125 .1211751 0.18 0.858 -.2176925 .2611175

gbp06 | .0740955 .0583577 1.27 0.206 -.0412015 .1893925

lnhra06 | .009393 .0049076 1.91 0.058 -.0003028 .0190888

iffbp06 | -.0885388 .0699087 -1.27 0.207 -.2266571 .0495794

offbbabip06 | .1467601 .0668164 2.20 0.030 .0147513 .2787689

lncontact06 | .0343613 .043438 0.79 0.430 -.0514588 .1201815

spray06 | -.0881482 .0441983 -1.99 0.048 -.1754705 -.0008259

shb | -.0023005 .0073786 -0.31 0.756 -.0168785 .0122774

lhb | -.0015946 .0059234 -0.27 0.788 -.0132974 .0101082

_cons | .3042865 .0402564 7.56 0.000 .2247521 .3838208

__________________________________________________________

Interestingly, this regression only had an R-squared of .1417, meaning that the predicted values only had a .38 correlation with the actual values, which is much lower than expected given the .53 number for 2007/2008. I think that this was a unique year, since the R-squared for 2005/2006 (below) is even better than the 2007/2008 regression, correlating with actual 2006 BABIP at .54. Fangraphs.com only has data on variables like contact rate, swing rate, and percentage of pitches in the strike zone for 2005-2008, and I wonder if that and other data’s new availability across 2005-2006 actually affected infield positioning and perhaps provided a lot of noise in the results. However, this is merely speculation on my part.

Regardless, the predicted values for 2008 BABIP using the regression formula for 2006 numbers to approximate 2007 BABIP did much better, and correlated with those values by a factor of .47. Not quite the .53 using the same year, but a pretty decent result. For thoroughness, the regression for 2006 BABIP using 2005 variables results are below. It also yielded

Source | SS df MS Number of obs = 161

____________________________________ F( 9, 151) = 7.01

Model | .042400469 9 .004711163 Prob > F = 0.0000

Residual | .101534074 151 .000672411 R-squared = 0.2946

____________________________________ Adj R-squared = 0.2525

Total | .143934543 160 .000899591 Root MSE = .02593

___________________________________________________________

babip06 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

___________________________________________________________

ldp05 | .4117508 .0927912 4.44 0.000 .228414 .5950876

gbp05 | .2063473 .0423228 4.88 0.000 .122726 .2899686

lnhra05 | .0045501 .0037699 1.21 0.229 -.0028985 .0119986

iffbp05 | -.0719867 .0600355 -1.20 0.232 -.1906047 .0466313

offbbabip05 | .0254006 .05237 0.49 0.628 -.078072 .1288733

lncontact05 | -.0688435 .0371655 -1.85 0.066 -.1422751 .0045881

spray05 | -.0291223 .034492 -0.84 0.400 -.0972716 .039027

shb | -.0046124 .0060702 -0.76 0.449 -.0166058 .007381

lhb | -.001334 .004768 -0.28 0.780 -.0107545 .0080865

_cons | .1409896 .0348412 4.05 0.000 .0721503 .2098289

___________________________________________________________

Fewer variables were significant in this regression, as line drive rate for 2005 to 2006 was extremely predictive, leading fewer variables to be significant.

*MULTIPLE YEARS REGRESSION*

I decided to mix and match to try to get a more serious way to approximate BABIP from only the previous year of data, even though obviously a higher correlation can be found if three years of data are available. So I did another regression for all hitters who had two consecutive years with over 300 PA in either 2005-2006, 2006-2007, and 2007-2008. I adjusted the variables slightly. Eric Seidman of Fangraphs.com (and also of BaseballProspectus.com and Statspeak.net) was kind enough to help provide me with additional data to run this regression. Once I am able to get BABIP for line drives, groundballs, and flyballs for all of the hitters with more than 300 plate appearances in consecutive years during the time frame he gave me (2001-2008), I will be able to run a more sophisticated regression, but given that I know the correlates from my previous model, I was able to do a pretty good job approximating the results using 2005-2008.

(I attempted to use dummy variables for years, handedness, and teams using this model but the results were so significantly unaltered that I did not need feel the need to include them in this output. They were probably already accounted for in historical data on GBBABIP, FBBABIP, etc.

I *will *note, however, that unlike the other 29 dummy variables for baseball teams which simply used their Baseball-Reference three letter code, my Stata program included a special variable name for my beloved hometown team—an homage to the great Chase Utley, the variable **WFC **was equal to 1 if the player played for the Phillies during the year in question. *Dummy *variable indeed.)

Anyway, this regression uses the following variables:

--LDP1: Line drive% in the first of the two consecutive years for this player

--GBP1: Groundball% in the first of the two consecutive years for this player

--LNHRFB1: Natural Log(HR/Flyball) in the first of the two consecutive years for this player

--IFREACHIFGB1: The percentage of time that a hitter reached base on a groundball that stayed in the infield* in the first of the two consecutive years for this player

--IFFBP1: Percentage of flyballs that stayed in the infield in the first of the two consecutive years for this player

--FBBABIP1: Flyballs’ BABIP in the first of the two consecutive years for this player

--OPPOP1: Percentage of balls hit to the opposite field in the first of the two consecutive years for this player

--RPOB1: Percent of times a runner scored when reaching base but not hitting a homerun* in the first of the two consecutive years for this player

--XBHP1: Percentage of at-bats for extra base hits in the first of the two consecutive years for this player

--Age1: Hitter’s age on July 1 in the first of the two consecutive years for this player

*This included data on how often hitters reached on errors or infield hits, which I correctly predicted would be more reliable than infield hit rates alone. After all, infield hits and reaching on errors are frequently judgment calls, and predicting a hitter’s infield hit rate for the next year can best be done by knowing how many times the hitter reached safely on an infield groundball in either case. This eliminates scorer error, and gets a better idea of speed. It is computed as (IFH/GB + ROE/GB) / (1 – GBBABIP + IFH/GB).

**This is part of Bill James’ speed score, but I did not have this data. However, of the components of speed score, this seems to be the most accurate. It is probably highly correlated with batting earlier in the lineup (when one is more likely to be knocked in), and hence can capture some of the unobservable ability to hit for a high BABIP that a manager knows. This is the same type of logic as why Nate Silver uses a prospect’s draft position to predict their PECOTA variables. Obviously, drafting a player higher does not cause him to be better, but given that real people are making the decision to draft him higher, there are some intangible traits that are not directly observable to the statistician. Here, I am assuming that the manager is hitting the hitters earlier in the lineup the year beforehand on average because he believes they will have a good batting average, and presumably, a good BABIP.

Here are the results of this regression:

_________________________________________________________________

Source | SS df MS Number of obs = 472

____________________________________ F( 10, 461) = 15.44

Model | .116082132 10 .011608213 Prob > F = 0.0000

Residual | .346605774 461 .000751856 R-squared = 0.2509

____________________________________ Adj R-squared = 0.2346

Total | .462687906 471 .000982352 Root MSE = .02742

_______________________________________________________________

babip2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

_______________________________________________________________

ldp1 | .1582288 .0532707 2.97 0.003 .0535453 .2629123

gbp1 | .1177305 .0279886 4.21 0.000 .0627293 .1727316

lnhrfb1 | .0058997 .0032932 1.79 0.074 -.0005718 .0123712

ifreachpifgb1 | .0651476 .0420148 1.55 0.122 -.0174166 .1477119

iffbp1 | -.0998108 .0358487 -2.78 0.006 -.170258 -.0293636

fbbabip1 | .0781451 .0377313 2.07 0.039 .0039985 .1522918

oppop1 | .1339932 .0405953 3.30 0.001 .0542185 .213768

rpob1 | .0758064 .0260499 2.91 0.004 .0246151 .1269976

xbhp1 | .1485864 .1009396 1.47 0.142 -.0497724 .3469451

age1 | -.0005428 .0003448 -1.57 0.116 -.0012205 .0001348

_cons | .1835684 .0287621 6.38 0.000 .1270474 .2400895

As the R-squared is .25, predicted BABIP’s correlation with real BABIP is also .50. Not all of these variables are significant, but they are all reasonably close to significant, and given that they are significant in other regressions, I think it is more accurate to include them in the regression. The lower significance is likely a result of using three different years with different characteristics, league tendencies, etc.

## Connect with The Good Phight