Improving BABIP Estimation's Additional Regression Results
On the main page of The Good Phight, I published an article on BABIP estimation. For the sake of space, I did not include all of my regression analysis, but I thought it was best to include the rest of my results for the sake of thoroughness.
Consider the following regression for 2007 BABIP using 2006 numbers, with the exact same variables as the 2008 BABIP estimation using 2007 numbers.
Source | SS df MS Number of obs = 162
______________________________________ F( 9, 152) = 2.79
Model | .025327181 9 .002814131 Prob > F = 0.0047
Residual | .153375881 152 .001009052 R-squared = 0.1417
______________________________________ Adj R-squared = 0.0909
Total | .178703062 161 .001109957 Root MSE = .03177
__________________________________________________________
babip07 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
__________________________________________________________
ldp06 | .0217125 .1211751 0.18 0.858 -.2176925 .2611175
gbp06 | .0740955 .0583577 1.27 0.206 -.0412015 .1893925
lnhra06 | .009393 .0049076 1.91 0.058 -.0003028 .0190888
iffbp06 | -.0885388 .0699087 -1.27 0.207 -.2266571 .0495794
offbbabip06 | .1467601 .0668164 2.20 0.030 .0147513 .2787689
lncontact06 | .0343613 .043438 0.79 0.430 -.0514588 .1201815
spray06 | -.0881482 .0441983 -1.99 0.048 -.1754705 -.0008259
shb | -.0023005 .0073786 -0.31 0.756 -.0168785 .0122774
lhb | -.0015946 .0059234 -0.27 0.788 -.0132974 .0101082
_cons | .3042865 .0402564 7.56 0.000 .2247521 .3838208
__________________________________________________________
Interestingly, this regression only had an R-squared of .1417, meaning that the predicted values only had a .38 correlation with the actual values, which is much lower than expected given the .53 number for 2007/2008. I think that this was a unique year, since the R-squared for 2005/2006 (below) is even better than the 2007/2008 regression, correlating with actual 2006 BABIP at .54. Fangraphs.com only has data on variables like contact rate, swing rate, and percentage of pitches in the strike zone for 2005-2008, and I wonder if that and other data’s new availability across 2005-2006 actually affected infield positioning and perhaps provided a lot of noise in the results. However, this is merely speculation on my part.
Regardless, the predicted values for 2008 BABIP using the regression formula for 2006 numbers to approximate 2007 BABIP did much better, and correlated with those values by a factor of .47. Not quite the .53 using the same year, but a pretty decent result. For thoroughness, the regression for 2006 BABIP using 2005 variables results are below. It also yielded
Source | SS df MS Number of obs = 161
____________________________________ F( 9, 151) = 7.01
Model | .042400469 9 .004711163 Prob > F = 0.0000
Residual | .101534074 151 .000672411 R-squared = 0.2946
____________________________________ Adj R-squared = 0.2525
Total | .143934543 160 .000899591 Root MSE = .02593
___________________________________________________________
babip06 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
___________________________________________________________
ldp05 | .4117508 .0927912 4.44 0.000 .228414 .5950876
gbp05 | .2063473 .0423228 4.88 0.000 .122726 .2899686
lnhra05 | .0045501 .0037699 1.21 0.229 -.0028985 .0119986
iffbp05 | -.0719867 .0600355 -1.20 0.232 -.1906047 .0466313
offbbabip05 | .0254006 .05237 0.49 0.628 -.078072 .1288733
lncontact05 | -.0688435 .0371655 -1.85 0.066 -.1422751 .0045881
spray05 | -.0291223 .034492 -0.84 0.400 -.0972716 .039027
shb | -.0046124 .0060702 -0.76 0.449 -.0166058 .007381
lhb | -.001334 .004768 -0.28 0.780 -.0107545 .0080865
_cons | .1409896 .0348412 4.05 0.000 .0721503 .2098289
___________________________________________________________
Fewer variables were significant in this regression, as line drive rate for 2005 to 2006 was extremely predictive, leading fewer variables to be significant.
MULTIPLE YEARS REGRESSION
I decided to mix and match to try to get a more serious way to approximate BABIP from only the previous year of data, even though obviously a higher correlation can be found if three years of data are available. So I did another regression for all hitters who had two consecutive years with over 300 PA in either 2005-2006, 2006-2007, and 2007-2008. I adjusted the variables slightly. Eric Seidman of Fangraphs.com (and also of BaseballProspectus.com and Statspeak.net) was kind enough to help provide me with additional data to run this regression. Once I am able to get BABIP for line drives, groundballs, and flyballs for all of the hitters with more than 300 plate appearances in consecutive years during the time frame he gave me (2001-2008), I will be able to run a more sophisticated regression, but given that I know the correlates from my previous model, I was able to do a pretty good job approximating the results using 2005-2008.
(I attempted to use dummy variables for years, handedness, and teams using this model but the results were so significantly unaltered that I did not need feel the need to include them in this output. They were probably already accounted for in historical data on GBBABIP, FBBABIP, etc.
I will note, however, that unlike the other 29 dummy variables for baseball teams which simply used their Baseball-Reference three letter code, my Stata program included a special variable name for my beloved hometown team—an homage to the great Chase Utley, the variable WFC was equal to 1 if the player played for the Phillies during the year in question. Dummy variable indeed.)
Anyway, this regression uses the following variables:
--LDP1: Line drive% in the first of the two consecutive years for this player
--GBP1: Groundball% in the first of the two consecutive years for this player
--LNHRFB1: Natural Log(HR/Flyball) in the first of the two consecutive years for this player
--IFREACHIFGB1: The percentage of time that a hitter reached base on a groundball that stayed in the infield* in the first of the two consecutive years for this player
--IFFBP1: Percentage of flyballs that stayed in the infield in the first of the two consecutive years for this player
--FBBABIP1: Flyballs’ BABIP in the first of the two consecutive years for this player
--OPPOP1: Percentage of balls hit to the opposite field in the first of the two consecutive years for this player
--RPOB1: Percent of times a runner scored when reaching base but not hitting a homerun* in the first of the two consecutive years for this player
--XBHP1: Percentage of at-bats for extra base hits in the first of the two consecutive years for this player
--Age1: Hitter’s age on July 1 in the first of the two consecutive years for this player
*This included data on how often hitters reached on errors or infield hits, which I correctly predicted would be more reliable than infield hit rates alone. After all, infield hits and reaching on errors are frequently judgment calls, and predicting a hitter’s infield hit rate for the next year can best be done by knowing how many times the hitter reached safely on an infield groundball in either case. This eliminates scorer error, and gets a better idea of speed. It is computed as (IFH/GB + ROE/GB) / (1 – GBBABIP + IFH/GB).
**This is part of Bill James’ speed score, but I did not have this data. However, of the components of speed score, this seems to be the most accurate. It is probably highly correlated with batting earlier in the lineup (when one is more likely to be knocked in), and hence can capture some of the unobservable ability to hit for a high BABIP that a manager knows. This is the same type of logic as why Nate Silver uses a prospect’s draft position to predict their PECOTA variables. Obviously, drafting a player higher does not cause him to be better, but given that real people are making the decision to draft him higher, there are some intangible traits that are not directly observable to the statistician. Here, I am assuming that the manager is hitting the hitters earlier in the lineup the year beforehand on average because he believes they will have a good batting average, and presumably, a good BABIP.
Here are the results of this regression:
_________________________________________________________________
Source | SS df MS Number of obs = 472
____________________________________ F( 10, 461) = 15.44
Model | .116082132 10 .011608213 Prob > F = 0.0000
Residual | .346605774 461 .000751856 R-squared = 0.2509
____________________________________ Adj R-squared = 0.2346
Total | .462687906 471 .000982352 Root MSE = .02742
_______________________________________________________________
babip2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
_______________________________________________________________
ldp1 | .1582288 .0532707 2.97 0.003 .0535453 .2629123
gbp1 | .1177305 .0279886 4.21 0.000 .0627293 .1727316
lnhrfb1 | .0058997 .0032932 1.79 0.074 -.0005718 .0123712
ifreachpifgb1 | .0651476 .0420148 1.55 0.122 -.0174166 .1477119
iffbp1 | -.0998108 .0358487 -2.78 0.006 -.170258 -.0293636
fbbabip1 | .0781451 .0377313 2.07 0.039 .0039985 .1522918
oppop1 | .1339932 .0405953 3.30 0.001 .0542185 .213768
rpob1 | .0758064 .0260499 2.91 0.004 .0246151 .1269976
xbhp1 | .1485864 .1009396 1.47 0.142 -.0497724 .3469451
age1 | -.0005428 .0003448 -1.57 0.116 -.0012205 .0001348
_cons | .1835684 .0287621 6.38 0.000 .1270474 .2400895
As the R-squared is .25, predicted BABIP’s correlation with real BABIP is also .50. Not all of these variables are significant, but they are all reasonably close to significant, and given that they are significant in other regressions, I think it is more accurate to include them in the regression. The lower significance is likely a result of using three different years with different characteristics, league tendencies, etc.
0 comments
|
0 recs |

by 























