Statistics in Baseball: September 2009

I've sometimes heard announcers say that while a pitcher loves watching his team rally, he gets tight sitting on the bench for so long, and his performance in the next half inning may suffer. This mildly relates to my previous post about a long 7th inning stretch not affecting the pitcher. But a long rally could mean 15, 20, even 25 minutes on the bench for a pitcher, so maybe that's enough time for his performance in the next half inning to suffer?

My idea to answer this question was to fit a multinomial logistic regression with the levels of the response being the necessary counts to estimate FIP: HR, BB+HBP, K, other outs; and the predictors being team at bat, pitcher, stadium, and pitches in the previous inning. The data from Retrosheet doesn't have the time of each half inning, but I figure pitches thrown is going to be highly correlated with actual time, and will suffice. Once the parameters in the multinomial logistic regression are estimated, one could easily estimate FIP=(13HR+3(BB+HBP)-2K)/IP for different values of pitches in the previous inning (for the average pitcher and average team at bat) and use the multivariate delta method to find the variance of each FIP estimate.

I didn't want to use linear regression with FIP as the response because it would basically be categorical. Moreover, sometimes IP=0, so I couldn't model FIP directly anyway - limiting the interpretability of my results for those who want to see FIP, and not some transformation of it.

I think the multinomial regression is necessary rather than a few binomial regressions because to estimate the variance of the FIP estimates, we need to take into account the covariance in the parameter estimates (any glm fit in R will spit out a correlation matrix of parameter estimates upon request). If using only binomial regressions, we can't estimate the correlation between the coefficient for pitches in the previous inning in the HR/PA model and the corresponding coefficient in the BB/PA model, for example.

Unfortunately, the multinomial regression functions available in R don't seem up to the task. (Let me know if I'm missing something.) But vglm in the VGAM package wants the data in binary form, which makes the already large data set (with one row for each inning) even larger, requiring a separate row for each PA. R then runs out of memory when finding the root - at least when I'm using all the PA from 2007-2008. The other function I found, multinom in the nnet library, doesn't seem to compute the covariance matrix.

But to answer today's question, it turns out one doesn't really need to look at the FIP. After some painful character string manipulation to parse the data, I fit three separate binomial logistic regressions - for HR, BB, K - and in none of them is the coefficient for previous pitches even close to significant, nor are any of the estimates practically different from zero. Below are the coefficients and the corresponding p-values:

HR	-0.0026	0.14
BB	0.0012	0.26
K	-0.0003	0.70

So it looks like length of the previous inning does not have any effect on the pitcher, and he should be wholeheartedly cheering for his team to score runs - even if he's mostly interested in his own sabermetric stats.

Next time, I'll use similar code to analyze the relationship between the score of the game and HR, BB, K allowed. It looks like there are some significant differences there.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

Statistics in Baseball

Wednesday, September 2, 2009

Does one long half inning lead to another?