Other things you can do with EBP

Elsewhere we’ve discussed using EBP to study the relationship between OBP and run production, and for estimating runs.  Here are a few other things you can do with it.

Predicting runs for college teams, little league teams, minor league teams, etc.

Elsewhere I’ve compared and contrasted EBP to many of the better-known run estimators, whether comparing correlations to actual run production when using EBP as a run estimator, or comparing how they vary with OBP, when introducing a p-dependence to the run estimators.  The creators of the run estimators discussed on those pages (with the possible exception of the simplest and oldest of these, Runs Created) all made use of actual major league gameplay data in their development. No such data was referenced in the development of EBPt (and only average rates of taking an extra base were referenced in the development of EBPf). As such, the formulas will be applicable to baseball played in any league, including little leagues, without modification. How well they make predictions in those contexts will probably vary considerably depending on rates of errors, wild pitches, stolen bases, etc. In major league baseball, the missing consideration of outs on the basepaths and base advances does not appear to have a great effect on its overall accuracy. This may not be the case in other contexts. It would be interesting to find out how it fares in other leagues. The idea that it might not is one reason I’d like to study how base advances and outs on the basepaths might be added to the model.

How changing the number of outs per inning would affect run production

Ever wonder how scoring in baseball would change if we had 4 outs per inning instead of 3? No? Well, if you had, Expected Binomial Production can provide an answer for you. All you have to do is replace the component formulas for 3-out innings:

$EBP(L,p) = \dfrac{p^{L+1}}{1-p} [3 + 2L(1-p) + \tfrac{1}{2}L(L+1)(1-p)^2]$

… with the component formulas for 4-out innings:

$EBP(L,p) = \dfrac{p^{L+1}}{1-p} [4 + 3L(1-p) + L(L+1)(1-p)^2 + \tfrac{1}{6}L(L+1)(L+2)(1-p)^3]$

You would then linearly combine the four L=0,1,2,3 formulas using exactly the same coefficients as before, to get the expected runs per 4-out-inning of a real team. Divide that by the EBP for the team in 3-out-innings to get the factor by which run production would go up. Examination will show that The Homers will have their production go up by a factor of 4/3, an increase of 1/3. Any real team will have it go up by even more, because the first term increases by 1/3; the second term (the one that’s linear in 1-p) goes up by 50%; the third term (quadradic in 1-p) goes up by 100%; and you get an additional fourth term. The largest values of L get the biggest increases.

If instead of the ratio of these two numbers you are seeking the difference between them, their formulas line up nicely. You can just take the differences of the four L-component formulas, and then linearly combine those differences. Subtracting the 3-out versions from the 4-out versions, the differences of the component formulas are:

$Difference = \dfrac{p^{L+1}}{1-p} [1 + L(1-p) + \tfrac{1}{2}L(L+1)(1-p)^2 + \tfrac{1}{6}L(L+1)(L+2)(1-p)^3]$

And interestingly, this can be rewritten as

$Difference = \dfrac{p^{L+1}}{1-p} [1 + L(1-p)(1 + \tfrac{1}{2}(L+1)(1-p)(1 + \tfrac{1}{3}(L+2)(1-p)))]$

In general, if OPI = the number of outs per inning, EBP says that on average, the runs per inning is expected to be

$EBP(L,p) = \dfrac{p^{L+1}}{1-p} \displaystyle\sum_{i=0}^{OPI-1} [(OPI-i) \binom{L-1+i}{L-1} (1-p)^i]$

You would then linearly combine the four L=0,1,2,3 formulas using exactly the same coefficients as before, to model real teams.

Strategy and lineup construction

Obviously, you could also use it for the purpose for which it was designed, which is to see how run production changes in especially low or high OBP situations, such as when a very low OBP-against pitcher faces a team with a collectively low OBP, or when the lowest-OBP portion of a team’s lineup is coming up. This could help inform a team whether it makes sense to use sacrifices and other techniques for manufacturing a single run, or to aim for a multiple-run inning.

You could also use it in trying to decide whether it is better to add a particular power-oriented hitter versus a particular on-base type of hitter to a given team’s lineup. One team may see more improvement by adding the power hitter, whereas another team may benefit more from adding the on-base hitter.

Predicting percentages of innings with a particular number of runs scored

Deciding which of two possible offenses is “better” isn’t necessarily just a matter of figuring out which lineup ought to produce more runs per inning. A lineup that more consistently scores something may win more games than a lineup that scores the same amount overall, but has more frequent “big innings” as well as more frequent innings in which it does not score. My theory is that the former is what you’ll get with a high-power team, and the latter is what you’ll get with a high-OBP team. Add to that the notion that big innings are likely to produce runs a lot of extraneous runs, that is to say, runs that are more than is necessary to win the game, and we arrive at the conclusion that a high-power team might be expected to win more games than a high-OBP team that scores the same average number of runs per inning.

It’s a hard idea to test using game data, because there are many other factors involved in arriving at a win. However, there is a variation we can make to the way the EBP formulas are derived that may be able to help decide this. It produces the likely fraction of innings in which a given team will score 0 runs, 1 run, 2 runs, etc. For example, here is what it predicts for 2016 MLB teams, alongside the actual fractions from Baseball-Reference.com.

Fractions of innings with particular run totals as predicted by EBP – 2016
Team 0 R 1 R 2 R 3 R 4 R 5+ R
LAA 74.65% 12.57% 6.78% 3.32% 1.53% 1.15%
HOU 73.59% 13.03% 7.08% 3.49% 1.60% 1.21%
OAK 75.86% 12.41% 6.44% 3.03% 1.33% 0.93%
TOR 72.67% 13.19% 7.34% 3.70% 1.74% 1.36%
ATL 76.13% 12.01% 6.36% 3.07% 1.40% 1.03%
MIL 74.50% 12.64% 6.84% 3.35% 1.53% 1.14%
STL 71.97% 13.58% 7.54% 3.78% 1.77% 1.37%
CHC 71.84% 13.21% 7.54% 3.92% 1.91% 1.58%
ARI 73.27% 13.23% 7.17% 3.51% 1.61% 1.20%
LAD 74.29% 12.76% 6.89% 3.37% 1.54% 1.15%
SFG 74.79% 12.38% 6.73% 3.34% 1.56% 1.20%
CLE 72.74% 13.20% 7.31% 3.67% 1.73% 1.35%
SEA 72.78% 13.24% 7.31% 3.65% 1.71% 1.31%
MIA/FLA 75.34% 12.29% 6.58% 3.21% 1.47% 1.11%
NYM 74.13% 12.93% 6.94% 3.37% 1.52% 1.11%
WSN 73.40% 13.05% 7.14% 3.54% 1.64% 1.24%
BAL 72.28% 13.67% 7.46% 3.67% 1.68% 1.24%
SDP 76.63% 12.22% 6.23% 2.87% 1.23% 0.82%
PHI 76.89% 12.05% 6.15% 2.84% 1.23% 0.83%
PIT 73.91% 12.61% 6.97% 3.52% 1.67% 1.32%
TEX 72.71% 13.36% 7.33% 3.64% 1.68% 1.28%
TBR 74.00% 13.22% 6.97% 3.31% 1.47% 1.03%
BOS 69.51% 13.97% 8.16% 4.33% 2.16% 1.86%
CIN 74.98% 12.61% 6.69% 3.22% 1.45% 1.05%
COL 71.10% 13.79% 7.76% 3.96% 1.89% 1.51%
KCR 75.71% 12.38% 6.48% 3.08% 1.37% 0.98%
DET 72.23% 13.39% 7.45% 3.76% 1.77% 1.39%
MIN 73.98% 13.02% 6.98% 3.38% 1.53% 1.12%
CHW 74.71% 12.69% 6.76% 3.27% 1.48% 1.09%
NYY 74.87% 12.60% 6.73% 3.25% 1.47% 1.07%
Fractions of innings with particular run totals – 2016 actual
Team 0 R 1 R 2 R 3 R 4 R 5+ R
LAA 73.41% 14.31% 6.56% 2.79% 1.61% 1.33%
HOU 71.54% 15.76% 7.74% 3.26% 1.02% 0.68%
OAK 74.86% 14.02% 5.52% 3.38% 1.38% 0.83%
TOR 72.30% 14.36% 7.35% 3.30% 1.37% 1.31%
ATL 73.22% 15.64% 7.04% 2.46% 1.23% 0.41%
MIL 73.40% 14.58% 7.15% 2.92% 1.25% 0.69%
STL 71.73% 14.79% 6.43% 3.59% 2.21% 1.24%
CHC 69.69% 15.67% 8.18% 3.40% 1.80% 1.25%
ARI 70.95% 16.80% 6.87% 2.86% 1.16% 1.36%
LAD 72.86% 14.57% 6.77% 3.04% 1.59% 1.17%
SFG 73.29% 13.90% 7.81% 2.53% 1.23% 1.23%
CLE 70.58% 15.51% 7.16% 4.03% 1.81% 0.90%
SEA 72.29% 14.81% 6.24% 3.57% 1.78% 1.30%
MIA/FLA 73.76% 14.96% 6.12% 3.34% 1.25% 0.56%
NYM 73.91% 13.77% 7.61% 2.84% 1.25% 0.62%
WSN 70.74% 16.63% 6.35% 3.38% 1.86% 1.04%
BAL 71.54% 15.38% 7.13% 3.36% 1.40% 1.19%
SDP 73.49% 15.04% 6.11% 3.09% 1.37% 0.89%
PHI 74.53% 15.60% 5.31% 3.11% 0.90% 0.55%
PIT 72.73% 14.42% 7.01% 3.37% 1.44% 1.03%
TEX 70.41% 16.40% 7.12% 3.42% 1.33% 1.33%
TBR 72.29% 16.39% 6.74% 2.50% 1.25% 0.83%
BOS 68.14% 15.76% 8.61% 3.92% 1.82% 1.75%
CIN 73.47% 14.39% 6.37% 3.29% 1.10% 1.37%
COL 71.03% 14.69% 6.06% 4.67% 1.67% 1.88%
KCR 74.43% 13.96% 6.50% 2.63% 1.31% 1.17%
DET 71.29% 15.34% 7.00% 4.20% 1.33% 0.84%
MIN 70.77% 17.81% 6.46% 2.86% 1.16% 0.95%
CHW 71.47% 17.23% 6.13% 3.65% 0.90% 0.62%
NYY 72.45% 16.22% 5.94% 3.29% 1.33% 0.77%

At first glance, these numbers are pretty close. But look at this table of how much it overestimates the numbers of runs:

EBP’s overestimates of fractions of innings with particular run totals – 2016
Team 0 R 1 R 2 R 3 R 4 R 5+ R
LAA 1.24% -1.73% 0.22% 0.53% -0.08% -0.18%
HOU 2.05% -2.73% -0.66% 0.23% 0.59% 0.53%
OAK 1.00% -1.61% 0.92% -0.35% -0.05% 0.10%
TOR 0.37% -1.18% -0.01% 0.40% 0.37% 0.05%
ATL 2.91% -3.64% -0.68% 0.62% 0.17% 0.62%
MIL 1.10% -1.94% -0.32% 0.43% 0.28% 0.45%
STL 0.23% -1.21% 1.11% 0.19% -0.44% 0.12%
CHC 2.14% -2.46% -0.64% 0.52% 0.10% 0.34%
ARI 2.32% -3.57% 0.29% 0.66% 0.46% -0.16%
LAD 1.43% -1.81% 0.12% 0.33% -0.05% -0.03%
SFG 1.50% -1.52% -1.08% 0.80% 0.32% -0.03%
CLE 2.16% -2.30% 0.14% -0.36% -0.08% 0.44%
SEA 0.49% -1.57% 1.07% 0.09% -0.08% 0.01%
MIA/FLA 1.57% -2.67% 0.45% -0.13% 0.22% 0.55%
NYM 0.22% -0.84% -0.67% 0.53% 0.28% 0.49%
WSN 2.66% -3.58% 0.79% 0.15% -0.23% 0.21%
BAL 0.74% -1.71% 0.33% 0.31% 0.28% 0.05%
SDP 3.14% -2.82% 0.12% -0.22% -0.14% -0.07%
PHI 2.36% -3.55% 0.84% -0.26% 0.33% 0.28%
PIT 1.18% -1.82% -0.03% 0.15% 0.23% 0.29%
TEX 2.30% -3.03% 0.21% 0.22% 0.36% -0.05%
TBR 1.71% -3.17% 0.24% 0.81% 0.22% 0.19%
BOS 1.37% -1.78% -0.45% 0.41% 0.34% 0.11%
CIN 1.51% -1.79% 0.31% -0.07% 0.35% -0.32%
COL 0.07% -0.90% 1.70% -0.71% 0.22% -0.37%
KCR 1.28% -1.58% -0.02% 0.45% 0.06% -0.20%
DET 0.94% -1.94% 0.45% -0.44% 0.44% 0.55%
MIN 3.21% -4.79% 0.52% 0.52% 0.37% 0.17%
CHW 3.24% -4.54% 0.63% -0.38% 0.59% 0.47%
NYY 2.42% -3.62% 0.78% -0.03% 0.14% 0.31%

It consistently underestimates the number of 1-run innnings, while consistently overestimating numbers of 0-run innings, and overestimating the number of multiple-run innings about two-thirds of the time. By considering the simplifying assumptions made in the derivation of EBP, we can speculate why these consistent biases occur. Zero-run innings occur less frequently than EBP predicts because, I speculate, it does not account for run-manufacturing activities, such as stolen bases, bunts, hitting behind the runner, and sacrifices. It also doesn’t account for wild pitches and balks, and some errors. On the other hand, multiple-run innings occur less frequently than EBP predicts because, I speculate, it does not account for the extra outs made on the bases, such as in double plays, caught stealings, and getting thrown out taking an extra base. Such outs comprise between 5% and 7% of all outs made. They bring innings to an end more quickly, and thus will cause fewer runs to be scored than EBP would predict; but since these outs can’t reduce the number of runs scored in innings in which nobody would have scored anyway, and can have at most a one-run impact on innings in which one run would have scored, this omission will selectively have a greater impact on what EBP says would have been multiple-run innings. I’m hoping to do some future research on how to compensate for these simplifying assumptions and arrive at truer predictions.

The formula for this variation is presented and explained over here.

Predicting left on base rates

Another variation will produce predictions of numbers of runners left on base per inning. I’m not sure the value in that, but it comes out pretty naturally as part of the derivation. Perhaps it could be used in a measure of the timeliness of a team’s hitting. I have not as yet bothered to calculate the predictions for these, buy I am curious what the an average year-by-year correlation coefficient would be for actual team seasonal left-on-base (LOB) data from the years 1955 through 2016. (Visit why I use correlation coefficient to evaluate accuracy to learn why I prefer this over other commonly used measures of accuracy like RMSE.)

Separating what happens at the plate from what happens on the basepaths

The oversimplifying assumptions of Expected Binomial Production may actually be of some use. You can perhaps look at the differences between EBP’s predictions and actual run production to more clearly quantify the effect of the other baserunning factors and outs-on-the-bases factors on run production. The separation from the context of those events is cleaner with EBPf used as a run estimator than with other run estimators, and it shows. For example, FanGraphs has a statistic it calls BsR that estimates how many more runs above or below average a team’s baserunning prowess earned them in comparison to an average team’s baserunning. Intuitively, it would seem that the run estimator that is most devoid of information of the effects of baserunning on scoring would gain the most from the addition of BsR to its predictions. I did these calculations, and they showed EBPf definitely gaining more by the addition of BsR than did any of the run estimators to which I compared it. Those that already had stolen base and caught stealing information as part of their formula actually got worse. Those that did not got better, but none by as much as EBP did. My impression from this result is that EBPf, used as a run estimator, is the estimator that is most devoid of the effects of base advances and outs on the bases. I suspect it may be useful in some circumstances to have that separation.