The Pythagorean model and its first-order approximation apply cross-sectionally across teams in a league across regular season play. In this section we shift our attention to an exact model for the win percentage of an individual team over the course of a season. For any particular team, consider a randomly selected game, and let random variable *X* denote the *spread*, that is, the difference between runs (or more generally points) for and against that team in a game. Note that we can estimate the mean spread per game at the end of a season by

$$E\left(X\right)\approx RS-RA$$(6)

where as before we interpret *RS* and *RA* as the average runs (points) per game for and against the team in question.

In a game selected at random, the team of interest wins the game if and only if *X* > 0 (the team outscores its opponents), and conversely the team loses if *X* < 0. Our single assumption is that ties are not possible, that is, *X* ≠ 0 (note that this assumption is also invoked in the Pythagorean model). Consequently, the probability that a team wins the game is given by

$$\mathrm{Pr}\left\{\text{Win}\right\}=\mathrm{Pr}\left\{X>0\right\},$$(7)

and this win probability can be estimated by the teams seasonal win percentage, that is,

$$\mathrm{Pr}\left\{\text{Win}\right\}\approx WP\text{.}$$(8)

Now, define a team’s expected *margin of victory* by

$$MOV=E\left(X|X>0\right),$$(9)

and similarly define a team’s expected *margin of defeat* by

$$MOD=-E\left(X\right|X<0).$$(10)

These definitions tell us, on average, by how much a team wins when it wins, and by how much a team loses when it loses. Each can be estimated simply from seasonal data: to estimate *MOV* for a given team, simply tally total runs scored minus runs against in games that the team of interest wins, and divide by the number of wins. To estimate *MOD*, tally total runs against minus runs scored, and divide by the number of losses.

With these definitions, we invoke the law of total expectation to write

$$\begin{array}{ccccc}E\left(X\right)\hfill & =\hfill & E\left(X\right|X>0)\mathrm{Pr}\{X>0\}+E\left(X\right|X<0)\mathrm{Pr}\{X<0\}\hfill & & \\ & =\hfill & MOV\times \mathrm{Pr}\left\{\text{Win}\right\}-MOD\times \left(1-\mathrm{Pr}\left\{\text{Win}\right\}\right)\hfill & & \\ & =\hfill & \left(MOD+MOV\right)\times \mathrm{Pr}\left\{\text{Win}\right\}-MOD\hfill & & \end{array}$$(11)

and after dividing by (*MOD* + *MOV*) and rearranging terms, we arrive at the desired result:

$$\mathrm{Pr}\left\{\text{Win}\right\}=\frac{MOD}{MOD+MOV}+\frac{1}{MOD+MOV}\times E\left(X\right).$$(12)

This linear equation exactly relates the probability of a team winning to its expected point differential. Note that the derivation is completely general, and in particular does not require assuming particular probability distributions for the number of points scored for or against a team, or that such scoring be independent. The only assumption is that games do not end in a tie (that is, $\mathrm{Pr}\left\{X=0\right\}=0$).

Equation (12) is also exact for each of the *n* teams in the league after substituting team-specific seasonal estimates for the various parameters, that is

$$W{P}_{i}=\frac{mo{d}_{i}}{mo{d}_{i}+mo{v}_{i}}+\frac{1}{mo{d}_{i}+mo{v}_{i}}\times \left(R{S}_{i}-R{A}_{i}\right)$$(13)

where *WP*_{i}, *RS*_{i} and *RA*_{i} are the observed win percentage, runs scored and runs against while *mod*_{i} and *mov*_{i} are the observed average margins of victory and defeat respectively for the *i*th team, $i=1,2,\mathrm{\dots},n$. Equation (13) is illustrated in Figure 1 for the 2016 MLB season (data from http://baseball-reference.com). In the figure, there are *n* = 30 straight lines, each one representing a different team. The intercept *a*_{i} and slope *b*_{i} for the *i*th team are given by

$${a}_{i}=\frac{mo{d}_{i}}{mo{d}_{i}+mo{v}_{i}}$$(14)

and

$${b}_{i}=\frac{1}{mo{d}_{i}+mo{v}_{i}}\cdot $$(15)

There are also 30 points, one on each line, that represent the exact win percentage and average run differential per game for each team. Note that while the intercepts differ across teams, the average of these intercepts is clearly close to 1/2. Also note that the slopes are quite close numerically, which implies that *mod*_{i} + *mov*_{i} is roughly constant across teams, as implied by the nearly-parallel lines in Figure 1.

Figure 1: Exact win percentage for the 2016 major league baseball season.

It is tempting to compare equation (13) to the first order Taylor expansion of the Pythagorean model of equation (3); doing so suggests that

$$\begin{array}{ccccc}& \frac{1}{2}+\frac{\gamma}{4\times {R}_{\text{total}}}\times \left(R{S}_{i}-R{A}_{i}\right)\hfill & & & \\ & \approx \frac{mo{d}_{i}}{mo{d}_{i}+mo{v}_{i}}+\frac{1}{mo{d}_{i}+mo{v}_{i}}\times \left(R{S}_{i}-R{A}_{i}\right)\hfill & & & \end{array}$$(16)

which in turn suggests that

$$\frac{mo{d}_{i}}{mo{d}_{i}+mo{v}_{i}}\approx \frac{1}{2}$$(17)

and

$$\frac{\gamma}{4\times {R}_{\text{total}}}\approx \frac{1}{mo{d}_{i}+mo{v}_{i}}$$(18)

for *i* = 1, 2, … , *n*. However, this is not correct, for while equation (13) is exact on a team-by-team basis, equation (3) applies cross-sectionally across the teams. Indeed, as argued earlier, equation (3) can be thought of as the regression line through the 30 individual points in Figure 1; Figure 2 superpositions this regression line on Figure 1. As is clear, while the intercept of this line equals 0.5 as indeed it must, the slope is attenuated from the team-specific values. Still, equation (18) suggests that the Pythagorean parameter *γ* depends upon scoring margins in addition to total scoring. The question is how to move from the within-team exact model to the Pythagorean model which is cross-sectional across teams. We address this in the next section.

Figure 2: First-order Pythagorean and exact win percentage models.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.