October 2016
FiveThirtyEight does baseball predictions where, for each team, they give the probabilities of advancing to the playoffs, advancing to the next round, etc. Here were the probabilities as of October 15:
Chance of making | |||
---|---|---|---|
Team | Division | World Series | Win World Series |
Cubs | NL Central | 64% | 40% |
Dodgers | NL West | 36% | 19% |
Blue Jays | AL East | 55% | 24% |
Indians | AL Central | 45% | 17% |
(note that the Cubs and Dodgers were playing each other in the NLCS, and the Blue Jays and Indians were playing each other in the ALCS)
Looking at this, it's clear that the chance to win the World Series is conditional on who your opponent is going to be. Since the model thinks that the Blue Jays are better than the Indians, for example, the Cubs would have a better chance of winning the World Series if the Indians advance than if the Blue Jays do.
So clearly, there are some underlying probabilities here: let \(w_{cb}\) be the probability that the Cubs would beat the Blue Jays in the World Series, assuming they both advance. Similarly, let \(w_{ci}\) be the probability that the Cubs would beat the Indians, and so on with \(w_{db}\) and \(w_{di}\).
The question I have is: can we derive the various \(w\)'s from the probabilities given above?
Let's define the given probabilities from the table with some constants:
Chance of making | |||
---|---|---|---|
Team | Division | World Series | Win World Series |
Cubs | NL Central | \(c_1\) | \(c_2\) |
Dodgers | NL West | \(d_1\) | \(d_2\) |
Blue Jays | AL East | \(b_1\) | \(b_2\) |
Indians | AL Central | \(i_1\) | \(i_2\) |
One way to look at this is that we have a bunch of numbers we know and are trying to solve for some variables. Generally speaking, if you have \(n\) independent coefficients you can solve for \(n\) variables - if you have more variables than that it's underdetermined (so there will probably be infinitely many answers), and if you have fewer variables it's overdetermined (so there will probably not be any answers).
In this case, we have 8 coefficients, but there are some redundancies because we know that $$c_1+d_1=1$$ $$b_1+i_1=1$$ $$c_2+d_2+b_2+i_2=1$$ so we really only have 5 independent coefficients. We're solving for the four \(w\)'s, but the coefficients we have also determine \(c_1\) and \(b_1\), so we're effectively solving for 6 variables. So it looks like it may be underdetermined. However:
So let's go ahead and try to actually solve this and see what happens!
We start with writing a formula for \(c_2\) in terms of other values we know. Since \(c_2\) is the probability that the Cubs win the World Series, first they have to advance to the World Series (by beating the Dodgers), then beat the Indians if the Indians advanced, or beat the Blue Jays if the Blue Jays advanced. So, this works out to $$c_2 = c_1\cdot(b_1w_{cb}+i_1w_{ci})$$Remembering that the \(w\)'s are the variables here and everything else are constants, we can simplify this to $$\frac{c_2}{c_1}=b_1w_{cb}+i_1w_{ci}$$Similarly for the other teams, we have $$\frac{d_2}{d_1}=b_1w_{db}+i_1w_{di}$$ $$\frac{b_2}{b_1}=c_1(1-w_{cb})+d_1(1-w_{db})$$ $$\frac{i_2}{i_1}=c_1(1-w_{ci})+d_1(1-w_{di})$$
We can write this in matrix form after a little more simplification to get
$$\left( \begin{array}{cccc}
b_1 & i_1 & 0 & 0 \\
0 & 0 & b_1 & i_1 \\
-c_1 & 0 & -d_1 & 0 \\
0 & -c_1 & 0 & -d_1 \end{array} \right)
\left( \begin{array}{cccc}
w_{cb} \\
w_{ci} \\
w_{db} \\
w_{di} \end{array} \right)
=
\left( \begin{array}{cccc}
\frac{c_2}{c_1} \\
\frac{d_2}{d_1} \\
\frac{b_2}{b_1} - c_1 - d_1 \\
\frac{i_2}{i_1} - c_1 - d_1 \end{array} \right)$$
Now we can use Gaussian elimination to reduce this to row echelon form and then solve for the \(w\)'s. At this point you can probably plug it in to Mathematica or Maple, but I was working on paper so I did it by hand. So here goes:
Add \(\frac{c_1}{b_1}\) times row 1 to row 3:
$$\left( \begin{array}{cccc}
b_1 & i_1 & 0 & 0 \\
0 & 0 & b_1 & i_1 \\
0 & \frac{c_1i_1}{b_1} & -d_1 & 0 \\
0 & -c_1 & 0 & -d_1 \end{array} \right)
\left( \begin{array}{cccc}
w_{cb} \\
w_{ci} \\
w_{db} \\
w_{di} \end{array} \right)
=
\left( \begin{array}{cccc}
\frac{c_2}{c_1} \\
\frac{d_2}{d_1} \\
\frac{b_2}{b_1} - c_1 - d_1 + \frac{c_2}{b_1} \\
\frac{i_2}{i_1} - c_1 - d_1 \end{array} \right)$$
Add \(\frac{i_1}{c_1}\) times row 4 to row 1:
$$\left( \begin{array}{cccc}
b_1 & 0 & 0 & -\frac{d_1i_1}{c_1} \\
0 & 0 & b_1 & i_1 \\
0 & \frac{c_1i_1}{b_1} & -d_1 & 0 \\
0 & -c_1 & 0 & -d_1 \end{array} \right)
\left( \begin{array}{cccc}
w_{cb} \\
w_{ci} \\
w_{db} \\
w_{di} \end{array} \right)
=
\left( \begin{array}{cccc}
\frac{c_2 + i_2 - i_1c_1 - d_1i_1}{c_1} \\
\frac{d_2}{d_1} \\
\frac{b_2}{b_1} - c_1 - d_1 + \frac{c_2}{b_1} \\
\frac{i_2}{i_1} - c_1 - d_1 \end{array} \right)$$
Add \(\frac{i_1}{b_1}\) times row 4 to row 3:
$$\left( \begin{array}{cccc}
b_1 & 0 & 0 & -\frac{d_1i_1}{c_1} \\
0 & 0 & b_1 & i_1 \\
0 & 0 & -d_1 & -\frac{d_1i_1}{b_1} \\
0 & -c_1 & 0 & -d_1 \end{array} \right)
\left( \begin{array}{cccc}
w_{cb} \\
w_{ci} \\
w_{db} \\
w_{di} \end{array} \right)
=
\left( \begin{array}{cccc}
\frac{c_2 + i_2 - i_1c_1 - d_1i_1}{c_1} \\
\frac{d_2}{d_1} \\
\frac{b_2}{b_1} - c_1 - d_1 + \frac{c_2}{b_1} + \frac{i_2}{b_1} - \frac{c_1i_1}{b_1} - \frac{d_1i_1}{b_1} \\
\frac{i_2}{i_1} - c_1 - d_1 \end{array} \right)$$
Add \(\frac{b_1}{d_1}\) times row 3 to row 2:
$$\left( \begin{array}{cccc}
b_1 & 0 & 0 & -\frac{d_1i_1}{c_1} \\
0 & 0 & 0 & 0 \\
0 & 0 & -d_1 & -\frac{d_1i_1}{b_1} \\
0 & -c_1 & 0 & -d_1 \end{array} \right)
\left( \begin{array}{cccc}
w_{cb} \\
w_{ci} \\
w_{db} \\
w_{di} \end{array} \right)
=
\left( \begin{array}{cccc}
\frac{c_2 + i_2 - i_1c_1 - d_1i_1}{c_1} \\
0 \\
\frac{b_2}{b_1} - c_1 - d_1 + \frac{c_2}{b_1} + \frac{i_2}{b_1} - \frac{c_1i_1}{b_1} - \frac{d_1i_1}{b_1} \\
\frac{i_2}{i_1} - c_1 - d_1 \end{array} \right)$$
Notice that row 2 is now all 0's, which means that the system of equations is underdetermined!
So as we surmised at the beginning, we actually can't figure out the various \(w\)'s. One easy way to confirm this is to find multiple solutions for given constants. Let's go back and use the numbers we started with:
Chance of making | |||
---|---|---|---|
Team | Division | World Series | Win World Series |
Cubs | NL Central | 64% | 40% |
Dodgers | NL West | 36% | 19% |
Blue Jays | AL East | 55% | 24% |
Indians | AL Central | 45% | 17% |
If we arbitrarily pick \(w_{cb}=0.7\), then we can solve for the others and get \(w_{ci}=0.60, w_{db}=0.44, w_{di}=0.66\). But we could also pick \(w_{cb}=0.8\) and get \(w_{ci}=0.47, w_{db}=0.26, w_{di}=0.88\). Note that this makes some amount of sense - as \(w_{ci}\) gets higher, \(w_{cb}\) has to decrease since the Cubs' probability of winning the World Series stays constant.
Another point is just because the system is underdetermined doesn't mean that we can pick anything for \(w_{cb}\). For example, \(w_{cb}\) can't be lower than around \(0.4\) or there are no solutions. (since the Cubs have a \(0.4\) chance of winning the World Series regardless of who is playing!)
It's a little disappointing we can't solve for the various \(w\)'s (especially because if we know whether the Cubs or Dodgers win but the Indians and Blue Jays are still playing, then we can solve for all the \(w\)'s!), but now we know for sure!
Fancy math equations by MathJax.