The Nürnberg Ice Tigers are my favourite hockey team. They have never won the championship, were in the final twice, and have in recent years always played well in the regular season but frequently lost the first round (quarter final) of the play-offs.

Tomorrow, they will play game five of this year's quarter final against the Cologne Sharks. Both teams have won both of their away games so far, and the Ice Tigers are trying to win their first home game tomorrow.

This notebook looks into some statistics related to tomorrow's event. The ideas came from Allan Downey.

I am following the wonderful reporting of Sebastian Böhm / Nürnberger Nachrichten und Frank Strube.

For my students, this is in English.

tl;dr:¶

The statistics related to both teams, in this fairly averaging and not deeply distinguishing analysis, seems to suggest that the series is very close. Guess what, this is also what I saw when I watched it. Despite this similarity, the numbers favour Cologne slightly but consistently.

In [1]:

import json
import numpy as np
import scipy.stats as sst
import matplotlib.pylab as plt
%matplotlib inline
import thinkplot

import del_bayes as delba

Some Data¶

Let's look at some basic data

First, of the current play-off quarter final results

Games	Home	Away
Ice Tigers - Sharks	1	4
Sharks - Ice Tigers	2	3	(OT)
Ice Tigers - Sharks	2	4
Sharks - Ice Tigers	2	3

In [2]:

NIT_PO_scores = [1, 3, 2, 3]
KEC_PO_scores = [4, 2, 4, 2]

n_goals_NIT_PO18 = 9
n_goals_KEC_PO18 = 12

fname_scores = './data/DEL Saison 2017_2018.txt'

In [3]:

home_scores_NIT, away_scores_NIT = delba.get_goals_per_season(fname_scores, team_id=16)
home_scores_KEC, away_scores_KEC = delba.get_goals_per_season(fname_scores, team_id=15)

lets look at the priors, i.e., the distribution of goals scored by both teams in the season 2017/2018

In [4]:

bins = np.linspace(-0.5, 10.5, 12)

things_to_plot = [home_scores_NIT, away_scores_NIT, home_scores_KEC, away_scores_KEC]
labels = ['NIT home', 'NIT away', 'KEC home', 'KEC away']
colors=['cyan', 'blue', 'pink', 'red']
size = [3, 2.5, 2, 1.5]

for i, item in enumerate(things_to_plot):
    
    hi, bi = np.histogram(item, bins)
    plt.plot(bi[:-1]+0.5, 
             hi,
             'o',
             markersize=size[i]*3,
             color=colors[i],
             label=labels[i])
plt.legend(loc='best')
plt.xlabel('number of goals in game')
plt.ylabel('absolute number of occurrence \nof given number of goals in season 17/18')
plt.xticks(np.linspace(0,10,11))
plt.ylim(0.0, None)

Out[4]:

(0.0, 8.4)

In [5]:

avg_goals_per_game_NIT = np.hstack((home_scores_NIT, away_scores_NIT)).mean()
std_goals_per_game_NIT = np.sqrt(np.hstack((home_scores_NIT, away_scores_NIT)).var())
avg_goals_per_game_KEC = np.hstack((home_scores_KEC, away_scores_KEC)).mean()
std_goals_per_game_KEC = np.sqrt(np.hstack((home_scores_KEC, away_scores_KEC)).var())

print("avg., std. of number of goals per games")
print("NIT={:3.2f}, {:3.2f}".format(avg_goals_per_game_NIT, std_goals_per_game_NIT))
print("KEC={:3.2f}, {:3.2f}".format(avg_goals_per_game_KEC, std_goals_per_game_KEC))

avg., std. of number of goals per games
NIT=2.92, 1.69
KEC=2.85, 1.74

on average, Nürnberg scores slightly more goals than KEC, but the mean and the variance are fairly similar

Warm-Up Thoughs¶

Thought 1¶

Suppose that goal scoring in hockey is well modeled by a Poisson process, and that the long-run goal-scoring rate of the Nürnberg Ice Tigers against the Cologne Sharks is 2.9 goals per game. In their next game, what is the probability that the Ice Tigers score exactly 3 goals? Plot the PMF of k, the number of goals they score in a game.

In [6]:

sst.poisson.pmf(3, avg_goals_per_game_NIT)

Out[6]:

0.22381712081635285

In [7]:

possible_goals = np.linspace(0,10,11)
p_goals = sst.poisson.pmf(possible_goals, avg_goals_per_game_NIT)

In [8]:

plt.bar(possible_goals, p_goals)
plt.xlabel('number of goals')
plt.ylabel('P of scoring x goals')

Out[8]:

Text(0,0.5,'P of scoring x goals')

Thought 2¶

Assuming the same goal scoring rate, what is the probability of scoring a total of n_goals_NIT_PO18 (=9) goals in four games (this is what they have done in the recent past, in the last four games)

In [9]:

possible_goals = np.linspace(0,30,31)
p_goals = sst.poisson.pmf(possible_goals, avg_goals_per_game_NIT)

In [10]:

temp = delba.AddPmf(p_goals, p_goals)
temp2 = delba.AddPmf(temp, p_goals)
final = delba.AddPmf(temp2, p_goals)

In [11]:

xs = np.linspace(0, final.shape[0], final.shape[0])
plt.bar(xs, final)
plt.xlim(0., 30.)
plt.xticks(np.linspace(0.,30.,21)[::2])
plt.xlabel("number of goals in 4 games")
plt.ylabel("P of scoring x goals in 4 games")

Out[11]:

Text(0,0.5,'P of scoring x goals in 4 games')

In [12]:

final[9]

Out[12]:

0.09406636812823332

In [13]:

sst.poisson.pmf(9, 4*avg_goals_per_game_NIT)

Out[13]:

0.09406636812823359

The probability of scoring 9 goals in four games is ~0.094 (given that we model using Poisson and that the long term average scoring during the season is representative for an average scoring in the post season)

The "good news" is, that it would have been more likely if they had scored 10, 11, 12, or 13 goals (the sharks did score 13 goals).

So either Cologne's defense and/or goalie was much improved compared to the regular season in the first four games, or the NIT offense was weaker, or both.

Statistically interesting, the result of the convolution (final[9]) is identical to the analytical result from the poisson distribution (sst.poisson.pmf...).

Also interesting, we're seeing here already some effect of the central limit theorem.

Thought 3¶

Suppose that the long-run goal-scoring rate of the Sharks against the Ice Tigers is 2.6 goals per game. Plot the distribution of t, the time until the Sharks score their first goal.

In their next game, what is the probability that the Sharks score during the first period (that is, the first third of the game)?

In [14]:

time_ax = np.linspace(0., 2.5, 21)

In [15]:

p_t_scoring = sst.expon.cdf(time_ax, scale = 1./2.6)

In [16]:

plt.plot(time_ax, p_t_scoring)
plt.xlim(0., 2.5)
plt.ylim(0.,None)
plt.xlabel('time between goals (1=60minutes)')

Out[16]:

Text(0.5,0,'time between goals (1=60minutes)')

In [17]:

sst.expon.cdf(1/3, scale=1/2.6)

Out[17]:

0.5796496154913181

given the average scoring rate, which is assumed to be constant over all games and within each game (no difference between any of the periods... we kind of know that in practice this assumption is violated...), there is a slightly larger chance than 50:50 that the Sharks score within the first period.

Thought 4:¶

Assuming again that the goal scoring rate is 2.6, what is the probability that the Ice Tigers get shut out (that is, don't score for an entire game)?

Answer this question two ways, using the CDF of the exponential distribution and the PMF of the Poisson distribution. Again, this is a statistically interesting thing!

In [18]:

1.0 - sst.expon.cdf(1., scale=1./2.6)

Out[18]:

0.07427357821433389

In [19]:

sst.poisson.pmf(0., 2.6)

Out[19]:

0.07427357821433388

Adventures into Bayes¶

First let's look at the prior, given the data of the season 2017/2018. We will start with the same prior distribution for both hockey clubs. Both are Gaussian with a mean of 2.9 (that means both teams start with the same prior; we could use a more informative prior, however in the limited time I had for this, I was not successful implementing it)

In [20]:

import importlib
importlib.reload(delba)
suite1 = delba.Hockey('NIT')
suite2 = delba.Hockey('KEC')
thinkplot.PrePlot(num=2)
thinkplot.Pdf(suite1)
thinkplot.Pdf(suite2)
thinkplot.Config(xlabel='Goals per game',
                ylabel='Probability')

And we can update each suite with the scores from the first 4 games.

In [21]:

suite1.UpdateSet(NIT_PO_scores)

Out[21]:

0.0019393946773635746

In [22]:

suite2.UpdateSet(KEC_PO_scores)

Out[22]:

0.001143959398114395

In [23]:

thinkplot.PrePlot(num=2)
thinkplot.Pdf(suite1)
thinkplot.Pdf(suite2)
thinkplot.Config(xlabel='Goals per game',
                ylabel='Probability')

suite1.Mean(), suite2.Mean()

Out[23]:

(2.5931581264473484, 2.7979813677731227)

using the play-off games goal scoring to date to update the priors, shows that the teams are really close: The most likely number of goals for NIT is 2.59, for KEC 2.80

To predict the number of goals scored in the next game we can compute, for each hypothetical value of $\lambda$, a Poisson distribution of goals scored, then make a weighted mixture of Poissons:

In [24]:

goal_dist1 = delba.MakeGoalPmf(suite1)
goal_dist2 = delba.MakeGoalPmf(suite2)

thinkplot.PrePlot(num=2)
thinkplot.Pmf(goal_dist1)
thinkplot.Pmf(goal_dist2)
thinkplot.Config(xlabel='Goals',
                ylabel='Probability',
                xlim=[-0.7, 11.5])

goal_dist1.Mean(), goal_dist2.Mean()

Out[24]:

(2.59171397162217, 2.7955988586976046)

Now we can compute the probability that the NIT win, lose, or tie in regulation time.

In [25]:

diff = goal_dist1 - goal_dist2
p_win = diff.ProbGreater(0)
p_loss = diff.ProbLess(0)
p_tie = diff.Prob(0)

print('P(win)= {:3.2f}, \nP(loss)={:3.2f}, \nP(tie)= {:3.2f}'.format(p_win, p_loss, p_tie))

P(win)= 0.38, 
P(loss)=0.45, 
P(tie)= 0.17

If the game goes into overtime, we have to compute the distribution of t, the time until the first goal, for each team. For each hypothetical value of $\lambda$, the distribution of t is exponential, so the predictive distribution is a mixture of exponentials.

Here's what the predictive distributions for t look like.

In [26]:

time_dist1 = delba.MakeGoalTimePmf(suite1)    
time_dist2 = delba.MakeGoalTimePmf(suite2)
 
thinkplot.PrePlot(num=2)
thinkplot.Pmf(time_dist1)
thinkplot.Pmf(time_dist2)    
thinkplot.Config(xlabel='Games until goal',
                   ylabel='Probability')

time_dist1.Mean(), time_dist2.Mean()

Out[26]:

(0.3890322826246101, 0.3608324093454575)

In overtime the first team to score wins, so the probability of winning is the probability of generating a smaller value of t:

In [27]:

p_win_in_overtime = time_dist1.ProbLess(time_dist2)
p_adjust = time_dist1.ProbEqual(time_dist2)
p_win_in_overtime += p_adjust / 2
print('p_win_in_overtime', p_win_in_overtime)

p_win_in_overtime 0.4810635311241746

Finally, we can compute the overall chance that the Ice Tigers win, either in regulation or overtime.

In [28]:

p_win_overall = p_win + p_tie * p_win_in_overtime
print('p_win_overall', p_win_overall)

p_win_overall 0.46317227261461386

Exercise: To make the model of overtime more correct, we could update both suites with 0 goals in one game, before computing the predictive distribution of t. Make this change and see what effect it has on the results.

In [29]:

suite1.Update(0)
suite2.Update(0)
time_dist1 = delba.MakeGoalTimePmf(suite1)    
time_dist2 = delba.MakeGoalTimePmf(suite2)
p_win_in_overtime = time_dist1.ProbLess(time_dist2)
p_adjust = time_dist1.ProbEqual(time_dist2)
p_win_in_overtime += p_adjust / 2
print('p_win_in_overtime', p_win_in_overtime)
p_win_overall = p_win + p_tie * p_win_in_overtime
print('p_win_overall', p_win_overall)

p_win_in_overtime 0.4793124781132104
p_win_overall 0.46287182347893885

Some Hockey Stats (DEL Quarterfinal 2018 Game 5 Nürnberg Ice Tigers vs Cologne Sharks)