The problem

Today we look at problem 6.1 Too many coin flips, from Ben Lambert’s book A Student’s Guide to Bayesian Statistics.

This time around we have 2 coins, whose nature we are unsure about (probability of heads). However, we know that each coin $i$ can be fair ( $θ_{i} = 0.5$ ) or biased towards heads ( $θ_{i} = 0.9$ ).

A uniform start

Suppose that:

we flip each coin twice: $n = 2$
we place a discrete uniform prior on the coin’s identity $i$

Q: What is the joint distribution of the data and the coins’ identity?

First things first, let $X_{i}$ be the number of heads in $n$ throws of coin $i$ . Then the joint distribution of interest is:

$P r (X_{1}, X_{2}, θ_{1}, θ_{2}) = P r (X_{1 : 2}, θ_{1 : 2})$

If we assume that the two coin throws are independent (i.e. the outcome of throwing coin 1 does not affect the outcome throwing coin 2), then:

$P r (X_{1 : 2}, θ_{1 : 2}) = \prod_{i = 1}^{2} P r (X_{i}, θ_{i}) = \prod_{i = 1}^{2} P r (X_{i} ∣ θ_{i}) \times P r (θ_{i})$

Note that, for each coin, we can write the joint $P r (X_{i}, θ_{i})$ as the product of the conditional distribution of the data given the parameters $P r (X_{i} ∣ θ_{i})$ and the distribution of the coin’s identity $P r (θ_{i})$ . Therefore, if we know how to calculate these two terms, then we can easily compute the whole thing:

given the coins identity, the binomial distribution can be used to model the outcome of $n$ independent coin flips (we’ve seen this in a previous post):

$P r (x_{i} ∣ θ_{i}) = (\binom{n}{x_{i}}) θ_{i}^{x_{i}} (1 - θ_{i})^{n - x_{i}}$

the uniform assumption over the coin’s identity tells us how to calculate $P r (θ_{i})$ :

$\begin{aligned} P r (θ_{i} = 1) = 1 / 2, P r (θ_{i} = 2) = 1 / 2 \end{aligned}$

The joint defines a probability distribution over all possible combinations of $X_{1}$ , $X_{2}$ , $θ_{1}$ and $θ_{2}$ . Because the random variables are discrete, we can enumrate all the possible combinations and write a bit of code to do the computation for us. We can expect a total of $3 \times 3 \times 2 \times 2 = 36$ different probabilities:

joint <- 
  tibble(
    x1     = c(0, 1, 2),
    x2     = c(0, 1, 2),
    theta1 = c(0.5, 0.9, 0.5),
    theta2 = c(0.5, 0.9, 0.5)
  ) %>%
  expand(x1, x2, theta1, theta2) %>%
  mutate(
    n      = 2,
    prior1 = 1/2,
    prior2 = 1/2
  ) %>%
  mutate(
    lik1 = dbinom(x1, n, theta1),
    lik2 = dbinom(x2, n, theta2),
    num1 = lik1 * prior1,
    num2 = lik2 * prior2,
  ) %>%
  mutate(p = num1 * num2)

$θ_{1}, θ_{2}$	X1,X2
	0,0	0,1	0,2	1,0	1,1	1,2	2,0	2,1	2,2
0.5,0.5	0.015625	0.03125	0.015625	0.03125	0.0625	0.03125	0.015625	0.03125	0.015625
0.5,0.9	0.000625	0.01125	0.050625	0.00125	0.0225	0.10125	0.000625	0.01125	0.050625
0.9,0.5	0.000625	0.00125	0.000625	0.01125	0.0225	0.01125	0.050625	0.10125	0.050625
0.9,0.9	0.000025	0.00045	0.002025	0.00045	0.0081	0.03645	0.002025	0.03645	0.164025

We can verify that the above distribution is a valid probability distribution:

joint %>%
  summarise(sum(p))

## # A tibble: 1 x 1
##   `sum(p)`
##      <dbl>
## 1    1.000

One and only likelihood

We now assume that we’ve observed $X_{1} = 2$ and $X_{2} = 1$ . How likely is this data, given our probability model of coin flips (binomial)?

We can answer this question via the likelihood:

$\begin{aligned} L (θ_{1 : 2} ∣ x_{1 : 2}) & = \prod_{i = 1}^{2} L (θ_{i} ∣ x_{i}) \\ = \prod_{i = 1}^{2} P r (X_{i} = x_{i} ∣ θ_{i}) \\ = P r (X_{1} = 2 ∣ θ_{1}) \times P r (X_{2} = 1 ∣ θ_{2}) \end{aligned}$

We bring out the likelihood when we’re presented with a single dataset. We fix the data $X_{1} = x_{1}$ , $X_{2} = x_{2}$ and vary the parameters $θ_{1}$ and $θ_{2}$ . Note that we’ve used the same expression, times the prior $P r (θ_{i})$ , to compute the joint; but that we did that for all possible datasets, i.e. combinations of $X_{1}$ and $X_{2}$ !

likelihood <- 
  joint %>%
  filter(x1 == 2 & x2 == 1) %>%
  select(-c(n, prior1, prior2, num1, num2, p)) %>%
  mutate(lik = lik1 * lik2)

likelihood %>%
  knitr::kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("stripped", "hover"), full_width = T, position = "float_left") %>%
  row_spec(0, bold = T, background = theme_color, color = "white", font_size = 15, align = "c") %>%
  row_spec(c(1,3), bold = F, background = "#fafafa", font_size = 14) %>%
  row_spec(c(2,4), bold = F, background = "transparent", font_size = 14) %>%
  row_spec(which.max(likelihood$lik), bold = T, color = "white", background = "red") %>%
  column_spec(1:length(likelihood), width_min = "50pt")

x1	x2	theta1	theta2	lik1	lik2	lik
2	1	0.5	0.5	0.25	0.50	0.1250
2	1	0.5	0.9	0.25	0.18	0.0450
2	1	0.9	0.5	0.81	0.50	0.4050
2	1	0.9	0.9	0.81	0.18	0.1458

Note that the likelihood is not a valid probability distribution:

likelihood %>%
  summarise(sum(lik))

## # A tibble: 1 x 1
##   `sum(lik)`
##        <dbl>
## 1      0.721

Now, if we were to select point estimates for $θ_{1}, θ_{2}$ (single values), we would probably choose the ones that maximise the likelihood. The maximum likelihood estimator for the parameters can simply be determined by looking at the table above: $θ_{1} = 0.9, θ_{2} = 0.5$ .

The result makes sense intuitively: the model prefers coin 1 to be biased towards heads because it flipped heads twice and coin 2 to be fair as it flipped once heads and once tails. If we were approaching this problem from a frequentist point of view, our job would be done here. However, from a Bayesian perspective, we’re just getting started.

The devil is in the denominator

In Bayesian Statistics, it is often said that the devil is in the denominator. What is the denominator, or marginal likelihood, exactly?

We can obtain the probability of the data $P r_{(} x_{1}, x_{2})$ by summing over all possible values of the parameters:

$\begin{aligned} P r (x_{1 : 2}) & = \sum_{θ_{1}} \sum_{θ_{2}} P r (x_{1 : 2}, θ_{1 : 2}) \\ = \sum_{θ_{1}} \sum_{θ_{2}} P r (x_{1 : 2} ∣ θ_{1 : 2}) P r (θ_{1 : 2}) \\ = \sum_{θ_{1}} P r (x_{1} ∣ θ_{1}) P r (θ_{1}) \sum_{θ_{2}} P r (x_{2} ∣ θ_{2}) P r (θ_{2}) \end{aligned}$

This is equivalent to selecting the columns of the joint $X_{1} = 2$ and $X_{2} = 1$ and summing the probabilities for all values of $θ_{1}$ and $θ_{2}$ .

numerator <- 
  joint %>%
  filter(x1 == 2 & x2 == 1)

x1	x2	theta1	theta2	n	prior1	prior2	lik1	lik2	num1	num2	p
2	1	0.5	0.5	2	0.5	0.5	0.25	0.50	0.125	0.25	0.03125
2	1	0.5	0.9	2	0.5	0.5	0.25	0.18	0.125	0.09	0.01125
2	1	0.9	0.5	2	0.5	0.5	0.81	0.50	0.405	0.25	0.10125
2	1	0.9	0.9	2	0.5	0.5	0.81	0.18	0.405	0.09	0.03645

denominator <- 
  numerator %>%
  summarise(value = sum(p)) %>%
  pull(value)

denominator

## [1] 0.1802

As, we’ll see above, the job of the denominator is to ensure that the posterior distribution is a valid probability distribution.

The posterior

The posterior aims to compute the update belief about the coin’s identities, after we’ve observed the 2 coins flips for each coin. This is now trivial, since we’ve worked out all of its components: the likelihood, the prior and the denominator:

$P r (θ_{1 : 2} ∣ x_{1 : 2}) = \frac{L (θ_{1 : 2} ∣ x_{1 : 2}) \times P r (θ_{1 : 2})}{P r (x_{1 : 2})}$

posterior <- 
  joint %>%
  filter(x1 == 2 & x2 == 1) %>%
  mutate(denom = denominator) %>%
  mutate(post = p/denom)

posterior %>%
  select(-c(n, prior1, prior2, num1, num2, lik1, lik2, p, denom)) %>%
  mutate(post = format(post, digits = 3)) %>%
  knitr::kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("stripped", "hover"), full_width = T, position = "float_left") %>%
  row_spec(0, bold = T, background = theme_color, color = "white", font_size = 15, align = "c") %>%
  row_spec(c(1,3), bold = F, background = "#fafafa", font_size = 14) %>%
  row_spec(c(2,4), bold = F, background = "transparent", font_size = 14) %>%
  column_spec(1:(length(posterior)-9), width_min = "40pt")

x1	x2	theta1	theta2	post
2	1	0.5	0.5	0.1734
2	1	0.5	0.9	0.0624
2	1	0.9	0.5	0.5619
2	1	0.9	0.9	0.2023

We can verify that the posterior is a valid probability distribution:

posterior %>%
  summarise(sum(post))

## # A tibble: 1 x 1
##   `sum(post)`
##         <dbl>
## 1           1

As before, we now may want to compute point estimates for our parameters. In this case, we’re interested in the mean of $θ_{1}$ . In the univariate case, this is simply a weighted average over the posterior: $E (θ ∣ x) = \sum_{θ} θ . P r (θ ∣ x)$ In this case, we need to first sum the posterior over $θ_{2}$ :

$E (θ_{1} ∣ x_{1 : 2}, θ_{2}) = \sum_{θ_{1}} θ_{1} \sum_{θ_{2}} P r (θ_{1}, θ_{2} ∣ x_{1 : 2})$

posterior_mean <- 
  posterior %>%
  group_by(theta1) %>%
  summarise(partial = sum(post)) %>%
  summarise(sum(theta1 * partial))

posterior_mean

## # A tibble: 1 x 1
##   `sum(theta1 * partial)`
##                     <dbl>
## 1                   0.806

And despite the result not falling into one of the two possible values for the coins identity, the mean is definetely pointing towards the coin being biased in the favour of heads.

The Bayesian denominator of too many coin flips

The problem

A uniform start

One and only likelihood

The devil is in the denominator

The posterior