7 min read

The Bayesian denominator of too many coin flips

The problem

Today we look at problem 6.1 Too many coin flips, from Ben Lambert’s book A Student’s Guide to Bayesian Statistics.

This time around we have 2 coins, whose nature we are unsure about (probability of heads). However, we know that each coin i can be fair (θi=0.5) or biased towards heads (θi=0.9).

A uniform start

Suppose that:

  • we flip each coin twice: n=2
  • we place a discrete uniform prior on the coin’s identity i

Q: What is the joint distribution of the data and the coins’ identity?

First things first, let Xi be the number of heads in n throws of coin i. Then the joint distribution of interest is:

Pr(X1,X2,θ1,θ2)=Pr(X1:2,θ1:2)

If we assume that the two coin throws are independent (i.e. the outcome of throwing coin 1 does not affect the outcome throwing coin 2), then:

Pr(X1:2,θ1:2)=i=12Pr(Xi,θi)=i=12Pr(Xiθi)×Pr(θi)

Note that, for each coin, we can write the joint Pr(Xi,θi) as the product of the conditional distribution of the data given the parameters Pr(Xiθi) and the distribution of the coin’s identity Pr(θi). Therefore, if we know how to calculate these two terms, then we can easily compute the whole thing:

  • given the coins identity, the binomial distribution can be used to model the outcome of n independent coin flips (we’ve seen this in a previous post):

Pr(xiθi)=(nxi)θixi(1θi)nxi

  • the uniform assumption over the coin’s identity tells us how to calculate Pr(θi):

Pr(θi=1)=1/2,Pr(θi=2)=1/2

The joint defines a probability distribution over all possible combinations of X1, X2, θ1 and θ2. Because the random variables are discrete, we can enumrate all the possible combinations and write a bit of code to do the computation for us. We can expect a total of 3×3×2×2=36 different probabilities:

joint <- 
  tibble(
    x1     = c(0, 1, 2),
    x2     = c(0, 1, 2),
    theta1 = c(0.5, 0.9, 0.5),
    theta2 = c(0.5, 0.9, 0.5)
  ) %>%
  expand(x1, x2, theta1, theta2) %>%
  mutate(
    n      = 2,
    prior1 = 1/2,
    prior2 = 1/2
  ) %>%
  mutate(
    lik1 = dbinom(x1, n, theta1),
    lik2 = dbinom(x2, n, theta2),
    num1 = lik1 * prior1,
    num2 = lik2 * prior2,
  ) %>%
  mutate(p = num1 * num2)
θ1,θ2
X1,X2
0,0 0,1 0,2 1,0 1,1 1,2 2,0 2,1 2,2
0.5,0.5 0.015625 0.03125 0.015625 0.03125 0.0625 0.03125 0.015625 0.03125 0.015625
0.5,0.9 0.000625 0.01125 0.050625 0.00125 0.0225 0.10125 0.000625 0.01125 0.050625
0.9,0.5 0.000625 0.00125 0.000625 0.01125 0.0225 0.01125 0.050625 0.10125 0.050625
0.9,0.9 0.000025 0.00045 0.002025 0.00045 0.0081 0.03645 0.002025 0.03645 0.164025

We can verify that the above distribution is a valid probability distribution:

joint %>%
  summarise(sum(p))
## # A tibble: 1 x 1
##   `sum(p)`
##      <dbl>
## 1    1.000

One and only likelihood

We now assume that we’ve observed X1=2 and X2=1. How likely is this data, given our probability model of coin flips (binomial)?

We can answer this question via the likelihood:

L(θ1:2x1:2)=i=12L(θixi)=i=12Pr(Xi=xiθi)=Pr(X1=2θ1)×Pr(X2=1θ2)

We bring out the likelihood when we’re presented with a single dataset. We fix the data X1=x1, X2=x2 and vary the parameters θ1 and θ2. Note that we’ve used the same expression, times the prior Pr(θi), to compute the joint; but that we did that for all possible datasets, i.e. combinations of X1 and X2!

likelihood <- 
  joint %>%
  filter(x1 == 2 & x2 == 1) %>%
  select(-c(n, prior1, prior2, num1, num2, p)) %>%
  mutate(lik = lik1 * lik2)

likelihood %>%
  knitr::kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("stripped", "hover"), full_width = T, position = "float_left") %>%
  row_spec(0, bold = T, background = theme_color, color = "white", font_size = 15, align = "c") %>%
  row_spec(c(1,3), bold = F, background = "#fafafa", font_size = 14) %>%
  row_spec(c(2,4), bold = F, background = "transparent", font_size = 14) %>%
  row_spec(which.max(likelihood$lik), bold = T, color = "white", background = "red") %>%
  column_spec(1:length(likelihood), width_min = "50pt")
x1 x2 theta1 theta2 lik1 lik2 lik
2 1 0.5 0.5 0.25 0.50 0.1250
2 1 0.5 0.9 0.25 0.18 0.0450
2 1 0.9 0.5 0.81 0.50 0.4050
2 1 0.9 0.9 0.81 0.18 0.1458





Note that the likelihood is not a valid probability distribution:

likelihood %>%
  summarise(sum(lik))
## # A tibble: 1 x 1
##   `sum(lik)`
##        <dbl>
## 1      0.721

Now, if we were to select point estimates for θ1,θ2 (single values), we would probably choose the ones that maximise the likelihood. The maximum likelihood estimator for the parameters can simply be determined by looking at the table above: θ1=0.9,θ2=0.5.

The result makes sense intuitively: the model prefers coin 1 to be biased towards heads because it flipped heads twice and coin 2 to be fair as it flipped once heads and once tails. If we were approaching this problem from a frequentist point of view, our job would be done here. However, from a Bayesian perspective, we’re just getting started.

The devil is in the denominator

In Bayesian Statistics, it is often said that the devil is in the denominator. What is the denominator, or marginal likelihood, exactly?

We can obtain the probability of the data Pr(x1,x2) by summing over all possible values of the parameters:

Pr(x1:2)=θ1θ2Pr(x1:2,θ1:2)=θ1θ2Pr(x1:2θ1:2)Pr(θ1:2)=θ1Pr(x1θ1)Pr(θ1)θ2Pr(x2θ2)Pr(θ2)

This is equivalent to selecting the columns of the joint X1=2 and X2=1 and summing the probabilities for all values of θ1 and θ2.

numerator <- 
  joint %>%
  filter(x1 == 2 & x2 == 1)
x1 x2 theta1 theta2 n prior1 prior2 lik1 lik2 num1 num2 p
2 1 0.5 0.5 2 0.5 0.5 0.25 0.50 0.125 0.25 0.03125
2 1 0.5 0.9 2 0.5 0.5 0.25 0.18 0.125 0.09 0.01125
2 1 0.9 0.5 2 0.5 0.5 0.81 0.50 0.405 0.25 0.10125
2 1 0.9 0.9 2 0.5 0.5 0.81 0.18 0.405 0.09 0.03645
denominator <- 
  numerator %>%
  summarise(value = sum(p)) %>%
  pull(value)

denominator
## [1] 0.1802

As, we’ll see above, the job of the denominator is to ensure that the posterior distribution is a valid probability distribution.

The posterior

The posterior aims to compute the update belief about the coin’s identities, after we’ve observed the 2 coins flips for each coin. This is now trivial, since we’ve worked out all of its components: the likelihood, the prior and the denominator:

Pr(θ1:2x1:2)=L(θ1:2x1:2)×Pr(θ1:2)Pr(x1:2)

posterior <- 
  joint %>%
  filter(x1 == 2 & x2 == 1) %>%
  mutate(denom = denominator) %>%
  mutate(post = p/denom)

posterior %>%
  select(-c(n, prior1, prior2, num1, num2, lik1, lik2, p, denom)) %>%
  mutate(post = format(post, digits = 3)) %>%
  knitr::kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("stripped", "hover"), full_width = T, position = "float_left") %>%
  row_spec(0, bold = T, background = theme_color, color = "white", font_size = 15, align = "c") %>%
  row_spec(c(1,3), bold = F, background = "#fafafa", font_size = 14) %>%
  row_spec(c(2,4), bold = F, background = "transparent", font_size = 14) %>%
  column_spec(1:(length(posterior)-9), width_min = "40pt")
x1 x2 theta1 theta2 post
2 1 0.5 0.5 0.1734
2 1 0.5 0.9 0.0624
2 1 0.9 0.5 0.5619
2 1 0.9 0.9 0.2023





We can verify that the posterior is a valid probability distribution:

posterior %>%
  summarise(sum(post))
## # A tibble: 1 x 1
##   `sum(post)`
##         <dbl>
## 1           1

As before, we now may want to compute point estimates for our parameters. In this case, we’re interested in the mean of θ1. In the univariate case, this is simply a weighted average over the posterior: E(θx)=θθ.Pr(θx) In this case, we need to first sum the posterior over θ2:

E(θ1x1:2,θ2)=θ1θ1θ2Pr(θ1,θ2x1:2)

posterior_mean <- 
  posterior %>%
  group_by(theta1) %>%
  summarise(partial = sum(post)) %>%
  summarise(sum(theta1 * partial))

posterior_mean
## # A tibble: 1 x 1
##   `sum(theta1 * partial)`
##                     <dbl>
## 1                   0.806

And despite the result not falling into one of the two possible values for the coins identity, the mean is definetely pointing towards the coin being biased in the favour of heads.