---
jupytext:
  formats: ipynb,md:myst,py:light
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.4
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

```{code-cell}
:tags: [hide-cell]

import mmf_setup; mmf_setup.nbinit()
import os
from pathlib import Path
FIG_DIR = Path(mmf_setup.ROOT) / '../Docs/_build/figures/'
os.makedirs(FIG_DIR, exist_ok=True)
import logging; logging.getLogger("matplotlib").setLevel(logging.CRITICAL)
%matplotlib inline
import numpy as np, matplotlib.pyplot as plt
try: from myst_nb import glue
except: glue = None
```

(sec:BayesianAnalysis)=
# Bayesian Analysis and Bayes Factors

Here we present some simple models to demonstrate and test Bayesian analysis.  I am
following some of the notations and discussions in {cite}`Turkman:2019` as well as
discussions in {cite}`Loredo:1992` and {cite}`Toussaint:2011`.  For a recent discussion,
see also {cite}`Llorente:2022`.

## Formulation

:::{margin}
Some notes: 1) As a physicist, I include the possibility of discrete parameters with
finite probability by allowing the [PDF][] to contain [Dirac delta function][]s. 2)
Often one will include additional conditional dependencies $I$ in both the prior
$h(θ|I)$ and posterior $h(θ|x,I)$ -- after all, the choice of prior depends on something
(e.g. previous measurements).  We suppress this to simplify notation, but it should be
kept in mind.  3) The [marginal likelihood][] is also known as Bayesian evidence, model
evidence, or sometimes simply evidence. 3) You will often see the [likelihood function][]
expressed as $L(θ|x)$ (or $\mathcal{L}(θ|x)$) which I think comes from the idea of
[maximum likelihood estimation][] (MLE) in statistics where one maximizes over the
parameters $θ$ for fixed data $x$.  Note that $L(θ|x)$ is not a [PDF][] over $θ$, but
$f(x|θ) = L(θ|x)$ is a [PDF][] over $x$.  I thus, use the latter notation.
:::
The basic idea is to express a model in terms of parameters $θ$ drawn from some
parameter space $Θ$.  Knowledge about the parameter values is expressed through
various probability density functions ([PDF][]s): specifically, **prior** knowledge
$h(θ)$, and **posterior** knowledge $h(θ|x)$ incorporating some observations $x$.
Bayes's theorem tells us how to 
use observations $x$ to update the state of knowledge, obtaining the posterior
distribution $h(θ|x)$:
\begin{gather*}
  h(θ | x) = \frac{f(x|θ)h(θ)}{h(x)} \propto f(x|θ)h(θ), \qquad
  h(x) = \int f(x|θ)h(θ)\d{θ},
\end{gather*}
The model must also express the [likelihood function][] $f(x|θ)$, which is the
probability of obtaining the observations $x$ if the model parameters are $θ$.  In this sense,
the observations $x$ are a random variable $X(θ)$ with an associated [PDF][] $f(x|θ)$.

The denominator $h(x)$ is sometimes called the [marginal likelihood][].  It is
usually simply computed as a normalization factor to ensure that the posterior is a
normalized probability distribution, and is omitted from discussions.

## Warmup 1: The Proverbial Coin

Consider flipping a coin, the outcomes of which are heads ($x=1$) with probability $θ$,
and tails ($x=0$) with probability $1-θ$.  What can you learn about $θ$ by flipping the
coin?  Suppose you get $N$ heads in a row.  What is the 90% CI for $θ$?

The Bayesian approach first requires we specify a prior.  Here we might assume we know
nothing about $θ$, and so take $h(θ) \sim U(0, 1)$ to be uniform.  The likelihood is
straight-forward: assuming that the flips are independent, the likelihood of heads is
simply $θ$, so the likelihood of $N$ heads in a row is $f(x|θ) = θ^{N}$.  Applying
Bayes' theorem, we have
\begin{gather*}
  h(θ|x) \propto f(x|θ)h(θ) = θ^{N}.
\end{gather*}
Normalizing, we have
\begin{gather*}
  h(x) = \int_0^1 θ^{N}\d{θ} = \left.\frac{θ^{N+1}}{N+1}\right|_{0}^{1}
       = \frac{1}{N+1},\qquad
  h(θ|x) = (N+1)θ^{N}.
\end{gather*}

:::{margin}
Notice that as soon as we get one head, we exclude $θ=0$ as expected, and that
subsequent heads sharpen the distribution.
:::

```{code-cell}
ths = np.linspace(0, 1)
fig, ax = plt.subplots()
for N in range(5):
    ax.plot(ths, (N+1)*ths**N, label=f"{N=}")
ax.legend()
ax.set(xlabel="$θ$", ylabel="$h(θ|x)$", 
       title="Bayesian posterior after observing $N$ heads in a row.");
```

Generalizing, if out of $N$ flips, one finds $n$ heads, the posterior (with uniform
prior) is the [binomial distribution][]
\begin{gather*}
  h(θ|x) = \binom{N}{n} θ^n (1-θ)^{N-n}.
\end{gather*}
The likelihood simply accumulates factors of $θ$ and $(1-θ)$, independent of the order
in which the data is taken.

:::{admonition} To think about and discuss.

How does this treatment deal with the difference between the following observations:
* Out of four flips, obtaining 2 heads and 2 tails. *(There are $\binom{4}{2} = 6$ ways to
  do this.)*
* Out of four flips, obtaining exactly the sequence HHTT.  *(There is exactly one way for
  this to occur.)*
:::

(sec:RecognizableSubclasses)=
## Warmup 2: Recognizable Subclasses

{cite}`Jaynes:1976` has an interesting example that he presented at a convention of
reliability and quality control statisticians working in the computer and aerospace
industries:

> at this point the meeting was thrown into an uproar, about a dozen people trying to
> shout me down at once. They told me, "This is complete nonsense. A method as firmly
> established and thoroughly worked over as confidence intervals couldn't possibly do
> such a thing. You are maligning a very great man; Neyman would never have advocated a
> method that breaks down on such a simple problem. If you can't do your arithmetic
> right, you have no business running around giving talks like this"... 
> In the end they had to concede that my result was correct after all.
>
> To make a long story short, my talk was extended to four hours (all afternoon), and
> their reaction finally changed to: "My God- why didn't somebody tell me about these
> things before? My professors and textbooks never said anything about this. Now I have
> to go back home and recheck everything I've done for years".

```{code-cell}
:tags: [margin, hide-input]
t0 = 1
tau = 1
t = np.linspace(-1, 4, 1000)
f = np.where(t<t0, 0, np.exp(-(t-t0)/tau)/tau)
fig, ax = plt.subplots(figsize=(3,2))
ax.plot(t, f)
ax.set(xlabel="$t$", ylabel=r"$f(t|\tau, t_0)$")
ax.set(xticks=[0, t0, t0+tau]);
ax.set(xticklabels=[0, "$t_0$", r"$t_0+\tau$"]);
```

Jaynes's example consists of a model with a [PDF][] of
\begin{gather*}
  f(t|\tau, t_0) = p(t) = \frac{\Theta(t-t_0)}{\tau}e^{-(t-t_0)/\tau}.
\end{gather*}
Physical interpretations include: random failure of a device after it runs out of a
chemical inhibitor at time $t_0$ {cite}`Jaynes:1976`, or the observation of some
astrophysical event whose first photons arrive at earth at time $t_0$
{cite}`Loredo:1992`.  Prior to the event $t < t_0$ there will be no detection, and
after, there will be an exponentially decaying likelihood of detection.

:::{admonition} Do it!
Assume that the decay constant $\tau$ is known, and express all times in units of this
(i.e. set $\tau = 1$).  Determine a 90% CI for $t_0$ given that the detector detects 3
photons at times $t_i/\tau \in \{12, 14, 16\}$.
:::

Although it is straightforward to simulate, this problem permits some analytic
analysis.  The population mean for the distribution is $ \braket{t} = t_0 + \tau$, thus,
the shifted mean $\hat{t}_0 = \braket{t - \tau}$ provides an unbiased estimator of $t_0$:
\begin{gather*}
  \braket{t} = \int t p(t)\d{t} = t_0 + \tau, \qquad
  \hat{t}_0(\vect{t}) = \frac{1}{N}\sum_{i=1}^{N}(t_i - \tau).
\end{gather*}
One can compute the distribution of the estimator if many samples of $N$ detections are
drawn from $p(t)$, obtaining
\begin{gather*}
  p(\hat{t}_0|t_0) = \frac{1}{\tau}\frac{N^N}{\Gamma(N) }\Theta(y)y^{N-1}e^{-Ny}, \qquad
  y = \frac{\hat{t}_0 - t_0}{\tau} + 1
\end{gather*}
For the case in the problem of $N=3$ one can also compute the [CDF][]:
\begin{gather*}
  F(\hat{t}_0|t_0) = \int_{-\infty}^{\hat{t}_0}p(\hat{t}_0|t_0)\d{\hat{t}_0}
                 = 1 - e^{-3y}\left(1 + 3y + \tfrac{9}{2}y^2\right).
\end{gather*}
Since this is uni-modal, the 90% highest probability density CI is defined by values
$\hat{t}_{\pm}$ such that $F_{N=3}(\hat{t}_+|t_0) - F_{N=3}(\hat{t}_-|t_0) = 0.9$ and
$p(\hat{t}_+|t_0) = p(\hat{t}_-|t_0)$.  In terms of $y_{\pm}$, this gives
\begin{gather*}
  y_-^2e^{-3y_-} = y_+^2e^{-3y_+}, \qquad
  - e^{-3y_{+}}(1 + 3y_{+})
   + e^{-3y_{-}}(1 + 3y_{-}) = 0.9,
\end{gather*}
which can be solved to find $y \in [0.15, 1.83]$, or
\begin{gather*}
  t_0 \in [\braket{\hat{t}_0} + (0.15 - 1)\tau, \braket{\hat{t}_0} + (1.83 - 1)\tau].
\end{gather*}
For our example $t_i/\tau \in \{12, 14, 16\}$, we have $\braket{\hat{t}_0} = 13\tau$, and
so our 90% CI is
\begin{gather*}
  \frac{t_0}{\tau} \in [12.15, 13.83].
\end{gather*}
As Jaynes and Loredo point out, this makes no sense since we know that $t_0/\tau \leq
12$: The entire 90% CI lies in an impossible region!

The Bayesian approach is straightforward.  In units with $\tau = 1$, the likelihood is
\begin{gather*}
  f(\vect{t}|t_0) \propto \prod_{i=1}^{N}p(t_i|\tau, t_0) = Θ(12-t_0)e^{3(t_0-14)}.
\end{gather*}
Given a uniform prior $h(t_0) \propto 1$, we find the Bayesian posterior given the data
to be (in units where $\tau=1$):
\begin{gather*}
  h(t_0|\vect{t}) = Θ(12-t_0)3e^{6}e^{3(t_0-14)}.
\end{gather*}
This properly excludes $t_0 > 12$.  Computing the 90% highest probability density CI, we
find:
\begin{gather*}
  \frac{t_0}{\tau} \in [11.23, 12].
\end{gather*}

:::{admonition} Details.
The cumulative distribution is
\begin{gather*}
  C(t_0|\vect{t}) = \int_{-\infty}^{t_0}\d{t}\; h(t|\vect{t}) = 
  \begin{cases}
    e^{3(t_0 - 12)} & t_0 < 12\,,\\
    1 & \text{otherwise}.
  \end{cases}
\end{gather*}
The mode is at $t_0 = t_+ = 12$ and the 90% confidence interval can be found by finding where
$C(t_{-}|\vect{t}) = 10\%$:
\begin{gather*}
  0.1 = e^{3(t_0 - 12)}
  t_{-} = 12 + \frac{\ln 0.1}{3} = 11.96\dots.
\end{gather*}
:::


```{code-cell}
:tags: [hide-cell]

from scipy.optimize import root
def f(ys):
    ym, yp = ys
    return [np.exp(-3*ym) * ym**2 - np.exp(-3*yp) * yp**2, 
            np.exp(-3*ym) * (1+3*ym) - np.exp(-3*yp) * (1+3*yp) - 0.9]

res = root(f, [0, 2], method='lm')
assert res.success
CI_90_y = res.x
print(CI_90_y)
```

```{code-cell}
:tags: [hide-input]

import scipy.stats
sp = scipy

N = 3     # Number of observations
tau = 1
t0 = 13.0 * tau

# First define some functions from above
def p(t):
    """Return the PDF for arrival times."""
    global t0, tau
    y = (t-t0)/tau
    return np.where(y>0, np.exp(-y)/tau, 0)

def pe(te):
    """Return the PDF for the estimator for N events."""
    global t0, tau, N
    y = (te - t0)/tau + 1
    return np.where(y>0, N**N/sp.special.gamma(N) * (y)**(N-1)*np.exp(-N*y) / tau, 0) 
    
def ce3(te):
    """Return the CDF for the estimator for 3 events.
    
    Note: only valid for te > t0 - tau.
    """
    global t0, tau, N
    y = (te - t0)/tau + 1
    return 1 - np.exp(-3*y)*(1+3*y + 9/2*y**2)

def bayesian_posterior(t0):
    global tau
    return np.where(t0 < 12*tau, 3*np.exp(6 + 3*(t0/tau - 14))/tau, 0)

rng = np.random.default_rng(seed=2)

Ns = 100000  # Samples
dist = sp.stats.expon(loc=t0, scale=tau)
ts = dist.rvs(size=(Ns, N))

# Compute unbiased estimators
tes = (ts - tau).mean(axis=-1)

fig, ax = plt.subplots(figsize=(4,3))
ax.hist(tes/tau, bins=100, density=True, histtype='step', label=f"{Ns=} samples")
te_ = np.linspace(10, 16, 200)
ax.plot(te_/tau, tau*pe(te_), c="C1", label=r"$\propto y^{N-1}e^{-Ny}$")
ax.plot(te_/tau, tau * bayesian_posterior(te_), c='C2', label='Bayes')

# Statistical 90% CI
te_ = tau * np.linspace(*(12 + CI_90_y))
ax.fill_between(te_/tau, tau*pe(te_), fc="C1")

# Bayesian 90% CI
Bayes_90_CI = [12 + np.log(0.1)/3, 12]
te_ = tau * np.linspace(*Bayes_90_CI)
ax.fill_between(te_/tau, tau * bayesian_posterior(te_), fc='C2')

ax.legend()
ax.set(xlabel=r"$\hat{t}_0/\tau$", ylabel=r"PDF $\times \tau$",
       xlim=(10, 16),
       title=r"90% CI about $t_0 = 13$");
```

The problem here is that the particular observation at hand is actually extremely
unlikely, thus the typical statistical analysis -- which is based on typical outcomes --
gives wildly inaccurate results.



## Simple Problem

We start with a simple problem.  Consider measuring a value $θ$ where the measurement
process introduces an error $\eta$ with PDF $f(\eta)$:
\begin{gather*}
  x = θ + \eta, \qquad \eta \sim f(\eta).
\end{gather*}
Consider two models: one where $θ \in [0, 1]$ is a real number, and another where
$θ\in \{0, 1/2, 1\}$ is discrete.  Can we learn about which model is better by
analyzing some data?

Let's start with the standard approach, assuming a flat prior $θ \sim U(0, 1)$:
\begin{align*}
  h(θ) &= \begin{cases}
    1 & 0 \leq θ \leq 1\\
    0 & \text{otherwise},
  \end{cases}, &
  h(θ) &= \frac{1}{3}\Bigl(\delta(a) + \delta(a-0.5) + \delta(a-1)\Bigr).
\end{align*}
Given a value $a$, the likelihood of measuring $x$ is:
\begin{gather*}
  L(x|a) = f(x-a)
\end{gather*}
from our model.  Thus, our posterior after $N$ independent measurements $\vect{x}$ is
\begin{gather*}
  p_{\vect{x}}(a) \propto \underbrace{\prod_{n}p_{\eta}(x_n - a)}_{\mathcal{L}(\vect{x}|a)}
  p_0(a)
\end{gather*}



## Polynomial Fitting

We will consider data sampled from three different functional models:

\begin{gather*}
  f_1(x) = ax^2\\
  f_2(x) = ax^2 + bx^3\\
  f_3(x) = ax^2 + bx^3 + cx^4\\
  g(x) = 1 - A\cos(Bx + C)
\end{gather*}

We will generate data from model $f_2$, and would like to used Bayesian techniques to
compare the models to develop an intuition about when data can exclude models $f_1$ and
$g$, etc.

## A First Approach

Let's start by generating some data.  We will take the following values:

\begin{gather*}
  a = 1, \qquad b = 0.1, \qquad c = 0.
\end{gather*}

```{code-cell}
%matplotlib inline
import numpy as np, matplotlib.pyplot as plt
#plt.rcParams['figure.figsize'] = (4, 3)

a, b, c = 1, 0.1, 0
dy = 0.1

def f(x, a=1, b=0.1, c=0):
    """Return f(x), the model function."""
    return a*x**2 + b*x**3 + c*x**4

def get_data(x, f, dy=0.1, seed=2):
    """Return some data with gaussian errors."""
    # Use a seed so results are reproducible
    rng = np.random.default_rng(seed=seed)
    y = f(x) + rng.normal(scale=dy, size=x.shape)
    return y

x = np.linspace(-1, 1, 10)
x_ = np.linspace(-1, 1, 100)

y = get_data(x, f, dy=dy)
fig, ax = plt.subplots()

ax.plot(x_, f(x_), label="$f$")
ax.plot(x_, f(x_, a=a, b=0, c=0), label="$f_1$")
ax.errorbar(x, y, yerr=dy, fmt='.');
ax.legend()
ax.set(xlabel='x', ylabel='y');
```

Since everything is linear and Gaussian, the standard least-squares regression should work.  See [`scipy.optimize.curve_fit`][].

```{code-cell}
from scipy.optimize import curve_fit
import scipy
p_opt, p_cov = curve_fit(f, x, y, sigma=dy, 
                         absolute_sigma=True)
print(p_opt), print(np.sqrt(np.diag(p_cov)))
chi_r = np.sqrt(sum(((f(x, *p_opt) - y)/dy)**2) / (len(y) - len(p_opt)))
chi_r
```

## Odds and Ends

### Credible Intervals (Confidence Intervals)

A proper characterization of a Bayesian analysis is by describing the posterior
distribution, but one often desires a simpler characterization.  For a single parameter
$θ$ with posterior $h(θ | x)$, one might report a "95\% credible interval
$[\underline{θ}, \overline{θ}]$" meaning
\begin{gather*}
  \int_{\underline{θ}}^{\overline{θ}}h(θ | x)\d{θ} = 0.95 = 1-\alpha.
\end{gather*}
However, the choice of $\underline{θ}$ and $\overline{θ}$ satisfying this is
not unique.  One possibility is a symmetric interval
\begin{gather*}
  \int_{-\infty}^{\underline{θ}}h(θ | x)\d{θ}
  =
  \int_{\overline{θ}}^{\infty}h(θ | x)\d{θ}
  = \frac{1-0.95}{2} = \frac{\alpha}{2}.
\end{gather*}
This does not generalize to higher dimensions.

The preferred way is to define the highest posterior density (HPD) credible set
$A = \{θ \mid h(θ | x) \geq k\}$ where $k$ is the largest real number such
that $P(A) \geq 0.95 = 1-\alpha$.  This generalizes, but is sensitive to the choice of
parameter $θ$.  Choosing a different parameterization will usually change the HPD
credible set.


### Bayes Factor

Bayes theorem is
\begin{gather*}
  h(θ | x) \propto L(x | θ) h(θ).
\end{gather*}

For a Bayesian, two competing "models" or hypotheses might consist of disjoint parameter
regimes $Θ_0 \cap Θ_1 = \emptyset$ and $Θ_0 \cup Θ_1$ is the
full parameter space.  The probability of these competing hypotheses is
\begin{gather*}
  P(Θ_{i}) = \int_{Θ_{i}} h(θ)\d{θ}, \qquad
  P(Θ_{i}|x) = \int_{Θ_{i}} h(θ|x)\d{θ}.
\end{gather*}
The Bayes factor is
\begin{gather*}
  B(x) = \frac{P(Θ_0|x) / P(Θ_0)}
              {P(Θ_1|x) / P(Θ_1)}.
\end{gather*}
The larger $B(x)$, the more the data $x$ supports hypothesis $Θ_0$.  Part of the
goal of this exercise is to develop some intuition about what this means.

:::::{admonition} Questions
**Q:** What happens if the hypotheses or models overlap?  How does the polynomial
model-selection fit into this formalism? 
**A:** Apparently one should make sure that all models are part of the same space to use
Bayes factors.
:::::

:::::{admonition} Problem 1.4

Given a coin which lands with probability $θ$ as heads.  The problem asks us to
test three hypotheses $Θ_<$, $Θ_=$, and $Θ_>$, that $θ < 0.5$, $θ =
0.5$, and $θ > 0.5$ respectively *(slightly changed notation)*.  The problem asks
us to assume prior probabilities to the hypotheses $P(Θ_<) = P(Θ_>) = 0.25$ and
$P(Θ_=) = 0.5$ with corresponding uniform priors $h(θ \mid Θ_<) = U(0, 0.5)$,
$h(θ \mid H_{>}) = U(0.5, 1)$.

After flipping the coin once and finding heads $x_0 = h$, we are asked to compute the
Bayes factors for the various hypotheses.

First some notations: From the discussion above, we can express the hypotheses as sets $Θ_{<}
= [0, 0.5)$, $Θ_{>} = (0.5, 1]$, and $Θ_{=} = \{0.5\}$ with prior
\begin{gather*}
  h(θ) = \frac{1}{2} + \frac{\delta(θ - 0.5)}{2}.
\end{gather*}
Note that the prior probabilities are just the corresponding "areas":
\begin{gather*}
  1 = \int_{0}^{1}h(θ)\d{θ}
    = \underbrace{\int_{0}^{0.5-0^+}h(θ)\d{θ}}_{P(Θ_<)=1/4}
    + \underbrace{\int_{0.5-0^+}^{0.5+0^+}h(θ)\d{θ}}_{P(Θ_=)=1/2}
    + \underbrace{\int_{0.5+0^+}^{1}h(θ)\d{θ}}_{P(Θ_>)=1/4}\,.
\end{gather*}
The likelihood of getting heads is just $θ$
\begin{gather*}
  L(h | θ) = θ\,.
\end{gather*}
Thus, the normalized posterior distribution is
\begin{gather*}
  h(θ | h) = θ\Bigl(1 + \delta(θ - 0.5)\Bigr) \propto L(h | θ)h(θ).
\end{gather*}
*For further flips, the posterior after $N$ flips with $n$ heads is*
\begin{gather*}
  h(θ | \vect{x}) \propto θ^{n}(1-θ)^{N-n}\Bigl(1 + \delta(θ - 0.5)\Bigr).
\end{gather*}

We can now compute the various probabilities of the competing hypotheses after the flip:
\begin{gather*}
  1 = \int_{0}^{1}h(θ)\d{θ}
    = \overbrace{\int_{Θ_<} h(θ)\d{θ}}^{P(Θ_{<})=1/4}
    + \overbrace{\int_{Θ_=} h(θ)\d{θ}}^{P(Θ_{=})=1/2}
    + \overbrace{\int_{Θ_>} h(θ)\d{θ}}^{P(Θ_{>})=1/4}\,,\\
  1 = \int_{0}^{1}h(θ | h)\d{θ}
    = \underbrace{\int_{Θ_<} h(θ|h)\d{θ}}_{P(Θ_{<}|h)=1/8}
    + \underbrace{\int_{Θ_=} h(θ|h)\d{θ}}_{P(Θ_{=}|h)=1/2}
    + \underbrace{\int_{Θ_>} h(θ|h)\d{θ}}_{P(Θ_{>}|h)=3/8}\,.
\end{gather*}
The relative Bayes factors are thus
\begin{gather*}
  B_{< \text{ vs } =} = \frac{(1/8) / (1/2)}{(1/4) / (1/2)} 
                      = \frac{1}{2} = 0.5,\\ 
  B_{> \text{ vs } =} = \frac{(3/8) / (1/2)}{(1/4) / (1/2)}
                      = \frac{3}{2} = 1.5,\\
  B_{> \text{ vs } <} = \frac{(3/8) / (1/8)}{(1/4) / (1/4)}
                      = 3.
\end{gather*}


  












:::::





[`SCIPY.OPTIMIZE.CURVE_FIT`]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy.optimize.curve_fit

[PDF]: <https://en.wikipedia.org/wiki/Probability_density_function>
[CDF]: <https://en.wikipedia.org/wiki/Cumulative_distribution_function>
[Dirac delta function]: <https://en.wikipedia.org/wiki/Dirac_delta_function>
[marginal likelihood]: <https://en.wikipedia.org/wiki/Marginal_likelihood>
[likelihood function]: <https://en.wikipedia.org/wiki/Likelihood_function>
[maximum likelihood estimation]: <https://en.wikipedia.org/wiki/Maximum_likelihood_estimation>
[binomial distribution]: <https://en.wikipedia.org/wiki/Binomial_distribution>
