**Likelihood** is the hypothetical probability that an event already occurred would yield certain outcome.
The concept differs from probability in that a probability refers to the occurrence of future events,
while a likelihood refers to past events with known outcomes.

In statistical inference, where a random sample has been realized with certain sample values, likelihood refers to the hypothetical probability that the random sample yields such observations, under some probability model of the population.

A **likelihood function** is the probability (density) for the occurrence of a sample realization
given that the probability density (with parameter) is known:
$$L(\theta) = P \{\mathbf{X}=\mathbf{x}|X \sim f(x;\theta) \} = \prod_{i=1}^n f(x_i;\theta)$$

**Maximum likelihood** is the procedure of searching through the parameter space for a probability model
that maximizes the likelihood of current observations.

**Maximum likelihood estimator** (MLE) of a model parameter is the maximizer of the likelihood function:
$$\hat{\boldsymbol{\theta}}(\mathbf{X}) = \arg \sup_{\boldsymbol{\theta}} L( \boldsymbol{\theta} | \mathbf{X} )$$

MLE is an optimization estimator: its objective function is the likelihood of current observations as a function of model parameters; the feasible domain is the parameter space. From another perspective, the feasible domain is a space of parametric probability models, and the objective is to search through the space for a model that justifies your current observations most.

Difficulties of optimization:

- finding and verifying the global maximum;
- numerical sensitivity of the maximum;

**Induced likelihood function** of a parameter dependent on the model parameter
is the maximum likelihood of observations within a family of models indexed by the induced parameter.
Symbolically, given induced parameter $\eta = g(\boldsymbol{\theta})$,
the induced likelihood function of $\eta$ is:
$$L^{*}(\eta|\mathbf{x}) = \sup_{ \boldsymbol{\theta}: g(\boldsymbol{\theta}) = \eta } L(\boldsymbol{\theta}|\mathbf{x})$$

The **MLE of induced parameter** is the induced parameter that maximizes the induced likelihood function:
$$\hat{\eta}(\mathbf{X}) = \arg \sup_{\eta} L^{*}( \eta | \mathbf{X} )$$

Theorem: (**Invariance property of MLEs**)
The MLE of any induced parameter equals to the induced value of the MLE of the model parameter.
$$\widehat{g(\boldsymbol{\theta})} = g(\hat{\boldsymbol{\theta}}), \forall g$$

MLEs are consistent in most cases.

Theorem: The MLE of an induced parameter is consistent, pointwise in the parameter space, if the induced parameter is a continuous function of the parameter and the following assumptions hold:

- The parameter is identifiable.
- The parametric model has common support and is differentiable in parameter space.
- The true parameter is an interior point in parameter space.

Symbolically, if $g(\boldsymbol{\theta}) \in C^0(\boldsymbol{\Theta},\mathbb{R})$, and

- $\boldsymbol{\theta} \ne \boldsymbol{\theta}' \implies f(x|\boldsymbol{\theta}) \ne f(x|\boldsymbol{\theta}')$
- $\forall x \in \Omega, \nabla_{\boldsymbol{\theta}} f(x;\boldsymbol{\theta})$ exists.
- $\exists \varepsilon > 0: B_{\varepsilon}(\boldsymbol{\theta}_0) \subseteq \boldsymbol{\Theta}$

Then, $$g(\hat{\boldsymbol{\theta}}) \overset{p}{\to} g(\boldsymbol{\theta}), \forall \boldsymbol{\theta} \in \boldsymbol{\Theta}$$

MLEs are asymptotic efficient in most cases.

Theorem: The MLE of an induced parameter is asymptotic efficient, pointwise in the parameter space, if in addition to all the conditions for consistency the following assumptions hold:

- $\forall x \in \Omega, f(x|\boldsymbol{\theta}) \in C^3 ( \boldsymbol{\Theta}, \mathbb{R} )$, and $\int f(x|\boldsymbol{\theta}) \mathrm{d} x$ is three times differentiable under the integral sign.
- $\forall \boldsymbol{\theta}_0 \in \boldsymbol{\Theta}, \exists c>0, M(x): \lVert \nabla^3 \log f(x|\boldsymbol{\theta}) \rVert \leq M(x), \forall x \in \Omega, \boldsymbol{\theta} \in B_c(\boldsymbol{\theta}_0)$ and $PM(x)<\infty$

MLEs are always functions of sufficient statistics.

- differentiation
- direct maximization (unique attainable global upper bound)
- log likelihood (MLE also solves the score function, i.e. gradient of log likelihood function.)
- successive maximizations
- the EM algorithm

Using second derivative condition to check for maximum likelihood requires negative definiteness of the Hessian matrix, which is formidable.

It is always important to analysis the likelihood function as much as possible, to find the number and nature of its local maxima, before using numerical maximization.

This is an algorithm suited to find MLE with missing data problems, by constructing a sequence that is guaranteed to converge to the MLE.

Parametric model | MLE |
---|---|

$N(\theta,b^2)$ | $\hat{\theta} = \bar{X}$ |

$N(a,\sigma^2)$ | $\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i-a)^2$ |

$N(\theta,\sigma^2)$ | $\left( \hat{\theta},\widehat{\sigma^2} \right) = \left( \bar{X},\frac{1}{n} \sum_{i=1}^n (X_i- \bar{X})^2 \right)$ |

$Bernoulli(p) $ | $\hat{p}=\bar{X}$ |

$U(0,\theta)$ | $\hat{\theta} = X_{(n)}$ |