Skip to main content

[Optimization] MLE & MAP

· 3 min read

One of the most important tasks in ML is to optimize parameters in a model. There are various approaches and I'll explain the most fundamental methods (MLE & MAP) in this post.

context

These concepts works on a [[Probabilistic Model]]. We assume that the data distribution is parametrized by θ\theta and the likelihood is defined as below:

P(xθ)=ip(xiθ)P(\mathbf{x}\vert\theta) = \prod_i p(x_i\vert \theta)

Optimization Methods

MLE

Maximum Likelihood Estimation(MLE) maximizes the likelihood P(Dθ)P(D\vert\theta) by optimizing the model parameter θ\theta. The result of MLE is a point estimation of the model parameter.

θMLE=argmaxθP(xθ)=argmaxθip(xiθ)\theta_{MLE}=\arg\max_\theta P(\mathbf{x}\vert\theta)=\arg\max_\theta\prod_i p(x_i\vert\theta)

This fits the model to the observed data. However, MLE is susceptible to overfitting the model when the data is insufficient. For example, some rare events might not be included in the dataset. MLE assigns zero probability to the unobserved events.

MAP

Unlike MLE, MAP utilizes the prior knowledge on the data distribution. The prior knowledge is represented as a distribution on the parameter θ\theta, P(θ)P(\theta). Given the observations DD, we can update the knowledge on θ\theta by applying Bayes' theorem.

P(θx)=P(xθ)P(θ)P(x)P(xθ)P(θ)P(\theta\vert \mathbf{x})=\frac{P(\mathbf{x}\vert\theta)P(\theta)}{P(\mathbf{x})} \propto P(\mathbf{x}\vert\theta)P(\theta)

With the updated distribution on θ\theta, the most plausible value is argmaxθP(θx)\arg\max_\theta P(\theta\vert \mathbf{x}). This formula can be rewritten in terms of likelihood:

θMAP=argmaxθP(θx)=argmaxθip(xiθ)p(θ)\theta_{MAP}=\arg\max_\theta P(\theta\vert \mathbf{x})=\arg\max_\theta \prod_i p(x_i\vert\theta)p(\theta)
Relationship between MLE and MAP

If the prior distribution on θ\theta is an uniform distribution, MAP converges to MLE.

MLE with Latent Variable model

In a simple model without latent variables, MLE finds the optimal θ\theta that maximizes the likelihood pθ(x)p_\theta(\mathbf{x}). Now, we introduced latent variables z\mathbf{z} to the model. We can still apply MLE by maximizing the likelihood if it is tractable.

log likelihood=logpθ(x)=logpθ(x,z)dz\begin{align} \text{log likelihood} &= \log p_\theta (\mathbf{x})\\ &= \log \int p_\theta(\mathbf{x},\mathbf{z}) d\mathbf{z} \end{align}

However, MLE is inapplicable in most cases. There are two main reasons why we can't use MLE:

  1. Even for a fixed θ\theta, the integral over latent variable zz is often intractable. (Evaluating likelihood is intractable.)
  2. Even if we can calculate the log likelihood for all values of θ\theta, maximizing the log likelihood as a function of θ\theta can be a difficult non-convex optimization problem. (Cannot guarantee to find the global optimum.)

Even if we find an optimal model parameter θmax\theta_{max} somehow, we cannot use it for prediction. To construct the posterior predictive distribution pθmax(xnewx)p_{\theta_{max}}(x_{new}\vert \mathbf{x}), we need the posterior on the latent variable pθmax(zx)p_{\theta_{max}}(\mathbf{z} \vert \mathbf{x}). However, the denominator of the posterior is the likelihood pθmax(x)=pθmax(x,z)dzp_{\theta_{max}}(\mathbf{x})=\int p_{\theta_{max}}(\mathbf{x},\mathbf{z}) d\mathbf{z}, which is intractable in general.

pθmax(zx)=pθmax(x,z)zpθmax(x,z)dzpθmax(xnewx)=zpθmax(xnewz)pθmax(zx)dz\begin{align} p_{\theta_{max}}(\mathbf{z}\vert \mathbf{x}) &= \frac{p_{\theta_{max}}(\mathbf{x},\mathbf{z})}{\int_z p_{\theta_{max}}(\mathbf{x},\mathbf{z})d\mathbf{z}}\\ p_{\theta_{max}}(x_{new} \vert \mathbf{x})&= \int_{\mathbf{z}}p_{\theta_{max}}(x_{new} \vert \mathbf{z})p_{\theta_{max}}(\mathbf{z} \vert \mathbf{x}) d\mathbf{z} \end{align}