One of the most important tasks in ML is to optimize parameters in a model. There are various approaches and I'll explain the most fundamental methods (MLE & MAP) in this post.
context
These concepts works on a [[Probabilistic Model]]. We assume that the data distribution is parametrized by and the likelihood is defined as below:
Optimization Methods
MLE
Maximum Likelihood Estimation(MLE) maximizes the likelihood by optimizing the model parameter . The result of MLE is a point estimation of the model parameter.
This fits the model to the observed data. However, MLE is susceptible to overfitting the model when the data is insufficient. For example, some rare events might not be included in the dataset. MLE assigns zero probability to the unobserved events.
MAP
Unlike MLE, MAP utilizes the prior knowledge on the data distribution. The prior knowledge is represented as a distribution on the parameter , . Given the observations , we can update the knowledge on by applying Bayes' theorem.
With the updated distribution on , the most plausible value is . This formula can be rewritten in terms of likelihood:
If the prior distribution on is an uniform distribution, MAP converges to MLE.
MLE with Latent Variable model
In a simple model without latent variables, MLE finds the optimal that maximizes the likelihood . Now, we introduced latent variables to the model. We can still apply MLE by maximizing the likelihood if it is tractable.
However, MLE is inapplicable in most cases. There are two main reasons why we can't use MLE:
- Even for a fixed , the integral over latent variable is often intractable. (Evaluating likelihood is intractable.)
- Even if we can calculate the log likelihood for all values of , maximizing the log likelihood as a function of can be a difficult non-convex optimization problem. (Cannot guarantee to find the global optimum.)
Even if we find an optimal model parameter somehow, we cannot use it for prediction. To construct the posterior predictive distribution , we need the posterior on the latent variable . However, the denominator of the posterior is the likelihood , which is intractable in general.