Variational inference in a method to approximate the posterior . This is a key technique for Variational AutoEncoder, one of the most famous generative model.
Introduction
When we are dealing with posteriors, it is likely to be intractable because of its integration for normalizing constant. Historically, people used Monte Carlo(MC) integration to approximate it. However, MC approximation is costly.
Variational Inference(VI) was devised to replace complex approximation with optimization problems. The main benefit of VI is that we don't need to know normalizing constant. This enables us to calculate posterior because integration for normalizing constant is one of the most significant problems. Although VI is mainly used to calculate approximate posterior, it can be used to approximate any distribution whose normalizing constant is unknown. VI defines a family of distributions, called variational family, and optimizes a "distribution difference measure", called KL-divergence.
- We can transform the approximation task into optimization task by parametrizing the variational distributions.
- We don't need normalizing constant because we use KL-divergence
VI is used when we have a distribution which we don't know completely but only know up to normalizing constant. We define a parametrized distribution and optimize KL divergence.
Key Terminologies
Before we delve into the Variational Inference(VI), it would be better to clarify some keywords: inference, evidence, evaluation, prediction, variational, and learning.
What is Statistical Inference?
Statistical inference is the process of using a sample to infer the properties of a population. In [[Probabilistic Model#Latent Variable Model|latent variable model]], we believe that there are some latent variables that control the events. Therefore, inferring underlying properties can be interpreted as updating latent variables() based on observations(). Bayes' rule tell us how to update the information on latent variables.
To assess the results of modeling and inference, we would like to know how well a model fits observed data . We can quantify the fitness between a model and observed data via marginal likelihood or evidence .
Once the model parameter is determined, we can make predictions for new data with the posterior predictive distribution. This is acquired by replacing the prior on the latent variable with the posterior.
Why Variational?
In functional analysis, functional is a general mapping from a space into a single (real/complex) value. In this definition, the domain can be a space of functions. If it is the case, the functional is a function that takes another function as input and output a single value.
Many problems involve finding an optimal input function that maximizes/minimizes the functional. The mathematical techniques developed to solve this type of problem are collectively known as the calculus of variations.
VI defines a KL divergence and it is a functional because it takes two functions(probability distributions) as input and returns a single value(difference between them.) Then, VI finds the optimal input function(variational distribution) which minimizes the KL-divergence. Therefore, this process involves kind of calculus of variation. This is why we call it Variational inference.
Learning and Parametrized Model
We introduced probabilistic model for ML problems, and [[Probabilistic Model#Parametrized Data Distribution|parametrized the model]] to learn from the data. With the parametrized model, learning can be defined as a process to find an optimal model parameter from observed data. This is optimizing to maximize the evidence. For the ease of computation, we maximize log evidence.
VI is a general approach to calculate an approximate posterior, which can be applied whether the probabilistic model is parameterized or not. However, it would be better to keep the parametrized notation for further analyses.
- is a model parameter which is introduced to define a probabilistic model.
- is a variational parameter which is introduced to define a variational family for VI.
Variational Inference
Variational inference is a method to calculate an approximate distribution of the posterior distribution with fixed . Since we want to get a good approximation, we want to minimize the difference(divergence) between and by optimizing . KL-divergence is a good metric to measure the difference between two probability distributions.
Objective: minimize KL divergence between surrogate distribution and posterior distribution
Evidence Lower BOund (ELBO)
We cannot compute the objective because calculating KL divergence requires which is unknown. Let's rewrite the KL divergence.
We can decompose the into and likelihood . Let's inspect each of the three terms.
- Something that we want to minimize.
- intractable because we don't know .
- This term is non-negative by definition of "KL-divergence"
- ELBO
- This term is tractable.
- In latent variable models, we assume that the complete data likelihood is tractable.
- By construction, is tractable
- log marginal likelihood, called log evidence.
- This term does not depend on
For a fixed model parameter , changing variational parameter does not change evidence . Therefore, maximizing ELBO by optimizing is equivalent to minimizing . This means that approximates well.
ELBO Interpretation
If we rewrite the ELBO, it consists of two terms
- approximate log likelihood log likelihood = .
- KL divergence between approximate posterior and prior .
Therefore, maximizing ELBO is balancing between
- maximizing log likelihood
- minimizing the distance to the prior
In a plain text, we want to maximize ELBO up to its upper bound(log likelihood) while keeping the surrogate distribution close to the prior distribution.
Wrapup
- VI is an approach to approximate a distribution whose normalizing constant is intractable.
- When we are using a latent variable model, we assume that complete likelihood is tractable. The posterior is intractable because its denominator(normalizing constant) is intractable. Therefore, VI is appropriate for calculating posterior.
- To apply VI, we define a set of variational distributions called variational family. This is a parametrized distribution .
- By maximizing ELBO, becomes a good approximation of .
Resources
- Main resource pyro_tutorial1, pyro_tutorial2
- princeton cos597C lecture note