Notes on the Bayesian Framework
Some tools:
- Stochastic variational inference
- Variance reduction
- Normalizing flows
- Gaussian processes
- Scalable MCMC algorithms
- Semi-implicit variational inference
Bayesian framework
- Bayes theorem:
It defines a rule for uncertainty conversion when new information arrives
- Product rule: any joint distribution can be expressed with conditional distributions
- Sum rule: any marginal distribution can be obtained from the joint distribution by integrating out
-
Statistical inference
Problem: given i.i.d. data
from distribution , estimate- Frequentist framework: use maximum likelihood estimation (MLE)
Applicability:
- Bayesian framework: encode uncertainty about
in a prior and apply Bayesian inference
Applicability:
Advantages: - we can encode prior knowledge/desired properties into a prior distribution - prior is a form of regularization - additionally to the point estimate of
, posterior contains information about the uncertainty of the estimate - frequentist case is a limit case of Bayesian one
Bayesian ML models
In ML, we have
-
Discriminative approach, models
- Cannot generate new objects since it needs
as an input and assumes that the prior over does not depend on : - Examples: 1) classification/regression (hidden space is small) 2) Machine translation (complex hidden space)
- Cannot generate new objects since it needs
-
Generative approach, models
- It can generate objects (pairs
), but it can be hard to train since the observed space is most often more complicated. - Example: Generation of text, speech, images, etc.
- It can generate objects (pairs
-
Training
Given data points
and a discriminative model .Use the Bayesian framework:
This results in a ensemble of algorithms rather than a single one
. Ensembles usually performs better than a single model.In addition, the posterior captures all dependencies from the training data and can be used later as a new prior.
-
Testing
We have the posterior
and a new data point . We can use the predictive distribution on its hidden value -
Full Bayesian inference
During training the evidence
Conjugacy
Conjugate distributions
Distribution
- There’s not conjugacy We can perform MAP to approximate the posterior with
since we don’t need to calculate the normalization constant, but we cannot compute the true posterior.
During testing:
Conditional conjugacy
Given the model:
Conditional conjugacy of likelihood and prior on each
Check conditional conjugacy in practice:
For each
- Fix all other
(look at them as constants) - Check whether
and are conjugate w.r.t.
Variational Inference
Given the model
KL is a good mismatch measure between two distributions over the same domain (see figure). And it has the following properties:
Evidence Lower Bound (ELBO) derivation
- Posterior:
- Evidence:
, shows the total probability of the observing data. - Lower bound:
Note:
does not depend on and depend on- minimizing
is the same as maximizing .
Optimizing ELBO
Goal:
- Data term:
- Regularizer:
Necessary to perform optimization w.r.t. a distribution
- Mean field approximation: Factorized family,
, - Parametric approximation: Parametric family,
Mean Field Approximation
Mean field assumes that
- Apply product rule to distribution
: - Apply i.i.d. assumption:
The optimization problem becomes:
This can be solved with block coordinate assent as follows: at each step fix all factors
Derivation
So, the optimization problem for step
Where this happens when:
Block coordinate assent can be described in two steps 1) initialize; 2) iterate
- Initialize:
- Iterate (repeat until ELBO converge):
- Update each factor
: - Compute ELBO
- Update each factor
Note: Mean-field can be applied when we can compute analytically
Parametric Approximation
Select a parametric family of variational distributions,
The restriction is that we need to select a family of some fixed form, and as a result:
- it might be too simple and insufficient to model the data
- if it is complex enough then there is no guarantee we can train it well to fit the data
The ELBO is:
If we’re able to calculate derivatives of ELBO w.r.t
Inference methods
So we have:
- Full Bayesian inference:
- MAP inference:
- Mean field variational inference:
- Parametric variational inference:
Latent variable model
Mixture of Gaussians
Establish a latent variable
Model:
where
Note: If
- Since
is a latent variable, we need to maximize the log of incomplete likelihood w.r.t. . - Instead of optimizing
, we optimize the variational lower bound w.r.t. to both and - This can be solved by block-coordinate algorithm a.k.a. EM-algorithm.
Variational Lower Bound:
is the variational lower bound function for iff:
- For all
for all : - For any
exists such that:
If we find such variational lower bound, instead of solving
, we can interatively perform block coordinate updates of .
Expectation Maximization algorithm We want to solve:
Algorithm:
Set an initial point
Repeat iteratively 1 and 2 until convergence
- E-step, find:
- M-step, solve:
- Set
and go to 1
- Set
EM monotonically increases the lower bound and converges to a stationary point of
Benefits of EM
- In some cases E-step and M-step can be solved in closed-information
- Allow to build more complicated models
- If true posterior
is intractable, we may search for the closest among tractable distributions by solving optimization problem - Allows to process missing data by treating them as latent variables
- It can deal with both discrete and latent variables
Categorical latent variables
Since
- E-step is closed-form:
- M-step is a sum of finite terms:
Continuous latent variables A mixture of continuous distributions
-
E-step: only done in closed form when conjugate distributions, otherwise the true posterior is intractable
Typically continuous latent variable are used for dimensionality reduction a.k.a. representation learning
Log-derivative trick
For example, we commonly find expressions as follows:
Now, the first term can be replaced with Monte Carlo estimate of expectation. Using the log-derivative trick, the second expectation can also be estimated via Monte Carlo.
Score function
It is the gradient of the log-likelihood function with respect to the parameter vector. Since it has zero mean, the value
Proof it has zero mean: