Variational Inference
The posterior in Bayesian Models with latent variables is quite hard to compute if we do not choose a conjuagte prior for the likelihood function. However, we need to have some approximate version of the posterior in order to make inferences later on. One approach to approximate posterior is to find a distribution ( is a family of distributions) that can minimize the KL divergence
where is the latent variable and is the data
And we do not need to worry about the normalization constant ()
since . We only need to work with likelihood and prior in this case.
E-step in EM
utilizes the variational inference technique. The E-step requires us to calculate the posterior of the latent variables, which can be hard to compute. VI can help us approximate that distribution assuming that we restrict ourselves to a family of distributions.
This is called Variational EM.
Mean Field Approximations
This is a Variational Inference method where we assume the distribution to be factorized over the latent variables across all dimensions , i.e.,
and we minimize the KL divergence using coordinated gradient descent. First find the minima for keeping everything else fixed, then , and so on. We will repeat this loop until convergence.
Since we are minimizing over one component at a time, the functional form of any can be derived as follows
We now make three observations
-
adding a constant to the minimization equation will still give the same result
-
because is a probability distribution
-
since is a valid probability distribution. Notice that the expectation integrates over all except and thus is a function of just . in the expectation just means that we are considering the product as the probability distribution. We convert this expectation to a positive value by taking the exponent, and then get a valid probability distribution as
The denominator of this expression is a constant which we can introduce in the equation as is without any alteration
Rewriting the minimization equation so far
and we know that KL divergence is minimized when the two functions conincide, i.e.
which is the expectation of the posterior on over without the current component being minimized.