Variational Inference with Normalizing Flows in 100 lines of code — reverse KL divergence
If you are working in science, chances are that you have encountered a density that you can only evaluate to a constant factor. If you want to sample from such a distribution, well-studied methods exist such as Markov Chain Monte Carlo or rejection sampling. You may also use importance sampling to get properties about the target distribution such as its expectation.
In this post we will use normalizing flows (that I described in a previous post) to fit the target density. Normalizing flows are particularly powerful because once trained, they allow sampling from the learned density and/or evaluate the density of new data points.
In particular, we will implement the paper Variational Inference with Normalizing Flows in about 100 lines of code.
We will focus on the section in the paper where they fit unnormalized densities. For that matter, we will use planar flows that are defined as
and for which the determinant can be computed in O(d) (where d is the dimension of the target density).
The variable z is some noise that should have the same dimension as the target density and whose base density p(z) should be easy to evaluate. The function f therefore defines a bijective mapping between some noise and the target data. When the tunable parameters u, w and b have been learned, the transformation allows sampling new data point from the target density as y=f(z), z ~ p(z) as well as to evaluate the density of the sampled points with the change of variable theorem:
The parameters u, w and b are trained by maximizing the reverse KL divergence between the density of the normalizing flow p(y) and the target density p*(y):
To sum up, we can efficiently train a normalizing flow — that by definition produces a normalized density — by minimizing the KL divergence with an unnormalized target density and then, after training we can efficiently sample new data points from the learned density.
There is one important point to mention before starting the implementation. If we go back to the definition of f, the transformation is not always bijective. The tunable parameters u need to be constrained in order to ensure bijectiveness. Fortunately, there is a way to efficiently do that and it is explained it the paper.
Finally, one pitfall of planar flows is that there is no analytical solution to invert f. That means that if we observe a new data point y, we cannot efficiently evaluate its density using the change of variable theorem:
As opposed to planar flows, more recent normalizing flow architectures allow to compute efficiently both f and its inverse which is a powerful asset. This means that once trained, you can both sample from the target density and evaluate the density of any new observed data point.
Now, let us dive into the implementation.
Modelling the layer f is a straightforward implementation of the equations inserted earlier — the only subtlety is to constrain u so that the transformation remains bijective. We will implement a forward layer that takes some noise as input and outputs the target data y as well as the log Jacobian determinant of the transformation.
As with most neural network architectures, using a single layer is not flexible enough. Therefore, we will stack multiple layers together to represent our final transformation. The main task is to keep track of the Jacobian determinant of each layer and return the log Jacobian determinant of the whole transformation.
Now that we have our model ready, we can start implementing the optimization function. The implementation is straightforward, the parameters of the flows are updated with stochastic minibatch gradient descent by minimizing the reverse KL divergence.
Then, we can easily create and train the planar flow in a few lines of code.
This should produce densities that are close to the targets given enough planar layers K:
The full code is available in the following GitHub repository.
I hope the story was useful to you. If you enjoyed it, please leave a clap to support my work, it will propel other similar stories. In my next post, I will train another class of normalizing flow using the forward KL divergence in order to learn a generative model of images. In the meantime, you may be interested in my posts about GANs and Variational autoencodes.