Variational Inference with Normalizing Flows in 100 lines of code — forward KL divergence

Generative modeling with Normalizing Flows — NICE: Non-linear Independent Components Estimation

Papers in 100 Lines of Code
5 min readAug 22, 2021

In my previous posts, I introduced normalizing flows and trained one to fit an unormalized target density using the reverse KL divergence. In this post, we will train a generative model of data from a dataset of grayscale images (we will focus on multi-channel images in the next posts) using the forward KL divergence.

In my previous post, we implemented the planar flow architecture. Here, we will focus on Non-linear Independent Component Estimation (NICE), another class of normalizing flow model.

A layer f is described as:

where x1 and x2 are partitions of the input variable x while y1 and y2 are partitions of the output variable y. In the paper, partitioning is done by separating the odd (x1, y1) and even (x2, y2) components of the input and output variables. Since the odd components of the input are not modified, the partitioning is exchanged between any two layers in order to make the full transformation expressive enough. This means that for the second layer, x1 will constitute the even components and x2 the odd ones.

As opposed to planar flows, this transformation is invertible:

This means that once trained, the learned model can be used both for data generation (sampling) and density estimation, as represented on Figure 1.

Figure 1: Representation of normalizing flows. Note: the notations (such as f and its inverse) are not in agreement with the ones from the text.

Given data points, the normalizing flow (function f) can be trained by maximizing the forward KL divergence:

where pH is the base distribution (a standard logistic distribution in the paper). If you compute the Jacobian determinant of the function f described at the beginning, you will realize that each additive layer has a unit Jacobian determinant. In order to fix this, a diagonal scaling matrix is introduced after the last layer.

“We include a diagonal scaling matrix S as the top layer, which multiplies the i-th ouput value by Sii:(xi)i≤D → (Siixi)i≤D. This allows the learner to give more weight (i.e. model more variation) on some dimensions and less in others.” [1].

More practically, the output h of the final layer is scaled with a diagonal positive scaling:

where the exponential function ensures positive scaling (and enables to compute the logarithm determinant of the Jacobian matrix easily).

Now, let us switch to the implementation. In the paper, four NICE layers are used where the function m are represented with MLPs made of five hidden layers of 1000 units (for MNIST dataset).

Two main functions need to be implemented. The first one is the forward function that computes the mapping between the data points y (our images) and the latent variables z (see Figure 1). This function also needs to return the logarithm determinant of the Jacobian matrix of the transformation. This function is useful for density estimation. The second function computes the mapping between latent variables and target data (useful for sampling).

Lines 8 and 9 in the code below are direct implementation of the equations of the function f. Lines 6 and 7 do the component partitioning and lines 11 and 12 update the value of x after each layer. Finally, the scaling is done on line 13 and the log Jacobian is computed.

The inverse function is truly similar to the forward function but the order of the operations is reversed.

Now that the normalizing flow is implemented, we can create a helper function to train it. It takes as input the model, an associated optimizer, a dataloader and a base distribution. Then, it trains the model by maximizing the total log marginal likelihood (forward KL divergence) of the dataset in the dataloader.

As said previously, in the paper, the base distribution for the MNIST dataset is the standard logistic distribution. Unfortunately, it is not available within PyTorch, but it can be created by composing different distributions.

Now that all the pieces are ready, we can put them together and train the model.

After 500 epochs, here are the samples your model should produce. While many of them are not bad at all, all are not perfect. In the next post, we will implement a more powerful architecture.

Figure 2: Unbiased samples from a trained NICE model.

The full code is available in the following GitHub repository.

Note: while normalizing flows are powerful and low sensitive to their hyperparameters, their bijectivity proprety is quite restrictive. For example, standard convolution layers cannot be used. In the next post, we will implement a more powerful class of normalizing flow and in the next one, we will train a flow to generate multi-channel images. In the meantime, you may be interested in my posts about GANs and Variational autoencodes.

I hope the story was useful to you. If you enjoyed it, please leave a clap to support my work, it will propel other similar stories.

If you want to dive into deep generative modelling, here are two great books I strongly recommend:

  • Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play : https://amzn.to/3xLG9Zh
  • Generative AI with Python and TensorFlow 2: Create images, text, and music with VAEs, GANs, LSTMs, Transformer models : https://amzn.to/3g4Y9Ia

Disclosure: I only recommend products I use myself, and all opinions expressed here are my own. This post may contain affiliate links that are at no additional cost to you, I may earn a small commission.

References

  • [1] Laurent Dinh, David Krueger, Yoshua Bengio, “NICE: Non-linear Independent Components Estimation”, 2015.

--

--