Normalizing Flows

Introduction to Normalizing Flows

Papers in 100 Lines of Code
4 min readJan 31, 2021

One of the best known approaches to learn a generative model of data is certainly the adversarial one. It involves pitching two networks against each other so that they jointly improve until one of them (the generator) reproduces data that closely match the data from the training distribution. While generative adversarial networks have shown impressive results, they require optimizing a two-player minimax game which is not stable and not easily tunable (see this post for a PyTorch implementation).

Another well known approach is the variational one. The idea is to introduce a variational posterior distribution and optimize a lower bound on the log marginal likelihood:

While more stable, this approach also trains two neural networks and only optimizes a lower bound on the log marginal likelihood (see this post for a PyTorch implementation).

Normalizing flows are powerful statistical model well designed for generative modeling among other tasks. They allow the exact evaluation of p(y) and therefore, their weights can be directly learned by maximizing the total log marginal likelihood of the training dataset.

The ability to exactly evaluate p(y) is a powerful asset and if this is the first time you read about normalizing flows, this may seem all magical to you. Let us dive into the details.

The idea is to construct a bijective mapping f such that y = f(z) where z is a variable with a usually simple base density p(z).

As f is invertible:

In this setting, given y one can evaluate its density by simply inverting f and keeping track of the Jacobian of the transformation (change of variable theorem):

Beyond evaluating densities, one can use the flow — once trained — to sample from the target density:

To sum up, being invertible, normalizing flows allow both density evaluation and sampling. This is represented in the Figure 1.

Figure 1: Representation of normalizing flows [1].

As long as f is a composition of invertible functions, f is itself invertible. Therefore, researchers have worked in developing invertible layers and many of them can be stacked to form highly-flexible transformations. It is worth mentioning that a naive choice of f would imply a cubic complexity in the dimension of the data y to compute the jacobian determinant. Therefore, researchers have developed constrained functions such that the computation of the Jacobian determinant is efficient.

One common way to make the computation of the Jacobian determinant linear is to make the Jacobian of the transformation f lower triangular. For example, this can be done by making the transformation autoregressive such that f can be rewritten as a vector of d scalar functions:

The Jacobian of the function f will therefore be lower triangular and making each scalar function bijective is sufficient to make f bijective.

Normalizing flows can also be conditioned on additional variables to allow sampling and evaluation of conditional densities (see Figure 2).

Figure 2: Representation of conditional normalizing flows [1].

While normalizing flows are extremely powerful and efficient, they also have drawbacks. The main pitfall of normalizing flows is their architecture that is heavily constrained in order to be bijective which makes the introduction of inductive bias more difficult.

In my next posts, I will implement normalizing flows using i) the reverse KL divergence in order to fit a density that can only be evaluated to a constant factor and ii) the forward KL divergence in order to learn a generative model of data from a training set of images.

I hope the story was useful to you. If you enjoyed it, please leave a clap to support my work, it will propel other similar stories.

References

  • [1] Vandegar Maxime, “Differentiable Surrogate Models to Solve Nonlinear Inverse Problems”, 2020.

--

--