Auto-Encoder models

What is an Autoencoder?

An autoencoder is a type of artificial neural network that is made of two parts: and encoder and a decoder. An auto-encoder learns how to encode data (encoder part) and then reconstructs it (decoder part). Between these two parts, the system produces a compact summary of the data, called a latent representation. Auto-encoders  learn the most important features of the data, without needing labels or supervision.

Denoising Autoencoders

A denoising autoencoder takes this idea further: it learns to reconstruct clean data from noisy input. This can be  useful for removing noise or recovering missing information in some data.

Variational Autoencoders (VAEs)

A variational autoencoder adds ones important property: instead of just compressing the data into fixed numbers (the latent variables), it learns to describe the data as a probability distribution. This forces the latent representation to be smoothly distributed in the latent space, and allows using VAE as generative models: by sampling points in the latent space and applying the decoder, one can generate entirely new samples, in principle statistically similar to the training samples. 

In this un-published project, we used variational auto-encoders as a first step to build a generative model capable of producing synthetic planetary systems, similar to the ones computed by solving numerical differential equations. In order to easily visualise the latent space, the original data (originally living in a space of dimension 40) is compressed in a latent space of dimension 2. This very strong compression rate prevents our models to be extremely efficient in re-constructing data, but allows understanding the structure of the latent space (see following figure).

Structure of the latent space after training of our Variational Auto-Encodre. For each sample (25000 in total) the encoder learns the mode and covariance matrix of the distribution (in the latent space) from which a latent point is sampled. The decoder then learns to reconstruct a sample from this randomly chosen latent point. On the image, the orange points are the modes of the gaussian distributions in the latent space, whereas the blue points are the randomly sampled points used for the re-construction. Note that the axis do not have any simple physical meaning.

Planetary systems used for training and their reconstruction. Columns 1 and 3 show planetary systems, blue represents the original properties of planets, red the 'noised' points (random noise is being added to blue points). Columns 2 and 4 represent the re-constructed samples (in green), the blue points are the same as in columns 1 and 3. Although the VAE captures the general shape of the point distributions, it fails at reconstructing the fine details of planetary system architecture. The x axis represents the distance between the central star and the planet (semi-major axis) in astronomical units (the distance between the Sun and the Earth) and in logarithmique scale. The vertical axis represents the mass of planets in Earth masses and in logarithmic scale. The Earth would be located at the (0,0) point on this diagram.

The notebook (experimental and not fully commented) used for this project is available here. The input file (''J20_25000.csv'') that contains the training set is available on demand