Image Auto Encoder using deconvolution and unpooling

Article
2017-01-20

Table of Contents

Summary
Setup
Run the example
Technical details

Summary

The example Image\GettingStarted\07_Deconvolution_PY.py shows how to use Deconvolution and Unpooling to generate a simple image auto encoder (07_Deconvolution_BS.cntk is the corresponding BrainScript version). It uses the MNIST dataset, which has a resolution of 28x28x1, encodes it into a 7x7x1 representation using convolution and pooling and decodes to the original resolution. The training criterion is root-mean-square error (RMSE). The figure above shows visualizations of the original image, the encoded image and the decoded image for the first five images of the MNIST test set.

Setup

To run the example you need the MNIST data set. You can get the data by running the following command from the Examples\Image\DataSets\MNIST folder:

python install_mnist.py

Run the example

The example is located in the Examples\Image\GettingStarted folder. To run this example, use the following command to run the Python version (from a Python CNTK environment):

python 07_Deconvolution_PY.py

or this command for the BrainScript version:

cntk configFile=07_Deconvolution_BS.cntk

The RMSE values for training and testing are 0.225 and 0.223 respectively. To visualize the encoded and decoded images run the following command:

python 07_Deconvolution_Visualizer.py

Set use_brain_script_model=True for the BrainScript model and False for the Python model. The visualizations will be stored in the Output folder under Examples\Image\GettingStarted together with a text representation of the encoder and the decoder output.

Technical details

Below is the model definition for the simple image auto encoder in BrainScript (for the full config file see Image\GettingStarted\07_Deconvolution_BS.cntk)

    cMap = 1
    model = inputFeatures => {
        conv1   = ConvolutionalLayer {cMap, (5:5), pad = true, activation=ReLU}(inputFeatures)
        pool1   = MaxPoolingLayer    {(4:4), stride=(4:4)}(conv1)
        unpool1 = MaxUnpoolingLayer  {(4:4), stride=(4:4)}(pool1, conv1)
        deconv1 = DeconvLayer        {1, (5:5), cMap, lowerPad=(2:2:0), upperPad=(2:2:0), bias=false}(unpool1)
    }.deconv1

The corresponding model definition in 07_Deconvolution_PY.py is

    cMap = 1
    conv1   = cntk.layers.Convolution  ((5,5), cMap, pad=True, activation=cntk.ops.relu)(scaled_input)
    pool1   = cntk.layers.MaxPooling   ((4,4), (4,4))(conv1)
    unpool1 = cntk.layers.MaxUnpooling ((4,4), (4,4))(pool1, conv1)
    z       = cntk.layers.Deconvolution((5,5), num_channels, cMap, lower_pad=(0,2,2), upper_pad=(0,2,2), bias=False, init=cntk.glorot_uniform(0.001))(unpool1)

We describe the BrainScript version here, the Python version is analogous. The model first applies a ConvolutionalLayer with a depth of cMap=1 to the input features followed by a ReLU activation and the uses a MaxPoolingLayer with a filter shape and stride of (4:4). This results in an encoded tensor of size 7x7x1. It then uses a MaxUnpoolingLayer and a DeconvLayer with the corresponding filter shapes to decode it back to the original resolution.

The decoder part compresses the original 784 (28x28) numbers to 49 (7x7), a factor of 16. Using only a depth of 1 for the ConvolutionalLayer has the advantage that the encoder results can be visualized in a meaningful way (see figure at the top of this page). One can increase the number of convolutional filters, e.g. to cMap=3 to have less compression and, hopefully, better decoding results. In this example the RMSE for both training and testing is reduced to 0.196. Another way to have less compression is to use a smaller filter shape and stride for the pooling layer. Using (2:2) for both pooling and unpooling yields an encoded tensor of size 14x14x1 and reduces the RMSE in this example to 0.136 for training and 0.131 for testing. The figure below shows visualization of the original image and the decoded image for the first five images of the MNIST test set for the three discussed settings.

Deconvolution and Unpooling

Lets look a bit closer at the MaxUnpoolingLayer and the DeconvLayer.

MaxUnpoolingLayer {(4:4), stride=(4:4)}(pool1, conv1)

The MaxPoolingLayer requires two inputs, which are the output of the corresponding pooling layer (pool1 in this case) and the input of the corresponding pooling layer (conv1 in this case). conv1 is required in CNTK to determine the target of the Unpooling operation, since CNTK does not store so called switch variables (see here for details).

DeconvLayer {1, (5:5), cMap, lowerPad=(2:2:0), upperPad=(2:2:0)}

The first parameter of the DeconvLayer is the depth of the output volume, the second is the kernel shape (width:height) and the third is the depth of the input volume. The padding parameters have to be set in accordance to the kernel shape to achieve the desired width and height of the output tensor (28x28 in this case). For more details on the DeconvLayer see the Layer Reference page.

Multi layer auto encoder

You can stack more layers of Conv/Deconv and Pool/Unpool for a more complex auto encoder. The following is an example with two layers of each type that you can use in 07_Deconvolution_BS.cntk (simply replace the model in the file):

    inputDim = 1
    cMap1 = 5
    cMap2 = 1
    model = inputFeatures => {
        conv_A   = ConvolutionalLayer {cMap1, (5:5), pad = true, activation=ReLU}(inputFeatures)
        pool_A   = MaxPoolingLayer    {(2:2), stride=(2:2)}(conv_A)
        conv_B   = ConvolutionalLayer {cMap2, (5:5), pad = true, activation=ReLU}(pool_A)
        pool_B   = MaxPoolingLayer    {(2:2), stride=(2:2)}(conv_B)
        unpool_B = MaxUnpoolingLayer  {(2:2), stride=(2:2)}(pool_B, conv_B)
        deconv_B = DeconvLayer        {cMap1, (5:5), cMap2, lowerPad=(2:2:0), upperPad=(2:2:0)}(unpool_B)
        unpool_A = MaxUnpoolingLayer  {(2:2), stride=(2:2)}(deconv_B, conv_A)
        deconv_A = DeconvLayer        {inputDim, (5:5), cMap1, lowerPad=(2:2:0), upperPad=(2:2:0)}(unpool_A)
    }.deconv_A

To visualize the results you need to replace z.pool1 by z.pool_B in 07_Deconvolution_Visualizer.py before running it to address the correct node name for the encoder output. To investigate all node names in the model simply uncomment print_all_node_names(model_file) in the Python script.

Share via