Image-to-Image Translation using Pix2Pix

Last Updated : 23 Jun, 2022

Pix2pix GANs were proposed by researchers at UC Berkeley in 2017. It uses a conditional Generative Adversarial Network to perform the image-to-image translation task (i.e. converting one image to another, such as facades to buildings and Google Maps to Google Earth, etc.

👁 Image

Architecture:

The pix2pix uses conditional generative adversarial networks (conditional-GAN) in its architecture. The reason for this is even if we train a model with a simple L1/L2 loss function for a particular image-to-image translation task, this might not understand the nuances of the images.

Generator:

👁 Image

U-Net architecture

The architecture used in the generator was U-Net architecture. It is similar to Encoder-Decoder architecture except for the use of skip-connections in the encoder-decoder architecture. Skip connections are used because when the encoder downsamples the image, the output of the encoder contains more information about features and classification of class but lost the low-level features like spatial arrangement of the object of that class in the image, so skip connections between encoder and decoder layers prevent this problem of losing low-level features.

Encoder Architecture: The Encoder network of the Generator network has seven convolutional blocks. Each convolutional block has a convolutional layer, followed by a LeakyRelu activation function (with a slope of 0.2 in the paper). Each convolutional block also has a batch normalization layer except the first convolutional layer.
Decoder Architecture: The Decoder network of the Generator network has seven Transpose convolutional blocks. Each upsampling convolutional block (Dconv) has an upsampling layer, followed by a convolutional layer, a batch normalization layer, and a ReLU activation function.
The generator architecture contains skip connections between each layer i and layer n − i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n − i.

Discriminator:

👁 Image

Patch GAN discriminator

The discriminator uses Patch GAN architecture, which also uses Style GAN architecture. This PatchGAN architecture contains a number of Transpose convolutional blocks. This PatchGAN architecture takes an NxN part of the image and tries to find whether it is real and fake. This discriminator is applied convolutionally across the whole image, averaging it to generate the result of the discriminator D.

Each block of the discriminator contains a convolution layer, batch norm layer, and LeakyReLU. This discriminator receives two inputs:

The input image and Target Image (which discriminator should classify as real)
The input image and Generated Image (which they should classify as fake).

The PatchGAN is used because the author argues that it will be able to preserve high-frequency details in the image, with low-frequency details that can be focused by L1-loss.

Generator Loss: The generator loss used in the paper is the linear combination of L1- loss between generated image, target image, and GAN loss as we define above.

Our generated loss will be:

Therefore, our total loss for generator

Discriminator Loss: The discriminator loss takes two inputs real image and generated image:

real_loss is a sigmoid cross-entropy loss of the real images and an array of ones(since these are the real images).
generated_loss is a sigmoid cross-entropy loss of the generated images and an array of zeros(since these are the fake images)
The total loss is the sum of the real_loss and generated_loss.

Implementation:

First, we download and preprocess the image dataset. We will use the CMP Facade dataset that was provided Czech Technical University and processed by the authors of the pix2pix paper. We will preprocess the dataset before training.

Code:

Now, we load train, and test data using the function we defined above.

Code:

After performing data processing, Now, we write the code for generator architecture. This generator block contains 2 parts encoder block and decoder block. The encoder block contains a downsampling convolution block and the decoder block contains an upsampling transpose convolution block.

👁 Image

Generator architecture

Now we define our architecture for the discriminator. The discriminator architecture uses a PatchGAN model. For this architecture, we can use the above downsampling convolution block we defined. The loss of the discriminator is the sum of real loss (sigmoid cross-entropy b/w real image and array of 1s) and generated loss (sigmoid cross-entropy b/w generated image and an array of 0s).

Code:

👁 Image

Discriminator Architecture

In this step, we define optimizers and checkpoints. We will use Adam optimizer in both generator discriminator.

Code:

Now, we define the training procedure. The training procedure consists of the following steps:

For each example input, we passed the image as input to the generator to get the generated image.
The discriminator receives the input_image and the generated image as the first input. The second input is the input_image and the target_image.
Next, we calculate the generator and the discriminator loss.
Then, we calculate the gradients of loss with respect to both the generator and the discriminator variables(inputs) and apply those to the optimizer.

Code:

Now, we use the generator of the trained model on test data to generate the images.

Code:

👁 Image

Results

Comment

Article Tags:

Machine Learning

AI-ML-DS

python