Inception V2 and V3 - Inception Network Versions

Last Updated : 14 Oct, 2022

Inception V1 (or GoogLeNet) was the state-of-the-art architecture at ILSRVRC 2014. It has produced the record lowest error at ImageNet classification dataset but there are some points on which improvement can be made to improve the accuracy and decrease the complexity of the model.
Problems of Inception V1 architecture:
Inception V1 have sometimes use convolutions such as 5*5 that causes the input dimensions to decrease by a large margin. This causes the neural network to uses some accuracy decrease. The reason behind that the neural network is susceptible to information loss if the input dimension decreases too drastically.
Furthermore, there is also a complexity decrease when we use bigger convolutions like 5x5 as compared to 3x3. We can go further in terms of factorization i.e. that we can divide a 3x3 convolution into an asymmetric convolution of 1x3 then followed by a 3x1 convolution. This is equivalent to sliding a two-layer network with the same receptive field as in a 3x3 convolution but 33% cheaper than 3x3. This factorization does not work well for early layers when input dimensions are big but only when the input size mxm (m is between 12 and 20). According to the Inception V1 architecture, the auxiliary classifier improves the convergence of the network. They argue that it can help reduce the effect of the vanishing gradient problem in the deep networks by pushing the useful gradient to earlier layers (to reduce the loss). But, the authors of this paper found that this classifier didn't improve the convergence very much early in the training.
Architectural Changes in Inception V2:
In the Inception V2 architecture. The 5x5 convolution is replaced by the two 3x3 convolutions. This also decreases computational time and thus increases computational speed because a 5x5 convolution is 2.78 more expensive than a 3x3 convolution. So, Using two 3x3 layers instead of 5x5 increases the performance of architecture.

👁 Image

Figure 1

This architecture also converts nXn factorization into 1xn and nx1 factorization. As we discussed above that a 3x3 convolution can be converted into 1x3 then followed by 3x1 convolution which is 33% cheaper in terms of computational complexity as compared to 3x3.

👁 Image

Figure 2

To deal with the problem of the representational bottleneck, the feature banks of the module were expanded instead of making it deeper. This would prevent the loss of information that causes when we make it deeper.

👁 Image

Figure 3

Architectural Changes in Inception V3:
Inception V3 is similar to and contains all the features of Inception V2 with following changes/additions:

Use of RMSprop optimizer.
Batch Normalization in the fully connected layer of Auxiliary classifier.
Use of 7x7 factorized Convolution
Label Smoothing Regularization: It is a method to regularize the classifier by estimating the effect of label-dropout during training. It prevents the classifier to predict a class too confidently. The addition of label smoothing gives 0.2% improvement from the error rate.

Architecture:
Below is the layer-by-layer details of Inception V2:

👁 Image

Inception V2 architecture

The above architecture takes image input of size (299,299,3). Notice in the above architecture figures 5, 6, 7 refers to figure 1, 2, 3 in this article.
Implementation:
In this section we will look into the implementation of Inception V3. We will using Keras applications API to load the module We are using Cats vs Dogs dataset for this implementation.
Code: Importing the required module.

Code: Creating directories in order to prepare for the dataset

Code: Storing the dataset in the directories created above and plot some sample images.

Code: Data augmentation to increase the data samples in dataset.

Code: Define the base model using Inception API we imported above and callback function to train the model.

In this step, we train our model but before training, we need to change the last layer so that it can predict only one output and use the optimizer function for training. Here we used RMSprop with a learning rate of 0.0001. We also add a dropout 0.2 after the last fully connected layer. After that, we train the model up to 100 epochs.
Code:

Code: Plot the training and validation accuracy along with training and validation loss.

Results:
The best performing Inception V3 architecture reported top-5 error of just 5.6% and top-1 error of 21.2% for a single crop on ILSVRC 2012 classification challenge which is the new state-of-the-art. On multiple crops(144 crops) it reported top-5 and top-1 error rate of 4.2% and 18.77% on ILSVRC 2012 classification benchmark.

👁 Image

Inception V3 Performance

An ensemble of Inception V3 architecture reported a top-5 error rate of 3.46% ILSVRC 2012 validation set (3.58% on ILSVRC 2012 test set).

👁 Image

Ensemble Results of Inception V3

References: