Data Science

Deep Feed Forward Neural Networks and the Advantage of ReLU Activation Function

How to build a Deep Feed Forward (DFF) Neural Network in Python using Tensorflow Keras API and how to choose between different activation…

Saul Dobilas

Jan 2, 2022

15 min read

Neural Networks

👁 Deep Feed Forward (DFF) Neural Networks. Image by author.

Deep Feed Forward (DFF) Neural Networks. Image by author.

Intro

We, as Data Scientists, always get excited about any performance gains we can achieve in our models. Chasing these improvements drives the community to keep experimenting, often leading to breakthroughs.

Many of the earlier experiments focused on the depth of the networks and how we could make them more efficient and get them to produce more accurate results.

In this article, I will take you through the structure of Deep Feed Forward (DFF) Neural Networks. Also, I will take a closer look at ReLU (Rectified Linear Unit) activation function and show you how to build a DFF Neural Network in Python using Tensorflow and Keras libraries.

This article is a continuation of my previous one on Feed Forward Neural Networks. If you are not familiar with the Neural Network basics, you may want to read this first:

Feed Forward Neural Networks – How To Successfully Build Them in Python

Deep Feed Forward Neural Network’s place within the universe of Machine Learning
The difference between Feed Forward (FF) and Deep Feed Forward (DFF) Neural Networks
What is the purpose of activation functions, and why has ReLU become a default for Deep Neural Networks?
Python example of how to build and train your own DFF Neural Network

DFF Neural Networks within the universe of Machine Learning

While Neural Networks are most frequently used in a supervised manner with labeled training data, I felt that their unique approach to Machine Learning deserves a separate category.

Within the Neural Network branch, DFF comes under the subcategory of Feed Forward Neural Networks, which are also often called Multilayer Perceptrons (MLPs).

The below graph is interactive, so please click on different categories to enlarge and reveal more👇 .

If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.

The difference between Feed Forward (FF) and Deep Feed Forward (DFF) Neural Networks

Structure

The structure of a DFF is very similar to that of an FF. The major difference between them is the number of hidden layers.

Currently, people refer to a Neural Network with one hidden layer as a "shallow" network or simply a Feed-Forward network. Meanwhile, a Neural Network with multiple hidden layers (2+) is called a Deep network.

If we were to draw a diagram of structural comparison between FF and DFF, it would look like this:

👁 Feed Forward (FF) vs. Deep Feed Forward (DFF) Neural Network structure. Image by author.

Feed Forward (FF) vs. Deep Feed Forward (DFF) Neural Network structure. Image by author.

What is the point of depth?

Typically you will find that deep NNs perform better than shallow ones. However, it is not always necessary to use a deep network. The choice will largely depend on the task you have at hand.

If you are working with many inputs, such as image data, then using a Deep Feed Forward (DFF) or a Convolutional Neural Network (CNN) would likely yield better results than a simple Feed Forward network.

However, suppose your task is to do some basic classification with a limited number of inputs. In that case, you may be better off using a simple FF network or even a tree-based algorithm such as XGBoost, Random Forest, or a single Decision tree.

So, going back to the point of depth, the simple answer is that deeper networks tend to deliver better performance on more complex tasks. There are multiple hypotheses on why deep networks perform better, ranging from efficiency to improved ability to learn more abstract representations.

Here are some hypotheses described in this detailed answer on Stackexchange:

A shallow network may require more neurons than the deep one.

A shallow network may be more difficult to train with our current algorithms (e.g. it may have a more nasty local minima, or the convergence rate might be slower)

Perhaps a shallow architecture does not fit to the kind of problems we are usually trying to solve (e.g. object recognition is a quintessential "deep", hierarchical process)?

Maybe there is a different reason not mentioned above.

Do not hesitate to share if you come across definitive proof of why deeper is better.😃

What is the purpose of activation functions, and why did ReLU become a go-to for Deep Neural Networks?

Challenges presented by deep networks

A Neural Network architecture contains activation functions inside the hidden nodes and output nodes. In short, the activation function takes the input value entering the node, performs a transformation, and then passes the result onwards to the next set of neurons.

Here is a simple illustration of a Feed Forward Neural Network that shows what activation functions look like (in this case, softplus and sigmoid) and how they transform data within the Neural Network.

👁 A simple Feed Forward Neural Network with Softplus and Sigmoid activation functions. Image by author.

A simple Feed Forward Neural Network with Softplus and Sigmoid activation functions. Image by author.

Traditionally, the two most widely used nonlinear activation functions were the sigmoid and the hyperbolic tangent (tanh). However, using these activation functions with Deep Neural Networks presented a problem of vanishing gradient.

The error is backpropagated through the network during the training process and is used to update the weights. Unfortunately, sigmoid and tanh activation functions tend to saturate.

It means that large negative and large positive values are transformed to 0 and 1 by sigmoid and -1 and 1 by tanh. The saturation often happens regardless of whether the inputs provided to the node contain useful information or not.

As the functions saturate, the derivate becomes close to zero. Hence, there is essentially no gradient left to propagate back through the network, making it challenging for the learning algorithm to continue adapting the weights.

Here is an illustration of the three activation functions:

👁 Activation functions. Image by author.

Activation functions. Image by author.

Rectified Linear Unit (ReLU) activation function

ReLU has been introduced as a solution to a vanishing gradient problem and quickly became a default option for most Deep Feed Forward (DFF) and Convolutional Neural Networks (CNN).

It has a very simple function that sets all negative values to 0 while returning the same value for all positive inputs.

ReLU(x) = max(x,0)

ReLU is, of course, a nonlinear function. However, it is very close to being linear, enabling it to preserve many of the properties that make linear models easy to optimize with gradient-based methods. At the same time, it tends to generalize well too.

Tensorflow implementation of ReLU allows you to set a few parameters to tune ReLU to your liking. E.g.:

You can set the saturation threshold by specifying max_value or;
You can turn ReLU into a Leaky ReLU by setting an alpha parameter, where alpha governs the slope for values lower than 0 (or another chosen threshold).

Here is an illustration of ReLU with max_value set to 3 and Leaky ReLU with alpha at 0.01.

👁 Variations of ReLU. Image by author.

Variations of ReLU. Image by author.

You may choose to use ReLU with a max cap if you run into a problem of exploding gradient while training your model. As you may have guessed, the exploding gradient is the opposite problem to the vanishing gradient, and it results from having large weights.

Although, typically, you can avoid the issue of exploding gradient if you use He weight initialization with a standard ReLU. He initialization (see: HeNormal, HeUniform) ensures that the initial weights are small enough to minimize the risk of exploding gradient.

Meanwhile, a Leaky ReLU can be beneficial when you want to avoid having "dead" neurons, which result from negative inputs being set to 0 by a standard ReLU.

👁 Image

Python example of how to build and train your own DFF Neural Network

It’s now time to have some fun and develop our own Deep Neural Network capable of recognizing MNIST digits.

Setup

We’ll need the following data and libraries:

MNIST handwritten digit data (copyright held by Yann LeCun and Corinna Cortes under the Creative Commons Attribution-Share Alike 3.0 license; the original source of the data: The MNIST Database)
Pandas and Numpy for data manipulation
Matplotlib for displaying handwritten digits
Tensorflow/Keras for Neural Networks
Scikit-learn library for some basic model evaluation

Let’s import all the libraries:

# Tensorflow / Keras
import tensorflow as tf # used to access argmax function
from tensorflow import keras # for building Neural Networks
print('Tensorflow/Keras: %s' % keras.__version__) # print version
from keras.models import Sequential # for creating a linear stack of layers for our Neural Network
from keras import Input # for instantiating a keras tensor
from keras.layers import Dense # for creating regular densely-connected NN layer.

# Data manipulation
import pandas as pd # for data manipulation
print('pandas: %s' % pd.__version__) # print version
import numpy as np # for data manipulation
print('numpy: %s' % np.__version__) # print version

# Sklearn
import sklearn # for model evaluation
print('sklearn: %s' % sklearn.__version__) # print version
from sklearn.metrics import classification_report # for model evaluation metrics

# Visualization
import matplotlib 
import matplotlib.pyplot as plt # for showing handwritten digits
print('matplotlib: %s' % matplotlib.__version__) # print version

The above code prints package versions used in this example:

Tensorflow/Keras: 2.7.0
pandas: 1.3.4
numpy: 1.21.4
sklearn: 1.0.1
matplotlib: 3.5.1

Next, we ingest MNIST handwritten digits data and display the first ten digits with their true labels above the images.

# Load digits data 
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()

# Print shapes
print("Shape of X_train: ", X_train.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_test: ", y_test.shape)

# Display images of the first 10 digits in the training set and their true lables
fig, axs = plt.subplots(2, 5, sharey=False, tight_layout=True, figsize=(12,6), facecolor='white')
n=0
for i in range(0,2):
 for j in range(0,5):
 axs[i,j].matshow(X_train[n])
 axs[i,j].set(title=y_train[n])
 n=n+1
plt.show()

This is what we get running the above code:

👁 The first ten digits of the MNIST dataset. Image by author.

The first ten digits of the MNIST dataset. Image by author.

As you can see, we have 60,000 images in the training set and 10,000 in the test set. Note that their dimensions are 28 x 28 pixels. However, we need to reshape the data before using it to train the DFF Neural Network.

# Reshape and normalize (divide by 255) input data
X_train = X_train.reshape(60000, 784).astype("float32") / 255
X_test = X_test.reshape(10000, 784).astype("float32") / 255

# Print shapes
print("New shape of X_train: ", X_train.shape)
print("New shape of X_test: ", X_test.shape)

The new shapes:

New shape of X_train: (60000, 784)
New shape of X_test: (10000, 784)

Training and Evaluating a Deep Feed Forward Neural Network

I have provided extensive commentary within the code, so I will not repeat the same in the body of the article. However, there are a few things that I would like to highlight:

Softmax activation in the output layer – this function takes values from all ten nodes and converts them to values between 0 and 1, where the sum of those ten values equals 1. Hence, you can imagine the output as a sort of "probability." Since each output node represents a different digit between 0–9, the one with the highest value ("probability") is what the Neural Net believes to be the correct answer.
Use of a "SparseCategoricalCrossentropy" for loss function – when your target data is binary, you should use "BinaryCrossentropy." When you have two or more label classes, and your data is OneHot encoded, you should use "CategoricalCrossentropy." In our scenario, we had ten classes to predict, but our target data was not OneHot encoded. Hence, we needed to use "SparseCategoricalCrossentropy."
Predicting labels – since the prediction gives us ten values (one for each output node), we need to pass the result through the argmax function that takes it as an input and returns the digit with the highest value ("probability").

##### Step 1 - Specify the structure of a Neural Network
model = Sequential(name="DFF-Model") # Model
model.add(Input(shape=(784,), name='Input-Layer')) # Input Layer - need to speicfy the shape of inputs
model.add(Dense(128, activation='relu', name='Hidden-Layer-1', kernel_initializer='HeNormal')) # Hidden Layer, relu(x) = max(x, 0)
model.add(Dense(64, activation='relu', name='Hidden-Layer-2', kernel_initializer='HeNormal')) # Hidden Layer, relu(x) = max(x, 0)
model.add(Dense(32, activation='relu', name='Hidden-Layer-3', kernel_initializer='HeNormal')) # Hidden Layer, relu(x) = max(x, 0)
model.add(Dense(10, activation='softmax', name='Output-Layer')) # Output Layer, softmax(x) = exp(x) / tf.reduce_sum(exp(x))

##### Step 2 - Compile keras model
model.compile(optimizer='adam', # default='rmsprop', an algorithm to be used in backpropagation
 loss='SparseCategoricalCrossentropy', # Loss function to be optimized. A string (name of loss function), or a tf.keras.losses.Loss instance.
 metrics=['Accuracy'], # List of metrics to be evaluated by the model during training and testing. Each of this can be a string (name of a built-in function), function or a tf.keras.metrics.Metric instance. 
 loss_weights=None, # default=None, Optional list or dictionary specifying scalar coefficients (Python floats) to weight the loss contributions of different model outputs.
 weighted_metrics=None, # default=None, List of metrics to be evaluated and weighted by sample_weight or class_weight during training and testing.
 run_eagerly=None, # Defaults to False. If True, this Model's logic will not be wrapped in a tf.function. Recommended to leave this as None unless your Model cannot be run inside a tf.function.
 steps_per_execution=None # Defaults to 1. The number of batches to run during each tf.function call. Running multiple batches inside a single tf.function call can greatly improve performance on TPUs or small models with a large Python overhead.
 )

##### Step 3 - Fit keras model on the dataset
model.fit(X_train, # input data
 y_train, # target data
 batch_size=10, # Number of samples per gradient update. If unspecified, batch_size will default to 32.
 epochs=5, # default=1, Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided
 verbose='auto', # default='auto', ('auto', 0, 1, or 2). Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. 'auto' defaults to 1 for most cases, but 2 when used with ParameterServerStrategy.
 callbacks=None, # default=None, list of callbacks to apply during training. See tf.keras.callbacks
 validation_split=0.2, # default=0.0, Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. 
 #validation_data=(X_test, y_test), # default=None, Data on which to evaluate the loss and any model metrics at the end of each epoch. 
 shuffle=True, # default=True, Boolean (whether to shuffle the training data before each epoch) or str (for 'batch').
 class_weight=None, # default=None, Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
 sample_weight=None, # default=None, Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only).
 initial_epoch=0, # Integer, default=0, Epoch at which to start training (useful for resuming a previous training run).
 steps_per_epoch=None, # Integer or None, default=None, Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. 
 validation_steps=None, # Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch.
 validation_batch_size=None, # Integer or None, default=None, Number of samples per validation batch. If unspecified, will default to batch_size.
 validation_freq=5, # default=1, Only relevant if validation data is provided. If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs.
 max_queue_size=10, # default=10, Used for generator or keras.utils.Sequence input only. Maximum size for the generator queue. If unspecified, max_queue_size will default to 10.
 workers=1, # default=1, Used for generator or keras.utils.Sequence input only. Maximum number of processes to spin up when using process-based threading. If unspecified, workers will default to 1.
 use_multiprocessing=False, # default=False, Used for generator or keras.utils.Sequence input only. If True, use process-based threading. If unspecified, use_multiprocessing will default to False. 
 )

##### Step 4 - Use model to make predictions
# Note, we need to pass model outputs through argmax to convert from probability to label
# Also, we convert output from tensor to numpy array
# Predict class labels on training data
pred_labels_tr = np.array(tf.math.argmax(model.predict(X_train),axis=1))
# Predict class labels on a test data
pred_labels_te = np.array(tf.math.argmax(model.predict(X_test),axis=1))

##### Step 5 - Model Performance Summary
print("")
print('-------------------- Model Summary --------------------')
model.summary() # print model summary
print("")

# I am not printing the parameters since my Deep Feed Forward Neural Network contains more than 100K of them
#print('-------------------- Weights and Biases --------------------')
#for layer in model_d1.layers:
 #print("Layer: ", layer.name) # print layer name
 #print(" --Kernels (Weights): ", layer.get_weights()[0]) # kernels (weights)
 #print(" --Biases: ", layer.get_weights()[1]) # biases

print("")
print('---------- Evaluation on Training Data ----------')
print(classification_report(y_train, pred_labels_tr))
print("")

print('---------- Evaluation on Test Data ----------')
print(classification_report(y_test, pred_labels_te))
print("")

Here is a summary and evaluation metrics for our Deep Feed Forward Neural Network, which we trained over five epochs.

👁 Deep Feed Forward (DFF) Neural Network performance. Image by author.

Deep Feed Forward (DFF) Neural Network performance. Image by author.

While 97% accuracy on test data is pretty decent, you could improve the results by training over more epochs or experimenting with the network structure. So give it a go and let me know if you managed to beat my score!

Conclusions

Congratulations! You can now successfully create Deep Feed Forward Neural Networks, experiment with the network structure, choose an activation function and use your network to make predictions.

Feel free to use the code provided in this article for your own projects. You can also download the entire Jupyter notebook from my GitHub repository.

As I try to make my articles more useful for readers, I would appreciate it if you could let me know what has driven you to read this piece and whether it has given you the answers you were looking for. If not, what was missing?

Cheers! 👏 Saul Dobilas

Feed Forward Neural Networks – How To Successfully Build Them in Python

How to Benefit from the Semi-Supervised Learning with Label Spreading Algorithm

SVM Classifier and RBF Kernel – How to Make Better Models in Python

Written By

Saul Dobilas

See all from Saul Dobilas

Data Science, Machine Learning, Neural Networks, Python, TensorFlow

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/deep-feed-forward-neural-networks-and-the-advantage-of-relu-activation-function-ff881e58a635/