The Convolutional Neural Network (CNN) is a multi-layered neural network that is known to be able to detect patterns and complex features. It has been useful in face detection, self-driving cars, and a lot more very complex tasks. In this article, I will give you a high-level idea of how a Convolutional Neural Network works.

This article will cover:

- How a convolution layer works in the forward pass.
- How Pooling layer works.
- A complete model structure of a convolutional neural network for an Image Classification project.
- Analysis of the model summary.
- Training the model and displaying the result.

# How CNN Works?

CNN can be used in so many different areas but in this article, we will talk about image classification examples. Image data can be expressed as numeric pixel values. Then these numeric values are passed into the CNN for processing. A normal neural network is also able to detect images but CNN is much more efficient both in terms of accuracy and speed.

## Convolution Layer

Convolution layers are very important layers in CNN because that’s what makes it a convolution neural network. In this layer, a filter or kernel is used to detect important features. The purpose is to make the dataset smaller and send only the important features to the next layer. This way it saves a lot of calculation in the dense layer and also ensures higher accuracy. Let’s have a look at a picture illustration.

The picture above shows the input data of depth 3, a kernel of the same depth and bias terms.

## How this kernel filters the input data?

The next few pictures will show that step by step.

Here is how the calculation works:

Let’s fill up the rest three of the output. Here is how to move the filter or kernel to calculate the y12.

I am not showing the calculation part. It is the same items-wise multiplication and then summing up as shown before. The following picture shows the kernel placement and bias for y21:

Lastly, kernel and bias for y22 calculation:

In the illustration above, only one kernel was used. But in the real model, several kernels can be used. In that case, there will be more outputs of the same size. **The type of padding I used here is called “****valid”. That means I actually did not use any padding at all. There are two other major types called “full” and “same”****.** I am not going to discuss those in this article. But in the exercise section, I will use the ‘valid’. In a high label idea, the padding ‘same’ means, adding a layer of zero at all sides of the input data and then using the kernel on it.

## Pooling Layer

The pooling layer reduces the dimensionality of the data and also detects the features irrespective of the location of the features in the image. Here is an example of how a MaxPooling2D works.

The picture above shows how MaxPooling works. The maximum value of the purple box is 15, so it takes only 15. The maximum of green box is 19, so only 19 remains. The same goes for two other boxes as well. There are other types of pooling like average pooling or min pooling. The name indicates how they work. In average pooling, we would take the average of the values of each box and in the min pooling, we would take the minimum value from each box.

These are the major ideas that are important to understand the exercise in this article.

# Convolutional Neural Network Exercise

For this exercise, I will use the ‘cifar’ dataset that is free and comes with the TensorFlow library itself. This dataset includes the pixel values of images of objects and labels include numbers. Each object is represented by a number. We will train the network first and check the accuracy using the test dataset. The dataset is already segregated by training set and test set. Here I am loading the data:

```
import tensorflow as tf
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
```

The dataset contains the following classes:

`'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'`

Each of the classes is represented by a number.

If you print y_train data, it looks like this:

Checking one image from the training set:

```
import matplotlib.pyplot as plt
image=X_train[3]
plt.imshow(image)
plt.show()
```

Output:

It is always good to scale the input data. As we have pixel values, I would divide them by 255.

```
X_train = X_train/255
X_test = X_test/255
```

Let’s check the shape of the training input:

`X_train.shape`

Output:

`(50000, 32, 32, 3)`

**What do we know from this shape?**

We have 50000 training data. The input size is 32×32 and the depth is 3. That means the images are colored images. We have RGB values.

## CNN Structure

For this project, I am going to use a Kernel size of 3×3 and I will use 32 output windows in the first convolution layer. Here is how it will look like:

In the demonstration before I only explained with one kernel for simplicity. But you can use as many Kernels as you need. I will use 32 Kernels for this exercise.

**For clarification, the picture above shows the input data of**** 3×3 and depth 3. Our data also has a depth of**** three as you can see from the X-train shape. But the size is 32×32 not 3×3 as shown in this picture.**

In the picture all the kernels are 2×2. But I will use 3×3 Kernels. You can try with any other size. In fact kernels do not have to be squares. They can be 4×2 or any other rectangular shape as well.

But kernels definitely cannot be bigger than the input shape. In this example, the input shape is 32×32. So, kernels cannot be bigger than that.

Also, when we used one Kernel, we had one output window. As I used 32 Kernels here, I will have 32 output windows.

After the convolution layer, there will be a MaxPooling layer. Where I used a 2×2 filter. Also, a stride of 2 means that there will be 2 steps. You can try with different strides.

I will have two other convolution and MaxPooling layers. Then there will be a ‘flatten’ layer. It does what it sounds like. It will flatten the three-dimensional data into a one-dimensional column. Because after that we will pass this one-dimensional data to the dense layer. I am assuming you know the regular neural network. A dense layer takes one-dimensional data. For this project, there will be three dense layers. In the end, the output layer.

The output layer will use ‘softmax’ activation. All the other layers will use ‘relu’ activation function.

Here is the model:

```
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, (3, 3), padding="valid",
activation="relu", input_shape=(32, 32, 3)),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
```

tf.keras.layers.Conv2D(48, (3, 3), padding=”valid”, activation=”relu”),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
tf.keras.layers.Conv2D(48, (3, 3), padding=”valid”, activation=”relu”),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(100, activation=”relu”),
tf.keras.layers.Dense(100, activation=”relu”),
tf.keras.layers.Dense(100, activation=”relu”),
tf.keras.layers.Dense(10, activation=”softmax”)]
)
Here is the summary of the model:

`model.summary()`

Output:

```
Model: "sequential_25"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_81 (Conv2D) (None, 30, 30, 32) 896
_________________________________________________________________
max_pooling2d_79 (MaxPooling (None, 15, 15, 32) 0
_________________________________________________________________
conv2d_82 (Conv2D) (None, 13, 13, 48) 13872
_________________________________________________________________
max_pooling2d_80 (MaxPooling (None, 6, 6, 48) 0
_________________________________________________________________
conv2d_83 (Conv2D) (None, 4, 4, 48) 20784
_________________________________________________________________
max_pooling2d_81 (MaxPooling (None, 2, 2, 48) 0
_________________________________________________________________
flatten_27 (Flatten) (None, 192) 0
_________________________________________________________________
dense_98 (Dense) (None, 100) 19300
_________________________________________________________________
dense_99 (Dense) (None, 100) 10100
_________________________________________________________________
dense_100 (Dense) (None, 100) 10100
_________________________________________________________________
dense_101 (Dense) (None, 10) 1010
=================================================================
Total params: 76,062
Trainable params: 76,062
Non-trainable params: 0
_________________________________________________________________
```

Let’s try to understand this summary. I will discuss one convolution layer and one MaxPooling layer for your understanding. After the first convolution layer output shape is (None, 30, 30, 32).

**Let’s underdtand this 30, 30, and 32.** The last element here is 32. That is easily understandable. Because we used 32 kernels, 32 output windows are expected.

**What is this 30, 30?** Because we used padding of ‘valid’, the output shape should be:

input size — kernel size + 1

Here input size 32, kernel size is 3, so,

32–3+1 = 30

This formula is for a padding of ‘valid’ only. If you use the padding of ‘same’ or ‘full’ the formula is different.

The next element is a MaxPooling layer. The output shape from the first MaxPooling layer is (None, 15, 15, 32). As mentioned before 32 comes from the 32 kernels. As we used a 2×2 filter in the MaxPooling layer the data becomes half on both sides. So, 30, 30 of convolution layer becomes 15, 15.

Before I move to train the model. I want to use an EarlyStopping condition.

## What is EarlyStopping?

Assume, I set my model training for 100 epochs but my model does not need 100 epochs. May be it converges after 50 epochs. In that case, if I leave it running for 100 epochs, it will cause overfitting. We can set an EarlyStopping condition with a patience value of our choice. I will use a patience value of 5 here. That means if the model loss does not change enough for 5 epochs the model will stop training even if it only ran for 30 epochs or 50 epochs.

```
from tensorflow.keras.callbacks import EarlyStopping
callbacks=[EarlyStopping(patience=5)]
```

## Training the model

First, we need to compile and then start training:

```
model.compile(optimizer="adam",
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs = 50,
validation_data=(X_test, y_test), callbacks=callbacks)
```

I set the model for 50 epochs. But it stopped after 17 epochs because of the EarlyStopping condition which saves a lot of time.

Here is the summary of the results:

```
met_df1 = pd.DataFrame(history.history)
met_df1
```

Output:

Here is the plot of training accuracy and validation accuracy per epoch:

```
met_df1[["accuracy", "val_accuracy"]].plot()
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.title("Accuracies per Epoch")
plt.show()
```

As you can see from the plot above training accuracy was consistently going up but validation accuracy was almost settled after a few epochs.

## Model Improvement

There are so many things you can try at the range of the ideas explained in this article. If you want to experiment with it, here are some ideas for you:

- Change the kernel shape. You can try with 2×2, 4×4, 2×4, 3×2, or any other shape of your choice.
- Instead of ‘valid’ please feel free to try with ‘same’ or ‘full’ as padding value.
- Change the number of kernels and use different numbers such as 48, 64, 56, or any other number instead of 32, 48, and 48.
- Add or remove convolution layers.
- Instead of max pooling try with average pooling.
- Add or remove the dense layers and change the number of neurons.
- Try other activation functions like tanh, elu, or leakyRelu.

I am sure, if you try hard enough you may get a much better validation accuracy than the result I displayed here.

# Conclusion

I tried to make the idea of the convolutional neural network, how it works behind the scene. Though if you have to implement it from scratch, there is a lot more mathematics involved. Especially, for the parameters update. But luckily we have TensorFlow. That updates the parameters for us and we do not have to do the partial differentiation of all the elements. Please feel free to try with some different model architecture as I suggested above and share your findings if you find them interesting!

Feel free to follow me on *Twitter *and check out my new *YouTube channel*.