CNN Basics Explained: Beginner’s Guide

A simple introduction to convolution neural network

Ransaka Ravihara
6 min readFeb 13, 2024
Photo by Tim Navis on Unsplash

Introduction

When learning convolutional neural networks, there are a bunch of buzzwords to clarify. In this reading, I will simplify those as much as I can. After reading this, you’ll learn what convolution is, different types of convolution operations, and how to apply convolution in actual use cases. Finally, you will build a convolutional auto-encoder as well.

Prerequisites I — Understanding dot product

In convolutional neural networks, aka CNNs, basic operations are mostly dot products between two matrices. Hence, understanding dot product is necessary to comprehend convolution operations, which we will learn in later sections. As an example, assume we have below two matrices.

The dot product of these two metrics can be calculated below two steps.

Element-wise dot-product

Understanding convolution operation

In simple words, the convolution operation is yet another series of element-wise dot product operations. The difference here is one acts as a particular matrix called a “Kernel.” In CNN, these kernels are learnable metrics and are often relatively small. For example, 3 x 3, 4 x 4, etc.

Forget about images and CNNs; try using simple examples to understand how convolution operations work. First, define the kernel and source matrix.

As a starting point, we first overlap the kernel with the source matrix starting from the left corner.

The convolution operation is split into intermediate steps.

After performing the convolution operation mentioned above, we obtain a matrix with dimensions 3 x 3. However, in the kernel operation, the original matrix has lost a pixel width stripe from the source matrix. The populated value positions in the resulting matrix correspond to the center of the 3 x 3 intermediate convolution operation matrix.

Furthermore, if you focus more on the result image size, you will notice that we can easily manipulate its size by tweaking a few settings above. As an example, if we change kernel size, it causes a change in the result matrix size. Similarly, instead of shifting one row, two rows left at a time(these pixel shifts are called stride) will also result in a different result matrix. There is another helpful setting we haven’t discussed whatsoever. It’s padding. It plays a critical role in convolution operation in most real-world applications. Now it’s time to dive deep into these two concepts: stride & padding. End of this section, we will build a relationship between stride, padding, kernel size, source matrix, and result matrix.

Padding

When we apply 3 x 3 convolution on the above example 5 x 5 source matrix, the result matrix we got was 3 x 3. What if we needed our convolution result to be the same width and height as the original matrix? Well, that is what exactly padding is used for. For example, say we need the convolution result matrix to be the size of 5 x 5; we can add a stripe as follows and then redo the convolution operation.

Usually, these padded lines are filled with either 0s. Now, you can do the convolution operation as we did in the previous section. Here, you will get a result matrix with the same shape as the source matrix.

Stride

During previous convolution calculations, you noticed we shifted our kernel on the source matrix from left to right. We call that stride in convolutions—the stride parameter used in CNNs to control the resulting matrix size.

The relationship between result matrix size, source matrix size, kernel, stride, and padding can be expressed as the equation below.

N = Resulting height(or width); M = Source height(or width); P = Padding value; K = Kernel size; S = Stride value

Let’s apply this formula to the above example to get an idea.

M = 5 (Considering only height as the source matrix is square)
P = 1
K = 3
S = 1

We have all the necessary components to discuss a more sophisticated example. Assume you are working on three-channel images (RGB Images) with a CNN model. You have three metrics to apply convolution instead of 1 at a time. Therefore, our kernel should have three channels as we need a dot product of 3 color channels. However, the resulting matrix is again a 2D matrix.

The convolution operation of 3 color channels returns a 2D matrix.

What if we have multiple kernels in convolution operations? Each convolution kernel generates a new layer in the output matrix in that scenario, as illustrated in the diagram below.

The convolution operation of 3 color channels with two kernels returns two layers in the result matrix.

Transpose Convolution

Transpose convolution is often used in autoencoder architectures; due to its nature, we need to downsample the image and then upsample it back to the original dimension. Hence, transpose convolution does the opposite of general convolution, which increases the resulting matrix compared to the original matrix. To achieve this, scaler matrix multiplication is used. See the example below.

The first three intermediate steps of transpose convolution for a given input matrix

Similar to convolution results, we can calculate the transpose convolution result shape using the below equation.

N = Resulting height(or width); M = Source height(or width); P = Padding value; K = Kernel size; S = Stride value

Perfect, now we have sufficient knowledge to get our hands dirty with some coding. Here, we are going to build a convolutional auto-encoder model. It’s a good example to try almost everything discussed here, including convolution, transpose convolution, stride, and padding.

Assume we have (480, 360) shape three-color image. We need to build an auto-encoder for these images. However, when we started the design, we knew only the number of color channels, image height, and width. However, there are three parameters that we can pick based on our preferences (based on the problem more specifically). Those are stride, padding, and kernel size. Let’s start building the encoder part of the model.

Using the formula we discussed in the previous section, you can cross-check the resulting image size of this network. Carefully check how we can use stride to manipulate image size accordingly.

Similarly, we can build a decoder part as well.

We flattened the neuron when building the encoder part to get a latent representation. Therefore, we have to convert it back to a 2D matrix. Note how we have achieved that using the below code piece.

        output = output.view(-1, 4 * self.out_channels, 120, 90)

Feel free to play around with different stride values, padding, and kernels and try to calculate the result matrix shape.

Another valuable and exciting thing to do is visualize these learned kernels. After it concludes, tiny details of your dataset, such as edges. Here, I am plotting one random—half-trained encoder kernel to give the reader context.

Let’s plot the first conv block’s kernels. We can access learned weights using an encoder.net[0].weights and then use matplotlib for plotting.

Thank you for reading.

Thanks for reading! Connect with me on LinkedIn.

In this article, all images, unless otherwise noted, are by the author.

--

--