Before convolutional neural networks became popular, neural networks would often struggle with 2D data like image classification and image segmentation. In this article, we’ll dive into a bit of history to uncover the motivation behind convolutional neural networks, then understand the idea behind how and why they work.

Motivating Background

The core of all problems can be described as a mapping from an input space to an output space, and our goal is to solve this mapping (or approximate it as best as we can in practical settings), whether it be via machine learning or other methods.

For example, image classification involves mapping an image to a class label, while image segmentation involves mapping an image to a pixel-wise mask (that is, an image to another image).

Before convolutional neural networks (CNNs) became mainstream, neural networks often consisted of fully-connected (FC) layers. The building block of an FC layer can be described as a linear layer followed by an activation function.

You are probably familiar with this scalar linear equation: $$ z = mx + c $$ When x and z are vectors, m becomes a matrix and c becomes a vector, which we denote as W and b respectively. $$ z = Wx + b $$ Then, we apply a non-linear activation function, e.g. sigmoid. $$ \hat{y} = \sigma(z) = \sigma(Wx + b) $$ This is an FC layer that maps a 1-channel input to a 1-channel output.

A linear layer with a non-linear activation function is a universal approximator. This means given any input, the weights \({W}\) and bias \({b}\) can be chosen such that the ouput \({\hat{y}}\) is very close to the true output \({y}\).

Let’s consider a binary image segmentation task: given an image, identify the cat. A naive approach would be to take each pixel colour value and concatenate them into a 1D vector. Then, apply a linear layer with sigmoid activation to output another 1D vector of the same shape, which can be unfolded to represent the segmented output. However, there are 2 major problems:

  1. Neighbouring pixels are related to each other. By concatenating the image, we are losing spatial information.
  2. The network only learns a global approximation to the current input image. It does not generalise well to new images.

The 2nd problem is known as high bias, and it is demonstrated below. Say our network performs well on the training image.

train-image train-segmented

However, our 1-layer network is very rigid. Due to the bias-variance tradeoff, it has very low variance and generates a similar mask for the test image, incorrectly segmenting it.

train-image train-segmented

In this case, even though we are using a global approximator, it is not good enough - we’d have to retrain the network every time we encounter a new image. More generalisable models involve deeper layers to capture local patterns, but stacking FC layers on top of each other result in a large number of parameters which is infeasible for imaging tasks. Plus, spatial information is still not being leveraged.

This motivates the need for CNNs. In this post, we will focus on CNNs for 2D data, like images.

Convolutional Neural Networks

In essence, what we’ve previously done was to look at the entire image at once in 1 FC layer. The idea behind CNNs is to shrink this window down and look at patches of the image at a time.

Since each neuron now only has access to its corresponding patch, it is forced to learn local patterns. We continue to add more layers, increasing its receptive field so that the neurons in deeper layers can learn global patterns while preserving spatial invariance.

Let’s dive into the details.

Convolution

Previously, we mentioned linear regression: \({z = Wx + b}\) where \({x}\) is the values of the input image pixels and \({W}\) is the weights. For example, let our image have dimensions 6x6.

\[\begin{bmatrix} 0 & 0 & 0 & 255 & 255 & 0 \\ 0 & 0 & 0 & 255 & 255 & 0 \\ 127 & 127 & 127 & 255 & 255 & 127 \\ 127 & 127 & 127 & 255 & 255 & 127 \\ 0 & 0 & 0 & 255 & 255 & 0 \\ 0 & 0 & 0 & 127 & 127 & 0 \end{bmatrix}\]

If \({z}\) is a scalar, then for our concatenated vector \({x}\) of size 36, \({W}\) would also have size 36, and we essentially calculate the dot product of \({W}\) and \({x}\).

The same concept applies to convolutions! However, instead of \({W}\) matching the size of \({x}\), we use a smaller filter, for example, a 3x3 filter. We refer to this as the kernel.

\[\begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\]

Of course, now we can’t perform dot product over the entire image \({x}\), but we can perform it over a patch. We start on the top right.

Note: I’m abusing notation a bit here, but if you think of these as dot products, treat them as vectors of size 9. However, convolutions are performed with the inputs as matrices.

\[\begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 127 & 127 & 127 \end{bmatrix} \cdot \begin{bmatrix} -1 & 0 & 2 \\ -2 & 0 & 4 \\ -1 & 0 & 2 \end{bmatrix} = 127 \times -1 + 127 \times 2 = 127\]

Now, we slide the filter to the right by 1 pixel.

\[\begin{bmatrix} 0 & 0 & 255 \\ 0 & 0 & 255 \\ 127 & 127 & 255 \end{bmatrix} \cdot \begin{bmatrix} -1 & 0 & 2 \\ -2 & 0 & 4 \\ -1 & 0 & 2 \end{bmatrix} = 1913\]

We repeat this for each patch, and we get an output. We have performed a convolution operator and is denoted as \({\ast}\).

\[\begin{bmatrix} 0 & 0 & 0 & 255 & 255 & 0 \\ 0 & 0 & 0 & 255 & 255 & 0 \\ 127 & 127 & 127 & 255 & 255 & 127 \\ 127 & 127 & 127 & 255 & 255 & 127 \\ 0 & 0 & 0 & 255 & 255 & 0 \\ 0 & 0 & 0 & 127 & 127 & 0 \end{bmatrix} \ast \begin{bmatrix} -1 & 0 & 2 \\ -2 & 0 & 4 \\ -1 & 0 & 2 \\ \end{bmatrix} = \begin{bmatrix} 127 & 1913 & 1913 & -766 \\ 381 & 1659 & 1659 & -258 \\ 381 & 1659 & 1659 & -258 \\ 127 & 1657 & 1657 & -638 \end{bmatrix}\]
Above calculations are technically cross-correlation. Traditionally, for convolution we'd flip the kernel (on both axes) before performing the dot product. In practice, the terms are used interchangeably - since the network will take care of the weights.

This is what it looks like in action.

convolution-gif

A key observation is that we are reusing the kernel for each patch, so the number of weights in our model remains small and manageable.

Now, we have a 4x4 output. This is known as a feature map. Each cell in the feature map is a neuron with a receptive field of 3x3. That is, its value is dependent on a 3x3 patch of the original input image.

Typically, we'd apply an activation function after the convolution to output the feature map to introduce non-linearity. I've omitted this for simplicity. I've also omitted the bias term, although the bias is sometimes not used in practice, especially if the data has been normalised.

Here, the ReLU activation is commonly used, as it can be interpreted as a gate: if the value is negative, it prevents gradient flow during backprop (the stage where the model weights get updated and the network learns), thus the neuron gets switched off. This is useful for controlling which areas of the image the deeper layers focus on - a bit like attention.

Let’s perform another convolution on the feature map with a different kernel.

\[\begin{bmatrix} 127 & 1913 & 1913 & -766 \\ 381 & 1659 & 1659 & -258 \\ 381 & 1659 & 1659 & -258 \\ 127 & 1657 & 1657 & -638 \end{bmatrix} \ast \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} = \begin{bmatrix} -3580 & -5751 \\ -4092 & -6135 \end{bmatrix}\]

Now, we have another feature map. Each neuron has a receptive field of 5x5, so its vision is almost the entire 6x6 image! This allows the neurons in this 2nd layer to learn global patterns that build upon the local patterns learned in the previous layer.

Towards the end of the convolutional network, we would sometimes flatten the feature map into a 1D vector and pass it through an FC layer with an activation function (like we did in our initial approach).

\[\sigma \left( \begin{bmatrix} -3580 & -5751 & -4092 & -6135 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 & -1 & -1 \end{bmatrix} \right) = 1\]

Designing ConvNet Architecture in Practice

In practice, for each layer, we’d have multiple kernels instead of just 1. For example, if we have 32 kernels in the first layer (applying each kernel like how we did above), we’d have 32 feature maps. The weights for each kernel are different, so each feature map captures different patterns.

Then, instead of flattening the 2x2 feature map like we did, we would flatten a 2x2x32 tensor to pass into the FC layer.

To interpret this, think of our cat segmentation task. Here it is again.

train-image

To identify a cat, it’s easier to break this down into subtasks: 1 channel is responsible for detecting the pointy ears, another channel to detect the tail, and so on. Each channel’s responsibility can be more abstract: edge detection, gradient contour, etc.

Finally, we aggregate all of our information on the subtasks (i.e. flattening all feature maps) to predict the output. For segmentation in particular, we would use transposed convolutions to upscale the feature maps back to the original image. This is not covered in this post, but the intuition is the same.

Further Reading

This article covers the motivation behind CNNs and the intuition behind convolution operators and how CNNs work.

In practice, the architecture is delicate and many other techniques are applied to ensure the network trains well and generalises to unseen data. These include:

  • Pooling to accelerate the growth of receptive fields and decrease in feature map size.
  • Batch Normalisation - note how in our example calculations, the values get very large very quickly. Normalising the values after each layer helps with model stability.
  • ResNet is a state-of-the-art architecture that introduces skip connections to enable gradient flow. This allows very deep networks (hundreds of layers) to train well and perform in complex tasks like classifying an image into 1000 classes.
  • Data augmentation, dropout and other techniques to prevent overfitting.