The Math Behind Convolutional Neural Networks (CNNs)#

Overview#

In this lesson, we will explore the basic mathematical concepts that form the foundation of Convolutional Neural Networks (CNNs), one of the most popular models used in Computer Vision. We will discuss concepts such as convolutions, filters, stride, and padding, and how these operations help in extracting meaningful features from images for tasks like object detection, classification, and segmentation.

Learning Objectives#

By the end of this section, you will:

  • Understand how images are represented as numerical data and why grayscale images are often used in CNNs.

  • Learn how to apply a single-layer convolution to detect edges in an image.

  • Explore how deeper CNNs are structured and how they use multiple layers to extract complex features.


What is a CNN?#

A Convolutional Neural Network (CNN) is a type of deep learning model specifically designed to process grid-like data, such as images. Unlike traditional neural networks, CNNs use a special operation called convolution to automatically detect and learn important features in images, such as edges, shapes, and textures. This ability makes CNNs highly effective at recognizing objects in images without requiring manual feature extraction.

Why are CNNs Ubiquitous in Marine Imagery?#

In marine science, CNNs have become ubiquitous due to their ability to handle complex imagery data from sources like underwater cameras, aerial drones, and satellite imagery. Marine imagery often involves detecting, classifying, and tracking various objects, such as fish, corals, and other marine organisms, within diverse and noisy environments.

CNNs are powerful in these tasks because they can:

  • Automatically identify features like shapes and patterns, even in challenging conditions such as low light or murky water.

  • Scale to large datasets, such as vast amounts of underwater footage or satellite imagery.

  • Handle variations in object size, orientation, and viewpoint, which are common in dynamic marine environments.

Annotations in CNNs: Bounding Boxes#

To train CNNs, annotated datasets are crucial. In the case of marine imagery, annotations are often in the form of bounding boxes. Bounding boxes are rectangular frames drawn around objects of interest within an image. For example, when detecting and classifying fish in an underwater video, bounding boxes are used to mark the locations of each fish.

These annotations help the CNN learn which parts of an image correspond to specific objects. The model is trained to detect and predict bounding boxes for new, unseen images, allowing it to automatically identify and localize objects in marine imagery.

Mathematical Foundations of CNNs#

Convolutional Neural Networks rely on mathematical operations that process image data in layers. The key operation is convolution, which is used to detect features in the input images.

Images as Data and Why Grayscale is Used#

Before diving into the math behind CNNs, it’s important to understand how images are represented as data in a way that machines can process. While humans can look at an image and instinctively understand it, machines treat images as numerical matrices.

An image is essentially a grid of pixel values, where each pixel represents some form of intensity or color information. For a grayscale image, this is simple: each pixel contains a single intensity value, typically ranging from 0 (black) to 255 (white). In the case of RGB images (red, green, blue), each pixel holds three values, one for each color channel. This makes RGB images a bit more complex, as the pixel data consists of three matrices (one for each channel). This is why for this lessons activity, the image will be translated to grayscale before processing

Why Convert to Grayscale?#

Grayscale images are often used in CNNs because they simplify the data representation. Instead of working with three separate color channels (red, green, and blue), you only need to deal with one channel—the intensity of light. This reduces the computational complexity, allowing the model to focus on extracting important features like edges, shapes, and textures, which are key for object recognition.

Grayscale images also help avoid color-related biases. For many computer vision tasks, the essential features of an image are independent of color. For instance, detecting edges or shapes is more important than detecting the exact color. Converting to grayscale strips away the additional information that may not be crucial for a particular task, ensuring that the network focuses purely on spatial structures.

How Machines See Images#

Humans can abstract images and recognize objects or patterns even when presented with complex scenes. However, machines don’t “see” images the same way we do. For a machine, an image is nothing more than a matrix of numbers—each number representing the intensity of light at a specific location.

When a machine processes an image, it uses mathematical operations like convolutions to extract numerical patterns from the image. These patterns help the machine recognize edges, textures, or specific features, like the shape of a fish or the outline of a crab leg. Unlike humans, who can effortlessly understand the meaning behind an image, a machine must systematically learn to detect patterns through data, which is why CNNs play a critical role in helping machines interpret and classify images.

Convolutions#

A convolution operation is essentially a way of applying a filter (kernel) to an image. The kernel slides over the input image, multiplying and summing the values to produce an output.

Convolution Formula#

Let’s represent the convolution operation mathematically:
Given an image \(I\) and a filter \(F\) of size \(m \times n\), the convolution at a position \((x, y)\) can be expressed as:

\[ S(x, y) = \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} I(x+i, y+j) \cdot F(i,j) \]

Where:

  • \(I(x, y)\) is the pixel value at position \((x, y)\) in the input image.

  • \(F(i, j)\) is the value of the filter at position \((i, j)\), where \(F\) is of size \(m \times n\).

  • \(S(x, y)\) is the output (or feature map) after applying the filter.

Filters / Kernels#

Filters (also called kernels) are small matrices of weights that are applied to input images to extract features such as edges, textures, and more. For example, a 3x3 filter applied to an image can help detect vertical edges.

import numpy as np

# Define a simple vertical edge detection filter
edge_filter = np.array([[1, 0, -1],
                        [1, 0, -1],
                        [1, 0, -1]])

# Applying this filter in a CNN will detect vertical edges in images.

Stride and Padding#

Two other important concepts in CNNs are stride and padding.

  • Stride: Stride defines how the filter moves across the image. A stride of 1 means the filter moves one pixel at a time, whereas a stride of 2 skips every other pixel.

Mathematically, if the stride is \(s\), the output size after convolution is:

\[ \text{Output Size} = \frac{(N - F)}{s} + 1 \]

Where \(N\) is the input size and \(F\) is the filter size.

  • Padding: Padding refers to adding extra pixels (usually zeros) around the borders of the image. Padding helps preserve the spatial dimensions of the image after convolution, especially when filters reduce the size of the image.

Pooling#

Pooling is a downsampling operation used to reduce the spatial size of the feature maps. The most common type of pooling is max pooling, which takes the maximum value from a portion of the image.

For example, in 2x2 max pooling:

\[ P(x, y) = \max \{I(x, y), I(x+1, y), I(x, y+1), I(x+1, y+1)\} \]

Building Hierarchical Features#

CNNs learn hierarchical features through multiple convolutional layers. The first layers detect simple edges or colors, while deeper layers capture more complex structures like shapes or patterns.

Example Code in PyTorch#

Below is a simple example of how a convolution operation is implemented in PyTorch.

import torch
import torch.nn as nn

# Define a convolutional layer
conv_layer = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1)

# Example input (single-channel image)
input_image = torch.randn(1, 1, 5, 5)

# Apply the convolution
output = conv_layer(input_image)

print("Input Image:\n", input_image)
print("Convolved Output:\n", output)

In this example, we defined a convolutional layer with a 3x3 filter, a stride of 1, and padding to preserve the image size.

Let’s Move onto a More Complex Example#

This example can look daunting at first, but don’t worry, we will walk through it step by step. The purpose of this activity is to give you a practical example of how to apply the convolution techniques you’ve learned on an actual image. By the end of this exercise, you will understand how to upload an image, apply a convolution (edge detection), and display both the original and processed images side by side.

Breakdown of the Code Blocks#

In this section, we will dissect each part of the code to understand its purpose.

Block 1: Importing Libraries#

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from google.colab import files

This block is essential for setting up the environment. Here’s a breakdown of the imports:

  • torch and torch.nn are part of the PyTorch library, which we will use to define and apply the convolutional filter.

  • matplotlib.pyplot will help us visualize the images.

  • numpy is used to handle arrays, which is the format that images take when processed.

  • PIL.Image allows us to load and manipulate images.

  • google.colab.files is a utility that allows us to upload files directly into Google Colab.

Block 2: Uploading the Image#

uploaded = files.upload()

This block prompts you to upload an image. It uses Colab’s files.upload() function to allow users to select and upload a file. For this exercise, we are working with a 180x180 image of a crab. Once the image is uploaded, the file is stored in memory for further processing.

Block 3: Opening and Preprocessing the Image#

image_path = next(iter(uploaded))
image = Image.open(image_path).convert('L')  # Convert to grayscale
image = image.resize((180, 180))

Here, we open the uploaded image and convert it to grayscale using .convert('L'), where 'L' means luminance or grayscale. The image is resized to 180x180 pixels to ensure it’s the correct size for our convolution operation.

  • Recakk why we convert to grayscale: CNNs typically process grayscale or single-channel images more efficiently for basic operations like edge detection, as it simplifies the data (reducing from 3 RGB channels to 1).

  • Resizing ensures that the image fits within the expected dimensions of the neural network input.

Block 4: Converting the Image to a PyTorch Tensor#

image_np = np.array(image)
image_tensor = torch.tensor(image_np, dtype=torch.float32).unsqueeze(0).unsqueeze(0)

This block first converts the image from a PIL format to a NumPy array using np.array(image). Then, it transforms the NumPy array into a PyTorch tensor, which is the format used by PyTorch for processing data.

  • The .unsqueeze(0) is applied twice to add two dimensions: one for the batch size and one for the channel (in this case, just one channel for grayscale images).

  • Why tensors? Tensors are the data structure that PyTorch uses for computation, similar to arrays in NumPy but optimized for GPU operations.

Block 5: Defining the Edge Detection Filter#

edge_filter = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1, bias=False)
edge_filter.weight = torch.nn.Parameter(torch.tensor([[[[1, 0, -1], [1, 0, -1], [1, 0, -1]]]], dtype=torch.float32))  

This block defines a convolutional layer (Conv2d) with a 3x3 kernel (filter) that will be used for detecting vertical edges.

  • in_channels=1 and out_channels=1: This means we have a single input (grayscale image) and a single output.

  • kernel_size=3: Specifies that we are using a 3x3 filter.

  • stride=1: This means the filter moves one pixel at a time.

  • padding=1: Adds zero-padding around the image to ensure the output size matches the input size.

The filter defined here is a vertical edge detection filter, where the kernel looks like this:

\[\begin{split} \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix} \end{split}\]

This filter emphasizes vertical edges by detecting differences in pixel intensities between adjacent vertical pixels.

Block 6: Applying the Filter#

with torch.no_grad():
    filtered_image_tensor = edge_filter(image_tensor)

In this block, we apply the convolution operation using the edge detection filter on the input image.

  • torch.no_grad(): This disables gradient calculation, which is unnecessary here since we are only applying the filter and not training a model.

  • Why convolution? Convolution applies the filter to the image, highlighting specific features (in this case, vertical edges). It scans through the image and performs element-wise multiplication with the filter, followed by summing the results to produce the output.

Block 7: Converting the Filtered Image for Display#

filtered_image_np = filtered_image_tensor.squeeze().numpy()

Block 8: Displaying the Images#

fig, ax = plt.subplots(1, 2, figsize=(10, 5))

# Original Image
ax[0].imshow(image_np, cmap='gray')
ax[0].set_title("Original Image")

# Filtered (Edge-detected) Image
ax[1].imshow(filtered_image_np, cmap='gray')
ax[1].set_title("Filtered (Edge-detected) Image")

plt.show()

This final block uses Matplotlib to display the images side by side.

  • plt.subplots(1, 2, figsize=(10, 5)): Creates two subplots, one for the original image and one for the edge-detected (filtered) image.

  • The original image is shown in the first subplot, while the filtered image (after applying the edge detection filter) is displayed in the second.

This visual comparison allows you to see how the convolutional filter modifies the image by highlighting vertical edges, reinforcing the math behind convolutions you’ve learned.

Now lets put it all together, open the whole code snippet below in google colab and test out working in that environment.

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from google.colab import files

# Upload the image (make sure it's the provided 180x180 image of a crab for this exercise)
uploaded = files.upload()

# Open the uploaded image
image_path = next(iter(uploaded))
image = Image.open(image_path).convert('L')  # Convert to grayscale
image = image.resize((180, 180))

# Convert image to numpy array and then to PyTorch tensor
image_np = np.array(image)
image_tensor = torch.tensor(image_np, dtype=torch.float32).unsqueeze(0).unsqueeze(0)

# Define a simple edge detection filter
edge_filter = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1, bias=False)
edge_filter.weight = torch.nn.Parameter(torch.tensor([[[[1, 0, -1], [1, 0, -1], [1, 0, -1]]]], dtype=torch.float32))  

# Apply the filter (convolution)
with torch.no_grad():
    filtered_image_tensor = edge_filter(image_tensor)

# Convert the filtered image back to numpy for display
filtered_image_np = filtered_image_tensor.squeeze().numpy()

# Plot the original and filtered images
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].imshow(image_np, cmap='gray')
ax[0].set_title("Original Image")

# Filtered (Edge-detected) Image
ax[1].imshow(filtered_image_np, cmap='gray')
ax[1].set_title("Filtered (Edge-detected) Image")

plt.show()

Video Example#

Animation 1: A 3x3 edge detection convolution filter slides over a 9x9 pixel subset of an example crab image, detecting edges and transforming pixel values in real-time. Credit: A. Carter UW

CNNs and Multiple Layers#

The edge detector example demonstrates how a simple convolutional filter can detect vertical edges in an image. However, Convolutional Neural Networks (CNNs) go far beyond using just one filter or one layer. CNNs are composed of many layers, each designed to detect increasingly complex features of an image. Let’s explore how these layers function together to make CNNs so powerful.

How CNNs Use Multiple Layers#

CNNs are made up of a sequence of layers, each of which transforms the input image in a different way. The first few layers typically detect simple features like edges, lines, and corners, similar to the edge detection filter we just used. As the network progresses deeper into subsequent layers, it begins to detect more complex patterns such as shapes, textures, and entire objects.

  • Early Layers: Focus on basic features like edges and corners. These are the low-level features that are present in almost every image.

  • Middle Layers: Detect more complex structures, like patterns, textures, or object parts (e.g., fins of a fish or the body of a coral).

  • Deeper Layers: Recognize entire objects and relationships between objects. In marine imagery, these layers might detect full fish species, coral formations, or even patterns of animal behavior.

This hierarchical approach allows CNNs to build a progressively more sophisticated understanding of the input data.

Preview of next section : how Weights Change During Training#

The power of CNNs comes from their ability to learn and adapt their filters (or weights) during training. Each filter in a CNN has associated weights, which are numbers that control how the filter interacts with the input image. When a CNN is trained on a dataset, these weights are updated over many epochs (iterations over the dataset).

  1. Initial Weights: At the beginning of training, the weights in each layer are typically initialized randomly. These random weights might detect patterns, but they don’t yet represent useful features for the task.

  2. Training and Backpropagation: During training, the CNN processes the input image and makes predictions. The difference between the prediction and the true label (known as the loss) is calculated. Using an optimization process called backpropagation, the CNN updates the weights to minimize this loss.

  3. Epochs and Weight Updates: Training is repeated over many epochs. An epoch refers to a full cycle through the training dataset. After each epoch, the weights are slightly adjusted based on the feedback from the loss function. Over time, the CNN learns to modify its weights to better detect the relevant features in the image.

    • Early Epochs: The weights are still far from optimal. The CNN is learning basic patterns, and the filters may be detecting random noise or unrelated features.

    • Later Epochs: As training progresses, the weights in each layer become specialized. Filters in the early layers may focus on detecting edges, while filters in deeper layers become tuned to recognizing specific objects, like marine species or coral structures.

  4. Convergence: After sufficient epochs, the weights reach a point where the CNN can accurately classify images in the dataset. At this point, the network has effectively learned how to transform raw pixel data into meaningful representations of objects.