Creating and Augmenting Datasets

11. Creating and Augmenting Datasets#

11.1. Overview#

In this lesson, we will focus on building and preparing datasets for deep learning models, discussing the rationale behind dataset creation and augmentation. We will learn how to perform batch processing of images with multiple augmentations, ensure that associated annotation files stay intact during dataset splitting, and how to choose the right augmentation techniques for different use cases. By the end of this lesson, you’ll have the skills to create a robust dataset pipeline.

11.1.1. Learning Objectives#

By the end of this section, you will:

Understand the rationale behind building datasets for deep learning, including the importance of augmentations and data splits.
Learn how to split datasets into training, validation, and test sets while ensuring associated annotation files (e.g., XML, JSON) remain intact.
Apply batch processing techniques to augment large datasets efficiently.
Customize augmentations based on the specific task, such as object detection, segmentation, or classification.
Explore and implement augmentations using Albumentations, a fast and flexible image augmentation library.

11.2. Rationale Behind Dataset Creation and Augmentation#

When training deep learning models, the quality and diversity of your dataset is critical. Here are a few reasons why augmenting and preparing datasets properly is important:

Dataset Variety: Real-world data is often limited, so augmentations help create variations in the data (rotation, flipping, brightness, etc.) to make models more robust.
Data Splitting: Properly splitting the dataset into training, validation, and test sets is important for ensuring that the model generalizes well. Validation sets help tune hyperparameters, while test sets evaluate final performance.
Task-Specific Requirements: Depending on the type of task (e.g., image classification vs. object detection), dataset augmentation strategies might differ. Object detection, for example, requires that the bounding box annotations remain consistent with the augmented images.

11.3. Introduction to Albumentations#

In LA 2.5, we explored how to manually apply image augmentations using PIL and OpenCV. While these foundational skills are critical for understanding how image transformations work, we now introduce Albumentations, a high-level image augmentation library designed for speed, simplicity, and flexibility.

11.3.1. Why Knowing the Foundations is Still Important#

Understanding how image augmentations work at a lower level provides key benefits:

Detailed Control: Manual augmentation techniques using PIL or OpenCV give you full control over how transformations are applied. This is especially important when handling complex or custom tasks that might not be covered by high-level libraries.
Customization: There are situations where highly specific augmentations are needed, such as when working with custom image formats or non-standard data. Knowing the underlying operations enables you to extend or modify augmentations beyond the capabilities of higher-level tools.
Better Debugging: When a model’s performance suffers from specific augmentations, it’s important to know how they work under the hood. Foundational skills help you troubleshoot issues when high-level libraries behave unexpectedly.

While detailed control is important, for large-scale datasets like those used in marine science (e.g., images from ROVs, underwater drones, or satellite imagery), Albumentations provides a faster, more efficient way to perform bulk augmentations. Let’s explore the syntax and key arguments of some Albumentations augmentations that are particularly useful for marine datasets.

11.3.2. Useful Albumentations Augmentations for Marine Data#

Below is a list of key augmentations that can be used for marine datasets. Each transformation includes its syntax, arguments, and use cases.

Note

Albumemtations is often installed as A to avoid the long (and annoying to say) name

import albumentations as A

11.3.3. Horizontal and Vertical Flips#

A.HorizontalFlip(p=0.5)
A.VerticalFlip(p=0.5)

Here, ‘p’ is the probability of applying the flip

Use Case: Flipping helps simulate different orientations of marine animals or objects. For example, mobile animals might be seen from different angles due to movement, and stationary organisms and geologic features can benefit from this augmentation due to variations in camera position. Flipping almost always increase variability, making it a very common augmentation.

11.3.4. Random Rotations#

A.Rotate(limit=45, p=0.7)

Here, ‘limit’ is the Maximum rotation angle (in degrees) in both directions.’p’ is the probability of applying the rotation.

Use Case: When marine cameras tilt or rotate due to currents or vehicle movement, applying random rotations can make models more robust to different camera orientations.

11.3.5. Brightness and Contrast Adjustments#

A.RandomBrightnessContrast(p=0.5)

Here, ‘p’ is the probability of applying a random amount of brightness or contrast augmentation

Use Case: Underwater lighting conditions vary dramatically, especially when working with still cam imagery that has a very bright illuminated foreground and a much darker background.

11.3.6. Gaussian Blur#

A.Blur(blur_limit=3, p=0.3)

Here, ‘blur_limit’ is the maximum kernel size for blurring and ‘p’ is the probability of applying that blur

Use Case: Blurring can simulate the effect of water turbidity, where visibility is reduced due to particles in the water. This is particularly useful for deep-sea environments or locations with high sediment. This can also be useful when looking at aerial survey data of marine mammals where varying degrees of their bodies are underwater or above water leading to a blur like effect. Similarly diffuse flows in hydrothermal imagery can show up as an intense blur, adding this augmentation is a good idea to catch classes that are partially obstructed by it.

11.3.7. Gaussian Noise#

A.GaussNoise(var_limit=(10.0, 50.0), p=0.5)

Here, ‘var_limit’ is the range of variance for the noise and ‘p’ is the probability of applying that blur

Note

There is no strict upper or lower limit imposed by the function itself; it’s based on the range you define. However, extremely large variance values might lead to very noisy and potentially unusable images. For that reason its good to stick to something like 10-50

Use Case: Similar to Gaussian blur, adding Gaussian noise simulates image degradation in murky waters or low-light environments. This is important for creating a more realistic training dataset in challenging underwater conditions.

11.3.8. Random Crop#

A.RandomCrop(width=128, height=128, p=0.5)

Here, Width and Height are self explanotory and given in pixel value, and ‘p’ is the probability of applying the crop

Use Case: Random cropping can simulate the loss of image data, helping the model learn to focus on partial objects or areas. This is especially useful when the camera cannot capture the entire object due to occlusion or framing issues.

11.3.9. Shift, Scale, Rotate#

A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.2, rotate_limit=20, p=0.7)

Here, ‘shift_limit’ controls the maximum shift as a fraction of image size, ‘scale_limit’ is the maximum scaling factor, ‘rotate_limit’ is the maximum rotation angle in degrees, and ‘p’ is the probability of applying the transformation.

Use Case: This augmentation is useful for simulating camera movement underwater, where slight shifts and rotations occur due to currents or vehicle navigation. Scaling can help the model handle different sizes of objects, and shifting ensures robustness against changes in object placement.

11.3.10. Random Shadow#

A.RandomShadow(shadow_roi=(0, 0.5, 1, 1), num_shadows_lower=1, num_shadows_upper=2, shadow_dimension=5, p=0.5)

Here, ‘shadow_roi’ is the region of interest for placing shadows (coordinates as a fraction of the image size), ‘num_shadows_lower’ and ‘num_shadows_upper’ control the range for the number of shadows, ‘shadow_dimension’ determines the size of the shadow, and ‘p’ is the probability of applying the shadow.

Use Case: This augmentation can simulate shadows caused by marine structures, plants, or large marine animals. Shadows can introduce varying light conditions, making the model more resilient to different lighting situations in real-world environments.

11.3.11. CLAHE (Contrast Limited Adaptive Histogram Equalization)#

A.CLAHE(clip_limit=4.0, tile_grid_size=(8, 8), p=0.3)

Here, ‘clip_limit’ is the threshold for contrast limiting, ‘tile_grid_size’ is the size of the grid for histogram equalization, and ‘p’ is the probability of applying the augmentation.

Use Case: CLAHE is useful for improving image contrast in underwater environments where lighting can be uneven or dim. It enhances details that may otherwise be missed, particularly in images with low contrast, such as deep-sea or low-light environments.

11.3.12. Resize#

A.Resize(height=180, width=180, p=1.0)

Here, ‘height’ and ‘width’ specify the target dimensions of the image, and ‘p’ is the probability of applying the resize (usually set to 1.0 to ensure resizing is applied to every image).

Use Case: This ensures that all images are resized to a standard size, such as 180x180, making them consistent for training in deep learning models. Resizing is often necessary when working with images of varying resolutions, particularly in datasets with mixed sources like drone imagery or satellite images.

11.4. 2. Dataset Splitting and Ensuring Annotation Integrity#

When splitting a dataset, it’s important to ensure that the associated annotation files (such as bounding boxes for object detection or segmentation masks) remain aligned with the correct images after augmentation and splitting. Common dataset splits include:

Training set: Typically 70-80% of the data.
Validation set: Typically 10-15% of the data for tuning the model.
Test set: The final 10-15% for evaluating model performance.

11.4.1. Example: Splitting a Dataset with Associated Annotations#

import os
import shutil
from sklearn.model_selection import train_test_split

# Directory paths
image_dir = '/path/to/images'
annotation_dir = '/path/to/annotations'
train_dir = '/path/to/train'
val_dir = '/path/to/val'
test_dir = '/path/to/test'

# Get image files
images = [f for f in os.listdir(image_dir) if f.endswith('.png')]  # Change extension as needed

# Split dataset into training, validation, and test sets
train_images, val_test_images = train_test_split(images, test_size=0.3, random_state=42)
val_images, test_images = train_test_split(val_test_images, test_size=0.5, random_state=42)

# Move images and their annotations to respective folders
def move_files(image_list, target_dir):
    for image in image_list:
        # Move image
        shutil.move(os.path.join(image_dir, image), os.path.join(target_dir, 'images', image))
        
        # Move associated annotation (assumes annotation has the same name but different extension)
        annotation_file = image.replace('.png', '.xml')  # Adjust extension based on annotation type
        shutil.move(os.path.join(annotation_dir, annotation_file), os.path.join(target_dir, 'annotations', annotation_file))

# Move the split files
move_files(train_images, train_dir)
move_files(val_images, val_dir)
move_files(test_images, test_dir)

11.5. 3. Batch Processing and Custom Augmentations#

Once the dataset is split, you can apply batch augmentations to increase the diversity of the dataset. Depending on the task (classification, object detection, segmentation), certain augmentations may be more appropriate than others.

11.5.1. Example: Batch Augmentations with Albumentations (OpenCV backend)#

import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
import os

# Define augmentation pipeline
transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.Rotate(limit=45, p=0.7),
    A.RandomBrightnessContrast(p=0.3),
    A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.2, rotate_limit=25, p=0.5),
    A.RandomCrop(width=128, height=128),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
])

# Batch process images in a folder
input_dir = '/path/to/train/images'
output_dir = '/path/to/augmented/images'

def augment_images(input_dir, output_dir, transform):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    for image_file in os.listdir(input_dir):
        image_path = os.path.join(input_dir, image_file)
        image = cv2.imread(image_path)
        
        # Apply augmentation
        augmented = transform(image=image)['image']
        
        # Save augmented image
        output_path = os.path.join(output_dir, image_file)
        cv2.imwrite(output_path, augmented.numpy().transpose(1, 2, 0) * 255)  # Convert back to original range

augment_images(input_dir, output_dir, transform)

In this example, we use Albumentations, a fast and flexible image augmentation library, to apply various transformations to batches of images. The augmentation pipeline includes horizontal flipping, random rotations, brightness and contrast adjustments, and random cropping.

11.6. 4. Customizing Augmentations Based on Task#

Different computer vision tasks require specific augmentations. Here are some augmentation strategies for common tasks:

11.6.1. 4.1 Object Detection#

When working on object detection tasks, it’s important to ensure that the bounding boxes are adjusted appropriately with the image augmentations.

11.6.1.1. Example: Augmenting Object Detection Data with Bounding Boxes#

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.Rotate(limit=45, p=0.7),
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))

# Example of augmenting an image with a bounding box
image = cv2.imread('/path/to/image.png')
bboxes = [[100, 150, 200, 250]]  # Example bounding box in PASCAL VOC format
class_labels = ['crab']

# Apply the augmentations
augmented = transform(image=image, bboxes=bboxes, class_labels=class_labels)
aug_image = augmented['image']
aug_bboxes = augmented['bboxes']

Here, Albumentations ensures that bounding boxes are modified along with the image, keeping the spatial relationships intact. The bbox_params argument specifies that we are using Pascal VOC format for bounding boxes.

11.6.2. 4.2 Image Segmentation#

For segmentation tasks, it’s crucial that segmentation masks undergo the same augmentations as the corresponding images.

11.6.2.1. Example: Augmenting Segmentation Data#

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.Rotate(limit=45, p=0.7),
    A.RandomBrightnessContrast(p=0.3),
])

# Augment both image and mask
image = cv2.imread('/path/to/image.png')
mask = cv2.imread('/path/to/mask.png', 0)  # Load mask as grayscale

# Apply the augmentations to both image and mask
augmented = transform(image=image, mask=mask)
aug_image = augmented['image']
aug_mask = augmented['mask']

In this case, both the image and its corresponding segmentation mask are augmented together to ensure that the mask still matches the transformed image.

11.6.3. 4.3 Image Classification#

For classification tasks, standard augmentations like random cropping, flipping, and brightness/contrast adjustments are useful to improve model generalization.

11.6.3.1. Example: Augmenting Image Classification Data#

transform = A.Compose([
    A.RandomCrop(width=128, height=128),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.Rotate(limit=45, p=0.7),
])

# Apply augmentations to the image
image = cv2.imread('/path/to/image.png')
aug_image = transform(image=image)['image']

# Save augmented image
cv2.imwrite('/path/to/augmented_image.png', aug_image)

For classification tasks, augmentations focus on changing the appearance and orientation of the image to help the model learn diverse features.

11.7. Interactive Activity: Augmenting a Dataset of Crabs and Fish#

In this activity, you will create an augmented dataset using randomcrab.zip and randomfish.zip. Each ZIP file contains 35 images (180x180 pixels) of crabs and fish, respectively. Your task is to use image augmentations to expand the dataset to 400 images of crabs and 400 images of fish, ensuring the output images maintain the same file size (180x180) and keep the original file names.

You’ll be provided with starter code that loads the images and applies basic augmentations. Your job is to customize the augmentation pipeline and generate the augmented images.

11.8. Instructions#

Download and extract the ZIP files:
- randomcrab.zip contains 35 images of crabs.
  - Download randomcrab.zip
- randomfish.zip contains 35 images of fish.
  - Download randomfish.zip
Augment each dataset: Your goal is to generate augmented images for each class (crabs and fish) by applying various transformations (rotation, brightness, flips, etc.).
Ensure consistency: Each output image must:
- Retain its original file name, and as a bonus try to append a string to each image, indicating it has been augmented.
- Be the same size as the original (180x180).
Save the augmented images in a directory called augmented_data/crabs for crabs and augmented_data/fish for fish.

11.9. Starter Code#

11.9.1. Extracting and Loading Images#

import zipfile
import os
import cv2
import albumentations as A

# Paths to the ZIP files
crab_zip = 'randomcrab.zip'
fish_zip = 'randomfish.zip'

# Extract ZIP files
def extract_zip(file, extract_path):
    with zipfile.ZipFile(file, 'r') as zip_ref:
        zip_ref.extractall(extract_path)

# Extract the crab and fish images
extract_zip(crab_zip, 'crabs/')
extract_zip(fish_zip, 'fish/')

# Load the images
crab_images = [os.path.join('crabs/', f) for f in os.listdir('crabs/') if f.endswith('.png')]
fish_images = [os.path.join('fish/', f) for f in os.listdir('fish/') if f.endswith('.png')]

11.9.2. Now, you’ll define an augmentation pipeline to apply transformations. Choose augmentations that you think make the most sense for this dataset.#

# Define the augmentation pipeline
augmentations = A.Compose([
#    A.HorizontalFlip(p=0.5),
#    A.Rotate(limit=30, p=0.7),

])

# Create the output directories
os.makedirs('augmented_data/crabs', exist_ok=True)
os.makedirs('augmented_data/fish', exist_ok=True)


# Function to apply augmentations and save images
def augment_and_save(image_path, output_dir):
    # Read the image
    image = cv2.imread(image_path)
    
    # Apply augmentations
    augmented = augmentations(image=image)['image']
    
    # Get the base filename (e.g., '1.png')
    base_filename = os.path.basename(image_path)
    
    # Save the augmented image to the output directory
    output_path = os.path.join(output_dir, base_filename)
    cv2.imwrite(output_path, augmented)

# Apply augmentations to crab images
for crab_image in crab_images:
    augment_and_save(crab_image, 'augmented_data/crabs')

# Apply augmentations to fish images
for fish_image in fish_images:
    augment_and_save(fish_image, 'augmented_data/fish')