Computer Vision — ZYXXYZ's Whymzykal Wunderland

Learn

Practice Lab

Image Fundamentals

Pixel & Raster Images ▶

A pixel (picture element) is the smallest addressable unit of a digital image — a single square of colour stored as numeric values. A raster image is a grid of pixels arranged in rows and columns; its quality is fixed by resolution. In contrast, vector images are mathematical descriptions of shapes and scale infinitely. Almost all CV algorithms operate on raster data.

Color Spaces — RGB, HSV, LAB ▶

RGB stores colour as additive Red, Green, Blue intensities (0–255 each). It mirrors how screens emit light but doesn't match how humans perceive colour. HSV (Hue, Saturation, Value) separates colour identity from brightness — useful for colour-based segmentation. CIE LAB is a perceptually uniform space where equal numeric distances correspond to equal perceived colour differences, making it ideal for colour comparison algorithms.

Bit Depth & Dynamic Range ▶

Bit depth is the number of bits used per channel per pixel. Standard images use 8 bits per channel (0–255, 256 values). 16-bit images (0–65535) are used in medical and scientific imaging. Dynamic range is the ratio between the brightest and darkest representable values. HDR (High Dynamic Range) imaging extends this using floating-point values, enabling rendering of sunlit outdoor and dim indoor scenes without clipping.

Sampling, Resolution & Aliasing ▶

Spatial resolution describes how many pixels cover a given area. The Nyquist theorem states that to accurately represent a signal, you must sample at least twice the highest spatial frequency in the scene. Sampling below this rate causes aliasing — the familiar staircase artefact on diagonal lines. Anti-aliasing applies a slight blur before downsampling to eliminate frequencies that would alias.

Grayscale Conversion ▶

Converting a colour image to grayscale collapses three channels (R, G, B) into one luminance value. A naive average gives incorrect results because the human eye is far more sensitive to green than red or blue. The standard luminosity formula is: Y = 0.299R + 0.587G + 0.114B. This weighted sum matches the perceptual brightness of the colour. Most edge and feature detection algorithms operate on grayscale images.

Image Processing

Convolution & Kernels ▶

A convolution slides a small matrix called a kernel (or filter) across an image. At each position, it multiplies each kernel weight by the underlying pixel value and sums the results — replacing the centre pixel with that sum. The kernel encodes the operation: a uniform kernel blurs, a difference-of-neighbours kernel detects edges. Convolution is the fundamental operation in both classical image processing and deep CNNs.

Gaussian Blur & Noise Reduction ▶

A Gaussian blur convolves the image with a Gaussian kernel — weights that follow the bell-curve distribution centred at the kernel's origin. Pixels near the centre contribute more than those at the edges, producing a smooth, natural-looking blur. The sigma (σ) parameter controls how wide the bell is and therefore how blurry the result. Blurring is used to suppress high-frequency noise before edge detection, so algorithms don't respond to every pixel-level fluctuation.

Edge Detection — Sobel & Canny ▶

Edges occur where pixel intensity changes rapidly. The Sobel operator applies two kernels to estimate the horizontal (Gx) and vertical (Gy) image gradients; edge strength is √(Gx²+Gy²). The Canny algorithm adds three more steps: Gaussian smoothing, non-maximum suppression (thinning edges to 1 pixel wide), and hysteresis thresholding (using high and low thresholds to follow weak edges connected to strong ones). Canny produces clean, thin, connected edges.

Thresholding & Binarisation ▶

Thresholding converts a grayscale image to binary (black/white) by classifying each pixel as foreground or background. A global threshold T sets: pixel = 255 if intensity > T, else 0. Otsu's method automatically selects T by finding the value that minimises intra-class variance. Adaptive thresholding computes a different threshold for each pixel based on its local neighbourhood — essential for uneven lighting conditions such as document scanning.

Morphological Operations ▶

Morphological operations process binary or grayscale images using a structuring element (a small shape like a 3×3 square or disk). Erosion shrinks foreground regions by keeping only pixels where the structuring element fits entirely within the foreground. Dilation expands them. Opening (erosion then dilation) removes small noise blobs. Closing (dilation then erosion) fills small holes. These are used in pre/post-processing pipelines for object segmentation.

Feature Detection

Corners & Interest Points — Harris, FAST ▶

An interest point (keypoint) is a location that can be reliably detected across images of the same scene under different conditions. Corner detection looks for regions with large intensity variation in multiple directions. The Harris detector computes a second-moment matrix at each pixel; its eigenvalues reveal whether the point is flat, an edge, or a corner. FAST (Features from Accelerated Segment Test) detects corners by comparing a circle of 16 pixels to the centre — much faster than Harris, enabling real-time use.

Scale Space & Blob Detection — LoG, DoG ▶

Features must be detected at the right scale. The Laplacian of Gaussian (LoG) finds blobs — round regions — by applying the second-derivative Laplacian after Gaussian smoothing; its response peaks where the blob size matches σ. The Difference of Gaussians (DoG) approximates LoG more efficiently by subtracting two Gaussian-blurred images at adjacent scales. SIFT uses DoG to build a scale-space pyramid and find scale-invariant keypoints.

SIFT — Scale-Invariant Feature Transform ▶

SIFT detects keypoints that are invariant to scale and rotation. After finding scale-space extrema via DoG, each keypoint is assigned a dominant gradient orientation to achieve rotation invariance. A 128-dimensional descriptor is computed from a 4×4 grid of 8-bin orientation histograms around the keypoint. Because the descriptor is based on gradient orientations rather than raw intensities, it is also robust to lighting changes. SIFT descriptors are matched using nearest-neighbour search in descriptor space.

HOG — Histogram of Oriented Gradients ▶

HOG describes local shape by building histograms of gradient orientations across cells (small spatial regions, typically 8×8 pixels). Each cell produces an orientation histogram with typically 9 bins (0°–180°). Cells are grouped into blocks (2×2 cells), and the histograms within a block are normalised together for contrast invariance. The concatenation of all block descriptors forms the HOG feature vector. Combined with a linear SVM, HOG+SVM was the state-of-the-art for pedestrian detection before deep learning.

Feature Matching & Homography ▶

After computing descriptors, feature matching finds corresponding keypoints across images using nearest-neighbour search. Lowe's ratio test rejects ambiguous matches: a match is kept only if the nearest-neighbour distance is less than 0.75× the second-nearest-neighbour distance. From matched keypoints, RANSAC (Random Sample Consensus) estimates a geometric transformation (homography, essential matrix) while robustly rejecting outlier matches caused by false correspondences.

Object Detection

Sliding Window & Image Pyramids ▶

The earliest object detectors swept a fixed-size window across every position in the image and ran a classifier at each location. To handle objects at different sizes, the image was downscaled repeatedly to form an image pyramid, and the window was run at each scale. Despite its conceptual simplicity, the sliding window approach was computationally expensive (millions of windows per image) and has been superseded by region proposal networks, though it remains important for understanding modern detection pipelines.

R-CNN Family — Region Proposal Networks ▶

R-CNN (2014) used Selective Search to generate ~2000 region proposals, warped each to a fixed size, ran a CNN feature extractor, and classified with SVMs — too slow for real-time use. Fast R-CNN ran the CNN once over the whole image and used ROI pooling to extract features for each proposal from the feature map. Faster R-CNN replaced Selective Search with a Region Proposal Network sharing the CNN backbone, making the full pipeline end-to-end trainable and near-real-time.

YOLO — You Only Look Once ▶

YOLO reframes detection as a single regression problem. The image is divided into an S×S grid; each cell predicts B bounding boxes and their confidence scores, along with C class probabilities — all in one forward pass. This gives YOLO its signature speed (real-time at 45+ FPS in early versions). Later versions (YOLOv3-v8) introduced anchor boxes tuned to the dataset, multi-scale prediction heads, and CSP-Net backbones, progressively closing the accuracy gap with two-stage detectors while maintaining speed.

Anchor Boxes & Non-Maximum Suppression ▶

Anchor boxes are pre-defined bounding boxes of various aspect ratios and sizes tiled across the feature map. The network predicts offsets from these anchors rather than absolute coordinates, making optimisation easier. Multiple detections often overlap the same object; NMS resolves this by sorting all detections by confidence, keeping the highest-confidence box, and suppressing any remaining box whose IoU with the kept box exceeds a threshold (typically 0.5). The process repeats until all boxes are either kept or suppressed.

IoU & mAP — Evaluation Metrics ▶

Intersection over Union (IoU) measures overlap between a predicted bounding box and the ground truth: IoU = area(A∩B) / area(A∪B). A detection with IoU ≥ 0.5 is typically counted as a true positive. Average Precision (AP) summarises the precision-recall curve for a single class. mAP (mean AP) averages AP over all classes and is the standard benchmark metric for detection datasets like COCO and Pascal VOC. COCO mAP averages over IoU thresholds from 0.5 to 0.95.

Deep Learning for CV

Convolutional Neural Network (CNN) ▶

CNNs learn image representations by stacking convolutional layers. Each layer applies learned kernels to produce feature maps that detect progressively complex patterns: early layers detect edges and textures; deeper layers detect parts and objects. The key insight is weight sharing — the same kernel is applied across the entire image, vastly reducing parameters compared to fully-connected layers. The spatial hierarchy of learned features is what makes CNNs so effective at visual tasks.

Pooling Layers & Downsampling ▶

Max pooling takes the maximum value within a sliding window (typically 2×2 with stride 2), halving the spatial dimensions. It achieves two goals: downsampling the feature map to increase the receptive field of subsequent layers, and providing a degree of translation invariance (a feature detected one pixel to the left still fires). Global average pooling collapses the entire spatial map to a single value per channel and is used before the final classification layer in modern architectures.

ResNet & Skip Connections ▶

As networks got deeper (>20 layers), training became difficult due to vanishing gradients — the gradient signal shrinks as it back-propagates through many layers. ResNet (He et al., 2015) solved this with residual connections: the output of a block is F(x) + x rather than just F(x). This identity shortcut ensures gradients can flow directly to earlier layers. ResNets enabled training networks with 50, 101, even 152 layers, achieving unprecedented accuracy. Skip connections are now ubiquitous in modern architectures.

Transfer Learning & Fine-Tuning ▶

Training a deep CNN from scratch requires millions of labelled examples and days of GPU compute. Transfer learning instead starts from a model pretrained on a large dataset (typically ImageNet with 1.2M images and 1000 classes) and adapts it to a new task. Fine-tuning means unfreezing some or all pretrained layers and training them at a low learning rate on the new data. The pretrained layers have already learned general visual features (edges, textures, shapes) that transfer well across many visual domains.

Data Augmentation & Regularisation ▶

Data augmentation artificially expands the training set by applying random transformations — horizontal flips, rotation, scaling, colour jitter, random crops — so the model sees more variety without collecting new data. Dropout randomly zeros activations during training, forcing the network to learn redundant representations and preventing co-adaptation. Batch normalisation normalises activations across the mini-batch to have zero mean and unit variance, stabilising training and allowing higher learning rates.

Advanced Topics

Semantic Segmentation ▶

Semantic segmentation assigns a class label to every pixel in the image — not just bounding boxes. The Fully Convolutional Network (FCN) was the first to do this end-to-end by replacing the fully-connected classification head with convolutional layers and using transposed convolutions to upsample back to full resolution. U-Net (widely used in medical imaging) adds skip connections between the contracting encoder path and the expanding decoder path, recovering fine spatial detail lost during downsampling.

Instance Segmentation — Mask R-CNN ▶

While semantic segmentation classifies pixels by class, instance segmentation distinguishes individual object instances — "person #1" vs "person #2". Mask R-CNN extends Faster R-CNN with a parallel branch that predicts a binary segmentation mask for each detected region. It introduces ROI Align (replacing ROI Pooling) to avoid quantisation artefacts when mapping proposals to feature maps. The result is a per-pixel, per-instance mask combined with a class label and bounding box.

Optical Flow & Motion Estimation ▶

Optical flow estimates the apparent motion of pixels between consecutive video frames, producing a dense vector field where each pixel carries a (dx, dy) displacement. Lucas-Kanade assumes constant flow within a small neighbourhood and solves the optical flow constraint equation using least-squares. Farneback and FlowNet (a CNN-based approach) compute dense flow over the whole frame. Optical flow enables action recognition, video stabilisation, object tracking, and frame interpolation.

Stereo Vision & Depth Estimation ▶

A stereo camera rig (two cameras at a known baseline distance) enables depth estimation by computing disparity — how many pixels a feature shifts between the left and right images. Depth is inversely proportional to disparity: Z = f × B / d, where f is focal length and B is baseline. Structure from Motion (SfM) reconstructs 3D structure from monocular video using feature matching across frames. Modern monocular depth estimation CNNs predict depth from a single image by learning scene priors from large datasets.

GANs in Computer Vision ▶

A Generative Adversarial Network pits two networks against each other: a generator learns to produce realistic images from noise; a discriminator learns to tell real from generated images. This adversarial dynamic drives both to improve until the generator produces images indistinguishable from real ones. In CV, GANs power super-resolution (SRGAN), image-to-image translation (pix2pix, CycleGAN), and inpainting. Diffusion models have largely superseded GANs for high-quality image synthesis but GANs remain important in real-time applications.

Sample Image

Upload ↑

Operation

Threshold

T 128

Blur Radius

σ 2

Adjustment

Brightness 0

Contrast 0

Select an operation above to see a description of what it does and how it works.

Original

Processed

Luminance Histogram — Processed Image