![]() |
VOOZH | about |
Computer vision is a field of artificial intelligence that enables machines to interpret and understand visual information from the world. It encompasses a wide range of tasks such as image classification, object detection, image segmentation and image generation. As the demand for advanced computer vision applications grows, so does the need for skilled professionals who can develop and implement these technologies effectively.
A pixel (short for picture element) is the smallest unit of a digital image. Each pixel holds a value representing color or intensity and when combined with millions of other pixels, it forms a complete image. Image resolution, on the other hand, refers to the amount of detail an image holds. It is usually expressed as the number of pixels along the width and height of an image (e.g., 1920×1080) or as pixel density (pixels per inch, PPI). Higher resolution means more pixels which generally results in clearer and sharper images.
The 2D Discrete Fourier Transform (DFT) is a way to convert a 2D image from its spatial form (pixels) into the frequency domain. In the frequency domain, we can see which patterns or details (like edges or textures) are present in the image. Each value in the DFT tells us the strength (amplitude) and orientation (phase) of a specific frequency component in the image. This is very useful for tasks like filtering, compression and detecting patterns.
The Fast Fourier Transform (FFT) is a faster way to compute the DFT. Normally, calculating DFT for an image takes operations in 2D. FFT reduces this to which is much faster. It does this by breaking the problem into smaller parts and reusing calculations. This makes FFT very practical for real-time image and signal processing.
Convolution is a fundamental operation in image processing where a small matrix, called a kernel or filter, is applied over an image to extract certain features or modify the image. It works by sliding the kernel across the image and performing element-wise multiplication followed by summation to produce a new value for each pixel. Convolution is important because it allows us to perform essential tasks such as blurring, sharpening, edge detection and feature extraction in a systematic and efficient way. Most image processing and computer vision techniques rely on convolution for analyzing patterns in images.
Where is the input image, is the kernel and is the output image.
Convolution can be used for:
Correlation is a technique used to measure the similarity between two signals or images. In image processing, it involves sliding a small template or kernel over an image and computing a similarity measure at each position. High correlation values indicate that the pattern in the kernel closely matches the region in the image. Correlation is commonly used in template matching, pattern recognition and feature detection.
| Feature | Convolution | Correlation |
|---|---|---|
| Definition | Combines an image and a kernel by flipping the kernel and summing products | Measures similarity between an image and a kernel without flipping |
| Kernel Orientation | Kernel is rotated 180° before applying | Kernel is used as-is |
| Application | Used for filtering, edge detection, blurring, feature extraction | Used for template matching, pattern recognition, detecting similarities |
| Effect on Image | Can produce results like blurring or sharpening | Produces similarity map indicating where the template matches best |
Filters are used in image processing to modify an image, either to enhance features, remove noise or detect edges. Filters are classified into linear and non-linear based on how the output pixel is computed from its neighborhood.
Linear Filters: A linear filter computes each output pixel as a weighted sum of its neighboring pixels. These filters follow the principles of linearity and superposition, meaning the output changes proportionally to the input. Linear filters are mainly used for smoothing, sharpening and edge detection.
Formula (2D linear filter):
Where:
Examples:
Pros: Simple, efficient, good for smoothing/sharpening.
Cons: Can blur edges and fine details.
Non-Linear Filters: A non-linear filter computes each output pixel using a non-linear function of neighboring pixels. These filters are effective for noise removal while preserving edges and details.
Examples:
Pros: Preserves edges, effective against impulsive noise.
Cons: Slightly more computationally expensive than linear filters.
Gaussian filtering is a type of linear smoothing filter used to reduce noise and blur an image in a controlled way. It uses a Gaussian function to assign weights to neighboring pixels, giving more importance to pixels near the center and less to those farther away. This weighted averaging preserves the general structure of the image while effectively removing high-frequency noise. Gaussian filtering is widely used in image preprocessing, edge detection and computer vision tasks because it smooths images without introducing sharp artifacts.
Gaussian function (1D) formula:
2D Gaussian function (used for images):
Where controls the spread of the Gaussian (larger → more blurring).
Purpose of Gaussian filtering:
Image enhancement involves improving the visual appearance of an image or making it easier to analyze. The goal is to highlight important features, improve contrast, reduce noise and make details more visible. Enhancement can be applied in the spatial domain (directly on pixels) or the frequency domain (using transforms like Fourier). Let's see some commonly used enhancement techniques,
1. Contrast Enhancement
2. Brightness Adjustment
3. Smoothing (Noise Reduction)
4. Sharpening
5. Edge Enhancement
6. Histogram Processing
7. Color Enhancement
8. Frequency Domain Enhancement
Histogram equalization is an image enhancement technique that improves the contrast of an image by redistributing its intensity values. It spreads out the most frequent intensity values across the entire range, making dark regions brighter and bright regions darker when necessary. This helps to reveal hidden details and makes features in the image more distinguishable, especially in low-contrast or underexposed images.
Color correction is an image enhancement technique that adjusts the colors of an image to make them appear more natural, accurate or visually appealing. It is used to compensate for lighting conditions, sensor limitations or color casts caused by environmental factors. The goal is to ensure that objects in the image have the correct color representation and maintain consistent color balance across different images or scenes.
Noise in images refers to unwanted random variations in pixel intensity which can degrade image quality and affect analysis. Noise can be introduced during image acquisition, transmission or compression. Different types of noise have distinct characteristics and require different filtering techniques for removal.
1. Gaussian Noise
2. Salt-and-Pepper Noise
3. Speckle Noise
4. Poisson Noise
5. Quantization Noise
6. Periodic Noise
Noise reduction techniques are used to remove unwanted variations in pixel intensity while preserving important image details like edges and textures. Different filters are effective for different types of noise.
1. Gaussian Filter
2. Median Filter
3. Bilateral Filter
4. Non-Local Means (NLM) Filter
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while retaining most of the important information. In image processing, It is often used to reduce the number of features or pixels, compress images, remove redundancy and extract the most significant patterns. It works by finding the principal components which are directions of maximum variance in the data and projecting the original image onto these components.
Affine transformations are geometric transformations that preserve points, straight lines and parallelism in an image. They are used to rotate, scale, translate, shear or reflect images while maintaining the general structure. Affine transformations are widely applied in image registration, object detection, image stitching and geometric corrections. They can be represented using matrix multiplication and vector addition, making them computationally efficient for image processing tasks.
Where is the transformed point, define rotation, scaling or shearing and define translation.
Geometric transformations are operations that change the spatial arrangement of pixels in an image. These transformations are used to resize, rotate, translate, warp or map images to a different coordinate system. They are essential for tasks like image registration, object alignment, perspective correction and image stitching.
Where is transformation matrix and are original and transformed coordinates.
Morphological operations are image processing techniques that focus on the shape and structure of objects within an image. They analyze and process images using a small shape called a structuring element to probe and transform the objects. These operations are widely used in binary and grayscale images for tasks like noise removal, object segmentation and shape analysis.
Uses of Morphological Operations:
The morphological gradient is a morphological operation that highlights the edges or boundaries of objects in an image. It is computed as the difference between the dilation and erosion of an image using a structuring element. This operation emphasizes the transition regions between foreground and background, making it useful for edge detection and shape analysis in both binary and grayscale images.
Uses of Morphological Gradient:
Where is the input image.
An edge in an image is a boundary or transition between regions with significant changes in intensity or color. Edges correspond to object boundaries, surface markings or texture changes. Detecting edges is a fundamental step in image processing because it helps identify important structures and shapes within the image.
1. Sobel operator: It is a gradient-based edge detection method that detects edges in both horizontal and vertical directions. It uses two 3×3 convolution kernels to compute approximate derivatives along x and y axes. The final edge strength is obtained by combining these gradients.
Where and are convolutions of the image with the Sobel kernels.
2. Prewitt Edge Detectors: The Prewitt operator is another gradient-based edge detection method similar to Sobel but uses simpler averaging in the kernels. It also uses two 3×3 kernels to detect horizontal and vertical edges.
The Canny edge detector is a multi-step gradient-based edge detection method designed to detect edges in images accurately and with minimal noise. It combines smoothing, gradient calculation, non-maximum suppression and edge tracking to produce clean edge maps.
Steps in the Canny Algorithm:
Step 1: Noise Reduction
Where is the input image and is the Gaussian kernel.
Step 2: Gradient Calculation
Step 3: Non-Maximum Suppression
Step 4: Double Thresholding
Step 5: Edge Tracking by Hysteresis
Combines noise reduction, edge detection and thresholding for accurate edge extraction.
A feature descriptor is a representation of an image region or keypoint that captures distinctive information about its appearance, shape or texture. Feature descriptors are used to describe and match keypoints across images, enabling tasks like object recognition, image matching and tracking. They transform raw pixel information into a compact and robust vector that can be compared across images even under changes in scale, rotation or illumination.
Scale-Invariant Feature Transform (SIFT) is a feature detection and description method used in computer vision to identify and describe distinctive keypoints in images. It is invariant to scale, rotation and partially invariant to illumination changes, making it ideal for matching objects across images taken from different viewpoints or under different conditions. SIFT detects keypoints and computes robust feature descriptors for each keypoint that can be used for image matching, object recognition and 3D reconstruction.
Step 1: Scale-space Extrema Detection: Identify potential keypoints by searching for local maxima and minima in the Difference of Gaussian (DoG) images at multiple scales.
Step 2: Keypoint Localization: Refine keypoints by eliminating unstable points with low contrast or poorly defined edges.
Step 3: Orientation Assignment: Assign a dominant orientation to each keypoint based on local gradient directions, making descriptors rotation-invariant.
Step 4: Keypoint Descriptor Generation:
Speeded Up Robust Features (SURF) is a fast and robust feature detection and description algorithm in computer vision. It is designed as a computationally efficient alternative to SIFT, providing scale- and rotation-invariant keypoints and descriptors for tasks like image matching, object recognition and tracking.
ORB is a fast and efficient feature detection and description algorithm used in computer vision. It combines the FAST keypoint detector with the BRIEF descriptor, adding orientation information to achieve rotation invariance. ORB is designed to be computationally lightweight while maintaining robustness, making it ideal for real-time applications.
| Feature | SIFT | SURF | ORB |
|---|---|---|---|
| Speed | Slow | Faster than SIFT | Fastest |
| Keypoint Detection | Difference of Gaussian (DoG) | Hessian matrix | FAST |
| Descriptor Type | 128-dimensional floating-point | 64- or 128-dimensional floating-point | Binary BRIEF |
| Rotation Invariance | Yes | Yes | Yes |
| Scale Invariance | Yes | Yes | Partially |
| Illumination Robustness | High | High | Moderate |
| Computational Cost | High | Moderate | Low |
| Applications | High-accuracy matching, object recognition | Image stitching, 3D reconstruction | Real-time tracking, SLAM, mobile applications |
Histogram of Oriented Gradients (HOG) is a feature descriptor used in computer vision to represent the local shape and appearance of objects in an image. It works by dividing the image into small cells, computing the gradient orientation in each cell and forming histograms of these orientations. HOG captures the edge and gradient structure of an object, making it effective for object detection, especially for detecting humans, vehicles and other rigid objects.
Template matching is a technique in computer vision used to find parts of an image that match a given template or reference pattern. It works by sliding the template over the input image and computing a similarity measure (e.g., cross-correlation) at each position. The location with the highest similarity indicates the best match. Template matching is simple and effective for detecting objects when their size orientation and appearance are known and consistent.
Limitations:
Optical flow is a technique in computer vision that estimates the motion of objects, surfaces or edges between consecutive frames in a video. It represents the apparent motion of pixels as a vector field, showing the direction and magnitude of movement. Optical flow is widely used in motion detection, video analysis, object tracking and robotics.
Lucas-Kanade Method: The Lucas-Kanade method is a sparse optical flow algorithm that estimates motion for a set of keypoints by assuming small motion and constant velocity within a local neighborhood. It solves a set of linear equations for each keypoint to compute the displacement vectors.
Where are spatial derivatives, is temporal derivative and is the flow vector.
A Convolutional Neural Network (CNN) is a deep learning model that automatically learns important features from images for classification. It works by processing the image through multiple layers that detect patterns at different levels, from simple edges in early layers to complex shapes in deeper layers and finally outputs probabilities for each class.
A Convolutional Neural Network (CNN) is a deep learning model that automatically learns important features from images for classification. It works by processing the image through multiple layers that detect patterns at different levels, from simple edges in early layers to complex shapes in deeper layers and finally outputs probabilities for each class.
Pooling layers are used in Convolutional Neural Networks (CNNs) to reduce the spatial dimensions of feature maps while retaining the most important information. By summarizing regions of the input, pooling layers help the network focus on dominant features rather than precise pixel locations. This makes the network more computationally efficient, robust to small translations and less prone to overfitting.
| Feature | Max Pooling | Average Pooling |
|---|---|---|
| Operation | Selects the maximum value in the pooling window | Computes the average value in the pooling window |
| Feature Emphasis | Highlights strongest activations | Provides a smoothed representation |
| Preservation of Details | Preserves prominent edges and textures | Can dilute strong features |
| Common Usage | Widely used in modern CNNs | Less common in modern CNNs |
| Effect on Noise | Can ignore weak noisy activations | May be influenced by noise |
Dropout is a regularization technique in Convolutional Neural Networks (CNNs) that helps prevent overfitting by randomly deactivating a fraction of neurons during training. By temporarily “dropping out” neurons, the network is forced to learn redundant and robust feature representations, reducing dependency on any single neuron and improving generalization to new data.
Over the years, several CNN architectures have become milestones in deep learning, each introducing innovations that advanced image classification, feature extraction and efficiency.
Transfer learning is a technique in Convolutional Neural Networks (CNNs) where a pre-trained model, trained on a large dataset, is reused for a different but related task. Instead of training a CNN from scratch which requires large datasets and high computation, transfer learning uses the features learned by existing models (like edges, textures and object parts) and adapts them to the new task. This approach significantly reduces training time and improves performance, especially when the new dataset is small.
Data augmentation is a technique used in Convolutional Neural Networks (CNNs) to artificially increase the size and diversity of the training dataset by applying various transformations to the existing images. By exposing the network to modified versions of the same data, it learns to generalize better, reducing overfitting and improving performance on unseen data.
YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) are real-time object detection models that predict object locations and class probabilities in a single forward pass of a CNN, making them fast and efficient for practical applications.
Region Proposal Networks (RPN) are a key component of Faster R-CNN, designed to generate candidate object regions (proposals) efficiently for detection. Instead of using external methods like selective search, RPN shares convolutional features with the detection network, allowing the model to propose regions and classify objects in a single unified framework.
Mask R-CNN is an extension of Faster R-CNN that adds instance segmentation capabilities to object detection. While Faster R-CNN predicts bounding boxes and class labels for objects, Mask R-CNN also predicts a pixel-level mask for each detected object, enabling the network to distinguish individual object instances, even when they overlap.
The key extension of Mask R-CNN is the addition of a mask prediction branch parallel to the existing classification and bounding box branches. To ensure precise mask alignment, it replaces Faster R-CNN’s RoIPool with RoIAlign which avoids quantization errors when mapping regions of interest (RoIs) from the feature map. This allows the network to generate accurate, pixel-level masks for each object.
By combining detection and segmentation, Mask R-CNN provides instance-level understanding of images, distinguishing overlapping objects with high precision.
| Feature | Semantic Segmentation | Instance Segmentation | Panoptic Segmentation |
|---|---|---|---|
| Definition | Classifies each pixel into a category. | Classifies each pixel into a category and instance. | Combines semantic + instance segmentation for all pixels. |
| Object Differentiation | Cannot distinguish different instances of the same class. | Distinguishes individual object instances. | Distinguishes instances and labels all pixels. |
| Output | Pixel-level class labels. | Pixel-level class labels + instance IDs. | Pixel-level class labels + instance IDs for all objects and background. |
| Use Cases | Road segmentation, medical imaging, satellite imagery. | Detecting multiple people, vehicles or objects separately. | Autonomous driving, scene understanding, complex visual reasoning. |
| Complexity | Lower than instance and panoptic segmentation. | Higher than semantic segmentation due to instance IDs. | Highest complexity, combines both semantic and instance segmentation. |
K-Means clustering can segment an image by grouping pixels with similar features (like color or intensity) into clusters. Each cluster corresponds to a segment in the image. Here are the proper steps to perform image segmentation using K-Means:
1. Feature Representation:
2. Initialize Clusters:
3. Assign Pixels to Clusters:
4. Update Centroids:
5. Iterate Until Convergence:
6. Generate Segmented Image:
7. Post-processing:
Fully Convolutional Networks (FCNs) are a type of Convolutional Neural Network (CNN) designed specifically for image segmentation. Unlike traditional CNNs used for classification, FCNs replace fully connected layers with convolutional layers, allowing the network to output pixel-level predictions for the entire image. This enables the model to produce segmentation maps where each pixel is assigned a class label.
FCNs provide an end-to-end trainable framework for segmentation, enabling efficient and accurate pixel-level predictions without the need for manual feature engineering.
Training a Convolutional Neural Network (CNN) on a small dataset can be challenging due to the risk of overfitting and insufficient data to learn robust features. To overcome this, several strategies can be applied to improve generalization and performance.
A Generative Adversarial Network (GAN) is a type of deep learning model used for generating realistic data, such as images, from random noise. It consists of two neural networks— a generator and a discriminator—competing against each other in a game-theoretic setup. The generator tries to create realistic data while the discriminator attempts to distinguish between real and generated data. Through this adversarial process, the generator gradually learns to produce highly realistic outputs.
A Generative Adversarial Network (GAN) consists of two neural networks—the generator and the discriminator—that compete in an adversarial framework to produce realistic data.
Training Process:
A DCGAN is a type of Generative Adversarial Network (GAN) that uses deep convolutional neural networks in both the generator and discriminator instead of fully connected networks, making it particularly suitable for generating images. By using convolutional layers, DCGANs can capture spatial hierarchies and local structures, producing higher-quality and more realistic images than vanilla GANs.
| Feature | Vanilla GAN | DCGAN |
|---|---|---|
| Architecture | Fully connected (dense) layers | Convolutional layers in generator and discriminator |
| Image Quality | Often produces low-quality images | Produces high-quality, realistic images |
| Stability | Training can be unstable | Improved stability due to convolutional architectures and batch normalization |
| Downsampling | Uses dense layers for generation | Uses transposed convolutions (upsampling) in generator and convolutions in discriminator |
| Applications | Simple synthetic data generation | Image synthesis, style transfer, super-resolution, etc. |
CycleGAN is a type of Generative Adversarial Network (GAN) designed for unpaired image-to-image translation. Unlike traditional GANs that require paired training data (input-output image pairs), CycleGAN can learn mappings between two domains without direct correspondence, using a cycle-consistency loss to ensure that translating an image to the target domain and back reconstructs the original image.
How it Works:
Use Cases:
Wasserstein GANs (WGANs) are a variation of Generative Adversarial Networks designed to improve training stability and convergence. Traditional GANs often suffer from problems like mode collapse, vanishing gradients and unstable training, making it difficult for the generator and discriminator to converge. WGANs address these issues by using the Wasserstein distance (Earth Mover’s distance) as a measure of similarity between the real and generated data distributions, instead of the standard Jensen-Shannon divergence used in vanilla GANs.
Conditional GANs (cGANs) are an extension of Generative Adversarial Networks (GANs) that allow the generation of data conditioned on additional information, such as class labels, text or other modalities. Unlike standard GANs which generate data from random noise alone, cGANs take both a noise vector and a conditional input to produce outputs that satisfy the specified condition. This enables controlled generation of images or data according to desired attributes.
How cGANs Work:
A Variational Autoencoder (VAE) is a generative model that learns to represent data in a continuous latent space and generate new data by sampling from this space. It consists of an encoder which maps input data to a probabilistic latent representation and a decoder which reconstructs data from the latent variables. Unlike traditional autoencoders, VAEs impose a probabilistic constraint on the latent space, encouraging smooth and continuous representations suitable for generating new samples.
| Feature | VAE | GAN |
|---|---|---|
| Learning Approach | Probabilistic modeling of latent space | Adversarial training (generator vs discriminator) |
| Output Quality | Often blurry but smooth and diverse | High-quality and realistic images but may suffer from mode collapse |
| Training Stability | More stable and easier to train | Can be unstable, sensitive to hyperparameters |
| Latent Space | Explicit, continuous and interpretable | Implicit, learned through adversarial loss |
| Loss Function | Reconstruction + KL divergence | Adversarial loss (generator tries to fool discriminator) |
A Denoising Autoencoder (DAE) is a type of autoencoder designed to remove noise from input data. Unlike standard autoencoders which learn to reconstruct the input exactly, DAEs are trained to reconstruct the original clean data from a corrupted version. This encourages the network to learn robust and meaningful features rather than merely copying the input.
A Convolutional Autoencoder (CAE) is a type of autoencoder that uses convolutional layers instead of fully connected layers to encode and decode image data. By using convolutions, CAEs can efficiently capture spatial hierarchies and local patterns in images, making them particularly suitable for image-related tasks. The encoder compresses the input image into a latent feature map and the decoder reconstructs the image from this representation.
Applications:
A Vision Transformer (ViT) is a deep learning model for image analysis that applies the transformer architecture originally designed for natural language processing, to computer vision tasks. Instead of using convolutions, ViTs process images by splitting them into patches and treating each patch as a sequence token, similar to words in a sentence. This allows the model to capture long-range dependencies and global context in images effectively.
How ViT Works:
A Swin Transformer is a hierarchical vision transformer designed to improve the efficiency and scalability of standard Vision Transformers (ViTs) for computer vision tasks. It introduces a shifted window-based self-attention mechanism which computes attention locally within windows and then shifts the windows between layers to capture cross-window connections, allowing the model to capture both local and global image features efficiently.
| Feature | Vision Transformer (ViT) | Swin Transformer |
|---|---|---|
| Attention Mechanism | Global self-attention over all image patches | Shifted window-based local attention |
| Computational Cost | High, scales quadratically with image size | Lower, scales linearly with image size |
| Feature Hierarchy | Single-scale, fixed-size patch embeddings | Hierarchical, gradually reduces resolution like CNNs |
| Inductive Bias | Minimal, relies on large datasets | Includes locality and hierarchical structure, better for smaller datasets |
| Applications | Image classification, object detection | Image classification, detection, segmentation and dense prediction tasks |
A Convolutional Vision Transformer (CvT) is a hybrid architecture that combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). It integrates convolutional layers into the token embedding and attention modules of a transformer, enabling the model to capture local spatial features efficiently while also modeling long-range dependencies through self-attention. This design improves performance, especially on smaller datasets, by providing an inductive bias similar to CNNs.
Applications:
CLIP is a multimodal model developed by OpenAI that learns to associate images with natural language descriptions. It jointly trains an image encoder and a text encoder to map both images and corresponding text into a shared embedding space, enabling the model to understand the relationship between visual and textual information.
How CLIP Works:
Applications:
ALIGN is a multimodal model developed by Google that, like CLIP, learns to align images and text in a shared embedding space. However, ALIGN is trained on a much larger dataset of noisy image-text pairs collected from the web which allows it to scale to billions of examples and improve robustness. It uses contrastive learning to maximize the similarity of matched image-text pairs while minimizing similarity of mismatched pairs.
BLIP is a multimodal model that improves over CLIP by incorporating both contrastive and generative objectives. While CLIP aligns images and text, BLIP adds image-to-text and text-to-image generation tasks during pretraining, allowing the model to learn richer representations that support both retrieval and generation.
| Feature | CLIP | ALIGN | BLIP |
|---|---|---|---|
| Developer | OpenAI | Salesforce / Research Labs | |
| Training Data | Tens of millions of curated image-text pairs | Billions of noisy web-scraped image-text pairs | Large-scale image-text datasets with captions |
| Training Objective | Contrastive learning | Contrastive learning | Contrastive + generative objectives |
| Architecture | Image encoder (CNN/ViT) + text transformer | Larger image and text transformers | Image encoder + text transformer (supports generation) |
| Zero-shot Performance | Good | Better due to massive-scale data | Improved via richer multimodal representation |
| Generative Capability | No | No | Yes, supports image-to-text and text-to-image generation |
| Use Cases | Zero-shot classification, image-text retrieval | Large-scale zero-shot classification and retrieval | Image captioning, VQA, retrieval and generation |
| Key Advantage | Aligns text and images effectively | Scales to massive noisy data, robust embeddings | Combines retrieval and generative tasks for richer understanding |
| Feature | Spatial Filtering | Frequency Filtering |
|---|---|---|
| Definition | Processes the image directly in the spatial domain, using kernels/masks on pixel values. | Processes the image in the frequency domain, using Fourier transforms and modifying frequency components. |
| Operation | Convolution or correlation with a kernel/mask. | Multiplying the Fourier-transformed image with a filter in the frequency domain. |
| Advantages | Simple, intuitive and works well for local operations like smoothing or edge detection. | Can easily perform global operations, like removing specific frequency noise or enhancing certain patterns. |
| Examples | Gaussian blur, Sobel filter, median filter | Low-pass filter, high-pass filter, notch filter |
| Computation | Direct pixel-wise computation | Requires FFT/IFFT transformations |
| Feature | Linear Filters | Non-Linear Filters |
|---|---|---|
| Definition | Filters where the output is a linear combination of input pixel values. | Filters where the output is a non-linear function of input pixels. |
| Operation | Uses convolution or correlation with a kernel. | Uses operations like median, maximum or morphological functions. |
| Superposition Principle | Obeys linearity and superposition. | Does not obey superposition. |
| Noise Handling | Effective for Gaussian noise, less effective for impulse noise. | Effective for impulse noise (salt-and-pepper) and preserving edges. |
| Examples | Averaging filter, Gaussian filter, Sobel filter | Median filter, morphological filters, adaptive filters |
| Effect on Edges | Can blur edges while smoothing noise. | Preserves edges better while reducing noise. |
| Feature | Image Sharpening | Image Smoothing |
|---|---|---|
| Definition | Enhances edges and fine details in an image. | Reduces noise and smooths variations in pixel values. |
| Purpose | To make edges and textures more prominent. | To remove noise and produce a visually smoother image. |
| Operation | Emphasizes high-frequency components using filters like Laplacian or high-pass filters. | Suppresses high-frequency components using filters like averaging, Gaussian or median filters. |
| Effect on Noise | Can amplify noise along with edges. | Reduces or removes noise effectively. |
| Common Filters | Laplacian, Sobel, Unsharp masking | Gaussian filter, median filter, averaging filter |
| Applications | Edge enhancement, feature extraction, medical imaging | Noise reduction, preprocessing for analysis, artistic smoothing |
| Feature | Erosion | Dilation |
|---|---|---|
| Definition | Shrinks or erodes object boundaries in a binary image. | Expands or grows object boundaries in a binary image. |
| Effect on Objects | Reduces size of foreground objects. | Increases size of foreground objects. |
| Effect on Holes | Enlarges background areas (makes holes bigger). | Shrinks background areas (fills small holes). |
| Structuring Element | Uses a kernel to remove pixels from object edges. | Uses a kernel to add pixels to object edges. |
| Applications | Removing small noise, separating objects, thinning. | Filling gaps, connecting components, smoothing object edges. |
| Feature | Sobel | Prewitt | Canny |
|---|---|---|---|
| Definition | Computes edges by combining derivatives in x and y directions using a weighted kernel. | Computes edges using simple derivatives in x and y directions with uniform kernel. | Multi-stage edge detector using gradient, non-maximum suppression and hysteresis thresholding. |
| Kernel Size | Typically 3×3, weighted toward center. | Typically 3×3, uniform weights. | Uses gradient calculation (can use Sobel internally) plus additional processing steps. |
| Noise Sensitivity | Sensitive to noise; smoothing helps. | Sensitive to noise; less robust than Sobel. | Less sensitive due to Gaussian smoothing before edge detection. |
| Edge Localization | Moderate accuracy in locating edges. | Moderate accuracy; slightly less precise than Sobel. | High accuracy due to non-maximum suppression. |
| Complexity | Simple, fast | Simple, fast | More complex, slower than Sobel/Prewitt. |
| Output | Gradient magnitude map | Gradient magnitude map | Thin, precise edges after thresholding. |
| Applications | Basic edge detection, feature extraction | Basic edge detection, directional edges | Object detection, image segmentation, feature extraction requiring precise edges |
| Feature | Fast R-CNN | Faster R-CNN | Mask R-CNN |
|---|---|---|---|
| Region Proposal | Uses external methods (e.g., selective search) to generate region proposals. | Uses Region Proposal Network (RPN) to generate proposals internally. | Uses RPN like Faster R-CNN for region proposals. |
| Detection Process | Extracts features for each proposed region using RoIPool, then classifies and regresses bounding boxes. | Shares convolutional features between RPN and detection head, faster and more efficient. | Adds a mask prediction branch in parallel to classification and bounding box regression. |
| Segmentation Capability | No | No | Yes, provides pixel-level instance masks. |
| Speed | Slower due to external proposal generation. | Faster than Fast R-CNN due to integrated RPN. | Slightly slower than Faster R-CNN due to mask branch. |
| Output | Class labels + bounding boxes | Class labels + bounding boxes | Class labels + bounding boxes + instance masks |
| Applications | Object detection | Object detection | Object detection + instance segmentation |
Designing a face recognition system involves multiple stages, including data collection, preprocessing, feature extraction and classification. The goal is to accurately identify or verify individuals based on their facial features. Here’s a structured approach:
1. Data Collection
2. Face Detection
3. Face Alignment and Preprocessing
4. Feature Extraction
Extract a compact representation (embedding) for each face.
Methods:
The feature vector should be robust to pose, lighting and expression changes.
5. Feature Matching / Classification
6. Training Considerations
Data Augmentation: Apply rotations, flips, brightness adjustments or random crops to improve generalization.
Loss Functions for Deep Models:
7. System Deployment
Overfitting occurs when a CNN performs well on the training data but poorly on unseen data, meaning it has memorized training features rather than learning general patterns. Several strategies can reduce overfitting:
1. Data Augmentation
2. Regularization Techniques
3. Reduce Model Complexity
4. Transfer Learning
5. Batch Normalization: Stabilizes learning and allows higher learning rates, indirectly reducing overfitting.
6. Cross-Validation: Use k-fold cross-validation to estimate model performance and detect overfitting.
7. Increase Dataset Size: Collect more data or use synthetic data generation to provide more examples for training.
K-Means clustering is an unsupervised method that can segment tumors based on pixel intensity differences in MRI scans.
1. Preprocessing:
2. Flatten Image:
3. Apply K-Means:
4. Reshape Clusters:
5. Post-processing:
It will generate a output with a segmented image highlighting the tumor region for further analysis or classification.