The development of Artificial Intelligence is progressing rapidly to close the disparity between the capabilities of people and computers. Individuals involved in the field, both specialists and hobbyists, study multiple aspects of the subject to make incredible things happen, one of them being Computer Vision.
The objective for the field of Computer Vision is to enable machines to gain the same level of insight into the world that humans possess, to interpret it in a similar fashion, and even put that intelligence to use for a variety of duties, such as Image & Video recognition, Image Analysis & Classification, Media Recreation, Recommendation Systems, Natural Language Processing, etc. The progress made in the area of Computer Vision with Deep Learning has been achieved and refined over time, largely through the use of a single algorithm — a Convolutional Neural Network.
Table of Contents
A Convolutional Neural Network (ConvNet/CNN) is an algorithm associated with Deep Learning that can analyze an image, assign importance to certain features and objects within the image, and distinguish them from each other. The pre-processing that is necessary with this network is much less than other classification algorithms. Where traditionally filters were created by hand, with enough training, these ConvNets can learn these filters/characteristics on their own.
The design of a ConvNet is similar to the neural interconnections in the human brain and was motivated by the layout of the Visual Cortex. Individual neurons will only react to signals within a limited area of the visual field known as the Receptive Field. A combination of such fields cover the entire sight region.
Which is better, ConvNets or Feed-Forward Neural Networks?
Although an image is essentially just a grid of pixel values, feeding this data into a Multi-Level Perceptron (MLP) as a flattened vector (for example, a 3×3 image matrix would become a 9×1 vector) is not an ideal solution for classification.
When dealing with extremely simple binary images, the method may display a satisfactory precision score when making class predictions, but its accuracy will be almost non-existent in the case of more complicated pictures with pixel correlations all over.
A ConvNet is able to identify the spatial and temporal correlations in an image via the implementation of appropriate filters. This design does a superior job of adapting to the image set because of the decreased number of parameters and the ability to reuse the weights. In simpler terms, the network can be taught to comprehend the intricacies of the image more efficiently.
The diagram displays an RGB image that has been divided into its three hues – Red, Green, and Blue. There are many such color formats in which pictures are available – Grayscale, RGB, HSV, CMYK, and so on.
It can be understood just how computationally demanding it would become once the images reach a size such as 8K (7680×4320). The responsibility of ConvNet is to simplify the images into a form that can be more easily handled, while maintaining the features that are important for getting a correct prediction. This is necessary when we are creating a structure that is not just capable of learning features but also adaptable to huge datasets.
Convolution Layer — The Kernel
Image Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels, eg. RGB)
In the example, the green part looks like our 3-dimensional (3D) image, I, which has a size of 5 x 5 x 1. The element that participates in the convolution operation at the start of the Convolutional Layer is known as the Kernel/Filter, K, that has a yellow hue. We have chosen K to be a 3 x 3 x 1 matrix.
Kernel/Filter, K = 1 0 1 0 1 0 1 0 1
The Kernel shifts 9 times due to a Stride Length of 1 (Non-Strided), while carrying out an element-wise multiplicative operation (Hadamard Product) between K and the part P of the image which the kernel is covering.
The filter progresses to the right with a predefined Stride Value until it scans the full width. After that, it jumps back to the left side of the image with the same Stride Value and continues the process until the whole image is analyzed.
For pictures with multiple channels (e.g. RGB), the Kernel has the equivalent depth as the input image. Matrix Multiplication is performed between Kn and In stack ([K1, I1]; [K2, I2]; [K3, I3]) and all the results are added together with the bias to produce an altered single-depth channel Convoluted Feature Output.
The goal of a Convolution Operation is to acquire high-level features, such as edges, from the input image. ConvNets do not necessitate only one Convolutional Layer. Generally, the first ConvLayer is responsible for spotting Low-Level features such as edges, colors, gradient orientation, and so forth. With extra layers, the architecture adjusts to the High-Level features too, allowing us to obtain a network that has a comprehensive comprehension of the images in the data set, similar to how we do.
The operation produces two different outcomes depending on the padding used: if Valid Padding is employed then the convolved feature is lower in dimensionality than the input, while if Same Padding is implemented then the dimensionality stays the same or increases.
If we enlarge a 5x5x1 image to a 6x6x1 image and then apply the 3x3x1 filter to it, the resulting convolutional matrix will be 5x5x1; this is why it is referred to as “Same Padding”.
However, if we do not use padding, the matrix will have the same dimensions as the kernel (3x3x1) – this is known as Valid Padding.
This collection contains numerous GIFs which can assist in comprehending how Padding and Stride Length collaborate to attain the desired outcomes.
Just like the Convolutional Layer, the Pooling layer is in charge of lowering the spatial dimension of the Convolved Feature. This is done to lessen the amount of computing energy needed to analyze the information through size reduction. Additionally, it is beneficial for drawing out the major features which are rotationally and positionally consistent, thus continuing the process of training the model proficiently.
When considering Pooling, there are two distinct types. The first is Max Pooling, which provides the maximum value from any region of the image that is encompassed by the Kernel. The second is Average Pooling, which provides the mean of all the values from an area of the image that is enclosed by the Kernel.
Max Pooling not only eliminates noisy activations, but also reduces the number of dimensions. In comparison, Average Pooling just conducts dimensionality reduction as a way of subduing noise. Consequently, it can be said that Max Pooling is much more effective than Average Pooling.
The Convolutional Layer and the Pooling Layer together create the i-th layer of a Convolutional Neural Network. If the images contain high levels of complexity, the number of layers can be increased in order to accurately capture all the low-level details, though this will require more computational power.
After going through the above process, we have successfully enabled the model to understand the features. Moving on, we are going to flatten the final output and feed it to a regular Neural Network for classification purposes.
Classification — Fully Connected Layer (FC Layer)
Incorporating a Fully-Connected layer is a generally inexpensive method for mastering non-linear combinations of the high-level features presented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in that area.
After changing the format of the input image to suit our Multi-Level Perceptron, we can flatten it into a single column. This output is then sent to a feed-forward neural network and backpropagation is conducted at each step of the training procedure. As the training progresses, the model is capable of recognizing the prominent and subtle characteristics of images and classifying them using the Softmax Classification approach.
Different structures of Convolutional Neural Networks (CNNs) are essential for creating algorithms that currently power Artificial Intelligence (AI) and will continue to do so in the future. A few of these are mentioned below: