
Convolutional neural networks
Convolutional neural networks or ConvNets, are deep neural networks that have provided successful results in computer vision. They were inspired by the organization and signal processing of neurons in the visual cortex of animals, that is, individual cortical neurons respond to the stimuli in their concerned small region (of the visual field), called the receptive field, and these receptive fields of different neurons overlap altogether covering the whole visual field.
When the input in an input space contains the same kind of information, then we share the weights and train those weights jointly for those input. For spatial data, such as images, this weight-sharing leads to CNNs. Similarly, for a sequential data, such as text, we witnessed this weight-sharing in RNNs.
CNNs have wide applications in the field of computer vision and natural language processing. As far as the industry is concerned, Facebook uses it in their automated image-tagging algorithms, Google in their image search, Amazon in their product recommendation systems, Pinterest to personalize the home feeds, and Instagram for image search and recommendations.
Just like a neuron (or node) in a neural network receives the weighted aggregation of the signals say input from the last layer which then subjected to an activation function leading to an output. Then we backpropagate to minimize our loss function. This is the basic operation that is applied to any kind of neural network, so it will work for CNNs.
Unlike neural networks, where an input is in the form of a vector, CNNs have images as input that are multi-channeled, that is, RGB (three channels: red, green, and blue). Say there's an image of pixel size a × b, then? ? the? ? actual? ? tensor? ? representation? ? would? ? be? ? of? ? an? ? a × b × 3 shape.
Let's say you have an image similar to the one shown here:

It can be represented as a flat plate that has width, height, and because of the RGB channel, it has a depth of three. Now, take a small patch of this image, say 2 × 2, and run a tiny neural network on it with an output depth of, say, k. This will result in a representation patch of shape 1× 1 × k . Now, slide this neural network horizontally and vertically over the whole image without changing the weights results in another image of different width, height, and depth k (that is, now we have k channels).
This integration task is collectively termed as convolution. Generally, ReLUs are used as the activation function in these neural networks:

The sliding motion of the patch over the image is called striding, and the number of pixels you shift each time, whether horizontally or vertically, is called a stride. Striding if the patch doesn't go outside the image space it is regarded as a valid padding. On the other hand, if the patch goes outside the image space in order to map the patch size the pixels of the patch which are off the space are padded with zeros. This is called same padding.

CNN architecture consists of a series of these convolutional layers. The striding value in these convolutional layers if greater than 1 causes spatial reduction. Thus, stride, patch size, and the activation function become the hyperparameters. Along with convolutional layers, one important layer is sometimes added, it is called the pooling layer. This takes all the convolutions in a neighborhood and combines them. One form of pooling is called max pooling.
In max pooling, the feature map looks around all the values in the patch and returns the maximum among them. Thus, pooling size (that is, pooling patch/window size) and pooling stride are the hyperparameters. The following image depicts the concept of max pooling:

Max pooling often yields more accurate results. Similarly, we have average pooling, where instead of maximum value we take the average of the values in the pooling window providing a low resolution view of the feature map.
Manipulating the hyperparameters and ordering of the convolutional layers, by pooling and fully connected layers, many different variants of CNNs have been created which are being used in research and industrial domains. Some of the famous ones among them are the LeNet-5, Alexnet, VGG-Net, and Inception model.