- Pawit Kochakarn

# Machine Learning with Python : Image Classifier using VGG16 Model - Part 1: Theory

In this post, we will be looking at how to program a simple image classifier using a pre-trained model called VGG16 from Keras and the popular computer vision library OpenCV.

Keras is basically an open source neural network library for Python that contains models to tackle any data problem that we want to solve. For image classification, we will be using VGG16, which is a 16-layer deep convolutional network used by the VGG team at the ILSVRC-2014 competition. We will be operating Keras and OpenCV through our TensorFlow environment that I used in my last post to demonstrate a simple linear model. Before, we dive into the code in the next post, let's look at the basic architecture of a convolutional neural network (CNN), in particular the VGG16 model.

What is a Convolutional Neural Network (CNN)?

Just like a basic neural network , CNNs take in raw pixel data of images as inputs to perform classification on it to predict what the image best represents (class scores). Each node of the CNN still performs matrix dot products on each input and has a loss function (SoftMax etc.). However, the major difference between a CNN and a simple perceptron is that CNNs are made up of multiple *convolution*, *max-pooling* and *fully-connected *layers, which all have their own functions within the algorithm. I will later explain each layer in detail.

Furthermore, each CNN layer is arranged in 3-dimensions (eg. width, height, depth). The width and height represents the pixel dimensions of the image and the depth represent the three colour channels (RGB). For example, an image input could have the dimensions 28x28x3. By the end, the output would only be reduced to a single vector of class scores (predictions) (eg. 1x1x8).

Basic Architecture

I will now explain the main function that each layer I mentioned in the above paragraph performs. The CNN layer order basically goes as follows : INPUT - CONVOLUTIONAL - RELU - POOL - FC.

- The Input layer basically sends in the raw pixel values of an image that we want to classify as a 3D matrix (eg. 28x28x3)

- The Convolutional layer computes the output of each neuron in the layer, by performing a dot product between a small chunk of the input and the corresponding weights. Something called a filter is used to localize a chunk of the input matrix to compute the dot product.

- The RELU layer holds our activation function that we will use on the dot product output from the previous layer. This layer leaves the shape of the output matrix unchanged however.

- The Pool layer downsizes the dimensions (only width & height) of the matrix, resulting in a smaller output.

- Finally, the FC (fully-connected) layer will compute the class scores, resulting in a 1D string of numbers that will correspond to 10 possible categories that the image could be. The output will consist of probabilities that the image could be one of the categories.

Below visualizes how the multi-layer CNN takes in an image of a sports car to produce a class score for it:

VGG16 Architecture

VGG16, as I introduced to you earlier, is a 16-layer CNN designed by Oxford's Visual Geometry Group. 3x3 convolution and 2x2 pooling layers are used in this network and the good thing is that it's open sourced, so anyone can use it to for their liking. Since it is already trained, the weights don't need much tweaking since it is already good to go. I will show you later in the next post on how to use this network to classify any set of image you like on your computer.

Now obviously I haven't gone into much detail of the maths behind the computations that are involved particular in the Conv and RELU layers since I don't want to go into too much theory. However, if you want to know more about the maths that goes on in CNN, this Stanford CS231 course notes would be the perfect resource : click __here__