Summer Deep Learning School Session 1

5 minute read

Math behind Deep Learning and Introduction to Keras

This post contains a basic introduction to Deep Learning and will also teach you to implement a simple Neural Network in Keras.

The session note book can be found here

ML Techniques

A Broad division

classification: predict class from observations
clustering: group observations into “meaningful” groups
regression (prediction): predict value from observations

Some Terminology

Features – The number of features or distinct traits that can be used to describe each item in a quantitative manner.
Samples – A sample is an item to process (e.g. classify). It can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
Feature vector – An n-dimensional vector of numerical features that represent some object.
Feature extraction – Preparation of feature vector and transformation of data in high-dimensional space to a space of fewer dimensions.
Training/Evolution set – Set of data to discover potentially predictive relationships.

Introduction to Keras

Keras is a high-level neural network API, written in Python. It was developed with a focus on enabling fast experimentation.

Any program written with Keras has 5 steps:

Define the model
Compile
Fit (train)
Evaluate
Predict

Here, we will discuss the implementation of a simple classification task using Keras, with a softmax (logistic) classifier.

Linear Classifier

A linear classifier uses a score function f of the form \begin{equation} f(x_i,W,b)=Wx_i+b \end{equation} where the matrix W is called the weight matrix and b is called the bias vector.

Softmax Classifier

We have a dataset of points \((x_i,y_i)\), where \(x_i\) is the input and \(y_i\) is the output class. Consider a linear classifier for which class scores f can be found as \(f(x_i;W) = Wx_i\). For this multi-class classification problem, let us denote the score for the class j as \(s_j = f(x_i;W)_j\), in short as \(f_j\).

Let \(z = Wx\). Then the scores for class j will be computed as: \begin{equation} s_j(z) = \frac{e^{z_j}}{\Sigma{_k}{e^{z_k}}} \end{equation} This is known as the softmax function.

If you’ve heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. The Softmax classifier gives an intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, we interpret the scores as the unnormalized log probabilities for each class and have a cross-entropy loss that has the form:

\begin{equation} L_i = -log(\frac{e^{f_{y_i}}}{\Sigma{_j}{e^{f_j}}}) \end{equation}

Example

Let us consider a dataset with 4 input features and 3 output classes. So, the shape of the weight matrix (W) is 3x4 and that of the input vector \((x_i)\) is 4x1. Therefore, we get an output of shape 3x1, which is given by \(Wx+b\). Also, for this particular \(x_i\), the output class \((y_i)\) is 2 \((3^{rd} class)\).

Now we have the unnormalized class scores. Now we will take the softmax function of these class scores. Finally, we can observe that the normalized class scores sum to 1. The loss function is given by \(-log(s_2(z)) = -log(0.353) = 1.04\).

Now that you have understood the concept behind softmax classifier, let’s jump into the code.

Let us create our dataset using scikit-learn’s make_blobs function. The number of points in our dataset is 4000 and we have 4 classes as shown below. The dimension of the data (no.of features) is chosen to be 2, so that it is easy to plot and visualize. In reality, image classification tasks will have a very large number of dimensions.

alt

We will use scikit-learn’s train_test_split function to split our dataset into train and test, with a ratio of 4:1. After this, we will convert our labels (y) into one-hot vectors before passing it into our classifier model.

This is where we will define our model.

The Sequential model is a linear stack of layers to which layers can be added using the add() function.

The Dense function here, takes 3 parameters - no.of output classes, input dimension, type of actvation.

Epoch: One pass of the entire set of training examples.

Batch size: Number of training examples used for one pass (iteration).

The compile function configures the model for training with the appropriate optimizer and loss function. Here we have selected categorical cross-entropy loss since it is a multi-class classification problem.

The fit function trains the model for the specified number of epochs and batch size.

The evaluate computes the accuracy on the test set after training.

Train on 3200 samples, validate on 800 samples

After 10 epochs:

Test score: 0.18148446649312974

Test accuracy: 0.9225

Now, let us generate the predictions.

The predict function returns a 4-element vector of softmax probabilities for each class. The correct prediction is obtained using numpy’s argmax() function, which finds the index of the maximum element in the vector.

Softmax predictions:

\[\begin{bmatrix} 1.3248611e-03 & 2.5521611e-09 & 9.9851722e-01 & 1.5793661e-04 \\ 6.4598355e-03 & 9.7919554e-05 & 5.8634079e-04 & 9.9285591e-01\\ \vdots & \vdots & \vdots & \vdots\\ 1.2989193e-03 & 6.6605605e-09 & 9.8468834e-01 & 1.4012725e-02\\ 6.4598355e-03 & 9.7919554e-05 & 5.8634079e-04 & 9.9285591e-01\\ \end{bmatrix}\]

Class with maximum probability:

\[\begin{bmatrix} 2 & 3 & \cdots &3 &1 \\ 0 &1 & \cdots & 3 & 3 \\ \vdots \\ 1& 3& \cdots & 2 & 3\\ \end{bmatrix}\]

alt