Notes for Deep Learning with Python

Chapter 1 - What is deep learning?

Deep stands for the depth of the (successive) representation layers of the feature-data, starting from the input and ending at a useful output that we identify as the outcome of the learning system.

Terms term

Loss Function

A function to measure the difference between the actual and the expected outputs

Synonyms: Objective Function, Cost Function

Optimizers

The algorithm that implements backpropagation to adjust weights with the purpose of reducing the loss functions.

Weights

Parameters of the various connections between the processing units (the neurons).

Kernel Function (SVM)

A computationally tractable function that maps any two points in the input space to their distance in the target representation space.

Random Forests

Ensemble of specialized decision trees.

Gradient Boosting

Ensemble of weak prediction models that focuses on boosting the weaker of the contained models to improve prediction.

Feature Engineering

Transforming the input data (with mostly manual intervention) into forms that machine learning algorithms can consume. A step now greatly automated / made redundant with deep learning.

CUDA

A programming interface for Nvidia’s GPUs. CUDA implementations for neural network matrix operations, started happening around 2011.

TPU

Google’s Tensor Processing Unit chip project meant to be faster and more energy efficient than the GPUs for neural network matrix operations.

Activation Functions

A class of functions that transform the input signal into a neuron into an output signal.

History

Improvements in the gradient backpropagation algorithms were key in allowing faster training of more complex (deep) neural networks. Key aspects were

Newer types of activation functions being tried
Better weight initialization schemes
Better optimization schemes like RMSProp and Adam
Better ways to help gradient propagation
- Batch normalization
- Residual Connections
- Depth-wise separable convolutions

Chapter 2 - The mathematical building blocks of neural networks

Terms

Tensor: Multi-dimensional arrays. Tensors are generalizations of arrays and matrices - rank-1 and rank-2 tensors themselves. By extensions, scalars are rank-0 tensors.
Differentiation: Yes. Calculus.
Gradient Descent: Moving towards a (local) minima in search for the right fit parameters, guided by the loss function. Calculus.
Overfitting: Learning too keenly from the training data, leading to a lower ability to generalize.
Batch Axis: Also called Batch Dimension. In data samples, typically the 0-th axis lists individual data items and batching happens on this collection. Batch Axis refers to the axis that lists the data items.
Random initialization: The weight-matrices, when initialized, are filled with random values. That process.

Environment setup

Code samples here require some local setup to have been done, instructions for which are in Chapter 3.

Let’s first activate the virtualenv infrastructure for Emacs.

(pyvenv-activate "~/.venv/dl")

Next, we install the requisite Python packages with pip.

source ~/.venv/dl/bin/activate
#pip install tensorflow keras matplotlib

python

' is not defined

from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print(train_labels[0:10])
print(test_labels[0:10])

With this setup, most of the code samples in the second chapter can be executed.

Real-world examples of data tensors

Most of the data-sets we will use fall under one of the following categories

Vector data: Rank-2 tensors of shape (samples, features)
Timeseries/Sequence data: Rank-3 tensors of shape (samples, timesteps, features)
Images: Rank-4 tensors of shape (samples, height, width, channels) or (samples, channels, height, width)
Video: Rank-5 tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)
Learning rate: A scalar factor that modulates the speed at which “learning” happens - in other words, it controls the rate at which the model converges to a steady state. Too fast, and the learning may be unstable and never converge. Too little, and you may need to wait a lifetime for the learning to finish.
Gradient Descent: Descend the gradient in search of stability.
Stochastic Gradient Descent: Random batch training, with varying batch sizes.
Backpropagation algorithm: Applying chain rule to the computation of the gradient values.
Computation graph: A datastructure which forms the core of Tensorflow. It is a directed acyclic graph of tensor operations in the context of Tensorflow.

Tensor operations

Element-wise operations
Broadcasting
Tensor product - Also called the dot product
Tensor reshaping

Gradient-based optimization

If the loss-function - a function of the weights of the neural network amongst other parameters - represents a hyperplane for various values of the weight-matrices, traversing in a direction that reduces the value for our inputs (training samples) is akin to descending down the hyperplane towards some (local) minima. That is gradient-descent, or gradient-based optimization.

Conceptually, the process will look like the following for neural networks

Take a batch of input data and the corresponding labels
Run the model on x₁ to obtain the predicted labels y_pred₁
Compute the loss for this batch - which is a measure of the mismatch between y-true₁ and y-pred₁
Update weights on the model slightly in a direction that slightly reduces the loss on this batch.

Stochastic gradient descent

In this method, we use the gradient of the loss with respect to the model’s parameters. This step is called the backward pass. Weights are then updated to reduce the loss, in conjunction with the learning rate. When done in batches of sub-sets, this is called the mini-batch stochastic gradient descent, or mini-batch SGD. The stochastic part refers to the randomness in the selection of the batch. One can go with a single-sized “batch” or the entire training data in one go (the latter being called batch SGD).

There are multiple variants of SGD - focused on taking into account the previous weight updates, while computing the next weight updates, rather than looking only at the current gradient. This is where the concept of momentum arises. All intuition based approaches, I suppose. Other example variants include ADAGrad and RMSProp.

Momentum addresses two issues

Convergence speed
Local minima

The intuition is that if the current learning velocity is large (high momentum), then if there is a local minima, the chances of descending and staying there are reduced as the momentum carries you over the local hills.

The Backpropagation Algoritm - Chaining derivatives

This algorithm leverages the chain rule of derivatives.

Applying the chain rule to the computation of the gradient values is what the Backpropagation algorithm is about.

Automatic differentiation with Computation Graphs

Computation graphs capture computation as a datastructure. In the Tensorflow world, it captures a graph of tensor operations. The operations and the graph are accurately specified and can be transformed in a very principled manner. In other words, they can be input to computing functions, and can very well be output from computing functions. And thus, they can be composed with various functions and transformed.

Obviously, these functions and transformations are particularly exciting from the perspective of enabling the above discussed optimizations, and thus, learning.

A graph that represents an expression can be subjected to a computation that outputs the derivative of that expression - in other words, we can do automatic differentiation. This clearly enables efficient learning of the kind we’ve been talking about above.