Sync'ing from Memory

Notes for Deep Learning with Python

Chapter-wise notes for Deep Learning with Python with Python and Clojure code samples. The book is not done. These notes are evolving as I get deeper both into the book and the accompanying code.

Chapter 1 - What is deep learning?

Deep stands for the depth of the (successive) representation layers of the feature-data, starting from the input and ending at a useful output that we identify as the outcome of the learning system.

Terms   term

Loss Function

A function to measure the difference between the actual and the expected outputs

  • Synonyms - Objective Function, Cost Function

Optimizers

The algorithm that implements backpropagation to adjust weights with the purpose of reducing the loss functions.

Weights

Parameters of the various connections between the processing units (the neurons).

Kernel Function (SVM)

A computationally tractable function that maps any two points in the input space to their distance in the target representation space.

Random Forests

Ensemble of specialized decision trees.

Gradient Boosting

Ensemble of weak prediction models that focuses on boosting the weaker of the contained models to improve prediction.

Feature Engineering

Transforming the input data (with mostly manual intervention) into forms that machine learning algorithms can consume. A step now greatly automated / made redundant with deep learning.

CUDA

A programming interface for Nvidia's GPUs. CUDA implementations for neural network matrix operations, started happening around 2011.

TPU

Google's Tensor Processing Unit chip project meant to be faster and more energy efficient than the GPUs for neural network matrix operations.

Activation Functions

A class of functions that transform the input signal into a neuron into an output signal.

History

Improvements in the gradient backpropagation algorithms were key in allowing faster training of more complex (deep) neural networks. Key aspects were

  • Newer types of activation functions being tried

  • Better weight initialization schemes

  • Better optimization schemes like RMSProp and Adam

  • Better ways to help gradient propagation

    • Batch normalization

    • Residual Connections

    • Depth-wise separable convolutions

Chapter 2 - The mathematical building blocks of neural networks

Terms

Tensor

Multi-dimensional arrays. Tensors are generalizations of arrays and matrices - rank-1 and rank-2 tensors themselves. By extensions, scalars are rank-0 tensors.

Differentiation

Yes. Calculus.

Gradient Descent

Moving towards a (local) minima in search for the right fit parameters, guided by the loss function. Calculus.

Overfitting

Learning too keenly from the training data, leading to a lower ability to generalize.

Batch Axis

Also called Batch Dimension. In data samples, typically the 0-th axis lists individual data items and batching happens on this collection. Batch Axis refers to the axis that lists the data items.

Random initialization

The weight-matrices, when initialized, are filled with random values. That process.

Environment setup

Code samples here require some local setup to have been done, instructions for which are in Ch3 Chapter 3.

Let's first activate the virtualenv infrastructure for Emacs.

(pyvenv-activate "~/.venv/dl")

Next, we install the requisite Python packages with pip.

source ~/.venv/dl/bin/activate
pip install tensorflow keras matplotlib
python
3.8.3 (default, May 27 2020, 20:54:22) 
[Clang 11.0.3 (clang-1103.0.32.59)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Using TensorFlow backend.
print(train_labels[0:10])
print(test_labels[0:10])
[5 0 4 1 9 2 1 3 1 4]
[7 2 1 0 4 1 4 9 5 9]

With this setup, most of the code samples in the second chapter can be executed.

Real-world examples of data tensors

Most of the data-sets we will use fall under one of the following categories

Vector data

Rank-2 tensors of shape (samples, features)

Timeseries/Sequence data

Rank-3 tensors of shape (samples, timesteps, features)

Images

Rank-4 tensors of shape (samples, height, width, channels) or (samples, channels, height, width)

Video

Rank-5 tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)

Learning rate

A scalar factor that modulates the speed at which "learning" happens - in other words, it controls the rate at which the model converges to a steady state. Too fast, and the learning may be unstable and never converge. Too little, and you may need to wait a lifetime for the learning to finish.

Gradient Descent

Descend down the gradient in search of stability.

Stochastic Gradient Descent

Random batch training, with varying batch sizes.

Backpropagation algorithm

Applying chain rule to the computation of the gradient values.

Computation graph

A datastructure which forms the core of Tensorflow. It is a directed acyclic graph of tensor operations in the context of Tensorflow.

Tensor operations

  • Element-wise operations

  • Broadcasting

  • Tensor product - Also called the dot product

  • Tensor reshaping

Gradient-based optimization

If the loss-function - a function of the weights of the neural network amongst other parameters - represents a hyperplane for various values of the weight-matrices, traversing in a direction that reduces the value for our inputs (training samples) is akin to descending down the hyperplane towards some (local) minima. That is gradient-descent, or gradient-based optimization.

Conceptually, the process will look like the following for neural networks

  • Take a batch of input data x1 and the corresponding labels y-true1

  • Run the model on x1 to obtain the predicted labels y-pred1

  • Compute the loss for this batch - which is a measure of the mismatch between y-true1 and y-pred1

  • Update weights on the model slightly in a direction that slightly reduces the loss on this batch.

Stochastic gradient descent

In this method, we use the gradient of the loss with respect to the model's parameters. This step is called the backward pass. Weights are then updated to reduce the loss, in conjunction with the learning rate. When done in batches of sub-sets, this is called the mini-batch stochastic gradient descent, or mini-batch SGD. The stochastic part refers to the randomness in the selection of the batch. One can go with a single-sized "batch" or the entire training data in one go (the latter being called batch SGD).

There are multiple variants of SGD - focused on taking into account the previous weight updates, while computing the next weight updates, rather than looking only at the current gradient. This is where the concept of momentum arises. All intuition based approaches, I suppose. Other example variants include ADAGrad and RMSProp.

Momentum addresses two issues

  • Convergence speed

  • Local minima

The intuition is that if the current learning velocity is large (high momentum), then if there is a local minima, the chances of descending and staying there are reduced as the momentum carries you over the local hills.

The Backpropagation Algoritm - Chaining derivatives

This algorithm leverages the chain rule of derivatives.

∂(f·g)/∂x = ∂(f·g)/∂g · ∂g/∂x

Applying the chain rule to the computation of the gradient values is what the Backpropagation algorithm is about.

Automatic differentiation with Computation Graphs

Computation graphs capture computation as a datastructure. In the Tensorflow world, it captures a graph of tensor operations. The operations and the graph are accurately specified and can be transformed in a very principled manner. In other words, they can be input to computing functions, and can very well be output from computing functions. And thus, they can be composed with various functions and transformed.

Obviously, these functions and transformations are particularly exciting from the perspective of enabling the above discussed optimizations, and thus, learning.

A graph that represents an expression can be subjected to a computation that outputs the derivative of that expression - in other words, we can do automatic differentiation. This clearly enables efficient learning of the kind we've been talking about above.

Chapter 3 - Introduction to Keras and Tensorflow