Notes for Deep Learning with Python
Chapter 1  What is deep learning?
Deep stands for the depth of the (successive) representation layers of the featuredata, starting from the input and ending at a useful output that we identify as the outcome of the learning system.
Terms
 Loss Function

A function to measure the difference between the actual and the expected outputs

Synonyms  Objective Function, Cost Function

 Optimizers

The algorithm that implements backpropagation to adjust weights with the purpose of reducing the loss functions.
 Weights

Parameters of the various connections between the processing units (the neurons).
 Kernel Function (SVM)

A computationally tractable function that maps any two points in the input space to their distance in the target representation space.
 Random Forests

Ensemble of specialized decision trees.
 Gradient Boosting

Ensemble of weak prediction models that focuses on boosting the weaker of the contained models to improve prediction.
 Feature Engineering

Transforming the input data (with mostly manual intervention) into forms that machine learning algorithms can consume. A step now greatly automated / made redundant with deep learning.
 CUDA

A programming interface for Nvidia's GPUs. CUDA implementations for neural network matrix operations, started happening around 2011.
 TPU

Google's Tensor Processing Unit chip project meant to be faster and more energy efficient than the GPUs for neural network matrix operations.
 Activation Functions

A class of functions that transform the input signal into a neuron into an output signal.
History
Improvements in the gradient backpropagation algorithms were key in allowing faster training of more complex (deep) neural networks. Key aspects were

Newer types of activation functions being tried

Better weight initialization schemes

Better optimization schemes like RMSProp and Adam

Better ways to help gradient propagation

Batch normalization

Residual Connections

Depthwise separable convolutions

Chapter 2  The mathematical building blocks of neural networks
Terms
 Tensor

Multidimensional arrays. Tensors are generalizations of arrays and matrices  rank1 and rank2 tensors themselves. By extensions, scalars are rank0 tensors.
 Differentiation

Yes. Calculus.
 Gradient Descent

Moving towards a (local) minima in search for the right fit parameters, guided by the loss function. Calculus.
 Overfitting

Learning too keenly from the training data, leading to a lower ability to generalize.
 Batch Axis

Also called Batch Dimension. In data samples, typically the 0th axis lists individual data items and batching happens on this collection. Batch Axis refers to the axis that lists the data items.
 Random initialization

The weightmatrices, when initialized, are filled with random values. That process.
Environment setup
Code samples here require some local setup to have been done, instructions for which are in Ch3 Chapter 3.
Let's first activate
the virtualenv infrastructure for Emacs.
(pyvenvactivate "~/.venv/dl")
Next, we install the requisite Python packages with pip.
source ~/.venv/dl/bin/activate
pip install tensorflow keras matplotlib
python
3.8.3 (default, May 27 2020, 20:54:22) [Clang 11.0.3 (clang1103.0.32.59)] on darwin Type "help", "copyright", "credits" or "license" for more information.
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Using TensorFlow backend.
print(train_labels[0:10])
print(test_labels[0:10])
[5 0 4 1 9 2 1 3 1 4] [7 2 1 0 4 1 4 9 5 9]
With this setup, most of the code samples in the second chapter can be executed.
Realworld examples of data tensors
Most of the datasets we will use fall under one of the following categories
 Vector data

Rank2 tensors of shape
(samples, features)
 Timeseries/Sequence data

Rank3 tensors of shape
(samples, timesteps, features)
 Images

Rank4 tensors of shape
(samples, height, width, channels)
or(samples, channels, height, width)
 Video

Rank5 tensors of shape
(samples, frames, height, width, channels)
or(samples, frames, channels, height, width)
 Learning rate

A scalar factor that modulates the speed at which "learning" happens  in other words, it controls the rate at which the model converges to a steady state. Too fast, and the learning may be unstable and never converge. Too little, and you may need to wait a lifetime for the learning to finish.
 Gradient Descent

Descend down the gradient in search of stability.
 Stochastic Gradient Descent

Random batch training, with varying batch sizes.
 Backpropagation algorithm

Applying chain rule to the computation of the gradient values.
 Computation graph

A datastructure which forms the core of Tensorflow. It is a directed acyclic graph of tensor operations in the context of Tensorflow.
Tensor operations

Elementwise operations

Broadcasting

Tensor product  Also called the dot product

Tensor reshaping
Gradientbased optimization
If the lossfunction  a function of the weights of the neural network amongst other parameters  represents a hyperplane for various values of the weightmatrices, traversing in a direction that reduces the value for our inputs (training samples) is akin to descending down the hyperplane towards some (local) minima. That is gradientdescent, or gradientbased optimization.
Conceptually, the process will look like the following for neural networks

Take a batch of input data x_{1} and the corresponding labels ytrue_{1}

Run the model on x_{1} to obtain the predicted labels ypred_{1}

Compute the loss for this batch  which is a measure of the mismatch between ytrue_{1} and ypred_{1}

Update weights on the model slightly in a direction that slightly reduces the loss on this batch.
Stochastic gradient descent
In this method, we use the gradient of the loss with respect to the model's parameters. This step is called the backward pass. Weights are then updated to reduce the loss, in conjunction with the learning rate. When done in batches of subsets, this is called the minibatch stochastic gradient descent, or minibatch SGD. The stochastic part refers to the randomness in the selection of the batch. One can go with a singlesized "batch" or the entire training data in one go (the latter being called batch SGD).
There are multiple variants of SGD  focused on taking into account the previous weight updates, while computing the next weight updates, rather than looking only at the current gradient. This is where the concept of momentum arises. All intuition based approaches, I suppose. Other example variants include ADAGrad and RMSProp.
Momentum addresses two issues

Convergence speed

Local minima
The intuition is that if the current learning velocity is large (high momentum), then if there is a local minima, the chances of descending and staying there are reduced as the momentum carries you over the local hills.
The Backpropagation Algoritm  Chaining derivatives
This algorithm leverages the chain rule of derivatives.
∂(f·g)/∂x = ∂(f·g)/∂g · ∂g/∂x
Applying the chain rule to the computation of the gradient values is what the Backpropagation algorithm is about.
Automatic differentiation with Computation Graphs
Computation graphs capture computation as a datastructure. In the Tensorflow world, it captures a graph of tensor operations. The operations and the graph are accurately specified and can be transformed in a very principled manner. In other words, they can be input to computing functions, and can very well be output from computing functions. And thus, they can be composed with various functions and transformed.
Obviously, these functions and transformations are particularly exciting from the perspective of enabling the above discussed optimizations, and thus, learning.
A graph that represents an expression can be subjected to a computation that outputs the derivative of that expression  in other words, we can do automatic differentiation. This clearly enables efficient learning of the kind we've been talking about above.