Notes for Deep Learning with Python
Chapter 1  What is deep learning?
Deep stands for the depth of the (successive) representation layers of the featuredata, starting from the input and ending at a useful output that we identify as the outcome of the learning system.
Terms term
 Loss Function
 A function to measure the difference between the actual and the expected outputs
 Synonyms: Objective Function, Cost Function
 Optimizers
 The algorithm that implements backpropagation to adjust weights with the purpose of reducing the loss functions.
 Weights
 Parameters of the various connections between the processing units (the neurons).
 Kernel Function (SVM)
 A computationally tractable function that maps any two points in the input space to their distance in the target representation space.
 Random Forests
 Ensemble of specialized decision trees.
 Gradient Boosting
 Ensemble of weak prediction models that focuses on boosting the weaker of the contained models to improve prediction.
 Feature Engineering
 Transforming the input data (with mostly manual intervention) into forms that machine learning algorithms can consume. A step now greatly automated / made redundant with deep learning.
 CUDA
 A programming interface for Nvidia’s GPUs. CUDA implementations for neural network matrix operations, started happening around 2011.
 TPU
 Google’s Tensor Processing Unit chip project meant to be faster and more energy efficient than the GPUs for neural network matrix operations.
 Activation Functions
 A class of functions that transform the input signal into a neuron into an output signal.
History
Improvements in the gradient backpropagation algorithms were key in allowing faster training of more complex (deep) neural networks. Key aspects were
 Newer types of activation functions being tried
 Better weight initialization schemes
 Better optimization schemes like RMSProp and Adam
 Better ways to help gradient propagation
 Batch normalization
 Residual Connections
 Depthwise separable convolutions
Chapter 2  The mathematical building blocks of neural networks
Terms
 Tensor
 Multidimensional arrays. Tensors are generalizations of arrays and matrices  rank1 and rank2 tensors themselves. By extensions, scalars are rank0 tensors.
 Differentiation
 Yes. Calculus.
 Gradient Descent
 Moving towards a (local) minima in search for the right fit parameters, guided by the loss function. Calculus.
 Overfitting
 Learning too keenly from the training data, leading to a lower ability to generalize.
 Batch Axis
 Also called Batch Dimension. In data samples, typically the 0th axis lists individual data items and batching happens on this collection. Batch Axis refers to the axis that lists the data items.
 Random initialization
 The weightmatrices, when initialized, are filled with random values. That process.
Environment setup
Code samples here require some local setup to have been done, instructions for which are in Chapter 3.
Let’s first activate
the virtualenv infrastructure for Emacs.
(pyvenvactivate "~/.venv/dl")
Next, we install the requisite Python packages with pip.
source ~/.venv/dl/bin/activate
#pip install tensorflow keras matplotlib
python
' is not defined
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print(train_labels[0:10])
print(test_labels[0:10])
With this setup, most of the code samples in the second chapter can be executed.
Realworld examples of data tensors
Most of the datasets we will use fall under one of the following categories
 Vector data
 Rank2 tensors of shape
(samples, features)
 Timeseries/Sequence data
 Rank3 tensors of shape
(samples, timesteps, features)
 Images
 Rank4 tensors of shape
(samples, height, width, channels)
or(samples, channels, height, width)
 Video
 Rank5 tensors of shape
(samples, frames, height, width, channels)
or(samples, frames, channels, height, width)
 Learning rate
 A scalar factor that modulates the speed at which “learning” happens  in other words, it controls the rate at which the model converges to a steady state. Too fast, and the learning may be unstable and never converge. Too little, and you may need to wait a lifetime for the learning to finish.
 Gradient Descent
 Descend down the gradient in search of stability.
 Stochastic Gradient Descent
 Random batch training, with varying batch sizes.
 Backpropagation algorithm
 Applying chain rule to the computation of the gradient values.
 Computation graph
 A datastructure which forms the core of Tensorflow. It is a directed acyclic graph of tensor operations in the context of Tensorflow.
Tensor operations
 Elementwise operations
 Broadcasting
 Tensor product  Also called the dot product
 Tensor reshaping
Gradientbased optimization
If the lossfunction  a function of the weights of the neural network amongst other parameters  represents a hyperplane for various values of the weightmatrices, traversing in a direction that reduces the value for our inputs (training samples) is akin to descending down the hyperplane towards some (local) minima. That is gradientdescent, or gradientbased optimization.
Conceptually, the process will look like the following for neural networks
 Take a batch of input data x_{1} and the corresponding labels ytrue_{1}
 Run the model on x_{1} to obtain the predicted labels ypred_{1}
 Compute the loss for this batch  which is a measure of the mismatch between ytrue_{1} and ypred_{1}
 Update weights on the model slightly in a direction that slightly reduces the loss on this batch.
Stochastic gradient descent
In this method, we use the gradient of the loss with respect to the model’s parameters. This step is called the backward pass. Weights are then updated to reduce the loss, in conjunction with the learning rate. When done in batches of subsets, this is called the minibatch stochastic gradient descent, or minibatch SGD. The stochastic part refers to the randomness in the selection of the batch. One can go with a singlesized “batch” or the entire training data in one go (the latter being called batch SGD).
There are multiple variants of SGD  focused on taking into account the previous weight updates, while computing the next weight updates, rather than looking only at the current gradient. This is where the concept of momentum arises. All intuition based approaches, I suppose. Other example variants include ADAGrad and RMSProp.
Momentum addresses two issues
 Convergence speed
 Local minima
The intuition is that if the current learning velocity is large (high momentum), then if there is a local minima, the chances of descending and staying there are reduced as the momentum carries you over the local hills.
The Backpropagation Algoritm  Chaining derivatives
This algorithm leverages the chain rule of derivatives.
∂(f)/= ∂(f)/· /
Applying the chain rule to the computation of the gradient values is what the Backpropagation algorithm is about.

Automatic differentiation with Computation Graphs
Computation graphs capture computation as a datastructure. In the Tensorflow world, it captures a graph of tensor operations. The operations and the graph are accurately specified and can be transformed in a very principled manner. In other words, they can be input to computing functions, and can very well be output from computing functions. And thus, they can be composed with various functions and transformed.
Obviously, these functions and transformations are particularly exciting from the perspective of enabling the above discussed optimizations, and thus, learning.
A graph that represents an expression can be subjected to a computation that outputs the derivative of that expression  in other words, we can do automatic differentiation. This clearly enables efficient learning of the kind we’ve been talking about above.