Table of Contents

A Neural Network Primer

The rationale of ANNs from a programmer's perspective. A gentle but in-depth treatment to the topic of introducing ANNs.


I chanced upon this primer from the 90s, on artificial neural networks, and thought it might be a good idea to write something similar with a different flavour for a different time.

The idea is to treat the programmer-reader with respect while still being introductory reading.

The code samples are mostly in Clojure, with possibly some Python used via libpython-clj. But they are mostly instructional and not written in the play-along way, as there is background setup not documented in this note. But the intention of the code is to be very much understandable in isolation.

Thinking about thinking

Biologically, the brain is known to be made up of a huge number of neurons - an average of 86 billion for humans. It probably is still a mystery how it all works, but various models have been proposed. None may have been accurate, but all have them have been useful.

Neurons are considered as individual computing units of the brain - each with advanced and complex IO capabilities. In isolation, a single neuron may not seem very exciting, but as part of a huge network of connected neurons that the brain is, the system as a whole is highly capable. Capable enough, to study itself – as we are doing right now.

Artificial neural networks (ANNs) are approximations of the biological model that we feel are extremely useful for certain classes of computational tasks that may be hard to code in the traditional style.

There are various kinds of ANNs that one comes across, and each of them is but an approximation of the a biological brain might behave under specific circumstances. In other words, depending on the kind of problem being attempted to be solved, a different ANN architecture may make more sense. But they did not appear in one go. It started very simple.

Our goal here is not to track the history of the evolution of ANNs but to see the roadmap for a deeper understanding.

A Simplified Neuron


A neuron receives input electrical signals via dendrites


Once a neuron decides to fire a signal, it is carried outwards to downstream dendrites via axons

Activation Function

The summed up input signals via the dendrites are processed by the activation function that characterizes the computation of each neuron. Typically, these activation functions are directly correlated to the strength of the summed inputs. The correlation also happens to be one of the key choices to be made when defining models.

The Perceptron - Basics

The Perceptron is the oldest model of the neuron. It sums up all input into a single input to the activation function and outputs a binary value. It discerns between different inputs and slots them into two categories.

So, let's consider the following table, where x and y are inputs, and z is the output. z = f(x, y)

x y z
0 0 0
0 1 1
1 0 1
1 1 1

This is an OR table. It probably didn't take you long to realize. But how did you recognize? Thinking about how you did it can be a great teacher in understanding how ANNs may have evolved, and allowing you to make sense of the landscape somewhat better.

Consider x and y as two inputs, and the corresponding outputs are captured in the z column. The perceptron is in effect just the function f as specified above. How can we implement one? If the inputs are binary as in the table above, the perceptron can be a simple lookup function into the table above and serve the purpose! (Think - a dictionary or a hash-map).

We can approximate capture the essense of the above OR table in a real-valued function as thus (while still using binary-valued inputs)

z = (1.5x + 1.5y ≥ 1.5)

Feeding x and y from the above table into the code below

  (let [w1 1.5
        w2 1.5
        f (fn [[x y]]
            [x y (>= (+ (* w1 x) (* w2 y)) 1.5)])]
    (map f input-table))
0 0 false
0 1 true
1 0 true
1 1 true

Let's shift the goal-post – slightly. Let us set the threshold to 1.51 from 1.5 and rerun with the same x and y inputs.

  (let [w1 1.5
        w2 1.5
        f (fn [[x y]]
            [x y (>= (+ (* w1 x) (* w2 y)) 1.51)])]
    (map f input-table))
0 0 false
0 1 false
1 0 false
1 1 true

z = (1.5x + 1.5y ≥ 1.51) – Now, that's AND!

So, AND and OR are almost the same, except for the discriminant being laterally translated.


We have now an example of a Perceptron with an ability to compute AND and OR over real values – even if the examples above used only binary values.

A discriminant is a possibly curved line in 2-D. By extension, will be a 2-D surface in 3-D. And so on. Some boundary that divides the space it is embedded in into two disjoint spaces.

The Perceptron – Getting Somewhat Real

What does the figure above indicate? The lines represent the exact values of our threshold but any combination of real-valued x and y on one side of the slanted lines will classify always into a single category – true or false. Which line? That will decide whether we are looking at the AND operation or the OR operation.

In real life, we expect inputs to be – well, real. Pardon the pun, but for well-defined inputs and well-known functions, we wouldn't be exploring neural networks.

The insight here is that both the functions are similar and map their inputs to opposite sides of some discriminant. Only that the discriminants are translated, while sharing the same slope.

Having the same slope is just incidental given our arbitrary choice of representation. The slopes can very well be different, and depend on multiple factors including what example datasets we work with.

Can we do a simple neural network to demonstrate learning AND? Indeed. Let's first create some sample data. It's synthetic - we already know the solution to the AND problem, so we randomly generate x and y values and then label their z.

  (def naive-sample-source [0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0])

  ;; We're cheating - we're creating our sample here you see
  (defn naive-class-fn [[x y]]
    (if (< 0.5 (+ x y)) 0 1))

  (map vector naive-sample-source naive-sample-source)

  (defn naive-sample [sample-count]
    (let [f #(rand-nth naive-sample-source)
          xs (->> f repeatedly (take sample-count))
          ys (->> f repeatedly (take sample-count))
          sample (zipmap (doall xs) (doall ys))]
      [sample (map naive-class-fn sample)]))

  (->> #(rand-nth naive-sample-source) repeatedly (take 10))
  (naive-sample 10)

So, what's the conclusion?

Nothing earth-shattering, except that we now have a notion of AND and OR computation for approximately 1 and approximately 0 values of x and y. And of course, infinitely many combinations in the entire real-numbers domain.

And, the neural network?

We've seen a neuron in action. Let's dwell further on the class of problems it solves for us.

The AND and OR functions classify their inputs into two groups. What we have above is an absolutely naïve classifier.

It's worth repeating – in the case of simple boolean functions, we understand the behavior exactly and have precise formulae to represent them. So, creating a neural network is superfluous. So, what make's it interesting is that we may want to deal with real-world noisy x and y values that don't fit into the theoretical bounds (err, exact boolean values) as required.

The premise of (artificial) neural networks is to save us from the trouble of finding specific and precise solutions for arbitrary problems. Just like humans learn and acquire new knowledge and skills.

Imagine if we fed the above truth tables to a black box, with x and y as our input, and z as the expected outcome. And in turn, this black box learnt the rules and then readied itself to respond with z values for any combination of x and y we threw at it – as long as those input values stayed within reasonable bounds. The definition of reasonable does not exist - it's subjective and is only meant to satisfy the human(s) that approved of that black box's behaviour at some point.

Not hard, right? We could cache all combinations of the input and corresponding output, and respond back with the right values. That's eminently doable for the small size of the training dataset we have. But it breaks down miserably when we unconstrain the input values (add some noise), or even deal with unforeseen input values that are different from the training data by wider margins.

As a first step towards creating such an entity, we can approximate the black box as a linear regression. And training it as an activity of solving this linear regression. The general equation looks as follows, for the two input-signal scenario

z = (wx x + wy y ≥ c)

Or, making it somewhat more general and rearranging terms to be on one side

z = (c + Σ wi · xi ≥ 0)

Which can again be re-written more generally as

Z = (c + WT · X ≥ 0)

Where W and X are the weight and input matrices respectively.

Let's write some helper code to see this in action with real inputs

  (defn or' [x y]
    (or (= x 1) (= y 1)))

  (defn and' [x y]
    (and (= x 1) (= y 1)))

  (defn xor [x y]
    (not= x y))
x y
1 0
1 1
0 0
0 1
(map (fn [[x y]] (list x y (or' x y))) input-table)
1 0 true
1 1 true
0 0 false
0 1 true
(map (fn [[x y]] (list x y (and' x y))) input-table)
1 0 false
1 1 true
0 0 false
0 1 false
(map (fn [[x y]] (list x y (xor x y))) input-table)
1 0 true
1 1 false
0 0 false
0 1 true