Probably definitely incomplete notes on probability.


Sample Space \(S\)
The set of all possible outcomes
Event \(E\)
A subset of the sample space \(S\).
The likelihood. Probabilities are associated with events.
  • \(P(S) = 1\)
  • \(0 \le{} P(E) \le{} 1\)
  • \(P(\cup_{n=1}^\infty{}E_n) = \Sigma{}_{n=1}^\infty{}P(E_n)\) – for mutually exclusive events.
Random Variables
Real-valued functions defined on the sample space. These are not events.

Topics and Tools You Should Definitely Know

These are some of the topics you should definitely know of with a good enough understanding to make understanding the theory of probability more accessible.

The science and the art of counting.

Probability, Loosely Speaking

If \(E\) is an event in \(S\) with a defined probability \(P(E)\), then the probability of the complement event set \(P(E^c)\) is \((1 - P(E))\).

Events are subsets. So, \(\cup{}E\) is the entire \(S\). And also, not all events are mutually exclusive. So, \(\Sigma{}P(E), \forall{} E\) is \(\ge\) 1, because you end up double+ counting. Which brings us to the following for a simple case of events \(E\) and \(F\).

\[P(E \cup F) = P(E) + P(F) - P(EF)\]

The notation \(EF\) conveys the intersection of the sets \(E\) and \(F\). It’s also written as \(P(E \cap F)\).

The generalised form is

\[P(E_1 \cup E_2 \ldots{} \cup E_n) = \Sigma_i{}P(E_i) - \Sigma_{i \lt j}P(E_i E_j) + \Sigma_{i \lt j \lt k}P(E_i E_j E_k) \ldots{} + (-1)^{n+1}P(E_1 E_2 \ldots{} E_n)\]

In words, the probability of the union of \(n\) events is equal to the sum of the probabilities of each event, minus the sum of the probabilities of the events taken two at a time, plus the sum of the probabilities of the events taken three at a time, and so on.

Conditional Probabilities

This is expressed as \(P(E|F)\). Intuitively, \(P(EF)\) is both of \(P(E)\cdot{}P(F|E)\) and \(P(F)\cdot{}P(E|F)\). \(F\) has a chance \(P(F)\) of occurring, and once that has happened, \(P(E|F)\) now informs you about \(P(EF)\).


Understanding independence is important, because correlation/causation are important to get right when dealing with data. When dealing with the probability of two independent events \(E\) and \(F\), \(P(EF) = P(E) \cdot{} P(F)\).

The generalization for \(n\) independent events is \[ P(A_1 \cap A_2 \cap \ldots \cap A_n) = P(A_1)P(A_2)\ldots{}P(A_n)\]

Thought experiment: How does one go about to infer the independence of two events, given we are allowed to run an experiment as many times as we wish?

Let the events be \(A\) and \(B\), with probabilities of occurrences being \(P(A)\) and \(P(B)\) respectively. One idea to explore would be to consider all the events \(A\) that occurred over multiple runs of the experiment, while also recording the \(B\) events. Consider the set of all the occurrences of \(A\) as a new sample space, and within that sample space, compute the probability of \(B\) occurring. If \(A\) and \(B\) are truly independent, then over a large number of experiments, the computed probability of \(B\) in the sub-sample will equal the probability \(P(B)\) over the original sample space.

Bayes’ Formula

\[E = EF \cup EF^c\]

Both \(EF\) and \(EF^c\) are mutually exclusive. How do we express \(P(E)\) if we have prior information about \(EF\) and \(EF^c\)?

\[P(E) = P(EF) + P(EF^c) = P(E|F)\cdot{}P(F) + P(E|F^c)\cdot{}P(F^c)\]

Simplified, another way to write it is

\[P(E) = P(E|F)\cdot{}P(F) + P(E|F^c|\cdot{}(1 - P(F)))\]

Example 1.5.2 from Book 2. Suppose urn #1 has 3 red and 2 blue balls, and urn #2 has 4 red and 7 blue balls. One urn, and then one ball, are randomly picked, and found to be blue. What is the probability that the ball came from urn #2?

There are two stages - the selection of an urn, and then the selection of a ball. Let \(A\) denote the event of urn #2 being picked, and \(B\) denote the event that a blue ball is selected.

\begin{align} P(A|B) &= \frac{P(A)}{P(B)}P(B|A) \\ &= \frac{\frac{1}{2}}{\frac{1}{2}\frac{2}{5} + \frac{1}{2}\frac{7}{11}} \frac{7}{11} \\ &= \frac{35}{57} \end{align}

There’s another way of arriving at the above, intuitively speaking. The probability of urn #2 being chosen, given we have a blue ball, is the probability of choosing a blue ball from urn #2, divided by the overall probability of choosing a blue ball. Which is,

\[ \frac{\frac{7}{11}}{\frac{2}{5} + \frac{7}{11}} \]

But, this can be deceiving, because it holds in the simple case where the probability of choosing either of the urns is equal - which, hence, cancels out in the numerator and the denominator. But the intuition is not wrong, just not too obvious for the general case.

Random Variables

To recall, random variables are real-valued functions defined on the sample space \(S\).

For example, when rolling two dice, the sum \(X\) of the values of the top faces could be called as a random variable. Random variables will have their own associated probabilities, counted in terms of the underlying event probabilities. For the two dice example, here are a few

  • \(P\{X = 2\} = P\{(1, 1)\} = \frac{1}{36}\)
  • \(P\{X = 5\} = P\{(1, 4), (2, 3), (3, 2), (4, 1)\} = \frac{4}{36} = \frac{1}{9}\)

In this example, the random variable \(X\) can take any value from 2 through 12.

The sum of all probabilities of the random variable \(N\), the number of flips of a coin required until the first head appears, is 1.

Suppose the probability of getting a head is \(p\). Then, \[P\{N = n\} = P\{(T, T, T, \ldots{}, T, H)\} = (1 - p)^{n - 1}\cdot{}p\]

\begin{align} P(\cup_{n = 1}^{\infty}\{N = n\}) &= \Sigma_{n = 1}^{\infty} P\{N = n\} \\ &= p \Sigma_{n = 1}^{\infty}(1 - p)^{n - 1} \\ &= \frac{p}{1 - (1 - p)} \\ &= 1 \end{align}


Random variables \(X\) are real-valued functions of outcomes \(s\) over the sample space \(S\). As \(s\) is random, the random variable associated with it is random as well.

If, say, \(B\) is a subset of the real values set that \(X\) can take, then

\(P\{X \in B\} = P\{s \in S : X(s) \in B\}\)

If \(X\) is a random variable, then the distribution of \(X\) is the collection of probabilities \(P(X \in B)\) for all subsets \(B\) of the real numbers.

Discrete Distributions

For many random variables \(X\), we have \(P(X = x) > 0\) for certain \(x\) values. If \(\Sigma_{x \in R^1}P(X = x) = 1\), then the random variable X is discrete. Because, we can enumerate all the $x$s for which the probabilities of the corresponding random variable is known.

But the above does not hold for continuous distributions.

The Binomial Distribution

Consider flipping n coins, each of which has a probability \(\theta\) of coming up heads, and hence a probability \(1 - \theta\) of coming up tails. \(0 < \theta < 1\). Let \(X\) be the total number of heads showing. Then, for \(x = 0, 1, 2, \ldots n\),

\begin{align} p_X(x) &= P(X = x) \\ &= \binom nx \theta^x(1 - \theta)^{n - x} \\ &= \frac{n!}{x!(n - x)!}\theta^x(1 - \theta)^{n -x} \end{align}

The random variable \(X\) is said to have the Binomial(x, \(\theta\)) distribution, also written as \(X \sim Binomial(x, \theta)\).

The binomial distribution is applicable to any situation involving n independent performances of a random system; for each performance, we are recording whether a particular event has occurred, called a success, or has not occurred, called a failure.

Computing a random variable - Example 2.5 problem

Working through this example, and spending time to grok it, should be helpful in grasping the idea of random variables better.

Suppose that independent trials, each of which results in any of m possible outcomes with respective probabilities \(p_1, p_2, \ldots{}, p_m\), \(\Sigma_{i = 1}^m p_i = 1\), are continually performed. Let \(X\) denote the number of trials needed until each outcome has occurred at least once.

The goal is to find a closed form solution for \(X\). Thinking of \(X = n\) feels somewhat out of grasp. The example presents a technique to compute the probability of the condition that at least one of the outcomes has not occurred after \(n\) trials, expressed as \(P\{X > n\}\).

\begin{align} P\{X > n\} &= P(\cup_{i = 1}^m A_i) \\ &= \Sigma_{i=1}^mP(A_i) - \Sigma\Sigma_{i<j}P(A_i \cdot A_j) + \Sigma\Sigma\Sigma_{i<j<k}P(A_iA_jA_k) - \ldots + (-1)^{m+1}P(A_1A_2 \ldots A_m) \end{align}

\(P(A_i)\) is the probability that each of the first \(n\) trials results in a non-\(i\) outcome. As each trial is independent,

\[P(A_i) = (1 - p_i)^n\]


\[P(A_iA_j) = (1 - p_i - p_j)^n\]

All other probabilities are similar. So, we can extend the above equation as

\[P\{X > n\} = \Sigma_{i=1}^m(1 - p_i)^n - \Sigma\Sigma_{i < j}(1 - p_i - p_j)^n + \Sigma\Sigma\Sigma_{i<j<k}(1 - p_i - p_j - p_k)^n - \ldots\]

Now comes the following key observation.

\[P\{X = n\} = P\{X > n - 1\} - P\{X > n\}\]


Probably definitely incomplete notes from the following