Last week, for pedagogical reasons, I had to solve the following problem:

In a bag there are \(4\) blue and \(5\) red balls. We randomly take \(3\) balls from the bag without replacement. Let the random variable \(X\) denote the number of blue balls we get. Find out the expectation and variance of \(X\).

The problem is quite a simple one. For finding the expectation, we can use the definition \[E[X] = \sum_{x} x P(X=x)\]

We can also compute \[E[X^2] = \sum_{x}x^2 P(X=x)\]

Then we get the variance using the formula \[\text{Var}(x) = E[X^2] - (E[X])^2\]

It is a straightforward exercise to calculate the probabilities. I got \(P(X=1) =
\frac{20}{42}\), \(P(X=2)=\frac{15}{42}\), and \(P(X=3)=\frac{2}{42}\)^{1}. Substituting in
the equations I got the expectation as \(\frac{4}{3}\) and variance as
\(\frac{5}{9}\).

I wanted to double-check the values. Instead of retracing the steps as a sane
person would have done^{2}, I wrote a small program to simulate the
experiment.

## The Simulation

Here is the class `RedBlue`

which defines the essential methods:

```
import random as rand
from enum import Enum
class Color(Enum):
RED = 1
BLUE = 2
class RedBlue:
def __init__(self, num_reds, num_blues):
self.num_reds = num_reds
self.num_blues = num_blues
self.bag = [Color.RED] * num_reds + [Color.BLUE] * num_blues
def prob_blues(self, num_picks, num_trials):
"""Return the probability of getting 'i' blue balls when 'num_picks' balls are
randomly taken from 'self.bag' without replacement. The return value is
a list whose ith element is the probability of getting 'i' blue balls.
"""
counts = [0] * (num_picks + 1)
for i in range(num_trials):
counts[ self.count_blues(num_picks) ] += 1
return [ x/num_trials for x in counts ]
def count_blues(self, num_picks):
"""Returns the number of blue balls observed when 'num_picks' balls are taken
from 'self.bag' without replacement."""
balls = rand.sample(self.bag, num_picks)
return balls.count(Color.BLUE)
```

The `sample`

function in Python’s `random`

library is used to get a random
sample sample from the input population, without replacement. The `count_blues`

function gets a sample, and then counts the number of blue balls it contains.
The `prob_blues`

function repeatedly calls `count_balls`

to estimate the
probability of getting each possible number of blue balls.

My initial idea was to compute the probabilities using the `prob_blues`

function, and then to use the formulas for \(E[X]\) and \(E[X^2]\) to calculate the
Expectation and Variance.

It then stuck me that it was an unnecessarily roundabout way. Expectation is
ultimately just the mean of the random variable. So we could as well take the
average of the values that `count_blues`

returns to get the Expectation.
\(E[X^2]\) can also be computed similarly.

So I ended up not using the `prob_blues`

function. The rest of the script is
given below:

```
rb = RedBlue(5, 4)
# Compute the expectation and variance
s = 0 # For E[X]
s2 = 0 # For E[X^2]
num_trials = int(1e5)
num_picks = 3
for i in range(num_trials):
num_blues = rb.count_blues(num_picks)
s += num_blues
s2 += num_blues * num_blues
mean = s/num_trials
mean_x2 = s2/num_trials
var = mean_x2 - mean * mean
print("Mean = {0}, Variance = {1}".format(mean, var))
```

The program outputs the estimated Expectation as \(1.3319\), and Variance as \(0.5589\). Both are close to the analytical quantities.

## Advantages

Is this method more efficient than computing the probabilities first and then finding the weighted average? The answer is no. Whether we compute the probabilities or not, we end up doing essentially the same work. If we do \(n\) trials and if there are \(d\) distinct values for the random variable, to compute the probabilities we need \(O(n)\) additions. Then to calculate the expectation and variance, we need a further \(O(d)\) multiplications and additions.

In the function given above, all of the computation happens in a loop which gets executed \(n\) times. Therefore we have \(O(n)\) multiplications and additions. If additions and multiplications are equally costly (as they are for small-enough numbers), both methods are equivalent in complexity. For bigger numbers (those which can’t fit in one machine word), multiplication is costlier than addition, and our function would perform worse.

There is, however, one important advantage: for computing the probabilities, we need to spend \(O(d)\) amount of space. We save that memory when we do the direct computation of Expectations. For our particular problem, that does not make any difference, as \(d\) is only \(3\). But in general, this might matter a lot.

### The Geometric Random Variable

The geometric random variable is a case in point. It stands for the number of trials we need to perform for being successful. If we encounter \(k-1\) failures before the first success, the value of the random variable is \(k\). We denote the probability of success in a single trial as \(p\).

The geometric random variable can have any positive integer value. Therefore, to estimate its expectation and variance using the probability method, we need \(O(n)\) amount of memory in the (unlikely) worst case, where each trial gives a different value. The direct method has no such headache.

Even more importantly, it also works for continuous random variables.

### The Uniform Random Variable

Possibly the simplest continuous random variable is the Uniform random variable.
Python’s random module’s `random()`

function is a uniform random variable with
range \([0, 1]\). To compute its expectation and variance, we need to use
integration. We can of course approximate the integration as a sum and use
something very similar to the `prob_blues`

function to evaluate it.

But it is much clearer to use the direct method:

```
s, s2 = 0, 0
num_trials = int(1e5)
for i in range(num_trials):
x = rand.random()
s += x
s2 += x*x
mean = s/num_trials
mean_x2 = s2/num_trials
var = mean_x2 - mean * mean
print("Estimated mean = {0:.4f}, variance = {1:.4f}".format(mean, var))
```

When you run the code, you should get the estimated mean to be close to \(0.5\), which is the Expectation of the Uniform random variable. The variance would be around the theoretical value \(\frac{1}{12}\).

If you want to play around with the code, you can get the files from
the `Expectation and Variance`

sub-directory of this git repository.