An Exploration of Basic Probability - Counting & Expectation

Posted on November 11th, 2014


I've come across probability in all sorts of contexts: in the social sciences, machine learning, a bit of measure theory, functional analysis, and more. As such, it has become a set of concepts and tricks I'm acquainted with, but haven't really spent much time absorbing in a unified way. Here I'll try and highlight some very basic, but subtle points, just for exploration's sake. I’ll start a view of probability as simple counting, and play with expectations in an uncommon setting. There will be the occasional formality, but only when I think it adds to the exploration.


Suppose we have events , and a probability measure . Measures are easy, and you can read up on them here. A probability measure just sums to 1 over the space of events. This is precisely counting: let’s say in a 100 trials, , , and happen respectively 30, 20 and 50 times. We’re not talking about the reason this happens; take itas a law of the universe. Then if we’re looking at a random variable , we have , and so on. Counting :)

We could look at combinations of events, like . The set of all these possibilities is formalized as a . If you go through the definition, it’s actually very reasonable.

I’ve found that these things are always presented as very matter of fact. But what happens when we try and look at slightly funky things, like non-numeric events?


We presumably all know the formula . But what happens when we don’t have numeric values? What’s the expected value of , , and in our case?

We could start by simply assign real numbers. For example, we could set or . But then in the first case we have an expected value of 2.2, and in the second case 1.8, without changing the fundamental problem. This leads to, as far as I know, a fundamental restriction when we want to play with numeric concepts like expected value or variance. Technically speaking, that is why we define them in terms of a random variable, a function from where is a measurable space, usually . But this doesn’t tell us much about how we might think about assigning numbers to our events. Let’s investigate this a bit further since we’re already having so much fun.

I’ll try to develop a bit of intuition concerning expectation when we are using numbers, to see how we might want to make the leap from non-numbers. Consider now with .

probability histogram

Linearity in events

What happens if we play with our event values?

It’s not too hard to see from our formula for expected value that if we change a given event’s value by , we change expectation by .

Linearity in probabilities

What happens if we now play with assigned probabilities over fixed ? It’s even more interesting now: since probabilities need to sum to 1, increasing an event’s probability means decreasing that of one or more others. In the simplest case (the complicated case really isn’t complicated), we have

In conclusion, expectation looks like a linear function over both its event space and their assigned probabilities. Nice! Linearity usually points to something like a vector space. Going back to our problem of dealing with non-numeric events, we now have a slightly better idea of how we might want to convert them to numeric values in a way that makes sense. To be precise, if we’re thinking of assigning to 1 or 4, we should be aware that the latter will pull expectation up by a factor of 4 times the assigned probability.

Norms and “Pre-processing”

Consider assigning , with equal probabilities of . By our previous argument, contributes three times as much to expectation per apple of probability. We’re now comparing between events. The idea of comparison might tempt us to play with metrics or even inner products, but really all we need is norms. In this case, we have . Since we’re working in , we just have the ratio of absolute values.

The question then arises, might there be any purpose in converting non-numeric events to a normed vector space other than ? We could call this “pre-processing”, since in any case we’re ultimately taking a norm which leads to a real number. But there might some modelling advantages. A simple case would be . Consider a point at . We could translate it to or ; either would change the by the same amount.

point translation

In fact by using the we’re defining an equivalence class on a circle. For example each point could represent (height, age), and we’re defining a tradeoff function when it comes to probability expectation. Different norms lead to different tradeoff functions:

unit norms

One could imagine first converting a non-numeric sample space into a meaninful vector space with “probability” equivalence classes induced by an appropriate norm.


I’m still only at the beginning of this exploration. I hope I’m not being too naïve. Some parts may seem banal, and I could be much more formal, but hopefully by looking at concepts in a down-to-earth way we’ll encounter some interesting questions that I haven’t come accross directly in the academic literature, such as we did with expectation and non-numeric sample space here.