# Whence Softmax?

Posted on August 3rd, 2017

## Introduction

What follows is a quick summary of one way to construct the Softmax classifier. This is really just a review for myself, but will perhaps be of use to someone out there. I’m also hoping it serves as a warm-up for more novel ideas and their exposition.

## Probabilities and the Exponential Family

Probabilities (distributions) are just functions of a set of variables that are normalised to sum to one. Some of my favourite distributions follow.

### Multinomial

Here we have a $K$ sided die where side $k$ has probability $\theta_{k}$ of showing up. We can let $\bar{\theta} = [ \theta_1, \theta_2, ..., \theta_K ]$. The likelihood of a bunch of throws (or data points) is then given by $p(D \mid \bar{\theta}) = \prod_{k=1}^{K} \theta_k^{N_k}$ where $N_k$ is how many times side $k$ came up. We have a product since the data points are assumed to be independent, which is often a reasonable assumption.

### Beta and Dirichlet

We’ll often want to model the probability of something being between $0$ and $1$. For example, we might want the probability of some indexed value like economic GDP or height. More interestingly, we may want the probability of a probability parameter (think of a $\theta_k$ above). The Beta distribution is great for this: with two parameters $a$ and $b$, we can get a wide variety of “shapes” over $[0,1]$.

$B(a,b)$ is called the “beta function”: think of it as the normalisation factor. Here are some example Beta distributions with different $a$ and $b$:

I like to remember distributions in pairs. The simplest way of doing so is considering their univariate and multivariate versions. The multivariate version of the Beta distribution, where a bunch of values can range from $0$ to $1$, is called the Dirichlet distribution:

where $ind$ is the indicator function, which is 0 unless its argument is true. In this case, that $\theta$ is in the “probability simplex” $S_k$. What’s that? It’s the set of $\bar{\theta}$’s such that each $\theta_k$ is between $0$ and $1$, and they all sum up to 1. So if $\bar{\theta}$ is a joint distribution of parameters (or economic indices, indexed weights etc…), this would be a good way to model that. The indicator’s effect is super simple: is just means everything outside the simplex is 0.

Another great one is the Gamma distribution, which applies to positive real values (building heights, chemical concentrations etc…). I’ll leave that to the reader. Turns out all sorts of distributions like chi-squared and exponential are special cases thereof. Fun challenge: explore multivariate versions of each these!

## Generalised Linear Models (GLMs)

### The Exponential Family

As you can see, there are many common distributions. After seeing a handful, people began to notice patterns. One such pattern was that many distributions are of the following form:

$\eta$ (pronounced “eta”) is often called the “natural parameter”. In term of notation, I used a $\bar{bar}$ to denote vectors before, whereas here I do not ($\eta$ could be a vector, hence the transpose used here). What I’ve found is that different notations are good in different contexts. In this case the bar would be clunky. A good approach is to be conscientious about what notation to use when writing, and flexible when reading. It’s also a good test of understanding to think about what the notation actually means in a given context.

Here’s an example: the Bernoulli distribution we all know and love (coin toss) is part of the exponential family. To see this, notice that

And we’re done. Can you map this to our formula for the exponential family?

### GLMs

Many situations arise where we wish to impose a certain distribution on $y \mid x$; that is, given a value for $x$, we want to get the probability for $y$ using a Bernoulli, Multinomial, Poisson, or other distribution based on the situation (a coin toss, multi-classification, rain droplets…). Turns out that if we take this as a starting point and add a few “linear” assumptions, we obtain a machinery to crank out prediction algorithms, including Softmax. The three step process is:

1. Assume $y \mid x ; \theta \sim ExpFam(\eta)$

2. Given $x$, we know that we want our hypothesis function $h_{\theta}(x) = E[ y \mid x ; \theta ]$, in other words we “predict what we expect”

3. The natural parameter is linear in $x$. That is, $\eta = \theta^T \cdot x$

For instance, using Bernoulli as our starting point (it’s a member of the Exponential Family), let’s figure out what our hypothesis function should be:

$h_{\theta}(X) = E[y \mid x] = \phi$ and $\eta = lg \frac{\phi}{1-\phi}$ which implies $\phi = \frac{e^{\eta}}{1 + e^{\eta}}$

Substitute $\eta = \theta^T \cdot x$ and we have logistic regression! Brilliant.

The core limitations of GLMs lie in their assumptions. They’re pretty reasonable, but insofar as they are inaccurate, so is the system. For example, if $y \mid x$ follows a different distribution to the one we choose, our predictions may suffer.

## Softmax

Let’s say we want to predict one of several outcomes i.e. multi-classification. We can repeat the above process, using the Multinomial instead of Bernoulli distribution to obtain a suitable classifier.

First, we need to show that the Multinomial belongs to the Exponential Family. Beyond assuming this for our GLM, remember that it enables us to figure out what the natural parameter $\eta$ is relative to the Multinomial’s parameter vector $\bar{\phi}$. We’ll need this to figure out what $E[ y \mid x]$ should be in terms of $\eta$ - that’s how we get our classifer $h_{\theta}$.

Notice the trick in the last exponent $1 - \sum_{i=1}^{k-1} ind(y=i)$. Now how do we view this as part of the Exponential Family? The key is we want some $T(y)$ in the exponent; the rest are constants that follow naturally (no pun intended). Remember $T(y)$ can be a vector, so we set:

In other words $T(i)$ has a $1$ in the $i$’th place only. Since we have $k-1$ degrees of freedom, $T(y)$ need only be $k-1$ dimensional, with $T(k)$ having all zeros. Then we continue:

which has the correct form.

Now, where do we start, where do we end? We now want $p(y=i \mid x) = \phi_i$. Let’s get this:

And voilà! We rearrange the final line to get $\phi_k = \frac{1}{\sum e^{\eta_i}}$, and sub that into line $(2)$ to get what we want: $\phi_i = \frac{e^{\eta_i}}{\sum_j e^{\eta_j}}$, also known as the “Softmax response function”! Using our assumption that $\eta_i = \theta_i^T \cdot x$, we have our classifier given $x$. Explicitly:

Notice that each $\theta_i$ is a vector. In other words, each class $i$ has a bunch of weights for each element of $x$ (each feature of $x$ could represent say the height, weight etc. of an individual) that contribute to that being the right class. These are then “squooshed” (technically speaking) according to the Softmax function to obtain a probability for that class.

## Conclusion

It’s always unsatisfactory to be given a formula without having at least one more fundamental view of where it comes from. Here we’ve presented Softmax classification, as derived from a few reasonable assumptions in what we call the GLM framework. Perhaps with a bit of creativity, we’ll stumble upon other ways of constructing predictive tools that prove useful.

## Acknowledgements

As stated, the content here is not novel. It’s a re-exposition of Andrew Ng’s notes, with some inspiration from Andrej Karpathy. Hopefully it provides a slightly different review to some, and it most certainly served as one, as well as a warmup, to me.