Posted on December 20th, 2014
Writing explicit programs to accomplish sophisticated tasks can quickly become intractable. For example, it may be evident how to tell a computer to change screen color based on which user is logged in: this is definite. Writing a definite program to recognize pigeons in a photo is not obvious. The core issue is that it is difficult to determine what set of rules is optimal, and even if we knew them it would be a lot of work to hard code them all.
Machine learning is then a further level of abstraction: we hard code the process of learning more generally, and hope the algorithm identifies a good set of rules in classification or regression, based on labelled data (supervised learning), unlabelled data (unsupervised learning), or a mixture of both.
Neural networks are a class of these learning algorithms. Let’s look into why we might care about them at all, given some other possible techniques. Our analysis will not be comprehensive, but I hope to draw attention to some of the more interesting questions.
As we said, we want some sort of algorithm that’s good at learning rules from data. Probability is a good framework for this since data is seldom black or white: pigeons have a whole range of sizes, but some sizes are more likely (intuitively speaking) than others.
So how can we use data to learn? It can be stifling to try and imagine all the possible ways. Indeed there are a number of concrete techniques: SVMs that seperate classes through margins, decision tree methods like random forests, neural networks, and many others.
But before thinking about diving into the details ad infinitum, I find it helpful to just take a step back and look at the overall picture. One way to do this is to consider local methods versus global methods.
Suppose we have a bunch of data points each with a value $x$, and we’re hoping to estimate a value $y$ given some $x$. Without loss of generality let’s assume these are real numbers. For example, consider the following picture generated using an $R$ standard data set.
Now say we don’t have an exact data point for car weight of 3.5 (tonnes for instance). How would we reasonably estiffmate efficiency? One reasonable way would be to try to find $E(efficiency \mid weight = w)$ (a specific weight w), or $E(Y \mid X=x)$ more generally for random variables $X$ and $Y$. We don’t know the “true” value of this, so we’re estimating.
The local way to do this would be to define a region around $X=x$, and take the average of points in that region. For example, to estimate efficient at a weight of $3.5$, we could take the average between weights of $3.25$ and $3.75$:
With one dimension of data ($x$-values), the range is a line i.e. $3.25 - 3.75$. In higher dimensions, we have hyperspheres. Clearly if we have many points, taking a smaller range closer to our desired value is intuitively more accurate, but we lose statistical support with fewer data points.
In fact, even if we fix the size of the range we’re looking at, as we increase dimensions, we require exponentially more points to have the same density. Don’t worry, this is not trivial or completely intuitive! It’s called the curse of dimensionality. So using our direct, local approach (often seen as “nearest neighbors”), we pay a price through sparsity.
The answer? Global methods. These use all data points for estimation. The drawback is that we have to make modelling assumptions, which may or may not be correct. I personally see this as “injecting” assumptional information in order to make up for sparsity.
In our car example, we could fit different types of polynomials to all of the data. We would do so by minimizing some kind of loss, often the the square of distances between predictions and data points (vertical distance here). This loss function is often used because it’s nice and smooth, has probabilitic interpretations, and leads to our desired $E(Y \mid X=x)$, although one could contemplate other loss functions.
Clearly by fitting more and more complex polynomials we predict the data better, but this may not generalise to new data points: the famous bias and variance tradeoff.
I haven’t said much about Neural Networks here, and I’ll save the basic mechanics for the next post. What we have seen is a raison d’être for Machine Learning, and two lines of attack in making estimations given new data points. We’ve also encountered concepts like the curse of dimensionality and the bias and variance tradeoff.
So where do neural nets fit in? Their coolest aspect by far (astronomical units) in my view is in how they deal with input. After all, we’re hoping to throw in inputs, and get some sort of output with our machine learning models. An underlying assumption so far has been that we know what the inputs should be. However consider an image, taken as its pixels. In most approaches, we would provide the color for each pixel, and try to have the algorithm estimate things like class (i.e. is this an image of a dog or a cat).
But what features are important in such a decision? Just the pixels? Should we also feed in black and white versions of the image? Perhaps rotations? The signal’s Fourier Transform? This is where neural nets (can) excel. What is truly remarkable is that they transform features in ways that prove useful. The programmer/designer need not then think about how to normalize values, scale them etc (as much).
In conclusion, neural nets are just an approach in trying to estimate new data points based on learned ones. Where they stand out is in their ability to transform input in useful ways. This is in line with the objective of machine learning we mentionned: to abstract away explicit rule definitions.
How do they accomplish this? We’ll begin to explore neural net architecture and answers to such questions in the next post and beyond.
This series of posts is heavily inspired by the writings of Michael Nielsen, Richard Socher, and Christopher Olah, to whom much credit is due. Some of the statistical intuition is also influenced by Trevor Hastie and Rob Tibshirani.