The core concepts of machine learning
Contents
17. The core concepts of machine learning¶
We’ve arrived at the last chapter of the book. Hopefully you’ve enjoyed the journey to this point! In this chapter, we’ll dive into machine learning. There’s a lot to cover, so this will be a pretty long chapter. To keep things manageable, we’ve structured the material into 7 sections. Here in Section 17, we’ll review some core concepts in machine learning, setting the stage for everything that follows. In Section 18, we’ll introduce the Scikit-learn Python package, which we’ll rely on heavily throughout the chapter. Section 19 explores the central problem of overfitting; Section 20 and Section 21 then cover different ways of diagnosing and addressing overfitting via model validation and model selection, respectively. Finally, in Section 22, we close with a brief review of deep learning methods -— a branch of machine learning that has made many recent advances, and one that’s recently made considerable inroads into neuroimaging.
Before we get into it, a quick word about our guiding philosophy. Many texts covering machine learning adopt what we might call a “catalog” approach: they try to cover as many of the different classes of machine learning algorithms as possible. This won’t be our approach here. For one thing, there’s simply no way to do justice to even a small fraction of this space within the confines of one chapter (even a long one). More importantly, though, we think it’s far more important to develop a basic grasp on core concepts and tools in machine learning than to have a cursory familiarity with many of the different algorithms out there. In our anecdotal experience, neuroimaging researchers new to machine learning are often bewildered by the sheer number of algorithms implemented in machine learning packages like Scikit-learn, and sometimes fall into the trap of systematically applying every available algorithm to their problem, in the hopes of identifying the “best” one. For reasons we’ll discuss in depth in this chapter, this kind of approach can be quite dangerous; not only does it preempt a deeper understanding of what one is doing, but, as we’ll see in Section 19 and Section 20, it can make one’s results considerably worse by increasing the risk of overfitting.
17.1. What is machine learning?¶
This is a chapter on machine learning, so now is probably a good time to give a working definition. Here’s a reasonable one: machine learning is the field of science/engineering that seeks to build systems capable of learning from experience.
This is a very broad definition, and in practice, the set of activities that get labeled “machine learning” is quite broad and varied. But two elements are common to most machine learning applications: (1) an emphasis is on developing algorithms that can learn (semi-)autonomously from data, rather than static rule-based systems that must be explicitly designed or updated by humans; and (2) an approach to performance evaluation that focuses heavily on well-defined quantitative targets.
We can contrast machine learning with traditional scientific inference, where the goal (or at least, a goal) is to understand or explain how a system operates.
The goals of prediction and explanation are not mutually exclusive, of course. But most people tend to favor one over the other to some extent. And, as a rough generalization, people who do machine learning tend to be more interested in figuring out how to make useful predictions than in arriving at a “true”, or even just an approximately correct, model of the data-generating process underlying a given phenomenon. By contrast, people interested in explanation might be willing to accept models that don’t make the strongest possible predictions (or often, even good ones) so long as those models provide some insight into the mechanisms that seem to underlie the data.
We don’t need to take a principled position on the prediction vs. explanation divide here (plenty has been written on the topic; see Section 17.4.3 below). Just be aware that, for purposes of this chapter, we’re going to assume that our goal is mainly to generate good predictions, and that understanding and interpretability are secondary or tertiary on our list of desiderata (though we’ll still say something about them now and then).
17.2. Supervised vs. unsupervised learning¶
Broadly speaking, machine learning can be carved up into two forms of learning:
supervised and unsupervised. We say that learning is supervised whenever
we know the true values that our model is trying to predict, and hence, are in a
position to “supervise” the learning process by quantifying prediction accuracy
and the associated prediction error. “Ordinary” least-squares regression, in the
machine learning context, is an example of supervised learning: our model takes
as its input both a vector of features (conventionally labeled X
) and a
vector of labels (y
). Researchers often use different terminology in various
biomedical disciplines—often calling X
variables or predictors, and y
the outcome or dependent variable—but the idea is the same.
Here are some examples of supervised learning problems (the first of which we’ll attempt later):
Predicting people’s chronological age from structural brain differences
Determining whether or not an incoming email is spam
Predicting a person’s rating of a particular movie based on their ratings of other movies
Discriminating schizophrenics from controls based on genetic markers
In each of these cases, we expect to train our model using a dataset where we know the ground truth—i.e., we have labeled examples of age, spam, movie ratings, and a schizophrenia diagnosis, in addition to any number of potential features we might use to try and predict each of these labels.
17.3. Supervised learning: classification vs. regression¶
Within the class of supervised learning problems, we can draw a further distinction between classification problems and regression problems. In both cases, the goal is to develop a predictive model that recovers the true labels as accurately as possible. The difference between the two lies in the nature of the labels: in classification, the labels reflect discrete classes; in regression, the labeled values vary continuously.
17.3.1. Regression¶
A regression problem arises any time we have a set of continuous numerical labels and we’re interested in using one or more features to try and predict those labels. Any bivariate relationship can be conceptualized as a regression of one variable on the other. For example, suppose we have the data displayed in this scatterplot:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=30)
y = x * 0.5 + np.random.normal(size=30)
fig, ax = plt.subplots()
ax.scatter(x, y, s=50)
ax.set_xlabel('x')
label = ax.set_ylabel('y')
We can frame this as a regression problem by saying that our goal is to generate
the best possible prediction for y
given knowledge of x
. There are many ways
to define what constitutes the “best” prediction, but here we’ll use the
least-squares criterion and say we want a model that, when given the x
scores as inputs, will produce predictions for y
that minimize the sum of
squared deviations between the predicted scores and the true scores.
This is what “ordinary” least-squares (OLS) regression gives us. Here’s the OLS
solution: first we add a column to x
. This column will be used to model the
intercept of the line that relates y
to x
.
x_with_int = np.hstack((np.ones((len(x), 1)), x[:, None]))
Then, we solve the set of linear equations using Scipy’s linear algebra routines. This gives us parameter estimates for the intercept and the slope.
w = np.linalg.lstsq(x_with_int, y, rcond=None)[0]
print("Parameter estimates (intercept and slope):", w)
Parameter estimates (intercept and slope): [-0.36822492 0.62140416]
Then, we visualize the data and also a straight line that represents the model of the data based on the regression:
fig, ax = plt.subplots()
ax.scatter(x, y, s=50)
ax.set_xlabel('x')
ax.set_ylabel('y')
xx = np.linspace(x.min(), x.max()).T
line = w[0] + w[1] * xx
p = plt.plot(xx, line)
What is this model? Based on the values of the parameters, we can say that the linear prediction equation that produced the predicted scores above can be written as \(\hat{y} = -0.37 + 0.62x\).
Of course, not every model we use to generate a prediction will be quite this simple. Most won’t—either because they have more parameters, or because the prediction can’t be expressed as a simple weighted sum of the parameter values. But what all regression problems share in common with this very simple example is the use of one or more features to try and predict labels that vary continuously.
17.3.2. Classification¶
Classification problems are conceptually similar to regression problems. In classification, just like in regression, we’re still trying to learn to make the best predictions we can for some target set of labels. The difference is that the labels are now discrete rather than continuous. In the simplest case, the labels are binary: there are only two classes. For example, we can use utilities from the Scikit Learn library (we’ll learn more about this library in Section 18) to create data that look like this
from sklearn.datasets import make_blobs
X, y = make_blobs(centers=2, random_state=2)
fig, ax = plt.subplots()
s = ax.scatter(*X.T, c=y, s=60, edgecolor='k', linewidth=1)
Here, we have two features (on the x- and y-axes) we can use to try to correctly classify each sample. The two classes are labeled by color.
In the above example, the classification problem is quite trivial: it’s clear to the eye that the two classes are perfectly linearly separable so that we can correctly classify 100% of the samples just by drawing a line between them. Of course, most real-world problems won’t be nearly this simple. As we’ll see later, when we work with real data, the feature-space distributions of our labeled cases will usually overlap considerably, so that no single feature (and often, not even all of our features collectively) will be sufficient to perfectly discriminate cases in each class from cases in other classes.
17.4. Unsupervised learning: clustering and dimensionality reduction¶
In unsupervised learning, we don’t know the ground truth. We have a dataset
containing some observations that vary on some set of features X
, but we’re
not given any set of accompanying labels y
that we’re supposed to try to
recover using X
. Instead, the goal of unsupervised learning is to find
interesting or useful structure in the data. What counts as interesting or
useful is of course very much person and context-dependent. But the key point is
that there is no strictly right or wrong way to organize our samples (or if
there is, we don’t have access to that knowledge). So we’re forced to muddle
along the best we can, using only the variation in the X
features to try and
make sense of our data in ways that we think might be helpful to us later.
Broadly speaking, we can categorize unsupervised learning applications into two classes: clustering and dimensionality reduction.
17.4.1. Clustering¶
In clustering, our goal is to label the samples we have into discrete clusters (or groups). In a sense, clustering is just classification without ground truth. In classification, we’re trying to recover the class assignments that we know to be there; in clustering, we’re trying to make class assignments even though we have no idea what the classes truly are, or even if they exist at all.
The best-case scenario for a clustering application might look something like this:
X, y = make_blobs(random_state=100)
fig, ax = plt.subplots()
s = ax.scatter(*X.T)
Remember: we don’t know the true labels for these observations (that’s why they’re all assigned the same color in the above plot). So in a sense, any cluster assignment we come up with is just our best guess as to what might be going on. Nevertheless, in this particular case, the spatial grouping of the samples in 2 dimensions is so striking that it’s hard to imagine us having any confidence in any assignment except the following one:
X, y = make_blobs(random_state=100)
fig, ax = plt.subplots()
s = ax.scatter(*X.T, c=y)
Of course, just as with the toy classification problem we saw earlier, clustering problems this neat rarely show up in nature. Worse, in the real world, there often aren’t any “true” clusters. Often, the underlying data-generating process is best understood as a complex (i.e., high-dimensional) continuous function. In such cases, clustering can still be very helpful, as it can help reduce complexity and give us insight into regularities in the data. But when we use clustering methods (and, more generally, any kind of unsupervised learning approach), we should try to always remember the adage that the map is not the territory—meaning, we shouldn’t mistake a description of a phenomenon for the phenomenon itself.
17.4.2. Dimensionality reduction¶
The other major class of unsupervised learning application is dimensionality reduction. Here, the idea, just as the name suggests, is to reduce the dimensionality of our data. The reasons why dimensionality reduction is important in machine learning will become clearer when we talk about overfitting later, but a general intuition we can build on is that most real-world datasets—especially large ones—can be efficiently described using fewer dimensions than there are nominal features in the dataset. Real-world datasets tend to contain a good deal of structure: variables are related to one another in important (though often non-trivial) ways, and some variables are redundant with others, in the sense that they can be redescribed as functions of other variables. The idea is that, if we can capture most of the variation in the features of a dataset using a smaller subset of those features, we can reduce the effective size of our dataset and build predictions more efficiently.
To illustrate, consider this dataset:
x = np.random.normal(size=300)
y = x * 5 + np.random.normal(size=300)
fig, ax = plt.subplots()
s = ax.scatter(x, y)
Nominally, this is a two-dimensional dataset, and we’re plotting the two features on the x and y axes, respectively. But it seems clear at a glance that there aren’t really two dimensions in the data—or at the very least, one dimension is far more important than the other. In this case, we could capture the vast majority of the variance along both dimensions with a single axis placed along the diagonal of the plot—in essence, “rotating” the axes to a “simpler” structure. If we keep only the first dimension in the new space and lose the second dimension, we reduce our 2-dimensional dataset to 1 dimension, with very little loss of information. In the next section, we will dive into the nuts and bolts of machine learning in Python, by introducing the Scikit Learn machine learning library.
17.4.3. Additional resources¶
If you are interested in diving deeper into the distinction between prediction and explanation, we really recommend Leo Breiman’s classical paper “The two cultures of statistical modeling” [Breiman, 2001] . Another great paper on this topic is Galit Shmueli’s “To explain or to predict?” [Shmueli, 2010]. Finally you can read one of us weighing in on the topic, together with Jake Westfall, in a paper titled “Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning” [Yarkoni and Westfall, 2017].