---
jupytext:
formats: ipynb,md:myst
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.11.5
kernelspec:
display_name: Python 3
language: python
name: python3
---
```{code-cell} ipython3
:tags: [remove-cell]
from ndslib.config import jupyter_startup
jupyter_startup()
```
(ml-overfitting)=
# Overfitting
Let's pause and take stock of our progress. In the last section, we developed
three fully operational machine learning workflows: one for regression, one for
classification, and one for clustering. Thanks to scikit-learn, all three
involved very little work. In each case, it took only a few lines of code to
initialize a model, fit it to some data, and use it to generate predictions or
assign samples to clusters. This all seems pretty great. Maybe we should just
stop here, pat ourselves on the back for a job well done, and head home for the
day satisfied in the knowledge we've become proficient machine learners.
While a brief pat on the back does seem appropriate, we probably shouldn't be
*too* pleased with ourselves. We've learned some important stuff, yes, but a
little bit of knowledge can be a dangerous thing. We haven't yet touched on a
number of concepts and techniques that are absolutely critical if we want to
avoid going off the rails with the skills we've just acquired. And going off the
rails, as we'll see, is surprisingly easy.
To illustrate just *how* easy, let's return to our brain-age prediction example.
Recall that when we fitted our linear regression model in an effort to predict
age from our brain features, we sub-sampled only a small number of features (5,
to be exact). We found that, even with just 5 features, our model was able to
capture a non-trivial fraction of the variance in age—about 20%. Intuitively,
you might find yourself thinking something like this: *if we did that well with
just 5 out of 1,440 features selected at random, imagine how well we might do
with more features!* And you might be tempted to go back to the model-fitting
code, replace the `n_features` variable with some much larger number, re-run the
code, and and see how well you do.
Let's go ahead and give in to that temptation. In fact, let's give in to
temptation systematically. We'll re-fit our linear regression model with random
feature subsets of different sizes and see what effect that has on performance.
First, let's load our ABIDE-II dataset again (and split it up into `features`
and `phenotypes` again):
```{code-cell} ipython3
from ndslib.data import load_data
abide_data = load_data("abide2")
features = abide_data.filter(like='fs')
phenotypes = abide_data.iloc[:, :6]
```
Now we can write some code to see how well we can do with different feature set
sizes. We start by setting up the size of the feature sets. We will run this 10
times in each set size, so we can average and get more stable results.
```{code-cell} ipython3
n_features = [5, 10, 20, 50, 100, 200, 500, 1000, 1440]
n_iterations = 10
```
We initialize a place holder for the results and extract the age phenotype,
which we will use in all of our model fits.
```{code-cell} ipython3
import numpy as np
results = np.zeros((len(n_features), n_iterations))
y = phenotypes['age']
```
Finally, we loop over feature set sizes and iterations. Each time, we sample
features, fit model, and save the score into out `results` array.
```{code-cell} ipython3
from sklearn.linear_model import LinearRegression
model = LinearRegression()
for i, n in enumerate(n_features):
for j in range(n_iterations):
X = features.sample(n, axis=1)
model.fit(X, y)
results[i, j] = model.score(X, y)
```
Let's plot the observed $R^2$ as a function of the number of predictors. We'll
plot a dark line for the averages across iterations and use Matplotlib's
`fill_between` function to add 1 standard deviation error bars, even though
these will be so small that they are hard to see.
```{code-cell} ipython3
import matplotlib.pyplot as plt
averages = results.mean(1)
fig, ax = plt.subplots()
ax.plot(n_features, averages, linewidth=2)
stds = results.std(1)
ax.fill_between(n_features, averages - stds, averages + stds, alpha=0.2)
ax.set_xlabel("# of brain features used")
ax.set_ylabel("explained variance in age ($R^2$)");
```
At first glance, this might look great: performance improves rapidly with the
number of features, and by the time we're sampling 1,000 features, we can
predict age perfectly! But a truism in machine learning (and in science, and in
life more generally) is that if something seems to good to be true, it probably
is.
In this case, a bit of reflection on the fact that our model seems able predict
age with zero error should set off all kinds of alarm bells, because on any
reasonable understanding of how the world works, such a thing should be
impossible. Set aside the brain data for a moment and just think about the
problem of measuring chronological age. Is it plausible to think that, in a
sample of over 1,000 people, including many young children and older adults, not
a single person's age would be measured with even the slightest bit of error?
Remember, *any* measurement error should reduce our linear regression model's
performance, because measurement error is irreducible. If an ABIDE-II
participant's birth date happened to be recorded as 1971 when it's actually 1961
(oops, typo!), it's not as if our linear regression model can somehow learn to
go back in time and adjust for that error.
Then think about the complexity of the brain-derived features we're using; how
well (or poorly) *those* features are measured/extracted; how simple our linear
regression model is; and how much opportunity there is for all kinds of data
quality problems to arise (is it plausible to think that all subjects' scans are
of good enough quality to extract near-perfect features?). If you spend a couple
of minutes thinking along these lines, it should become very clear that an $R^2$
of 1.0 for a problem like this is just not remotely believable.
Clearly, there must be something very wrong with our model. And there is: our
model is *overfitting* our data. Because we have a lot of features to work with
-- even more than samples, our linear regression model is, in a sense getting
creative: it's finding all kinds of patterns in the data that look like they're
there, but aren't really there. In a word, our model is starting to hallucinate.
If you find the idea that a statistical model could hallucinate kind of weird,
you've come to the right place. We'll spend the rest of this section exploring
that idea and unpacking its momentous implications.
## Understanding overfitting
To better understanding overfitting, let's set aside our relatively complex
neuroimaging dataset for the moment and work with some simple bivariate
examples.
Let's start by sampling some data from a noisy function where the underlying functional form is quadratic. We'll create a small function for the data generation, so that we can reuse this code later on.
```{code-cell} ipython3
def make_xy(n, sd=0.5):
''' Generate x and y variables from a fixed quadratic function,
adding noise. '''
x = np.random.normal(size=n)
y = (0.7 * x) ** 2 + 0.1 * x + np.random.normal(10, sd, size=n)
return x, y
```
We fix the seed for generation of random numbers and then produce some data and visualize it.
```{code-cell} ipython3
np.random.seed(10)
x, y = make_xy(30)
fig, ax = plt.subplots()
p = ax.scatter(x, y)
```
Now let's fit that data with a linear model. To visualize the results of the linear model, we will use the model that has been fit to the data to predct the values of a range of values similar to the values of `x`.
```{code-cell} ipython3
from sklearn.metrics import mean_squared_error
est = LinearRegression()
est.fit(x[:, None], y)
x_range = np.linspace(x.min(), x.max(), 100)
reg_line = est.predict(x_range[:, None])
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(x_range, reg_line);
mse = mean_squared_error(y, est.predict(x[:, None]))
ax.set_title(f"Mean squared error: {mse:.2f}");
```
The fit looks... meh. It seems pretty clear that our linear regression model is *under*fitting the data—meaning, there are clear patterns in the data that the fitted model fails to describe.
What can we do about this? Well, the problem here is that our model is insufficiently flexible; our straight regression line can't bend itself to fit the contours of the observed data. Clearly, the solution is to use a more flexible estimator! A linear fit won't cut it—we need to fit *curves*.
Just to make sure we don't underfit again, let's use a really flexible estimator—specifically, a 10th degree polynomial regression model.
This is also a good opportunity to introduce a helpful object in scikit-learn called a `Pipeline`. The idea behind a `Pipeline` is that we can stack arbitrarily many transformation steps together in a sequence, and then cap them off with an `Estimator` of our choice. The whole pipeline will then behave like a single `Estimator` -- i.e., we only need to call `fit()` and `predict()` (or `transform()`, for unsupervised models) once.
```{eval-rst}
.. index::
single: Scikit Learn Pipelines
```
Using a pipeline will allow us to introduce a preprocessing step before the
LinearRegression model receives our data, in which we create a bunch of
polynomial features (by taking $x^2$, $x^3$, $x^4$, and so on—all the way up to
$x^{10}$). We'll make use of Scikit Learn's handy `PolynomialFeatures`
transformer, which can be found in the `preprocessing` module (we've already
seen some other useful data preprocessing tools in there).
```{eval-rst}
.. index::
single: Polynomial features
```
Again, we'll create a function that wraps the code to generate the pipeline, so
that we can reuse that as well.
```{code-cell} ipython3
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
def make_pipeline(degree=1):
"""Construct a Scikit Learn Pipeline with polynomial features and linear regression """
polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
pipeline = Pipeline([
("polynomial_features", polynomial_features),
("linear_regression", LinearRegression())
])
return pipeline
```
Now we can initialize a pipeline with `degree=10`, and fit it to our toy data:
```{code-cell} ipython3
degree = 10
pipeline = make_pipeline(degree)
pipeline.fit(x[:, None], y)
reg_line = pipeline.predict(x_range[:, None])
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(x_range, reg_line)
mse = mean_squared_error(y, pipeline.predict(x[:, None]))
ax.set_title(f"Mean squared error: {mse:.2f}");
```
At first blush, this model seems to fit the data much better than the first
model, in the sense that it reduces the mean squared error considerably relative
to the simpler linear model (our MSE went down from 0.6 to 0.11). But, much as
it seemed clear that the previous model was *under*fitting, it should now be
intuitively obvious that the 10th-degree polynomial model is *over*fitting. The
line of best fit plotted above has to bend in some fairly unnatural ways in
order to capture individual data points. While this helps reduce the error for
these particular data, it's hard to imagine that the same line would still be
very close to the data if we sampled from the same distribution a second or
third time.
We can test this intuition by doing exactly that: we sample some more data from
the same process, and see how well our fitted model predicts the new scores.
```{code-cell} ipython3
test_x, test_y = make_xy(30)
x_range = np.linspace(test_x.min(), test_x.max(), 100)
reg_line = pipeline.predict(x_range[:, None])
fig, ax = plt.subplots()
ax.scatter(test_x, test_y)
ax.plot(x_range, reg_line)
mse = mean_squared_error(y, pipeline.predict(test_x[:, None]))
ax.set_title(f"Mean squared error: {mse:.2f}");
```
That's... not good. When we apply our overfitted model to new data, the mean
squared error skyrockets. In fact, the model performs significantly worse than
even our underfitting linear model did. (As an exercise, you can apply the
linear model to new data, just so we're comparing apples to apples. You'lll
probably find that the linear model's error also increases when applied to new
data, but that the increase is much smaller than what we observe for our
10th-order polynomial model.)
Of course, since we wrote the data-generating process ourselves, and hence know
the ground truth, we may as well go ahead and fit the data with the true
functional form, which in this case is a polynomial with degree 2:
```{code-cell} ipython3
degree = 2
pipeline = make_pipeline(degree)
pipeline.fit(x[:, None], y)
x_range = np.linspace(x.min(), x.max(), 100)
reg_line = pipeline.predict(x_range[:, None])
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(x_range, reg_line)
mse = mean_squared_error(y, pipeline.predict(x[:, None]))
ax.set_title(f"Mean squared error: {mse:.2f}");
```
There, that looks much better.
Unfortunately, in the real world, we don't usually know the ground truth. If we
did, we wouldn't have to fit a model in the first place! So we're forced to
navigate between the two extremes of overfitting and underfitting in some other
way. Finding this delicate balance is one of the central problems of machine
learning -— perhaps *the* central problem. For any given dataset, a more
flexible model will be able to capture more nuanced, subtle patterns in the
data. The cost of flexibility, however, is that such a model is also more likely
to hallucinate -— i.e., to fit patterns in the data that are only there because
of noise, and hence won't generalize to new samples. Conversely, a less flexible
model is only capable of capturing simple patterns in the data. This means it
will avoid chasing down rabbit holes full of spurious patterns; but it does so
at the cost of missing out on a lot of real patterns too.
One way to think about this is that, as a data scientist, the choice you face is
almost never between good models and bad ones. but rather, between lazy and
energetic ones (later on, we'll also see that there are many different ways to
be lazy or energetic). The simple linear model we started out with is a
relatively lazy model: it has only one degree of freedom to play with, and if it
can't fit the data well with that one degree of variation, it doesn't care.
It'll just sigh and go back to the couch to watch TV.
The 10th-degree polynomial, by contrast, is hyperactive and prone to conspiracy
theories: it sees patterns everywhere, and if it has to go very far out of its
way to fit a data point that's giving it particular trouble, it'll happily do
that.
Getting it right in any given situation requires you to strike a balance between
these two extremes. Unfortunately, the precise point of optimality varies on a
case-by-case basis—so there is, as they say in machine learning, *no free
lunch*.
Later on, we'll connect the ideas of overfitting vs. underfitting (or,
relatedly, flexibility vs. stability) to another key concept -— the
*bias-variance tradeoff*. But for now, let's turn our attention to some of the
core methods machine learners use to diagnose and prevent overfitting.
### Additional resources
In case you are wondering whether overfitting actually happens in real research
you should read the paper "I tried a bunch of things: The dangers of unexpected
overfitting in classification of brain data" by Mahan Hosseini and colleagues
{cite}`Hosseini2020-uc`. It nicely demonstrates how easy it is to fall into the
trap of overfitting when doing data analysis on large complex datasets, and
offers some rigorous approaches to help mitigate this risk.