20. Deep learning#

While we promised that this chapter would not be a catalog of different machine learning methods, no book introducing machine learning – and particularly applied to a topic so closely bound with image processing – would be complete without at least a brief mention of deep learning. Deep learning is a relatively new branch of machine learning that encompasses a set of techniques that combine advances in artificial neural networks with clever optimization methods and large amounts of data to do remarkably well at many different machine learning tasks. We’d hesitate to call these methods “artificial intelligence” (AI), but these algorithms are prone to doing things that until recently we would have thought only humans can do, like recognize objects in a photograph, generate (at least somewhat) cogent written prose, or play Go. So we think that it’s fair to say that it’s the closest we’ve gotten to AI so far.

20.1. Artificial neural networks#

It is only too appropriate that we would use a model of a neural network to study the brain. A deep learning algorithm is composed of individual functional units that are connected to each other to simulate a kind of simplified neural network. Just like real neurons, each unit in the network receives inputs – either from outside the network, or from other units in the network, and has outputs – into other neurons or as the final output of the network. Let’s look at a diagram of a very simple artificial neural network:

The circles denote nodes and the lines denote direct connections between nodes. The activity of the nodes at the bottom (denoted by \(X_{11}\) and \(X_{12}\)) is set by two different inputs to the network, and the activity of the node at the top (denoted by \(X_{21}\)) is the output of the network. The input layer is similar to the \(X\) matrix that you saw in previous chapters, with different rows of the matrix corresponding to different instances of the input. In this network, \(X_{21}\) is akin to the model predictions in the previous chapters. As an example, \(X_{11}\) might be the weight of an animal, \(X_{21}\) might be the wingspan of the animal and \(X_{21}\) might be the model output: the predicted airspeed of the animal. This is the familiar regression setting that you saw in previous chapters with two inputs and one output.

In the notation we will use here, the first subscript indicates the layer number and the second subscript indicates the number of the particular unit within the layer. So, \(X_{12}\) is the second unit in the first layer and \(X_{21}\) is the first unit in the second layer.

The activity in each unit in the input layer affects the activity in the output unit through a weight: \(W^2_{11}\) and \(W^2_{12}\). The notation for the weights is that the superscript tells us which layer these weights go into, the first subscript tells us the identity of the source unit and the second subscript tells us the identity of the target unit. For example \(W^2{11}\) is a weight from unit 1 in the first layer to unit 1 in the second layer.

How do the weights affect the activity in the second layer? In the simplest case: \(X_{21} = W^2_{11} X_{11} + W^2_{12} X_{12}\). That is, the activity in the second layer is the weighted linear sum of the activities in the first layer, where the activity in each unit is multiplied by the weight between that unit and the output.

Written out in code, this might look something like the following. We use a notation very similar to the one we used in writing out the math, where, for example \(X_{11}\) is written as x_11 and \(W^{2}_{11}\) is written as w_2_11.

x11 = 100
x12 = 40
w_2_11 = -1
w_2_21 = 3
x21 = w_2_11 * x11 + w_2_21 * x12
print(x21)
20

In our previous example, this might mean that the airspeed of an animal that weighs 100 grams and has a wingspan of 40 cm, might fly at a speed of 20 km/h (try out different numbers to see how speed is affected by these two factors, given these weights).

Neural networks become more sophisticated when we add more layers to them. For example, here is a three-layer network. There are two units in the input layer, two units in the hidden layer, and one unit in the output layer.

Exercise

Implement the code that calculates the output of this network, using the same notation that we used before. Don’t worry about the values of the weights – just make something up.

Another thing that adds power to neural networks is the addition of non-linear activation functions to the units. Schematically, it would look like this:

Where \(f()\) is the activation function. In each unit, this is applied to the weighted sum of the inputs. Often, a non-linear activation function is used to give the For example, one typical function to use would be a hyperbolic tangent, which changes very little for very small or very large values, and changes gradually in-between. For example, in this case, the output of the first unit in the hidden layer is \(tanh(W^2{11} X_{11} + W^2{12})\). Another non-linear activation function, that is often used in practice, is the rectified linear unit, or ReLU. This function sets the activations that are negative to zero and passes through activations that are positive as they were. We can write this function in Python in very few lines of code

def relu(x):
    if x < 0:
        return 0
    else:
        return x

And incorporate this into our two-layer network. In this case, adding the ReLU did not change the output of the network. This is because the output was positive. But you can try changing the weights to see what happens when the weighted sum \(W^2{11} X_{11} + W^2{12}\) becomes negative.

x11 = 100
x12 = 40
w_2_11 = -1
w_2_21 = 3
x21 = relu(w_2_11 * x11 + w_2_21 * x12)
print(x21)
20

Exercise

Add the ReLU function as an activation to the hidden and the output layer of the three-layer network that you previously implemented.

20.2. Learning through gradient descent and back-propagation#

How do we know what values to assign to the weights of a neural network so that the input-output relationships are accurate? This is the goal of learning with these algorithms. This is implemented using two main ideas. The first idea is that of gradient descent. This mechanism for learning is similar to the optimization procedures that we used to fit machine learning models earlier in this chapter, or that we used to register images to each other in Section 14. It usually starts with some (often random!) guess of what the values of the weights should be and then – based on observations of training data – it determines a small change that should be introduced to improve the accuracy of the input-output relationship: the difference between the true value of \(y\) for each training sample, and the predicted value of \(y\), given the current values of the weights. In each round of training, the weights are adjusted gradually, based on the gradient of the weights with respect to the error. The gradient refers to the overall set of changes to the weights that improve the error and we descend in the direction of less and less error, which is why this algorithm is known as gradient descent. But how does it know how to change the weights to improve the predictions? That is done through backpropagation. This is a process that determines the direction and amount of change to apply to each weight based on the errors observed in the current training data.

Backpropagation uses three quantities to determine an error that needs to be adjusted for every weight:

  1. The activation of the node feeding into the weight.

  2. The slope of the activation function of the node the weight feeds into

  3. The error of the node a weight feeds into.

The first should be straightforward to look up, because it should be one of the variables that describes our network. We will come back to the second one – this is also often straightforward (and particularly straightforward for the ReLU activation functions, as we will see). But how do we determine the third of these? This is done by propagating error back from the output unit. We can determine the error for the output unit, because we are training the network on pairs of inputs and outputs (for example, animals for which we have measured weight, wingspan and airspeed). For example, this might be the squared error: the squared difference between the output of the network (with the current weight setting) and the true value of the airspeed for this animal. Once we have determined this error, we can send it back to the nodes in the hidden layer that are sending inputs to the output layer. Multiplied by the slope of the activation function in the unit from which error was back-propagated, and the degree of activation of the node feeding into the weight, this tells us the error for the particular weight we are currently interested. And once we determine that, we can send this error back to units further down in the network.

Coming back to the slope of the ReLU activation function, we notice that when the input to the function is negative, the output of the ReLU is always 0. This means that for negative inputs, the slope is 0 (that is, the function doesn’t change with changes in the input). For values above 0, the ReLU is the identity function (\(f(x) = x\)), so its slope is equal to 1. This means that we can implement a simple function that calculates the slope of the ReLU function for any input value that it is provided:

def d_relu(x):
    if x > 0:
        return 1
    else:
        return 0

As an example of learning, let’s imagine a simple dataset where weight, wingspan have been measured in a single animal and we use just that animal’s data to train the two-layer network that you saw above. The weights are initialized as before, so the result, as before, is 20.

x11 = 100
x12 = 40
w_2_11 = -1
w_2_21 = 3
x21 = relu(w_2_11 * x11 + w_2_21 * x12)
print(x21)
20

But let’s now consider the situation where this is an incorrect output. That is, the measured airspeed was 15 km/h, and not 20 km/h. In that case, the squared error for the output would be:

e21 = (x21 - 15)**2

Now, let’s propagate this error to determine how the two weights coming into the output unit should change. First let’s calculate the change that should be applied to \(W^2_{11}\). The three components are:

  1. The activation of the node feeding into the weight: x11.

  2. The slope of the activation function of the node the weight feeds into: d_relu(x21).

  3. The error of the node a weight feeds into: e2.

Which means that the change or error for this weight is the product of these three variables:

e_2_11 = x11 * d_relu(x21) * e21

We can calculate the same for the other weight. This would be the same as above, except we use the activation in the other unit in the input layer:

e_2_21 = x12 * d_relu(x21) * e21
print(e_2_21)
print(e_2_11)
1000
2500

These numbers seem rather large, and it is quite common not to apply the full change to the weight, but instead use a small learning rate to make changes to the weights.

lr = 10e-6

w_2_11 = w_2_11 - lr * e_2_11
w_2_21 = w_2_21 - lr * e_2_21

Having updated the weights, we run the network through again

x21 = relu(w_2_11 * x11 + w_2_21 * x12)
print(x21)
17.100000000000023

We can see that the value that the network is predicting is now closer to the true value. We can repeat the process for as many rounds as we need until the error is deemed small enough (for example, here we set the threshold at 0.01). In this case, we have to update the weights 81 times until the result converges to a value close enough to the true value to be considered close enough.

iterations = 1
while e21 > 0.01:
    iterations = iterations + 1
    e21 = (x21 - 15)**2

    e_2_11 = x11 * d_relu(x21) * e21
    e_2_21 = x12 * d_relu(x21) * e21

    w_2_11 = w_2_11 - lr * e_2_11
    w_2_21 = w_2_21 - lr * e_2_21
    x21 = relu(w_2_11 * x11 + w_2_21 * x12)

print(x21)
print(iterations)
15.098741304117311
81

Exercise

Implement the code that calculates the changes to the weights in the three-layer network that you implemented.

20.3. Introducing Keras#

Understanding the basic mechanics of training a neural network is useful, but fortunately, when we work with artificial neural networks, we don’t have to program all of these low-level operations ourselves. While Scikit Learn only has a limited collection of artificial neural network algorithms, there are several other widely-used open-source software libraries for deep learning. Here, we will use the TensorFlow software library, which was originally developed at Google as a tool for their machine learning applications, but was made publicly available in 2015. As you have seen throughout this book, having well-designed APIs is very helpful in applying theoretical ideas in practice. Within the TensorFlow library, we will use a particular API called Keras, which makes access to the underlying machinery – constructing networks, training them with data, and using them for prediction – much easier. From Tensorflow, we import the Keras API, and some additional components of the Keras library, a sub-module that implements layers of networks, a sub-module that implements different model architectures and a couple of utility functions.

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras.utils import to_categorical, set_random_seed

We can start with a small toy example, before we proceed to more interesting applications. As you saw in Section 15, Scikit Learn includes scripts that create synthetic data that we can use to hone our intutions with. We’ll create a dataset like the one we did when we first introduced the idea of classification in classification. Using the make_blobs function, we create a dataset that has two classes (y=0 and y=1) that are different in terms of their X values (each sample has two features, and the X array has two columns). Just like we did in previous chapters, we split the data into train and test sets. To make sure that the results are identical in each run, we set the random seed with the Keras utility function set_random_seed.

set_random_seed(2222)
from sklearn.datasets import make_blobs
X, y = make_blobs(centers=3)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

Our goal for now will be to create a neural network that learns how to classify y values based on X values – recall from Section 15 that the make_blobs function creates 3 different object categories. And use these components to construct a simple neural network. The kinds of models that we saw before are called “Sequential” models in Keras. That’s because the activity proceeds sequentially through the network from the bottom to the top (in our diagrams). One way to construct this kind of network is to start by initializing an empty Sequential() object and then using the add method of this object to add layers. In our first example, we will add layers that are instances of the Dense class. This means that there is one weight between each unit in this layer and the previous layer (that is, they are “densely” connected; we’ll see other kinds of layers in just a bit). When initializing each layer we need to specify the number of units in this layer (below 16 in the first layer, 32 in the second layer and 2 in the last layer). We can also specify an activation function that governs the output of each unit. Here, we’ll use the ReLU function in each one. The first layer also needs to also know something about the size of the input. In this case, the number of features in X. Finally, the last layer is designed to provide information about the preferred output. Because we are doing a classification task here, rather than a regression task, this layer will have two units – one for each class – and the classification output will be read out from the activity of these two units. If the first unit is more active, we’ll predict a ‘0’ and if the second unit is more active, we’ll predict a ‘1’. This readout, with one unit ultimately winning out over the other, is implemented by a function called a “softmax” function.

model = keras.Sequential(
    [
        keras.Input(shape=(X_train.shape[-1],)),
        layers.Dense(8, activation="relu"),
        layers.Dense(16, activation="relu"),
        layers.Dense(3, activation="softmax"),
    ]
)
2022-06-11 20:25:38.669919: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.12/x64/lib
2022-06-11 20:25:38.669980: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-11 20:25:38.670007: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az460-270): /proc/driver/nvidia/version does not exist
2022-06-11 20:25:38.670252: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

After constructing the network, we need to do one more step before we can fit it, which is to compile the network. This is a feature of TensorFlow that allows it to work really fast when fitting even very large models. When we ask to compile a network, TensorFlow figures out a plan for computing the activations in the network and to compute the loss and optimize it based on the information we provide to it, so after we do this step, following steps proceed at high speed. When compiling, we can ask for a particular error function to be used for back-propagation relative to the output of the network (also called a “loss function”). Here, we use the categorical cross-entropy function. We will not get into the details of this function, but suffice to say that it is an error function that usually works well for classification problems. We can also ask for a particular kind of optimizer to be used to figure out how to apply changes to the weights (e.g., how to determine the learning rate), and we ask for a particular metric to be reported to us as learning proceeds: the classification accuracy.

model.compile(loss='categorical_crossentropy', metrics=['accuracy'])

Finally, we are ready to fit the model. Keras expects the y input to be converted into its format for categorical variables. This is an array that has one column for each category in y – three in this case. In each row of this array, one value is set to 1, corresponding to the category of that observation. The other columns – two columns in this case – are set to 0. For example, the first element of the training data is from the category 2 (i.e., that sample comes from the third blob), and the categorical variable row is equal to [0, 0, 1]

categorical_y_train = to_categorical(y_train)
print(y_train[0])
print(categorical_y_train[0])
2
[0. 0. 1.]

We can set the number of times that the neural network will go over all of the training data during training (that’s called “epochs”) and the number of samples that the network will see in each step in training during each of the epochs (that’s called the “batch size”). During the course of training, Keras will update us with information about the value of the loss, which should progressively decrease. The value of categorical cross-entropy is not so easy to interpret, so we also asked to see the classification accuracy, which is easier to understand – a number between 0 and 1, with perfect classification equal to 1. In this case, purely guessing would result in accuracy of about 1/3.

history = model.fit(X_train, categorical_y_train, epochs=5, batch_size=5,
                    verbose=2)
Epoch 1/5
10/10 - 1s - loss: 1.0496 - accuracy: 0.6200 - 585ms/epoch - 59ms/step
Epoch 2/5
10/10 - 0s - loss: 0.8900 - accuracy: 0.6200 - 18ms/epoch - 2ms/step
Epoch 3/5
10/10 - 0s - loss: 0.7799 - accuracy: 0.6200 - 15ms/epoch - 2ms/step
Epoch 4/5
10/10 - 0s - loss: 0.6883 - accuracy: 0.6200 - 16ms/epoch - 2ms/step
Epoch 5/5
10/10 - 0s - loss: 0.6098 - accuracy: 0.6200 - 13ms/epoch - 1ms/step

To predict the categories of y in the test data, we need to convert then back from the Keras categorical format to an array of 0s and 1s, using the numpy argmax function. We can then evaluate the predictions relative to the true values using Scikit Learn’s accuracy_score function, which quantifies the quality of classification. We find that this network is not amazingly accurate, but it performs far better than a random guess would.

y_predict = model.predict(X_test).argmax(axis=1)
from sklearn.metrics import accuracy_score, balanced_accuracy_score
print(accuracy_score(y_test, y_predict))
1/2 [==============>...............] - ETA: 0s

2/2 [==============================] - 0s 2ms/step
0.66

One way to think about the performance of this neural network is to quantify the complexity of the model by counting the number of parameters in the model. Keras models include a method that summarizes the model for us, telling us what layers we have in our model and how many parameters are in the model overall

model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 8)                 24        
                                                                 
 dense_1 (Dense)             (None, 16)                144       
                                                                 
 dense_2 (Dense)             (None, 3)                 51        
                                                                 
=================================================================
Total params: 219
Trainable params: 219
Non-trainable params: 0
_________________________________________________________________

Let’s see how this number – 219 – comes about: the first model has 8 units, each of which receives inputs from three different values (the three columns of X) – so the number of weights is 24. The second layer has 16 units, each of which receives inputs from each of the 8 units in the first layer, so 8 times 16, which is 128. In addition to the weight, each unit in this layer also has an additional parameter – a so-called “bias term”, which can adjust the level of activity in the unit up or down by a fixed amount, so we have 128 + 16, which is 144. For the last layer, we have 3 units, each representing one of the possible output categories, and each of these receives an input from the middle layer, so 16 times 3, which is 48, plus one bias term for each unit, totalling 51. Putting together 24 + 144 + 51 gives us the 219 total parameters in this model. So, it this model complex? Well, that depends on what we compare it too. It’s certainly more complex – has more parameters – than some of the models we’ve previously seen. But it is not as complex as some of the neural networks that are the state of the art in deep learning, which can have several million parameters. Such large complexity is achieved by adding many more layers to the network, or adding many more units in each layer.

Exercise

Can you make this network even more accurate on these data? What happens if you add more dense layers? What happens if you add more units in each layer? How does each of these changes affect the number of parameters in the model?

20.4. Convolutional neural networks#

Let’s zoom out from the small toy examples you saw until now. As it turns out, the real power of neural networks – particularly for image processing – was discovered when two technological trends converged: the first is the ability to fit weights for neural networks that have a large number of layers. The deep in deep learning comes specifically from the models that started becoming feasible around 2010 and that by 2012 were reaching impressive results in the ImageNet benchmark, in which computer vision algorithms are compared in their ability to recognize thousands of different objects in millions of labeled color photographs. One of the main things that made these models feasible was the growing availability of graphical processing units (GPUs) that were used to process the incoming data and to update the weights in these very large artificial neural networks. The other technological trend that became important around that time is the availability of massive amounts of data. Of course we don’t want to train any machine learning on just one sample, as we did here, but it turns out that training a deep neural network that accurately recognizes objects (or does other sophisticated tasks) requires extraordinary amounts of data. The ImageNet dataset itself was, therefore, a driver of the development of a lot of these algorithms.

At same time, a clever set of theoretical ideas also drove the development of deep learning, particularly for computer vision. One idea that was particularly important was that neural networks could incorporate properties of biological visual system in order to better learn how to perform computer vision tasks. In particular, researchers who were working on these algorithms in the late 1980’s and early 1990’s were aware of the classic results that showed that the mammalian primary visual cortex contains neurons that each have their receptive fields in a particular part of the visual field and that these neurons were each particularly sensitive to a narrow range of orientations and spatial frequencies. Each of these receptive fields operates as a small filter that lets through only image features corresponding to its preference, and you can think of the assembly of neurons with these filter properties as performing a convolution of the image on the retina with a kernel of this orientation and spatial frequency (see Section 12 to remind yourself what convolutions are). Based on this insight, researchers started designing neural networks for computer vision with units that performed a convolution of the input image. Instead of devoting a weight for each pixel in the input image, these units would have only a very small number of weights to fit (for example, for a 3-by-3 convolutional kernel they would have only 9 weights each). The properties of these convolutional filters – their orientation and spatial frequency setnsitivity, for example – would be determined through learning by back propagation and gradient descent of the convolutional kernel weights, just like the simple network we saw above.

To demonstrate this, we’ll continue to look at the ABIDE2 dataset. In the previous sections, we used features of the data that had been computed using a complicated and sophisticated software pipeline – Freesurfer. Freesurfer was developed to incorporate a massive amount of knowledge about the brain and about MRI. An understanding of what parts of the image contain cortical areas, white matter and sub-cortical regions, for example. The features represent biologically meaningful properties of the brain, such as cortical thickness. The process of creating features that are useful for use in machine learning is known as feature engineering and is an important part of many applications of machine learning. This is often a part of the pipeline that requires a deep understanding and intuition about the scientific domain to which machine learning algorithms are applied. A feature of convolutional neural networks that make them particularly remarkable is that they often do not require feature engineering to be performed in advance. Instead, the algorithms receive the raw images and learn to represent those features in the images that contribute to the task that the neural network is learning. For example, in this case, we will provide to the algorithm as input some of the ABIDE2 brain images themselves and ask: could we train a convolutional neural network algorithm that can accurately classify participants with autism directly from their brain images? When we previously tried to do that based on the Freesurfer-derived features, we found that we can classify with a reasonable level of accuracy based on these data. Would a convolutional neural network also be able to learn this classification directly from brain images? For simplicity, we will look at the data from only two of the ABIDE2 sites (KKI and OHSU) and only at the mid-saggital slice of the T1-weighted images for each of 90 subjects from either of these sites, which we have resampled to a resolution of 50-by-50 pixels.

from ndslib import load_data
X, y, subject = load_data("abide2_saggitals")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=100)

Instead of the Dense layers we used above, this model will use Conv2D layers, which implement the convolution kernels that we are training here. In addition to number of units per layer and the activation function used by these units, we also need to specify the size of the kernel. For example, here we will use kernels of 3-by-3 pixels. To read out the class information from the network, we need to “flatten” the output of the second convolutional layer. That is, we need to turn it from a series of two-dimensional images – the results of the convolution filter operation in each of the units – into a one-dimensional vector of outputs. These are then combined through one last Dense layer that provides the output of the network – the class information – through the “softmax” function. Here, there are two classes – participants diagnosed with autism and participants who were not diagnosed with autism, so this readout layer has two units and the softmax function selects between them.

model = keras.Sequential(
    [
        keras.Input(shape=(50, 50, 1)),
        layers.Conv2D(8, kernel_size=(3, 3), activation="relu"),
        layers.Conv2D(16, kernel_size=(3, 3), activation="relu"),
        layers.Flatten(),
        layers.Dense(2, activation="softmax"),
    ]
)

As before, we compile the model to use the categorical cross-entropy function as its loss function and to report to us about the accuracy of classification.

model.compile(loss="categorical_crossentropy", metrics=["accuracy"])

Before fitting this model, let’s contemplate the complexity of this kind of model, by looking at the number of parameters in this model.

model.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 48, 48, 8)         80        
                                                                 
 conv2d_1 (Conv2D)           (None, 46, 46, 16)        1168      
                                                                 
 flatten (Flatten)           (None, 33856)             0         
                                                                 
 dense_3 (Dense)             (None, 2)                 67714     
                                                                 
=================================================================
Total params: 68,962
Trainable params: 68,962
Non-trainable params: 0
_________________________________________________________________

That’s a lot of parameters! To understand where all these parameters come from consider that each convolutional filter in the first layer has 3-by-3 pixels in their kernels, and each one of them also has a bias term, so we have 8 times (9+1), which is 80. This is more parameters than we had in the first layer of our fully-connected network, but consider the dimensionality of the input in this case: each image has 50-by-50 pixels, so the inputs are 2,500-dimensional, rather than the 3-dimensional input in the fully-connected network we had before. In the second convolutional layer, each unit has to learn a separate set of weights for each output of the first layer - notice that the output shape is 48-by-48. This is because, in contrast to the convolution example you saw in Section 12, the images are not padded with zeros around the edges, so the convolution “peels off” one layer of pixels around the edges of the image. In this case, that might not matter much. The second layer performs convolutions on these outputs and since we have 16 units, this means 16 (units in the second layer) times 8 (units in the first layer) times 9 (weights in each kernel), adding 16 bias terms (one for each unit in the second layer), we end up with 1,168 parameters at this layer. Up until now, this doesn’t seem too bad, but before we can read out the output of the network, we have to flatten the output of the second convolutional layer and feed that into the dense layer at the top of the network. At this point, we do have to include a weight for each pixel in the output, so 46 by 46 by 16 inputs into the two units at the top layer, which adds up to the 67,714 parameters in that layer. Altogether, 68,962 parameters for the full network. How well does this network perform this task? We start by fitting it to the data.

history = model.fit(X_train, to_categorical(y_train), batch_size=5, epochs=5,
                    verbose=2)
Epoch 1/5
29/29 - 1s - loss: 12.3917 - accuracy: 0.5845 - 754ms/epoch - 26ms/step
Epoch 2/5
29/29 - 0s - loss: 0.5865 - accuracy: 0.8028 - 199ms/epoch - 7ms/step
Epoch 3/5
29/29 - 0s - loss: 0.3195 - accuracy: 0.9085 - 194ms/epoch - 7ms/step
Epoch 4/5
29/29 - 0s - loss: 0.1622 - accuracy: 0.9507 - 206ms/epoch - 7ms/step
Epoch 5/5
29/29 - 0s - loss: 0.0866 - accuracy: 0.9789 - 197ms/epoch - 7ms/step

The fit accuracy seems rather high, reaching almost 100% correct within the 5 epochs of training. But when we apply the model to the test data, we see much less accurate performance:

model.evaluate(X_test, to_categorical(y_test), verbose=2)
2/2 - 0s - loss: 1.9438 - accuracy: 0.5833 - 167ms/epoch - 83ms/step
[1.943800449371338, 0.5833333134651184]

Looks like the neural network is overfitting to the training data, achieving a rather high accuracy of classification in the training data, but then falling to a much lower level of accuracy on the test data. One way to think about this is that the model has so many parameters, and the training data is so small, that the neural network can simply memorize each and every one of the samples in the training data. Then, when it sees an unfamiliar sample in the test set, it can’t do much better than guess what group this sample belongs to.

What can we do to improve the situation? The convolutional neural network tool-box contains several kinds of tricks to improve model fitting and make it less prone to over-fitting. One commonly-used trick is called “dropout”. In this approach that was invented by Nitish Srivastava and colleagues [Srivastava et al., 2014], overfitting is prevented by training only some of the units in a particular layer in each batch. That is, some proportion of the units in the layer is “dropped-out” in each batch, so that it is not affected by the images in this batch. This means that the units in the layer to which dropout is applied become a little bit less uniformly trained and can respond to different kinds of inputs. This works effectively as a regularization method (we previously discussed regularization in Section 19.2.1).

Another way to deal with the high flexibility of the model is to do something to reduce the astronomical number of parameters that the model has. We can do that by reducing the number of convolutional filters in each layer, but another way to do that without hurting the ability of the network to generalize too much is to pool information across neighboring pixels. A common way to do that is to replace every group of pixels by the maximal value of this group of pixels. For example, if we perform a pooling with a 2-by-2 pixel window on the outputs of a convolutional layer, we would replace ever four pixels in the output with a single pixel that has the largest value among the four. This means that the next

model = keras.Sequential(
    [
        keras.Input(shape=(50, 50, 1)),
        layers.Conv2D(16, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(2, activation="softmax"),
    ]
)

One of the important effects of adding max spatial pooling into the neural network is that the number of parameters goes down. Here, we go from a neural network with almost 70,000 parameters to a network with just over 12,500 parameters. The effect, particularly in this regime with few data samples, is that the network doesn’t fit the training data as well. At the same time, performance on the test data is improved.

model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d_2 (Conv2D)           (None, 48, 48, 16)        160       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 24, 24, 16)       0         
 )                                                               
                                                                 
 conv2d_3 (Conv2D)           (None, 22, 22, 32)        4640      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 11, 11, 32)       0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 3872)              0         
                                                                 
 dropout (Dropout)           (None, 3872)              0         
                                                                 
 dense_4 (Dense)             (None, 2)                 7746      
                                                                 
=================================================================
Total params: 12,546
Trainable params: 12,546
Non-trainable params: 0
_________________________________________________________________
model.compile(loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X_train, to_categorical(y_train), batch_size=5, epochs=5,
                    verbose=2)
Epoch 1/5
29/29 - 1s - loss: 2.8656 - accuracy: 0.5563 - 771ms/epoch - 27ms/step
Epoch 2/5
29/29 - 0s - loss: 0.9073 - accuracy: 0.6549 - 171ms/epoch - 6ms/step
Epoch 3/5
29/29 - 0s - loss: 0.6686 - accuracy: 0.6972 - 163ms/epoch - 6ms/step
Epoch 4/5
29/29 - 0s - loss: 0.5708 - accuracy: 0.7465 - 169ms/epoch - 6ms/step
Epoch 5/5
29/29 - 0s - loss: 0.4733 - accuracy: 0.7746 - 186ms/epoch - 6ms/step
model.evaluate(X_test, to_categorical(y_test), verbose=2)
2/2 - 0s - loss: 0.7002 - accuracy: 0.6667 - 149ms/epoch - 74ms/step
[0.7002447843551636, 0.6666666865348816]

What should we make of this 67% correct performance on the test set? Probably not too much. In addition to issues of class imbalance that we covered above, one of the concerns with models with this many parameters and multiple layers is that it is hard to say what exactly about the images is the thing that accounts for their accurate level of performance. Here again, we might worry that subtle differences in the images acquired at each site, coupled with different frequencies of each class of subject in each of the sites provides enough clues for the algorithm to substantially exceed 50% performance without actually learning anything truly generalizable about differences between the brains of individuals diagnosed with autism and brains of individuals not diagnosed with autism.

A general criticism that has been leveled against the use of machine learning in general and the use of deep learning in particular is that the models can be rather inscrutable and are very sensitive to features of the data that are confounded with the variables of interest. In this case, for example, study site and autism. One way to deal with this criticism is to be very careful in designing the experiments and in scrutinizing the data that is used with a very critical eye.

In addition, analysis approaches have been developed to try to understand what neural networks and other machine learning algorithms are doing inside of the “black box”. These analysis approaches attempt to deconstruct the operations of the neural network by discovering the features of individual images that lead to a particular classification decision. For example, the features of samples in our dataset that lead to a classification of a particular image as the brain of an individual diagnosed with autism. This field is broadly known as “interpretable machine learning”, and includes methods that are applied to neural networks as well as to other complex machine learning algorithms, such as the tree algorithms that we demonstrated above. In some cases, these algorithms can help uncover confounds. For example, if the deep learning network were to focus on some task-irrelevant features of the image that tell us which site the image is taken from. In other cases, these approaches help uncover interesting relationships in the data. This idea – of “interpretable machine learning” – brings us full circle to where this chapter started, with the tension between prediction and explanation. These methods try to create a bridge between these two goals, by unpacking the reasons that a model accurately predicts.

20.4.1. Additional resources#

The Deep Learning Book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville provides a much deeper introduction to the topic than we are able to provide.