9. Manipulating tabular data with Pandas

As we saw in a previous chapter, many kinds of neuroscience data can be described as arrays. In particular, the analysis of many physiological signals, such as BOLD, really benefits from organizing data in the form of arrays. But there’s another very beneficial way of organizing data. In neuroscience, we often have datasets where several different variables were recorded for each observation in the dataset. For example, each observation might be one subject in a study, and for each subject, we record variables like the subject’s sex and age, various psychological measures, and also summaries of physiological measurements using fMRI or other brain imaging modalities. In cases like this, it’s very common to represent the data in a two-dimensional table where each row represents a different observation and each column represents a different variable. Organizing our data this way allows us to manipulate the data through queries and aggregation operations in a way that makes analysis much easier. Because this type of organization is so simple and useful, it’s often called “tidy” data (a more technical name, for readers with an acquaintance with databases, is the third normal form, or 3NF). This is also a very general way of organizing data in many different applications, ranging from scientific research to website logs, and administrative data collected in the operations of various organizations. As you will see later in the book, it is also a natural input to further analysis. For example, with the machine-learning methods that you will see in sklearn. For this reason, tools that analyze tidy tabular data play a central role in data science across these varied applications. Here, we’ll focus on a popular Python library that analyzes this kind of data: “Pandas”. Pandas is not only the plural form of an exceedingly cute animal you can use as the mascot for your software library but also an abbreviation of “panel data”, which is another name for data that is stored in two-dimensional tables of this sort.

An example should help demonstrate how Pandas is used. Let’s consider a diffusion MRI dataset. In this dataset, diffusion MRI data were collected from 76 individuals ages 6 - 50. Let’s start with the subjects and their characteristics.

We import pandas the usual way. Importing it as pd is also an oft-used convention, just like import numpy as np:

import pandas as pd

Pandas knows how to take sequences of data and organize them into tables. For example, the following code creates a table that has three columns, with four values in each column.

my_df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3'],
                      'C': ['C0', 'C1', 'C2', 'C3']})

It’s easy to look at all of the data in such a small table by printing out the values in the table.

my_df
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3

The Pandas library also knows how to read data from all kinds of sources. For example, it can read data directly from a comma-separated values file (“.csv”) stored on our computer, or somewhere on the internet. The CSV format is a good format for storing tabular data. You can write out the values in a table, making sure each row contains the same number of values. Values within a row are separated from one another by commas (hence the name of the format). To make the data more comprehensible, it’s common to add a header row. That is, the first row should contain labels (or column names) that tell us what each value in that column means. We often also have one column that uniquely identifies each one of the observations (e.g., the unique ID of each subject), which can come in handy.

Here, we will point to data that was collected as part of a study of life span changes in brain tissue properties [Yeatman et al., 2014]. The data is stored in a file that contains a first row with the labels of the variables: subjectID, Age, Gender, Handedness, IQ, IQ_Matrix, and IQ_Vocab. The first of these labels is an identifier for each subject. The second is the age, stored as an integer. The third is the gender of the subject, stored as a string value (“Male” or “Female”), the fourth is handedness (“Right”/”Left”), the fifth is the IQ (also a number) and the sixth and seventh are sub-scores of the total IQ score (numbers as well).

We point the Pandas read_csv function directly to a URL that stores the data (though we could also point to a CSV file stored on our computer!), as part of the AFQ-Browser project. We also give Pandas some instructions about our data, including which columns we would like to use; how to interpret values that should be marked as null values (in this case, the string “NaN”, which stands for “not a number”); and which columns in the data we want to designate as the index (i.e., the column that uniquely identifies each row in our dataset). In this case, our index is the first column (index_col=0), corresponding to our unique subject identifier.

All of the arguments below (except for the URL to the file) are optional. Also, read_csv has dozens of other optional arguments we could potentially use to exert very fine-grained control over how our data file is interpreted. One benefit of this flexibility is that we aren’t restricted to reading only from CSV files. Despite the name of the function, read_csv can be used to read data from a wide range of well-structured plain-text formats. And beyond read_csv, Pandas also has support for many other formats, via functions like read_excel, read_stata, and so on.

subjects = pd.read_csv(
   "https://yeatmanlab.github.io/AFQBrowser-demo/data/subjects.csv",
   usecols=[1,2,3,4,5,6,7],
   na_values="NaN", index_col=0)

The variable subjects now holds a two-dimensional table of data. This variable is an instance of an object called a DataFrame (often abbreviated to DF). A DataFrame is similar to the numpy array objects that you saw in the previous chapter, in that it’s a structured container for data. But it differs in several key respects. For one thing, a DataFrame is limited to 2 dimensions (a Numpy array can have arbitrarily many dimensions). For another, a DataFrame stores a lot more metadata – data about the data. To see what this means, we can examine some of the data as it’s stored in our subjects variable. The .head() method lets us see the first few rows of the table:

subjects.head()
Age Gender Handedness IQ IQ_Matrix IQ_Vocab
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0
subject_001 31 Male NaN 129.0 58.0 74.0
subject_002 18 Female NaN 130.0 63.0 70.0
subject_003 28 Male Right NaN NaN NaN
subject_004 29 Male NaN NaN NaN NaN

A few things to notice about this table: first, in contrast to what we saw with Numpy arrays, the DataFrame tells us about the meaning of the data: each column has a label (or name). Second, in contrast to a Numpy array, our DF is heterogeneously typed: it contains a mixture of different kinds of variables. Age is an integer variable; IQ is a floating point (we know this because, even though the items have integer values, they indicate the decimal); and Gender and Handedness both contain string values.

We might also notice that some variables include values of NaN, which are now designated as null values – values that should be ignored in calculations. This is the special value we use whenever a cell lacks a valid value (due to absent or incorrect measurement). For example, it’s possible that subject_003 and subject_004 didn’t undergo IQ testing, so we don’t know what the values for the variables IQ, IQ_Matrix, and IQ_Vocab should be in these rows.

9.1. Summarizing DataFrames

Pandas provides us with a variety of useful functions for data summarization. The .info() function tells us more precisely how much data we have and how the different columns of the data are stored. It also tells us how many non-null values are stored for every column.

subjects.info()
<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, subject_000 to subject_076
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         77 non-null     int64  
 1   Gender      76 non-null     object 
 2   Handedness  66 non-null     object 
 3   IQ          63 non-null     float64
 4   IQ_Matrix   63 non-null     float64
 5   IQ_Vocab    63 non-null     float64
dtypes: float64(3), int64(1), object(2)
memory usage: 4.2+ KB

This should mostly make sense. The variables that contain string values, Gender and Handedness, are stored as “object” types. This is because they contain a mixture of strings, such as "Male", "Female", "Right" or "Left" and values that are considered numerical: NaNs. The subjectID column has a special interpretation as the index of the array. We’ll use that in just a little bit.

Another view on the data is provided by a method called describe, which summarizes the statistics of each of the columns of the matrix that has numerical values, by calculating the minimum, maximum, mean, standard deviation, and quantiles of the values in each of these columns.

subjects.describe()
Age IQ IQ_Matrix IQ_Vocab
count 77.000000 63.000000 63.000000 63.000000
mean 18.961039 122.142857 60.539683 64.015873
std 12.246849 12.717599 7.448372 8.125015
min 6.000000 92.000000 41.000000 36.000000
25% 9.000000 114.000000 57.000000 60.000000
50% 14.000000 122.000000 61.000000 64.000000
75% 28.000000 130.000000 64.500000 70.000000
max 50.000000 151.000000 76.000000 80.000000

The NaN values are ignored in this summary, but the number of non-NaN values is given in a “count” row that tells us how many values from each column were used in computing these statistics. It looks like 14 subjects in this data did not have measurements of their IQ.

9.2. Indexing into DataFrames

We’ve already seen in previous chapters how indexing and slicing let us choose parts of a sequence or an array. Pandas DataFrames support a variety of different indexing and slicing operations. Let’s start with the selection of observations, or rows. In contrast to Numpy arrays, we can’t select rows in a DataFrame with numerical indexing. The code subjects[0]will usually raise an error (the exception is if we named one of our columns0`, but that would be bad practice).

One way to select rows from the data is to use the index column of the array. When we loaded the data from a file, we asked to designate the subjectID column as the index column for this array. This means that we can use the values in this column to select rows using the loc attribute of the DataFrame object For example:

subjects.loc["subject_000"]
Age              20
Gender         Male
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
IQ_Vocab       77.0
Name: subject_000, dtype: object

This gives us a new variable that contains only data for this subject.

Notice that .loc is a special kind of object attribute, which we index directly with the square brackets. This kind of attribute is called an indexer. This indexer is label-based, which means it expects us to pass the labels (or names) of the rows and columns we want to access. But often, we don’t know what those labels are, and instead, only know the numerical indices of our targets. Fortunately, Pandas provides an .iloc indexer for that purpose:

subjects.iloc[0]
Age              20
Gender         Male
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
IQ_Vocab       77.0
Name: subject_000, dtype: object

The above returns the same data as before, but now we’re asking for the row at position 0, rather than the row with label "subject_000".

You might find yourself wondering why only passed a single index to .loc and .iloc, given that our table has two dimensions. The answer is that it’s just shorthand. In the above code, subjects.iloc[0] is equivalent to this:

display(subjects.iloc[0, :])
Age              20
Gender         Male
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
IQ_Vocab       77.0
Name: subject_000, dtype: object

You might remember the : syntax from the Numpy section. Just as in Numpy, the colon : stands for “all values”. And just like in Numpy, if we omit the second dimension, Pandas implicitly assumes we want all of the values (i.e., the whole column). And also just like in Numpy, we can use the slicing syntax to retrieve only some of the elements (slicing):

subjects.iloc[0, 2:5]
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
Name: subject_000, dtype: object

The .iloc and .loc indexers are powerful and prevent ambiguity, but they also require us to type a few more characters. In cases where we just want to access one of the DataFrame columns directly, Pandas provides the following helpful shorthand:

subjects["Age"]
subjectID
subject_000    20
subject_001    31
subject_002    18
subject_003    28
subject_004    29
               ..
subject_072    40
subject_073    50
subject_074    40
subject_075    17
subject_076    17
Name: Age, Length: 77, dtype: int64
age = subjects["Age"]

This assigns the values in the "Age" column to the age variable. The new variable is no longer a Pandas DataFrame; it’s now a Pandas Series! A Series stores a 1-dimensional series of values (it’s conceptually similar to a 1-dimensional Numpy array) and under the hood, every Pandas DataFrame is a collection of Series (one per column). The age Series object also retains the index from the original DataFrame.

Series behave very similarly to DataFrames, with some exceptions. One exception is that, because a Series only has one dimension, we can index them on the rows without explicitly using loc or iloc (though they still work fine). So, the following 2 lines both retrieve the same value.

age['subject_072']
40
age[74]
40

We’ll see other ways that Series objects are useful in just a bit. We can also select more than one column to include. This requires indexing with a list of column names and will create a new DataFrame

subjects[["Age", "IQ"]]
Age IQ
subjectID
subject_000 20 139.0
subject_001 31 129.0
subject_002 18 130.0
subject_003 28 NaN
subject_004 29 NaN
... ... ...
subject_072 40 134.0
subject_073 50 NaN
subject_074 40 122.0
subject_075 17 118.0
subject_076 17 121.0

77 rows × 2 columns

We can also combine indexing operations with loc, to select a particular combination of rows and columns. For example:

subjects.loc["subject_005":"subject_010", ["Age", "Gender"]]
Age Gender
subjectID
subject_005 36 Male
subject_006 39 Male
subject_007 34 Male
subject_008 24 Female
subject_009 21 Male
subject_010 29 Female

A cautionary note: beginners often find indexing in Pandas confusing—partly because there are different .loc and .iloc indexers, and partly because many experienced Pandas users don’t always use these explicit indexers, and instead opt for shorthand like subjects["Age"]. It may take a bit of time for all these conventions to sink through, but don’t worry! With a bit of experience, it quickly becomes second nature. If you don’t mind typing a few extra characters, the best practice (which, admittedly, we won’t always follow ourselves in this book) is to always be as explicit as possible -— i.e., to use .loc and .iloc, and to always include both dimensions when indexing a DataFrame. So, for example, we would write subjects.loc[:, 'age'] rather than the shorthand subjects['age'].

9.3. Computing with DataFrames

Like Numpy arrays, Pandas DataFrame objects have many methods for computing on the data that is in the DataFrame. In contrast to arrays, however, the dimensions of the DataFrame always mean the same thing: the columns are variables and the rows are observations. This means that some kinds of computations only make sense when done along one dimension, the rows, and not along the other. For example, it might make sense to compute the average IQ of the entire sample, but it wouldn’t make sense to average a single subject’s age and IQ scores.

Let’s see how this plays out in practice. For example, like the Numpy array object, DataFrame objects have a mean method. However, in contrast to the Numpy array, when mean is called on the DataFrame, it defaults to take the mean of each column separately. In contrast to Numpy arrays, different columns can also have different types. For example, the "Gender" column is full of strings, and it doesn’t make sense to take the mean of strings, so we have to explicitly pass to the mean method an argument that tells it to only try to average the columns that contain numeric data.

means = subjects.mean(numeric_only=True)
print(means)
Age           18.961039
IQ           122.142857
IQ_Matrix     60.539683
IQ_Vocab      64.015873
dtype: float64

This operation also returns to us an object that makes sense: this is a Pandas Series object, with the variable names of the original DataFrame as its index, which means that we can extract a value for any variable straightforwardly. For example:

means["Age"]
18.961038961038962

9.3.1. Arithmetic with DataFrame columns

As you saw above, a single DataFrame column is a Pandas Series object. We can do arithmetic with Series objects in a way that is very similar to arithmetic in Numpy arrays. That is, when we perform arithmetic between a Series object and a number, the operation is performed separately for each item in the series. For example, let’s compute a standardized z-score for each subject’s age, relative to the distribution of ages in the group. First, we calculate the mean and standard deviation of the ages. These are single numbers:

age_mean = subjects["Age"].mean()
age_std = subjects["Age"].std()
print(age_mean)
print(age_std)
18.961038961038962
12.24684874445319

Next, we perform array-scalar computations on the "Age”` column or Series. We subtract the mean from each value and divide it by the standard deviation:

(subjects["Age"] - age_mean ) / age_std
subjectID
subject_000    0.084835
subject_001    0.983025
subject_002   -0.078472
subject_003    0.738064
subject_004    0.819718
                 ...   
subject_072    1.717908
subject_073    2.534445
subject_074    1.717908
subject_075   -0.160126
subject_076   -0.160126
Name: Age, Length: 77, dtype: float64

One thing that Pandas DataFrames allow us to do is to assign the results of an arithmetic operation on one of its columns into a new column that gets incorporated into the DataFrame. For example, we can create a new column that we will call "Age_standard" and will contain the standard scores for each subject’s age:

subjects["Age_standard"] = (subjects["Age"] - age_mean ) / age_std

subjects.head()
Age Gender Handedness IQ IQ_Matrix IQ_Vocab Age_standard
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0 0.084835
subject_001 31 Male NaN 129.0 58.0 74.0 0.983025
subject_002 18 Female NaN 130.0 63.0 70.0 -0.078472
subject_003 28 Male Right NaN NaN NaN 0.738064
subject_004 29 Male NaN NaN NaN NaN 0.819718

You can see that Pandas has added a column to the right with the new variable and its values in every row.

9.3.1.1. Exercise

In addition to array-scalar computations, we can also perform arithmetic between columns/Series (akin to array-array computations). Compute a new column in the DataFrame called "IQ_sub_diff", which contains in every row the difference between the subject’s "IQ_Vocab" and "IQ_Matrix" columns. What happens in the cases where one (or both) of these is a null value?

9.3.2. Selecting data

Putting these things together, we will start using Pandas to filter the dataset based on the properties of the data. To do this, we can use logical operations to find observations that fulfill certain conditions and select them. This is similar to logical indexing that we saw in Numpy arrays (in Section 8.2.8). It relies on the fact that, much as we can use a boolean array to index into Numpy arrays, we can also use a boolean Series object to index into a Pandas DataFrame. For example, let’s say that we would like to separately analyze children under 10 and other subjects. First, we define a column that tells us for every row in the data whether the "Age" variable has a value that is smaller than 10:

subjects["age_less_than_10"] = subjects["Age"] < 10
print(subjects["age_less_than_10"])
subjectID
subject_000    False
subject_001    False
subject_002    False
subject_003    False
subject_004    False
               ...  
subject_072    False
subject_073    False
subject_074    False
subject_075    False
subject_076    False
Name: age_less_than_10, Length: 77, dtype: bool

This column is also a Pandas Series object, with the same index as the original DataFrame and boolean values (True/False). To select from the original DataFrame we use this Series object to index into the DataFrame, providing it within the square brackets used for indexing:

subjects_less_than_10 = subjects[subjects["age_less_than_10"]]
subjects_less_than_10.head()
Age Gender Handedness IQ IQ_Matrix IQ_Vocab Age_standard age_less_than_10
subjectID
subject_024 9 Male Right 142.0 72.0 73.0 -0.813355 True
subject_026 8 Male Right 125.0 67.0 61.0 -0.895009 True
subject_028 7 Male Right 146.0 76.0 73.0 -0.976663 True
subject_029 8 Female Right 107.0 57.0 51.0 -0.895009 True
subject_033 9 Male Right 132.0 64.0 71.0 -0.813355 True

As you can see, this gives us a new DataFrame that has all of the columns of the original DataFrame, including the ones that we added through computations that we did along the way. It also retains the index column of the original array and the values that were stored in this column. But several observations were dropped along the way – those for which the "Age" variable was 10 or larger. We can verify that this is the case by looking at the statistics of the remaining data:

subjects_less_than_10.describe()
Age IQ IQ_Matrix IQ_Vocab Age_standard
count 25.000000 24.00000 24.000000 24.000000 25.000000
mean 8.320000 126.62500 62.541667 66.625000 -0.868880
std 0.802081 14.48181 8.607273 8.026112 0.065493
min 6.000000 92.00000 41.000000 50.000000 -1.058316
25% 8.000000 119.00000 57.750000 61.750000 -0.895009
50% 8.000000 127.50000 63.500000 68.000000 -0.895009
75% 9.000000 138.00000 68.000000 72.250000 -0.813355
max 9.000000 151.00000 76.000000 80.000000 -0.813355

This checks out. The maximum value for Age is 9, and we also have only 25 remaining subjects, presumably because the other subjects in the DataFrame had ages of 10 or larger.

9.3.2.1. Exercises

Series objects have a notnull method that returns a Boolean Series object that is True for the cases that are not null values (e.g., NaN). Use this method to create a new DataFrame that has only the subjects for which IQ measurements were obtained Is there a faster way to create this DataFrame with just one function call?

9.3.3. Selecting combinations of groups: Using a MultiIndex

Sometimes we want to select groups made up of combinations of variables. For example, we might want to analyze the data based on a split of the data both by gender and by age. One way to do this is by changing the index of the DataFrame to be made up of more than one column. This is called a “MultiIndex” DataFrame. A MultiIndex gives us direct access, by indexing, to all the different groups in the data, across different kinds of splits. We start by using the .set_index() method of the DataFrame object, to create a new kind of index for our dataset. This index uses both the gender and the age group to split the data into four groups: the two gender groups "Male"/"Female" and within each one of these participants aged below and above 10:

multi_index = subjects.set_index(["Gender", "age_less_than_10"])
multi_index.head()
Age Handedness IQ IQ_Matrix IQ_Vocab Age_standard
Gender age_less_than_10
Male False 20 NaN 139.0 65.0 77.0 0.084835
False 31 NaN 129.0 58.0 74.0 0.983025
Female False 18 NaN 130.0 63.0 70.0 -0.078472
Male False 28 Right NaN NaN NaN 0.738064
False 29 NaN NaN NaN NaN 0.819718

This is curious: the two first columns are now both index columns. That means that we can select rows of the DataFrame based on these two columns. For example, we can subset the data down to only the subjects identified as males under 10 and then take the mean of that group

multi_index.loc["Male", True].mean(numeric_only=True)
Age               8.285714
IQ              125.642857
IQ_Matrix        62.071429
IQ_Vocab         66.000000
Age_standard     -0.871679
dtype: float64

Or do the same operation, but only with those subjects identified as females over the age of 10:

multi_index.loc["Female", False].mean(numeric_only=True)
Age              22.576923
IQ              117.095238
IQ_Matrix        57.619048
IQ_Vocab         61.619048
Age_standard      0.295250
dtype: float64

While useful, this can also become a bit cumbersome if you want to repeat this many times for every combination of age group and sex. And because dividing up a group of observations based on combinations of variables is such a common pattern in data analysis, there’s a built-in way to do it. Let’s look at that next.

9.3.4. Split-apply-combine

A recurring pattern we run into when working with tabular data is the following: we want to (1) take a dataset and split it into subsets; (2) independently apply some operation to each subset; and (3) combine the results of all the independent applications into a new dataset. This pattern has a simple, descriptive (if boring) name: “split-apply-combine”. Split-apply-combine is such a powerful and common strategy that Pandas implements extensive functionality designed to support it. The centerpiece is a DataFrame method called .groupby() that, as the name suggests, groups (or splits) a DataFrame into subsets.

For example, let’s split the data by the "Gender" column:

gender_groups = subjects.groupby("Gender")

The output from this operation is a DataFrameGroupBy object. This is a special kind of object that knows how to do many of the things that regular DataFrame objects can, but also internally groups parts of the original DataFrame into distinct subsets. This means that we can perform many operations just as if we were working with a regular DataFrame, but implicitly, those operations will be applied to each subset, and not to the whole DataFrame.

For example, we can calculate the mean for each group:

gender_groups.mean()
Age IQ IQ_Matrix IQ_Vocab Age_standard age_less_than_10
Gender
Female 18.351351 120.612903 59.419355 63.516129 -0.049783 0.297297
Male 18.743590 123.625000 61.625000 64.500000 -0.017756 0.358974

The output of this operation is a DataFrame that contains the summary with the original DataFrame’s "Gender" variable as the index variable. This means that we can get the mean age for one of the Gender groups through a standard DataFrame indexing operation:

gender_groups.mean().loc["Female", "Age"]
18.35135135135135

We can also call .groupby() with more than one column. For example, we can repeat the split of the group by age groups and sex:

gender_and_age_groups = subjects.groupby(["Gender", "age_less_than_10"])

As before, the resulting object is a DataFrameGroupBy object, and we can call the .mean() method on it

gender_and_age_groups.mean()
Age IQ IQ_Matrix IQ_Vocab Age_standard
Gender age_less_than_10
Female False 22.576923 117.095238 57.619048 61.619048 0.295250
True 8.363636 128.000000 63.200000 67.500000 -0.865317
Male False 24.600000 122.055556 61.277778 63.333333 0.460442
True 8.285714 125.642857 62.071429 66.000000 -0.871679

The resulting object is a MultiIndex DataFrame, but rather than worry about how to work with MultiIndexes, we’ll just use .iloc to retrieve the first combination of gender and age values in the above DataFrame.

gender_and_age_groups.mean().iloc[0]
Age              22.576923
IQ              117.095238
IQ_Matrix        57.619048
IQ_Vocab         61.619048
Age_standard      0.295250
Name: (Female, False), dtype: float64

We could also have used .reset_index(), and then applied boolean operations to select specific subsets of the data, just as we did earlier in this section).

9.3.4.1. Exercise

Use any of the methods you saw above to calculate the average IQ of right-handed male subjects older than 10 years.

9.4. Joining different tables

Another kind of operation that Pandas has excellent support for is data joining (or merging, or combining…). For example, in addition to the table we’ve been working with so far, we also have diffusion MRI data that was collected in the same individuals. Diffusion MRI uses magnetic field gradients to make the measurement in each voxel sensitive to the directional diffusion of water in that part of the brain. This is particularly useful in the brain’s white matter, where diffusion along the length of large bundles of myelinated axons is much larger than across their boundaries. This fact is used to guide computational tractography algorithms that generate estimates of the major white matter pathways that connect different parts of the brain. In addition, in each voxel, we can fit a model of diffusion that tells us something about the properties of the white matter tissue within the voxel.

In the dataset that we’re analyzing here, the diffusion data were analyzed to extract tissue properties along the length of 20 major white matter pathways in the brain. In this analysis method, called tractometry, each major pathway is divided into 100 nodes, and in each node, different kinds of tissue properties are sampled. For example, the fractional anisotropy (FA) is calculated in the voxels that are associated with this node. This means that for every bundle, in every subject, we have exactly 100 numbers representing FA. This data can therefore also be organized in a tabular format. Let’s see what that looks like by reading this table as well. For simplicity, we focus only on some of the columns in the table (in particular, there are some other tissue properties stored in the CSV file, which you can explore on your own, by omitting the usecols argument)

nodes = pd.read_csv(
    'https://yeatmanlab.github.io/AFQBrowser-demo/data/nodes.csv',
    index_col="subjectID",
    usecols=["subjectID", "tractID", "nodeID", "fa"])
nodes.info()
<class 'pandas.core.frame.DataFrame'>
Index: 154000 entries, subject_000 to subject_076
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   tractID  154000 non-null  object 
 1   nodeID   154000 non-null  int64  
 2   fa       152326 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 4.7+ MB
nodes.head()
tractID nodeID fa
subjectID
subject_000 Left Thalamic Radiation 0 0.183053
subject_000 Left Thalamic Radiation 1 0.247121
subject_000 Left Thalamic Radiation 2 0.306726
subject_000 Left Thalamic Radiation 3 0.343995
subject_000 Left Thalamic Radiation 4 0.373869

This table is much, much larger than our subjects table (154,000 rows!). This is because for every subject and every tract, there are 100 nodes and FA is recorded for each one of these nodes.

Another thing to notice about this table is that it shares one column in common with the subjects table that we looked at before: they both have an index column called ‘subjectID’. But in this table, the index values are not unique to a particular row. Instead, each row is uniquely defined by a combination of three different columns: the subjectID, and the tractID, which identifies the white matter pathway (for example, in the first few rows of the table, "Left Thalamic Radiation") as well as a node ID, which identifies how far along this pathway these values are extracted from. This means that for each individual subject, there are multiple rows. If we index using the subjectID value, we can see just how many:

nodes.loc["subject_000"].info()
<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, subject_000 to subject_000
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   tractID  2000 non-null   object 
 1   nodeID   2000 non-null   int64  
 2   fa       1996 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 62.5+ KB

This makes sense: 20 major white matter pathways were extracted in the brain of each subject and there are 100 nodes in each white matter pathway, so there are a total of 2,000 rows of data for each subject.

We can ask a lot of questions with this data. For example, we might wonder whether there are sex and age differences in the properties of the white matter. To answer these kinds of questions, we need to somehow merge the information that’s currently contained in two different tables. There are a few ways to do this, and some are simpler than others. Let’s start by considering the simplest case, using artificial data, and then we’ll come back to our real dataset.

Consider the following three tables, created the way we saw at the very beginning of this chapter:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])


df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

Each one of these tables has the same columns: 'A', 'B', 'C' and 'D' and they each have their own distinct set of index values. This kind of data could arise if the same kinds of measurements were repeated over time. For example, we might measure the same air quality variables every week, and then store each week’s data in a separate table, with the dates of each measurement as the index.

One way to merge the data from such tables is using the Pandas concat function (an abbreviation of concatenation, which is the chaining together of different elements).

frames = [df1, df2, df3]

result = pd.concat(frames)

The result table here would have the information that was originally stored in the three different tables, organized in the order of their concatenation.

result
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

In this case, all of our individual DataFrames have the same columns, but different row indexes, so it’s very natural to concatenate them along the row axis, as above. But suppose we want to merge df1 with the following new DataFrame, df4:

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D4', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])

Now, df4 has index values 2 and 3 in common with df1. It also has columns 'B' and 'D' in common with df1. But it also has some new index values (6 and 7) and a new column ('F') that didn’t exist in df1. That means that there’s more than one way to put together the data from df1 and df4 The safest thing to do would be to preserve as much of the data as possible, and that’s what the concat function does per default:

pd.concat([df1, df4])
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 NaN B2 NaN D2 F2
3 NaN B3 NaN D4 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7

In this case, the new merged table contains all of the rows in df1 followed by all of the rows in df4. The columns of the new table are a combination of the columns of the two tables. Wherever possible, the new table will merge the columns into one column. This is true for columns that exist in the two DataFrame objects. For example, column B and column D exist in both of the inputs. In cases where this isn’t possible (because the column doesn’t exist in one of the inputs), Pandas preserves the input values and adds NaNs, which stand for missing values, into the table. For example, df1 didn’t have a column called F, so the first row in the resulting table has a NaN in that column. Similarly, df4 didn’t have an 'A' column, so the 5th row in the table has a NaN for that column.

9.4.1. Merging tables

There are other ways we could combine data with concat; for example, we could try concatenating only along the column axis (i.e.,s stacking DataFrames along their width, to create an increasingly wide result. You can experiment with that by passing axis=1 to concat and seeing what happens. But rather than belabor the point here, let’s come back to our real DataFrames, which imply a more complicated merging scenario.

You might think we could just use concat again to combine our subjects and nodes datasets. But there are some complications. For one thing, our two DataFrames contain no common columns. Naively concatenating them along the row axis, as in our first concat example above, would produce a rather strange-looking result (feel free to try it out). But concatenating along the column axis is also tricky because we have multiple rows for each subject in nodes, but only one row per subject in subjects. It turns out that concat doesn’t know how to deal with this type of data, and would give us an error. Instead, we need to use a more complex merging strategy that allows us to specify exactly how we want the merge to proceed. Fortunately, we can use the pandas merge function for this. merge is smarter than concat; it implements several standard joining algorithms commonly used in computer databases (you might have heard of ‘inner’, ‘outer’, or ‘left’ joins; these are the kinds of things we’re talking about here). When we call merge, we pass in the two DataFrames we want to merge, and then a specification that controls how the merging takes place. In our case, we indicate that we want to merge on the index for both the left and right DataFrames (hence left_index=True and right_index=True). But we could also join on one or more columns in the DataFrames, in which case we would have passed named columns to the on (or left_on and right_on columns).

joined = pd.merge(nodes, subjects, left_index=True, right_index=True)
joined.info()
<class 'pandas.core.frame.DataFrame'>
Index: 154000 entries, subject_000 to subject_076
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   tractID           154000 non-null  object 
 1   nodeID            154000 non-null  int64  
 2   fa                152326 non-null  float64
 3   Age               154000 non-null  int64  
 4   Gender            152000 non-null  object 
 5   Handedness        132000 non-null  object 
 6   IQ                126000 non-null  float64
 7   IQ_Matrix         126000 non-null  float64
 8   IQ_Vocab          126000 non-null  float64
 9   Age_standard      154000 non-null  float64
 10  age_less_than_10  154000 non-null  bool   
dtypes: bool(1), float64(5), int64(2), object(3)
memory usage: 13.1+ MB

The result, as we see above, is a table, where each row corresponds to one of the rows in the original nodes table, but adds in the values from the subjects table that belong to that subject. This means that we can use these variables to split up the diffusion MRI data and analyze it separately by (for example) age, as before, using the split-apply-combine pattern. Here, we define our subgroups based on unique combinations of age (less than or greater than 10), tract ID, and node ID.

age_groups = joined.groupby(["age_less_than_10", "tractID", "nodeID"])

Now applying the mean method of the DataFrameGroupBy object will create a separate average for each one of the nodes in each white matter pathway, identified through the “tractID” and “nodeID” values, across all the subjects in each group.

group_means = age_groups.mean()
group_means.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4000 entries, (False, 'Callosum Forceps Major', 0) to (True, 'Right Uncinate', 99)
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   fa            4000 non-null   float64
 1   Age           4000 non-null   float64
 2   IQ            4000 non-null   float64
 3   IQ_Matrix     4000 non-null   float64
 4   IQ_Vocab      4000 non-null   float64
 5   Age_standard  4000 non-null   float64
dtypes: float64(6)
memory usage: 200.3+ KB

For example, the first row above shows us the mean value for each variable across all subjects, for the node with ID 0, in the Callosum Forceps Major tract. There are 4,000 rows in all, reflecting 2 age groups x 20 tracts x 100 nodes. Let’s select just one of the white matter pathways, and assign each age group’s rows within that pathway to a separate variable. This gives us two variables, each with 100 rows (1 row per node). The index for this DataFrame is a MultiIndex, which is why we pass tuples like (False, "Left Cingulum Cingulate") to select from it.

below_10_means = group_means.loc[(False, "Left Cingulum Cingulate")]
above_10_means = group_means.loc[(True, "Left Cingulum Cingulate")]

We can then select just one of the columns in the table for further inspection. For example, “fa”. To visualize the 100 numbers in this series, we will use the Matplotlib library’s plot function (in the next chapter you will learn much more about data visualization with Matplotlib). For an interesting comparison, we also do all of these operations but selecting the other age group as well

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(below_10_means["fa"])
ax.plot(above_10_means["fa"])
ax.set_xlabel("Node")
label = ax.set_ylabel("Fractional anisotropy")
../../_images/002-pandas_92_0.png

This result is rather striking! In this part of the brain, there is a substantial difference between younger children and all of the other subjects. This is a part of the white matter where there is a significant amount of development after age 10.

9.4.1.1. Exercises

How would you go about comparing the development of male and female subjects in this part of the brain? How would you compare younger children to the other subjects in other tracts?

To summarize, Pandas gives us a set of functionality to combine, query, and summarize data stored in tabular format. You’ve already seen here how you get from data stored in tables to a real scientific result. We’ll come back to using Pandas later in the book in the context of data analysis, and you will see more elaborate examples in which data can be selected using Pandas and then submitted to further analysis in other tools.

9.4.2. Pandas errors

Before we move on to the next topic, we would like to pause and discuss a few patterns of errors that are unique to Pandas and are common enough in the daily use of Pandas, that they are worth warning you about. One common pattern of errors comes from a confusion between Series objects and DataFrame objects. These are very similar, but they are not the same thing! For example, the Series objects have a very useful value_counts method that creates a table with the number of observations in the Series for every unique value. However, calling this method on a DataFrame would raise a Python AttributeError, because it doesn’t exist in the DataFrame, only in the Series. Another common error comes from the fact that many of the operations that you can do on DataFrames create a new DataFrame as output, rather than changing the DataFrame in place. For example, calling the following code:

subjects.dropna()

would not change the subjects DataFrame! If you want to retain the result of this call, you will need to assign it to a new DataFrame:

subjects_without_na = subjects.dropna()

Or use the inplace keyword argument:

subjects.dropna(inplace=True)

This pattern of errors is particularly pernicious because you could continue working with the unchanged DataFrame for many more steps leading to confusing results down the line.

Finally, errors due to indexing are also common. This is because, as you saw above, there are different ways to perform the same indexing operation, and, in contrast to indexing in Numpy arrays, indexing by rows and by columns, or indexing the order of a row (i.e., with iloc) does something rather different than indexing with the row index (i.e., with loc).

9.4.3. Additional resources

As mentioned at the beginning of this section, Pandas is a very popular software and there are multiple examples of its use that are available online. One resource that is quite useful is a set of snippets available from Chris Albon on his website. If you’d like to learn more about “tidy data”, you can read the paper by Hadley Wickham with the same name [Wickham, 2014].