11. Data science tools for neuroimaging

In previous chapters, we saw that there is plenty of things that you can do with general-purpose software and data science tools. Many of the examples used data from neuroimaging experiments, but the ideas that we introduced – arrays and computing with arrays, tables and how to manipulate tabular data, scientific computing, and data visualization – were not specific to neuroimaging. In this chapter, we will discuss data science approaches that are specifically tailored to neuroimaging data. First (in Section 11.1), we will present a survey of neuroimaging-specific software implemented in Python. Next (in Section 11.2), we will discuss how to organize data from neuroscience experiments for data analysis that best takes advantage of these software tools. In the following chapters, we will dive into some of the applications of these data science tools to fundamental problems in neuroimaging data analysis.

11.1. Neuroimaging in Python

Within the broader ecosystem of Python tools for science, there is a family of tools specifically focused on neuroimaging (we will refer to them collectively as “NiPy”, which stands for “Neuroimaging in Python”). These software tools, developed by and for neuroimaging researchers, cover a wide range of data analysis tasks on a variety of different kinds of experimental data. In the next few sections, we will see in detail how some of these tools are used. But first, we will provide a broad survey of the different kinds of tools that currently exist. It is important to emphasize that this is a very dynamically evolving ecosystem, and some of these tools may evolve into other tools over time, or even disappear. New tools will inevitably also emerge. So, this survey will be, by necessity, a bit superficial and a bit dated. That said, we’ll try to give you a sense of how an ecosystem like this one emerges and evolves so that you can keep an eye on these trends as they play out in the future.

11.1.1. Reading and writing neuroimaging data

All of the NiPy tools depend to some degree on the ability to read neuroimaging data from commonly used file formats into the memory of the computer. The tool of choice for doing that is called nibabel. The name hints at the plethora of different file formats that exist to represent and store neuroimaging data. Nibabel, which has been in continuous development since 2007, provides a relatively straightforward common route to read many of these formats. Much in the way that Numpy serves as a basis for all of the scientific computing in Python, nibabel serves as the basis for the NiPy ecosystem. Because it is so fundamental, we devote an entire later section (Section 12) to some basic operations that you can do with nibabel. In Section 13, we will go even deeper to see how nibabel enables the integration of different neuroimaging data types. So, we will not say much more about nibabel right now.

11.1.2. Tools for MRI data analysis

Tools for neuroimaging data analysis are usually focused on particular data types and on particular ways of analyzing the data. Functional MRI (fMRI) is a very popular data type in cognitive neuroscience, and several tools focus on functional MRI. For example, both the BrainIAK software library and the Nilearn software library focus on machine learning analysis of functional MRI data. We will return to talk about machine learning in more detail in chapter Section 17. Some tools fit more traditional generalized linear models (GLM) to fMRI data. One of these is fitlins. It relies heavily on the Brain Imaging Data Structure standard, which we will discuss below in Section 11.2. Another one is glmsingle, implemented in both Matlab and Python, which focuses on the estimation of single-trial responses.

Another popular MRI modality is diffusion MRI (dMRI). It is used to characterize the properties of brain white matter and to find white matter connections between different parts of the brain. DIPY (Diffusion Imaging in Python) is a software library that implements a variety of methods for analysis of this kind of data. We will use image processing algorithms that are implemented in DIPY in chapter Section 16. Building upon DIPY, there are several tools that focus on more specific dMRI analysis tasks, such as DMIPY, which focuses on various ways to model the microstructure of brain white matter tissue, pyAFQ and TractSeg, which focus on delineating the major white matter tracts, and NDMG, which focuses on quantifying the connectivity matrix of the brain, based on dMRI data. Another widely-used software library for analysis of dMRI is the Medical Imaging Toolkit for Diffusion.

Other tools focus on analysis of structural MRI data. The advanced normalization tools (ANTs) software library implements a variety of algorithms for structural data, such as image registration and image segmentation, which will be described in more detail in chapter Section 14. The ANTs library, originally developed in the C programming language, is now also available through a user-friendly Python interface in the antspy library.

11.1.3. Tools for other neuroimaging modalities

One of the largest and mature projects for neuroimaging data analysis focuses on electrophysiology data that comes from experiments that use Magnetoencephalography (MEG) Electroencephalography (EEG), Electrocorticography (ECoG), and other similar measurements. The MNE software library has been in development since 2011 and contains implementations of many algorithms that are common across these different types of data. For example, analysis of the frequency spectrums of these signals. It also includes algorithms that are specific for particular types of data. For example, algorithms to localize the brain sources of MEG signals that are measured outside the skull.

11.1.4. Nipype, the Python tool to unify them all

One of the main challenges for anyone doing neuroimaging data analysis is that different algorithms are implemented in different software libraries. Even if you decide that you would like to use NiPy tools for most of your data analysis, you might still find yourself needing to use non-Python tools for some steps in your analysis. For example, to use novel algorithms that are implemented in the popular FMRIB Software Library (FSL) or Analysis of Functional NeuroImages (AFNI) frameworks for fMRI, the popular MRTRIX library for dMRI, or in FreeSurfer for analysis of structural MRI. In theory, if you wanted to construct an analysis pipeline that glues together operations from different frameworks, you would have to familiarize yourself with the varying ways in which these different frameworks expect their input data to appear and the varying ways in which they produce their output. The complexity can quickly become daunting. Moreover, creating reproducible data analysis programs that put together such pipelines becomes cumbersome and can be very difficult. Fortunately, these challenges have been addressed through the Nipype Python software library. Nipype implements Python interfaces to many neuroimaging software tools (including those we mentioned above). One of Nipype’s appealing features is that it provides a very similar programming interface across these different tools. In addition, it provides the functionality to represent pipelines of operations, where the outputs of one processing step become the inputs to the next processing steps. Running such a pipeline as one program is not only a matter of convenience but also important for reproducibility. By tracking and recording the set of operations that a dataset undergoes through analysis, as well as the parameters that are used in each step, Nipype allows users of the software to report exactly what they did with their data to come to a particular conclusion. In addition, Nipype allows a processing pipeline to be used with other datasets. However, this raises another kind of challenge: to achieve this kind of extensibility, the input data has to be organized in a particular way that the software recognizes. This is where community-developed data standards come in. In the rest of this section, we will discuss the Brain Imaging Data Structure, or BIDS, which is the most commonly-used data standard for organizing neuroimaging data for analysis.

11.2. The Brain Imaging Data Structure (BIDS) standard

In the previous section, we pointed out that Nipype can be used to create reproducible analysis pipelines that can be applied across different datasets. This is true, in principle, but in practice, it also relies on one more idea: that of a data standard. This is because to be truly transferable, an analysis pipeline needs to know where to find the data and metadata that it uses for analysis. Thus, in this section, we will shift our focus to talking about how entire neuroimaging projects are (or should be) laid out. Until recently, datasets were usually organized in an idiosyncratic way. Each researcher had to decide on their own what data organization made sense to them. This made data sharing and re-use quite difficult because if you wanted to use someone else’s data, there was a very good chance you’d first have to spend a few days just figuring out what you were looking at and how you could go about reading the data into whatever environment you were comfortable with.

11.2.1. The BIDS specification

Fortunately, things have improved dramatically in recent years. Recognizing that working with neuroimaging data would be considerably easier if everyone adopted a common data representation standard, a group of (mostly) fMRI researchers convened in 2016 to create something now known as the Brain Imaging Data Standard, or BIDS. BIDS wasn’t the first data standard proposed in fMRI, but it has become by far the most widely adopted. Much of the success of BIDS can be traced to its simplicity: the standard deliberately insists not only on machine readability but also on human readability, which means that a machine can ingest a dataset and do all kinds of machine processing with it, but a human looking at the files can also make sense of the dataset, understanding what kinds of data were collected and what experiments were conducted. While there are some nuances and complexities to BIDS, the core of the specification consists of a relatively simple set of rules a human with some domain knowledge can readily understand and implement.

We won’t spend much time describing the details of the BIDS specification in this book, as there’s already excellent documentation for that on the project’s website. Instead, we’ll just touch on a couple of core principles. The easiest way to understand what BIDS is about is to dive right into an example. Here’s a sample directory structure we’ve borrowed from the BIDS documentation. It shows a valid BIDS dataset that contains just a single subject:


There are two important points to note here. First, the BIDS specification imposes restrictions on how files are organized within a BIDS project directory. For example, every subject’s data goes inside a sub-[id] folder below the project root —- where the sub- prefix is required, and the [id] is a researcher-selected string uniquely identifying that subject within the project ("control01" in the example). And similarly, inside each subject directory, we find subdirectories containing data of different modalities: anat for anatomical images; func for functional images; dwi for diffusion-weighted images; and so on. When there are multiple data collection sessions for each subject, an extra level is introduced to the hierarchy, so that functional data from the first session acquired from subject control01 would be stored inside a folder like sub-control01/ses-01/func.

Second, valid BIDS files must follow particular naming conventions. The precise naming structure of each file depends on what kind of file it is, but the central idea is that a BIDS filename is always made up of (1) a sequence of key-value pairs, where each key is separated from its corresponding value by a dash, and pairs are separated by underscores; (2) a “suffix” that directly precedes the file extension and describes the type of data contained in the file (this comes from a controlled vocabulary, meaning that it can only be one of a few accepted values, such as "bold" or "dwi"); and (3) an extension that defines the file format.

For example, if we take a file like sub-control01/func/sub-control01_task-nback_bold.nii.gz and examine its constituent chunks, we can infer from the filename that the file is a Nifti image (.nii.gz extension) that contains BOLD fMRI data (bold suffix) for task nback acquired from subject control01.

Besides these conventions, there are several other key elements of the BIDS specification. We won’t discuss them in detail, but it’s good to at least be aware of them:

  • Every data file should be accompanied by a JSON “sidecar” containing metadata describing that file. For example, a BOLD data file might be accompanied by a side-car file that describes acquisition parameters, such as repetition time.

  • BIDS follows an “inheritance” principle —- meaning that JSON metadata files higher up in the hierarchy automatically apply to relevant files lower in the hierarchy unless explicitly overridden. For example, if all of the BOLD data in a single dataset was acquired using the same protocol, this metadata need not be replicated in each subject’s data folder.

  • Every project is required to have a dataset_description.json file at the root level that contains basic information about the project (e.g., the name of the dataset and a description of its constituents, as well as citation information).

  • BIDS doesn’t actively prohibit you from including non-BIDS-compliant files in a BIDS project -— so you don’t have to just throw out files that you can’t easily shoehorn into the BIDS format. The downside of including non-compliant files is just that most BIDS tools and/or human users won’t know what to do with them, so your dataset might not be quite as useful as it otherwise would be. BIDS Derivatives

The BIDS specification was originally created with static representations of neuroimaging datasets in mind. But it quickly became clear that it would also be beneficial for the standard to handle derivatives of datasets -— that is, new BIDS datasets generated by applying some transformation to one or more existing BIDS datasets. For example, suppose we have a BIDS dataset containing raw fMRI images. Typically, we’ll want to preprocess our images (for example, to remove artifacts, apply motion correction, temporally filter the signal, etc.) before submitting them to analysis. It’s great if our preprocessing pipeline can take BIDS datasets as inputs, but what should it then do with the output? A naive approach would be to just construct a new BIDS dataset that’s very similar to the original one, but replace the original (raw) fMRI images with new (preprocessed) ones. But that’s likely to confuse: a user could easily end up with many different versions of the same BIDS dataset, yet have no formal way to determine the relationship between them. To address this problem, the BIDS-Derivatives extension introduces some additional metadata and file naming conventions that make it easier to chain BIDS-aware tools (see the next section) without chaos taking hold.

11.2.2. The BIDS Ecosystem

At this point, you might be wondering: what is BIDS good for? Surely the point of introducing a new data standard isn’t just to inconvenience people by forcing them to spend their time organizing their data a certain way? There must be some benefits to individual researchers – and ideally, the community as a whole -— spending precious time making datasets and workflows BIDS-compliant, right? Well, yes, there are! The benefits of buying into the BIDS ecosystem are quite substantial. Let’s look at a few. Easier data sharing and re-use

One obvious benefit we alluded to above is that sharing and re-using neuroimaging data becomes much easier once many people agree to organize their data the same way. As a trivial example, once you know that BIDS organizes data according to a fixed hierarchy (i.e., subject –> session –> run), it’s easy to understand other people’s datasets. There’s no chance of finding time-course images belonging to subject 1 in, say, /imaging/old/NbackTask/rawData/niftis/controlgroup/1/. But the benefits of BIDS for sharing and re-use come into full view once we consider the impact on public data repositories. While neuroimaging repositories have been around for a long time (for an early review, see [Van Horn et al., 2001]), their utility was long hampered by the prevalence of idiosyncratic file formats and project organizations. By supporting the BIDS standard, data repositories open the door to a wide range of powerful capabilities.

To illustrate, consider OpenNeuro – currently the largest and most widely-used repository of brain MRI data. OpenNeuro requires uploaded datasets to be in BIDS format (though datasets do not have to be fully compliant). As a result, the platform can automatically extract, display, and index important metadata. For example, the number of subjects, sessions, and runs in each dataset; the data modalities and experimental tasks present; a standardized description of the dataset; and so on. Integration with free analysis platforms like BrainLife is possible, as is structured querying over datasets via OpenNeuro’s GraphQL API endpoint.

Perhaps most importantly, the incremental effort required by users to make their BIDS-compliant datasets publicly available and immediately usable by others is minimal: in most cases, users have only to click an Upload button and locate the project they wish to share (there is also a command-line interface, for users who prefer to interact with OpenNeuro programmatically). BIDS-Apps

A second benefit to representing neuroimaging data in BIDS is that one immediately gains access to a large, and rapidly growing, ecosystem of BIDS-compatible tools. If you’ve used different neuroimaging tools in your research -— for example, perhaps you’ve tried out both FSL and SPM (the two most widely used fMRI data analysis suites) —- you’ll have probably had to do some work to get your data into a somewhat different format for each tool. In a world without standards, tool developers can’t be reasonably expected to know how to read your particular dataset, so the onus falls on you to get your data into a compatible format. In the worst case, this means that every time you want to use a new tool, you have to do some more work.

By contrast, for tools that natively support BIDS, life is simpler. Once we know that fMRIPrep -— a very popular preprocessing pipeline for fMRI data [Esteban et al., 2019] -— takes valid BIDS datasets as inputs, the odds are very high that we’ll be able to apply fMRIPrep to our own valid BIDS datasets with little or no additional work. To facilitate the development and use of these kinds of tools, BIDS developed a lightweight standard for “BIDS Apps” [Gorgolewski et al., 2017]. A BIDS App is an application that takes one or more BIDS datasets as input. There is no restriction on what a BIDS App can do, or what it’s allowed to output (though many BIDS Apps output BIDS-Derivatives datasets); the only requirement is that a BIDS App is containerized (using Docker or Singularity; see Section 4), and accept valid BIDS datasets as input. New BIDS Apps are continuously being developed, and as of this writing, the BIDS Apps website lists a few dozen apps.

What’s particularly nice about this approach is that it doesn’t necessarily require the developers of existing tools to do a lot of extra work themselves to support BIDS. In principle, anyone can write a BIDS-App “wrapper” that mediates between the BIDS format and whatever format a tool natively expects to receive data in. So, for example, the BIDS-Apps registry already contains BIDS-Apps for packages or pipelines like SPM, CPAC, Freesurfer, and the Human Connectome Project Pipelines. Some of these apps are still fairly rudimentary and don’t cover all of the functionality provided by the original tools, but others support much or most of the respective native tool’s functionality. And of course, many BIDS-Apps aren’t wrappers around other tools at all; they’re entirely new tools designed from the very beginning to support only BIDS datasets. We’ve already mentioned fMRIPrep, which has very quickly become arguably the de facto preprocessing pipeline in fMRI; another widely-used BIDS-App is MRIQC [Esteban et al., 2017], a tool for automated quality control and quality assessment of structural and functional MRI scans, which we will see in action in Section 12. Although the BIDS-Apps ecosystem is still in its infancy, the latter two tools already represent something close to killer applications for many researchers.

To demonstrate this statement, consider how easy it is to run fMRIPrep once your data is organized in the BIDS format. After installing the software and its dependencies, running the software is as simple as issue this command in the terminal:

fmriprep data/bids_root/ out/ participant -w work/

where, data/bids_root points to a directory that contains a BIDS-organized dataset that includes fMRI data, out points to the directory into which the outputs (the BIDS derivatives) will be saved and work is a directory that will store some of the intermediate products that will be generated along the way. Looking at this, it might not be immediately apparent how important BIDS is for this to be so simple, but consider what software would need to do to find all of the fMRI data inside of a complex dataset of raw MRI data, that might contain other data types, other files and so forth. Consider also the complexity that arises from the fact that fMRI data can be collected using many different acquisition protocols, and the fact that fMRI processing sometimes uses other information (for example, measurements of the field map, or anatomical T1-weighted or T2-weighted scans). The fact that the data complies with BIDS allows fMRIPrep to locate everything that it needs with the dataset and to make use of all the information to perform the preprocessing to the best of its ability given the provided data. Utility libraries

Lastly, the widespread adoption of BIDS has also spawned a large number of utility libraries designed to help developers (rather than end users) build their analysis pipelines and tools more efficiently. Suppose I’m writing a script to automate my lab’s typical fMRI analysis workflow. It’s a safe bet that, at multiple points in my script, I’ll need to interact with the input datasets in fairly stereotyped and repetitive ways. For instance, I might need to search through the project directory for all files containing information about event timing, but only for a particular experimental task. Or, I might need to extract some metadata containing key parameters for each time-series image I want to analyze (e.g., the repetition time, or TR). Such tasks are usually not very complicated, but they’re tedious and can slow down development considerably. Worse, at a community level, they introduce massive inefficiency, because each person working on their analysis script ends up writing their own code to solve what are usually very similar problems.

A good utility library abstracts away a lot of this kind of low-level work and allows researchers and analysts to focus most of their attention on high-level objectives. By standardizing the representation of neuroimaging data, BIDS makes it much easier to write good libraries of this sort. Probably the best example so far is a package called PyBIDS, which provides a convenient Python interface for basic querying and manipulation of BIDS datasets. To give you a sense of how much work PyBIDS can save you when you’re writing neuroimaging analysis code in Python, let’s take a look at some of the things the package can do.

We start by importing an object called BIDSLayout, which we will use to manage and query the layout of files on disk. We also import a function that knows how to locate some test data that was installed on our computer together with the PyBIDS software library.

from bids import BIDSLayout
from bids.tests import get_test_data_path

One of the datasets that we have within the test data path is data set number 5 from OpenNeuro. Note that the software has not actually installed into our hard-drive a bunch of neuroimaging data – that would be too large! Instead, the software installed a bunch of files that have the right names and are organized in the right way, but are mostly empty. This allows us to demonstrate the way that the software works, but don’t try reading the neuroimaging data from any of the files in that directory. We’ll work with a more manageable BIDS dataset, including the files in it in Section 12.

For now, we initialize a BIDSLayout object, by pointing to the location of the dataset in our disk. When we do that, the software scans through that part of our file-system, validates that it is a properly organized BIDS dataset, and finds all of the files that are arranged according to the specification. This allows the object to already infer some things about the dataset. For example, the dataset has 16 subjects and 48 total runs (here a “run” is an individual fMRI scan). The person who organized this dataset decided not to include a session folder for each subject. Presumably, because each subject participated in just one session in this experiment, that information is not useful.

layout = BIDSLayout(get_test_data_path() + "/ds005")
BIDS Layout: ...packages/bids/tests/data/ds005 | Subjects: 16 | Sessions: 0 | Runs: 48

The layout object now has a method called get(), which we can use to gain access to various parts of the dataset. For example, we can ask to give us a list of the filenames of all of the anatomical ("T1w") scans that were collected for subjects sub-01 and sub-02

layout.get(subject=['01', '02'],  suffix="T1w", return_type='filename')

Or, using a slightly different logic, all of the functional ("bold") scans collected for subject sub-03

layout.get(subject='03', suffix="bold", return_type='filename')

In these examples, we asked the BIDSLayout object to give us the 'filename' return type. This is because if we don’t explicitly ask for a return type, we will get back a list of BIDSImageFile objects. For example, selecting the first one of these for sub-03’s fMRI scans:

bids_files = layout.get(subject="03", suffix="bold")
bids_image = bids_files[0]

This object is quite useful, of course. For example, it knows how to parse the file name into meaningful entities, using the get_entities() method, which returns a dictionary with entities such as subject and task that can be used to keep track of things in analysis scripts.

{'datatype': 'func',
 'extension': '.nii.gz',
 'run': 1,
 'subject': '03',
 'suffix': 'bold',
 'task': 'mixedgamblestask'}

In most cases, you can also get direct access to the imaging data using the BIDSImageFile object. This object has a get_image method, which would usually return a nibabel Nifti1Image object. As you will see in Section 12 this object lets you extract metadata, or even read the data from a file into memory as a Numpy array. However, in this case, calling the get_image method would raise an error, because, as we mentioned above, the files do not contain any data. So, let’s look at another kind of file that you can read directly in this case. In addition to the neuroimaging data, BIDS provides instructions on how to organize files that record the behavioral events that occurred during an experiment. These are stored as tab-separated-values (‘.tsv’) files, and there is one for each run in the experiment. For example, for this dataset, we can query for the events that happened during subject sub-03’s 3rd run:

events = layout.get(subject='03', extension=".tsv", task="mixedgamblestask", run="03")
tsv_file = events[0]
<BIDSDataFile filename='/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/bids/tests/data/ds005/sub-03/func/sub-03_task-mixedgamblestask_run-03_events.tsv'>

Instead of a BIDSImageFile, the variable tsv_file is now a BIDSDataFile object, and this kind of object has a get_df method, which returns a Pandas DataFrame object

bids_df = tsv_file.get_df()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   onset                       85 non-null     float64
 1   duration                    85 non-null     int64  
 2   trial_type                  85 non-null     object 
 3   distance from indifference  0 non-null      float64
 4   parametric gain             85 non-null     float64
 5   parametric loss             0 non-null      float64
 6   gain                        85 non-null     int64  
 7   loss                        85 non-null     int64  
 8   PTval                       85 non-null     float64
 9   respnum                     85 non-null     int64  
 10  respcat                     85 non-null     int64  
 11  RT                          85 non-null     float64
dtypes: float64(6), int64(5), object(1)
memory usage: 8.1+ KB

This kind of functionality is useful if you are planning to automate your analysis over large datasets that can include heterogeneous acquisitions between subjects and within subjects. At the very least, we hope that the examples have conveyed to you the power inherent in organizing your data according to a standard, as a starting point to use, and maybe also develop, analysis pipelines that expect data in this format. We will see more examples of this in practice in the next section. Exercises

BIDS has a set of example datasets available in a GitHub repository at https://github.com/bids-standard/bids-examples. Clone the repository and use pyBIDS to explore the dataset called “ds011”. Using only pyBIDS code, can you figure out how many subjects participated in this study? What are the values of TR that were used in fMRI acquisitions in this dataset?

11.3. Additional resources

There are many places to learn more about the NiPy community. The documentation of each of the software libraries that were mentioned here includes fully worked-out examples of data analysis, in addition to links to tutorials, videos, and so forth. For Nipype, in particular, we recommend Michael Notter’s Nipype tutorial as a good starting point.

BIDS is not only a data standard but also a community of developers and users that support the use and further development of the standard. For example, over the time since the standard was first proposed, the community has added instructions to support the organization and sharing of new data modalities (e.g., intracranial EEG) and derivatives of processing. The strength of such a community is that, like open-source software, it draws on the individual strengths of many individuals and many groups who are willing to spend time and effort evolving and improving it. One of the resources that the community has developed to help people who are new to the standard learn more about it is the BIDS Starter Kit. It is a website that includes materials (videos, tutorials, explanations, and examples) to help you get started learning about and eventually using the BIDS standard.

In addition to these relatively static resources, users of both NiPy software and BIDS can interact with other members of the community through a dedicated questions and answers website called Neurostars. On this website, anyone can ask questions about neuroscience software and get answers from experts who frequent the website. In particular, the developers of many of the projects that we mentioned in this chapter, and many of the people who work on the BIDS standard often answer questions about these projects through this website.