7. Sharing code with others

Collaboration is an important part of science. It’s also fun. In the previous chapters, you incrementally learned how to write code in Python and how to work with this code as you are improving it by debugging, testing, and profiling it. In the next few chapters, you will learn more and more about the various ways that you can write software to analyze neuroimaging data. While you work through these ideas, it would be good to keep in mind how you will work with others to use these ideas in the collaborative work that you will undertake. At the very least, for your own sake, and the sake of reproducible research, you should be prepared to revisit the code that you created and to keep working with it seamlessly over time. The principles that will be discussed here apply to collaborations with others, as well as with your closest collaborator (and at the same time the one that is hardest to reach!): yourself from six months ago. Ultimately, we will also discuss ways in which your code can be used by complete strangers. This would provide the ultimate proof of its reproducibility and also give it more impact than it could have if you were the only one using it.

7.1. What should be shareable?

While we love Jupyter notebooks as a way to prototype code and to present ideas, the notebook format does not, by itself, readily support the implementation of reusable code – functions that can be brought into many different contexts and executed on many different datasets – or code that is easy to test. Therefore, once you are done prototyping an idea for analysis in the notebook, we usually recommend moving your code into python files from which you can import your code into your work environment (for example, into a notebook). There are many ways to organize your code in a way that will facilitate its use by others. Here, we will advocate for a particular organization that will also facilitate the emergence of reusable libraries of code that you can work on with others, and that follow the conventions of the Python language broadly.

However, not every analysis that you do on your data needs to be easy for others to use. Ideally, if you want your work to be reproducible, it will be possible for others to run your code and get the same results. But this is not what we are talking about here. In the course of your work, you will sometimes find that you are creating pieces of code that are useful for you in more than one place, and may also be useful to others. For example, collaborators in your lab, or other researchers in your field. These pieces of code deserve to be written and shared in a manner that others can easily adopt into their code. To do so, the code needs to be packaged into a library. Here, we will look at the nuts and bolts of doing that, by way of a simple(ified) example.

7.2. From notebook to module

In the course of our work on the analysis of some MRI data, we wrote the following code in a script or a Jupyter notebook:

from math import pi
import pandas as pd

blob_data = pd.read_csv('./input_data/blob.csv')

blob_radius = blob_data['radius']

blob_area = pi * blob_radius ** 2
blob_circ = 2 * pi * blob_radius

output_data = pd.DataFrame({"area":blob_area, "circ":blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')

Later on in the book (in Section 9) we will see exactly what Pandas does. For now, suffice it to say that it is a Python library that knows how to read data from comma-separated value (csv) files, and how to write this data back out. The math module is a built-in Python module that contains (among many other things) the value of \(\pi\) stored as the variable pi.

Unfortunately, this code is not very reusable, even while the results may be perfectly reproducible (provided the input data is accessible). This is because it mixes file input and output with computations, and different computations with each other (For example, the computation of area and circumference). Good software engineering strives towards modularity and separation of concerns. One part of your code should be doing the calculations, and another part of the code should be the one that reads and munges the data, yet other functions should visualize the results or produce statistical summaries.

Our first step is to identify what are reusable components of this script and to move these components into a module. For example, here the calculation of area and circumference seem like they could each be (separately) useful in many different contexts.

Let’s isolate them and rewrite them as functions:

from math import pi
import pandas as pd


def calculate_area(r):
    area = pi * r **2
    return area


def calculate_circ(r):
    circ = 2 * pi * r
    return circ

blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = calculate_area(blob_radius)
blob_circ = calculate_circ(blob_radius)

output_data = pd.DataFrame({"area":blob_area, "circ":blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')

In the next step, we might move these functions out into a separate file – let’s say we call this file geometry.py – and document what they do:

from math import pi

def calculate_area(r):
    """
    Calculates the area of a circle.

    Parameters
    ----------
    r : numerical
        The radius of a circle

    Returns
    -------
    area : numerical
        The calculated area
    """
    area = pi * r **2
    return area


def calculate_circ(r):
    """
    Calculates the circumference of a circle.

    Parameters
    ----------
    r : numerical
        The radius of a circle

    Returns
    -------
    circ : float or array
        The calculated circumference
    """
    circ = 2 * pi * r
    return circ

Documenting these functions will help you, and others understand how the function works. At the very least having a one-sentence description, and detailed descriptions of the functions input parameters and outputs or returns, is helpful. You might recognize that in this case, the docstrings are carefully written to comply with the numpy docstring guide that we told you about in Section 5.

7.2.1. Importing and using functions

Before we continue to see how we will use the geometry module that we created, we need to know a little bit about what happens when you call import statements in Python. When you call the import geometry statement, Python starts by looking for a file called geometry.py in your present working directory.

That means that if you saved geometry.py alongside your analysis script, you can now rewrite the analysis script as:

import geometry as geo
import pandas as pd

blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = geo.calculate_area(blob_radius)
blob_circ = geo.calculate_circ(blob_radius)
output_data = pd.DataFrame({"area":blob_area, "circ":blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')

This is already good, because now you can import and reuse these functions across many different analysis scripts without having to copy this code everywhere. You have transitioned this part of your code from a one-off notebook or script to a module. Next, let’s see how you transition from a module to a library.

7.3. From module to package

Creating modular code that can be reused is an important first step, but so far we are are limited to using the code in the geometry module only in scripts that are saved alongside this module. The next level of reusability is to create a library, or a package, that can be installed and imported across multiple different projects.

Again, let’s consider what happens when import geometry is called. If there is no file called geometry.py in the present working directory, the next thing that Python will look for is a Python package called geometry. What is a Python package? It’s a folder that has a file called __init__.py. This can be imported just like a module, so if the folder is in your present working directory, importing it will execute the code in __init__.py. For example, if you were to put the functions you previously had in geometry.py in geometry/__init__.py you could import them from there, so long as you are working in the directory that contains the geometry directory.

More typically, a package might contain different modules that each have some code. For example, we might organize the package like this:

    .
    └── geometry
        ├── __init__.py
        └── circle.py

The code that we previously had in geometry.py is now in the circle.py module of the geometry package. To make the names in circle.py available to us we can import them explicitly like this:

from geometry import circle
circle.calculate_area(blob_radius)

Or we can have the __init__.py file import them for us, by adding this code to the __init__.py file:

from .circle import calculate_area, calculate_circ

This way, we can import our functions like this

from geometry import calculate_area

This also means that if we decide to add more modules to the package, the __init__.py file can manage all the imports from these modules. It can also perform other operations that you might want to do whenever you import the package. Now that you have your code in a package, you’ll want to install the code in your machine, so that you can import the code from anywhere on your machine (not only from this particular directory) and eventually also so that others can easily install it and run it on their machines.

To do so, we need to understand one more thing about the import statement. If import cannot find a module or package locally in the present working directory, it will proceed to look for this name somewhere in the Python path. The Python path is a list of file system locations that Python uses to search for packages and modules to import. You can see it (and manipulate it!) through Python’s built-in sys library.

import sys
print(sys.path)

If we want to be able to import our library regardless of where in our file system we happen to be working, we need to copy the code into one of the file system locations that are stored in this variable. But not so fast! To avoid making a mess of our file system, let’s instead let Python do this for us. The setuptools library, part of the Python standard library that ships with the Python interpreter, is intended specifically for packaging and setup of this kind. The main instrument for setuptools operations is a file called setup.py file, which we will look at next.

7.4. The setup file

By the time you reach the point where you want to use the code that you have written across multiple projects, or share it with others for them to use in their projects, you will want to also organize the files in a separate directory devoted to your library:

    .
    └── geometry
        ├── geometry
        │   ├── __init__.py
        │   └── circle.py
        └── setup.py

Notice that we have two directories called geometry: The top-level directory contains both our Python package (in this case, the geometry package) as well as other files that we will use to organize our project. For example, the file called setup.py is saved in the top-level directory of our library. This is a file that we use to tell Python how to set our software up and how to install it. Within this file, we rely on the Python standard library setuptools module to do a lot of the work. The main thing that we will need to do is to provide setuptools with some metadata about our software and some information about the available packages within our software.

For example, here is a minimal setup file:

from setuptools import setup, find_packages

with open("README.md", "r") as fh:
    long_description = fh.read()

setup(
    name="geometry",
    version="0.0.1",
    author="Ariel Rokem",
    author_email="author@example.com",
    description="Calculating geometric things",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/arokem/geometry",
    packages=find_packages(),
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
        "Intended Audience :: Science/Research",
        "Topic :: Scientific/Engineering"
    ],
    python_requires='>=3.8',
    install_requires=["pandas"]
)

The core of this file is a call to a function called setup. This function has many different options. One of these options is install, which would take all the steps needed to properly install the software in the right way into your Python path. This means that once you are done writing this file and organizing the files and folders in your Python library in the right way, you can call:

$ python setup.py install

to install the library into your Python path, in such as way that calling import geometry from anywhere in your filesystem will be able to find this library and use the functions stored within it. Next, let’s look at the contents of the file section by section.

7.4.1. Contents of a setup.py file

The first thing that happens in the setup.py (after the import statements at the top) is that a long_description is read from a README file. If you are using GitHub to track the changes in your code and to collaborate with others (as described in Section 3), it is a good idea to use the markdown format for this. This is a text-based format that uses the .md extension. GitHub knows how to render these files as nice-looking web pages, so your README file will serve multiple different purposes. Let’s write something informative in this file. For the geometry project it can be something like this:

# geometry

This is a library of functions for geometric calculations.

# Contributing

We welcome contributions from the community. Please create a fork of the
project on GitHub and use a pull request to propose your changes. We strongly encourage creating
an issue before starting to work on major changes, to discuss these changes first.

# Getting help

Please post issues on the project GitHub page.

The second thing that happens is a call to the setup function. The function takes several keyword arguments. The first few ones are general meta-data about the software:

name="geometry",
author="Ariel Rokem",
author_email="author@example.com",
description="Calculating geometric things",
long_description=long_description,

The next one makes sure that the long description gets properly rendered in web pages describing the software (for example in the Python package index, PyPi; more about that below).

long_description_content_type="text/markdown",

Another kind of meta-data is classifiers that are used to catalog the software within PyPI so that interested users can more easily find it:

classifiers=[
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
    "Intended Audience :: Science/Research",
    "Topic :: Scientific/Engineering"
],

Note in particular the license classifier. If you intend to share the software with others, please provide a license that defines how others can use your software. You don’t have to make anything up. Unless you have a good understanding of the legal implications of the license, it is best to use a standard OSI-approved license. If you are interested in publicly providing the software in a manner that would allow anyone to do whatever they want with the software, including in commercial applications, the MIT license is not a bad way to go.

The next item is the version of the software. It is a good idea to use the semantic versioning conventions, which communicate to potential users how stable the software is, and whether changes have happened that dramatically change how the software operates:

version="0.0.1",

The next item points to a URL for the software. For example, the GitHub repository for the software.

url="https://github.com/arokem/geometry",

The next item calls a setuptools function that automatically traverses the filesystem in this directory and finds the packages and sub-packages that we have created.

packages=find_packages(),

If you would rather avoid that, you can also explicitly write out the names of the packages that you would like to install as part of the software. For example, in this project, it could be:

packages=['geometry']

The last two items define the dependencies of the software. The first is the version of Python that is required for the software to run properly. The other is a list of other libraries that are imported within our software and that need to be installed before we can install and use our software (that is, they are not part of the Python standard library). In this case, only the Pandas library. Though we haven’t started using it in our library code yet, we might foresee a need for it at a later stage – for example, when we add code that reads and writes data files. For now, we have added it here as an example.

python_requires='>=3.8',
install_requires=["pandas"]

7.5. A complete project

At this point, our project is starting to take shape. The filesystem of our library should look something like this:

    .
    └── geometry
        ├── LICENSE
        ├── README.md
        ├── geometry
        │   ├── __init__.py
        │   └── circle.py
        └── setup.py

In addition to the setup.py file, we’ve added the README.md file, as well as a LICENSE file that contains the license we have chosen to apply to our code. At this point, the project has everything it needs to have for us to share it widely and for others to start using it. We can add all these files and then push this into a repository on GitHub (see Section 3). Congratulations! You have created a software project that can easily be installed on your machine, and on other computers. What next? There are a few further steps you can take.

7.5.1. Testing and continuous integration

We already looked at software testing in Section 6. As we mentioned there, tests are particularly useful if they are automated and run repeatedly. In the context of a well-organized Python project, that can be achieved by including a test module for every package in the library. For example, we might add a tests package within our geometry package:

    .
    └── geometry
        ├── LICENSE
        ├── README.md
        ├── geometry
        │   ├── __init__.py
        |   ├── tests
        |   │   ├── __init__.py
        |   |   └── test_circle.py
        │   └── circle.py
        └── setup.py

Where __init__.py is an empty file, signaling that the tests folder is a package as well, and the test_circle.py file may contain a simple set of functions for testing different aspects of the code. For example the code:

from geometry.circle import calculate_area
from math import pi

def test_calculate_area():
    assert calculate_area(1) == pi

will test that the calculate_area function does the right thing for some well-understood input.

To reduce the friction that might prevent us from running the tests often, we can take advantage of systems that automate the running of tests as much as possible. The first step is to use software that runs the tests for us. These are sometimes called “test harnesses”. One popular test harness for Python is Pytest. When it is called within the source code of your project, the Pytest test harness identifies functions that are software tests by looking for them in files whose names start with test_ or end with _test.py, and by the fact that the functions themselves are named with names that start with test_. It runs these functions and keeps track of the functions that pass the test – do not raise errors – and those that fail – do raise errors.

Another approach that can automate your testing, even more, is called “continuous integration”. In this approach, the system that keeps track of versions of your code – for example, the GitHub website – also automatically runs all of the tests that you wrote every time that you make changes to the code. This is a powerful approach in tandem with the collaborative tools that we described in Section 3 because it allows you to identify the exact set of changes that changed a passing test into a failing test and to alert you to this fact. In the collaborative pattern that uses pull requests, the tests can be run on the code before it is ever integrated into the main branch, allowing contributors to fix changes that cause test failures before they are merged. Continuous integration is implemented in GitHub through a system called “GitHub Actions”. We will not go further into the details of this system here, but you can learn about it through the online documentation.

7.5.2. Documentation

If you follow the instructions we provided above, you will have already made the first step towards documenting your code, by including docstrings in your function definitions. A further step is to write more detailed documentation and make the documentation available together with your software. A system that is routinely used across the Python universe is Sphinx. It is a rather complex system for generating documentation in many different formats, including a PDF manual, but also a neat-looking website that includes your docstrings and other pages you can write. Creating such a website can be a worthwhile effort, if you are interested in making your software easier for others to find and to use.

7.6. Summary

If you go through the work described in this section, making the software that you write for your use easy to install and openly available, you will make your work easier to reproduce, and also easier to extend. Other people might start using it. Inevitably, some of them might run into bugs and issues with the software. Some of them might even contact you to ask for help with the software – either through the Issues section of your GitHub repository or via email. This could lead to fruitful collaborations with other researchers who use your software. On the one hand, it might be a good idea to support the use of your software. For one, one of our goals as scientists is to have an impact on the understanding of the universe, and the improvement of the human condition. and software that is supported is more likely to have such an impact. Furthermore, with time, users can become developers of the software. Initially, by helping you expose errors that may exist in the code, and ultimately by contributing new features. Some people have made careers out of building and supporting a community of users and developers around software that they write and maintain. On the other hand, that might not be your interest or your purpose, and it does take time from other things. Either way, you can use the README of your code to communicate about the level of support that others might expect. It’s perfectly fine to let people know that the software is provided openly, but that it is provided with no assurance of any support.

7.6.1. Software citation and attribution

Researchers in neuroimaging are used to the idea that when we rely on a paper for the findings and ideas in it, we cite it in our research. We are perhaps less accustomed to the notion that software that we use in our research should also be cited. In recent years, there has been an increased effort to provide ways for researchers to cite software, and also for researchers who share their software to be cited. One way to make sure that others can cite your software is to make sure that your software has a Digital Object Identifier (or DOI). A DOI is a string of letters and numbers that uniquely identifies a specific digital object, such as an article or dataset. It is used to keep track of this object and identify it, even if the web address to the object may change. Many journals require that a DOI be assigned to a digital object before that object can be cited so that the thing that is cited can be found even after some time. This means that to make your software citeable, you will have to mint a DOI for it. One way to do that is through a service administered by the European Council for Nuclear Research (CERN) called Zenodo. Zenodo allows you to upload digital objects – the code of a software library, for example – into the website, and then provides a DOI for them. It even integrates with the GitHub website to automatically provide a separate DOI for every version of a software project. This is also a reminder that when you are citing a software library that you use in your research, make sure to refer to the specific version of the software that you are using, to make sure that others reading your article can reproduce your work.

7.7. Additional resources

Packaging and distributing Python code involves putting together some rather complex technical pieces. In response to some of the challenges, the Python community put together the Python Packaging Authority (PyPA)website, which explains how to package and distribute Python code. Their website is a good resource to help you understand some of the machinery that we explained above, and also to update it with the most recent best practices.

In Section 4.1 you learned about the conda package manager. A great way to distribute scientific software using conda is provided through the Conda Forge project, which supports members of the community, by providing some guidance and recipes to distribute software using conda.

Jake Vanderplas wrote a very useful blog post on the topic of scientific software licensing.

Developing an open-source software project can also become a complex social, technical, and even legal challenge. A (free!) book you might want to look at if you are contemplating taking it on is Producing Open Source Software by Karl Fogel, which will take you through everything from naming an open-source software project to legal and management issues such as licensing, distribution, and intellectual property rights.