Sharing code with others
Contents
7. Sharing code with others¶
Collaboration is an important part of science. It’s also fun. In the previous chapters, you incrementally learned how to write code in Python and how to work with this code as you are improving it by debugging, testing, and profiling it. In the next few chapters, you will learn more and more about the various ways that you can write software to analyze neuroimaging data. While you work through these ideas, it would be good to keep in mind how you will work with others to use these ideas in the collaborative work that you will undertake. At the very least, for your own sake, and the sake of reproducible research, you should be prepared to revisit the code that you created and to keep working with it seamlessly over time. The principles that will be discussed here apply to collaborations with others, as well as with your closest collaborator (and at the same time the one that is hardest to reach!): yourself from six months ago. Ultimately, we will also discuss ways in which your code can be used by complete strangers. This would provide the ultimate proof of its reproducibility and also give it more impact than it could have if you were the only one using it.
7.2. From notebook to module¶
In the course of our work on the analysis of some MRI data, we wrote the following code in a script or a Jupyter notebook:
from math import pi
import pandas as pd
blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = pi * blob_radius ** 2
blob_circ = 2 * pi * blob_radius
output_data = pd.DataFrame({"area":blob_area, "circ":blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')
Later on in the book (in Section 9) we will see exactly what Pandas does. For
now, suffice it to say that it is a Python library that knows how to read data from
comma-separated value (csv) files, and how to write this data back out. The math
module is a built-in Python module that contains (among many other things) the
value of \(\pi\) stored as the variable pi
.
Unfortunately, this code is not very reusable, even while the results may be perfectly reproducible (provided the input data is accessible). This is because it mixes file input and output with computations, and different computations with each other (For example, the computation of area and circumference). Good software engineering strives towards modularity and separation of concerns. One part of your code should be doing the calculations, and another part of the code should be the one that reads and munges the data, yet other functions should visualize the results or produce statistical summaries.
Our first step is to identify what are reusable components of this script and to move these components into a module. For example, here the calculation of area and circumference seem like they could each be (separately) useful in many different contexts.
Let’s isolate them and rewrite them as functions:
from math import pi
import pandas as pd
def calculate_area(r):
area = pi * r **2
return area
def calculate_circ(r):
circ = 2 * pi * r
return circ
blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = calculate_area(blob_radius)
blob_circ = calculate_circ(blob_radius)
output_data = pd.DataFrame({"area":blob_area, "circ":blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')
In the next step, we might move these functions out into a separate file –
let’s say we call this file geometry.py
– and document what they do:
from math import pi
def calculate_area(r):
"""
Calculates the area of a circle.
Parameters
----------
r : numerical
The radius of a circle
Returns
-------
area : numerical
The calculated area
"""
area = pi * r **2
return area
def calculate_circ(r):
"""
Calculates the circumference of a circle.
Parameters
----------
r : numerical
The radius of a circle
Returns
-------
circ : float or array
The calculated circumference
"""
circ = 2 * pi * r
return circ
Documenting these functions will help you, and others understand how the function works. At the very least having a one-sentence description, and detailed descriptions of the functions input parameters and outputs or returns, is helpful. You might recognize that in this case, the docstrings are carefully written to comply with the numpy docstring guide that we told you about in Section 5.
7.2.1. Importing and using functions¶
Before we continue to see how we will use the geometry
module that we created,
we need to know a little bit about what happens when you call import
statements in Python. When you call the import geometry
statement, Python
starts by looking for a file called geometry.py
in your present working
directory.
That means that if you saved geometry.py
alongside your analysis script, you
can now rewrite the analysis script as:
import geometry as geo
import pandas as pd
blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = geo.calculate_area(blob_radius)
blob_circ = geo.calculate_circ(blob_radius)
output_data = pd.DataFrame({"area":blob_area, "circ":blob_circ})
output_data.to_csv('./output_data/blob_properties.csv')
This is already good because now you can import and reuse these functions across many different analysis scripts without having to copy this code everywhere. You have transitioned this part of your code from a one-off notebook or script to a module. Next, let’s see how you transition from a module to a library.
7.3. From module to package¶
Creating modular code that can be reused is an important first step, but so far
we are are limited to using the code in the geometry
module only in scripts
that are saved alongside this module. The next level of reusability is to create
a library, or a package, that can be installed and imported across multiple
different projects.
Again, let’s consider what happens when import geometry
is called. If there is
no file called geometry.py
in the present working directory, the next thing
that Python will look for is a Python package called geometry
. What is a
Python package? It’s a folder that has a file called __init__.py
. This can be
imported just like a module, so if the folder is in your present working
directory, importing it will execute the code in __init__.py
. For example, if
you were to put the functions you previously had in geometry.py
in
geometry/__init__.py
you could import them from there, so long as you are
working in the directory that contains the geometry
directory.
More typically, a package might contain different modules that each have some code. For example, we might organize the package like this:
.
└── geometry
├── __init__.py
└── circle.py
The code that we previously had in geometry.py
is now in the circle.py
module of the geometry
package. To make the names in circle.py
available to
us we can import them explicitly like this:
from geometry import circle
circle.calculate_area(blob_radius)
Or we can have the __init__.py
file import them for us, by adding this code to
the __init__.py
file:
from .circle import calculate_area, calculate_circ
This way, we can import our functions like this
from geometry import calculate_area
This also means that if we decide to add more modules to the package, the
__init__.py
file can manage all the imports from these modules. It can also
perform other operations that you might want to do whenever you import the
package. Now that you have your code in a package, you’ll want to install the
code in your machine, so that you can import the code from anywhere on your
machine (not only from this particular directory) and eventually also so that
others can easily install it and run it on their machines.
To do so, we need to understand one more thing about the import
statement. If
import
cannot find a module or package locally in the present working
directory, it will proceed to look for this name somewhere in the Python path.
The Python path is a list of file system locations that Python uses to search
for packages and modules to import. You can see it (and manipulate it!) through
Python’s built-in sys
library.
import sys
print(sys.path)
If we want to be able to import our library regardless of where in our file system we happen to be working, we need to copy the code into one of the file system locations that are stored in this variable.
But not so fast! To avoid making a mess of our file system, let’s instead let
Python do this for us.
The setuptools
library, part of
the Python standard library that ships with the Python interpreter, is intended
specifically for packaging and setup of this kind. The main instrument for
setuptools
operations is a file called pyproject.toml
file,
which we will look at next.
7.4. The pyproject file¶
By the time you reach the point where you want to use the code that you have written across multiple projects or share it with others for them to use in their projects, you will want to also organize the files in a separate directory devoted to your library. The recommended way of organizing your Python library is:
.
└── geometry
|
├── LICENSE
├── README.md
├── pyproject.toml
├── src
| └── geometry
| ├── __init__.py
| └── circle.py
└── tests/
Notice that we have two directories called geometry
: The top-level directory
contains the source for our Python library (in this case, just the geometry
package/module) - in a folder called src
, as well as other files that we will
use to organize our project. For example, the file called pyproject.toml
is
saved in the top-level directory of our library. This is a file that we use to
tell Python how to set our software up and how to install it. Within this file,
we define how our software should be installed (a “backend”), as well as
metadata about our software and some information about the available packages
within our software so that build tools (“frontends”) know how to go about
installing it.
For example, here is a minimal pyproject.toml file:
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "geometry"
authors = [
{ name="Ariel Rokem", email="arokem@gmail.com" },
]
description = "Calculating geometric things"
readme = "README.md"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Intended Audience :: Science/Research",
"Topic :: Scientific/Engineering"
]
requires-python = ">=3.10"
dependencies = [
"pandas==2.2.2"
]
version = "0.0.1"
[project.urls]
Homepage = "https://github.com/arokem/geometry"
Issues = "https://github.com/arokem/geometry/issues"
[tool.setuptools.packages.find]
where = ["src"]
include = ["geometry*"]
exclude = ["tests*"]
Once this file is in place, we can use a build tool called “pip” (the “package installer for Python”) to install the library on our machine in such a way that Python would be able to find it. There are a few ways of using pip, but we recommend the following:
$ python -m pip install .
This will install the library into your Python path, in such as way that
calling import geometry
from anywhere in your filesystem will be able to find
this library and use the functions stored within it. Next, let’s look at the
contents of the file section by section.
7.4.1. Contents of a pyproject.toml file¶
The first thing that happens in the pyproject.toml
file is the definition
of the build system that we will use:
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
Next are a few vital stats, also known as metadata, about the library:
[project]
name = "geometry"
authors = [
{ name="Ariel Rokem", email="arokem@gmail.com" },
]
description = "Calculating geometric things"
Next, we point to a README file that contains a bit more information about the project
readme = "README.md"
If you are using GitHub to track the changes in your code and to collaborate
with others (as described in Section 3), it is a good idea to use the
markdown format for this. This is a text-based format that uses the .md
extension. GitHub knows how to render these files as nice-looking web pages, so
your README file will serve multiple different purposes. Let’s write something
informative in this file. For the geometry
project it can be something like
this:
# geometry
This is a library of functions for geometric calculations.
# Contributing
We welcome contributions from the community. Please create a fork of the
project on GitHub and use a pull request to propose your changes. We strongly encourage creating
an issue before starting to work on major changes, to discuss these changes first.
# Getting help
Please post issues on the project GitHub page.
Next, we define the dependencies of the software. The first is the version of Python that is required for the software to run properly. As of writing of these lines (in July 2024), Python 3.10 is a good minimal version to work with (although versions as disparate as 3.8 and 3.12 could work) The other is a list of other libraries that are imported within our software and that need to be installed before we can install and use our software (that is, they are not part of the Python standard library). In this case, only the Pandas library. Though we haven’t started using it in our library code yet, we might foresee a need for it at a later stage – for example, when we add code that reads and writes data files. For now, we have added it here as an example. It is a good idea to specify exactly which version of other libraries is required.
requires-python = ">=3.10"
dependencies = [
"pandas==2.2.2"
]
Another kind of meta-data is classifiers that are used to catalog the software within PyPI so that interested users can more easily find it:
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Intended Audience :: Science/Research",
"Topic :: Scientific/Engineering"
],
Note in particular the license classifier. If you intend to share the software with others, please provide a license that defines how others can use your software. You don’t have to make anything up. Unless you have a good understanding of the legal implications of the license, it is best to use a standard OSI-approved license. If you are interested in publicly providing the software in a manner that would allow anyone to do whatever they want with the software, including in commercial applications, the MIT license is not a bad way to go.
The next item is the version of the software. It is a good idea to use the semantic versioning conventions, which communicate to potential users how stable the software is, and whether changes have happened that dramatically change how the software operates:
version="0.0.1",
The next section of the file points to URLs for the software. For example, the GitHub repository for the software.
[project.urls]
Homepage = "https://github.com/arokem/geometry"
Issues = "https://github.com/arokem/geometry/issues"
The next item calls a setuptools
function that automatically traverses the
filesystem in this directory and finds the packages and sub-packages that we
have created. It also tells setuptools
which parts of the software not to
install (in this case, the tests)
[tool.setuptools.packages.find]
where = ["src"]
include = ["geometry*"]
exclude = ["tests*"]
7.5. A complete project¶
At this point, our project is starting to take shape. The filesystem of our library should look something like this:
.
└── geometry
|
├── LICENSE
├── README.md
├── pyproject.toml
├── src
| └── geometry
| ├── __init__.py
| └── circle.py
└── tests/
In addition to the pyproject.toml
file, we’ve added the README.md
file, as
well as a LICENSE file that contains the license we have chosen to apply to our
code. At this point, the project has everything it needs to have for us to
share it widely and for others to start using it. We can add all these files
and then push this into a repository on GitHub (see Section 3).
Congratulations! You have created a software project that can easily be
installed on your machine, and on other computers. What next? There are a few
further steps you can take.
7.5.1. Testing and continuous integration¶
We already looked at software testing in Section 6. As we mentioned
there, tests are particularly useful if they are automated and run repeatedly.
In the context of a well-organized Python project, that can be achieved by
including a test module for every module in the library. For example, we might
add a tests
directory that includes a test_circle.py
script:
.
└── geometry
|
├── LICENSE
├── README.md
├── pyproject.toml
├── src
| └── geometry
| ├── __init__.py
| └── circle.py
└── tests
└── test_circle.py
Where the test_circle.py
file may contain a simple set of functions for
testing different aspects of the code. For example the code:
from geometry.circle import calculate_area
from math import pi
def test_calculate_area():
assert calculate_area(1) == pi
will test that the calculate_area
function does the right thing for some
well-understood input.
To reduce the friction that might prevent us from running the tests often, we
can take advantage of systems that automate the running of tests as much as
possible. The first step is to use software that runs the tests for us. These
are sometimes called “test harnesses”. One popular test harness for Python is
Pytest. When it is called within the source code of
your project, the Pytest test harness identifies functions that are software
tests by looking for them in files whose names start with test_
or end with
_test.py
, and by the fact that the functions themselves are named with names
that start with test_
. It runs these functions and keeps track of the
functions that pass the test – do not raise errors – and those that fail – do
raise errors.
Another approach that can automate your testing, even more, is called “continuous
integration”. In this approach, the system that keeps track of versions of your
code – for example, the GitHub website – also automatically runs all of the
tests that you wrote every time that you make changes to the code. This is a
powerful approach in tandem with the collaborative tools that we described in
Section 3 because it allows you to identify the exact set of changes that
changed a passing test into a failing test and to alert you to this fact. In the
collaborative pattern that uses pull requests, the tests can be run on the code
before it is ever integrated into the main
branch, allowing contributors to
fix changes that cause test failures before they are merged. Continuous
integration is implemented in GitHub through a system called “GitHub Actions”.
We will not go further into the details of this system here, but you can learn
about it through the online documentation.
7.5.2. Documentation¶
If you follow the instructions we provided above, you will have already made the first step towards documenting your code, by including docstrings in your function definitions. A further step is to write more detailed documentation and make the documentation available together with your software. A system that is routinely used across the Python universe is Sphinx. It is a rather complex system for generating documentation in many different formats, including a PDF manual, but also a neat-looking website that includes your docstrings and other pages you can write. Creating such a website can be a worthwhile effort if you are interested in making your software easier for others to find and use.
7.6. Summary¶
If you go through the work described in this section, making the software that you write for your use easy to install and openly available, you will make your work easier to reproduce, and also easier to extend. Other people might start using it. Inevitably, some of them might run into bugs and issues with the software. Some of them might even contact you to ask for help with the software – either through the Issues section of your GitHub repository or via email. This could lead to fruitful collaborations with other researchers who use your software. On the one hand, it might be a good idea to support the use of your software. For one, one of our goals as scientists is to have an impact on the understanding of the universe, and the improvement of the human condition. and software that is supported is more likely to have such an impact. Furthermore, with time, users can become developers of the software. Initially, by helping you expose errors that may exist in the code, and ultimately by contributing new features. Some people have made careers out of building and supporting a community of users and developers around software that they write and maintain. On the other hand, that might not be your interest or your purpose, and it does take time from other things. Either way, you can use the README of your code to communicate about the level of support that others might expect. It’s perfectly fine to let people know that the software is provided openly, but that it is provided with no assurance of any support.
7.6.1. Software citation and attribution¶
Researchers in neuroimaging are used to the idea that when we rely on a paper for the findings and ideas in it, we cite it in our research. We are perhaps less accustomed to the notion that software that we use in our research should also be cited. In recent years, there has been an increased effort to provide ways for researchers to cite software, and also for researchers who share their software to be cited. One way to make sure that others can cite your software is to make sure that your software has a Digital Object Identifier (or DOI). A DOI is a string of letters and numbers that uniquely identifies a specific digital object, such as an article or dataset. It is used to keep track of this object and identify it, even if the web address to the object may change. Many journals require that a DOI be assigned to a digital object before that object can be cited so that the thing that is cited can be found even after some time. This means that to make your software citeable, you will have to mint a DOI for it. One way to do that is through a service administered by the European Council for Nuclear Research (CERN) called Zenodo. Zenodo allows you to upload digital objects – the code of a software library, for example – into the website, and then provides a DOI for them. It even integrates with the GitHub website to automatically provide a separate DOI for every version of a software project. This is also a reminder that when you are citing a software library that you use in your research, make sure to refer to the specific version of the software that you are using, to make sure that others reading your article can reproduce your work.
7.7. Additional resources¶
Packaging and distributing Python code involves putting together some rather complex technical pieces. In response to some of the challenges, the Python community put together the Python Packaging Authority (PyPA)website, which explains how to package and distribute Python code. Their website is a good resource to help you understand some of the machinery that we explained above, and also to update it with the most recent best practices. Some of this chapter was based on a tutorial that the PyPA has produced.
In Section 4.1 you learned about the conda package manager. A great way to distribute scientific software using conda is provided through the Conda Forge project, which supports members of the community, by providing some guidance and recipes to distribute software using conda.
Jake Vanderplas wrote a very useful blog post on the topic of scientific software licensing.
Developing an open-source software project can also become a complex social, technical, and even legal challenge. A (free!) book you might want to look at if you are contemplating taking it on is Producing Open Source Software by Karl Fogel, which will take you through everything from naming an open-source software project to legal and management issues such as licensing, distribution, and intellectual property rights.