2. The Unix operating system#

The Unix operating system is a fundamental tool for controlling the way that the computer executes programs, including in data analysis. Its history goes back all the way to the 1960’s, but it is still one of the most commonly-used tools for computing today. In the 1960’s, when Unix was first developed, Computers were very different, accessed usually as mainframe systems that were shared among multiple users. Each user would connect to the computer through their own terminal device – a keyboard and a screen that were used to send instructions to the computer and receive outputs on. This was before the invention of the first graphical user interfaces (GUIs), so all of the instructions had to be sent to the computer in the form of text. The main way to interact with the computer was through an application called a “shell”. The shell application usually includes a prompt, where users can type a variety of different commands. This prompt is also called a “command line”. The commands typed at the command line can be sent to the computer’s operating system to do a variety of different operations, or to launch various other programs. Often, these programs will then produce text outputs that are printed into the shell as well. Over the years, computers evolved and changed, but the idea of a text-based terminal still remains. In different operating systems, you will find the shell application in different places. On Windows, you can install a shell application by installing git for Windows (this will also end up being useful for the following sections, which introduce version control with git and containerization with Docker). On Apple’s Mac computers, you can find a shell application in your Applications/Utilities folder under the name “Terminal”. On the many variants of the linux operating system, the shell is quite central as well and comes installed with the operating system.

2.1. Using Unix#

The developers of the Unix shell believed that programs that run in this kind of environment should be built to each do only one thing. Ideally, each program’s output should be formatted so that it could be used as an input to another program. This means that users could use multiple small programs to construct more complicated programs and pipelines based on combinations of different tools. Let’s look at a simple example of that. Opening up the shell, you will be staring at the prompt. In my computer that looks something like this:

$ You can type commands into this prompt and press the “enter” or “return” key to execute them. The shell will send these commands to the operating system and then some output might appear inside of the shell. This is what is called a “read, evaluate, print loop”, or REPL. That is because the application reads what you type, evaluates to understand what it means and what information to provide in return, it prints that information to the screen and then it repeats that whole process again, in an infinite loop, which ends only when you quit the application, or turn off the computer. 2.1.1. Exploring the filesystem# When you first start the shell, the working directory that it immediately sees is your home directory. This means that the files and folders in your home directory are immediately accessible to you. For example, you can type the ls command to get a listing of the files and folders that the shell sees in your working directory. For example, in the shell on one of our computers: $ ls
Desktop        Library        Pictures       miniconda3
Documents      Movies         Public         projects


Most of the items listed here are folders that came installed with Ariel’s computer when he bought it. For example, Documents and Desktop. Others are folders that he created in the home directory, for example projects. There is also a single stray file here, Untitled.ipynb, which is a lone Jupyter notebook (ipynb is the extension for these files) that remains here from some impromptu data analysis that he once did. The ls command (and many other unix commands) can be modified using flags. These are characters or words added to the command, that modify the way that the command runs. For example, if adding the -F flag to the call to ls, adds a slash (/) character at the end of the names of folders, which is practically useful, because it tells us which of the names in the list refer to a file, and which refer to a folder that contains other files and folders.

$ls -F Applications/ Downloads/ Music/ Untitled.ipynb tmp/ Desktop/ Library/ Pictures/ miniconda3/ Documents/ Movies/ Public/ projects/  In general, if we want to know more about a particular command, we can issue the man command, followed by the name of the command for which we would like to read the so-called man page (man presumably stands for “manual”). For example, the following command would open the man page for the ls command, telling us about all of the options that we have to modify the way that ls works. $ man ls


To exit the man page, we would type the q key. We can ask the shell to change the working directory in which we are currently working by issuing the cd (or “change directory”) command.

$cd Documents  which would then change what it sees when we ask it to list the files. $ ls -F
books/
conferences/
courses/
papers/


This is the list of directories that are within the Documents directory. We can see where the change that has occurred by asking the shell where we are, using the pwd command (which stands for “present working directory”).

$pwd /Users/arokem/Documents  Note: this is the answer that Ariel sees (on his Mac laptop computer), and you might see something slightly different, depending on the way your computer is set up. For example, if you are using the shell that you installed from gitforwindows on a Windows machine (this shell is also called a git bash shell), your home directory is probably going to look more like this: $ pwd
/c/Users/arokem


This is the address of the standard C:\Users\arokem Windows home directory, translated into a more unix-like format. If we want to change our working directory back to where the shell started, we can call cd again. This command can be used in one of several ways:

$cd /Users/arokem  This is a way to refer to the absolute path of the home directory. It is absolute because this command would bring us back to the home directory, no matter where in the file-system we happened to be working before we issued it. This command also tells us where within the structure of the file-system the home directory is located. The slash (/) characters in the command are to be read as separators that designate the relationships between different items within the file system. For example, in this case, the home directory is considered to be inside of a directory called “Users”, which in turn is inside the root of the filesystem (simply designated as the / at the beginning of the absolute path). This idea – that files and folders are inside other files and folders – organizes the filesystem as a whole. Another way to think about this is that the file system on our computer is organized as a tree. The root of the tree is the root of the entire filesystem (/) and all of items saved in the filesystem stem from the root. Different branches split off from the root, and they can split further. Finally, at the end of the branches (the leaves, if you will) are the files that are saved within folders at the end of every path through the branches of the tree. The tree structure makes organizing and finding things easier. Another command we might issue here, that would also take us to the home directory is: $ cd ..


The .. is a special way to refer to the directory directly above the directory in which we are currently working within the filesystem tree, bringing us one step closer to the root. Depending on what directory you are already in, it would take you to different places. Because of that, we refer to this as a relative path. Similarly, if we were working within the home directory and issued the following command:

$cd Documents  this would be a relative path. This is because it does not describe the relationship between this folder and the root of the filesystem, and would work differently depending on our present working directory. For example, if we issues that command while our working directory was inside of the Documents directory, we would get an error from the shell because, given only a relative path, it can’t find anything called Documents inside of the Documents folder. One more way I can get to the home directory from anywhere in the filesystem is using the tilde character (~). So, this command: $ cd ~


is equivalent to this command:

$cd /Users/arokem  Similarly, you can refer to folders inside of your home directory by attaching the ~/ before writing down the relative path from the home directory to that location. For example, to get from anywhere to the Documents directory, I could issue the following, which is interpreted as an absolute path. $ cd ~/Documents


Exercise

The touch command creates a new empty file in your filesystem. You can create it using relative and absolute paths. How would you create a new file called new_file.txt in your home directory? How would you do that if you were working inside of your ~/Documents directory? The mv command moves a file from one location to another. How would you move new_file.txt from the home directory to the Documents directory? How would this be different from using the cp command? (hint: use the mv and cp man pages to see what these commands do).

2.1.2. The pipe operator#

There are many other commands in the Unix shell, and we will not demonstrate all of them here. Instead, we will now proceed to demonstrate one of the important principles that we mentioned before: the output of one command can be directly used as an input to another command. For example, if we wanted to quickly count the number of items in the home directory, we could provide the output of the ls command directly as input into the wc command (“word count”, which counts words). In unix, that is called “creating a pipe” between the commands and we use the pipe operator, which is the vertical line that usually sits in the top right of the US English keyboard: |, to do so:

$ls | wc 13 13 117  Now, instead of printing out the list of files in my home directory, the shell prints out the number of words, lines and characters in the output of the ls command. It tells us that there are 13 words, 13 lines and 117 characters in the output of ls (it looked as though the output of ls was 3 lines, but there were line-breaks between each column in the output). The order of the pipe operation is from left to right, and we don’t have to stop here. For example, we can use another pipe to ask how many words in the output of ls contain the letter D. For this, we’ll use the grep command, that finds all of the instances of a particular letter or phrase in its input: $ ls | grep "D" | wc
3       3      28


To see why this is the case, try running this yourself in your home directory. Try omitting the final pipe to wc and seeing what the output of ls | grep "D" might be. This should give you a sense of how the unix combines operations together. Though it may seem a bit silly at first – why would we want to count how many words have the letter “D” in them in a list of files? – when combined in the right way, it can give you a lot of power to automate operations that you’d like to do. For example, identifying certain files in a directory and processing all of them through a command line application.