The Unix operating system
Contents
2. The Unix operating system¶
The Unix operating system is a fundamental tool for controlling the way that the computer executes programs, including in data analysis. Its history goes back to the 1960s, but it is still one of the most commonly-used tools for computing today. In the 1960s, when Unix was first developed, Computers were very different, accessed usually as mainframe systems that were shared among multiple users. Each user would connect to the computer through their terminal device – a keyboard and a screen that were used to send instructions to the computer and receive outputs. This was before the invention of the first graphical user interfaces (GUIs), so all of the instructions had to be sent to the computer in the form of text. The main way to interact with the computer was through an application called a “shell”. The shell application usually includes a prompt, where users can type a variety of different commands. This prompt is also called a “command line”. The commands typed at the command line can be sent to the computer’s operating system to do a variety of different operations, or to launch various other programs. Often, these programs will then produce text outputs that are printed into the shell as well. Over the years, computers evolved and changed, but the idea of a text-based terminal remains. In different operating systems, you will find the shell application in different places. On Windows, you can install a shell application by installing git for Windows (this will also end up being useful for the following sections, which introduce version control with git and containerization with Docker). On Apple’s Mac computers, you can find a shell application in your Applications/Utilities folder under the name “Terminal”. On the many variants of the Linux operating system, the shell is quite central as well and comes installed with the operating system.
2.1. Using Unix¶
The developers of the Unix shell believed that programs that run in this kind of environment should be built to each do only one thing. Ideally, each program’s output should be formatted so that it could be used as input to another program. This means that users could use multiple small programs to construct more complicated programs and pipelines based on combinations of different tools. Let’s look at a simple example of that. Opening up the shell, you will be staring at the prompt. On my computer that looks something like this:
$
You can type commands into this prompt and press the “enter” or “return” key to execute them. The shell will send these commands to the operating system and then some output might appear inside the shell. This is what is called a “read, evaluate, print loop”, or REPL. That is because the application reads what you type, and evaluates it to understand what it means and what information to provide in return, it prints that information to the screen, and then it repeats that whole process, in an infinite loop, which ends only when you quit the application, or turn off the computer.
2.1.1. Exploring the filesystem¶
When you first start the shell, the working directory that it immediately sees
is your home directory. This means that the files and folders in your home
directory are immediately accessible to you. For example, you can type the ls
command to get a listing of the files and folders that the shell sees in your
working directory. For example, in the shell on one of our computers:
$ ls
Applications Downloads Music Untitled.ipynb tmp
Desktop Library Pictures miniconda3
Documents Movies Public projects
Most of the items listed here are folders that came installed with Ariel’s
computer when he bought it. For example, Documents
and Desktop
. Others are
folders that he created in the home directory, for example, projects
. There is
also a single stray file here, Untitled.ipynb
, which is a lone Jupyter
notebook (ipynb
is the extension for these files) that remains here from some
impromptu data analysis that he once did. The ls command (and many other unix
commands) can be modified using flags. These are characters or words added to
the command, that modify the way that the command runs. For example, if adding
the -F
flag to the call to ls
, adds a slash (/
) character at the end of
the names of folders, which is practically useful, because it tells us which of
the names in the list refer to a file, and which refer to a folder that contains
other files and folders.
$ ls -F
Applications/ Downloads/ Music/ Untitled.ipynb tmp/
Desktop/ Library/ Pictures/ miniconda3/
Documents/ Movies/ Public/ projects/
In general, if we want to know more about a particular command, we can issue the
man
command, followed by the name of the command for which we would like to
read the so-called man page (man presumably stands for “manual”). For example,
the following command would open the man page for the ls command, telling us
about all of the options that we have to modify the way that ls
works.
$ man ls
To exit the man page, we would type the q
key. We can ask the shell to change
the working directory in which we are currently working by issuing the cd
(or
“change directory”) command.
$ cd Documents
which would then change what it sees when we ask it to list the files.
$ ls -F
books/
conferences/
courses/
papers/
This is the list of directories that are within the Documents
directory. We
can see where the change has occurred by asking the shell where we are,
using the pwd
command (which stands for “present working directory”).
$ pwd
/Users/arokem/Documents
Note: this is the answer that Ariel sees (on his Mac laptop computer), and you
might see something slightly different, depending on the way your computer is
set up. For example, if you are using the shell that you installed from
gitforwindows on a Windows machine (this shell is also called a git bash
shell), your home directory is probably going to look more like this:
$ pwd
/c/Users/arokem
This is the address of the standard C:\Users\arokem
Windows home directory,
translated into a more unix-like format. If we want to change our working
directory back to where the shell started, we can call cd
again. This command
can be used in one of several ways:
$ cd /Users/arokem
This is a way to refer to the absolute path of the home directory. It is
absolute because this command would bring us back to the home directory, no
matter where in the file system we happened to be working before we issued it.
This command also tells us where within the structure of the file system the
home directory is located. The slash (/
) characters in the command are to be
read as separators that designate the relationships between different items
within the file system. For example, in this case, the home directory is
considered to be inside of a directory called “Users”, which in turn is inside
the root of the filesystem (simply designated as the /
at the beginning of
the absolute path). This idea – that files and folders are inside other files
and folders – organizes the filesystem as a whole. Another way to think about
this is that the file system on our computer is organized as a tree. The root of
the tree is the root of the entire filesystem (/
) and all of the items saved
in the filesystem stem from the root. Different branches split off from the
root, and they can split further. Finally, at the end of the branches (the
leaves, if you will) are the files that are saved within folders at the end of
every path through the branches of the tree. The tree structure makes organizing
and finding things easier.
Another command we might issue here, that would also take us to the home directory is:
$ cd ..
The ..
is a special way to refer to the directory directly above the directory
in which we are currently working within the filesystem tree, bringing us one
step closer to the root. Depending on what directory you are already in, it
would take you to different places. Because of that, we refer to this as a
relative path. Similarly, if we were working within the home directory and
issued the following command:
$ cd Documents
this would be a relative path. This is because it does not describe the
relationship between this folder and the root of the filesystem, and would work
differently depending on our present working directory. For example, if we issue
that command while our working directory was inside of the Documents
directory, we would get an error from the shell because, given only a relative
path, it can’t find anything called Documents
inside of the Documents
folder.
One more way I can get to the home directory from anywhere in the filesystem is
by using the tilde character (~
). So, this command:
$ cd ~
is equivalent to this command:
$ cd /Users/arokem
Similarly, you can refer to folders inside of your home directory by attaching
the ~/
before writing down the relative path from the home directory to that
location. For example, to get from anywhere to the Documents
directory, I
could issue the following, which is interpreted as an absolute path.
$ cd ~/Documents
2.1.1.1. Exercise¶
The touch
command creates a new empty file in your filesystem. You can
create it using relative and absolute paths. How would you create a new file
called new_file.txt
in your home directory? How would you do that if you
were working inside of your ~/Documents
directory? The mv
command moves a
file from one location to another. How would you move new_file.txt
from the
home directory to the Documents
directory? How would this be different from
using the cp
command? (hint: use the mv
and cp
man pages to see what
these commands do).
2.1.2. The pipe operator¶
There are many other commands in the Unix shell, and we will not demonstrate all
of them here (see a table below of some commonly-used commands). Instead, we
will now proceed to demonstrate one of the important principles that we
mentioned before: the output of one command can be directly used as an input to
another command. For example, if we wanted to quickly count the number of items
in the home directory, we could provide the output of the ls
command directly
as input into the wc
command (“word count”, which counts words). In unix, that
is called “creating a pipe” between the commands and we use the pipe operator,
which is the vertical line that usually sits in the top right of the US English
keyboard: |
, to do so:
$ ls | wc
13 13 117
Now, instead of printing out the list of files in my home directory, the shell
prints out the number of words, lines, and characters in the output of the
ls
command. It tells us that there are 13 words, 13 lines, and 117 characters
in the output of ls (it looked as though the output of ls
was 3 lines, but
there were line breaks between each column in the output). The order of the pipe
operation is from left to right, and we don’t have to stop here. For example, we
can use another pipe to ask how many words in the output of ls contain the
letter D
. For this, we’ll use the grep
command, which finds all of the
instances of a particular letter or phrase in its input:
$ ls | grep "D" | wc
3 3 28
To see why this is the case, try running this yourself in your home directory.
Try omitting the final pipe to wc
and seeing what the output of ls | grep "D"
might be. This should give you a sense of how unix combines operations.
Though it may seem a bit silly at first – why would we want to count how many
words have the letter “D” in them in a list of files? – when combined in the
right way, it can give you a lot of power to automate operations that you’d like
to do. For example, identifying certain files in a directory and processing all
of them through a command line application.
Command |
Description |
---|---|
|
List the contents of the current directory. |
|
Change the current directory. |
|
Print the path of the current directory. |
|
Create a new directory. |
|
Create a new file. |
|
Copy a file or directory. |
|
Move or rename a file or directory. |
|
Remove a file or directory. |
|
Print the contents of a file to the terminal. |
|
View the contents of a file one page at a time. |
|
Search for a pattern in a file or files. |
|
Sort the lines of a file. |
|
Search for files based on their name, size, or other attributes. |
|
Print the number of lines, words, and bytes in a file. |
|
Change the permissions of a file or directory. |
|
Change the ownership of a file or directory. |
|
Print the first few lines of a file. |
|
Print the last few lines of a file. |
|
Compare two files and show the differences between them. |
2.2. More about unix¶
We’re going to move on from unix now. In the next two chapters, you will see some more intricate and specific uses of the command line through two applications that run as command-line interfaces. Together with and also independently of the tools you will see below, unix provides a lot of power to explicitly operate on files and folders in your filesystem and to run a variety of applications, so becoming facile with the shell will be a boon to your work.
2.3. Additional resources¶
To learn more about the unix philosophy, we recommend “The Unix Programming Environment” By Kernighan and Pike. The authors are two of the original developers of the system. The writing is a bit archaic, but understanding some of the constraints that applied to computers at the time that unix was developed could help you understand and appreciate why unix operates as it does, and why some of these constraints have been kept, even while computers have evolved and changed in other respects.