UNIX command line
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How can we navigate through folder and files in the computer using a command-line interface?
What are the UNIX-commands to handle and edit files?
Objectives
Learn how to navigate, open, and handle files using the terminal.
Understand the basic commands to parse text files containing biological data.
Command line interface and graphical user interaface are different ways of communicating with computer’s operating system. The shell is a program that provides the command line interface and allows to control the computer using keyboard commands. For bioinformatics tools, limited software have graphical user interface and you will have to use shell. The shell is a powerful method of communicating with the computer that can help you to do your work more efficienty and understanding how to use shell will be transformative for you to apply in bioinformatics. It can be used to connect to remote and cloud computers.
Terminal Command line
Once you login to HiperGator through SSH, you will start using a bash shell.
[<username>@login1 ~]$
The $ prompt shows that the bash shell is ready to accept bash commands.
Before learning some basic commands, there are a few recommendations regarding UNIX systems.
- The terminal syntax hates spaces between names. For long or complex names, use connectors such “_” or ‘-’ instead of spaces.
- Uppercase is different from lowercase. ‘R’ is not the same as ‘r’ in commands, paths and arguments.
- UNIX system uses /for path, unlike windows, which uses\.
We will be working on ‘blue’ storage in HiperGator for our workshop under a group name ‘general_workshop’. Each user has a directory in the ‘general_workshop’ folder. There is also a ‘share’ directory where all the datasets and information for this workshop are stored. Please remember to only copy requested files from shared folder to your user folder and run the analyses in the folder(directory) with your username only.
Your personal folder is named as your gatorlink username. 
Enter the following command to go to your work directory 
(we will talk about cd shortly). 
Do not forget to replace <username> with the username provided to you.
$ cd /blue/general_workshop/
When copying code, do not copy the
$or>prompt signs. Selection of prompt is disabled in this website.
Basic Commands
Displaying current path/location
pwd displays your “path” (where you are located in the cluster).
$ pwd
/blue/general_workshop/
Displaying files and folders in current location
ls command dipslays the files and folders in the current location.
$ ls
anujsharma     guest.11240     guest.11248     Intro_slides.pptx     share
emgoss         guest.11241     guest.11249     jhuguet
...
...
Adding argument/flag -l to ls displays additional details such as permissions, file owner, 
size, date modified etc.
$ ls -l
drwxr-sr-x 3 anujsharma      general_workshop    4096 Sep 10 03:47 anujsharma
...
...
-rw-r----- 1 jhuguet general_workshop 1290989 Sep  8 15:11 Intro_slides.pptx
drwxr-sr-x 7 jhuguet general_workshop    4096 Sep  8 15:27 share
drwxr-sr-x 2 emogss  general_workshop    4096 Sep  8 11:01 emgoss
Permissions in linux
Linux permissions look like this:
d r w x r w − r − −
- 1st character represents special flag: file
−, directorydor linkl.- The rest of the characters show permissions in set of three:
rfor read,wfor write andxfor execute.−means permission denied.
- 2nd to 4th characters: permission for file owner
- 5th to 7th characters: permission for the group
- 8th to 10th characters: permission for others
Creating directories
mkdir creates a new directory in the current path.
Lets create a new directory called newdir and 
then use ls command to check if the directory was succesfully created.
$ cd <username>
$ mkdir newdir
$ ls
cd <username>is for entering your working directory first. We will covercdshortly.
newdir
Changing directories
cd changes your current path. 
Lets change current path to the directory we just created. 
Use pwd to check the current path.
$ cd newdir
$ pwd
/blue/general_workshop/<username>/newdir
$ cd ..
$ pwd
/blue/general_workshop/<username>
Common path symbols in linux
Linux uses some symbols to represent commonly used paths.
..stands for parent directory.
.stands for current directory.
/at the beginning stands for root directory.
~stands for home directory.
Copying files
cp is used for copying files. Lets copy file1.txt from share folder to your working directory.
$ cp /blue/general_workshop/share/file1.txt ./file1.txt
$ ls
file1.txt
$ cp file1.txt file2.txt
$ ls
newdir     file1.txt     file2.txt
cp -r can be used for copying entire directories. -r stands for recursive.
Copy the demo folder from share folder to your working directory.
$ cp -r /blue/general_workshop/share/demo ./
$ ls
demo     newdir     file1.txt     file2.txt
Moving files
mv is used for moving files or directories. Unlike copying, moving deletes the original copy.
$ mv file2.txt newfile.txt
$ ls
demo     newdir     file1.txt     newfile.txt
Deleting files
rm can be used for deleting files (and directories too with -r recursive argument)
$ rm newfile.txt
$ ls
demo     newdir     file1.txt
Removing directories
rmdir removes the specified directory if it is empty. 
Lets remove the newdir we created earlier and 
check if it is removed using ls.
$ rmdir newdir
$ ls
demo     file1.txt
File content handling
There are a set of commands to read the contents of a file.
cat reads the entire content of a file and returns to command line prompt. Let’s see what file1.txt file contains.
$ cat file1.txt
CM008465.1      4979077 A       C       intergenic_region
CM008458.1      97095206        A       G       intergenic_region
CM008459.1      72668492        G       A       intergenic_region
...
...
CM008463.1      16380489        T       A       intergenic_region
CM008463.1      123496326       T       A       intergenic_region
CM008457.1      21333612        G       T       intergenic_region
less and more display small chunks of the file content at a time.
In less output, you can scroll up and down the file content using 
↑ and ↓ keys.
In more output, you can scroll down the file using enter key.
You can return to command line prompt by pressing q key.
Lets view file1.txt again but using more or less.
$ less file1.txt
$ more file1.txt
head and tail can be used to read the start and end of the file respectively.
-n argument can be used to specify the the number of lines to read (default is 10 lines).
We can now only read the start or the end of file1.txt.
$ head file1.txt
CM008465.1      4979077 A       C       intergenic_region
CM008458.1      97095206        A       G       intergenic_region
CM008459.1      72668492        G       A       intergenic_region
CM008465.1      57718962        G       A       downstream_gene_variant
CM008465.1      225524501       C       T       intergenic_region
CM008464.1      76483552        T       A       intergenic_region
CM008464.1      22690967        C       T       intergenic_region
CM008463.1      130960788       A       G       intergenic_region
CM008457.1      19966217        A       C       intergenic_region
CM008466.1      61441376        G       T       intergenic_region
$ tail -n 3 file1.txt
CM008463.1      16380489        T       A       intergenic_region
CM008463.1      123496326       T       A       intergenic_region
CM008457.1      21333612        G       T       intergenic_region
Extracting lines from the middle
Lines from middle of a file can be extracted usng
sed -nas follows. To extract 5th to 7th lines:$ sed -n 5,7p file1.txtCM008465.1 225524501 C T intergenic_region CM008464.1 76483552 T A intergenic_region CM008464.1 22690967 C T intergenic_region
sedis a very powerful tool in in bash and can be used to do wide range of text editing tasks. Here,-nargument directssedto pick only the lines that match the parameter5,7p. You will see more uses ofsedlater.
File length
wc can be used for reading length of the file. 
wc accepts argument 
-l for number of lines, 
-c for number of characters and 
-w for number of words.
$ wc -l file1.txt
100 file1.txt
$ wc -c file1.txt
4302 file1.txt
$ wc -w file1.txt
500 file1.txt
Writing a file
> operator can be used to write the output into a file.
Lets write some files.
$ head -n 5 file1.txt > head.txt
$ tail -n 5 file1.txt > tail.txt
$ ls
demo     file1.txt     head.txt     tail.txt
You can even save the lists of files in current directory into a file.
$ ls > files.txt
$ cat files.txt
demo
file1.txt
files.txt
head.txt
tail.txt
>vs>>Writing to an existing file with
>removes its existing contents. Use>>operator to append new contents to the end of exsiting file.
Concatenate files
cat command is used for concatenation of multiple files. 
Lets concatenate the new files created in previous step. 
We can verify concatenation by checking the number of lines in concatenated file.
$ cat head.txt tail.txt > concat.txt
$ wc -l concat.txt
10 concat.txt
head.txtandtail.txteach had 5 lines each. To confirm, you can check like this:wc -l head.txt.
Print/output to screen
echo command can be used to display result to the screen.
$ echo "hi"
hi
Variables
Variables can be assigned values with = operator. Do not use space around =.
$ a="Hello"
$ echo $a
Hello
$ b="World"
$ echo "$a $b"
Hello World
"vs'(Double quotes vs single quotes)
"and'mean different things in unix and should not be use interchangeably. Check yourself how the output ofecho '$a $b'differs from that ofecho "$a $b".
Text manipulation
cut is used to extract fields of data from a string or a file.
head.txt contains part of the output of a real sequence analysis. 
The first part (CM00084xx) contains the names of the chromosomes, 
which we need to extract.
Since chromosome name is first 8 characters, 
cut -c can be used to extract chromosome name. 
-c specifies vertical position of characters to extract
$ cut -c 1-8 head.txt
CM008465
CM008458
CM008459
CM008465
CM008465
The columns in head.txt are tab separated. You can verify with cat head.txt.
So, we can alternatively just extract the first column using cut.
-f: specify columns to extract.
$ cut -f 1 head.txt
CM008465.1
CM008458.1
CM008459.1
CM008465.1
CM008465.1
Cut field delimiter
What if we want to define column by character other than
Tab?-dargument allows us to specify delimiter to break line into columns. For example, suppose we don’t want the.1part in the chromosome name in the code above. In this case, we can get the chromosome name by separating columns by.instead ofTaband extracting the first column.$ cut -f 1 -d "." head.txtCM008465 CM008458 CM008459 CM008465 CM008465
Sorting
sort is used to sort lines in a file. 
By default, sort sorts in alphanumeric order.
$ sort file1.txt
CM0084̲5̲5.1      ̲2̲7261226        C       A       intergenic_region
CM0084̲5̲5.1      ̲2̲8̲9̲184259       AT      A       intron_variant
CM0084̲5̲5.1      ̲2̲8̲9̲401193       G       A       intergenic_region
...
...
CM0084̲6̲6.1      ̲61600177        C       T       intergenic_region
CM0084̲6̲6.1      ̲71138968        G       A       intergenic_region
CM0084̲6̲6.1      ̲95058324        T       C       intergenic_region
The top lines now begin with CM008455.1 and the bottom ones with CM008466.1.
-r argument reverses the order of sorting.
$ sort -r file1.txt
CM0084̲6̲6.1      ̲95058324        T       C       intergenic_region
CM0084̲6̲6.1      ̲71138968        G       A       intergenic_region
CM0084̲6̲6.1      ̲61600177        C       T       intergenic_region
...
...
CM0084̲5̲5.1      ̲2̲8̲9̲401193       G       A       intergenic_region
CM0084̲5̲5.1      ̲2̲8̲9̲184259       AT      A       intron_variant
CM0084̲5̲5.1      ̲2̲7̲261226        C       A       intergenic_region
-k can be used to sort by a specific column.
$ sort -k2 file1.txt
CM008465.1      ̲108474056       G       A       intergenic_region
CM008463.1      ̲109077809       C       T       intergenic_region
...
...
CM008458.1      ̲97095206        A       G       intergenic_region
CM008461.1      ̲9871059 T       C       intergenic_region
-n argument is used to sort in numerical ascending order.
Numerical vs alphanumeric order
Sorting by numerical order is useful when dealing with numbers. In alphanumeric order, 2 comes after 15 because the first character “2” comes after first character “1”. In numerical order, 2 comes before 15, since 2 is smaller than 15.
$ sort -k2 -n file1.txt
CM008457.1      ̲3542278 T       G       intergenic_region
CM008465.1      ̲3999373 G       A       upstream_gene_variant
...
...
CM008455.1      ̲289401193       G       A       intergenic_region
CM008455.1      ̲294972840       C       T       intergenic_region
$ sort -k2 -n -r file1.txt
CM008455.1      294972840       C       T       intergenic_region
CM008455.1      289401193       G       A       intergenic_region
...
...
CM008465.1      3999373 G       A       upstream_gene_variant
CM008457.1      3542278 T       G       intergenic_region
Argument shortcuts in bash
Multiple argument, and sometimes expected values of arguments can be written together in bash. In above example, you can replace the above command
-k2 -n -ras-k2nr. Guess whatsort -k2nr file1.txtdoes.
Replacing text
sed command can be used for replacing text in a file. 
sed 's/old/new/g' replaces all instances of old with new.
$ sed 's/CM0084/Chr_/g' tail.txt > prettyfile.txt
$ cat prettyfile.txt
Chr_66.1        28255843        A       C       intergenic_region
Chr_65.1        236852266       C       T       intergenic_region
Chr_63.1        16380489        T       A       intergenic_region
Chr_63.1        123496326       T       A       intergenic_region
Chr_57.1        21333612        G       T       intergenic_region
Search and Extract
grep command is used to find a string in a file and return the matching line.
$ grep "downstream_gene_variant" file1.txt
CM008465.1      57718962        G       A       ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
CM008466.1      236800990       C       A       ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
CM008458.1      226074184       T       TA      ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
CM008458.1      214028749       A       C       ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
CM008460.1      5909607 A       T       ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
CM008460.1      63781473        A       T       ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
CM008465.1      138360632       A       G       ̲d̲o̲w̲n̲s̲t̲r̲e̲a̲m̲_̲g̲e̲n̲e̲_̲v̲a̲r̲i̲a̲n̲t
Argument -c is used to return the number of matches.
$ grep -c "downstream_gene_variant" file1.txt
7
Piping commands
| operators can be used for piping. 
Piping means that the output of first command serves as input of second command and so on. 
This eliminates need for saving intermediate results as a file.
Lets sort file1.txt numerically by second column, 
and only display the top 5 lines.
$ sort -k2 -n file1.txt | head -n 5
CM008457.1      3542278 T       G       intergenic_region
CM008465.1      3999373 G       A       upstream_gene_variant
CM008465.1      4979077 A       C       intergenic_region
CM008457.1      5681949 G       A       intergenic_region
CM008460.1      5909607 A       T       downstream_gene_variant
Exercise: Finding “alien genes” in the plant pathogen Streptomyces scabies.
Let’s use the commands in a simple real case. Copy the file
aliens_in_scabiesinsidesharedirectory using following commandMake sure you are in
/blue/general_workshop/<username>directory, if not usecd /blue/general_workshop/<username>to go to your personal working directory.$ cp /blue/general_workshop/share/strep/aliens_in_scabies ./Streptomyces scabies is a plant pathogen that produces necrosis in potatoes. Most of the virulence factors are located in regions that have low GC content. Also, virulence is highly expressed during the interaction with roots. The file “aliens_in_scabies” contains a tabular data with more than 1000 genes (first column) along with the level of change expression when growth in rich medium vs. interaction with roots (second column). GC content% is provided in the third column of the table.
Using UNIX commands
- Sort the table by expression levels (second column).
- Create a new file that contains the top 10 highest expressed genes.
- Using the new file, sort the table by gene GC content.
- Create a table with only the gene names but replace the name SCAB with SCABIES.
# 1 $ sort -k2n aliens_in_scabies # 2 $ sort -k2nr aliens_in_scabies | head > newfile.txt # 3 $ sort -k3n newfile.txt # 4 $ cut -f1 aliens_in_scabies | sed 's/SCAB/SCABIES/g'
An intro to loops
Many tasks are repetitive. It is not necessary to repeat the same command multiple times. Instead, we can use wildcards and loops to repeat a task.
Wildcards
There are certain symbols in UNIX that represent mutiple values, 
and are called “wildcards”. * is a universal wildcard that stands for anything.
$ ls *.txt
concat.txt     files.txt     prettyfile.txt
file1.txt      head.txt      tail.txt
$ mv *.txt demo
$ ls
demo
$ ls demo
concat.txt     files.txt     prettyfile.txt
file1.txt      head.txt      tail.txt
Other wildcards are
| Wildcard | Value | 
|---|---|
| * | any character or characters | 
| ? | any single character | 
| [0-9] | any single number | 
| [A-Z] | any single capital alphabets | 
| [a-z] | aany single small alphabets | 
| [!x] | not x | 
| {x, y} | x or y | 
The ‘FOR’ loop
For loop iterates over a specific range or numbers or an array. For example.
$ for x in {0..3}
> do
>   echo $x
> done
0
1
2
3
The process between do and done enclose task to be repeated.
The variable x will take consecutive values from array {0..3} in each iteration.
Try
echo {0..3}to see what{0..3}represents.
Bash prompts
$prompt specifies ready to start a command.#replaces$for root user.>specifies continuation of multiline command from previous line.
The ‘WHILE’ loop
while loop iterates as long as a condition is true. Lets use the example.
$ x=0
$ while(($x < 4))
> do 
>   echo $x
>   ((++x)) 
> done
0
1
2
3
((++x))means increase the value ofxby 1. Alternatively, you can usex=$((x+1)). Double parenthesis(( ))allows math operations in shell code.
As in for loop, the commands to be looped over are enclosed between do and done.
Learn more about unix commands
Key Points
pwd, ls, and cd are three main commands that allow navigation through directories
cat, more and less display text files
sed and grep are command for replacing and extracting characters in a text file.
cut and sort are commands to edit text files.
Unix commands allow handling repetitive tasks, suing for or while loops.