More UNIX commands
Overview
Objectives
To understand UNIX conditional statements.
To learn about functions in bash.
To get started with bash scripting.
Parallelization in bash with GNU parallel.
The optional sections are targeted for more advanced users (psa: more advanced than regular sections, but still for beginners). Feel free to try these sections once you are done with regular lesson.
Conditionals
A conditional expression consists of commands that are executed only if some conditions are satisfied. In bash, the general syntax for conditional statement looks like:
$ if [ condition ]
  then
      commands if given condition is true ...
      ...
  elif [ another condition]
      commands if new condition is true but previous conditions are false ...
      ...
  else
      commands if all conditions is false ...
      ...
  fi
elifandelsesections are optional.elifsection can be repeated multiple times.
Let’s look at a simple example:
$ if [ 1<2 ]
  then
      echo "true"
  else
      echo "false"
  fi
true
$ if [ 1>2 ]
  then
      echo "true"
  else
      echo "false"
  fi
false
Operators
Instead of commonly used operator symbols such as = or >=, 
it is preferable to use test operators for numeric comparisons, 
especially when working with variables that have not been declared as numbers.
-eqfor equal to,-gtfor greater than-ltfor greater than,-gefor greater than or equal to-lefor less than or equal to
$ a=1
$ b=2
$ if [ $a -gt $b ]
  then
      echo "a is greater than b."
  elif [ $a -lt $b ]
      echo "a is lesser than b."
  else
      echo "a is equal to b."
  fi
a is lesser than b.
Other useful operators are
!conditionfor checking if the condition is false.-z STRINGfor checking if string (or variable) is empty.-v VARIABLEfor checking if variable is defined.-d /path/to/DIRfor checking if a directory exists.-f /path/to/FILEfor checking if a file exists.-s /path/to/FILEfor checking if a file exists and is not empty.
Single line conditional statement
For shorter commands, conditional statements can be written as one-liner
using double brackets [[...]].
The general syntax is:
$ [[ CONDITION ]] && COMMANDS IF TRUE || COMMANDS IF FALSE
Let’s look at some examples.
$ ls files/fasta
example.fasta
$ [[ -f files/demo/example.txt ]] && echo "file exists".
$ [[ -f files/demo/example.txt ]] && echo "file exists" || echo "file does not exist"
file does not exist
$ [[ -f files/demo/example.fasta ]] && echo "file exists" || echo "file does not exist"
file exists
$ [[ ! -f files/demo/example.fasta ]] && echo "file exists" || echo "file does not exist"
file does not exist
An intro to functions
Functions are scoped, reusable and often portable chunks of code. Reusability is an important aspect to a function, as you can refer back to the same piece of code multiple times throughout your pipeline once you define it once.
Function syntax
In bash, functions are defined as follows:
$ function_name () 
  {
      function code ...
      ...
  }
function_nameis the name of the function, and it will be used to call the function later.
()declares the variablefunction_namestores a function, rather than string or another data.
{...}delimits the scope of the function, i.e., it encloses the code that are part of the function.
Let’s write a simple function.
$ myfunc ()
  {
      echo "Hello, world!"
  }
Calling a function
As mentioned earlier, a function is called by its name. 
Let’s call the myfunc function defined earlier.
$ myfunc
Hello, world!
The code echo ... was executed when you called the function.
Arguments
Another important component of functions are arguments. Arguments are values passed to the function from interactive prompt or from parent script. Arguments enable the function to use different input data in different calls.
You can pass arguments to a function by simply supplying the values after the function name. The function stores the arguments passed to it in variables $1 (first argument), $2 (second argument), … and so on.
$ myfunc2()
  {
      echo $1
      echo $2
  }
$ myfunc2 "Hello" "World"
Hello
World
Let’s call the function with new arguments.
$ myfunc2()
  {
      echo $1
      echo $2
  }
$ myfunc2 "This workshop" "is awesome."
This workshop
is awesome.
Return value
Unlike many other programming languages, bash function do not return a data back, but rather return a status.
Instead values can be saved to a global variable or a file. This part will not be discussed here.
Alternatively, standard output, e.g.,
echo, can be captured as follows.$ echo $(myfunc2) | wc -c $ wc -c <<< $(myfunc2)
Exercise: Write your own script
- Write a function to find the maximum of two input numbers.
 - Run the function with 1 and 2.
 - Run the function with 4 and 3.
 
# 1 $ max_num() { if [ $1 -gt $2 ] then echo $1 else echo $2 fi } #1 alternative one-liner $ max_num() { [[ $1 -gt $2 ]] && echo $1 || echo $2 }# 2 $ max_num 1 22# 3 $ max_num 4 34
An intro to scripting
Scripts are another way of writing reusable portable code. Scripts are files which contain code to be executed and are called interactively or from another script.
Scripts are generally considered more standalone than functions. Scripts are preferrable to functions when the code is reused in many other scripts.
Writing a script
A typical UNIX script looks like this:
#!/bin/sh
your code here ...
...
Script shell
The first line of a script is a shebang, i.e.,
#!, followed by the shell to run the script.A shell is an interpreter, which converts user command to machine language.
- The Bourne shell
 shis available in all modern UNIX systems and is available from path/bin/sh.- The Bourne-Again shell
 bashis present in most modern UNIX systems and is the default shell in many of them. It is slightly feature rich thanshand is available from path/bin/bash.zsh,fish,ksh,dashetc. are other popular UNIX shells.- Many programming languages provide their shells. For example, python scripts can be run from
 /usr/bin/env pythonor/usr/bin/env python3etc. and R scripts from/usr/bin/env Rscript.If you are not sure about shell selection,
bashis the safest choice for UNIX scripting.
Let’s write a simple script. 
Use nano to write a simple script as follows. 
Once you are done, press Ctrl+x to return to bash prompt.
Press Y and Enter to save the changes made to the file.
$ nano code.sh
-----------------------------------------------------------------------------------------------
 GNU nano 3.3 beta 02                     File: code.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
echo "Hello, World!"
-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------
Making script executable
So far, your script is a regular text file and does not have execution permission.
You can make it executable by using chmod +x.
$ chmod +x code.sh
Calling the script
You can run the script by using its filepath (path+filename).
$ ./code.sh
Hello, World!
Arguments
Just like functions, arguments to a script can be passed by simply supplying the values after the call.
Similarly, argument values can be accessed from within script with $1, $2 and so on.
Parsing the argument
Argument parsing in a script allows
- passing arguments without a fixed order
 - pass options without any arguments
 An example is
echo -hfor help, where-his an option.We do not cover argument parsing here, but you can learn more about it here.
If you are publishing your own script for others to use, it is considered best practice to adequately comment each step of the script.
Exercise: Write your own script
Use what you learnt so far to write a script that does the following:
a) Create a directory named in the first argument in current directory.
b) Create a file that named in the second argument within directory from step 1.
c) Write the content of third argument to the file from step 2.
d) Print content of file created in step 2.Next, run the script with following arguments:
- first argument: demo
 - second argument: text
 - third argument: “Success!”
 
# 1 nano script.sh------------------------------------------------------------------------------------------ GNU nano 3.3 beta 02 File: script.sh ------------------------------------------------------------------------------------------ #!/bin/bash # 1a - creating a directory from first argument mkdir -p $1 # 1b - creating a file from second argument within the directory touch $1/$2 # 1c - writing third argument to the file created echo $3 > $1/$2 # 1d - reading the content of the file cat $1/$2 ------------------------------------------------------------------------------------------ ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell ------------------------------------------------------------------------------------------# 2 $ chmod +x $1/$2 # making script executable $ ./script.sh demo text "Success!"Success!Using
-pargument withmkdireliminates errors related to creating nested or pre-existing directories.
Parallelization
In computer science, parallelization is the execution of multiple, often similar/repeated tasks concurrently.
Parallelization is important in bioinformatics because of:
- large dataset leading to lengthy computation time, e.g., 50 million reads, 30 Gb file
 - similar processing for each unit of data, e.g., alignment of each read to reference genome using identical processes.
 - availability of computing clusters with large computing resources.
 
Let’s look at a demo example of parallelization using GNU parallel.
First let’s define function that takes some time to complete.
$ myfunc3()
  {
    sleep 2   # suspend computing for 2 seconds
    echo $1
  }
We can execute this n times function sequentially, i.e., not in parallel using a for loop.
$ for i in {1..10}; do myfunc3 $1; done
We need some way to monitor the time taken. Let use some simple math for this, i.e., subtract start time from end time.
$ start=$SECONDS   # Start time
$ for i in {1..10}; do myfunc3 $i; done
$ end=$SECONDS     # End time
$ echo -e "---\nExecution time is $(echo "$end - $start" | bc -l) seconds."
1
2
3
4
5
6
7
8
9
10
---
Execution time is 20 seconds.
Now let us parallelize.
Parallelization with background tasks
The easiest way to parallelize is bash is by using background task. 
This is accomplished using & at the end of the command (within loop).
$ set +m           # Disable monitoring mode to prevent job status output
$ start=$SECONDS   # Start time
$ for i in {1..10}; do myfunc3 $i & done
$ wait             # Wait for all tasks to complete
$ end=$SECONDS     # End time
$ echo -e "---\nExecution time is $(echo "$end - $start" | bc -l) seconds."
1
2
3
4
5
6
7
8
9
10
---
Execution time is 2 seconds.
Your screen output might look slightly different. The background task management in RedHat Linux (used by HiperGator) prints some more job id information on the screen. Nonetheless, output of
myfunc3function should be same as above.
GNU parallel
However, inbuilt parallelization methods in UNIX system are not
very user-intuitive and flexible, and sometime vary slightly between different
UNIX distros. GNU parallel is a great alternative for parallelization.
Among other benefits, it can set number of CPU cores per job based on number of jobs
and has options to specify minimum memory for a job,
which make sit a useful tool in bioinformatics.
In a personal computer,
parallel needs to be installed source. 
However, GNU parallel is available in HiperGator –
it only needs to be loaded.
$ ml parallel
User-defined objects (variables and functions) are not accessible 
within parallel by default,
so the function myfunc3 has to be made global (globally available) first.
$ export -f myfunc3
Now, let’s run 5 jobs at the same time.
$ start=$SECONDS   # Start time
$ parallel -j5 --ungroup --will-cite myfunc3 ::: {1..10}
$ end=$SECONDS     # End time
$ echo -e "---\nExecution time is $(echo "$end - $start" | bc -l) seconds."
Breakdown of GNU parallel
parallelis the command name.-j5specifies 5 jobs at a time.--ungroupspecifies printing output as soon as one input is processed (optional).--will-citeis added to suppress prompt to cite parallel (optional).myfunc3is the function to execute.{1..10}is an array of 10 arguments.The function
myfunc3will be run 10 time, each with a number between 1 to 10 as an argument.
1
2
3
4
5
6
7
8
9
10
---
Execution time is 5 seconds.
Thus, with parallelization, we were able to reduce the execution time.
Become a shellscript pro