More UNIX commands

Overview

Objectives

To understand UNIX conditional statements.

To learn about functions in bash.

To get started with bash scripting.

Parallelization in bash with GNU parallel.

The optional sections are targeted for more advanced users (psa: more advanced than regular sections, but still for beginners). Feel free to try these sections once you are done with regular lesson.

Conditionals

A conditional expression consists of commands that are executed only if some conditions are satisfied. In bash, the general syntax for conditional statement looks like:

$ if [ condition ]
  then
      commands if given condition is true ...
      ...
  elif [ another condition]
      commands if new condition is true but previous conditions are false ...
      ...
  else
      commands if all conditions is false ...
      ...
  fi

elif and else sections are optional. elif section can be repeated multiple times.

Let’s look at a simple example:

$ if [ 1<2 ]
  then
      echo "true"
  else
      echo "false"
  fi

true

$ if [ 1>2 ]
  then
      echo "true"
  else
      echo "false"
  fi

false

Operators

Instead of commonly used operator symbols such as = or >=, it is preferable to use test operators for numeric comparisons, especially when working with variables that have not been declared as numbers.

-eq for equal to,
-gt for greater than
-lt for greater than,
-ge for greater than or equal to
-le for less than or equal to

$ a=1

$ b=2

$ if [ $a -gt $b ]
  then
      echo "a is greater than b."
  elif [ $a -lt $b ]
      echo "a is lesser than b."
  else
      echo "a is equal to b."
  fi

a is lesser than b.

Other useful operators are

!condition for checking if the condition is false.
-z STRING for checking if string (or variable) is empty.
-v VARIABLE for checking if variable is defined.
-d /path/to/DIR for checking if a directory exists.
-f /path/to/FILE for checking if a file exists.
-s /path/to/FILE for checking if a file exists and is not empty.

Single line conditional statement

For shorter commands, conditional statements can be written as one-liner using double brackets [[...]].

The general syntax is:

$ [[ CONDITION ]] && COMMANDS IF TRUE || COMMANDS IF FALSE

Let’s look at some examples.

$ ls files/fasta

example.fasta

$ [[ -f files/demo/example.txt ]] && echo "file exists".

$ [[ -f files/demo/example.txt ]] && echo "file exists" || echo "file does not exist"

file does not exist

$ [[ -f files/demo/example.fasta ]] && echo "file exists" || echo "file does not exist"

file exists

$ [[ ! -f files/demo/example.fasta ]] && echo "file exists" || echo "file does not exist"

file does not exist

An intro to functions

Functions are scoped, reusable and often portable chunks of code. Reusability is an important aspect to a function, as you can refer back to the same piece of code multiple times throughout your pipeline once you define it once.

Function syntax

In bash, functions are defined as follows:

$ function_name () 
  {
      function code ...
      ...
  }

function_name is the name of the function, and it will be used to call the function later.

() declares the variable function_name stores a function, rather than string or another data.

{...} delimits the scope of the function, i.e., it encloses the code that are part of the function.

Let’s write a simple function.

$ myfunc ()
  {
      echo "Hello, world!"
  }

Calling a function

As mentioned earlier, a function is called by its name. Let’s call the myfunc function defined earlier.

$ myfunc

Hello, world!

The code echo ... was executed when you called the function.

Arguments

Another important component of functions are arguments. Arguments are values passed to the function from interactive prompt or from parent script. Arguments enable the function to use different input data in different calls.

You can pass arguments to a function by simply supplying the values after the function name. The function stores the arguments passed to it in variables $1 (first argument), $2 (second argument), … and so on.

$ myfunc2()
  {
      echo $1
      echo $2
  }

$ myfunc2 "Hello" "World"

Hello
World

Let’s call the function with new arguments.

$ myfunc2()
  {
      echo $1
      echo $2
  }

$ myfunc2 "This workshop" "is awesome."

This workshop
is awesome.

Return value

Unlike many other programming languages, bash function do not return a data back, but rather return a status.

Instead values can be saved to a global variable or a file. This part will not be discussed here.

Alternatively, standard output, e.g., echo, can be captured as follows.
$ echo $(myfunc2) | wc -c

$ wc -c <<< $(myfunc2)

Exercise: Write your own script

Write a function to find the maximum of two input numbers.
Run the function with 1 and 2.
Run the function with 4 and 3.

# 1
$ max_num() {
    if [ $1 -gt $2 ]
    then
        echo $1
    else
        echo $2
    fi
  }

#1 alternative one-liner
$ max_num() {
    [[ $1 -gt $2 ]] && echo $1 || echo $2
  }

# 2
$ max_num 1 2

# 3
$ max_num 4 3

An intro to scripting

Scripts are another way of writing reusable portable code. Scripts are files which contain code to be executed and are called interactively or from another script.

Scripts are generally considered more standalone than functions. Scripts are preferrable to functions when the code is reused in many other scripts.

Writing a script

A typical UNIX script looks like this:

#!/bin/sh

your code here ...
...

Script shell

The first line of a script is a shebang, i.e., #! , followed by the shell to run the script.

A shell is an interpreter, which converts user command to machine language.

The Bourne shell sh is available in all modern UNIX systems and is available from path /bin/sh.

The Bourne-Again shell bash is present in most modern UNIX systems and is the default shell in many of them. It is slightly feature rich than sh and is available from path /bin/bash.

zsh, fish, ksh, dash etc. are other popular UNIX shells.

Many programming languages provide their shells. For example, python scripts can be run from /usr/bin/env python or /usr/bin/env python3 etc. and R scripts from /usr/bin/env Rscript.

If you are not sure about shell selection, bash is the safest choice for UNIX scripting.

Let’s write a simple script. Use nano to write a simple script as follows. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

$ nano code.sh

-----------------------------------------------------------------------------------------------
 GNU nano 3.3 beta 02                     File: code.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash

echo "Hello, World!"




-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Making script executable

So far, your script is a regular text file and does not have execution permission. You can make it executable by using chmod +x.

$ chmod +x code.sh

Calling the script

You can run the script by using its filepath (path+filename).

$ ./code.sh

Hello, World!

Arguments

Just like functions, arguments to a script can be passed by simply supplying the values after the call.

Similarly, argument values can be accessed from within script with $1, $2 and so on.

Parsing the argument

Argument parsing in a script allows

passing arguments without a fixed order

pass options without any arguments

An example is echo -h for help, where -h is an option.

We do not cover argument parsing here, but you can learn more about it here.

If you are publishing your own script for others to use, it is considered best practice to adequately comment each step of the script.

Exercise: Write your own script

Use what you learnt so far to write a script that does the following:
a) Create a directory named in the first argument in current directory.
b) Create a file that named in the second argument within directory from step 1.
c) Write the content of third argument to the file from step 2.
d) Print content of file created in step 2.

Next, run the script with following arguments:

first argument: demo

second argument: text

third argument: “Success!”
# 1
nano script.sh
------------------------------------------------------------------------------------------
 GNU nano 3.3 beta 02                   File: script.sh
------------------------------------------------------------------------------------------
#!/bin/bash

# 1a - creating a directory from first argument
mkdir -p $1

# 1b - creating a file from second argument within the directory
touch $1/$2

# 1c - writing third argument to the file created
echo $3 > $1/$2

# 1d - reading the content of the file
cat $1/$2

------------------------------------------------------------------------------------------
^G Get Help    ^O WriteOut    ^R Read File    ^Y Prev Page    ^K Cut Text      ^C Cur Pos
^X Exit        ^J Justify     ^W Where Is     ^V Next Page    ^U UnCut Text    ^T To Spell
------------------------------------------------------------------------------------------
# 2
$ chmod +x $1/$2     # making script executable
$ ./script.sh demo text "Success!"
Success!
Using -p argument with mkdir eliminates errors related to creating nested or pre-existing directories.

Parallelization

In computer science, parallelization is the execution of multiple, often similar/repeated tasks concurrently.

Parallelization is important in bioinformatics because of:

large dataset leading to lengthy computation time, e.g., 50 million reads, 30 Gb file
similar processing for each unit of data, e.g., alignment of each read to reference genome using identical processes.
availability of computing clusters with large computing resources.

Let’s look at a demo example of parallelization using GNU parallel.

First let’s define function that takes some time to complete.

$ myfunc3()
  {
    sleep 2   # suspend computing for 2 seconds
    echo $1
  }

We can execute this n times function sequentially, i.e., not in parallel using a for loop.

$ for i in {1..10}; do myfunc3 $1; done

We need some way to monitor the time taken. Let use some simple math for this, i.e., subtract start time from end time.

$ start=$SECONDS   # Start time
$ for i in {1..10}; do myfunc3 $i; done
$ end=$SECONDS     # End time
$ echo -e "---\nExecution time is $(echo "$end - $start" | bc -l) seconds."

1
2
3
4
5
6
7
8
9
10
---
Execution time is 20 seconds.

Now let us parallelize.

Parallelization with background tasks

The easiest way to parallelize is bash is by using background task. This is accomplished using & at the end of the command (within loop).

$ set +m           # Disable monitoring mode to prevent job status output
$ start=$SECONDS   # Start time
$ for i in {1..10}; do myfunc3 $i & done
$ wait             # Wait for all tasks to complete
$ end=$SECONDS     # End time
$ echo -e "---\nExecution time is $(echo "$end - $start" | bc -l) seconds."

1
2
3
4
5
6
7
8
9
10
---
Execution time is 2 seconds.

Your screen output might look slightly different. The background task management in RedHat Linux (used by HiperGator) prints some more job id information on the screen. Nonetheless, output of myfunc3 function should be same as above.

GNU parallel

However, inbuilt parallelization methods in UNIX system are not very user-intuitive and flexible, and sometime vary slightly between different UNIX distros. GNU parallel is a great alternative for parallelization. Among other benefits, it can set number of CPU cores per job based on number of jobs and has options to specify minimum memory for a job, which make sit a useful tool in bioinformatics.

In a personal computer, parallel needs to be installed source. However, GNU parallel is available in HiperGator – it only needs to be loaded.

$ ml parallel

User-defined objects (variables and functions) are not accessible within parallel by default, so the function myfunc3 has to be made global (globally available) first.

$ export -f myfunc3

Now, let’s run 5 jobs at the same time.

$ start=$SECONDS   # Start time
$ parallel -j5 --ungroup --will-cite myfunc3 ::: {1..10}
$ end=$SECONDS     # End time
$ echo -e "---\nExecution time is $(echo "$end - $start" | bc -l) seconds."

Breakdown of GNU parallel

parallel is the command name.

-j5 specifies 5 jobs at a time.

--ungroup specifies printing output as soon as one input is processed (optional).

--will-cite is added to suppress prompt to cite parallel (optional).

myfunc3 is the function to execute.

{1..10} is an array of 10 arguments.

The function myfunc3 will be run 10 time, each with a number between 1 to 10 as an argument.

1
2
3
4
5
6
7
8
9
10
---
Execution time is 5 seconds.

Thus, with parallelization, we were able to reduce the execution time.

Become a shellscript pro

Advanced Bash-Scripting Guide

lesson home

Basic bioinformatics and command-line tools for phytopathologists

next episode

More UNIX commands

Overview

Conditionals

Operators

Single line conditional statement

An intro to functions

Function syntax

Calling a function

Arguments

Return value

Exercise: Write your own script

An intro to scripting

Writing a script

Script shell

Making script executable

Calling the script

Arguments

Parsing the argument

Exercise: Write your own script

Parallelization

Parallelization with background tasks

GNU parallel

Breakdown of GNU parallel

Become a shellscript pro

lesson home

next episode