Phylogenetic tree

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How can we run other software and create an analyses pipeline in the script?

Objectives
  • Calling preinstalled software from the cluster computer

  • Run a genomic analyses pipeline.

Sequence alignment

Sequence alignment is the basic and the most important step in phylogenetic analysis. A wrong sequence alignment will lead to wrong result. Various alignment tools are used to identify similarity regions that indicate fuctional, structural, and/or evolutionary relationships sequences. Multiple sequence alignment tools such as ClustalOmega, Muscle, MAFFT are commonly used. They vary in terms of algorithms used; ClustalOmega uses HMM profile-profile techniques, and MAFFT uses Fast Fourier Transforms and both are considered good for medium to large alignments. Muscle is regarded good with proteins for medium alignments.

We will use MAFFT sequence alignment tool.

$ module load mafft

$ mafft avrBs2_all_genomes.fas > avrBs2_all_genomes_aligned.fas
nthread = 0
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00

Making a distance matrix ..
    1 / 20
done.
...
...
$ more avrBs2_all_genomes_aligned.fas

Make sure you are still in phylogeny folder. You can use pwd to check.

Reminder: more displays small chunks of the file at a time. You can scroll down the file using enter key. You can return to command line prompt by pressing q key.

The two files avrBs2_all_genomes.fas and avrBs2_all_genomes_aligned.fas may not look very different at the moment considering the gene was present in all the sample genomes used here as seen from blast results. However, aligning the sequence prior to phylogenetic reconstruction is always valuable. Now, we have an input file ready to use for a software that can run a maximum likelihood phylogentic analysis. Read more about phylogenetic trees here.

Phylogenetic tree

There are several phylogenetic analyses software and tools that can be used. We will run a Maximum likelhood phylogenetic analysis. There are other models/methods such as Neighbor-Joining, Maximum parsimony, and Bayesian. However, our focus is on application of command line, we will only focus on running a specific analysis here. Maximum Likelihood uses probabilistic values to model the evolution and the tree with the highest probability is shared.

We will use RAxML - Randomized Axelerated Maximum Likelihood tool to construct the tree that we will submit through a batch file.

The SLURM submission script is present in `/blue/general_workshop/share/scripts/slurm_tree.sh

$ cp ../../share/scripts/slurm_tree.sh ./

$ tail -n8 slurm_tree.sh
# Load RAxML
module load raxml

# Make tree with RAxML
raxmlHPC -d -p 12345 -m GTRGAMMAI -s avrBs2_all_genomes_aligned.fas -n avrBs2_tree

# Success message
echo 'ML tree output in RAxML_bestTree.avrBs2_tree'

The argument -n allows us to name the output suffix. In our case, the output will be name RAxML_bestTree.avrBs2_tree.

Help with arguments

To understand what other arguments mean, you can check the help file ml raxml; raxmlHPC -h.

Edit the email address in the SLURM submission script and submit the job.

$ nano slurm_tree.sh

$ sbatch slurm_tree.sh

Running bootstraps

To bootstrap the tree, add arguments -b -#1000 where 1000 is the number of bootstraps. If you run bootstraps, the output will be in a file named RAxML_bootstrap.<suffix>. To understand about the arguments, you can run ml raxml; raxmlHPC -h

Tree visualization

We can download this tree file in our own computer and visualize using ‘Figtree’.

Alternatively, we can use commandline-based utilities to generate image. We’ll use ggtree package in R programming language.

R and all necessary packages are already installed in Hipergator. A helper R script is available from scripts directory. Run the script as follows:

Rscript <script> <input file> <output name>

$ ml R

$ Rscript ../../share/scripts/ggtree.R RAxML_bestTree.avrBs2_tree tree_avrbs2.pdf

$ ls tree_avrbs2.pdf
tree_avrbs2.pdf

Now we can navigate to OpenDemand dashboard to Files > /blue/general_workshop > username/phylogeny/. Select the file tree_avrbs2.pdf and click the Download button.

Phylogenentic tree

(Optional) Trees in ASCII!

To quickly visualize the tree within commandline, try this:

$ ml newick_utils/1.6

$ nw_display RAxML_bestTree.avrBs2_tree

Phylogenetic tree pipeline

We have now successfully extracted a gene of interest from our genomes and generated a phylogenetic tree. However, we completed the entire process using several distinct steps where we evaluated output of one step before starting to run the next step. In real world scenario, most of the intermediate results are not of interest, so those disjoint steps are often chained into one or few longer workflow, referred to as pipeline.

Question: Would it be possible to develop a script where we only need to provide external (generated outside the pipeline usch as gene and genome sequences) inputs and get the final result (the phylogennetic tree) in return directly?

YES*

For this, we will merge all the steps together into one script.

Warning: *Chaining multiple steps is not always straightforward. We have to be vigilant for what might change when the steps are automated and intermediate results are not evaluated. For example, we have to make sure that the genes of interest are actually in the genome. Otherwise, the BLAST may output match to different closely related gene, gene sequence alignment may add gaps and give us a wrong output and so on. Understanding what might go wrong comes from experience in bioinformatics.

Let’s put everything together into one script. The input files are available in /blue/general_workshop/share/all_in_one/ and the script is available as /blue/general_workshop/share/scripts/slurm_pipeline.sh.

$ cd /blue/general_workshop/<username>

$ mkdir pipeline

$ cd pipeline

$ cp -r ../../share/phylogeny/* ./

$ cp ../../share/scripts/slurm_pipeline.sh ./

$ cp ../../share/scripts/blast2fasta.sh ./

$ ls
avrBs2.fas         GEV1054.fasta     GEV909.fasta     GEV968.fasta          Xeu.fasta
blast2fasta.sh     GEV1063.fasta     GEV915.fasta     GEV993.fasta          Xg.fasta
GEV1001.fasta      GEV839.fasta      GEV917.fasta     slurm_pipeline.sh     Xp.fasta
GEV1026.fasta      GEV893.fasta      GEV936.fasta     TB15.fasta
GEV1044.fasta      GEV904.fasta      GEV940.fasta     Xc.fasta
$ tail -n23 slurm_pipeline.sh
# Load programs
module load ncbi_blast
module load mafft
module load raxml

# Loop over each genome
for genome in `ls *.fasta | sed 's/.fasta//g'`
do
  # Make database
   makeblastdb -in "$genome.fasta" -dbtype nucl -out "$genome"

  # Run blastn on that database
   blastn -query avrBs2.fas -db "$genome" -out $genome"_avrBs2.out" -outfmt 5 -evalue 0.001
done

# Combine all outputs and parse into multiFASTA file
cat *_avrBs2.out | sh blast2fasta.sh > avrBs2_all_genomes.fas

# Align sequences with MAAFT
mafft avrBs2_all_genomes.fas > avrBs2_all_genomes_aligned.fas

# Make tree with RAxML
raxmlHPC -d -p 12345 -m GTRGAMMAI -s avrBs2_all_genomes_aligned.fas -n avrBs2_tree

After editing the email address in nano, you are ready to submit the submission script.

$ sbatch slurm_pipeline.sh

Visualize tree

Once the job finishes, you can transfer all the outputs to your personal computer. Then, you can visualize the ‘RAxML_bestTree.avrBs2_tree’ using Figtree software. Setup page has instructions on how to install Figtree.

Key Points

  • Don’t forget to check parameters of the software. Can be viewed using ‘-h’ followed by the software name.