Phylogenetic tree
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How can we run other software and create an analyses pipeline in the script?
Objectives
Calling preinstalled software from the cluster computer
Run a genomic analyses pipeline.
Sequence alignment
Sequence alignment is the basic and the most important step in phylogenetic analysis. A wrong sequence alignment will lead to wrong result. Various alignment tools are used to identify similarity regions that indicate fuctional, structural, and/or evolutionary relationships sequences. Multiple sequence alignment tools such as ClustalOmega, Muscle, MAFFT are commonly used. They vary in terms of algorithms used; ClustalOmega uses HMM profile-profile techniques, and MAFFT uses Fast Fourier Transforms and both are considered good for medium to large alignments. Muscle is regarded good with proteins for medium alignments.
We will use MAFFT sequence alignment tool.
$ module load mafft
$ mafft avrBs2_all_genomes.fas > avrBs2_all_genomes_aligned.fas
nthread = 0
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00
Making a distance matrix ..
1 / 20
done.
...
...
$ more avrBs2_all_genomes_aligned.fas
Make sure you are still in phylogeny folder. You can use
pwd
to check.
Reminder:
more
displays small chunks of the file at a time. You can scroll down the file using enter key. You can return to command line prompt by pressing q key.
The two files avrBs2_all_genomes.fas and avrBs2_all_genomes_aligned.fas may not look very different at the moment considering the gene was present in all the sample genomes used here as seen from blast results. However, aligning the sequence prior to phylogenetic reconstruction is always valuable. Now, we have an input file ready to use for a software that can run a maximum likelihood phylogentic analysis. Read more about phylogenetic trees here.
Phylogenetic tree
There are several phylogenetic analyses software and tools that can be used. We will run a Maximum likelhood phylogenetic analysis. There are other models/methods such as Neighbor-Joining, Maximum parsimony, and Bayesian. However, our focus is on application of command line, we will only focus on running a specific analysis here. Maximum Likelihood uses probabilistic values to model the evolution and the tree with the highest probability is shared.
We will use RAxML - Randomized Axelerated Maximum Likelihood tool to construct the tree that we will submit through a batch file.
The SLURM submission script is present in `/blue/general_workshop/share/scripts/slurm_tree.sh
$ cp ../../share/scripts/slurm_tree.sh ./
$ tail -n8 slurm_tree.sh
# Load RAxML
module load raxml
# Make tree with RAxML
raxmlHPC -d -p 12345 -m GTRGAMMAI -s avrBs2_all_genomes_aligned.fas -n avrBs2_tree
# Success message
echo 'ML tree output in RAxML_bestTree.avrBs2_tree'
The argument -n
allows us to name the output suffix. In our case, the output will
be name RAxML_bestTree.avrBs2_tree
.
Help with arguments
To understand what other arguments mean, you can check the help file
ml raxml; raxmlHPC -h
.
Edit the email address in the SLURM submission script and submit the job.
$ nano slurm_tree.sh
$ sbatch slurm_tree.sh
Running bootstraps
To bootstrap the tree, add arguments
-b -#1000
where1000
is the number of bootstraps. If you run bootstraps, the output will be in a file namedRAxML_bootstrap.<suffix>
. To understand about the arguments, you can runml raxml; raxmlHPC -h
Tree visualization
We can download this tree file in our own computer and visualize using ‘Figtree’.
Alternatively, we can use commandline-based utilities to generate image.
We’ll use ggtree
package in R
programming language.
R
and all necessary packages are already installed in Hipergator.
A helper R script is available from scripts directory.
Run the script as follows:
Rscript <script> <input file> <output name>
$ ml R
$ Rscript ../../share/scripts/ggtree.R RAxML_bestTree.avrBs2_tree tree_avrbs2.pdf
$ ls tree_avrbs2.pdf
tree_avrbs2.pdf
Now we can navigate to OpenDemand dashboard to Files >
/blue/general_workshop > username
/phylogeny/.
Select the file tree_avrbs2.pdf and click the Download button.
(Optional) Trees in ASCII!
To quickly visualize the tree within commandline, try this:
$ ml newick_utils/1.6 $ nw_display RAxML_bestTree.avrBs2_tree
Phylogenetic tree pipeline
We have now successfully extracted a gene of interest from our genomes and generated a phylogenetic tree. However, we completed the entire process using several distinct steps where we evaluated output of one step before starting to run the next step. In real world scenario, most of the intermediate results are not of interest, so those disjoint steps are often chained into one or few longer workflow, referred to as pipeline.
Question: Would it be possible to develop a script where we only need to provide external (generated outside the pipeline usch as gene and genome sequences) inputs and get the final result (the phylogennetic tree) in return directly?
YES*
For this, we will merge all the steps together into one script.
Warning: *Chaining multiple steps is not always straightforward. We have to be vigilant for what might change when the steps are automated and intermediate results are not evaluated. For example, we have to make sure that the genes of interest are actually in the genome. Otherwise, the BLAST may output match to different closely related gene, gene sequence alignment may add gaps and give us a wrong output and so on. Understanding what might go wrong comes from experience in bioinformatics.
Let’s put everything together into one script. The input files are
available in /blue/general_workshop/share/all_in_one/
and the
script is available as /blue/general_workshop/share/scripts/slurm_pipeline.sh
.
$ cd /blue/general_workshop/<username>
$ mkdir pipeline
$ cd pipeline
$ cp -r ../../share/phylogeny/* ./
$ cp ../../share/scripts/slurm_pipeline.sh ./
$ cp ../../share/scripts/blast2fasta.sh ./
$ ls
avrBs2.fas GEV1054.fasta GEV909.fasta GEV968.fasta Xeu.fasta
blast2fasta.sh GEV1063.fasta GEV915.fasta GEV993.fasta Xg.fasta
GEV1001.fasta GEV839.fasta GEV917.fasta slurm_pipeline.sh Xp.fasta
GEV1026.fasta GEV893.fasta GEV936.fasta TB15.fasta
GEV1044.fasta GEV904.fasta GEV940.fasta Xc.fasta
$ tail -n23 slurm_pipeline.sh
# Load programs
module load ncbi_blast
module load mafft
module load raxml
# Loop over each genome
for genome in `ls *.fasta | sed 's/.fasta//g'`
do
# Make database
makeblastdb -in "$genome.fasta" -dbtype nucl -out "$genome"
# Run blastn on that database
blastn -query avrBs2.fas -db "$genome" -out $genome"_avrBs2.out" -outfmt 5 -evalue 0.001
done
# Combine all outputs and parse into multiFASTA file
cat *_avrBs2.out | sh blast2fasta.sh > avrBs2_all_genomes.fas
# Align sequences with MAAFT
mafft avrBs2_all_genomes.fas > avrBs2_all_genomes_aligned.fas
# Make tree with RAxML
raxmlHPC -d -p 12345 -m GTRGAMMAI -s avrBs2_all_genomes_aligned.fas -n avrBs2_tree
After editing the email address in nano
, you are ready to submit the submission script.
$ sbatch slurm_pipeline.sh
Visualize tree
Once the job finishes, you can transfer all the outputs to your personal computer. Then, you can visualize the ‘RAxML_bestTree.avrBs2_tree’ using Figtree software. Setup page has instructions on how to install Figtree.
Key Points
Don’t forget to check parameters of the software. Can be viewed using ‘-h’ followed by the software name.