BLAST search
Overview
Teaching: 25 min
Exercises: 35 minQuestions
How can we conduct local blast searches using the terminal?
Objectives
Learn how to conduct blast searches locally using the terminal
Basic Local Alignment Search Tool (Altschul S et al.) is a set of programs that search sequence database for statistically significant similarities. There are five traditional BLAST programs available in BLAST+ tools in commandline:
Program | Description |
---|---|
BLASTN | compares a nucleotide query sequence against a nucleotide sequence database. |
BLASTP | compares an amino acid query sequence against a protein sequence database. |
BLASTX | compares a nucleotide query sequence translated in a ll reading frames against a protein sequence database. |
TBLASTN | compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. |
TBLASTX | compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. |
These are the same blast applications what you can see in the NCBI blast website. Read more about command line BLAST here.
BLAST terminology
- Query
- The sequence for which the search is performed.
- Database
- A set of sequences on which search is performed to find matches for the query sequence.
- Subject
- A single sequence within the database.
- Score (S)
- A value representing quality of each alignment.
- It is calculated as the sum of substitution and gap scores.
- Expectation value (E-value or E).
- A value representing probabilistic significance of each alignment.
- A number of different alignments with scores equivalent to or better than ‘S’ that are expected to occur in database search by chance.
- The lower the E value, the more significant the score.
Application of BLAST
BLAST can be run online (e.g. NCBI blast) as well as locally. Few applications of local BLAST are:
- to compare against multiple genomes of interest,
- to output in different formats that can be used in downstream applications.
Outline of sequences to BLAST
We will use the local BLAST to identify if the bacterial genomes carry a gene of interest. We will use bacterial spot causing Xanthomonas and one of the conserved effector ‘AvrBs2’ sequence as an example today. More information on Xanthomonas is available here.
Pathogen effectors proteins
Effectors are proteins that are secreted by pathogenic bacteria into the plant cells to. manipulate host prcoesses and achieve colonization. The effector gene we are analyzing in this course named avrBs2 is one of the most studied bacterial effectors because of its significant virulence contribution and is conserved across multiple Xanthomonas species.
In the folder /blue/share/xanthomonas/
, there are four Xanthomonas species genomes.
- Xanthomonas euvesicatoria →
Xeu.fasta
- Xanthomonas perforans →
Xp.fasta
- Xanthomonas gardneri →
Xg.fasta
- Xanthomonas citri →
Xc.fasta
The folder also contains nucleotide sequence for avrBs2 gene as avrBs2.fas
.
Our objective would be to see if the bacterial genomes carry the effector gene.
Before starting, copy all the required files to your working directory.
$ cd /blue/general_workshop/<username>
$ cp -r ../share/xanthomonas ./
$ ls
demo strep slurm xanthomonas
$ cd xanthomonas
$ ls
avrBs2.fas Xc.fasta Xeu.fasta Xg.fasta Xp.fasta
Loading BLAST in Hipergator cluster
First, we need the BLAST program.
BLAST tools is already installed in HiperGator, but not available by default.
We can use module load
command to load the program.
$ module load ncbi_blast
Creating BLAST database
Next, we need to specify the database to blast against,
which in our case are the genome files Xeu.fasta
, Xp.fasta
, Xg.fasta
, and Xc.fasta
.
makeblastdb
command, which is a part of BLAST+ package,
is used for creating BLAST databases from FASTA files.
$ makeblastdb -in Xeu.fasta -out Xeu -dbtype nucl
Building a new DB, current time: 09/15/2020 02:25:22
New DB name: /blue/general_workshop/<username>/xanthomonas/Xeu
New DB title: Xeu.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.0926349 seconds.
$ ls
avrBs2.fas Xeu.ndb Xeu.not Xeu.ntf Xg.fasta
Xc.fasta Xeu.nhr Xeu.nsq Xeu.nto Xp.fasta
Xeu.fasta Xeu.nin
New database files with extensions
.ndb
,.nhr
,.nsq
etc. have been created.
If we were making a protein database, we would use
-dbtype prot
and provide amino acid fasta file.
Understanding command arguments
For most unix commandline commands, you can use
-h
or--help
argument/flag to see help or options. This will include list of arguments and their meanings. In this case,makeblastdb -h
command displays options avaiable formakeblastdb
.
Alternatively,man <command>
shows manual for the command.
Creating BLAST database in batch
Making database individually for all genomes will be manually laborious.
Instead, we can identify common patterns in the name of the genome files and
execute makeblastdb
in a loop.
What is one pattern that is common in all the genome files?
The extension - ‘.fasta’.
We can specify all files in current path with fasta extension by ./*.fasta
.
The command below has backticks
`
, which is the symbol just left to 1 key and just above the Tab key in the keyboard. Do not mistake it for quotes.
$ genomes=`ls *.fasta | sed 's/.fasta//g'`
$ echo $genomes
Xc Xeu Xg Xp
$ for genome in $genomes; do makeblastdb -in "$genome.fasta" -out $genome -dbtype nucl; done
...
...
Enclosing commands in
`
(backticks)
`
act like brackets in math. First, the commands inside backticks are executed, and then the output is used for commands outside.
One liners
Short loops can be written in a same line by separating commands with
;
.;
is equivalent to pressing Enter.
sed
command is being used to remove.fasta
extension from list of names.
For maually selecting the databases, you can use
for
loop like this:for genome in Xp Xg Xc; do makeblastdb -in "$genome.fasta" -out $genome -dbtype nucl; done
Merging database with
blastdb_aliastool
BLAST+ also includes a command
blastdb_aliastool
for combining databases; however, it is outside the scope of this workshop.
Usage:
blastdb_aliastool -dblist "Xeu Xp Xg Xc" -dbtype nucl -out Xspp -title "Xanthomonas genomic"
Performing BLAST search
We are now ready to do a BLAST search. Since both the query (avrBs2.fasta
)
and the database are nucleotide sequences, we will perform blastn
.
$ blastn -query avrBs2.fas -db Xeu -out Xeu_avrBs2.out -evalue 0.001
$ ls *.out
Xeu_avrBs2.out
Performing BLAST search in multiple databases in batch
We can blast multiple databases in a loop as well.
Let’s do this in a different way from previous example.
First lets create a list of all databases to BLAST against and save it into a file.
We can use nano
for writing to a file.
$ nano
A basic text editor will open. Type in all the databases to search against.
-----------------------------------------------------------------------------------------------
GNU nano 3.3 beta 02 New Buffer
-----------------------------------------------------------------------------------------------
Xp
Xg
Xc
-----------------------------------------------------------------------------------------------
^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos
^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell
-----------------------------------------------------------------------------------------------
You do not have to press enter and go to the next line after typing ‘Xc’. The next line will be read as an empty line and give error message. Although the additional line and error message due to the empty line does not affect our outputs here, it might be relevant for other situations.
Press Ctrl+o to save the file.
Give it a name dblist.txt
and press Enter.
Press Ctrl+x to return to bash prompt.
Now we can run the blastn
in loop.
$ for dbname in `cat dblist.txt`
> do
> blastn -query avrBs2.fas -db "$dbname" -out $dbname"_avrBs2.out" -evalue 0.001
> done < dblist.txt
Exercise: Performing blast search in SLURM
We can also run
blast
as a SLURM job, which is useful for long and resource intensive search.The SLURM submission has been prepared for your and is available as
/blue/general_workshop/share/scripts/slurm_blast.sh
. Genome and query files are available in/blue/general_workshop/share/xanthomonas
. You can use the checklist to mark progress.
- Change your location to your working directory
/blue/general_workshop/<username>
.- Make a folder in your working directory called
slurm_blast
and enter that directory.- Copy all genome and query fasta files to current directory.
- Copy the submission script to the current directory.
- Open the script in nano and edit the email address.
- Submit the job to SLURM.
- After job is completed, check if output files exist in current directory.
#1 $ cd /blue/general_workshop/<username> #2 $ mkdir slurm_blast $ cd slurm_blast #3 $ cp ../../share/xanthomonas/* ./ #4 $ cp ../../share/scripts/slurm_blast.sh ./ #5 $ nano slurm_blast.sh → edit email address → ctrl+x → y #6 $ sbatch slurm_blast.sh #7 $ ls *.out
Key Points
Using UNIX commands, we can customize blast searches.