Using BLAST to compare sequences to a sequence database.

Tutorial adapted and abridged from Charles Delwiche's Bioinformatics course.

Consider the following DNA sequence:

ATTTGGAGCATCATGCCTGCAAACTCCGAGAAGGAGCACCTCTCCATCGT
GATTTGCGGCCATGTCGACAGTGGCAAGAGCACCACAACAGGGCGGCTCA
TCTTCGAGCTCGGTGGCCTTCCAGAGCGCGAACTTGACAAGCTGAAGCAG
GAGGCTGAGCGTCTTGGGAAAGGTTCTTTCGCCTTTGCATTCTACATGGA
CCGGCAGAAGGAGGAGCGTGAGCGTGGGGTGACCATCGCTTGCACCACGA
AGGAGTTCTACACCGAGAAGTGGCACTACACAATCATTGATGCACCGGGC
CACCGTGATTTCATCAAGAACATGATCACGGGTGCATCCCAGGCTGATGT
CGCACTCATCATGGTTCCCGCAGACGGAAACTTCACGACAGCAATCGCCA
AGGGCAACCACAAGGCGGGGGAAATCCAGGGCCAGACCAGGCAGCATTCC
CGGCTCATCAACTTGCTTGGCGTGAAGCAGATCTGCATTGGCGTGAACAA
GATGGACTGCGACACGGCGGCATACAAGCAGGCCCGTTATGATGAGATTG
CAAATGAGATGAAGAGCATGCTCGTGAANGTCGGGTGGAAGAAGGACTTT
ATTCGAGAAAACACACCCGTGATGCCCATCT

This is a DNA sequence which has been obtained by random screening of a cDNA library. We would like to learn more about the sequence. One easy way of getting insight into a sequence is to find out whether or not it resembles seqeunces that have already been reported in other studies. To do this, we will use the algorithm blast (a Basic Local Alignment Search Tool) to compare the sequence to the GenBank database maintained by NCBI (the National Center for Biotechnology Information, a branch of the NIH National Library of Medicine). We will use the sequence above as a query sequence, and use blast to compare the query sequence to the GenBank database. The actual analysis will be run on a massively parallel supercomputer operated by NCBI as a service to the research community. There are several ways to submit searches to the blast server; we will use the web interface.

First, copy the sequence. Then go to the NCBI web site (http://www.ncbi.nlm.nih.gov/), and follow the link for BLAST on the NCBI home page, and then the link for Standard nucleotide-nucleotide BLAST [blastn]. In the space provided, paste the sequence and then click on the button that says BLAST!

The page will be replaced with a page called "formatting BLAST." Notice that it provides you with a blast ID number, an estimate of how long it will take for the results to be returned, and some formatting options.

While you are waiting for your blast results to be returned, open up another browser window and expore the NCBI home page. There are many useful resources provided by NCBI. It pays off to know one's way around it. You can also read the blast overview (http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html) and other information linked to the blast page.

After waiting for a seemly period of time, go back to the "formatting BLAST" page and click on the FORMAT button. The results of your blast search will be displayed on a new web page. There is information on how to cite this analysis in scientific publications and on the nature of your search, followed by a set of colored lines that illustrate the results of the search, and then text describing the results of the search, and below that more text showing examples of the best matches.

Mouse over the colored lines and notice how the display changes. Look at how this information correlates with the text further down the page, and notice that there are links to the sequences which the query sequence matched.

What inferences about this sequence can you make from this information?

What is the identity of the sequence?

What gene do you think it encodes?

What organism do you think it comes from?

How reliable do you think this inference is? Why?

Hint: look at the bit score, at the e-value, and at the individual matches (notice that there are links you can follow).


Recall that the sequence was from a cDNA library. That means that it is probably a protein-coding sequence. Blast is more sensitive to subtle patterns in amino acid sequences than in nucleotide sequences, so it can be helpful to try a search that takes advantage of the information that this is a protein coding sequence. We don't know the reading fram eof the sequence, so we will want to search a translation of the sequence in all six possible reading frames against a protein database.

Because you are working with a nucleotide sequence, you will need to perform a translated search. Return to the BLAST home page (http://www.ncbi.nlm.nih.gov/BLAST/) and under Translated BLAST Searches select Nucleotide query - Protein db [blastx].

Notice that there are a number of other options you can select, but don't change any of them.

Submit the search request, and chill out learning more from the site until the results are returned.

Note: Blast searches submitted to via the web site are submitted to a queue, and they are given a priority that is a function of the number of searches you submit at the same time. If you submit a series of searches from the same computer, each search will take progressively longer. If you want to submit multiple searches it is best not to use the web interface to submit searches. Searches can also be submitted via email; if you want to know how, send an email consisting of the single word HELP to blast@ncbi.nlm.nih.gov.

How do the blastx results differ from the blastn search?

What inferences can you make from the different results in the two searches

What is the identity of the sequence?

What gene do you think it encodes?

What organism do you think it comes from?

How reliable do you think this inference is? Why?