Andy Baxevanis:
All right, good morning, everyone, and welcome
to week two or this year's current topic series.
Let's go ahead and get the formalities out
of the way right from the start. So my CME
disclosure statement, "I don't have any relevant
financial relationships with any commercial
interests." Okay, so with that out of the
way, my job today in the same way that Dr.
Green laid the groundwork for upcoming lectures
in this series by providing you a broad overview
of the field of genomics last week, I'm going
to be providing you a solid foundation for
the upcoming lectures devoted to bioinformatics.
I mentioned last week that genomics and bioinformatics
go hand in hand with one another; they're
inexplicitly linked. And Eric really reinforced
that sentiment when he showed you this figure
from NHGRI's current vision document, showing
the spectrum of genomic-based research going
from the understanding of the structure of
genomes all the way through to our efforts
focused on improving the effectiveness of
healthcare. Now, our ability to work across
this entire spectrum really rests, in large
part, on being able to apply computational
approaches across the board. To analyze the
kinds of data that are generated in the course
of whole genome sequencing efforts, genome-wide
association studies, clinical sequencing efforts,
and pharmacogenomic studies, or any of the
other many kinds of genomic science you'll
be hearing about throughout this series.
So given, again, that inextricable relationship
between genomics and bioinformatics, what
we're going to go through over the next two
weeks is as gentle of an introduction as I
can give you to some of the major tools used
in bioinformatics. You're going to see the
tools and the concepts underlying these tools
again and again throughout this series. So
it's really important to have a basic understanding
of the theory that underlies these techniques
to start taking this technique -- these techniques
away from the realm of the black box. So I
hope you'll find this information useful,
and that you'll pick up a few strategies along
the way that will hopefully be of practical
use in your own studies.
So with that, let's just start right off by
reinforcing the importance of alignment as
one of the most powerful approaches that we
have in the study of sequence data, specifically
with respect to determining similarity and
deducing homology. So start off with a very
basic question: why bother constructing sequence
alignments; what do they do for us? And that
question applies whether we're talking about
a paralyzed [spelled phonetically] alignment,
where we're taking two individual sequences,
and aligning them to one another, or to a
multiple sequence alignment, where we're looking
at a collection of presumably related sequences.
So in either case what we're trying to do
is get a measure of relatedness between either
nucleotide or amino acid sequences, and once
we have that measure of relatedness it allows
us to draw certain biological inferences.
So we might find structural relationships
between these sequences, functional relationships,
or evolutionary relationships.
Now, before we start to talk about the technique
we need to define a couple of key terms; it's
important to use the right terminology when
we talk about these phylogenetic relationships.
So the first term we're going to talk about
is similarity. So what I'd like you to do
is imagine you have two sequences, A and B,
and that you align those two sequences to
one another. And the easiest way to convey
how related those sequences are to one another
is to just simply count the number of matches
and mismatches along that alignment. So that
gives you the quantitative measure; again
similarity: it's based on an observable -- in
this case how many positions are aligned with
one another? We usually do express this as
a percent identity, and by looking at this
it allows us to quantify changes that we see
as two sequences diverge from one another,
whether we have substitutions, insertions,
or deletions. It'll also, from a functional
standpoint, allows us to identify residues
that are crucial for maintaining structure,
and by extension, function. If we start seeing
very high percentages of sequence similarity,
high degrees of sequence similarity, what
that might be telling us is that these two
sequences in this paralyzed alignment have
a common evolutionary history, or possible
commonality in biological functions. So for
example, we might be able to identify regions
that are important from a functional standpoint
when thinking about things at the protein
level, or regions that are important to expression
when we're looking at sequences at the nucleotide
level.
The second term we need to define this morning
is the one that is most often misused, and
that one is homology. So what homology is,
is the conclusion that we get by looking at
similarity; it implies an evolutionary relationship.
When there is homology between sequences those
sequences are termed homologs. So these are
genes that have arisen from a common ancestor.
And unlike similarity -- that is the quantifiable
term -- this is the conclusion, again. So
either genes are, or are not homologous; we
don't measure this in degrees. And just to
drive home the point, a quote from Walter
Fischer who was one of the grandfathers of
bioinformatics: he says -- it's worth repeating
here, that "homology, like pregnancy, is indivisible;
you're either homologous/pregnant, or you
are not." So again, this is the conclusion.
Now, that term, homolog, actually breaks down
into several subtypes. It might describe several
different types of evolutionary relationships,
and we're going to focus specifically on two
of them here. So the first type of homology
has to do with orthologs. So orthologs are
genes that have diverged as a result of a
speciation event. So they're direct descendants
of a sequence in a common ancestor, they more
-- likely have similar domain and three-dimensional
structure; because of that, they usually retain
the same or very similar biological functions
over evolutionary time, and we can use that
information to predict gene function in novel
genomes that we're starting to assign functions
to newly identified genes. So put very simply
orthologs can be thought of as the same gene
in different species.
The second term, the second type of homology
we're going to talk about, is paralogy. So
when we have paralogs, these are just genes
that arose by the duplication of a single
gene in a particular lineage. So these might
be less likely to perform similar function,
but they can take on new functions over evolutionary
time. So this is where Mother Nature gets
to be inventive, providing some insight into
evolutionary innovations; taking bits and
pieces of proteins, restructuring domains
in order to accomplish new jobs within the
cells. So again, put simply paralogs can be
thought of as similar genes in the same species,
adapting or co-opting a preexisting gene product
for new biological functions.
So just to make this a little bit clearer,
because the definitions are a little hard
to follow, but so let's look at a picture.
So in this case we have a gene, gene alpha,
in a common ancestor and along the -- oh,
evolutionary time there's been a couple of
speciation events that have given rise to
three species. Each one of them has an ortholog
to gene alpha, Gene 1, Gene 2 and Gene 3.
So each one of these three genes arose from
this common ancestor alpha. Now, if we go
back a little further in evolutionary time,
we roll back the clock a bit, and we consider
the possibility of a gene duplication predating
alpha. So we have a gene duplication here
at the base of the tree giving rise to genes
alpha and beta in this case. So alpha went
along and gave rise to three orthologs; beta
in this case underwent some similar speciation
events -- not exactly the same -- but gave
rise to three genes, 4, 5, and 6. So, how
are these things all related to one another?
Any of the Genes 1, 2, and 3 are orthologous
to one another; Genes 4, 5, and 6 are also
orthologous to one another, because again,
in each case they had a common ancestor. But
now, if we take any pair of 1, 2, and 3 versus
4, 5, and 6, those are the paralogs, because
they go way back to this gene duplication
event; that's how they're related to one another.
So that takes care of the definitions. Now,
how do we actually identify the orthologs,
the same gene in different species? And the
most common way, and most straightforward
way we have of identifying candidate orthologs
is by looking for reciprocal best hits. And
when I say candidate orthologs, it's very
important -- the word candidate really, really
is important, because, keep in mind that a
lot of the things we're going to talk about
through this series deal with computational
predictions, so I really need to underscore
the importance of being able to experimentally
verify a lot of the things that you do using
these computational techniques.
Okay, so let's say we start with a gene that
we're interested in from a particular organism,
and we -- as a first step here, we just do
a sequence similarity search against all of
the genes in a second organism. And what we
might find is something that looks like this,
a series of hits using some sort of sequence
similarities search method that identifies
a series of genes that are similar to the
one that we started with. Now, I have the
first one here marked in yellow, so we're
going to just assume that based on some scoring
method, this one is the most related to the
red one that we started with. Okay, so that's
the first step. Now, where does the reciprocal
part come in? Now, what you do is you take
this yellow sequence and do the sequence similarity
search in the opposite direction. You go -- now
take this, and search it against all of the
genes in Organism 1. And if you're lucky,
what you'll find is that the best hit in both
directions involve these two genes. So the
red one is most related to the yellow one
in Organism 2; then, when you use the yellow
one as the search query, it's most related
to the red one in Organism 1.
Okay, so that's the general idea. Now for
the specifics. We have two broad types of
alignment method at our disposal. So the first
one are called global sequence alignments.
So what I want you to do is imagine that you
have two full-length sequences, either two
genes or two protein sequences, and you now
align them across their entire length. So
we're going to force an alignment over the
entire length of these sequences. So this
works really well if you have sequences that
are of similar length, but the problem now
is that, as these become more divergent, as
they become more dissimilar to one another,
the global alignment methods tend to miss
important biological relationships; because
you're trying to force an alignment over the
entire length of these sequences, it might
mean that you're just not aligning important
structural or functional domains.
So from a biological point of view, a slightly
different approach is better, and that's what
are called the local sequence alignments.
So here, we try to find the most similar regions
in two sequences that are being aligned with
one another, and the glorified term for this
is paired subsequences; it just means that
we're just looking for two regions within
an entire sequence that align with one another.
So any regions, when we do these local sequence
alignments, outside the area of the local
alignment, we just go ahead and exclude those.
By doing this, we're aligning two sequences;
we might find more than one local alignment
that could be generated for any two sequences
that we compare. So this works really well
for sequences that share some similarity,
but not high degrees of similarity, or for
sequences of different length. So again, I'll
just emphasis that is really the best way
to compare sequences from a biological standpoint,
especially if we're trying to deduce evolutionary
relationships. So the approach of choice for
biological discovery.
Okay, so let's say we have now aligned two
sequences, we talked about the concept of
similarity we're going to try to look for
percent identity, some metric to allow us
to quantify how good the alignment actually
is. How do we actually deduce that number,
how do we measure how good the alignment is
between Sequence A and Sequence B? So to do
that we use something called scoring matrices,
and these scoring matrices quite simply is
our empirical weighting schemes that we use
to assign these numbers, these numerical values.
And what's in those scoring matrices is information
concerning the physical, chemical, and biological
characteristics of either nucleotides or amino
acids. So they take into account things like
side chain structure, their chemistry, the
function of particular side chains based on
what we know about how -- especially on the
amino acid side of the house -- how these
particular amino acids actually are involved
in structure and function. So we know that
things like cysteine and proline residues
help to keep parts of a protein together;
tryptophan has a bulky side chain, so it can
only be accommodated in certain parts of a
protein; things like lysines and arginines,
they're positively charged, they're very important,
particularly in the context of DNA binding
proteins, because you need those positively
charged residues to bind to the negatively
charged backbone of the DNA.
Okay, so in the construction of these scoring
matrices, we try to take into account two
major factors. First is conservation: which
of the residues can substitute for any other
residue, and not adversely affect the function
of the protein? So we know that things like
isoleucines and valines can substitute for
one another in a protein sequence, because
they're both small and they're hydrophilic.
Things like serines and threonines, they're
both polar so we often see that substitution.
So the basic idea here is we want to just
keep track of charge, size, hydrophobicity,
and other physical, chemical factors, again,
to try to see, when we can't have an exact
match between two sequences, what's the next
best thing? The other factor is frequency.
If we look at the entire constellation of
proteins, all of the protein sequences available
to us, how often do we see any one of those
amino acids amongst the entire constellation
of proteins? Some are more rare; some are
more frequent, and that becomes important.
Okay, so why should you actually care about
any of this? Every single time you do any
sort of sequence comparison, whether it's
two sequences to one another, A to B, or a
Sequence A to an entire data base, like GenBank,
these scoring matrices are being used in the
background; they appear in every single analysis
involving sequence comparison. They also,
as you'll come to hopefully appreciate, they
implicitly represent evolutionary patterns.
So the choice of matrix really can strongly
influence what results come out of the methods
that you apply to these alignments. So the
default option may not always be the best
choice, and so it's -- becomes important for
us to talk about how these scoring matrices
are actually constructed. What guidelines
go into putting them together?
All right, so let's start on the easier side
of the house. If we look at a scoring matrix
that -- if we're doing a nucleotide-based
comparison -- we only have four nucleotides
A, T, G, and C -- and if we align two sequences
with one another, all we're going to do is
just apply a very simple matched or mismatched
scheme. So anytime you have two -- a particular
position in an alignment where the two nucleotides
match which one -- one another, you're going
to assign a positive value; any time you have
a mismatch between the two, you're going to
subtract points. So in this case, the match
gets you two points; the mismatch loses you
three. The matrix, as you'll see here, you
have the A, T, G, and C going across the top,
the same residues going down the side, and
just the way it works, that for every position
in the matrix -- let's say you have a T aligned
with a T -- you just look for the intersection,
and you'll see that that's -- in this case
it's an exact match, so that's the plus two.
You'll see the plus twos all across the diagonal
representing the exact matches, and the minus
threes in the other positions. This assumes
that each nucleotide occurs 25 percent of
the time.
Okay, so this works really well at the nucleotide
level, because nucleotides are not all that
chemically dissimilar from one another, but
as we've already alluded to, the new amino
acids are very different from one another,
a much more complex picture. So we now go
to something that looks like this. So this
matrix is called a BLOSUM62 Matrix, and again,
we have all of the amino acids going across
the top, all of them going down the side.
So again, I want you to imagine, now, in this
case a protein sequence alignment, Sequence
A with Sequence B. We're going to look at
a particular position, and see how many points
we can assign to a match at that position.
So we're going to image where we've got a
tryptophan in Sequence 1 and a tryptophan
in Sequence 2, so we're going to look for
where the W intersects with the W, and that
give us in this case 11 points. Now, if you
look at the diagonal here, unlike the case
we saw before, where no matter which nucleotides
we were aligning and getting exact matches,
where in the previous case we were assigning
the same value regardless, here on the diagonal
you'll see that the values change. The 11
is the highest number in the diagonal, and
that represents the frequency part of constructing
the matrices. This is the most rare of the
amino acids, so when you get an exact match
you want to give it more points than you would
if you had a more common amino acid aligning
with each other.
Okay, now to the point of substitution. All
right, so let's say, instead of having an
exact match of tryptophan for tryptophan,
this time we've aligned a tryptophan with
a tyrosine. So instead of W-W, we now have
a W against a Y. We take a look here, we see
now in this case we would only give it two
points. So it's still a positive score, which
represents that it's a conservative substitution
-- those two things can substitute for one
another, and in most cases not adversely affect
the structure or function of that particular
protein -- but the lesser value here represents
that it's just not as good as an exact match.
Let's just take as a final example a mismatch.
So let's say you've now aligned the cysteine
with a glutamine. Those two things are not
known to be able to substitute for one another,
so here we are to subtract points; in this
case, minus three. So the way the general
rules work, just in summary, you get the most
points for an exact match with the rare amino
acid scoring more points than the common amino
acids; you lose points for conservative substitutions,
and you lose points for mismatches.
Okay, now, so the matrix we were just looking
at was something called a BLOSUM62 Matrix.
It is part of a family of matrices called
the BLOSUM Matrices. These we're developed
by the Henikoffs at the Fred Hutchinson Cancer
Research Center. And that's part of the good
news: you don't have to derive these matrices
yourself; you just need to know which ones
to pick. So as far as the BLOSUM Matrices
go, what they were designed to do is look
for differences in conserved un-gapped regions
of our protein families, which the Henikoffs
termed as blocks. These are directly calculated
based on local alignments. So these are all
based on observables; so, using substitution
probabilities, again, the concept of conservation,
and the overall frequency of amino acids.
Because of the way these were constructed,
using known protein families, they're really
sensitive to detecting structural or functional
substitutions, and they perform better than
other matrix series that are out there, specifically
the PAM Matrices. The PAM Matrices were developed
way back in the distant past, as far as bioinformatics
goes, at a time where very little sequence
information was available; so we only had
sequence data available for globular proteins,
they were -- the matrices were extrapolated
from closely related proteins, and you'll
often see them as an option for a choice in
phylogenetic programs, but I would recommend
that you pick these BLOSUM Matrices instead,
because just bottom line, they really are
useful in identifying both closely and distantly
related sequences. So that's a big plus.
All right, so now here's where evolution comes
into play. When I showed you the BLOSUM Matrix
earlier, that was a BLOSUM62 Matrix. So there's
always a number that follows the word BLOSUM.
And what the N represents is the degree of
similarity of the sequences that were used
to construct the matrix. So in -- so here
you've got -- in the case of a BLOSUM62 Matrix
that means that the sequences used to construct
the matrix shared no more than 62 percent
identity. In the construction what happens
is that the contribution of sequences that
are higher than the threshold are clustered
and replaced by a single sequence that represents
the clusters. So what's the reason for doing
this? Last week Eric showed you the phylogenetic
range of a number of vertebrate species whose
genomes had been sequences, and while we have
more and more whole genome sequences in hand
from a wide variety of organisms, the genome
sequences that we have available to us are
not uniformly distributed; we have multiple
sequences for some organisms. So we really
need to just remove that bias from the data
set, and that bias tends to be more toward
closely related sequences. We want to do this,
because we don't want to miss anything that
might be biologically relevant.
So in practice how does this work? Let's say
we start with a protein family. Here we've
got one with six sequences in the group. Any
place you have an exact match you'll see an
asterisk at the top of the column. Let's say
we're trying to construct a BLOSUM80 Matrix.
So we're going to cluster together any of
the sequences that share 80 percent or more
similarity with any of the other sequences
in the set. So in this case we end up with
three clusters, one that has sequences, one
that has three, and one that just stays by
itself; it doesn't cluster with any of the
other sequences. We then, now collapse this
down, representing these six sequences as
three. So these three sequences are now represented
by this single sequence, these two sequences
are represented by this single sequence, and
now we use this to construct the matrix.
So what are the implications of this? When
we do the clustering, again, we reduce this
contribution of closely related sequences;
so we remove the bias that I mentioned earlier
about substitutions that occur in the most
closely related members of a family. But here
is the more important thing: when you reduce
N, the value of N, you're going to get more
distantly related sequences; if you increase
N, you're going to get more closely related
sequences. So, here is your first cheat sheet
of the morning; put a big star next to this
one, and -- as far as which one of these matrices
to choose.
The default by -- if you're going to do a
BLAST search, at NCBI it's going to be BLOSUM62.
The reason it's the default is, just through
a lot of very controlled trials, it has been
seen as being the most effective in finding
all potential similarities, and you're going
to get percent similarities that go down as
far as 30 to 40 percent. Okay, as we move
up the numbers here, as the number get higher,
they're going to be -- we're going to get
sequences that are more closely related. So
by the time we get to BLOSUM90, we're going
to have maybe shorter alignments, but they're
going to be much more similar to one another
in the range of 70 to 90 percent. We go in
the opposite direction, we're going to find
more distantly related sequences; so longer
alignments that might be weaker, and you can
push it down to probably around 30 percent.
Now, of course the question goes back to which
one to choose? If you know what you're looking
for in advance, if you want to find things
that are highly similar, you're going to push
to the higher end of this range; if you want
to look for things that are more distantly
related, you can go to the lower end of this
range, but the method that I would strongly
recommend you use is something that Steve
Altschul, who is a person who developed BLAST,
talked about in a paper called -- he called
it the True -- "Three Matrix Strategy." And
so, instead of doing a single search, instead
you're going to do three; you're going to
pick 62 as a starting point, pick one that
is going to give you more distantly related
results, one that's going to give you more
closely related results. And what you're going
to find is you're going to get three different
sets of results back. They'll be highly overlapping,
but one -- some of them will pop out certain
sequences if the other ones won't. So again,
I would strongly recommend you use this three
matrix strategy as you learn these techniques,
as you become more comfortable, more familiar
with them, but the bottom line to all of this
is that no single matrix is the complete answer
for all sequence comparisons.
All right, so one last thing we need to discuss
before we get into talking about BLAST, is
GAPS. So again, I want you to imagine Sequence
A and Sequence B. And when we do these alignments,
often you'll see we'll introduce GAPs in order
to improve the alignment, so we get these
local alignments better. So again, used to
improve the alignments between these two sequences.
More often than not, these are used to compensate
for insertions and deletions, and this is
really important, because the gaps -- because
we're representing insertions and deletions
-- they represent biological events, so you
can't just keep putting these gaps in willy-nilly
to improve your alignment, because you're
not actually reflecting biology by doing that.
So you need to keep them to a reasonable number,
again, to not reflect a biologically implausible
scenario. And a good rule of thumb for this
is about one gap per 20 residues, and it depends
on the case, but that's a good starting point.
Now, if you think again about your alignment,
think about the scoring matrices we've been
comparing, single positions, a letter in one
sequence to a letter in another sequence,
but now we have a situation where we have
a letter in the one sequence and nothing in
the other sequence; we have a gap there, so
we can't score this simply as a match or a
mismatch. So how do you score an amino acid
against essentially nothing? The method we
used to do this is -- involves something called
the Affine GAP Penalty, and so what this does
-- all you need to know about this is that
any time you open a gap, you assess a penalty;
anytime you extend the length of a gap, you
assess an additional penalty. So this deduction
is governed by a formula right here, G+L(N),
where G is the gap opening penalty. So as
soon as you introduce the first position of
a gap in a nucleotide alignment, you're going
to subtract five points; on a protein alignment
you're going to subtract 11 points. As you
make that gap longer, you apply the gap extension
penalty. So the second term, L(N), so it's
the penalty times the length of the gaps.
So every additional position you add to that
gap in a nucleotide alignment, you're going
to subtract an additional two points; in a
protein alignment, you're going to subtract
an additional one point.
So let's say we have a gap of length two in
a protein alignment. So G here is 11. If the
gap is of length two, so L is one times two,
the length of the gap. So 11 plus two is 13.
You would subtract 13 points for a gap of
length two. You can -- as you'll see as we
go through the methods, you can actually change
the gap penalties to make the introduction
of gaps either more or less permissive. So
if you want things -- if you want fewer gaps,
you can adjust for that; if you want to allow
additional gaps you can do that as well.
All right, so with all of that, half an hour
later, by way of introduction we finally get
to BLAST. It's finally time to talk about
BLAST, and this is really the most important
technique in bioinformatics. We're going to
hit this pretty hard today, because I want
to make sure you know how to use it properly.
And -- because there, unfortunately, are many
cases in the literature where wrong biological
conclusions have been made, because of not
interpreting BLAST results properly. Either
the people doing the analyses didn't know
how to use it, or didn't really know what
the statistics meant. The most important take-home
message I can give you today that applies
to this and any of the other methodologies
that you're going to learn throughout the
14 weeks of this series is that, just because
something appears on your screen or on a piece
of paper in the results does not automatically
imply biological significance. So we're going
to talk about what does imply significance
in a moment.
Okay, so what is BLAST? BLAST: a basic local
alignment search tool. What this does is it
looks for high-scoring segment pairs, HSP.
So again, just a glorified term for just two
sequences, a pair of sequences, that can be
aligned with one another. And when we do that
alignment, we're going to try to make that
alignment as long as possible in a statistically
robust way, and I'll tell you about that in
a moment. The score has to break a score threshold,
S, before it's reported to you, and the alignments
can be either gapped or un-gapped. Again,
because we're looking for these local alignments
within entire sequences, the results are not
going to be limited to just the best high-scoring
segment pair for the two sequences being aligned,
and in fact, I'll show you an example that
will produce two high-scoring segment pairs.
There are different flavors of BLAST, and
three that are used most commonly are the
first three on this slide. So BLAST N -- N
is for nucleotide. So we start with a nucleotide
sequence, and we compare that against a collection
of nucleotide sequences. BLAST P -- P is for
protein. So again, the comparison is of a
protein sequence to a database of protein
sequences. BLAST X is a little bit different;
here we're going to translate the query into
a protein sequence before we do the comparison.
So, yes, we start with a nucleotide sequence,
we do the six-frame translation, we take the
protein sequence arising from that, and then
compare them against a protein database. So
this is actually a methodology that's much
more sensitive than direct nucleotide searches.
The last two are on the slide for sake of
completeness. I can't remember the last time
I ran one of these. The T in these both stand
for translated, and you can see for yourselves
what the query and the target sequences actually
are, but the comparisons here are all done
at the nucleotide level.
All right. So how does the method actually
work? So we're going to start with a sequence
of interest. We have an amino acid sequence
here, our query. And you'll see I have three
letters in the middle of the sequence in red.
The P, the Q, and the G. And we're going to
start our search with this query word of length
three. And what happens in reality is it starts
at the one end, at the end terminal end, and
it just keeps moving along, the window moves
along. So it starts with a GSQ, and the SQS,
and moves along. But just for sake of symmetry
on this slide we're going to start in the
middle. What we want to do is now compare
this to all of these sequences in a protein
database, starting off just looking for other
occurrences of the P, the Q, and the G in
that order, but we also want to look for conservative
substitutions, because we might find some
interesting biological relationships by doing
that.
So when we do that we establish something
called the neighborhood. So the neighborhood
starts off with the exact match, the PQG,
and that has a score. So if you were going
to go back to your hand outs to the slide
that shows the BLOSUM62 matrix, and did -- and
find the intersection of the P for the P,
the Q for the Q, and the G for the G, you
would find that the P would give you seven
points, the Q would give you five, and the
G would give you six for a total of 18. What
I've done here on the slide is I've just changed
the middle position to show those conservative
substitutions. So we've got the PQG, EG, RG,
going down the slide. And you'll see that
because these matches are not exact at that
position, as you would expect, because now
you know how the scoring matrices work, the
score also goes down. And of course, you would
do this at all three of the positions.
Now, at some point you have to say, well,
maybe we're kind of going outside of the neighborhood.
So the method will automatically draw for
you a line at what it calls the neighborhood
score threshold; BLAST computes this for you.
In this case, T is 13 given the query word
that we started with, so anything above the
line, everything from 13 and up, we keep for
the next step. Everything from 12 and down
we throw out. All right. So let's take our
neighborhood, put it at the top of the next
slide, and here's the sequence now that we
have that we started with. Here's our PQG,
our query word. You'll notice, now I have
it aligned with a sequence that was found,
the subject, but in this case it's not a PQG;
it's the P, the M, and the G, and that's perfectly
fine, because if we look back at the top,
PMG is in our neighborhood. Now, just qualitatively
you can see how good the alignment is by looking
at this middle row. So any place you have
an exact match you see the letter repeated;
any place you have a conservative substitution
you'll see a plus sign; any place you see
a blank is just a mismatch.
All right, but we want to be able to apply
numbers to this, and we want to try to make
this alignment as long as possible. So here
we are, we've got this alignment starting
at three, but you'll see that there's many
amino acids. Here it's -- the alignment is
a blank-40. You go as far as you can in both
directions, aligning as many of the characters
as you can. So the question of course then
becomes, how far is too far? Okay. So here's
our alignment again. And here's just a graph
of how the scoring is working out for this
particular alignment. On the X-axis here we
have the extension. How many positions have
we aligned? On the Y-axis is the cumulative
score. So the scores that are coming from
our scoring matrix as we just keep adding
additional positions to this alignment. So
if we start out here, the extension is three,
what we started with, the query word. We're
above the threshold cutoff, as you'll remember
from the previous slide. So as we go in both
directions we're going to assume that we're
having more matches than we do mismatches,
more positive scores being assessed than negative
scores being deducted. So the curve is going
to go upwards. At some point, we're going
to break a threshold, the score threshold,
okay. So this -- from here up, there are the
things that are all reported to you in your
BLAST result, and herein lies pitfall number
one.
Breaking the score threshold does not imply
biological significance, and I'll show you
what does momentarily. We don't pay attention
to the scores, but the scores are used to
calculate the number we are going to pay attention
to shortly. Okay. So we're going to keep going
up, we're going to keep assuming that we're
matching more than we're mismatching, but
at some point we're going to have gone too
far. We're not going to be able to align any
more residues with one another. We start thinking
about that matrix again, and all those negative
values in the matrix. So as we go out further
we're going to start subtracting points faster
than we are actually adding them. At some
point, we're going to leave the top of this
curve and go down a certain amount, the significance
delay -- decay, excuse me -- that represents
both the mismatches, which assess negative
points, and the gap penalties, which assess
negative points. When you've reached the significance
decay, the method says, okay, I've gone far
enough; I'm going to go back to the peak of
this curve. That's going to represent the
longest alignment I can put together, the
length of my high-scoring segment pair. Okay.
So once we've done that we're going to have
to think about doing a probability calculation.
So again, why do we use the probabilities
instead of the scores? Now that you know about
the scoring matrices, the answer will hopefully
be easier to understand. I want you to imagine
two different alignments of equal length.
One is composed primarily of common residues.
One is composed primarily of rare residues.
Okay, and the composition of those alignments
are going to be reflected in the score. Even
if the number of positions in those two alignments
that match one another is exactly the same,
you're going to end up with two very different
scores, because of the frequency information
that's captured within the scoring matrices.
So a better measure is needed to be able to
compare one alignment to the next, any two
alignments to one another, and that again
is where the probability comes in. So this
is the equation, the Karlin-Altschul equation
that actually governs that calculation. You'll
see that in the exponent here is the S -- so
that is the normalized score -- and the other
factors that go into the calculation of this
score. Again, you don't have to do this; the
BLAST method will do it for you, but what
that value, E, represents is the number of
high-scoring segment pairs found purely by
chance. So this is really the number of false
positives that you have -- have come out of
your BLAST search. We want that value of E
to be as low as possible. Lower values signify
higher similarity. Okay.
So rule number one. Again, put a big star
on this one. So what we're going to look for
as far as probability values, if we're comparing
two nucleotides sequences to one another,
we want that E value to be 10-6 or lower.
If we're looking at protein sequences, we
want that value to be E to the -- 10-1/3 or
lower. And again, keep in mind that when we
talk about protein similarity searches that
means both BLASTP and BLASTX. Remember the
BLASTX; we start with the nucleotide sequence,
but the comparison's done at the protein levels,
so you're going to use the protein statistics
here.
All right, so let me take you through actually
how you're going to do this when you get back
to your offices and to your labs. So we're
going to pretend we're at the computer. You
have the screen captures in your handout.
You might want to look at the PDFs on your
monitors back at your desk, because they'll
be a little bit easier to read, but the most
important thing to do is practice, and just
take some sequences that are of interest to
you and the biological questions you're trying
to answer, and use the methodologies that
I'm going to show you today. All right. So
we're going to start off at NCBI's homepage,
which there is the URL, ncbi.nlm.nih.gov.
The most popular things that are on this website
are shown in the right sidebar, and there
is a direct link to BLAST there. So we're
going to go ahead and follow that link, bringing
us to the BLAST homepage, or you can just
go directly by typing in the full URL. So
let me quickly orient you to this page. At
the top of the page you can just go directly
to full genomes, and search each one of those
genomes individually. The middle of the page,
which is where the action is today, is where
all the basic BLAST searches are. And at the
bottom are a series of specialized BLAST searches,
which we'll talk about momentarily.
So we're going to start off doing a protein-based
search; so a protein BLAST. The link is right
there, and if we go ahead and click that it
will take us to the BLAST search page. But
before we do that we need a sequence that
we actually are going to use as the query.
So what I put together for all of you is a
webpage. Here is the URL that has the query
sequences for all of the examples that I'll
be showing you both this week and next week.
So this is a great way for you to go back
to lab and do a little bit of practice on
your own, reproducing what we're doing in
the lecture this morning. Okay. So we're going
to take that first sequence from the page
I just showed you, and just stick it right
into the box. Now, you can use the entire
length of that sequence -- that's great; if
you are only interested in part of that sequence,
you can actually specify the beginning and
ending amino acids you want to consider the
start position and the stop position. Okay.
So the first thing we have to do is pick,
what do we want to compare this sequence to?
What database do we want to look at? So we
have a bunch of choices available to us, and
I want to talk about one in particular that's
called RefSeq. So what RefSeq is, is an effort
to provide a single reference sequence for
each molecule of the central [unintelligible].
So to have every DNA sequence, every MRNA
sequence, and every protein sequence represented
once and only once. And the reason for doing
this is it really became obvious that we needed
to have a resource like this when a lot of
us were doing these BLAST searches, and you
might take a gene of interest, you'd get back
maybe two, five, 10, 20 exact matches, and
you just don't know by looking at the BLAST
results which one is the right one; the canonical
entry for that particular gene or for that
particular protein. So this is a curated effort
that helps address that problem.
So by virtue of the fact that each of these
sequences are represented once and only once,
of course, the database is non-redundant.
It's constantly updated by curators to reflect
the current knowledge of sequence data and
biology. And that's a great thing: that these
records keep up-to-date with what we know
functionally about these nucleic acid sequences
or these protein sequences. In the featured
tables, the part of these entries that captures
these biological attributes, you'll find information
about the gene, the gene transcript, or the
protein. They encompass a wide taxonomic range,
but primarily focus on both mammalian and
human species; and again, just to reemphasize
the ongoing update and curations. And you'll
see a time stamp showing you when that record
was last brought up-to-date.
Now, the way to know what a sequence -- whether
a sequence is a RefSeq sequence is by looking
at its accession number. So anytime I use
the term accession number that is a unique
identifier that if you type that in it would
produce back that one sequence unequivocally.
So it's basically that sequence's Social Security
Number. And there's two series in the RefSeq
accession numbers that I'm going to show you.
The first ones, they start with an N. And
these come from curation of GenBank entries.
So the Ns, let's say we start with an NT,
that's a genomic contig. It may have come
from a whole genome sequencing project or
some other high throughput sequencing effort
at the nucleotide level. The NM, M stands
for MRNAs. The NPs, P stands for proteins.
And the Rs stand for the non-coding transcripts.
So what these all have in common is that these
have all been experimentally verified. Somebody
has had these molecules in their hands, and
have actually directly sequenced them.
Now, of course, we crank out sequences at
break neck pace nowadays, and so we don't
always have the luxury of having these experimentally
verified sequences. So what we do is often
apply computational techniques to predict
where the MRNAs are, and where the proteins
are. So this is where this X-series come from.
So let's say we start with a genomic contig,
these NT entries. And we then take those,
apply the predictive methods to figure out
where the genes are -- so gene structure:
introns, exons, and so on -- to figure out
where we can come up with model MRNAs, and
then we use those in turn to come up with
the model proteins, the XPs. So please keep
in mind that, again, these are experimentally
verified; these come from genome annotation.
These are computational predictions. So you
have to look at them, not necessarily skeptically,
because the predictive methods are so good
nowadays, but you just have to keep in the
back of your mind that they haven't been experimentally
verified. So whenever you have a choice you're
going to go look at the ones in the N-series
instead.
One of the other choices that was on the pull-down
menu was Swiss-Prot. Similar idea; we'll talk
about that in greater detail next week. So
I like to use RefSeq, just because -- this
non-redundancy, you just get a neater and
shorter hit list from your results usually.
Okay. So let's go back to the screen here.
So now, we go a little bit further down. I
-- what I've done -- excuse me, for the purpose
of the example I've just left the database
at the default, which is the non-redundant
protein sequence set. So that's every sequence
that's in GenBank. As far as the organisms
go, you can just leave it blank and search
for everything, but you can limit by organism
or by taxonomic groups. So you can put in
the word human, you can put in the word bilaterian,
you can pick particular classes if you're
looking at a specific point in evolutionary
time. Okay. As we go down, program selection.
So BLASTP is already chosen for us, but you'll
see that there's a whole bunch of other variations
of BLASTP, and we're going to talk about those
in great detail next week.
All right. As we get to the bottom of the
screen you'll see that there is a link here
that says 'algorithm parameters.' And if you
click on that it's going to expand the screen,
and what we're going to see looks like this.
So this is where we get fine-tune what BLAST
actually does. First one asks us, well, maximum
number of target sequences. How many sequences
do you want to see in your results? The default
is 100. I always hike this up as far as possible.
So we're going to put that at 250 to make
sure that we don't miss any statistically
significant hits. We go down, word size is
three. So you already know how that works.
If you want three to -- if you want to just
use the default, that's fine; if you want
to push more towards exact matches, you're
going to make the word size longer. The E
value threshold, right here it says it's 10.
Remember that I've told you that we want to
find -- use an E value of 10-1/3 for our cut-off.
So I'm just go ahead and leave it at that
value for purposes of the example, but we're
going to apply the cut-off when we see the
results. Scoring parameters, we can chose
our matrices. Here's your choices, again.
All of the BLOSUM matrices -- remember what
the numbers actually mean -- but you also
have a couple of PAM matrices available to
you as well.
Okay. Finally, when we get to the bottom,
a couple of new things to talk about. So filters
and masking. So the first one says 'filter
low complexity regions.' We're going to go
check that off. What is a low complexity region?
So these are regions of biased composition.
So you might have an amino acid sequence or
a nucleotide sequence where on the nucleotide
side you might have a string of A's, a very
long string of A's; or on the protein side,
a single amino acid repeated over and over
and over again. So these homo-polymeric runs,
you might see short period repeats, or the
subtle overrepresentation of several residues.
The reason it's important to mask these is
it actually confuses BLAST.
BLAST relies on having uniformly distributed
amino acid frequencies. So when it hits these
homo-polymeric runs, it often leads to false
positives, because if you think about the
alignments, and you've got these long runs
of the same letter over and over and over
again, sometimes it's not exactly clear how
to actually align that part of the two sequences
with one another. So not enabled by default;
you should just always do that. Okay. Finally,
the last thing, just is a matter of convenience,
I always check off the box at the bottom that
says 'show results in a new window.' So once
we click the BLAST button down here at the
bottom, either a new tab or a new window will
open. Why that's nice to do is if you want
to come back and change the parameters, you
can just go back to this window and do that,
rather than having to keep clicking the back
key on your browser.
All right. So we'll go ahead, submit this.
Here's what we get back. So here we've started
with -- our query sequences is represented
in this case by the grey bar here. Right under
the grey bar we'll see that there's a block
here, and that represents a protein family
that was identified; so a particular domain
architecture. Again, we're going to talk about
that much more next week. You'll recall that
the default for how many sequences to return,
how many hits to return was 100. In this case,
if we look here it says that 225 BLAST hits
were found. So it's good that we put our value
there up to 250, because we just wouldn't
have seen all of the results that BLAST could
find for us. Okay.
Let's push this overview to the top here.
So this is a graphic overview of our results.
So here our sequence, our query sequence is
shown as the red bar, and you see just a bunch
of reds and purples here. So what is this
mean? All of the colors -- the color key is
at the top. Remember that the colors correspond
to the scores, not to the probabilities. These
are shown -- so there's a color key -- these
are shown in descending score order; so the
best score is at the top, the worst score
is at the bottom. If you mouse over any of
these lines, there's a window here at the
top that will tell you what the found sequence
was, what that particular line corresponds
to, what sequence it corresponds to. You may
see these lines in between. So what the gaps
will -- these either represent gaps or that
you have more than one high-scoring segment
pair for the query sequence versus that particular
sequence. If there -- and you'll see that
either as a gap, or you might just see this
tick mark here, which means the gap is actually
in the query sequence rather than in the found
sequence. Finally, you might see lines that
have these bars on them, but there's nothing
connecting them. Those are just unrelated
hits, just done that way to save space. Okay.
So let's go ahead to see what we actually
got. Let's scroll down, take a look at the
hit list, and see what we can learn. So let
me orient you to the results first. You have
a description of each one of the sequences
that was found that was deemed somehow related
to our query sequence. Each one of them has
a short description called the definition
line or the def line; just a quick one-line
description of what that sequence actually
is or does. You'll see that there are a couple
of columns here that give us the score values.
We have query coverage, so we'll see that
in a little bit more detail, but the important
column here is the E value. So I sorted this
by E value. Next to that, you'll see the percent
identity values, and then these are the accession
numbers; again, the unique identifiers, the
Social Security Numbers for each one of these
found sequences. All right. So if we look
at the very first one, we see the E value
is zero. So what the zero tells you is -- just
because of space there's not enough room to
show the number -- so whenever you see zero
it's 10-1000 or less. So you don't even have
to think about those. Those are great hits.
We look a little bit further down, here we
have a case 3e minus 179, so that just means
3*10-179. So it's just showing the exponential
notation there. I'm going to scroll down to
the bottom. And remember that this is now
a protein-protein search, so the E value cut-off
that I'm going to apply is 10-1/3. Now in
this particular case -- again, here's the
E value column -- this particular case, the
one at the bottom of the list actually beats
the threshold quite handily: it's 10-53. So
we're going to go ahead and just draw the
line, and accept all of these values for now.
If we saw any that were below that 10-1/3
threshold, we would go ahead and reject them,
but I want to provide a caution here, because
you might have a case of something that's
just below the line that actually is biologically
relevant. So this is where you have to put
your biology hat on -- you've read tons of
papers, you've done your own research, you
know something about the system that you're
looking at -- and you may have biological
evidence that argues in favor for something
being slightly below the line as being related.
So anytime you have biological evidence that
always, always trumps computational predictions.
So the guidelines that I give you here, these
cut-offs are starting points, but again, let
your knowledge of biology be your guide here.
It will never serve you wrong.
Okay. So, let's take a look at one of the
results that came out. And so here we have
a -- the classic way of representing a BLAST
search, so where we can see the alignments.
So at the top we have some scoring information,
our -- which we'll come back to in a moment.
We can see where this sequence aligns. So
if in our query sequence we're starting at
position 17, the first line goes from 17 to
76, and you can see the line numbers -- excuse
me, the position numbers going all the way
through this alignment. So the first thing
we want to do is take a look at these values
that we have calculated to give us yet another
way of determining whether or not these are
actually related to one another. So first
we have a percent identity number. In this
case, they're 100 percent identical to one
another. The numbers you want to look at -- this
is the second criteria, so again, put a star
here -- if we're doing a protein-based search
we want to see percent identity greater than
or equal to 25 percent. If we are doing a
nucleotide based search we want to see a percent
identity greater than or equal to 70 percent.
Right next to that you'll see where it says
positives. In this case it's still 100 percent,
so what the -- here what that is, is really
a percent similarity. It includes conservative
substitutions. So again, identity means identity.
Positive means the identities plus the conservative
substitutions. In this particular alignment,
no gaps, so we don't have to worry about that.
But what you will see is that there are some
regions of our query sequence that appear
a little bit greyed out in lower case letters.
Those are the low complexity regions, the
regions of biased composition where you have
the subtle overrepresentation of a couple
of amino acids. Here's a great case where
there are -- you can just see over and over
again just short period repeats, or the same
amino acid coming over and over again. Okay.
So that is one high-scoring segment pair that
came from the comparison of our query sequence
to this particular protein, this prospero
isoform, but if we scroll down a little bit
more we'll see something that says range two.
So we started with our two sequences and two
different high-scoring segment pairs were
found. So there's two different parts of the
alignments led to these local alignments.
So again, remember the difference between
global and local alignments. So instead of
forcing the alignment across the entire sequence,
we're just trying to find those local regions
of similarity. In this case there are two.
So we have the second high-scoring segment
pair detected. We're going to look at our
identity information again. So 95 percent
identical, the starting position for that
particular part of the alignment. Here we
do have a gap and that's just represented
by the dashes, in this case, in the subject
sequence.
Okay. So let's put all of this together. We
go back to this graphic. Here are the two
scoring lines from the two alignments that
I just showed you. So let's think about our
cut-off. So first we want to make sure that
E value is 10-1/3 or better. So, yep, we've
got that. So zeroes in both cases. We look
at the percent identities in protein base,
so we want that to be 25 percent or better.
We got that as well. So 100 percent for the
first one, 95 for the second. And if we just
look at where these correspond to the overview
that we had, the first high-scoring segment
pair goes from position 17 to 704; the second
one from 777 to 1403 with a small gap between
them. So the vast majority of this sequence
is align-able, but there's a region in the
middle that you can't align the one you started
with, with the one that you found. All right.
So your second cheat sheet for the morning
-- again, keep this handy -- if we're talking
in terms of E value at the nucleotide level
10-6 or smaller; protein level 10-1/3 or smaller.
As far as sequence identities go, for nucleotides
70 percent or more; proteins, 25 percent or
more. Again, don't use these cut-offs blindly.
I give this to you as a starting point, as
you start to learn these techniques, and start
to get a better sense for how they work, but
please, please do pay attention to the alignments
that are either side of that dividing line,
and never, ever ignore what you know about
the biology of those sequences. That always
is the most important thing to factor into
these analyses. Okay.
So let's take a little bit of a simpler approach
here. Let's say, instead of searching a database,
you have two sequences and you just want to
see how those two sequences align with one
another; their predetermined sequences. So
just like a BLAST search, we're going to find
these local alignments, either nucleotide
sequences or protein sequences. All of the
BLAST algorithms are available to you. Again,
you can select either BLOSUM or PAM as the
matrix for these comparisons. Same Affine
gap costs are -- can be applied here. You
can adjust them. The input sequences, again,
can be masked. So we come back to the BLAST
homepage. In this case the link to get to
this is in the lower section in the specialized
BLAST section, 'align two or more sequences
using BLAST,' BLAST two sequences. When we
click on that link it brings us to this page.
It looks very similar to the one we saw before
for the BLASTP search. We're going to look
at these top -- tabs at the top first, make
sure that we've clicked on BLASTP in this
case to see the search page that's relevant
to a protein-based search.
If you go to the example sequences page that's
up on the current topics website, you'll -- we're
going to just take the two sequences from
those sample sequences; paste one in the first
box, paste the second one in the second box.
As before, we can specify query sub-ranges
or subject sub-ranges if we don't want to
use the entire lengths of these sequences.
Once again, at the bottom, the algorithm parameters
link; we're going to go ahead and expand the
page now by clicking on that. We'll now see
all of the stuff here at the bottom. So the
general parameters here, as before, how many
sequences do you want returned? Well, we already
know we've got two, so we can just leave that
alone. The E values, the word size, as before.
Here I actually don't change anything, because
you already know that you've sequence A and
sequence B, so you're just going to, by inspection,
not have to worry about any of these things.
You're just trying to align the two sequences.
But what will come out is the statistical
measure of how well those two actually align
with one another. As before, you can choose
of a series of matrices. We can mask out the
low complexity regions, which you should always
do by default. Again, we're going to check
off the little box at the bottom where we'll
show our results in a new window, and then
just go ahead and click the big BLAST button
at the bottom of the page.
So what we get back is reminiscent of what
we just saw. We have our overview of the results.
Here's our query sequence, again, as the red
line. We only searched it against one sequence
so there is the one sequence. Part of it aligned
well, part of it a little less well. Let's
scroll down a little bit to see what those
results look like. So in our data table up
here, again, one and -- only one hit. We see
the E value and the percent identity. A greater
amount of information shown in the lower part
here. So here we have our first part of the
alignment, the longer part. The percent identity's
46 percent, so that beats our eval -- excuse
me, our percent identity threshold for proteins,
the expect value; 10-26 beats our 10-1/3 threshold,
and again, you can just see what part of the
sequences align. So in this case the query
from positions 95 to 178.
The second part, that little part at the end
terminal end, if we look at the stats here,
we're just -- it's a very short alignment;
as you can see here, it's only six amino acids
long. The E value is almost 2, so we're just
going to go ahead and reject that just on
the basis of that one number alone. We can
also look at this in terms of a dot-matrix;
it is a traditional way of looking at this.
Sequence 1 is represented on the x-axis, so
positions 1 to 178; Sequence 2 on the y, positions
1 to 134. We saw that there were two high-scoring
segment pairs, or at least two alignments,
one of them, which we actually accepted. Here
is that shorter alignment, so it just shows
you graphically which parts of the sequences
were aligned with one another. It's a little
bit more intuitive to look at this, but I
thought I would show you this just for the
sake of completeness.
All right, so we've now spent a lot of time
talking about things at the protein level,
but of course, there is a lot we can learn
by looking at the same types of information
at the nucleotide level, so we're going to
talk about a couple of methods here. All of
the nucleotide BLAST algorithms can be found
in the basic BLAST section by clicking on
the nucleotide BLAST button. And when you
get to the program selection part of the search
page, you'll see that there are three options
available to you. Mega BLAST, discontiguous
Mega BLAST, and BLASTn.
So, what are those doing for you, what do
they actually do? First one, Mega BLAST, is
the one you would pick if you were searching
against complete genomes or chromosomes. So
this is really optimized for aligning really,
really long or highly similar sequences; things
95 percent similar or more. You'll see the
word size, because we're looking for things
that are almost exact is very high at 28,
the Affine gap -- excuse me, this is the scoring
matrix. So for a match you get one point;
for a mismatch you lose two. This is a linear
gap penalty, so unlike the Affine gap penalty,
just a value is assigned, the same value for
opening the gap and for extending the gap.
Okay. The middle block here, talks about discontiguous
metal -- Mega BLAST, and this is what you'll
be doing most of the time at the nucleotide
level. The results between these two are very
similar, but Mega BLAST will run faster in
relative terms. So here we've got discontiguous
Mega BLAST; here is BLASTn, our standard;
we've got a nucleotide search that we're comparing
against a whole slew of nucleotide sequences,
or nucleotide database, and you'll see that
the values are all the same here. Finally,
in the lower block, if you've got things that
are very short, nearly exact matches, again,
you can use BLASTn to do it here, but rather
than using the default, you can just change
the values here to push in that direction.
All right, so enough about BLAST. Let's talk
about something else that you could put into
your armamentarium, and this is a method called
BLAT, which is the BLAST-like alignment tool.
This is a method that was developed at the
University of California Santa Cruz, and it's
very similar to Mega BLAST in that it was
designed to rapidly align really long nucleotide
sequences, so sequences at least 40 in length,
having a 95 percent sequence similarity or
more. It reliably can find exact matches down
to a length of 33, but this is really the
method of choice when looking for exact matches
in nucleotide databases. The other thing that
is very impressive about it is its speed;
it's 500 times faster than BLAST for most
MRNA or DNA searches. The downside: it might
miss divergent or shorter sequence alignments,
because it's not optimized to do that; it's
optimized to find that these long stretches
of similarity. You can use it on protein sequences,
but if you're looking at protein sequences
you should really be using BLASTP instead.
So, when do you use this? You've now done
a sequencing effort, you have got a brand
new gene that you're trying to characterize,
or a sequence fragment that you're trying
to get information on, so you'll be able to
find its genomic coordinates, determine gene
structure -- so where the introns or extrons
might be -- or find markers of interest in
the vicinity of that sequence. You can also
use it to find highly similar or identical
sequences. So if you have a ton of MRNA data,
you could align those MRNA sequences onto
a genome assembly, you can start to identify
additional genome family members, or you can
start doing cross-species alignments to identify
putative homologs, so going back to the beginning.
The other thing that's nice about it is you
can take a sequence and show it as a separate
track within the UCSC genome browser. So this
is a new concept, just thinking about a track
-- we'll see an example of that shortly, and
Dr. Wolfsberg will be spending quite a bit
of time discussing the use of the UCSC browser
when she presents her lecture in two weeks'
time.
So let's just go ahead and go to the BLAST
-- excuse me, the BLAT homepage at genome.UCSC.edu.
The link to BLAT is in the left sidebar, so
if we just go ahead and click on that, it
brings us to a very simple search page. Across
the top you have a small number of selections
you can make, so you can pick which genome
you want to search against. In this case I
am starting with a rat sequence, so we are
going to search against a rat genome. The
query type here is DNA, and then the other
parameters are predefined for you. Then just
go ahead and click on submit, and we're going
to get back a very easy to understand results
table. Okay. So across the top we have a number
of sequences that were found based on our
query, you'll find a score for each one of
the sequences that were found based on our
query. You'll see a score for each one of
them; so these are in reverse score order,
the highest one at the top; starting and ending
positions for each one of those alignments,
and a percent identity. We're going to go
ahead and click on the browser link for the
very first one, the best hit in the group,
just so we can see what this now looks like
in the context of the UCSC browser.
So we see where our chromosomal position is
here by the red bar on this chromosome, on
Chromosome 5 of the rat genome. Our sequence
is the one that is labeled, quite self-explanatory
'Your Sequence from BLAT Search.' So there
is the sequence, and you'll notice some discontinuity
in the end; so it's not a perfect alignment
across the entire length. There's some gaps
in there, and there are also some regions
that are marked in red along the way. So the
reds are mismatches. You'll see the color
key here at the bottom, so red genome inquiry
sequence have different bases at this position,
and you can read the rest of them for yourself;
the other colors are in the key.
What we also see here are other tracks. So
each one of these lines is a track, so we
have information on rat MRNAs from GenBank,
Rat ESTs, the expressed sequence tags, some
conservation information; so looking at data
from mouse, and human, and dog, and opossum,
and so on. So it's a really powerful way of
not just finding what the chromosomal context
of your gene of interest is, or sequence of
interest is, but a lot of other information
about related sequences, other types of data
that go well beyond just simple idea of alignment.
And again, Tyra is going to spend a lot of
time talking to you about this during her
lecture in a couple weeks.
All right. If you go back to our first page,
we have another set of links here called 'details,'
so details link for our best hit. And so here
what we have, the top part of this is our
query sequence, the one that we started with
from our sample set of sequences on the current
topics sequence page. This is the sequence
that was found. You'll see that there are
things in different colors and upper case
and lower case; the key is at the top, so
anytime you have a matching base in CDNA and
genomics sequences those are blue and capitalized;
the things that are in light blue are the
boundaries of gaps in either sequences, and
those are often splice sites [spelled phonetically].
So, it's a view; it's not a particularly informative
view. The view that I think is much easier
to look at is the one that is reminiscent
of the BLAST view, where we just have the
two sequences aligned with one another, you
can see the positions of the gaps along the
way, and get a good sense for how aligned
these two sequences are with one another.
Finally, I just want to talk about one last
sequence alignment program, and that's called
FASTA, and so what FASTA does is identifies
regions of local alignment, must like our
BLAST method does, but it uses a different
algorithm to use it; it uses something called
a Smith-Waterman Algorithm to find the best
alignment between the two sequences. So the
methodology that underlies this technique
is significantly different than the one that
is used for BLAST. Here are the URLs if you'd
like to check that out and play with it a
little bit. It's slower than BLAST, but it's
more stringent in certain ways, so I think
one way to include this in your armamentarium
is, if you think back to our BLAST results,
we would draw a line at our E value cutoff,
you could use it as a second method to rule
in or rule out things around the cutoff. So
the same thing -- way that in a lab you might
use two different experimental techniques
to verify a result, here you can use two different
computational techniques to verify your computational
results.
Okay. So with that, that's where we're going
to leave things today. We're going to build
upon the basic concepts next week when we
start to discuss how we can go well beyond,
just looking at letter by letter alignments
to find commonality between sequences, and
we're going to end up talking about how we
can use multiple sequence alignments, and
apply those to your own work. So thank you
once again for joining us today, and we'll
see you again next week.
[applause]
