Hi! My name is Tony Smith.
In this lesson, we're going to look at a practical
application of data mining in the world of biology.
Knowledge discovery with biological data,
or so-called bioinformatics.
Now, there are many different types of biological
problems that we might want to study, many
different data types.
I'm going to look at a subset that's quite
common called sequence analysis.
Sequences of nucleotides that make up genes
or sequences of amino acids that make up proteins.
In fact, the latter.
We're going to look at a very easily stated
sequence problem for proteins.
It goes like this: given a freshly produced
protein, which portion of it is the signal peptide?
Now, what does this mean?
Well, you might remember from high school
biology that along your DNA, there are nucleotide
sequences called genes.
Genes get copied with messenger RNA to produce
a transcript, and the transcript is used to
string together amino acids into a polypeptide
chain, which is a protein.
Proteins perform some function in a cell,
and, in order to do that, they have to be
transported to where they're going to perform
that function, and, through that transport,
they have to pass through a membrane.
In so doing, what happens is the 20 or 30
or so amino acids at the beginning of the
protein--called the signal peptide--they open
up a translocation channel that allows the
protein to pass through the membrane.
In so doing, the signal peptide portion gets
cleaved off.
The signal peptide is kind of like a key that
opens a door for a protein, and, if we know
what the key is, it give us an idea as to
what the function of the protein might be.
We want to predict where the signal peptide
ends.
Where is the cleavage point?
We first ask ourselves what's our general
goal?
Do we want an accurate prediction or do we
want an explanatory model?
Something that gives us some knowledge.
We'll have to ask what features might be relevant
in predicting the cleavage site.
So, what features do we need to generate from
the data we're given?
What approach are we going to take?
What learning algorithms in Weka we might
use, and how are we going to know if the model
produced by Weka is any good?
How do we know if we're successful?
Here's some 10 instances or so of new proteins.
As you can see, they're sequences of letters
where each letter corresponds to a different
type of amino acid.
M is Methionine, A is Alanine, S is Serine,
and so on.
About 25 or 30 residues along for the beginning
of the protein, marked in red here, is the
cleavage site.
That's the beginning of the mature protein,
the part that survives after cleavage.
That's what we're trying to predict.
Which of those residues is the cleavage site.
What properties do we think are relevant?
Do we want properties of the entire signal
peptide or just properties around the cleavage
site?
We might get some domain knowledge from a
biologist to help us out, or we might do some
ad hoc statistical analysis to look for thing
that might correlate with the cleavage site.
For example, given the 1400 examples in our
dataset, we might find that there's a very
tightly clustered length, with the mean length
of 24.
Knowing the position of a residue might be
useful in predicting whether or not it's the
cleavage site.
If we look at the residue at the start of
the protein and, perhaps, the three residues
immediately upstream of the cleavage site
and the three residues downstream from it,
there might be some useful information there,
some context.
In fact, if we do a histogram of the upstream
region of the data we've got, we'll see that
is looks like the letter A, Alanine, and perhaps
the letter L and maybe S, as well, seem to
be quite frequent around the cleavage site.
So, that could be useful.
When we don't have much domain knowledge,
we might come up with a set of features that
include the position of the residue being
considered; the residues at each position,
three either side of the cleavage point; and
then for each residue that we know is the
cleavage site, we'll put that in the class
of yes this is the cleavage point; and we'll
just get some negative instances by randomly
choosing some other residues and producing
the same information.
We might do this inside a spreadsheet.
Here's an example.
Each column is an attribute and each row is
one instance of a residue.
We record all this information.
This can be saved in a comma-separated version
in most spreadsheet packages.
Weka, of course, can load a CSV package.
We're going to go ahead and load in this data
into Weka and have a go seeing if we can predict
the cleavage site from it.
I've loaded up the dataset that I just showed
you into Weka.
We see here we've got the features, the length,
or the position of the acid in question.
Which residue is at the -3 position, -2, -1.
The residue at the cleavage site and 1, 2,
and 3 upsteam.
And I've recorded whether this is an example
of the cleavage site or a randomly chosen
other residue that's not.
Now, if I go straight to classify, I want
an explanatory model, so I'm going to go for
a C4.5 decision tree.
I'll go down to trees, load up J48, which
is C4.5, and, under the default settings of
10-fold cross-validation, I'm just going to
go ahead and start up Weka.
It comes back pretty quickly.
If we look at the accuracy, we'll see we've
got 78-79% accuracy.
That's pretty good considering other state-of-the-art
software for predicting the signal peptide
cleavage point performs at about 80-85% accuracy.
So, we've already done really well, but is
this model any good?
Now, if we look at the true positive rates
for the two classes.
Here we've got the Yes and No class, and if
we look at the true positive rates, they're
around 80%, so that pretty good.
Let's take a look at the decision tree produced.
I'll just pop up the visualization of it.
Enlarge that a little bit.
Fit to the Screen.
Now, there's a couple of reasons why this
decision tree suggests we haven't come up
with a very good model.
One is it's very wide and very shallow, and
it's highly branching.
Each of these tests seems to produce a lot
of very small subsets.
This suggests that what we've done is that
we've actual found a model that overfits the data.
Now, what does that mean?
Well, let me give you an example.
Machine learning algorithms are trying their
best to get predictive accuracy, and it's
often very easy for learning algorithms to
find some model that will work.
There are two reasons why we might get good
performance for the wrong reasons.
One is sparseness of data, and another is
overfitting the data.
Let's look at each of these problems and see
if we can figure out what's going on with
our example here.
Data sparseness is another form of overfitting,
but it's specifically because we don't have
enough instances to figure out the true underlying
relationship.
Consider this very small dataset here.
What I've done is that I've rolled two dice--six-sided
game dice--and I've tossed a coin.
Two dice, one coin.
I've recorded the outcomes.
I rolled a 3 with one dice, a 5 with another,
and a heads with the coin.
I did that four times and recorded the four
instances here.
Now, we know that there are six possible outcomes
for rolling a dice.
I've got two dice.
Two outcomes for a coin toss.
That's 6 x 6 x 2.
That's 72 possible instances we could've had,
but we only have 4.
I give these four instances to Weka.
I say come up with a rule that allows me to
predict the coin toss from the roll of the dice.
It comes up with a model: if Die1 > 2 then
the outcome of the coin toss is heads, otherwise
it's tails.
That fits the data we've got here.
100% correct, but, of course, if we had additional
instances, then hopefully Weka would see that
there's no correlation, these are random outcomes.
This is the problem of overfitting due to
data sparseness.
This is a real problem with our signal peptide,
because we've recorded 7 different residues
around the cleavage site, so each of them
can be 1 of 20 residues.
That's 20^7 possible patterns.
We've got the position, there's about 60 different
integers there.
The two class values.
That's 153 billion possible instances of which
we have 1400 positive ones and an equal number
of negative ones.
A tiny fraction.
That's data sparseness.
Overfitting, in general, can be indicated
when the model is overly complex, such that
the tests practically uniquely identify instances.
The model splits instances into lots of very
small subsets, and a tell-tale sign of this
is the model is complex, highly branching.
That's what we see from our example here.
We can usually tell if we've been overfitting.
If we just get some more data, if we tried
to predict it on the basis of the tree we learned,
we'd get poor performance.
Of course, we don't often have extra data.
Given these characteristics of an overfitting
model, I would look at the decision tree we've
got here and suggest that it is overfitting.
One way to test that is I've actually prepared
a dataset with three times as many negative
instances.
I'll just go back and load up file two here,
sigdata2.
That's the same as data1, only with three
times as many negative instances.
We'll just go back to Classify under the same
default settings.
We'll go ahead and start it up.
Now, if we look at the accuracy, we'll see
it's even gone up, 82.5%.
But, if we look at the true positive rate
of the cleavage class, it's actually down
to almost 50%.
That is practically a coin toss in it's accuracy
in predicting the very thing we're interested
in: is this the cleavage site?
This doesn't look like a very fruitful way
of going about trying to predict the cleavage
site.
Our amino acid context approach appears to
be overfitting the data.
What else could we try?
Well, we might look for a different set of
features that capture the more general properties
of signal peptides.
A more informed approach, which we might learn
about by consulting an expert, a biologist,
is we assume that the cleavage occurs because
of physical forces at the molecular level.
That is, amino acids have electro-chemical
properties.
We might create features that capture those
physicochemical properties of amino acids
around the cleavage site or of the signal
peptide as a whole.
We can get some domain knowledge from the
experts.
What kind of knowledge would we get?
Well, this diagram here shows a distribution
of the amino acids at positions relative to
the cleavage site.
If we look at the -1 position, that's the
amino acids immediately upstream of the cleavage
site.
Here the size of the letters is proportional
to the frequency of the amino acid type at
that position.
we'll see at the -1 position, there's a lot
of A's, quite a few G's, S's, some C's and
T's.
At the -3 position, we see A's, V's, S's,
and T's.
Also, sort of the region 5 to 15 upstream,
we see there's a lot of L's, V's, and A's.
What's going on here?
What are the electro-chemical properties of
A's and L's and V's that we might exploit
to capture this non-uniform distribution in
these relative positions?
It turns out that amino acids have well-known
types.
They can be molecules that tend to not like
being near water.
They're called hydrophobic.
You see on the right side of this Venn diagram,
we've got A, V, P, M, L, F.
These are all hydrophobic amino acids.
On the other side, we've got the hydrophilic
ones, the ones that like to be near water.
We also have some amino acids that are positively
charged and some are negatively charged.
This effects whether or not they stick together,
of course.
And then the rest are not really very charged.
There are residues with small side chains,
the bit of the molecule that distinguishes
one residue from another.
We've got A, V, P, G, C, N, S there all have
small side chains, and the other ones are
somewhat larger.
These are the kinds of properties we could
record about the molecule around the cleavage
site.
In fact, biologists know of the physicochemical
properties around signal peptides, and they
talk about this thing called the C-region,
H-region, and the N-region.
Now, the C-region is just those 3, 4, 5, 6 residues
immediately upstream of the cleavage site.
They're usually uncharged at position -3 and
the -1 position are small, have a small side
chain.
Adjacent to that upstream is the H-region,
about 8 residues long.
That was all the L's and V's we saw.
It tends to be a hydrophobic region.
Then, above that, to the beginning of the
protein is the N-region, which tends to be
positively charged.
This is information we can use to construct
more informed features.
The possible features we might include are
the size, the charge, the polarity, and the
general hydrophobicity of regions of the signal
peptide, especially at position -1 and -3,
because they seem to be quite distinct.
We might compute the total hydrophobicity
in an approximate H-region, about 5 to 15 upstream
of the cleavage site.
We might look at the total charge, polarity,
and hydrophobicity in the C-region and so on.
Then record whether or not that's the cleavage
site.
So, for a couple of randomly chosen residues
which are not the cleavage site, we'll compute
these same features.
In fact, I've created a dataset which just
includes the following four features: the
position, as we had before--the same as the
length we had in the previous dataset--the
overall hydropathy of the approximate H-region,
the side-chain size for the -1 residue, and
the charge of the -3 residue.
If we go back to Weka here, we'll just load
in file 3, the one I prepared here.
I'll just load it in.
Here we can see the position, the charge at
the -3 position, whether or not it's small
in the -1 position, and the overall hydophobicity
here of the H-region, which you'll see is
a numeric value.
There are charts of general hydrophobicity
for amino acids, and I've just summed them
up for a region upstream of the cleavage site.
Let's go back to J48.
It's still all set up here for 10-fold cross-validation.
We'll start her off under the default settings.
If we look at our accuracy here, we've got--holy
smokes--91.5% accuracy.
That's great! Now, is this all just because
we're predicting one class?
We look at the true positive rate, and we'll
see we've got an average true positive rate
of almost 92%.
That's quite good.
But, we might ask ourselves, are we overfitting
the data?
Now, if we look at the model, it's going to
be quite small, because we don't have very
many features.
Maybe this is a little on the big side.
(Fit to screen here.)
We might wonder, are we overfitting the data?
Have we got a problem of data sparseness?
Well, once again, I can generate three times
as many negative instances to see if we're
just getting a sort of random outcome.
We'll go back to Preprocess here, open the
file sigdata4.
It's the same as sigdata3, but with three
times as many negative instances.
I'll load them all in.
We've got 5,620 instances.
I'll go back to Classify.
Same default settings.
Go ahead and start it up, and let's look at
the accuracy first of all.
Accuracy has gone up to almost 94%, but let's
look at those true positive rates.
Here, we see that our average true positive
rate for our two classes still remains high, 94%.
This indicates, in fact, that the model has
been relatively good at discriminating between
cleavage sites and non-cleavage sites.
In fact, if we look at the model, if we visualize
the tree, we can see a number of features
here.
At the top of the tree, it's looked at the
H-region, which we knew was useful in predicting
the cleavage site, and then it's looked at
the smallness of the -1 position and so on.
Overall, this looks like it might possibly
be capturing, in a formal model, the general
principles biologists told us all about.
When we're doing bioinformatics, the considerations
we have for doing data mining is we have to
ask ourselves what's our overall goal?
Do we want predictive accuracy or explanatory
power?
How do we prepare the data to generate features
which are actually going to be useful for
solving our problem?
How can we evaluate how good the model is
that we get, knowing that Weka's going to
do its best to come up with a highly accurate
model, and it may do so under spurious circumstances.
Most importantly, bioinformatics is an instance
where data mining really is a collaborative experience.
So, seek expert advice whenever you can.
I've set up some exercises, some activities
for you to do where you can explore bioinformatics
and signal peptide prediction more thoroughly.
Enjoy!
