MONICA: So today, Eamonn Keogh
from UC Riverside is here to
talk about data mining time
series and images.
And as usual, this talk is going
to be archived on Google
Videos, so please don't say
anything confidential.
EAMONN KEOGH: Thank
you, Monica.
Thanks, [INAUDIBLE].
And welcome to the talk.
So a quick outline.
The first half talks a little
more as a review.
I'm going to talk about a quick
review of time series
data mining.
And a way of looking at it, the
generic data mine algorithm.
And I'm going to show an
explicit presentation of time
series data called SAX.
And again, this is all review.
And then I'm going to talk about
in the next half of the
talk some new work we're doing
in mining shape databases, and
in particular two problems,
shape discord discovery,
that's finding unusual shapes,
and shape motif discovery,
which is finding near-duplicate
shapes.
And if I have time, I have a DVD
bonus feature which I'll
try and get to also.
So let's get right to it.
First of all, what
is a time series?
I apologize if--
oh, OK.
So what is a time series?
A time series is simply a
collection of data collected
sequentially in time.
And so here's an example here.
And of course, the beauty of
this is actually we can plot
this and we can now see it's
actually a heartbeat, and
maybe even know what kind of a
heartbeat this actually is,
what kind of [INAUDIBLE].
So for the rest of this talk,
I'm going to actually tend to
show pictures more than
equations and numbers.
Hopefully you'll go
along with that.
So I want to spend 10 minutes
making the following point--
that time series
are ubiquitous.
They're really absolutely
everywhere.
And the reason why this is the
case is because people tend to
measure stuff.
We measure how many web clicks
we get, we measure our
popularity in Google, we measure
whatever it is, and
things change over time.
So for example, George Bush's
popularity changes over time--
down, down, down, down, down--
and your Google stock
changes over time--
hopefully up, up, up, up.
And so the world is full of
these time series in medicine,
science, arts.
But the other point I want to
make over the next five slides
is time series are ubiquitous
for another reason--
because many times of data which
are not time series can
be massaged to become
a fake time series.
Let me show you five examples.
So here's example one.
I took the Bible as a long
sentence and I took a window
of length 100 words and I split
it across the Bible.
And now I'm going to
measure the local
frequency of certain words.
So for the first
example is God.
So for example, in Numbers,
God is hardly mentioned.
And in Deuteronomy, it's
all God all the time.
So this is like a pseudo time
series, and I can use this for
various tasks.
So here's one task.
If I want to translate the Bible
into Spanish, or if I
actually want to retrieve the
Spanish text, I can look for a
time series similar to the
blue one, and I find in
Spanish, I get the word Dias,
which of course is
translation for God.
Now, this is actually kind
of a trivial example.
There's got to be better
ways to translate text.
But this could be actually a
bursty time series of, say,
weblog clicks.
And I could find certain
words associated with
other kinds of queries.
AUDIENCE: [INAUDIBLE]?
EAMONN KEOGH: Oh, sorry.
AUDIENCE: What is the difference
between the blue--
EAMONN KEOGH: OK, I shouldn't
have glossed over that, but I
mean, I just made it for fun.
So actually, when I did this
originally, I noticed actually
the match is overall
quite good.
But in one local area, there's
actually quite
a difference here.
And so the question is, why
is that difference there?
So I subtracted the red from the
blue, and I queried on the
difference of those
two time series.
And the closest match is
actually this thing right
here, which is El Senor, The
Father, a synonym for God.
And so for some reason, El Senor
is a synonym for God,
but it's locally bursty.
It doesn't happen in the Old
Testament very much, but it
happens a lot in the
New Testament,
particularly in Ezekiel.
So there's actually a nice extra
bit of value we can get
out of this sometimes.
Here's an example we'll come
back to later, but I'll
briefly show you now.
We can also take shapes
and convert
shapes into time series.
And again, one reason for doing
this is because we have
a big arsenal of tools that
handle time series very well
that don't work well in
video shape directly.
So we can actually handle
things more
efficiently in the space.
Yet another example--
this is actually every
professor's worst nightmare, a
grad student with a gun.
So in this case, actually, on
the top video, the grad
student is pointing
over there.
And here's three examples
over there.
Here's I'm tracking her hand.
And in the bottom
example, she's
pointing a gun at somebody.
And here are three examples
of the gun.
And in this example, if you look
at the time series, it's
easy to actually figure
out which is which.
Because there's a little bump
here you can see which
corresponds to the hand pausing
over the holster and
then drawing the gun out.
So we can easily classify this
in the time series space, but
it's actually hard to do in
the original video space.
This is, of course, a very
contrived example.
But for example, in Las Vegas,
there are similar moves you
can make with your hand on a
blackjack table that people
actually want to know
about automatically.
Here's another example of a
data set which is not time
series, but we can fake it
to become a time series.
So here's a handwritten
archive.
I'll just briefly mention, I
know Google is interested in
doing this because
I was in Dublin
giving a talk in November.
And halfway through the talk,
the dean of the college kicked
the door in.
And he said, Google has
given us the check.
And the whole room went wild.
So you're obviously giving these
guys too much money.
Anyway, if one actually indexed
this, we can transform
it a little bit, move the slant,
and then we can make a
pseudo time series--
one above the word, and
one below the word--
and then simply index the time
series, which we can
do very, very well.
It's a very simple, naive thing
but actually works much
better to mark off models
in these other
tricks in this domain.
Yet another example.
I'll go very quickly
over this.
But here, we actually have a
four-dimensional brain volume.
And of course, it's
not a time series.
But we can actually fill the
volume with a Hilbert curve,
measure the local intensity at
every part of the brain, and
then take the Hilbert curve
and stretch it out--
as I've done here--
into a pseudo time series.
And once again, the idea is
simply that we can't handle
efficiently this high-volume
data, but we can handle
one-dimensional time series
very efficiently.
So the conclusion from all this
is that there really is
time series all over the world
in every kind of domain.
And now let's do something
interesting with it.
So what should we do with
this time series data?
Well, we actually want to
do everything with it.
We want to do clustered, maybe
cluster customers, similar web
clicks, et cetera.
We want to do classification.
Is my heartbeat normal
or abnormal?
We want to do motif discovery,
perhaps.
So basically, are there
any repeated patterns
in this time series?
We might want to do
rule discovery.
So rule discovery might say, if
ever you see a double peak
like this, then within 10
seconds we're going to see the
shape here with some support
and confidence.
We may wish to do query
by content.
Basically, where I'm going
to Google time series.
Many time series data sets
are very, very large.
So we might actually want to
visualize massive data sets.
And then finally, novelty
detection.
So here's a heartbeat, and
something strange and novel
happened here.
The person either died or the
machine got unplugged.
So these are a diverse set of
problems, but they all have
one thing in common.
To solve these problems, we need
to actually figure out
how to measure similarity.
If we can do that in a
meaningful way, all these
problems I claim are trivial.
But if we can't do this in a
meaningful, useful way, then
all these problems actually
are impossible.
So let's talk about similarity
very quickly.
So the classic similarity
measure
is you put in distance.
And here's the equation,
but the picture's
very simple to see.
We have two time series, C and
Q. We put them on top of each
other by z-normalization.
And then we can simply
measure the distance
between the i-th point--
i to i--
take that distance, square it,
sum it up, square root of it,
that's the Euclidean distance.
Euclidean distance is
very simple, but
it's amazingly effective.
You have to do an awful lot of
work to actually beat this.
This is incredibly simple
and easy to do.
Again, we're doing a review
here at this point.
We'll get to the new
stuff in a moment.
So one thing about data mining
is that most of the time in
data mining, you're actually
going backwards
and forwards on disk.
CPU time is typically
not even relevant.
It's all disk access time.
So to handle that, most data
mine algorithms don't work on
the original raw data,
whatever that is.
They work on some kind of
abstraction of the data.
So here we have a time series
in red, but we can't fit it
into main memory.
So we're going to put instead
this blue abstraction of it,
and that will fit into
main memory.
So here's one example
of an abstraction.
I'll show you what it
is in a moment.
One important idea is that if
you want to measure the
distance, let's say, between
Google stock and Yahoo stock
and we can't fit it all into
main memory, we can actually
have the approximations in
main memory, and we can
measure the distance
approximately in this space.
Again, a simple and
useful trick.
Now, of course, the piecewise
linear approximation is only
one approximation we
could have used.
We could have also used
wavelets, Fourier transforms,
SVD, et cetera, et cetera.
So we actually have many, many
choices of things we could do.
OK, now let me show you the
generic data mine algorithm.
And this basically works in
[UNINTELLIGIBLE] of anything
I've shown you--
indexing, classification,
clustering, rule discovery, et
cetera, et cetera.
It's a basic generic paradigm
for mining data.
The first thing we want to do
is we want to create an
approximation of the
data which actually
fits into main memory.
So we can't fit all of the data
into main memory, we're
going to approximate it, and
fit it into main memory.
And now, we're going to in
main memory approximately
solve the problem at hand,
whether it's rule discovery,
et cetera, et cetera.
At this point, though, we have
an approximate solution which
might not be exact because
we're working on an
approximation of the data.
So the next step, we're going to
go to disk-- hopefully only
once or twice--
to either confirm the model or
just slightly modify the model
a little bit.
And the hope is that we're
going to make very few
accesses of the disk in this
stage, rather than many, many
accesses we would do if we
worked in the raw data.
So if you believe this model,
then all you have to answer is
which approximation
did you use?
And there's lots of
approximations out there.
This actually is a complete
taxonomy of them.
And don't worry if you
can't read this.
It's not really important.
The point is, we have lots and
lots of choices, and which one
should we pick?
There is one very, very
important requirement for any
approximation we use, and
this is lower bounding.
It turns out that if we do this
generic framework and our
approximation obeys
lower bounding--
which I'll explain in
the next slide--
then when we do this, we can
guarantee the answer we have
at this point is identical to
the answer we would have
gotten had we actually looked
at the raw data and then all
those accesses.
So this is actually a
very nice result.
Again, all I'm saying is that
if the approximation obeys
this lower bounding principle,
I can do this very efficient
thing and end up with the same
answer as I would have gotten
on the raw data.
So what is this lower
bounding thing?
Well, we've seen we can make
a, for example, Euclidean
distance right here, but again,
I can't fit all this
raw data into main memory.
So suppose, for example, I had
a piecewise [INAUDIBLE]
approximation which does fit in
main memory, and I measured
a distance approximately
on that space here.
So here's a measure for it.
Lower bounding says that for
any possible two objects, Q
and S, that the distance in the
reduced space is going to
be less than or equal to the
distance in the true states.
Hopefully, it's going to be very
closely approximated, but
the important thing is it's
always going to be less than
or equal to.
If this is true again,
we can now use
generic data mining paradigm.
Now, there's lots of
approximations I mentioned.
This is not even a complete
list. But
one of them is special--
symbolic.
And although there's lots and
lots of ways of taking a time
series like this and converting
it to symbols--
I haven't encountered at
least 200 different
ways of doing that--
unfortunately, none
is in lower bound.
In contrast, I can lower bound
wavelets, Fourier transforms,
SVD, et cetera, et cetera.
I can lower bound
all of these.
But unfortunately, of the 200
ways out there, none of these
actually lower bound in
a symbolic space.
So why do we care?
Well, just think about what you
can do if you can actually
use symbols that you can't
do in the real space.
So for example, I can hash
symbols meaningfully, but you
can't really meaningfully hash
real valued numbers.
I can use suffix trees on
symbols, which again, you
can't really use on
real value things.
Markov models, for example.
More generically, there are lots
of people out there who
are experts in dealing
with symbols--
people at Google, for example,
or people in bioinformatics
who are used to dealing
with [INAUDIBLE]
symbols all the time.
And they have wonderful ideas,
algorithms, data structures,
et cetera, et cetera.
And if I can actually have a
symbolic representation, I can
steal all of their
wonderful ideas.
And again, I can't really do
this in this real space.
So a contribution I want to talk
about briefly is we've
created the first symbolic
representation of time series
that actually allows
lower bounding of
the Euclidean distance.
And this means we can
do all kinds of
wonderful, clever things.
And I'm going to give you some
examples in a little bit.
So our presentation
is called SAX.
And again, for clarity, all I'm
claiming is this-- that
you can give me a time series
like this one here, I can put
it into my little magic black
box, and I'm going to spit out
a symbol strings instead.
And furthermore, I can measure
the distance between two sets
of symbols, and this will be
a lower bound to the true
distance in the original
raw space.
And if this is true, then I can
use all these wonderful
algorithms and guarantee I get
the right answer with my
generic data mining paradigm.
So the obvious question is,
how do you get SAX?
It's quite simple.
We have the original time
series here in blue.
And I'm going to simply take a
window of a fixed size and
measure local areas, look
at the average.
So it's the average of this
section, the average of this
section, and so on
and so forth.
It's a piecewise average
approximation.
Once I do this, I'm going to
take a little Gaussian curve
here, I'm going to divide it
into three equal areas--
so the area here, the area here,
and the area here, are
all equal--
and I'm going to extend
some break points out.
And anything above this break
point is a C, everything below
the break point is an A, and in
the middle here, it's a B.
So I can take this real value
time series in blue and
eventually spit out these
symbols here.
So let me preempt a question
people often ask.
So why a Gaussian?
It's true that for long global
time series, they may be
actually non-Gaussian.
But for small, local
sub-windows, or virtually any
time series, they do tend to
be approximately Gaussian.
And we've seen this actually
in 128 length data sets in
very diverse domains.
So here's visual proof,
if you like.
And again, I can show you
this in many, many
different data sets.
So the idea is, by using this
technique, we guarantee that
we have approximately
equal symbols.
We have as many A's
as B's as C's.
And most things we want to do,
like Markov models and hashing
and suffix trees, have their
best cases if you have an
equal number of symbols, equal
[INAUDIBLE] symbols.
Sir?
AUDIENCE: Is it only
[INAUDIBLE]?
EAMONN KEOGH: No, this is
actually arbitrary.
And we can choose the number
of symbols we want to use.
AUDIENCE: This function is just
to discretize the data,
though, right?
EAMONN KEOGH: That's all
it does so far, yes.
AUDIENCE: So it could be linear,
it could be any other
distribution.
EAMONN KEOGH: It could be.
So actually, we thought of
making this adaptive at some
point originally, but when we
made it adaptive, it always
became a Gaussian basically.
So we simply hard-coded
that fact.
I should say actually that
if this actually is a
non-Gauusian time series, then
the correctness of all the
algorithms doesn't change.
Everything is still correct.
It might be less efficient, but
we're pretty safe there.
Here's a quick visual
comparison.
Here's a time series and here's
its SAX representation.
And visually, here's a
comparison with these other
approximations.
Visually, we don't really
lose anything, right?
The real reconstruction
error is about the
same in most cases.
I can always find you some
examples that make one of
these look better than the
other, but on average, it
tends to be that SAX actually
works as efficiently as the
other things in terms of
reconstruction error.
Again, I told you there's
a lower bound here.
I'm not going to actually
prove it or derive it.
But I want to re-emphasize that
it actually is true, that
the Euclidean distance between
two time series is measured
like this, as we discussed.
And in the SAX space, this
function here, de minisis
function, actually gives us a
distance which approximates
this distance and will
always lower bound.
And we can prove it
by transitivity.
Basically, this is known to
lower bound this, and we can
show that we can lower
bound this.
So we now know that we have a
lower bound presentation.
AUDIENCE: [INAUDIBLE]?
EAMONN KEOGH: There are
two parameters in SAX.
One is the cardinality alphabet,
A, B, and C is
[INAUDIBLE] example.
And the other is the number
of sections here.
And hard choices have to be
made here, unfortunately.
Although, actually, it's not
critically sensitive choices
in most cases.
OK, just a bit of a history of
SAX and then were' going to
get into this new stuff
with shapes.
So it was invented in 2002.
And the first thing we did is
we showed that for most
problems, it basically works as
well as Fourier transforms
and wavelets, et cetera, et
cetera, for all the classic
but boring problems--
classification, clustering
and index.
And in 2003, we showed that
actually SAX allows linear
time discovery of time
series motifs.
So if you've got a time series
and you have a repeated
pattern in it, we can actually
find those in linear time.
It's been referenced 60 times,
but actually most of these
times are actually applications
of this.
People now use this technique
for motion capture, they use
it for telemedicine,
mining medical
data sets, and whatnot.
So people actually really
do use this quite a lot.
And in 2004, we showed how to
use SAX to do parameter-free
data mining.
Last year, we had a real
explosion in things we're
doing with SAX.
So we'll do an anomaly
detection, visualization of
massive data sets, joining
streams, et cetera, et cetera.
And recently--
for example, this
year at KDD--
there was many papers, actually
submitted at least,
that used SAX as wavelets and
Fourier transforms. So I
really think that SAX is getting
some critical mass and
I think it's actually going
to take off quite a bit.
So this has all been review.
Today I'm going to show you that
with SAX, we can actually
now do data mining on shape
data sets and do very
interesting things
in that domain.
AUDIENCE: [INAUDIBLE]?
EAMONN KEOGH: No, no.
Any solution is fine.
AUDIENCE: So uniform is fine.
EAMONN KEOGH: Any solution
you want.
Uniform is fine.
But in general, Gaussian will
give you equal probable
symbols and uniform will not,
that the A's will be very
rare, and the B's will be more
common, and then up the
alphabet, those symbols
are less common.
AUDIENCE: But isn't it that
you need to [INAUDIBLE].
This is the simple approach
to 30% of [INAUDIBLE].
EAMONN KEOGH: Uniform actually,
we can handle
uniform if we want.
But in general, virtually every
time series has this
Gaussian property.
It's kind of hard to believe,
and reviewers often don't
believe it.
So we actually built
a web page.
And for 100 data sets,
we showed it's true.
And 100 originally, And we
pointed our viewers to that.
People's intuition happens
to be wrong there.
Again, globally, most time
series are not Gaussian.
But small, local windows
virtually always are.
AUDIENCE: [INAUDIBLE]
intuition [INAUDIBLE]
when I'd like to present my time
series as symbols, what I
will do is simply the same,
maybe not this [INAUDIBLE]
but I would take the high values
and name them A, B, C,
D with uniform solutions.
EAMONN KEOGH: If you do that,
you will have symbols have
different frequencies.
And you generally
don't want that.
You generally have equal
frequencies.
Because again, for Markov
models, for hashing, for
suffix trees, it's always
better to have
equal probable symbols.
But you could do that
if you wanted to.
And SAX would allow that
if you wanted it.
That would be fine.
Question?
AUDIENCE: I feel like you
answered it, but why do you
want the equal probable
symbols?
[INAUDIBLE].
EAMONN KEOGH: Empirically,
that's true.
You can't prove it because there
can always be a special
case it's not true for.
Empirically, that is the case
that the lower bound is
maximumally [INAUDIBLE]
when that's true, yes.
AUDIENCE: Because also, this
relies on the shape of the
time series.
Because the time series starts
at very low values and then
grows larger and larger.
Do we need more symbols to
represent this time series?
EAMONN KEOGH: The number of
symbols is fixed as a user
choice with a user parameter.
So that isn't the case.
AUDIENCE: [INAUDIBLE]
the number of symbols to be
only three, for example.
OK, I'm going to [INAUDIBLE]
that the number of frequencies
would be similar.
But three is not enough for
this type of series.
So is there a way to figure
out which number is
[INAUDIBLE]?
EAMONN KEOGH: In general, we
impose three or four as the
number of symbols.
But actually, you could try to
learn it using, say, minimum
description [INAUDIBLE]
like that.
And we've done some
work on that.
It's not being So today, because
we found basically
three symbols for virtually any
kind of problem, any kind
of data set tends to work the
best. So we did what we did to
make SAX an adaptive
[INAUDIBLE]
symbols, but really it becomes
a waste of time.
The symbol thing just works
amazingly well.
So I hinted at this earlier on,
but I want to show it to
you again in more detail, that
we actually can convert shapes
to time series.
So here we have a shape.
And all we do is we find the
center point, the mean value
of the shape.
And now we're going to actually
have a clock hands go
around 360 degrees, measure the
local distance here, and
that local distance becomes
the height of y-axis.
So some observations
about this.
First of all, there's other ways
of doing this, but this
way I actually like quite a
lot because first of all,
there are no parameters-- it's
completely parameter-free--
it's very fast and easy, and
it's very, very intuitive.
So why do this?
Why take shapes and map
them into time series?
Well again, there's a whole
arsenal of tools we can use in
time series space.
Let me give you some
quick examples.
This is the famous Brodtmann
image of two
howler monkeys idealized.
Here's the real deal.
Here's some real
howler monkeys.
And you actually can see that
Euclidean distance here will
really be quite intuitive, that
howler monkeys can be
very similar to a similar howler
monkey, and Euclidean
distance can reflect that,
and life is good.
So if we actually query with
this, we'll probably find this
thing here.
Another tool in our arsenal
is dynamic time warping.
So in some cases actually,
the shapes might be
morphologically changed.
So for example, different
gorillas might have a
different permanence of the
brow ridge, or a different
permanence of the jaw, but in
a time series space, that
simply becomes the fact that
these peaks move backwards and
forwards in the x-axis.
And again, we can beautifully
handle that within dynamic
time warping.
AUDIENCE: A question.
If there are phase shifts, so
how do you align this stuff?
Because--
EAMONN KEOGH: So [INAUDIBLE]
great questions.
I think the question's rotation
and variance basically?
So if one skull is rotated
around 90 degrees?
Lovely.
We actually can handle that,
but let me just push that
aside for the moment.
I actually brought some papers
to show you, or we could talk
about it offline.
This is actually the
general question.
How do you make sure the two
skulls are aligned in the
right direction?
And for the moment, I'm going
to ignore that, but we
actually can handle
that easily.
AUDIENCE: Why does the purple
line cross the orange line?
EAMONN KEOGH: Oh, it looks weird
but it's only because
you can't--
everything actually mounts
correctly, but because one's
above, one's below, basically.
[INAUDIBLE] actually
changed around.
It looks weird at first, but
if I took this one here and
pulled it down here,
[INAUDIBLE] the
lines wouldn't cross.
AUDIENCE: Oh, I see.
EAMONN KEOGH: It's a bit
visually distracting.
But if I took this and
pulled it down here,
that will be good.
AUDIENCE: Question.
[INAUDIBLE]?
EAMONN KEOGH: Oh, sure.
Well, [INAUDIBLE]
we took normalized these
with all unit length.
So the size and shape
makes no difference.
Obviously, we have a skull
that's five times as big as
another skull, it's probably
just that the photograph was
taken from a different
distance.
So we're normalizing all
unit length, basically.
And sometimes even dynamic time
warping isn't enough.
This actually is the famous
school 5 skull.
And when you actually see it in
universities, typically we
have this example here where
they filled in the missing
parts in the poxy.
But the real actual skull that
was found is basically missing
all the nose region.
Now, if I want to match this,
it's problematic because if I
insist upon matching the missing
nose region, I'm going
to get pathological results.
So once again, in the time
series arsenal, we actually
have a tool, longest common
sub-sequence, which basically
says, try to match this shape,
but if it's too hard to match
something, basically give up.
And so here, it matched
everything beautifully, but
just the nose part, it
essentially gives up
in that part here.
It knows it's like
a wild card.
I'm not going to match this.
So again, here's some examples
that simply show that we can
actually take these shapes,
convert to time series, and
then throw lots of different
tools at them.
But I want to point out,
actually, the Euclidean
distance tool works
exceptionally well by itself.
So here's a collection of
primate skulls based upon the
Euclidean distance.
And it's not the correct
evolutionary tree.
So for example, the humans here
should be closer to the
orangutans obviously.
But it is subjectively
sensible.
And at the lower levels of the
tree, all these red things
here, all these blue things
here, et cetera, it is
actually giving us the correct
sub-tree for the tree of life.
Notice by the way that the
humans are an outlier here
because of their massive
brain case.
So again, the point simply at
this point is that we can take
shapes, convert the time
series, and then we can
meaningfully measure similarity
in that space.
Once you believe that,
[INAUDIBLE]
actually some real original
data mining shapes.
And again, this is some
ongoing work.
We're going to show some
interesting things we can do
with some shapes.
So the first thing I
want to look at,
we call image discords.
Let's say I give you a database
of shapes, it could
be quite large, and the question
is what's the most
unusual shape in
the collection?
So you can look at
this visually.
Probably most of you will
grab this one here.
And this is actually
the correct answer.
That was the most objective.
So we actually have an objective
definition for this.
We call it shape discord.
And the intuition is we want
to find the shape which has
the maximum distance to
its nearest neighbor.
So in terms of islands,
you think of
Hawaii as being a discord.
Its nearest neighbor state is
probably California or Alaska
which is very far away.
Whereas Nevada has a very
close nearest neighbor.
So of all the possible shapes
in here, which one has the
maximum distance to its
nearest neighbor?
And it happens to be the
shape right here.
So why is this useful?
Let me show you some examples.
So here's a fairly large
collection of fruit fly wings.
And often people modify the
genes and then see what
difference it makes in the
phenotype of these insects.
And actually, if you get the
wrong gene, like a [INAUDIBLE]
gene, you can actually make an
antenna sprout out of the wing
[INAUDIBLE] for example.
So in this case actually,
we mined this.
And this one popped out here.
It's clearly not a phenotype.
It's simply someone damaged
a wing basically.
But the point is that we can
actually find this unusual
shape quite efficiently,
as we'll see.
AUDIENCE: Is it similar
to outliers?
EAMONN KEOGH: It is somewhat
similar to outliers, but there
are some kind of differences.
Outliers in Euclidean
space are easy to
find relatively speaking.
But in this case, because we
have a rotation in variant
metric, we can actually embed
into Euclidean space directly
as we'll see in a moment.
But it is somewhat similar
to outliers, yes.
So here's another example.
Here we have a very large
collection of petroglyphs--
basically, ancient graffiti.
And we want to actually find are
there any unusual shapes
in this data set?
This of course is
a subset of it.
And so we found the number one
discord here is the shape
right here.
And so the question is why?
If I take a different shape,
let's say this one here, and
look at its nearest neighbor,
which happens to be this one
here, you can see actually, they
quite match very well.
Some small local differences,
but they're
basically very similar.
And this is true for
almost all of them.
They have some similar shape
in the database.
But this one here doesn't.
Why is that?
So here's actually the red one
and the blue one together.
There's a local difference
here.
And the explanation is that this
one uniquely has an arrow
sticking into it.
It's a big horn sheep, and
someone stuck an arrow into it
which caused this large increase
in the time series
space here.
As it happens at the visual
[? check in, ?] this is the
only one that actually
has this arrow in it.
Two more quick examples.
Here's a collection of
projectile points or
arrowheads.
And it's hard to estimate how
many of these things there
actually are.
At UCR alone, we have 1 million
of these things.
Unfortunately, most
of them are not
photographed or indexed.
But with a small subset of
these, a few thousand of them,
we run the algorithm and it
pops out this one here.
And actually, I was quite
surprised to know that
arrowheads can be asymmetric.
And finally, some kind
of special fishing
tools look like this.
And this example right here is
a subset of a large cell we
looked at, a large slide
we looked at.
And most blood cells tend to be
round, which is like a flat
line in time series space,
or a figure-eight shape.
This basically corresponds to
either two cells put together
or a cell dividing.
And again, this isn't unusual
because it looks like this,
and that's not unusual because
it looks like this, and so on
and so forth.
But one of them, this one right
here which comes from
here, is unusual.
Because actually it's
tear drop shaped.
And it has no near match
in the data set.
And as it happens, I think
tear drop shapes are
indicative of a kind of
anemia basically.
So after these examples right
now, hopefully you believe
that this definition can be
useful for looking at large
data collections.
I guess the question is now,
how do we find these things
efficiently?
Well, first of all, how do
we find them in general?
Here's the actual
MathLab code.
It's very simple to find them,
but unfortunately,
it's very, very slow.
So let me explain
the algorithm by
looking at this matrix.
We don't need to build a matrix,
we can actually simply
compute things on the fly.
But it's easy to understand if I
actually look at the matrix.
So the code basically is two
nested for loops which simply
walk across each column.
And then within each column,
visit every single row.
And they look for the
smallest value
apart from the diagonals.
So here I come down, I find
that in this column here,
1.1's the smallest value.
That's my best so far.
And I keep looking for those.
At this point, I'll find here
that 7.5 is the smallest value
in this column, and
it's actually the
largest smallest value.
So in this case, this
is actually the
discord in column six.
So the good news is it's a very
simple, trivial algorithm.
The bad news is, of course,
it's quadratic.
It's actually constant space,
because I don't have to keep
this in memory.
But I actually have to visit all
the cells here, so I have
a quadratic algorithm.
And the question is, can
we do better than that?
And the answer is,
of course, yes.
So one simple observation is
going to help us here, which
is suppose I'm searching
across here.
So I've found in this
column 1.1.
That's my best so far.
And then here, for example,
I find 2 which
is my best so far.
When I get to the next column,
as I scan down, if I find the
value here, apart from the
diagonal one, which is less
than my best so far,
I can stop.
When I find the 1.2 here, I
know there's no point in
continuing on because this is
not going to be the discord.
And so I don't have to visit
this cell or this cell.
So a very simple pruning
technique can
help us quite a bit.
AUDIENCE: [INAUDIBLE]?
EAMONN KEOGH: So in this simple
example, [INAUDIBLE]
constructed.
But of course we don't really
[INAUDIBLE] make this.
We build it on the fly,
measure these
things on the fly.
OK, so actually potentially
could help a lot.
And there's one other
trick that could
help us quite a bit.
We can actually have heuristics,
called outer and
inner, which tell us
the order in which
to visit these columns.
So I'm assuming a for loop right
now that goes from left
to right and from
top to bottom.
But I could simply permutate
that and go on any
permutation.
I could visit this one, then
that one, then that one.
And then within each column, I
can visit this one, then that
one, then this one, et cetera.
And if we're very careful,
this can make a huge
difference.
If we actually visit them in a
special order, things get a
lot better.
How much better?
Let's imagine a magic
heuristic which is
given to us by God.
Well, what would a heuristic
actually do?
It would tell us that we should
visit the columns in
order of discord value.
In other words, we should
actually jump to the real
discord first, and then the
second best discord second,
and so on and so forth.
If that happened, then we would
find a large value here,
7.5, early on, and we could
do a very good pruning.
And likewise, for the inner
heuristic, if we were magical,
what we would do is that when we
got a column, we would jump
to the smallest value
right away.
We'd go straight to the nearest
neighbor, and it would
be less than our best so far.
And we could stop.
We wouldn't have to look
at the rest of the
things in the column.
AUDIENCE: What's [INAUDIBLE]?
EAMONN KEOGH: Yeah, so the
analysis basically is this.
If we had this magical
algorithm, the first column we
went to, we'd actually visit
every single thing--
that's order or n.
And then an order of n number
of times, we make one
[INAUDIBLE] move.
We'd jump to the nearest
neighbor and we'd stop.
So basically, it'd be
2n, which is linear.
AUDIENCE: But we don't know
the nearest neighbor--
EAMONN KEOGH: No, no.
I agree.
So this is a magic algorithm.
This is what God can do.
We can't do this.
OK, so if we were magical
and God has given us the
heuristics to order these
things, we've actually gone
from quadratic to linear.
So line potentially is very
good, but the question is can
we actually do this?
So the answer is we can't
actually do it, but we can
approximate it very closely.
So the two observations are
this-- that although we'd like
to visit the columns in this
magical order, as long as we
actually get a reasonably high
value early on, it really
makes no difference.
So early on in our first few
attempts in the outer loop, we
want to find a pretty
high value here.
And if we do that, we can
prune quite well.
And the other observation is
that for the inner loop, we
don't really have to jump to
the nearest neighbor in
constant time as long as we
jump to a near enough
neighbor, a neighbor that's
actually less than the current
[INAUDIBLE] so far.
So if it's the fifth nearest
neighbor or the 20th nearest
neighbor, it might be
near enough to stop.
So based on these two
observations, we're going to
try to approximate the
magic algorithm.
AUDIENCE: So is this basically
a [INAUDIBLE] then?
Or similar to [INAUDIBLE], where
you're just picking a
random value and then trying
to jump around the space to
find a [INAUDIBLE]?
EAMONN KEOGH: No, we do
it better than random
[INAUDIBLE].
So the answer to this problem
is actually going to
be based upon SAX.
What else?
We take our shape, convert it to
a time series, convert the
SAX, and now we're going to put
these SAX words, c-a-a,
into this little matrix here.
So this is the first shape,
c-a-a, and so c-a-a goes here.
And I guess the next
shape was c-a-b.
And that goes here.
And so on and so forth.
So I'm going to populate this
[INAUDIBLE] matrix here.
As I'm building this, I'm also
going to insert these words
into a suffix tree.
So for example, image
one was c-a-a.
And I put a 1 here because image
one came from there.
And then for image two, c-a-b,
c-a-b, I put a 2 here.
And for image three,
it's also c-a-a.
So c-a-a I have an overspill.
I have a 3 here.
OK, so hopefully you believe me
that actually I can build
these in linear time, and
basically constant space, in
fact, a very small
constant space.
So [INAUDIBLE]
built these, and I'm
going to use these
to approximate magic.
So the first question is, I want
to visit the outer loop
where I'm going to find a likely
discord very early on.
And how do I do that?
Well, I have this count here--
1, 2, 3-- and I can map
it back into here.
So everything that is here--
c-a-a, this one here gets
a value of three.
This one here gets
a value of three.
This is simply a count of the
number of words that happen to
map to the number
of those words.
So the intuition is that if it
had an unusual shape, it
probably had an unusual time
series, which probably had an
unusual SAX approximation.
So its count here is going
to be very low.
In contrast, if I have a very
common shape, it's going to
have a very similar time series,
very similar SAX
words, and the count probably
will be very high.
So my outer magical heuristic
is going to be this.
Find the smallest value in this
column, and whatever that
is, I'm going to jump to that
value first. The value's
typically one.
And if I have some ties,
I'll jump to them.
I'll take all the ties and
go to those first.
So my simple approximation for
the magic outer heuristic is
find a small value here and then
go to those values first.
In the inner loop, the same
intuition works here.
So as I'm going through my
search, and I want to see if
this is the discord.
Or if not, I want to find
out very quickly.
The intuition is I want to find
similar things early on
to just be able to prune.
And how am I going to do this?
Well, similar things to this,
c-a-a, are likely to be either
1 or I guess 731.
Because they map to the same
SAX word, they're probably
similar, they probably have a
very small nearest neighbor
distance, and I can actually
abandon search early on.
So this is my heuristic
for the inner loop.
And my hope is with those two
simple heuristics, I can
actually approximate the
magical algorithm.
So this is ongoing work.
I don't have very detailed
results here right now.
But on a small data set, like
200 arrowheads, we find that
we actually get within 22%
of the magic algorithm.
And the good news is for larger
and larger data sets,
we actually get better and
better and better.
We get asymptotically close
to the magical algorithm.
Essentially, we're solving
a quadratic
problem in the near time.
Now, one small problem is this
is actually empirically true
in all that we try.
But in the worst case, it could
still be quadratic.
If it gave me a database where
every shape is identical, I'm
in trouble, right?
But for realistic data sets with
some kind of variance, we
actually get basically a
linear time algorithm.
OK, let me show you a
complementary problem.
And I'll do this at kind of a
high level details to be a
little bit faster.
Now we have a data set.
And the question is are
there any duplicates
in this data set?
So trivially here, I guess we
have this and this duplicate.
And then the question is again,
how can we find these
duplicates and find
them efficiently?
And once again, we
have a reasonable
definition for this.
Basically, we're trying to
minimize the Euclidean
distance between the
pair of objects.
And once again the question
is, is this useful?
And I'm going to just be
very brief and show
you this one example.
So here we have a data set
of about 4,000 different
petroglyphs from all over
North America, including
Canada and Mexico.
And the question is, are
there any similar ones?
And subjectively, there are
similar ones, but what are the
most similar pair in
this data set?
And actually, here's
the answer.
And it's actually quite
interesting.
First of all, notice actually
we had to be rotation in
variant to find this.
So they're similar, but one's
rotated 90 degrees.
What's also interesting is
actually that these two came
from locations which are
only 150 miles apart.
So potentially, there's
actually some
meaning to this, basically.
Although, I actually don't
know at this point.
OK, so image motifs are quite
interesting in this example.
But how can we find them?
And again, for time reasons,
we're just going
to gloss over this.
But basically, the idea
is, once again, SAX.
We can use a SAX presentation.
And as it happens, repeated
patterns in DNA strains are
very, very important.
So we're going to steal someone
else's algorithm, take
our shapes, convert them to
SAX, and then find DNA
repeated patterns essentially,
and then because its lower
bound in property, we get the
data back to the generic
algorithm, we do some cleanup
work, and in basically linear
time, we can find the repeated
patterns using someone else's
[INAUDIBLE].
And I think the important thing
to point out here is
that you can really only
do this with SAX.
It's hard to imagine how you'd
do this with wavelets or
Fourier transforms
or anything else.
Because it relies upon the
ability here of hashing in the
random projection algorithm.
Actually, I'm going
to conclude here.
But I have a five minute bonus
feature which I will try to
get-- is that OK?
AUDIENCE: I think so.
So you can [INAUDIBLE].
You've got 13 minutes.
EAMONN KEOGH: Lovely.
So conclusions, and again,
I'll go on for
another five minutes.
But the first conclusion is--
and you probably already know
this basically--
the representation
is everything.
With the right representation,
you can solve any problem
efficiently.
And for time series data mining,
SAX is probably the
best answer for most problems.
But the nice new result here
hopefully is that SAX is also
probably the best answer for
all kinds of problems
with shapes too.
And I really can't overemphasize
this.
I really believe that SAX
actually is the way to go for
virtually all kinds of problems
in time series and
shape data mining.
You can just do things when
you have symbolic
representations that you can't
do in the real space.
And you can simply use so
many clever tricks.
Again, I will go on after this,
but my conclusion was
going to be, if I had enough
time, SAX appeal.
I'm always interested in new
collaborators and I'm always
interested in new problems, new
data sets, and of course,
I'm always interested
in money.
OK, so here's the five
minute bonus DVD
feature, if you like.
We can actually use SAX
to visualize very
massive data sets.
And here's how we're
going to do that.
Let's start with a digression.
Here's some DNA from two
different animals.
And the question is, are
they similar animals?
And obviously, there's
no way you could tell
by looking at this.
And of course, this is actually
a small subset of a
genome which is about 3 billion
base pairs in a mammal.
Let's try a little trick.
Let's build a 2x2 matrix, and
let's label them ATCG
arbitrarily.
And let's actually count how
often we see those letters in
the DNA sequence.
So in this case, 20% of the time
we saw a C, and 24% of
the time, we saw a T, and
so on and so forth.
I can actually make more
complicated matrices, because
basically recursively do a
little trick here, and then
count every possible
combination of two.
C-C, C-T, T-C, et cetera,
et cetera.
And I can do this, of course,
at any level I
want to, if I want.
And now I count how often
see C-C's or C-T's,
et cetera et cetera.
Having done this, I now have
a matrix of numbers.
And I can map those to colors.
I can numerize all the colors
arbitrarily from 0 to 1.
And in this case, I see that
this pixel here, for example,
occurs 7% of the time.
So I can look up 7% here
and paint it red.
Once I've done that, I have
converted DNA strain into a
colored bitmap.
And I guess the question
is now, so what?
Let's actually look at some
files of DNA, under Microsoft
Windows, the classic thing,
Notepad I guess.
And here are the icons for
Homo sapiens, Loxodonta
Africana, et cetera,
et cetera.
And looking at this, you have
no way of knowing are these
things very similar.
But let's actually do this
now with this trick
I've just shone you.
Let's replace the normal icon
with the icon I've learned
with this approach I've shown
you in the previous slides.
When I do this, I see this.
And it's strongly suggested
that these two things look
alike and those two
things look alike.
And as it happens, this is the
Indian elephant, the African
elephant, and this is the
chimpanzee and the human.
So it's sort of intuitive,
it makes some sense.
You can look at these bitmaps
and figure out, these are a
part and these are a pair.
OK, that was a lot of digression
because I'm talking
about time series.
And the question is,
can I do something
similar for time series?
And so what?
Would it be actually useful?
And the quick answer
is, of course, yes.
We have SAX in our
arsenal of tools.
And with SAX I can make
a string, a pseudo DNA
sting, if you like.
And once I have a pseudo DNA
string, I can map it into a
matrix, I can get the colors,
and life is good.
So I can do this very fast. The
question is, is it useful?
So here's the first example.
Here are some brain
waves from rats.
And if you look at this, it's
sort of like the Sesame Street
game of which one of these
things is unusual?
And the answer, I guess, is this
one looks different from
those once.
As it happens, these are
all epileptic rats,
and this one isn't.
OK, let's try this.
I can actually add value to
this by mapping them into
Euclidean space based upon
multi-dimensional scaling.
I'm basically going to find
these similarities and cluster
them in two-dimensional space.
So this actually data set is
monthly power demand for
Italian electrical power.
When I do this, I see that
actually all my winter months
are over here and all my summer
months are over here.
But once in a month, August
is completely unique.
It's out there.
So why does it look different,
and why is it far from these
things here?
And actually, it you're Italian
or you're European,
you might know this.
In Italy, in August, the entire
country closes down.
[INAUDIBLE] closes, the
government closes, and
everyone goes to the beach.
And actually, we can see
this if you plot
the data right here.
Now, of course, in this case
we could actually see this
anomaly if we plotted
the data.
But I guess the utility of this
is you can see anomalies
without actually opening MathLab
or Excel, whatever
tool you actually have. And in
any case, this anomaly here,
or this cluster here, is too
subtle to see to my eye in
this space here.
But we can actually
see it better in
this space right here.
Here's another example.
This is a famous data set, at
least in data mining, for
testing [INAUDIBLE].
And it allegedly contains
70 heartbeats.
Here's a subset of them.
Once again, in the classic view,
nothing really is added
to the value here.
But in the bitmap view,
we see this.
There appears to be too strong,
strong clusters--
these guys here and
those guys there.
So what's going on?
Why are these guys so different
from these guys?
As it happens, they're
not heartbeats.
A grad student many years
ago made a mistake.
So these five things here
are not heartbeats.
They're what's called
pacemaker cells--
but natural pacemakers,
not electronic ones.
So no one has noticed this, even
though many people have
actually used this data set to
test algorithms. And again,
the value here is that we could
actually find this if we
plotted the data and looked
at it potentially.
But we get it here for
free, essentially.
When we copy or paste or edit
or move files around, for
free, we can actually
see what's going on.
This is my final example, then
I'm all done for the day.
My hope actually is to use this
for basically almost like
a Google toolbar tool for
query by content.
So I can search for documents
based upon the data that was
created, its file size,
its type, et cetera.
And I can also search
based upon keywords.
But let's think about a more
general search, a search upon
structure, where you simply
say, find me a file that's
similar to this one.
Here's an example of that.
It's completely contrived,
but it makes sense.
Here we have a data set of large
mammals in the Americas.
And I right click on the
brown bear and I ask
for its nearest neighbor.
And so we search quickly through
an index, and the
nearest neighbor is
the polar bear.
And you can kind of tell
visually, actually they look
similar by these two
icons here, or
bitmaps here and here.
As it happens last night on
CNN, they announced that
actually they had found the
first polar bear/brown bear
hybrid in the wild, which was
an interesting coincidence.
So I've shown you actually
I can take time series
and map into this.
And I've shown you how I can
take DNA and map it into this.
And right now, I'm working
on doing this for text.
So using [INAUDIBLE] index
[INAUDIBLE] something else, we
convert text into bitmaps.
And it actually works
quite well.
If you look at my papers, they
all look kind of similar,
because I only do one
thing, basically.
If you look at Shakespeare, it
tends to look a bit different,
and so on and so forth.
And so the question is, can we
actually use this to do a fast
search by type?
Basically, find me similar
documents to this document
right here.
Great.
I thank you for your
indulgence.
And now I really am finished,
and so I'll be happy to answer
any questions you might have.
AUDIENCE: So what are some
specific application areas
where you see this
being applied?
I guess you did refer
to [INAUDIBLE].
EAMONN KEOGH: You mean
for the shape thing?
Well, one [INAUDIBLE] you
could probably see is
anthropology.
So anthropologists typically
have lots of data.
As I mentioned, a million
arrowheads.
In one part of Nevada alone,
in 10 square miles, there's
evidently about 100,000
petroglyphs.
And amazingly, people
don't know
anything about these things.
It surprised me.
So anthropology's one
possible example.
Definitely looking at things
like cells in biology.
It's very tedious to actually
look at every single cell.
The fruit fly wings actually
makes a lot of sense too.
Because [INAUDIBLE] make lots
and lots of data like this.
So those are some potential
applications.
I'm hoping people will suggest
more when this work gets out
there a little bit.
And in this example what we're
actually doing was we're
looking at long video
sequences of,
say, martial arts.
And the idea is, we are trying
to find two video sequences
where the person actually
has the same pose.
If we have that kind of a
collision, then we can
actually run the video and
then jump from one video
stream to a different
video stream.
So basically we can make fake--
it's called video
textures, where we can
combine different
moves, kicks, and punches.
People have tried this in
motion capture data.
And it's very, very
hard to do.
In the simple example to have
a shape-- of which I don't
have any with me right now--
it actually works
exceptionally well.
So basically, for video games,
it's quite useful to do this.
We actually have a kick and
a punch, and we have
a kick and a fall.
And we can actually combine
those. any combination of
those, basically.
AUDIENCE: It seems like you're
focusing on silhouettes also,
instead of using more complex
algorithms like over curves in
2-D space, or something
like that.
EAMONN KEOGH: So we are
looking at silhouettes
obviously for simplicity.
For many shapes, you can look
at a silhouette and you can
actually tell what it is.
Virtually, that's
always the case.
Certain people's faces, it's
very easy to recognize
people's faces based on that.
For some shapes, of course,
the color, the texture, or
internal features
are important.
And currently, we're not
looking at those.
But we have to do something
to initially start this.
But that is potentially
future work.
Any more questions?
AUDIENCE: I wonder if you have
any thoughts on applying your
anomaly detection to
surveillance [INAUDIBLE]
applications.
EAMONN KEOGH: We do.
For one reason, it's because
it's very heavily funded.
And people like that
kind of stuff.
So we actually are looking at
them [INAUDIBLE] detection in
that kind of domain.
So you could think
about unusual
shapes and match shapes.
So for example, even things like
matching suitcases, based
upon the shape, they figure out
if it's been transferred
between the streams
and whatnot.
So it's something that we're
actually looking
at at a high level.
