>> So, it's an honor to have Tom Dietterich with us today.
I think I remember the first moment,
I think I ran into Tom I was passing a seminar room.
I looked in and then walked
in at Stanford Margaret Jacks Hall,
and there was some sort of a very
interesting debate going on.
Some machine learning topic.
I think Rich's folks were there, and so,
it's a just a memory that I've had over the past.
I just remember that the first glimmer of
the energy of Tom Dietterich.
I've come to know him and love his work over the years.
Tom mentioned during our fireside chat,
the video we did last night together
that we were talking a little
bit about interests and this idea of that,
I think we share
one thing where we both have what we call,
"Research Attention Deficit Disorder."
We get very curious and excited about many topics
and it's amazing to me how deep
Tom goes on each of them that he attacks over the years.
Tom did his undergraduate work at Oberlin,
and then, did his master's at UIUC.
I think you mentioned you were following at
the time the line
of where you can find the best computer.
>> Right.
>> Sticking phones into
these 300 bud sockets and so on in those days.
And then, did his graduate work at Stanford University,
following I guess in [inaudible] after hanging out with
Machalski at UIUC which
got him into the machine learning world.
And has worked on many
different areas I think following
all the recent trends and leading some of them.
Remember, you and I were at the first UAI conference
in 1985 together.
And I remember you wearing a pair of
shorts and asking hard questions
in front of the room back in those days.
I cannot forget anything Tom
about these interesting interactions
we've had over the years.
Going to the probabilistic graphical model phase,
it's always been touching on applications.
Has been quite a lot of work during the Caelo project on
desktop events and understand
trying to infer the intentions
of goals of users from desktop streams of evidence.
More recently has been written like
several of us here at the challenge of AI
in the open world issues about
robustness and understandability of these algorithms.
He was a leader in this whole area
that's now called Computational Sustainability,
working with Carla Gomes at Cornell at defining this area
and somehow convincing the National Science Foundation
to fund it deeply.
Looking at applications of optimization and
other areas of Machine Intelligence to hard challenges
with environment and speciation
and resource management of planetary resources,
sort of resonating with our AI effort now at Microsoft.
Today, he'll be talking about a piece of
this interesting area on robustness,
I think it's an important aspect anomaly
detection algorithms explanations and application.
So, Tom thanks very much for being with us.
>> Thank you. Is this working yet?
>> Yes.
>> Can you hear me?
>> Yes.
>> Okay. So of course,
everything that I'm talking
about today was done by someone else.
So, this is a collaboration with
two of my fellow faculty in Computer Science,
Alan Fern and Weng-Keen Wong.
Weng-Keen is currently rotating at NSF
as a program manager but he'll be back in the Fall.
And then two faculty from statistics,
Sharmodeep Battacharyya and Debashis Mondal.
And at least one of them did their thesis at
UW but I don't remember which one.
And then, several graduate students and
as their contributions come up,
I'll be pointing those out in the talk.
So, I have a sort of long agenda here,
this is more than I would
usually try to cover in one talk,
but some of them are fairly straightforward.
So, what I want to do is start by
defining this question of Anomaly Detection,
what do we mean by that because
that's not a terribly well-defined term.
And then, I'll report
briefly review big bench marking effort we did.
Mostly, in the knowledge discovery
and database community,
there have been many algorithms
published in the anomaly detection area.
I think largely because they have
many business problems in fraud detection and management,
and this is a big tool there.
But a lot of those are just publishing an algorithm,
a mechanism without any kind of a theoretical foundation.
And so, we wanted to understand them, and then,
we've done a bit of work on
probably approximately correct type of
machine learning theory for
what we call rare pattern anomaly detection.
Then, I'll talk about scenarios like fraud detection,
where you have a ranked list
of anomalous transactions or users or whatever and now,
you have an analyst in the loop.
And so, you want to show the analysts say,
your number one candidate and
explain to them why you think it's anomalous.
Obviously, it's not sufficient to dump
180 element vector on them and say,
we think this is an anonymous user, but we want.
So, I'll talk about
some very simple feature-based explanations.
And then, if we've got the analyst in the loop,
we can also get feedback from them and use that to
re-rank our candidates as we're presenting them.
And that turns out at least in
our benchmarking to be really powerful.
And then, I'll wrap up with two applications.
One which is a big sort
of sustainability application involving,
deploying 20,000 sensors weather stations in
sub-Saharan Africa and then trying to automatically
detect when stations break and sensors break and so on.
And the other is, sort of open world AI problem,
which is the problem of
open category classification
in say multi-class classification.
So, defining anomalies.
So, we're going to assume that
our data are just feature vectors in
dimensional real space and that we have
a data set that is a mixture of
what we'll call the nominal points,
so that we don't call them the normal
points and the others will call the anomalies.
Our assumption is that the anomalies are generated by
different generative process than the nominal points.
But typically our data will be unlabeled,
we'll talk about that I guess on
the next slide which is there are really three setting.
So, in some cases people have tried
to formulate this as a Supervised Learning problem,
so we get some labeled data,
but in most of
the sort of anomaly detection applications,
the anomalies are quite rare.
Fortunately, 50 percent of
our users are not trying to defraud us,
although, maybe it seems like that at times.
And so, here we have severe class imbalance if we
approach things from that perspective.
Another possibility is that we are
training on clean data that
has no anomalies in it, but then,
maybe at test time we are going to
get contaminated data and we need to
identify those contaminants and that's what
happens in the open category case.
And then, another situation is
where our training data is contaminated with
some kinds of anomalies and we
want to figure out which points those might be.
So, we'll be looking at those.
So, I think a very important consideration
here is this assumption I call the WDAD assumption.
The assumption that they anomalies are drawn
from a well-defined probability distribution.
So, if just to backup,
if you're formulating things in
the supervised learning way you are
essentially assuming my anomalies
are coming from some well-defined distribution,
right, and I can learn about that.
But if we're in an adversarial setting
like in the fraud or cyber security kinds of
settings then that really
is not a valid assumption to make, right?
In fact, our adversaries are constantly trying to defeat
whatever distributional assumptions we've made.
Another case, in the weather network case
it seems like it's adversarial in the sense that,
it's sort of the number of ways something can be
anomalous is sort of an unbounded set and
I guess technically speaking it's still
a well defined probability distribution but you don't
have nearly enough data to have sampled the whole thing.
So, you really can't model it.
And another case that's come up,
I've collaborated a little bit with Kiri Wagstaff,
the Jet Propulsion Laboratory.
And their user is
usually a scientist and they're looking at data
coming from some sensor like she's worked on
some spectrometers that are
on the Mars science observatory.
And what the expert
wants is to find the interesting data points.
And now, the first few interesting data points are
actually ways that the instrument fails,
and they need to understand those.
But then, they start being interested in
unusual spectra that might
signal some interesting compounds.
But then, once they've seen those, then they say,
well don't show me any more of those I want to see,
show me real anomalies and so,
it's a time changing notion.
So I think, we need to be sensitive to
when it's appropriate to make this assumption.
So there are
two main strategies for Unsupervised Anomaly Detection.
It depends a lot on this parameter that we call Alpha,
which is the fraction of the training points or
test points that will be these anomalies.
If Alpha is reasonably large,
then we can think about fitting something like
a two component mixture model and actually trying
to model that anomaly distribution,
if we're willing to make
this well-defined distribution assumption.
There are various conditions for that to work.
Alternatively, if the proportion of anomalies is very
small as it typically is in these fraud cases,
then the main strategy is
to treat it as a problem of outlier detection.
So we're going to try to
model the distribution of the nominal points,
the ordinary users, and
then treat the outliers as our candidate anomalies.
The advantage there is that we don't have to make
any assumption about the distribution.
But of course, if the anomalies are not outliers,
then we will miss some of them.
Also, if the nominal distribution
itself has very heavy tails so that it
just naturally has a lot of
outliers, this won't work either.
So as I mentioned in the KDD literature,
there must be
20 or 30 different algorithms that have been published.
There's a huge number of papers there.
A problem with many of these is that,
unlike in a lot of other parts of
machine learning and Computer
Vision and natural language processing,
there's not a large collection of
standard benchmark datasets that
you could do a thorough evaluation on these algorithms.
Often, there's some sort
of a corporate application with
a proprietary dataset that they can't share.
Then the most popular benchmark dataset is
this computer intrusion dataset that
was collected as part of
a DARPA project back in the late 90s,
and it turns out to be extremely easy.
So, we strongly believe in benchmarking
and the need to have a
large and constantly growing collection
of anomaly benchmarks.
So, my student Andrew Emmott
conducted a big benchmarking study.
Our methodology was to repurpose existing
supervised learning benchmarks
either classification or regression problems.
So we chose 19 datasets from the UC Irvine repository,
which were all the datasets that
satisfied a set of criteria we had.
We wanted real-valued features.
We wanted enough data. We wanted enough dimension.
Then we would choose one or more of the classes
to be these anomaly classes,
the source of our anomalies,
and the rest to be the nominals.
This ensures that our anomaly points are being created by
some process that's different from
the data that's creating the nominals.
Then we systematically varied things like Alpha,
the relative frequency of the anomalies,
the difficulty of the points,
like how deeply embedded are they inside
our nominal distribution versus
how much are they outliers.
We added redundant or irrelevant features
to see how much that could defeat the algorithms.
We also looked at how clustered the anomaly points are.
Because to a large extent,
these anomaly detection algorithms are all
something that some form
of approximate density estimation.
You're essentially trying to model
the probability density of the nominal points,
and then score points according to how
the very low density points are your anomalies.
Now, if the anomalies are all tightly clustered,
then they form a little peak of density,
they don't look like they're low-density
anymore, and so you'll miss them.
So, we were interested in that.
We created 20 replicates for
each configuration of these manipulated values,
and that gave us about 25,000
different benchmark datasets.
So then, we compared a set of eight different algorithms.
I won't go through all of them.
But basically, there are some
that are doing density estimation,
some that we call the quantile-based methods,
like the one-class Support Vector Machine
is probably the best known of those.
It tries to find a boundary
around your data that contains say,
one minus Delta of
your data for some quantile you choose Delta.
There are nearest neighbor-type distance methods,
and then there are these projection methods.
I want to talk about these
because these two algorithms turn
out to be very attractive for many reasons.
So, the first algorithm and
the one that actually performed the best on
our benchmarking exercise is
something called the Isolation Forest.
Those of you who know me,
know I love decision trees and
anything random forests boosting all these things.
So, I had a natural affinity for this algorithm.
Although, when I first saw this paper came out,
I said, "This is such a weird idea."
But the basic idea is we're going to construct
a fully completely random binary tree.
Again, recall that our data are not labeled,
so we just choose one of the attributes at random.
Then, we choose a splitting threshold
uniformly at random out of
the observed range of the data.
Then we do that, split the data recursively.
We split this in some sense overfitting radically,
until every data point is in its own leaf in the tree.
In the paper, actually,
they have ways of avoiding having to do that.
You can stop the splitting at some point.
The idea, and we'll build an ensemble of these,
and these statistics were interested and is actually
the depth of a point in the tree,
the depth at which a point becomes completely isolated.
Like let's say that my one data point x_i,
is the only data point in this leaf node here,
then we would say x_i,
is isolated at depth three.
The intuition here is in some sense,
we're just throwing like
a dartboard random separators at the data.
How many random separations do we have to
do before a data point is
separated from all the rest of the data?
The intuition is, well,
if you're an outlier,
it's going to take only a few random
shots and we'll get you isolated.
Whereas if you're deeply embedded
in a tight cluster, I may have to split,
split, split, split forever to get down to isolating you.
So, the anomaly score that we get, well,
so d-bar is going to be the average isolation depth
over the 100 trees say some ensemble.
Then, one nice thing in this paper is that
they calculate the expected isolation depth
from a little bit of theory.
So, then they can normalize this so
that it lies in a zero-one scale,
where anomalies are given a score as close to one,
and the nominal points typically have scores
less than 0.5 or around 0.5.
The other algorithm I want to talk about. Yes.
>> Does that favor a particular basis of the features?
If you have a linear transform of the features,
you get a different basis.
This is invariant.
For example, if you have
the point is not obviously outlier in one basis.
But if you transform it,
for example, even like the sphere.
Then like it's fuzzy around that.
I don't know if you can tell just
by looking at this via one basis?
>> It's true. So, of course,
we've also looked at randomly rotating
the data before we top it up.
It helps a little bit, but
the randomness increases the variance
of the isolation depth,
so then we have to make the ensemble bigger.
So, it hasn't been enough of a game to be
worthwhile to us because
the memory starts to be an issue here.
>> Why this business problem
arise frequently enough in nature,
or like in practice
that you should actually care about this?
>> Well, so in nature, I don't know.
In our benchmark datasets and
in our fraud detection applications,
the features are generally meaningful.
We're not doing this on images or things like this,
where pixel 25 has no particular meaning at all.
So, yeah.
So I certainly wouldn't use this out of
the box on images, that would be crazy.
Maybe I would use it in the penultimate layer
of CNN or something.
I'm not even convinced that would be
a good idea, but yeah.
>> Maybe something like
relationship between variables, like say,
x squared plus y plus z squared,
is like the meaningful variable like the outlier.
>> But usually, just as decision trees,
basically are looking at higher-order interactions.
As you go deeper, you're looking,
these things pretty much follow the data distribution.
We only put splits where we have data,
so they do a pretty good job.
There was an interesting paper by Nina Mishra I think,
and colleagues at ICML a couple of years ago,
where they analyzed a slight variant of
this algorithm and showed that
the isolation depth is related to
an LP norm and
the distance between the outlier and the data.
So in this case, basically an L1 norm.
You can sort of imagine how this would
be sort of like an L1 distance.
If you look at these different algorithms,
I think a lot of them are essentially
doing something that's very similar.
I mean, the nearest neighbor algorithms
are calculating some distance,
and the Isolation Forest is
definitely calculating something like a distance also.
Clearly, these things are using kernel distance,
and these things are using something like
a Gaussian kernel in the original space.
So yeah, of course,
you could apply PCA to try
to rectify the data in some sense.
The other method that I want to mention,
or I should say, are there any other questions? Yes.
>> Do any your benchmark datasets include
behavioral data or sequential data that was generated by
capturing traces of someone's behavior
like networking intruder's behavior?
>> No.
So when we apply this in this DARPA,
we were in this DARPA program called Atoms.
There, we were looking at
an organization that had 5,000 employees.
We had actual desktop spyware installed,
monitoring software on
their desktops that they were working
for some organization in
the military industrial complex, so I'm allowed to say.
So, we could go down and really look at
every email messages they sent
and even the content and all that stuff.
But we summarized that at a daily time step,
and turned it into a vector
of how many emails did they sent,
how many websites did they visit,
how many this and that, and the other thing.
Then we also did some normalization to try to deal with
both asking how unusual
was this day compared to the typical day of this user,
and how unusual is this user on
this day compared to everybody
else in the company on that day.
Because otherwise, we had these things where
some new web link would be sent
out through the whole company and everybody go visit it,
and nobody had ever visited that link before,
and we will get all excited
if we're only looking at one you user.
So we had to do
some pretty serious feature engineering
to turn this into something that
was almost an ID.
But then, all these algorithms are
purely for that sort of feature vector case.
In my weather data, we're again,
just having to figure out how we can
engineer the features to take what is very much
a spatio-temporal time series and turn it into
little database views that
are approximately exchangeable.
Okay, well, the second algorithm I want to
mention is even simpler, is called LODA.
Given a set of data points, what it does is,
is create a set of capital M, say again,
say 100 sparse random projections.
So, it randomly chooses square root of d,
where d, is the dimension of the data, square root of d,
of the dimensions, and then
defines a random projection onto a one-dimension,
and then fits a one-dimensional density model.
I've drawn it as a kernel density estimate,
but they actually use an adaptive histogram estimator.
Then what they do, is just calculate
the average of the minus log density of a query point,
so the average surprise.
We can see, if we took a data point
like this one right here,
it would have a pretty high
probability in this projection.
But of course, if we got some other projection,
then it will be mapped to a low-density tail.
So, this is extremely simple idea,
but also works surprisingly well presumably
because they're not
super however interactions in our data.
So looking at three or four dimensions
at a time, works pretty well.
So then, how do we do our analysis?
I point you to the archive papers for the details.
But we basically, chose a couple of
metrics either the area under the ROC curve,
which we put through a LODA transform,
or the log of the lift,
which is related to the area
under the precision-recall curve.
Then, just try to set up
a linear model to see it across all of these benchmarks,
how much of the variation in the metric could be
explained by the relative frequency of the anomalies,
the difficulty of the points,
the clusteredness, the amount of irrelevant features,
what we call the mother set,
the original, the domain
from which the original dataset was drawn,
and which algorithm are we using?
Because I wanted to say, well,
a typical finding in machine learning is that
the choice of problem is much
more important than which algorithm you use,
and we were curious if that was true here or not.
The answer is pretty much true also.
So, here are those five factors.
Here is the change in the R squared basically,
how much variation can be explained in these two metrics?
So area under the ROC curve,
if we take a model and delete the dataset feature,
how much R squared do we lose,
or how much can we assign to the dataset?
And that's the most important variable
for explaining the observed AUC.
Next is a relative frequency.
So, if you have
a dataset where the anomalies are very rare,
we can do very well because
we have very little contamination.
As the nominees get more common,
then these techniques fail.
Because really only one of them has
any explicit attempt to be robust and use a robust loss,
that's this technique called
robust kernel density estimation.
All the others are just plunging
forward fitting some kind of a model.
So, the choice of algorithm is
really only the third most important factor,
and really not a whole lot more
important than the others.
But with that caveat,
here are the different algorithms that we tested,
sorted by decreasing change
in the metric compared to a control baseline.
What we see, is actually,
the Isolation Forest was performing
better than the other methods,
and then there's sort of like a five-way
tie for second place.
Surprisingly, the one-class Support Vector Machine
and Support Vector Data Description really
performed the worst on
these problems, which was quite a shock.
Because I think if you asked
a random machine learning person,
"If I had a one-class problem and I
needed to fit it, what would you use?"
And they'd all say, "Well, gee,
how about Scholkopf has this algorithm."
We're a little puzzled about this.
Of course, it's possible that we
just didn't tune these properly.
Although, we tried a lot of things,
but we wanted to have a fully
automated tuning procedure So,
it may be that we didn't do that well.
But there have been some papers
recently in literature and auto tuning those,
and so we were trying to use the current best practices.
My own hypothesis here,
is that the metrics
that we're evaluating on are really
related to how well you can rank the data points.
Whereas these methods really just set it up
as are you inside or outside of a particular quantile,
a particular level set of the density.
So, it's not clear that they're really been trained for
the right to do well on these metrics.
So, maybe that's why they don't do as well as some of
these more distance-based or density-based approaches.
So, we also find in general that
the Isolation Forest is a bit more
robust to irrelevant features,
and that it's also a bit more
robust to clustered anomaly points.
So that may be why it has
this slight advantage over the other methods.
Okay. Well, let's move on now to thinking
about whether we can have
some kind of a theory to explain what's going on.
Because one of the things we were quite
surprised at is when we look at the learning curves,
these algorithms all learn quite quickly.
Of course, if we look at
what existing sample complexity theory do we
have for this kind of one-class or density estimation,
it's often quite discouraging.
So if you look at full density estimation,
you typically need a number of examples
that's exponential in the dimension of the data,
and that doesn't match our experience.
For the quantile methods,
there are polynomial sample complexity results.
So, our observations are
maybe more consistent with something polynomial time.
So, my colleague Alan Fern came up with
this theory we call rare pattern anomaly detection,
and it works as follows.
So we imagine we're going to define,
this is very valiant
style pattern learning kind of thing,
we're going to define a space of patterns or regions.
So we'll define a pattern
as a Boolean function that takes a point in d,
dimensional space, and maps it either zero or one.
So it might be a half plane
or an axis-parallel hyper-rectangle,
some kind of region.
We'll let capital H,
be the set of all the patterns in our pattern space.
Then, the idea is to define a notion of
a rare pattern and a common pattern.
To do this, we need to do a little bit
of being careful about measures.
So we're going to assume some baseline measure.
So if we're in say a hyper box in d,
dimensional space, we might use
just the Lebesgue measure,
the uniform measure over that.
But if we're in unbounded real space,
we need to use something like a standard
multivariate Gaussian as our baseline measure.
We'll let mu(h) be the measure of that pattern,
pattern h. Then, let's let p,
be the probability density of the nominal data points,
and it's defined over Rd, also.
And will let p(h) be the probability
assigned to pattern h. So,
we'll say a pattern h,
is tau-rare if the sort of
normalized probability density is
less than or equal to some threshold tau,
and otherwise we'll say that pattern is tau-common.
Then our definition of what we want,
is we're going to say,
a data point now,
a query point is tau-rare if there exists
a tau-rare pattern that that data point belongs to.
So that data point belongs in pattern h,
and h, is tau-rare.
So it means there's some way of looking at the data,
there's some bin in the data that this query point x,
belongs to that is very unusual
under the nominal distribution.
So, then our goal,
an ideal anomaly detection algorithm would output all of
the tau-rare query points in a test set,
and not output any of the
tau-common points that would be the idea.
So, I guess what's
interesting about this model is that instead
of trying to estimate
the entire density or
even trying to estimate a level set of the density,
we're just choosing some set of
regions in the space and just
trying to figure out roughly what
their probability density is or the probability of those.
I think that this is
why if we don't have too many of those regions,
then we can get
uniformly good estimates over most
of them from a relatively small sample.
So, we define an algorithm to
be probably approximately correct RPAD.
If for any probability density p,
and any threshold tau,
it draws a polynomial size sample.
Then with probability one minus Epsilon,
it detects all the tau-rare points and rejects,
does not output any tau plus Epsilon common points.
So we have to introduce a sort of
elbow room parameter to
allow the algorithm to make some mistakes.
So it's allowed if a point is between
tau plus Epsilon rare and tau rare,
then the algorithm can do anything it wants there.
So, what we end up with
is an algorithm such as the following.
We can draw a polynomial size sample,
and we'll talk about what this is in a moment,
and then we just estimate p-hat,
is the fraction of the sample
that belongs to each of our patterns,
and then we can calculate the normalized rareness.
Then we just say a query point
is declared to be an anomaly
if one of these tau-rare patterns
is made true by that query point.
So we can prove a sample complexity result that
says that the number of samples we
need is on the order of one over Epsilon squared,
the log of the number of
patterns plus log of one over Delta.
So, very kind of typical pack result.
It's a little unfortunate that this
is Epsilon squared here.
So that's going to mean if we want Epsilon to be 0.01,
then we're going to need 10,000.
This will be a big number, a big constant.
Then we can get a VC dimension version of this,
where the VC dimension plugs in
here times log of one over Epsilon squared.
So, there are lots of
different practical algorithms you
can design with this procedure.
So, half spaces and axis-aligned hyper-rectangles,
and in our paper, we worked out
the VC dimension of all of these.
In particular, we designed a version of
the Isolation Forest that exactly matches this theory.
To do that, the pattern space
has to be chosen a priori in this theory,
it can't be adaptive to the data.
So to deal with that, we use the usual trick of say,
taking our training data and dividing it in half,
and using half the data to grow
an isolation tree to some fixed depth k,
that fixes our pattern space,
and then the second-half of the data,
we can estimate the probability of all of these bins.
So, here are some examples on some datasets,
where the plot on the left is for
the actual Isolation Forest algorithm
grown to a fixed depth
k. This is our theoretical version of
it also at depth
k. The vertical axis is the area under the ROC curve,
and I apologize that the axes
are different in the two plots.
Then these different lines are
for different depths of tree.
But if we just look in this case,
the depth one tree is actually works the best.
We see that in this case,
the pattern min method
actually does a little bit better than
the Isolation Forest on the area under the ROC curve.
For another dataset particle,
the Isolation Forest does much better
because notice this is 0.71 down to 0.66,
and this would be 0.66 down to 0.52.
So, all of these curves basically are below all of these,
so the Isolation Forest was much better.
Then, here's a dataset shuttle,
where the RPAD,
the theoretically inspired algorithm is actually
doing quite a bit better than Isolation Forest.
This time, I did get the axes aligned.
So I think qualitatively, well,
the shapes of the curves aren't always the same,
but yeah,
there's sometimes some funny business
going on with very small samples.
This is particularly weird
that we do very well with very small samples,
and there's an interesting phenomenon
going on here that we don't fully understand.
But, when you have a training sample,
and it includes some amount of contamination.
A standard practice that's done in
the Isolation Forest is to
train on actually subsets of the data,
not on the entire dataset,
and the idea is that you're kind of thinning out.
It's a robust stratification strategy.
By training on smaller sub-samples,
the absolute number of contaminating points is reduced.
The fraction isn't reduced,
but it seems that if you have a tight clustered points,
you're reducing that local density of that cluster.
And I think this helps the algorithms do a better job,
but we don't have
a good theoretical understanding of this.
But empirically,
even the original authors of
the Isolation Forest algorithm
say that you should train on
a small subsets of
the data rather than the entire dataset,
and that this often helps.
For those of us in machine learning it's like,
"What, you're not using all your data?"
But of course, we're building an ensemble.
So across the whole ensemble,
we're using all of our data.
The smaller the subsets,
the bigger the ensemble needs to be,
because you also have to reduce
the variance that you've introduced
by choosing these random subsets.
So this is a paradox that where this kind of thing,
where the small samples are actually performing well.
It's not observed all the time,
but it's observed enough of the time
that we think it's a real phenomenon
and we're trying to
find a nice theoretical understanding of it.
Okay. So the conclusions from this is that we feel that
this rare pattern anomaly detection theory
captures a lot of the behavior of
the Isolation Forest or of LODA,
because these methods are solving a simpler problem than
full density estimation or
even estimating level sets of the density.
So, I guess maybe I should pause and say,
are there any questions about that? Yes.
>> How does this algorithm compared on
a high-dimensional data, very high-dimensional data?
>> Well, I would guess on very high-dimensional data,
you need to use a very large ensemble of trees.
Let's see, what's the right way to talk about this?
If your space of rare patterns is, in some sense,
if you have a distribution of the data,
you want to have these rare patterns kind
of surrounding the data.
Obviously, one intuition is in high-dimensional space,
you're going to need a lot of
those patterns surrounding the cloud of data.
So, the number of patterns you might need to really tile
the entire region might grow exponentially.
So, our result says
the sample size is polynomial in the number of patterns,
but the number of patterns that
you need might be exponential.
So, that's a good question.
>> Have you tried this on a bag-of-words?
>> We have not tried this on
a bag-of-words thing for example, yeah.
Now, one might hope,
so I'll do the standard machine learning thing and say,
"Oh, but the data really lives
in a low-dimensional manifold."
So all we have to do,
is identify that manifold then.
Well, but we know that
tree-type algorithms will in
some sense follow that manifold.
So, we would maybe we'll still do okay.
I'm waving my hands. Yes.
>> So, it seems like it still
requires knowing the pattern space.
But then, if you're dealing with anomaly detection,
many successful intrusions happen exactly
because there is some sort of
behavioral pattern that you didn't take into account,
and so the adversary was able to exploit it.
So, is there a way to
extend or fix this theory to maybe at
least give you some indication that there's something in
the data for which you don't maybe have a pattern,
or you didn't miss the pattern that may be an anomaly?
>> Well, if we think about what,
for instance, the Isolation Forest does,
it actually partitions the entire
input space into patterns.
So, there is always a pattern
for every possible observation.
Many of those patterns will be essentially empty.
In fact, I skipped over
a little detail of the Isolation Forest.
But when you do a split at a node and you split the data,
there is little empty spaces
between the observed range of the data
in the children and the observed range in the parent.
They're actually, a query point can fall into
that empty space in between the observed ranges.
I'm being very inarticulate here about this.
But if I have a set
of data points along
a line and I choose an arbitrary split,
there's going to be some space
between adjacent instances and
that empty space effect for the algorithm
allocates two additional patterns for that empty space.
Because if a query point comes down,
it's considered to be isolated if
it falls into this space,
it was inside the range of
the parent and outside the range of the child.
That turns out to be really important.
If you don't do this right,
then the performance of
the algorithm suffers quite a lot.
So, I guess my first point would be that if we
use some of the data to build the tree structure,
and if the tree structure is really
following the structure of the data,
then we will have tiled the space nicely.
And every leaf node by construction is going to
be rare because it will have at most one point in it.
So we're going to have a lot of reasons that
have essentially zero points or one in them,
and we've also completely tiled the space.
So, if there is an unusual behavior,
it will fall into one of
these empty spaces basically and will be detected.
>> Then how do you tell between
interesting anomalous behaviors
and uninteresting anomalous behaviors?
Because essentially, you won't be able to do that.
>> Well, yeah, because we're only
doing outlier detection here,
and we have no knowledge that an outlier.
We can't tell the difference between
an outlier that's part of the natural variation of
the distribution and an outlier that's due to
some unusual behavior that's
caused by something different like
a fraudster or something, yeah.
In fact, of course,
a fraudster is going to try to
mimic the nominal distribution,
and so we might want to try to
use some essential behavior,
things that the fraudulent user is not
able to manipulate to help us detect their behavior.
But, all I'll say is that
in the competitions inside this DARPA program,
we were number one in several months
using these kind of techniques.
So, maybe that's a tribute
to our future engineering skills
or maybe the red team made it too easy, I don't know.
But these methods were working pretty well.
We're competing against an IBM team that
actually has a product in this space, so I don't know.
Anything else? Okay, well,
let me move on now to
assuming that we can find these anomalies,
how can we explain them to the user?
So, our basic idea is the following.
We were really troubled
by what constitutes a good explanation,
and how could we measure that in
a rigorous way without doing human subject studies.
So, here is the thing that we came up with.
We imagined that the goal of
our anomaly detection system is to
persuade the analyst that
a candidate anomaly is a real anomaly.
The idea is we're going to expose
one feature at a time to the analyst,
and in the worst case
until we've presented all of
the different features to the analyst.
Then with an appropriate visualization tool,
until we've either convinced the analyst that it
really is an anomaly or the analyst says, no, it isn't.
We sort of view the analyst,
their job is really to decide whether to say launch
an internal investigation like
a fraud investigation, or not.
So, this is the the kind of model here.
So, we call this a sequential feature explanation.
We had a paper at UAI two years ago on this.
So in pictures, the vertical axis here is the experts or
the analysts' assessment of
the probability that the data point is a nominal,
not an anomaly, given the features you've seen so far.
So we'll say that notionally it starts out pretty
close to one because say,
one minus Epsilon,
where Epsilon is the background
probability of being an anomaly.
In our sequential feature explanation
is some permutation of the feature,
say, feature number three,
we think is the most important.
So we start presenting these features to the expert.
If we're being persuasive,
the experts posterior belief that
this is a normal point starts to drop,
and hopefully goes to zero.
What we do is choose some threshold,
and we talk about
our metric is something we
call the minimum feature prefix,
MFP, which in this case will be four.
It took us the first four features in our ordering,
we're required to drive
the expert's assessment down below say point,
whatever that is, 0.25.
So, this is the MFP is our metric.
Across the whole dataset of anomalies,
we'll look at the average of
that as being our performance metric.
So far, anomaly detectors working well.
Or if we have a good explanation algorithm,
it should use short explanations.
So this is sort of
the minimum kind of thing we might have and obviously,
it's only appropriate for
problems where the features are meaningful.
So, in this case, our technique,
we relied on another one of our anomaly detectors,
which is an ensemble of Gaussian mixture model,
so the EGMM algorithm.
We had originally developed this just as a control.
So we apply PCA and then in the PCA space,
we fit an ensemble of Gaussian mixture models,
where we randomly vary the number
of mixture components and the starting initialization,
and we discard mixture components
that didn't fit the data very well.
And of the ones that survive,
then we can get our anomaly scores are actually,
again, the average surprise across those models.
So it's the product of the density estimates I guess.
So, then, we use a simple greedy algorithm.
So, the score of a particular feature
is how much the log likelihood
drops when we marginalize out,
or I've got this backwards there
should probably a minus sign here.
It's how much the surprise decreases when we
compare the marginalized distribution
to the original distribution, so right there.
This will always have
a higher probability density after
we've marginalized that one feature,
then the full distribution had.
So, we do this, we score each feature and
the one that led to the biggest change in the density,
we say that's the feature of the biggest impact,
and then we remove it from
this set of candidate features and add it
to our answer and just do that in a greedy fashion.
Again, we used a bunch of our datasets.
We only used seven of
the UCI domains instead of 19 in this study.
Instead of using real human analysts,
we trained regularized random forests
on the supervised version of the data,
so they can see all the data.
Then the question was,
we would convince them if they
saw all the first four features,
is there a predicted probability
that it belongs to the anomaly class,
or to the nominal class
as it dropped below some threshold.
So, we compared against a random ordering of
the features and against an Oracle that
actually knows the labels of the data.
What we saw was that we always did better than random,
and sometimes we do almost as well as the Oracle.
So, this is random ordering.
This is the Oracle in blue,
and this is our greedy algorithm.
Sometimes, we don't do as well.
I guess one thing that's kind of
interesting is that for most of
these Irvine type domains, on average,
the Oracle only needs to show the analyst one
or at most two features to give an explanation.
So, that's convincing.
So, that's kind of interesting.
This gives us a kind of metric of how hard a dataset is.
So, if we look at the KD1999 dataset,
for instance, it's always
the case that you need only one feature.
Even our algorithm only needs to
use one feature to explain an anomaly.
Whereas if we look at our
this Adam's insider threat database,
you need about seven on average,
so it's a much more challenging problem.
Well, if we're going to be talking to
an analyst and convincing them,
we can also get feedback from an analyst.
So, we wanted to also then look at,
could we take the feedback
from the analyst and feed it back into
our anomaly detection process to
re-rank the candidates that we have.
So, we had a paper at ICDM a year and a half
ago showing that this could be a potentially big win.
So what we do is,
we require one parameter which is a quantile,
basically corresponds to how many cases
can an analyst look at in a given,
like in a day or a month,
whatever the time period is that we're analyzing.
So, the idea is that the analyst can look and say,
Sixty cases in a month.
We want to optimize the 60 cases we present."
So we start by presenting to
the analyst the highest scoring anomaly
we have and then the analyst says,
"Yes, that's the case I'm going
to open an investigation, or no, it isn't."
And then we feed back that.
And what we've tried to do,
is learn to reweight.
In this case, we're working with
the LODA anomaly detector,
which remember, creates a bunch of
random projections, one-dimensional random projections.
So we learned to reweight those projections to try
to rank all of the known anomalies above quantile tau,
and all of the known nominals below quantile tau.
To do this, we used a modification of
a method from Stephen Boyd and
famous colleagues called Accuracy at the Top that
was published at NIPS several years ago now.
So, we're re-ranking based on feedback from the expert,
and we call this active anomaly detection.
I guess the only interesting thing is
that if the anomalies are fairly rare,
then we will have a significant false alarm rate.
So it could very well be that the first, three, or four,
five, ten cases we showed to the analyst,
none of them are actual fraud cases. What can we do?
Well, one of the advantages of having
this threshold tau is that
even with no positive examples,
we still learn to push down all of
those non-anomalies below the threshold tau.
So, this seems to help us learn
even without any positives,
but we had to make some minor modifications.
So, we compared against
a baseline where we do no learning,
we just take our initial anomaly scores
and present them in that order.
We looked at randomly ordering them,
and then this is our method.
Then there are two other methods in the literature.
One called AI2, which I don't think has
ever been published except as a CSAIL Tech Report.
So it's probably a class project or something,
although it works quite well in some cases.
Then there's something called SSAD,
which was in a JAIR paper.
Their method from Gornitz, et al.
is called Marginalize and Cluster, I think.
We thought we had an idea for
improving it using our Accuracy at the Top method,
but as you'll see it was not an improvement,
but just for completeness.
So, what we're going to plot in these plots,
is along the horizontal axis is
how many cases have we presented to the analyst,
and the vertical axis is how many
of them turned out to be true anomalies.
Again, all this is done with
a simulated analyst, and so on.
So, we can see that our baseline method
is this blue curve.
So, if we have a budget of 60 cases we can look at,
we find about, I don't know,
25 or 26 or something.
Whereas if we can incorporate our feedback, we get 45.
So, huge advantage from having the analyst in loop.
However, this is the KD1999 dataset,
which is extremely easy.
So let's move onto some other ones.
So this is a Abalone dataset.
Here, early on, we
get some benefit of our method over our baseline.
No feedback is the blue curve,
and ours is the red curve.
Then this is SSAD method,
which is doing just about as well as ours.
But if we say, well, suppose we only
had time to look at 20 cases,
then we'd still be seeing
more than a factor of two improvement.
Eventually, obviously if you go out far
enough on this, there will be no benefit.
Because eventually, we'll have
shown everything to the analyst.
Here's another one for thyroid.
In this case, again,
compared to our baseline and we're doing pretty well.
In this case, this AI2 method is
doing almost as well as we are.
But again, the benefit is potentially almost a factor of
two in terms of the number of anomalies we can detect.
In mammography, similar story.
So in general, for all of these,
we're seeing that incorporating
the analyst feedback can be
really powerful at least for this LODA algorithm.
But now, let me speculate about
why this might be overly optimistic.
We have to remember that these datasets that we've
constructed here are constructed from
the supervised learning datasets.
What's going on when we
reweight these different projections,
we're essentially reweighting the original features
and putting more weight on some than on others.
So, it could be that because our anomalies are coming
from classes in the classification problem,
they were presumably determined to be classes
because they were believed to be
separable using the features.
It may be they were
essentially in a sort of a sneaky way of
learning a classifier and taking advantage of that.
So, it's not clear in
a real fraud detection case would we get
this same factor of two improvements.
However, we have a paper under review right now at KDD,
where we're doing reweighting for
Isolation Forest using a sort
of online learning strategy.
We have applied this to
this advanced persistent threat cybersecurity case,
which is not a supervised learning case,
and we're still seeing a substantial,
not a factor two, but a substantial benefit
from integrating the feedback. Yes.
>> Would it help to put
the label to answer that question?
>> Labeled?
>> So you have your classes.
>> Right.
>> Would you put some label noise,
it's going to start mixing the classes,
they wouldn't be separable anymore.
>> Oh, you mean when we constructed, that's interesting.
That's interesting. We should maybe look at that.
That would be a good test at least for this particularly.
Yeah. I guess another thing is,
we could create benchmarks where our nominal points
come from say just one class and
the anomalies are from the mixture
of all the other classes.
Or maybe we should use lots
of each classes so that there's
probably not a simple decision boundary that
separates the nominals from the anomalies,
because they'll be very multi-modal distributions
on both sides. Yeah.
>> So the feedback you have here is
very example driven right on the analyst judging thing.
It seems like you can also order it to be feature-driven,
or going back to pattern terminology, patterns.
So in fraud cases,
people are often interested in patterns like
a small gas station purchase followed by big purchase,
is that pattern right?
And instead, the anomalies are
evidence that this pattern exists.
Have you thought about trying to
order in other ways where people might be able to
give a more sample-efficient feedback over
an entire feature as to whether or
not I believe something anomalous is happening here?
>> We have not.
We did try to do some sort of
model-based work back in the Insider threat dataset,
where we had some theories of Insider threats and we
designed features specifically to look for those threats,
like someone copying a large number of files
onto a thumb drive.
But you have in mind maybe more of a knowledge discovery
setting where the analyst looks at our anomalies.
Or maybe can you do like as we've been discussing today,
maybe some cluster analysis of these anomalies
the way you guys are clustering mistakes and say,
"Oh, I see sort of a pattern in this,
and this looks completely bogus to me.
This looks like just some side effect of some instrument.
So, I don't want to see any more
of these cases like this,
for instance, or this looks interesting to me.
So, everything that looks like this,
let me label that whole cluster." Yeah.
>> Okay. How you select the example that they set
them in kind of set based criterion right.
If you want to exclude a whole feature,
if you just randomly select examples,
chances are you won't exclude that feature.
Whereas if you start selecting them in conjunction,
you can do that.
>> Yes. We did look at whether the sort
of select the biggest anomaly is the right thing to do.
And Weng-Keen, you may or may not know,
but when he was a graduate student,
he was working on rare category discovery and
there you do have some diversity tricks.
So we could think of this as also,
could we elicit from the expert.
Again, consistent with some of
the things I've been hearing about here.
Not only is that anomaly,
but I'm going to call that a full
anomaly because this corresponds
to a kind of use-case I know.
This is the check that the credit card is working before
you use it pattern.
No, we haven't thought about that,
but it's a great idea obviously. Yes.
>> Can you still reward diversity,
because diversity will definitely change
the efficacy of query versus anomaly.
So, it seems like your quick trick
that can completely change your metric,
or is that not true?
>> No, we would really have to
also have something in our metric that rewards diversity.
Well, as you say,
if we could essentially present
a group of proposed anomalies,
and say we think these points are all
anomalous for the same reason, are we right?
And get feedback on the whole group at once,
then that would be great.
Then the next query,
we would use obviously a different group.
But we might have to balance them,
I know in real situations.
I remember years ago,
I visited Bell Labs and Darrell [inaudible] was working
on payphone fraud for, well,
it wasn't even AT&T at the time,
and they had some statistical model
and they were looking for weird behavior.
But they also ranked them by how valuable they were
in terms of the financial impact of those frauds.
So, that was it.
Okay, well, I've got to
move along because we're really behind.
So, let's talk about these two applications.
So the first one is, data cleaning and weather networks.
So, I'm part of a group
that's working on something called
the Trans-African Hydro-Meteorological Observatory,
and it's founded by two folks.
One is a colleague of mine John Selker,
who's a hydrologist and
the other is a guy named Nick van de Giesen,
who's at the Technical University of Delft,
also a hydrologist.
They were at
some World Meteorological Organization meeting,
at which this plot was put up showing where are
the weather stations around the world that reliably
report data to the World Meteorological Organization.
Those purple blue dots are the ones
that are the good weather stations.
We can see that in most of Latin America and Africa,
the weather infrastructure is just very poor.
So, in the context of agriculture in Africa,
most agriculture is rain-fed,
there's not a lot of irrigation.
They noted that when you have very poor sensing,
then you can't buy crop insurance
because the insurance companies
don't know what risk they would take.
They don't know how to price the insurance.
So, that leaves the farmers to
plant their plants very far apart,
so that each plant under
the worst case precipitation conditions will
still get enough water to do well.
There have been field trials showing that if
you do offer crop insurance,
then the farmers plant much more aggressively,
plant their plants much closer together and
get a much higher agricultural productivity as a result.
So, this is one of these examples
of if we could provide better data,
this could lead to crop insurance industry,
which could in turn
lead to higher agricultural productivity.
John Selker is not a modest man.
He wants to have Africa leapfrog over
the whole rest of the world and become
the best sensed continents.
Of course, the price of sensors is dropping rapidly,
and we can now build
weather stations that are very cheap.
So, we're trying to deploy 20,000 such stations
across Africa and then create various data products.
I like to say, "Well,
weather stations are really cool, but for some reason,
people want to put them out in the weather and expose
them to the rain
and all kinds of problems, and then they break."
So, this is a kind of Internet of Things project.
I like to say, the Internet of Things is
going to be the Internet or broken things.
So, how can we automatically
detect which stations have
failed, which sensors have failed?
So, we might approach this as
a sort of medical diagnosis problem.
If we could say, "Well, these sensors
tend to get one in five different diseases,
can we learn to recognize the symptoms?"
But this is one of those cases where we feel like
there seems to be
an unbounded number of ways
these weather systems can fail.
So, where there are known symptoms,
we can write rules for them
or build probabilistic models,
but we want to use anomaly detection
to detect new kinds of failures.
So, I think in
the interest of time I won't go
through all the technical stuff here.
But I'll just say that we are combining
anomaly detection scores in a dynamic Bayesian network,
where we reason about the state of each sensors.
Then we're doing a foreign map inference
to try to say given
the observations that we've
made turned into anomalies scores,
we try to not just flag the data with anomaly flags,
but actually do some root cause analysis.
And say the reason that you got these 2,000 alarms,
is due to one particular thermometer
failing for some time interval.
So, I guess I should say,
maybe the most interesting or weirdest move here is
a typical problem you have when you
try to do diagnosis on this,
is you want to model the sensor readings as a function
of the state of the sensors.
But these sensor distributions can be very messy,
and so we replace the actual distribution sensor readings
by the anomaly score assigned to those sensor readings.
This has a nice virtue
of normalizing everything to be in zero and one,
and having a more predictable distribution.
At least that's our hypothesis,
but we don't have definitive results to prove this.
So, I'll say, we're currently working
a lot on comparing sensor readings at multiple stations.
So then the other problem which is more related to
open world is Open Category Classification.
So, given that I have training data for say
K classes that I've
collected and I trained
a multi-class classifier for them,
but then my test data may contain
queries corresponding to new classes,
how can I detect them?
We call them alien queries,
the queries that belong to new kinds of objects.
So, our approach is, given our training data,
we train both our multi-class classifier
but also an anomaly detector.
So, then when an input query comes in,
we look at the anomaly score.
And if it's greater than some threshold,
then we reject that point and say,
"We think this might belong to a new category,
so we're not going to classify it."
But otherwise, we feed through
our classifier and make our prediction.
So, this was for me was
motivated by another ecological problem which was,
for about a decade I was working with some folks at
the EPA and on campus
at freshwater stream health monitoring.
So the EPA every year in the United States does
randomized sampling of freshwater streams,
and what they do is they use a kick-net,
which is what these guys are doing here,
to collect insect larvae that
live on the substrate of the streams,
and then they bring those specimens into the lab.
Currently, they manually identify
them to genus or species.
But this is very time consuming and so we thought, "Well,
we'll build a robotic device that
can singulate the specimens as they say,
and then photograph them,
and then we'll use Computer Vision to tell them apart.
We did this back in the mid 2000s when everything
was Bag of SIFT and Support Vector
Machines as the classification technology.
But we got up into the low 90 percent correct,
over ultimately 54 categories.
So we were all excited about that
and then we started to look at actually deploying it,
and we realized although our collaborators had
collected specimens and we had
these beautiful training data for all the specimens,
the test samples are going to contain other stuff.
It turns out more than 29 species
or more than 54 species out there,
like it's estimated there's something like 1,200 species
of freshwater macro invertebrates as they call them.
So this led us to think about,
how could we deal with this problem?
Of course, more recently in Computer Vision,
this has become a popular problem.
So, there are several techniques
that are out there but as far as we know,
no one has looked at the question whether we can
provide any kind of theoretical guarantees.
So, we are looking at a set up now,
where we have two training sets.
One training set is of our clean training data.
Our collaborators went out and
collected all these training data for us.
It's all nicely labeled,
and we know there are no contaminating samples in these.
But in addition, we're going to require that we have
a unlabeled collection of
samples that are representative of our test queries.
So, here, we are making
a well-defined anomaly distribution
assumption that we can get
a sample from the distribution of anomalies.
In fact, even talk about what do we mean to have
a pat guarantee on
detecting say 99 percent of all the aliens,
right there, I'm assuming that's
a sensible statement that there is such a distribution.
So, we're going to assume that this contaminated sample,
which we'll call SM for mixture is a mixture where again,
the proportion of the aliens is Alpha.
So, this density is a mixture of
one minus Alpha times the clean distribution,
p_not times Alpha times this anomaly distribution.
Now, what we want to find is,
if let's suppose this is the CDF of
the anomaly scores of the anomaly points.
So, we're really interested in,
say we wanted to detect 95 percent of
the aliens or anomalies,
then we want to know this quantile.
The anomaly score, such
that 95 percent of
the probability mass is greater than that.
So we'd like to estimate this tau at
0.05 or 0.01 or whatever we want.
So we'll call this q, quantile 0.05.
So, we note that if this is
a mixture distribution since
we're in one-dimensional world,
it's true that for
the cumulative distribution functions also,
that the CDF of
the anomaly scores for the mixture distribution is
a convex combination of
the nominal distribution in the anomaly distribution.
So if you do a little algebra,
you can say, GD, anomaly CDF can be calculated.
We can estimate this from data.
We can estimate this from data.
If someone will just tell us
Alpha or give us maybe a reasonable upper bound on Alpha,
then we can use that to estimate this CDF.
So, the picture looks something like this.
This might be the CDF of the clean data,
the black one is the mixture data,
and maybe this is our anomaly CDF.
So, we're going to set our threshold
somewhere around in here maybe.
We'll get all the anomalies,
but we will have some number of false alarms too.
So, of course,
this all depends on how good the anomaly detector is,
and how well they can distinguish
the aliens from the known classes.
So, when we look in practice,
this is empirical CDF of the of the clean data,
this is the empirical CDF of the mixture data,
and of course, we're kind of subtracting them.
So, unfortunately, this thing can go negative.
This is the estimated CDF of the anomalies.
So, it's pretty sad.
But if we have a larger sample,
it starts to look reasonable.
But it still can cross this threshold a few times,
and so we look at
the largest value of the anomaly score that
crosses this 0.05 line and we set our threshold there.
That's what this pseudocode says.
Then we're able to get a guarantee that if
the number of samples in
these two datasets is equal to n,
then the required sample size is something like this.
We have one over Epsilon squared term,
we have something that's sort of a one
over Alpha squared term,
and then we have this,
which is a funny thing
that comes from the fact that we're estimating
two different cumulative distribution functions and
there are some concentration results for CDFs.
So, then with probability one minus Delta,
the alien detection rate will be
at least one minus q plus Epsilon.
So once again, we need some elbow room
in the form of some Epsilon.
Anyway, so that's what we can do.
Then experimentally, I didn't put the results here,
but we have a paper under review for
ICML where we also show experimental results for this.
Of course, it's a loose bound but it is a bound,
and it does work in practice.
We see that for some reasonable things,
so we have some results
on some image datasets and some benchmark datasets.
We get false alarm rates using
Isolation Forest as our anomaly detector.
False alarm rates somewhere in
the 10 to 20 percent range for reasonable sample sizes.
So we're very happy with this, very pleased with it.
So, to finish then.
Outlier detection, I talked about the
different unsupervised and mixed.
Our benchmarking suggests that
this Isolation Forest algorithm is quite nice.
Surprisingly, that these percentile methods
or quantile methods are not doing so well.
I should say in passing,
obviously these things don't work at
all well by themselves for image data,
we've kind of discussed this already,
any kind of signal data or time series data.
So, I think the next steps here,
I'm very intrigued by things like GANs, and autoencoders,
and other kind of manifold
learning and density estimation techniques
that are coming out of the deep learning literature,
and so that's the next thing I want to try.
There's some very nice papers at
the last couple of NIPS about this,
because I think that might give us
much better anomaly detectors than we can get using
these simple IAD feature based methods.
Then we talked about the RPAD theory,
which accounts for their behavior
I think reasonably well.
Then we talked about presenting
the most important features,
a simple ordered feature explanation
incorporating expert feedback using
this Accuracy at the Top strategy.
Finally, guarantees for Open Category Detection.
I apologize for running long,
and thank my funders,
and open it up for questions.
>> Thanks a lot.
>> Thank you very much. Yeah.
>> So, at the beginning of the talk,
you mentioned that there was a fundamental assumption
about the nominal distribution
and the anomaly distribution?
>> Well, the assumption I wanted to call out is,
is it safe in my application to assume that
the anomalies come from a well-defined distribution.
>> I'm curious, can you sketch how to weaken
such an assumption to
allow the fraudster also
in the loop not just the analyst in the loop,
but an arms race between
the anomaly detector and
the people who are trying to gain the thing.
>> Well, so, I don't
know how to do that in a theoretical framework.
It seems to me that we end
up having to think about it more like a game or
minimaxing kind of adversarial learning setup instead,
and try to get some guarantees that way,
so that would be quite different from this.
Of course, there is some interesting work
in adversarial learning for supervised learning.
Of course, everybody in deep nets right now
is trying to do with adversarial
examples in deep learning.
Obviously, some of the strategies that are
being explored like using GANs and
trying to map queries onto the data
manifold to strip out these adversarial attacks,
many are very closely related to
the same ideas of could we detect
the adversarial queries as
being outliers that fall
outside of our data distribution.
So, I think anomaly detection
will again be part of the answer
but this theory for
open category will not be part of the answer there.
We need some other more adversarial approach.
>> And then just a clarification.
Was the DARPA Atoms thing run
as an arms race between red and blue team or was it?
>> Yeah, so there was a red team that was
actually based on people from CERT,
the computer security guys.
They have this whole White paper on
different kinds of Insider threats in organizations that
they've collected both from DOD and from industry.
My favorite one is what they called the Insider Startup,
that a sub-team decides,
the company doesn't appreciate us,
we're going to take the source tree with us
and go form a startup.
But there were other things,
like someone who's passed over for promotion,
survivor guilt, all these,
and so they have this whole set of things,
a lot of them very psychological.
Then they hired a script writer to write scripts
about what kinds of emails and text messages will
these people be exchanging
in addition to their desktop data,
then they acted out all of this under
this desktop and monitoring software,
and then overlaid it on the behavior of
real employees in the company.
I had access to this in a very controlled setting,
and we had to see if we could
find those red team inserts.
But unfortunately, we don't really know how hard they
were and of course,
we don't have access to the data anymore.
It would have been interesting to,
once we have the ground truth labels to go and see well.
So how could we train a supervised learner,
like go to the nearest neighbor
classifier and figure this
out quickly, we don't know.
>> In the weather station example, I'm curious.
Mostly, the anomaly detection algorithm you
talked about seemed to be optimizing
for low false alarm rates and high gadgets.
But in some of the systems
depending on how you're reacting to the anomaly,
you might be able to tolerate lots of false alarms
depending on what's important and how
you're recovering from a problem.
I was curious, in the weather station,
what happens when you detect an anomaly?
>> Well, so,
in all existing weather networks that I'm aware of,
they have a lot of flags and
then they rely on some human
eyeballs to look at them all.
So for instance, Oklahoma Mesonet
is the best weather network in the United States.
They have a big budget, and they have
four full-time meteorologists for 130 stations.
So scaling that up to 20,000 stations in Africa,
needed heck of a lot of.
I just don't think that strategy
scales to the Internet of Things so
that's why I think we need to bring
much more intelligence to bear.
Also, most of the existing weather networks,
they use basically hand-coded expert systems for writing
these problem detection rules
and so they are completely ad hoc.
So, I think in some sense,
low-hanging fruit for us because by
bringing probabilistic inference on top of the alarming,
we should be able to greatly.
We still have lots of alarms,
but we can say these 1000 alarms
are really just due to one failure,
and create a trouble ticket for it.
Well, we're well on the way to building a system where
the weather network manager,
who everyday logs in and he gets a report of
what are the new failures we found in the last 24 hours.
Then he can very easily create trouble tickets and
assign those to the field engineers,
who then will be scheduled to go out and do the repairs.
In our dreams, that would be integrated with
an inventory system and on and on.
But that's how they're set up.
>> But you have to give a lot more knowledge
than usually about how sensors failure
essentially air models and
must be not so much control you're aware off,
where experts have very rich models about house,
they would just use like this like doing weird things,
but they filter in certain ways. [inaudible]
>> A mature system,
presumably, ultimately we find these recurring anomalies.
So, then they bring them into the lab and say,
"Okay, what's going on here?"
Then, we've already had this happen.
We didn't discover with our code,
but there was a manufacturing defect
or design defect in the station,
and so all the generation two stations
are having to be replaced with.
But over time, you will accumulate a library of
failure models and with
very specialized detectors for those.
But we want this anomaly infrastructure as
a fallback because there's always something new.
We haven't had any of our stations attacked by
an elephant yet but I'm sure it'll happen.
Oklahoma, they have unbelievable stories
about cows getting into the thing.
What is the symptom of a weather station where
some ants have decided to lay
eggs all over the circuit board?
It's hard to have a disease model for that.
>> I was just
wondering about the online case of anomaly detection.
You seem to assume here in the problem statement
that the nominal distribution is stable over time.
>> Absolutely, yes.
>> What happens if
the nominal distribution starts drifting
over time because since
the weighted identification changes,
or whatever other issues?
>> So, somehow we're going to
have to track that distribution.
Yeah, we haven't confronted that case here.
Although in the weather data case,
it's not stationary but
it is very predictable in
an annual cycle, on a daily cycle,
there are lots of
periodic components that you
can sort of the traditional time-series techniques,
model and pull them out of the signal.
Then in your dream,
you have now uncorrelated white noise and then it looks.
>> [inaudible].
>> Maybe it's a part of the answer,
but actually we've spent
a whole year trying to do good feature engineering
and decided this isn't working for us.
Because it's just too much work to do all that modeling.
When you make modeling errors,
uses artifacts,
and then we start making mistakes because of those.
So, we're looking now at basically,
looking at nearby weather stations that
should be experiencing the same disturbances,
the same annual daily cycles,
and then comparing them.
How well can we predict one using
the other or what's their joint distribution,
and using that to do the anomaly detection stuff.
So, yeah.
>> So you retrain the models,
or there's some drift between predictions?
>> Yes. Actually, we need to retrain
the models really sort
of a sliding window thing because again,
the nature of correlations changes.
In wintertime, for instance,
precipitation is our most important variable,
our most problematic sensor,
and it has the smallest spatial scale
in terms of spatial smoothness.
In the wintertime, if we think
about the Pacific Northwest,
it's very smooth, very large storm.
It affects the whole Pacific Northwest.
So, if it's raining in one place,
it's should be raining at another station
and we can use that correlation.
But in the summer time or in
the late spring when we get these showers,
stations becoming extremely uncorrelated,
so then we need to have in a sense switch models to use
a model that corresponds to
a showery day with lots of little cells.
Really, I think we need to be tracking that
because it really is happening day-by-day weather.
Like the first day of a rain event,
it might have this large scale and then by the third day,
it's into these little uncorrelated things.
So, I think we actually need to
attract that kind of larger spatial phenomenon.
Speaking of larger spatial phenomenon,
another thing is if you have
a spatially clustered set
of anomaly of alarms all ringing at once.
It's probably a real weather event
and an unusual weather event,
not any of the stations broken,
and that's knowledge it's not anywhere in our system,
but the meteorologists definitely know this.
Of course, there are other data sources
like satellites, and radar,
and things that we can
use as additional sources of data for this, so.
>> Okay. With that,
we probably should thank. Thanks very much.
>> Thank you.
