>> So, I'm really happy to
bring this speaker today.
Nick Heard, is a well,
an excellent researcher,
but first and foremost,
he's a good friend of mine.
We go back at least to 2011,
which makes us both of
a certain age I think that.
We met in true academic style and
that I was working
on my dissertation,
and I thought I had
an original idea.
I don't know, many of you
have probably come across
this where you thought
you had an original idea.
I did a little bit of
a literature review
and it turns out
somebody beat me to it,
and it turned out it was Nick.
So, I sent him an email and said,
that's a great idea, I
thought it was mine.
But you're a smart guy
for having it
and that resulted in
a long-term collaboration.
We've got papers together,
and patents and I've mentored
his postdocs, and students.
I hope to extend that
relationship here at Microsoft,
and then do a lot
of collaboration
with Nick, and Imperial College.
So, with that we're going to
hear his views on statistics,
and cyber security. Thanks Nick.
>> Thanks Josh. Thank you
for the very generous
introduction.
So, it's not quite as he tells it
his very kind, very
kind to say that.
So, thank you for the invitation,
I'm thrilled to be here to talk.
So, the title of my talk
is Statistics and Cyber-Security.
So I'm a Researcher from
Department of Mathematics
at Imperial College in
London and also seconded upto to
the Harvard Institute for
mathematical research.
So, here's the
obligatory slide of
people I've worked
with in the work here.
I'm looking forward to
adding Joshua's name and
Microsoft the next
time I give this talk.
So, the top line of
my graduate students
work in Imperial and then
Silvia works that the ATI,
Alan Turing institute
as a post-doc.
Pat's a friend and
collaborator at Bristol and
Melissa is a LANO there'll
be lots of mentions
of Los Alamos or
LANO for short
throughout the talk and
everything I have in terms
of contact with them
which is all from Josh,
so I'm very grateful to him
for that collaboration.
Which would be
a long lasting and very
fruitful collaboration so,
that's all due to Josh.
Okay. So I probably don't need to
plug this point to
you guys so much,
but there's a general
understanding
I think by governments
and industry that
data science techniques
have some role to
play in the future of
cyber security defenses.
There's plenty of data
sources available that
could help discover and
prevent network attacks.
At Imperial, our interests
are in developing
statistical or
probability models well,
for building models of normal
behavior and detecting
intrusions by seeing if there's
outlined behavior with
respect to those models.
So, the advantage
of being able to do
this is we can discover
zero-day attacks
by looking at
people's normal behavior
and looking for
unusual deviations from those.
So it could even be, someone
using legitimate credentials
but they may be
stolen therefore traversing
in an unusual way.
So, these are
the different data sources
we look at Imperial.
Most of the things
I'm going to talk
about it's about
net workflow data.
So, this IP address talk
to this IP address on
this protocol and some other data
particularly the start time
is nothing I'll look at,
so the timings of events,
but we also look at
other things as well.
So, there's authentication
events so users authenticate
on computers whether or not
they're successful and
what time that happened.
A bit of host-based stuff.
So, the events and
processes that's much more
along the lines of
you guys probably look at.
So, that's the space we're
just starting to work into.
So, a bit of that, and there's
also the potential to do
some physical touch
stuff like building
access control IoT and so on.
So, a lot of the models
that went took out a
fairly simple we only think about
high-frequency data and I
wanted you to do thinning,
so just screening and triage
like finding the
interesting stuff amongst
the bulk of traffic we have
passing through our network.
We want models that can be
used in Hadoop if we can,
and things that we can stream.
So, there's lots of
different ways we can think
about analyzing
our cyber data and there's
lots of different approaches.
I'm going to focus on
these three different ones.
So, you can think
of these as heading
down from the macro scale
at the top down to
a micro scholars who worked
on that bulleted list.
So, I could look at a sort
of network wide view of
my network and perhaps a few
things like graph theory,
spectral decomposition,
community detection and
clustering and
also some high-level summaries
for network oversight.
So, I'm going to do these in
the reverse order in the talk.
So, I'll start with
edge first but yeah.
So, we can look at network level,
we look at node based models
and build moles of
processes that user
or computer runs,
its connectivity,
it's pattern of life,
time it's active, or go right
down to edge-based models.
So, looking for safer beacons
to specific IP addresses
some temporal dependence
between edges
and typical packet size
and things like that.
So, the point is that
all of these could use
useful cyber analytics and
it's not a competition.
We can look at all of these and
come up with analytics
and run them all in
parallel and gathering together
is where we think we
can gain strength.
So, we're going to
have a whole bunch of
analytics running and
somehow try to combine
them all so hopefully
getting a bunch of
weak indicators and combining
them into something strong.
That's the idea is we
want to have lots of
different views of
the problem and combine them.
So, this talk is going to be
a summary of some of
the work we've been
doing in this area,
some published up, some
stuff we already done
and the stuff we're
currently doing now.
So starting from the most macro
level look at edges.
So, an edge, by an edge
between two IP addresses.
I mean to say the edge have
an edge between X and Y,
what I mean is that
X has ever been
observed to initiate
a connection to Y.
So, I'm going to keep
my directed edges,
so all my graphs
can be directed to
an arrow on my links
between nodes.
So, nodes will be an IP address.
Yeah, you think of
the edge as being
this is the fundamental,
the primitive if you like of
doing cybersecurity
data analysis.
It's not a node, it's
sort of counter-intuitive
and noting for this it
will be as fundamental.
But it's actually from
the serving traffic,
the edge is the primitive.
The node is the union of all
the edges coming from it.
So, the edge is of
the building block
if you like for all our models.
So, that's the most micro
level we can look at.
It's a specific edge.
The traffic from
this IP address to
this IP address, okay?
So, in netflow which
is what most of
this talk will be about,
some edges will carry
entirely automated traffic.
Some will carry
just human traffic
and some will carry
a combination of the two.
So, what I want to look
first is how we go about
classifying edges as being
either human or automated,
if we don't know anything
about our network in priori.
So, one simple diagnostic
you'd have is
the automated edge
is typically carry
what I would call superhuman
levels of traffic volume.
So, you could just threshold
on that say, okay,
there's way more than a human
could generate in that day,
so that's got to be automated.
So, you could just do
that and I want to
hope hands about that,
but that's not always the case.
So, what we want is
some more sophisticated filters
besides just thresholding on
volume to say that's
an automated edge that contains
some automated traffic.
So, they're often automated staff
is often highly periodic.
Of course, we need the beacons
for updates or keep-alives.
So, we have a paper we did
in 2014 when we scanned for
periodicities in the data
coming from a single edge.
So, we looked at
the periodic graph of
some observation time T which
can be efficiently calculated
using the Fast Fourier transform.
So, I'm going to explain a bit
about what this formula does in a
second, and the second is now.
So, let's talk about
this periodogram.
Is effectively but
is proportion to
the squared magnitude of
the resultant vector if I
put all of that each event onto
the unit circle as
unit vectors on the F to
the minus one second clock.
So, that is if I pick
a particular frequency F
and you imagine a clock of
circumference one
over F and I just
start the clock going
work my [inaudible] and
as events go as it occur,
I'm going to just
drop a markdown and
keep on going round dropping
markdown as the events were.
That's what this does.
Forget this bit,
this is uninteresting that's
just a constant as it were.
These increments of
the process will be
multiplied by this guy
here and that's
essentially just plotting
points down onto
the circle in this way.
So, just backing
off for a second,
what we want to do is find
surprisingly high levels
of periodicity at
specific frequencies
but we don't know at priori which
frequency we're going
to find them out.
So, we're going to
look at a bunch of
frequencies and calculate
this function here and see
if there's any clustering of
these points on these circles
for any given F,
that starts again.
It doesn't exists at a frequency,
such that the points are
all bunch if I put
them onto an F,
F to the minus one clock.
So, this is what my s_hat of
F looks like from my own
computer's connections
to Dropbox.
So, you can see a peak at
around the 55 second mark.
So, there are the event times
projected onto such a clock
onto a 55 second clock.
So, you can see
there's some bunching,
well there's two sets of
bunching that's okay there's
been some sort of refresh at
some point where there
all beaconing up
of one phase and then there
was perhaps a delay or
a reboot or something
and then beaconing
again at a slight,
it's still the same periodicity
but a phase shift.
But that's okay if you look at
the resultant vector those
are still on the same scale.
Those blue points with
massive resultant vector of
adding that just adding
all those unit vectors together.
Whereas if I just pick
some other arbitrary periods,
so 43 was the number that
came to my head and
tried putting that in.
So, that's the prediction
for 43 second.
Probably we can see
them all just rather
more uniformly scattered.
There's still a bit of
bunching but that's
normal you do get
multiple pulses for
a single event in
a beacon sequence you tend
to get a bit of blurring and
multiple events and so on.
So, I get a non-zero
resultant vector
but nowhere near
the same magnitude.
So, that's the point.
So, that's what's
driving this spike in s_hat of f
is that there is
that fantastic bunching at
that specific frequency.
>> Hi.
>> Hi.
>> So, when you said
you know [inaudible]
expectation on the frequency.
Like everything
between, you know,
a nanosecond and 23 weeks?
Or like is there-.
>> Yeah, well, we only
have data recorded to
the milliseconds
so I'm never going
to find anything more.
>> But in principle
you can find stuff,
you know, at
millisecond frequency.
>> Yes.
>> Then at like
three minute frequencies-
>> Yeah.
Yeah.
Like a few seconds is common.
That's possible, yeah.
Whereas if I go too big,
then I get 86-or
400 seconds a day.
And that's just human
again, so I'm kind of
interested in periods which are
somewhat less than
that. Good point.
Okay, so that's just to
see what it looks like.
So that's what- if I looked at
all the traffic coming
from my computer,
the left-hand image shows
a sort of time of day
histogram plot of
when my computer does
its business and you
can see I've got
this big spike of
activity at 2:00 AM which
isn't- I hopefully in bed some of
the time so that's not me.
And you can see that
it will get sort of
taken out if I filter
out by doing this sort
of processing.
So 80 percent or more of
my traffic was easily found to be
automated by just doing
this sort of filtering with
really significant
p-values on this test,
which I didn't really say about
this Fisher's test
using this statistic.
The maximum of that function
divided by the sum of the things.
Okay, sum over all frequencies.
I want to find
a specifically hundreds.
And using a simple central
limit theorem arguments under
the assumption of a null model
of homogeneous Poisson process,
these things would be
Chi-squared under the null.
It's easy to work that out.
So you can get close form
or approximations of the
p-values really easily,
so you can actually
quantify these things.
So that stuff we've done a
few years ago so that's out
there and that's
kind of a handy technique.
So that's also the
first go to thing
in terms of classifying an edge.
But you can see even
in a Dropbox picture,
there are still a few edges,
events that don't fit in with
the bunch and it's kind
of interesting to think
about mixtures where
perhaps I've got
some human things as well as
perhaps I've changed
a file there or whatever.
There could be some
some human mixed
in with stuff.You can still get
it because even though
I've got an edge which
has a mixture of human
and automated- Hi.
>> For human, like,
is it one person or is
it possible that
there are multiple-
>> Well, I know this
computer. It's mine.
So, it's probably one person.
Yeah but probably multiple.
>> No, I mean, I'm
just saying that.
>> What?
>> Is it possible
that the edge is not
just played between
a single computer and,
I don't know, a server but behind
this scene, are people in
their multiple computers?
>> That's a great question.
A bit later on the talk,
I'll talk about mixtures
of individuals.
So that will be- I'm not
being specific about that.
I just want to know
is it automated,
is it clearly periodic,
or is it something else.
That's a fair question.
But I'll come back to it.
So yeah, if I've got
a mixture of human and
automated stuff,
there was major traffic is always
high volumes that will dominate,
so I will get to say, okay,
there is a periodicity
going on on this edge but
there could be a mixture.
So what what current work
he's doing with Francesco is
looking to do to
separate out ends and classify
every single event and say
whether each every single
NetFlow event
is automated or human.
So for example,
there's two ways in.
One is I can look at
these clocks if this is
the frequency I found,
I can say, okay well,
this event here it's
not in phase with
the rest of the beacon.
So this is somewhat of
an outlier from that,
so that's one indicator
of that being, perhaps,
a human event is It's not
in with the bunched of
events of the periodic signal
that I'm looking for.
Okay, so that's one one piece of
evidence is in
phase with beacons.
And the second thing is
what time of day did it happen?
Is it in the middle of the day
or is it during the nighttime?
Those are two, say
orthogonal pieces
of information which
I can bring together,
to do inference on
an event by event basis.
Yeah, saying that there.
So briefly what we're
doing when we're
looking at Bayesian
mixture model to learn
the human and automated
components of
this mixture density
that we think might be
going on on a given edge.
So if I call the fA
the density for
automated events and fH,
the density for human ones,
I want to learn those two things.
So just for simplicity,
say when lots and lots of
simple, simple stats models.
We want to do these things
at scale so we fit
a wrapped normal distribution
on the unit circle.
So we rescale
everything once we know
our inferred frequency from
the Fisher's test we can
rescale everything
to the unit circle
and put it on naught to 2 pi.
So, we have an
automated density fA
for learning the mean.
Essentially, what's the phase?
Perhaps that should be
a mixture distribution from
multiple phases but the moment
is just a single one.
And for this example that's
fine and we learn the variance
of these beacons.
So we want to learn
the mean event.
So what's the phase shift
and what's the blurring?
It's always plus or minus a bit
of time, it's never bang on.
So I want to learn that for
the automated part and for
the human thing we want
to learn the daytime,
essentially, so it hopefully
has some sort of
structure like that.
So for that we have
a step function density.
So I assume it's
just piecewise constant density
with an unknown number of steps.
So this is a particular
hobby horse of mine.
I just love piecewise
constant things.
Really easy and tractable to work
with and they are consistent.
If I don't bound the number of
steps then asymptotically as
they get more and more data,
I will find the density.
So the fantastic flexible form,
so that's why we use that for
the density for the human stuff.
>> Nick, is there
a paper for this?
>> Coming, yeah. So this
is work in progress.
So these are taken from
the draft paper which
Francesco is waiting
for me to read,
so he'll be glad to
hear this question
being asked and get on with it.
Yes, it is the message. Yes, so
that's the beaconing stuff.
So that's our work in terms of
separating out
automated from human.
Now moving on to
human stuff that we're
talking about- Okay,
the automated stuff
is kind of an easy to
understand is periodic
and that's okay.
But we want to build models of
the human stuff better as well.
So even human-generated
network connections
don't arrive as a
Poisson process.
They come in bunches.
It's kind of bursty,
which Josh opened my eyes
to when I first
first him in 2011.
So, that was a big blow
so we've been
doing with the stuff
with Poisson process.
Mathematicians like
Poisson processes.
They're fantastically
tractable. Easy to work with.
So they're lovely models
to work with
but if they don't hold it's
just going to give us
the wrong inference.
So with Matt and another
of my graduate students,
we're trying to model
self-excitation.
So looking at the
accounting process of
events on an edge which
we've just considered to
be human and still trying to
learn what this sort of
self-excitation that goes on.
So the model we've
arrived upon is
this sort of non-parametric
walled process structure.
I've included a diagram
below which is
misleading because this
is the only relevant one.
These are just other models
we look at in the paper.
So what's going on is
saying, essentially, as if,
why is my county
impressive events for this edge
of interests that I have
some background intensity lambda
and then some
triggering effect here.
So if these my event times,
then I look at how long it's
been since the last event.
t minus y of Y of t. It says
how long has it been since
the last event and
these lambda j is going to be
some decreasing sequence with,
again, an unknown number
of changements.
I'm getting piecewise
constant but with
a decreasing intensity
going to back
down until eventually it goes
back down to the background.
Again , asymptotically we should
learn the correct intensity if
it's the right model and
in terms of structure,
then we should learn it.
So people tend to work
with Hocks processes
when people are
doing self-exciting
models of point processes.
So there, you add up.
Every time you have
an event, you get a boost.
Like an exponential style
booster that's what's drawn here.
So every time I have an event,
these red crosses, I get
a boost in the intensity.
So we looked at those models,
we looked at walls process again
which that just
says how long since
the last event rather
than accumulating.
Because that's the thing,
so I think with
this guy you just get
too much because
NetFlow is so bursty.
You get this massive
boost that doesn't
happen in the model's in a bit of
shock and you've finally stopped.
So, the world model
seemed to fit better and
this exponential
isn't quite right.
It's a bit- we can fit
more flexible things.
Obviously, this doesn't look so
flexible in this picture but
asymptotically you say we can fit
however many steps we need,
perfectly, in the n.
So this proves to be
the best model in
terms of fitting
the excitation that happens
within the process.
So, it's a flexible model and it
provides us a
consistent estimator,
the asymptotic,
will get it right.
It captures burstiness.
By capturing burstiness, we
found it negates the need to
fit a model for
seasonality because that
should be the obvious thing.
Even in the human age,
my intensity function should
have some component of
what time of the day is it?
If it's the nighttime,
I really should be
excited about that,
but we found the
burstiness is enough.
People don't leave work
the same time every day
so fitting seasonal models
is something I'm
starting to go off with.
We've done that quite
a bit, but it's such
a- to do it well,
you'd have to have
a really flexible structure.
On top of that, there's a lot of
work involved in doing that.
We found for the simple models
we were looking at
for building in time of day,
this was just swallowing
everything out.
I've just done an event
just now I'm still at work.
The moment I've stopped
then perhaps it's okay.
So it kind of-
>> The other thing about
seasonality is it's expensive.
So, in terms of keeping
million models around
or monitoring all the customers,
we have to keep these models
for each users.
Say then, we have to have
all these parameters for
the time of day, right.
So, it gets expensive for
the computing or
so. That's great.
>> So, you can see
the performance.
So, Blue is homogeneous process.
So, if we did our best fit of
a homogeneous Poisson
process of data,
this is the QQ plot as we
want to cover 45-degree line.
We're just trying to
imagine that the data
were homogeneous Poisson
process they get the blue line.
The red is an inhomogeneous
Poisson process
where I've tried
to learn the time
of day intensity and I
barely get any uplift
at all from doing that.
Whereas, the self-exciting
models do particularly well.
This is why the black line
is the bold one with
the non-parametric intensity
and that's one,
which doing best, which
is why we favored it.
So, I think it's
incredible for homo- I
mean it's not
my work, it's Matt's.
So, I can say that I think so
I believe we can
fit that so well.
So, just predicting when
the next event is going to be.
So, yeah, that's
interesting stuff.
So, that's edges dealt with.
Not perfectly covered, but that's
all we've done on edges.
So, next up nodes.
One level higher up in the
sense of black granularity.
So, we've done some modeling
of destinations of nodes.
So, looking at the sequence
of IP addresses, say,
a computer connects to-
really relate to some work
I did with Josh on
port scoring a long time ago,
so it's the same model again.
So here, we think about
scoring every single events.
So look at a node and see
who it connects to over
time and scoring the surprise
of every single attempt,
"Oh, we've connected
to that one" and
so on, so really simple things.
Then aggregate the surprise using
control charts or
p-value combiners.
I'm big into p-value
combiners these days,
which is quite odd as a Bayesian,
but I call at them anyway.
So, in 2016, we've
got a paper where we
looked at modeling from
a server-based perspective.
So, I look at a server and
it has its own model of,
"Okay, who's going to
connect to me next?
I have a probability
mass function
on the go of who I think,
we should learn over time
who I think is going to
connect to the next."
So, we look at the sequence
of clients, x1,
x2, and so on which
connect to a server y.
When you simply model
as a multinomial,
we saw nothing clever
going on at all.
With an unbounded number of
categories that makes it
a bit more interesting.
So, imagine I don't know
how the limits of my
computer network,
there could be
arbitrarily many computers
that don't work, so
I want to split it,
so I could think about
having a base measure
on the names of my computers.
So, I think it's
quite a fun thought and
some measure of
confidence in that alpha.
Then from that, just from
that simple Dirichlet
process model with a multinomial,
we get a p-value for
every single observation.
So, you can think of alpha
x star is the measure
attached to observing
the next event being on x.
So, you get a chunk
of probability from
the base measure, and
then essentially,
it's a glorified histogram
accounts of how many
times I've seen
this client connect before
along with the base measure.
So, really simple things.
So, to score an edge,
first of all,
we just look at all the
connections from
each unique client to the server
y and found the minimum
p-value of that.
That should follow
these distribution
up to some approximations,
at least assuming independence
and continuity and so on
and being in for naught one
and naught the discrete.
We transform those to p-values,
we get a p-value for every edge.
So then, from a node perspective,
what would you do is
then combine all those
for a local and clients
I want to say,
"Okay, what was the p-value
I got for each of your edges?"
So for a computer here,
which turned out to where
the more interesting
ones is this from
the Los Alamos network data.
This is the p-values we got for
all its edges so loads
of surprise going on.
So, this was one of the red team
attacked nodes, this guy here.
So, it was really easy to
find, those data are so easy.
There's no great glory
in winning this,
but it's just that
a nice demonstration
of how we can do
simple models and get
well-calibrated
decent answers out
in an automated fashion.
So, yeah, we use
Fisher's method to
combine to get
a node's score in the end.
Those are fairly
arbitrary choices of
combiners so a lot of
my work since then has been
looking at how you choose
a proper p and make a proper
choice of p-value combiners.
So, I've done a paper on that.
>> Is that paper
published yet or?
>> It is, it went to
Biometrika this year.
>> Excellent.
>> So, more interesting
is- that's okay,
that's fine I can look at
just my surprise of going
to edges that I've been to
before or not and so on,
but really we're most
interested in new edges.
So, if a node form new edges,
I want to think how
surprising that is, okay?
So, yeah, in the red team
attacking the line all data,
there was a lot of
whole bursts of
new edge activity
created by the red team,
and that should be an easy way of
detecting unusual behavior.
So, what we want to do
is to understand whether
new edge activity is odd
for that host because
the point is that different hosts
form new edges at
very different rates.
Some make lots of new edges
and some are much
more consistent in
the nodes that they visit.
So, there are different
rates between hosts.
Also, we want to think about
who the edges being formed with.
If it's a new edge, okay, who to?
Is that the node you ought
to be connecting to?
Or was it a node someone else
should be connecting to?
So, we want to understand what's
the right node for you
to be connecting to,
as well as understand the rate to
which you should be
forming those things.
So we were looking at building
models of these things.
So again, we look at
this as accounting.
As we look at
the counting process of
new edges being formed.
So, if capital X of
my sources are my clients,
and capital Y are my
servers and destinations,
we want to model
the new edge intensity
lambda x y of t for
client-server pair x y.
So, it's a new edge intensity.
So, if G_t is
my current graph at time
t of all the edges
I've ever seen,
that I've got
indicator saying if I
need it to be inside and
need not to be in G_t,
so it's in x cross y slash
G_t so it needs to
have not been seen.
Otherwise, this indicates
there as a zero,
got no intensity of a new age
there if I've already seen it.
So, that's all that first bit
is saying. It makes sense?
Then, there'll be some
seasonality function which
I'm going to treat as
a nuisance parameter
and not do inference
on that recurring theme in this.
So, we're just not doing
that much seasonality modeling,
which is very early,
but that's all.
Then some other covariates,
which had to do with the degree,
the outdegree of x,
the indegree of y.
So how much they make new edges?
Whether they're in
a burst of activity?
So, this is an indicator
is equal to one,
if my last edge I met
if it was a new one.
This is equal to one if
the last two edges
that connects on
my new- further terms
didn't make any difference
in stuff we did.
So, those tend to be
new because covariants,
whether or not I'm in
a burst of making new edges,
that's a good
prediction whether I'm
about to make another.
Plus, some mysterious last
final term of
some latent variable,
which is going to characterize
the attraction Z_xy
we're going to kept
characterized
the attraction from x to
y in some model that
we're going to build.
So, we had two versions
of doing that.
One is a hard thresholding
clustering model,
so we cluster clients
and servers and
have a bit stochastic
block models, and say,
this type of client connects
to this group of servers,
that's a good fit.
So we cluster up
the clients and servers
into blocks and say,
"Okay, well that block does
tend to connect to that block.
So, you are in then that's
okay or that's some more unusual.
Or we do something
more soft and do
a latent feature model with
an Indian buffet process prior.
So, that's we're trying to do
a much more flexible style of
modeling that actually
turns out to do better.
So, that's a paper which is
under review at the moment.
It's what we've done,
but it's not over yet.
But yeah, that
turns out to be the
best thing for what we did.
Okay. Here's a picture just
to illustrate that in action.
So these are the p-values
that we obtained for
two communities.
This is from the brought back,
same infected one again
looking at now from
a new perspective and just
some other random-ish.
An infected computer,
or less anomalous.
So, these are
control charts that use
a combined p-value
functions over time.
Here are various
p-value thresholds
drawn on with the workup by
Monte-Carlo what they
should be to calibrate
these things because somehow more
events than others and so on.
But on a p-value scale,
these are now the same.
You can see that here
we go way down in
low p-values in terms of
the new edges we're making here.
Whereas, for the uninfected one,
it makes new edges,
but they're okay.
They're not surprising either by
rate or by who they're
with and so on.
So, that's it in action.
>> To make the,
is this the model adaptive?
Is it learning over time or?
>> Of course, I think so.
I've read in it as
a time-varying covariate.
So, I think why we did it
was we have to read
data in chunks,
and update the model over time,
and it seems to work out easily.
Reading all the data is a bit
problematic for us in
fitting this model,
so we tend to just take
a subset and just
learning becomes smooth.
So, I think so.
Silvia Metelli had done
all the hard work for that,
so she should be
the person to actually
say, "No, it's not.
Yes, it is." Or it could be you.
So, that's quite a nice general,
this sort of new edges
or unlikely edges for
some reason is
a nice general concept.
We can look at different things.
We could surprise the sequence
of processes executed
by a computer,
and measure the accumulates
surprise, and also,
a data fusion combining
host processes and
network level things into
getting scores for
each of those things,
that's kind of what
we're doing under
the data-centric
engineering program ATI.
With Josh, in the past,
we've looked at port scoring.
Modelling how unlikely using
different services is say,
a lots of applications of
such simple models inside.
Then, finally, a Whole Graph
perspective might do for time.
So, this is the value
for the highest level we
can think about taking.
So, this is joint work
with Pat and Francesco.
So, here, we want to look at
the full graph and look at
how we can learn underlying
structure in the graph.
So, we're going to consider
a binary adjacency matrix
are not perfect about
counts just yet,
think about the binary
directed adjacency
matrix At for my entire network.
Such that (Atij is equal to one,
if and only if i
is connected to j,
which will run that
way by time t. So,
here is just a
cartoon graph of what
Gt looks like perhaps that
one connected to two,
and two is connected
to three, and so on,
and look at that
binary representation
of the data of that graph.
So, that's the adjacency
matrix that I'm
talking about. So, yes.
A low rank approximation
of this thing,
so this is what I've
observed up to time
t. If I take a low rank
approximation of
At such as or some
Laplacian-style transformation
of At as well could be
that, that can provide.
It's a bit like building
a statistical model.
If I take a low rank
approximation
to prevent overfitting,
if I take a low rank, if I take
a full SVD or just
recover the full matrix
again, I don't have a model.
I'll just say, "Okay.
I've seen this.
If I've seen it,
then it's what happens
with probably one.
If I haven't, that comes a lot
easier, that's not good."
With a low rank approximation
of that thing and do
some can be difficult to
seek outside of
negative greater ones,
so I'm going to do
some sensible solution.
Treatment of
that estimate of the At,
you can think of that
as being like a model,
being like a prediction.
So, we can do estimation and
prediction using this thing.
So, here, we think
about SVD a lot so.
A is approximately A hat of k in
the sense that if I do
SVD and get A hat of k,
that's the closest k-dimensional,
a rank k approximation of
A in the Frobenius norm.
Essentially, that gives me
two matrices and vectors
in k-dimensional space.
So, if I'm thinking about
clients and Servers, say,
every client and every Server,
each gets a position in
k-dimensional space.
Even if those are
the same computer,
they'll still get a one
positioning clients based,
one positioning service based,
and it's the way we treated it.
So, the ith or jth row of
Uk or Vk respectively,
provide as of
notional latent position
in k-dimensional space for
that client or Server.
So, pretty pointless page,
but I guess what I
mentioned in this,
so be at look at
adjacent matrix issue.
This is like an observed
adjacency matrix,
which have mostly zeros,
but the occasion one
scattered around,
and what I want to do is
find that unlikely one.
That one which just clears
out the picture that shouldn't be
there compared to the rest of
the structure of the graph.
It's this little game
we want to play.
Either in prediction
for the future,
we're just looking at
my graph and saying,
"That one shouldn't be
there." That's the game.
So, some work we've done
in that direction was
with Melissa from
London and Francesco
is now carrying on that work,
where we actually looked at
the non-binary version where
rather than just ones and
zeros in this this thing,
instead in the ij position,
I've got the number of
connections I've seen up to time
t. So, zero, if
I've never seen it,
but otherwise,
some non-negative integer
and positive integer,
if I've seen the edge.
So, it's a bit of a decision
why do you think is
informative to see how many times
of collides is
connected to a Server.
I mean, somehow there might be
a bit of overkill in that signal,
but on the other hand, you're
getting some sort of sense
of preference in
normal edges are equal,
and if I go to this Server a lot,
then perhaps, that's
an important Server to me,
and I want to think of it as
a recommended system type problem
where this sort of client
really likes that sort of Server.
So, that works is taking
that approach that we have
a Poisson star model.
So, my adjacency is now count.
So, if a Poisson
starts fresh again,
this top product idea
will have later positions
for the both the client
and the Server
where in this case,
user in process or user in
computer have the same ideas.
I know some sort of
clever prior structure,
so that we sort of
shared some information
between edges for the same user
or for the same service.
I get the idea of busyness site.
This user does a lot or this
Server gets a lot of traffc.
So, we have some sharing
between edges,
so it's not completely free.
So, we had a p-value
for anomalous detection given
by the upper tail
probability of a count.
We see like how facts
getting as many as we
did given this Poisson
model and then use
Fisher's method similar to
before it's combined edges to get
a score for users or clients.
Whereas now, where counting
extending this worksheet
was talking with Josh,
and his team earlier about
incorporating known
groupings of computers,
and users with
online latent factor,
so I could know something
about particular users
say, "Okay.
This user is in
this department and so on."
So, therefore, there should
be some natural grouping
and some similarities.
So, I'm trying to find
a network structure
and if I already know
some information about that,
I should be putting that
into that model arguably.
So, we're looking at how
to build those things
in these further factors
in this model,
but it's the same framework.
Now, to revisit the
questions for this question
earlier about whether I've got
more than one user on
a single address certainly
is an option I'm interested in.
So, we can think about
the network traffic flowing from
an IP address as being a mixture
of multiple individuals,
each exhibiting that own make
mixtures of behavioral norms.
I can think of the traffic
coming for an IP address
as like a mixture of
mixtures even so it could be
different uses like a family of
users are using
the same IP address,
and within that
different devices like
an IoT staffing that could be
all different sort
of physical devices,
making network connections and
also computers and other things,
and different tasks as well.
So, all sorts of
different mixtures
you can think about going on.
Generally, I want to think about
mixtures or mixtures of
mixtures of behaviors,
and I find a very interesting
research from really
hard to think about
is it possible
to actually detangle
that mixture if I
observe an IP address
for long enough,
could I infer how many
people are using it?
Which devices? Which types?
Which uses active
at different times?
Which behaviors there are when
they are active and so on,
those things that
feel they should
be eventually observable
because if it's just
taking the user example,
they won't all be
active all the time,
those come and go and so,
if you can see
the repeated structure
temporarily come
and go over time,
in principle it
should be possible
to detangle and say okay,
I can see those sort of
connections happen
in blocks of time
and they stop and
they come and go so
there should be
possibly to detangle,
but it's a really awkward
inference problem.
So I see this is like a sort
of topic modeling problem.
If we can do
that's great because we
could build better models,
if I could say
actually this isn't
just a single homogenous thing,
this is actually a mixture
of lots of different things,
and if I could understand
what those things are,
there have been a much better
place to build good models.
So, we fear like
topic modeling so
that idea and text analysis
where I take us with
corpus of documents and
count how many times word
appears as a bag of
words representation
and each document
is made up of
a mixture of topics,
and those topics have
distributions on
words as usual so
there's about cats,
or it's about
microwaves or whatever,
and there'll be
a different lexicon
of words that are
used for each of
those different topics but
a priority we don't know what
the topics are or there
were distributions.
We don't know anything,
but people still try to do
that and try and infer what
the topics are automatically,
they try to draw them out.
So, a very active research area
in computer science as
well as statistics.
So, we see it as analogous
problem in cyber and so if you
look at the traffic from
an IP address or user,
then we could think about-
you have to think about
an IP address then the topics
could be the user,
the different users could be
seen as being like a topic,
or their behaviors can
be seen as a topic,
and the IP address of those users
or a behaviors visit would
be like the documents.
Those are the topics and
the documents would be
like a chunk of time
of say one day of traffic.
So yeah, we've done
some work on that so we did
some initial very
low-level efforts
at that which is there
but there was a very easy problem
in trying to scale this
up it's really hard be
a non-parametrics
is the standard way
of looking at these
topic models may allow
potentially infinitely
many temporarily
occurring and maybe even
temp time-varying topics
as how much you could do
with this sort of stuff.
So, this is joint work
with genius Anne,
another my PhD students
and she's doing really
well I think considering.
So this is an example data
set we're looking at.
This is real data
again from LANL.
So, each row is a day of data
so it's 10 days of data,
and these are 32 different
computers being connected to.
Each block corresponds
to a true user.
So, this is user one,
user two, user three,
user four, user five,
and this is how the colors
show how much they
connected to each of these
computers on each the 10 days.
So, what we do is then take
those data and smash them
together just because we
make life difficult for us
then see if we can
untake it again.
That's the game, can we
sort of infer just by
putting some sort of
ideas of sparsity,
we don't want topics which
visit all the computers here,
we want to have some some idea,
some sort of purity
of these topics.
So, if I have a fixed behavior
or usually or they
don't visit all the user
because they had their
favorites, and they go to this.
So, we push that into the models.
We have ideas of sparsity,
and some sort of
temporal dependency there.
There should be some tumbling
every time that you see
the contiguous blocks
of connectivity
as we push that into
the model as well.
So, we're still getting stuff
wrong but still it's great we're
getting some recovery of some
of the sort of shapes here.
So, Genie is never
happy with these which I
think they're not bad.
We're not there yet
but we're starting to
be untangled without
knowing anything.
Trying to untangle
the mixture that
is up to the viewer to decide
whether they think
that's good or bad,
but I think that's not too bad.
Okay, so I think that's on
the difficulties scale,
I think that's the hardest thing
we work on that's
just a real scale,
I think it's going to
be really problematic.
And finally, I'm going to talk
a bit more length about
some stuff from doing
unchanged annotation so this
is work with MSU student
called Carl,
who's starting as
a graduate student in October,
so you're looking
at this stuff, too.
So this is work I'm doing
by myself on change point,
and enjoy talking about that
for the end of this talk.
It's my current favorite thing.
So finally, it just
overall look at
our network side of
the graph and it's
just kind of just look at
traffic volumes, to say okay,
from Imperial or LANL or
wherever we are data,
is there an unusual spike
in a particular port,
or from particular packet
size or something,
is there some thing we just
want to account everything,
just want to count everything
going through our network.
Is there's something unusual
going on in those counts?
I was kind of feel like we
ought to be able to do that.
This is the most basic
detector thing we
should have on the go.
We should have some
situational awareness
of that going on.
And so, the motivating problem
if you like is Wannacry, okay.
So, this is these are data
for the whole of 2017.
So, this is from
Imperial's network.
These are the counts
of the number of
net flow events recorded
on the three most commonly
observed ports over 2017.
Port for four, four, five,
port for three and port 80.
So, port four, four,
five is still the new kid
on the block,
and in this graph ina sense
it wasn't the most
popular until just
before May 2017,
which was of course the time of
the Wannacry attack which is
the biggest cyber attack
to hit the UK so far
at least affecting the NHS and
various industry, and
academia institutions.
So though, Imperial hasn't
been listed as one of them,
but certainly we've seen
an incredible spike
in for 4-4-5 at that time,
and we also know about
that we shouldn't
just be seeing that on
Twitter or something,
we ought to seeing even
here right, I mean
Something is already
happening there.
We didn't have to wait
till the actuals of
the outbreak thing
to know that there
was a gradual uprising
of the user Port or maybe
there's for legitimate reasons,
but we ought to at least have
some situation awareness of
this sort of things
going on before and just
sky high and we still
we are not Cantonese,
we still don't even know.
Those are networks working fine,
we don't really know.
So, that's the
motivating picture.
So, I'm going to define
peaks like that.
But it's not just support.
So, it could be for services,
it could be other things
like track your volumes or
data volumes or
geo locations that
got more traffic
connecting to IP addresses
in location x is something else,
is also different term.
Things I might want to count on
and I really don't want
to put a limit on that.
So I'd like to get account,
lots of different things
and model them in
a flexible way and look
for change points.
So, this is a graph that Karl
has made. It's very pretty.
Looking at just a week of
data for imperialist network
looking at different ports
and these are some change
points he's fitted,
so he's a MS student
excellent stuff to
have done this base in
change for analysis
on this multinomials.
So, one is nice and
it's nice pictures you
can see why change
points happens.
It's also looking at
both changes in volumes,
and changes in the mixture.
So, he's doing us a marked
change point thing where
we look to say okay,
there's a change point and
it's a change in the mixture,
it's a change in the volume
or it's a change in both.
So, we classify
the change for what come
with model and we fit
that model change point model.
>> So, the mixture ports
are being observed?
>> Yeah. So, what
the relative frequencies are
and then this is
the overall height of
the bar is the volume,
so we're looking for changes in.
So, he has this sort of
way of encoding plus I
think there is a gain in
volume x is a change in,
oh sorry plus a change in
volume and x is a change
in mixture probability and
the star is those two
entrepreneurs in both
of those things.
So, that's nice and
sort of to makes sense,
but the other takeaway
from me is that,
you can also lost
the change points and I
don't really want to be
getting that alarmed that
often, that's one week.
We got everything is changed
and it's changed again.
So there's only a certain level
of amounts of surprise
I'm willing to swallow.
This is the wretched seasonality
prominent human
behaviors, so annoying.
We just do the same thing
all the time.
Just not go to bed and not
change the clocks or anything,
just do everything
cause everything
will be so much easier but they
insist on having various lives
and going to bed and all sorts.
So, and for looking at
ports that manifests itself in
very different ways
for different ports
so you can see port 443,
I got this? This is pretty.
You want to [inaudible]
messes things up even more.
So look at seasonality
pray that in 2017,
really big sort of
weekday spikes for
port 443 less than for
port 445 and so on.
These are some of
the more popular ports.
Yeah. So there's
all that sort of level of
variation going on all the time.
So if you've got these
relatively different changes
then if I just start looking
at those multinomials younger.
I'm going to keep saying
oh my God it's suddenly.
Its port phase really
popular and now it's not
again and so on and
it's constantly
surprise over and over,
and I don't want
to get involved in
modeling seasonality as I've said
because no two weeks are the
same and as always on it.
There is gradual drifts
going on overtime.
I can't just look at last week,
it's not the same and
there's all sorts of
different things going on.
So, yeah.
I'm going off fitting
seasonality function
instead trying to
build robust models
that can just say okay,
let's just not get
excited too often.
So, get change point model.
The idea of a change
point model is,
we trying to partition up,
we look at every time is looking
through time observing
a stream of data,
central and just chop
it up into segments,
chop the parts passions
of time into segments
and we tend to fit
really simple models
within each segment.
Therefore when I fitted
this piece wise in segments,
then the overall global model is
therefore somewhat more
complex because I've
been able to fit
a local model and I
learn where to fit
the cuts and so on.
So, there's such
a way of building
a complex model with
lots of little ones
sort of concatenated together.
I know oftenly the model we
fit is an oversimplification
and so we get
more change points than
would be preferable like in
the figure I showed you.
So you can't model
seasonality and
that's certainly one alternative,
but say no two weeks
are often the same and
building the models is
labor-intensive as well so
just an extra modeling
problems have to face
whenever we want to
count anything because
they don't vary in
the same way each time.
So this is just work
I'm doing by myself,
is just looking at modeling
data with change point segments
where I can admit
the models problem
mis-specified and look for
clear discontinuity for change
rather than the gadget
gradual drift.
So, mine look like a bit of
a hack and maybe it is
but I'm kind of really
enjoying working on it,
so I am going to tell you
about it to finish my talk.
So our quick review
of base in change
point analysis so tau
has been my vector
of change points.
It's unbounded sequence
that goes on forever.
So I'm going to learn a sequence
of change points tau,
tau one tau two that
of increasing in
time change points which are
unknown and they correspond
to the change points of a piece
wise deterministic process
the each of T. So,
my parameter of
my model is going to be
called generically feta and it
varies over every
single time step but
it's constant within
any segment so far.
For two neighboring chain points
for any t within that,
this piece wise to terminate
proof process theta t,
takes value theta j.
So it's just a fancy way
of saying I have
this parameter that changes
at every change points.
Okay so theta j is the parameter
for the jth segment.
Okay. So the usual thing
is for Bayesian we have
a [inaudible] process
on the prior for
the change points and we try and
learn them given the data.
So suppose I observe,
I'm going to think
about discrete-time
but actually what
I'm summing actually
works in continuous time
which is really
nice but I'm not going
to talk about that,
I'm just going to talk
about discrete time,
so I'm going to observe
a sequence x one x two
x hence they can be discounts
multinomial counts or
whatever they could,
just some data driven thing,
and I've got a model p of
x t given theta of t. Okay.
Where theta of T is
going to be one of
these piecewise
deterministic processes.
So Bayesian inference on
towers is particularly
tractable and
durable if I assume conjugate
independent priors for
the parameters theta J.
Okay, or prize which can at least
be well approximated
by those conjugate
in independent prize which gets,
I can use the conjugate
prior and then do
some important sampling and
everything works out fine.
So, we tend to like looking
at conjugate priors when
we can get away with that.
Because then those premises
can be integrated out,
and we obtain a
marginal likelihood
for all the data we've observed
given the change points
which typically factorizes there
example like
autoregressive process.
However, this isn't
the case but typically
we have a factorization where
like this is a product over
segments K of the data,
the marginal district, the joint
distribution of the data
within that segment.
Okay. Product over segments
of the probability
of the data within
each segment since
this whole thing we're off
the web integrated
out feeds are out,
so I've got a for each segment
so I just got to march on
this marginal likelihood
for those observations
within each segment and take
product over that segments
and I'm done.
So that's the thing
we're normally
still working with,
that's very standard.
So, here's an example
to make it concrete,
so I suppose within
the segments I have got,
I gamma distributed rates.
So for feature which has
everything for signal segments,
imagine the theta one
for the first segment,
it follows a gamma
distribution and
then within the segment versus
n things in the segment,
then X1 up to XN
assumed to be IID.
Press on theta one,
that's the [inaudible]
typical it's
a nice conjugate prior for theta,
that's a typical Bayesian model
for the first segment of
a change point model.
Okay. So, it's the conjugate
prior and I can
integrate the drown,
get a marginal likelihood
for X1 up to XN.
So the first I think interesting
non-intuitive observation to
make about this frameworks,
I draw a single theta
and then draw X1 up
to XN IID from personal theta.
So, I think interestingly,
I am certain paradoxically,
this sampling scheme
is equivalent to
the following dynamic
theta strategies.
So here I've got the same theta
for the whole of
the segments and
that's how I described
it in the piecewise
deterministic process,
but actually in terms
of likelihoods in terms
of the genitive scheme for
the x's, this is equivalent.
Just draw v2 and
again from gamma AB,
and just draw X1 from
plus one theta one,
then I redraw a new theta
from the posterior,
what would be the posterior
of theta given X1,
that's what that looks
like and then draw x2
from plus one theta two.dot dot,
eventually join
the terrain from that
posterior and the next n
from presently terrain.
That gives the same
joint distribution
of X1 up to XN as this,
which I think is very something
truly cool about that.
Basic stuff in a way
just based there and
whatever but that's
really elegant.
One kind of got a fixed theta
for all the time and X1 up
to XN ID from present theta
one and in the second case,
just X1 from plus theta one,
you want then draw
another theta which
is influenced by X1 and X2,
so that's kind of an
interesting I think
observation on its own right.
But there's a point to
making this observation,
is that I could
start messing with
this and rather than
doing this I could do
something a bit different.
In the back of Bayesian Theory,
Bernardo and Smith, who invented
the marginal likelihood
of these things.
If I've got a
Poisson distribution
with a Gamma
distributed parameter,
we look up the
marginal likelihood,
this is the form is like
a nice familiar form to
those who do Bayesian
conjugate models.
You can think of this thing
is actually really nice,
you can think about
joint distribution is obviously,
has to be the product of
the predictive distributions.
I can write P of X1 up
to Xn is P of X1 times P
of X2 given X1 and this is
just probability theory.
So that's what that looks like.
I think of it as a
product of n terms
and they're just all
cancelling out all the time.
The predictive of one thing is
the denominator of the previous
seconds answer all the time.
So you see this go up
to i minus one then
the next of the loop
of my X1 to n,
these guys will just
cancel out with
the next one in
the loop and so on.
So you see, they will
all just cancel out and
I get a very simple form
for the marginal.
But that's essentially
what's going on if you
think about it from
the predictive distribution
point of view.
De Finetti's
representation theorem,
means that it has to be this way,
if I want to have
exchangeable observations
within a segment which
is what I'm generally
thinking about the IID
from some theta where theta
is drawn from a prior,
that De Finetti's
representation theorem
says if I want
exchangeable observations,
then that is has
to be the format,
it has to be the structure
is exactly what it
has to be like.
But I guess my point is what
if I don't want exchangability?
What if I know my model is wrong?
I know the data you're
on exchange but I know
say for example there's some
drift going on over time,
some slow level of
drift they don't
really care about and I don't
want to bother modelling it,
I just want to absorb.
In fact these data
aren't exchangeable,
I'm looking for continuous
things rather than IID,
so I'm willing to sacrifice
exchangability for
model robustness.
That's the argument I'm making.
I'm more likely to be like
the previous observation,
the one at the start of
the segment because I don't
really believe in this IID.
It's an underspecified model.
So, we can break
this full conditioning.
So, for example, rather
than doing the product
of the predictive distribution
like I'm supposed to,
up to i minus one from one,
I can think about defining
an alternative
probability distribution
on X1 up to XN which I've
indexed with k for a
given integer k. As
I'm defining it to
be the products
of from X1 to n now of Xi given,
not Xi minus one down to X1
but just for the last K,
so I'm just essentially
windowing to the last K. So,
I can still write down what
the marginal likelihood will
be under this formulation.
So, if I have this K
windows they actually are
only really depend on the last K
rather than everything
in the segments,
I'm just like the last K things.
Then this will be my
marginal likelihood,
I can still write down,
integrate our
parameters, I can still
write down a marginal likelihood.
It's usually messier but
I can write it down.
But I don't know what K is.
I'm not trying to be too
specific about that.
So, I want to average
this thing over some prior Q on
K for some of the things I've
implemented in the work is
geometric or geometric prior.
So, essentially I am
here at time T and I'm
looking back and saying it's
a Bernoulli trial, looking
back into the past,
there could have
been a changepoint
with equal probability
every step back from
here but truncated by where
my infer changepoints are.
So, my changepoints are going to
take on that interpretation.
So I'm jumping here.
So, my joint distribution
for X1 up to XN for
a given prior measure Q which
is going to be just geometric
isn't simply this thing,
so I just going to sum over
my geometric distribution
for K of these things.
So I can write down the
marginal likelihood for
a given Q, this is
discrete, that's fine.
You go back to
that point I was making
about the sequential thing,
I was obviously saying
X1 up to XN being
IID from Poisson theta
where theta with gamma
and I said okay this is
where we join theta one X1,
theta two this is what
this looks like so I
say okay for the nth one
theta n is drawn not
from J equals one to N minus one
but from J equals N
minus K. So you're
constantly redrawing theta
again but theta based on
the last K rather than
everything and I'm
averaging over K.
>> Is it possible to fit
an REMA model to
this time series of thetas?
>> Sure that's exactly the thing
I don't want to do.
I want to do something
really, really simple.
But yeah exactly,
if I was a proper
time-series guy and
actually wanted to
learn the relations like
separate REMA model,
but I'm thinking
about doing this over
all different sorts of
things I want to
count in sideburns.
I don't want to be
going down that road,
maybe I should want to but I'm
interested in running
really simple models
where I can integrate
every parameter out and
just get marginal likelihood.
>> The reason is that
the REMA model gets expensive.
>> Yes, that's the point
here. I'm trying to develop
a generic changepoint
that just can deal
for temporal trends without
any structural modelling at all.
Just trying but get
robustness or taking
a very simple
changepoint models and
tweaking them a bit so
they're more robust.
But it's a very fair question.
The opposite of what I want
to do with good or bad.
So I think that's
the interpretation
of this twisted model
is I'm looking at
there may have been
a changepoint at
anytime in the past.
With that time follows a Q to
this distribution Q
my prior of when I think
the changepoint would
have been truncated
by the changepoints
that I do infer.
So, for given fitted set
of changepoints,
what I'm saying is if
I'm looking back I
think there's
a geometric prior of
how long it was since
my last changepoint but there's
definitely was one there.
That's kind of how it
goes. Therefore, these
things in my mind based on
some interpretations
definite changepoints.
They have really got to
a lot, they have to be
because I'm allowing
some sort of possibility.
There could have been
a changepoint anyway even within
a segment by this K. So,
this is what it looks like to
simulated with
a nice place to start.
This is a Poisson gamma again.
You can see the gamma
parameters I'm using here.
Very small alpha and beta.
K is infinity, that's
just the normal model,
so IID and I've got a
single changepoint,
in all these pictures,I've got
single changepoint at 200.
So, IID data in each segment.
That's what it looks like.
So, with an intensity
drawn from a gamma
with those values.
Now, instead a change K
to some different thing.
So this is what K equals 10 looks
like from just drawing
from that model.
K equals five and K equals
one and then you
can see we got some sort
of temporal trend.
This drawing from
the model gives me
stuff with temporal trend
and that's what I want.
So, it's a really
cheap naughty way of
getting temporal trend without
actually modelling anything.
Just putting geometric
prior, essentially
is what's doing it.
I'm now going to change
the gamma parameters.
Why is that?
The more I make,
the larger I make
these guys the more
they're ignoring
the data or anyway so
the more temporal variation,
more unlike drawing from
the prior will over
against the more temporal
variation I'm going to get.
So, I'm now going to
try. There you go.
So, there's IID again
but now a bigger alpha
and a bigger beta,
so that's K equals one with
this big alpha and beta.
So, you can see it's
all sorts I mean,
I kind of like this because this
is no way a record value.
This changepoint, is
no way of all that sort.
If I was looking at
this changepoint analysis,
assuming exchangability,
there's no way
we fit a changepoint, right?
So, it's right in the heart,
close enough to
the heart of what I've
already seen that there's
just nothing to pick that point
out but there is
this discontinuity.
So, looking at some
of which we accept as
a temporal feature then
you find exist from
the model so obviously I'd
find it that way and I do.
So to finish, these ideas
look like when fitted to
those Port 445 data
that characterize
the one attack so that's
what I got just by
running Bayesian
changepoint analysis
on account just looking,
I was looking at
treating Poisson,
where I end up
fitting changepoints
I don't say it's just too many,
whereas I'm claiming this is
a more parsimonious
representation of what's
happened to this
initial boost here
and then just before the attack,
this boot start happening again,
and again here we're
really jumped up and down.
So, my feeling is
I really like this green
picture to come out,
this right picture but maybe
that's because I'm biased
because I really like the work.
So, I'm calling it my robust
changepoint analysis.
So, conclusions.
So, what I'm trying to argue
since Cisco methods provide
a framework for
automating the detection
of a significant
unusual cyber behavior.
That can be formed at
different scales when you've
measured level
to photograph analysis.
At each level of
resolution, the moles
would typically be
under-specified.
Choose the complex nature of
both human and all
network traffic.
So, like for example,
when I was looking at
the Fisher's G-test,
looking at beaconing stuff on
her mode, had
the stuff on a clock,
it's only able to quantify
using chi-squared type things,
the significance getting
a p-value out about
the significant beacon compared
to a homogenous Poisson process.
Well, no one ever thinks that was
a homogenous Poisson process
going on in any of this.
So, that's a terrible
null model and
so underspecified models,
but nonetheless,
it's a good enough baseline
null model that things which are
significantly different and more
significantly different
they are from that,
the more like beaconing they are.
So, even though I may
not be using the most
perfect statistical models,
the probability calculus
is still giving me
a coherent scale for
prioritizing the most
significant looking
beaconing things even
if the numbers themselves
aren't quite right.
Okay. So, that's going
to be argument I think,
so it gives me a coherent scale
but I'm not saying I
necessarily believe
every single p-value.
For that reason, much
of our future worksheet
concerned with identifying
robust anomaly detection methods
through a combination of
the several robust models have
been championing at the end.
They're trying to sort of be
a bit more open to saying,
"Okay, I figure for
you misspecified
this and so they
want to be robust.
", and more importantly,
combining evidence
doing data fusion,
combining evidence
from different tests,
from different data
sets, and so on,
so that I can synthesize
multiple weak signals
into a strong signals so
that if I'm combined robustness,
which also weaken my signal
if I can then combine
lots of things,
I can still get strength again.
So, just listed a few papers
I mentioned, but
thanks for your time.
>> Yeah. I'm not sure if
those things are correct
but the way you compensate are
not taking into
the seasonality is like
they are generating
the new learning from
the previous assets,
the idea then.
>> For the thing at the very end?
>> Yeah.
>> Okay. From the last K things,
where K is unknown,
average over [inaudible].
>> [inaudible].
>> So, K has some prime measure.
>> On the net full stuff,
this is going to be
a difficult question to answer,
but as you go from
organization to organization,
do you've any sense of like
how much are we tuning
and retweaking in GitHub?
[inaudible] losses is
one thing if you went
to a different
national lab that has
slightly different scale
or something that it was-
had different
geographical locations.
How re [inaudible] is most of the-
like what you're finding here?
>> So, in terms of finding
[inaudible] Josh would be
way better position than me
to answer a question like that.
Saying the tools we're
developing should be
sort of situation independently,
should be useful in
any other situation.
How much speaking there
would be and so on
in different organization
would vary of course,
and the tools we're
looking at should have
a general purpose use.
But the findings
across organizations,
that's why I want
more collaborators,
I've been trying to do some work
with Microsoft about using
most often MySpace
and I bet, yeah,
we need more collaborators
that were really good to
see how these things
vary between,
because it's not easy to
share these sorts of data.
We didn't land on a very
unusual in publishing
these sorts of data.
We don't have other open data.
So, I'm just academics,
so we have to work with-.
>> [inaudible]
>> I don't think so.
>> You do your Fisher's G-test
thing way back over the end,
how did you pick a sequence
of K values to test?
>> So, it's just driven
by how many time points
have gotten.
So, the fast Fourier transform
just naturally spits out
the period of comfort
for all t in that range,
one up to floor of t over two,
where t is the number of
time points I've observed.
So, I'm not in any way,
an expert on the mechanics
of the fast Fourier
transform but I know there's
an efficient algorithm which
throws out s of f for
all those frequencies
simultaneously really easily,
much easier than doing
them one at a time.
So, I didn't choose it.
It's what the algorithm
automatically
spits out and it's
great. It's really fast.
>> The test statistic,
like the sum, yeah.
>> Yeah. So, these are
the four frequencies.
It's what they are.
These guys, so those
are the ones that is if you ask
your favorite programming
language to do the FFT,
it will calculate this thing
for these frequencies.
That's just what it will return.
So, there's no choice reminders.
I just get those for free without
any effort and it's super fast.
>> Like when you are modeling,
like the human trafficker,
like one hour it's
going to share,
how are you filtering
out automated traffic.
Beforehand, when you are
determining that piece
of [inaudible] what frequencies,
I thought maybe traffic
was occurring at and then
filtering it out and then
doing that analysis for.
>> Right. Yes. Say for
this stuff that we're
looking at edges,
which we've already
determined from
those other analysis
to be human ones,
or the idea is that we would do
a filtering step and
perhaps say it's,
"Oh, what if in these, we
sort of piece together
so there are links where it
can also joined together?"
So, if we believe this sort of
stuff rock and classify
every single event as
being human automated,
then I could take that filtered
human data and fit
these models on that,
so that's the idea.
But for the examples
we use in this paper,
I think we're using Edges which,
because there's that nice
about doing the imperial's of
this stuff from my computer
and we kind of know the things,
I think that's a human one
will get to treat them.
>> So, there should
be a way of you being
able to know beforehand
what pulling things
were going on or
what devices that you [inaudible] and
you'd be able to identify [inaudible]?
>> Yes. Yeah. Yeah.
>> He mentioned that
the Poisson process is
a pretty poor model for
modeling human behavior
on a network.
So, when you're looking
at the adjacency matrix
or like the entire network flow,
is there a reason why you
just use a Poisson processor?
Because I'm assuming
my traffic flow is
a combination of
human and automated.
Could you like add
some process in
the whole process there?
>> Sorry. I'm traveling
so this part here is like a
homogenous Poisson process,
so this is like [inaudible] of
a homogenous Poisson
process with boosts.
That's exactly what's going on.
There's a constant intensity.
So, yeah, it's like saying,
"I've got homogenous
Poisson process,
except I'm going to throw
down some more points on
top from this boost
of self-excitation."
>> What does that
self-excitation is- represent?
>> The natural burstiness
that we observe in the data,
so it's somewhat artificial
contrasts in the sense.
It's not like each event
is triggered,
though it could be
I suppose in terms
of firing off agile, so on.
So, there's maybe a curious as
we could physically happening.
But it's more
just a way of trying
to mathematically
represent these bursts
that we get by saying,
"If I see an event,
then I feel more like
to see an event now in
this next instant than I was
before I had the initial wanting
to get the sequence started."
>> When you join
the network flow,
looking at the entire network,
encoding that adjacency matrix,
where each entry is
just right for
the Poisson process, right?
>> For the Poisson
factorization stuff, yes,
I see. Good points.
Yes. It will have some fairly
inflated counts in that.
So, that's, yes, it's
a very fair point.
So, yeah, I can't remember.
I wasn't the person who actually
did the actual analysis
for that so I don't
have any treatment was done
in terms of truncating
the data tone and he's
factoring out seasonality
of automated stuff.
There may have been
some pre-processing going on.
That's a very good point.
We should do that.
As I- joining these bits
that have been that often
so it drops by the way,
so this would work on
these individual projects
and with the knowledge, they
should sort of fit together.
But it actually sort of
gets done. I don't know.
>> Are there
questions? Let's thank
our speaker one more time.
