Okay, so it's a great pleasure
to introduce Professor Karthik
Pattabiraman, from the
University of British Columbia.
Karthik got his PhD in 2009
from the University of
Illinois in Champaign Urbana.
And he spent a year at
Microsoft Research,
working with me as
a post-doc in 2010.
Since Andy joined the faculty
at the University
of British Columbia, and has
worked in areas of resilience,
security, JavaScript
web programming.
And today he's gonna talk
about two interesting topics,
one is IoT and the issues
around building a secure and
reliable IoT systems.
The other is looking at systems
built on top of neural networks.
And how understanding
the resilience of the systems
is very important,
especially in the growing areas
in which they're being applied.
So with that,
thank you very much, Karthik.
>> Thank you, Ben,
thank you everyone,
both who are visiting online and
who are here.
So it's great to be here.
As Ben mentioned, I was
a post-doc here about five,
six years ago, and
have very good memories.
So I'm gonna be talking today
about a couple of recent
products we've been
doing in my group.
My background is mostly in
fault tolerance and security.
And this is applying classical
fault tolerance security
techniques to two emerging
areas, self-driving cars and
smart devices.
So this is joint work
with my PhD students
as well collaborators
from Nvidia research.
So my research broadly is on
building error resilience and
secure software systems.
Work on three broad areas,
software resilience techniques,
web applications and
JavaScript programming and
recently on cyber-physical
system security.
I'll talk more about this talk.
But the focus here will
be cyber-physical system
security and resilience.
So what are cyber-physical
systems?
So these are systems we interact
with on an everyday basis.
They're more and
more in our which is IoT.
So if you look at a modern car,
for example, it's mostly
a computer that controls many
of the essential functions.
So for example,
when you press the brake pedal
at a certain speed, it's
an indication to the computer
that you wanna slow down.
There's no physical hydraulics
anymore, at least in most cars.
Smart meters, many of us
have these in our homes.
There are also microcontrollers.
That's a Nest thermostat, which
was recently acquired by Google.
But there are many
other devices as well.
And of course embedded
medical devices of the body.
So these are examples of
cyber-physical systems.
Which have to interact
with the environment and
as a result their safety and
security is extremely important.
So that's the good news.
So unfortunately the bad guys
are never far away, so to speak.
So there are a lot
of security and
reliability issues
that come to the fore.
So perhaps the most well
known is the hack on the,
you bought a car.
These guys at Wired
magazine they were able to
hack a car that was
going on the highway.
There has been word from
[INAUDIBLE] on smart cars,
pacemakers and
smart medical devices.
There has also been tons
of attacks on smart meters.
I can't list them all, but
people have shown that pretty
much you can make your smart
meter read whatever you want,
including the metro values of
public consumption, if you like.
The Nest thermostat,
I don't know of any attacks, but
there was a famous glitch,
I think, in January 2016,
where it went down after
a software upgrade.
And pretty much the temperature
is 30 below during development.
So it's not just security
of these devices,
even actual failures
are very common.
So with that said, how different
are these systems from regular
desktop or laptop computers?
So there are a bunch of
challenges that are unique to
cyber-physical systems.
So first is many of these
systems operate in real time
constraints, so for
example, a pacemaker.
That's an exteme example,
you have to deliver a pulse
every 1 or 1.5 seconds.
But many other system also
have some of the real time
constraints, like in your brake.
When you hit your brake pedal in
your car, you better convey that
information to the computer
within a certain amount of time.
Or else, you might hit
the car in front of youq.
So there are these constraints.
There are also
resource constraints,
because a big
fraction of many of
these devices are extremely for
resource constraint.
And they're very cost sensitive.
So to give you one example,
a typical smart meter costs
a few hundred dollars.
Even adding some extra memory
on that would actually
increase it's cost by 10 to 20%,
which many of these
manufacturers don't want to do.
Likewise, in a modern car,
I've been told that computer
is at least 20% of the cost,
especially when you move to
self driving cars and so on.
So these are cost
constrained as well.
They're hard to upgrade,
because they need to operate
continuously often in
remote environments.
So you can't restart
them on an ongoing basis.
And often there's no
human in the loop, right?
So you can't ask for
confirmation of an action or
when the system is dealing with
an important event like
a life critical event.
You can't have a human say this
is actually they wanna do.
So that's the other
challenge that comes up.
So with that broad motivation,
I'm gonna be focusing on two
specific systems in this talk.
So one is,
we're going to look at some work
on resilience of deep neural
networks accelerators employed
in self-driving cars.
And here they're focusing on
soft errors or hardware flaws
which cause single event upsets
or bitflips in these systems.
And in the second part of the
talk, I'll talk about intrusion
detection systems for
smart embedded systems that for
which we use dynamic
invariant detection.
A technique that has been
used is software engineering.
But here we adapted it to
cyber-physical systems.
And then we talk about some of
the ongoing work in my group
around this theme as
well as some conclusions.
So I don't need to stand
here and tell you that
self driving cars are big,
depending on who you are,
they are either five years
in the future or 15.
But everyone acknowledges at
some point they'll be here.
Big companies are investing
a lot of money into this.
But more broadly one of the main
technologies that make self
driving cars possible is
deep neural networks.
Or the ability to recognize
other cars, pedestrians, etc,
and take action accordingly.
So this is definitely a safety
critical application,
nobody wants to build
a self-driving car which
is only partially correct.
So it's very important
that we get this right.
Another important constraint
here is the real time
constraint.
So you want for example, your
car to look at the image, and
then classify it appropriately
within a fraction of a second,
or within some time
bound interim so
that it can take action.
So, what many of the companies
in this space have been
doing is investigating
specialized accelerators,
hardware accelerators
that they can deploy on
these self-driving cars.
So to run deep neural networks.
Now there have been other
instances of accelerators for
deep neural networks, as well,
like the data center space.
I'm not gonna talk about it now,
but many of what I'm
gonna present today,
many of the results also
apply in that space.
Now, why do we care so
much about self-driving
cars in particular?
So, I'll show you examples where
even a single bad flip can
actually cause
misclassification.
But more importantly,
in this particular space,
there's very strict regulation.
So, you have to make sure, or
you have to prove that you're
chip has what they call a FIT
rate of failures over a billion
cars of less than ten overall.
And this is for the entire chip,
not just the DNN part.
So these are standard, so
ISO 26262 enforces this.
And if you're a company
that manufactures chips for
these cars, ensure that
you meet that standard.
So it's a very real problem.
So this is regardless of
any other source of error.
You've gotta show the FIT
rate to be within that bunk.
So one of the big challenges
in modern electronics is
soft errors.
So there you have alpha
particles come and
upset your transistors.
And it will lead to deposit or
charge depletion.
So I won't go into
the device physics of this.
But what happens is you have
some electronic component,
like an for example,
Marketplexor.
So what you mind end up doing
is getting a wrong vector.
So 001 you get 0101.
Similar things apply to flip
flops to jelly beans as well.
But if they are from high
level programing prospective,
these manifest as
value corruptions.
And there've been
numerous instances of
soft errors actually affecting
business critical applications.
The famous one being Sun had to
recall their servers, things,
back in 2001, and it made the
front page of New York Times.
And there were other similar
events that actually caused
bad publicity as well as losses,
billions of dollars.
So these are things that chip
designers have to worry about.
Now, I show you just
one graph here.
So soft errors have been
increasing in computer systems
at least for
over the last decade or so.
This is the graph
from Shekar Borkar.
So the x axis is the feature
size or the number of
nanometers using today or
like in 16 or even 8 nanometers.
And the y axis is
relative units.
These companies never tell
you the actual error rates,
they show you relative
error rates, and
you can see that's
going up exponentially.
So this a real problem, and
it's going to get worse as we
integrate more and
more electronics.
Now, you might stand there and
say,
how would we solve this problem?
Yes, Ben, of course.
>> [INAUDIBLE] you go back-
>> Sure.
>> Okay, so if the anyway.
Even it's independent,
there'd be another step.
>> That's true.
>> How are they protecting
the card chips now, right?
Because it looks like there have
some strategy because of this.
>> Right, so
that's a great question.
In fact,
that's we get to our next slide.
>> Okay. [LAUGH] >> So today,
what happens is
they got bandits.
So they say, okay,
I know that at this voltage,
the probability of it actually
flipping the bed is very high.
So I'm going to [INAUDIBLE]
double the voltage, for
example, right?
Or they do things like, I'm
going replicate certain parts of
the pipeline, but DMR, and
I don't necessarily think those
are [INAUDIBLE], but they might
work for more traditional CPUs
where we have a larger power
budget and performance budget.
In the DNN case,
this incurs high overhead and
I know that start ECC parity
on all storage elements, so
they have real-time constraints,
so
you can't go replicate your
critical path, for example.
Another thing people had tried
to do is, there have been a lot
of work on generic
micro-architectural solutions,
but these are DNN-algorithm
agnostic.
And they end up being
non-optimal for the system,
especially, to give
you one example,
a technique we've developed is
there are other reasons as none
of this selectively
duplicate the computation.
So, figure out which parts
of the program are really
influential to the output and
then duplicate that.
This doesn't really
work on a DNN,
because the actual program
is five instructions.
So, you duplicate it, okay,
well, you pay 100% or whatever.
So, these are very
different ways for
traditional programs that run
on regular CPUs and so on.
We do have other work in my
group looking at how to predict
programs running on CPUs
as well as GPUs from.
I'm not going to talk
about that today.
But however, there are similar
challenges in that space,
as well.
So a very brief overview
of a deep neural network.
I'm not a machine
learning expert,
but this is sort of my
understanding of this.
So you have multiple layers
in your neural network and
in each layer has
specialized functions.
So you have an image, you fill
it into the first layer and
it goes and classifies it
into one of many categories.
So there are different
feature maps,
there's a convolution layer,
a subsampling layer, and so on.
And different neural network
architectures have different
combinations of these layers.
So when I get into the details
of the experiments we did,
I talk about which sets
of layers we considered.
And the other side of this is,
so this is from abstract level.
But the accelerator actually
have an architecture that lets
you implement those layers
directly in hardware without
going to the software.
So the software is simply
do convolution, but
that operation is done
directly in the hardware.
The way they do it is
they have this array of
crossing the limits.
So this is an example of
an architecture from MIT
called Eyeriss.
So they have a global buffer
in which goes to the DRAM,
that's why it affects the data.
And then the data gets streamed
into this [INAUDIBLE].
So these are all some of
the [INAUDIBLE] computation.
Again, yep, you have a question.
>> [INAUDIBLE] like
a a CPU that Google-
>> Yes, it's very [INAUDIBLE].
Thank you, yes.
So many of these have very
similar architecture.
In fact, at [INAUDIBLE] there
are at least a dozen of these
variants they are [INAUDIBLE].
We're looking at particular one,
which is Eyeriss.
But results are not
confined to this.
So we will try to canonicalize
the architecture here, so
we try to make it as
direct as possible.
And [INAUDIBLE], yes, sure.
>> Are these already products or
they are pretty much still
in the research phase?
>> Okay, there are some
products up there.
Okay, I am not allowed to say
anything about the specifics of
the architecture.
But Google already has TPU and
the other such
architectures as well.
But what we are doing is using
an open source architecture
called Eyeriss, which was.
So anybody can go and
play with it.
So the crossing element, it's
the line crossing element and
crossing element is basically
encoding these operations
directly.
>> So
is that essentially a chip?
>> Yes, so
[INAUDIBLE] his group and
[INAUDIBLE] release a temporary
measure that is faster, etc.
So, there would be a couple
of others that does this.
There's a group from China, they
also have a similar chip for
doing DNNs.
So they're not the only
game in town, by any means.
But as good as any of
the other architecture,
one of the things they have
going for them is they could
move dynamical the configuration
of the flight.
So they claim the goal is to
at least have more adaptivity,
unlike some of the others,
once you fix the architecture,
you're stuck with that.
So, again,
I am not a chip designer, so
I can't speak on the merits and
demerits of Eyeriss versus
our architectures, but
I am told it is fairly
typical of many of these DNN.
So the goal here, what we are
trying to do is understand error
propagation in DNN accelerators
through fault injection.
So I'll tell you what we exactly
mean by fault injection.
It's again, we want to both
quantify the probability as well
as characterize it.
And of course we wanted to
get the failures, right?
So the goal is to gain insights
so that we can come up with
efficient based and
detect errors, and
entirely lay it out with some
hardware, as well as software
techniques we develop to go and
detect this fault.
So what is fault injection?
As an [INAUDIBLE] we go and
put up different values in
the actual implementation of
these neurolectrodes on the chip
according to a fault model,
and then we study the effects.
So we consider four
neurolectrodes here, conrent,
alexsnack, cafenet, and NIN.
The data sets are given as well,
so
these are fairly standard
in this community.
So there's a whole bunch of
input data sets that they
convert and the number of
tells you how many classes
they classify the result into.
There are different ways
of implementing these
neural networks.
You can use fixed point
instead of floating point.
So we looked at all
these data types.
So we looked 16-bit as well
as 32-bit fixed point.
So to give an example,
the 16-bit fixed point has one
side with 5 integer bits and
10 fractional bits, so probably
[INAUDIBLE] but and 32-bit.
I will be more specific about
what exactly we consider about
the [INAUDIBLE] and
likewise our floating point.
So these are being proposed by
researchers in terms of saying
for certain neural networks,
certain data types are better.
But we look at the whole
garment of it.
So how do we do
the fault injection?
So we actually inject 3000
random faults in each latch,
in each layer of
the neural network.
So we mostly consider faults
in sequential elements,
because this is the dominant
fault model today.
We use a simulator.
It's called Tiny-CNN which was
written in C and what we do is
emulate for the software and
hardware into the C code.
So an example would be, okay,
this is very simplified code For
each layer, for
each [INAUDIBLE],
for each latch you doing
a confrontation there,
you would go and inject a fault,
using r function.
We also verified that
this corresponds to the result
of the testing experiment.
The places I have injected
faults are the places where
real faults were observed, and
we out this under [INAUDIBLE].
So we've built this
fault injector.
We consider a single
a bit-flip fault because
that's the dominant fault
model for transient fault.
We also consider fault that
occurs in latches in executions
units as well storage buffers
like SRAM's scratchpad.
So in a nutshell we inject fault
In pretty much an sequential
element in the and
you do this multiple times.
One for run to get.
So before I tell you
about the results.
Maybe sort of pull up the slides
of what the as you call it.
So this is from a okay but
this is an illustrated one.
So the neural network was
suppose to classify this truck,
of course its a truck, but it
end up classifying it as a bird.
Now, depending on
your [INAUDIBLE]
will it actually put this on
the real serve time in class.
So, I don't know what really
happened but my guess is it
would see the bird its
not going to sort it out.
So, that could be a problem.
So, before many cases
where a single bit-flip
can actually cause the whole
dnn to be misclassified, yes.
>> So, it will not only inject
into the first layer, there is
possibility that you inject
the flip the very last moment,
when we try to output the
>> That's true, it is possible.
So we try to be [INAUDIBLE] here
right, so we pick, so if you're,
let's say if 99% of your time
is being spent in the first
few layers.
Then there's only 1% probably
really jacked into the last
layer, right?
So we try, we want to emulate
what happens in reality, so
the layers are getting
executed on the activator.
So any of these could actually
be subject to faults.
I will show you later the
distribution of faults by layer
and you will see that some of
the results are surprising.
>> This single
evidence may not show
that the probability of this-
>> That's, so
this is the character.
I will show you data which looks
at the distribution by layer.
>> Another question is can you
go back to the previous slide?
So I guess I miss something.
I miss the connection between
the C Program and the chip.
>> The C Program doesn't
[INAUDIBLE] a simulator.
So all I'm trying to say here is
we do the injection in docket.
[INAUDIBLE] Think of it like
simple scale out for CPU's and
so that is a DNN simulator,
acceleration simulator here.
And we identify which parts of
that simulator actually map to
the hardware.
And we know those spots that
are susceptible in hardware and
we inject the simulator.
But it's just the way for us to
do the [INAUDIBLE] prediction
without having to do
the FPGA's and the like.
>> So this C program is not
very specific to this
kind of [INAUDIBLE].
>> No, it's not so
we have to do the mapping.
>> It's more a high level thing.
>> That's correct but
we did the mapping, and
we also validated it
with [INAUDIBLE].
Whenever is they put
it beam to see that.
So the first order
of approximation,
we know these are places right?
But of course,
with all the abstraction,
we do lose some information.
>> Yeah.
>> Okay, so
in terms of next line.
That the classification just
like any real type system,
you gotta reclassify presumably.
So how does that affect the
ability to actually to decide
that's a truck because in
one frame out of hundred or
whatever saying its a bird
isn't so bad, right?
>> Ok, so that's a good point.
So, it depends on the speed of
at which you're transmitting.
Typically, I´ve heard numbers
from .3 to .5 seconds for
each re-classification
classification.
Depends on how.
Now this is far from this
particular platform but
think of it this way.
If you´re going at
70 miles per hour,
you have half a second to
slam on your brake, so
that you, the laws of
physics comes to a stop.
Depending on the distance
of the object.
So I am not claiming that all of
these will result in collisions
or even ones.
But again,
let's go back to the standards.
So the standards say if you
have a failure like this,
this no more than one or
ten [INAUDIBLE] enough.
Still that's for the whole chip.
So maybe you're lucky and
this didn't actually cause
a collision, it still failed
to meet the standard if you
don't actually guarantee that.
So that is the goal
we're sent here.
Do you have a question?
>> No.
>> Right so
we have a bunch of
research questions here.
So we said what are the silent
data corruption.
I have mention this.
Silent data corruption
is a standard tool,
we use it for
output actually defaults
from the character output
under a falling data book.
I tell you more how we actually
determine whether something is
silent in a corruption or not.
So we looked at one of the DC
rates in different DNNs for
different data types, and
then we look at the sensitivity
on the big level basis, or
which bits are sensitive.
And then we looked at
the actual values,
the results, how far they
are from the correct values, and
then the question you how
propagate layer by layer,
we look at that.
>> Yes, so, sir if I access,
there's also,
you talk about the different
precision of floating point,
do different precision of
floating point result in
different classifications?
>> Hold that thought,
I'm going to talk about that.
>> Yeah.
[LAUGH].
>> Okay.
>> I'm coming to that.
Actually, in fact, maybe just to
give you, the thought answer is
yes, yes by signficant amount,
okay, and you'll see how.
So before going further,
I should define the kind
of interruptions.
So better keep in mind
that a DNN is not like
a typical program that
outputs one answer, right?
So and then answer,
you can say, the program is.
It might output that.
That's acceptable regardless of
whether that's a correct answer.
So we call the first kind of
SDC what we call as SDC1 is So
you have a mismatch between the
winner, the top correct answer
from the correct network
to the faulty one.
Then we look at the SDC5,
which has, the winner is not
in the top five predictions.
So you might argue if it
says even on one of the five
predictions it's a track,
it should stop, right?
So look at how far
away that was.
And then we do the same
thing for 10% to 20%.
So the confidence of the winner.
So the neural network also
puts a confidence saying
I'm 90% confident
this is a truck and
10% confident this is a bird,
right?
So the confidence drops with
more than 10% than 20%.
That's also an issue,
because essentially you
don't know what it is.
Or many of these things have
thresholds that say for
example for me to apply a
breaking action I need to be 90%
sure there is an obstacle
in front of me.
So that's the threshold
we're looking.
So I'll explain the graph but
first I just wanna show you
the overall trends here.
So the x axis is the different
kinds of precision.
Double floored Float 16.
16 [INAUDIBLE] when you
have a 16 fixed point,
but you have 10 places for
your, for the [inaudible].
And 32 are when 10 places
to the decimal and
21 places or
21 bits for and so on.
So we also looked at
the different kinds of so
one and five.
And 10% and 20%.
And one thing that stood out and
the bars are different
shadings of.
So one thing that stood out
across all of this is that
the 32 [INAUDIBLE] has a very
significant as you see in
the [INAUDIBLE],
right without even going
looking at the [INAUDIBLE].
So this is because Some of you
might have guessed it already,
you have 21 places or
21 bits for [INAUDIBLE].
So one of those gets hit, you're
gonna cause a large deviation
and you're gonna [INAUDIBLE]
complete misclassification.
You can't tell from this graph,
But I'll show you
in the next graph.
The actual data type even
between Float and Float16, etc,
actually makes quite a lot
of difference in terms
of the sensitivity, as does
the kind of neural network
[INAUDIBLE] to take away
here from high level,
you can see there can actually
be significant amount of state
exposed when you are using
these wide data types and
that can cause ISDC.
Another thing also you might
notice is that, actually
not that much difference between
the different SDC types,
despite what I said in
the previous slide,
we looked at all these types of
SDC's but we found they don't
matter that much in practice,
the difference.
So from now on we say we just
going to go with SDC one because
it seems like the results
are similar and
why this is the cause is because
you thousand possible entries.
Let's say you have random fault,
okay?
Something just randomly
starts [INAUDIBLE].
The chance of that randoms
classification being in the top
five of thousand is
actually really small.
So, whether its the top one or
the top five or the top,
the possibility of being less
than 10% that's not certain. Yes.
>> Why would we design
a thing that had 20
more bits of exponents?
That's a giant range
that you're expressing
relative to the number of bits
in the [INAUDIBLE], right?
>> That's an excellent question.
So one thing is many of
these DNS access radars,
they're designed to be gentle,
as gentle as possible, right?
So you don't want to be, so
if you didn't accommodate that,
you don't wanna be stuck
with the choices you made.
So one of the take aways we're
gonna get to is that it's better
to restrain your data type
as much as possible for
the specific deal
you're trying to run.
Rather than make one
actualator rule them all that
very general because this
gives you this large amount
of bit space that can go wrong,
right?
So in fact,
the second question, right,
we looked at how sensitive
are the different bits and
at the top you see for the bits
and for the floating data types,
and for the fixed point data
types, again for the bits that
are encode exponents to the
other side of the binding point.
At the very high level, the bits
are the ones that are very
sensitive as you would
expect because or
problem, the neural network but
you have another exponent and
we look at exactly what kind of
errors those are in a second but
that can lead to large option.
So again it reinforces
the that you are better
of designing your data
types to be exact or
as close as possible to the
dynamic ranges of the values. And-
>> You're saying that it means
your Your data space
should really be small
in the sense of not spanning
a lot of orders of magnitude.
>> That's correct.
>> Is that easy to do
from your own networks?
Do you tend to have that
property or are their values
to the [CROSSTALK]
>> Okay.
That's actually an excellent
question because the next
question that's
the next question we-
>> [CROSSTALK] I ask [LAUGH]
>> No but I'll give you the,
so not always.
Okay, but it turns out
in most of these cases,
the values clustered on zero.
So, there have been other
studies of theoretical analysis
[INAUDIBLE] but the fact that
they show that the values
eventually converted in zero and
the further it deviates from
zero, the higher the probability
that you will get a wrong answer
and that's exactly
the intuition here.
So, these graph show the values
and the probability of the reds
are the faulty ones, the greens
are the correct values.
So this is with float16 and
these are for these are the ones
that result in a benign.
So right away,
even without perhaps squinting
too hard at the graphic,
you can see that for the ones,
the values are actually go all
over the place from
the correct value, right?
The correct value is 0.
So the further you
cause deviation from 0,
the harder it is for the neural
network to converge and
as a result, you may end up
with but as the distribution,
there's more even in
the case of benign, right?
So it doesn't, you don't get
this runaway effect over here.
>> Without errors,
does the neural network,
for its correct functioning,
require that, so
you're saying at convergence
they get toward [CROSSTALK].
At the beginning, do you need
these values that are so
many orders of magnitude
away from zero?
>> I see okay.
>> You say like I am not
going to have any exponents like-
>> Right.
So the answer is specific.
So there has been work for
example in community that looks
at the power consumption for
different ranges of these.
One of the is to keep the
product consumption low you have
got to learn the characteristics
of the neural network and
design data types
exactly to fit them.
So what this argues for
is more specialization,
rather than make it a general
axle data platform.
So if I know you're gonna run
neural networks which have small
dynamic ranges,
then I should try to pick my
data type to fit those changes.
>> That network that do
have these narrow ranges.
>> Yes.
That's correct.
So as a designer you have
a choice to use those networks,
of course
they may take longer to convert
or slower but that's good for
both energy as well as
resilience so that's the.
Okay, but for this graph here,
it's only the 14 distribution,
it's not the correct
value distribution.
So, I'll actually use this as a
data also to design what we call
software symptom based detectors
which are in software where
you can dynamically bound
the range of the neural values.
So if it's outside that range,
chances are it's going
to be a [INAUDIBLE].
So you stop data
from propagating.
So we'll look at that
in a few seconds.
Now looking at the distribution
by layer, apologise for
the quality for the graph there,
but it's in a paper.
So what we do is, we have these
three [INAUDIBLE] CaffeNet, and
those are more or less
the similar range and ConvNet.
One thing I should have
mentioned earlier is,
ConvNet doesn't have what
we call [INAUDIBLE] layers.
So these three networks do, and
you actually see the effect
of that, so you see these
AlexNet and CaffeNet do.
So you see when you have the
fold in the first two layers,
which are doing
the normalization,
actually the property of it
being corruption is very low,
because it's bringing the value
back to the, to that.
And after that, you have
a fold that keeps going up and
the reason it spikes
toward the end is,
the last it is fully
connected in this network.
So you have one fold that's
going to get spread everywhere,
all over the place and
maybe that's where that
particular example is coming
from, where you have a fold in
the last layer, or in the later
layer, it's going to get spread,
and there is no denormalization
to bring it back.
Whereas in the end case,
actually,
there is a normalization
layered towards end as well.
So that's another lesson you
can learn from this, right?
It might help to put
normalization layers towards
end, so you bring the value
back to acceptable range
rather than spread it
all over the place.
Now one other thing also
is as you keep increasing
the layers your property
of [INAUDIBLE] increases,
so a fault early on may not
matter as much a fault later in
execution may be much
more consequential.
The same graph you can see
with the Euclidean distance of
the values, so we make sure the
distance in the value, correct
value in the 40 value for the
different networks, and again
you see in AlexaNet in CaffeNet
the Euclidean distance actually
drops quite sharply around
layer 2 and then stays there.
Whereas these other two guys,
the value can be quite
significantly different.
So the idea and again there is
one thing to take away here,
the further you go from the
correct conversion value of 0,
the higher the chance of SDC,
and you want to put layers that
actually bring that
value closer to that.
So, the local response
normalization
layer in there one and
two normalizes the value back
towards the normal range and so
you want to put such networks,
if at all possible To
keep the value bounded.
So now you might say, okay,
what is the use of all this?
What if I don't really have
ability to choose the neural
networks I want to run?
Is there anything as
a chip designer I can do
it to mitigate these faults?
So one thing that you can do is,
you can choose
narrow data types.
So for example, you can choose
32b_rb26, which is, you have
26 bits for your mantissa, and
just five bits for the exponent.
Rather than this other one,
where you have 21
bits to the exponent,
because that can vary widely.
But then anyway,
how much difference does that
actually make in practice?
So here we have to use some
projections, because the ITIS
architecture I talked about was
fabricated at 128 nanometers.
So we want to look at
the effect on 28 nanometers,
which is the current
technology or even lower.
So we use some results from
a paper and nuclear engineering
where they did these neutron
flex measurements and
scaled it down.
So I had collaborators who deal
with this, not me personally.
But what we are ended up doing
was to project the actual
FIT rate for
these different architectures.
And as it turns out,
if you use the wider data type,
your FIT rate can be about 60,
just for the DN accelerator.
And that's a problem, because
remember I told you the FIT rate
has to be less than ten, right?
And that includes the whole
ship not just the accelerator.
Whereas, if you're judicious
about your choice of data type,
you can actually bring
it down quite a bit.
So here's an easy fix, right?
So you can choose the data type
that actually fits the neural
network you wanna implement, and
bring down the FIT rate by
almost an order of magnitude,
right, from 58.2 to
5 point something.
So this is one easy thing we can
do to keep the values straight.
But what if this wasn't enough?
What if you wanted to bring
the FIT rate even lower?
So here we are proposing
symptom-based error detectors
that can be done in software.
So at the end of each layer,
you can check whether the value
deviates from zero by
a significant amount,
and detect that.
Or you can learn the correct
ranges over a period of time and
then you can run
the rain check so
that if the value deviates too
much from the correct one,
then you're going to stop
the network from computing.
And we find that
if you did this,
it's a fairly simple rain
check at a very simple level.
And the recall of this can be
about 93% with 90% precision.
So you can actually bound
the ranges fairly tightly, and
they're fairly predictable for
the class of neural
networks we studied.
So this is something
you do in software.
If you wanna guarantee your
FIT rates are bounded, you can
also do things like selectively
protect certain latches and
hardware.
Of course, this assumes that you
are the designer of accelerator
platform so you can decide
which latches to duplicate.
So, the baseline would
be duplicate everything,
but it's a huge overhead.
So here we say, for a given
flip-flop target rate reduction,
one is overhead and
TMR is triple model length,
where you duplicate
everything three times.
And there are a bunch of
other strategies here.
And the important thing is
the curve has this reverse
exponential curve
of distribution.
Here, if you look at this
fraction of front gate latches,
so by protecting about
20% of the latches,
you can bring down a FIT
rate by 80%, right?
And of course,
to get to the last 80%,
you've got to pay
the rest of the cost.
So this is an important result,
because it means that you can go
after the low-hanging fruit and
get pretty big wins.
Yes, you don't get 100%
prediction, but you get 80% of
the FIT rate reduction cuz we're
predicting about 20% latches.
So you don't have to duplicate
everything, for example,
because that has its
own attendant cost.
So this is based on
a theoretical model
that will make quarters derive.
So we've tested this on
regular CPUs, but this is
to something we're testing on
the DNN accelerator platform,
because we have to go and
refabricate the whole
chip to incorporate this.
So but the takeaway here is,
with 20% overhead,
you can get almost 100x
reduction in the FIT rate.
So what we did in this,I think
was a characterized error
propagation in DNN accelerators,
we looked at different
parameters, data types, layers,
value types, and DNN topologies.
And for the three mitigation
methods, I'm not saying these
are panacea, but these do bring
down the FIT rate considerably.
So things like restraining the
value range of the data type,
if you can do that, depending
on your neural network design.
Adding a value range checker,
and you show that
you can do this with a pretty
low overhead in software.
And selective latch hardening,
if you actually are able
to modify the hardware.
So the takeaway here is if you
can adapt your fault tolerance
technique through
the characteristics of
the application or the neural
network, you can get much better
protection than the generic
one size fits all approaches.
You got a question.
>> So is your [INAUDIBLE],
is it dynamically determining
what the right value range is?
>> That's correct.
>> Is that in hardware?
>> It's done in software,
because at the end
of each layer,
you can look at output values.
>> Uh-huh.
>> So
you're reusing the same
hardware for each layer, right?
>> Yeah, okay.
>> So you read back the values
in software, and
then you can read out of
the text in software.
>> Okay, okay.
>> And in fact, you can use
the CPU to do the tech,
because CPU's sitting idle
while the DNN accelerator.
>> Okay.
>> Yes.
>> When you say neuron values,
you mean those values
after each layer.
>> After each, yeah.
That would be each.
>> And
not the temporary variables.
>> No,
not the temporary variables.
The output of each layer.
Well, they are temporary
where the neural network is
a black box, right?
But the values that
are produced that are featured.
>> [INAUDIBLE] Two layers.
They are variables.
>> Right, so those are the ones,
the output of each layer-
>> Do you want to put a range for-
>> That's what we do actually.
We learn the ranges of
the outputs of each layer.
But in between the outputs,
there are the temporary
variables, which you
don't care about.
>> Okay.
>> So for output of each layer,
I mean the actual values
communicated between the layers,
not just the ones at the end
of the current rate.
>> Right.
Thank you.
>> Any other question about this
part?
Okay, so now I'm going to tell
you about something that is
actually quite different,
but has the same theme,
which is that we want to build
intrusion detection systems for
the cyber-physical systems.
And we're going to use
dynamic invariants here.
So the motivation here is we're
wanna provide lost cost security
for cyber-physical systems.
So we wanna satisfy resource as
well as real time constraints.
No human intervention needed,
and we wanted it to be able to
detect zero day attacks
as far as possible.
So we don't want to be having
to upgrade our signatures in
the field for
these devices and so on.
So we want to,
as much as possible,
make these systems
be autonomous and
capable of running intrusion
detection software on their own.
On the host without relying on a
network connection, for example.
So here, we're going to
leverage the properties of
the cyber-physical system.
So most of
the cyber-physical system,
we consider to have
timing predictability.
What does this mean?
They have to send
a message every x seconds.
Or they have to send a pulse.
They have to do this loop
within a minute, for example.
So there are timing constraints
that are part of this
specification.
And we're gonna exploit this and
use time as a first class
constraint to detect intrusions.
And so then the idea is
actually straightforward.
We learn the invariants based
on the random execution, and
we monitor these invariants
at runtime to see if they're
violated.
So let me sort of tell you
a little bit of a threat model
we consider.
So this is a model of
the cyber-physical system as
a communication network or
physical process and
a cyber process which could
be a control algorithm.
And physical process is sending
measurements of the cyber
process and
its sending actuation commands.
So we are looking at Attacks,
so attacks can occur at any
part here, it can occur
in the cyber process,
in the physical process,
or in the command send.
So we are not looking at attacks
in the physical process, cuz you
attack the physical process,
we learn the wrong values.
But what we are doing is,
we're trying to learn invariants
from the cyber process and use
these for intrusion detection.
So, very simple example will be,
let's say this was a smart
meter, let's say it's a car.
And you know that your current
value of speed can't be greater
than the previous speed value
by more than some threshold.
So this is an invariant
you could learn, so
that if there is an attack
that tries to compromise this,
you would be able to detect it.
So of course,
we are not the first
to come up with intrusion
detection systems.
There's been a ton of work here,
so
the classical ones
are signature-based IDSs,
where you have to give it
a bunch of attacks signatures.
These don't quite work for
the system because you need to
update the signatures
on the fly, and
you want the ability to provide
zero day attack detection.
And then there's also
anomaly based IDSs,
where you learn from statistical
models of the behavior.
But those are false positives,
and there's other work
that's shown that these are less
than suitable for the system.
Because you don't have
a human in the loop that you
can alert when there's
a false positive.
Having said that, what I'm going
to talk about is also related to
that, so
we'll come back to that.
Then there
are specification-based IDSs,
where you learn a specification
of what the software is doing,
and then you try
to detect attacks.
So you can do that through
static analysis, but
unfortunately, in these
cyber-physical systems,
there's a lot of interaction
with the environment.
A typical static analysis
doesn't capture this,
unless you go out of your way
to model the environment.
So what we are doing is, we are
going to use dynamic analysis.
So we're going to learn
the invariants by monitoring
the system over
a period of time.
And then use the invariants
to go and detect attacks, so,
this is sort of the more.
So what's an invariant?
Many of you are probably
familiar, but
just to give you an example.
So you could say energy usage
should be greater than or
equal to 0.
Or the current minus the past
value should be less than or
equal to threshold, or
some other sequence of values
should fall into a certain
sequence, and so on.
So there's been a lot of work
on dynamic invariant detection
in the software design space.
So starting from Daikon,
from and
there's been work on Texada for
distributed system,
Perfume for mining properties
of the active system, and so on.
And the common thing with all of
these systems is, they look at
only one of two dimensions
of data time or event.
So, if you look at events, cyber
physical systems are gonna send
and receive messages, and
those are events, for example.
Time, of course,
what is in data is just the data
values we are achieving, right?
So if we look at Daikon, for
example, which is one of
the classic dynamic
invariant detection systems.
They only look at the data,
how the data actually changes,
right?
There's other ones that look
at just the event patterns, or
looks at events and time.
And our insight here is that you
actually need to look at data,
time, as well as events,
the interplay between
all these three.
Because that's really where
some of the interesting attacks
are gonna try to break
your invariants.
And I'll give you an example
later, but in a nutshell,
what happens is, in the cyber
physical systems they have to be
predictable data value
changes at certain times, and
we want to be able to
learn those invariants.
And if you try to
the value prematurely, or
you don't let the value change,
or you suppress an event that's
supposed to occur at a certain
time, then we will detect that.
Now this will not work for
a regular general
desktop computer,
because you don't have that
level of timing predictability
in a typical application.
But applications that are on
devices are simple, and
they follow these kind
of repeated patterns.
So that's the secret sauce,
if you will,
that lets us actually go and
detect these invariants.
So if I plot, if I look at Data,
Event, and Time, so one way to
learn these invariants would
be to simply look at all these
three variables and
navigate that space.
The problem is, however,
this leads to state-based
[INAUDIBLE], because this,
almost every pattern you can
think of, you can learn within
these three dimensions.
So the insight we are gonna
leave it as here is,
I can actually look at Data,
Event,
and Time, conditioned
upon the event space.
So cyber-physical systems
are very event driven, so
there's usually
a explicit command.
So if you go back to
the slide I showed you,
where we're sending a command
to a physical process,
getting measurements,
each of these read it.
So you're reacting to some
external event, and idea is,
I can break up
the space by events.
And learn the data values
that has to change,
as well as the time
properties of that event, so
I can use this to come
up with the invariants.
So the system we built
is called ARTINALI.
So it's gonna condition data
changes on events as time,
events that we
show on the paper,
this is equivalent into
learning Data, Event, and Time.
It's just based probability,
and what we end up doing is,
we end up deriving Data
per Event invariants,
Time per Event invariants,
and Data per Time invariants.
So all these three together,
we can leverage and say, okay,
this data has to change in
response to this event,
at a certain time.
So that's the idea, so
we built the system for
ARTINALI where, You give it
the program and the test,
so you still have to log
the events yourself.
This is common to many of
these other event tracking
systems as well.
But ideally,
you could have built a front
end to do that automatically.
But once you do the log,
everything else is automatic.
So we derive the data per event,
the event per time, and
then derive the data per time.
And then feed this directly to
the intrusion detection system.
So it will use these invariants,
and look for
violations of them
to detect an attack.
So we evaluated this on two
platforms, so one thing I should
mention here is, there aren't
that many open source.
So we tried to pick open source
systems, so that we can try to
find security loopholes, and
evaluate it on our system.
This is not a requirement,
but it just makes our life
easier in terms of evaluation.
So there aren't that many open
source cyber-physical systems we
could use, so we found two.
One was an advanced
metering infrastructure,
called SEGmeter, so
this is a real smart meter, and
we can actually look at a code.
It's running on Linux
based invariant, and
we can run our system on it.
And then there's also
a smart artificial pancreas,
it's an open source project for
people with type 1 diabetes,
when they have to have a control
loop to monitor insulin
levels and so on.
So these were two systems
that we had used to test
our evaluation.
And what we did is, so we took
this cyber-physical system,
generated a whole bunch of
traces, just normal operation of
the system with different
events and so on.
And then we had
a learning phase,
where we took some traces and
learned the invariants.
And then we also had the
intrusion detection phase where,
as the traces were
being generated,
they were fed to our
intrusion detector.
And if we'd violated
the invariant,
we would raise an attack flag.
To keep the comparison fair,
we compared with, or
rather, we compared with
three other systems, Daikon,
Texada, and Perfume.
So these systems do not have
the data, the time constraint,
so they look at data over event,
event over time, and so on.
They don't look at data, event,
and time that ARTINALI does.
So, and then we see the
invariants from each of these
systems in turn, and
then checked how good they were
at detecting the intrusion.
So we wanted to make an apples
to apples comparison, so
any trace that we got, we'd run
it against all the invariants.
And check which of them will
actually detect the security
attack, yeah?
>> [INAUDIBLE] what was
the load you have and
device to get the traces
in the first place?
>> So
you mean what we actually do or-
>> Yeah, what was the input that
you gave it to generate the
normal operation [INAUDIBLE]?
>> So for the SEGmeter, we just
let the meter run for a day,
I believe.
And it was generating
these traces well,
I mean we will randomly sample.
That's a different part of the,
[INAUDIBLE] of a source.
>> Okay.
>> Sending communicating
with the server.
With the open apps,
we didn't have any [INAUDIBLE]
problems with this operations.
So, we used some data
that was released
by the National Diabetes
Association to assimilate
different conditions of
typical patient meter,
you're gonna learn manage that.
So this is what, the increment
that you use to test
the system of normalization
>> Yeah I understand that.
It's a challenge if you're
gonna do inversion direction to
have a sense of the variety
in the wild with
the kind of-
>> Behavior.
>> Yeah behavior these
things are gonna sit right?
>> Right, so
one thing we don't need
those are set of attach, right.
Because of characterising
the good behaviour system.
Yeah, but you are right, that
is going to be a challenge, yes.
>> So
you must have some time lag, and
must mean to see some of now of
the traits before you start.
[INAUDIBLE] Cuz I
mean the very first-
>> Yeah, not detected.
>> Through, right?
And then the next
one is fine and
you're like whoa,
that's an intrusion, right?
>> And for
the purpose of experiment so
we haven't actually gone and
measured both that time needs to
be to get the theoretical bound.
For the purpose of experiment,
I believe 50 traces.
So how long in the case
In the case of the
>> Full traces and
then on the next trace.
>> On the next trace and
I show you how we, but
you bring up a good point right?
So you can make
this system be very
like a very high recall by just
and we try to be balanced.
I will show you how we do that.
Yes?
>> Is the goal here to come
up with the universal kind of
detection or per meter?
>> Per system, per system
>> Like, I mean, for
family you have a detector
system which is
trained using the [INAUDIBLE]
>> So for a particular media,
let's say, you run the you run
the invariation and then feed it
to the So for that [INAUDIBLE]
because there are thousands...
>> For my family, and then
if Mike's family [INAUDIBLE]
>> He has to run the, yes.
>> The training part.
>> I think that's the model
they're using now,
having said that you know for
a particular class of devices
we don't believe that variation
but we want to explore that.
So that's an interesting
question right,
how much learning
can I benefit from?
Right and
here we're learning it for
a particular system and trying
to detect intrusions on that system
>> So
is there a notion in terms of
this training phase the notion
of coverage in the sense of-
>> I talk about that, so,
you mean coverage of inputs?
>> Yeah exactly right how do
you know that you've seen all
the things that are normal?
>> So we don't, but we do
measure false positives and
false negatives, right?
And we can look at how
to balance those two.
So yeah but the assumption
is that you'd be able to see
all the behaviors in the first
few minutes of training,
which will impact
the false positive rate.
So I'm going to
show you the data.
So the first thing I want to say
is when we build a system like
this one question is
how do you validate it?
So we try to come up with
attacks ourselves on these meter
and smart APS.
So this is a by it's definition,
it is anecdotal in
the sense they try to take
attack that other people have
discovered in this systems.
And then replicated in our lab
and we were able to do it.
And in this case,
we were able to actually
detect all the attack
that we talked about.
But this by itself
is not sufficient,
right, because one of our goals
was to detect zero day attacks.
So yes,
this is a low bar to beat.
We've took a bunch of attacks.
You model them,
we detected them.
But it's by no means enough,
because I mean,
an attacker can do
arbitrary things, right?
So how do I emulate that?
So one thing we tried to do
is we tried to emulate what
an arbitrary attack that would
do we're using fork injection on
mutation vesting.
So this is where an absence of
data about exactly whatever do.
You have to be able to make the
worst case assumption and say,
all right if an attacker somehow
manages to flip a branch.
Or someone manages
to mutate the value,
I'm able to take that
with my assertions.
So, for example here
we did data mutation.
For different kinds
of pointers for
example that teach the values to
emulate [INAUDIBLE] and so on.
We then branch things, so
we said we're going the if part,
and the then part,
go down else part, or
we try to introduce
delays at random places.
When I was here last year
I spoke about this product
[INAUDIBLE] attacks
against smart meter.
So one of the things we found
that the modern hacker was even
by delaying synchronization you
can actually introduce attacks.
So we tried to use that
knowledge of the kind of
attacks that could
occur here but
we had to do this in a sort
of light way to be unbiased.
We did a bunch of
these mutations.
I forget the exact number.
It's in the paper.
And tried to get statistical
estimates of how many
were detected and so on.
So now we come to the important
question of false negative or
false positive, right?
So depending how
you tuned ideas,
you can have more false
negatives or false positives.
So, I will spare you
the definitions of these, but
what we tried to do was we
measure what was on the F-Score,
so it tries to balance a false
positive or false negative.
It's a standard term in
information retrieval.
So there's standard
meter called beta,
which lets you choose which
one to prioritize more.
So if you have beta
greater than one you'll,
false negative higher, so
you want fewer false negatives.
Beta equal to one,
they're equal and
beta less than one is
the other way around.
So for the smart beta we said
you know the worst case that
happens if you have a false
negative is you lose some money.
Somebody steals your power.
So it's probably not as bad.
But you don't want
false positives because
if you're utility provider
you don't want to come
Look at a false alarm every
few minutes, that's a cost.
So then we prioritize false,
we said we let a few
false negatives slide but
we prioritize false positives.
For the medical device we
do it the other way around.
Now you could both ways,
but our rationale was you don't
want an attack to slip through,
even if it means alerting
the human every now and then.
It's probably fine because
you're wearing the device.
So here's one example of
the kind of tuning we did.
So the x axis is the number
of training traces.
The y axis is the percentage for
different false positive,
false negative, and
F-score balance.
Okay, so here you'll see and
therefore, actually for
the different F-scores,
this is for OpenApps.
So you'll see that the false
positive rate keeps decreasing
as you increase the number
of [INAUDIBLE].
But your force negative
red might increase but
it does stabilize at some point,
right because you are gonna
get [INAUDIBLE] saturation.
Now we could, so this' for
one particular IDS that we want
[INAUDIBLE] to the [INAUDIBLE]
variance for our platform.
So we can for this figure
find the optimal number of
this is the maximum f score,
which happens to be around 15 or
you can go higher as
well if you chose to.
But, important thing is this
kind of training is very
platform specific,
very ideal specific.
So, we had to do this for
all [INAUDIBLE] for
all the platforms.
And what we are going to do is,
so for every system, be it
Texada or Daikon or Perfume
we tried to find the point
at which that particular value
of F-score is maximized.
So that's the best [INAUDIBLE]
configuration for that system.
And then we converted the best
configuration file system,
so in a sense we are looking
at the best case for
each of these systems.
We could have done it for
[INAUDIBLE] case and
[INAUDIBLE] case but
it's important to make sure
that all are the same even key.
So first, let's look at
the false negative rate.
Here, we are looking at value.
So there's 95%
confidence interval.
So we looked at different kinds
of attacks, like data mutation,
branch flipping,
artificial delays or so on.
And one thing you see is
with all the three systems,
the false negative rate for,
I mean first of all it varies
depending on the kind of
attacks you are [INAUDIBLE].
So an example would be Daikon
are very good at detecting
data mutation attacks because
it was [INAUDIBLE] that are in
variants, right?
Whereas Perfume for example,
we're looking for at the timing
of [INAUDIBLE] was very good at
detecting artificial delays.
But ARTINALI was good at
detecting all the three,
because it was looking
at intersections, so
it was looking at
data event Anti.
So overall the false negative
rate with the ARTINALI,
our system was much slower than
the false negative rate with
of each of these systems.
The white bars are aggregated
for negative rates.
So in fact it's about 2%.
So that means only 2% of the
time we miss the attack whereas
with some of these systems,
it could be as high as 90% or
50% on average.
So again, just because we
are able to say specifically at
this time, at this event, here's
all the data valuation chain.
So, an attacker has very little
freedom to change things.
The story is not however, as
sanguine for the false positive
case, though we end up doing
better than the other system.
So here,
if you look at Daikon, Texada,
Perfume, the false positive
rate is pretty high.
In the defense of those systems,
I'll say they were not meant for
security or intrusion detection.
They're more for
software engineering, so
they had a programmer
look at the invariants.
So for them,
a false positive was fine,
as long as a programmer
could invalidate it quickly.
For us, we don't have that, but
even so, our false positive
rate was about 12%.
And so we were still improving
on those other system, but
we still think this
is pretty high,
so we do need to bring
the false positive rate down.
But I'll tell you this,
one of the things that bring
the false positive rate down is
the chance of a coincidental
path where the data event and
time exactly match the correct
one is actually lower.
Because you're looking at
three things as opposed to
any two or just one.
So we could do better if we
actually started sub classifying
the kind of paths the software
was taking, take for example,
which we don't do.
We just log it as
one aggregate thing.
For example,
if I say whenever was the
communicates with the server,
I'm gonna load one
kind of invariant.
And then it say, start
sensing the former values and
will learn all that
kind of variant.
This will require
us to customize
the invariant generation system,
we will try not to that, but
that's one way we can bring
down the false positive rate.
Even without doing all that,
it's still doing better than
these other systems but
there is room for
improvement and of course,
the big thing is performance.
Because performance and
memory because these systems
are very resource constraint.
So results for
the [INAUDIBLE] platform.
We have over 4 megabytes of
memory to use because the Linux
at 16 MB RAM and
then the Linux [INAUDIBLE] and
other things takes 12 MB.
So that's the limit we
are operating under so
one thing is all the systems
end up being about similar.
So the invariance being
stored about 3 megabytes,
which is still less than that
but even in the worst case,
Perfume fill is less than 4MB.
So it's fine.
The more important thing
is detection time.
So there's performance over it.
So it be impossible
33% overhead.
Which is also comparable
to [INAUDIBLE] system but
what does this mean in reality?
So in the case of it's
sensing the power values,
taking some action, sensing
the power values, and so on.
This loop happens
every one second.
So you have to finish whatever
your IDS does in that one,
sorry it happens
every one minute.
So you have to finish whatever
your IDS does in that minute.
Otherwise you won't be
able to keep up, right?
And we find that we can
finish within the first 10 or
15 seconds.
So that's also the detection
latency, so as long as we
are within that minute you can
more room for example, yeah?
>> So what are your senses about
the other load on the CPU when
you're doing it?
>> So we are running whatever
load manufacturer or
processor or something.
>> I see so, so in addition.
>> Yeah, that's already there.
In the worst case but
our overheads are still we are
within 25 to 30% of the time.
So, there is room to actually
have certain workload for
whatever process you're
running should still be okay.
But having said this, these
are our dataset in the cloud.
We're probably not going to
compute heavy stuff on it.
So that's the other side
of it but we are able to
make sure that whatever load
we put on it is not affecting
the real time performance of the
system for these two systems.
So, to summarize, ARTINALI is
a multi-dimensional model for
cyber physical systems.
We capture the data
event time interplay and
we have real-time
data invariants.
Here, I'm gonna say coverage.
I don't mean coverage in
the form of test coverage,
I mean coverage to and different
attacks to be introduced.
So that increased and
it decreases the rate of false
positives, but there's
still room to improve that.
And the overheads are comparable
to other invariants detection
systems.
But the goal I want to emphasize
is not the ARTINALI by itself is
a panacea, there is still
room to improve it.
But the idea of applying dynamic
invariant detection is important
to these domains because
you can't have programmers
write specifications for you.
So, and these systems
do very specific tasks.
So I think there is a lot of
opportunity here in terms of
coming up with generic invariant
detection systems that can learn
automatically without the
programmer having to describe or
write specifications.
And then to feed these back for
security as well as reliability
and right now we are actually
trying to run this on a UAV.
So we have a drone that we
are trying to attack and prevent
the drone from crashing just in
time by running ARTINALI, but
that's still work in progress. So,-
>> Yes of course.
>> So, you are saying it
took 19 seconds right?
In the time it took a minute.
So, it's 19 seconds for
all the processing it did and
then the processing just sits
idle for the rest of the time?
>> No, sorry that
is the workload that
is already running.
That's a [INAUDIBLE] and things
happen under one minute, okay so
it reads the values and
does some calculations.
Sometimes it just sits idle and
so on.
Our ideas is running
in a parallel process,
it's running in
a separate process.
>> Separate processor?
>> Separate processor,
it uses a pipe to come.
>> You need other processor
around the device?
>> We need another processor.
It's running standard Linux.
It's computing our.
>> I mean, it changes the cost
of the device to have two
processors.
>> It's one processor but
running two processes.
>> Two threads.
>> So we are,
a Unix process right, so
we communicate with pipes.
>> That's what I'm saying,
so the competition might
make the other process
miss it's deadline?
>> Correct so
we didn't observe that.
So in all of our tests we
never missed the deadline.
So you're right, I don't have
a proof here that we will never
ever miss a deadline.
>> So we're getting, right?
>> In and we finished
our ideas processed in
this first 20 seconds
of the minute we have?
So that leads me to
believe we're not,
out of invalid we have
finished 59 second, let's say-
>> Yeah I mean it depends on how
close the rest of the processing
was to the deadline in
the first place, right?
Cuz it's the same thing as
coscheduling stuff in a data
center right here like
you have to worry
about the interference right?
>> I agree with you.
I mean one thing would have
been nice to do is to say I
can bound the amount of
overhead I introduce to it but
I will say this however.
We can't prioritize which
parts of the to monitor.
So, if you find that we
are close to that brim or
are gonna overflow our time.
We could say I am gonna monitor
a few other invariants.
So of course you trade
off coverage, so
we had another that we
are trying to model.
Like a set cover problem.
I have a property I
want to establish.
What is the minimum amount of
invariants I have to monitor to
establish the property.
And you can treat it like a
programming problem, right, but
I agree with you.
What we have is a more anecdotal
demonstration that it doesn't
affect the real time operation,
but
it'd be nice to have a proof.
Any other questions?
So I'm gonna tell you little bit
about some of the ongoing work
in my group and then conclude.
So I spoke about some of the
work we did on formal methods to
use model-checking to find
attacks against the smart meter.
So, we had this high level
model of the smart meter,
in mod or rewriting logic.
When we ran the model checker
and it found lot of real attack.
So I spoke about this
last time I was here,
where we had things like,
you could tamper with,
say the The communication
between the two boards,
this is a commodities
smart EW board off eBay.
So it has some protection,
but nothing major, so
we can easily open it up.
And we can do things like
reconnect to the timer,
which would send a pulse to
the meter exactly at a certain
time and reboot it.
And nine times out
of ten we wiped out
all the data that was
stored on it at that time.
So [INAUDIBLE] phone,
many of these are attacks, but
what it requires is high
level model of the CPUs.
So we're now trying to see if
we can do this without having
the programmer write
the high level model.
So we're trying to model
these attacks directly in
the [INAUDIBLE] execution,
through symbolic execution.
And then we're gonna try and
use these invariants with
a [INAUDIBLE] to show that
invariants are sufficient for
establishing a certain property.
So this is still a work
in progress, but
that's the goal, right?
So ultimately we want to have a
system that comes up with these
specific, these invariants.
And then automatically
the model checker goes and
tries to find attacks.
And then it finds an attack,
let's say, and then we come up
with new invariants to detect
that attack, and so on.
So to have this self-improving
capability in IDS, but
we're not there yet.
Another recent project we
started with my group is trying
to build smart middleware for
IoT devices.
So while we've got a bunch of
work on JavaScript for the web,
that is a lot of effort to run
JavaScript in IoT devices.
So Samsung, Intel have VMs
that actually run JavaScript,
JavaScript VMs on their tiny.
But today, to program these,
you still have to program them
on a device-specific basis.
So our goal is to try to
come up with a framework for
programming IoT devices.
So that you write JavaScript
code in one place, and
the framework takes care of
distributing the computation.
Sending the chunks of code,
migrating them, scheduling them,
and so on.
So we call this SmartJS,
but the name may change, so
we have some examples of code
that we can migrate on the fly,
without requiring
the programmer to do anything.
So one of the challenges here is
keeping track of closure state
without modifying the VM.
I'm happy to talk about this
offline if anyone is interested,
but this is a recent
project we started.
And more broadly,
going back to the DNN case,
I think what this calls for
is, we need to think about
resilience of machine
learning applications.
So today the problem is,
when there's a lot work on
adversarial machine learning
where they change inputs and
so on.
But what if we were are able to
change the actual parameters of
the algorithm, or
change the implementation or
the hardware that runs it?
So what we want is to build
algorithms that are resilient to
perturbations, even
on the parameter.
So for example, if you have,
say, an algorithm that goes
off on a different path for
a small change in parameters,
that's sort of a resilient one.
So can I build machine learning
algorithms or systems, where
a small change doesn't introduce
large changes in the output?
So this is something, looking at
a bunch of theoretical machine
learning folks, trying to see if
it can bond algorithmically as
a result of these systems.
So to conclude,
I hope I've convinced you
that cyber physical systems,
resilience, and security
are important challenges.
So we got two systems
in this context,
deep neural network accelerators
for self-driving cars,
and invariant monitoring for
embedded system security.
In this future work, we want to
do formal analysis to actually
improve properties about these
systems in the presence of our
mitigation mechanisms.
As well as build
smart runtimes for
these to help programmers
write resilient code..
And finally, interested in
looking at resilient machine
learning algorithms as
systems in a broad sense.
So with that, I'm happy to take
your questions, and that's my
email address there, in case you
have more questions, thank you.
>> [APPLAUSE]
>> So
when you use formal analysis
model to do CPS safety,
the anniversary model
is different, right?
Not.
>> These are so
thank you for that question.
So here we are [INAUDIBLE]
that can tap out the physical
interfaces of the system.
So I'll give you an example,
so you can reboot the system
at any point in time,
let's say, right, which is
already, just plug in, plug out.
You could read the needles
on the serial port, okay,
you can plug the serial port
cable and read the data.
So yeah, assume that,
even if it was encrypted,
it doesn't matter,
the meter would have said
that it was being read.
And the third attack there is,
we can drop certain messages
sent to the server.
So I can drop all
the timing synchronization-
>> Assume that the device is
tamper resistant-
>> It does not-
>> [INAUDIBLE] it's something
very deep in the chip.
>> Yeah, we assume that,
okay, so
we assume that the attacker
is not like a nation state or
someone who has deep pockets,
right?
To go and do the kind of
thing like [INAUDIBLE]
the whole chip and so on.
What they can do is,
they can tap over the physical
interfaces, which is, so all the
stuff that's actually mounted.
We bought the equipment off
eBay for less than $50, so
a solid state will
cost $10 right now.
So you can actually hook
it up to your meter and
send the reboot pulse
at the right time.
But the said that you'll
reboot when you come
to this function in the code.
And then look, the meter loses
data, but from that to actual
translation for the physical
attack was straightforward.
So I don't know if I
answered your question.
Right so all the attacks
are done on a real meter,
with equipment
bought off eBay for
very low cost,
which is what we assumed
the typical attacker would do
>> So you talk about
the modelling and whether or not
you expect people to build these
models of the system,
[CROSSTALK] and discover.
How do they get it right
if they don't have
a model to begin with?
>> [LAUGH] That's
a great question.
We've talked to some companies
that do IoT device development,
and the feedback we got was,
they started the model, but
the quality of the models was
often outdated, so you're right.
So many of these devices are,
well,
they're glorified
state machines, right?
So you do have to have a model
of the device, but by the time
the code goes through
iterations, and it's ready for
shipping, they don't necessarily
keep the model up to date.
In fact, so
we were trying to pitch it
to a company that does
IoT device development.
And in fact,
we can't rely on the model, so
we initially started off by
saying, give us your model,
we'll turn that into invariance.
But then because the model
turned out to be not accurate,
we said, can we learn
invariance directly from code?
But you're right, that is a-
>> Use a language like P
[INAUDIBLE] would that have
been a good input [INAUDIBLE]
possible input to your-
>> Absolutely, so in fact,
the formal modeling work,
we assume the programmer
actually expressed the code
in a higher level model.
Re-writing the object, and
I don't know the specifics of P
but from what I know, I assume
it is a higher level logic
to express your CPS program.
So yes, suddenly we could do
better invariants in that case,
and we don't have
false positives, so
this is just my
[INAUDIBLE] take, right?
So many of the programmers we
spoke to who are they were not
necessarily CS majors.
Many of them had engineering
backgrounds, but
few of the following methods
are available to describe
the program in a high
level logic, so
they were writing code.
>> But you were saying that
they were expressing as
a statement sheet.
>> On the board.
>> Yeah, yeah.
>> On the whiteboard, sorry.
>> It's okay, I see
>> Not actually, so,
they gave us some sheets
of paper or, you remember,
diagrams in some places.
>> Yeah, I see, okay.
>> But they had a model,
but not actually executed,
not a way a CS person
would think of as a model.
But maybe it was just this one
or two companies we spoke to.
Maybe there are other companies
which actually do a more
diligent job.
>> But I remember, the beginning
of P was Windows developers
who had joined the state
machine on PowerPoint.
>> [CROSSTALK]
>> It's the same story,
except they had
such a big thing,
it couldn't fit on the board
>> I mean,
I know there's been a lot about
the model-based concentration in
the software community.
Where people would say,
okay, let me make the call
automatically and so on I
don't know whether those tools
are actually used in these-
>> It is being used,
[INAUDIBLE] it's used in
drivers, it's used in measures,
it is a real thing, yeah.
>> Okay I would love to learn
more in depth, okay, thanks.
>> [INAUDIBLE]
>> Thank you.
>> [APPLAUSE]
