Thank you for having me. Volume okay? Okay.
So yeah, I'm going to talk today about applying machine learning to ops automation.
I'm going to start off by just going into a problem, a view of the problem.
The Expedia Group, for those who don't know, we're a travel company.
We help customers book online travel and to do that all in a single place, right?
So if they want to book their flight, book a hotel, get a car, rent a car - whatever it is.
Expedia Group operates a large number of different properties.
This is, I think, most of them - there's some that aren't even on here.
But Expedia does operate a number of properties. It's a pretty big internet company.
I think last year (2018) we did 11.2B in revenue, and like 99.7B in gross bookings, so lots of volume here.
So what that means is we have lots of customers coming in
and these customers generate a lot of traffic on our front page for our properties.
But it doesn't stop there
because behind the scenes, there are lots of different services, microservices, databases, this whole thing
and so of course there's a lot of traffic going into those as well
and it still doesn't stop there because with all of those, the services are calling each other
and so what ends up happening is you just have a huge volume of traffic
and there are a lot of things that can go wrong in the course of trying to serve up a good customer experience
and what that means is we get lots of alerts, right?
So this is just different anomalies that we detect in the environment, that end up going to end users
and it's very common - I'll leave the numbers out, but it's very common
for people to get large numbers of alerts in any given week
and having that volume of alerts, it undermines the effectiveness of the monitoring system
because at a certain point you can't really deal with that alert volume.
So from that picture a number of key challenges emerged.
So the first one is the one that I've emphasised, which is that there are too many alerts, or a lot of them.
A close second is false alerts, right?
Because a lot of these alerts come in, and a fair number of those alerts are noise
and so what we want to do is, we want to figure out
how do we reduce the amount of noise and improve the signal?
Many alerts that come in, they're not actionable, right?
They may have some service and tell you that some metric is wrong with the service
but it's unclear what to do with that information
even if you know what the service is, and if you know what the metric is.
Diagnosis: so one of the things, of course, is as alerts are coming in, and as an incident is starting to unfold
it's important to understand what it is that's really causing the issue so that you can do something about it
and that diagnosis can take too long.
Of course, the thing that we really care about with all this monitoring and diagnosis
is being able to actually fix the issue, so as to restore service to the customer
and that process also takes too long.
And these challenges obviously are not specific to Expedia.
These are challenges that I think anybody who works in a large company, and maybe even not a large company
you're pretty familiar with these.
So the question we want to explore in this talk is whether automation and machine learning can help to solve some of these problems.
My name is Willie Wheeler and today we're going to talk about Automating Operations with Machine Learning.
I do have a question for the audience. I know this is the DevOps audience.
Is there anybody in the audience who's done work in machine learning at all, or has any background in it?
Little bit. Okay. Okay, that's good. The talk is not for you.
We're not assuming a machine learning background.
My assumption is people are really wanting to understand what it's all about.
So that's fine. A couple things.
The first thing is in this talk if you have questions, feel free to shout them out.
You don't have to wait till the end.
And then the other thing is also, if you have comments, like, you know
sometimes people say "hey, I have a question" and then they have a comment.
It 's okay, actually! I'm perfectly happy to take comments, if something comes to mind that you want to share.
Okay, so an overview of the talk.
The first thing we're going to look at is just automating operations in general
so not worrying about machine learning yet, just what does operations automation look like.
Then the second one is, we'll do a very quick overview of machine learning
just to get a baseline there for the following discussion.
And then finally, we'll marry those two concepts together
which is, okay, we have ops automation, we have machine learning
can we apply machine learning to the ops automation effort?
First operating - or, automating operations, rather.
So there's a key concept in operations called the operations loop.
The idea is that you have this closed loop that has some different components.
The first one is: you start off, the system is healthy, everything's running as you want it to be
but at some point in time, something breaks, right?
It could be a service breaks, it could be that some marketing campaign goes bad, or an A/B test goes bad
Whatever it is, something breaks
and so now we have an unhealthy state, and we want to understand and fix that
and the first step to that is simply being able to detect that that's happened, right?
After detecting it, we have to be able to turn that general sense that something is wrong
into an actual understanding of what it is that's wrong so we can go in and fix the issue.
That's the remediation piece.
And so this brings us back into the healthy state.
Now, this operations loop is not necessarily automated, and in fact in most cases, it's largely unautomated, right?
It's oftentimes people implementing this loop.
Some pieces are more common to be automated.
So for example the monitoring systems, generally it's some kind of automated monitoring system
that's paying attention to the time series metrics and then generating alerts.
But after the the monitoring piece, most of it tends to be largely manual.
Sometimes there will be scripts that you can run to execute a rollback or that sort of thing.
In many cases, the operations loop is manual, and so what we want to do is
we want to explore some opportunities for automating different parts of the operations loop.
So how do we take that abstract concept of an operations loop
and turn that into an actual system that does this?
What I'm going to describe here is something we released in December.
So what we did - I'll describe it before I actually show all the details here
but what it is, is it's one of these closed-loop operations automation systems.
Expedia gets a fair amount of bot traffic, right?
And these aren't necessarily malicious bots.
What the bots are often doing is scraping the site for pricing information, for hotel information, whatever it is.
A lot of times, it's the partners.
But there will be bots that come in, and sometimes the bots are coming in at a volume that's problematic.
For whatever reason, the system can't handle the load that the bots are generating
and so we have to understand that that's what's going on
and then we have to basically intervene by applying a bot treatment.
There are different types of bot treatment available.
The most benign treatment would be to just reroute that bot traffic to a different farm
so that they're not interfering with actual paying customers, that's one.
Another kind of bot treatment would be to apply a CAPTCHA
so that's where someone has to type in a weird string in order to prove that they're a human.
So we have that, we can send it down a black hole, we can do a tar pit which slows down the traffic.
There's a bunch of different bot treatments we can apply.
So what we're looking at in the following loop is
being able to detect when there's some kind of bookings drop, let's say a hotel bookings drop
that the cause of the situation is heavy bot traffic coming in, or at least heavier than we'd like
snd then being able to automatically diagnose that that's the issue, and then apply that bot treatment.
I'll show you how that works.
So first we have here, on the left, we have a bunch of different applications in our ecosystem
and the individual squares represent applications.
Yellow means that the application is having some kind of issue. It could be a performance issue, whatever.
Red means that it's having a serious issue.
For the sake of discussion, we'll assume it's one of our hotel related services that's having a problem.
So we do have an observability layer
and in our observability layer, we have the standard types of observability.
So there's the logging piece - we rely heavily on Splunk. We also use Elasticsearch and Kibana.
Then there's metrics capture, right?
So we want to be able to capture metrics coming in from the various applications or other systems as well
and so we have metrics capture.
We use Kafka for this. We use Grafana and Graphite as well, and Metrictank on the backend.
Then the third thing we have is tracing, distributed tracing.
For those who are unfamiliar with that, the idea with that is that when traffic comes in the front door
you'd like to be able to trace the flow of that traffic all the way into the back end
all the way to the database and and just see what's going on with that
so that when something's wrong, you can more easily diagnose it.
So we do have a tracing system and I'll talk a lot more about that a little bit later in the presentation.
One of the things you can see, coming into tracing, is that trace data coming in from the apps.
From tracing, we send data to metrics, too, and that's telemetry data, right - trace telemetry.
The idea is that when you're doing traces, you can measure various properties of the traces.
So for example, if you're tracing a call from one service to another, you can monitor the counts
or you can track the counts, you can look at success rates and error rates, you can look at latencies, durations...
And so that's what we have in place. And again, I'll talk about that a bit.
So after metrics, this is where we have our first stage of anomaly detection
and this is time series anomaly detection.
We have a system. We have multiple systems for this, but the one I'll be talking about today is called Adaptive Alerting.
What anomaly detection does, is it takes that metrics data, and it converts that
not into emails or pagers, but it just converts it into classifications.
Let's say this metric value is anomalous, or it's not anomalous, or it's weakly anomalous, strongly anomalous, that kind of thing.
[AUDIENCE MEMBER] So you don't use Splunk?
[WILLIE WHEELER] Yeah, interesting. So we do use Splunk as well. We've historically used it quite a bit.
Splunk is hard to scale to larger volumes for anomaly detection.
I think of it as more primarily a logging centralisation tool.
So we do use it, but we're trying to scale that out to on the order of millions of metrics.
Okay, so diagnostics. The idea behind the diagnostics piece.
I mentioned that anomaly detection - that's really the time series anomaly detection - is kind of the first tier of anomaly detection.
What diagnostics is doing here is it's really that second tier of anomaly detection.
What we're trying to do is capture lots of these classifications coming out of the anomaly detection system.
They're high-volume, and as I mentioned at the beginning, they're generally low-information, considered individually, right?
So not particularly actionable if you take one at time
but if we can feed them into something that's able to make sense out of them
and reduce the number of anomalies into a small number of high information anomalies
then that's what we're doing here.
And then when we receive those anomalies, again, we don't directly send those to emails or pagers.
We're not trying to get people involved for every anomaly that goes on.
We send it to an auto-remediation system that we're in the process of building
and what this auto-remediation system does is, it has various things that it can do in the environment in order to fix things.
So for example if we had a bad release, then then one of the kinds of things we might want to do
is either redeploy the service, or do a rollback, do a channel flip, scale up, wait for a little while...
There are different kinds of things that we can do.
And so being able to choose among those is one of the functions of this remediation system.
Another kind of thing, and this is just an example
but another kind of thing is what I mentioned, which is the bot treatment system.
And so in this case, in the scenario that I'm describing, that's what we're doing.
And then the third thing, and this may be a little bit interesting.
It's different than what a lot of monitoring systems and anomaly detection systems do.
We have alert management.
What alert management does is, it manages end user subscriptions
and it manages the actual dispatch of anomalies, and alerts actually, to interested parties.
Now what we've done is, we don't directly hook up the alert manager to the anomalies, especially not the time series anomalies
because we already said we're trying to not have a huge volume of alerts going out
So the idea behind this is the auto-remediation system basically tries to fix things on its own
without involving people, other than maybe letting people know that there's some high-level issue going on.
The other thing that auto remediation can do is
sometimes auto remediation either doesn't know how to fix something
or else it knows how to fix it, but we've decided not to let it automatically fix it
because we're trying to be cautious, right?
Like we don't necessarily want it to automatically, you know, do a rollback.
We might want it to ask a person
"Hey, it looks like this need to rollback. Hey, please check this out."
And if so, then, you know, use this chat bot in Slack to do the rollback.
And so that's why we've moved the alert manager after the automation. It's an automation first type of design.
So we do have human operators, of course.
We have both a centralised team, like a NOC
we have another team that is internal that works closely with the app team
then of course the application teams themselves.
You know, we are a DevOps shop, so we do have application teams also managing their own applications.
And so we send alerts to people in appropriate situations.
So in the bot treatment case, we tell it to do the CAPTCHA, it applies the CAPTCHA, that fixes the app
and a kind of a funny story is we were getting ready to do a demo of this thing
so when we released it, it automatically woke up and started going crazy
and the guy who runs our operations team, he said "whoa, this thing's going crazy, can we turn it off?"
and then his team looked and said "no, it's finding actual bots here!"
It turned out it was actual bots, so that was a great way to be born into the world.
And then the bot treatment says "hey, I just fixed this bot problem for you."
Okay, so that was ops automation. That was the overall loop.
Now I'm going to talk a little bit about machine learning. This will be pretty brief.
Okay, machine learning systems. The idea behind machine learning is that
instead of individual developers programming the logic into some system that has to perform a task
instead, what you do is you take a data set and you show it to the machine
and the machine looks at the data set and figures out how to perform the task.
That's the idea behind it. So it doesn't require explicit programming.
One of the things that you'll see sometimes on the web is this kind of compare and contrast.
The traditional programming approach is that somebody like me or you will write a program
and then you'll feed inputs to the computer or into the program, and then it will generate outputs.
The machine learning approach, and this is a bit of a simplification
but the machine learning approach is different than that.
You give it the inputs, you give it the outputs, and then the machine looks at that and says
"Hey, here's an executable that's going to be able to do predictions or classifications or whatever it is
based on what I've seen in those inputs and outputs."
and so that the whole process that I just described as far as looking at data sets and then
generating models and such
is back ended by what was commonly called a machine learning pipeline and I'll show you what that looks like a
Database it doesn't matter it could be anything and
On that training data, you'll run training algorithms and the idea. Is that these training
Algorithms, sometimes they run on CPUs or GPUs
But they will look at the training data and they'll try to learn some tasks
Like I said, it could be a prediction class or classification task rather and when they're done training and you know
It could be hours long training. It could be, you know even longer than that
The output is a model that is able to perform the task, right? So
Whatever that task is and you know common tasks would be like image classification
You probably seen like, you know, if you go on Facebook it says hey your friend is in this photo, right?
It's because of this kind of image classification task
And then finally, it's not good enough for the model to see it in a repository
We have to actually have a runtime that we can use and so there's an executable model that's able to ingest those
those
models and make them available for
applications for machine learning enabled applications that want to use them and so for example
Like we'll talk about in a bit is anomaly detection algorithms, right if we can load in anomaly detection
Models and we can we can build ml enabled anomaly detection applications
So there there are many many approaches to machine learning not just a single one
two of the most famous ones
ones called convolutional neural network and the idea behind this one is
The kind of killer app for this is this image classification, right?
You show it an image of you know
Let's say a dog or whatever it is and you train the network on a huge number of images and then it can perform
Classification tasks such as this looks to be a dog, you know, it's not a bat or flying fox in Australia
And then another kind of important your network is
Machine learning based approaches something called a recurrent neural network
and what these do is they're able to take into account time series or sequence information and so for example if somebody has a
Sentence in one language you want to translate to another language?
This is able to perform that task because it's able to interpret, you know sequence information and context and that sort of thing
Okay, so that was it for the intro to ml now I'm going to get into the automating operations ml
There are three topics here
The first one is I'll just briefly introduce the anomaly detection
Domain because both of the applications that I described will be anomaly detection applications
So I'll talk a little bit about anomaly detection
And then the first application is that first order time series?
anomaly detection they mentioned so we'll look at
Using an open source project that Expedia is doing it's called adaptive learning
We'll look at how we use adaptive alerting to perform this
Time series anomaly detection task and then there's another one and this is definitely like future looking we haven't done this yet
in the machine learning approach
But we'll look at incident Diagnostics currently incident
Diagnostics as we do it is we use play books to do that just because we know how to do this with play books our
Operations team already use play books, but what we want to do is we want to see if we can start to
Take some small problems within that play book and replace it with machine learning so that we can generate anomalies that are more
Informational instead of these variable level time series anomalies and look at that
and this is based on a
Product an app called haystack again. It's open source
and
So here's if you're interested in either of these things and I have links in the slide deck now I made the slide deck
Available to people as well. So you don't have to write anything down, but here's the
haystack
project
Okay, so anomalies and anomaly detection so forcing this is very simple
An anomaly is just an unusual data point that's all it is. Now. You have a data set or a stream of data
Most of the data looks normal some of its weird. That's an anomaly
The reason we care about anomalies is that anomalies are often signals that something we care about has
changed right
So for example if I'm monitoring bookings if something broke on the site
Then that will manifest itself as an anomaly in some metric that
Tracking such as you know bookings counts. And so AB anomaly detection is important because we want to detect those
So the first application is time series anomaly detection with adaptive alerting
adaptive alerting
What we're trying to do is we're trying to minimize the main types detect by performing anomaly detection on
Streaming data, so a mean time to detect. Yeah. Yeah here is centers. There's an event in the world
where something broke right and then a certain amount of time passes before we realize that something broke and
That's the time to detect
So we're trying to minimize the mean time to detect so that it's not more than you know a minute or two
that's a goal and
First when I see streaming time, seriously and what this means is the data is coming in at high volume in real time
and so our anomaly detection has to deal with that way of receiving data as opposed to
Surging logs for anomalous data points are calling the database looking for anomalies. We're we're talking about streaming data coming in
and
adaptive alerting supports two broad categories of anomaly detection model
So the first one is I'll just call it classical models. These are models that come out of statistics essentially, right?
So for people are familiar, you know rema or whole winters that kind of thing
They're kind of standard statistical models. And then the other one is machine learning based models that have you know, people have been developing recently
So we support both
So just to give you a sense of the variety of anomaly detection algorithms and how they work
The kind of standard one there. Probably every tools orange is constant threshold race
So you have a metric that you're monitoring if the metric goes either above or below, you know?
Whatever threshold you've defined and that's a Clemson threshold and you get an alert frame so that that's concentration
That's good for certain kinds of metrics like for example
If you want to know when you're about to run out of disk then you know
This might be a good type of you know algorithm to use
Another type of algorithm. This is that also very common. It's cold winters and graphite for examples and
what whole winters does this whole winters it tracks the
you know the general trend of the metric and it also pays attention to the
Seasonality the metric what that means is regular cycles, right?
so the lot of times a lot of metrics have
regular cycles that you know recur every day or every week or you know, even monthly
In some cases sales cycles and such. So a whole winters is good at that
Both of these are classic models. And neither one of these is a machine learning model, right?
You don't excrete them on large sets of data. They just kind of, you know, a constant threshold case
It doesn't update itself at all. Just you know applies a threshold whole winters in updates its internal statement
Ok, so the next the next floor that I'll show these are all machine learning algorithms. The first two are
regression so we have something called STL regression and
Another one is just you know custom regression. We have, you know, different teams who implement their own anomaly detection algorithms
So, you know
We can run these kind of standard ones and if people have custom ones we can pull those into
And then finally we get into the you know, the kind of you know machine learning
Proper so, you know recurrent neural networks for a time series information and then, you know even integration with AWS
so AWS has various
Forecasting and anomaly detection type of algorithms. And so one of them that we've done some integration with is a random cut forests
For example, so these are examples of the kind that we support and as you'll see later
It's open-ended we can add on any type of anomaly time series anomaly detection
so the way
adaptive alerting is
It's essentially you have two topics that we care about once the input topic where the metrics are coming in
You can see an example of the metric there. This is a simplified version of it there
There's a little bit more to it. But essentially there's some metric identifier and then there's a timestamp
You know on the metric and then you have some value, right?
it goes into adaptive alerting and
you know here we have a bunch of anomaly detectors that are like the ones that I just showed you on the previous slide and
Then the anomaly detectors perform
classifications, okay
they're not generating alerts or you know emails or pages are just
Classifying the metrics and then they push those classifications out into an outbound topic that you know
It's called anomalies for us and you can see that we've added that classification that hey this is a weak anomaly that's coming out
Our system is running off of kafka
The adaptive alerting system itself is based on Java and we have a Kafka streams
As you'll see in a bit though. It's it's not restricted to Java
Okay, so now if we want to am attached machine learning to this
Here's what that looks like. So we have the same thing. We were just doing this is the kind of non machine learning flow
But if we want to actually train machine learning models
What we have to do is we have to add that pipeline. And then the first thing we do is we
tap that kafka topic the incoming topic we tap that and we use Apache fleet to send that data into
S3, right so we can capture data on an ongoing basis and train up models
After we do that data capture we run training algorithms on the data there
so
You know if we want to run a regression model on everyone want to run the you know aren't in that
recurrent neural network model or you know any of Amazon's
Models we can run this based on this. So this is all kind of model development pipeline
It's kind of analogous to a deaf pipeline in the San set
Here, you know the data going into that training. We're trying to create a model that we can use in run time
And so here's the output of that, right?
We have some machine learn based on amélie detector that we've deployed into a production environment. And so now that it's deployed there
We can start sending
Metric data not only to the you know to the Java
Version that has the standard type of models in but we can also send it to these
you know anomaly detectors that are running on machine learning and those can be
totally arbitrary
Language framework for whatever doesn't matter there for the most part. They're just running in containers
And so if somebody wants to you know build a anomaly detector in JavaScript, or if they want to do that Python
they could just put in a container deploy that and then
you know wrap that with a key streams app that listens to the Kafka topic and
Push the classifications to the anomalies topic and and then they're doing anomaly detection
Okay, so all that was the first approach to integrating machine learning into our
Automation or our operations automation. I'm gonna switch gears a little bit
Look at that second level where we're trying to
Diagnose how we're taking all that information coming in or trying to diagnose what's happening and generate those high-level
Or those you know more like guess I should say there's more actionable anomalies that I was talking about
So I guess they're just gonna say the same thing right time series anomaly detection jitter. It's a large volume of individually unnatural anomalies
What we need is a small number of actual anomalies
And so I'm going to now introduce this
These tag is Expedia groups
Open source distributed
Tracing product and it has a number of key features. So the first one is obviously does distribute traces
So, you know if you know zip
It has that type of function. I've actually been working with the Zipkin team on doing integrations between
Haystack and Zipkin as well. So that that's been a pretty exciting development for us
The next thing though the haystack adds is trace telemetry
so I mentioned earlier that when we collect trace data want to be able to attach numbers to that like
You know counts and success rates and such so he stack does support the ability to capture metric data on this traces
The next thing and I'll show this in more detail, but this is a very cool feature in haste. It's the dynamic service graph
So right a lot of people are familiar with you know
You have your services you'd like to understand the services are connected to each other
Which ones can point and call which ones this or thing but you know once he gets in and he's deal at all
Doing CIC we release is on an ongoing basis. Especially you're doing micro services
There's no way to create a correct. Dynamic service graph, right? It's for a correct service graph
It's always changing and so because haystack actually captures the trace data
It's able to construct that service Draft dynamically. And again, I'll show you what that looks like and
Then a fourth thing is anomaly detection on Kalama tree, right?
So we have all that telemetry data, so wouldn't it be great if we could see that a certain call to a certain endpoint?
It's taking longer than it should be taking for example
Then finally there's this
Piece that just came out maybe a couple weeks stage snapshotting
The idea behind stay snapshotting in haystack is you have your service graph, right?
This is that topology that I mentioned all those services calling each other
then you also have an overlay of
Anomaly data on those nodes on the services and the edges to or at the edges are the calls
If you have anomaly detection overlay on that, you can take that and you just save that out on an ongoing basis
so like let's say you save it out every minute just dump that to a three and over time what you have is you have
training data to
understand the
Behavior of the system right how how different patterns in the system tell you what's broken, right?
So here is an example, this is a real life a haystack service pack the names of the services are redacted
But those are all individual services and micro services
So this graph, like I said, it's dynamically generated
No, you know person you know linked all these together and this is not the whole site. This is just a part of the site
You know on the left hand side, that's the front end, you know requesting it goes in toward the back end
Which is on the right the blue, you know circles are the you know, healthy ones and blue rectangles. Those are the healthy ones
Amber means that it's a little bit unhealthy, but not too bad then red means that there's something broken here
And so, you know, we capture all this in real time
So having that data, it creates a number of interesting automation opportunities for us
So I'll talk about three of them here. First one is incident classification, right?
So, you know, you know something is broken, but you're not sure exactly what's going on yet
It's very nice to be able to say hey, here's what's happening pretty
Closely related to that is incident Diagnostics, right?
So on the classification side, you're looking at something say hey, we're having a hotel bookings drop and it's just in the u.s
Incident Diagnostics is more like being able to say hey
Here's the specific system that's responsible for breaking everything else
And here's the thing that we need to fix so that that's a diagnostic piece
then finally incident prediction is the inverse operation of
Diagnostics what prediction is is if you see something broken?
But it hasn't turned into taking user-facing it then you predict what's going to happen
several time steps in the future such that you can intervene before it actually becomes a user facing incident and
So question is can machine learning help with these three
Automation opportunities and we think that it can like I said, this part is the part that's future-looking
So these are things that we're working on right now. There's a
fairly recent interest kind of I guess Google and
MIT put out a paper recently a few months ago on something called graft networks
and this is a fairly recent type of network kind of like convolutional networks, or
Recurrent networks, but the difference here is a graph networks are signed to work with graphs structured inputs, right?
So in a convolution that you're looking at like an image, let's say you're looking at a grid topology
here we're saying we have some system that we want to understand and it has a graph type of structure and
feed that in and then generate graph type of outputs and
You know, I just mentioned the convolutional thing
So on the image, you know
just just a little background some fur on for image classification the way that works is you take a
Rectangular scanner and you scan it across the image and you learn feature
So as you learn like edges and you learn, you know, you, you know can learn higher and higher features till some point
You're learning that hey, there's a cat here. There's a dog here and amazingly it works
so
The idea with graph networks is they do the same kind of scanning thing?
But instead of looking at rectangles
What they do is they look at arbitrary grid apology because you know in a graph you don't know that it's a rectangle
And so those three use cases I mentioned them
But I'll go over them again just to give you some visuals
So that first one is incident classification, right? So we know that there are bunch of things
Broken in the environment. We're trying to figure out what it is. It's broken
So we run that through a graph network
we bolt on a classifier on to the end of that graph Network and then the output would be
Classifications of different kinds like haze. It's a general booking strop, right?
Like for example, there's a network down or you know, is it a holiday that would cause the general booking stop?
No, it's not that this looks like it's a hotel bookings drop and maybe in a specific region or a specific
Point of sale, so that that's one kind of problem
Another kind of problem. I mentioned was incident Diagnostics
So the input here is the same where you have a certain pattern of breakage in the environment
But what we want to do is we want to identify it
The one thing that needs to be fixed so that somebody can go fix it, right
and so that's what this problem is - and you feed in the
Fault pattern and then out comes the you know
The marker that says this is the one that you want to take a look at
So in this example, you know hotels are broken. It's because the geo service had a bad deployment
that's the idea and then finally
incident prediction this is that inverse operation that I mentioned so you have something broken and
What we want to be able to do is say in 8 minutes
There's going to be a hotel bookings drop unless somebody goes in and you know deals with the fact that geo service had a bad
deployment for example
There are a number of expected benefits from this approach and you know, you can play around with some examples, too
There is a library
I'll have a link to it so you can see some of these benefits in action
but one of the benefits is being able to support relational reasoning right because the
Inputs have the explicit relational information, you know baked into the input
That means the network's able to capitalize on that information to perform whatever task it's doing
another
Benefit is that our expected benefit? I should say is just being able to process service graphs of arbitrary anthropology and size
Right because what I mentioned was that our environment is changing
it's pretty organic, you know, our teams are able to create new services when they need new services, so
There's no you know
Basically, the teams can do all that. So we need to be able to handle
Changes in that service topology and then finally this idea of composition
This is a really important one one of the challenges with traditional deep learning
Techniques is they don't handle composition in other words, you know
The world has a certain compositional structure to write like if I'm trying to get back to Seattle, there's a structure to that
I know that I need to get to the airport
I know that I need to get from one Airport to the other Airport then AJ, you know getting that word go home
Whatever it is, right and I can break down any of those legs into further problems
Like how do I get from here to the airport? Well, I need to you know get over to get to the airport
whatever it is, and so it's the same kind of thing with a
Diagnostic problem. Can we take you know an understanding of how an individual service works and
Other individual services work then turn that into a larger understanding of how the site behaves
So with that, I'll just give you leave you with some slides again all yeah, I'll make this deck available so you can get information
Ops automation. There's a good video that
Matt recommended
about closed loops
Forecasting and anomaly detection these two really go hand in hand a lot of anomaly detection
Methods are based on forecasting. So some good resources there and then machine learning
There's a bunch of public data sets that you can get on github if you want to just try your hand in some of this
And then tensorflow also there's a bunch of tutorials that you can get sensor blow is one of the you know important
machine learning frameworks
There are some other ones as well
But I can learn about that there and then if you're interested in the graph network piece in particular
There's a bunch of links to that as well. And that's it
Thank you very much
Yeah, especially welcome challenges
Yeah
Yeah
yeah, yeah, so
There's a lot of fighting. So my ops management happens to be my boss and you have to be here
Recognizes the value
And the leadership in general
You know, I've been pretty glad that
There's not a lot of fight against
Automation and even cloud to people, you know understand the value of that. Some people won't understand how we can do that
The the way, we've approached this it isn't she subject even for like a devops odd. It's for topside
It's it is touchy because you know, there's always they kind of concern about
Its coming to take my job
so the way we've been
Thinking about this in approaching. This is to think about the expertise that people have like we you know
We have certain people who are experts in operation saying I'm not one of them by awake by the way. I'm a developer
I work in this space, but we have people whose their expertise is and operations and they bring that subject matter expertise
But they don't have the development skill set in a lot of cases order to fully capitalize on that expertise
So what we've been doing is saying hey, you know this
Stuff that we're putting together
we're putting it together to extend your reach and
Part of what we do is we help the ones who are interested and acquired that development skill set for them
We help them. Learn it
And in other cases that you're not interested necessarily in acquiring the developments Gilson
That's okay too because there are other pieces that are important
so for example, I mentioned that we have a playbook based approach for diagnostics and even for some of the
reactions we've been using AWS step functions to implement those play books because their graphical their you know,
Jason driven, and you know people you know who don't have a deaf
Background can really own those. And so yeah, we're working with those teams as partners and the intent is they own this, right?
We're just you know, helping out
We're actually building it with members of the ops team, right so we have people from the ops team or on rotation
They say hey, I'm interested in acquiring the skillset and also
contributing my expertise to the effort and
They've been tremendously valuable then they take that information and that knowledge back to their team
So as to you know, basically empower their own teams to kind of create their you know, their skills to that next level
Yeah, there are questions or doubts or in these
Might speak up a little bit
Yeah, yeah, that's that's a great question so the question was if I talk about mean time to resolution
No, I don't actually remember if I talk about MTTR, but I'll talk about it now
So the way we've organized it is is that there's you know
There's this kind of flow from something broke - it's not broken anymore. Right so that end-to-end mean time to resolution
That's really the goal of this thing. We're trying to reduce that number down to as small
as small as possible and we break that up into
three key segments
there's a mean time to detect which we treat as when
Did the first signal come in for the particular thing from whether we recognized it as such at the time, right?
When did we receive that first single we want that to be very low right?
like, you know, like I said mid a couple minutes and then
From the time that we received that first signal to the time where that's converted into an actionable
You know alert or anomaly I should say
right
That's we call that mean time to know right have how long does it take for us to understand what's going on?
and then that third segment from
You know, we have something that's actionable. We know what to do to actually facing it and we're storing service, right?
So it's not enough to just pull the lever. It's the service has to be restored to the customer
He also can immediately call that MTTR. So there's that that short MTTR then there's an empty TR
so this does reduce the MTTR in cases where we dealt with the type of
Incident question, right? So I I don't know if he caught me. Yeah ba
Scenario that is right. So when we turned it on and caught the bot
situation within about two or three minutes what was going on you did automatically
You know light up the capture and what we did was we posted a message to slack that said hey, you know
There's a here so that somebody could go into the chat pod and do this now in that case
there was a lot of you know head scratching and you know people have wondered what's wrong with this thing because you know
We were surprised at the behavior. So we didn't immediately fix it
But the information was known right away had a person, you know run the chat
But I would have been faced with in about three four minutes. Oh, yeah, that would be a market improvement over how long our meeting
Yeah, so it's a the question was what would be the training data
So it's not generally log data at least for the you know, we talked about two different anomaly detection problems
So for the time series anomaly detection
Problem. The training data is metric data that can be any metric that somebody thinks is interesting enough to build models off of we don't
Send everything into s3, but you know for something like bookings or traffic through sales funnel
Those are things that we care about they're very high value. So that's just metric data time series data or the
idea that we have about training the
diagnostic on the graph network the training data would be the
a-stack service graph like those state snapshots being pushed into s3, but because those are graphs with
Does does the time series. Yeah
Yeah, great great question. So the question is do the anomaly detectors handle seasonality
So it's very much dependent on the type of anomaly detector. This is why we have so many different models, right?
So for example constant threshold
Definitely does not handle seasonality, right? That's just a
value other
Anomaly detectors they're able to adjust
In real-time, but they're not necessarily accounting for seasonality
so for example
if the metric changes then the anomaly detector over time changes to adjust to the new normal and then finally there are
Anomaly detection algorithms that explicitly account for seasonality that can be one seasonality. For example
Like cold winters and you know one seasonality, but I can deal with that
The the other one we built a zombie detector called Aquila
Which is based on that STL regression thing that I mentioned
that one is in petal two seasonality z-- which is daily and weekly seasonality and you know
You can actually harm handle or arbitrary sees now
Yeah, that's also a great question or really good so question is the classifier is abhorrent detecting something's anomaly or not an anomaly
Do we have to deal with false positives? And then how do we you know face that the answer is definitely yes
there are a lot of false positives and some of this is
it's
It's called an irreducible error, right?
So when you're collecting samples of data
there's always some amount of
noise that you're going to see just in the way that the numbers or even if you have a great detector and
The thing is that avoiding false positives trades against avoiding false negatives or whatever
you know threshold you choose you're going to capture some amount of false positives and or negatives, so
the idea that we have on this and this is
There's a couple of different pieces. So the first one is
Part of the strategy is just to have better detectors
That's part of why I want to have machine learning detectors because they tend to be better tighter bands
Better point estimates of what the midline is that kind of thing. That's part of the strategy
Another part of the strategy is taking the time series data feeding those into something that's able to ingest those
tolerate a certain amount of noise and
so that I mentioned that diagnostic piece in the middle via in that gene, which uses just like if you think of a photograph
right
Let's have a photograph of a dog and a photograph has scratches and one of the corners is torn off and and you know
There's a bicycle parked in front of the dog, right? There's a certain amount and noise and that crap but
I'm just level there's absolutely no ambiguity that I'm looking at a dog, right?
So the the analogy is bun dudes want to be able to tolerate a circuit madam always coming in my series. Peace
But if we can look at those patterns, let's say I can see some noise, but it doesn't matter
I know that this is our Gmail service just had a bad. That's right and the strategy third part of the strategy. This part is
We'll see if we can do it or not. But if we're able to actually
Do the incident
classification that I mentioned then we can essentially use that as
The output right for like when we have this kind of input like this kind of pattern in service path
and we see that that maps to you know hotels bookings are down that we can use that as the
Supervised training output that allows us to do the match and you're talking about yeah
Bet that's an idea what we'll see if it works like I said
Might have to speak up a little bit
Okay
Okay
Yeah, it's the question correctly if I heard it equipment how important is or do we need to know the details
Of the machine, Bertie. Yes. It depends on what you're wanting to do, right?
So, you know at a high level, you know users of the system don't have to at all like somebody else's put in
machine learning algorithms and
Something I didn't get into in this talk but one of the things that we want to do and again
We haven't done this piece of it yet. One of the things we want to do is be able to
forgive in metric automatically pair that with one or more
anomaly detection models that provide a good fit to that metric and if we're able to do that, then the you know, the
Requirement on users to know anything about machine learning goes way way down. So that's why we see that it's very important
but you know if somebody wants to
Actually build their own custom
Anomaly detector, you know, of course, they would know more about machinery that is advanced
Yes, the current the question is how do we currently compare different models to each other so current approach is pretty low-tech
It's we will put a model out there
and then if our
operations team tells us that this model was no good or too noisy or whatever we take that as a feedback signal and we make
Adjustments, you know, we just kind of eyeball and look at it. We get a pretty good sense for that
there are different metrics that are available for
Evaluating a couple pieces. So for example like the point forecast
We're trying to predict the next value of the metric. There are metrics that one, you know metric is called
Or one measure, I should say it's called a mean squared error
So you can kind of detect the difference between the observed value and the predicted value and make adjustments
And then the other way we evaluate this is this kind of gets to the answer that I just gave a little while ago
Eventually, what we need to know is we need to know that the classifications are correct because that's how you evaluate whether an anomaly detection
Or detector algorithm. It's good or not. And so that's why we watch
Be able to do the incident
Classification so we can use that as the yes or no?
you know this kind of right and the idea behind it is want to get away from
Asking people to provide feedback because at best they'll only do that when there was an alert, right? They won't join them nothing
So I thought I'm not going to sit here and say aims2 tanking nothing. That's great. Great
I'll only do it when there's a false positive and they won't even do that reliably
so what we want to do is basically look at the
you know the patterns see that they back to a certain incident classifications like hotels are down and then use that to
Recognize and yeah, this is a real alert
Yeah for sure
Yeah, so so log data we mostly do not use to Train the measure in if the law of data just goes into
Elasticsearch people search for the metrics data the way we do that is we send that metrics data
Into an s3 where it just sits there and it's organizing the different folders so that you know, people can load whatever
Metric they want into their training algorithm
and then what you have is
Depending on the type of model you want like if you want a regression model or if you want that that I mentioned already now
You know currently over that worked. Well, whatever kind of model you want
You'll pick that model point it at the folder with all the training data you want then you'll say hey
Great, right
Push the button and then it runs for hours and then just that information and when it's done the output is hey
here's a model that
Can perform the classification test the anomaly detection? That's great. And then
Yeah, so this this is kind of similar
Take a lot of people are converging on the scene
Questioner I say you do need
That those labels or those labels are very useful. I'll say or defy
the anomaly
Or not
We don't believe that relying on humans to do that labeling. It's going to be a scalable approach
Whatsoever and people have better things to do than to label metrics coming in
And so the idea would be that if we can't generate those
classifications right like so that the haystack thing that I mentioned if we're able to generate those classifications that
Those will be the outputs that we use for the supervised learning that Richard
