AMIR HERMELIN: Hey, everyone.
Can you hear me?
About to get started, so I
hope you enjoyed the keynote.
Amazing stuff.
And I really appreciate it
and welcome you to our session
here where we're going to be
talking about how Waze takes
their approach to
monitoring planet scale
service with a relatively
small team, how they do it,
which tools they use, and what's
the approach they've taken.
Before we get started, I'd like
just to introduce ourselves.
So my name is Amir Hermelin,
and I'm a product manager
on Google Cloud.
With me are two Wazers who
actually do the work that we'll
talk about, Yonit Gruber-Hazani
and Evgeny Andzhelo,
and they'll be telling
about their experiences
monitoring Waze.
So how many of you
don't know what Waze is?
Excellent.
Nobody.
So everyone knows--
that's great.
That's great.
And I hope you all use it.
And so you know how
big the service is
and how people have come to
depend on Waze to get them
where they're going,
and they depend on it
to be reliable and
up and available.
But we'll take you
behind the scenes
so you can understand the
complexity of monitoring
and managing such
a huge service.
So how the service is
built, how it's deployed,
what kind of issues
users can run into,
and what kind of issues the
service itself can encounter.
And then we'll talk
about Waze's approach
to taking the data
that they have
and representing it
in a unified manner
to make it easy to automate
their approach to monitoring.
And we'll wrap it up
by talking a little bit
about the tools they
use for monitoring
that are called Stackdriver.
So with that, I'll
hand it over to Yonit.
YONIT GRUBER-HAZANI:
Thanks, Amir.
Hi, everyone.
I'm Yonit, and I've been a
Wazer for the last two years.
I'm a DevOps on Waze
infrastructure team.
And let's see a show of hands
if you used a navigation app
this week.
We-- of course you did.
We all do.
I do it-- use it daily.
So for the people
who don't know Waze,
Waze is crowdsourced
navigation app.
It's free.
And among our best features
are timesaving routes and maps
that are updated daily thanks to
our 500,000 wonderful volunteer
map editors who updates
the maps all the time.
So we currently have around 100
million active users monthly.
This size of infrastructure--
this size of a amount of users
takes a huge infrastructure
to support it.
So we have thousands of
instances running on hundreds
of autoscaling groups.
We have about 2 petabytes
of cassandra data running
on 1,000 cassandra
instances, which makes us
one of the biggest deployments
of cassandra in the world
today.
This is the amount
of API calls running
through our servers between
the different microservices.
This is a lot of activity, and
it needs a lot of monitoring.
So we monitor things that
show us the service level
that we give to our users.
So we monitor the
amount of traffic.
That's the query per minute.
We monitor the performance
of the service.
That's the latency.
And we monitor, of
course, failure rates.
So when you have
such a big system
and you help people
with their driving,
downtime is not an option.
We-- when an online game is
down, people are annoyed.
I know when my boy's--
my son's online game is
down, he's devastated.
But when Waze is
down, people get lost.
So in some countries when
Waze is down, within minutes
it's in the news--
in the national news.
That's how essential we are.
So try to imagine that you are
a DevOps on the infrastructure
team.
You're trying to debug
a problem at rush hour.
Everything is
failing, and reporters
are calling you to ask
details about what's going on.
That's a very
stressing situation,
and we've been there.
So downtime is not an
option, but downtime happens.
And I use Waze everyday.
When I drive home, I want
to know how much time it's
going to take me to
get to the school
to pick up my kids so I
can call ahead and beg
for mercy from the teacher.
So we feel responsible to keep
our system online all the time.
So now that you understand
the size of the challenge
and what we need to
monitor and how big
it is, let's take a
look at the technology
stacks that actually
we use at Waze
so you can see how we carry
this amount of traffic.
Waze is built on microservices.
We use Java-based code.
We run on standard
instances that are
part of an autoscaling group.
We use redis or memcache
as a caching layer,
and we use cassandra or
Postgres as the databases.
Now most of our services
are pretty similar.
They'll look the same, but
we have a few outliers.
Some are not to Java based,
and some are not even
running on instances.
We use as well Kubernetes and
App Engine and Cloud Functions.
Now these hundreds
of microservices,
they are talking to each other.
Sometimes they're servers,
and sometimes they're clients.
But most of the time, they are
both, passing data between them
with API calls.
So to do that, we use propriety
communication protocol.
And this protocol also
collects statistics
about these API calls between
the different microservices
and sends them into
a central location.
So we have a place to
pick up all the statistics
about the performance
and usage of these APIs.
Our production critical
servers are also
split into dozens of
geographical shards.
Now why is that?
Because it helps
to spread the load
between the different
areas, and it
reduces the impact of bad
updates when we deploy
code into production.
Now seems pretty
complicated, doesn't it?
Well, you haven't
heard the half of it.
This is only on one of
our logical data centers--
bless you-- split
across three regions.
And that was only
the front end parts--
the online part, which has
the actual user experience.
We also have an
offline part, which
is that-- which is doing
the machine learning and map
building.
And we use big data tools
to do the post-processing.
We use MapReduce and
Athena and everything
needs to be monitored.
Oh!
And double everything
I just said because we
use both GCP and AWS.
To support deployments to
these two different clouds,
we are using Spinnaker open
source to deploy our code.
This gives us an obstruction
layer between the clouds
and allow for seamless
management of GCP and AWS.
So what we see here
in this specific slide
is a sample of a microservice
that does routing.
You can see on the right hand
that we have 30s and more API
calls that it's using to talk
to about five microservices.
And on the left side,
you can see the scale--
you can see this is--
all this is in one
of these rectangle on the left.
So you can see the size and
the amount of the API calls.
This is what brings us to
this staggering number of 25
million API calls per minute.
On top of monitoring such a
large-scale infrastructure,
there are other things to
consider when monitoring.
We have new features.
We have hundreds of engineers
working all the time
to extend our system,
add new features,
and deploy new microservices
into production.
These needs to be added
with monitoring on the fly
without interruption and with
zero config from the developer
side.
There are hardware failures.
Imagine we have thousands
of disks running around
in the system.
And as you all know, disks fail.
This is something that they do.
So we need to
monitor that as well.
We have quotas.
When you use a cloud
service, you always
get to the 100% of quota
usage on any given API at 3:00
AM in the morning usually.
So we want to alert on
the 75% of the quota
before we get there so we have
time to increase it and request
for more because you all
know that takes time as well.
And certificates and
licenses are also
something that expire within
the large infrastructure.
So we want to be alerted
30 days in advance
so we can request for extension
or replace those certificates
with a new one.
And, of course, service SLO.
Every service that we monitor.
We want to know
the quality of it,
so we need to know UPM, that's
the query per minute, failure
rate and latency
for the performance.
And, of course, don't
forget about the scaling.
And this whole thing
is very dynamic.
Imagine that we have the
geographical chartings,
and each country has
a different driving
hours and driving habits.
And you have holidays.
Any autoscaling group can grow
from three servers up to 120
throughout the day and
shrink back at night.
Well, you can't
monitor something
like that with
configuration files.
So now that you know
how complicated it is
and what are the
challenges, can you
imagine how to start deploying
a monitoring service on such
a large-scale system?
Ladies and gentlemen,
I present to you
the man behind the machine.
He who brought the order out of
chaos, the Python ninja, Evgeny
Andzhelo.
[APPLAUSE]
EVGENY ANDZHELO:
Thank you, Yonit.
Hi, everybody.
I'm Evgeny, and I'm
going to show you
how Stackdriver makes sense
of our complicated system
that Yonit just told you about.
Firstly, it wasn't
stable enough, and we--
it took us a while to
find out when it was down.
Secondly, monitoring new
microservices in our system
was complicated
and time consuming,
which was a big problem
because new microservices were
being added constantly, and we
were having trouble keeping up.
About two years ago, we heard
about new monitoring system
called Stackdriver.
Stackdriver offers
powerful tools
for collecting and using
time series metrics.
We thought it was pretty
cool because it was easier
to see the status of an issue
over a long period of time.
Stackdriver automatically
collects metrics
from our GCP instances
and services.
It can also import the
data from AWS CloudWatch
with just a few
clicks of a button.
Thanks to this, we
hit the ground running
and got tons of useful info
without any effort on our part.
The built-in metrics
were great, but we also
needed another type of
metrics, and Stackdriver
allowed us to send
custom metrics
with all info we needed.
It's a really flexible system
with great functionality.
We also found out that using
their API was super easy,
and the system was very stable.
And it could process a large
amount of metrics quickly.
We realized that with
Stackdriver, the sky
was the limit, and we
started adding new metrics
from all the various
corners of our system.
For example, we started writing
metrics like request rate
and latency and gave them
labels according to their cloud
platform, region, and more.
So that way you could see
in one metric information
from instances running
on both AWS and GCP.
After we added all this
data into Stackdriver,
we started thinking about
how to put it to good use.
First of all, we
wanted to make it
more easily accessible for
us and for our developers.
And we did that by
using dashboard API
to create specific dashboards
for all our critical services.
They basically
allowed us to display
a lot of useful information
both building and custom
for even any given service
in a single pane of glass.
Second, we wanted to be alerted
when specific metrics meet
pretty fine conditions.
To accomplish that, we started
using alerting API to create
alert policies, which
basically allowed
us to define one or more
conditions that will trigger
an alert and define how the
alert will be delivered using
notification channels like
webhooks that can report a bug,
page their own code, or trigger
our autohealing actions,
which allow our internal
scripts to try and fix
the issue before we
wake up the on-call.
Over time, we edit
even more metrics
such as failure rate, cache and
database layer state, errors,
and [INAUDIBLE]
client side metrics.
We could also use the data that
already existed in Stackdriver
to compare our current state
to the state we had last week
or at some other
time in the past.
This helped us find anomalies
and other irregular activity,
which had previously
been next to impossible.
For example, let's
assume that we usually
have about 10,000 logins
per minute in the Bay Area
during peak hour.
And today that number
is 10 times higher.
This could point to a real
problem like a [INAUDIBLE]
storm, and we need to
be able to see that.
Now let's talk about automation.
After a period of
using Stackdriver,
we thought that a lot of
our service dashboards
were very similar.
So we thought why not
automate their creation.
And we did that using
a schedule process that
scans for new
services in our system
and automatically creates a
dashboard using pretty fun
templates without any additional
work for us or the developers.
You have no idea how
much time it saved us.
It also reduced
the time it takes
us to find the root
cause of issues
and resolve them because all
the dashboards are created
using the same template.
The scheduled process also
create alerts for news services
with default threshold
and notification channels.
After a loading period,
we adjust them so then
they actually notify the on-call
when there are any issues.
Both dashboard and
policy definitions,
whether manually or
automatically created,
are saved in the
GIT report and can
be modified by the developers.
And any developer
working in new service
can add themselves as
notification channel
and debug the issue before the
service moves to production.
Everything we just talked about
helped our team sleep better at
night with less false positive
and resolve issue faster.
Now I'd like to show
you a demo based
on an issue we had
in the past that
was solved quickly and
easily thanks to Stackdriver.
So let's say that I'm
currently on call.
It's the middle of
the night, and I just
got a page on my phone.
I can see that it's a page
about high failure rate for one
of our routing services.
So I wake up, open my laptop
to see more information
about the alert.
In the email we can see
a link to open incidents
to our playbook and the
link to the dashboard.
In the bottom, we can
also see the resources
that triggered the alert.
Before I move onto
the dashboard,
let's look at the
configuration of this alert.
We can see that it's
pretty easy to read,
and the developer
can easily modify
the notification channel's
duration or threshold
if needed.
The combiner allows me to
choose whether to trigger
the alert if all the
conditions are met
or at least one of them.
In the notification field,
we can choose one more.
For example, we can
see-- in the example,
we can see that the alert
will send-- enable end trigger
our webhook that we'll
call autohealing bot
with the details
of the incident.
This is a dashboard that
was linked in the email.
The problem can be seen in the
server API failure at widget.
Now I'm freed of the
metrics in the widget
to focus on the specific
metrics related to the issue.
After scrolling down a
bit in the dashboard,
I see something interesting
in another widget
displaying service
dependencies failure rate.
And by dependencies, I
mean all the services
that routing needs
to function properly.
I see that geocoding
has a high failure rate.
It started at the
exact same time.
So I go to the
geocoding dashboard
and see something weird in
the cassandra latency widget.
Read latency peaked just before
our routing issue started, so I
connected to our
cassandra cluster
and manually fix the issue
using our cassandra playbook.
In the future, this
issue won't even
wake me up because we are
very close to finishing
our cassandra
autohealing process that
will fix it automatically.
So as I just
demonstrated, Stackdriver
gives great visibility
into our system
and helps us resolve issues
much faster than before.
YONIT GRUBER-HAZANI:
Thanks, Evgeny.
[APPLAUSE]
So we have hundreds of
engineers working all the time
on improving the system,
adding new microservices,
and extending the existing ones.
Before we had all
these automation tools
that Evgeny just spoke
about, all of this
was done manually
in the DevOps team.
So at some time in
the progress, we just
couldn't keep up
with the workload.
And large parts of the system
were inadequately monitored,
and some parts were
not monitored at all.
So let's take a look
how it looks now
with all the
automation in place.
Let's say that I'm
a developer at Waze,
and I just wrote my
new cool microservice
and deployed it into
the staging area.
Now I used the Java
microservice framework
that all the
developers use in Waze.
And this framework includes
our monitoring modules
already written, and it will
start working automatically.
Those modules-- I'll tell
you a little bit more
about these modules.
As you can see
from the beginning,
this will-- dashboard
will be auto-generated and
auto-deployed into Stackdriver
with the Stackdriver dashboard
API as soon as I
deployed my service.
And I'll show you each of these
cubes because it's very little.
So the first thing that we'll
see are the server APIs.
These are the same
APIs that I've
been talking before for
the specific service
that I'm looking at.
And I will see query
per minute and I'll
see server failure rate and
I'll see also the latency.
You can see on the right side
in the right upper cube also
server diff for the API.
This is the QPM from last week.
So we can see a
comparison and see
if there is a change in the
behavior of the application.
The next set of widgets
are the server exceptions.
Now this is from
the application log.
We collect those
included modules inside
of the Java framework.
We include all the
errors from the log,
and we export them as a
graph in the Stackdriver.
So I can see from
the same dashboard
how the application
is performing.
Now the client APIs is like
what you Evgeny said before.
These are the clients that are
connecting to my microservice.
So I can see as well how
they are performing also
in the same dashboard
for the microservice.
So I can see how my service
failure is affecting
other service in Waze.
And the last set here
are the dependencies.
So my server is connecting
to other microservices.
And we can see here the
QPM, the Query Per Minute.
We can see the timeouts,
and we can see the errors
for connecting to
other services.
So I can see from this dashboard
if the problem is actually
in my microservice or
maybe its downstream
in a different microservice.
So I can continue going
between the microservices
until I get to the problem one.
Also we can see here--
from those included
monitoring modules--
we can see that the memcache
QPM, query per minute,
is showing up.
All the data layer
is also exposed
in the same dashboard, so I can
see if the specific service is
having a problem with a
memcache or with a cassandra
it's connecting to.
I can even see which table
is eating up all the space.
So these metrics are
great, but they're not
enough for my
particular use case.
Let's say that my microservices
is also writing and reading
from a queue.
So I want to see how-- what
is the QPM of writing messages
to the queue.
Now my application already
exports these metrics,
and there are already
in Stackdriver
thanks to the included modules.
But I want to see it on
the default dashboard.
So what I need to do is
edit one configuration
file from our Stackdriver
configuration repo,
which is saved in GIT, push it
into production, and that's it.
In a couple of minutes,
Evgeny's automation tool
will pick up on
the change in GIT,
and we'll add that widget via
the Stackdriver dashboard APIs
into the dashboard
that I'm using.
As a developer, I can also
open a new config file
and create a whole new
dashboard only with the metrics
that I want to see when
I'm debugging a problem.
All these changes and
repeating dashboard structure
helps me at 3:00 AM to
solve production issues.
I can tell which table in
the cassandra is failing.
I can see exactly
what the problem
is from the logs in
the same dashboard.
So the problem-- so instead
of just getting an alert Waze
is down or the website is down,
I get an alert saying service a
is failing, and
in the dashboard,
I can see that it's
failing because of service
b, which is downstream
a dependency that
is not performing well.
Looking into the future, we
are using Spinnaker open source
to deploy our services.
So now we are starting
to integrate the metrics
from Stackdriver
into the Kayenta
tool, the new Kayenta
from Spinnaker,
that will take these metrics.
And when you're doing a deploy,
it will fire up a few instances
and do a comparison
of the metrics
between the new instances and
the older autoscaling group
to decide if to move forward
with their deployment
or to stop and kill those
instances and rollback.
So this way the
deployment process
is automatically
deciding if to continue,
and no human
intervention is needed.
This, of course,
will reduce downtime
if a bad deploy is done.
So this is me and my league
of extraordinary gentlemen.
We are mostly DevOps
and a few developers.
And this tiny DevOps team
handles this whole operation
in its up time.
We have one on call
for primary issues
and one person who wakes up
at night and one secondary
to handle the bugs that the
star driver and a notification
channel can open.
And where are we today?
We are sending 2 and 1/2
million time series per minute
into Stackdriver.
We have thousands of dashboards
all created automatically
through Stackdriver and
are created automatically
through Stackdriver
dashboard API.
We have hundreds of
policies all set up
on the critical time series.
So we know when
something is failing.
And now I'll hand you back to
Amir, who will put everything
into perspective.
[APPLAUSE]
AMIR HERMELIN: Thank
you very much, Yonit.
Thank you, Evgeny.
So the next time you use
Waze and you're thinking how
does this work--
keep working myself and hundreds
of millions of other users?
Now you know the story.
This is the team.
Other members are lined up.
Also, if you can recognize
them from the photos.
So that's the story of how--
the approach they've
taken, but let's
talk a little bit
about Stackdriver,
which is the platform and
collection of services
that they use to enable
this monitoring approach.
So first of all, what
Yonit and Evgeny described
as how they use Stackdriver,
we've focused mostly
around monitoring.
So Stackdriver is
a platform that
enables you to easily collect
metrics, logs, and other data
into a central repository.
They take that from
open source components,
from applications that are
running in Google Cloud or AWS.
We use APIs and other
methods to ingest.
Yonit did mention the
dashboard API, which
is currently in early access.
We intend to release
it in the future,
but it's not available
to everyone right now.
But the alerting API
as well as other APIs
mentioned is available.
And then you take that
data-- first of all,
Stackdriver enables you to store
it to maintain it for longer
periods and to
create visualizations
and insights from
it, visualizations
in the form of dashboards.
Insights are sometimes
automatically inferred
from the data and
the correlation
as well as generate alerts
via our various notification
channels.
So Evgeny mentioned
email and webhook.
We support other channels such
as PagerDuty, Slack, SMS, et
cetera.
And we are expanding
our support for that.
And finally Stackdriver
also supports
other third-party
vendor ecosystems,
so you can integrate
with other tools that
can use the theta
and other services
in the cloud, one example
being taking logging data
and automatically
ingesting it to BigQuery
for deeper analysis.
Just to complete the picture,
Stackdriver is not only
about these capabilities.
We're building more.
So we're building
context capabilities
and autoinferring of services.
We're expanding more of our
visualization technologies,
adding tools that can
help you better mitigate
issues and deal with incident
response, et cetera, et cetera.
And you can hear about it
in other lectures as well.
And just going over to
the different Stackdriver
products available to you.
So we mainly touched
upon monitoring,
which you can use to visualize
an alert on the health
of your application.
Stackdriver also has
a logging product
that collects your
logs in some cases
automatically depending
on the service
and lets you analyze
them and do fast queries
and even do things
like create metrics
from logs so you can use
those metrics in dashboards
and alerting policies and
other advanced capabilities.
We also have error
reporting, which
can automatically group
and detect and alert you
on exceptions and
failures in your code.
And, again, the nice
thing is that this suite
is tightly integrated, so
let's say you already are--
you already have
logs in the platform,
then error reporting
comes from free.
You don't need to
instrument another service.
You don't need to sign
up for something else.
It's automatically
available for you.
We also have trace with
distributed trace, which
we support via API on the
SDK, Zipkin collectors,
and it automatically
collects strays
and analyzes your
latency for applications
running in App Engine.
Debug is another
very, very useful tool
that lets you debug
applications in production.
So you can-- during
production, you
can insert breakpoints
and logging statements
without all the overhead.
So you don't need to
instrument it in advance.
It's a really cool
capability, and I
hope you can see a demo of it
in one of our sessions here.
And finally profiler,
which is an--
it's actually in beta.
And so it's available, and it
helps you pinpoint hotspots
in your running--
in our applications
running in the cloud.
So to recap it,
firstly how many of you
are running services
that you believe
are larger scale than Waze?
Nice.
Which services?
Spotify.
Yeah, that's a huge service.
That's great.
Love it.
I was hoping at least one
person would raise your hand.
But for the rest of you,
this also applies to you.
This approach can
also apply to you.
So let's recap what
we talked about today.
We talked about how
large the service is
and how challenging
it is to monitor it,
and what are the implications of
not catching errors real time,
what it means to users.
In the case of Waze,
it can mean people
not reaching their
destinations on time or at all,
or it can be even--
sometimes it can even make it
to the news in some countries.
So downtime is
definitely user pain
and the team that is
tasked with keeping
the application up--
it's not a huge team,
so they need an approach.
The system is very dynamic,
which creates more challenges.
So any developer can
add microservices,
remove microservices on the fly.
They don't need to
check in with this team.
And it could be developers
in different floors
or in different buildings.
So the approach the team
has taken is first of all
to decide which metrics--
which signals they
collect, what they
look at in a more canonical way.
For all the microservices,
what are we interested in?
What metrics?
Which dashboards?
What alerting policies?
They started there.
After that, they fine tuned
because different microservices
have different requirements.
So you can build and customize.
And when you have that canonical
approach across the board built
in as much automation
as possible,
so the developers use config
files and the alerting policies
and the dashboards and the
metrics that are adjusted.
Everything is automatically
generated to reduce human error
and also to promote unity.
So when problems
occur, engineers
can look at different
microservices
in the same manner, a
similar pane of glass.
So automation is a key part.
And lastly we talked about
the specific monitoring
suite of products that Waze
uses, which is Stackdriver.
Stackdriver is
available free to try.
Just google Stackdriver
and get started.
Just lastly other talks--
I'm not going to
go over each one,
but these are talks
related to monitoring,
to SRE best practices, to
how Waze and Spinnaker--
Waze uses Spinnaker together.
These are all great talks.
We recommend that
you attend them
either today or the next days.
And don't forget to
rate our session,
especially if you liked it.
So looks like we have a
few minutes, but we are--
yeah, we are standing
between you and lunch.
The only worst thing is
standing between you and beer,
but we are standing
between you and lunch.
So what we're going to do is
we're going to wrap it up here,
but if you have
additional questions,
we're going to just be right
here next to the stage.
Feel free to line
up and come ask us.
The rest of you enjoy lunch
and the rest of the conference.
Thank you very much.
Thank you.
[APPLAUSE]
