KENNY YU: So, hi, everyone.
I'm Kenny.
I'm a software engineer at Facebook.
And today, I'm going to talk
about this one problem--
how do you deploy a service at scale?
And if you have any questions,
we'll save them for the end.
So how do you deploy a service at scale?
So you may be wondering how
hard can this actually be?
It works on my laptop, how
hard can it be to deploy it?
So I thought this too four
years ago when I had just
graduated from Harvard as well.
And in this talk, I'll
talk about why it's hard.
And we'll go on a
journey to explore many
of the challenges you'll
hit when you actually
start to run a service at scale.
And I'll talk about how Facebook has
approached some of these challenges
along the way.
So a bit about me--
I graduated from Harvard class 2014
concentrating in computer science.
I took C50 fall of 2010 and
then TF'd it the following year.
And I TF'd other classes
at Harvard as well.
And I think one of my favorite
experiences at Harvard was TFing.
So if you have the opportunity,
I highly recommend it.
And after I graduated,
I went to Facebook,
and I've been on this one
team for the past four years.
And I really like this team.
My team is Tupperware, and Tupperware
is Facebook cluster management
system and container platform.
So there's a lot of
big words, and my goal
is by the end of this
talk is that you'll
have a good overview of the challenges
we face in cluster management
and how Facebook is tackling
some of these challenges
and then once you understand
these, how this relates
to how we deploy services at scale.
So our goal is to deploy a
service in production at scale.
But first, what is a service?
So let's first define what a service is.
So a service can have
one or more replicas,
and it's a long-running program.
It's not meant to terminate.
It responds to requests
and gives a response back.
And as an example, you
can think of a web server.
So if you're running Python,
you're a Python web server
or if you're running PHP,
an Apache web server.
A response requests, and it
gives you get a response back.
And you might have multiple of
these, and multiple of these
together compose your service
that you want to provide it.
So as an example, Facebook users Thrift
for most of its back end services.
Thrift is open source, and it
makes it easy to do something
called Remote Procedure Calls or RPCs.
And it makes it easy for one
service to talk to another.
So as an example, service at Facebook,
let's take the website as an example.
So for those of you that
don't know, the entire website
is pushed as one
monolithic unit every hour.
And the thing that actually
runs the website is hhvm.
It runs our version of PHP, called
Hack, as a type-safe language.
And both of these are open source.
And the way to website
is deployed is that there
are many, many instances of this
web server running in the world.
This service might call other services
in order to fulfill your request.
So let's say I hit the
home page for Facebook.
I might want to give my
profile and render some ads.
So the service will call maybe the
profile service or the ad service.
Anyhow, the website
is updated every hour.
And more importantly, as a Facebook
user, you don't even notice this.
So here's a picture of
what this all looks like.
First, we have the Facebook web service.
We have many copies of our
web server, hhvm, running.
Requests from Facebook users-- so either
from your browser or from your phone.
They go to these replicas.
And in order to fulfill the
responses for these requests,
it might have to talk to other
services-- so profile service or ad
service?
And once it's gotten
all the data it needs,
it will return the response back.
So how did we get there?
So we have something that
works on our local laptop.
Let's say you're starting a new web app.
You have something working-- a
prototype working-- on your laptop.
Now you actually want
to run it in production.
So there are some challenges there
to get that first instance running
in production.
And now let's say your app takes off.
You get a lot of users.
A lot of requests start
coming to your app.
And now that single
instance you're running
can no longer handle all the load.
So now you'd have multiple
instances in production.
And now let's say your app--
you start to add more features.
You add more products.
The complexity of your
application gets more complicated.
In order to simplify
that, you might want
to extract some of the responsibilities
into separate components.
And now instead of just having
one service in production,
you have multiple
services in production.
So each of these transitions
involves lots of challenges,
and I'll go over each of these
challenges along the way.
First, let's focus on the first one.
From your laptop to that
first instance in production,
what does this look like?
So first challenge
you might hit when you
want to start that
first copy in production
is reproducing the same
environment as your laptop.
So some of the challenges
you might hit is let's
say you're running a Python web app.
You might have various
packages of Python libraries
or Python versions
installed on your laptop,
and now you need to reproduce the
same exact versions and libraries
on that production environment.
So versions and libraries--
you have to make sure
they're installed on the
production environment.
And then also, your app
might make assumptions
about where certain files are located.
So let's say my web app needs
some configuration file.
It might be stored in
one place on my laptop,
and it might not even exist
in a production environment.
Or it may exist in a different location.
So the first challenge here
is you need to reproduce
this environment that you have on
your laptop on the production machine.
This includes all the files and
the binaries that you need to run.
Next challenge is how do you make
sure that stuff on the machine
doesn't interfere with
my work and vice versa?
Let's say there's something more
important running on the machine,
and I want to make sure my dummy web
app doesn't interfere with that work.
So as an example, let's say my service--
the dotted red box--
it should use four gigabytes
of memory, maybe two cores.
And something else in
the machine wants to use
two gigabytes of memory and one core.
I want to make sure that that other
service doesn't take more memory
and start using some
of my service's memory
and then cause my service to
crash or slow down and vice versa.
I don't want to interfere with the
resources used by that other service.
So this is a resource isolation problem.
You want to ensure that
no workload on machine
interferes with my
workload and vice versa.
Another problem with
interference is protection.
Let's say I have my workload
in a red dotted box,
and something else running a
machine, the purple dotted box.
One thing I want to ensure is that
that other thing doesn't somehow
kill or restart or terminate
my program accidentally.
Let's say there's a bug in the
other program that goes haywire.
The effects of that service should
be isolated in its own environment
and also that other thing
shouldn't be touching important
files that I need for my service.
So let's say my service needs
some configuration file.
I would really like it if
something else doesn't touch that
file that I need to run my service.
So I want to isolate the environment
of these different workloads.
The next problem you might have is how
do you ensure that a service is alive?
Let's say you have your service up.
There's some bug, and it crashes.
If it crashes, this means users will
not be able to use your service.
So imagine if Facebook went down and
users are unable to use Facebook.
That's a terrible
experience for everyone.
Or let's say it doesn't crash.
It's just misbehaving or slowing down,
and then restarting it might help--
might help it mitigate
the issue temporarily.
So what I really like is
if my service has an issue,
please restart it automatically so
that user impact is at a minimum.
And one way you might be able to
do this is to ask the service, hey,
are you alive?
Yes.
Are you alive?
No response.
And then after a few seconds of
that, if there's still no response,
restart the service.
So the goal is the service
should always be up and running.
So here's a summary of
challenges to go from your laptop
to one copy in production.
How do you reproduce the same
environment as your laptop?
How do you make sure that once you're
running on a production machine,
no other workload is
affecting my service,
and my service isn't affecting
anything critical on that machine?
And then how do I make sure that my
service is always up and running?
Because the goal is to have users be
able to use your service all the time.
So there are multiple
ways to tackle this issue.
Two typical ways that companies
have approached this problem
is to use virtual
machines and containers.
So for virtual machines, the
way that I think about it is you
have your application.
It's running on top
of an operating system
and that operating system is running
on top of another operating system.
So if you ever use
dual boot on your Mac,
you're running Windows inside a
Mac-- that's very similar idea.
There are some issues with this.
It's usually slower to
create a virtual machine,
and there is also an efficiency
cost in terms of CPU.
Another approach that companies
take is to create containers.
So we can run our application in
some isolated environment that
provides all the guarantees
as before and run it directly
on the machine's operating system.
We can avoid the overhead
of a virtual machine.
And this tends to be faster
to create and more efficient.
And here's a diagram that shows
how these relate to each other.
On the left, you have my service--
the blue box-- running on
top of a guest operating
system, which itself is running on
top of another operating system.
And there's some overhead because
you're running two operate systems
at the same time versus the container--
we eliminate that extra overhead
of that middle operating system
and run our application directly on the
machine with some protection around it.
So the way Facebook has approached
these problems is to use containers.
For us, the overhead of
using virtual machines
is too much, and so that's
why we use containers.
And to do, this we have a
program called a Tupperware agent
running on every machine
at Facebook, and it's
responsible for creating containers.
And to reproduce the environment,
we use container images.
And our way of using container
images is based on btrfs snapshots.
Btrfs is a file system that makes
it very fast to create copies
of entire subtrees of a file
system, and this makes it very fast
for us to create new containers.
And then for resource isolation,
we use a feature of Linux
called control groups that allow
us to say, for this workload,
you're allowed to use this much memory,
CPU, whatever resources and no more.
If you try to use more than
that, we'll throttle you,
or we'll kill your workload
to avoid you from harming
the other workloads on the machines.
And for our protection, we
use various Linux namespaces.
So I'm going to not go
over too much detail here.
There's a lot of jargon here.
If you want more
detailed information, we
have a public talk from our Systems
of Scale Conference in July 2018
that will talk about this more in depth.
But here's a picture that summarizes
how this all fits together.
So on the left, you have
the Tupperware agent.
This is a program that's running
on every machine at Facebook that
creates containers and ensures that
they're all running and healthy.
And then to actually create the
environment for your container,
we use container images, and
that's based on btrfs snapshots.
And then the protection
layer we put around
the container includes multiple things.
This includes control groups to control
resources and various namespaces
to ensure that the
environments of two containers
are sort of invisible to each other,
and they can't affect each other.
So now that we have one instance
of the service in production,
how can we get many instances
of the service in production?
There are new sets of
challenges that this brings.
So the first challenge
you'll have is, OK,
how do I start multiple
replicas of a container?
So one approach you may take
is, OK, given one machine,
let's just start
multiple on that machine.
And that works until that
machine runs out our resources,
and you need to use multiple machines to
start multiple copies of your service.
Now so now you have to use multiple
machines to start your containers.
And now you're going to hit
a new set of classic problems
because now this is a
distributed systems problem.
You have to get multiple machines to
work together to accomplish some goal.
And in this diagram, what
is the component that
creates the containers
on the multiple machines?
There needs to be something that
knows to tell the first one to create
containers and a second one to create
containers or sock containers as well.
And now what if a machine fails?
So let's say I have two
copies of my servers running.
The two copies are running
two different machines.
For some reason,
machine two loses power.
This happens in the
real world all the time.
What happens then?
I need two copies of my service
running at all times in order
to serve all the traffic my service has.
But now that machine two is down,
I don't have enough capacity.
Ideally, I would want something to
notice, hey, the copy of machine two
is down.
I know machine three
has available resources.
Please start a new copy
on machine three for me.
So ideally, some component
would have all this logic
and do all this automatically
for me, and this problem
is known as a failover.
So when real-world
failures happen, then we
want ideally to be able to restart
that workload on a different machine,
and that's known as a failure.
So now let's look at this problem
from the caller's point of view.
The callers or clients of your service
have a different set of issues now.
So in the beginning, there's two
copies of my servers running--
there's a copy on machine one
and a copy on machine two.
The caller knows that it's on
machine one and machine two.
Now machine two loses power.
The caller still thinks that a
copy is running on machine two.
It's still going to send traffic there.
The requests are going to fail.
Users are going to have a hard time.
Now let's say there is
some automation that knows,
hey, machine two is down please
start another one on machine three.
How's the client made aware of this?
The client still thinks the replicas
are on machine one and machine two.
It doesn't know that there's
a new copy on machine three.
So something needs to tell the
client, hey, the copies are now
on machine and one machine three.
And this problem is known as
a service discovery problem.
So for service discovery, the
question it tries to answer
is where is my service running?
So now another problem we might face
is how do you deploy your service?
So remember I said the
website is updated every hour.
So we have many copies of the service.
And it's updated every
hour, and users never
even notice that it's being updated.
So how is this even possible?
The key observation
here is that you never
want to take down all the replicas
of your service at the same time
because if all of your replicas
are down in that time period,
requests to Facebook would fail, and
then users would have a hard time.
So instead of taking
them all down at once,
one approach you might take it to
take down only a percentage on a time.
So as an example, let's say I
have three replicas of my service.
I can tolerate one replica
being down any given moment.
So let's say I want to update my
containers from blue to purple.
So what I would do is take down one star
a new one with the new software update
way until that's healthy.
And then once that's healthy and
traffic is back to normal again,
now I can take that the
next one and update that.
And then once that's healthy,
I can take down the next one.
And now all my replicas are healthy,
and users have not had any issues
throughout this whole time period.
Another challenge you might hit
is what if your traffic spikes?
So let's say at noon,
the number of users
that use your app increases by 2x.
And now you need to increase the number
of applicants to handle the load.
But then at nighttime,
whenever users decrease
and it becomes too expensive
to run the extra replica,
so you might on a tear
down the number of replicas
and use those machines to run
something else more important.
So how do you handle this
dynamic resizing of your service
based on traffic spikes?
So here's a summary of
some of the challenges
you might face when you go
from one copy to many copies.
First, you have to
actually be able to start
multiple replicas on multiple machines.
So there needs to be something
that correlates that logic.
You need to be able to
handle machine failures
because once you have many machines,
machines will fail in the real world.
And then if containers are moving
around between machines, how are clients
made aware of this movement?
And then how do you update your
service without affecting clients?
And how do you handle traffic spikes?
How do you add more replicas?
How do you spindown replicas?
So all these problems
you'll face when you
have multiple instances in production.
And our approach for
solving this at Facebook
is we introduce a new component,
the Tupperware control
plane, that manages the lifecycle
of containers across many machines.
It acts as a central coordination
point between all the Tupperware
agents in our fleet.
So this solves the following problems.
It is the thing that will start
multiple replicas across many machines.
If a machine goes down, it
will notice, and then it
will be its responsibility to recreate
that container on another machine.
It is responsible for publishing
service discovery information
so that client will now be made
aware that the container is
running on a new machine,
and it handles deployments
of the service in a safe way.
So here's how this office together.
You have the Tupperware control
plane, which is this green box.
It's responsible for creating
and stopping containers.
You have this service discovery system.
And I'll just draw a
cloud as a black box.
It provides an abstraction where
you give it a service name,
and it will tell you
the list of machines
that my service is running now.
So right now, there are
no replicas of my service,
so it returns an empty
list for my service.
Clients of my service--
they want to talk to my replicas.
But the first thing they do is they
first ask the service discovery system,
hey, where are the replicas running?
So let's say we start two replicas
for the first time on machine
one and machine two.
So you have two containers running.
The next step is the update
of service discovery system
so that clients know that they're
running on machine one and machine two.
So now things are all healthy and fine.
And now let's say
machine two loses power.
Eventually, the control
plane will notice
because it's trying to heartbeat
with every agent in the fleet.
It sees that machine two is
unresponsive for too long.
It deems the work on
machine two as dead,
and it will update the
service discovery system
to say, hey, the service is no
longer running on machine two.
And now clients will stop
sending traffic there.
Meanwhile, it seems that machine
three has available resources
to create another copy of my service.
It will create a container
on machine three.
And once the container on
machine three is healthy,
it will update the
service discovery system
to let clients know you can send
traffic to machine three now as well.
So this is how a failover and
service discovery are managed
by the Tupperware control client.
I'll save questions for the end.
So what about deployments?
Let's say I have three replicas
already you running and healthy.
Clients know it's on
machine 1, 2, and 3.
And now I want to push a
new version of my service,
and I can tolerate one
replica being down at once.
So the first thing I
want to do is first,
let's say I want to update
the replica on machine one.
The first thing I want to do is make
sure clients stop sending traffic there
before I tear down the container.
First, I'm going to tell
the service discovery
system to say machine one
is no longer available,
and now clients will stop
setting traffic there.
Once clients stop
sending traffic there, I
can tear down the
container on machine one
and create a new version
using the new software update.
And once that's healthy, I can
update the service discovery system
to say, hey, clients you can send
traffic to machine one again.
And in fact, you'll be getting the
new version of the service there.
The process repeats for machine two.
We disable machine two, stop the
container, recreate that container,
and then publish that
information as well.
And now we repeat again
for machine three.
We disable the entry so that
clients stop sending traffic there.
We stop the container and
then recreate the container.
And now clients can send
traffic there as well.
So now at this point, we've updated
all the replicas of our service.
We've never had more
than one replica down,
and users are totally unaware that any
issue has happened in this process.
So now we are able to start
many replicas of our service
in production we can update them.
We can handle failovers, and
we can scale to handle load.
And now let's say your web
app gets more complicated.
The number of features or products grow.
It gets a bit complicated
having one service,
so you want to separate
out responsibilities
into multiple services.
So now your app is now multiple services
that you need to deploy to production.
And you'll hit a different
set of challenges now.
So first understand the challenges.
Here's some background about Facebook.
Facebook has many data
centers in the world.
And this is an example
of a data center--
a bird's eye view.
Each building has many thousands of
machines serving the website or ads
or databases to store user information.
And they are very expensive to create--
so the construction costs,
purchasing the hardware, electricity,
and maintenance costs.
This is a big deal.
This is a very expensive
investment for Facebook.
And now separately, there are
many products at Facebook.
And as a result, there are many
services to support those products.
So given that data
centers are so expensive,
how can we utilize all
the resources efficiently?
And also, another problem we have is
reasoning about physical infrastructure
is actually really hard.
There are a lot of machines.
A lot of failures can happen.
How can we hide as much of this
complexity from engineers as possible
so that engineers can focus
on their business logic?
And so the first problem is how can
we effectively use the resources
we have becomes a bin-packing problem.
And the Tupperware logo is actually
a good illustration of this.
Let's say this square
represents one machine.
Each container represents a different
service or work we want to run.
And we want to stack
as many containers that
will fit as possible
onto a single machine
to most effectively utilize
that machine's resources.
So it's kind of like playing
Tetris with machines.
And yeah, so our
approach to solving this
is to stack multiple containers onto
as few machines as possible, resources
permitting.
And now our data centers are spread
out geographically across the world,
and this introduces a
different set of challenges.
So as an example, let's say we have
a West Coast data center and an East
Coast data center,
and my service just so
happens to only be running in
the East Coast data center.
And now a hurricane hits the East
Coast and takes down the data center.
Our data center loses power.
Now suddenly, users of that service
cannot use that service until we create
new replicas somewhere else.
So ideally, we should spread our
containers across these two data
centers so that when disaster hits
one, the service continues to operate.
And so the property we would
like is to spread replicas
across something known as fault
domains, and a fault domain
is a group of things
likely to fail together.
So as an example, a data center
is a fault domain of machines
because they're located
geographically together,
and they might all lose power
at the same time together.
So another issue you might have is
hardware fails in the real world.
Data center operators need to
frequently put machines into repair
because their disk drives are failing.
The machine needs to be
rebooted for whatever reason.
The machines need to be replaced
with newer generation of hardware.
And so they frequently
need to say I need
to take those machines into
maintenance but on those 1,000 machines
might be many different teams'
services running on those machines.
Ideally, the data center
operators would need
to interact with all
those different teams
in order to have them safely move
the replicas away before taking
down all 1,000 machines.
So the goal is how can we safely replace
those 1,000 machines in an automated
way and have as little involvement
with service owners as possible?
So in this example, a single
machine might be running containers
from five different teams.
And so if we had no automation,
five different teams
would need to do work to move
those containers elsewhere.
And this might be challenging for teams
because sometimes a container might
store a local state on the
machine and that local state needs
to be copied somewhere else
before you take the machine down.
Or sometimes a team might not
have enough replicas elsewhere.
So if you take down this
machine, they will actually
be unable to serve all their traffic.
So there are a lot of issues
in how we do this safely.
So recap of some of the
issues we face here--
we want to efficiently use all
the resources that we have.
We want to make sure replicas
are spread out in a safe way.
And hardware will fail
in the real world,
and we want to make repairing hardware
as safe and as seamless as possible.
And the approach Facebook has taken
here is to provide abstractions.
We provide abstractions to
make it easier for engineers
to reason about physical structure.
So an example, we can stack
multiple containers on a machine,
and users don't need
to know how that works.
We provide some
abstractions to allow users
to say I want to spread
across these fault domains,
and it will take care of that.
They don't need to understand
how that actually works.
And then we allow engineers
to specify some policies
on how to move containers
around in the fleet,
and we'll take care of
how that actually works.
And we provide them a high-level
API for them to do that.
So here's a recap.
We have a service running on our
local laptop or developer environment.
We want to start running
in production for real.
And suddenly, we have more traffic
Than one instance can handle,
so we start multiple replicas.
And now our app gets more complicated.
So we have instead of just one service,
we have many services in Production.
And all the problems
we faced in this talk
are problems we face in the
cluster management space.
And these are the problems
my team is tackling.
So what exactly is cluster management?
The way you can understand
cluster management
is by understanding the
stakeholders in this system.
So for much of this talk, we've
been focusing on the perspective
of service developers.
They have an app they want to easily
deploy to production as reliably
and safely as possible.
And ideally, they should focus most
of their energy on the business logic
and not on the physical infrastructure
and the concerns around that.
But our service needs to
run on real-world machines,
and in the real world,
machines will fail.
And so center operators--
what they want from the system is being
able to automatically and safely fix
machines and have the system
move containers around as needed.
And a third stakeholder
is efficiency engineers.
They want to make sure
we're actually using
the resources we have as
efficiently as possible
because data centers
are expensive, and we
want to make sure that we're
getting the most bang for our buck.
And the intersection of all these
stakeholders is cluster management.
The problems we face and the challenges
working with these stakeholders
are all in the cluster management space.
So to put it concisely, the
goal of cluster management is
make it easy for engineers
to develop services
while utilizing resources
as efficiently as possible
and making the services
run as safely as possible
in the presence of real-world failures.
And for more information about
what my team is working on,
we have a public talk from the
Systems of Scale Conference in July,
and it's talking about what
we're currently working on
and what we will be working
on for the next few years.
So to recap, this is my motto now--
coding is the easy part.
Productionizing is the hard part.
It's very easy to write some
code and get a prototype working.
The real hard part is to make sure it's
reliable, highly available, efficient,
debuggable, testable so you can
easily make changes in the future,
handling all existing
legacy requirements
because you might have
many existing users
or use cases, and then be able to
withstand requirements changing
over time.
So let's say you get it working now.
What if the number just
grows by 10x or 100x--
will this design still work as intended?
So with cluster management systems
and container platforms, it makes--
at least the productionizing
part-- a bit easier.
Thank you.
[APPLAUSE]
