JAMES DUKE: Yeah, so we
wanted to spend a few minutes
this morning to share with you
some of the design and systems
that we have collectively
built at Twitter to enable
the security,
identity-- and we'll
talk about our AAA
strategy, for the project
to move Twitter's Hadoop
infrastructure into GCP.
So the aim of today
is, if you were
to go do a similar project,
we should provide you
with some inspiration--
hopefully mostly what to do,
or maybe a little bit of
what not to do as well.
But the idea is, if you were
to do something similar,
you would have a
good starting point.
So we're going to start off
with a little bit of an overview
of what is Partly Cloudy.
I see a few folks--
I see Derek in particular here
with the Partly Cloudy shirt
on.
And so it's a little bit
about what is the project.
We'll talk about
the architecture
that we've built and designed
and then some of the specifics
about the project's identity
and the much-maligned demigod
services.
So let's start.
Vrushali, what is Partly Cloudy?
VRUSHALI CHANNAPATTAN:
So Partly Cloudy
is a project at Twitter to
extend our data processing
platform from an on
premises-only model
to a hybrid model.
So we are not setting
it up such that you
can run part of your job
on premises and part of it
in the cloud.
It's more about running
your job either on premises
or also having the
ability to run it in GCP.
We started by deciding to
move our ad hoc compute
and our core data
storage to GCP.
And this project is not just
limited to the Data Platform
team at Twitter, but a
lot of teams at Twitter
have been very closely
involved with this.
And not just Twitter.
We've been working very
closely with Google on this.
We have an on site Google
presence at Twitter,
and we collaborated very closely
with Google for this project.
JAMES DUKE: So
with that in mind,
why did we do Partly
Cloudy, Vrushali?
VRUSHALI CHANNAPATTAN:
So let's take a look
at why we wanted to do
this project, right?
At Twitter, we've had
a long term desire
to have some cloud presence.
We wanted to be in the
cloud, but then that
meant arriving at
a strategy that
would balance our
developer capabilities
and agility that come by
being in the cloud with those
or associated cost
and the effort
that it takes to
move to the cloud.
Being in the cloud gives
us a very convenient way
to test Changes at scale.
It gives us a way to
have business continuity.
You have redundancy.
And then, being in GCP means
we have access to so many
of the Google offerings,
like BigQuery,
Cloud ML, and anything new
that Google comes up with.
So hence, Partly Cloudy.
JAMES DUKE: Absolutely.
And so, as part of
that, we kind of
created a pretty strict brief
at the beginning of the project.
And specifically,
today, we wanted
to talk about what
we called the AAA
strategy-- the
authentication, authorization,
and audit capability.
And these were three things
that the Twitter team
were very, very clear on
from the very early days.
These were things which we had
to do when we actually rolled
out this project into Google.
And so, in the next
45 minutes or so,
we'll talk you through each one
of these, and all of the design
decisions, and all
of the complexity
that came out of that,
and how we actually
addressed that complexity.
So on the
authentication piece, we
wanted to make sure that,
any time any human being,
or any kind of service account
acting on behalf of a human
being, has access to data,
that we can guarantee
that it is the person who they
say they are before they access
that data.
That we can guarantee
that that person
has the specific
authorization to access
that particular piece of data.
As you might imagine, there's
lots of data categories
that we're dealing with here.
So making sure that that
person, or that service account,
could only access that specific
thing was very critical.
And, finally,
after the event, we
needed to be able to say,
OK, who accessed what when?
In the event of an
incident, or even
in the course of the day-to-day
operations of the platform--
just to be able to
be certain that we
had good visibility into
what was actually happening.
And so let's talk a little
bit about the situation
before we did Partly Cloudy.
So Vrushali, do you want
to talk us through what
it looks like on premises here?
VRUSHALI CHANNAPATTAN: Sure.
So this is our
architecture before we
moved to Partly Cloudy.
This is a
representative picture.
So we've shown that we
have more than one data
centers on premises.
All of the boxes are
not drawn to scale.
So don't interpret the sizes
of the boxes as their sizes.
Anytime you see
more than one box,
we want to say that we have more
than one such cluster, of more
than one such instance.
Let's start from the
left of the picture.
The little blue birds
that you see there,
those are the Twitter users,
who are interacting on Twitter,
maybe they're
tweeting, retweeting,
replying to somebody, having
a conversation on Twitter.
So all of these user
actions generate data.
Now this data is collected
by our log pipeline.
That's the yellow
rectangle in the picture.
Yesterday, my
colleague [? Lahood ?]
gave a talk about the whole data
replication system at Twitter.
So in case you missed it,
be sure to catch it online.
Now this log pipeline writes the
data to our real time clusters.
That is where the data
is collected, aggregated,
processed, and then, from then
on, it moves to the production
clusters, the add up
clusters, and eventually
ends up on cold storage.
So this is our
current architecture.
This is our architecture
before Partly Cloudy.
Let's take a look
at how it changes
when we move to Partly Cloudy.
Now, as you see,
in this picture,
we have a third box that
is our Google Cloud setup.
Our ad hoc clusters from on
premises have moved to GCP.
Our data storage is now in GCS.
And we have introduced
a new cluster
on premises that is
the copy cluster.
This copy cluster has
the sole functionality
of replicating the data
from on premises to GCS.
We do not draw on general
purpose jobs on this cluster.
The nodes on this
cluster have connectivity
to our internal systems,
as well as to GCS.
JAMES DUKE: And what was it--
why choose to do
the copy cluster?
What was the goal there?
VRUSHALI CHANNAPATTAN: Right.
So we wanted to restrict the
blast radius of how many nodes
on premises have
external connectivity
because all of the GCP
endpoints are public endpoints.
We wanted to make
sure that we have
a few sets of controlled
hosts that have connectivity
to the external systems.
Now, this goes over a
dedicated bandwidth connection.
JAMES DUKE: OK, so let's
dig in a little bit
then and see what that box
on the right, what it looks
like when we get into
the Google Cloud base.
So I want to start off with
the organization and project
structure.
We'll start at the beginning.
So what we built out was, we
created a Twitter organization.
And to be clear,
the names in here,
often they're
place holder names.
They're not necessarily
the real names.
We have on an org level
construct that is Twitter.
Within that, we have a number
of photos, one of which
is the data infrastructure
photo, which is the one where
all this magic happens.
And then, within that, we have
a number of different projects.
There are a number of what we
call pillar projects, which
are specific to the particular
areas of data processing
in Twitter.
And then we have a couple that
are kind of service projects
where-- we'll talk a
little bit about some
of those special cases
as we move forward.
So that's kind of the
project level structure.
What, then, do we have inside
each of these projects?
This is where the new
cloud native stuff really
starts to happen.
So, inside each of
those pillar projects,
we have a number of
pieces of infrastructure.
First off, we have what
we call the NES node,
the NES box, which is
essentially a bastion,
or a gateway machine.
So when users need to
access this platform,
this is the machine that they'll
actually be connecting to.
In addition, there are
a number of name nodes.
For those of you
familiar with Hadoop,
name nodes, resource
manager, and worker nodes.
Very standard make
up of Hadoop here.
The access to the actual data
is where it gets interesting.
So the data, each one of
the log's categories--
which we'll talk about in
a moment what each of those
means--
each of those
log's categories is
stored in a particular bucket.
So here you see, on
the top right here,
the data set
bucket-- that's where
the actual log's data exists.
In addition, we have
the scratch bucket,
which you can think of a little
bit like temporary storage,
and a scrub bucket,
where data which
has had any PII, log's
data that has had any PII
scrubbed and removed,
can be stored.
The worker nodes
are the ones that
directly access those buckets.
With the exception
of, in some cases,
users can access some
buckets directly,
but that's on a very, very, very
narrow use case-based process.
The vast majority
of the access comes
from the tasks that are running
inside the worker nodes who
read that data out of
the storage buckets.
Now, later on, we'll
talk a little bit
about how we do the
authentication around this
because, as you can
likely see from here, now
we've got a situation where you
have jobs which potentially are
running without the user there.
They might be
scheduled jobs, or they
might be jobs which take long
enough that a user wouldn't
want to sit in front of
the machine for five hours
to run that process.
And so you can see already
where some complexity
starts to come into this.
So up until now, this has
seemed fairly straightforward.
And that was our impression when
we started doing this as well.
As we go into it, we
found very quickly
that some complexity
comes in here.
And so we tried to build in this
process of minimum complexity.
And every time we
see something where
we think, well, OK, hang on.
Let's That's getting maybe
a little bit complex.
Let's see if we can
simplify and come back
to something which doesn't
involve so many moving parts.
So what we've
ended up with here,
we're talking through
now, it represents
the bare minimum
level of complexity
to solve the problem in
the way that we have.
VRUSHALI CHANNAPATTAN: So
one thing I want to call out
here is that we are running
our own Hadoop clusters in GCP.
The input and output
is in our GCS buckets.
But all of the VMs
are managed by us.
JAMES DUKE: Yeah, absolutely.
So each of these nodes
that you see here,
they are the virtual
machines in Google Compute
Engine that's managed by
the data [INAUDIBLE] team.
All right, so what
about the storage?
Vrushali, want to
talk us through that?
VRUSHALI CHANNAPATTAN: So
when we are replicating data
from on-premises from our
HDFS locations to GCS,
we wanted to make
sure we provide
our users with the same
unified experience.
On-premises, when our
users do data processing--
and when I say users, I mean
the data platform users,
not the Twitter users.
These are Twitter
employee users.
When they want to
do data processing,
they refer to an
on-premises path,
such as data center, cluster,
user, and then the path.
This is a logical path
on-premises as well.
This path gets mapped
to an actual HDFS
path by our UFS file system.
On GCS, when we wanted our users
to do data processing in GCP,
we wanted them to have a
very similar experience.
So we have this logical
cloud path /gcs/user/helen.
So the path remains the same.
They only have to
change the prefix
to indicate where they
want to run their jobs,
and where the data resides.
Now, this logical cloud path
maps to an actual bucket path,
with the gs://--
and you all might
recognize the bucket name.
The interesting thing here to
note is that all of our buckets
have prefixes and post fixes.
Prefixes are words like
user, logs, scrubbed.
Those indicate the type of
data that is in those buckets.
All our buckets are
dual region buckets.
That's the actual bucket type.
But give looking
at bucket name, we
want to determine what
kind of bucket this is.
Is this a user bucket?
Is this a logs bucket?
Which is why we have a
set of prefixes there.
The post fix is
dp.twitter.domain.
Now, twitter.domain is
a verified domain name
that we have we got from Google.
The reason is bucket names in
GCS live in a global namespace.
We wanted to make sure that
our bucket names do not
collide with other GCP users
at other GCS customers.
Which is why we have a
verified domain name.
JAMES DUKE: Yeah, and
that gives you the ability
as well to quickly identify.
Like this is a Twitter
bucket, hey, this bucket
is a logs bucket.
And I think we wanted to do
this for a couple of reasons.
One was one was to maintain
a similar user experience.
And like you said,
when we say users here,
these are Twitter employees.
This is kind of an
internal service right.
When Twitter employees,
their workflow,
they are used to dealing with
these kind of logical paths.
So we wanted to continue that.
Additionally, we wanted
to create a situation
where we could move
buckets, rename buckets.
We wanted to abstract the
implementation details
behind the scenes from
the user interface.
And that's kind of
another guiding principle
we've had through all
this has been-- let's
make it seamless for the user.
We haven't written that down
because it's somewhat evident,
and it's somewhat true for
every project that we do.
User experience
always comes first.
And so, to maintain a
good user experience here,
we didn't want to have to
have the user sit there
and do a GS and a
complicated bucket path,
which could then change
at some point in time.
So we implemented
this virtual mapping,
which, again, it added more
complexity, because now we
have another layer
where we have the map
between the logical path and
the real GCS bucket path.
But again, it was complexity
which was worth the tradeoff
because it maintains
that user experience.
VRUSHALI CHANNAPATTAN: One
thing that I can add there,
like James said, we wanted to
move buckets, rename buckets
as we were working
on this project
because our design was evolving.
And we put forward
a request to Google
for a tool, or a utility,
that can move buckets
across GCP projects, or
that can rename buckets.
And Google has built this
tool and open sourced it.
It's on GitHub.
JAMES DUKE: Ah, the
famous bucket mover tool.
We are laughing because it's
been a fun project for us
to do.
And Google's professional
services with their partners
have written this project.
If you want to check it out,
search the Google Cloud GitHub
organization.
It is there.
So, moving then from
data now to users,
let's take a little bit
of a look about what
it looks like from a
user access perspective.
Now we're starting to
get into the really
into the details of
this AAA strategy.
So the first thing to know
is, how did it used to work?
So, on-premises, Twitter
users are granted access
to the existing HDFS and Hadoop
infrastructure via their Unix
username.
With cerberus authentication
to actually access the platform
itself.
Now cerberus and Unix users
don't map all that well
into Google's Platform.
You can't grant the Unix
user access to a GCS bucket,
for example.
So we needed to find a
way where we could do
a similar thing in the cloud.
What we ended up on, after
many, many, many months
on the different iterations
of how we should do this,
we landed on this
approach of using
what we call a shadow account.
So in this case,
a shadow account
is a Google service
account, which is nominally
mapped to a human user account
or an existing service account
in the on-premises system.
Which we've called
a headless user.
And so the way this
works is, it's actually
the shadow account
which is granted access
to the majority of the
resources in the platform.
And when the user needs to
perform some kind of access--
run a job-- it's actually the
service account that gets used.
Again, we wanted to make sure
this was a seamless experience.
The user doesn't actually
necessarily know.
VRUSHALI CHANNAPATTAN:
It's completely
transparent to the user.
JAMES DUKE: Yes, exactly.
We actually deliver the
credentials required
to authenticate to
that service account
to that user's Unix
username, and only
to that user's Unix username.
What that means is that the
user, when they run a job,
there's no login necessary.
They log in when they
access the system.
But there's no additional step
to generate the credentials
for access in the cloud.
That's all taken care
of in the background.
And in a few minutes,
we'll tell you
what the inevitable
complexity of that is,
and how we solved that.
But effectively, what we're
saying is, most of the access
comes via the shadow account.
Now, as I mentioned
earlier, there
are a few cases where a
user, a human user account,
may want to access some
data in the platform.
And so we do have,
in some cases,
the OAuth2 access granted.
Most cases, that's for what
we call the user bucket, which
is the output bucket
for jobs that users run,
and where they're storing
additional things, which
can be metadata and that kind of
stuff in the platform as well.
And then that maintains--
so Twitter is a G
Suite user as well.
And that maintains the access
to the G Suite platform
without changing it in any way,
which was another thing that we
needed to ensure.
VRUSHALI CHANNAPATTAN:
That way, users
can have other
Google offerings that
require G Suite access to
buckets that would work there.
JAMES DUKE: Exactly.
Another piece that we
haven't touched on,
but was a key
requirement here is
that there was a very, very,
very firm decision very early
on that we would not
store users G Suite
credentials on these
shared missing machines.
Fundamentally, the nodes that
you saw inside the project,
they are shared resources.
This is a multi-tenancy system.
And we were very
clear from very early
on-- we're not going to store
the user's credentials there.
The reasoning being the risk
of leakage was too great.
When you have multiple
users all sharing a machine,
it can become tricky to
protect those credentials.
Twitter already had
a secrets system
in place, which we
were able to leverage
in order to actually
ensure some of that
without having to
reinvent the wheel.
Speaking of keys, key management
becomes a big part of this.
Now we have a situation
where the shadow
account is the gateway
to all this information.
So we better make sure
that we have good security
around that account.
Authentication for
a shadow account
is performed using a JSON key.
We generate a new key-- and
again, these are place holders.
we Generate a key every n days.
Then that key is valid
for 2n plus n days.
In other words, if we were to
create a key which was valid--
and I'm choosing numbers made
up for demonstration purposes
here.
If we were to say a key is valid
for 10 days, or the period of n
is 10 days, on day 1, the system
would issue that first key.
Right through to day 10 the
same key would be issued.
On day 11 we would
issue a new key.
And then that would be
valid-- that will continue
to be issued until day 20.
And then the same for
21 to 30, et cetera.
Each key, however, can still
be used for an additional 2n
periods after that--
in this case, 20 days.
So in this example, each key
would be valid for 20 days.
But I would stop
receiving it when I
create a new job after 10 days.
What that means is, it
gives the flexibility
to still rotate keys
quickly, but it also
means that longer
running jobs, which
might last more than
10 days, can still
execute in the background.
As we mentioned,
they are distributed
by Twitter's existing
security service.
And it distributes the
keys only to the machines
that are going to need them.
So the next machine
for example, would not
have those keys on them.
And the key is only
readable by that user.
And it's actually obfuscated
from that user entirely.
Did I miss anything on the keys?
Good stuff.
VRUSHALI CHANNAPATTAN:
Yes, very exciting.
JAMES DUKE: Yeah, I
mean you're seeing here,
what it's like a 6 bullet point
distilling on something which
took eight months, I guess,
to design and come up
with an agreement on.
We went back and forth.
There's actually a few
other competing things.
A colleague of ours
that we work with
has just released a white
paper and a blog post
on a token brokers
service, which we also
consider for this.
There are a number
of different ways
in which we can
solve this problem.
We ultimately
decided on this one,
for the reasons that
we've already discussed.
So what that means
is, we now have
a bunch of servers
accounts, potentially
thousands of service accounts.
So what we wanted to do is,
let's separate those out,
put them over in
their own projects
so that they're not
affected by changes
and whatever that we might
make in those pillar projects.
So we have a project
specifically for the Twitter
employee users which just
stores those service accounts.
At some point in time, we
may need more of those.
I'm optimistic that the
usage of the platform
will continue to grow.
And so at some point, we
may need to branch out
into additional ones.
But at least for now,
we have a single home
for those service accounts.
OK, so what do the users
actually get from the platform?
Vrushali?
VRUSHALI CHANNAPATTAN:
So, well we
saw what we've
created in GCP so far.
What is it that the users
themselves actually get?
The users get a shadow account
to access their data in GCS.
Now this is still
transferring to the users.
The users do not
know, but then they
have the capability to access
GCS via their shadow account.
They'll get a packet
for their data in GCS.
They get access to a Hadoop
cluster, to a presto cluster.
Basically, with the shadow
account and the GCS set up,
they can now run data
processing jobs in GCP.
Now, all of this is
done by our system.
For the users themselves,
it's like a platform
that has been set up for them to
do their data processing jobs.
And let's take a look at
who configures the set up.
Who does all of
this for the users?
JAMES DUKE: I think it's time
for a little therapy session.
VRUSHALI CHANNAPATTAN: Yes.
JAMES DUKE: So I'm very
conscious that this is
on camera and being recorded.
So I'm going to say
this for posterity.
Unfortunately, I'm responsible
for the name of these services.
When we were having this
discussion to begin with,
it quickly became clear
that we would need to have--
at the time, we were talking
about one application,
which would manage all of this.
And the concern,
from someone who
I won't name, but
is in the room here,
was we don't want to
have a godlike service.
And that makes perfect sense.
Because if you think about
everything that we just
described, there's a
lot of complexity there,
a lot of moving parts.
So it does make sense to break
it up into different things.
And I unfortunately said,
well, it will be a little bit
like a demigod service, then.
And the name stuck.
So I apologize.
It's not the best name in the
world, but it's what we have,
and what we've been
using for a while now.
So what are the
demigod services?
Well, it is a set of
services that's specifically
written and built by
Google and Twitter
together to actually manage
all of that complexity we just
talked about-- to rotate those
keys, provision those buckets,
manage those users.
And they all run inside GCP.
Do you want to talk
to us a little bit
about what some of the
features of those demigods are?
VRUSHALI CHANNAPATTAN:
So when we
started thinking about
these demigod services,
we had a few key design
considerations in mind.
We wanted them to run
asynchronously of each other.
We wanted them to run
independently of each other.
We wanted them to run with
exactly scoped privilege
accounts.
So like James mentioned,
we wanted these services
to have admin privileges,
but not a huge blast of admin
privileges-- a very restricted
scope of admin privileges
granted to each demigod service.
We wanted their runs to
be [INAUDIBLE] important.
So even if they run
again and again,
they reset the system's
rate to what is expected.
The demigod runs will
override any manual changes,
and that was something
that we wanted.
Because we have thousands of
buckets, thousands of user
accounts to manage.
And we wanted to make sure that
whatever setting exists in GCP
is what we expect it to be.
We wanted the demigods,
the code itself
to be modular in design so that
we can maintain each service
independently of each other.
We know exactly what
each service is doing.
And we wanted it to be
as simple as possible.
That is the key.
JAMES DUKE: There's
a diagram coming up
where you can see the
level of simple simplicity
we managed to achieve.
It's not what it could be,
but it is a good compromise.
And so I wanted to come back
to the idempotent piece again
for a second because I
think that's incredibly
important for this process.
Because all these services,
they are controlling access
to what can be fairly
sensitive data.
And so we wanted to
make sure that it's
a very predictable, repeatable,
and reliable process.
If we had a situation where
two different runs of the tool
could end up with two
different outputs,
we would be in a
really bad place.
Because it would be almost
impossible to maintain.
It would be impossible
to troubleshoot.
So we really wanted to
make it very declarative.
And in a moment you'll
see, most of the inputs
that we give to these tools they
are of a declarative nature.
It's LDAP, or it's YAML that
actually specifically codifies
what the tool is going to do.
And the tools will do the
same thing with the same input
every time.
OK, so let's talk about
the actual demigods.
We're going to talk about four
demigod got services today.
Let's start off with
bucket creation.
Vrushali, what does the
bucket creation demigod do?
VRUSHALI CHANNAPATTAN: Like the
name says, it creates buckets.
It's responsible for the overall
bucket management as well.
This demigod, we wanted it to
run one per pillar project.
So if we have a demigod service
for one of our GCP projects,
that bucket creation
demigod has the ability
to create buckets only in
that particular GCP project.
So it's restricted access scope.
Almost all of our demigods
have configurable inputs
that they can take.
All of our demigods
look at an LDAP input.
This LDAP input tells
them information
about the bucket ownership,
where to create the bucket.
And there is a
YAML input as well
that gives it
additional information
about the different
configuration options
for that particular bucket.
JAMES DUKE: We also have the
shadow account management
demigod service.
So this is the service that's
responsible for provisioning
and deprovisioning the shadow
accounts that we discussed.
So it notices when a user
has been granted access
to the platform and will
create that service account,
that shadow account,
and provide it
with the correct IM roles
on the basis of the group's
membership, which we'll
talk about in a moment.
Again, inputs LDAP, and YAML.
And again, in this
case, there is
one of these across all
of the pillar projects.
The reason being,
we have one service
account provisioning project.
So it doesn't really
make sense to have
individual ones of these because
it's managing something that's
a single point for everything.
We also have bucket
creation, which
is what we just talked about.
We also have policy management.
If you want to talk
us through that one?
VRUSHALI CHANNAPATTAN: Sure.
So this particular
set of services
is responsible
for setting the IM
policies on all of our buckets.
This, again, runs one
per pillar project
so that it has its access
scoped to that project.
This demigod service
interacts with Google Groups
as well because it needs
to give access to a bucket,
and that is scoped
to a Google Group.
So IM policies are
set for a Google
Group for a particular bucket.
We also made sure we added
an ignore list option.
I want to call this out.
So let's say you're
debugging a problem.
There's some situation that
you want to take care of.
And you don't want the
demigods to come and override
some settings that
you've done while you
were debugging things, or while
you were taking care of things.
So we have an ignore list
that you can call soon
that tells the demigod
service, do not
reset the bucket permissions
for this particular bucket.
JAMES DUKE: Exactly.
And then finally that we
want to talk about today,
we have the key lifecycle
management service.
So we abstracted this away from
the shadow account creation
service and instead run
it as its own service.
Again, this actually
executes what
we talked about earlier
of the key rotation.
It creates keys.
It expires those keys
and rotates those keys.
The other thing that it does,
though, which we didn't really
touch on, is it actually
pushes those keys
into the secrets system
that Twitter has.
So it will actually, once
it's generated that key, it
will grab that key and then
push it into the key store.
And again, its input, again, is
the destination for the keys.
So it needs to know an
endpoint to push to.
And again, it'll
wait for the LDAP,
and it will recognize when
those shadow accounts exist.
So we wanted to talk a little
bit about how we actually
put these things into
practice, and what
some of that deployment
would look like.
And I promised you
a diagram which
demonstrates the level of
simplicity that we achieved.
Well, here it is.
As simple as possible.
This is a complicated thing
that we're doing here.
Did you want to talk us through
the complexity here, Vrushali?
VRUSHALI CHANNAPATTAN: Yes.
So we have not tuned all
of our demigod services,
or all of our
projects or every--
it's not a
comprehensive diagram.
But it's a
representative diagram.
We would have run out of
space and arrows otherwise.
So all our demigods run
in an admin project.
So this is a
separate GCP project.
We have our other data pulled
back from processing projects,
like the Twitter in
range product project,
and the user project.
The demigod services
run in this project,
but are scoped to work
on the target projects.
And we have separate instances,
like showing your bucket
creation for the IE one, bucket
creation for the project.
We have the
policy-granting demigod.
Starting from the
left in the picture,
you'll see some LDAP groups
and Google Group interactions.
The demigods are basically
set up to look at LDAP groups
as one of their inputs.
They interact very closely
with Google Groups.
We use the Google Groups API,
and we add users or shadow
accounts to Google Groups.
We set IM policies on our
buckets using Google Groups.
So we use Google
Groups extensively.
Our configuration input
goes into a bucket itself.
So, the demigod services,
whenever they start up,
they look at this
conflict bucket.
And for every run, if the config
changes, the run will change.
We have a key store, a secret
store that Twitter manages.
This is a Twitter key
distribution service.
And that is responsible for
moving the keys from whatever
notes it needs to be at.
The destinations can be set.
And this is our own system.
This is a similar thing that
we have on-premises as well.
So our users are used
to that interface.
Let's take a look at some of
the deployment considerations
that we had.
All of the demigod
services run in GCP itself.
They run on VMs.
The VM themselves run
with a service account.
Now this service
account is, again,
created in the admin
project itself.
And that service
account is scoped
to what it needs to
have, the privileges
that it needs to have.
The project itself has
limited SSH access.
So only admins can access it.
The reason we've separated
it into a separate project
is, we wanted to have a clear
delineation of what resources
we are using for our admin work.
What is it that
we need to set up?
We can control our access
to that particular project
accordingly.
The processes on those VMs run
as cerberized headless users.
One important contribution that
we had for the key creation
service is that it should
not write the keys to disk.
We do not want to be
leaking keys at any point.
So the demigod service
holds it in memory,
and then writes it
to the secret store.
JAMES DUKE: Yeah,
and we've kind of
mentioned this a
couple of times,
but it's worth calling out.
One of the key parts
of this design,
and the reason we were able
to do this effectively,
was that we had this secret
management system already.
All of this depends
upon there being a way
to reliably and securely
distribute those keys where
they need to be.
So we were fortunate when
we began this project
that Twitter already had
one built and rolled out.
Had it not been
the case, we might
need to use something like
Vault or some other method
of distributing those keys.
And we looked into doing that
as well at the beginning.
And that comes with a
different set of challenges.
So for those of you who were
thinking about doing something
similar, that is a key
part that we haven't really
touched on too much today
that needs to get solved,
how you actually
distribute those keys.
OK, should we look a little
bit about what happens
when these things actually run?
So we built them.
We designed them.
We told you why we have them.
We've designed them.
We've built them.
And now we're going to
actually run them and see
what it looks like.
So what happens when a user
joins an LDAP pillar group?
So the first thing that
happens is, a shadow account
is created for them.
So the LDAP pillar groups,
they are indicators
to the system that this
particular human being has
become a part of the team that
needs access to this service
or has been, for some reason,
granted access to the service.
Once that happens, the shadow
account creation service
will notice and will
provision that shadow
account for the user.
It will also create
a GCS user bucket
for that particular user.
That will be the bucket
creation service.
The key service will then
generate a key for that user
and add it to the
secrets system.
And then that secrets system
will distribute the keys
accordingly to the right
places in the infrastructure.
So each one of those things
is separate from one another.
They don't really-- they do
somewhat depend on each other.
But it doesn't have to
happen in order like this.
It would just be the next
time that service runs that it
will pick up from the changes.
Each one is-- think
of it like a flow.
One happens after the
other, but it doesn't matter
if it is exactly sequential.
What about when a new
data set is added?
Vrushali, what does
that look like?
VRUSHALI CHANNAPATTAN: So
our data set storage on GCS
is based on a data set
ownership model, like we saw.
So how do we ensure that
new data sets that get added
are actually getting
their buckets in GCS?
This is an event driven system.
Whenever we have a data
set information that
gets added on-premises,
that information
gets replicated to a YAML
file in a config bucket.
Now this conflict
bucket is actually
one of the inputs to the
bucket creation demigod.
So once the demigod sees that
there's a new input there,
it looks up the
ownership LDAP, does
the corresponding ownerships
in settings in Google Groups,
and creates a bucket.
It also creates a
corresponding set
of scratched scrubbed
buckets that we
require for our processing.
The IM policy demigod will
then grant privileges.
These ownership--
or these read write
settings are, again,
a configuration
input to the demigod service.
So the demigod serves is
more like a rules engine.
It takes a set of
inputs, and then it
has a set of actions
that it does.
Once the input changes,
it notices that,
and it performs the
corresponding action.
JAMES DUKE: And so the idea
here is, with the exception of--
I as an owner-- or it's not me--
Vrushali, as an owner or
administrator of the data info
platform, has one
step to add a user.
We add them to that group.
And there's some governance
around that as well.
But it's basically
one manual component.
Then everything else triggers
out and flows automatically
from there.
And so all those
demigod services
that are all operating
in the background
go into creating this system so
that the user can very quickly
and very seamlessly access it.
And we've actually
rolled this out.
It's there today.
And most importantly, it works.
Yes, it works.
We actually have users
using this platform
with the demigods running,
with them automating access.
And what we found is-- it's
still fairly early days.
But what we've found
so far is, we've
found a couple of bugs and
wrinkles we have to work out.
But all in all,
so far, we've seen
it to be working
pretty effectively.
Which is pretty good, right?
We've been working on this
for a couple of years.
It's good to have you
finally rolled out.
And so what we've seen is
that, now those users--
and we're actually seeing
people come to the team
and ask for access, right?
VRUSHALI CHANNAPATTAN: Yes.
And it's, like I said,
it's the same experience
that people have on-premises.
They join in LDAP groups.
They make changes to
their config files.
Those get checked into
our source repository.
So it's the same experience.
It's just that things,
again, happen on GCP.
JAMES DUKE: Exactly.
The user doesn't know
that it's in the cloud.
But it is in the cloud.
All right.
Well, with that, both
of us are hiring.
Quick chance to take a plug.
If you're interested,
come see us afterwards
or visit these websites.
And with that, we have
about 10 minutes left.
Does anybody have any questions?
[MUSIC PLAYING]
