[MUSIC PLAYING]
KARAN BHATIA: Good
afternoon, everybody.
My name is Karan Bhatia.
I'm with Google.
And I lead the scientific
computing efforts
for Google Cloud.
Thank you for coming.
And the title of this session
is Scientific Computing
with Google Cloud Platform--
Experiences from the Trenches
in Particle Physics and Earth
Sciences.
For computational scientists,
this is a really exciting time.
They have the ability to really
leverage almost infinite set
of resources.
They have the ability to
access new processors,
new accelerators, and,
of course, new software
tools and platforms to
really take all that power
and put it at the
fingertips of researchers.
With Google's global network
and 17 regions distributed
around the world, it's no
wonder that many of the largest
scientific projects are looking
to leverage the Google Cloud
infrastructure for their
operations as they scale.
We've been working with a number
of the different scientific
computing projects,
including the CERN ATLAS
team, as they build
their infrastructure
to enable their
scientists to make
the next set of
scientific breakthroughs.
Today, we have Mario
Lassnig from CERN.
He's the ATLAS data manager.
He'll be here talking
about some of the efforts
that they have in
integrating the Google Cloud
infrastructure and services
into their existing data
and compute systems.
But Google Cloud is not just
for the largest projects.
In an effort to
make it accessible
to all scientific
projects, a few years ago,
we made a partnership with the
National Science Foundation
and provided $3 million
in Cloud Credits
for exceptional
research proposals
in the area of big data.
The proposals were reviewed
in the same process
that National Science
Foundation has.
And those that were funded had
access to Google Cloud Credits
that they could then--
a significant amount of
credits that they could
use for enabling their science.
One of those projects that
was funded by this program
was the Pangeo project.
This was a collaboration
between Columbia University,
the National Center for
Atmospheric Research,
and Anaconda, the company
that provides a lot
of scientific Python tools.
The Pangeo project,
the focus of that
is to build a science
platform for earth scientists
to essentially do their
science while underneath,
have all the cloud
capabilities with Kubernetes,
with Google Cloud Storage,
and the other services.
Today, we have Dr. Matthew
Rocklin from Anaconda
here to show that
platform and to explain
how they're leveraging Google
Cloud in order to build it.
I should also point
out that Google
provides a fairly robust set
of services in the free tier.
So by signing up
with Google Cloud,
you get $300 worth
of credit in addition
to the core capabilities within
the free tier of the services.
In addition to that, for
academic researchers,
if you need more resources
than the free tier provides,
we have a research
credits program
where we will provide
additional credits
for the scientific
community as they
are building and developing
their scientific applications.
So please join me in
welcoming Dr. Matthew
Rocklin from Anaconda to talk
about the Pangeo project.
[APPLAUSE]
MATTHEW ROCKLIN: Thank you.
Hi, everyone.
My name is Matthew Rocklin.
I work for Anaconda.
So Anaconda is a
for-profit company
behind a lot of the
open-source Python stack.
Today, I'm also representing
a broader collaboration
that includes a
lot of scientists
and infrastructural
institutions.
Originally, Columbia and
NCAR, but now a dozen
other organizations.
So there we go.
So our goal in
this collaboration,
this particular
collaboration, is
to help many scientists
analyze large earth science
data sets cheaply.
And we'll dive into those
few words in a moment.
So to give a quick
example, this is--
so this is a video of
the earth's oceans.
This is a video of a simulation
of the earth's oceans.
This was created
on a supercomputer
in some institution in Colorado.
And it is actually
quite scaled down.
You can see a lot of
interesting detail here.
You can see the tendrils
coming off of Japan there.
You can see tides everywhere
there's an enclosed area.
This is scaled down.
This actually is about to
two-petabyte simulation
in total.
And this kind of data
is what oceanographers
help to understand--
help them understand
how the oceans
work, which in turn,
helps us understand
things like erosion,
also things like how heat moves
throughout the earth, which
drastically affects
things like our climate.
So we simulate our planet,
like how we saw in the video.
We also observe it with
things like satellites
in the sky or
land-based systems.
This is critical
for many fields,
like oceanography,
also climate science.
Most things are more
immediately pragmatic,
things like weather prediction.
Will there be a hurricane in
the next week on the Gulf,
or other systems like
agriculture, or insurance.
Is there going to be
a drought next year?
How will that affect crops?
How will that affect regional
instability in northern Africa?
However, once we have
those large data sets--
sorry, come back
here for a minute--
once we have these
large data sets--
these are petabytes each
for the simulations.
NASA has missions that
launch somewhere between one
and tens of petabytes
per year coming
and in as satellite imagery.
Once we have those data sets,
we now need to analyze them.
Typically this is done by
querying some large data
center, pulling off a small bit
of data to your local laptop,
and then using some
of the natural Python
tools like NumPy,
pandas, and scikit-learn
interactively with
Jupyter Notebooks.
This is the workflow that
we're really focused on.
We're focused on the analytics
side of these large earth
science data sets.
There are some challenges here.
The first is that
the data sets are
different from what normal
big data tools are used to.
There are large,
gridded data sets.
Imagine the temperature,
and pressure,
and wind speed
everywhere on the earth,
in many different
altitudes over time.
That data set doesn't
fit into a database.
That's differently structured
than what you normally
find in most BI tooling.
It's very, very natural to
use tools like NumPy, pandas
or xarray, which [INAUDIBLE]
designed for these gridded data
structures.
But those tools have
traditionally not scaled well.
There's a mismatch.
We either have
scalable, big data tools
that are designed for
tabular data sets.
Or we have more
scientific tools that
are designed for
small amounts of data.
So one solution we use
to solve that problem
is what I particularly
work on is a tool
called Dask, which is a
library for parallel computing
in Python.
And Dask can use
other Python libraries
like NumPy or like
pandas, produced scalable
variants of those libraries.
It allows you to scale
up a Python ecosystem
to the 100-terabyte
scale fairly naturally.
We'll see an example
of that in a moment.
The other challenge we
face is increased access.
So some people have access
to these supercomputers,
some people have access
to these data centers,
but relatively few.
You can imagine there
might be a graduate student
in the Philippines with
fairly low resources
both in terms of
global connectivity.
He can't download the
petabytes from Colorado.
He also may not have
the financial resources
to buy all the credits on that.
And so we need to
create services
that allow them to run their
companies as remotely, bringing
the users to the data.
This is a problem that
many organizations face.
We also face it.
And we're going to solve
our problem with JupyterHub.
Some people may be
familiar with Jupyter.
And JupyterHub allows
an administrator
to easily provide Jupyter
instances to many people
within an organization.
And we're going to see an
example of that as well.
So I'm going to show an example
here of pangeo.pydata.org.
And if you want to, you
can go to this website
now, pangeo.pydata.org.
This is a JupyterHub
deployment running
on Google Cloud Platform.
You can sign in with
GitHub credentials.
And when you sign in, you have
the option to create a server.
We provide you with your own
personal Jupyter Notebook
server.
Let's skip forward a little bit.
As we wait here,
Google Cloud Platform
is actually spinning up
for us some node somewhere
on the cloud.
That will take a little bit.
And we're dropped here
into a JupyterLab session.
So JupyterLab is
the next generation
of the Jupyter Notebook.
It has the normal
Notebook interface
on the right that you've see.
It also has things
like a file browser.
You can open up a terminal.
You have a full Linux
workstation here.
And you can do things like
manage your environment
with your [INAUDIBLE] libraries,
[INAUDIBLE] libraries.
And this is very, very
nice on a personal machine.
I use it daily.
But it is very,
very valuable when
you're on some remote system,
[INAUDIBLE] manage that system
and keep that persistent.
So the combination of JupyterLab
and JupyterHub we found
has been a really good tool to
give a variety of people access
to a remote system.
And again, you can do this now
if you go to pangeo.pydata.org.
So here, let's do a
little bit of science,
we're in this notebook.
And we're going to
import some libraries
like NumPy and xarray.
And we're going to look at some
data stored on Google Cloud
Storage.
We're going to look at
the sea surface height
of a simulation over time.
We can see here we've open
up the data with xarray.
Xarray is a mix of NumPy
and pandas together.
You have a bunch of
different variables.
One of these variable,
the sea level altitude,
which is sizable.
It's not huge.
And it's chunked up into various
blocks on Google Cloud Storage.
So the data set is too large to
work with on a single machine.
So we're going to use Dask
and Kubernetes to give us
more computational power.
We're going to ask for,
I think, 40 workers.
And then again, we're
using Google Cloud Platform
to rapidly scale this
user's environment
to now include many, many more
machines on that same cluster.
You're going to have
to connect to that.
And then we're going
to open up some data.
So we're going to
open up this data set
and look at just one day of it.
So this is showing us the
sea level altitude over time,
over the globe at
fairly fine resolution.
And so what you can see here--
it's a little washed
out here, unfortunately.
What you can see here is--
let's see-- this is the earth.
And you might be able to see
the continents on the left.
There's Africa.
And there's some wiggles
coming off of Africa.
And the red and the
blue are differences
in altitude of the oceans.
So you see areas where
the ocean might be higher,
it might be lower.
If you live in a low-lying
area like the Netherlands
or the Philippines, you
might be very concerned
about that altitude.
That data set's
around 70 gigabytes.
It's not huge.
But it's bigger than what we'd
want to work over a laptop.
We're going to do a few
computations with that data.
We're going to aggregate
over spatial dimensions
versus time average, aggregate
over [INAUDIBLE] longitudes.
We have both time and latitude.
And an average of
over [INAUDIBLE]
spatial over all the
[INAUDIBLE] time.
Before we do the
computation, we're
going to open up
the Dask dashboard
so we get a sense
of that computation
and what's happening.
So this is telling us the
activity of our cluster
as we're running with.
So again, Dask is going
to give us the ability
to scale standard NumPy
and pandas-like workloads
onto many machines on a cluster.
Here, we're defining
some of those workloads
with the xarray library, which
is useful for geosciences.
And now we're computing that.
So the busy plot on the
right shows the activity
of all of the cores
of our cluster.
Every horizontal
line corresponds
to the activity of one core.
And over time, you
can see they're
working on different things.
So one color, like
the green, might
correspond to loading data
from Google Cloud Storage.
Another color, like
the light green,
might correspond to
computing some aggregation.
Red is communication
between nodes.
And so what we've done
is we've broken down
this large computation into
many thousands of small NumPy
computations.
And then Dask and
Kubernetes are working
through all those
computations for us.
And what that gives
us, that gives us now--
we've computed these
three different data sets.
We're going to look at the
aggregations all throughout.
First, we're going to
expand to full screen
so we get the full image.
And so we've taken all
the spatial dimensions.
And we're now just
looking at time.
And we're going to see
two different things.
One, we can see some
oscillation seasonally.
The Earth breathes a
little bit with the oceans.
You also see a
standard upward trend.
So sea level is rising.
And what's really nice here--
I'm going to pause
this just for a moment.
What's really nice here
is that we didn't actually
need to rely on an agency
or trust any organization
to give us this analysis.
All of the raw data was
available to us on the cloud.
It's publicly available.
And we were able to, with
our own computations,
go to that data set and do
that analysis on our own.
It removes any sense of
politics from the conversation
of whether or not sea
levels might be rising.
So it really democratizes
a lot of the ability
of anyone on the globe
currently to analyze
these sorts of data sets.
If we take that same analysis
and throw in latitude--
the slow part here is actually
rendering the image, not
the computation.
So if we throw in
latitude, we can
see how sea level rise will
change over time on the x-axis,
but also how it
varies over latitude.
You can see, for example,
there's a strong difference
here at 37 degrees latitude.
There is a strong
difference, which
is actually, coincidentally
right where we are.
So the sea level rise
variation increases
quite a bit in that latitude.
And here's variation over time.
As you can see on the eastern
coast of a lot of land masses,
there is lot of
increased variability.
You can see a lot of
increase and decrease
of the sea level rise over time.
So again, these are just some
examples of how you can, today,
right now, if you ignore
the rest of the talk,
you can go run this
analysis on your own
with open data sets
living on Google Cloud.
With computation
on Google Cloud,
this whole computation cost
us something like $0.50.
So this workflow is
scalable, but it's also
very, very familiar.
So if you're familiar
with the Python ecosystem,
the tools like NumPy,
pandas, xarray, and Jupyter
are very familiar to you.
And because of that,
this architecture
has been redeployed in many
different environments.
This has had great success.
This combination of scaling with
Jupyter and scaling with Dask
and the rest of the
computational ecosystem
has been wildly
successful, especially
in the scientific communities.
So this is very commonly
deployed on HPC systems,
for example, to
institutions like NASA,
or the National Center
for Atmosphere Research,
with the UK Met
Office where they just
deployed JupyterHub and Dask
on an HPC job scheduler.
And they run on some
sort of data center.
This is great because these
institutions already exist.
It's a little bit
challenging though
because they all have their
own different data sets.
And that siloing of data
sets can be quite challenging
when you have to move
large data sets around.
Fortunately, the
same exact stack
also works on the cloud
with tools like Kubernetes.
And it works on
any cloud platform
that supports Kubernetes,
like Google Cloud Platform.
And what's nice here is that
because many institutions are
all placing their data
sets on the cloud,
suddenly all these computations
become much more universally
available.
You can start mix and matching.
You can combine your
oceanographic data sets
with your population
movement data sets.
And it's all publicly
available, so it's
very pleasant to start
combining those things.
So just briefly, some
successes and some
challenges we've run into.
So this has been really
started as a collaboration
between a science
group at Columbia,
a data infrastructure
group at the National
Center for Atmospheric Research,
and some open source software
developers in Anaconda.
But it's rapidly spread
out to other groups, so
many government
organizations and also
many different other
scientific domains.
What I showed up above
was just for geosciences,
but it's equally applicable
to genomics, or to imaging.
Those communities
have reached out.
It's also applicable to many
more for-profit enterprises.
So NumPy and pandas are used in
geosciences and everywhere else
as well.
Scale, so scientists are
regularly interacting
with 100-terabyte data sets.
When I say interact,
I don't mean computing
on 100-terabyte data sets.
I mean they've opened
up a Jupyter Notebook,
and they're typing at that
notebook, and they hit Enter,
and they expect to
get a response back
within a minute or so.
So they're actually doing
interactive analysis
with these large complex data
sets at human time scales.
That's really important for
the innovation of science.
And we're also
doing the science.
So the people at Columbia who
are designed to do this work
are doing this work.
And they're coming
up with new analyses.
And the cloud, in particular,
has been very nice
because autoscaling has been
very helpful to reduce costs
without maintenance.
We actually don't pay
anybody to manage the system
that you can play with now.
It's up there.
It's live.
It breaks very infrequently.
And it's usually my fault.
And it's relatively
cheap, these analyses are.
Because we can scale our
clusters up and down,
we can burst out to
a 1,000 machines,
and we can go back down
again to zero again.
And we can be fairly
cheap in that way.
This has also been pretty
impressive for access.
So if you watch people
who log in, a lot of them
are graduate students,
a lot of them
undergraduates,
all of them aren't
in resource-rich countries.
And it's been very
exciting to see
this large-scale
analysis be taken out
into those other communities.
However, there's also
a lot of challenges.
So we're struggling
with interactivity
on petabyte-scale data sets.
So Dask is underlying
parallels in machinery--
seems to be struggling to
operate interactive time
scales in both sizes.
Data formats, so the established
formats for gridded data,
things like HDF5, don't work
very well in cloud storage.
There are alternatives.
But now we have
to build consensus
among a lot of
scientific communities
to establish new common
data formats designed
for infrequent random access.
There's a general lack of
expertise among the science IT
community.
There's a ton of people who
are well-trained to handle
HPC clusters.
And there's relatively few
people in the science community
who have the
cloud-native skill set.
Some retraining process
needs to be done.
And then finally, there's
resource management.
So things like
JupyterHub or Kubernetes
lack some of the
fine-grained user management
around software environments,
resource limits, data sharing,
and governance.
They're tricky if you want to
manage a thousand users all
hammering a cluster with their
own particular resource queues.
So taking off my
open source hat,
putting on my enterprise
hat, Anaconda Enterprise
saw these sorts of
problems and actually
didn't understand what
these problems were before I
built a system like this.
It was hard to understand
all the pain that
goes around stopping the users
from breaking the cluster
before you have a bunch
of users actively trying
to break a cluster.
So that's it for my bit.
I'm going to hand it
off Mario in a moment.
But give it a try.
So you go to pangeo.pydata.org.
You can play with that
exact same analysis.
It should all just work.
If you want to learn more,
you go to pangeo-data.org.
If you're an interested
partner, if you want to join in,
there's many organizations
that come and play with us.
And you also get involved.
So we're a community
organization.
We do all of our communication
on GitHub, pangeo-data/pangeo.
So thank you so much.
MARIO LASSNIG: Yeah, it works.
OK.
Hello, everyone.
My name is Mario Lassnig.
I'm with the
European Organization
for Nuclear Research,
commonly known as CERN.
Thank you, Karan, for
giving me the possibility
to speak here to you.
And I do this on behalf of
the ATLAS collaboration.
So ATLAS is one
of the experiments
that we host at CERN.
OK, maybe I should stand here.
So just a little
introduction, CERN
is a particle
physics research lab
at the border of Switzerland and
France, beautiful Lake Geneva.
And our accelerator complex
is 100 meters underground.
You have the Large
Hadron Collider, the LHC.
You have the 27-kilometer
circumference.
And you have four
experiments, four detectors
attached to the LHC.
And we basically make
particle beams collide
in the center of
these detectors.
And our experiment
is the ATLAS one.
So, this is what an
example event looks like.
So this is a cross-section
of the detector.
It's layered like an
onion, different types
of things that we can measure,
most importantly the energy
deposits over time of
the decaying particles.
So what we get--
the yellow dots that
you see are essentially
hits in the detector.
So you can imagine
this is a camera.
And it takes pictures
every 25 nanoseconds.
So that's a lot of data
coming out of there.
If I'm not mistaken, it's
roughly 1.7 petabytes
per second.
And then we trigger that down in
hardware to event sizes of two
to eight megabytes in size.
So this is just raw data
coming from the detector.
But computing at CERN has a
quite longer history than that.
On the left-hand side, you
can see the original machine
that Tim Berners-Lee used to
create the original technology
for the World Wide Web.
It's now in our museum.
And actually, it still works.
They had an event two years ago
where they tried it out again.
Then we improved a bit.
So our data center right now is,
I believe, at the 250,000 cores
that we provide 300 petabytes of
disk and 200 petabyte of tape.
Yes, we still use tape.
But eventually,
all of this is just
the infrastructure underneath.
What we actually want
is plots like that
on the right-hand side.
This is what the people
are actually interested in.
And if you're a
particle physicist,
you can see the
Higgs boson in there.
Well, you might have
to zoom in a bit.
So how do we do the
computing for the experiment?
We leverage what we call
the WLCG, the Worldwide LHC
Computing Grid.
Now, CERN is not the only
center providing capabilities.
We have 170 data centers all
over the world, 40 countries,
all together, 800,000 cores,
some disks, some tape.
And the aggregate
network that we can use
is roughly three terabytes.
It's always a tiered structure.
So we have tier
zero, which is CERN.
And then the large data centers,
Brookhaven National Laboratory,
Rutherford Appleton
Laboratory, all those kinds
of very big supporters
are involved.
And then we have to
tier twos and threes.
Those are universities
and laboratories
within the National Research
Council's funding agencies
that are attached to
the larger data center.
It's really a tiered structure.
And what we do is
heterogeneous computing.
So our data centers
are publicly funded.
So it's not like
somebody in Spain
makes a decision to
buy Intel processors
and then everybody in
the world has to buy
exactly the same hardware.
No, every single data center
have their own funding
structure, their own
decision making process.
So it's really everybody can
do what they want, literally.
And we have to make sure within
the framework of the WLCG
to make it work together.
And not only that, the
data centers not only
support the ATLAS
experiment, but also
many different
experiments as well.
And then pledge to
the experiments.
The distributed computing
project for ATLAS
takes two concepts, workflow
management and data management,
and merges them together.
We basically scheduled
computational tasks.
It's really like a huge batch
farm across all those data
centers.
It interacts with the compute.
And the data management is in
charge of all the experiment
data.
And you can see from the curve
on the left-hand side that we
are roughly at--
we're approaching 400 petabytes
of experiment data there.
Now, this includes
user output data
to detector data, everything.
But we also have a very large
operations and support team.
And without these people,
nothing would work.
So they do the databases,
the monitoring, the analytics
for the experiments.
Now I would like to quickly
touch the data management
system because that's where
I'm primarily involved in.
When we think data
management, for science,
it's really completely
different to a business use case
because the science workflows,
as Matt already explained,
they're a bit complicated.
So what we try to do is we try
to have this global namespace
that we can federate across all
the different storage systems
that the countries provide.
We need to control and
account the data end users.
We need to have some policies
and rules on the data.
We have to transfer data
between the data centers.
We have to make sure that
when the jobs access the data,
that it's available.
We can do popularity-based
things, replication, caching,
all those kinds of things.
And then we also
have to interact
with all the other systems
that actually need the data.
So the system is called Rucio.
And this is the one that
we're actually integrated
with the Google Cloud Platform.
So I think in the
interest of time,
I will just skip
to the next slide
because that's actually
where it gets interesting.
When we do orchestrated
storage-to-storage transfers,
so basically copying a couple
of petabytes from France
to the United States,
we have to hop
through many different
research networks.
And as you can see, we
do roughly five petabytes
of transfers per week, all of
different types of activities.
But when the data is
there, the actual jobs
need to do something with them.
So the job IO we get
per week is much,
much higher than the actual
orchestrated activities.
So the other one
we can schedule.
We have a list of data
that we want to transfer.
We know roughly a month in
advance which data we actually
want to move.
But when we have users accessing
it, this is completely chaotic.
And as Matt showed,
you push a button,
and suddenly you try to work
through a petabyte sample
of data.
And we have to make sure that
this is properly orchestrated,
not only worldwide,
but also at the lower
level with the storage systems.
So that looks actually
quite well and it works.
So what is the problem?
The problem you can
see here on this plot
on the left-hand side.
That is the prediction
of the amount of data
that we will need.
And you'll see there's
a very big jump in 2025.
This is when we will get
the high luminosity upgrade
to the Large Hadron Collider.
So from roughly one
exabyte of data,
we will jump up to four
and more exabytes of data.
And already now
at 400 petabytes,
we are having quite
a bit of trouble
getting the job IO working.
So this data deluge, this is
not only a problem for the LHC,
we see this across
many fields of science.
And the main problem
there is that, of course,
storage continues to be
the major cost factor.
The science output rate exceeds
the infrastructure growth rate
at any possible rate.
I mean, you can't
leverage so much taxes
that you build many
more science centers.
So we call these
tactical resources.
It's like we want to use
these excess resources that
are available in the world.
And so ATLAS and
Google joined forces
to start the Data
Ocean R&D project.
There's this data
lake concept that's
very much used in
business, so we
tried to give it a
funny spin and just
called it a data ocean
just to make it bigger.
There's absolutely no
meaning behind this.
So what we try to
do is we wanted
to make it easy for the
science users in the experiment
to benefit from what Google
Cloud can provide us.
But on the other
hand, we also wanted
to learn from Google how
they do their things.
So we wanted to evaluate,
adopt, and benefit
from the novel technologies that
are available on the market.
So at the Supercomputing
Conference last year in Denver,
we had our first meeting.
And we started out with
a large sheet of paper
and started to draw potential
scenarios what we could do.
And we ended up with our
first white paper on this.
So the first use cases
that we wanted to do
is to ensure user analysis.
Imagine on the
WLCG your job runs.
It produces output.
It writes it somewhere
in the world.
But for example, it ends up in a
data center in a remote country
with not so good network.
And the only administrator there
is a PhD student on 5% time.
Data center goes down, you
have to wait three days
to get your job output data.
That is clearly
not what you want.
So with Google
Cloud Platform, we
would be able to get
100% output availability
by making an additional
copy on the cloud.
Also, we should be
able to overflow
the CPU to cloud compute.
And the other thing
that we wanted to test
is the data
placement replication
and popularity thing.
We really want to dynamically
expand the storage capacity,
use the additional network
that Google provides us
between our national
research networks.
For example, we can inject
data in our data centers
in Amsterdam and have it use
the Google network to move it
to the US without having to rely
on our own research networks.
And then, of course,
within the Google network,
try to replicate it even more.
And the third topic that
we would like to test
is to get the internal
data formats that we
use for the job IO itself.
What we use is--
if you know HDF5, it's a
data format using trees.
We use the root system for this.
So what we want, we want to rip
it apart into its constituents
and try to figure out what would
happen if we don't use files.
So eventually, we would end up
with a couple trillion events
that we need to process.
So we're clearly not there yet.
So we wanted to do some
cloud-based marshalling
of events from this.
So how do we get data
into Google Cloud Storage?
This is the
necessary first step.
But the LHC is running,
so we can't just
shut it down and do this.
So we had to integrate this
transparently and on the fly
because downtime, so
while the LHC is running,
are very, very costly.
So what we try to do is to
make GCS just another data
center within the WLCG.
So this means within all the
existing heterogeneous data
centers that we
have, we had to make
sure it supports all
the data policies,
it supports the user data
access with the authorization
and authentication.
It must support the existing
protocols that we use
and also the existing
tool chain that we have.
So this was quite a bit
of development effort.
So we started with
S3 free because we
had full stack support
from top to the bottom.
And you can see in
the center, that's
the Rucio data
management system that
connects the WLCG sites and
the Google Cloud Storage.
But then we quickly figured
out that if we continue
to use S3 as our primary
data access protocol,
that we're running into
some serious problems
because this one is rate
limited on the storage sites.
And really what's
the major problem
is that the key
distribution is problematic.
So imagine we have
our credentials
to access Google Cloud Storage,
and we start to hand it out
to our users, and
somebody decides
to just run for a couple
of petabytes of data,
we would rack up a bill
that we could never fund.
So what we decided
to do is we say,
OK, we don't use the
compatibility layer with S3.
We move to Google
Cloud Platform-native
client-side signed URL.
And with that, we have complete
control over the integration.
So these first movements, we
did the work quite manually slow
and tedious to S3.
But with the signed URL
support, this basically
allowed us to use
to Google Cloud
Platform like any other WebDAV
storage, which was really cool.
And then we used this
to do user analysis
input to the European West
1 and the USEAST1 storage
areas in the GCP.
And what happened
was that we maxed out
the intermediate
throughput and started
to overload the
storage from the WLCG
because we couldn't even inject
it fast enough into Google
Cloud Storage.
So we were quite
happy with that.
When you have your data
in, you want to compute.
And we have a system
built in collaboration
with the University of Texas,
Arlington and Brookhaven
National Laboratory.
It was called PanDa.
And it has a component
called the Harvester.
So what the Harvester
does is it basically
is a real-time view on the
capabilities of your cloud.
And it tells you
how much usage you
have from the experiment
point of view,
not from the infrastructure
point of view,
and what is the integration
with the cloud costs
that we actually have.
So we continuously fire
up virtual machines
with the actual
computational pilots in them.
We can pull the VM states.
We can shut them
down when necessary.
And we just have the Compute
Engine completely integrated
into our batch workflow.
And at the bottom,
the Rucio system
takes care of moving the data
from the WLCG storage systems
into the Google Cloud Storage
through the orchestrator
transfers.
So we started with
ATLAS simulation jobs.
When I said at the
beginning that we
get a lot of data
out of the detector,
we even get more data
from our simulation.
I think the split is
something 30% to 70%.
So 70% is just pure
simulated data.
Now what we need
to do is we need
to run these simulations on the
Google Compute Engine clusters
and read the IO from
the CERN storage.
That was the first step.
We wanted to show that we
can run the compute on Google
but accessing the data at CERN.
And then we moved on to also
use the IO from Google Compute.
In the meantime,
we decided that we
could try to profit from the
preemptible virtual machines.
And you can see on the
left-hand side of the plot,
we used the normal VMs
that are available on GCE,
and then we switched
to a preemptible VMs.
So this basically means that
the infrastructure can take out
the VM underneath you.
However, if you
do this, you gain
a lot of dollars per event.
So this was a very
positive experience there.
So the ATLAS
analysis also works.
We [? prove ?] the simulation
case was just a first step.
We can use GCE with IO from
CERN, with IO from GCS,
and its completely
and fully automated.
So the physicists
themselves don't even
know that their jobs
are running on Google
or that the data is
used from Google.
So what is the
take-home message?
The science workflows and
the software that we use
are complex and complicated
to a point where we can't just
go ahead and change
things because we need
to have repeatable experiments.
So if we decide to
make an addition,
like enabling support for a
commercial cloud provider,
we have to make sure that
this is done in a very, very
conservative way.
So all these workflows have
to transparently integrate
these providers.
But then we come
to the next step
because the
scientists themselves
are starting to get cautious.
To a scientist, the data is
the most precious resource.
They literally don't care
about the infrastructure
and where it runs.
But if you tell them we're
taking all your data--
that is basically
your livelihood--
and we're putting that
on a commercial company,
and it's not on publicly-funded
and science infrastructure,
they start to get
very, very cautious.
So what we are
trying to do here is
we are trying to win the
scientists-- we're basically
trying to win ourselves over
by providing ourselves tools
to use the commercial platforms
in a way that makes sense
to us.
So we can allow our
existing science software
to run transparently against
a commercial provider while
at the same time keeping
complete control of the data.
So for us really,
the important part
was to have a data management
system to be able to do that.
So imagine you're a
science community,
you're not on a scale
of ATLAS, you're maybe
10 people, five to 10
people, and you decide,
OK, I have access to my
university storage cluster.
But I would really like to get
some capability from Google
Storage completely
integrated into my workflow,
you can use our Rucio
system to do this.
You can use the PanDa and
Harvester system to do this.
So we really give
you the possibility
here to integrate your existing
on-premise capabilities
with the commercial
cloud providers
and give you as a
science customer
full control over all
your collaboration data.
So ATLAS and Google
are committed
to long-term collaboration.
This is not a one-term thing.
So we're really looking forward
to scaling out even more.
We have many more
experiments that are
deploying in the same tools.
And we are giving them
essentially the capability
for free to hook into the
commercial cloud providers
and use it as part of
their science workflows.
And with that, I'm at the end.
And thank you.
[APPLAUSE]
[MUSIC PLAYING]
