GARRICK EVANS: My name's
Garrick Evans.
I'm a software architect on
the Google Cloud Platform
solutions team.
And I want to tell you one of
the coolest parts about my job
is the opportunity that comes
up once in a while to get
involved with some projects
that partners are doing
working on some very, very hard
problems and asking some
of the most intriguing and
important questions of the
world, in this particular
case, in the universe.
So today, I'm actually pleased
to share with you
one of these projects.
It's the ATLAS experiment on
Google Compute Engine.
And with this project, what we
wanted to demonstrate was
actually tangible acceleration
of scientific research by
leveraging and taking advantage
of Google's Cloud in
a particular Compute Engine.
The computational and data
demands of the ATLAS
experiment are pretty
substantial.
There's thousands of
collaborators running as a
baseline hundreds of thousands
of jobs per day, peaking at
over 10 times that amount.
The experiment generates tens
of petabytes of data a year,
and there's currently well over
100 petabytes of data
under management coordination.
So with me today are Dr.
Sergey Panitkin of the
Brookhaven National Lab in New
York and Andrew Hanushevsy of
the SLAC National Accelerator
Lab at Stanford.
Sergey leads research and
development of cloud computing
at the ATLAS experiment at
a Large Hadron Collider.
And we'll talk about the project
itself, the compute
clusters this team has
assembled, and share his
results with you.
Andy's an information systems
specialist who's designed and
co-developed a data clustering
technology, XRootD, which was
used to federate data between
ATLAS and the Cloud and will
provide us an overview
of the technology.
So with that, I'd like to
introduce you to Dr. Sergey
Panitkin of Brookhaven
National Lab.
[APPLAUSE]
DR. SERGEY PANITKIN: Today
I will talk-- oh.
[LAUGHTER]
DR. SERGEY PANITKIN: Today I
will talk a little bit about
ATLAS experiment, Large Hadron
Collider, ATLAS experiment
computing the challenges of
a big data experiment, our
interesting Clouds, and our
recent project Google Compute
Engine, some conditional
clusters that we run on the
grid and now we're running in
the Cloud, and we'll also
describe the XRoot technology.
Andy will go into more
details about that.
We're saying that there are
products developed in the high
energy community.
Remember, worldwide web.
There is a new generation of
technology developed that
community, they're open source,
and we're saying
they're ready to be shared with
people who are interested
in Clouds and large
computational clustering,
bridging virtual and real
infrastructure, things like that.
ATLAS is a multipurpose detector
and Large Hadron
Collider at CERN.
The ATLAS experiment itself
is a large international
collaboration of about 3,000
scientists and engineers from
many universities and labs
around the globe.
It took almost 20 years to
design and build the
apparatus, but nevertheless,
it's very young collaboration
with more than 1,200 graduate
students working ATLAS and
driving the analyses.
This photo shows the outline of
the LHC tunnel on an aerial
view of countryside near
Geneva, Switzerland.
The LHC is one of the largest
scientific instruments ever
built and certainly only one
of the most complex.
And everything about the LHC
is extreme, its size, its
energy, its detectors.
It's the coldest and emptiest
place in the solar system.
It's the hottest place in the
universe, with temperatures of
particle created under collision
exceeding trillions
degrees of Celsius.
LHC tunnel is about 27
kilometers long.
Thousands of superconducting
magnets working at near
absolute zero temperature are
needed in order to accelerate
and collide protons and heavy
ions at the highest
temperatures ever achieved
in the lab.
Such energies are needed in
order to explore the high
energy frontier of modern
particle physics to discover
things like Higgs boson, the
missing piece of the standard
model, a particle that is
responsible for electric
symmetry breaking
that generates
massive elementary particles.
LHC's aim to explore physics
beyond the standard model,
things like supersymmetry,
possible existence of
extradimensional, possible
candidates of dark matter,
dark energy, and whatever
can be at that frontier.
The ATLAS detector is a
multipurpose apparatus
designed to detect particles
created in the collision of
the LHC beams.
It's the largest detector of its
kind, and also one of the
most complex.
It's built like a Russian doll
with detectors inside other
detectors with several giant
superconducting magnets.
All of this is needed to
accurately register particle
and identify them.
It has very high granularity of
sensing elements with about
150 millions of sensors
of various kinds.
And it's capable of taking
snapshots of the collision
events at the rate of 40
megahertz, or every 25
nanoseconds.
And it weighs about 7,000
tons, and it's
quite large in size.
This slide gives you a
feeling of the size
of the ATLAS detector.
Here, outlines of ATLAS and our
sister detector, CMS, are
superimposed with the image of
the ATLAS and CMS six-story
office building at CERN.
And ATLAS is taller than
that building.
It's a huge apparatus.
The detector itself sits in
the LHC tunnel about 100
meters below the surface, and
it was assembled there piece
by piece carefully like
a ship in the bottle.
Quite extraordinary engineering
achievement.
Especially taking into account
that the detector should be
able to measure particle
positions with accuracy
exceeding 100 microns.
So it should be assembled with
a very high precision.
And that's how it looks fully
assembled in the cover.
As I mentioned before, LHC and
ATLAS were built to explore
high energy frontier of modern
particle physics to search for
new phenomena that may occur at
these energies to probe the
very fundamental
laws of nature.
In particular, to search for
Higgs boson, this missing
piece of standard model
elementary particles
responsible for electroweak
symmetry break and again
generation of masses of other
elementary particles.
And Higgs' mechanism was
suggested about 40 years ago,
and since then physicists
around the world were
searching for proof
of its existence.
And recent discovery of a new
product at the LHC which looks
like Higgs boson culminated
this search.
Less than a year ago, two
experiments at LHC ATLAS and
CMS announced discovery
of the new particle.
It was called a giant leap for
science, the most important
discovery of the last decade
in the particle physics.
It generated a lot of public
interest and a lot of media
attention, with thousands
and thousands of print
stories and TV spots.
And most likely you have heard
about this discovery already.
But you probably haven't
seen my next slide.
I must note here that typically
we do not detect
elementary particles like
Higgs directly.
Like many other subatomic
particles, Higgs is too heavy,
too unstable, and
too short lived.
When produced, it immediately
decays into wider, more
stable, more long-lived
particles that are actually
registered by our detectors,
like ATLAS.
And by the way, such decays
may be multi-staged, and
[INAUDIBLE] stable object
decay [INAUDIBLE].
After several of such
decays, you're
getting final state particles.
By using collected information
about decayed products of
Higgs, what we call the final
state particles, we can
nevertheless prove the existence
of parent particles,
like Higgs.
Indirectly, yes, but without
any doubt, just by using
simple laws of conservation of
energy and momentum familiar
from freshman physics 101.
Whatever energy existed before
the particle decayed should be
conserved after the decay and
mass equivalent to energy, as
was pointed out by Einstein,
and mass of the decaying
parent particle doesn't
disappear.
It transformed to masses
[INAUDIBLE]
the decay product to
other particles.
So again, by using the
information collected about
final state particles,
we can reconstruct
the mass of the particle.
And we measure many events.
For each event, take all final
state particles that belong to
a particular decay channel.
We calculate effective mass of
such combination, and plug in
a histogram of effective mass.
And eventually we expect to see
a peak in that histogram
that corresponds to the mass
of the parent particle.
Of course, there may be other
particles that decay into the
same final state, and you
will see a spectrum in
distribution.
But they should be at peak
corresponding to the mass of
the particle you search for.
There will be noise, but if
particle exists, there will be
a signal simply because
of conservation
of energy and momentum.
And on this slide, you can see
how the discovery of the Higgs
boson unfolded.
Here, looking at the spectrum
of effective mass of
[INAUDIBLE]
detected by ATLAS.
Decaying to [INAUDIBLE]
is one of the predicted decay
channels for the Higgs boson.
And as we were collecting
more and more data,
the peak at 125 GV--
and GV is a unit of mass used in
high energy physics-- just
kept growing, indicating that
there is a new particle, very
heavy, never seen before, clear
and beautiful peak.
And of course, you
do search in many
different decay channels.
Higgs can decay into anything
that is not forbidden by
conservation laws.
In one event, it can decay into
gamma [INAUDIBLE], in
another into [INAUDIBLE], and so
forth, and some events may
have no Higgs in there.
But it's the same Higgs, so in
every decay channel, you
should expect a peak at the same
place, at the same mass
in these various
decay channels.
And you see the clear indication
that there is
something at 125 GV.
Now, let's talk about data
challenges of such analysis.
Particles that we want to
discover and study are rare.
That's why LHC runs at such
high energy and with such
intense beams, almost a billion
interactions per second.
Even in this condition, the
probability to create Higgs
boson is tiny.
You would need to search for
one Higgs in more than a
trillion events.
And at high beam densities,
multiple collisions can occur
in one beam crossing.
This plot shows how such
event looks like.
It's quite messy.
That explains why you need such
a big detector, such high
granularity, such strong
magnetic fields, so many
channels of sensors, so many
channels of electronics,
because you are searching
for something very rare.
And that's probably for Google
Developers, at a familiar
problem, your selectivity should
be very, very high,
like one in several trillions.
It's like looking for one
person's cells in the world
population, of one needle
in 20 million haystacks.
ATLAS is the quintessential
big data experiment.
ATLAS detector generates
about one petabyte per
second of raw data.
No one can store it,
even Google.
Most is filtered out in real
time by the trigger system.
Interesting events are
recorded for further
reconstruction analysis.
And as of this year, we're
managing about 140 petabytes
of data worldwide, and that's
distributed over
100 computing centers.
And that's actually not only
raw data, but the derived
formats, the simulation,
about 50% of our data.
The Monte Carlo simulation of
various properties, simulation
of how the particle propagates
a detector
iteration, things like that.
And we expect that these data
rates will be only growing
after LHC large shut down,
what's happening now in 2013,
the data rate will be higher,
the energy of
collision will be higher.
We expect the influx of already
filtered data on the
level of 40 petabytes
per year.
And we have to deliver this
data to thousands of
physicists worldwide.
Now, a little bit about
ATLAS computing.
ATLAS uses grid computing
paradigm for organization of
distributed resources.
Job distribution is
managed by PanDA
Workload Management System.
PanDA stands for production
analysis.
Think of it as grid
metascheduler.
PanDA was developed by ATLAS,
and now it manages
distribution of computing jobs
for hundreds of computing
sites, about 100,000 cores, 100
million jobs per year, and
serving thousands of users.
Organizationally, it's a grid
set up in a tiered system,
highly hierarchical
with the tier zero
central located at CERN.
Tier zero center receives the
raw data from the ATLAS
detectors, performs first pass
analysis, and then distributes
among other tiers.
Tier one is typically national
based large centers.
One of them is the
Brookhaven Lab.
And each tier one facility then
distributors derived data
to tier two computing facilities
that provides data
storage and processing
capabilities for more in depth
end user analysis.
This plot shows the distribution
of running jobs,
both Monte Carlo simulation--
we call them production jobs--
on the ATLAS grid and in
analysis jobs on the ATLAS
grid for the past year.
We run about 100,000 cores
worldwide, processing 150,000
jobs per day.
It's clear that all available
computational resources are
fully stressed and utilized.
And on this plot, you see the
distribution of pending jobs,
submitted but waiting for
execution on ATLAS grid for
the past year.
You can see that the job
submission pattern is very
uneven in time.
Spikes in demand usually happen
before major physics
conference or during
data reprocessing.
And demand can exceed available
computational
resources by more than an
order of magnitude.
Lack of resources slows down
the pace of scientific
discovery, and that's why ATLAS
is interested in cloud
computing, and, in particular,
in public cloud resources.
A couple of years ago, ATLAS
set up cloud computing R&D
project to explore
virtualization and cloud
computing primarily as a
tool to cope with peak
loads on the grid.
We wanted to learn how to use
public and private clouds in
typical ATLAS computational
scenarios.
Since then, we gained experience
with many cloud
platforms, like Amazon,
[INAUDIBLE]
consortium of European cloud
providers, future grid
[INAUDIBLE] academic clouds in
US and Canada and many others.
We also explored private and
hybrid cloud configuration
based on OpenStack, cloud stack
open nebula, and others.
And now our latest project was
on Google Compute Engine, and
I'll talk about it
now in detail.
We were invited to participate
in Google Compute Engine trial
period in August 2012.
And we were immediately
attracted by modern hardware,
powerful API, and competitive
pricing.
And this is Google, after all.
At the beginning, we were
frustrated that none of the
tools that we had used before
supported Google Compute
Engine, so initial a lot of
manual labor and image
building cluster management was
needed since we couldn't
reuse our standard tools.
We're glad to see here at Google
I/O that this situation
is changing and many tools are
supporting Google Compute
Engine now.
But Google engineers were very
helpful in helping us with
initial setup and debugging the
problems and explaining to
us features of the Google Cloud
platform implementation.
And also Google was very
gracious in providing more
resources than the initial
trial quarter so we could
start working on the larger
scale cluster.
We wanted to try several ATLAS
computational scenarios, high
performance analysis clusters
like PROOF.
We wanted to learn about storage
and data management of
the cloud, in particular
utilizing XRootD technology
for storage aggregation,
Ephemeral Storage aggregation,
and federation.
We also wanted to try a life
scale Monte Carlo production
simulation of the cloud using
PanDA Workload Management
System, as well as some other
smaller projects.
So let me talk about
PanDA [INAUDIBLE]
on Google Compute Engine.
Google agreed to allocate
additional resources for ATLAS
at the tune of about five
million core hours.
Resources were organized as
HTCondor PanDA queue, and that
allows for transparent inclusion
of the cloud
resource into ATLAS
computational grid.
Google Compute Engine looks as a
part of the ATLAS grid, just
like another grid site.
Very transparent.
It was intended to run CPU
intensive Monte Carlo
simulation, and the idea was
to try to have a production
type of run on Google
Compute Engine.
And the system was delivered
to ATLAS as a production
resource, not as R&D platform.
We ran for about eight weeks.
Two weeks were planned
for start up.
And we had very stable
running.
The Google Compute Engine
was rock solid.
We had a few problems, and most
of them were on the ATLAS
side, were on computational
intensive job, not much I/O,
for this particular workload.
These were physics [INAUDIBLE]
generators [INAUDIBLE]
fast detector simulation, full
detector simulation.
Produced data was automatically
shipped to ATLAS
grid storage for further
processing and analysis.
That was really usable data.
We completed about 450,000 jobs
generated and processed
about more than 200
million events.
Very good performance, very
comparable to performance of
ATLAS grid.
This plot shows the job
failure rate as
a function of time.
Most failures occurred during
start up and scale up period
as we expected.
Most problems were actually
on the ATLAS side.
No failures were due to the
Google Compute Engine.
Very stable performance
of the platform.
This plot shows distribution of
finished and failed jobs.
Green histogram is for finished
jobs, the pink one
for failed ones.
Again, very good performance.
We reached high rates
of production,
50,000 jobs per day.
Good number.
We also tried PROOF clusters.
And PROOF is implementation of
MapReduce paradigm based on
the ROOT framework.
ROOT framework for data analysis
was developed by high
energy nuclear physics
community, developed and
supported by the ROOT
team at CERN.
It's written in C++, free,
open source, very high
performance.
And we'll have a slide with the
pointers to the system.
And PROOF allows for efficient
aggregation and use of
distributed computing resources
for data intensive
event based analyses.
It uses XRootD for clustering,
storage
aggregation, data discovery.
Xroot is well suited for
ephemeral storage aggregation
into one name space.
And PROOF cluster also
can be federated.
So on this slide, you can see
typical architecture of the
PROOF clusters with the super
master, which serves as users'
single point of entry.
System complex is completely
hidden from users.
And it allows, for example,
for interactive analysis,
where you can send the query and
in real time see how the
particular histogram just
grows and changes.
One of the distinctive
feature.
But it also allows batch
analysis, and you can connect
to the system, look at the
histogram, and disconnect,
then connect again, look again,
see how it's going.
And here's another view of
the typical structure
of the PROOF cluster.
PROOF, as I mentioned,
works very well with
XRootD based storage.
XRootD provides data discovery
and hails PROOF exploits this
data locality [INAUDIBLE].
Access to Google Compute Engine
allowed us to build and
test large PROOF clusters, up
to 1,000 workers, something
that is very difficult to
do in the real domain.
We just don't have resources
to do this kind of test and
see how it all scales.
And the figure here shows
scalability test for 500
worker cluster.
And it shows very good
performance and pretty good
scalability.
We also look at the storage
performance of
Google Compute Engine.
And this plot shows the
performance in the typical
ATLAS analysis scenario of
ephemeral store, persistent
store, and here compare it to
what happened if you have all
the data in the memory.
And know that ephemeral disk
has better single worker
performance, but the persistent
storage shows
better scaling and better
peak performance.
And of course, in this situation
it's clear that the
RAID is needed for better
performance.
We also look at the data
transfer capabilities.
And this plot shows the data
transfer from our own
Federated ATLAS Xroot to Google
Compute Engine in
extreme copy mode, which is sort
of similar to bitTorrent.
If you have several copies,
you can copy them in
multisource, multistream mode.
And Google Compute Engine Xroot
cluster using ephemeral
storage was used for this test,
and average transfer
rate was about 60 megabyte
per second.
And this is very good taking
into account that this is over
completely unmanaged
public network.
We have no control there.
But still, this is single
client performance.
Many of the plots that I'm
showing, they're single client
because the system that we're
running, they scale very well
when you add resources.
So you expect clustering and
scaling up, so what you are
really interested in is how the
building blocks are running.
And here is single client, but
you can run on multiple VMs,
multiple clients simultaneously
and stream them over.
And we're also thinking about
dedicated network peering
between ATLAS network
infrastructure and Google storage.
So that will give us much
higher, 100 gigabyte per
second performance, than we
would be able to control it
and do manageable transfers.
But this is if you just want to
run and bring the data in.
It's doable.
And now we'll talk about Xroot,
this clustering and
storage clustering technology
that Andy was a creator and a
driving force behind this
project, and still is.
And Andy, please.
[APPLAUSE]
ANDREW HANUSHEVSKY:
[INAUDIBLE].
So I want to take a quick trip
through a bit of technology we
developed a while back but wound
up being absolutely an
ideal match with the Google
Compute Engine.
And that's XRootD.
Now you'll say, never
heard of it.
Well, it's a system for scalable
cluster data access.
And you say, that's nice.
What is it really?
Well, what we have is an
implementation of two
services, one an XRoot service
that provides access to data.
So you would take one of these
demons and drop it on every
node where you have data that
you need to access.
Now, there's a companion
service.
It's called CMSD.
Stands for Cluster Management
Services.
And that's used for data
discovery as well as routing
clients to where the data is
and server clustering.
So these are separate, but we
normally use them together.
And so for the purposes of this
talk, we'll always talk
about this particular pair.
Now I want to emphasize
that this system
is not a file system.
People are actually using the
system to cluster existing
file systems.
So we have people basically
taking HDFS, GPFS, Lustre, and
building one big giant cluster
out of that and having uniform
data access.
So while it's not a file system,
it's also not just for
file systems.
We have an experiment that's
using this as a framework to
cluster MySQL tables across
hundreds of MySQL servers so
they could do massively
parallel queries.
So the idea is that if we don't
have a plug-in for your
data, and you can write a
plug-in for your data, then
you can cluster it.
And so I'd like to show you what
that plug-in architecture
looks like.
First, we start off with
a protocol driver.
You can plug-in any number of
protocols into that driver.
In our particular case,
we want to do XRoot.
So let's take a look as we
plug stuff together.
So plug in your protocol.
Plug in your authentication
framework.
Plug in your logical file
system, your authorization
framework, your storage
system, and then your
clustering.
And all of a sudden, you've
built up a clustering system
for a particular kind
of application.
So let's get back to what the
data access problem is.
It's the High Energy
Physics regime.
Yeah, they do nasty things,
like start up thousands of
parallel jobs.
And they all start up pretty
much at the same time.
And if that weren't bad enough,
each one of those
opens 10 or more files.
Pretty much a profile
[INAUDIBLE] denial service
attack, if you ask me.
But basically, you have
to handle that.
To make matters worse, the
particular framework that they
use is small block sparse random
I/O. What do we mean by
small block?
Average read size about 4K.
What do we mean by sparse?
Well, they have like a
10-gigabyte file, and you'll
be lucky if they read
half of it randomly.
So pretty challenging.
So we adopted a synergistic
solution to try
to attack this problem.
And you'll see what we
mean by synergy here.
First, we wanted to minimize
the latency.
And the key elements there
were using a paralyzable
protocol, file sessions, a
sticky thread model, and
lockless I/O. Next, we wanted
to minimize hardware
requirements, so short code
paths, compact data
structures, members that are
cognizant of what the memory
cache is so it's friendly
to the memory cache.
We don't actually move data
around in the server, and we
don't do crossthread
data sharing.
So in the end, we wind up with
less than seven microseconds
overhead on a two gigahertz CPU
per I/O request and less
than a 100-megabyte
memory footprint.
So pretty compact.
Now those two are synergistic
in the sense that if you
minimize latency, you'll see
opportunities to minimize
hardware requirements.
And if you start minimizing
hardware requirements, you see
opportunities to minimize
latency.
The next thing we wanted to do
was minimize human cost.
So what does that mean?
Well for us that meant a single
configuration file, no
database requirement.
You can add and delete
nodes at will.
We don't care.
You don't have to restart
anything.
You just add stuff
and delete stuff.
And you use your natural file
system administration tools to
administer this thing because
that's what you know.
Now, that together, we wanted
to maximize scaling.
And those two are actually
synergistic.
You can't maximize scaling
unless you
minimize the human cost.
And you see immediately
opportunities between those
two as you attack both
of those problems.
So let's talk about scaling.
We used B64 trees for scaling.
And we'll basically scale this
using this pair, XRootD CMS.
So let's start out with
a single node.
And then what we'll do
is we'll add 64 data
servers to that node.
Gee, looks like a really dinky
cluster, doesn't it?
So what we do, it's a B64 tree,
well, we'll add 64 data
servers to each one
of those 64 nodes.
Well now we have a
cluster of 4,096.
Decent, but not very big.
We can just repeat this step.
And now we've accomplished a
cluster of 262,000 servers.
Well, that looks big, but what
if we iterate one more time?
Now we've constructed
a cluster of 16
million data servers.
And you look at that and
say, hm, that looks
like a house of cards.
Well, not really because we can
replicate the head node,
geographically distribute it,
and now we have quite a bit of
redundancy in the system.
So let's add some names
to these things.
The head node is the manager.
Intermediate nodes
are supervisors.
And data servers always
are at the leaf nodes.
So remember that.
Now, this is a B tree.
We can split it up
any way we want.
And this works great for
basically doing cloud
deployment because in fact
part of that three can be
inside the GCE, another part
can be in a private cloud,
another part in a private
cluster, and we can piece that
all together to make it look
like one big cluster.
Now, you look at that
and say great.
But I have 16 million nodes.
How do I get to the data?
I can't keep track of
16 million things.
Well, in fact, let's take
a look at how we do it.
So when you've got a big cluster
like this, you really
have to adopt a brand
new strategy.
So we have a client.
He gets an open or doesn't
open to the head node.
And here we're going to assume
that the head node knows
nothing about the file
the client needs.
So what does it have to do?
It has to find the route
to the file.
And it'll accomplish that
by just doing a directed
broadcast parallel query.
And that'll set up the routing
tables in this tree.
After that, the head node can
then redirect the client to
the next subtree.
That subtree in turns
redirects the
client to the leaf node.
Now, this is the only scalable
way of doing it because you
don't have to keep track
of anything but
the immediate route.
Pretty much in how the
internet works.
So the bottom line here, this
is a simple, flexible, and
effective system.
I want to say it's simple, but
the devil's in the details.
We have a paper we presented
in IPDS that will give you
some of the algorithms
we have to use.
And you can actually get the
paper at XRootD.org.
It's LGPL open-source.
You can download it, run it.
It's managed by the XRootD
collaboration.
That collaboration is
open to new members.
And I do encourage you
to go to XRootD.org.
So now Sergey, finish it up.
DR. SERGEY PANITKIN: And
let me summarize.
All in all, we had great
experience with
Google Compute Engine.
We tested several computational
scenarios on
that platform, PROOF, XRootD
clusters, PanDA batch
clusters, and ran large scale
Monte Carlo production.
We think that Google Compute
Engine is modern cloud
infrastructure that can serve
as a stable high performance
platform for scientific
computing.
And tools developed by the LHC
community may be of some
interest to the broader
community of developers
working on Google Compute
Engine and
other compute engine.
And thank you very much.
[APPLAUSE]
GARRICK EVANS: Thanks a lot.
So we're done, and we'd be happy
to take any questions
that you guys have
at the time.
Please come up to the mics.
No?
OK, well, thanks.
Hope you had a great I/O.
