>> It is an honor to have Professor
Miron Livny with us here.
He is a Professor of
Computer Science at the
University of Wisconsin-Madison,
and he's going to talk
about High Throughput Computing
in the Service of
Scientific Discovery.
I think that you take it from there.
>> Yeah. Thank you.
It's a short, but so far
pleasurable visit here,
and I must admit that
I was not aware until
this morning that I am
scheduled to give a talk here.
I thought it would be just a visit.
So I am recycling
the talk that I gave
at ISS earlier this week,
but I strongly encourage you
to guide me through the
process with questions,
suggestions, disagreements,
whatever complaints,
especially if you want to know
more deeper technical stuff
on what I am presenting,
because the talk is a pity
fellow on technical things,
and it's more about the big picture.
So I hope you will find it useful,
but I'm open to any adjustment
and suggestion to how to do it.
So I'll start by a little bit
background about UW-Madison,
besides the fact that it
always looks like this.
But the really important
thing are the number,
and the 43,000 students,
22,000 faculty and staff,
$3 billion of total budget,
but what is important
for the discussion here
is that there is also
$1.2 billion of external support
for research that
goes into the campus.
So that's a very
significant research engine
that obviously is strongly
related to what I am
going to talk about.
So that's the reason I wanted
to make sure that this
is on the table as we start the talk.
So before I dive into
the more science part,
let me talk business commercial.
It's always a challenge when you
are in the business world to
decide where you are on the
vertical versus the horizontal,
or the horizontal
versus the vertical.
Do you focus on a market
that it is vertical,
very few users with very
specific requirement,
or do you go horizontal
and do something
that applies to everyone
and what have you?
Where you position yourself,
and where you put your effort,
and all that thing is a
challenge for any business.
So with this as a background,
when you work with Domain scientists,
then they are interested
in verticals,
they are interested in
solving the problem,
and they don't care about
what delivers to them the solution
for the computational problem.
How elegant is it?
Does it use block chains or
does it use machine learning?
Whatever it is, it
has to be effective.
Because if it's effective,
they will adopt it,
and if not who cares?
It's not efficient, it's effective,
because it has to do what
they need to do with
the resources that they have at
the amount of time that they have.
So this is the
vertical that researchers or
Domain scientists care about.
Computer scientists are horizontal.
They want to come up with
something that everyone will use.
Therefore in terms of their science,
they are focusing, and I
am a Computer Scientist,
so I'm supposed to
belong to this side,
but I would say and we'll
see later that I strongly
believe that if you
want to be horizontal,
you must evaluate what you
are doing in real verticals.
So even if you want to be horizontal,
you cannot run away from the
verticals and work with them.
So I think the same realization
is shared with MIT that a
year ago came and said,
"We are going to invest a
billion dollars in creating
a new college that will reshape."
When I see MIT saying they will
reshape themselves, that's huge.
Institution like this don't reshape.
Commercial entity restructure
every year or every six months,
but organization like MIT
don't reshape themselves.
The main thing is that to reorient
MIT to bring the computing and
AI to all fields of study,
and to use all fields
of study to impact
research and development
in computing and AI.
So this is recognizing the fact
that there is a strong synergy
between the two areas that
requires a different approach to
how an established and
successful institution
like MIT is going to move forward.
So in rephrasing the announcement
of MIT to the context of the Center
for High Throughput Computing,
I claim that we established the
center in 2006 to do exactly that
on a much narrower scope of
the computing technology,
and to do it on distributed
high-throughput computer,
to have impact on all fields
of study on the campus,
and to leverage from this $1.2
billion enterprise to
do what we do better.
So one way to summarize what
we are doing in CHTC is to
harmonize the vertical
and the horizontal.
Now, that may sound as if we are
trying to square the circle.
Because usually, you
say that two things
that are completely independent,
they're orthogonal.
So if you think about
the horizontal and the
vertical as being orthogonal,
then you say how can
you harmonize them,
but I believe we can
square the circle.
So this is the love letters we
like to get from our verticals,
and this is a guy that is studying
mice on a remote Island in
the South Atlantic with
an evolutionary phenomenon
that is not understood.
Why are the mice there twice
the size of the mice in
Europe even though they
arrived to this island not
by swimming from Europe,
but on cargo ships that went
to the Island 200 years ago,
and suddenly they're twice as big
as their ancestors in Europe?
By the way, I learned as a result
of talking to these guys that
there are two types of field mice:
there are the German and
there are the French.
The German mice may [inaudible]
I don't know which one made
it or both of them made it.
But the thing is that,
what they don't
understand is how you can
have evolutionary
process in 200 years
because they are twice
as big and actually
they're causing problem to deposit.
An island is basically
a rock in the middle of
the ocean unless they swim over
there but that's a little bit.
So what we like to see that CHTC
is essential to my project.
This is an adoption of a vertical.
I couldn't do what
I did without this.
This is the effectiveness
if you want to.
So in order to do this harmonization,
I created this acronym as
I put together the slide,
you have to do DDDOE.
Namely, you have to
be able to Design,
to Develop, to Deploy,
to Operate, and to Evaluate.
The researchers are coming
with their upper layer.
You have to provide the rest which is
the lower layer and the
stuff in the middle
and you want to do the
best you can on this,
but you also have to make sure
that you partner with them,
which is more of a personal
commitment of time and effort.
What you would like to aim
at or you should aim at if
you really believe in it
is how can you influence
their science with the
capabilities that you offer?
See, because the ultimate success
is if the vertical that you
created not only was adopted but
changed the way they do science.
So if you can do more,
you can do better.
You can do different.
They may change the scientific
method to do things more different.
So their science is changing.
A simple way to think about
it is if your science
you formulate the problem by
assuming you can only
calculate two points,
then you know that all you can
deal with is straight lines.
So you will not ask question
about what happened
between the points.
But if you can do many points,
you can start asking
question about the shape of
the curve as it goes
between the two lines.
That's a different way
to do your science
because you realize you can do more.
It's a frequent question that
I asked researcher and say,
"What will you do if I give
you two million core hours?"
What? In many cases
I get a blank face.
Part of it is because
researchers don't even dream of
what they can do if they
had more computing power
available to them.
I would argue that
this is also true with
researchers in general that
when you come and say,
"I gave you two million dollars
to do your work, what would you do?"
They always come and say,
"Oh, I don't believe I
can get that much money."
I say, "Let's assume
you have the money."
But at the same time,
we want to take care
of our horizontals
and we have been
developing quite a bit of
new stuff in the area of distributed
what we call high
throughput computing and
this is translated or
materialized in all the
HTCondor software which is
a whole family of technologies that
obviously I don't
have the time to even
start explaining all
the different pieces.
Also what is nice is that we were
considered as
significant contributors
to two Nobel Prizes in physics.
The first one was the detection or
the identification of the
Higgs boson at the LHC
and the second one
was the detection of
gravitational waves that has proven
that the prediction or the
model that Einstein envisioned
100 years ago as being correct.
By the way, Einstein never believed
that it can't be
measured because it's
such a tiny thing and measuring it
and detecting it, and all this.
I will get back to the
gravitational weight later.
The other part of our work has
been to pioneer the concept of
research computing facilitation
which is different than
user support because it
really recognizes the role
of these individuals in
bringing computing and
the science together and not helping
somebody to implement a piece of
software to run on something.
We use non-technical people to do
research computing facilitation
because we learned over the years
that technical people
don't know how to listen.
Technical people have a solution
before you presented a problem
because the solution
is what they know.
So they will map whatever
your problem is to
their solution and facilitation
is about finding the
right solution for you.
Therefore, you should learn how
to listen and to understand
what the problem is rather than to
focus how to map the problem
to the solution that you know.
To give you a little bit of
a feeling of what we do on
the ground is we delivered last year
400 million core hours to
researchers on the campus from
250 different project
and there's about
1,000 research computing
facilitation hours.
This is the snapshot that
was before my talk at TIFR
last week of the top 10
users of CHTC in 24 hours.
So we delivered 1.1
million core hours.
The first one, CMS is one
of the LHC experiments is
representing an international group
of more than three and a
half thousand researchers.
The second one is IceCube.
I will talk about it later,
is also an international
collaboration
that is anchored at Wisconsin.
They did about 50
percent of the cycles.
Then you have individual
group from physics.
This is BIO magnetic
resonance, databank, math,
and even here number 10 at
Nutritional Sciences did
almost a thousand cores solid
for 24 hours to do something.
So that's the way it
looks on the ground,
in what we are doing in CHTC,
and this is enabled through the
research computing facilitation.
This is Michael, that is leading
the facilitation effort,
and an important part of it is to
teach how to fish
rather than to fish.
User support usually
does the fishing,
and said, let me write
for you the code.
Let me rewrite this thing.
Let me do this thing.
We don't do any of this.
We guide you to do
what needs to be done,
or we can connect you with somebody
that will help you do things,
but this is not what
the facilitators do,
and we apply the same concept of
scaling out not only
to our infrastructure,
but also to the facilitation process.
It's now the national activities
that are related to it,
there were several NSF projects
that were related to it.
So it's becoming more
and more understood.
So IceCube is a detector of
neutrinos in the South Pole,
and they drilled 1.5 kilometer
probes into the ice,
and through the ice,
they're detecting these
particles, the elusive particles.
The important thing is that all
the infrastructure is based
on HTCondor, it's a vertical,
and you may find it interesting
that GPUs are critical to
the computing capacity,
but they don't do machine learning.
They're using GPUs to do
simulation of the ice,
that is critical to
their ability to turn
the information that they get
into the science that they need.
This is the distribution of
GPU hours that they did over
the last year and a half,
where every color is a
different pool of resources,
or a different site,
or a different organization,
you will see it later,
that provided them with the GPUs,
and the top of the upper value here,
300,000 hours,
in these columns or monthly
columns and some of the places,
the first one is University
of Wisconsin-Madison.
They did one and
half million total GPU
hours in a year and a half.
But GPUs are not the only
thing that they need,
they need also CPUs,
and that is which is easier
for them to get obviously.
Again, they can go out and
use CPU hours from all these
different institutions.
We hear the top institution is
35 million total over the
year and a half of CPU hours.
So the important thing
is having many colors,
multi-color is good, which means
the contribution is coming
from many different sources.
Speaking about sources,
this is a picture
of a blazer that is emitting
particles towards Earth,
and the distance between this blazer
and Earth is exactly four
billion light-years.
They recently detected a particle,
which they call the ghost particle,
that was emitted from this blazer,
and by knowing that it came and
the properties and all these,
and the ability of other,
what's called
multi-messenger astronomy to
focus on this blazer that
is related to the emission,
because the emission was created
by an event that was related
to the black holes in this galaxy.
So that's what we enable them to do,
in terms of the computational part,
once they get the signals out of
the ice from these
different detectors,
and I encourage you,
if anyone is interested,
go to search for it on the web,
and there's a lot of
interesting stuff there.
But speaking about four
billion years and trying
to give a meaning to
the term multi-scale.
So this is another love
letter that we got
earlier this year in
July, and saying,
look at this protein image that
we just created using CHTC.
This is 12 angstrom,
four billion light
years, 12 angstrom.
Now, what we are seeing
more and more in science is
that the instruments in this case,
a Cryo-EM, a microscope.
They're generating a lot
of signals that require
significant computing power
to turn the signal into
data before you can do advanced
machine learning on it,
or whatever you are doing with
data in order to get your science.
So the computing barrier is more
in how do I get the data out of it,
rather than how do I turn
the data into the science,
and this is an example.
So one, we are getting
signals of particle that were
omitted four billion
years ago, and this one,
it is something that was emitted from
an electron microscope
that now is being
deployed in more and
more universities around
the world to study structures.
They can see things
at the atomic level.
I mentioned earlier the
gravitational wave.
This is LIGO, this is a report that
came from the group in
Germany, and again,
we couldn't do it without
the collaboration and the technology
that we have been doing together,
and here, this is a partnership that
has been going on for
over 15 years now.
This is common that to
build these things,
it's a long relationship.
Across technologies, across
science software, and all that,
to get to where people are because
they start with one instrument and
they upgrade their instrument.
The first instrument
couldn't detect a thing,
but they had to go through
the first instrument to
get the second instrument,
and then they make the detection,
and these instruments
are a big investment.
So the first installation of Condor
was at Wisconsin at 1985.
So even in our small world,
we are working on something
for quite a number of years.
From the beginning, we made
sure that we create verticals,
and at the beginning,
the vertical was just our
computer science department.
Now, the work that went into Condor,
we had to change the
name because we had
the lawsuit about calling a software,
calling it Condor, and
we managed to get out of
it after spending a lot of money
by putting an HT in front of it.
This work is actually a
continuation of my PhD work.
That's the reason why I am
sympathetic with your work,
and my PhD work was
strongly influenced by
the distributed computing
era of the mid '70s,
and here's a paper that Enslow
published in the Computer Magazine.
Yes, we had magazines
in '75 about computers,
those of you who don't believe
that the web publication,
actually I have another publication
that's from the communication of
the ACM that is earlier
in a few minutes.
But the important thing
here is that Enslow
listed the benefits of
a distributed system,
and as you can tell,
we have new generations
of buzzwords that are
re-promising us the
same thing that were
listed by Enslow in the mid '70s.
>> He said if you have
a true distributed system.
That would be this.
Then you ask yourself
the obvious question,
how comes that if we
knew it in the '70s,
how come that you
cannot order through
Amazon a distributed system in a box,
deliver it, deploy it,
and be done, right?
Why do you have to call
these things now Clouds
because they're something new?
The reason for that is actually
in a technical report that
Enslow and his student
published in '81,
and they said, okay,
you want all these good things,
these are the properties that
the system must have in order to
be a true distributed system,
and if it's not a true
distributed system,
it will not give you the benefits.
I want to focus on
two elements of this.
The unity of control
and component autonomy,
because I think the stress
between these two is key.
Now, we are trying to
do system transparency,
we now call the virtual
machine or container,
whatever is the buzzword of that,
but everyone wants to
make it transparent.
So the unity of control says
that all the elements of
the system have to be unified
to achieve a common goal.
I know, it doesn't sound
like computer science.
You never talk about the component of
a database being unified
to achieve a common goal.
This sounds like social science,
but this is critical.
When you bring together the
pieces of a distributed system,
you have to assume that there is
some driving goal that they all
share and they want to achieve.
At the same time,
you have to make sure that
all the components are
autonomous because if they
are not fully autonomous,
you will not get all these
beautiful benefits of before.
Now, how do you achieve
this commonality of
goal with full local autonomy
is the hard problem.
Therefore, if you're ever presented
with a distributed system and
somebody said, this is it.
Check the unity of control
and check the autonomy.
We as humans don't like to build
systems with local autonomy,
we like to be in control.
We like to know about everything
and make decision for everything,
rather than letting them exist.
You can take it to question of
short forces and long
forces in physical system.
You achieve stability is
through the short forces,
so you are creating some local
autonomy in creating that.
So that is one of the things
that have been driving
us from the beginning.
So it's very important when
you do something in the space,
whether it's your PhD or
a continuation of it,
try to put it on principles
rather than on the latest
buzzword of the day.
At the same time, if you build
systems then you have to
go to the masters and see,
what did they teach us,
and I assume I don't have to
introduce Dijkstra to you,
but he published in '68.
It was presented, I think it's
actually happening this week,
the ACM Operating System
conference is happening this week,
in '67, where he said,
if you want to build a system,
understand the sequential processes
and put them in a hierarchy,
and build them on
a solid design where
you understand what are
the pieces and what is the
relationship between the pieces.
Namely, don't sit down
at your terminal with
the latest scripting language
and start writing the system.
Yes, he was a logician.
He wanted to prove properties,
and that is that.
But another important
part of this paper,
which is a short read,
I encourage you to read it,
was that Dijkstra published
a paper where he listed the
mistakes that they made.
How many papers did you
read recently where people
listed the mistake that they
made in doing anything?
All papers are just saying
look how great I am.
Here he said, we made mistake.
The first mistake was that we
tried to build a perfect system,
which is way too complicated.
Which is way too involved with
the boundary and the corners,
and all the things that
makes it very complex,
and then also, it's
continuously following assumptions
that are changing all the time,
that you have to
simplify and you have
to not build something
that can handle,
in the optimal way,
every corner case, because
you will end up with nothing.
The second mistake that
they made was that they
didn't include debugging
from the beginning.
This is something that I'm
sure you see all the time.
I'll write it, I will
make it run fast,
the scale, all that
things, and debugging,
we'll deal with it later,
because debugging is
not functionality,
debugging is not publishable.
How easy it is to debug your
latest, greatest algorithm?
I can tell you that in my group,
when somebody comes to me and says,
how about if we do this as an
algorithm for doing something?
Then say, are you willing to be
responsible of debugging and
supporting it in the field
on that many installation?
Yeah. We don't have the number of
installation that you
guys are talking about,
but still, we have enough for the
size of the theme that we have.
Typically, the answer is saying, no,
I don't want to deal with that.
So I say then we're not going
to follow and not implement it,
even though it's an
amazing algorithm,
because it's so complex that we
will never be able to debug it,
and proving correctness
of implementation,
you know how difficult it is.
Taking all these principles and
all these concepts together,
it was unavoidable for us to
come up with a vision that says,
we can do global computer
invariant Flock of Condors,
we presented it at CERN in '92.
We said, let's connect all these
computers, all the things,
and create one worldwide, I know,
it sounds like a Cloud,
and we'll run all this.
We even have already a
system that can do it.
We have a demonstration of
a system that can do it.
So each of these is a Condor
Pool that spanned from Dubna in
Russia all the way through
Europe, to Wisconsin.
Yes, at that time, we had only
200 workstation in the pool,
in the early '90s,
and we were able to submit
jobs that traveled through
these interconnect from
Dubna to Wisconsin.
Actually, if you think about
execution of jobs
and message passing,
it's a little bit similar.
You enter at one place,
you send your job somewhere,
and you get the results back,
which are like an acknowledgment.
So the principles of routers and
network transfer can be applied,
and here we actually used routers and
gateways conceptually to
move the jobs across.
So if you went from Berlin to here,
you may have gone this
way to land in Wisconsin.
But the fact is that,
even an advanced organization
like CERN didn't get it.
We presented it, but didn't.
Now, 25 years later, they got it.
They installed it.
The transition to ht condor on
all their batch processing,
which includes about 15,000
servers with 230,000 cores,
completed, and it's running.
If I have time, we'll get to
Open Science a bit later.
We do this whole thing worldwide,
but it's two lessons here,
it's if you build on
the right primitives,
you reach things which
are unavoidable.
There are not that many
solutions to the same problem.
Even though it seems that
you can choose here,
you can choose their.
So anchor them on
the right principle,
and you have to wait.
If people adopt what
you are doing quickly,
I don't consider this as a good sign.
In '96, I decided,
it was more me, I said,
okay, we have to articulate
that we are different.
Everyone around us is HPC.
The only way that they refer to
the work that we're doing was,
this is embarrassingly parallel work.
It hurts you when people say that
what you do is embarrassing.
So part of what I have been doing
since is anytime somebody talks
about embarrassingly parallel,
I say, this is pleasantly parallel.
This is naturally parallel,
there is nothing
embarrassing about doing it.
I also coined the term
High Throughput Computing.
A year later, I was interviewed by
HPC wire about High
Throughput Computing,
which is also funny.
But here is the way that I try to
contrast High Throughput
versus high-performance.
So high-performance
measures everything
in floating point
operations per second.
If you look at the top
500 list of the machines,
they are based on LINPACK and
they are giving you the FLOPS.
How many floating point
operations you can do per second?
By the way, the way they build
these machines is really focusing,
putting as much silicon as
possible into the floating point.
Let's assume that that's fine now.
>> I don't know.
>> Yes, it's hard to see up.
Full tolerance is not you
connect and you plugin,
and you still don't have power.
So they focus on FLOPS,
to be in the top of the top 500 list,
which is the biggest
machine in the world,
you have to run LINPACK once
and be faster than anyone else.
The scientists that we are
working with care about FLOP P,
which is floating point
operation per year.
Now, you know very well that
you cannot take what
you can do in a second
multiplied by the number of
seconds in a year and get
what you can do in a year.
If you can run a kilometer
in three minutes,
it doesn't mean that you can run
that many kilometers in a year.
It's a different problem,
and it requires robustness,
but it also requires automation.
Why automation?
Because there is a huge
difference between running
one job over 100,000 hours
to running a 100,000
jobs for one hour.
The big machines are
designed for this.
High Throughput is designed for that.
I will not have time to talk about
what we are doing for automation,
we are using tags and
directed a cyclic graph
for capturing the things.
You can find information about it,
but that's the core of the
high-throughput computing.
Again, you have to be patient.
So I introduced it in 1996.
In 2017, a National
Academy of Science report
stated that many fields
today rely on high-throughput
computing for discovery.
Then they even said many
fields increasingly rely on
high-throughput
computing for discovery.
So there is now a recognition
that something that didn't exist
20 years ago in a way as
a concept is critical for
scientific discovery.
So the good news is that high
throughput is important.
We tried to do similar things,
we talked about the grid,
the Gridworld, and we
have a chapter in this
book that was referenced.
I don't know how many tens
of thousands of times
because everyone who wrote
any paper about grid
referenced the book,
I think, without opening it.
But it's like, grid,
here is the book.
We tried to convince the community
that the key for the grid is
integrated mechanism that are
robust, scalable and portable.
The community didn't follow that,
and then the whole grid movement
disappeared because everything
that happened there,
people said, but it doesn't work.
We said, sure, it
doesn't work because
nobody focused on mechanisms
that are dependable.
We have a problem in
our field that people
like to work on policies.
People like to publish policies.
People claim that they know
how to evaluate policies.
Doing the same for mechanism
is much more difficult.
But I would argue that if you
give me a good set of mechanisms,
I can get decent
behavior or performance
from any system with
a policy that will
take me 15 minutes to invent.
But if you give me an amazing policy
and no mechanisms, good luck.
It will take you more than 15 minutes
to come up with the mechanism.
So we need mechanism,
and at the end of the day,
even when you come up with an
optimizer as we talked earlier,
you still need something
that will run the plan.
If this sucks, and if this
breaks half the way
because of whatever,
the optimizer is not
really the solution.
So I'm checking on this because
there is no time here anywhere.
Are there any questions,
any suggestions, anything?
We are about 10
minutes from the hour.
I have more stuff.
I can continue, but I will
give you an opportunity. Yes?
>> So what is the bottom
line of the story?
I mean, like when jobs
crash or machines crash,
what happens then [inaudible]?
>> We consider all of
these things as normal,
and just try again.
>> So we start with it?
>> No. Now, if you
have a check point,
we'll start from where you are.
If two nodes disconnect,
we'll try to reconnect.
The question is how long
were the two disconnected?
In Condor, that's where
you submitted the job,
and that's where the job is running.
They have a relationship
which keep alive,
and all that stuff.
Now, if this goes away,
then how long this will stay
committed to the relationship?
How long this will stay is an
autonomous decision of both players.
So we use this for doing live update,
for example, on the submit.
So this submit machine may have
10,000 jobs that are running,
and we want to upgrade it.
So actually, we send an update to
all the workers and say,would
you please wait longer,
because I'm going down and it
may take me longer to come back.
So you just keep doing
what you are doing,
and then reconnect, because we don't
want you to be all the time
staying in the connect.
Because if I really
crashed and the machine
was burned and all that,
I want you to let go
because there will be
nobody to take the results.
So the advantage that
we have, or the luck,
is that we started with workstations,
that anything can happen.
It's not only that,
it's any engagement that happens,
the data or the rules that
govern the engagement can
change between the
time that there were
advertise and the time that
the action took place.
So when the action
comes to take place,
we have to check whether
the condition for this,
what we call a match,
are still valid.
If any partner, this
would be autonomy,
decides to get out of
this relationship,
everyone is, okay, we tried.
So the mentality is if it worked,
it's a miracle, or
it's a coincidence.
Now if it didn't work,
it's a problem.
It's normal.
Typically, when people come to
give presentations to my group,
they sometimes almost run away,
because they come and say, look.
I did this and this, and it can
run faster under these condition,
and all this, and we stay mum.
Then when they're done and they ask,
and they say, okay, question.
It's like, okay, what
happened at this task?
What happens if this goes away?
What happened if this then
failed to communicate?
What happened if this buffer is full?
Picking on databases because
my host is the database?
The worst thing that
can happen to you is
a database that runs
out of disk space.
All the optimizer in
the world will not save
you and it will take you a long
time to recover your database,
and it just ran out of disk.
It's not that it burn down.
It stopped journaling. So in Condor,
if you cannot log, you shut down.
We are working hard
to put into the log
a method that we shut down
because we couldn't log anymore.
We keep all the time.
Yeah. It doesn't always work
because it's hard to
keep space on disk,
because we have to be prepared
for the same to fall apart.
So that's a long answer.
It has to be in the DNA,
not as an afterthought.
Anything else?
So let
me give you another principle
before we have to call it quit.
So in '92, we published a paper.
Here's what we did, we were
very proud about 250,000 jobs.
We do it on at Wisconsin
probably in one hour these days,
and a worldwide, maybe every second.
I don't know, something like that.
But the important thing is we said,
look, we listened to the user.
Back to the verticals.
We asked them, what
do you care about?
They said, we want
to get access to as
much capacity from a single point,
and we want this point of access
to preserve our local environment.
By the way, the overhead here
is not networking overhead.
It's the effort to make it all
happen in the effectiveness.
So we turn this into a
principle that we use
throughout our system and
other systems that are
built in these contexts,
including Open Science
Grid and the like.
We said, submit locally
and run globally.
Submit to a local environment where
you manage everything locally.
It's not only that
you're managing locally,
you use a local namespace,
you use a local identity space,
and all that, and try to
reach as far as you can.
So that's, by the way,
the reason why we don't
assume shared file system,
because you cannot run
globally if you assume
a shared file system.
You have to be prepared.
I know it sounds like containers,
pack everything, we'll see and go.
Can you do it for everything?
But my understanding from
talking to a [inaudible] ,
that one of the challenges
that you guys are facing,
and I understand he's coming
to visit here shortly.
I saw him a couple of
weeks ago in Madison.
He said, yes, we are now
facing what you have been
complaining about for
such a long time.
In the Cloud environment,
where you need the elasticity in
your database management system,
then how do you grow and shrink?
This is the same problem.
How do I give you
the data that you need where
you are going to execute?
How do I bring the data back?
It's not identical, but
it's the same fundamental.
Let me make a quick comment here.
There is this whole business
of resource acquisition
that is related today to the Cloud.
Maybe I'll make a comment
on it later quickly
when I show you one of
the pictures on CHEC.
So that's that's the thing,
and I, for many years,
have been using this picture of
the desktop that goes to the floor,
that goes to the building,
that goes to the campus, that
goes to the region,
that goes to the world.
You want to sit here and run here,
but you want to use
everything in between.
It's not that you go to the Cloud,
and my understanding from
what I know about Azure
is that it has been the
Microsoft philosophy
of using it as an extension of
your local environment rather than
taking everything that you
have and put it there.
So that's the submit
locally and run globally.
But you go to the
young generation and
they don't have desktops
anymore and they
don't have workstations anymore,
they have Jupiter Notebooks.
Everything starts and ends
in a Jupyter Notebook.
So the question is, and that's
what we are working on,
just to give you a flavor
why we are not done,
is how do you bring Python to Condor?
All the binding, all the basis,
what you create is an API
for an interface from
Python to expose what Condor does
as jobs and stuff like that.
But also to go in the opposite
direction and to say,
how can you bring the concept of
high-throughput computing
into the Python land?
Here, what we are doing is that,
Python has map as
a an important construct of
capturing, doing multiple things.
We created a module
that call HT map that
implements the map as
a high-throughput computing things
through Condor in the back.
So the functions that are
invoked by the map are basically
job that are running in Condor,
it's all the issues with us
synchronicity and all that.
But at the end of it,
you have an object,
which is the map object that
got populated by Condor jobs.
But then you have to make the object
richer because there are
all the questions about
the asynchronous
execution of the net.
So I think that will be the last
picture I will share with you,
unless there is
interesting more later.
So this is the world that
we present our CHTC users.
So you have the researcher
with a problem to solve.
This as a workflow
described as a DAG,
a piece of it, and by the way,
we have chairs in this conference.
So not everyone is
sitting on the ground,
and the researcher is
interfacing with CHTC that
basically presents
all the resources of
the campus as a single
accessible resource from here.
So the 1.1 million hours
that are delivered
daily is coming from
a whole bunch of Condor pools
of the campus, over a dozen.
Some of them are in CHTC and some of
them are owned and
operated by others.
But CHTC also can bring in
resources from HPC systems,
a growing source of computing power,
I would say in the US and
Europe now because there is
significant investment in it,
and they want it to be used
for science, including HTC.
You saw the previous
report that I showed you,
go to the Cloud.
So the guys at TIFR,
the CMS, they expand
their computing capacity
using Condor into Azure.
They can get that 10,000 cores
in Azure up in 10 minutes,
connect them to the Condor
pool at the IFO and go,
and the Open Science Grid.
So this is basically the worldview.
The thing is that really,
this researcher has to be shown
here with a pile of
money, and an allocation,
and a priority that using these
resources has to be paid with,
and that actually brings up,
I call it capacity planning,
you can call it optimization,
which is moving from an era where
you did capacity planning
at very low frequency,
once a year or once every five years,
to high-frequency capacity planning.
Because you have to decide when to
use your money to
buy Cloud Resources.
When to use the allocation
to use HPC Resources,
and when to use your priority to
use Open Science Grid Resources.
So that's as far as I can
go in the time that we
have, any further questions?
Or something which is not related,
but you still are interested to know,
I can and try and answer. Yes.
>> So the the Condor migration
system, is that [inaudible]
>> We did the check pointing,
in the black and white.
I think only a year ago,
we decided to stop supporting it
because it became non-sustainable.
Actually, in this '92 workshop,
we sent out a list of request to
the operating system for
workstation community and
said, give us check-pointing.
None of the solutions that
were offered to us since,
and there were promises of
check-pointing [inaudible] ,
and check-pointing Containers,
and all that, don't work
in our environment.
Also, if we are not convinced that
the check pointing is reliable,
it becomes unsustainable.
If somebody is running a job in
your system that would checkpoint
at 20 time and crashes,
whose fault is it?
The job's fault or
your check-pointing?
This can consume way too many
resources for a group like ours.
So we couldn't find anyone
to do check pointing in
a different way than what we were
doing with writing out the core,
and more and more of
the jobs are complex scripts
and other crazy things.
Yes. It would be amazing.
What we are doing our
best today is to support
application to do their own
logical check-pointing.
So we need to understand it,
we need to move the things
reliably back to the origin,
where it's even non-trivial in
this case, how is the signal?
How do they tell you
that they are done?
Usually, they want to shut
down and they want to
restart because only
when you restart,
you verify that the checkpoint
that you created is
valid because how long
do you keep the check?
But in the black and white,
we did it, but we had to give up.
We had users who use
check-pointing to
address also the local file system.
They will start the job,
read everything that they need,
checkpoint, and then move.
So there are many science
applications that really
built the entire state by reading
in the first couple of minutes.
So rather than knowing which
data you need and looking up
database rather than
doing it remotely,
they did it locally,
and then checkpoint it.
Maybe for these cases,
we can do something better with
containers and things like that.
But the jury is out.
There is also a question with
all these technologies
moving so fast.
That if we invest in something,
and get it to work there,
and then we wake up,
and people started with Docker in
the science community and now
everyone is doing singularity.
Singularity has its own problem.
Then, singularity will
move to another version.
The DevOps world is pretty
demanding because we have
to build on stable and
dependable services.
By the way, as a footnote
that I should have mentioned,
we are one of the few,
I don't want to say only,
but one of the few
science middle-ware,
if you want to call it,
that we have a pretty
wide Window deployment.
Condor works very well on Windows,
and we have quite a bit,
including commercial users that
are running Condor on Windows.
We made early on a decision that
took us quite a bit of effort,
but we did a deep porting to Windows,
where we raised our abstraction
level to a point where we
can go to Linux Unix and Windows
from the same abstraction level.
Because earlier, we were tied,
and rather than trying to go through
some an interpreter and all that,
we said we need s native
Windows implementation.
But we have quite a bit of users.
I would say more heavily
in the commercial world.
So since I mentioned
the commercial world,
I think that would be a fun thing.
So one of our users who is
using Condor is the Dreamwork.
So the entire rendering
form of Dreamwork,
which the last time they reported
it in one of our meetings,
was 45,000 servers and
15,000 desktop machines.
They are using it to do all
the production rendering.
The latest movie that they
released took 300 million
core hours to render.
So if you watched any of
their movies since 2011,
that was released since 2011,
you're watching Condor.
Any other questions? Thank you.
