This this two-factor authentication; so
basically to get in
I need my card and then I need a PIN and
this is a scrambler pad so basically
every time that you look at that the
numbers are in a different order.
This is the High Performance Computing facility for the university of Nottingham.
SEAN>> What do you use it for?
All sorts of things it's
basically to do with the high compute
research so for example students and
researchers will use this for doing
calculations based on things like fluid
dynamics, aerospace, genomics...
All sorts of things anything which requires
- astronomy - that's what anything that
requires a large amount of compute.
SEAN>> And you've got earplugs in today for obvious reasons
Yes it's a litte bit noisy in
here yes yeah...
SEAN>> So we will do some talking outside (LINK IN DESCRIPTION) but can you show us a bit of it before we go outside?
Certainly yes yes the
main HPC facility which we call
Minerva is...
...and then we've got some extensions in, on the racks on here.
SEAN>> All of these's blinking lights what's going on is this data activity..
...or processing what's going on there?
Both! The actual lights that you can see there, the brighter ones, those are
actually the storage, the disk storage the
actual compute nodes don't actually
blink very much. The ones at the bottom
there that's the network activity.
We do shut it down for maintenance once a year for a day or so
this at the moment is the third
generation of HPC - the first one...
...which was installed about eleven years ago and then we regularly refresh this.
SEAN>> So this one's been going for how long or how long is this been? This one's been going for about four years.
SEAN>> Okay and then I hear rumors of a
new one on the horizon?
Yes we're in the procurement at the moment to put a replacement in
SEAN>> and will that mean this gets ripped out and the
whole new one just gets put in?
Good question, we would like to utilize as
much as possible because although it is
old you know there is still life left in
it and we do try to - "sweat the
assets" as they say but certainly some of
this will be replaced.
SEAN>> What's it running, would we recognise any of the operating system or any of that?
It's, yes the; most of the Nodes are running a version
of Linux and the the storage is fairly standard but above
that we use PBS as our main
scheduler.
SEAN>> How many people might be using this at one time?
At any one time they're probably running hundreds of jobs
SEAN>> Do they run for a long time? Might they be running years? How does it work?
We wouldn't have jobs that are run for years but certainly we could have jobs which are
running for months. Most of the jobs -you
know- we're probably only running for days
SEAN>> Okay and so when you look at a system like this can you put a figure on how much it costs?
Capital cost for a system like this
we're probably talking in terms of about
one and a half to two million pounds ($2.1m - $2.8m)
The ongoing costs - We have about 250 kilowatts of air conditioning.
When we run this flat out - this particular block here running flat out pulls about 70 kilowatts of power
and you're drawing that all the time so to
run this whole facility you're talking
about thousands of pounds just purely in
power costs and then of course they're
all the ongoing licensing and the support for that... So it's not insignificant.
SEAN>> So that's a lot of power is there a big red
switch somewhere someone has to pull to turn it on?
Yes there is - and no I'm not going to press it for you
SEAN>> So its obviously a lot of
equipment and looks like it might be
quite complicated
does it ever go horribly wrong? Does it ever have big problems?
Generally speaking it is pretty reliable. Individual nodes will fail.
Individual disks will fail but generally speaking the equipment itself is relatively...
...modern computer equipment is inherently reliable - we probably have
more problems with the air conditioning
than we do with the actual compute itself.
SEAN>> So the other thing I was thinking about when when you look at this it's is this
totally bespoke or is it's like a
template or how does it work in terms of
how do you buy one of these - How would
you go and buy a high-performance computer?
That's the $64,000 question
basically you have to start to think
"What do we need it for?" because there is
no one generic high performance compute job.
Different departments, different
research, different requirements have
different computing requirements. Some are very very high performance computing you know
it's a lot of number crunching - others
it's about manipulating data so there's a lot of
data movement. Other things it's about
visualization.
So you the first thing you've got to do is to say right "What is our mix of jobs?" because the way
which you set it up for high analytics
is a different hardware set to what you set
up for vizualization and things like
that. So that's the first thing you've got to do.
You've then basically got to say okay these are the jobs that we want to run.
Once you've actually got that you
then go up with a supplier to say right
this is what we want to do, this is how
much money we've got to spend. What can you give us?
Although this is fairly old now, you know there is still quite a lot
of life left in here okay it's not cutting edge - but it'll
still do a lot of the jobs because a lot
of the jobs are purely about number crunching.
This is perfect for that so
basically we will put the new one in - We
will try and keep as much as we can of
the old one so that that we "sweat our assets "
and that also means that we've
got additionally capacity for our
researchers to use as well
and then basically we will then go for a
gradual replacement so as new processors come online and as new research projects
come you know the balance of the jobs
will change so that means we may have to
strip out a particular type of node
replace it with a different type of node
but you know so that will be far more
organic in the future we're not
expecting in the future to do a
complete rip and shred. Unless something
comes up and oh you know we build a new data center - but that's not on the cards at the moment.
The equipment itself is fairly generic, you know, these are standard
blade enclosures. The storage is standard
storage - We have about two hundred and
forty terabytes in this block here - it's all
connected up by InfiniBand
SEAN>> Is InfiniBand a speed of network?
It's a standard - This is a 40 gigabit InfiniBand gigabit
SEAN>> So at home you might have Gigabit - this is 40 of those?
Yes, 40 Gigabit yes - and also of course it's also multi path as well so..
...because you know there's no point
in doing a lot of calculations if you
can't then get the result of those
calculations off.
There're effectively two types of jobs. There are parallel jobs where you've got a job running on
multiple nodes and then you've got
single node jobs where basically
it's all running on one node. So again, with the parallel jobs you need network
connectivity to make sure you're not processing the same bit twice.
SEAN>> So for a researcher or someone who's a
part of a project what's the big benefit
of doing this rather than letting their
office computer do it? Is it the speed of
compute? It the fact that they can set
it off and come back another day or,
what's the main benefit?
Yes it's the capacity.  Because basically the job will start to run it will then
continue to run and then so for example
Christmas is a very very busy time for
us because a lot of researchers will
start a job going then come back after
Christmas and pick up the data
As I say, you you could do these things at home, it's just that it would take you
months or years to do what this can do in days or hours.
SEAN>> Are they 'hot' swappable then?
Yes they are
SEAN>> (Joking) Come on then, let's pull one out...
No!
They're all single-phase power but because the phase on this rack is
different to the phase on this rack
there is the possibility of having a
potential difference of more than 400
volts across the two racks. It's unlikely
because each of the... but from a "health and safety" point... and it's exactly the
same why you'll see a lot of these have got laser [warning stickers]  because we use laser optics
SEAN>> For your networking?
Er, yes the fibre...
SEAN>> And what is that, the aircon?
Nope, that is the fire suppression
SEAN>> Oh let's go of a look at that then
The fire suppression system that we have in here is it's an IG55 system which is an inert gas.
It's 50% Argon, 50% Nitrogen
basically if there is a fire in here all
of the gas in there is released in one go that
replaces about half the atmosphere in here which takes the oxygen level down to
a point where it doesn't support combustion. It is just about breathable but you wouldn't want to
run a marathon in it you know it's like
trying to run at the top of Mount Everest.
SEAN>> So it suppresses the fire without damaging the kit?
Yes. The gas is released through these nozzles here.
SEAN>> They look like sprinklers but they're actually gas...
Gas nozzles, yes.
SEAN>> and how does it work with the cooling? Is it go in hot one side and out
cold the other?
This is - Yes basically we use aisle
containment so this is the cold aisle when we put cold air
in it then goes through the equipment
we'd expect to see a delta T in terms of
20-odd degrees - and on the other side basically it gets vented through...
SEAN>> So through that glass is going to be 20 degrees warmer? Can we go in? yeah
OK I think I'd like to spend my time on this side...
If you come down here you
can definitely feel the temperature difference.
So these are compute nodes.
SEAN>> ...and how many computers are in each one of those blocks then?
Each one of here so in this particular one you've got 1 2 3 4... ...8 individual blades in this blade enclosure here.
You asked about the big red button? That's the big red button
SEAN>> That would turn it off and on?
No, that would turn it off.
SEAN>> Ah that's like a "Danger danger!" -
press that?
Basically if I press that then everything will die immediately
SEAN>> let's stay away from the big red button then...
But that is the big red button, yes....
Assuming that they are separate parts of the CPU if we look
back at our instructions here we execute
instruction 1 it uses the load/store unit..
complicated. The point is what we're doing is by multiplying G by various numbers or
adding it to itself - this point addition -
we're moving around this curve sort of
seemingly at random
