Hi, thanks for turning a Singularity
Prosperity. This video is the seventh in a
multi-part series discussing computing
and the final discussing classical
computing. In this video, we'll be
discussing what heterogeneous system
architecture is and how it's going to
shape the future of classical computing!
[Music]
To summarize what we've discussed in
previous videos, CPUs are general-purpose
devices designed to execute and manage
complex instructions while GPUs are
massively parallel computing devices
designed to execute streams of
calculations as fast as possible due to
their parallelism. This translates to
current architecture with the CPU
managing most of computer operation such
as the operating system, input/output
devices and various other tasks, while the
GPU does the hard-hitting in terms of
computation. For the longest time we were
making the CPU execute and manage all
the tasks leaving the GPU only for
graphics and simulation purposes. This
was and still is extremely wasteful in
terms of computation, the CPU already has
to deal with O/S overhead along with
various other issues that have a penalty 
on performance. With plateaued CPU clock
rates, the miniaturization of the
transistor coming to an end, more cores
in the CPU not having significant boost
in performance as well as thanks to the
increasing adoption of GPUs in general
computing and increasing popularity of
parallel platforms like CUDA, this is
beginning to change.
This yields a new type of computing
architecture called HSA, heterogeneous
system architecture. HSA is where
multiple compute devices work in
unison instead of being segmented in
operation. Another huge factor in HSA
paradigms taking off is new and
improving memory and data standards. For
more information on the CPU, GPU and new
innovations in memory and data, to gain
deeper insight into what I discussed
here, be sure to check out the previous
videos in this computing series. Now what
we haven't discussed in those previous
videos is FPGAs and ASICs, two other
types of computing devices that will
play a crucial role in HSA. The field
programmable gate array, FPGA, is a
special type of computing device. Unlike
other computing devices, its hardware can
be reprogrammed for specific tasks. To be
more clear, other computing devices have
fixed hardware and software optimized to
run on it. FPGAs have reprogrammable
hardware so hardware can be optimized
for software. Due to this hardware
reprogramability, FPGAs are more
expensive and also quite difficult for
the average developer or computer
enthusiast to work with. However, it
allows for massive parallelism and usage
a much less power, which fits perfectly
for a variety of needs such as data
processing or streaming for example.
Referring back to heterogeneous
architecture, FPGAs can be used as
accelerators to process data and then
send it to the CPU, and when paired in a
system with CPUs and GPUs massive
improvements can be seen:
Now we already have
industry-leading capabilities, with our
Azure GPU offering, which is fantastic
for building trained AI models offline.
Okay, but to support live AI services
with very low response times at large
scale, with great efficiency - better than
CPUs, we've made a major investment in
FPGAs. Now FPGAs are programmable
hardware, what that means is that you get
the efficiency of hardware but you also
get flexibility because you can change
their functionality on-the-fly. This
new architecture that we've built
effectively embeds an FPGA based AI
supercomputer into our global hyper
scale cloud. We get awesome: speed, scale
and efficiency,
it will change what's possible for AI.
Now over the past two years, quietly
we've deployed it across our global
hyper scale data centers, in 15 countries
spanning five continents. Okay, so let's
start with a visual demo of what happens
when you add this FPGA technology to one
of our cloud servers. We're using a
special type of neural network, called a
convolutional neural net, to recognize
the contents of a collection of images.
Okay, on the left of the screen what you
see is how fast we can classify a set of
images using a powerful cloud-based
server running on CPUs. On the right, you
see what happens when we add a single 30
watt Microsoft designed FPGA board to
the server. This single board turbo
charges the server, allowing it to
recognize the images significantly
faster, it gives the server a huge boost
for AI tasks. Okay, now let's try
something a little harder, using a more
sophisticated neural network to
translate languages. The deep neural
network based approach we're using here
is computationally much harder and
requires much more compute, but it's
achieving record-setting accuracy in
language translation. Okay, so to test the
system, let's see how quickly we can
translate a book from one language
to another. Now I picked a nice small
book for this demo, 'War and Peace',
it's about 1440 pages, and we'll go over
to the the monitor here and using 24 
high-end CPU cores we will start
translating the book from Russian to
English. Okay,
now we'll throw four boards from our
FPGA based supercomputer at the same
problem, which uses 1/5 less total power,
as you can see...thank you *applause*. As you saw,
just a single FPGA accelerator incorporated
in an HSA system can yield significant
boosts in performance for artificial
intelligence tasks and similar
performance boosts extent to other
compute tasks as well. Beyond FPGAs there
is also ASICs, application specific
integrated circuits, the most optimized
type of computing device. ASICs are fixed
in hardware, however as the name states, it
is application specific, meaning both the
hardware and software are designed from the
ground up to be tightly coupled and
optimized to a specific subset of tasks,
but do them extremely well. For example,
ASICs have seen a lot of use in
cryptocurrency mining. We'll discuss
various types of ASICs in future videos
on this channel, such as on the Internet
of Things and other computing
applications, for example: tensor
processing units, TPUs, for use in AI and
Nvidias Drive PX card for use in
self-driving cars. As a side note,
ASICs are the reason Apple phones and
laptops are so fast and fluid, all
hardware is specifically designed for their
devices as well as software that can
fully utilize all the hardware resources.
The problem with ASICs is that they are
significantly more expensive in terms of
research and development and
implementation. this is why most
companies and people opt for generic
chipsets. However, with the increasing
complexity of problems computers must
solve and the coming end of the
miniaturization of the transistor, ASICs
will see exponentially increasing use in
the coming years.
[Music]
So, based on what we've discussed about
heterogeneous architectures, the CPU will
manage computational resources, FPGAs
will accelerate data processing and GPUs
or ASICs will crank out the calculations
necessary. In terms of the computational
performance this yields, it all comes
back to what we've talked about over and
over again in previous videos in
this series, increased parallelism. Now
when discussing parallelism in
heterogeneous architectures there is
another law we must look at, 
Gustafson's Law. This law essentially
states, that with increasing data size
the performance boost obtained through
parallelization,
in other words, the addition of more
cores and other hardware and software
parallelism, increases because parallel
work increases with data size. Simply put,
until we run into power issues, we can
keep adding more cores and as long as
the problems we give them are
sufficiently complex the cores will be
useful. In terms of heterogeneous
architecture, this means we can keep
adding more compute devices. Now luckily,
it seems the world has found such a
problem, deep learning for use in
artificial intelligence, a field of
computer science that has now gone
mainstream and is increasing in
popularity more and more everyday. Deep
learning algorithms utilize large
amounts of big data and are
intrinsically parallel, we'll explore
this much deeper in this channel AI
series. Deep learning also acts as a
positive feedback loop and can propel
the field of computing much further. We
can see this in technologies such as AI
smart caching and going beyond that, deep
learning can help us identify ways to
maximize heterogeneous architectures,
develop new ASICs and much more to push
competing performance forward. So, with
heterogeneous architectures as well as
increasing parallel software paradigms
that utilize big data such as, deep
learning,
we'll see massive increases in
performance over the years, well
exceeding the expectations of Moore's Law.
If we look at Moore's Law in terms of
heterogeneous architectures, we still have a
long ways to scale, possibly another 75
plus years of performance increases. To
add to this, who knows what other types
of ASICs and architecture and software
changes are to come during this
inflection period in the field of
computing. Now before concluding this
video, to highlight the shift in the
computing industry to heterogeneous
architectures, watch this clip of
principle researcher at Microsoft,
Kathryn McKinley: So now what we're
seeing is specialization in hardware,
which is FPGAs,
specialized processors or
combining big and little processors
together; a big powerful fast processor
with a very energy-efficient processor.
So software has had an abstraction that
all hardware's about the same and one
thread of execution, so in order to make
software port to different versions of
crazy hardware or different generations
of hardware, as we go through this
disruptive period, the software systems
are not prepared for this.
So my research is targeting both how
you do this as a software system but
also programming abstractions that let
you trade-off quality for energy
efficiency, let you reason about the fact
that sensor data is not correct and how
do you deal with these inaccuracies in a
programming model that you don't need a
PhD in statistics or computer science in
order to use. I hope you guys have
enjoyed and learned a lot over the past
few videos on classical computing. As
mentioned earlier in this series,
classical computing gives the illusion
of parallelism through hardware and
software and this illusion just keeps
getting better and better due to
increasing performance. In the next
videos in this computing series, we'll
cover truly parallel non-classical
computers such as quantum and bio
computers, as well as new emerging
paradigms in computing such as: optical
and neuromorphic computing. At this point
the video has come to a conclusion, I'd
like to thank you for taking thee time to
watch it. If you enjoyed it, consider
supporting me on Patreon to keep this
channel growing and if you want me to
elaborate on any of the topics discussed
or have any topic suggestions, please
leave them in the comments below.
Consider subscribing for more content,
follow my Medium publication for
accompanying blogs and like my Facebook
page for more bite-sized chunks of
content. This has been Ankur, you've been
watching Singularity Prosperity and I'll
see you again soon!
[Music]
