YUFENG GUO: Is your
model training slower
than you'd like?
Perhaps your GPU isn't getting
the kind of utilization
you think it should.
Then you've come
to the right place,
because we're going to level up
your understanding of storage
options for your machine
learning and data science
workloads.
[MUSIC PLAYING]
Welcome to AI Adventures, where
we explore the art, science,
and tools of machine learning.
My name is Yufeng Guo.
And on this episode,
we're going to take
a look at various storage
options on Google Cloud.
When thinking about running
machine learning and data
science workloads in
the cloud, one factor
that is often
overlooked is storage.
It's easy to take
storage for granted.
In our day to day, storage
just happens on the hard drives
of our phones and laptops.
But in the cloud, things
are a little bit different.
Depending on your use case
and your collaborators,
your data storage requirements
may have vast differences.
So today, we're going to look at
some different types of storage
options on Google Cloud and
see what kinds of trade-offs
are being made so
that you can pick
the right tool for your task.
Oftentimes when we're doing
tasks on Google Cloud,
the default is to just stick
files on Google Cloud Storage,
which I'll refer to as GCS.
You make a new bucket,
stick some files in it,
and you're done.
Since GCS can be accessed
over the network via REST API
calls often wrapped in
a command line or SDK,
it's become very convenient
for a quick and easy storage
solution, when you need
to drop a few files here
and there to relay
some bits around.
The data gets replicated
across multiple regions
if you need to.
You only pay for the exact size
of the files you're storing.
And sharing is relatively easy.
This makes GCS especially
useful for systems that operate
in a serverless environment.
But sometimes, you need the
performance of block storage.
Enter the persistent disk.
Yes, indeed, we're
moving all the way
down the stack to be closer
to the compute resource
and closer to the raw hardware.
OK, virtualized raw hardware.
These disks offer
high-performance
reads and writes.
And your code sees it
as a local file system,
which can be very handy
for migrating existing
workflows into the cloud.
Virtual machines in
a Google data center
are scattered across
many physical machines.
And they're all
connected together
via the fast local network.
That applies not only to
compute, but to storage
as well.
So what we see as a
single persistent disk
is actually many slices of
different hard drives scattered
all over the data center.
This also means
that you can easily
connect multiple
virtual machines
to the same persistent
disk, as well
as connect multiple persistent
disks to one virtual machine.
All of these slices
can read and write
in parallel, which is why
larger persistent disks will
have faster Io
performance up to a point.
We can see that in this
graph, read speeds--
they don't top out until after
you have a 10-terabyte disk,
while sustained
throughput maxes out
around 400 megabytes per second
at the 4-terabyte disk size
mark.
Persistent disks have
three tiers of choices--
standard, SSD, and local SSD.
Standard persistent disks
are your good old spinning
hard drive and are the
backbone of most data
storage systems out there.
SSDs up the read and
write performance.
But it is still
distributed across the data
center in the same way as
standard persistent disks.
Now, local SSD is co-located
with your compute resource.
And it comes in exact
increments of 375 gigabytes.
So you should really
only use it if you're
sure that you're going to need
that high performance read
and write.
Additionally,
standard and SSD disks
both offer zonal and
regional options.
With regional
persistent disks, you
can failover to a
second zone in the event
that your primary zone has an
outage, enabling you to have
high availability services.
In machine learning
use cases, you
can take advantage of the
high I/O and throughput
performance of persistent
disks by loading large data
sets onto a persistent
disk and then
attaching them in read only
mode to many, many instances.
This will enable your team to
work with very large data sets
without duplicating the
data, all while everyone
gets to feel like the data is
right there on their own VM
instance.
Note that for workloads
that primarily
involve small random reads--
4 kilobytes to 16 kilobytes--
the limiting factor
is random Input/output
Operations Per Second, or IOPS.
But for workloads
that involve primarily
sequential or
large random reads,
like 256 kilobytes
up to 1 megabyte,
the limiting factor
is throughput.
This means that you should set
the block size of your disk
to match the type of
use case you anticipate.
So which storage
option will you choose?
Well, that'll depend
on your use case.
From the scaling flexibility
of GCS to the high performance
of persistent disks,
there's a storage option
suitable for every type of
machine learning workload.
Thanks for watching this
episode of Cloud AI Adventures.
If you enjoyed it,
click that Like button
and subscribe to get all
the latest episodes right
when they come out.
I'm Yufeng Guo on
Twitter @YufengG.
And if you're looking for longer
form machine learning and cloud
content, be sure to
check out the Adventures
in the Cloud YouTube channel,
which I've linked below.
And for more on disk
and performance,
check out the resources
in the description below.
And I'll see you on the next
episode of AI Adventures
right here on the Google Cloud
platform YouTube channel.
