DAVID J. MALAN: Hello, world!
This is CS50 Live, CS50's
episodic show wherein
we not only look at recent news
in tech, but also explain it.
This week's focus, user
error on a massive scale.
But first, a look back at
our earliest of seasons.
Hello, world!
This is--
ROBOT: CS--
RAMON GALVAN: 50 Live.
I'm Ramon Galvan, filling
in today for David--
DAVID J. MALAN: Who's lost his voice.
RAMON GALVAN: Today he'll be the
Andy Richter to my Conan O'Brien.
[MUSIC PLAYING]
DAVID J. MALAN: And of course,
who could forget Season 2?
[NO SPEECH]
CS50 Live first looks today at GitLab.
GitLab is a popular source code
hosting site, much like GitHub.com,
that developers can use in order
to store their code centrally,
in order to version control
it, share multiple copies,
as well as share it with other users.
Unfortunately, GitLab ran into
a bit of an issue very recently.
The whole incident started
when they saw this.
They support a feature
known as Snippets,
whereby users can
create snippets of code,
much like GitHub Gist, whereby users
can upload small snippets of code
to share them with other people.
Unfortunately, having some
1.5 million snippets of code
created over the course
of just a few days?
Not normal.
In fact, this seemed to be the
result of some spamming behavior
by some adversarial folks online.
Moreover, GitLab also notice that one or
more spammers seemed to be using GitLab
inappropriately, as a content
delivery network, or CDN,
whereby they were serving up
files in ways that they shouldn't.
Now unfortunately,
these kinds of attacks
resulted in a bit of a ripple
effect on their back-end databases.
Particularly, GitLab
posted the following.
"We are experiencing issues
with our production database
and are working to recover."
Now unfortunately, just
minutes later did they
post, "We accidentally
deleted production data
and might have to restore from backup."
Now what exactly happened?
Well, it's quite common for databases
to be replicated from one to another
so that you have a
primary and a secondary,
the latter of which is a backup
of the former in real time.
As part of the diagnosis
challenge for figuring out
why the databases were slowing
down in terms of this replication,
one of GitLab's system
administrators very deliberately
executed a command quite like this.
Now what is this command?
Well at the front of
this command is "sudo,"
which says execute the following
command with administrative, or root,
privileges.
What is the command to be executed?
Well, rm -rf apparently.
And rm, you might know, is to remove
files or folders from a system.
-r though means recursively.
Delete the following thing recursively,
so that any directories inside of that
also get deleted.
And unfortunately, the
"f" -rf means forcibly,
which means don't even prompt
the human to confirm or deny
that he or she wants to do this.
Now the system administrator meant
to execute this command deliberately
on their secondary database,
db2.cluster.gitlab.com,
so that they could resume then
the replication from their primary
to their secondary database.
Unfortunately, it appears
to have been late at night,
and this was a stressful
situation, and darn it
if this command weren't executed
on db1.cluster.gitlab.com,
the actual primary database.
Now no big deal, surely we have
backups all over the place.
So we can just restore from
backup, and our customers
will be perfectly
happy and on their way.
Unfortunately, out of five backup
or replication techniques deployed,
GitLab reported that, "None are working
reliably or set up in the first place."
Indeed, if you'd like to read their
whole post-mortem in which they
discussed exactly what went wrong, and
how, you can check out this URL here.
But the moral of the
story, for our purposes,
is please, please beware
the rm -rf, especially
if you're not just deleting
some directory of your own,
potentially your customers as well.
In other news, you might have
noticed that half of the internet
went down recently, and somehow
this was Amazon.com's fault. Well,
it turns out that Amazon is not
just the e-commerce site that you
might know and use.
They're also one of the world's
largest cloud providers,
where cloud computing is
this technique whereby
other people can run servers,
and have hard drives,
and more services
somewhere in the world.
And you as a customer can
essentially rent those services,
so that your website your application
isn't hosted by you in your own data
center, but in Amazon, or Microsoft,
or Google's own data center.
Now unfortunately, something went wrong
with one of Amazon's cloud services,
something called S3,
simple storage service.
Such that the result, according to
one popular ISP called Level Three,
was outages across
the US, if not beyond,
because these websites pictured--
here is this heap map--
were relying on at least
one of Amazon's services.
In fact you might notice some familiar
names among the websites affected.
Codecademy, Coursera, Docker, Giphy,
GitLab, GitHub, Heroku, Imgur,
Kickstarter, Medium, Quora, Slack,
Travis CI, and many, many more.
In fact, perhaps best was the irony of
a website called, Is It Down Right Now?
being down right now.
This is a website that typically allows
you to check other websites are down,
but if you actually visited that
website during Amazon's outage, would
you have seen an error like this.
Now fair is fair.
Some of CS50's own infrastructure
also went down during this incident,
and that's because CS50 stores not
only some of its largest video files,
but also the data related
to its video player,
on Amazon S3, the cloud
service in question.
So in fact, during that outage, if you
tried to watch one of CS50's videos
in its own player, you probably
would have seen an error screen
quite like this.
Because the video player, which is
JavaScript, or client-side based,
wasn't able to pull the requisite
data from Amazon servers.
Now what is Amazon S3, and what
technically went wrong here?
Well at first glance, it's
all pretty technical sounding.
"Amazon S3 is a simple
key-based object store,"
according to Amazon's documentation.
"Keys can be any string,
and can be constructed
to mimic hierarchical attributes."
But what does that mean?
Well, let's tease this apart.
It's a key-based object store.
Now an object, in this case,
just refers to files, where
a file is just a whole bunch of bits.
But Amazon kind of abstracts
away the notion of a file,
so there isn't really the notion of
files, and folders, and all of that.
There's just objects, which are,
for all intents and purposes, files.
But they are accessible via keys,
which typically are strings,
much like in a hash table, if familiar.
You access some value by
way of some unique key.
So for instance, in CS50, we posted this
first video from fall 2016 at this URL
here.
It's an mp4, which is a video file.
Now it turns out that the
video file actually lives
on a server that similarly
named, but notice what's
in it, cdn.cs50.net.s3.amazonaws.com.
Which is to say that indeed,
within CS50's own CDN--
content delivery network-- the
data itself comes from Amazon.
Now what about the key
that uniquely identifies
our objects, or videos, or other files?
Well this string here with slashes, and
words, and so forth, looks like a file
inside of a bunch of folders, but--
that's fine to think about
it that way-- but it really
is just a unique string
that resembles a file path.
We've adopted a scheme whereby
it looks like these are folders,
simply because it keeps our
data nicely hierarchical.
So what went wrong,
and what did users see?
Well, if you visited
Amazon's status page
on the day in question-- or the
days prior to the days in question--
you would have seen
beautiful green check marks,
from February 27 on back,
whereby all was well.
Green check means good for
the S3 storage service.
Unfortunately, on February 28,
did this thing rear its head.
And suffice it to say, red icon bad.
In fact, in this case it means half of
the internet would appear to be down.
Now, you can read more on
the details of this story,
but let's take a look at
a few of the key moments.
At 2:37 PM Eastern Time on
February 28 did Amazon report this.
"We can confirm high
error rates for requests
made to S3 in the US EAST-1 Region.
We've identified the issue, and are
working to restore normal operations."
Well what does that mean?
Well US EAST-1 Region is simply
one of Amazon's data centers.
Like a lot of cloud providers,
they have data centers--
buildings with lots of servers
and lots of hard drives and more--
all over the world.
And US EAST-1 happens to
be one of the most popular.
It's physically located in northern
Virginia, in the United States.
And because CS50 isn't all that far
away, in Cambridge, Massachusetts,
much of our assets live in
US EAST-1 one by choice.
In fact, it's a trade-off.
We could absolutely replicate our data
across multiple, multiple regions,
and have been much more tolerant
against this kind of fault,
but it's a trade-off between
how much storage you need,
how much money it might cost, and how
much complexity you have to introduce.
So we very consciously put
much of our data in US EAST-1
so that it's as close
to campus as possible.
Now, Amazon explains that the
reason S3 became inaccessible,
and in turn, so many
of these customers--
CS50 among them-- went
offline was as follows.
"The team was debugging an issue
causing the S3 billing system
to progress more slowly than expected."
OK.
"An authorized S3 team member,
using an established playbook,
executed a command which was intended
to remove a small number of servers
for one of the three S3 subsystems
that is used by the building process."
OK.
"Unfortunately, one of the inputs to
the command was entered incorrectly,
and a larger set of servers
was removed than intended."
In other words because of human error.
Literally a typographical error in
the equivalent of a terminal window.
Mistyping a command, did
Amazon take offline--
not just a few servers meant
to diagnose some problem--
but a huge number of servers.
All of which then need to be
rebooted, which takes some time,
and which explains the downtime.
In the real world, this might
be like if Amazon were hungry
for a little bit of
chocolate, and so went over
to the chocolate serving station,
and picked up this here X-ACTO knife,
and just wanted to take a tiny
little piece of the internet offline
so as to enjoy some chocolate.
And so you might just take
off a little corner like this.
Mm-mm, that's a good server.
But that's not what Amazon in fact did.
Amazon, because of a mistyped command,
for which apparently there was not
a sufficient prompting
process to say, "Human are you
sure you want to take
down all these servers?"
Amazon effectively took out this
here saw, turned it on, and bit off
half of the internet.
Mm, that'd good internet.
But, CS50, to be fair, is not
immune to these kind of mistakes.
In fact, here on CS50 Live,
we've made our own fair share.
In fact, why don't we take a look now
at some of CS50 Live's own outtakes.
Hello, world.
This is--
ROBOT: CS--
DAVID J. MALAN: 50--
RAMON GALVAN: Live.
DAVID J. MALAN: So if you see
me trip, if you see me misspeak,
if you screw up, all of that is
happening literally right now,
in Cambridge, Massachusetts.
Oh hi, world.
Drumroll.
Perspect-- or, per second.
Pause the vizio if you'd like--
Book redder.
[INAUDIBLE]
Mark Zuckerberg's favorite patent,
to protect our nuclear missile.
Keeping [INAUDIBLE] as usual.
Bugle itself--
[INAUDIBLE]
Good episode for you.
It's actually quite--
Fine-- and--
Then you may recall--
Ted to this you are here.
TECHNICIAN: This is CS50-- gah.
DAVID J. MALAN: And now
I made the blooper reel.
Fantastic.
SPEAKER 1: Let me read you ending.
That should be the ending.
DAVID J. MALAN: Photos of Jason
Hirschhorn dressed as a pumpkin.
JASON HIRSCHHORN: Oh boy.
I don't know if I want
people to see that.
DAVID J. MALAN: No, now it's photos of
the Jason Hirschhorn dressed as a boy.
SPEAKER 2: And if you're
interested, I can actually
show you how it's going to be--
DAVID J. MALAN: Yeah, absolutely.
SPEAKER 2: He's a little
off balance this morning,
he hasn't had his coffee yet.
[BUS HONKING]
SPEAKER 3: Whoa!
Oh god.
DAVID J. MALAN: From--
where's he from?
SPEAKER 4: Right where my arm is, you
can see like the white characters.
There is the pole.
DAVID J. MALAN: And they don't know
that this is-- you should point here.
SPEAKER 4: Oh.
DAVID J. MALAN: Can you hear me, world?
RAMON GALVAN: Hello, world.
Welcome to CS50 Live, I'm Ramona Galvan.
DAVID J. MALAN: And I--
And I'm David Malan.
RAMON GALVAN: And today I'm
hosting today's episode.
DAVID J. MALAN: But with--
RAMON GALVAN: [BEEP] OK, OK.
TECHNICIAN: Don't say [BEEP] on the air.
RAMON GALVAN: He'll be the Robin to my
Batman, the Andy Richter to my Conan,
the Cheech to my Chong today.
It's most definitely a serious
thing that we're doing today.
This is not a joke.
Dropbox had made quite a fuss lately,
but I know nothing about this.
What is this about?
That was all above me.
And this-- is something I don't know of.
We also took a tour of Third Glass--
Third Deg--
DAVID J. MALAN: [BEEP] right there!
Allows you to swipe credit cards on your
iPhone in order to process payments.
RAMON GALVAN: I have a flip phone.
Let's play the clip.
DAVID J. MALAN: To host the first ever--
RAMON GALVAN: OK.
To host the first ever--
DAVID J. MALAN: I was
in graduate school.
RAMON GALVAN: And I was in fourth grade.
Although I loved Zamyla, I
would much rather not spend--
DAVID J. MALAN: Spend half
as much time with her.
RAMON GALVAN: Exactly.
DAVID J. MALAN: Come on out, Zamyla!
RAMON GALVAN: This was CS50.
DAVID J. MALAN: And this was terrifying.
RAMON GALVAN: This is terrifying.
Made a nice little teaser reel
girl to kind of encapsulate
the debauchery that took place.
DAVID J. MALAN: I love you.
RAMON GALVAN: I love you,
unlike David who circles you.
DAVID J. MALAN: Rumors that-- oh
That's it for CS50 Live.
Thanks so much to CS50 Live's whole
team, Arturo, and Ian, and Christian,
and Doug, and Marina, Marinda, Dan
Coffey, and Cynthia for our brand
new set.
This was CS50.
