Now, on CS50LIVE today do we look at
something else that's been in the news.
Namely, this frightening quote here.
"We have broken SHA-1 in practice."
But what does that mean?
Well SHA-1, it turns out, is what's
called a hash function, which
is a special algorithm that
takes input and produces output,
as all algorithms do.
And what it's generally
designed to do is
take pretty large inputs,
or pretty important inputs,
and reduce them to a number,
or a short string that
represents that original input.
And SHA-1, like a lot
of other algorithms
too, are used in any number
of different contexts.
For instance, SHA-1 is used in what
are called digital certificates, which
is a digital mechanism whereby you
can prove with high probability
that some document, or some
credit card transaction,
or some bank transaction, or more
was indeed performed by, or signed
by a specific person.
They're also used in
software updates so that when
you download new software
from Apple or Microsoft,
you can confirm with high
probability that that update indeed
came from one of those players, and not,
for instance, from some malicious user
online.
Backup systems.
If you're in the habit, as you should
be, of backing up all of your files,
your software might be
using SHA-1 in order
to ensure that your file is
indeed correctly backed up
and it wasn't corrupted, for instance,
when being uploaded or copied.
And it's also used by GIT, the version
control system with which you might
be familiar on a website like GitHub.
Now it wasn't all that long
ago that a well-known security
researcher, Bruce Schneier, wrote this.
"It's time for us all to
migrate away from SHA-1."
Actually, it was a long time ago.
That was said in 2005,
at which time it was
noted that cryptographers had showed
that SHA-1 is not collision free.
Specifically, that is they developed an
algorithm for finding collisions faster
than brute force.
Now what does collision free
mean, and what are collisions?
Well, let's take a closer
look now at SHA-1 itself.
It's indeed an algorithm,
which you recall
can be thought of as this black box
that takes inputs and produces outputs--
takes problems and produces solutions.
And indeed, we generalize that black box
as algorithms, but a type of algorithm
is a hash function,
of which SHA-1 is one.
Now, a hash function generally takes
as input a string or some other value
and produces as output
another string or a number.
You can think of it more
generally as the input being
x and the output being f of x, a
function of x, if you think back
to some of your math days.
Now let's take a specific example.
Suppose that we want to
hash foo to some value.
That is, we want to take a string like
foo as input and produce as output
some number that uniquely,
or with high probability,
uniquely represents foo, perhaps so that
we can use fewer bits or fewer bytes
to represent that same value.
Now quite simply, if you're
familiar with hash tables, which
use hash functions, we might start
simply by looking at f and thinking,
oh, well this is 0, 1, 2, 3, 4, 5.
This is the fifth zero-indexed
letter of the alphabet A through Z.
And so you know what?
We're going to represent the word foo
with a hash value, or integer, of 5.
Great.
How about bar?
Let's take a look at b,
the first letter there.
A is 0, B might be 1, and so we would
represent bar's hash value as 1.
And then, what about a
value like baz, B-A-Z?
Let's again take a look
at its first character.
And oh-oh, it produces the same value.
So this is a collision.
If you are using a hash function,
like this very simple one we're
using here looking only
at the first character,
you get collisions if two inputs
happen to produce the same output.
All right, well that's
clearly my fault for having
proposed such a silly naive algorithm.
Let's do something a little
better with the same inputs.
Foo, again, could be viewed not just
as a single character, its first,
but let's look at all three.
And let's convert all three of
those to some integer representation
where a again is 0 and z would be 25.
And therefore, foo is 5, 14, 14.
And you know what?
Why don't we mathematically
just add these together so
that we're taking into account
more information than just
the first characters?
So 5 plus 14 plus 14
is going to give me 33.
So in this algorithm, 33
shall be my hash value.
Bar, meanwhile, is going to become,
of course, B-A-R, or 1, 0, and 17.
Add those together and we get
18, another distinct value.
And baz, and remember this
was the problem before,
is going to be B-A- and
Z, or 1, and 0, and 25,
which of course is 26
when added together.
So all seems to be well.
Now a hash function like SHA-1 is
actually much more sophisticated
than looking at a single character, or
adding even the number of characters
together.
It actually produces much larger
hash values like this one here.
In fact, if you were to hash,
so to speak, the string CS50,
you'd actually see this as output.
But there's a problem with this
general principle of hashing.
In fact, notice what happens here.
Foo, again, has a hash value of 33.
But so does another word like
oof, because it's the same letters
if we're adding them together.
So simplistically, we're still going
to get 33, so that's a collision.
And what that means is
we can't necessarily
figure out from the
output what the input was
because there might be multiple
inputs that produce that value.
Moreover, if we're just doing
addition with this algorithm,
our hash values are just going
to grow, and grow, and grow.
And surely we can't count to
infinity if our computers on earth
only have a finite amount of memory.
So if we're hashing
qux, we might get 59,
or double quux, 79, or
triple quuux, or even more
than that we might just keep counting
higher, and higher, and higher.
And we probably want to cap the
number of possible hash values
so we can actually store stuff in
real computers in their memory.
So of course, in programming
we might use something
like this, the modulo,
or remainder operator,
which you recall allows
us to wrap values around.
And indeed, if you're familiar
with hash tables, which
you might think of
pictorially as this, this
is one of the ways we actually
ensure that we don't run out
of index beyond the
boundaries of an array
by making sure we wrap back around.
So what is the problem then at hand?
And why is this so worrisome?
Well, hash functions, SHA-1
included, are supposed to be one way.
Given an input, you should
be able to get an output.
And you should not be able
to go in the other direction.
You shouldn't be able to
reverse engineer things so
that given the hash value you can
figure out what that value was.
For instance, if you
saw this on the screen,
you should not be able to figure
out that this once upon a time
was CS50 as input.
Now SHA-1 is still OK in
that regard, but it's not OK
when it comes to these collisions.
Because security
researchers have originally
claimed that two objects colliding
accidentally is exceedingly unlikely.
That is a mathematical tenet
of this and many algorithms.
And in fact, and don't
get scared, this would
be the formula with
which you can compute
the probability of a collision
using not our simple arithmetic,
but SHA-1 itself.
This yields really, really,
really, really low probabilities
that should be quite reassuring enough.
In fact, the probability of a
collision with the SHA-1 algorithm
should be 1 divided by 2
raised to the 160th power.
If we do that math, that's
1 divided by this big number
here, which is one
quindecillion, a number so big
I had to Google how to pronounce it
in anticipation of saying it just now.
If we actually do that math, the
probability is so, so close to 0 you
can see that 0.0000 all the way down to
those digits there is the probability
of a collision.
Now, this seems all very abstract.
How can we wrap our minds around
this a little more effectively?
Well, consider the planet Earth.
There's a whole lot of sand on
Earth, and indeed according to NPR
there should be seven quintillion grains
of sand, give or take, on planet Earth.
Now that's a lot of grains of sand.
So the probability of a
collision in the SHA-1 algorithm,
theoretically, would be like looking on
Earth for a specific grain of sand out
of all of the grains of sand on Earth.
And actually, I'm oversimplifying.
You wouldn't have to just
check this Earth looking
for a specific grain of sand.
You would have to search
194 octillion planet
Earths identical to this one, each of
them with their own deserts of sand,
in order to find that
one possible collision.
Suffice it to say that should
be super, super unlikely.
And yet, computers are a
lot faster than us humans.
In fact, in 2015 was The SHAppening,
not to be confused with the hit series B
Happening starring Mark
Wahlberg, at which point
security researchers announced,
we just successfully broke
the full inner layer of SHA-1.
So a portion of the
algorithm used for hashing
was cracked using, frankly, quite a few
GPUs, or Graphical Processing Units.
64 GPU cluster was used to essentially
crack the hash function in order
to figure out a possible
form of collision.
And choosing hardware that looks a
little something like this-- in fact,
you might have these in your
PC if you're quite the gamer--
GPUs are highly specialized for
lots of parallel processing.
So this is increasingly what
CS researchers are using.
They estimated ultimately, this team,
that the SHA-1 collision cost today
in 2015 should be between
75,000 US dollars and a $120,000
if you rent Amazon EC2 cloud
computing over a few months.
And this is the key takeaway.
And this is why it's so worrisome.
That financial amount is within the
resources of a criminal syndicate.
In other words, by investing that
much money in a hack or an exploit,
could bad guys theoretically
compromise the systems out
there in the world-- credit cards,
passwords, and the like that
rely on SHA-1 at least in part?
For more on this particular attack,
take a look at this URL here.
But in conclusion
today, unfortunately, it
was just days earlier that SHAttered
now followed The SHAppening, wherein
researchers from Google
in the Netherlands
announced that we have
broken SHA-1 in practice.
So not just a portion of the
algorithm but the algorithm
itself, by finding a collision
and demonstrating a mechanism
for generating such collisions.
This research result is so significant
that it even has its own logo.
And ultimately, what the
researchers put forth was this.
This attack required over nine
quintillion SHA-1 computations.
This took the equivalent
processing power
as 6,500 years of
single CPU computations,
and a 110 years of
single GPU computations.
Now of course, they didn't use
single CPUs or single GPUs.
They too had a rack of servers and
more using the so-called cloud in order
to perform these calculations.
Which is to say that with enough
money, and with enough computers,
and CPUs use and GPUs, is
this attack now possible?
Now thankfully, browser
manufacturers have at least
been aware of this for some time.
And Google, for instance, has been
defending against this in recent months
already by displaying a screen like
this if a website you are visiting
is using encryption that uses
this older version of SHA, SHA-1.
In fact, if we zoom in
here, you'll see that Google
is complaining, a bit arcanely,
of a weak signature algorithm.
Now Firefox has been
planning to do this,
and just days after this
announcement did they
further their plans to do this quickly.
And Mozilla, the company behind Firefox,
announced that SHA-1 on the public web
is now over.
And there are indeed solutions,
and there have been for some time.
So the real result here is humans
really expediting their transition away
from SHA-1.
There's algorithms like
SHA-256, SHA-3 and others,
which not only are different algorithms,
but also use many, many more bits
to actually produce those hash values.
So collisions are even less likely.
Indeed, if I may say so
myself, CS50's own website
here is secure, according to Google
Chrome, because we have indeed
been using SHA-256 for some time.
Now for more on this,
head to this URL here.
And indeed yes, this research
result is so significant it even
has its own website.
