[MUSIC PLAYING]
SPEAKER 1: Cryptography.
What is it, and why is it important?
We're going to answer those two
questions in exactly that order.
Let's start with what cryptography is.
It's the art and science of obscuring,
and ideally protecting, information.
Now it's an art and a science because
there's math involved with it.
It's pretty straightforward to
manipulate characters in some way
by adding some constant number
to them or to change them
in some systematic manner.
But it's an art, because doing so in a
way to defend against potential attacks
is not as easy as it might first appear.
There's a lot of
guesswork and calculation
that needs to go into play to
find a really strong cipher.
Cryptography gives us
the opportunity to have
a basic level of security
against an adversary who might
do bad things with the information.
We usually contrast,
in cipher information,
with information that is
presented in the clear, which
is to say there's no protection
surrounding it at all.
And it's generally considered better to
protect information using cryptography
than to have information just
freely available out there.
Now a cipher, we're going to start
by talking about cryptography
sort of through history.
We'll lead up to more modern
forms of cryptography,
which are derived from more
ancient forms of cryptography.
But a cipher is one of the most
fundamental forms of cryptography.
And ciphers are algorithms.
And recall that an algorithm is just
a step-by-step set of instructions
that we use to complete a task.
And in case, the task is to
obscure or encipher information.
And ciphers can also be used in
reverse to unobscure, or decipher,
that same information that was
previously encoded or enciphered.
Now there are many
different ciphers out there
that have varying levels
of security potential.
Some of the more ancient ciphers
that we're going to start with
should be [INAUDIBLE] be considered
to have no security potential at all
considering how easy they are to crack.
But again, this leads into the more
modern approach to cryptography,
which is much more secure
than some of these basic ones.
And now let's start by imagining that
we have possession of this device.
Now if you're looking at this device
and it seems somewhat familiar to you,
it may be because you've recently
seen the movie A Christmas Story,
where Ralphie, the
character there, obtains
one of these, which is a little orphan
Annie's secret society decoder pin.
And this decoder pin has a set
of numbers going sequentially one
through 26 around the
inner edge, and a set
of letters, which is not
presented in any particular order,
around the outer edge.
And what would happen is the
radio announcer would provide,
set your pins to some combination.
So line up one number with one letter.
And then it would read
off some secret message
that, ostensibly, only individuals
who possessed this pin,
or many of the duplicate versions
of this pin that were distributed
to children around the
country, could then decipher
by taking the numbers that
were given over the radio
and transforming them back into
letters so that it makes sense.
So if you can, if you
zoom in on this image,
it might be a little
difficult to see, but you
can see that the 3 corresponds to the
letter L, and the 4 corresponds to an M
based on this particular
setting of this decoder pin.
So this is one potential, what we
would call a substitution cipher,
where we're changing, we're substituting
a letter in this case for a number,
and that number will henceforth
represent that letter
for the rest of this message.
But what is the problem
with this cipher?
Or more generally, when we think
about issues in computer science
where we have adversaries who are
trying to penetrate some system,
or break a code, or break
in, or hack into anything,
hack your password, we sometimes frame
this in terms of asking the question,
what is the attack vector?
Where is the vulnerability
that is potentially
part of this particular cipher?
And in this case, it's that
anybody who has access to this pin
is able to break any cipher
that is made with this pin.
And again, this pin was distributed
pretty extensively in 1930s and 40s
to children who listened to
this very popular radio program.
So these pins were in
the hands of many people.
And anybody who had
access to the pin would
be able to understand the message.
And so that is, how we might
frame this attack vector,
is the key, in this case, the pin, which
we will call a key for this purpose,
is just very prevalent.
It's pretty well known how to use
this key and manipulate this key.
A lot of people have access to that key.
But that's just one example
of a substitution cipher.
We have many different examples of
substitution ciphers that we could use.
Let's just take another very
simple, straightforward one,
which is imagine we have all
of the letters of the alphabet
and we're just going to assign the
ordinal position of that letter
as its cipher value.
So with the secret society pin,
there was this sort of random element
to it, right?
The letters were being skipped.
There wasn't a rhyme or reason to them,
although the numbers were sequential.
Here let's just line up both.
Let's use sequential letters and map
them to their sequential numbers.
So A becomes 1, B becomes 2, and so on.
Both of these things
are increasing linearly.
Now you may recall that
as computer scientists,
we ordinarily start counting from
zero rather than counting from one.
I'm counting from one here because
this mapping of A to 1 and Z to 26
is much more familiar to
us intuitively as humans,
and I want to keep us grounded in this
discussion of cryptography right now.
But ordinarily, you might actually
instead see this as 0 to 25, 0 being A,
through Z being 25 as
opposed to 1 through 26.
But this cipher would
work exactly the same
and has roughly the
same security potential
as Annie's secret society cipher does.
And we can actually make this a little
bit better because we are consistently
increasing the letters, A through
Z, and consistently increasing
the numbers, 1 through 26.
We could also, instead of just
doing this direct mapping,
we could rotate around.
We could start the 1 somewhere
else as opposed to being A.
And now instead of having just one
cipher where A maps to 1, B maps to 2,
we have a variety of
different ciphers, depending
on where we decide we want
to have our starting point.
So for example, we might
instead add two to every number.
So instead of going from 1
to 26, we go from 3 to 28.
Now think about it.
If you're trying to break this
cipher and you see patterns
like this with all these numbers in
them, what might jump out at you?
Well, if you're used to seeing ciphers
that are 1 through 26, for example,
something where you
don't see any 1s or 2s
and suddenly you're seeing 27s and 28s
potentially in the message that might
be long enough to have, in
this case, Ys or Zs in it
might seem to you that
this is slightly off.
Like this cipher must
be shifted in some way.
Instead of being this
straightforward line,
there's some modification
that's been made to it.
That's kind of a tip off
if you're trying to defend
against somebody figuring that out.
And so instead of going
27, 28 at the end,
we might instead wrap
around the alphabet.
Once we have exhausted the 26
possible values that we started with,
the 26 letters of the alphabet, we
might instead, once we have X is 26,
say, well, instead of Y
being 27, Y is 1 and Z is 2.
And this is not a massive improvement
on the security of this cipher.
Like I said, it's still quite
fragile and quite easy to break.
But it doesn't give quite as much
of a clue to a potential adversary
as to how to crack it, how
to decipher the message.
And this can be done
for any different value
to obtain any number
of different ciphers.
Instead of going forward
by two positions,
we could add 20 to every
letter's value, again,
wrapping around the
alphabet when we exhaust,
when we get to 26, instead of having
27, 28, we would just reset at 1
and continue on.
But we can also add 26 to it.
But that doesn't look very
different than what we had before.
And that's where this cipher's
vulnerability comes into play.
There's only 26 possible
ways to rotate the alphabet
while keeping the order of
the letters preserved, right?
Unless we start skipping
A, D, G, and then,
you know, rearranging the other
letters in some other way.
If we want to keep everything
straightforward in a line,
again, wrapping around 26
when necessary, there's
only 26 ways to do it.
That is to say that shifting
the alphabet forward by 26
is exactly the same as shifting
the alphabet forward by 0.
And so that's our limitation.
We have a very small number of,
again, this word keys that can
be used to decipher using this cipher.
Now this is an example of something
called a rotational cipher,
and it's actually a rather
famous rotational cipher
known as the Caesar Cipher.
It's attributed to Julius
Caesar and was apparently used
more than two millennia ago for him
to encode messages to his troops
on the line.
And at the time, this was revolutionary.
And generally what you're
going to find with cryptography
is there's just this pattern of breaking
the mold and doing something new
and trying to stay one step ahead.
And oftentimes, other
people will then catch up.
And this cipher, which
was once, you know,
lauded as being a wonderful
cipher, is no longer as strong
as it once was thought to be.
And so we keep having to advance
and improve and get ahead
of it for whatever kind of
adversary that is, whether that's
a potential enemy on the battle line,
as might have been the case with Julius
Caesar, or whether that's a hacker
who's trying to break into your system
as might be the case today.
And fortunately, again, we're
not using Caesar Cipher today
to uncipher any of our information.
We're using much more modern techniques.
But these modern techniques
evolved from seeing
codes being created, ciphers
being created and broken,
and then having to be
created anew to try
and defend against new vulnerabilities
that have been exposed.
So like I said previously, very easy to
decipher or to crack the Caesar Cipher,
but at the time, very, very difficult.
The limitation, again,
limited number of keys.
There's only 26 ways to rotate
the alphabet for it to make sense.
In the English alphabet, of course.
If you're using a
different alphabet, you're
number of keys might
be different if you're
using the same rotational approach.
But the fundamental
limitation is you are
confined by how many
letters are in your alphabet
that you're using to
encipher information.
So let's take things one step further.
What is an improvement that we
might be able to make to Caesar?
That would lead us to this idea
potentially of the Vigenere Cipher.
So Caesar had this
limitation of there's one key
and there's only 26 possible
values for that key.
What Vigenere Cipher does is it,
instead of using a single key,
uses multiple keys.
Instead of picking a
number to shift by, we're
instead going to define a keyword.
And we're going to use the letters
of that keyword in sequence as we
go to change what our
key is at any given
time, such that our enciphered message,
instead of being enciphered using one
key, might use three keys
or five keys or 10 keys,
depending on the length of
the keyword that we use,
if that keyword is three
or five or 10 letters long.
So this keyword becomes
the interesting twist
that made Caesar much more
challenging for an adversary
to crack by using different keys.
Now let's walk through an example
of how the Vigenere Cipher works
because I think it makes more sense
to see this visually rather than just
discussing it verbally.
So what we want to do here
is encrypt the message HELLO
using the keyword LAW.
So here our message HELLO is what
also might be called plain text.
It is in the clear.
It is not enciphered.
It is not hidden against any adversary.
And our key is LAW.
All right, so let's take a
look at how we might do this.
So it oftentimes helps, especially
when trying to encipher or decipher
using the Vigenere
Cipher, to consider all
of the inputs that go into determining
the final outputted character.
So we're going to take
a look at plain text,
and we're going to convert
it, just like we did
with Caesar, to its ordinal position.
We're going to see where
in the alphabet that is.
Is it the first letter?
Then it's 1.
If it's the last letter, it's 26.
And so on.
We're going to do the exact same
thing with each letter of our keyword.
So we're going to take
a look at the keyword,
figure out what that letter's
numerical correspondence would be.
We're going to then add
those two things together.
If we go over 26, just as we
did with the Caesar Cipher,
we're going to wrap back around
such that we're confining ourselves
to that range of 1 through 26.
And then we're going to take that
number and transform it into a letter.
So for example, if the result there is
2, we're going to change that into a B.
And the reason for that is that B is
the second letter of the alphabet.
So let's walk through this with HELLO
as our plain text and LAW as our key.
So the first letter of our plain
text is H, and the ordinal position
of that H is 8.
It is the eighth letter of the alphabet.
We do the same thing with the first
L for LAW, the first letter of LAW.
L is 12, it's the 12th
letter of the alphabet.
So our next step is to add those
two values, eight and 12 together.
We get 20.
We don't need to wrap around.
We didn't go over 26, so we're still OK.
And the 20th letter of the alphabet
is T. So the first step of this
is enciphering process with HELLO, using
the Vigenere Cipher, using the key LAW,
is to turn the H into a T.
So we can do this again,
we can take a look
at the E, the second
letter of our plain text.
We use the second letter
of our keyword now.
So we're not using the same key.
We're not using 12
over and over and over.
We're using a different key.
We're now using the A, the
second letter of our keyword,
whose ordinal position is 1.
So 5 plus 1 is 6, and that results in F.
Next, we use the first L
of HELLO, and the W of LAW.
So L is 12, W is the 23rd letter of
the alphabet, we add those together,
we're at 35.
35 is not a legal value
in terms of this cipher.
We are confined to 1 through 26.
And so we just subtract 26 and we
get down to 9, and now we have I.
So now what do we do?
We've exhausted our
keyword, but we still
have plain text that
we need to encipher.
Well, as you might expect,
the logical thing to do
is just go back to the beginning
of the keyword and continue on.
And so we will.
So we'll use the L, the second L of
our plain text, and the first L--
because we've now exhausted
all of those letters,
we have to go back to the beginning--
the L for LAW.
12 plus 12 is 24.
24, the 24th letter
of the alphabet is X.
And we do that finally as well for the
O, advancing it one position, because
of the A in LAW, to 16, and that is P.
So ultimately, HELLO
in this case becomes
this random set of characters, TFIXP.
And some advantages might also
immediately jump out at you.
With the Caesar Cipher,
anytime we changed a letter,
it always was that same
letter every time we
saw it in the enciphered message.
So if we had a B and we were advancing
everything by two characters,
every B in the original
message would always
be a D because D comes
two letters after B.
So again, if our Caesar Cipher
key is two, every time we see a B,
it becomes a D, every time we
have an A, it becomes a C, always.
Here with the Vigenere Cipher,
because we have different keys
and we're rotating
these keys differently,
depending on which letter
of the keyword we are
and which letter of
the plain text we are,
those two Ls are not the same, right?
Instead of H-E-L-L-O, we
don't have some mapping.
Those two Ls are I and X. They
are not the same character.
And so already we're seeing
a bit more security here
because there's not
this potential to guess.
Caesar is also much more
secure when you consider
how many keys are available to you.
With the Caesar Cipher we
had 26 keys available to us.
With the Vigenere Cipher
we have 26 to the n keys,
where n is the length of our keyword.
So for example, if we're using
a two letter long keyword,
for example, AA or AB or all the way
up, that leaves us with 26 squared,
or 676 possibilities.
Now if we extend to three letter
keywords or four letter keywords,
we're getting even more
and more possibilities.
And as we start to increase
the number of possibilities,
we start to really increase the
difficulty for some adversary
to figure out what the key is.
And that's really the goal
of cryptography, right?
We want to be able to
protect information
and we want to defend that information
from being determined by other people.
So the more work we put into making
more challenging keys, the more likely
we are to be successful in our
attempt to encipher information.
So again, Vigenere much
more of a secure cipher.
It's still not secure
and it's definitely
not a cipher that is used today.
There are computer programs that are
capable of figuring out how to decipher
using the Vigenere Cipher pretty well.
But it's more secure
than Caesar for sure
because of its changing alphabets
and its much larger number of keys.
Let's go back to this decoder pin and
think about another potential problem
that we have.
Now assume that your
adversary is actually
not a member of Annie's secret society.
They don't have this pin.
So that's already a step up.
We previously had assumed that anybody
who had the pin could crack it,
and that's still true.
But let's assume your adversary,
lucky you, doesn't have this pin.
Is there still a way that they would be
able to crack the code without the pin?
Think about it for a second.
Think about what our characteristics
of the English language
are that might suggest people
figure out what this cipher is.
Think about some unique features
of the English language,
which is one letter words, like I and
A, which might appear in the message.
If you see a single
letter word in a message,
you're probably going to guess
that it's either the letter I,
and every time I see that character or
that number I'm going to assume it's
an I, or you're going
to assume it's an A
and you're going to try and
plug in an A everywhere.
And some trial and error might
reveal some patterns that emerge.
And there is a very prevalent
pattern in the English language,
which is that letters appear
with a pretty regular frequency.
Given any arbitrary text
in the English language,
it's pretty likely that the
distribution of letters within that text
is going to follow this pattern
roughly 13% of the time, give or take.
Any arbitrary letter
selected from a text
is going to be the letter E. And only
1/10 of 1% of the time will it be a Z.
And only 2/10 might it be a J.
So there are some letters that
appear very frequently and
there are other letters that
appear very infrequently.
And that is still a problem in
this generic substitution cipher,
even with the letters being
scrambled, which seems at first
blush to perhaps be much
more secure than one where
the letters are increasing
sequentially and the numbers
are increasing sequentially.
Even this scattershot mapping
of letters to numbers,
as long as we're still
confined to these two domains
where we have A through
Z and 1 through 26
and there's always a
mapping between them,
whether they're ordered
or not ordered, is still
a problem, in the English language
anyway, because of frequency analysis.
These are actually very common puzzles.
Humans might find it kind of tedious
to try and solve these puzzles,
but otherwise, this is
well known as a cryptogram.
You may, if you are the puzzling
type, this type of puzzle
is called a cryptogram.
And this pattern is definitely
something that is across all messages
that appear in the English language.
There are plenty of other ciphers
that appear, that are used,
that are more secure
than any of these what
we might call one-to-one
ciphers, mapping
a single character to a different
character or to a number.
There are some ciphers that substitute
pairs or triples of characters
at a time.
And these ciphers, again,
form the basis for what
eventually becomes more
modern cryptography, which
we're getting to in just a moment.
There are also
transposition ciphers, where
instead of substituting
one character for another,
we simply use an algorithm to rearrange
all the letters in some systematic way.
And the defect there is that all the
letters of our original plain text
message are still there and all
we need to do is unscramble them.
And because there's
an algorithm that was
used to scramble them
in the first place,
there's got to be a
way to undo it as well.
With a little bit of trial and
error, we can probably sort that out.
Finally, the most egregious issue
with these classical ciphers
is, how do you distribute the key?
How do you tell someone who you
want to share information with?
How do you tell your ally
what the key is for the cipher
that you are going to use?
You can't encrypt it because
if you encrypt the key,
how will they know what the real key is?
If you say, if you send
them a message and they
don't know how to interpret it, or
they see it and they interpret it
as something else, that's not
going to be helpful to you.
You want them to see the
key in the plain text.
You want them to see the
key in the clear, rather.
You want them to just have it.
You don't want to encrypt
that as you hand it to them.
That doesn't do them any good.
But if you're giving
the key to your ally
and your adversary is
within earshot, or they
have access to that same piece of paper
because your ally carelessly throws it
away and they can just
pick it up, now all
of a sudden all of your messages using
basic ciphers are fairly insecure.
But let's take a step forward
in modern cryptography.
Perhaps you've seen a screen that
looks like this at some point
when you're trying to
log in to some system.
Enter your email and we'll email
you a link to change your password.
Well, why don't you just
email me my password?
Like you're going to give
me a link to change it,
you must know it if I use
my credentials to log in
to your service any given day.
But OK, I guess, sure.
The reason for this is
actually a reason of security.
So let's distinguish ciphers, which
we've been talking about, from hashes.
So one of the most
critical distinctions is
that ciphers are generally reversible.
You can undo what you did.
That's the whole reason why it's
important to share with your allies
the key.
But hashes are generally not reversible.
Or certainly, they're not
supposed to be reversible.
And so it turns out, and we'll
learn about this a little bit
later, when you log in to
some service, if that service
is doing a good job of
protecting your data,
the reason they can't just send you
your password is because they actually
don't know your password.
And that might seem strange
because clearly, there
must be something-- if I type in
my password then I get logged in.
But a good service is one that does not
store your password in the database.
That's probably a good
thing if you think about it.
In case there was ever
a data breach, you
wouldn't want your password
to be in their database.
Instead what they do is they store a
hash of your password in the database.
And then when you provide
your password to them,
they run that hash
through the same things,
called a hash function, which is just
a generic idea for a function that
takes any arbitrarily large amount of
data and maps it to some other range
or some other set of values.
Now that might be a arbitrarily
long string of information.
It might be some fixed string where
if I run my password through this,
I'm going to get back something
that is always 20 characters long.
But it looks nothing like
my original password.
I've just made some weird
manipulations to it.
And that's what happens in
log-in systems more generally
is you will log in to
some service, you'll
type in your password, when that
information is then submitted
to the organization to check
your log-in credentials,
they will run your password through
that same hash function again.
And if that value matches what they
have in their database for you,
that is how they know that you have
provided the correct credentials.
They're mapping-- they're matching some
mapping of your password to the one
that they have stored, but they're not
actually checking your actual password.
And that should probably give
you some sense of security.
And if you ever use a service where
you end up having to click on that link
and they actually send
you your password,
you probably don't want to
use that service anymore
because they're not taking strong
enough precautions to protect your data.
So as I said, once we have a
password stored in the database,
it is actually stored as a hash
rather than as the password itself.
The service should not be able to
tell you what your password really is.
So this idea of a hash
function-- what is it?
Well, as I said, it's something
that takes any arbitrary data--
and eventually we'll get into hashing
things like files and not just words
or strings, but for now let's
keep it to strings, strings
being a sequence of characters
or letters, like a word
or a phrase or a sentence--
and mapping it to some other range.
So we'll start out by just mapping a
string, a set of letters, to a number.
But it could be to a different
string, a string that's
always 10 characters long, and so on.
So there are some properties
that good hash functions have.
Let's take a look at
what some of these are.
So they should use only
the data being hashed.
There shouldn't be anything
else that comes into play.
They shouldn't be bringing
in any outside information.
It should rely exclusively
on whatever data is
being passed in to the hash function.
They should also use all
of the data being hashed.
It becomes a bit less effective if
every time I provide a word or a string
to my hash function, I'm only using
the first letter of that string,
such that my hash
function for every word
or every string I provide
that starts with A
is going to return the same value.
That's not terribly useful to me.
I want to get a better
distribution of values.
Your hash function
should be deterministic.
And when we say deterministic,
we mean no random elements to it.
Oftentimes we think that random
numbers are nice to jumble things up.
But the problem is we want our
hash function to always output
the same value for the same inputs.
So if I give you my password
and hash it and I get
some output, every time
I provide my password
and run it through that
same hash function,
I want to get the same
output every time.
And that's what sites rely on when
they're using hashed passwords as part
of the credentialing check.
They're relying on the fact
that they will always get
the same output given the same input.
So that's a requirement
of a hash function.
Hash functions should
uniformly distribute data.
So oftentimes you're
mapping these strings,
let's say, to some set of values.
Those could be numbers,
again, those could be strings.
You want to spread those
out evenly, ideally,
across all of the possible
values that you have.
You don't want everything to hash
to 15 if your range is 0 to 100.
You'd ideally like everything
to be spread out such
that there's an equal
number of 0s, 1s, 99s,
and so on, as we talked about a little
bit when we discussed hash tables.
Finally, we also want to be able to
generate very different hash codes,
very different values
for very similar data.
For example, LAW and LAWS should
hash two very different values.
That would be ideal if
a tiny bit of variation
created a really dramatic ripple effect.
And creating this really
dramatic ripple effect
is pretty key when we're talking about
cryptographic hash functions, which
we'll get to in a second,
which form the basis of almost
all modern cryptography, which
form the basis of everything
that we do that we rely on when we think
of security in the computational field,
it's almost always relying on these
hash functions being really, really
good at making small changes
have very dramatic ripple effects
in the hash code or the hash
value, the data that comes out
of the hash function.
So after all this talk
about good hash functions,
let's take a look at a
pretty bad hash function.
And we'll talk about why.
We'll talk about one of its virtues, but
some of its potential problems as well.
So instead, let's add up
all of the ordinal positions
of all the letters in the hash string.
So this ordinal position
idea is exactly the same
as we had a moment ago when we were
talking about Caesar and Vigenere.
So A is 1, B is 2, and so on.
So for example, for a word like STAR, if
we want to add up the ordinal positions
of all of the letters in
that word, we have S-T-A-R.
That's 19 plus 20 plus 1 plus 18.
So if you do that math
quickly, that ends up being 58.
So what is a good thing
about this hash function?
Well, it's not reversible.
If I get a 58, I don't necessarily
know that the input that I had there
was STAR.
It could have been any one
of a whole variety of things.
It could have been ARTS
or RATS or SWAP or PAWS
or WASP or MULL or this whole
random set of 29 Bs in a row.
All of these things, when run
through this really terrible hash
function that I've defined
here, all add up to 58
when I follow the rules
of this algorithm.
So I never know what my
input was given my output.
That is a good thing.
That is what a hash function should do.
Hash functions, unlike ciphers,
should not be reversible.
But the problem that I have here is
that I have a lot of collisions, right?
There are a lot of different
things that map to 58.
And when we talked about
collisions a little bit previously,
we were talking about them in
the context of a hash table.
And collisions were OK in that context.
We were just clustering things together.
If they all happened to
have the same hash value,
we'll just put them in the same bucket.
When we're talking about
cryptography though,
when we start to get into relying on
cryptography to keep our data secure,
we can't have collisions at all.
In fact, pretty much we rely on the
fact that it is so mathematically
unlikely, neigh impossible to have a
collision in order for these things
to work.
And so collisions, when we're talking
about cryptographic hash functions,
are definitely not a good thing.
So to recap, to check that a user
gave us the correct password, if we're
storing a hash of the password in
the database versus just storing
the plain text password in the database,
which hopefully no one is storing
a plain text password
in the database, we
run the actual password, the real
password through the hash function.
We get a hash value as an output,
some string or some number
or what have you as the output.
And if we get a match, odds are
they entered the right password.
Now I'm saying odds are
because we can't be 100% sure.
And we can never be 100% sure.
We can be really, really,
really sure, but there's always
a chance of a collision.
Even with the best designed
hash functions, even
with the best designed
cryptographic hash functions,
there's always a chance of a collision.
But ideally, that chance
is quite infinitesimal.
Very, very, very, very,
very, very unlikely.
So odds are if we get this hash,
comes out of this hash function,
it's quite likely,
like 99.9% plus likely
that they entered the correct
password, this is, in fact,
the user whose credentials are being
verified, and we should log them in.
Modern cryptography is just hashing.
It's just hashing that's quite a bit
more clever, certainly than the example
that I just talked about a moment ago.
Also, these algorithms tend not to
work on a character by character basis.
It's the algorithm that
I just did as well where
I was adding up every single letter.
I was looking at each one individually.
They tend to take,
these modern algorithms
tend to take clusters of letters,
pairs or triples or so on at a time,
maybe do even more things.
They might rearrange the letters
before they do things to them.
So there's multiple layers going on
with these encryption algorithms.
And unlike some of the ones
I've discussed earlier,
most of these also have the property
where given data of arbitrary size--
and now we're starting to really
expand our minds into not just words
or strings, but also images,
files, videos, documents, PDFs,
and so on; anything can be run through
a hash function to get a value--
but we're always going to get a
string of bits, a bit string, that
is always exactly the same size.
So depending on the
algorithm, maybe it's
going to be a 160-bit long
string, or a 256-bit long string.
But our range is finite.
It's always going to
be exactly 256 bits.
But the combination of those
bits will be different, ideally,
for every single piece of data we
might throw at it, no matter what.
OK, so let's expand our
definition of a hash function
to this idea of a
cryptographic hash function.
What properties should they have?
They should be very difficult, very,
very difficult, basically impossible
to reverse.
It should be computationally impossible
for anybody to undo the encryption.
That's pretty much the same
as a regular hash function.
We're just really hammering the
point home when we say this here.
They should still be deterministic.
We don't want any random elements to it.
We still want to a
hash a value and always
get the same output no matter what if
we run that same value through the hash
function an arbitrary number of times.
They should still generate
very different hash codes
for very similar data.
We still want things to
be spread out and we want
minor changes to have dramatic effect.
And they should never--
and this is one of those words
that computer scientists love--
they should never allow two different
sets of data to hash to the same value.
Do you see a potential problem
when we frame it in this way?
When we say they should
never be able to do that?
We've already restricted ourselves
to a finite domain, right?
I said a moment ago, maybe this hash
function maps to 160-bit long strings.
There's only so many
combinations of 160 bits.
Now that might be an unfathomably
large number, but using the word never
there becomes a bit dangerous.
We can't really rely on that.
And we'll see why this could
potentially be a problem.
This static length string, by the way,
is usually referred to as a digest
in this context.
When we start to talk about more
modern cryptography techniques,
the output of a
cryptographic hash function
is usually referred to as a digest.
Let's take a look at one of these
cryptographic hash functions.
And certainly I'm not going to
dive into the mathematics of it.
I wouldn't be able to
explain the mathematics.
I wouldn't be able to do it justice if I
tried to explain the mathematics of it.
But let's just take a look at
some of the basics of this.
So SHA-1.
SHA-1 is quite a famous algorithm.
It was designed by the National
Security Agency in the mid-1990s.
So these are really smart people
who are tasked with working
with things like military intelligence.
These are people who are dedicating
their lives to trying to protect data
as best as they possibly can.
Far more brilliant
minds than I, for sure.
And this hash function-- and
this is a published paper.
Hash functions tend to be, actually
it's this very strange dichotomy where
you describe exactly
how the function works,
but it still should be irreversible.
And this just really becomes a question
of incredibly complicated mathematics
involved, such that even if you
knew so many of the pieces going in,
you still might not-- you still wouldn't
be able to undo it, even if you tried.
It's kind of amazing actually.
SHA-1's digests are
always 160 bits in length.
So this is one of those ones
I just said a moment ago.
That means that there are 2 to the
160 different SHA-1 digests, which
is a bit over 10 to the 48th power.
And again, 2 to 160 means for
every single one of the 160 bits,
that could be a 0 or a 1.
So we have that, two options times two
options times two options, 160 times.
Just to try and make it fathomable, to
understand how large this number is,
let me try and paint a picture for you.
So imagine that you are looking on
Earth for a specific grain of sand.
You're looking for one specific
grain of sand on Earth.
That is easier by far than trying
to have SHA-1 have a collision where
two values would map to the same thing.
There's about 10 to the 18
grains of sand on Earth.
So that's eight quintillion--
I had to look up that word--
eight quintillion grains of sand.
So way easier to find the
grain of sand on Earth
than it is to have a collision.
In fact, we go even further
and say that imagine
that every single one
of those grains of sand
was another planet Earth, each
of which also had sand on it.
So you have eight
quintillion planet Earths.
You're trying to find a
specific grain of sand
on one of those eight
quintillion planets.
It's still easier than trying
to have a collision with SHA-1.
SHA-1 is such an important
algorithm that it's actually
one of the algorithms that is
required in federal regulations
to be used by the government
for encrypting information.
There are others as
well, but SHA-1 is listed
by the National Institute for Science
and Technology as a standard algorithm.
But there's a problem, which
is that SHA-1 is broken.
And it has this clever website
called SHAttered, shattered.io.
So the research team that figured out
how to create a collision intentionally
create a collision.
And intentionally creating collision
has the effect of basically saying,
this cryptographic hash
function is broken.
And they have proven that there is
a way that they can systematically
generate collisions.
So that's bad.
And we'll see why that's
bad in just a moment.
But you can go to this
URL, shattered.io,
and read quite a bit about
how the researchers do it.
They explain it in different levels.
So if you really want to dive into the
technology and the mathematics of it,
you're certainly welcome to.
If you just want to understand it at a
base level and why this is a problem,
I definitely encourage you
to take a look at this site
and read more about this.
So what did these researchers do?
So they said, It is now
practically possible
to craft two colliding PDF files
and obtain a SHA-1 digital signature
on the first PDF file,
which can also be abused
as a valid signature
on the second PDF file.
In short, what they're
basically saying here
is we were able to
create two PDF files such
that if I run them through the SHA-1
algorithm, the digest that I get
is the same.
Why is this potentially bad?
For example, by crafting
the two colliding PDF files
as two rental agreements
with different rent,
it is possible to
trick someone to create
a valid signature for
a high-rent contract
by having him or her
sign a low-rent contract.
If you can take a PDF and
twist it into anything
you want it to be, but
have a valid signature,
a valid SHA hash associated
with it, that's not great.
Now before alarm bells start
going off because SHA-1 is still
use quite extensively, even now,
this SHAttered research result
was developed in 2017 it was
released, but SHA-1 is still
being used now, even then.
Before you panic though, it has
not been broken that many times,
although they did very--
they worked for two years to
create this PDF collision.
And they demonstrated a
method for how to do it.
It has still not
happened that many times.
Cryptographic hash functions, once
they've demonstrated one collision,
are broken.
That is certainly true.
But the actual effects of this
have not yet really materialized.
The computational power
required to create this
is well beyond the capabilities of
most people, or most syndicates even.
So no cause for alarm yet.
But it does show that there
is a limitation with SHA-1,
and we still want to always
be staying one step ahead.
Just like when Julius
Caesar's enemies figured out
how to crack the Caesar Cipher, the
goal was, we need to get one step ahead.
As technologists, we always
want to stay one step ahead
to make sure that we are doing
our best job protecting our data.
And as lawyers, we
want to make sure we're
doing our best job
protecting our clients' data
against potential adversarial attacks.
So as I mentioned, there
are other standards
that are in use by other organizations,
including the federal government.
SHA-1, as I mentioned, is just one of
a few different options that they use.
SHA-2 and SHA-3 are much
more robust algorithms.
They use more bits,
basically, in their digest.
So instead of being 160 bits,
you can have anywhere between 220
and 500 or so bits.
So way larger of a domain,
even reducing the likelihood
of a collision that much more.
Again, imagine how unlikely
it was with 2 to the 160.
Now we make it even more so.
500 bits, that's unfathomably
large and difficult to duplicate.
MD5 and MD6 are other cryptographic
hash functions, or hash functions
that you may encounter.
MD5 in particular I've highlighted here
in yellow because it's not actually
considered secure anymore,
but it's still very, very
commonly used as a checksum.
Basically, what we do is
we run a file through MD5.
And say we're a distributor
of a file and we
want people to come download
our source, and they
want to be able to trust our source, we
might run our file through MD5 and say,
if you run this file through MD5, the
hash will be blah blah blah blah blah.
And other people can then download
the file and run it through MD5.
It's usually a program that is
available on computers for people
to just run any arbitrary data
through to get a hash result.
And they can check, OK, the hash value
that I received from this trusted
source matches the hash value
that I was told I would receive,
and so I will trust this.
Versus perhaps getting
that same software
versus some corner of the internet
that you don't really trust.
If you find the MD5 hash
of the trusted source
does not match what you downloaded and
what you thought was that same file,
it's probably a sign that
something has changed in it
and you don't really want to--
you might want to be skeptical about
trusting that file rather than just
diving right into it.
So what do we do that relies on
cryptography on the internet today?
Or you know, just using
our computers every day.
Email.
Email relies pretty
extensively on cryptography,
particularly when we start to use secure
email services, of which Gmail might
not be considered one, but
there are services out there,
for example, ProtonMail and others,
that do encrypt email completely
from point to point.
Much safer in terms of
protecting one's communications.
Similarly, you may be familiar
with the mobile app Signal is also
used to encrypt communications between
two people over the text messaging
network rather than over
email and the internet.
Secure web browsing, you may be
familiar with this distinction
between HTTP and HTTPS.
And if you're not, that's OK.
We're going to be talking about
that a little bit later as well.
But you want to make sure that your
web traffic is encrypted against people
who are able to just monitor the network
for all the traffic that is going by.
You probably don't want your
searches to be someone else's
fodder for entertainment.
VPNs.
If you use a VPN, that's a great thing
to do if you're traveling, for example,
and you may be on less secure networks
than you might find at your business
or at home or at a university
institution, for example.
VPNs allow you to encrypt
communications with a network,
and also allow the network to pretend
to do something on your behalf so that
your web traffic cannot be
traced back to you directly,
which might be advantageous
in some situations as well.
Document storage as well.
So if you use services like
Dropbox, for example, generally what
Dropbox is going to do is
break your document into pieces
and encrypt those pieces.
Rather than just storing the whole file
writ large in some server somewhere
on the cloud, it's going to
encrypt it before it sends it over
so that you have some more
comfort that your data is being
protected by these cloud services.
And certainly, we're going to talk
a bit more about what the cloud is
and what cloud services are and what
they can be used for a little bit
later in the course as well.
Hash functions and cryptographic
hash functions are great,
but they are well documented
and there's only the one.
There's only one version of SHA-1.
There's only one version of SHA-3.
And that is a limitation.
Now it might not be a severe
one because it's pretty strong.
They're pretty strong algorithms.
But are there ways that we can improve
our own cryptographic techniques
if we're trying to protect
data that we are receiving,
data that we are sending, and so on?
And that leaves this idea
of public-key cryptography,
or public- and private-key
cryptography, or asymmetric encryption.
You'll hear these terms kind
of used interchangeably.
Let's start by talking about public-key
cryptography by way of an analogy.
We're going to go way back to
arithmetic and algebra days here.
So imagine we have something like this.
We have 14 times 8 equals 112.
Multiplication we can
think of as a function.
It is a function.
If 14 is our input and our function
is times 8, the result is 112.
Now multiplication is not a hash
function because it is reversible.
I can take that 112, multiply it by
1/8, or equivalently divide by 8,
and get back the original input.
So multiplication is a function,
but it is not a hash function.
It is reversible because if we multiply
any number x by some other number y,
we get a result z.
And we can undo that whole process by
taking z, multiplying it by 1 over y,
or the reciprocal of y, and
getting back the original x.
Reversible.
Goes in both directions.
Now let's take this function
and kind of obscure it.
We know for ourselves that this
function that I'm using is n times 8.
Whatever I pass in is going
to be multiplied by 8.
But I don't tell you what that is.
I don't tell my friends what that is.
I just say, hey, if you
want to send me a message,
just run it through this function.
So again, we're going to just
use math as an example here.
If my message is 14,
I might say, f of 14--
and again, this is getting back
to algebra, maybe a little bit
back in the day--
f of 14 is 112.
That is my public key, you might think.
And you might say, having just
gone through this whole example,
that, well, it's pretty
easy to undo that.
If I know that 14 is the plain
text and 112 is the cipher text,
I can probably figure out that
your function is n times 8.
And so I've broken
your encryption scheme.
I have figured out how to
reverse your cryptography.
Well, it's true that n times
8 is certainly one function
that I could use to turn that
plain text, 14 in this example,
into that cipher text,
112 in this example.
But there are other
ways that I can do it.
My actual function could have
been n times 10 minus 28.
So 14 times 10 is 140, minus 28 is 112.
And there are other contrived
mathematical examples
that I could continue to do
pretty much ad infinitum to define
ways to transform 14 into 112.
So just because you see that
112, that doesn't mean you
have figured out how to
break my hash function.
You haven't figured out what
my encryption technique is.
If all I say is, here's a black box
that I would like you to feed an input
into, even if you see the output, you,
or really more concernedly an adversary
who sees that output as well should
not be able to, or cannot in this case,
undo it.
Because yes, I could have
been using n times 8.
I could have been using this crazy
thing involving the square of n.
And that's kind of the idea
behind public-key cryptography.
I am going to publicize that I
have a function that can be used,
but I'm not going to tell
you what that function is,
and I'm certainly not going
to tell you how to reverse it.
So public- and private-key cryptography
are actually two hash functions
where the goal is to reverse them.
We kind of talked about
this as hash functions
are supposed to be irreversible.
But the distinction here is that we are
creating two functions, f and g, which
are intended to reverse one another.
So it's not that the function does the
single function that is reversible,
it is that we have two functions that,
working together, create a circuit.
If I take data and I run it through
function f, I get some output.
If I run that output through function
g, I get back the original data.
I have deciphered the information.
And the same thing works in reverse.
If I take some data and I
run it through function g,
I get some hashed output
that makes no sense.
And if I run that hashed
output through function f,
I get back the original data once again.
Now the key is that-- pun intended--
the key is that one of these functions
is public and the other one is private.
One of them is available to
everybody, and everybody uses
that function to send you messages.
If you want to send me a
message using encryption,
using public and private key
encryption, you take the message
and you use my public key
to encrypt it, and you
send me the result, the
hashed encrypted result.
And I use my private key to decrypt it.
And I am, ostensibly,
the only person who
has my private key, even
though I've broadcasted,
made my public key widely available.
Now the math that goes into this is
well beyond the scope of a discussion
that we're going to have here today.
But basically, and most
encryption, most cryptography
involves the use of prime
numbers, particularly
very, very large prime numbers.
And you're looking for prime numbers
that have a particular pattern.
And when I say "you're"
looking for it, don't worry,
you don't have to do this yourself.
There are plenty of
programs out there, RSA
being a very popular one,
that can be used to generate
these public and private key pairs.
But the amazing thing is that it can
generate these pairs very quickly,
but it's almost impossible
to break or figure out
what the underlying functions,
or even in this case
what the underlying
two prime numbers are
that are the foundation for
your own encryption strategy.
So it's pretty amazing that it's
easy to define these functions
and almost impossible to reverse
engineer them, so to speak.
So we start with a huge prime number,
we find some other prime number that
has a property, a special
property related to it,
and from those two numbers we generate
two functions whose goal in life
is to undo whatever the first one does.
So f's job is to undo what g does,
g's job is to undo what f does.
And this is called a public
and private key pair.
So your public key is really
some complicated hash function
that does work.
And that hash function is
represented as a very long string
of numbers and letters.
It looks just like a hash digest.
But it's just a human representation,
a readable representation
of a mathematical function.
And your private key is the same--
or your private key is also a
representation of letters and numbers.
It's not exactly the
same as your public key,
but it undoes the work
that your private key does.
And again, these keys are generated
using a program called RSA.
So let's take a look
at exactly how we would
go about doing some
asymmetric encryption using
public and private keys.
So here we have some original data.
It's a message perhaps
that I want to send.
And I want to send it to you.
I want to send this
message to you, but I don't
want to send it to you in the clear.
I don't want to, you know,
it's sensitive information.
I don't want to send it via plain text.
And I don't want to use
a generic hash function
because if I use a generic hash
function, like SHA for example,
it's irreversible.
You will not be able to figure
out what I tried to say.
So instead, I take this original
data and I use your public key.
Your public key, again, is just
a mathematical-- a very complex--
mathematical function.
So I take this data, I feed it into your
public key, your public hash function,
and I get some garbled stuff out.
OK?
And this is what I send to you.
I send you this garbled stuff.
In order for you to figure out
what the original message is,
you use your private key.
Not your public key--
your public key is what
I use to encipher the information--
but your private key, which
is known only to you, hypothetically.
It should not be distributed to others.
It undoes the work that
your public key did.
And so if I give you the
scrambled data and you
use your private key
to try and decipher it,
you will get back that original data.
But here's the great thing.
No one else's private key
will be able to do that.
If anybody intercepts that message other
than you and they use their private key
or they use your public
key again, they will not
be able to decipher the
message that I sent to you.
And so public and private keys
are very interesting because they
create these pairs.
They're these unique encryption
schemes that are unique to two people,
or really even to one person.
If you were to send
me a message back, you
would send me a message
using my public key.
You would then send me whatever the
encrypted sort of scrambled data
is for the message that you
sent using my public key.
I would then use my
private key, which is not
known to you or to,
hypothetically, anyone
else to decipher what you sent me.
And I would get back the secret
message, or the perhaps not-so-secret,
but sensitive message
that you sent to me.
And so that's this idea
of asymmetric encryption.
You can encrypt using
someone's public key.
And anybody can do so.
And for that reason, you'll often
find technically-minded people will
sometimes post their public
key literally on the internet,
such that anybody who wants to send
them a message using a secure channel
can do so.
And programmers as well.
So if I'm doing some work using a
tool called GitHub, a popular service
available online for sharing
and posting source code,
if I want to send something from
my computer to GitHub's servers
in the cloud, I might authenticate using
a public key and private key encryption
scheme so that they see that I'm
using their public key to send them
information, they're decrypting it.
When they send information back
to me, they're using my public key
and I use my private key to decrypt it.
It's actually part of--
it's part of a communication strategy
used by technically-minded folks.
And you're not restricted to just
having one public and private key.
For example, I have one
public and private key
that I use for a secure email, I
have one public and private key
that I would use for
secure texting on my phone,
and I have one public and private key
that I use for my GitHub repository.
So I have different sets and
different combinations of these keys.
But the key is that-- the
key, again, pun intended--
is that the decryption can only be done
by someone who has the private key, not
the public key, because only those
two functions are reciprocals
of one another.
They undo the work that the
other did in the first place.
But interestingly enough,
that's not the only thing
we can do with public and private keys.
So instead of just encryption,
we also have this idea
of a digital signature, which
is different than e-signature,
an e-signature just being the
tracing of a pen typically
along some surface and just logging
where all the pen strokes happen to be.
So we're talking about something
much more complex than that.
We're talking about
something cryptographically
based when we talk
about digital signature.
It's kind of the opposite of encryption.
And using someone's
digital signature, you
can verify the authenticity of a
document and verify, more precisely,
the authenticity of the
sender of a document.
And we're going to explain this
in great detail in just a moment,
but the basic idea is they're signing
the document using their private key.
You still don't see what the key is.
And because these public
and private key pairs
are specific to an
individual person, if you
were able to verify that
that document could only
have been signed using
someone's private key,
then you have quite a serious
belief that that person
is the person who signed the document,
who sent the document, and so on.
Digital signatures are 256
bits long pretty consistently,
which means there are 2 to the 256th
power distinct digital signatures,
which makes the potential of
a forgery effectively zero.
Again, I'm using this--
I'm trying to avoid saying never because
computer scientists don't like never.
But effectively, there is
no chance of a forgery.
Now the process for how one verifies
a digital signature is quite--
there's quite a few steps involved.
And I have a diagram here
that I sourced from online.
And what I'd like us to
do now is walk through
this process to hopefully give you
an understanding of how these work
and how you might be able to
rely on digital signatures.
And states and different entities
are recognizing digital signatures
as a valid way to sign
documents, but it really helps
to have a good understanding of
them such that you, as an attorney,
are comfortable with the fact that this
does represent a specific individual.
So let's take a look at
how this process works.
So we start with data.
Data in this case is any document.
Perhaps it's a scanned, signed version
of some PDF with somebody's actual ink
signature.
But again, the whole
thing is just scanned.
The next step is to use a hash function.
The hash function that we could use
in this context could be anything.
It could be SHA-1.
It could be something very complex.
In general, the hash function
that's going to be used here
is actually not a
cryptographic hash function.
It's going to be something like MD5.
So something that anybody has access to.
And that's going to result in a
hash, a set of zeros and ones.
In the case of MD5, it's going to be
about 160 or so different characters.
Now where things get
very interesting is we
take that hash, that
set of zeros and ones,
and we encrypt it using
the signer's private keys.
Remember, these functions are
reciprocals of one another.
A public key can undo
what the private key does,
and the private key can undo
what the public key does.
Notice in this case we're still
not sending anyone our private key.
We are just using our private
key to encrypt something.
So we take this hash that we received
from running our file through MD5,
we encrypt it using our private key,
and we get some other result out of it.
This number that comes out of running
the hash through our private key
is called the signature.
We then just couple that--
so when we send this off,
we send the signature plus
the original document,
and that would be considered
a digital signature.
So that's the signing
part of the process.
That's where we go.
We start with a file.
We run that file through
a generic hash function.
Not our public and
private keys, something
that is generally pretty accessible.
We take that hash, we encrypt
it using our private key
to get some other hash that looks
similar, different zeros and ones,
but totally different
pattern of zeros and ones.
We attach the original document and the
digital signature when we send it off,
and that is considered a
digitally signed document.
Now the real crux is how do you
prove that I'm the person who
sent you this document, right?
If you want-- if you're
receiving something
that has a digital
signature, which is supposed
to be as good as any
other kind of signature,
it's supposed to have legal effect.
How do we verify that that
person who sent you the document
was actually the correct one?
So then we go to the verification step.
So we start, we've now received
this digitally signed data.
This is the same as this
digitally signed data here
that was sent by the sender.
We also received two
pieces of information.
We received the document,
the original document,
and we received the signature.
And recall, again, that
the signature is what
happens when we take
the hash of the document
and run it using our
private key to get a result.
Now the interesting
step here is remembering
that the public and private keys
are reciprocals of one another.
So we can take this
complicated signature hash
and we can use the public key,
which, again, is publicly available.
Anybody should ostensibly have
access to someone's public key, not
their private key.
And notice that the signer has
never sent their private key.
They've only used it
to encrypt some data,
but they never sent the private key.
The public key has always
been available though.
We take the signature, we run it
through the public key function,
and we get a hash.
We take the data, the document,
and we run it through MD5,
the same hash function that the sender
was supposed to use, and we get a hash.
And we're checking to
make sure that these two
hashes are equal to one another.
If they are equal to one another,
that means the signature is valid.
Let's talk about why
that would be the case.
If we use the MD5 of this file,
the generic hash of this file,
and we encrypt it using our private
key, we get some result, OK?
But this is very easy to calculate.
It's MD5.
We're taking a basic document, we're
running it through a publicly known,
well-defined hash function.
Anybody who has access to this document
and a program on their computer
called MD5 can literally
run this document through it
and get this number.
This is not the tricky part of this.
We then take this hash function,
we encrypt it using our private key
to get some secret number.
The public key though will undo that.
Remember, the public and private
keys are reciprocals of one another.
Whatever one does, the
other one can undo.
And so only my public key will
undo the work of my private key.
So if I take this value and I
encrypt it using my private key,
and then I run this value
through the public key,
I should get the original result
again, the original MD5 hash.
And that's why we have to
send the document as well, not
just the digital signature,
the numbers that we
get by running it through our
private key in the first place.
That way we have a way to validate
that yes, this file has this checksum,
and the sender took that checksum, they
ran it through their own private key,
and when I used their
public key to undo it,
I get the same value, which is
effectively proving, but is,
we'll term it as it's
very, very, very, very
likely that this person who
claimed to have sent the document
is, in fact, the person
who sent that document.
And so that's what digital
signatures can be used for.
It is a mathematical,
cryptographic way to verify
the identity of the sender of
a document or an individual.
Or in whatever context you might be
using or receiving digital signatures,
it is purely a verification step that
is based entirely in mathematics.
There's one other
potentially interesting use
of digital signatures that's
also quite buzzy right now,
and that's blockchain technology.
And what is the blockchain?
Digital signatures are really key
to knowing how the blockchain works
and why it is trusted as a decentralized
source of information for individuals.
So understanding
digital signatures means
you are in a position to
understand blockchain.
And I use here the term the blockchain,
but it really is a blockchain.
There's no such thing
as the one blockchain.
There are many different-- this is
just an idea that is implemented.
Generally, we're hearing it in
the context of a cryptocurrency,
but it does not need to be restricted to
that, although cryptocurrencies are so
discussed in the media and have been
dissected by so many researchers
that they provide an interesting
vehicle, an interesting lens
through which to consider blockchain.
And so our example today is
going to focus on Bitcoin.
It is the most well-documented
of the cryptocurrencies.
It is the most well-documented
implementation of the blockchain,
or among the most
well-documented implementations.
But this is not specifically
a lecture about Bitcoin.
We're just using Bitcoin as a lens
through which to understand blockchain.
There's also an outside source
that I strongly encourage.
This channel on YouTube provides
interesting mathematical dissections
of topics, and they tackle blockchain
and Bitcoin pretty extensively.
And this is an excellent
supplementary resource
to consider if you're
trying to dig into this
or understand it a little bit
more, because in this video
I'm going to omit some of the more
technical details for the sake
of, hopefully, broader understanding.
But if you want to dive
into it more deeply,
this is a resource
that I would recommend.
And I really like talking about
Bitcoin in the context of blockchain
because it's actually how I kind of
got started almost as an attorney.
When I was practicing, when
I graduated from law school,
I decided to go out on my
own and start my own firm.
I live in a small town and
so a lot of my early work
was doing estate plans, wills and
such for individuals in my town,
getting to know them.
But I had studied extensively
technology-related law in law school
and I really wanted to use it.
And a few years into my
practice, I had a friend
who needed an estate plan prepared, and
he asked if he could pay me in Bitcoin.
And I had no idea what that meant.
I didn't really know anything
about Bitcoin at the time.
And I looked it up and thought
it sounded interesting,
and so I said sure.
So I learned how to set up an account.
And it's also worth
mentioning at the outset,
as we're talking about
cryptocurrency, that you
need to understand how
Bitcoin works to use Bitcoin.
You don't need to understand
how the federal banking
system works to use a bank.
And the same is true here with Bitcoin.
But I ended up accepting
a Bitcoin payment
by creating what's
called a Bitcoin wallet.
I immediately sold the Bitcoin that I
received and turned it into cash, such
that I could use it for
more generic purposes.
And what I decided to do was
send out a press release saying,
oh, I accept Bitcoin, because
it was something that was novel
and I hadn't really
heard that much about it.
And this got the attention of
my local paper and companies
in the area that were
technically minded as well.
And so Bitcoin sort
of provided this forum
to meet new clients that also
allowed me to explore fields
of the law about which I am passionate.
So it's kind of an interesting segue
to be able to share that with you now.
All right, so stepping away from Bitcoin
again more broadly to blockchain.
What is the blockchain?
It's very similar to something
you've already learned about,
which is a linked list.
So recall that a linked list is
a set of nodes, each of which
have connections forward and
backward to other nodes in the chain.
They are linked together.
And similarly, with a blockchain, all
of the blocks are chained together.
It's basically the same
terminology slightly modified.
So a linked list is a set of nodes, each
of which is connected to the one prior
and the one after it.
We learned about linked lists as having
generally three pieces of information
associated with them-- a previous
pointer, which is basically
a reference to the prior node,
or in this case, the prior block;
we have the next pointer, which
is a reference to the next node
or the next block; and we had data.
And in this case, the data is
actually two different things.
There's the real data.
And again, in the context of
a cryptocurrency blockchain
we're going to be talking
about a list of transactions,
a numbered list of transactions
from person A to person b,
each of those transactions
being digitally signed such
that you can verify that the
person who logs that transaction
is actually the one who
made that transaction.
And also, something
called a proof of work.
And this proof of work
is very interesting
because this is how Bitcoin
ostensibly derives its authority.
There is no central controller
of the Bitcoin currency,
and it is very decentralized.
And there needs to be
some way for people
to agree as to what the true ledger is,
or what the true set of transactions
that have happened are.
And the way that is done is by relying
on something called the proof of work.
And we'll dive into
that shortly as well.
So again, cryptocurrencies, that data is
a ledger of transactions, each of which
is digitally signed using
the digital signature
technique we've just
discussed by the person who
made or initiated that transaction.
And that ledger is decentralized,
which means that any time there's
ever a change, any time any
transaction is recorded, in this case,
using Bitcoin, again, our lens
through which to consider blockchain,
that message is broadcast out.
So if I make a transaction
in Bitcoin, I pay you $10,
I'm going to announce to everyone
else who has a Bitcoin wallet
or who is monitoring the blockchain,
the list of transactions, hey,
please add the following transaction
to this list, Doug pays you $10.
And that is announced to everybody,
everybody records it in their ledger,
and then some stuff is
going to start happening.
But here is a potential issue.
How do you know that the
blockchain is legitimate?
How do you know that your copy of
what is being said is the truth?
How do you know that your copy of the
blockchain is accurate with respect
to all other transactions
that have happened?
Everybody else has
their own copy as well.
It's decentralized.
We all maintain, anybody
who's using Bitcoin maintains
their own copy of the blockchain.
How do you defend against
people modifying it?
That's a very interesting question.
The way that cryptocurrencies
do it is to assume--
and this is defined
in the Bitcoin paper--
the way the cryptocurrencies
do it is to assume
that the chain that has the most
computational work put into it
is the true chain.
This decision is completely arbitrary.
There's no reason why one needs
to be vetted over the other.
But something had to be agreed upon
by, collectively, users of Bitcoin
to say in the event of a dispute,
between which person's chain is
the accurate de facto
definitive list of transactions?
We're going to go with the one that
has been verified the most times.
And again, this word verified
is sort of a sketchy word.
There's nothing inherently
about proof of work
or anything else that proves that a
transaction has taken place in the way
that we normally think
of this term verified.
Rather it is the collective standard
by which we all agree to adhere,
that the person--
or that the blockchain that has the
most proof of work in it is the list.
That is just something we
must subscribe to as users
and consumers of blockchain.
Now how do we determine which blockchain
has had the most computational work
into it, which copy of the blockchain
has had the most computational work put
into it?
Well, this is proof of work.
So proof of work is how the correct
blockchain of all the copies
that are decentralized is determined.
So recall how hashing works.
Hashing allows us to take any arbitrary
data and run it through a hash function
and get an outcome.
And that outcome is going to be, let's
say 256 bits, each of those bits being,
of course, 0 or 1.
Now there's a lot of
different combinations there.
But some of them will be very unique.
And the way Bitcoin works,
Bitcoin's blockchain works
is to prove a particular block.
We are asking people who are
oftentimes called miners--
that's where this term comes
from because they are mining. ,
Ultimately the reward for doing this
proof of work is to receive Bitcoin
that are sort of
generated out of thin air.
And so these people are termed miners.
But we are asking anyone who has a
computer to hash the entire block.
So hash the entire list of
transactions, the reference
to the previous block
and the next block.
And remember, all of that is contained
in a single node of this blockchain,
basically.
And we're looking for a
highly unusual pattern.
We're looking for maybe the first
30 bits or the first 40 bits
to all be zeros.
That's really weird.
Like, that's a really
difficult pattern to find.
And the only way to do it is to guess.
So you take this entire block,
you attach a single piece of data
to the bottom of it, like 1, 2, 3.
You can just count in
that way trying to guess.
And if you hash that
entire thing together,
do you eventually find a block
that, when hashed in this way,
produces this very, very unique pattern?
If so, you just say, here's
the number that I attach.
So let's say I took the entire
block and I hashed it with 12345
was the number, right?
It's very difficult to
find a value that would
create this unique
pattern of zeros and ones,
in particular, zeros, 30 zeros in a row.
But it's really, really easy to
verify that someone has done it.
To verify that someone has done it,
all you have to do is if they announce
the number that they used,
12345, as their proof of work--
and that's what the
proof of work really is,
it's that number that they
use to figure it out--
if they announce that and you
hash the block with that number,
you can verify, yes, that pattern
is actually 30 zeros in a row.
So I guess you have proven it.
Now this is, again, kind of arbitrary.
Like, this seems weird.
Why are you spending
all your time trying
to figure out a specific
pattern that exists somewhere?
That is a question that I cannot
answer other than to say that it is
the standard by which people who have
ascribed to the Bitcoin standard have
just agreed to be bound by.
The person who finds this
number is probably the--
is proving the validity of
all the transactions above it.
And this gets interesting when
you think about somebody trying
to perpetrate a fraudulent transaction.
So imagine I'm trying to perpetrate a
fraudulent transaction by initiating
a transaction that says,
I'm going to pay you $100.
And I announce that to you,
but I don't broadcast it
to everybody else who maintains
the blocks, who are maintaining
their own copies of blockchains.
Which is interesting because you
think that I have spent $100,
and as far as you're concerned
I have spent $100 to you,
but no one else is aware of that.
So no one else thinks
that I have spent $100.
They all think I am $100
wealthier than I actually am.
The problem then arises that
I need to verify that block.
I need to verify that transaction.
So I append the transaction to
my own copy of the blockchain
because I am the only
person other than you--
the two of us maybe have these
copies of the blockchain,
but everybody else, I didn't
broadcast this transaction
so no one else knows about it.
In order for it to have a
proof of work attached to it,
in order for it to be
considered the valid chain,
I would need to prove that block.
I would need to find that secret number
that when hashed with the entire block,
produces a pattern of 30 consecutive
zero bits before anybody else does.
So that's a 1 in 2 over 2
to the 30th power chance
because I'm looking for a
pattern of 30 consecutive zeros.
There's a 1 in 2 to
the 30th power chance
that I'm going to find that pattern.
And I have to find that
pattern before somebody else.
And in the meantime, other transactions
are coming in on my ledger.
On my-- other people are
broadcasting their transactions.
And I have to keep
adding them to my ledger
and keep proving that work
over and over and over,
all the while trying to stay ahead
so that my fraudulent transaction is
considered ultimately
the correct blockchain.
Now the odd-- you just
can't beat the odds of that.
One malicious person trying to
perpetrate a fraudulent transaction
using the blockchain cannot stay ahead.
They can't win the
find the secret number
game over and over and over and over.
Eventually, some other chain,
which contains valid transactions,
will win out over my
attempted fraudulent chain.
And it will be disregarded.
Nobody will consider that to be a
valid part of the chain anymore.
And so that's kind of how this works.
Again, it's arbitrary the way
they decide to resolve or verify.
There's nothing about
this process that proves
that person A sent person B money.
It's just the consensus
that we have decided, well,
if people have gone through the effort
to try and find these secret numbers,
and many different people are doing
it, and this one chain is longer than
the others because it's been verified--
again, using this term verified--
it's been proven with work
over and over and over, we're
just going to agree that
that's the right one.
So again, it's kind of strange.
And I do, again, refer you to
that video that I shared earlier
to get into some of the more
technical details of this,
which I'm glossing over a little
bit here in this discussion.
But proof of work is basically
the collective consensus
of blockchain users, or
in this case specifically,
of Bitcoin users, for which transactions
they are going to consider valid.
Because changing any one-- and if
you go back in time, as opposed
to trying to forward think I want
to add a new fraudulent transaction,
if you try and go back in time to
modify a transaction from the past,
say there was a transaction
that was you pay me $10
and I maintain a copy of the
blockchain, so I can go back in time
and modify that file, technically, I
change it to you pay me $100, well,
because I've changed
even the tiniest thing
and I'm hashing that block,
that means that when I hash it
with that secret number, I'm no longer
getting that secret pattern of 30
numbers, 30 zeros in a row.
And so that kind of calls that
transaction into question.
It also, because each of those
blocks contains a reference
to the next block and
the previous block,
it also invalidates all of the other
transactions in that blockchain.
And so because of this
weird technique we're
doing of hashing blocks, hashing data,
trying to look for specific patterns,
but realizing that any cryptographic
hash function with the tiniest
change to the input creates
a totally different output,
we actually are pretty well
defended against people
who try and go back in time and
make fraudulent transactions using
the blockchain.
So it's mathematical and it's quirky,
but it does provide a clever way
to defend against that
kind of thing, considering
we don't have a central
authority to rely on
to adjudicate these kinds of disputes.
We are collectively, not
trusting one another enough,
but agreeing to trust the mathematics
of the blockchain in order
for it to succeed.
So as I mentioned, we can very
easily verify the correctness
of someone's proof of work.
That proof of work is just the
number that is hashed with the block
to produce the secret pattern of
30 zeros and then some other bits,
and so on.
The longer a chain gets,
the more and more likely
it is that all the transactions
in it are "verified."
Again, I keep putting air
quotes around that word
because it doesn't mean
in exactly the same way
that we might consider
verified colloquially to mean.
It doesn't prove anything
about the transaction itself,
just that we accept it as the standard.
We accept this as the de facto
truth because of all the mathematics
that have been put into it.
So the longer a chain
gets, the more likely
it is that it consists of only
verified, legitimate transactions.
But that brings up a question
of, what is a transaction?
A transaction is just an
exchange between two people.
And if we start to
really spread things out,
we can almost think about a
transaction as a contract.
I offer you $10 for you to
do something on my behalf,
and assuming that we're intending for
me to actually give you these $10,
and you're intending to actually
do something on my behalf,
and the thing that you're
doing for me is not illegal,
we've basically formed a contract.
And so while Bitcoin can be
used, the blockchain for Bitcoin
can be used to send money
back and forth between people,
the data that goes into the data
block of any blockchain is arbitrary.
And there's no reason why, instead
of being a list of transactions,
that data couldn't be something
much more significant than that.
There's no reason it couldn't
be a digitally signed PDF
scan of a contract between two people.
There's no reason it can't be a
message from me typed to you saying,
I will pay you $100 if you
paint my house on Tuesday,
and you sending something back in
that same chain saying, I will paint--
I accept your offer for this payment.
I accept your offer.
I will paint your house on
Tuesday in exchange for $100.
We've just formed a contract
with no middleman at all.
We are announcing our intentions.
It is being recorded publicly in
everybody's version of the blockchain.
There is verified, again,
verified in the sense
that we collectively term
to be accurate rather than
proving that I definitely sent this
although the digital signatures
associated with these
transactions do, again,
suggest yes, I am the person who made
this transaction because I digitally
signed it.
If I do the same thing with a
contract, if I send you an offer
and you accept, and both of
those items are in the chain,
we arguably have formed a contract.
And that is what the blockchain
associated with the Ethereum technology
is actually more akin to.
So Bitcoin is kind of
restricted in how it
approaches cryptocurrency and
approaches transactions between people.
And Ethereum opens up a little bit more.
And there are other blockchain
technologies and other services
that rely on the blockchain
in order to do things far
beyond what a cryptocurrency could do.
But all these things are only
possible because we rely on--
we rely so extensively on cryptography.
We use computers to send information
securely, encrypt information.
And the mathematical
unlikelihood of someone
being able to duplicate
our work, or certainly
reverse engineer this
encryption is what gives us
the confidence to make these
transactions in the first place.
And so cryptography forms the
basis of almost everything
that we do when we talk
about security on a computer.
But ultimately, cryptography
just relies on mathematics.
So the moral of the
story is probably this.
You are probably not
going to be implementing
your own version of the blockchain.
And really, you don't need to understand
it completely in order to use it.
Like I said, you can use Bitcoin without
knowing the mathematics of how Bitcoin
works, just like you can use a bank
without knowing the minutia of how
the banking system works.
The point of the blockchain is
to remove a central authority.
We don't rely on one person or
one entity or one government
to determine what has happened,
what the transactions are
like we do with a bank.
Your bank has a ledger
of everybody's accounts.
With blockchain technology, we are
decentralizing this and making it
so that everybody has access to
all of the information at once,
and it is everybody's responsibility
to keep that ledger accurate.
And because these ledgers rely
so extensively on cryptography,
because this technology
relies on cryptography,
we can use the power of
cryptography, the fact
that things are very difficult to
reverse engineer mathematically
to verify that yes,
these are the things,
these are the things that have happened,
these are the transactions that
have been logged, and everybody
knows about it at the same time.
