Have you ever wondered how it’s possible
to scratch a CD or DVD and still have it play
back whatever it’s storing?
The scratch really does affect the 1’s and
0’s on the disc, so it reads off different
data from what was stored, but unless it’s
really scratched up, the bits it reads are
decoded into precisely the same file that
was encoded on to it, a bit-for-bit copy,
despite all those errors.
There’s a whole pile of mathematical cleverness
that allows us to store data, and just as
importantly to transmit it, in a way that’s
resilient to errors.
Well, actually, it doesn’t take too much
cleverness to come up with a way to do this.
Any file, whether it’s a video, or sound,
or text, code, an image, whatever, is ultimately
stored as a sequence of 1’s and 0’s, and
a simple strategy to correct any bit that
gets flipped would be to store three copies
of each bit.
Then the machine reading the file could compare
these three copies and always take the best
two out of three whenever there’s a discrepancy.
But what that means is using two thirds of
your space for redundancy.
And even then, for all that space given up,
there’s no strong guarantee about what happens
if more than 1 bit gets flipped.
The much more interesting question is how
to make it so that errors can be corrected
while giving up as little space as possible.
For example, using the method you’ll learn
about in this video, you could store your
data in 256-bit blocks, where each block uses
nine bits –Nine!– to act as a kind of
redundancy, the other 247 bits are free to
carry whatever meaningful message or data
you want.
And it will still be the case that if a bit
gets flipped, just by looking at this block
and nothing more, a machine will be able to
identify that there was an error and precisely
where it was, so that it knows how to correct
it.
And honestly, that feels like magic.
And for this particular scheme, if two bits
get flipped, the machine will at least be
able to detect that there were two errors,
though it won’t know how to fix them.
We’ll talk later about how this scales for
blocks of a different size.
Methods that let you correct errors like this
are known, reasonably enough, as “Error
correction code.”
For the better part of the last century, this
field has been a really rich source of surprisingly
deep math being incorporated into devices
we all use every day.
The goal here is to give you a very thorough
understanding of one of the earliest examples,
known as a Hamming code.
And by the way, the way I’m thinking about
the structure of this video is less about
explaining it as directly as possible, and
more a matter of prompting you to invent it
yourself, with some a little gentle guidance
here and there.
So when you feel like you see where it’s
going at some point, take that moment to pause,
actively predict what the scheme is going
to be before I tell you.
Also, if you want that understanding to get
down to the hardware level, Ben Eater has
made a video in conjunction with this one
showing you how to actually implement Hamming
codes on breadboards, which is extremely satisfying.
Now, you should know, Hamming codes are not
as widely used as more modern codes, like
the Reed Solomon algorithm.
But there is a certain magic to the contrast
between just how impossible the task feels
at the start, and how utterly reasonable it
is once you learn about Hamming.
The basic principle of error correction is
that in a vast space of all possible messages,
only some subset are going to be considered
valid messages.
As an analogy, think of correctly spelled
words vs incorrectly spelled words.
Whenever a valid message gets altered, the
receiver is responsible for correcting what
they see back to the nearest valid neighbor,
as you might do with a typo.
Coming up with a concrete algorithm to efficiently
categorize messages like this, though, take
a certain cleverness.
The story begins in the 1940s, when a young
Richard Hamming was working for Bell Labs
and some of his work involved using a very
big very expensive punchcard computer that
he had only limited access to.
And the programs he kept putting through it
kept failing because every now and then a
bit would get misread.
Frustration being the crucible for invention,
he got so fed up that he invented error correction
codes.
There are many different ways to frame Hamming
codes, but as a first pass, we're going to
go through it the way that Hamming himself
thought about them.
Let’s use an example that’s simple, but
not too simple: A block of 16 bits.
We’ll number the positions of these bits
from 0 up to 15.
The actual data we want to store is only going
to make up 12 of these bits, while 4 of the
positions will be reserved as a kind of redundancy.
The word “redundant” here doesn’t simply
mean copy, after all those four bits don’t
give us enough room to blindly copy the data.
Instead, they'll need to be a much more nuanced
and clever kind of redundancy, not adding
any new information, but adding resilience.
You might expect these four special bits to
come nicely packaged together, maybe at the
end or something like that, but as you’ll
see, having them sit in positions which are
powers of 2 allows for something that's really
elegant by the end.
It also might give you a little hint about
how this scales for larger blocks.
Also, technically it ends up being only 11
bits of data; you’ll find there’s a mild
nuance for what goes on with position 0, but
don’t worry about that for now.
Like any error correction algorithm, this
will involve two players: A sender who is
responsible for setting these four special
bits, and then a receiver who is responsible
for performing some checks and correcting
any errors.
Of course, the words “sender” and “receiver”
really refer to machines or software that's
doing all these checks, and the idea of a
message is meant really broadly, to include
things like storage.
After all, storing data is the thing as sending
a message, just from the past to the future
instead of from one place to another.
So that’s the setup, but before we can dive
in, we need to talk about a related idea which
was fresh on Hamming’s mind in the time
of his discovery, a method which lets you
detect any single-bit error, but not to correct
them, known in the business as a “parity
check”.
For a parity check, we separate out only a
single bit that the sender is responsible
for tuning, and the rest are free to carry
a message.
The only job of this special bit is to make
sure the total number of 1’s in the message
is an even number.
For example, right now that total number of
1’s is 7, that's odd, so the sender needs
to flip that special bit to be a 1, making
the count even.
But if the block had already started off with
an even number of 1’s, then this special
bit would have been kept at a 0.
This is pretty simple, deceptively simple,
but it’s an incredibly elegant way to distill
the idea of change anywhere in a message to
be reflected in a single bit of information.
Notice, if any bit of this message gets flipped,
either from 0 to 1 or 1 to 0, it changes the
total count of 1’s from being even to being
odd.
So if you're the receiver, you look at this
message, and you see an odd number of 1’s,
you can know, for sure, that some error occurred,
even though you might have no idea where it
was.
In the jargon, whether a group of bits has
an even or odd number of 1’s is known as
its “parity”.
You could also use numbers and say that the
parity is 0 or 1, which is typically more
helpful once you start doing math with the
idea.
And this special bit that the sender uses
to control the parity is called a “parity
bit”.
And actually, we should be clear, if the receiver
sees an odd parity, it doesn't necessarily
mean there was just 1 error.
There might have been 3 errors, or 5, or any
other odd number, but they can know for sure
that it wasn’t 0.
On the other hand, if there had been 2 errors,
or any even number of errors, that final count
of 1’s would still be even, so the receiver
can’t have full confidence that an even
count necessarily means the message is error-free.
You might complain that a method which gets
messed up by only two bit flips is pretty
weak, and you would be absolutely right!
Keep in mind, though, there is no method for
error detection, or correction, that could
give you 100% confidence that the message
you receive is the one the sender intended.
After all, enough random noise can always
change one valid message into another valid
message just by pure chance.
Instead, the goal is to come up with a scheme
that's robust up to a certain maximum number
of errors or maybe to reduce the probability
of a false positive like this.
Parity checks on their own are weak, but by
distilling the idea of change across a full
message down to a single bit, what they give
us is a powerful building block for more sophisticated
schemes.
For example, as Hamming was searching for
a way to identify where an error happened,
not just that it happened, his key insight
was that if you apply some parity checks,
not to the full message, but to certain carefully
selected subsets, you can ask a more refined
series of questions that pin down the location
of any single-bit error.
The overall feeling is a bit like playing
a game of 20 questions, asking yes or no queries
that chop the space of possibilities in half.
For example, let's say we do a parity check
on just these 8 bits, all the odd-numbered
positions.
Then if an error is detected, it gives the
receiver a little more information about where
specifically the error is; namely that it’s
in an odd position.
If no error is detected among those 8 bits,
it either means there's no error at all, or
it sits somewhere in the even position.
You might think that limiting a parity check
to half the bits makes it less effective,
but when it’s done in conjunction with other
well-chosen parity checks, it counterintuitively
gives us something more powerful.
To actually set up that parity check, remember,
it requires earmarking some special bit that
has control for the parity of that full group.
Here let's just choose position number 1.
So for the example shown, the parity of these
8 bits is currently odd, so the sender is
responsible for toggling that parity bit,
and now it's even.
This is only one out of four parity checks
we’ll do.
The second check is going to be among the
8 bits on the right half of the grid, at least
as we've drawn it here.
This time, we might use position number 2
as a parity bit.
So these 8 bits already have an even parity,
and the sender can feel good leaving that
bit number 2 unchanged.
Then, on the other end, if the receiver checks
the parity of this group and they find that
it’s odd, they’ll know that the error
is somewhere among these 8 bits on the right.
Otherwise, it means either there's no error,
or the error is somewhere on the left half.
Or I guess there could have been two errors,
but for right now we’re going to assume
that there’s at most one error in the entire
block; things break down completely for more
than that.
Here, before looking at the next two checks,
take a moment to think about what these first
two allow us to do when you consider them
together.
Let's say you detect an error among the odd
columns, and among the right half.
It necessarily means the error is somewhere
in the last column.
If there was no error in the odd column, but
there was one in the right half, well that
tells you it's in the second to last column.
Likewise, if there is an error in the odd
columns but not the right half, you know that
it's somewhere in the second column.
And then, if neither of those two parity checks
detects anything, it means the only place
an error could be is in that leftmost column,
but it also might simply mean there’s no
error at all.
Which is all a rather belabored way to say
that two parity checks let us pin down the
column.
From here, you can probably guess what follows.
We do basically the same thing, but for the
rows.
There’s going to be a parity check on the
odd rows, using position 4 as a parity bit.
So in this example, that group already has
even parity, so bit 4 would be set to a 0.
And finally there's a parity check on the
bottom two rows, using position 8 as a parity
bit.
In this case, it looks like the sender needs
to turn on bit 8 on in order to give that
group an even parity.
Just as the first two checks let us pin down
the column, these next two let you pin down
the row.
As an example, imagine that during transmission
there’s an error at, say position 3.
Well, this affects the first parity group,
and it also affects the second parity group,
so the receiver knows that there's an error
somewhere in that right column.
But it doesn’t affect the third group, and
it doesn't affect the fourth group.
And that lets the receiver pinpoint the error
up to the first row, which necessarily means
position 3, so they can fix the error.
You might enjoy taking a moment to convince
yourself that the answers to these four questions
really will always let you pin down a specific
location, no matter where they turn out to
be.
In fact, the astute among you might even notice
a connection between these questions and binary
counting.
And if you do, again let me emphasize, pause,
try for yourself to draw the connection before
I spoil it.
If you’re wondering what happens if a parity
bit itself gets affected…well you can just
try it.
Take a moment to think about how any error
among these four special bits is going to
be tracked down just like any other, with
the same game of 4 questions.
It doesn’t really matter, since at the end
of the day what we want is to protect the
message bits, the error correction bits are
just riding along, but protecting those bits
as well is something that naturally falls
out of this scheme as a byproduct.
You might also enjoy anticipating how this
scales.
If we used a block of size 256 bits, for example,
in order to pin down a location, you need
only 8 yes or no questions to binary search
your way down to some specific spot.
And remember, each question requires giving
up only a single bit to set the appropriate
parity check.
Some of you may already see it, but we’ll
talk later about the systematic way to find
what these questions are in just a minute
or two.
Hopefully, this sketch is enough to appreciate
the efficiency of what we’re developing
here.
Everything except for those 8 highlighted
parity bits, can be whatever you want it to
be, carrying whatever message or data you
want.
The 8 bits are “redundant” in the sense
that they’re completely determined by the
rest of the message, but it’s in a much
smarter way than simply copying the message
as a whole.
And still, for so little given up, you would
be able to identify and fix any single bit
error.
Well...almost.
Okay, so, the one problem here is that if
none of the four parity checks detect an error,
meaning that the specially selected subsets
of eights bits all have even parities just
like the sender intended, then it either means
there was no error at all, or it narrows us
down into position 0.
You see, with four yes or no questions, we
have 16 possible outcomes for our parity checks.
And at first, that feels perfect for pinpointing
1 of the 16 positions in the block.
But you also need to communicate a 17th outcome,
the “no error” condition.
The solution here is actually pretty simple:
Just forget about 0th bit entirely.
So when we do our four parity checks, and
we see that they're all even, it unambiguously
means that there is no error.
What that means, is that rather than working
with a 16-bit block, work with a 15-bit block,
where 11 of the bits are free to carry a message,
and 4 of them are there for redundancy.
And with that, we now have what people in
the business would refer to as a “(15, 11)
Hamming code”.
That said, it is nice to have block size that's
a clean power of 2, and there’s a clever
way that we can keep that 0th bit around and
get it to do a little extra work for us.
If we use it as a parity bit across the whole
block, it lets us actually detect, even though
we can't correct, 2-bit errors.
Here’s how it works.
After setting those four special error-correcting
bits, we set that 0th bit so that the parity
of the full block is even, just like a normal
parity check.
Now if there’s a single-bit error, then
the parity of the full block toggles to be
odd, but we would catch that anyway thanks
to our four error-correcting checks.
However, if there are two errors, then the
overall parity is going to toggle back to
being even, but the receiver will still see
that there’s been at least some error because
of what's going on with the four usual parity
checks.
So if they notice an even parity overall,
but something nonzero happening with the other
checks, it tells them there were at least
two errors.
Isn’t that clever?
Even though we can’t correct those two-bit
errors, just by putting that one little bothersome
0th bit to work, it lets us detect them.
This is pretty standard, it's known as an
“extended” hamming code.
Technically speaking, you now have a full
description of what a Hamming code does, at
least for the example of a 16-bit block.
But I think you'll find it more satisfying
to check your understanding and solidify everything
up to this point by doing one full example
from start to finish yourself.
I’ll step through it with you, though, so
you can check yourself.
To set up a message, whether that’s a literal
message that you’re transmitting over space,
or some data that you want to store over time,
the first step is to divide it up into 11-bit
chunks.
Each chunk is going to get packaged into an
error-resistant 16-bit block, so let’s take
this one as an example and actually work it
out.
Go ahead, actually do it, pause and try putting
together this block
....okay, you ready?
Remember, position 0 along with all the powers
of 2 are reserved for error-correction duty,
so you start by placing the message bits in
all of the remaining spots, in order.
You need this group to have even parity, which
it already does, so you should have set the
parity bit in position 1 to be a 0.
The next group starts off with odd parity,
so you should have set its parity bit to 1.
The group after that starts with odd parity,
so again you should have set it’s parity
bit to 1.
And the final group also has odd parity, meaning
we set that bit in position 8 to be a 1.
And then, as the final step, the full block
now has even parity, meaning that you can
set that bit number 0, the overarching parity
bit, to be 0.
So as this block is sent off, the parity of
the four special subsets, and the block as
a whole, will all be even, or zero.
As the second part of the exercise, let’s
have you play the role of the receiver.
Of course, that would mean you don’t already
know what this message is; maybe some of you
memorized it but, let’s assume that you
haven’t.
What I’m going to do is change either zero,
one or two of the bits in that block, and
then ask you to figure out what it is that
I did.
So, again, pause and try working it out.
...okay, so you as the receiver now check
the first parity group and you can see that
it’s even, so any error that exists would
have to be in an even column.
The next check gives us an odd number, telling
us both that there’s at least one error,
and narrowing us down into this specific column.
The third check is even, chopping down the
possibilities even further.
And the last parity check is odd, telling
us there’s an error somewhere in the bottom,
which by now we can see must be in position
number 10.
What’s more, the parity of the whole block
is odd, giving us confidence that there was
one flip and not two (if it’s three or more,
all bets are off).
After correcting that bit number 10, pulling
out the 11 bits that were not used for correction
gives us the relevant segment of the original
message, which if you rewind and compare,
is indeed exactly what we started the example
with.
And now that you know how to do all this by
hand, I’d like to show you how you can carry
out the core part of all of this logic with
a single line of python code.
You see, what I haven’t told you yet is
just how elegant this algorithm really is,
how simple it is to get a machine to point
to the position of an error, how to systematically
scale it, and how we can frame all of this
as one single operation rather than multiple
separate checks.
To see what I mean, come join me in part 2.
