[MUSIC PLAYING]
SPEAKER: OK.
So what is dynamic programming?
Normally, of course, I like
to ask you to contribute ideas
that we can discuss.
But in this particular case,
the point that I want to make
is either you already knew
the answer, or you're wrong.
Dynamic programming is designed
to be buzzword compliant.
It's something that was
invented by Richard Bellman,
and he came up with the name.
And it's actually-- and this is
according to his own autobiography--
the problem was that Rand
is a defense contractor,
so he was working at Rand,
working on computer algorithms,
and Rand worked for the Air
Force, and still works for--
it's part of the big
industrial military complex--
which meant that their budget depended
on ultimately the Secretary of Defense.
And if the Secretary of Defense
didn't like what he was doing,
or Congress didn't like what he
was doing, and they got wind of it,
that would be the end of it.
So apparently, according to Bellman,
the Secretary of Defense at the time
had a pathological aversion to research.
The way he described it is that he
would suffuse and become red in the face
if you mentioned the word
research in his presence.
And he really didn't like it if you
said something like mathematics.
And so Bellman wanted to be sure that
whatever he was doing if this showed up
in a Congressional report or in a
report that landed on the Department
of Defense's-- on the
Secretary of Defense's-- desk,
that it wouldn't say
anything objectionable.
So he came up with this term
programming, which in the math world,
means optimization-- find
the best answer to a problem.
And dynamic and the idea
of dynamic, was this
is an adjective that, yes, it sort of
expresses that things are changing,
but mostly it has no
negative connotations.
Dynamic just sounds good.
So that's where the name comes from.
So dynamic programming
is basically a name
that nobody could possibly
object to, and therefore
decide to cancel the project.
There is a term that is much more
descriptive of what's going on,
which is a lookup table.
So what dynamic programming really
is, when it comes down to it,
is it's a way of looking at
certain kinds of problems.
With linked data
structures, linked lists,
tries, hash tables, all of these sort of
fall into a category of data structure
that you use when you've
got data and you want
to map the connections between them.
And so there's a whole category
of different data structures
that involve using pointers to map
the connections between things.
Dynamic programming is
a category of algorithms
where you say, in my problem,
when I want to solve it,
I end up asking the same question over
and over again as part of solving it.
The same question with the
same parameters, if you will.
And so rather than doing all the
same work over and over again,
let me just remember the answer
after I figure it out once,
and the next time I
ask the same question,
I'll just go look up
what the answer was.
That's the idea of dynamic programming.
And of a lookup table.
You've already used lookup tables.
For example, going way back
to the beginning of the class,
Here's a for loop.
for(i=0, i less than strlen of some
string variable that I've called str,
i++.
this.
Is a loop the loop
through every character
of this string, str, one by one.
But the problem is every
time you go through the loop,
you recalculate the
length of this string.
And calculating the
length of the string in C
takes some time because you have to
walk down the whole string until you
reach that null character at the end.
So going all the way back again
to the beginning of the class,
remember, we talked about, it would be
better from a performance perspective
not to keep recalculating
the length of the string.
Instead, why not say something
like int length=strlen(str),
and save that value to a variable, and
just look up that value every time we
go through the loop and are checking--
have we reached the last value of i?
There, length you can think of as a
lookup table that stores one value.
There's already an
example that you've seen.
We're going go to
lookup tables that store
the answer to more than one thing.
And so I'm going to go through today
a few different examples of problems
that you can solve using
dynamic programming.
So these are problems where you
can break them down in a way
that you keep asking the same
question over and over again
and just look up the answer
in a table or an array.
And the first such problem I want
to talk about is rod cutting.
Now the idea of this is I have
a rod of something valuable,
so maybe it's a Tootsie
Roll, or plutonium,
there's a wide range of things that
come as sort of like a long cylinder.
And you want to chop it up
into shorter pieces to sell.
And what you know is that different
length pieces, different length rods,
sell for different amounts.
And the question is, how
should I cut this long thing up
into smaller pieces to maximize
the money that I can sell it for?
So if I know, for example, that a rod
that's one inch long sells for $1,
and a rod that's two inches long
turns out to be a lot more valuable,
I can sell that for $5.
And when it's three inches
long I can sell for $8.
Now there's a lot more demand for
things that are three inches long
then things that are four inches long.
So if it's four inches
long, I can sell it
for $9, which is a little
more but not a lot more.
And then comes the question--
I've got a rod of length 10--
how should I chop it down into 1s and
2s and 3s and 4s to maximize the profit.
And there's different
ways I could do this.
I could chop it into 10 little one
inch rods, and sell each one for $1
and I've made $10.
I could chop it into five two inch
rods, and it would get 5 times 5 is $25.
That's a lot more money.
So being the unabashed capitalist than
I am, that's what I'm likely to do.
And there are some in-between options
that might make me some money.
The question is, how
to quickly figure out
what I should chop this rod
into to maximize my profit
and be sure that that's
the most that I can sell it
for There wasn't some other way to
chop this up that would be better.
And one way to think
about the problem is,
if I've got a rod that's 10 inches
long and I can chop it once an inch,
there are 1, 2, 3, 4, 5, 6, 7, 8, 9
possible places I could cut the rod.
And any one of those places I
could either decide to make a cut
or not make a cut.
And that will affect the end result.
So that's kind of 9 bits--
I either cut or I don't.
They're a 1 or a 0.
So I end up in terms of the
possible different combinations
of how I could cut this with
every possible 9-bit number
in binary, which has 2 to the 9th
possible values, which is 512.
So I could try 512 possible
different ways to cut this rod.
And I can figure out
which one does the best.
And then I could try having a rod
that was 20 inches long instead,
and discover there's 19 possible
places to decide I'm going to cut it
or I'm not, and there's 2
to the 19th possibilities
now, which is about 500,000.
2 to the 10th is about 1,000, 2 to the
20th is about a million, 2 to the 19th
is about a million divided
by 2, it's about 500,000.
So in general, if we said
this rod is n inches long,
so that there's n places or n minus 1
places where I could make a decision
to cut it, there are 2 to the n
approximately possible different ways
of making combinations of cuts.
And that's a lot of things to try.
At the same time, you can sort of
see that splitting it into rods
of 1, 1 5, 5, and 9 inches long--
this is not the only combination
of cuts that would give me that.
There are other combinations that would
also give me two 1s, two 5s, and--
sorry, a 4 inch long.
Two 1s, two 2s, and a 4.
I could do a 4 and then
two 2s and then two 1s.
I'd get the same value out of it.
So a lot of the possible
ways that we could cut--
they would give the same
result in terms of the value.
And so maybe I want to
reorganize the problem
where I don't have to try everything,
because I realize a lot of times
I'm just recalculating the same thing.
And one way to do that--
and this is where this concept of
dynamic programming or a lookup table
comes in--
is to say, let me decide first where
to cut where the left most cut will be.
So instead of deciding, am I
going to cut it 1 inch or not,
am I going to cut it 2 inches or not,
am I going to cut it 3 inches or not,
I'm going to decide what's the
first place that I'm going to cut?
That could be at 1 inches or 2
inches or 3 inches or 4 inches
or 5 inches or 6 or 7
or 8 or 9 inches, or I
could decide I'm not
going to cut at all,
I'm just going to leave the rod whole.
So with his rod of length 10, there's 10
possible places to make the first cut.
And the thing that I'm going to decide
is that that splits the rod into two--
the part on the left I'm
not going to cut again.
I'm not going to cut this 2 part--
or the 3 part--
I'm making that decision.
This is the first piece
that I'm going to sell.
So the total value
that I can get will be
whatever I can sell the piece on the
left for like $1 or $5 or $8 or $9,
plus whatever I can get for the right
part if I chop it into smaller pieces
the best possible way.
So the part on the
left I won't cut again.
I can just look up what is the value
of a rod of this length in the table.
But I also need to figure out
for the part on the right, what
is the best way to cut this up.
And even that, if you draw
out all the possibilities,
you're going to end up
with the same 2 to the n,
except you can start
seeing some commonalities.
So for example, if you decided
to cut a rod of length 2,
you've got eight leftover, and you
want to find the best way to cut it.
Now, if you only chopped off one
inch first, that was your first cut,
and then you did another cut where
you chopped off another inch,
suddenly you're end up with what is the
best way to cut a rod of length eight.
So in the possible ways of
chopping this thing of length 9,
one possibility is, again,
we take one inch off to sell,
and we have eight inches left.
If right from the start
we've taken two inches off,
we would also have eight inches left.
And so we're asking the
same question both times.
We've said, I've got some stuff I've
already cut that I'm going to sell,
and now I've got an eight inch rod left.
How do I cut that?
There's no point
figuring that out twice.
We could just keep a little
array in which we store, what's
the best way to cut a rod of length 8?
What's the best way to cut around
of length 9, of 7, 6, of 5,
of 4, and for each one of those,
we just look in the array and say,
have we put a value in there yet?
If so, we just use it.
We don't recompute it.
And the result is, if
you start looking at it,
we've got 10 choices for our first
cut, and then on average we've
got something like up to nine
choices for our second cut,
and up to eight choices
for our third cut.
And so for each of the possible
10 choices the first time,
we've got nine choices to make for
our second cut, we're up to 90.
But this would get us exponential.
The problem is a lot of these
cuts end up being the same.
And so you end up saying,
I only have to deal
with sort of 10 choices of what to make
as the first cut for right of length
10, and nine choices to make for
the first cut of a rod of length 9,
and eight choices for the first
cut to make on a lot of length 8,
because after that I'm getting
down to something shorter.
I'll figure that out eventually.
And you end up with
something like n squared.
So for rod of length 10, I'd
have roughly 10 times 10 choices
that I actually have to do work for.
And so you get down to n
squared as opposed to n cubed.
Another way to do this is to say,
let me start with a rod of length 1.
I have no choices for how to
cut it, so I know the answer
for how I'm going to sell that.
For a rod of length 2, I can either
do the cost for rod of length 1,
and make a cut, and then have
a rod of length 1 leftover.
I can look up what that costs.
And I'll store then which of those was
the best value for rod of length 2.
I'll say, for rod of length 2, I
can get $5, and the way to do that
is to not cut.
And now for a rod of length three, well,
I could decide to cut it after 1 inch,
and I've got 2 inches leftover.
So we'll look and say, well, the best
thing to do with a rod of 2 inches
is get $5 for it, so great.
For something with 3 inches, I can
get $6 by making a cut after one inch.
If I make the cut after 2 inches, I
get $5 for selling a rod of length 2,
and then I look up what do
I do with a rod of length 1,
and the answer is you just sell
it for $1, that gets me $6.
I look up, and what happens if
I just sell a rod of length 3--
and that's $8.
So I'll record-- great, for rods
of length 3, don't chop it up,
just sell it.
For a rod of length 4, now
I can chop it after 1 inch.
I look up, ah, for a rod
of length 1, I sell it
for $1, plus what do I do
with a rod of length 3?
I just sell it.
I'm done.
I've found that that gets me $9.
What if I cut at 2 inches?
Well I've got a two
inch bar I sell for $5,
and then I look up how do I recover
the right half, it's 2 inches long.
I won't bother.
My table says just keep it. .
And after I've tested all of those,
I know that cutting 4 into 2 and 2
is going to be the best thing to do.
So what's going to happen
is, as you build this up,
eventually you get to where you're
dealing with your rod of length 10.
For every cut that
you make, you can just
look up the left half
you know you're going
to say as a single thing, the right
half you can look up in a table
and say, what's the maximum
value I can get for this?
And the table will say, you
can get this many dollars,
and here's where you make the first cut.
And then you've got a
shorter piece left over,
and you can look up where
to make the next cut.
Questions?
So I'm going to move on
to some other examples.
And the next one I'm going
go through fairly quickly.
And the one after that, I'm
going to go into more detail
even than I did the rod cutting.
And we'll draw out that table,
filling in some of the values.
If you remember to
networking last week, we've
got all sorts of computers that are
connected together-- for instance,
my computer, at least
when I'm in my office,
is probably hooked up through
a department-wide server,
or maybe just through
Yale's University server,
and Natalie also has a
computer that's connected,
and because our offices are close,
they may actually have a connection
directly to each other.
Now, David and Doug up in the CS50
office at Harvard, or their computers
may also be directly
connected to each other.
So they can send messages
back and forth directly,
and they're also connected
to Harvard's server.
But if I want to send a
message to David, I will need--
and that might be an HTTP
request, for example,
if he's running a web
server on his computer--
I'm going to need to find a way
to hop from computer to computer.
I can't talk directly
to David's computer,
because there's no wire
running between them.
So I need to find some computer where
I say, I've got a message for David,
do you know how to deliver that?
And it will say, yeah,
I can do that for you.
And then it will pass it
on to some other computer,
saying this is for David,
and it keeps getting
passed from server to server, until
it gets to one that is actually
connected to David's computer.
And the problem of how to find who to
pass that message to is called routing.
And it's one of the
problems in networking.
And it's particularly
difficult because networking,
you have to deal with other people.
When you're just writing a program
of your own on your laptop,
if something goes wrong,
it was probably your fault,
and you can probably fix it.
If you're talking about
a network, it could
be that somebody turned off
the lights somewhere else,
or unplugged their
computer without warning.
That's completely out of your control.
But you still have to deal with it.
And so how do you get from my
computer to David's computer
is something that you
have to keep track of
and may keep changing, and to
get there, we need to find--
I need to be able to find not just
how I could get to David's computer,
but essentially who to pass that
message to to get it there the fastest.
I may have more than
one way of doing it.
And one way to do it
is going pretty fast,
and the other way is going to
go through China and Sierra
Leone and a congested undersea cable
to Africa which has very poor internet
links, and it will take a long time
for the message to get to David.
Unless, of course, somebody
from MIT went and found
the cable that connects Harvard
to Yale and snipped it just
for fun, in which case I might be stuck
taking the longer route for a while
because something broke
outside of my control.
And the question I want
to talk about today
is, how do you figure out
who to pass that message to?
And this is another place where
dynamic programming turns out
to be really helpful.
So the idea is, it's
not just that I need
to know how to get to
David's computer, I
need to know how to get to
everywhere on the internet.
And everywhere else on
the internet also needs
to know how to get to
everywhere on the internet.
Everybody is trying to solve this
problem of how to find the quickest
way to get to any other computer.
And to start out, I'm just plugged in,
I don't even know who I'm connected to,
let alone who else is out there.
So the process is, rather than me
trying to go snooping and exploring
the entire internet--
and by the time I'm done, somebody
will have unplugged their computer
and the process will need to change--
I want to try and make
sure that we are all
doing this work together and sharing
as much information as possible.
So what I will do is I will
just broadcast a message to all
of my neighbors, saying
haha, I'm here, I'm Benedict,
and now Natalie will know,
and Yale server will know,
aha, Benedict's computer is
connected directly to us.
So if we need to get a message to
Benedict, we can just give it to him.
Now at the same time, everybody
else is going to do that.
So I'll get a message
from Natalie, saying haha,
here I am, and I'll get one from
Yale saying haha, here I am.
And now, I know that I'm connected
directly to Natalie and Yale.
This is what we call being one hop away.
So if I need to send a message,
it needs to make one hop,
just like an airplane
makes one hop from city
to city, to get to the next computer.
And as far as I know, that's
all there is in the world.
But everybody's just done this, so now
everybody knows all of the computers
that they're adjacent to.
The next thing is, we can all send
that information around, and say, here
I am, and oh, by the way, here's
everybody that I'm connected to.
So if you give me a message
for any of those people,
I'm saying, anybody who passes a message
to me that needs to go to Natalie--
I can get it to her in one hop.
So then you can figure out how many
hops it takes you one more hop than that
if you pass the message to me.
So after we share that
we can update our lists.
And what will happen is Yale
server has learnt from me, hahaha,
I can get a message to Natalie for
you, just give it to me and in one hop,
I'll get it to Natalie.
And Yale server says, yeah, but I
can just give it to Natalie myself,
so why bother doing that?
So Yale server will keep in its list, I
just found a way to get Natalie in two
hops, I don't care.
I can get there in one hop by just
giving it straight to Natalie.
However, I will have learned that Yale
can reach Qwest's server in one hop.
And so if I need to pass a message to
Qwest, I can give the message to Yale
and ask it to give it
to Qwest and that's
the fastest way I know to get there.
And so each step everybody
sends a pretty short message
to all their neighbors.
It takes a fixed amount of time.
And we're accumulating this
information and always keeping
track of, what was the shortest--
what is the fewest number of hops
it takes us to get to a server,
and who do we give the
message to to get there?
And as long as I give
the message to Yale,
Yale can worry about how it gets
to Qwest, it's just promised me it
takes one hop.
If I do this again, now it turns out
I can get to Google in three hops.
I can get to Harvard in three hops.
But all I care about is, I can get
there in three hops and the way I do it
is I give the message to Yale's server.
Yale's server doesn't even know
directly how to talk to Harvard's server
or to Google's server.
It just knows to pass the
message to Qwest's server.
Qwest, by the way, owns a
lot of the big fibers that
connect big data centers and things.
So when I traced what
happens if I try and connect
to Google's server from my office,
it goes up through a departments--
CS department-- server, then a
Yale server, then it goes to Qwest,
and then it goes to Google.
There are some other
ones, like Internet2,
which connects a lot of
universities, and America Online
used to own a lot of this, I don't
know if they ever sold it off.
There's a few companies that
own most of those cables.
And eventually, after four rounds, I
find out that I can even reach David
and Doug in four hops by
giving the message to Yale.
And Yale knows that
it needs to just give
that message to Qwest, which
gives that message to Harvard,
and then Harvard knows how
to pass the message on.
And this is in fact how the
major routers on the internet
share this information.
Now, your computer or my
computer, what they typically know
is any message I need to send, I just
send it to the server I'm connected to,
to the wireless access point.
And the wireless access point basically
says, I send it to Yale's server.
Or maybe there's a very small--
it knows the following laptops are
connected directly to me and anything
else I just send to Yale's server.
But when you get up to the level of
all of Yale University, for example,
it may well start to participate in
this process of sharing this information
to know, I could talk to Qwest,
I could talk to Internet2,
I could talk to Comcast, there's
a bunch of people I could talk to.
Which cable should I
send this message out
on to make sure it
gets where it's going?
And the dynamic programming
aspect of this is, at each step
I'm reusing all the information
that was computed before
to just add another layer on top.
How do I get to places in three hops?
I'm not recomputing the route
to get from Yale to David,
let's say, when I figure out that
I can get there in four hops.
I'm just using the information that I
know that the Yale server can do it.
And so that reduces the amount
of actual work that I have to do.
Any questions?
Now, the third category of algorithm
that I want to talk about--
and I'm going to talk about this
one in a little more detail,
and I'm going to talk about a few
different applications of exactly
the same algorithm, so not
just of dynamic programming
but of this specific algorithm, of
different problems that it can solve.
And the one I'm going to start
with is DNA sequence alignment.
You also do this for
aligning protein sequences.
If you have a strand
of DNA, it's made up
of a chain of what are called
bases, which are chemicals.
They're molecules.
And there's four of them
that are used in DNA.
There's adenine, and thymine
and guanine and cytosine.
And we'll typically just
abbreviate each one with a letter.
So if you were to
unroll a strand of DNA,
and just write down the list of
these bases of these molecules
that are connected in a
long chain, you essentially
get a string that's
made up of four letters.
So it's a string of some combination
of the characters A, C, G, and T.
Now, it might be that I know by
heart the entire genome of a mouse
and I know what every single gene does.
Or at least I know what some of them do.
Now, over time we diverged from mice and
various mutations happened to our DNA,
because whenever it gets copied,
we sometimes make mistakes,
and somehow we ended up instead
of being a proto mouse we ended up
being a modern human.
And that same proto mouse whatever
ended up being a modern mouse.
through different mutations.
So if I start sequencing
my DNA, I might find
a strip of it that corresponds
to a particular gene,
and I don't know what it does,
but I'd like to find out.
One way to do that would be
to try and match it and find
the best possible place for this
to match which gene of mouse DNA
does this match the best?
Keeping in mind that there might be some
gaps, a gene might have disappeared,
a gene might have reappeared, or gotten
added, and sometimes one of these bases
might have gotten replaced
by a different one.
So we might have a mismatch.
So there's different
things that could happen
in the process-- it won't be a perfect
identical gene in the mouse to what's
in the human, it will
just be mostly the same.
And I could use that to
say, what's the best match,
and then use that to predict
what this gene codes for,
or what the function of the
protein it codes for is.
You can do the same thing with proteins,
they're made up of amino acids.
You can look at that chain and look
for similar things between one protein
and a database of proteins
where you know what they do
to see if you can predict the function.
But the problem is there's a lot
of different ways of doing it.
So let me write these on the
board we've got AAC, AGT, TACC.
Now, let me try and come up with
the worst possible match that I can.
TAAGGTCA.
So there's different possible ways
that I could line these strings up.
They're not the same length, so I'm
going to need some gaps somewhere.
There's different places I
could try and put them and see
how the characters line up.
And the rule is going to be--
and this is based, in
fact, on how common
it is for a mutation
to occur where one base
switches to another one, versus a
base is actually deleted or added
in genetic mutations.
It turns out that saying, it's roughly
twice as likely that a base just
changes to another base than to have a
base completely disappear or appear out
of nowhere.
during mutation.
So we will say that when
we line up the strings,
any time we have a mismatch
between two characters--
so two bases that are not the same--
we're going to have that a penalty of 1.
And we're a cost of one.
That's what we pay to force the
strings to match up that way.
And any time we've got a gap, where,
say, the C from the first string or DNA
strand doesn't match
anything in the second one,
so that corresponds to a base
getting inserted or deleted,
we're going to assign that
a penalty or a cost of 2.
So we'd rather have two mutations--
two things were base turned
into something else--
that seems about as likely as having
something just get added or deleted.
And then I can add up
all those costs, and I
get the cost of that particular
way of matching the strings.
And I'm looking for the way to do this
with the lowest cost possible match.
This is also called edit
distance, because it's also
measuring how many edits
would have to make to a text,
how many changes would I have to make
to the text to turn it from one thing
into another.
This doesn't just have
to be these four letters.
This could be anything.
But the thing that I've written on
the board that I want you to see
is the worst possible way I could
match these would be to say,
what if I just deleted the
entire gene from the mouse
and then did a bunch of insertions to
produce the entire gene from the human?
So each one of these characters,
we say, matches to a Gap.
That is one possibility.
It's not a very good one.
That would be 2 plus 2 plus 2 plus
2 plus 2 plus 2 blah blah blah blah
blah blah blah.
That would cost a lot.
That's a very unlikely way that we
would have gotten from this to this,
and we're looking for
the most likely way.
But what it does say is that there's
not really any point inserting yet more
gaps in the middle, because
those things line up perfectly,
they won't change the matching score.
That's kind of silly.
Once this thing is completely
gone, I might as well
start adding the new one in.
So the longest possible sequence where
we try and line the two of these up
is the length of 1 plus the length
of the other which here is about 18.
And now if we look at
each of these columns.
We sort of had a choice.
The same way with a rod, we had a
choice-- do we cut or do we not cut?
Here we have a choice
of do we do a deletion--
so we had a letter on the
first string, and we're
going to match it to a gap on the second
string, as if this letter got deleted.
Or we could do an insertion, in which
case it's kind of like this situation
where we had nothing
and we added a letter.
We could do that at the
beginning of the string.
Or we could actually try
and match two letters.
So at each point here, we've
got three choices we can make--
an insertion, a deletion, or a match.
And if we could make that
choice independently,
between each of the letters we'd have
three to the 17th possible combinations
of things to try.
That's a lot.
How much is 3 to the 17th?
Nobody knows this off
the top of your head?
OK, the technical term is a lot.
That's a number.
It's defined as 3 of the 17th.
So again, we don't want to keep--
we don't want to try everything.
And we really want to do this.
Computational biologists
really want to figure out, look
through a database of
genes or of proteins
and try and match something to
figure out what it might do.
Which parts of the genome
that we just sequenced
or the DNA strand that we just
sequenced are worth zeroing in on
and doing more studies on, because they
might code for something important?
It turns out that we like
to do this also on occasion,
or at least we do do it, when we want
to look at two people's submissions
and see how similar they are.
So fundamentally,
cheat checkers do this.
They want to quickly compute
how many changes would I
have to make to one student's program to
turn it into another student's program?
And once we've computed all of that, we
want to look and find the ones where--
for most people maybe
it took 1,000 edits,
but here we've got a pair
where it only took two.
And that's kind of suspicious.
So I'm going to sort of
rephrase the problem.
And it's going to be similar to rod
cutting, where I'm going to say,
let's pretend that I already know
how to match the right portion of two
of the strings.
So if I already know the best
way to match from GTTACC to GTCA,
then to figure out the best
way to match the whole thing
is I just have to figure out
how to do the first part.
And this is kind of like
saying, let me decide
where I'm going to put the
first letter of the first word,
where I want to put the first gap,
let's say, or the first insertion,
and then figure out the rest of it.
But I'm phrasing it going
from the end to the beginning.
And I'm phrasing it going
to the end to the beginning
kind of like I've shown here
just because when we actually
compute the table, that's
going to end up putting
the answer in a more logical place.
So we can kind of say, at the
last position in this match,
are we going to have an insertion,
are we going to have a deletion,
or are we going to try
and match two characters?
And that gives us three possibilities.
And each one has a cost.
And so whatever the best way to
line up these two strings is,
the last thing that happens is going
to be one of these three things--
those are the only three possible
ways for the last part of the match
to happen--
has to be an insertion
or a deletion or a match
because they're the only options.
So if we know the
costs of those, then we
can say the cost of matching the entire
string is the cost of doing whichever
choice we make here plus the best edit
distance cost, or the best alignment,
of whatever's left.
Which is slightly different
in each of these three cases.
Except that if, for instance, we do
a deletion, and then for this choice
where we're got TAAGGTCA at the
bottom, the next time around we
tried to do an insertion, and
then the best of whatever's left,
we would have AACAGTTACC matching
to TAAGGTC, go figure that out.
And then we would have
insert A then delete C.
Sorry, we would we would end
up with these two strings.
But instead of having
to try to match C to A,
we would have modeled one
insertion and one deletion.
We'd have the same thing left here.
So as we build these
things up, we're going
to try avoiding finding the best match
for the same pair of strings twice.
And, realistically what's going to
happen is we can start with, well,
what if both strings were empty?
Then they match perfectly.
No problem.
What happens if, at
the end, I would like
to have the string C or the
character C, be deleted.
So I would like to do
some sort of match here
where the very last thing
that happens is C is deleted,
and the rest of that TAA blah blah
blah blah is somewhere over here.
But I'll worry about where later.
So in that case, we have a cost
of 2, and when we were done,
we would have used up this
character C, because we deleted it.
And so what we're left with after that
is what is in the corresponding cell
to the right in the table, which is the
cost of matching nothing to nothing.
Similarly, if we said the last thing
that we want to have in the table
is CC matching to
nothing, the cost of that
is 2 for the match, the deletion of
C, plus the cost of matching nothing,
C to nothing, which was
this cell in the table.
So we have 2 plus the cost
of 2 that's right here is 4.
So in general, what's going to
happen is if I look at the cell
where the little hand is right now--
that just disappeared-- right
here, this is the cost--
we'll ultimately record the
cost of matching GTTACC GGTCA.
And it will say, in order to get
the best match between these,
so the smallest edit
distance, if you will,
here's what the total cost
will be, and it will tell you
should you try and match those two G's
to each other, or should you do it as--
should you have a
deletion or an insertion?
That's what's going to be stored
in this cell of the table.
And ultimately that means
in entry 0, 0 of your array
at the beginning of the
table, will tell you the cost
of matching AACAGTTACC to TAAGGTCA.
It will tell you what
the total cost is, and it
will tell you whether you
should start with the insertion
or deletion or a match.
And depending on which one, that's
going to use up characters from one
or both of the strings, which will move
you to a new cell in the table that
will tell you what to do next.
So once it's primed, I can say,
well, at the end, if I just
look at the end of each string, I have
to match the letter C to the letter A.
And what is the best way to do that?
Well, I've got three options.
One option is I could delete the
C. Oh, there's an error here,
this should read i plus 1.
So I could say, let me
go over to cost i plus 1.
So I've used up the C.
But I'm going to stay in--
sorry, i plus 1 is this row, but
I'm going to stay in column j--
so I'm going to come down to here,
which says I've used up the A,
so this A right here got matched to
a gap, which is like an insertion.
It says somewhere at
the end of the table,
I'm going to end with that
A from the second string.
And, oh, I still have
the C floating around.
So let me switch back
to some white chalk.
But leftover, we still have a C here.
Because we were trying to
match A to C, and so far we
haven't dealt with that
C. So we would have
the cost of matching
A to a gap, which is
2, plus whatever the cost of
matching C to nothing is, which is 2.
Or we could say what if,
instead of doing that, we did C
and we'll delete C as
opposed to inserting A,
but it means we still have an A floating
around here that we haven't used yet.
That was an alternative way of matching
C and A. So then we have a cost of 2,
because we're doing a deletion.
Plus we have left over
the cost from the cell
to the right in the table of
matching the string A to nothing,
so that would be a cost of 4.
Or we could try and match C and
A. And when you match C and A,
well the cost of this is zero if
the letters are the same and one
if they're different.
So in this case it's one, because
c and A are not the same letter.
And we're left with two empty strings,
because we've used up the C and the A.
So we end up with 1 plus the
cost of whatever's over one
and down one in the table, which is 0.
So that total cost is 1.
And of those three possibilities,
we just take the smallest one.
We say, ah, the smallest one was if I
went diagonal-- if I matched A to C.
So that's what I'll record in the table.
Total cost of matching A to C
is one, and the best thing to do
is to actually try and
match these two characters.
It would be a mismatch.
And then look up in the table
what to do with whatever is left.
And now I could say, well, what
about matching the string CC to A?
I can do the same question.
I could say, I could start by matching
the letter C and the letter A,
and now I've got a C leftover.
I could delete the first C and now
I've got an A and a C leftover.
That would be going this way.
Sorry, if I match both.
I might have a C leftover.
If I delete-- if I just first
start with a C and do a deletion,
I end up with a C and A leftover.
If the first thing I do is insert
the A, I've got a CC leftover.
So each one of those
I've got a penalty of 2
to go right, 2 to go down, one to go
diagonal, that I add to whatever's
in that table.
I discover that if I try and match
A to C I pay a penalty of one,
and I've got a C leftover
that I have to delete,
so there's a total cost of three.
And that was as good
an option as I had, so
that's what I filled in to the table.
And then I can try and figure
out for the string ACC to A,
and just work backwards through
this table and up the rows
until it's all filled in.
And the nice thing about this is
every time when I look at the table,
I need three pieces of information.
I need the piece of information
in the column just to my right.
I need the piece of information
in the column in the row
immediately below me--
the cell immediately below me--
and I need the piece of information
that's one down and one to the right.
And because I'm working
backwards to the table,
it's nice and easy because
the data's already there.
I started filling in the bottom
row in the last column, which
are sort of special
cases, and now I just
have some loops that run
backwards through this.
And if one string is n letters
long, that's n is in number,
and the other string is m,
which is m as in the mnemonic,
then the total is m times n.
And I see Doug smiling at mnemonic.
I have to attribute that particular joke
to Professor Mitzenmacher from Harvard,
as reported to me by my roommate,
because I wasn't actually
in the class where he made the joke.
But I liked it.
So you can fill up the table.
And now, instead of having 3 to
the n, essentially, things that you
had to try, you only have n squared.
So 18 letters.
18 squared is a little
less than 20 squared.
And 20 squared is 400.
So we've got at the most
about 400 but in this case
it's not even n plus m squared,
it's really just n squared.
It's really like 10 squared.
It's like 100.
So we've got roughly 100 different
steps to find the answer.
If we tried everything,
we'd be doing the same work
over and over and over again,
and we'd do more like 3
to the 20th, which again,
is a really big number.
2 to the 20th is a million.
And 2 to the 3rd is like--
3 is like 2 to the 1.2 or something,
let's say, so it's roughly a million
to the 1.2.
It's about 130 million, there you go.
I knew we had an idiot
savant in the room somewhere.
And again, each one of
these cells in the table--
it tells you essentially what to do.
Do you match this first pair of letters?
Do you do a deletion?
Or do you do an insertion?
And from that you can figure out how
much of each of the strings is left.
And then you look up in the next
cell of the table that corresponds
to that what the next thing to do is.
And so to figure out exactly
how these things match up
as well as just what is the
cost of matching them up
in the best possible way, you
work your way through the table.
Rather than trying to store all
that information in each cell
and store the same information
over and over again,
you store only one piece
of the answer in each cell.
Questions?
Yes.
AUDIENCE: In the runtime, it says
there's omn, is the m a constant,
or is it another--
what is that [INAUDIBLE]?
SPEAKER: Right, so in this
running time, where it says OMN,
and what does that mean?
What we're saying is we've
got two different strings.
One of them is n characters long,
the other one is m characters long.
So in terms of the
function of the input,
we're just being slightly
more precise and saying,
we've got two words that
are each n letters long.
And then it's sort of n squared.
Saying maybe they're
very different lengths,
and so we can be a little more precise
as to how many steps they take.
But the key thing there is you
see one thing times another,
so that's kind of like
something squared.
As you think about, as these
strings get longer and longer,
how much work is this?
So this is not too bad.
And it's really actually pretty
feasible to do this for DNA sequences
that might be a few
hundred base pairs long.
It's feasible to do
this for a source file
from homework for cheat checking--
is maybe a few thousand characters.
No big deal, even though
we've got 1,000 students,
and we've got all the past submissions,
so we've got 10,000 submissions,
so we've got roughly a
million pairs of files.
Something like that, and that's
not a big deal for a computer.
And each one of those takes
somewhere like a million steps,
because you've got 1,000
characters times 1,000 characters.
We can manage that.
Other questions?
Yes.
AUDIENCE: Do you have to gain brute
force to fill out the table intially?
SPEAKER: So do you have to do brute
force to fill up the table initially?
That depends on how
you define brute force.
You do have to fill up all
the cells of the table.
Because in order to find
the answer at the top left,
even though you're ultimately only
going to use the value from one
of these three, you need
to look at all three
values-- the one to the
right and the two below.
So in order to find
the final answer, you
have to have filled up the whole table.
And there's m by n entries in the table,
which is where you get to this m times
n or n squared-ish kind of thing.
But compared to what you might
think of as brute force of try
every possible alignment
of the two strings
which involves lots
of subalignments, lots
of parts that align the same
way, in lots of different places,
we're not redoing those over and over.
So we're not doing this
really brutal brute force that
would say there's something
like 3 to the 2000 steps
to compare a single pair of files, and
then we have to do a million of those.
That would be a problem.
Other questions?
So I've sort of folded
two applications into one
right there with edit distance.
One is this computational biology where
you treat a DNA strand or a protein
sequences as a string.
And the other was some sort
of fancy file comparison
you might use for search and replace.
You might use this for
spelling checking to figure out
the best word in the dictionary.
Not just is the word
there, but of the words
that are in the dictionary,
which one is most likely what
I meant to type based on some model of
how frequent different kinds of errors
are.
And you could compute the edit distance
to every word in the dictionary
to propose a list of suggestions.
But there's another application
that I want to talk about,
which is image stitching.
This is what happens when
you take out your cell phone
and you build a panoramic image.
So you sweep the camera around
and it's going to take a video
or it might just end up taking a
series of pictures every second or two.
And then it stitches them
together into one big image.
And there's a few parts to that process.
One part is each of the pictures
that got taken, figuring out
how your camera moved between each one.
So it can sort of transform
the images as if--
to where they overlap in
just the right places.
But even then, they won't line up
perfectly, because maybe somebody's
walking around in the
scene, or it turns out
that if you're actually walking
along with your camera like this,
as opposed to just rotating
around, you get a parallax effect
where things that are closer to
you appear to be moving faster
than things that are further away.
So right now I see part
of the middle chair.
The right hand side of it to
me is hidden behind the tripod.
But if I walk over here, that's come
into view and the left side of the seat
is hidden behind the tripod.
So within that panorama, maybe there's
no perfect way to light up the images,
because there's different information
visible even in the overlapping areas.
And to resolve that problem,
the key is once you figure out
how the two images overlap,
you want to figure out and take
part of that overlapping area
from one image and part of it
from the other image.
So there's a seam between the two.
And a seam is just a
connected line of pixels.
And everything on one side comes from
image A, everything on the other side
comes from image B. And the question
is where should that seam go?
Now, that seam should go, ideally,
somewhere where we won't notice it.
So let's switch from one
image to the other at a place
where they are really
similar, and somewhere
where I'm seeing the parallax
from the tripod moving,
or somebody was walking
along in the scene.
I try and make this seam where I
connect the two images go around that.
So it's not picking up sort of a jump
where the two images were actually
different.
Well, the simplest way
to do that is to look
at a pair of this overlapping
rectangle, and say, let's draw this seam
where pixels are really similar.
And I could do that by looking,
for instance, at the difference
in the pixel color between the two.
In this case, it's nice and easy because
the images are black and white right
here, so I just take the
difference in intensity,
which is a number between
0 and 255, and I'll
get a different number that's
between 255 and negative 255.
So I might subtract white from black.
I take the absolute value of that.
If that's 0, it means that pixel
was the same color in both images.
If it's really big, it means the
colors were really different.
And now, I could say
the cost of a seam going
through a particular pixel is that
difference in values between the image
pixels--
the difference in intensity.
And so for the total seam, I
want to figure out from there,
should I go right?
Should I go down?
Or should I go down and to the right?
Which I can do by
looking up in the table
the cost of a seam passing through the
pixel to the right or the pixel down
and to the right or
the pixel to the left.
Here's the table, it's just
that now, a cell in the table,
instead of saying the cost of
matching ACCC for example to TCA,
filling in a cell in the table.
What that-- I'm going to
interpret that cell as is
the cost of having the image seam run
from here to the lower right corner.
And so in that sense, if you're
looking at image stitching,
you can see that matrix.
It's this rectangle of overlapping
pixels between the two images.
And the algorithm to find the optimal
seam is identical to edit distance.
Absolutely identical.
The only difference-- even though
I said it was absolutely identical,
there is of course a difference,
otherwise it would be no fun--
but the difference and the reason
I say the algorithm is identical
is this cost function to decide what
is the total cost of this cell based
on the cost of the thing to the
right, down and to the right
and down below, is different.
That cost function for edit
distance was some number
depending on whether they
matched or mismatched
or it was an insertion or deletion
plus the cost of the thing to the right
or down or diagonal.
So there was a different
penalty that you
paid for moving right in this
table, or down, or diagonal.
With this image stitch right now,
the way that I've defined it,
the cost is the same.
The penalty that you pay for having
a seam go through a particular pixel
is the same even if you then go on
to the right or diagonal or down.
It's just the difference in intensity.
So you would take the minimum of--
you would take difference in
intensity plus the minimum
of what's to the right, what's down
below, or what's diagonal to the right.
Slightly different cost function,
but really the same algorithm.
And there's ways to implement
this where you can essentially
say to the algorithm, OK,
go fill in this table.
Use this cost function, so that you
could use exactly the same code.
You implement this once and
you can do edit distance
and you can also do image stitching.
And this is one of the
respects in which algorithms
and the abstraction of
computer science becomes
really interesting is when you start
to see two problems that seem really
unrelated and turn out to be the same.
And it's even more
fun when you can write
one program that does two
completely different things for you
and don't even have to re-implement it.
So any questions?
I'm going to end, then, with a
demo of one more algorithm that
uses dynamic programming--
and you can find various demos of this
on the web-- this is just one of them--
called seam carving.
This is very similar to
image stitching, where
we found that seam going from the upper
left corner to the lower right corner.
The idea here is that we want to
resize the image on the right.
And you did this in p set 3.
You resized images--
or p set 4, excuse me.
You resized images.
And if you made them narrower,
they got kind of squashed,
and if you made them wider,
they got kind of stretched,
and everything looked wrong.
Wouldn't it be nice if
we could resize the image
but have it still look normal?
So the idea of the
seam carving algorithm
is essentially to resize the image.
Instead of scaling every pixel to be
a little narrower than it was before,
let's just take a row
of pixels and delete it.
And so we might take in this row
right here, or this column of pixels,
and delete it, and maybe
nobody would notice.
The problem is you can only do
this so many times in an image
before somebody notices.
Because you get some sort of jump,
where you've deleted useful information.
And one of the reasons for that is that,
for example, in this column right here,
in the water you could get
away with deleting a pixel.
In the sky you could get
away with deleting a pixel.
But in the balloon, you
better not delete a pixel,
because that would be really obvious.
The balloon will start to
look really funny really fast.
So instead of finding a straight
line of pixels to delete,
let's find a wiggly line.
We're still deleting one
pixel from every row,
so every row will end up being one
pixel smaller than it was before,
which means the image will still be
rectangular, but we have a choice.
If we delete a particular pixel at the
top of the image, on the next row down
we could either delete the
pixel right below it or the one
to the left or the one to the right.
So this is also pretty much--
we've got these three choices,
just like we had before--
slightly different cost function.
And then you've got-- you could start
by deleting any one of the pixels
at the top.
You pick the one that's going
to have the lowest cost.
And that cost might be how different
are you than your neighboring
pixels on the left and right.
So if you're pretty much the same
color as the thing you're left
and the thing to your right,
nobody will notice if you go away.
And with just a little
bit of extra work,
you can then update that
information, that table,
to reflect the fact that you've
deleted this one pixel from every row.
You can patch up all
of the seams that that
would have affected to figure
out if I need to squashed
by another pixel, what do I throw out.
And you can start resizing the image.
And you can see, it's
deleting a lot of sky.
It's deleting a lot of water.
It's being pretty careful about
how it thins out these bushes.
And thinning them a lot less
than it thins some other things,
trying to do them in sensible ways so.
The stuff that we're used to seeing is
mostly staying normal and empty space
is getting squashed.
But that empty space
is in different places
depending on where you are in the image.
And down here, for example, we
don't have a lot of empty space.
It's hard to delete grass
without you knowing it,
because you're going to get half a blade
of grass and it's going to look funny.
And there's only so much water you can
delete before that becomes obvious.
So somewhere we had to start deflating
the reflection of the balloon.
So it looks funny, but it looks less
funny than if everything had just
been squashed by an equal amount.
As long as you save the information
about what you were deleting,
you can re-expand it.
So you see, even if we'd
squashed a whole bunch,
this balloon is still
whole, because it's
so easy to delete pixels in the sky
and still have the sky look good
that there's no need
to shrink the balloon.
The reflection of the balloon-- that
got harder because we're balancing it
off with deleting stuff in the grass.
So it takes just a little bit
of work beyond that image seam
and edit distance problem
to do this seam carving.
And it takes just a
little bit more work to be
able to resize both horizontally and
vertically, which is pretty cool.
And this is something that was published
at the ACM Siggraph Conference, which
is a giant computer graphics conference
every August, about 10 years ago,
I want to say.
I can tell you the exact date
by looking for the reference--
2007.
So yes, 10 years ago.
And it's actually a pretty short paper.
And it's really easy to understand,
because it's a pretty simple algorithm.
And it's a really cool idea.
And so it's one of these
rare papers that you
can sit down and read and understand.
You can go to the talk and understand.
And you can go home afterwards.
And if you've got it a little bit
of a background in computer science,
and particularly in computer
graphics, you can understand this.
And you can just sit down and
in an hour you can implement it.
And so you get these web
demos that popped up.
Photoshop can do this.
And it started appearing all
over the place really fast,
because it was a really good idea.
It worked surprisingly well.
And it does that through
this dynamic programming.
Let's not recompute information about
how to get from some particular point,
how to delete a scene from
there down to the bottom.
It doesn't compute that
for every possibility
It just sort of figures out, well
from here, the best thing to do
would be to go down or the best thing
to do would be to go down and left.
And that pixel worries about
how to go on from there.
Any questions?
OK, well thank you all for coming.
Remember that tomorrow morning
there will be another lecture
streamed from Harvard.
And if you're at Harvard, you
can always have fun going.
We'll talk about an
introduction to Python,
which is another
programming language like C.
It's got slightly different
syntax, but the same basic deal.
It tends to get used for
certain kinds of programs
where its syntax makes it a
little bit easier to write them.
Then you will be getting instead
of a programming assignment,
you'll be getting a fun more
sort of written exercise
to work on over the weekend.
Remember your exam.
So that will come out
after lecture tomorrow.
And you'll have three days for that.
Then next week at Yale, there are no
sections because of our fall break.
So there will be no sections
Tuesday or Wednesday.
You've got some time off.
You've got some time to recover.
You've got some time for sleep.
And next Friday, which
is during our fall break,
there will be another
lecture streamed from Harvard
that will be more about Python.
And p set 6 will be coming out.
It's likely to tie a lot of the things
we've been talking about this week
and next week together into
a programming assignment.
And you will of course have
10 days to work on that.
So you can look at it when it
comes out, but it doesn't have
to ruin the rest of your fall break.
I promise.
And with that, again,
thank you for coming.
This was CS50.
