 
SPEAKER 1: All right, this is
CS50 and this is week five.
And let's take a look at
where we left off last time.
You may recall this guy here,
Binky from our friends at Stanford.
And we used Binky to start
talking about pointers.
What is a pointer?
So, a pointer is just
an address, the location
of some piece of data in memory,
because recall at the end of the day
your computer just has a few pieces
of hardware inside of it, one of which
is RAM or Random Access Memory.
And in RAM you have the ability to
store bunches and bunches of bytes,
or kilobytes, or
megabytes, or gigabytes,
depending on how much memory you have.
And if you assume that no
matter how much RAM you have you
can enumerate the bytes-- this is byte
0, this is byte 1, this is byte 2,
and so forth-- you can give each of
the bytes of your computer's memory
an address and those addresses
are simply called pointers.
And now in C we have the
ability to use pointers
both to go to any location
in memory that we want
and even to dynamically allocate
memory in case we don't necessarily
know a priori how much memory
we might need for a program.
Now, in terms of your
computer's RAM, recall
that we divided the
world into this picture
here whereby if this rectangular region,
arbitrarily, represents your computer's
memory, here is how the
computer divvies it up
when you're actually using a program.
At the bottom of your computer's area
of memory, you have the so-called stack.
And recall that the stack is where
any time you call a function,
it gets a slice of
memory-- a frame of memory,
if you will-- for all of its local
variables, all of its arguments
and anything else that it might need.
On top of that might go another
slice or frame of memory
if that first function calls another.
And if that second function in
turn calls another function,
you might have a third
frame on the stack.
Of course, this doesn't end well if you
keep calling function after function
after function after function.
And so, hopefully you don't accidentally
induce some kind of infinite loop
such that these frames pile on
top of each other infinitely
many times, because eventually they'll
run the risk of hitting the heap.
Now, the heap is the same
type of physical memory.
You're just using it in
a slightly different way.
The heap is used any time you want
to dynamically allocate memory,
when you don't know in
advance how many bytes
you need but you do know once the
program is running how many you now
want.
You can ask via functions
like malloc the operating
system for some number
of bytes, and those bytes
are allocated from the heap.
So, those two have addresses or numbers.
And so, the operating
system, by way of malloc,
just figures out which of
those bytes are not yet
being used so that you can
now put whatever piece of data
you have in that particular place.
Now, beyond that [? appear ?]
things like initialized data,
uninitialized data.
That's where things like global
variables that are initialized or not
end up that might be outside
of your main function.
And then above that is the
so-called text segment,
which are these zeros and ones
that actually compose your program.
So when you double click an
icon on Windows or Mac OS
to run a program or you type dot slash
something in the Linux command line
environment in order to run a program,
the bits that compose your program
are loaded also into memory
up into this region here.
So, at the end of the day, you have
access to just pretty generic memory,
but we use it in these different ways.
And it allows us to
ultimately solve problems
that we might not have
been able to in the past.
Recall for instance this example here,
deliberately shown in red because it
was [? buggy. ?] This does not work.
Now, logically, it does do the swap that
we intend whereby a goes into b and b
goes into a.
And we achieve that result by
way of this temporary variable
so that we have a temporary placeholder
into which to store one of those values
while doing the swap.
But it had no permanent impact on the
two variables that were passed into it.
And that was because by default in C any
time you pass arguments to a function,
those arguments are passed
so to speak, by value.
You get copies of those values
being passed into a function.
And so, if main, for instance,
has two variables, x and y--
as they did last time--
and you pass x and y
into a function like this
one here swap, x and y
are going to get copied
as a and b respectively.
So you might perfectly,
logically, correctly swap a and b,
but you're having no permanent
impact on x and y themselves.
But what if, per this
green version here,
we reimplement swap to be
a little more complicated
looking, but at the end of
the day actually correct?
Notice now we've declared
a and b not to be
integers but to be pointers to
integers, the addresses of integers.
And that's what's implied by
the star that we're putting
right there before the variable's name.
Meanwhile, inside of the
body of this function,
we still have three lines of code.
And we're still using a temporary
variable, and that in itself
is not a pointer.
It's just an integer
as before, but notice
we're using this star notation
again, albeit for a different purpose
to actually dereference these pointers.
Recall that int star a and int star
b means give me a variable that
can store the address of an integer.
That's declaring a pointer.
Meanwhile, if you just say star
a without declaring something
to the left of it with
a data type like int,
you're saying go to the
address that is in a.
So if a is an address,
star a is at that address,
which of course per its declaration
is going to be an integer.
Similarly, star b means
go to the address in b.
Star a means go to the address in a
and put the former into the latter,
ultimately putting the value of temp at
the address in b-- so absolutely more
complicated at first glance,
but if you consider again
the first principles of what's
going on here, all we are doing
are moving things around in memory.
And we can do that now
because we have the ability
to express the locations, the
numeric locations of where
things are in memory.
But nicely enough, we,
the programmer, don't have
to care where things are in memory.
We can access things symbolically
as we're doing here with a and b.
So even though we might
have seen on the screen
or you might see while debugging
actual addresses of memory,
rarely does that actually
matter in practice.
We can deal with everything we've
learned thus far symbolically.
Now, last time we also took a
look at the world of forensics,
and we took a look at how images are
implemented and specifically file
formats like BNP, and JPEG,
and GIF, and yet others.
And we glanced into [? Asmila's ?]
here as we tried to enhance this image,
but of course, there was only
finite amount of information.
So, what you see is what you get
in terms of any kind of glint
or suspect in her eyes.
But we did this in part
so that we could also
introduce another feature of C that
allows us to declare our own data
types, indeed our own data structures.
For instance, we proposed
that if you wanted
to write a program
that stores a student,
you could actually declare
your own student data type
inside of which is a name and
inside of which is a dorm,
and anything else that
you might actually want.
Meanwhile, this syntax here gives
us a new data type called student
so that if we want to write a
program that implements students,
we can actually wrap related
information together like name and dorm
without having to maintain
a whole bunch of strings
for just names and a whole
bunch of strings for just dorms.
We can actually encapsulate things
all inside of one structure.
And indeed encapsulation is another
principle of computer science
that you'll see throughout program
and throughout the field itself.
So, what do we now do this time?
So, today we introduce more
sophisticated ingredients
with which we can solve problems and
we revisit a problem from the past
that we thought we had rather
knocked off and had solved.
So, this might represent
a whole bunch of names,
a whole bunch of numbers, a whole bunch
of telephone numbers in a phone book
back to back to back to
back stored in this case
in the form of an array, the simplest
of data structure, so to speak,
that we've discussed thus far.
And an array, again, is a contiguous
block of memory each of whose element--
typically are of the same data type,
integers, or strings, or the like--
and they are by definition
back to back to back to back,
which allows you random access.
Which means you can jump
to any of these locations
instantly just by using in C
that square bracket notation
or as we saw last time
using pointer arithmetic,
actually using the star operator
and maybe adding some number two
and address to get at
some subsequent address.
But it turns out there's a few problems
with this fundamental approach.
Nice and as simple as it is, it would
seem that we rather paint ourselves
into a corner with this approach.
This array has 1, 2, 3, 4, 5, 6 total
elements, at least as depicted here.
So that's fine if you want
to insert a number, and then
another number, and
then four more numbers.
But what if you want to then
insert a seventh number,
not to mention an eighth number
or a ninth number or the like?
Well, where do you put them?
Well, you might think,
well, that's fine.
I'm just going to go put the
seventh number over here,
or the eighth number over here,
or the ninth number over there.
But you can't just blindly do that.
If this memory is being
managed not by you
per se but by malloc and by the computer
itself inside-- and your program,
this memory over here, while
it might physically exist,
might be used by some other part
of your program all together.
It doesn't necessarily belong to
you unless you've asked for it.
And the problem with an array is
that as we've seen it typically
you declare their size in advance,
as with the square bracket notation,
and say give me six integers or give me
six something or others, but that's it.
You have to decide in advance.
You can't just grow it as you can in
some programming languages thereafter.
You've rather painted
yourself into a corner.
But with malloc and other
functions like we saw last time,
you can actually allocate
more memory using malloc.
Unfortunately, it might end up in
another location in your computer's
memory, so you might
have to do some copying
to take the original six
elements and move them
elsewhere just to make room for more.
And there is a data function for
that, something called [? re-alloc ?]
or reallocate.
And indeed it can do exactly that.
It can give you a bigger
chunk of memory and reallocate
what was previously there
to be a larger [? size. ?]
But you have to do a little bit of work.
You have to invoke it
in order achieve that.
You can't just blindly keep adding
things at the end of this array.
Now, unfortunately, while a solution
that might not be very efficient.
Even if you can allocate a bigger
chunk of memory that's bigger than six
because you have more numbers,
for instance, to store,
what if that takes a bit of time?
And indeed it's going to.
If you allocate more integers
somewhere else in memory
you still have to copy
those original values,
and now it just feels
like you're wasting time.
Now, instead of just inserting
things into the list,
you might have to copy it into a
bigger space, reallocate things, grow.
It's a lot more work.
And all of that discussion of
running time and performance
comes back into play, because if that
whole copying process and reallocating
is costing you time, your
algorithm or your program
ultimately might not really be
as fast as you might want it.
So, what could we do instead?
What could we do instead
in order to solve
this problem dynamically, so to
speak, that being the operative word.
And luckily enough, last week we learned
that there is dynamic memory allocation
in C by way of that function malloc.
And we also learned that there is
ways of representing structures
in C that you don't necessarily
get with the language itself,
because they're not primitives.
They're not built in.
In other words, let me propose
this as a solution to our problem.
This is a list of, let's see,
five numbers it would seem,
9, 17, 22, 26, and 34.
Pretty arbitrary right
now, but you might
imagine very simply drawing those
same numbers-- 9, 17, 22, 26,
34-- in the form of an array and
they're clearly deliberately sorted.
But again, what if you wanted to grow
that array or even shrink that array
dynamically over time?
Well, let me propose that we not draw
those numbers back to back to back
to back literally next to each other
but allow ourselves potentially
a little bit of space?
But if that's the case and nine is here
in my computer's memory and 17 is here
and 22 is here, or over here,
or over here-- in other words,
what if I relax the constraint that
my numbers or my data types more
generally have to be stored contiguously
back to back to back to back in memory
and instead allow them to be anywhere,
indeed anywhere a function like malloc
wants to give me more
memory, that's fine.
If it wants to give me memory up here
in my computer, I'll deal with that.
If it wants to give me extra
memory over here, that's fine.
I'll deal with it, because I'll use
these conceptual arrows to stitch
together my data structure this time.
And now, where have we seen
these kinds of arrows before?
What feature of C allows
us to connect one thing
to another where a la chutes and
ladders get from one place to another?
Well, that's exactly what we saw
last time which was pointers.
While we've drawn these here per the
snippet from a textbook using arrows,
those are really just pointers.
And what does each of
these rectangles represent?
Well, clearly a number in the
top half of the rectangle,
but I claim that at the bottom
half of these rectangles
let's consider that bottom rectangle
to just be another piece of data,
specifically an int star, a pointer.
Or rather not a pointer because it seems
to be pointing not just to the number
but to this whole rectangle,
so I need some new terminology.
I need some kind of structure to
contain an integer and this pointer.
And for that, I think I'm
going to need a struct.
And indeed let me propose that to
solve this problem we give ourselves
this building block as a new
C data type called a node.
You can call it anything
you want, but the convention
would be to call something like
this in a data structure-- that's
like a puzzle piece or a building
block in a data structure
would be called a node.
Let me propose that we
define it as follows.
I'm using that same syntax from last
time with which we declared a student
data type, but here I'm saying
inside of this data structure,
this node shall be an int.
And that's pretty straightforward.
Just like a student might
have a name and a dorm,
this node will have an
int called n arbitrarily.
And then the only piece of detail
that's a little bit new now is
the second line, struct node star next.
Now, what does that mean?
It's pretty verbose, but struct node
is just recursively, if you will,
referring to this same
type of data structure.
Star means this is going to be a
pointer, the address of one such thing,
and next is just an arbitrary
but pretty reasonable
name to give to such a pointer.
So this line here,
struct node star next,
is the incantation in C
with which you declare
one of those arrows that will
point from one node, one rectangle
to another node, another rectangle.
And the fact that we have a
little bit of additional verbiage
up here, typedef struct
node, is because again C
is a language that is read
top to bottom, left to right,
so words have to exist
before you actually use them.
So, whereas last time when
we declared a student,
we didn't actually mention struct
student or anything like that.
We just said typedef open curly brace.
Today, when declaring
a node, we actually
have to have some additional syntax
here just called struct node.
And technically this
word could be anything,
but I'll leave it as
node for consistency.
And that allows me
inside of this definition
or to specify that the
second data member is
going to be a pointer to exactly
that kind of data structure.
But typedef, just to be clear, allows
me to type a smaller name for this data
structure here, which I will
simply called node at this point.
So, what can we actually do with
this kind of data structure now?
And, indeed, let's give
this data structure a name.
Let's start calling a linked list.
Previously, we had arrays, but now
we have linked lists, both of which
at the end of the day are types
of lists, but linked lists,
as the name suggests, are linked or
threaded together using pointers.
Now, when you have a linked list,
what might be some operations,
some algorithms that you
might want to run on them?
Well, if you've got a linked list of
say numbers, for the sake of discussion,
you might want to insert a
new number into that list.
You might want to delete
a number from that list
and you might want to search that list.
And that allows us to then
consider how we might implement
each of these kinds of things.
But it turns out while all simply--
while fairly simple intuitively,
we're going to have to be a little
careful now by way of our pointers.
So, let's more formally declare a
linked list to look something like this.
It's a collection of nodes
that are linked together
with pointers as represented
by these arrows here,
but we're going to need some
special pointer, at least
at the beginning of the list.
Let's just call it first.
It doesn't necessarily
store a actual integer.
It itself first is just a
pointer to the start of the list.
And by way of that pointer can we access
the first actual node in the list.
From there can we get at the second,
from there can we get at the third,
and the fourth, and the fifth,
and any number of others.
And this syntax over here
might just represent null.
Because you don't want
to have that pointer just
pointing off into no man's land, that
will have to be a null pointer so
that if we check for that
with a condition we know,
OK, we're at the end of the list.
So, let's pause for just a moment
and consider these three algorithms--
insert, delete, and search, and
consider what's going to be involved.
Well, how would you go about
searching for an element of this list?
Suppose I wanted to find the number 22?
What do you do?
Well, me, I, the human can just
look at this and be like all right,
22 is right there.
But a computer can't do that.
A computer every time
we've had this discussion
can only look at one thing at a time.
But moreover the computer this
time is even more constrained
because it can't just use
our old friend binary search
or divide and conquer, because how do
you get to the middle of a linked list?
Well, you have to find your way there.
The only thing you have in a
linked list from the outset
is one pointer called first or whatever
it is, but one pointer that leads you
to the beginning of the list,
the first node in the list.
So, if you want to get to
the second node in the list,
you can't just go to bracket one,
or bracket two, or bracket three
to get any number of other
elements in the list.
You have to follow these
bread crumbs, if you will.
You have to follow these arrows or these
addresses to go from one node's address
to the other to the other.
And so, we've paid a price already.
And we'll see that there
is still an advantage here,
but what's the running time?
What's an upper bound on the running
time of search for a linked list,
even if it is sorted?
 
Any thoughts?
 
Is it constant time like big O of 1?
Is it log of n?
Is it n, n squared?
What's the running time going to be?
Well, they're sorted, and that was this
magical ingredient, this assumption
we've been allowed to make in
the past which was helpful,
but that assumed that
we had random access.
In C, we had square bracket notation,
so that using some simple arithmetic
we could jump roughly to the
middle, and then the next middle,
and the next middle looking for Mike
Smith or whatever element it is we're
looking for.
Unfortunately here, one
price we have already
paid already by taking this step
toward linked lists is linear time.
Big O of n would seem to be the running
time of searching a linked list,
because the only way you can
start is at the beginning,
and the only way you can get through
the list is by following these arrows.
And if there's n nodes
in the list, you're
going to need as many as n steps to
find, something like 22, or 26, or 34,
or any elements all together.
Well, that's not all that great.
What about insert?
What's an upper bound on
the running time of insert?
Well, here too it depends.
Suppose that we don't care
about keeping the list sorted.
That's kind of a nice advantage,
so I can be a little lazy here.
So, what's the running
time going to be if I
want to insert a new number like
the number 50 into this list,
but I don't care about
keeping it sorted?
Well, instinctively, where
would you put this element?
Where would you put it?
You might be inclined-- you kind
of want to put it over here,
because it's the biggest element.
But again, if you don't care
about keeping it sorted,
where is the fastest, the quickest
and dirtiest place to put it?
I would propose let's just put
it at the front of the list.
Let's take this first pointer,
point it at the new number
50 that we've have somehow
added to the picture
as by calling malloc, asking
malloc for a new node.
And then have 50, in turn, point to the
number 9, and then 9 can point to 17,
and 22, and so forth.
What if we want to insert
another number, 42,
and we don't care about where it goes?
Well, why don't we just put it
at the beginning of the list?
Then we have the first
pointers pointing at 42,
which in turn should point at 50, which
in turn can point at 9, then 17, then
22, and so forth.
So, if we're just lazy
about this, we can actually
achieve a great running time
for insert constant time.
Unfortunately, if we want
to keep things sorted then
we're going to have to incur a
linear time cost again, right?
Because if we have to
insert 42 or 50, worst case
they might belong all the way at
the end of the list and that's
Big O of n steps.
And delete, too, unfortunately,
whether it's sorted or unsorted
is also like search
going to be Big O of n
because you don't necessarily know
when you're searching for a number
to delete if it's going to be at the
beginning, the middle, and the end.
So, in the worst case, it
might indeed be at the end.
You know what?
Why don't we instead of
walking through this verbally,
let's see if we can't
get some volunteers?
Can we get seven volunteers to play--
wow, to play the role of numbers here.
1, 2, 3, 4, 5, 6, and
yes, 7, come on up.
All right, so I have here some
printouts for all seven of you
that represent exactly the nodes
that we have here on the screen.
Let's meet one of our first contestants.
What is your name?
AUDIENCE: Scully.
SPEAKER 1: Scully, nice to see you.
So, you shall be literally first
and represent our first pointer.
So, if you want to come and
stand roughly over here.
And then what is your name?
AUDIENCE: Maria.
SPEAKER 1: Maria, nice to see you.
And you can be the number 9 right
next to our first contestant.
And your name?
AUDIENCE: Sarah.
SPEAKER 1: Sarah, nice to see you.
You shall be the number 17.
And your name?
[? AUDIENCE: Satoshi. ?]
[? SPEAKER 1: Satoshi, ?]
nice to see you.
You shall be 20.
And your name?
[? AUDIENCE: Mosof. ?]
[? SPEAKER 1: Mosof, ?] nice to see you.
And you shall be 22.
AUDIENCE: Jed.
SPEAKER 1: Jed, nice to
see you-- 29, formerly 26.
And your name?
AUDIENCE: Erin.
SPEAKER 1: Erin, nice to see you.
You shall be 34.
All right, so what we have here
is seven elements, six of which
are very similar to themselves, one
of which is fundamentally different.
So, Scully here represents first,
and indeed her sheet of paper
is horizontal to suggest
that she is just a node.
She is just going to be the
pointer of to a node in a list.
Everyone else's nodes
are vertical, as have
been the rectangles we've
been drawing on the screen,
because each of these guys represents
a number as well as a next pointer.
Now of course, you're only seeing
in front of you the number.
So we're going to go ahead
and if you wouldn't mind,
use your left hand to represent the
arrow that we've long had on the screen
to point to the person next to you.
And Erin, you're a bit
of an anomaly, but also
because you need to have a null
pointer at the end of the list,
so that you're not just
pointing aimlessly.
And pointing to the ground
seems fine, so literally
pointing to the ground will
represent-- will infer as null.
So, Scully, you are the only
thing keeping this list together,
so to speak.
So, you two need to point with
your one pointer to Maria there.
So, here we have a linked list.
And just to make this clear, could
everyone separate from each other
by a step or two in any direction?
Notice that the list is still--
it's almost identical to before.
Can some of you take a
step forward, a step back,
but still point at the other person?
So, now we're capturing a
little more accurately the fact
that these nodes and these
pointers can be anywhere in memory,
so long as they're linking themselves
together by way of these pointers.
All right, so suppose
now that I want to insert
an element like 55, which happens
to belong at the end of this list.
Let me go ahead and malloc
Christian if I could.
So we have asked malloc for
a chunk of memory equivalent
to the size of one
integer and one pointer.
That is going to be represented
with this rectangle here.
Nice to see you.
AUDIENCE: Nice to see you.
SPEAKER 1: You shall be 55.
And I'll play the role of the temporary
pointer as predecessor pointer
or pointer just using
my left or right hand
to try to figure out
where Christian belongs.
So, just like you might-- just like
we might want to search the list,
inserting is fundamentally
rather the same.
The only thing I have access to
at the outset of this algorithm
is my first pointer, and
it's only by way of Scully
that I can even access
the rest of the list.
I cannot jump to the
middle, jump to the end.
I can only start at the beginning
and literally follow the pointer.
So, let me go ahead and do that.
Let me go ahead and
point at whatever Scully
is pointing at, which happens to
be Maria, which is the number 9.
55, of course, is
bigger than that, and I
do want to keep the list
sorted for today's purposes.
So I'm going to very carefully
follow the next pointer, 17.
Follow the next pointer, 20.
Follow the next pointer, 22.
Follow the next pointer, 29.
Follow the next pointer, 34.
Follow the next pointer,
ah dammit, null.
And so this is why it is important
with some of these algorithms
to have a predecessor pointer, a
second pointer or really my left hand
so that maybe my left hand
can still point at Erin.
My right hand can realize,
ah, null, so that I still
have access to the last node
in the list so that Christian--
if you could come over here.
I'm going to go ahead and tell Erin
quite simply to point at Christian.
Good, and let's just say for
students' sake come on over here,
but technically we could have
left Christian there and just had
Erin pointing at him.
It's just going to get a
little confusing before long,
so we'll just cheat and
move you right over here.
But now we have a linked list
that has one additional member.
Suppose now that we want to
make another insertion-- pardon.
Let me go ahead and propose
that we insert say the number 5.
Well, the number 5, of course,
belongs at the beginning of the list.
So you know what?
I need to malloc.
Can I malloc Jordan off
camera five, perhaps?
So malloc, a very slow return value.
OK, we're going to store your
node your n value five here.
His pointer is undefined right
now, because he's not actually
pointing at anything.
And so where does he ultimately belong?
Well, he belongs at
the start of the list.
So, let me deliberately make a mistake.
Let me go ahead and update
Scully to point at Jordan,
thereby putting Jordan effectively
at the front of the list.
Unfortunately, whom should
Jordan now point at technically?
It should be Maria, but this is code.
The only thing we can do
is copy pointers in memory,
and if Scully's left hand is
no longer pointing at Maria,
I have literally orphaned
the entirety of this list.
I have leaked 1, 2, 3, 4,
5, 6, 7 chunks of memory,
seven nodes, because I got my
order of operations out of order.
Indeed, I should have done
what-- let's undo, control Z.
And now let me go ahead and do what?
Jordan should point at the exact
same thing Scully is pointing at,
which has no downside.
Even though it feels redundant,
we've not lost any information.
And now that Jordan
is pointing at Maria,
Scully's pointer can
be pointed at Jordan.
And now, even though the
list looks a little weird,
this is a key feature
of the linked list.
These nodes could have
been malloc from anywhere.
So indeed, even though we initially
kept everyone physically sorted
left to right-- and you've all cleaned
the list up even since-- that's OK.
The point is that all of these
nodes are linked together.
So, thank you so much to our volunteers.
You can keep these pieces
of paper and later on we'll
have some stress balls for you.
But that's the key idea
here behind a linked list.
Thank you.
So, of course, there are some
more complicated operations
that we might have to deal with.
For instance, if we want to insert
into the middle of the list,
that's going to be a little more
of a burden on me, the program,
keeping track of where
things have to go.
But nicely enough, there's
only these three cases--
the beginning of the list, the end of
the list, and the middle of the list,
because middle of the list doesn't
have to mean literally the middle,
just anywhere that's not
the beginning or the end.
Of course, we should
be careful to make sure
that we handle the empty
list scenario, which
is equivalent to putting something
at both the beginning of the list
and the end of the list.
But that would be perhaps a special
case we could deal with separately.
Of course, there are other operations
like inserting-- or rather removing
from the tail of the list,
removing from the head of the list,
and removing in the middle.
And that would be the opposite
of malloc, if you will.
And in those cases, we
have to take care to call
our friend free to free
those bytes of memory,
give them back to the operating
system so that we don't leak memory.
But there, too, I'm probably going to
have to be careful as to what order
I change my pointers and free nodes.
Because what you don't want to
do, and what unfortunately you
might very well accidentally
do at some point,
is free a pointer and then try to access
that pointer or change the pointer,
even after you've told the operating
system I'm done with this address.
That can give you what's
called a segmentation
fault, which is just one
of the ways in which you
can deduce that kind of mistake.
So, let's actually implement
one of these methods.
And we'll pluck off one that
allows us to actually take
a look at the syntax with which
we can manipulate pointers.
And let's go ahead and
implement a function
called search, for
instance, where search
I [? proposed ?] just returns a
bool, true or false, this number n
is in the given the list.
And now, why have I said node star list?
Well, at the end of the day, a linked
list is just a whole bunch of nodes.
But the first of those nodes that
we keep calling first is of what
data type?
If you have a pointer,
a variable, that's
pointing to a linked list, that means
it's storing the address of a node,
otherwise known as a node star.
So, this would be the syntax with which
you can pass to a function something
like a linked list.
You simply have to pass it a pointer
to the first element in that list.
And if I want to go ahead
now and implement this,
let me go ahead and
propose the following.
Let me go ahead here and give myself a
temporary value, so node star pointer
we'll call it, PTR.
And that's going to equal
the start of the list.
So, I'm just creating
another box of memory
and I'm storing inside
of it the same address
that I was passed in, just so that
I have a temporary variable that I
can use to update.
After this, let me go ahead and
say while that pointer is not
equal to null-- because recall that
null is this special sentinel value that
means end of the list.
So inside of this loop,
what do I want to do?
I'm going to go ahead
and say if pointer--
and now I have to get at
the number inside of it.
So, if I recall from the last time,
we only spent a little bit of time
on the student example, but
we said something like student
dot name or student dot dorm.
And in this case I'm inclined
to say pointer dot n, where n is
the number, the integer that's inside.
But pointer this time
is not a struct, per se.
It's the address of a node.
It's the address of a struct.
And so, perhaps the most
intuitive piece of syntax in C,
at least retrospectively
now, is that if you
want to access a piece of
data that's inside of a node
and you have a pointer to that node much
like our arrows in the pictures imply,
you literally draw an
arrow using a hyphen
and then using a right angle bracket.
So, now if we do see-- whoops,
let me finish my thought.
If pointer n equals equals the n we're
looking for, let me go ahead in here
and say return true.
Or else, let me go ahead
and not return false,
because I don't want to just check one
element and then blindly say false.
I instead want to say pointer
should get pointer arrow next.
And then only after that
loop is all complete
should I say something
like nope, return false.
So, what's actually going on here?
The function declaration,
again, took in two arguments--
one, an int n that we're looking
for, two a pointer to a node,
otherwise known as a
node in a linked list.
And per the pictures
we've been drawing, you
can access any other element in that
linked list by way of the first element
in that list, as suggested here.
So, now I'm just giving myself a
temporary variable called pointer,
but I can call it anything I want.
And I'm declaring it as node star, so
that it can store the address of a node
as well, and then I'm
just initializing it to be
the exact value that was passed in.
So, I don't want to accidentally
break the list that I was passed.
I don't want to change
the value of my parameter
unnecessarily and complicate things.
I just really want a
temporary variable, much
like I in the world of loops that
allows me to constantly iterate
through something and update it as
I go while the whole along the way
I want to be checking this.
While pointer is not null.
If pointer is null, that means
I'm at the end of the list,
or maybe more curiously, I was passed
null, in which case there is no list.
And that's a valid scenario that could
happen, even though it's a bit strange.
But if pointer is null, I don't want
to proceed further inside of this loop.
But so long as pointer is not
null, let me go ahead and do this.
Let me follow that pointer and
go inside that node and say
is your n value equal
equal to the [? end ?]
value that I've been
asked to search for?
And if so, return true.
I don't want to just return
false now because otherwise I'd
only ever be checking the
first element in a linked list.
So, I now want to do the equivalent
in spirit of i plus plus.
But I'm not using i's.
I don't need to use
pointer arithmetic here,
and indeed it won't work because I
have this thing stitched together.
It's not an array of contiguous memory.
I want to say that my
current temporary value
pointer should get
whatever pointer arrow next
is, and then let this loop continue.
If it's not null, check again.
If it's [? end ?] value equals what
I'm looking for and repeat, repeat,
repeat until pointer equals null.
So, let's make this more concrete.
Let me go ahead and just draw
a temporary picture here.
And let me suppose here
that what I have been passed
is something like the following.
Let's do a very simple linked list
that has maybe the number one,
and has the number two,
and has the number three.
And, again, I've drawn
gaps between these nodes,
because they could be
anywhere in memory.
So, technically they don't need
to be left to right like this,
but that'll keep us sane.
And if this is indeed
a correct linked list,
there are pointers in
each of those fields
that point to the next node
in the list, and that slash
I drew in the last one just means null.
You can draw it however you want.
But for a linked list to work, we need
to know the beginning of this thing.
So, we'll call this
first, and that of course
has to point to the first
element in my linked list.
So, here's the state of our world.
It's a linked list quite similar
in spirit to the other one.
It's a little shorter, just
for the sake of discussion.
And now, let's consider
my function search,
which again takes two arguments.
So, that the first argument
is of type int called n.
And suppose I'm searching
for the number three.
The second argument is a node star,
so the address of a node called list.
So, what does that mean?
When this function
search is called, let's
suppose that we currently
have the value n, which
is going to be 3, because
that's arbitrarily
the number of decided to search for.
And then this other value pointer
is going to be initialized-- sorry,
not that.
List is going to be
whatever pointer is passed
in as the second argument
to this function.
So, let's suppose that this linked list,
this sample linked list at the top,
is indeed what is passed
in to this function.
So, I've passed in 3, because
I want to search for 3.
So what goes here?
If I pass this sample linked
list into my search function,
what is the value of list?
Well, list if I'm past
this first pointer
is really going to be the pointer
to the first element in the list.
That's all we're saying.
Node star list just
means give me the address
of a linked list, which means give
me the address of the first node
in the linked list, which
means that initially when
I call search my picture--
my stack frame, if you will,
in terms of my local arguments--
is going to look like this.
All right, so with that said,
how does this code work?
We recall in this code
have this while loop that
just sits in a loop checking
whether the current nodes
n equals equals the one we're looking
for, and if not, it updates it.
So, we need one more local
variable called pointer that's
initialized to the start of this list.
So, this will be a pointer and
it's initialized to the same thing
that my second argument
is initialized to.
So, this now is the state of our world
once one line of code has executed,
that very first one in there.
So, now let's implement the loop.
While pointer does not equal
null, so here's pointer.
Does it equal null?
No, because if it did,
we would just draw
a slash or some other piece of syntax.
But it's pointing at clearly
something that exists.
So, this node here has
some valid address.
Pointer is pointing at
it, so it's not null.
So, what do I do inside of my code?
I check if the n value
inside of pointer, PTR,
equals equals the
number I'm looking for,
and if so, return true, otherwise,
if not I update pointer.
So let's check.
Let's follow the arrow, PTR,
and look at the value n.
Recall that the top of these boxes is n.
The bottom of them is called
next-- n, next, n next.
So, I followed this pointer.
I'm looking at the box called n.
Does 3 equal 1?
No, obviously not.
So I update-- pointer gets pointer next.
So, to be clear, pointer
gets pointer next,
that second to last line of actual code.
So, what does that mean I need to do?
That means I need to update pointer
to be equal to pointer next.
What is pointer next?
Well, here's pointer and here's next.
We were looking a moment ago at n.
Now I'm looking at next.
So, pointer next means that I should
update whatever is inside this box--
and a lot more on the screen-- to
be equal to pointer next, which
is this field.
This field is pointing at
that, so that line of code
has the effect of updating PTR to
simply point at the second node.
So, what happens next?
I seem to still be in that loop and I
say, well, pointer does not equal null,
and it doesn't.
It's pointing at that second node.
If pointer arrow n equals
equals n, but no that's not
the case, because I'm looking for three.
I'm pointing at two,
so that is again false.
So, again, I don't return true.
I instead update pointer
to equal pointer next.
So, what has to happen here, at the
risk of deleting my handiwork again,
now pointer gets pointer next,
which is this element, which
is equivalent to pointing
at this node here.
And so, now I'm still inside that loop
while pointer-- not equal to null.
It's not null.
If pointer arrow n equals equals
n, well, let's follow that logic.
If pointer, follow the
arrow, n equals equals n,
three-- which is the one I'm
looking for-- returned true.
And so, how then does this
function ultimately behave?
It would seem in this case to return
true, because I have eventually
found that number three.
What would happen by contrast if I were
looking not for three, but for four
with this code?
In other words, what if I'm not
looking for three and I want to go one
step further?
Well, one step further
is going to update PTR
to equal null, that
slash in my last node.
And that means code wise,
I'm going to break out
of that loop, because
pointer now does equal null.
And so, by default that very last
line of code return false, not found.
So, complicated at first
glance, and it certainly
looks more complicated than
things we've written before,
but again, if you go back to basics,
what does each of these lines mean?
Consider that there's no magic here.
This first line means give me a
variable that's a pointer to a node.
It'd be in other words
the address of a node
and assign it whatever I was passed in.
While pointer [? naught ?] equals
null, we've seen null before.
It's this special zero
value, and I'm just
making sure that the pointer I'm using,
PTR, does not equal that special value.
And then inside of this loop I'm using
one piece of new syntax, this arrow
notation, which just like the picture
suggests means go there, and then
look at the field called n and check
if it equals the n you're looking for,
and if so return true.
Otherwise, update yourself
much like i plus plus
but specifically update pointer to
be whatever the value is when you
follow the arrow in that next field.
So, this of course is just search.
We've not actually changed the
list, but imagine, if you will,
that you could now
implement insert and delete,
not simply by following these
pointers but actually changing
the value of next in a node to the left,
a node to the right, or a new node all
together.
So, who cares?
Why did we add all of this complexity?
We had arrays, which were working
really well for a whole bunch of weeks,
and now we've claimed that
arrays are not so good.
We want to use linked lists instead.
But why might we want
to use linked lists?
Well, linked lists gives us dynamism.
We can call malloc and give ourselves
more, and more, and more nodes
and grow our list of numbers, even
if we don't know in advance how
many such numbers we need.
And we can shrink them,
similarly, so we don't
have to allocate a massive
array unnecessarily.
We can shrink our data structure based
on how many numbers we actually need.
But we're paying a price.
Search is a little bit slower,
delete is a little bit slower.
Insert would be slower if we
insist on keeping things sorted,
so we've paid this price.
And indeed, this is thematic.
In CS50, in computer
science more generally,
there's often going to be
these trade-offs of time,
or space, or just complexity,
or your human time.
Any number of resources
can be in scarce supply.
And, indeed, we've seen
by way of linked lists
that we're solving one problem
while introducing another.
It's like that silly
situation you might see
memes of where you cover your hands--
put your hand around a hose that
has a leak and all of a sudden
another leak springs up over there.
We're just moving the problem
elsewhere, but maybe that leak
is less problematic to
us than this one here.
So, again, it's this
theme of trade-offs.
Now, this here is Mather Dining Hall.
And this, of course, is
a whole bunch of trays
where you might go and get
some food, lunch, breakfast,
or dinner, or the like, and you pick
up this tray and you put food on it.
But what's interesting about trays,
as Mather and a lot of cafeterias do,
is trays are stacked
one on top of the other.
And it turns out now that
we have this second building
block with which to create data
structures-- we're not just
using arrays anymore.
We now have pointers and this general
idea of linking nodes together
in our toolkit, we can now start
to imagine more interesting data
structures that solve problems
in slightly different ways.
For instance, suppose
that I wanted to implement
this paradigm of stacking things on
one on top of the other like this.
Indeed, this is a data
structure called a stack,
and it generally has two
operations associated with it,
push to push something on the stack and
pop to take something off of the stack.
And this is perhaps a
useful data structure
if you just want to store numbers
or something else in really just
the most efficient way for you
without regard really to fairness.
So, for instance, if this table
here is my initial data structure
and it's empty and I have a piece
of information that I want to store,
I'm just going to going
ahead and put it right there.
And now suppose I want to push
another number onto the stack.
I'm just going to go
ahead and simply put it
on top, third number on
top, fourth number on top.
But I've now committed to a
certain property, if you will,
a certain property whereby the last
tray in has to be the first tray out,
otherwise known in computer
science as LIFO-- last in,
first out-- because if I want to
get that first number, I mean,
I've created a mess for myself.
I have to lift these all up or
move them just to get at it.
That just seems stupid.
Intuitively, the easiest thing to
grab is probably going to be the top,
but that's not necessarily
the first element I put in.
But that's OK.
This might still be a
valid data structure--
and indeed later in the term when
we introduced web programming
and we look at languages
like HTML, there's
actually a number of applications
where it's actually super useful
to have a data structure where you can
just stack stuff on top of each other
in order to tuck some data
away for subsequent use.
And, indeed, when we talked
about memory in our own computer,
stacks clearly have some value.
We talked about main and then swap,
and then maybe other functions.
There are many contexts, one of which
we've seen already, where in life you
actually want to stack
things on top of each other
so as to keep track of really what
did I do most recently, because that's
the next problem I'm going to
deal with or the next frame--
in the case of memory-- that I'm
going to pop off of the stack.
So, how might we implement
this data structure?
Let me propose that if we want
to define our own data type
called the stack that implements that
idea of cafeteria trays or frames
in memory, let me go ahead and typedef
a structure called stack inside
of which are two data members.
Suppose that, for the
sake of discussion,
I'm not going to try to store trays
or something that doesn't really
exist computationally but
rather numbers, just integers.
Inside of this structure
called a stack is
going to be an array
called numbers of type int
and it's going to store capacity.
So capacity, let's assume, is hash
defined elsewhere to be some constant.
So, maybe it's 10, maybe it's 1,000,
but it's some constant integer elsewhere
that limits, ultimately, the capacity
of the stack to some integral value.
And then size-- this seems weird.
I have capacity here and then size here.
Well, there's the
semantic distinction here.
Just because you have
a stack ready to go,
as I did a moment ago-- just because
I had this stack ready to go,
empty initially, it's going
to have some capacity.
Realistically, I can only
go as high as the ceiling
or until the things fall over.
But there's also a size here,
and the size is currently zero,
but the capacity might be
like 1,000 or whatever.
So, that's the difference
there and the size now is 4
and the capacity is like
1,000 minus 4 at this point--
or rather, capacity is still 1,000,
because that's the total possible size,
not the actual size.
But what if that's a limit?
What if I don't want to restrict myself
to some fixed, finite number of trays
or a fixed number of numbers?
Well, I could instead declare
my stack as being a pointer.
Now, this pointer
initially has no value,
so let's assume that it's probably going
to be initialized to null in my code,
but that too is not going to be useful.
Why would I declare numbers now
not to be in an array, which
felt very straightforward.
Normally we draw arrays left
to right, but with the trays,
just imagine turning an array vertically
and thinking of the number stacked
on top of each other.
But this is just a pointer to a number.
Why might I want to implement a
stack as just a pointer to an int?
That seems wrong.
I want lots of numbers, not one number.
So, what could I do?
Well, what if in this world to implement
a stack I invoke our friend malloc
and I say to malloc, malloc,
give me enough memory
for 2,000 numbers or 5,000 numbers.
What is malloc going to return?
Well, by definition,
we know malloc is going
to return the address of a chunk
of memory, and that chunk of memory
is going to be of whatever
size I ask malloc for,
and the address of the first is
really just equivalent to the address
of one integer.
And so long as I, the programmer,
remember that I asked malloc
for 2,000 integers or
for 5,000 integers,
I know implicitly the end of
that chunk of memory and malloc
just need to tell me the beginning.
So, it's perfectly fine to implement
the stack by way of a single pointer,
because all I need to
know is, hey, malloc,
where should I put my first integer?
Because I know via pointer
arithmetic, per last week,
that I can put my next
integer four bytes later,
four bytes later, four bytes later.
And I'm deliberately going
up this time, but it really
is just an array where you can think
of the array as left and right.
So, this would be a way of giving
ourselves a data structure called
a stack that is not fixed from the
outset like this previous version
to some specific capacity.
Now, we are limited only by how much
physical memory or virtual memory
my computer actually has.
So, suppose Apple or someone
similar implemented the lines
outside their stores for the
release of the iPhone as a stack.
So, it's weird maybe to think of
people stacking on top of each other,
but maybe you could imagine Apple
funneling everyone into the glass store
here in Manhattan, and then whoever is
the last one in gets their phone first.
Because why?
They're closest to the exit.
So, you have all these people show
up super early in the morning or days
before, you pile them all into
the store saying everyone, hey,
please go into the corner there.
Please get into the store.
And then as soon as 9:00
AM rolls around and it's
time to give out the iPhones, just for
logistical convenience you realize,
all right, why don't we just
give the person who came in
last their phone first because
they're closest to the exit
and get them out, last in, first out?
Good design, bad design?
It's correct in so far
as everyone's going
to get an iPhone if supply is there,
and that's never going to be the case.
So, it's not necessarily
very equitable or fair,
and indeed the humans are not
going to be very pleased with Apple
if they used a LIFO data
structure or a stack.
What would these fans of Apple
hardware prefer that Apple use?
We call it a line.
If you go to the UK,
they call it a queue,
which is actually a perfect answer,
because there's this other data
structure in the world
called a queue, which
is exactly what you would hope
the Apple store line would be,
a line whereby it's first in, first out.
So, the first person
there three days before,
at 5:00 AM gets his or her
phone first, and the one person
who comes in at 9:01 AM
doesn't get their phone
because they're at the last
position in the queue or the list.
And a queue, nicely enough, might
just have at least two operations--
enqueue and dequeue
whereby enqueue means get
into line d queue means
get out of the line,
but these happen at different places.
For instance, if there's a whole bunch
of people lined up here on the stage,
closest over there's
the front of the list.
I get here last.
I enqueue myself at the
end of this data structure,
but you dequeue someone from
the beginning of the list.
By contrast, when we had a stack,
when you push someone onto the stack,
you pop it off, or him or
her off first by nature
of it being a LIFO data structure.
So, how might we implement a queue?
It's actually slightly
more complicated, 50% more
pieces of information you need to
keep track of, the front of the list.
But you can still do it in an array.
So, suppose that we do use an array,
and let me go ahead and draw this
as follows.
Suppose that like hopscotch we
draw the queue for an Apple Store
like an array like this.
And here is the door of the Apple store,
so you want to be at location zero,
ideally.
1, 2, 3, 4, 5, 6-- so
this is how many people
can fit into our queue in this case.
So, suppose that Alice wants to buy an
iPhone and she gets to the store first.
Where should she go to keep things fair?
This is the queue, so we don't
want to put her into the corner,
so to speak, in our first example.
We want to put her at
the front of the list.
So, Alice belongs right
there, pretty straightforward.
Now, Bob arrives and he comes
in slightly after Alice,
so he gets to get behind Alice in line.
And so Bob is there, and maybe Charlie
arrives thereafter, and then so forth.
David maybe comes in fourth and beyond.
So, that's how people would
queue up, so to speak.
And now, when it's time
for Apple to open this door
and start selling iPhones, what happens?
We want to take Alice
out of the list first.
We want to de-queue Alice.
So, we need to start
remembering some information,
because it's not sufficient
now to just remove-- whoops,
it's not sufficient just
to remove Alice like this,
because suppose that we do keep
adding other people, person
d, e-- whoops, that's not an f.
OK, don't know what happened there.
Person f is here, g, h.
Suppose that Alice has bought
her phone and left the store,
and then Bob does the same.
He goes ahead and leaves the store,
and then Charlie leaves the store.
Where do I put person i who
maybe shows up a little late?
It would seem that I want to put
them at the end of the queue, which
makes good sense, but right now d, e,
f, g, and h are still in the queue,
and this is an array I proposed.
My data structure's an array, so I can't
just move d to the front of the line
easily.
I have to actually
shift him or move him,
and this might conjure up some memory
of our searching and sorting examples
where when we had our
humans on stage, we actually
had to physically move
people like an insertion sort
to make room for those elements.
And that's fine.
We can absolutely say, hey, David,
come over here please and person e,
come over here please, and that's
obviously what the Apple store does.
But when you're implementing
this idea in memory,
you can't just ask the numbers
themselves or the elements themselves
to do that moving.
You need to do it for them, and
that's going to cost you time,
and that's going to be a
price that you have to pay.
But I propose that we
can be clever here.
We do not need to incur the cost of
moving d, e, f, g h where Alice, Bob,
and Charlie previously were.
Where can we put person i instead?
I mean, there's obviously
room in the line,
so maybe why don't we
just put person i here?
But, again, we don't want to piss off
everyone who's already in the line.
So, if this now is an array, we
have to be mindful of the fact
that the front of this list has
to be remembered separately.
This data member here front really
should store not 0 in perpetuity
but really 0, 1, 2, 3.
It should store the
current front of the list.
And I need another
variable, presumably, called
size that keeps track of how many
elements are in the list, which
in this case is going to be six.
So with a queue, if I'm
implementing it using an array,
there's some added
complexity if I want to avoid
the inefficiency of moving
all of these elements
and incurring the kind of running times
we saw when we talked about searching
and sorting the other day.
There's no reason mechanically to move
d, e, f, g, h anywhere in the array.
We can in constant time, and maybe
our old friend the modulo operator
that you might have
used in [INAUDIBLE], we
can just figure out where
i, and j, and everyone else
should go so long as we
keep track separately
with a queue of what the
front of the list would be.
And why is this 3?
Well, if I continue numbering the
array like this, as we've often done,
you can now see that d is the head of
the list, or the front of the list.
And so, we should remember
his location there as 3.
But, again, what happens if
j, k, and then l shows up?
There is no room for l in this
world, not to mention m, n, o.
So what if we solved
this problem as before
by changing the array from having
some fixed capacity to having
no pre-determined capacity,
just use a pointer so that we
can use malloc to dynamically
allocate a big chunk of memory,
remember its capacity
ultimately, but also
remember the front and the
size of this data structure?
So, the same idea there might apply.
So, at the end of the
day, what have we done?
We've taken these new
building blocks, pointers,
and this notion of linking
things together using pointers
much like a linked list,
and we've looked back
at our data structure called an array.
And using these now, we can implement
what are generally called abstract data
types, a queue in a stack
does not have as a low level
meaning as an array does, which
is a very technical concept.
And in a linked list, this
is a very technical concept.
It's a node with pointers
linking things together.
A stack is like a stack
of cafeteria trays,
or a queue is something like people
lining up outside of an Apple Store.
These are abstract data types
that can be implemented clearly
underneath the hood in at least
a couple of different ways.
You can use an array and
keep things simple, a la
weeks two and beyond in
the class, but you're
going to paint yourself into
a corner by fixing their size,
as I did a moment ago,
by declaring this queue
and before it a stack to
be of a fixed capacity.
But now that we have pointers, and
malloc, and dynamic memory allocation,
and this spirit of linked
lists, we can change
that to actually be
numbers and actually just
remember where things
are underneath the hood.
And nicely enough, a stack
in a queue doesn't even
need me to stitch things
together like a linked list.
I just need malloc in
order to allocate those.
So, let's tie these two
topics together if you would
and compare and contrast them
by way of a wonderful animation
that a fellow educator
made and posted online
that we've abbreviated here to give us
a sense of the difference between stacks
and queues.
[AUDIO PLAYBACK]
[MUSIC PLAYING]
-Once upon a time, there
was a guy named Jack.
When it came to making friends,
Jack did not have the knack.
So, Jack went to talk to the
most popular guy he knew.
He went to Lou and asked what do I do?
Lou saw that his friend
was really distressed.
Well, Lou, began just
look how you're dressed.
Don't you have any clothes
with a different look?
Yes, said, Jack.
I sure do.
Come to my house and
I'll show them to you.
So, they went off to Jack's,
and Jack showed Lou the box
where he kept all his shirts,
and his pants, and his socks Lou
said I see you have all
your clothes in a pile.
Why don't you wear some
others once in a while?
Jack said, well, when I
remove clothes and socks,
I wash them and put them away in
the box, then comes the next morning
and up I hop.
I go to the box and get
my clothes off the top.
Lou quickly realized
the problem with Jack.
He kept clothes, CDs,
and books in a stack.
When he reached something
to read or to wear
he chose [? the top ?]
book or underwear.
Then when he was done he
would put it right back,
back it would go on top of the stack.
I know the solution,
said a triumphant Lou.
You need to learn to
start using a queue.
Lou took Jack's clothes
and hung them in a closet,
and when he had emptied
the box he just tossed it.
Then he said, now, Jack, at the
end of the day put your clothes
on the left when you put them away.
Then tomorrow morning
when you see the sunshine,
get your clothes from right
from the end of the line.
Don't you see, said Lou.
It will be so nice.
You'll wear everything once
before you wear something twice.
And with everything in queues
in his closet and shelf,
Jack started to feel quite
sure of himself all thanks
to Lou and his wonderful queue.
[END PLAYBACK]
SPEAKER 1: All right, so let's
take a look at another data type,
this one known as a tree.
Because now that we have the ability
to stitch data structures together much
like a linked list, we
now have the ability
to stitch things together not just left
to right or top to bottom conceptually,
but in any number of directions.
And indeed, there's nothing
stopping us from having
one node linked to by way of
multiple pointers, multiple nodes.
So, for instance, this picture here
from a textbook is a tree structure.
And it's very much like the
family trees that you might
have drawn in grade school or the like.
But in this case, you
have just one root node,
the node at the top of the
data structure, so to speak,
from which everything else descends.
And that node is said to have children.
For instance, 2 and 3 are children
of the node number 1 here.
And then there's other semantics in
this world of trees in computer science.
Much like family trees,
anything that does not
have children-- like 5,
6, and 7, or 8 and 9--
would be called leaves of the
tree, because like the leaves
at the end of the branches,
there is nothing beyond them.
So, nicely enough we borrow
a lot of the language
from family trees and
actual trees in order
to discuss this data
structure known as a tree.
But why in the world would we want
to lay out data in a tree structure?
Now we just seem to be
doing things because we can,
it would seem at first glance.
Because, for instance, suppose we had
these numbers-- 22, 33, 44, 55, 66, 77,
and 88.
They're clearly sorted.
And suppose that I wanted to lay
these out in a data structure
and be able to search them
efficiently, assuming the whole time
that they are indeed sorted.
Well, if we wanted to do that, we have
our old friend arrays from weeks ago.
And we also have our old algorithm
from Mike Smith, our binary search
algorithm, divide and conquer.
And we can find nodes in
this data structure super,
super fast in logarithmic
time, big O of log n.
So, we've solved that problem.
But it turns out we don't necessarily
have to use an array laying out data
from left to right because, again, one
of the prices we pay of using arrays
where as we've realized
today is this finiteness.
At the end of the day, the
size of an array is fixed.
You have to decide in advance how
big your array is going to be.
So, what if you want to
add more numbers to it?
What if you want to remove
numbers for efficiency
and not waste so much memory?
You can't really do that with an array.
You can, but have to
jump through some hoops.
You have to reallocate the array, as
with a function like [? re-alloc ?]
if you indeed used malloc in
the first place to allocate it.
But then you have to copy the
old array into the new array,
so it's all possible.
Nothing's impossible once you
have a keyboard at your disposal,
but it's a lot of work,
and it's more time,
and it's expensive there for
both in terms of your time
and the computer's time.
But could we achieve
the beauty of divide
and conquer and binary search from
week zero without the constraints
that arrays impose?
And today, the solution to
all of our array problems
seems to be linked lists or
more generally pointers so
that, one, we can dynamically allocate
more memory with malloc when we need it
and then use pointers to
thread or stitch together
that node with any existing nodes.
So, indeed let me propose this
variant on a tree structure
that the world calls binary
search trees or BSTs.
Binary in this case means two, and this
just means that every node in this tree
is going to have 0, 1,
or 2 maximally children.
And now, in this case binary search tree
means that for every node in the tree
it's left child is less than it and
its right child is greater than it.
And that's a recursive definition.
You can look at the root of this
tree and ask that same question.
55?
Is it greater than its left child?
Yep.
Is it less than its right child?
Yep.
That is the beginning, it would
seem, of a binary search tree.
But it's recursive in so far as
this is indeed a binary search
tree if that statement is true.
Those answers are the same for
every other node in the tree.
33, is its left child smaller?
Yep.
Is its right child bigger?
Yep.
How about over here, 77?
Left child smaller?
Yep.
Right child bigger?
Yep, indeed.
How about the leaves of the tree?
Is 22 greater than its left child?
I mean, yeah, there is no child,
so yes, that's a fair statement.
It certainly doesn't violate
our guiding principle.
Is it less than its right child, if any?
Yes, there just isn't any.
And so this is a binary search tree.
And indeed, if you took a scissors and
snipped off any branch of this tree,
you would have another binary
search tree, albeit smaller.
But it's recursive and that definition
applies to every one of the nodes.
But what's beautiful here now is that
if we implement this binary search
tree, similar in spirit to how
we implemented linked lists
using not arrays but using pointers
and not one pointer but two pointers
whereby every node in this tree
apparently has up to two pointers--
let's call them not next but how about
left and right just to be intuitive.
Well, if every node has a
left and a right pointer,
now you can conceptually attach
yourself to another node over there
and another node over there,
and they too can do the same.
So, we have the syntax already with our
pointers with which to implement this.
But why would we?
Well, one, if we're using
pointers now and not an array,
I can very, very easily allocate
more nodes for this tree.
I can insert 99 or 11 really
easily, because I just
called malloc like I did before.
I put the number 99 or
11 inside of that node,
and then I start from
the root of the tree,
much like I start from the first
element in the linked list,
and I just search for its destined
location going left or right
based on the size of that value.
And what's nice, too, here is
notice how short the tree is.
This is not a linked list.
It's not a long list, whether
vertically or horizontally.
This is very shallow this tree.
And indeed I claim that if we've
got n elements in this list,
the height of this tree
it turns out is log of n.
So, the height of this tree is
log of n, give or take one or so.
But that's compelling, because
how do I search this tree?
Suppose I am asked-- I'm trying to
answer the question is 44 on my list?
How do I answer that?
Well, we humans can obviously just look
back and it's like, yes, 44 is in it.
It's not how a computer works.
We have to start from what
we're given, which in this case
is going to be as the arrow
suggests a pointer to the tree
itself, a pointer towards first node.
And I look is this the number 44?
Obviously not.
55 is greater than 44, so I'm
going to go down to the left child
and ask that same question.
33, is this 44?
Obviously not, but it's
less than it so I'm
going to go down to the right child.
Is this 44?
Yes, and simply by
looking at three nodes
have I whittled this problem
down to my yes no answer.
And indeed, you can think
of it again with scissors.
I'm looking at 55 at the
beginning of this story.
Is 44 55?
No, 44 is less.
Well, you know what?
I can effectively snip off
the right half of that tree,
much like I tore that phone book in
week zero, throwing half of the problem
away.
Here I can throw essentially half of the
tree away and search only what remains
and then repeat that process again,
and again, and again, whittling
the tree down by half every time.
So therein lies our
logarithmic running time.
Therein lies the height
of the tree, so long
as I am good about
keeping the tree balanced.
There's a danger.
Suppose that I go ahead and start
building this tree myself in code
and I'm a little sloppy
about doing that.
And I go ahead and I insert, for
instance, let's say the number 33.
And it's the first
node in my tree, so I'm
going to put it right
up here at the top.
And now suppose that the
next number that just happens
to get inserted into this tree is 44.
Well, where does it go?
Well, it has no children
yet, but it is bigger,
so it should probably go over here.
So, yeah, I'll draw 44 there.
Now, suppose that the
inputs to this problem
are such that 55 is inserted next.
Where does it go?
All right, 55, it's bigger,
so it should go over here.
And then 66 is inserted next.
All right, it goes over
here-- never mind that.
So, what's happening to
my binary search tree?
Well, first of all, is
it a binary search tree?
It is because this node is
bigger than its left child,
if any-- there just isn't any--
and it's less than its right child.
How about here, 44?
It's bigger than its left child,
if any-- because there is none--
and it's smaller than its right child.
The same thing is true for 55,
the same thing is true for 66.
So, this is a binary search tree and
yet somehow what does it look like?
It looks like a linked list, right?
It's at a weird angle.
I've been drawing
everything horizontally,
but that's a meaningless
artistic detail.
It devolves potentially
into a linked list.
And so, binary search trees if
they are balanced, so to speak,
if they are built in the right order
or built with the right insertion
algorithm such that they do
have this balanced height,
this logarithmic height, do afford
us the same logarithmic running time
that the phone book example did and
our binary search of an array did.
But we have to do a little
bit more work in order
to make sure that these
trees are balanced.
And we won't go into detail
as to the algorithmics
of keeping the tree balanced.
But realize, again, there's
going to be this trade-off.
Yes, you can use a binary search tree or
trees more generally to store numbers.
Yes, they can allow you to achieve that
same binary or logarithmic running time
that we've gotten so
used to with arrays,
but they also give us
dynamism such that we
can keep adding or even removing nodes.
But, but, but, but it
turns out we're going
to have to think a lot harder about
how to keep these things balanced.
And indeed, in higher
level CS courses, courses
on data structures and
algorithms will you
explore concepts along
exactly those lines.
How would you go about implementing
insert and delete into a tree
so that you do maintain this balance?
And there is yet more variance
on these kinds of trees
that you'll encounter accordingly.
But for our purposes, let's consider
how you would implement the tree itself
independent of how you might
implement those actual algorithms.
Let me propose this type of node.
Again, notice just the very
generic term in programming
where it's usually like a container for
one or more other things, and this time
those things are an
integer-- we'll call it n
but it could be called
anything-- and two pointers.
And instead of next, I'm going to just
by convention call them left and right.
And as before, notice that I do
need to declare struct node up
here or some word up here.
But by convention I'm just going to do
typedef struct node, because C reads
things top to bottom, left to right.
So if I want to refer to
a node inside of a node,
I need to have that vocabulary, per
this first line, even though later on I
just want to call this
whole darn thing a node.
And so, that's the distinction.
This actually has the side
effect of creating a data
type by two different names.
One is called struct node, and
you can literally in your code
write struct node something,
struct node something.
It just feels unnecessarily
verbose, so typedef
allows you to simplify
this as just node, which
refers to the same structure.
But this is necessary for this
innermost implementation detail.
So, now that we have the
ability with a structure
to represent this thing, what
can we actually do with it?
Well, here is where recursion
from a few weeks ago
actually gets really compelling.
When we introduced that sigma example
a while ago and talked in the abstract
about recursion, frankly, it's kind of
hard to justify it early on, unless you
actually have a problem that lends
itself to recursion in a way that
makes sense to use recursion
and not just iteration,
loops-- for loops, while
loops, do while, and the like.
And here we actually have a
perfect incarnation of that.
What does it mean to search
a binary search tree?
Well, suppose I'm
searching for a number n
and I'm being given a pointer
to the root of the tree,
and I'll call it tree.
So, just like when I was
searching a linked list,
I'm given two things, the
number I'm looking for
and a pointer to the first thing in
the data structure-- the first thing
in a linked list or the
first thing in a tree.
And in this case, we would
call that first thing
in a tree a root, generally speaking.
So, the first thing I had
better do in my search function
is check, wait a minute.
If tree equals equals
null, don't do anything.
Do not risk touching any pointers,
because as you may have gleaned already
or soon will with some
of CS50's problems,
you will cause quite probably
a segmentation fault,
a memory-related problem in your
program such that it just crashes.
It literally says segmentation
fault on this screen
if you touch memory that you should not.
And you should not touch null values.
You should not go to null values.
You should not do star of any
value that itself might be null.
And so, if tree equals equals null
is super, super important here,
because I want to make
sure to immediately
say, well, if you hand me null, that's
like handing me no tree whatsoever.
So, my answer is obviously false.
N can't be in a non-existent tree.
But we need that condition
up top, because the next case
is [? noticed through ?] the following.
Else if n-- the value
we're looking for-- is less
than the value of n in
this node-- tree, recall,
doesn't refer to the whole
thing, per se, in this context.
It refers to the current
node that we've been
past, which at the beginning of
the story is the root of the tree.
So, if the number n in the root of
the tree is greater than the number
we're looking for, we
want to go to the left.
Else we want to go to the right
and search the right subtree.
So, what's the syntax here?
If the n we're looking
for, like 44, is less
than the value at the current node,
55, then what do we want to do?
We want to call search, still
searching for the same number n
but searching on the left subtree.
And how do you pass in a
pointer to the left tree?
Well, you have in tree a
pointer to the root node.
Tree arrow left just means go to my left
child and past that value in instead,
pass its address in instead.
Meanwhile, if the number
you're looking for
is actually greater than the
value in the current node, search
the right subtree, else return true.
Because if the list is not null-- if
there is actually a list and the number
you're looking for is not
less than the current node
and it's not greater than the current
node, it must be the current node,
so you can return true.
But there's one important detail here.
I didn't just call search.
I called return search in each
of these two middle cases.
Why is that?
Well, this is where recursion gets
potentially a little mind bending.
Recursion is the act of a
function calling itself.
Now, in and of itself, that sounds bad,
because if a function calls itself,
why wouldn't it call itself again, and
again, and again, and again, and again,
and just do this
infinitely many times such
that now you get a stack overflow
where all of those frames on the stack
hit the heap and bad things happen.
But no, recursion works beautifully
so long as every time you recurse,
every time a function calls itself it
takes a smaller byte of the problem.
Or rather, put another
way, it throws away
half of the problem, as in this case,
and looks only at a remaining half.
Because if you keep shrinking,
shrinking, shrinking, shrinking
the problem, you will
eventually hit this base case
where either there is no more tree
or you're looking right at the node
that you want to find.
And so, by returning
search and tree left,
or returning search and tree
right, you're deferring the answer.
When you, the search
function, are called and asked
is the number 44 in
this tree, you might not
know because the node you're looking
at at the beginning of the story
was again 55.
But you know who does know?
I bet my left child will know
the answer to that if I just
ask it by passing it-- passing to
search a pointer to it, my left child,
and passing in that same number 44.
So, saying return search is
like saying I don't know.
Ask my left child.
Or I don't know, ask my right child
and let me return as my answer
whatever my child's answer is instead.
So, you could do this same
function using iteration.
But you could solve it arguably much
more elegantly here using recursion,
because a data structure like
this-- like a binary search tree,
which again is recursively
defined-- each node is conceptually
identical, if numerically
different from the others,
allows us to apply this algorithm,
this recursive algorithm
to that particular data structure.
Now, let's look at a
more concrete incarnation
of trees that allows us to do something
pretty neat and pretty real world.
Indeed, this is another problem borne
of a real world domain of compression.
We talked a couple weeks
ago about encryption,
the art of concealing or
scrambling information.
Compression, meanwhile, is the art
of taking something that's this big
and compressing it to make it smaller,
ideally without losing any information.
It's pretty easy to take
a 10 page essay that's
maybe-- that was supposed
to be a five page essay
and just remove paragraphs from
it or remove sentences from it.
But that changes the meaning of
the paper, makes it a worse paper,
even though you're compressing
it by making it smaller.
No, what most students would typically
do, if you've written 10 pages
and it needs to fit into five,
you really, really, really
shrink the font size or
increase the margins.
Or maybe more realistically you
write a five page paper that's
supposed to be a 10 page paper,
and so you increase the font size
or increase the margins so as to
expand or decompress the essay.
So, similarly here, what if
we wanted to compress text,
but we want to do it
losslessly in a way that we
don't lose any information
by just throwing away
characters, or paragraphs,
or pages, but we
want to use the system with which
we're familiar from week zero.
So ASCII, again, is just this code,
this mapping of letters to numbers.
And so, A is-- capital A is 65
and that's some pattern of bits,
but it's some pattern of
8 bits-- 7 historically,
but really 8 bits in
practice So every one
of the characters in the
English alphabet, at least here,
takes up 8 bits.
Now, that sounds fine.
That allows us to express as many
as 256 possible characters, which
is more than enough for English
characters, plus some punctuation
and so forth.
But it seems wasteful.
I type A, E, and I, maybe
O and U pretty often.
I use the values often--
the vowels often.
B and D, I feel like I use those a lot.
I don't really type Q all
that much, Z all that much.
So, there are certain
letters that I just
feel like I don't type them
that often, and indeed,
probably if we analyzed a dictionary,
we wouldn't see them as frequently
as other letters.
Indeed, if you've ever played
or watched Wheel of Fortune,
certainly all the
contestants on that show
know which are the most popular
letters in English words.
And it seems silly and
perhaps inefficient--
certainly for a computer
scientist-- that we are not somehow
embracing the fact that some letters
are more commonly used than others,
and yet we are just blindly using
8 bits, the same amount of memory,
for every darn letter in our alphabet.
Why?
If you keep writing a certain
letter again and again,
why not use fewer bits for
the more popular letters,
and more bits for the
less popular letters
so that at least you're optimizing
for the common case, so to speak?
Well, it turns out that someone
named Huffman years ago did
figure this out and introduced what's
generally known as Huffman coding.
And, at first glance, it's a little
similar in spirit to something
some of you might have grown up learning
a little something about called Morse
code, but it's better
in a couple of ways.
Morse code typically transmitted with
electrical signals or audible signals.
It has dots and dashes
where a dot is a quick beep
and a dash is a slightly
longer beep, and you
can use those series of dots and
dashes, as per this chart here,
to represent letters of the
alphabet and some numbers.
The one problem, though, as efficient
as this seems-- and then by efficient
I mean look at E. Mr. Morse realized
that is super popular, so he
used literally the shortest symbol
for it, just a dot, a simple blip,
to represent an E. And,
meanwhile, as I kind of imagined,
Z is not that common,
so dash, dash, dot,
dot is longer than just a single dot.
So Z is probably less popular,
and that's why we did this.
And Y may be even less
popular-- dash, dot, dash--
I don't know why I'm using this voice.
But it's longer than E, so we
optimized for the shorter characters.
Unfortunately, suppose that you receive
the message dot, dot, dot, dot, dot,
dot, so six dots in a row, and I
technically paused in between them.
Six dots, what message
did I just send you?
 
Six dots.
 
So, I wanted to say hi, so I said
dot, dot, dot, dot, which is H,
and then dot, dot which is I. I
should not have paused between them,
because the whole point of Morse code
is to do this as quickly as possible,
even though you probably do want
to pause to resolve ambiguity,
and indeed, that's the problem.
I wanted to send you
hi, H-I, but maybe I
just sent you E, E, E,
E, E, E, six Es in a row,
because those two were just dots.
So, in other words, Morse code
is not immediately decodable
when you're reading, or hearing,
or seeing the dots and dashes come
over the wire, so to speak,
because there's these ambiguities,
unless this transmitter
does indeed pause,
as I accidentally did there, to give
you a moment to take your breath
and realize, oh, that was an H. That's
an I. As opposed to E, E, E, E, E, E.
So, it's not necessarily the best
system in so far as some letters
share prefixes with other letters.
In other words, I, dot dot, has a
common prefix with E. Both of them
start with a single dot.
It just so happens that
I is a little longer,
and that can lead
potentially to ambiguity,
and it certainly means that the
transmitter should probably slow down.
So, the whole system is meant to
be super fast, super efficient,
but you probably should
pause between certain letters
so that the recipient
doesn't get confused as
to the message you're actually sending.
Well, thankfully Huffman coding--
which as we'll see in a moment
is based on trees-- does
not have that ambiguity.
It is a immediately decodable.
And suppose for the sake of
discussion, as per this example here,
you just have a whole bunch of
text that you want to transmit.
This is meaningless.
There's no pattern in these
As, and E, B, C, Ds, and Es,
but if you go through and count them
up, each these letters-- A, B, C, D, E--
occur with some frequency in this text.
So, it's meant to be representative
of an essay, or a message,
or whatever that you
want to send to someone.
Indeed, if you count up all of the
As, Bs, Cs, Ds, and Es, and divide
by the total number of
letters, it turns out
that 20% of the characters in that
random string are As, 10% are Bs,
10% are Cs, 15% are
Ds, and 45% are Es, so
it's roughly consistent
with what I'm claiming,
which is that it E is pretty popular.
So, intuitively, [? it ?]
would be really nice
if I had an algorithm that came up
with some representation of bits
that's not just 8 bits
for every darn letter
but that is a few bits
for the popular letters
and more bits for the
less popular letters,
so I optimize, again, so to
speak, for the common case.
So, by this logic E, hopefully,
should have a pretty short encoding
in binary, and A, and B, and C, and D
should have slightly longer encoding,
so that again if I'm using E a lot I
want to send as few bits as possible.
But I need this algorithm
to be repeatable.
I don't want to just arbitrarily
come up with something
and then have to tell you in advance
that, hey, we're using this David Malan
system for binary.
We want an algorithmic process here.
And what's nice about trees is that
it's one way of seeing and solving
exactly that.
So, Huffman proposed this.
If you have a forest of nodes, so to
speak, a whole bunch of trees-- each
of size one, no children-- think of them
as each having a weight or a frequency.
So, I've drawn five circles
here, per this snippet
from a popular textbook that has 10%,
10%, 15%, 20%, 45% equivalently in each
of those nodes.
And I've just labeled the
leaves as B, C, D, A, E,
deliberately from left to right because
it will make my tree look prettier,
but technically the lines could cross
and it's not a big deal in reality.
We just need to be consistent.
So, Huffman proposed this.
In order to figure out the so-called
Huffman tree for this particular text,
in order to figure out what to encode
it's letters as with zeros and ones,
go ahead and take the two smallest nodes
and combine them with a new root node.
So in other words, B
and C were both 10%.
Those are the smallest nodes.
Let's go ahead and combine
them with a new root node
and add together their weights,
so 10% plus 10% is 20%.
And then arbitrarily,
but consistently, label
the left child's edge or arrow as a
0 and the right arrow's edge as a 1.
Meanwhile, repeat.
So, now look for the two smallest nodes.
And I see a 20%-- so
ignore the children now.
Only look at the roots of these trees.
And there's now four trees, one of which
has children, three of which don't.
So, now look at the smallest roots
now and you can go left to right here.
There's a 20%, there's a 15%,
there's a 20%, and a 45%.
So, I'm not sure which one
to go with, so you just
have to come up with some
rule to be consistent.
I'm going to go with
the ones on the left,
and so I'm going to combine the 20%
with the 15%, the 20% on the left.
Combine their weights.
That gives me 35% in a new root,
and again label the left branch 0
and the right branch a 1.
Now, it's not ambiguous.
Let's combine 35% and 20%
with a new root that's 55%.
Call its left branch
0, its right branch 1.
And now 55% and 45%, combine
those and give us a 1.
So why did I just do this?
Well now I have built up the so-called
Huffman tree for this input text
and Huffman proposed the following.
To figure out what
patterns of zeros and ones
to use to represent
A, B, C, D, E, simply
follow the paths from the
root to each of those leaves.
So what is the encoding for A?
Start at the root and then
look for A-- 0, 1, so 0,
1 shall be my binary
encoding for A in this world.
How about B?
0, 0, 0, 0 shall be my binary
encoding for B. How about C?
0, 0, 0, 1 shall be my
encoding for C. How about D?
0, 0, 1.
And beautifully, how about E?
1.
So, to summarize, what
has just happened?
E was the most popular letter, B and
C, were the least popular letters.
And if we summarize these,
you'll see that, indeed,
B and C got pretty long encodings,
but E got the shortest encoding.
And it turns out
mathematically you will now
have a system for encoding letters
of the alphabet that is optimal,
that is you will use
as few bits as possible
because you are biasing things
toward short representations
for popular letters, longer
representations for less
popular letters.
And mathematically this gives
you the most efficient encoding
for the original text without
losing any information.
In other words, now if Huffman
wanted to send a secret message
to someone in class or over the
internet, he and that recipient
simply have to agree on
this scheme in advance
and then use these encoding
to transmit those messages.
Because when someone receives 0, 1 or 0,
0, 0, 0 or 0, 0, 0, 1 from Mr. Huffman,
they can use that same
look-up table, if you will,
and say, oh, he just sent me an A
or, oh, he just sent me a B or C. So,
you have to know what
tree Huffman built up.
And, indeed, what typically
happens in actual computers is
when you use Huffman coding
to compress some body of text
like we just have here, you
store the compressed text
by storing your As, Bs, Cs,
Ds, and Es and other letters
using these new
encoding, but you somehow
have to embed in that file in the
compressed file the tree itself
or this cheat sheet of encodings.
So, with compression-- maybe
you're compressing a Microsoft Word
file, or a dot TXT file,
or any other type of file,
you have to store not just the
compressed text using these shorter
representation-- not 8-bit ASCII, but
these shorter representations-- but you
also somewhere, maybe at the beginning
of the file or at the end of the file,
somewhere where someone else can find
it, you need to store this mapping
or you need to store the tree
itself in some digital form.
And so, it's possible
by this logic that you
might try to compress
a really small file,
and that file could
actually become bigger
because you're storing a
tree inside the file to--
with which to recover
the original information.
Or better yet, most
algorithms or most actual
compression programs will realize,
wait a minute, if compressing this file
is actually going to make it bigger,
let's just not compress it at all
and leave it alone untouched.
So, what if you take a compressed file
and compress it again, and compress it
again, and compress it again?
A dangerous assumption to get into
is, well, I could just maybe keep
compressing that video file again,
and again, and again, and again,
and I can maybe compress my big
essay, or my big video file,
or big music file to just maybe one bit.
Right?
That's the logical extreme,
just keep compressing,
compressing, compressing, compressing.
But, of course, that
can't possibly make sense,
because if you compress some file
down to just a single bit, 0 or 1,
you've clearly thrown away information
and can't possibly recover it all.
So, at some point, too, you've hit this
lower bound on the size of the file
until you need to start throwing
actual information away.
At some point, the file just has so
much entropy, appears to be so random,
there really is no pattern to start
to leverage to compress it further.
And so, there generally is some
maximum amount of compression
you can apply to something.
So, how would we represent this?
Let's whip out a C struct here.
So, this time each of the
nodes in a Huffman tree
need a little something different.
They need, at least in the
leaves, some kind of character
to remember the symbol.
Now, technically only the leaves
need to know what symbols they are,
so it's a little redundant
to have this in every node,
but we can keep things simple and use
the same type of node for everything.
Float frequency, I could use an integer
and treat it exactly as a percentage,
or I can use a float as the nodes
were with 0.1 and 0.45 and so forth,
and I'll call that frequency.
And then each of those nodes
needs a left child potentially
and a right child potentially.
And, again, I'll call
these things a node.
So, again, it's getting a little more
involved this node, but it still allows
me to represent it ultimately in C.
And now, it's time to pursue lastly
the holy grail of data structures,
if you will.
Thus far, we've been solving
problems, creating new problems,
trying to solve those again.
And the problems we've been exploring
this week are things like dynamism,
if we want to be able to grow
or shrink our data structure.
Malloc and pointers
give us that flexibility
but might cost us a bit
more time, because we
have to keep things
sorted differently or we
have to follow all of those pointers.
And so, a lot of the algorithms
we've been discussing today
at least have-- like linear time,
searching, or inserting, or deleting
potentially like in a linked list.
Better still would be
something logarithmic
like a balanced binary search tree,
so still preserving that nice binary
aspect from week zero.
But the holy grail of a data
structure for its operations
is Big O of 1 so to
speak, constant time.
If you are searching, or inserting,
or deleting, and somehow changing
a data structure, wouldn't it be
amazing if every darn operation
takes just one step, or
maybe two steps, or three
steps but a constant number of steps?
Now, it might be a little
naive for us to expect
that we can store an arbitrary
amount of data in some fancy way
that we get constant time,
but maybe just maybe if we're
clever we can get close to that.
So, let's introduce a step toward that.
It turns out there exists in this
world things called hash tables.
And a hash table can be
implemented in any number of ways,
but you can think of it
really as just an array.
So, for instance, this might be a way of
representing a hash table called table,
whose first location is bracket zero
and whose last location is bracket
n minus 1 for however long this is.
And I just left it as blanks.
I don't even know what this
hash table might want to store.
It could be numbers, it could
be names, it could be letters,
it could be anything we want.
But hash table has this
nice theoretical property
that if well-designed
and thought through,
you can maybe just maybe get
constant look up time in it.
And let's do a simple
example of a hash table.
Hash tables are often nicely
thought of as buckets,
so we borrowed these from the loading
dock outside just a little moment ago,
and we've attached thanks to
Arturo some of these signs to them.
This is going to be Z, so
I'll just put this over here.
This is going to be C, so I'll put
this over here, and B here, and A.
And we thought we might get chased
away by the folks on the loading dock,
so we didn't bother getting D
through Y, So we'll just pretend
that we have 26 such buckets here.
And suppose that the goal
at hand is-- I don't know,
it's like at the end
of an exam, so we've
got our old blue books that a
class might use for students
writing essays in some class.
And it's time for the students
to come submit their blue books.
Now, we could just collect them all
and make a big mess as would generally
be the case, or we can be a little
more methodical to at least make
our jobs easier.
Now, at the end of the day, what's going
to be interesting about hash tables
is that there's going
to be this distinction
between actual benefits and
theoretical benefit, or lack thereof.
So, we'll come to that in just a
moment, but here's A, B, C, D, and Z.
And you know what?
I just am going to ask the students
in this class-- there are so
many people in the room
after an exam, I just
want them to at least make
my life 1/26 as difficult
by putting all the As over there,
all the Bs here, all the Cs here,
all the Zs here, so that I don't have
a massive mountain of As through Zs
that I have to sift
through individually.
It would just be nice if
they do the first pass
of bucketizing the values based on
the first letter in their last name.
In other words, my hash
function, my algorithm,
is going to be for each student
to consider his or her last name,
look at the first letter they're
in, and put his or her exam
in the appropriate bucket.
So, here is, for instance,
someone with the letter
C. I'm going to put
that blue book in here.
Here's someone with the letter
A. That one's going to go here.
Letter Z?
This one's going to go over here.
Letter B?
This is going to go over here.
C, and B, and F-- Z, I mean, and all of
[? the ?] [? letters ?] of the alphabet
in between.
So, hashing really has this
visual and conceptual equivalence
of putting something in this bucket,
putting something in that bucket,
putting something in this
other bucket, ultimately
bucketizing all of your elements.
And you can think of this,
frankly, as just an array,
but it's not just an
array with one spot.
It looks I can stack multiple
numbers or multiple blue books inside
of that array.
So, we're going to have to come
back to that, because this clearly
can't be an array.
Normally, the array would be filled
the moment you put one value in it.
But this hashing is
the interesting part.
The juicy ingredient today is if I take
into account as input what it is I'm
trying to store, use some piece of that
information to decide where to put it,
that's an algorithm, because
I can repeat that process,
so long as it's not random.
You go over here, you go over here.
That's amazing.
Wow, OK, pushing my luck.
OK, so I'm not just randomly
putting things here.
I'm actually giving some thought
as to where I'm putting things,
and that makes the algorithm
deterministic, repeatable, predictable
so that if you insert something
now, you can absolutely
find it if present later.
Unfortunately, if our
hash table does look
like this, just a simple array from
bracket 0 to bracket n minus 1 dot,
dot, dot in between,
and it's just an array
for integers or an array for strings or
whatever, once you put something here,
or here, or here, that's it.
There is no more room to
put another element there
wide as I might have drawn this table.
If there's an int there, that's it.
So, what could you do?
Suppose that you do have an
array structure like this,
and that is unacceptable.
You have a whole bunch of elements
here and this table looks like this,
and you consider this table like this.
And maybe it's just where you're
supposed to take attendance or put
people's names.
So, if you say, oh, Alice is here today.
Let me go ahead and hash on Alice's
name and put her where the As should go.
Oh, Zoe is here, Z-O-E, so
we'll put her down there.
And then who else?
Alex is here.
Dammit, Alex, no room for
you in our hash table,
because Alice is already there.
This is stupid.
If we have data we want to
insert into this data structure,
it would seem that I have 24 available
spots into which I could put Alex
and yet I'm just stubbornly trying
to put him where only the As belong.
So, why don't I, in this kind of
scenario, I need to put Alex in here.
I clearly have space.
You know what?
Let me just probe the array looking
for the first available spot.
OK, Alex, you're just going to go here,
and if someone else like Erin appears,
fine.
You just are going to go over here.
So, you try to put the letter
As where you want them to go,
but if there's already
someone there, just
probe deeper into the data structure
looking for the first available slot.
So, this is a general technique in
programming called linear probing
whereby you have a data structure.
If you hash to some location like
the letter A there's a collision,
something is there, you probe
further in the data structure just
looking for some place you can put it.
So, you get close to constant
time decision-making.
Put A here, put Z here.
And because this is an array, you have
random access with your square bracket
notation, but if you have lots of As
and not too many Zs, or Bs, or Ds,
it's possible this approach could
devolve back into linear time.
So, in the ideal we have one
A, one B, one Z, and everything
in between, that's constant time.
We have our holy grail, constant
time operations for a data structure,
but not if we want to support
insertion of other elements,
even those that hash
to the same location.
So, what's the fix?
Well, if the problem
is that we've already
made room-- we already have used
this space for Alice, you know what?
If we need to put someone else
here, why don't we just create
dynamically some more space?
We have malloc now.
We have dynamic memory allocation.
Why don't we just extend our data
structure laterally, horizontally--
artistically here-- so that, yes,
you try to go to that first location.
But if there's multiple people
that are meant to go there,
multiple values, go ahead
and just link them together,
thereby merging the idea of a hash table
and a linked list with a data structure
that might look like this.
So, this is an example, somewhat
arbitrary, of 31 days out of a month.
And if you actually hash
on people's birth dates,
as I think this author did, you
can think of your hash table
still as an array.
But that array does not store
strings, it does not store integers.
It only stores pointers, 31 total
in this case-- some of which
might be null, per the
vertical diagonal slash--
but those pointers in turn point
to the beginning of linked lists.
So, if multiple people were born
on the fourth of some month,
you would put J. Adams and W. Floyd
in a linked list at that location.
If both Aaron, and Alex, and Alice,
and other students with the names A
all belong at that first location
in my previous table, that's fine.
Just string them together
with a linked list.
Much like with these buckets, at the end
of the day, I'm still creating piles.
And at the end of the day, I still have
to go through them all, ultimately.
But each of these piles
is 1/26 the size of it
would have been if everyone just
came up at the end of the exam
and just piled all their
books in the same pile.
So, whereas, these algorithms
at the end of the day
are still devolving, if you
will-- or these data structures
are devolving, if you will,
into linear time operations,
in the worst case if these things
just get really long and stringy,
at least in actuality they might be as
good as 1/31 as long or 1/26 as tall.
And so, now there's this
dichotomy in this week five
of asymptotic running time,
the theoretical running
time that we've really been belaboring
and the actual running time.
Just because something is n
squared does not mean it's bad.
If there's only a few
elements, n squared is great.
It's going to happen super fast
if your computer is 1 gigahertz,
or 2 gigahertz, or faster these days.
N squared in and of itself isn't bad.
It just gets really bad
when your data gets large.
But in practice, even n squared divided
by 2 is actually better than n squared.
So, a couple weeks ago when I was
saying don't worry about the lower order
terms, the constant terms,
focus only on n squared
and not n or anything you're dividing
by, that's fine theoretically,
but in actuality you're going
to feel that kind of difference.
So, here's one last data structure
that we'll call a trie-- so trie,
short for retrieval somehow,
T-R-I-E, but pronounced try.
And this one is cool
because this now is really
like a weird offspring of these
data structures from today.
But it's a tree each of
whose nodes is in an array.
And a trie is really good for storing
words like words in a dictionary.
Indeed, one of the problem
I had for you in CS50
is going to be to implement a spell
checker, which effectively means build
a dictionary in memory,
and you'll be challenged
to spell check words as
fast as you can, storing
as many as 100,000 English words
somehow in your computer's memory
and answering questions of the form
is this a word, is this a word,
is this a word.
That's, after all,
what spell checking is.
So, a trie is kind of interesting in
that-- and this is an excerpt of an,
artist's rendition
there of-- the root node
here represents this-- is this
rectangle here, and that of course
looks like an array.
And notice what's implicit in this.
If this is location A
and this is location Z,
the author here has just
decided to only show you
those letters that matter
for the sake of discussion.
But the fact that the M
location here is not blank
means there's a pointer there.
Indeed, what are these arrays?
They are arrays of
pointers to other nodes.
So, the fact that M is not null and it
leads to this node, and notice that A
is not null and it leads to this node,
and then this node, and then this node.
And this is where the artist
is just taking some liberties.
This tree would be monstrously
wide, because all of these arrays
are so darn wide, so he or she is just
showing you the width-- or the element
that we care about, M, A, X, W, E, L,
L, and then some special sentinel symbol
delta, but it could be anything.
This is null, really.
This is how using a trie a programmer
could store the name Maxwell,
M-A-X-W-E-L-L, by simply leaving
little bread crumbs, if you will,
from one node to another such that
each of those elements in the array is
a pointer to another array.
And if you keep following these
pointers, following the bread crumbs
and you eventually find yourself
at this special sentinel value--
and actually, it wouldn't be null, it
would be like a Boolean saying true.
This is a word you can just by storing
a single yes or no at this location way
down here, implicitly reveal that
M-A-X-W-E-L was in fact a word.
Let's follow another.
So, let's say Turing,
T-U-R-I-N-G, check, Boolean true.
Turing is in this dictionary as well.
So, if there are bread crumbs that
lead to null, that word is not in here.
So, apparently there is no
names starting with A through L,
and there is no one after U through
Z or some of the letters in between,
because those pointers are
implicitly and pictorially null.
But let's consider, then, what is the
running time of inserting or looking up
a name and [? in a trie? ?] Thus
far, pretty much all of the data
structures we've talked about
have pretty slow running times,
linear in the worst case.
So, if we used an array
to store people's names
or we used to linked list to store
people's names, in the worst case
we had linear running time, unless
maybe we sort things, but even
then that costs us some time.
So, linear may be logarithmic
was the best we could do.
And even with a hash
table, whereby, maybe we
store Maxwell at the M
location in our table,
he might still have a link list of
a whole bunch of other M people.
That, again, can devolve into
something linear, a linear linked list.
But what about a hash table?
To answer the question is Maxwell in
a trie-- sorry, what about to trie?
To answer the question is
Maxwell in a trie, what do we do?
We start at the root and we follow
the pointer that represents m,
and then we follow the pointer there
that represents A, then X, W, E, L, L,
and we look for at the end of that
series of steps a true false value.
And if it's true, yes, Maxwell is here.
What about Turing?
Well, we start at the
pointer that represents
T, then U, R, I, N G, then check.
Oh, true.
Turing is in there.
Let's look for David.
No, false.
There's not even a pointer there.
David is not in this dictionary.
So, how many steps did that each take?
To tell whether Maxwell
was in the dictionary,
was M-A-X-W-E-L-L and then look at
the Boolean, so that was eight steps.
And to look up Turing was
T-U-R-I-N-G. And then that Boolean,
that was seven steps.
Those numbers have nothing to do with
how many words are already in the trie.
There might be-- and there's
only a couple dozen here--
there are a dozen or so here-- there
might be thousands of actual words
in this dictionary, but we're still
going to find Alan Turing by way
of T-U-R-I-N-G Boolean seven steps,
and M-A-X-W-E-L-L Boolean, eight steps.
It doesn't matter how many other
data elements are in this trie.
And that's what's
powerful, because if there
is an upper bound on the
number of letters in an English
word-- which is kind of true.
I've rarely typed words
that are longer than I don't
know 10 characters, 15 characters.
At some point there
might exist these words,
but no one actually says
or types these words.
Those are effectively constants.
The maximum length of a word in
English is surely some constant,
because there is one
word that's the longest.
That's a constant value,
which means inserting a name,
or searching for a name, or
removing a name from a trie
does depend on the length
of the name, but it does not
depend on how many pieces of data
are already in the data structure.
And as such, it is constant time.
So, now in C, we have a
whole bunch of new syntax
with which to represent data
structures, namely actual structs in C,
and we have pointers, and
we have malloc with which
we can build more interesting
shapes, if you will, in memory.
And we now have a number of abstract
data types and actual data structures
we can build using these ingredients
with which we can now solve problems
that are going to demand all the more
resources, all the more time, all
the more space, in which case
efficiency and good design
is going to be ever more important.
All this and more next time.
 
AUDIENCE: She thought she
was doing the right thing.
 
[AUDIO PLAYBACK]
-Tell me more.
-David was sure it had to be the
muppet, something called muppet mode,
but the pressure was too much.
-This is Mario in muppet mode.
Take 23.
 
[HONKING]
 
-What's happening?
I thought this is what
you always wanted,
to star in the walkthrough videos,
to have all of YouTube's eyes
watching you.
[HONKING]
 
-Yes, you are.
You have to be.
Now stand up straight, tuck in
your shirt, look into the camera!
Take it again from the top.
 
