[Week 6]
[David J. Malan] [Harvard University]
[This is CS50.] [CS50.TV]
This is CS50, and this is the start of Week 6,
so a couple of new tools are now available for you to take advantage of,
the first of which is called CS50 Style.
Odds are if you're like me or any of the teaching fellows,
you've probably seen a program whose style looks a little something like this.
Maybe you start cutting some corners late at night, or you'll deal with it later,
and then a TF or CA comes over during office hours.
Then it's hard for us to read.
Well, this code is syntactically correct, and it will compile, and it will actually run.
But it's definitely not a 5 for style.
But now, if we go into this directory here—
and notice that I have conditions2.c—
and I run this new command, style50, on this file conditions2.c, Enter,
notice that it's informed me that it has been stylized.
Gedit noticed that the file has been changed on disk,
and if I click reload, all your problems are now automated.
[applause]
That's one of the things we did this weekend.
Realize that it is imperfect because there are some code
that it simply won't be able to stylize perfectly,
but realize this is now a tool you can take advantage of
if only to tidy up some of the more errantly placed curly braces and the like.
But more compelling now is CS50 Check.
With CS50 Check, you can actually perform the same correctness tests
on your own code that the teaching fellows are able to.
This is a command line utility that comes now in the appliance
as soon as you do an update50 as per
pset 4 specifications, and you use it essentially like this.
You run the command check50.
Then you pass in a command line argument, or more generally known as a switch or a flag.
Generally, things that have hyphens are called a switch
to a command line program, so -c specifies
the checks that you want to run.
The tests that you want to run are identified uniquely by this string,
2012/pset4/resize.
In other words, that's just an arbitrary but unique string
that we use to uniquely identify pset 4's correctness tests.
And then you specify a space separated list of the files that you want to upload
to CS50 Check for analysis.
For instance, if I go into my solution here for resize.c—
let me open up a bigger terminal window—
and I go ahead and run let's say check50 -c 2012/pset4/resize,
and then I go ahead and specify the names of the files,
resize.c, and then hit Enter, it compresses,
it uploads, it checks, and I just failed a whole bunch of tests.
The one in red at top left says that resize.c and bmp exist.
That was the test. That was the question we asked.
And it's unhappy because the answer was false.
The white text below it says expected bmp.h to exist, and that's simply my fault.
I forgot to upload it, so I need to upload both files,
resize.c and bmp.h.
But now notice all of the other tests are in yellow because they haven't run,
and so the smiley face is vertical because he's neither happy nor sad,
but we have to redress that issue in red before those other checks will run.
Let me fix this.
Let me zoom out and rerun this, this time with bmp.h also
on the command line, Enter, and now if all goes well,
it's going to check and then return a result of—hold your breath—
all green, which means I'm doing really well on pset 4 so far.
You can see and infer from the descriptive text here
exactly what it is we tested.
We tested first do the files exist?
We then tested does resize.c compile?
Then we tested does it not resize a 1x1-pixel BMP when n, the resize factor, is 1.
Now, if you have no idea what n is, you will once you dive into pset 4,
but that simply is a sanity check to make sure that you're not resizing
an image at all if the resize factor is 1.
If, by contrast, it resizes a 1x1 pixel to a 1x1 pixel BMP to 2x2 correctly
when n is 2, then similarly, mine forms accordingly.
In short, this is meant to, one, take the crossing the fingers
out of the equation right before you submit your pset.
You will know exactly what your TF will soon know
when you go about submitting some of these problem sets,
and also the pedagogical motivation is really to put
the opportunity in front of you so that when you know a priori
that there's bugs in your code and tests that aren't being passed,
you can put in more effective time up front to solve those problems
rather than lose points, get feedback from your TF,
and then go, "Ahh," like I should have figured that out.
Now at least there's a tool to help you find that.
It's not going to point out where the bug is, but it will tell you
what is symptomatic of it.
Now realize the tests aren't necessarily exhaustive.
Just because you get a screen full of green smiley faces
doesn't mean your code is perfect, but it does mean
that it has passed certain tests prescribed by the spec.
Sometimes we won't release checks.
For instance, whodunit, one of the aspects of pset 4,
is kind of disappointing if we give you
the answer as to what it is, and there's a number of ways to reveal
who the person is in that red noise.
The spec will always specify in the future for pset 5 onward
what checks exist for you.
You'll notice there's this white URL at the bottom.
For now, this is just diagnostic output.
If you visit that URL, you'll get a whole bunch of crazy, cryptic messages
that you're welcome to look through, but it's mostly for the staff
so that we can diagnose and debug bugs in check50 itself.
Without ado, let's move on to where we left off.
CS50 library we took for granted for some weeks,
but then last week, we started peeling back one of the layers of it.
We started putting aside string in favor of what instead?
[Students] Char.
Char*, which has been a char* all this time,
but now we don't have to pretend that it's an actual data type string.
Rather, it's been a synonym of sorts for char*,
and a string is a sequence of characters,
so why does it make sense to represent strings as char*s?
What does a char* represent in the context of this concept of a string?
Yeah.>>[Student] The first character.
Good, the first character, but not quite the first character.
It's the—[Students] Address.
Good, the address of the first character.
All that's necessary to represent a string in a computer's memory
is just the unique address of its very first byte.
You don't even have to know how long it is
because how can you figure that out dynamically?
[Student] String length.
You can call string length, excellent, but how does string length work?
What does it do? Yeah.
[Student] Keep going until you get the null character.
Yeah, exactly, it just iterates with a for loop, while loop,
whatever from * to the end, and the end is represented
by \0, the so-called nul character, nul,
not to be confused with null, which is a pointer,
which will come up in conversation again today.
We peeled back a layer of GetInt, and then we took a look at GetString,
and recall that both of those functions, or really,
GetString, was using a certain function
to actually parse, that is, read or analyze, the user's input.
And what was that new function?
Scanf or sscanf. It actually comes in a few different flavors.
There's scanf, there's sscanf, there's fscanf.
For now, though, let's focus on the one most easily illustrated,
and let me go ahead and open up in the appliance
a file like this, scanf1.c.
This is a super simple program,
but that does something that we've never done
without the help of the CS50 library.
This gets an int from a user. How does it work?
Well, in line 16 there,
notice that we declare an int called x, and at this point in the story,
what is the value of x?
[inaudible student response]
[David M.] Right, who knows, some garbage value potentially, so in 17, we just tell the user
give me a number, please, and step 18 is where it gets interesting.
Scanf seems to borrow an idea from printf in that it uses these format codes in quotes.
%d is of course a decimal number.
But why am I passing in &x instead of just x?
The former is correct. Yeah.
[inaudible student response]
Exactly, if the goal of this program, like the function GetInt itself,
is to get an int from the user I can pass functions
all the variables I want, but if I don't pass them by reference
or by address or by pointer, all synonymous for today's purposes,
then that function has no ability to change the contents of that variable.
This would pass in a copy just like the buggy version of swap
that we've talked about a few times now.
But instead, by doing &x, I'm literally passing in what?
[Student] The address.>>The address of x.
It's like drawing a map for the function called scanf and saying here,
these are directions to a chunk of memory in the computer
that you can go store some integer in.
In order for sscanf to now do that
what operator, what piece of syntax is it going to have to use
even though we can't see it because someone else wrote this function?
In other words--what's that?
[Student] X read.
There's going to be some reading, but only with regard to x here.
If scanf is being passed the address of x,
syntactically, what operator is bound to exist somewhere
inside of scanf's implementation so that scanf
can actually write a number 2 to that address?
Yeah, so the *.
Recall that the * is our dereference operator, which essentially means go there.
Once you've been handed an address, as is the case here,
scanf is probably—if we actually looked around its source code—
is doing *x or the equivalent to actually go to that address and put some value there.
Now, as for how scanf gets input from the keyboard,
we'll wave our hands out for today.
Just assume that the operating system allows sscanf to talk
to the user's keyboard, but at this point now in line 19,
when we simply print out x, it seems to be the case
that scanf has put an int in x.
That's exactly how scanf works, and recall last week
that's exactly how GetString and GetInt and its other family of functions
ultimately works, albeit with slight variance like sscanf,
which means scan a string instead of the keyboard.
But let's take a look at a little variance of this.
In scanf2, I actually screwed up.
What is wrong—and I'll hide the comment that explains as much—
what is wrong with this program, version 2?
Be as technical as possible this time.
It looks pretty good.
It's nicely indented, but—
okay, how about let's prune it down to shorter questions?
Line 16. What's line 16 doing in precise but technical English?
Getting a little awkward. Yes, Michael.
[Student] It's pointing to the first letter of a string.
Okay, close. Let me tweak that a little bit.
Pointing to the first letter of a string, you are declaring a variable called buffer
that will point to the first address of a string,
or rather, that will point more specifically to a char.
Notice it's not actually pointing anywhere because there's no assignment operator.
There's no equal sign, so all we're doing is allocating the variable called buffer.
It happens to be 32 bits because it's a pointer,
and the contents of buffer presumably eventually
will contain an address of a char, but for now, what does buffer contain?
Just some bogus, who knows, some garbage value,
because we haven't explicitly initialized it, so we shouldn't assume anything.
Okay, so now line 17 is—what does line 17 do?
Maybe that will warm this up.
It prints a string, right?
It prints String please.
Line 18 is kind of familiar now in that we just saw a variance of this
but with a different format code, so in line 18,
we're telling scanf here is the address of a chunk of memory.
I want you to ring in a string, as implied by %s,
but the problem is that we have not done a couple of things here.
What's one of the problems?
[Student] It's trying to dereference a null pointer.
Good, null or just otherwise unknown pointers.
You're handing scanf an address, but you just said a moment ago
that that address is some garbage value because we didn't actually assign it to anything,
and so you're telling scanf effectively go put a string here,
but we don't know where here yet is,
so we haven't actually allocated memory for buffer.
Moreover, what are you also not even telling scanf?
Suppose this was a chunk of memory, and it wasn't a garbage value,
but you're still not telling scanf something important.
[Student] Where it actually is, the ampersand.
Ampersand, so in this case, it's okay.
Because buffer is already declared as a pointer
with the * piece of syntax, we don't need to use ampersand
because it's already an address, but I think I heard it here.
[Student] How big is it?
Good, we're not telling scanf how big this buffer is,
which means even if buffer were a pointer,
we're saying scanf, put a string here,
but here could be 2 bytes, it could be 10 bytes, it could be a megabyte.
Scanf has no idea, and because this is a chunk of memory
presumably, it's not a string yet.
It's only a string once you write characters and a \0 to that chunk of memory.
Now it's just some chunk of memory.
Scanf won't know when to stop writing to that address.
If you recall some examples in the past where I randomly typed on the keyboard
trying to overflow a buffer, and we talked on Friday about exactly that.
If an adversary somehow injects into your program a much bigger word
or sentence or phrase then you were expecting you can overrun
a chunk of memory, which can have bad consequences,
like taking over the whole program itself.
We need to fix this somehow.
Let me zoom out and go into version 3 of this program.
That's a little bit better.
In this version, notice the difference.
In line 16, I'm again declaring a variable called buffer,
but what is it now?
It's an array of 16 chars.
This is good because this means I can now tell scanf
here is an actual chunk of memory.
You can almost think of arrays as being pointers now,
even though they're not actually equivalent.
They'll behave differently in different contexts.
But it's certainly the case that buffer is referencing
16 contiguous chars because that's what an array is
and has been for some weeks now.
Here, I am telling scanf here's a chunk of memory.
This time, it's actually a chunk of memory,
but why is this program still exploitable?
What's wrong still?
I've said give me 16 bytes but—
[Student] What if they type in more than 16?
Exactly, what if the user types in 17 characters or 1700 characters?
In fact, let's see if we can't trip over this mistake now.
It's better but not perfect.
Let me go ahead and run make scanf3 to compile this program.
Let me run scanf3, String please: hello, and we seem to be okay.
Let me try a slightly longer one, hello there.
Okay, let's do hello there how are you today, Enter.
Getting kind of lucky here, let's say hello there how are you.
Damn it.
Okay, so we got lucky. Let's see if we can't fix this.
No, it's not going to let me copy.
Let's try this again.
All right, stand by.
We'll see how long I can pretend to focus while still doing this.
Damn it. That's rather appropriate, actually.
There we go.
Point made.
This, embarrassing though it also is, it is also one of the sources of great confusion
when writing programs that have bugs because they manifest themselves
only once in a while sometimes.
The reality is that even if your code is completely broken,
it might only be completely broken once in a while
because sometimes, essentially what happens is the operating system allocates
a little more memory than you actually need for whatever reason,
and so no one else is using the memory right after your chunk of 16 characters,
so if you go to 17, 18, 19, whatever, it's not such a big deal.
Now, the computer, even if it doesn't crash at that point,
might eventually use byte number 17 or 18 or 19 for something else,
at which point your data that you put there, albeit excessively long,
is going to get overwritten potentially by some other function.
It's not necessarily going to remain intact,
but it won't necessarily cause a seg fault.
But in this case, I finally provided enough characters
that I essentially exceeded my segment of memory, and bam,
the operating system said, "Sorry, that's no good, segmentation fault."
And let's see now if what remains here in my directory—
notice that I have this file here, core.
Notice that this is again called a core dump.
It's essentially a file that contains the content of your program's memory
at the point at which it crashed,
and just to try a little example here let me go in here
and run gdb on scanf3 and then specify a third argument called core,
and notice here that if I list the code,
we'll be able as usual with gdb to start walking through this program,
and I can run it and as soon as I hit—as with the step command in gdb—
as soon as I hit the potentially buggy line after typing in a huge string,
I'll be able to actually identify it here.
More on this, though, in section in terms of core dumps
and the like so that you can actually poke around inside of the core dump
and see on what line the program failed you.
Any questions then on pointers and on addresses?
Because today on, we're going to start taking for granted that these things exist
and we know exactly what they are.
Yes.
[Student] How come you didn't have to put an ampersand next to the part—
Good question.
How come I didn't have to put an ampersand next to the character array as I did previously
with most of our examples?
The short answer is arrays are a little special.
You can almost think a buffer as actually being an address,
and it just so happens to be the case that the square bracket notation
is a convenience so that we can go into bracket 0, bracket 1,
bracket 2, without having to use the * notation.
That's a bit of a white lie because arrays and pointers
are, in fact, a little bit different, but they can often but not always be used interchangeably.
In short, when a function is expecting a pointer to a chunk of memory,
you can either pass it an address that was returned by malloc,
and we'll see malloc again before long, or you can pass it the name of an array.
You don't have to do ampersand with arrays because they are already
essentially like addresses.
That's the one exception.
The square brackets make them special.
Could you put an ampersand next to the buffer?
Not in this case.
That would not work because, again, of this corner case
where arrays aren't quite actually addresses.
But we'll perhaps come back to that before long with other examples.
Let's try to solve a problem here.
We have a data structure that we've been using for some time known as an array.
Case in point, that's what we just had.
But arrays have some upsides and downsides.
Arrays are nice why?
What's one thing that you like—to the extent you like arrays—about arrays?
What's convenient about them? What's compelling?
Why did we introduce them in the first place?
Yeah.
[Student] They can store a lot of data, and you don't have to use an entire thing.
You can use a section.
Good, with an array you can store a lot of data,
and you don't necessarily have to use all of it, so you can overallocate,
which might be convenient if you don't know in advance how many of something to expect.
GetString is a perfect example.
GetString, written by us, has no idea how many chars to expect,
so the fact that we can allocate chunks of contiguous memory is good.
Arrays also solve a problem we saw a couple weeks ago now
where your code starts to devolve into something very poorly designed.
Recall that I created a student structure called David,
and then that was actually an alternative, though,
to having a variable called name and another variable called, I think, house,
and another variable called ID because in that story I then wanted to introduce something else
like Rob into the program, so then I decided wait a minute,
I need to rename these variables.
Let's call mine name1, ID1, house1.
Let's call Rob's name2, house2, ID2.
But then wait a minute, what about Tommy?
Then we had three more variables.
We introduced someone else, four sets of variables.
The world started to get messy very quickly,
so we introduced structs, and what's compelling about a struct?
What does a C struct let you do?
It's really awkward today.
What?>>[inaudible student response]
Yeah, specifically, typedef allows you to create a new data type,
and struct, the struct keyword, allows you to encapsulate
conceptually related pieces of data together
and thereafter call them something like a student.
That was good because now we can model
much more sort of conceptually consistent the notion of a student in a variable
rather than arbitrarily having one for a string, one for an ID, and so forth.
Arrays are nice because they allow us to start cleaning up our code.
But what is a downside now of an array?
What can you not do? Yeah.
[Student] You have to know how big it is.
You have to know how big it is, so it's kind of a pain.
Those of you with prior programming experience know that in a lot of languages,
like Java, you can ask a chunk of memory, specifically an array,
how big are you, with a length, property, so to speak, and that's really convenient.
In C, you can't even call strlen on a generic array
because strlen, as the word implies, is only for strings,
and you can figure out the length of a string because of this human convention
of having a \0, but an array, more generically, is just a chunk of memory.
If it's an array of ints, there's not going to be some special character
at the end waiting for you.
You have to remember the length of an array.
Another downside of an array reared its head in GetString itself.
What's another downside of an array?
Sir, just you and me today.
[inaudible student response]>>It's what?
It's declared on the stack.
Okay, declared on the stack. Why don't you like that?
[Student] Because it gets reused.
It gets reused.
Okay, if you use an array to allocate memory,
you can't, for instance, return it because it's on the stack.
Okay, that's a disadvantage.
And how about one other with an array?
Once you allocate it, you're kind of screwed if you need more space
than that array has.
Then we introduced, recall, malloc, which gave us the ability to dynamically allocate memory.
But what if we tried a different world altogether?
What if we wanted to solve a couple of those problems
so we instead—my pen has fallen asleep here—
what if we instead wanted to essentially create a world that's no longer like this?
This is an array, and, of course, this kind of deteriorates once we hit the end of the array,
and I now no longer have space for another integer or another character.
What if we sort of preemptively say well, why don't we relax
this requirement that all these chunks of memory be contiguous back to back,
and why don't, when I need an int or a char,
just give me space for one of them?
And when I need another, give me another space,
and when I need another, give me another space.
The advantage of which now is that if someone else
takes the memory over here, no big deal.
I'll take this additional chunk of memory here and then this one.
Now, the only catch here is that this almost feels like I have
a whole bunch of different variables.
This feels like five different variables potentially.
But what if we steal an idea from strings
whereby we somehow link these things together conceptually, and what if I did this?
This is my very poorly drawn arrow.
But suppose that each of these chunks of memory
pointed to the other, and this guy, who has no sibling to his right,
has no such arrow.
This is in fact what's called a linked list.
This is a new data structure that allows us to allocate a chunk of memory,
then another, then another, then another, any time we want
during a program, and we remember that they're all somehow related
by literally chaining them together, and we did that pictorially here with an arrow.
But in code, what would be the mechanism via which you could somehow connect,
almost like Scratch, one chunk to another chunk?
We could use a pointer, right?
Because really the arrow that's going from the top left square,
this guy here to this one, could contain inside of this square
not just some ints, not just some char, but what if I actually allocated
a little extra space so that now,
each of my chunks of memory, even though this is going to cost me,
now looks a little more rectangular where one of the chunks of memory
is used for a number, like the number 1,
and then if this guy stores the number 2,
this other chunk of memory is used for an arrow,
or more concretely, a pointer.
And suppose I store the number 3 over here while I use this to point at that guy,
and now this guy, let's suppose I only want three such chunks of memory.
I'll draw a line through that, indicating null.
There is no additional character.
Indeed, this is how we can go about implementing
something that's called a linked list.
A linked list is a new data structure, and it's a stepping stone toward
much fancier data structures that begin to solve problems
along the lines of Facebook-type problems and Google-type problems
where you have huge data sets, and it no longer cuts it
to store everything contiguously and use something like linear search
or even something like binary search.
You want even better running times.
In fact, one of the Holy Grails we'll talk about later this week or next
is an algorithm whose running time is constant.
In other words, it always takes the same amount of time no matter
how big the input is, and that would indeed be compelling,
even more so than something logarithmic.
What is this on the screen here?
Each of the rectangles is exactly what I just drew by hand.
But the thing all the way on the left is a special variable.
It's going to be a single pointer because the one gotcha
with a linked list, as these things are called,
is that you have to hang onto one end of the linked list.
Just like with a string, you have to know the address of the first char.
Same deal for linked lists.
You have to know the address of the first chunk of memory
because from there, you can reach every other one.
Downside.
What price are we paying for this versatility of having a dynamically
sizable data structure that if we ever need more memory, fine,
just allocate one more chunk and draw a pointer from
the old to the new tail of the list?
Yeah.
[Student] It takes about twice as much space.
It takes twice as much space, so that's definitely a downside, and we've seen this
tradeoff before between time and space and flexibility
where by now, we need not 32 bits for each of these numbers.
We really need 64, 32 for the number and 32 for the pointer.
But hey, I have 2 gigabytes of RAM.
Adding another 32 bits here and here doesn't seem that big of a deal.
But for large data sets, it definitely adds up to literally twice as much.
What's another downside now, or what feature do we give up,
if we represent lists of things with a linked list and not an array?
[Student] You can't traverse it backwards.
You can't traverse it backwards, so you're kind of screwed if you're walking
from left to right using a for loop or a while loop
and then you realize, "Oh, I want to go back to the beginning of the list."
You can't because these pointers only go from left to right as the arrows indicate.
Now, you could remember the start of the list with another variable,
but that's a complexity to keep in mind.
An array, no matter how far you go, you can always do minus, minus, minus, minus
and go back from whence you came.
What's another downside here? Yeah.
[inaudible student question]
You could, so you've actually just proposed a data structure called a doubly linked list,
and indeed, you would add another pointer to each of these rectangles
that goes the other direction, the upside of which
is now you can traverse back and forth,
the downside of which is now you're using three times as much memory as we used to
and also adding complexity in terms of the code you have to write to get it right.
But these are all perhaps very reasonable tradeoffs, if the reversal is more important.
Yeah.
[Student] You also can't have a 2D linked list.
Good, you can't really have a 2D linked list.
You could. It's not nearly as easy as an array.
Like an array, you do open bracket, closed bracket, open bracket, closed bracket,
and you get some 2-dimensional structure.
You could implement a 2-dimensional linked list
if you do add—as you proposed—a third pointer to each of these things,
and if you think about another list coming at you 3D style
from the screen to all of us, which is just another chain of some sort.
We could do it, but it's not as simple as typing open bracket, square bracket. Yeah.
[inaudible student question]
Good, so this is a real kicker.
These algorithms that we've pined over, like oh, binary search,
you can search an array of numbers on the board
or a phone book so much more quickly if you use divide and conquer
and a binary search algorithm, but binary search required two assumptions.
One, that the data was sorted.
Now, we can presumably keep this sorted,
so maybe that's not a concern, but binary search also assumed
that you had random access to the list of numbers,
and an array allows you to have random access, and by random access,
I mean if you're given an array, how much time does it take you
to get to bracket 0?
One operation, you just use [0] and you're right there.
How many steps does it take to get to location 10?
One step, you just go to [10] and you're there.
By contrast, how do you get to the 10th integer in a linked list?
You have to start at the beginning because you're only remembering
the beginning of a linked list, just like a string is being remembered
by the address of its first char, and to find that 10th int
or that 10th character in a string, you have to search the whole damn thing.
Again, we're not solving all of our problems.
We're introducing new ones, but it really depends on what you're trying to design for.
In terms of implementing this, we can borrow an idea from that student structure.
The syntax is very similar, except now, the idea is a little more abstract
than house and name and ID.
But I propose that we could have a data structure in C
that is called node, as the last word on the slide suggests,
inside of a node, and a node is just a generic container in computer science.
It's usually drawn as a circle or a square or rectangle as we've done.
And in this data structure, we have an int, n,
so that's the number I want to store.
But what is this second line, struct node *next?
Why is this correct, or what role does this thing play,
even though it's a little cryptic at first glance?
Yeah.
[inaudible student response]
Exactly, so the * sort of spoils that it's a pointer of some sort.
The name of this pointer is arbitrarily next,
but we could have called it anything we want, but what does this pointer point to?
[Student] Another node.>>Exactly, it points to another such node.
Now, this is sort of a curiosity of C.
Recall that C is read by a compiler top to bottom, left to right,
which means if—this is a little different from what we did with the student.
When we defined a student, we actually did not put a word there.
It just said typedef.
Then we had int id, string name, string house,
and then student at the bottom of the struct.
This declaration is a little different because,
again, the C compiler is a little dumb.
It's only going to read top to bottom,
so if it reaches the 2nd line here
where next is declared and it sees, oh, here's a variable called next.
It's a pointer to a struct node.
The compiler is going to realize what is a struct node?
I've never heard of this thing before,
because the word node might not otherwise appear
until the bottom, so there is this redundancy.
You have to say struct node here, which you can then shorten later on
thanks to typedef down here, but this is because
we are referencing the structure itself inside of the structure.
That's the one gotcha there.
Some interesting problems are going to arise.
We've got a list of numbers. How do we insert into it?
How do we search it? How do we delete from it?
Especially now that we have to manage all of these pointers.
You thought pointers were sort of mind-bending
when you had one of them just trying to read an int to it.
Now we have to manipulate an entire list's worth.
Why don't we take our 5-minute break here, and then we'll bring
some folks up on stage to do exactly that.
C is much more fun when it's acted out.
Who would literally like to be first?
Okay, come on up. You are first.
Who would like to be 9? Okay, 9.
How about 9? 17?
A little clique here. 22 and 26 in that front row.
And then how about someone over there being pointed at.
You are 34. Okay, 34, come on up.
First is over there. Okay, all four of you guys.
And who did we say for 9?
Who is our 9?
Who really wants to be 9? All right, come on, be 9.
Here we go.
34, we'll meet you over there.
The first part is make yourselves look like that.
26, 22, 17, good.
If you can stand off to the side, because we're going to malloc you in a moment.
Good, good.
Okay, excellent, so let's ask a couple of questions here.
And actually, what's your name?>>Anita.
Anita, okay, come on over here.
Anita is going to help us sort of solve one fairly simple question in first,
which is how do you find whether or not a value is in the list?
Now, notice that first, represented here by Lucas,
is a little different, and so his piece of paper is deliberately sideways
because it's not quite as tall and doesn't take up as many bits,
even though technically he has the same size of paper just rotated.
But he's a little different in that he's only 32 bits for a pointer,
and all of these guys are 64 bits, half of which is the number, half of which is a pointer.
But the pointer is not depicted, so if you guys could somewhat awkwardly
use your left hand to point at the person next to you.
And you're number 34. What's your name?
Ari.
Ari, so actually, hold the paper in your right hand, and left hand goes straight down.
You represent null on the left.
Now our human picture is very consistent.
This is actually how pointers work.
And if you can scrunch a little bit this way so I'm not in your way.
Anita here, find me the number 22,
but assume a constraint of not humans holding up pieces of paper,
but this is a list, and you only have Lucas to begin with
because he is literally the first pointer.
Suppose you yourself are a pointer, and so you too have the ability to point at something.
Why don't you start by pointing at exactly what Lucas is pointing at?
Good, and let me enact this out over here.
Just for the sake of discussion, let me pull up a blank page here.
How do you spell your name?>>Anita.
Okay, Anita.
Let's say node* anita = lucas.
Well, we shouldn't call you lucas. We should call you first.
Why is this in fact consistent with reality here?
One, first already exists.
First has been allocated presumably somewhere up here.
Node* first, and it's been allocated a list somehow.
I don't know how that happened. That happened before class started.
This linked list of humans has been created.
And now at this point in the story—this is all going on Facebook apparently later—
at this point in the story, Anita has been initialized to be equal to first,
which doesn't mean that Anita points at Lucas.
Rather, she points at what he points at
because the same address that's inside of Lucas's 32 bits--1, 2, 3--
is now also inside of Anita's 32 bits--1, 2, 3.
Now find 22. How would you go about doing this?
What's that?>>Point to whatever.
Point to whatever, so go ahead and act it out as best you can here.
Good, good, and now you're pointing at—what's your name with 22?
Ramon.>>Ramon, so Ramon is holding up 22.
You have now done a check.
Does Ramon == 22, and if so, for instance, we can return true.
Let me—while these guys stand here somewhat awkwardly—
let me do something quickly like bool find.
I'm going to go ahead and say (node* list, int n).
I'll be right back with you guys. I just have to write some code.
And now I'm going to go ahead and do this, node* anita = list.
And I'm going to go ahead and say while (anita != NULL).
The metaphor here is getting a little stretched, but while (anita != NULL), what do I want to do?
I need some way of referencing
the integer that Anita is pointing at.
In the past, when we had structures, which a node is,
we used the dot notation, and we would say something like
anita.n, but the problem here is that Anita is not a struct per se.
What is she?
She's a pointer, so really, if we want to use this dot notation—
and this is going to look deliberately a little cryptic—
we have to do something like go to whatever Anita's left hand is pointing at
and then get the field called n.
Anita is a pointer, but what is *anita?
What do you find when you go to what Anita is pointing at?
A struct, a node, and a node, recall, has a field called n
because it has, recall, these 2 fields, next and n,
that we saw a moment ago right here.
To actually imitate this in code,
we could do this and say if ((*anita).n == n), the n that I'm looking for.
Notice that the function was passed in the number I care about.
Then I can go ahead and do something like return true.
Else, if that's not the case, what do I want to do?
How do I translate to code what Anita did so intuitively by walking through the list?
What should I do up here to simulate Anita taking that step to the left, that step to the left?
[inaudible student response]>>What's that?
[inaudible student response]
Good, not a bad idea, but in the past, when we've done this, we've done anita++
because that would add the number 1 to Anita,
which would typically point at the next person, like Ramon,
or the person next to him, or the next to him person down the line.
But that's not quite good here because what does this thing look like in memory?
Not that. We have to disable that.
It looks like this in memory, and even though I've drawn 1 and 2 and 3 close to one another,
if we really simulate this—can you guys, while still pointing at the same people,
can some of you take a random step back, some of you a random step forward?
This mess is still a linked list,
but these guys could be anywhere in memory,
so anita++ is not going to work why?
What's at location anita++?
Who knows.
It's some other value that just so happens to be interposed
among all of these nodes by chance because we're not using an array.
We allocated each of these nodes individually.
Okay, if you guys can clean yourselves back up.
Let me propose that instead of anita++, we instead do anita gets—
well, why don't we go to whatever Anita is pointing at and then do .next?
In other words, we go to Ramon, who's holding the number 22,
and then .next is as though Anita would be copying his left hand pointer.
But she wouldn't go farther than Ramon because we found 22.
But that would be the idea. Now, this is a god-awful mess.
Honestly, no one will ever remember this syntax, and so thankfully,
it's actually a little deliberate—oh, you didn't actually see what I wrote.
This would be more compelling if you could. Voila!
Behind the scenes, I was solving the problem this way.
Anita, to take that step to the left,
first, we do go to the address that Anita is pointing at
and where she will find not only n, which we just checked for comparison's sake,
but you will also find next--in this case,
Ramon's left hand pointing to the next node in the list.
But this is the god-awful mess to which I referred earlier,
but it turns out C lets us simplify this.
Instead of writing (*anita), we can instead just write anita->n,
and it's the exact same thing functionally, but it's a lot more intuitive,
and it's a lot more consistent with the picture that we've been drawing
all this time using arrows.
Lastly, what do we need to do at the end of this program?
There's one line of code remaining.
Return what?
False, because if we get through the whole while loop
and Anita is, in fact, null, that means she went all the way to the end of the list
where she was pointing at—what's your name again?
Ari.>>Ari's left hand, which is null.
Anita is now null, and I realize you're just standing here awkwardly in limbo
because I'm going off on a monologue here,
but we'll involve you again in just a moment.
Anita is null at that point in the story, so the while loop terminates,
and we have to return false because if she got all the way to Ari's null pointer
then there was no number that she sought in the list.
We can clean this up too, but this is a pretty good implementation then
of a traversal function, a find function for a linked list.
It's still linear search, but it's not as simple as ++ a pointer
or ++ an i variable because now we can't guess
where each of these nodes are in memory.
We have to literally follow the trail of breadcrumbs or, more specifically,
pointers, to get from one node to another.
Now let's try another one. Anita, do you want to come back here?
Why don't we go ahead and allocate one other person from the audience?
Malloc—what's your name?>>Rebecca.
Rebecca. Rebecca has been malloced from the audience,
and she is now storing the number 55.
And the goal at hand now is for Anita to insert
Rebecca into the linked list here in its appropriate place.
Come on over here for a moment.
I have done something like this.
I have done node*. And what's your name again?
Rebecca.>>Rebecca, okay.
Rebecca gets malloc(sizeof(node)).
Just like we have allocated things like students and whatnot in the past,
we need the size of the node, so now Rebecca
is pointing at what?
Rebecca has two fields inside of her, one of which is 55.
Let's do what, rebecca-> = 55.
But then rebecca->next should be—like right now, her hand is kind of who knows?
It's pointing at some garbage value, so why don't for good measure
we at least do this so that left hand is now at her side.
Now Anita, take it from here.
You have Rebecca having been allocated.
Go ahead and find where we should put Rebecca.
Good, very good.
Okay, good, and now we need you to provide a bit of direction,
so you've reached Ari.
His left hand is null, but Rebecca clearly belongs to the right,
so how do we have to alter this linked list
in order to insert Rebecca into the appropriate place?
If you could literally move people's left hands around as needed,
we'll fix the problem that way.
Okay, good, and meanwhile, Rebecca's left hand is now by her side.
That was too easy.
Let's try allocating—we're almost done, 20.
Okay, come on up.
20 has been allocated, so let me go ahead and say again here
we've just done node* saad.
We have malloc(sizeof(node)).
We then do the same exact syntax as we did before for 20,
and I'll do next = NULL, and now it's up to Anita
to insert you into the linked list, if you could play that exact same role.
Execute.
Okay, good.
Now think carefully before you start moving left hands around.
You by far got the most awkward role today.
Whose hand should be moved first?
Okay, wait, I'm hearing some no's.
If some folks would politely like to help solve an awkward situation here.
Whose left hand should be updated first perhaps? Yeah.
[Student] Saad's.
Okay, Saad's, why, though?
[inaudible student response]
Good, because if we move—what's your name?>>Marshall.
Marshall, if we move his hand first down to null,
now we have literally orphaned four people in this list
because he was the only thing pointing at Ramon and everyone to the left,
so updating that pointer first was bad.
Let's undo that.
Good, and now go ahead and move the appropriate left hand pointing at Ramon.
This feels a little redundant.
Now there's two people pointing at Ramon, but that's fine
because now how else do we update the list?
What other hand has to move?
Excellent, now have we lost any memory?
No, so good, let's see if we can't break this once more.
Mallocing one last time, number 5.
All the way in back, come on down.
It's very exciting.
[applause]
What's your name?>>Ron.
Ron, okay, you are malloced as number 5.
We've just executed code that's almost identical to these
with just a different name.
Excellent.
Now, Anita, good luck inserting number 5 into the list now.
Good, and?
Excellent, so this is really the third of three total cases.
We first had someone at the end, Rebecca.
We then had someone in the middle.
Now we have someone at the beginning, and in this example,
we now had to update Lucas for the first time
because the first element in the list now has to point at a new node,
who, in turn, is pointing at node number 9.
This was a hugely awkward demonstration, I'm sure,
so a big round of applause for these guys if you could.
Nicely done.
That's all. You may keep your pieces of paper as a little memory.
It turns out that doing this in code
is not quite as simple as just moving hands around
and pointing pointers at different things.
But realize that when it comes time to implement something like
a linked list or a variant of it if you focus on really
these basic fundamentals, the bite-size problems I have to figure out,
is it this hand or this hand, realize that what is otherwise a fairly complex program
can, in fact, be reduced to fairly simple building blocks like this.
Let's take things in a more sophisticated direction still.
We now have the notion of the linked list.
We also have—thanks to the suggestion back there—a doubly linked list,
which looks almost the same, but now we have two pointers inside of the struct
instead of one, and we could probably call those pointers previous and next
or left or right, but we do, in fact, need two of them.
The code would be a little more involved.
Anita would have had to do more work here on the stage.
But we could certainly implement that kind of structure.
In terms of running time, though, what would be the running time
for Anita of finding a number n in a linked list now?
Still big O of n, so it's no better than linear search.
We can't do binary search, though, again.
Why was that the case? You can't jump around.
Even though we obviously see all the humans on the stage,
and Anita could have eyeballed it and said, "Here is the middle of the list,"
she wouldn't know that if she were the computer program
because the only thing she had to latch on to at the start of the scenario
was Lucas, who was the first pointer.
She would necessarily have to follow those links,
counting her way until she found roughly the middle,
and even then, she's not going to know when she's reached the middle
unless she goes all the way to the end to figure out how many there are,
then backtracks, and that too would be hard unless you had
a doubly linked list of some sort.
Solving some problems today, but introducing others.
What about a different data structure altogether?
This is a photograph of the trays in Mather House,
and in this case, we have a data structure we've also kind of already been talking about.
We talked about a stack in the context of memory,
and that's sort of deliberately named because a stack in the terms of memory
is effectively a data structure that has more and more stuff layered on top of it.
But the interesting thing about a stack, as is the case in reality,
is that it's a special kind of data structure.
It's a data structure whereby the first element in
is the last element out.
If you are the first tray to be put onto the stack,
you're going to be unfortunately the last tray to be taken off the stack,
and that's not necessarily a good thing.
Conversely, you can think about it the other way around,
the last in is the first out.
Now, do any scenarios come to mind where having a stack
data structure where you have that property
of the last in, first out, is actually compelling?
Is that a good thing? Is that a bad thing?
It's definitely a bad thing if the trays weren't all identical
and they were all special different colors or whatnot,
and the color you want is all the way at the bottom.
Of course, you can't get that without great effort.
You have to start from the top and work your way down.
Similarly, what if you were one of these fan boys
who waits up all night trying to get an iPhone and lines up
at a place like this?
Wouldn't it be nice if the Apple store
were a stack data structure?
Yay? Nay?
It's only good for the people who show up at the last possible minute
and then get plucked off the queue.
And in fact, the fact that I was so inclined to say queue
is actually consistent with what we would call this kind of data structure,
one in reality where the order does matter,
and you want the first one in to be the first one out
if only for the sake of human fairness.
We'll generally call that a queue data structure.
It turns out besides linked lists, we can start using these same basic ideas
and start creating new and different types of solutions to problems.
For instance, in the case of a stack, we could represent a stack
using a data structure like this, I would propose.
In this case, I've declared a struct, and I've said inside of this structure
is an array of numbers and then a variable called size,
and I am going to call this thing a stack.
Now, why does this actually work?
In the case of a stack, I could draw this effectively on the screen as an array.
Here is my stack. Those are my numbers.
And we'll draw them as this, this, this, this, this.
And then I have some other data member here,
which is called size, so this is size, and this is numbers,
and collectively, the whole iPad here represents one stack structure.
Now, by default, size has presumably got to be initialized to 0,
and what's inside of the array of numbers initially
when I first allocate an array?
Garbage. Who knows? And it doesn't actually matter.
It doesn't matter if this is 1, 2, 3, 4, 5, completely randomly
by bad luck stored in my structure because so long as I know that the size of the stack
is 0, then I know programmatically, don't look at any of the elements in the array.
It doesn't matter what's there.
Don't look at them, as would be the implication of a size of 0.
But suppose now I go ahead and insert something into the stack.
I want to insert the number 5, so I put number 5 here,
and then what do I put down here?
Now I would actually put down 1 for the size,
and now the stack is of size 1.
What if I go ahead and insert the number, let's say, 7 next?
This then gets updated to 2, and then we'll do 9,
and then this gets updated to 3.
But the interesting feature now of this stack is that
I'm supposed to remove which element if I want to pop
something off of the stack, so to speak?
9 would be the first thing to go.
How should the picture change if I want to pop an element off the stack,
much like a tray in Mather?
Yeah.>>[Student] Set size to 2.
Exactly, all I do is set size to 2, and what do I do with the array?
I don't have to do anything.
I could, just to be anal, put a 0 there or a -1 or something to signify
that this is not a legit value, but it doesn't matter because
I can record outside of the array itself how long it is
so that I know only look at the first two elements in this array.
Now, if I go and add the number 8 to this array, how does the picture change next?
This becomes 8, and this becomes 3.
I'm cutting a few corners here.
Now we have 5, 7, 8, and we're back to a size of 3.
This is pretty simple to implement,
but when are we going to regret this design decision?
When do things start to go very, very wrong? Yeah.
[inaudible student response]
When you want to go back and get the first element you put in.
It turns out here even though a stack is an array underneath the hood,
these data structures we've started talking about are also generally known as
abstract data structures whereby how they're implemented
is completely besides the point.
A data structure like a stack is supposed to add support
operations like push, which pushes a tray onto the stack,
and pop, which removes an element from the stack, and that's it.
If you were to download someone else's code who already implemented
this thing called a stack, that person would have written
only two functions for you, push and pop, whose sole purpose in life
would be to do exactly that.
You or him or her who implemented that program
would have been entirely the one to decide how to implement
the semantics of pushing and popping underneath the hood
or the functionality of pushing and popping.
And I have made a somewhat shortsighted decision here
by implementing my stack with this simple data structure why?
When does this data structure break?
At what point do I have to return an error when the user calls push, for instance?
[Student] If there's no more space.
Exactly, if there's no more space, if I've exceeded capacity,
which is all caps because it suggests that it's some kind of global constant.
Well, then I'm just going to have to say, "Sorry, I can't push another value
onto the stack," much like in Mather.
At some point, they're going to hit the top part of that little cabinet.
There's no more space or capacity in the stack, at which point there's some kind of error.
They have to put the element somewhere else, the tray somewhere else,
or nowhere at all.
Now, with a queue, we could implement it a little differently.
A queue is a little different in that underneath the hood, it can be implemented
as an array, but why, in this case, am I proposing
to also have a head element representing the head of the list,
the front of the list, the first person in line at the Apple store, in addition to size?
Why do I need an additional piece of data here?
Think back to what numbers is
if I've drawn it as follows.
Suppose this is now a queue instead of a stack,
the difference being—just like the Apple store—queue is fair.
The first person in line at the start of the list, number 5 in this case,
he or she is going to be let into the store first.
Let's do that.
Suppose that this is the state of my queue at this moment in time, and now the Apple store
opens and the first person, number 5, is led into the store.
How do I change the picture now that I have de-queued the first person
at the front of the line?
What's that?>>[Student] Change the queue.
Change the head, so 5 disappears.
In reality, it's as though—how best to do this?
In reality, it's as though this guy disappears.
What would number 7 do in an actual store?
They would take a big step forward.
But what have we come to appreciate when it comes to arrays
and moving things around?
That's kind of a waste of your time, right?
Why do you have to be so anal as to have the first person
at the start of the line at physically the start of the chunk of memory?
That's completely unnecessary. Why?
What could I just remember instead?>>[inaudible student response]
Exactly, I could just remember with this additional data member head
that now the head of the list is no longer 0, which it was a moment ago.
Now it's actually the number 1. In this way, I get a slight optimization.
Just because I've de-queued someone from line at the start of the line at the Apple store
doesn't mean everyone has to shift, which recall is a linear operation.
I can instead spend constant time only
and achieve then a much faster response.
But the price I'm paying is what to gain that additional performance
and not having to shift everyone?
Yeah.>>[inaudible student response]
Can add more people, well, that problem is orthogonal
to the fact that we're not shifting people around.
It's still an array, so whether or not we shift everyone or not—
oh, I see what you mean, okay.
Actually, I agree with what you're saying in that it's almost as though
we're now never going to use the start of this array anymore
because if I remove 5, then I remove 7.
But I only put people to the right.
It feels like I'm wasting space, and eventually my queue disintegrates into nothing at all,
so we could just have people wraparound,
and we could think of this array really as some kind of circular structure,
but we use what operator in C to do that sort of wraparound?
[inaudible student response]>>The modulo operator.
It would be a little annoying to think through how do you do the wraparound,
but we could do it, and we could start putting people at what used to be the front of the line,
but we just remember with this head variable who the actual head of the line actually is.
What if, instead, our goal ultimately, though,
was to look up numbers, as we did here on stage with Anita,
but we really want the best of all these worlds?
We want more sophistication than array allows
because we want the ability to dynamically grow the data structure.
But we don't want to have to resort to something that we pointed out
in the first lecture was not an optimal algorithm,
that of linear search.
It turns out that you can, in fact, achieve
or at least close to constant time, whereby someone like Anita,
if she configures her data structure not to be a linked list,
not to be a stack, not to be a queue, could, in fact,
come up with a data structure that allows her to look up things,
even words, not just numbers, in what we'll call constant time.
And in fact, looking ahead, one of the psets in this class is almost always
an implementation of a spellchecker, whereby
we give you again some 150,000 English words and the goal is to
load those into memory and rapidly be able to answer questions of the form
is this word spelled correctly?
And it would really suck if you had to iterate through all 150,000 words to answer that.
But, in fact, we'll see that we can do it in very, very quick time.
And it's going to involve implementing something called a hash table,
and even though at first glance this thing called a hash table is going to
let us achieve these super rapid response times,
it turns out that there is in fact a problem.
When it comes time to implement this thing called—again, I'm doing it again.
I'm the only one here.
When it comes time to implementing this thing called a hash table,
we're going to have to make a decision.
How big should this thing actually be?
And when we start inserting numbers into this hash table,
how are we going to store them in such a way
that we can get them back out as quickly as we got them in?
But we'll see before long that this question of
when everyone's birthday is in the class will be quite germane.
It turns out that in this room, we've got a few hundred people,
so the odds that two of us have the same birthday is probably pretty high.
What if there were only 40 of us in this room?
What are the odds of two people having the same birthday?
[Students] Over 50%.
Yeah, over 50%. In fact, I even brought a chart.
It turns out—and this is really just a sneak preview—
if there's only 58 of us in this room, the probability of 2 of us
having the same birthday is hugely high, almost 100%,
and that's going to cause a whole bunch of hurt for us on Wednesday.
With that said, let's adjourn here. We'll see you on Wednesday.
[applause]
[CS50.TV]
