[MUSIC PLAYING]
DAVID J. MALAN: All right, this
is CS50, and this is week 4.
So let's consider first what we did
last time, which was focus on numbers
and focus on how we can search
them and how we can sort them.
And in particular, we took a
look at a number of algorithms
that we finally gave names to, a couple
of which we actually saw back in week
0, but didn't treat as formally.
So we looked more formally
last time at linear search,
this process of searching elements
from left to right or right to left
as though you're walking along
a line looking for some value.
We looked at binary search, a
divide and conquer algorithm
whereby we started, in
this case, in the middle,
and then we looked left to
right and then went to the left,
went to the right, and then did
the same problem again and again.
We looked also at bubble sort.
How do you get to the
point where you can
use an algorithm like
binary search, which
assumes that your inputs are sorted?
So how do you get to that sorted state?
Well, bubble sort was one
algorithm that essentially
swapped elements pairwise from
left to right, from left to right,
until finally the biggest
elements bubbled up to the top,
and the whole list
was ultimately sorted.
But we also looked at selection sort.
And selection sort is the
name implied, selected,
then smallest element again and again
and again, plucking it out of the list,
and putting it where it belongs.
Insertion sort, meanwhile, walked
through the list really just once,
and as it encountered
each element, it then
inserted it into the sorted half of
the list, the left half of the list
in its proper place.
And then lastly, and where we really
concluded and really got more bang
for our buck was this
final algorithm merge
sort, which actually solved this same
problem no less correctly, but so much
more efficiently.
And indeed, lends itself perhaps
to a fundamentally better design
in so far as it leverages a much
better running time, so to speak.
In fact, we started to ascribe
these kinds of formulas
to the running times of or the
performance of our algorithms n
squared, otherwise known
as quadratic, whereby
you take n times n number of
steps or minutes or seconds
or some unit of measure, n log in,
which is actually what we achieved
with the merge source,
which was much faster
recall than the end squared algorithms
like bubble sort and selection sort
and insertion sort in
their upper bounds.
n would be something linear, like
a straight line relationship.
Logarithmic of n, log
of n, log base 2 of n,
technically, would be
something like week 0's divide
and conquer approach
looking for Mike Smith,
as well as last time's binary search.
And then one, which doesn't
have to be literally one step.
Maybe it's two, maybe it's 10, but
it is a fixed finite constant number
of steps.
And that too might be the
running time of some algorithm.
Now how do we describe the
running times of algorithms?
Well, we use some special
notation, asymptotic notation,
so to speak, which while it might
look cryptic at first glance,
really is just a handy way of succinctly
expressing the fact that you know what
the upper bound on some
algorithms running time is,
what the lower bounds on some
algorithms running time is.
And if those are one in
the same formulas in theta,
do you have a coincidence of the two
of big O, so speak, and capital theta.
So that, while Greek,
literally is just a way
of expressing a bit more succinctly
what these running times are.
And we'll continue to
revisit this issue as we
look at more algorithms and
soon data structures still.
But this time, we apologize.
We pull back a layer here and admit
that there is no such thing as a string.
Indeed, all this time we've been saying
that there's ins, and there's floats,
and there's chars, and
there's doubles and more.
And we've also been
saying there are strings,
but there really aren't strings.
This is sort of a figment of the
imagination of our so-called CS50
library.
But it's a pedagogical
simplification that we've
been using for the past
several weeks, so as to not get
lost in the weeds, the
lower level implementation
details of what a string
is, so that we can just get
real work done in these first weeks.
But now we'll begin to
look underneath that hood
and see what a string actually
is and what the implications are.
And it turns out, while more
complicated in some sense,
it really just boils down
to some first principles,
what it is the computer is
doing underneath that hood.
So let's take a look at string first
by way of a couple of examples.
Let's go in to CS50 IDE, create a new
file, save it as Compare zero dot C,
and look at the little program
that actually doesn't necessarily
do what we think it's going to do.
In particular, let me go
ahead and include our typical,
include CS 50 dot H. And let me go ahead
and include standard IO dot H as well.
Let me go ahead and use
main void, so I'm not
going to worry about any
command line arguments for now.
And then I'm going to go ahead
and just prompt the user.
Hey user, give me a string
called S for instance.
And then I'm going to have no newline.
I just want that all in one same line.
Now I'm going to go
ahead and do string S,
gets gets string, open
paren and close parens,
so as to get a string from the user.
And then, let me do this.
Let me also print T colon and ask
the user, essentially, for a string
that I'll call T, since T
comes after S quite simply.
And now let me just compare these
strings as the filename suggests.
So I know how to compare,
not with if s equals t,
because that's the assignment operator.
But we know that s equals equals t
should compare the values on the left
and on the right.
So let's try this.
So if s equals equals t, then I'm
going to go ahead and print out same.
Elts, they are presumably
difference, so I'm
going to go ahead and print out
different with a newline character
and then save it.
So a pretty simple program, so let me
go down into my terminal window and run
make compare 0.
Now let me go ahead and
run dot slash compare 0. s,
I'm going to go ahead
and type in Zamyla.
And I'm going to go ahead
now and type in Maria.
OK, and they're different.
I expected as much.
Now let's go ahead and run this again.
Again with Zamyla.
And let's just say
Zamyla again, different.
What did I do wrong?
Let me try this again.
Maybe it's the capitalization.
So Zamyla in all lower case, different.
Well, maybe it's just Zamyla's name.
Let me try Rob or RLB?
How about RLB?
Those are different, as is Rob and Rob.
So what is going on?
Those strings pretty
much look the same to me.
I'm typing the same
incantation of strings,
so what is it that's going on here?
You know what, let me do a
test with something else.
Let me go ahead and
create a new file here.
I'm going to call this copyzero.c.
Maybe I'm just misunderstanding
how comparison works.
But surely I should be able to copy
a string and make an identical copy,
so let's do that now.
Let me go ahead and create
a file called copyzero.c.
And let me do an include cs50.h.
Include standard io.h into
main void, so just as before.
And now, let me go ahead and
just prompt the user like before.
Give me a string s.
And I'll put that in a variable,
s, calling get string as before.
And now, I remember from prior
classes that I'm supposed to do,
if s equals equals null, maybe I
need to do some error checking here.
And if that's the case, I'm
going to arbitrarily return 1.
So recall that main can return values.
0 means all is well.
1 or any other non-zero value
means something is wrong.
So maybe that's what I was doing wrong
earlier, just not error checking.
So let's try that.
Now, let me declare a variable of type
string called t and assign it to s.
So in other words, I
want to copy s into t.
And I know that this happens
from the right to the left.
So that's as I think it should be.
And then, if strleng of
t is greater than 0--
you know what, I'm going
to do one other thing.
Not just copy it.
I'm going to go ahead and do this.
Recall, that you can
treat strings as arrays.
So I'm going to say the zeroth
character, the first character of t--
you know what?
I want to uppercase it.
I want to really make sure that
these two strings are indeed
different as I intend.
And let's go ahead now and do this.
Let me go ahead now and do printf of s.
And let me plug-in s's
value in a new line.
Plug-in s.
And then printf t:%s
for a placeholder again.
Comma t semi-colon.
So in other words, if the length
of t is greater than zero,
let's capitalize it by
changing that first character.
And then, just print out s and t.
And surely, by capitalizing only t, I
should see only one capitalized word.
Now, I'm using both
strleng and two upper.
So rather than let Clang have a chance
to yell at me, I'm going to go in here
and preemptively add ctype.h.
Which recall is a library
you might have seen
or certainly will soon see that
has a number of functions in it.
Among them, two upper.
And then, I also need to include
string.h, so that I can use strlen--
the function that gives me
the length of the string.
So Clang would have yelled at me if I
forgot that, but let me preemptively
solve that problem.
And now, do make copies zero.
All seems to be well.
Let's run dot slash copy 0.
All right.
Let's go ahead and type in-- how about
just my own name in all lowercase.
Huh?
Now, why is this confusing?
So I wrote code that got a string
from the user and stored it in s.
I then wrote code that
declared a second variable, t.
And I set t equal to s,
thereby making I would
think a copy-- as in past weeks
of using the assignment operator.
Then down here, I made
sure t was long enough.
That's just a quick sanity check.
And then on this line
here, I'm just saying,
set the first character of t, t bracket
0, equal to the result of upper casing
the first character of t.
So the only code that's
touching t is this one here.
And yet, somehow my name gets
capitalized both in s and in t.
So what is it that's
actually going on here?
It seems to be broken still.
In fact, let me go ahead and open
another example, rather than type
this one out ourselves.
Let me go ahead and open up
an example called no swap.
As the name suggests,
it's a bit of a spoiler.
I wrote this program in
advance to do the following.
First, I've included standard
io.h, so I can use printf.
I have a prototype of my function
called swap up here, because indeed,
the goal at hand I decided was I just
want to write a simple function that
swaps two values.
Given a and b, make b a and a, b.
And then, return.
So I'm just going to arbitrarily test
this out by declaring a variable x.
Setting it equal to 1.
A variable y setting it equal to 2.
And then, just as a sanity check, I'm
going to print out x is such and such,
y is such and such.
And then, I'm going to
claim swapping dot dot dot.
And then, the key line
is apparently this one.
Call a function called
swap, passing in x and y.
And if I implemented swap correctly,
this should swap the two variables.
Thereafter, I'm going to claim swapped
x as such and such, y as such and such.
So let's run this program and see
what else is apparently a lie.
Make no swap in my source directory.
./noswap enter.
And if we scroll back up in my
history, you'll see x is 1, y is 2.
Swapping swapped.
x is 1, y is 2.
All right.
So maybe just the swap
function is buggy.
This isn't necessarily
indicative of a misunderstanding.
So let's look at the
implementation of swap.
Swap returns no value and I think
that's OK, so long as it takes inputs.
Swap takes two inputs, an ints and
an ints, called a and b respectively.
And then, let's consider how this works.
So I've declared a temporary
variable, called temp--
though I could call it anything I want.
And I'm storing in it the value a.
So I'm taking a-- and it's the number 1.
And I'm just temporarily
storing the number 1
in this additional temporary
variable, so that I now
have two copies of the number
one-- in a and in temp.
Then, I change the
value of a to equal b.
So at this point in the story,
a should have a value of 2.
b should have a value of 2.
Which could be problematic, except
I've saved the original value 1
in my temporary variable.
And so, indeed I change b to
equal that temporary value.
So indeed, I can't do
this magical switcheroo.
Because if I simultaneously
took one value and one value
and put them in each other's
places, I would essentially
risk clobbering one or the other.
Because the computer typically
can only do one thing
at a time underneath the hood,
even though it's super fast.
So I run the risk of somehow getting
two copies of that same value,
unless I'm careful.
I really want to take one value,
put it in a temporary space.
Then, copy one value over.
And then, do that switcheroo.
But it doesn't seem to work.
And yet, this is a good
opportunity for one more tool.
Let me do a further check.
Let me go ahead and run
debug 50 on no swap.
And you know what?
I'm going to set a break point right at
the beginning of this program on line
10 by simply clicking in the so-called
gutter of the line number here.
A red dot appears, so that
now when I run debug 50,
notice my right hand bar pops out.
Notice that the line 10 in question
is highlighted, because execution is
paused at this line.
Let's take a look at what's going on
the right hand side of my debugger.
Notice that x has a value
of 1 and y has a value of 2.
And that's what I intend.
Now, let's begin to
step over these lines.
So let's step over the printf.
And we see on my terminal
window that x is 1.
Step over the next printf
and we see that y is 2.
Now, let's step over that and we
see indeed swapping dot dot dot.
But rather than step over my call to
the swap function, let's step into it.
So I can actually see
maybe my logic is buggy.
So let's go ahead and step into swap.
And notice that debug 50 immediately
jumps down to the first line and swap.
And notice now on the right
hand side, what's going on?
a is 1 and b is 2.
And that's as expected,
because I passed an x and y.
Swap calls it's two inputs a and
b respectively, so a is 1, b is 2.
Temp though, very curiously is 32,767.
And where is that value coming?
Well, it turns out when
you declare a variable,
but haven't yet said it's of some
value it's going to by definition have
an undefined value.
A so-called garbage value.
Remnants of other
aspects of your program
having used that chunk of memory before.
And we'll see why this is in just a bit.
So this is anomalous.
It's not a bug in the program.
It's to be expected, because
I haven't yet assigned
temp a value until-- let me zoom out.
Until I step over line 21 and
actually let temp equal a.
And notice, if you watch the
right hand side of my screen now,
temp is currently 32,767.
And when I step over it, OK.
It becomes 1.
So that's as expected.
Now, a is about to become b.
And indeed, we now see that a
is 2 b is 2 and temp is still 1.
So now we have to put temp into
b, so let's step over that line.
And indeed, we've done
a switcheroo of a and b.
a is 2, b is 1.
OK.
Let's go ahead and play this out.
Let's continue here.
I claim that I've swapped.
I'm printing x-- and somehow
x and y are still 1 and 2.
Even though the debunker
told me it would
seem that my swap function was correct.
So what is happening after all of these
weeks of CS50 and all of these problems
solved thus far?
It would seem that all of
our understanding of things
is kind of unraveling.
And that's because we've been very
careful over the past few weeks,
at least in class.
And in sections to try to avoid
tripping over some of these lower-level
implementation details.
And there's not that many of them.
But today, is now a time
to peel back this layer
and understand exactly
what it is that's going on.
Indeed, all of this time when you
run a program on your computer,
double-click an icon or run the
program's name with dot slash
something or other at your
terminal window-- what happens
is that your computer gives that program
the illusion of a really big chunk
of memory all of its own.
Maybe two gigabytes of memory,
even though it doesn't necessarily
use all of that memory.
And that memory-- if you just
think of your computer's memory,
as we've done before, is
like a rectangular region.
And we could number of
the bytes in my memory.
But it doesn't really matter what the
addresses are, what the numbers are,
just that it exists.
It turns out that when
you run a program,
your operating system typically lays
out your program's memory in a fairly
standard way.
There's a chunk of memory down here for
something called environment variables.
There's a bigger growable chunk
of memory down here called
the so-called stack.
On the opposite side of the
picture is a so-called heap.
Another chunk of memory that actually
grows in the other direction.
So long story short, bad things
can happen if both of those
grow bigger than you intend.
Then, there's some kind of uninitialized
and initialized data up top.
And then, text.
Now, it turns out text is
the segment of memory where
your program's zeros and ones live.
So when you double-click
an icon on your Mac or PC
or when you run the command
dot slash something,
those zeros and ones are loaded from
your hard disk or solid state disk
into RAM or memory.
And it's put conceptually
at the top of the memory
that your computer program is using.
And below that is the actual
data that your program is using.
The variables and the
values inside of it.
Now, each of these types of
memory have different purposes.
And we'll see in just a moment
what it is that's going on.
And we'll ultimately
peel back these layers.
So what is it that's actually
going on underneath the hood here?
Well, let's consider this
to be my computer's memory.
So focusing on just that
bottom most portion,
which I called the stack a moment ago.
So if we draw just the bottom of my
computer's memory kind of like this,
the bottom of it has technically
environment variables.
But let's focus on the
region known as the stack.
And the stack, as the name implies,
is kind of like the stack of trays
that you might see in a
cafeteria or a dining hall,
where you put trays on top of another
until it can get potentially pretty
tall.
And it turns out when
you run a program, not
only is your program given the illusion
of this really big memory space laid
out as proposed, but
it also by convention
uses this memory in a
fairly standard way.
Specifically, when main is called,
main is given a chunk of memory
at the bottom, so to
speak, of this stack space.
And so, let me go ahead
here and write main.
And any local variables that main
has and any arguments to main,
namely argc an argv, end up inside here.
So if indeed you are using
something like argv and argc,
you might have a value
like this down here.
And you might have another chunk
of memory carved out here for argv.
And if you have a couple of
local variables, for instance x
and another one, y, those two would
be allocated space in this slice
if you will.
This frame of memory.
Meanwhile, if main calls a
function, like swap-- swap
is allocated a swath of memory, a
frame of memory, above main by design.
So if I've called swap,
its memory ends up here.
And if the swap function
itself has arguments,
like a and b or any
other local variables,
those values too are put inside
of the so-called stack frame.
So this might be a and this might be b.
In other words, the concepts
that we've been taking for
granted, both in Scratch and in C, at
the end of the day, boil down to values
needing to go somewhere physically.
And so, if you assume that
the big rectangular region
here is your computer's memory.
And then, you consider that
the operating system really
just slices and dices this memory,
such that mains memory is down here.
Any function that main calls
is immediately above it.
And frankly, if swap called its own
function, it would end up above it.
But now, given this basic
definition of memory management
and the layout of computer
program's memory space,
you can perhaps start to infer
why all of these failures
have started to happen in my program.
A moment ago, I didn't
have argv and argc.
I just had for instance x and y.
And I had the value 1
and I had the value 2.
Main then called swap and put a copy
of 1 there and a copy of 2 there.
And indeed, that's the key insight.
When a function calls another function,
passing in arguments as inputs,
that the function is being
passed to copies of those inputs.
So at this point in time, if you
opened up the lid of your computer
and looked inside digitally,
you would see 1 and 2 down here.
And you would see
another pattern of bits
representing 1 and 2
up here in duplicate.
So now when my swap
function operates, it
declares a temporary
variable recall, called temp.
So let me draw that here.
And as I recall, it stores
in temp, which value?
The value of a.
The value of a is 1.
It then took the value of b, put it
in a-- which puts that value here.
And now at this point in the
story, a and b are incorrect.
We still need to put the value 1 in b.
And that's why we then took
temps value, put it in b.
Thereby giving the me ultimately
the number 1 in b's slot as well.
And so, at this point in the
story, temp still exists.
And a and b have the correct
answers, 2 and 1 respectively.
But the catch is the moment
that swap returns, this happens.
Essentially, everything that was
being used above main disappears.
It's not actually deleted.
All of those bits are still
there, so technically the numbers
are still there.
But the computer just
forgets about those values.
And indeed, the key
takeaway here is that when
execution returns to main-- that
is when swap says, return I'm done.
And main takes over operation again,
1 and 2 are completely unchanged.
So all of the word that
we just did up there
was a complete waste
of time in some sense.
Because even though it was correct
in so far as it swapped a and b,
it did not actually have a
permanent impact on the variables
that I actually cared about.
Let's consider another example now.
Recall my previous example where when I
whipped up some code to compare values,
I called get string once and
stored its return value in s.
I called get string twice and
stored its return value in t.
And then, I just tried
to compare s and t.
Well, what does that actually mean?
It turns out that when
you call get string,
it's not technically
returning a string to per say.
It's not technically returning
to you a sequence of characters.
It's actually returning
something much simpler.
When I call get string and do this--
string s gets get underscore string,
all that's happening is this.
The left hand side of this expression
is telling the computer, hey, computer.
I need a variable called s that's big
enough to fit a string so to speak.
On the right hand side, get string
indeed gets a string from the user,
like D-A-V-I-D or Z-A-M-Y-L-A.
But where does that string come from?
It turns out it comes from
somewhere else in memory.
And suppose that the user has indeed
typed in Z-A-M-Y-L-A quote unquote
so to speak.
That's just kind of floating
somewhere in my computer's memory.
And if this string happens to
exist at address number 123,
byte number 123 in my computer's memory.
What's actually going to get
returned is the value 123.
In other words, a string is
technically just the address
of a character in memory.
In fact, if we zoom
in further on Zamyla,
recall that Zamyla's name really
looks like this underneath the hood.
It's an array, a block of
contiguous memory with a backslash 0
at the very end.
And if this first byte just
happens to be the number 123,
the second byte is going to be 124.
The third byte it's going to be 125.
And so forth.
In other words, if my
computer has a billion bytes
of memory, 2 billion bytes of memory--
like a gigabyte or two gigabytes,
you can certainly number each
of those bytes and the computer
does underneath the hood.
And so Z-A-M-Y-L-A backslash 0 is simply
a sequence of 7 characters that live
somewhere physically in memory.
And what a string technically is,
is just kind of a breadcrumb--
specifically the address of, the number
of the first byte of Zamyla's name.
So when you return a string, you're not
handing back Z-A-M-Y-L-A backslash 0,
you're handing back
something much smaller.
A little scrap of paper, if you
will, that's a map to Zamyla's name
somewhere in your computer's memory.
And so over here, this number 123 is
generally called, not just a number,
but an address.
Much like post boxes and houses have
addresses physically that uniquely
identify them, 123 uniquely
identifies that byte in memory.
And so via this address,
this map if you will,
can my program later on find Zamyla's
name and do something with it.
Like printed out, or capitalize it,
or compare it, or anything else.
And so, how though
does the computer know
where Zamyla's name begins and ends?
Well, it begins at 123.
And again, recall from two weeks
ago, it ends with backslash 0.
So with those two markers,
where does it begin?
Where does it end?
Can you infer the
entirety of someone's name
or the entirety of a string in between?
Now, consider my second line of code.
String t gets get string.
So what does this mean?
This is another function call.
And suppose that I just happen to type
in Z-A-M-Y-L-A enter, just like before.
What's happening there is that the
computer, blind to the fact that,
that might already exist
somewhere in memory,
is going to give me another chunk of
memory somewhere with Z-A-M-Y-L-A.
And then, backslash 0.
Whereby, this is effectively
an array of its own.
And I don't know where it is.
Maybe it's at address number
234, followed by 235 dot dot dot.
It's somewhere else in memory.
I don't know or need to
even care where it is.
But the key detail here is that t in
so far is it is a variable itself.
It gets what value?
Well, if the string's Zamyla,
the second time I type it in it
just happens to end up at location 234.
What ends up in t is
simply the number 234.
So if we change back to my code where
I was comparing s and t, as I am here.
I've called get string
once, stored the value in s.
Called get string twice
stored the value in t.
And then, I'm quite simply saying,
if s equals equals t, then say same.
Else, say different.
Now, when you consider what is really
going on underneath the hood all
these weeks, certainly 123
is not the same as 234.
And so, s does not equal t.
Capitalization meanwhile.
What happens with that other program?
Well, we'll do this one more quickly.
But if I for instance,
call string s gets get
underscore string open
paren close paren.
That again, is going to
give me something like this.
And it's going to give me something
like Z-A-M-Y-L-A backslash 0,
all of which I can think of, as before,
like an array-- if sloppily drawn.
Maybe starting again at address 123.
And so that's what ends up in s.
But in my copy 0 program,
recall that I did this.
String t gets s.
I didn't call get string again.
I just said, store s inside of t.
Effectively, I thought,
making a copy of it.
What this left-hand side gives
me is another box called t.
If this is s, this is t.
But what goes inside of this new box t?
Well, what goes inside
is literally a copy of s.
What is s? s is 123.
OK. t then, is 123, which
means later in my copy program,
when I simply decided to
capitalize t bracket 0,
the first character in
my string t-- that's
kind of misleading, because my
string t is really just my string s.
They're sort of synonyms for
one another at this point,
because one indeed equals the other.
And so what has happened is that I
have gone to the first character in t,
which of course is z.
And recall from my example earlier when
I typed it in all lowercase, David,
for instance, with a lowercase d, it
capitalized it not only in t, but also
in s.
In other words-- and frankly, it
doesn't really matter, typically,
what these addresses are.
I'm just using 123 and 234 because
they're sort of easy to say.
But you can really think of
s as again, being a map that
leads you to the string you care about.
A pointer, if you will.
Literally an arrow.
And t, similarly, can be
thought of as a pointer.
And the key detail here is that
because I've set s equal to t-- or t
equal to s-- they are effectively
pointing at the same thing.
So strings are a lie.
There is no such thing
as a string data type.
There are things called
chars, characters
that can live somewhere in memory.
And we humans can
arbitrarily decide that hey,
if we put a backslash 0 character at the
end of a sequence of other characters,
we can all just agree universally
to treat that as the end of a quote
unquote "string," that
is a word or a phrase
or paragraph or anything even bigger.
But we need a convention for
remembering where strings begin.
We've already solved the
where do they end problem.
So where does the string begin?
It begins at an address.
It begins with a pointer.
And so this special data type
that we declare in CS50's library
called string really is just
in CS50 IDE an 8-byte value,
a 64-bit value that is just a
really big number that represents
the address in memory of a string.
And I say really big just
because the IDE gives you access
to lots of memory, certainly
numbers bigger than 123.
But a string is just a number, is
just an address, AKA a pointer.
And that explains, then, why
all three of these examples
did not behave as I might
have hoped, because rather
they were taking things
a little too literally.
Or I was failing to appreciate
what's actually going on.
Let's pause for a moment,
take things down a notch.
Make things a little more
real with a bit of claymation
that will motivate, eventually,
peeling back this layer further
and seeing what's really going on.
[STRUMMED CHORD]
NICK PARLANTE: Hey, Binky.
Wake up.
It's time for pointer fun.
BINKY: What's that?
Learn about pointers?
Oh, goodie!
DAVID J. MALAN: Binky, who exist
here only in claymation form,
is the product of a good friend of
ours, Nick Parlante at Stanford,
who teaches computer science there.
You'll see more of Binky and hear
more of Nick in just a moment.
But these here are
sort of metaphorically
the training wheels that we've
had on for the past few weeks.
And the goal now at hand
is to take these off,
and to finally start looking at what's
really going on underneath the hood.
And starting to remove,
if you will, let's
see if-- [BANG] --probably
not the best idea.
Remove, if you will,
these training wheels,
and actually see what's
going on, and understand
and take advantage of the same.
As follows.
Let's go ahead now into CS50
IDE, and go ahead and open up,
let's say, compare1.c, which I
wrote in advance to look as follows.
And you'll notice that it works a
little differently from version zero.
Here we have a prompt for string s.
And we store in it the
return value of get string.
But notice what's on the left-hand side.
Char star s, all of a sudden.
Indeed, all of this time,
I've been treating things
as though they are strings, literally.
But it turns out a
string is just a synonym
for a data type known as a char star.
And the new syntax today,
then, is this star operator.
The asterisk that actually has special
meaning in certain contexts, not just
multiplication.
But in this case, it specifies
that the type of a variable
is not a char literally,
but a char star, the address
of a char, a pointer to a char.
Now, why char?
I thought we were talking about strings.
But again, recall that a string
is just a sequence of characters
back to back to back, and
therefore, you can define a string
by the address of its first character.
Ergo, what we really
need underneath the hood
is a data type that lets us
store the address of a character.
There is no string.
And so what does this allow us to do?
Let me go into CS50 IDE, and let
me declare then, on this line
here, that s, this time, will
not be a "string" quote unquote.
That was from the CS50 library.
But rather it's going to be a char star.
It turns out that all this time
get string, again, does not
return a string, it returns
the address of a string, AKA
the address of the first
character in a string.
And so the type of value it's
returning is not just a number.
It's not just an int.
It's a special type of an int.
It's used for a different purpose.
It's simply an integer that
represents the address of a char.
And the way you type out address
of a char is literally char star.
So this, then, is identical to my
previously in weeks past having typed
string s.
Now I'm going to start
typing it as char star.
Meanwhile, t is going to be the same.
So when I prompt the user
for another string there,
I'm going to store that
return value inside of t.
And now, notice, just for
good measure, I'm making sure
that both s and t are not null.
I'm using a bit of conditional
logic there, saying if s is not null
and t is not null, it's safe to proceed.
Because recall that get string can
accidentally return null sometimes
if your computer is out of memory,
or something else goes wrong.
Or not so much
accidentally, but by design.
But notice this new chunk of code.
It turns out-- and we
know now from a moment
ago-- you can't just
compare s against t.
They're not going to
equal the same thing
if you type in two independent strings.
We need a special function that actually
compares strings in a conceptual way.
I mean that a string is equal to another
string if every one of its characters
equals every one of the
other string's characters.
Thankfully, there exists in C a function
that does exactly that called strcomp,
string compare, and it
takes two arguments.
The first is a string.
The second is a string.
Or more properly, the first
is the address of a string.
Or even more properly, the first
is the address of a character.
The second is the
address of a character.
And str compare is just going to
hope that both of those strings
eventually end in a
backslash zero, so that they
don't loop forever through memory.
They eventually hit that
special null terminating byte.
And if so, and those characters are
all entirely equal, you print same.
Else, as before, we print out different.
So now, when I compile this
program, make compare 1,
and then I do ./compare1.
Now I'll type in david in all
lowercase, david in all lowercase.
They're indeed the same.
Let's do it again.
Zamyla with a capital Z. Zamyla with a
capital Z. There are indeed the same.
Let's do Zamyla with a capital
Z. zamyla with a lowercase z.
Different.
And then FUBAR, clearly different.
Now we're actually comparing
these things properly.
Because now we're appreciating what
it actually means to be a string,
and we are underneath the hood
comparing what we should be doing.
Now, underneath the hood,
what is str compare doing?
Honestly, it's probably just
a while loop or a for loop
that is iterating over the
string and their lengths
and looking at the i-th
character in each string,
and making sure they're
all in fact equal.
Let's go ahead and fix
copy with version 1 here.
Copy 1.c that I've written in
advance now looks like this.
I still prompt the user for a
string s with these lines here.
I then, just to be safe, say
hey, wait a minute, if s equals
equals null, return 1.
And again, 1 is just arbitrary.
I just want to get out, lest
I break something later.
Now, down here, this
is a new line of code.
And this is perhaps one of
the most powerful ingredients
we'll see this week, is this
new function called malloc,
memory allocate.
This is a special function via which
you can ask the operating system, Linux
in the case here, or Mac OS, or Windows,
if you're running the code locally,
hey operating system, please give
me a bunch of bytes of memory.
Now, why do I want this?
This program is copy 1.c.
The goal at hand is to create a
program that copies a string s
and stores the copy in t, so to speak.
Last time, it was not sufficient
just to say t equals s,
because that copied the addresses.
That didn't give me a copy of
Z-A-M-Y-- Z-A-M-Y-L-A backlash 0.
It instead just gave me
a copy of the address.
So how do I get a complete
copy of Zamyla's name?
I need to preemptively do
a little bit of arithmetic
and say, all right, how
long is Zamyla's name?
Well, it's the length
of s, str len of s.
But plus one.
Why plus one?
Why plus one?
Yeah?
Exactly.
We now were hit the lowest
level of the computer.
If we don't ask the operating system for
memory for that extra backslash 0 byte,
we're not going to get it.
So we have to explicitly
say, give me one more byte,
because I know how strings are
implemented underneath the hood.
I need to put that backslash
zero there, ultimately,
and then whatever that expression is,
the length of Zamyla, so Z-A-M-Y-L-A,
six, plus one, seven bytes.
Times the size of a character.
Turns out it's always going
to be one, by definition.
But just for good measure, I'm
clearly saying, give me seven times
the size of a char,
which is going to be one.
That gives me seven total bytes.
So just to simplify.
If you multiply all this out,
because the line looks unnecessarily
cryptic at the moment, this really
is equivalent, at the moment,
to just this.
Call the function malloc.
Give it the number seven, so that
malloc, and in turn, the operating
system, looks inside of its memory
bank, so to speak, and says,
hmm, where are there are seven available
bytes that aren't currently in use?
Ah, here is a chunk of them.
And it's a contiguous chunk.
It's going to find a block of memory,
a rectangular region, if you will,
and grab seven bytes, and
return them to my main function.
But what do I mean to
return a chunk of memory?
Well, just as get
string returns a string
by returning the address of the
first character in that string,
so does malloc equivalently
simply return the address
of the first byte of memory.
But the danger now is
that unlike a string,
malloc is not giving you characters.
It's just giving you seven bytes in
a row that you are now free to use.
It does not give you a backslash
zero at the end of them.
If you want to remember the
length of the chunk of memory
you just allocated, the burden is
entirely on you, the programmer.
And indeed, one of the most common
sources of bugs in writing code in C
is to forget about how long
was this chunk of memory,
and to accidentally, with a loop,
go too far past the end of it.
And we'll see what can
happen in those cases.
So now, assuming I do have in t the
address of that chunk of memory,
let me just say, if t equals
equals null, return 1.
Something happened that's bad,
probably the operating system just
didn't have seven extra
bytes of memory to give me.
So fine, I'll quit.
Then down here, what do I want to do?
Well, I now need to implement, at
least in this example, my own copying
process.
Here, at this point in the story,
I have two variables, s and t.
s contains the address
of Zamyla's name. t
contains the address of a new
chunk of memory of length seven.
So here's what I want to do.
Just like a couple of
weeks ago, I'm going
to iterate from zero on up
to the length of the string.
But not up to, but up through
the length of the string.
Because in this case, I actually
want to iterate with a for loop
up through that backslash 0 byte.
And then just this syntax
from a couple of weeks ago,
when we simply manipulated strings as
for our cryptography ciphers, character
by character.
Make the i-th character of t
equal to the i-th character of s.
And this is perfectly valid,
because so long as this loop doesn't
go past n, the number of characters
that I allocated, seven, in this case,
I can go to t bracket 0, bracket 1,
bracket 2, all the way up through n,
effectively copying the string.
And so now when I actually print out s
and t, I should see truly a copy of t.
Because even when I
force its first character
to lower case with this same
line of code here as before,
I'm actually changing different memory.
So let's compile this.
Make copy 1, ./copy1.
And let me go ahead and type
in zamyla in all lowercase,
and now notice the
program does seem to work.
Zamyla is reprinted in lower
case for s, but it's then
print in uppercase for
its first letter for t.
And because the z's look pretty
similar, let's do my name again,
whereby I type david in all lowercase.
Type Enter.
And now, you see s is
still david all lowercase,
but t has only now been
capitalized itself.
It hasn't had a side
effect of some sort on s,
because they're different
chunks of memory.
Why?
Well, what has just happened
in this program is this.
We have, again, done
string s gets get string.
And when we typed get
string, this gives me
a chunk of memory for the address of s.
Get string gives me a name,
like D-A-V-I-D in all lowercase,
plus that backslash 0.
Which again, is really just an
array underneath the hood like this,
that starts at some byte, and maybe
it's again, by coincidence, 123.
Just because it's easy
to say, but that's not
where it's necessarily going to end up.
And so what ends up here is 123.
And then later, when
I allocate t, I again
get this little chunk of
memory that's supposed
to store the address of a character.
And actually, recall
that we're now doing this
as char star, not even string.
So t is similarly a char star.
And what happens, malloc,
when I ask it for seven bytes,
gives me 1, 2, 3, 4, 5,
6, 7 bytes of memory.
There's no null terminating
character just yet.
It's just a block of memory.
And frankly, there could be
some random values here, as
denoted with question marks here.
It's just a chunk of memory that
might have been used previously
in my program for some other purpose.
But what gets stored here, if
this happens to be at address 234,
is this value here, 234.
And if you're not liking
the numbers, again, you
can think of these as just
being pointers, arrows,
to these chunks of memory.
But now, in my C code, when I have
a few lines above this loop whereby
I am copying from s
bracket i into t bracket
i, each of the characters in my
loop, what's actually happening?
Well, fairly intuitively, this lower
case d ends up getting copied here.
This lower case a ends
up getting copied here.
V-I-D, on through.
And David can't count,
so-- backslash-- oh, right.
David's name is shorter than Zamyla's
name, which means we didn't actually
ask for this many characters over here.
But we have taken the
computer more literally now.
Give me six bytes, not
seven bytes, in this case.
And then literally copy each of the
characters from the original string
into this new string all the way up
through that backslash 0 character.
And then when you capitalize
the first character in t,
you are literally only changing
this-- we can do better than this.
We are only changing this
first character here,
which looks like that now.
And that's what's going
on underneath the hood.
So this is why, then, in
the beginning of the class,
we don't introduce
strings as char stars,
because you very quickly get
caught up in a lot of this minutia.
But at the end of the day,
it's not all that complicated
once you realize that a
string is just an address,
and malloc, this new function,
also just returns an address.
This is very powerful,
because now you have
these sort of breadcrumbs that can
lead you to different places in memory.
A little map, so to speak, that can
lead you to actual strings in memory,
and can actually now solve
problems more effectively.
For instance, we can go back
and solve one other problem
we saw a moment ago, which was swap.
So this version of swap was broken why?
What was the source of
this fundamental problem?
Yeah?
STUDENT: [INAUDIBLE]
DAVID J. MALAN: Yeah.
When you went back to main, you
erased the memory on top of it,
and the fundamental problem there
was when I passed in x and y,
they became copies called a and
b, in different chunks of memory.
And so the fundamental
problem seems to be
that swap is incorrectly implemented.
It's logically correct.
It does swap two values.
And we saw that with debug 50, but it's
kind of fundamentally flawed in so far
as it requires, it
seems, by design of C,
that a and b be passed in by
value as copies, so to speak.
We need some way to change this function
to say main, hey, uh-uh, don't give me
copies of your variables.
Give me a treasure map that
will lead me to your variables.
Give me the address of x.
Give me the address of y.
And I'll still call them a
and b, or whatever I want,
but lead me to the original values.
Don't just pass me
copies of those values.
And so we can change swap as follows
from a program or a function that's
incorrect entirely, but
into one that is correct.
And we need to change
the syntax a little bit.
So before is what we had here.
After is what we now have.
Before, after.
Before, after.
So if you see it in
rapid succession there,
all you see is that a whole bunch
of stars are appearing in the code.
And unfortunately, C was not
designed in the best of ways
to make clear what star
means in different contexts,
but it's all related as follows.
The fact that I have now
changed a and b to be not ints,
but ints stars, int pointers, if you
will, means that when main calls swap,
it is by design of the C language,
going to pass in the address of x
and the address of y.
So that's what the star means.
Give me the address of an int and the
address of an int, not actual ints.
Now, down here, the star unfortunately
means something slightly different,
but related in spirit.
Int temp just gives me an integer,
an int variable called temp.
Star a, in this context, without
the word int in front of it again,
means go to that location.
Follow the treasure map, so to speak.
Go to the address that is in a.
So for instance, if the
address of a is say,
33 Oxford Street,
Cambridge, Massachusetts.
That happens to be the address
of the CS building at Harvard.
Star a means go to 33 Oxford
Street in Cambridge, Massachusetts.
The star just means go to
that particular address.
So what does that mean, then, down
here, when I say star a gets star b?
That means go to the address
in b and get its value,
and store it at whenever
a is pointing at to.
So go to a and wait for me for a value.
Go to b, get a value, and put
that value at the location in a.
And then lastly, this just
means go to the location in b,
go to whatever building
that is, so to speak,
and put the value that is in
temp inside of that building.
So a pointer is just an address.
These stars just mean
pointers are involved.
Give me the address of an int,
give me the address of an int.
And again, confusing,
admittedly, the star
in this context where we don't have
the word int in front of it again,
on the side of the equal sign,
just means go to that address.
Go to that building.
Go to that other building
and put something there.
So we can now fix our
swap program correctly.
We can now open up, as I will here,
swap dot c, which I wrote in advance.
That looks almost the same, except
that I've changed the swap function
as follows. a is now
int*, b is now int*,
and I also borrowed the stars
inside of the function, as well.
But something's gotta change.
There's one more line of code I need
to change for all of this to work.
What is that?
What line needs to change?
Well who cares about swap?
It's main that was calling
this thing in the first place,
so let's go back to the original story.
Main, here, declares x and y as 1 and
2, does some printfs here, as before.
But notice this line has to change.
So one more piece of syntax today.
And we're running out of new symbols.
We've seen most of C already.
&x and &y means get me the address
of x, and get me the address of y,
and pass those in instead.
So x,y would just mean pass in
a copy of x and a copy of y,
or the values thereof.
&x &y means give me a little, you know,
map with the address of x and a little
map with the address of y, so that
swap-- who's receiving those maps--
can go there.
So what does this mean
in pictorial form?
If we now go back to the
beginning of this story,
where we were looking at my computer's
memory as this big rectangular region
like this.
With main's chunk of
memory at the bottom here.
And inside of main was two variables,
like x, and another variable y.
And inside of those were
the numbers 1 and 2.
And then I called swap.
And so swap gets its own frame
on the stack, so to speak.
This is swap's frame.
It, too, had a variable called
a and a variable called b.
But what goes in there now?
It's not 1 and 2.
We need to know a little something
more about my computer's memory.
And I don't know where
everything's laid out,
but let me just arbitrarily
assume that, you know,
it's inside of my computer's memory.
Maybe this is byte number 90.
This is going to be 91.
This here is going to be 92,
93, 94, 95, and so forth.
I just need to know that there's
some kind of numbering scheme there.
So what goes inside of a is 91.
What goes inside of b is 92.
And not the values 1 and 2, but rather
the addresses of those values 1 and 2.
Because now my code for the swap
function, consider what it does.
It says, upon receiving
the address of an integer,
called a, upon receiving the address
of another integer, called b,
go there and store that value in temp.
Go to the address in b and store
that value at the address in a.
Store the value in temp
at the address in b.
So let's see what happens then.
So first of all, I need another
variable here, called temp.
Temp, meanwhile, is not a pointer.
It's just an integer,
but what does it store?
Well, according to my
code, temp gets the value
of going to a, going
to the address in a.
So what is a?
a is 99.
That's like a treasure map leading
to, OK, this chunk of memory
down here in my computer.
And what value is there
once I've gone there?
Once I've gone to the CS building
inside of it, I see the number 1,
and so I put the number 1 in temp.
Meanwhile, my second line of
code says go to the address in b
and grab its value, and put
it at the address that's in a.
So what does that mean?
Well, star b means
start here and go to 92.
So it's like an arrow-- kind
of like chutes and ladders,
if you know the game--
like go to address 92.
What value is there?
The number 2.
And the other half of the
equation, on the left,
said, go to the address
in a, which is here,
and copy the 2 into that location.
And then the last line--
only one more line--
is this, get the value in temp-- that's
easy-- and put it at the address in b.
So we have to go to b
and put temp in there.
So to do that, here's
temp, it's the number 1.
I have to go to the address in b.
The address and b is 92.
So let's go there, and
aha, let me go ahead then,
and overwrite the value that's
there with the number 1.
So now this frame of
memory on the stack--
the 91, the 92, and the
temporary variable--
they are, by design of my
new function, disposable.
I really don't care, after swap
returns, if those things continue--
I did care a little bit about that.
I don't care if those
things continue to exist.
All I care about is that
x and y continue to exist.
So in this way is the new and
improved version of the swap function
actually having a permanent
impact on my data?
And with the frame, the
memory still looks like that,
because it's gone to the address in a.
Gone to the address in b, which
leads it to the original x and y.
And so by way of pointers,
by way of these addresses,
do we have the ability to actually
go much, much deeper into a program
and actually get at
values that previously we
had no way of even expressing.
So it's at this point in the story where
I usually admit that, at least for me,
this has been among the most challenging
topics when I, myself, was a student.
And in fact, all these years later--
it's like, 20, 20 year-- yeah,
I think we're up to 20 years ago.
20 years ago-- I didn't
take this photo then--
but I sat in what was, at the time,
the back right hand corner of Elliot
House's dining hall, here at Harvard.
And I sat down with my teaching
fellow, who of all the TFs I had
as an undergrad, still remember to
this day, [? Nishat ?] [? Meda ?].
And we just reconnected on
Facebook, all these years later.
Very exciting.
And it was he who
wonderfully sat down with me
at office hours one
day in the dining hall,
trying to help me understand pointers,
because it was just so much more
technical than all the other stuff.
Like, there is no puzzle piece in
Scratch for the address of something
that leads you somewhere so
powerfully as these stars seem
to be able to, here.
And this is only to say that
this is among those topics that
might take a little bit of
time to sink in, but it does.
And when it does, it really is that
proverbial light bulb that goes off.
And for me, that light bulb
went off right then and there.
Now, what more can we do with these
things, after that motivational speech?
Pointer arithmetic.
So, sort of complicated sounding
topic, but really, it just
goes back to first principles,
as to what a pointer actually is.
And it allows us now
to do things like this.
Let me go ahead and open up one other
program that I wrote in advance here,
called pointers dot c.
And take a look at what
this thing does here.
It works a little differently
from the syntax we're used to,
and from any of our
crypto problems thus far.
So notice on this first line here,
I get string and I store in s.
No more string right
now, just char star.
We can be real and talk about
it as the address of a char.
A little sanity check,
is s equal equal to null?
If so, just return.
Something went wrong, so
let's not deal with it now.
Down here, a for loop.
For i gets 0 all the way up to n.
So this is just a standard syntax
we've used a few times now, even back
in week 1 when we just
wanted iterate over.
Or in week 2, when we wanted iterate
over the characters in a string.
But we've never seen this
kind of craziness before.
A star, and then some
arithmetic in parentheses.
In the past, when we wanted to print
out a character, as implied by %c here,
we quite simply, as I recall, did this.
Which was nice and intuitive, right?
The square brackets
denote to treat the string
as though it's an array, which it
really is, an array of characters.
And that means get the
i-th character of s.
But now that we understand what s is, we
don't need to use this syntactic sugar,
as it's called.
Any time a language has a
feature that's convenient to use,
and easier to read sometimes,
but isn't fundamentally
necessary to express
yourself, it's often
called syntactic sugar, which means
it's just kind of a nicety to have.
And indeed, that square
bracket notation is just
sugar for this more arcane, but
perhaps more well-defined syntax now.
The star operator in this context is
the dereference operator, technically.
It's the go there operator,
as I've been describing it.
Go to some address.
Well s, recall, is a string.
But there is no string.
Strings are just the
addresses of characters now.
The first in a string.
So initially in this
loop, what am I doing?
s is the address of a string, the
address of its first character.
And I'm saying, add to s, the value i.
Well, i is just this variable in my
for loop that's initialized to 0.
So s plus 0 is obviously just s.
s is the address of a char.
*s means go to s.
What do you find when you get there?
A character, because s is a char
star, the address of a character.
And so printing out %c
*s effectively means,
go print that character right there.
On the second iteration of
the loop, i is, of course, 1.
So s plus 1 is 1 byte farther
from the beginning of the string.
And the star means go to that
character and print it with %c.
One more iteration, i is now 2.
s plus 2 is 2 bytes away
from the start of the string.
Go there and print that
character in the string.
And so forth.
And do this up until the
length of the string.
Now this is perfectly correct.
And if you really kind of want
to look cool with your code
you can use pointer arithmetic
in this way, because all it is
is just expressing more precisely
what is going on underneath the hood.
But it's a little more cryptic.
It's certainly a couple more characters.
But it is functionally
equivalent to what
we've been doing for weeks, which
has been, again, just s [i].
So whereas some of today's
ideas are admittedly
new-- allocating memory
and actually looking
underneath the hood
at what a string is--
we're not really getting
newfound capabilities
that we didn't already have when it
comes to manipulating existing strings.
So this is pointer arithmetic,
so to speak, insofar
as we are doing
arithmetic with pointers.
Math with pointers.
All right, let's take a look now
at where things can go wrong.
So this is a program written by our
friend Nick Parlante at Stanford,
inspired by Binky, who will
return in just a moment.
And it's fundamentally broken.
This code is incorrect.
It also doesn't really
do anything useful.
But it's meant to be demonstrative
of things that can go wrong.
So at the top of this program, notice
we are declaring two variables, x and y.
But today those variables are not ints,
as they might have been in weeks past,
they are int*s, the addresses ints.
They're not being initialized
yet to any value, so that's fine.
So really, this is just
giving us, if you will,
like, two boxes on the screen.
So x at this point in the story looks
like this, y at this point in the story
looks like this.
I have no idea what's inside of them.
They have garbage value, so to speak.
Because if we didn't
assign them a value,
the computer is not
going to do it for us,
so they might just have remnants
of some past usage of that memory.
So we don't know what's inside of them.
But that's OK, because on this next
line I call malloc and actually allocate
enough memory for one int.
Now this is kind of a silly use.
It's not the best way
to give yourself an int.
We've seen for weeks
now how you get an int.
You say, like, int x, or int y, or int
z, and the computer gives you an int.
But if we want to use malloc
in the simplest way possible,
we can just say, hey, malloc,
give me enough space for an int.
And recall from the past that an
int here is generally 4 bytes.
So give me 4 bytes of
memory, specifically
the address of the first of those bytes.
That's all malloc is doing, and
it's storing that address in x.
For good measure I should check for
null, but we're not, in this case,
per Binky.
So what's the next line do?
This next line means go to the address
in x and put the number 42 there.
That seems OK.
Because assuming malloc
returns the address
of a chunk of memory, the
first address of 4 bytes,
we can go there and put the number 42
in binary, in 4 bytes worth of bits.
But what about this line?
I've flagged it in red
because this program probably
is going to go no further.
In fact, something very,
very bad is about to happen.
Why?
Well, what is the value in y?
Well, originally x was a garbage
value until I called malloc
and asked malloc, hey, malloc,
give me enough space for an int.
So I'll draw it as a box here.
Give me the address of
that chunk of memory.
So this question mark is really now
an arrow to that chunk of memory.
And then I said, hey, computer, go
ahead and go there and put the value 42.
But then my next line of code said,
hey, computer, go to the value in y
and put the unlucky number 13 there.
So that's like saying, go-- I don't know
where to go, because this has no value.
And so something very
bad is going to happen.
Because the question mark
implies, this is a garbage value.
Maybe it's 0, maybe it's 1,000, maybe
it's some number in between or bigger.
It's some garbage value, which
means if I just go there,
who knows where I'm going to end up?
But odds are I'm going
to end up somewhere
I shouldn't, because I should not be
touching memory that is not my own.
And indeed, thanks to Binky, we're about
to see that bad things indeed happen
when you touch memory
that you shouldn't.
Let's take a look.
[VIDEO PLAYBACK]
-Hey, Binky, wake up.
It's time for pointer fun.
-What's that?
Learn about pointers?
Oh, goody!
-Well, to get started, I guess we're
going to need a couple pointers.
-OK.
This code allocates two pointers
which can point to integers.
-OK, well, I see the
two pointers, but they
don't seem to be pointing to anything.
-That's right.
Initially, pointers
don't point to anything.
The things they point to are called
pointees, and setting them up's
a separate step.
-Oh, right, right, I knew that.
The pointees are separate.
Er, so how do you allocate a pointee?
-OK, well, this code allocates
a new integer pointee,
and this part sets x to point to it.
-Hey, that looks better.
So make it do something.
-OK, I'll dereference the pointer x to
store the number 42 into its pointee.
For this trick I'll need my
magic wand of dereferencing.
-Your magic wand of dereferencing?
Uh, that's great.
-This is what the code looks like.
I'll just set up the number and--
[POP]
-Hey, look, there it goes.
So doing a dereference on x follows
the arrow to access its pointee.
In this case, to store 42 in there.
Hey, try using it to store the number
13 through the other pointer, y.
-OK, I'll just go over here to
y and get the number 13 set up,
and then take the wand of
dereferencing and just--
[BUZZER]
Oh!
-Oh, hey, that didn't work.
Say, Binky, I don't think the
dereferencing y is a good idea,
because, you know, setting up
the pointee is a separate step,
and I don't think we ever did it.
-Oh, good point.
-Yeah, we allocated the pointer y, but
we never set it to point to a pointee.
-Hm, very observant.
-Hey, you're looking good there, Binky.
Can you fix it so that y points
to the same pointee as x?
-Sure.
I'll use my magic wand
of pointer assignment.
-Is that going to be a
problem, like before?
-No, this doesn't touch the pointees.
It just changes one pointer to
point to the same thing as another.
-Oh, I see.
Now y points to the same place as x.
So wait, now y is fixed.
It has a pointee, so you can try
the wand of dereferencing again
to send the 13 over.
-Oh, OK, here goes,
-Hey, look at that.
Now dereferencing works on y.
And because the pointers are sharing
that one pointee, they both see the 13.
-Yeah, sharing, whatever.
So are we going to switch places now?
-Oh look, we're out of time.
-But--
-Just remember the three pointer rules.
Number one, the basic structure
is that you have a pointer
and it points over to a pointee.
But the pointer and
pointee are separate,
and the common error
is to set up a pointer,
but to forget to give it a pointee.
Number two, pointer dereferencing
starts at the pointer
and follows its arrow over
to access its pointee.
As we all know, this only works if there
is a pointee, which kind of gets back
to rule number one.
Number three, pointer
assignment takes one pointer
and changes it to point to the
same pointee as another pointer.
So after the assignment the two
pointers will point to the same pointee.
Sometimes that's called sharing.
And that's all there is to it, really.
Bye bye, now.
[END PLAYBACK]
DAVID J. MALAN: So what
else can go wrong now
that we have the ability
to touch, correctly
or incorrectly, any
memory that we actually
have access to inside of our program?
Well, memory leaks are one such problem.
Now that you have the capability to
ask the operating system for memory
via the malloc function, you
have the ability to accidentally
not give that memory back.
In fact, a very common mistake
in some programming languages
is to ask the operating
system for a chunk of memory,
use it, and then never actually free it.
To give it back so that
other parts of your program,
or other programs on the computer,
can actually make use of that memory.
But fortunately there exist tools
via which we can detect this,
and one of them is called Valgrind.
So this is another debugging
tool that you'll now
be able to use once
you start dynamically
allocating memory in your program.
So up until now you probably
have not used malloc,
and therefore you have
not likely actually asked
the operating system for more
memory in this very dynamic way.
Instead, you have just declared
an integer or an array,
and asked the computer for memory
in other ways, not using malloc.
But suppose that we have
a program like this.
In memory dot c I've made some mistakes.
Deliberately, but mistakes nonetheless.
And indeed, this draws inspiration from
the documentation for this very tool
to highlight a couple of
its most useful features.
So let's look at this program here.
There's no prototype
for this function f,
just because the example that Valgrind,
this tool, gives just puts it up top,
and that's fine.
But let's focus on main first.
Main takes no command line
arguments, per it's mention of void,
and it returns an int as usual.
The function f gets called here.
So f open bracket, close bracket
just means call the function f,
but pass it no inputs, and
then eventually return 0.
So let's look at f.
What happens?
Well, f takes no inputs,
produces no outputs.
It only apparently has
these two lines of code.
And let's consider what it does.
It calls malloc, memory allocate,
10 times the size of int.
So that just means, hey, malloc,
give me enough space for 10 integers.
Now we know on CS50 IDE that the size
of an int recall is 4 bytes, or 32 bits.
So this is like saying, hey,
computer, give me 40 bytes total.
4 times 10.
Malloc recall returns the
address of that chunk of memory,
and it stores it in x, which, according
to the star, is the address of an int.
So that's all.
So borrowing some of the same
ideas from the Binky example.
What is bad about this
is this next line.
Why is x[10]=0 a problem?
Well, the size of this chunk of
memory is what, 10 ints worth.
So 10 times 4, 40 bytes.
But remember how chunks
of memory are indexed
when you use square bracket notation.
When you use square bracket
notation, you better
start counting at 0, which means
if we have 10 ints, first one is 0,
the last one is 9, the
10th one does not exist.
And so if I'm saying
xx[10] gets any value,
that's like saying go there
and put the value 0 there,
but I don't have access
to that chunk of memory.
I did not ask the
computer for that memory.
I asked the computer for all of this
other memory, only 10 of these bytes.
So this feels problematic.
Moreover, I have asked the operating
system for memory via malloc,
hey, give me 40 bytes.
I never handed it back.
Thereby introducing what we
would call a memory leak.
Now, as an aside, once your program
quits, all of the memory it's allocated
is actually automatically
given back, but the problem is
that, for long running programs,
things like your browsers,
or word processing
programs, or any number--
Skype-- or any number of other
programs that you might run locally
on your Mac or PC, if those bigger
programs have memory leaks, such
that they keep asking Mac
OS or Windows for memory
and never actually
remember to give it back,
you might experience what you might
have in the past of your Mac or PC
really starting to slow
down and kind of crawl.
And that can be for any number
of reasons, but one of them
is if your program or programs inside
of them have some form of memory leak.
Not your fault. It's the programmer's
fault. But memory leaks, nonetheless.
So what is Valgrind good for here?
Well, it turns out, that in Valgrind
you can run the command like here.
Valgrind.
Leak check equals full.
Dot slash.
And then the name of your program.
And so let me do that.
Let me go ahead back into CS50 IDE
and let me go ahead and make memory.
Compiles OK, so it doesn't appear
to be syntactically flawed.
Dot slash memory.
Nothing bad seems to happen.
So that's all fine and good.
And you might think that, OK my
program works, it's time to submit,
all is well, but not necessarily.
Let me go ahead and run Valgrind.
And let me actually go
ahead-- let me go up here
and not create a new
file, but a new terminal.
Just so that we have
a much bigger window.
And let me go ahead into this directory
and run Valgrind, dot slash memory.
Without any command line
arguments besides that.
And hit Enter.
And the downside of
Valgrind, frankly, is
that it's just-- output is atrocious.
Like it's just not easy to read.
But you start to learn
to notice patterns in it.
So I've learned to notice when I make
mistakes in programs that invalid right
of size 4, not a good thing.
That's indeed a problem of some sort.
But what's nice about
Valgrind is that it tells me
that the source of this invalid
right, whatever that means,
starts in memory.c line 13 and really
rears its head in line 8 of memory.c.
Well, what are those lines?
Let's go to line 13.
Ah, that's my call to F. And line 8.
Oh, that's my use of x bracket 10.
So why is Valgrind telling
me invalid right of size 4?
Well, size 4 sounds familiar.
That's the size of an int.
Invalid just means bad.
Right means change something,
like mutate a value.
So invalid right of size
4 maybe just means this.
I am in an invalid way
changing 4 bytes of memory
by storing 0 from the right hand side
into x bracket 10 on the left hand
side because, again, that x
bracket 10 is an 11th byte, rather,
an 11th int, that I never
ask the operating system for.
I only asked it for 10, not 11.
And so Valgrind is telling me, somewhat
cryptically, that I've screwed up.
On line 8 ultimately I've screwed
up by touching 4 bytes of memory
that I shouldn't have.
Meanwhile, a little
more worrisome is this.
Address such and such as
zero bytes after a block
of size 4 allots, da-da da-da.
Oh, not that.
Let's focus on-- here we go.
Uh oh.
Leak summary.
Definitely lost.
Oh, my god.
Like I have lost 40 bytes
of the computer's memory.
Now what does that mean?
This simply means that
Valgrind has noticed
that I must have called malloc
or some similar function,
asked for memory, 40 bytes'
worth, and never gave it back.
Well, we haven't introduced the
way of giving memory back so let
me at least address that now.
But let me take this advice.
Rerun with leak check full to
see details of leaked memory.
So let me go ahead and do that.
Let me go ahead and copy
this so I can rerun Valgrind.
Whoops.
Valgrind.
With that command line argument.
Dot slash memory.
And now I'll see a little
more detail sometimes.
Although, it appears in this
case it's already pretty verbose.
And so indeed I've still
lost 40 bytes in one block.
And up here again is a reiteration
of that invalid right of size 4.
So how do I fix this?
Well, let me go back into my program.
And you know what?
The program's still pretty useless,
but at least let me fix that mistake.
And go ahead now and go at x bracket 9.
Let me remake my program.
Make memory.
And let me in my bigger terminal
window do Valgrind dot slash memory.
And now, oh, damn it.
Still have definitely lost
40 bytes in one blocks,
but notice the output
is much shorter now.
I don't have a mistake that
was higher up here before.
So I fixed one problem.
Let me go ahead and
fix that other problem.
After using this memory,
let me go ahead and free x.
So free is the opposite
of malloc in this case.
And if I go ahead and recompile
make memory and then run it
in my bigger terminal window, nice.
All heap blocks were freed.
No leaks are possible.
Now what does that mean?
Well, it turns out, when we
look at our computer's memory,
there is indeed that stack at the
bottom that we talked about earlier.
And the stack is where frames of
memory go, slices of memory go,
that are used when
functions are being called.
And they get layered
on top of each other
and get un-layered only
once the function's return.
The heap, meanwhile, is a chunk of
memory above the stack, by design.
At least as we've drawn it here.
And in the heap you have just
a pool of available memory
that you can draw from at any point.
And what's powerful
about the heap is that,
whereas the stack keeps growing
up and then as functions return
it grows back down thereby
throwing away, effectively
losing track of, any memory
that we had on the stack,
like x's and y's and a's and
b's and temp variables still,
the heap does not experience
that kind of disposal of memory.
If you ask for memory via malloc, malloc
gives you a chunk of memory over here,
or over here, or over here, up above
the stack inside of the so-called heap.
And even when your function
returns, whether it's f, or swap,
or something else, that use of
memory persists permanently.
Only once you call free do
you give that chunk of memory
back to the operating
system and allow it to reuse
that memory for something else.
But unfortunately this
picture kind of suggests
a problem and bad design in some sense.
Although, given a finite
amount of memory, eventually
something's got to bump
up against something else
if you want to keep asking
for and using memory.
Those arrows kind of suggest
that the stack and the heap
are really destined to collide
with each other if what happens?
Well, if the stack grows too large.
In other words, if you keep calling
function after function after function
after function and never return.
And maybe this is accidental.
Maybe you're using
recursion, per last time,
and you accidentally
don't have a base case
so you just keep calling functions.
So much so that you
run out of stack space.
Or maybe you call malloc so many
times that the heap keeps growing down
and down and hits the stack, meanwhile.
Either of those problems can persist.
And you might recognize
some familiar terms.
The first of those scenarios
describe stack overflow.
So if any of you have ever discovered
the website, Stack Overflow,
which is a wonderful
place to go for advice
and tips and tricks with
programming, more generally.
stackoverflow.com draws its name
from exactly that programming
risk and problem.
Heap overflow is similar in spirit.
It's just the opposite.
When the heap grows down
too far, as we've drawn it.
And they're both
examples, more generally,
of what you might call buffer overflows.
A buffer is a chunk of space that's
typically finite in some form
and you can eventually overflow it
by trying to use more memory then
you should have.
And a buffer overflows
typically actually relates
to arrays and chunks of memory.
And, in fact, a great example
from Wikipedia is this one here.
The code is a little cryptic too because
it doesn't really do anything useful,
but it's fundamentally
flawed in the following way.
Let's consider exactly what happens
when you have a stack overflow.
It turns out that this is a bad
enough problem that you can actually
lose control over your computer itself.
It can be hacked into if
an adversary, a bad guy,
has the ability somehow to inject his
or her own adversarial code, code that
deletes your files, or encrypt your
files, or does something malicious,
into your program because of a
memory related mistake you've made.
Let's consider this example.
Here's the main function.
In this particular example, it takes
command line arguments arg-c and arg-v
and just apparently calls this function
foo, passing in the second word
that the user typed in.
So the first command line argument
that the user ran at the prompt.
And it just blindly does this.
It doesn't check arg-c, it doesn't check
for null, it doesn't check anything.
It just blindly passes
in arg-v bracket 1.
And that's generally bad practice.
Right?
If you're not error checking, bad things
are likely going to happen eventually.
Well, what does foo do?
Foo takes a char star as inputs.
Now, a week ago, we would
have said string bar,
but we don't need to
hide that detail anymore.
Bar is just the address of a character.
It's a string.
Foo returns nothing, but
has two lines of code.
First, it declares an array
called c-- just because.
This is kind of a
contrived example online,
but the pictures will
tell it all-- of size 12.
So this is saying, hey, computer.
Give me an array of 12 characters.
Or give me a chunk of memory of size
12 that I plan to put characters into.
And then, mem copy, you
might not have seen before,
but it essentially does this it
copies into this chunk of memory
whatever is at that chunk of
memory up to this many bytes.
So mem copy, as the names
suggests, copies memory from here
to here a total number
of this many bytes.
So why is this worrisome?
Well, if for whatever reason
you have only allocated 12 bytes
and the user has typed
a word at the prompt
when running your program
that is more than 12 bytes,
it would seem that the user
is able to touch memory
that you the programmer never intended.
Why?
Well, your use of mem copy is
checking the length of something,
but it's checking the length of bar.
So if the user types in not 12
characters, but 20 characters,
strleng of bar is going to be 20.
The user's input is 20 characters long.
So your code is saying copy
20 characters from bar into c.
Unfortunately, c is only of size 12 and
so that's 8 bytes you end up copying
that you shouldn't.
So if you were given a chunk of
memory here that's only of size 8
and you're just 12, you're touching
8 more bytes to the right of it,
so to speak, that don't
actually belong to you.
Now, at best, nothing's going to happen.
Your computer, your program
is not going to notice.
Things are just going to hum along
and all is going to seem fine.
Much like my memory
program a moment ago.
Seemed fine, but Valgrind
thought otherwise.
In this case, really
bad things can happen.
Because if what the human adversary,
the bad guy, has typed at the prompt
is not just some word like foo,
or bar, or David, or Zamyla,
or any number of sort
of innocuous strings,
but is actually the ASCII equivalent,
so the textual equivalent,
of some malicious code-- so if he or
she actually somehow typed at the prompt
a pattern of characters that really
represent a pattern of bits that
do something bad, like
delete all your files,
or spam everyone in your address
book, or any number of things.
If they provide a string that's
longer than 12 characters,
the overflow is going to
end up somewhere in memory.
And where is that?
Well, let's zoom in a little lower level
to what your computer's memory looks
like on the stack.
So it's a little more detail.
So let's absorb this for just a moment.
Think of this now as the bottom of
the stack, parent routines stack.
So you can think of this as main, or
foo, or some function on the stack.
It turns out-- and I didn't
mention this earlier--
that besides there being arguments
or parameters on the stack
and local variables on the stack,
it turns out that the computer also
uses the stack just to kind of store
a reminder for itself of the address
in memory of the
function that called it.
So if this picture represents foo, the
function right now, who called foo?
Main called foo and so what foo
does is, on its own stack, it
just kind of jots down a note to
itself here in pink, where it says,
return address, when I'm done,
return to this address of main.
The computer does that for itself.
So when foo is done executing,
previously, I just wipe the screen
and remove foo.
But how does the program know where
to go back to after foo is done?
It's because it put a little
bread crumb for itself.
The address of main, itself.
Not the address of variables, the
address of the function itself,
which, recall, is in the text
segment of your computer's memory.
But never mind that for now.
So now, let's look at what the stack
frame for foo is actually like.
Here is that array of size 12.
And it's drawn as a rectangle.
If we made it wider, we could just
do 0, 1, 2, all the way through C 11,
but the author of this graphic has just
drawn things a little more square like.
So this is the first byte in C and
this is the last byte in C, 0 and 11,
respectively.
And then bar, recall, is the
only argument to the foo function
and that, as I did mention before,
belongs on the stack as well.
But here's the problem.
And this is by design.
It appears that whoever
designed the stack,
has it growing upwards,
as in this picture.
And that's how I described it earlier.
So the stack itself grows up.
Frames go up and up and up and up.
But within a frame, notice what happens.
Within a frame, if you've got a buffer,
an array of memory, a chunk of memory,
and you write to it, you start
writing to that memory, by design,
at the top left corner, so to
speak, to the right, and then down,
and then down.
So top left to bottom
right, so to speak.
Pictorially.
This was not necessarily a good
thing, at least in this case.
Because if the stack is growing up
and this is the top of foo's frame
and it's use of the array
c is by design going down,
that's all fine if the user only
provides a string of length 12 or less,
12 or shorter, but in this case if
the adversary types in a 20 character
string or 200 character
string, the computer
is not just going to stop there.
Unless you wrote code
with an if condition
to check the length that the user
is trying to pass in and protect
against that, it's
going to overwrite all
of these bytes, all of these bytes,
whatever this is, all of these bytes,
and this is what's worrisome.
If you have a really smart
adversary up against you,
trying to hack into your system, and he
or she is smart enough or lucky enough
to figure out how to inject a pattern
of bits into your program in this way,
such that he or she overwrites
this return address,
a really clever adversary
can trick your program
into returning to, not the
function that called foo, main,
but returning to the address
of the adversary's own function
that he or she has somehow injected
by way of typing input at the prompt,
in this case.
In other words, if the user types in
just, hello, h-e-l-l-o, backslash, 0,
all is fine and good and it
fits perfectly in that frame.
But because we didn't
have good error checking
on the length of the
memory we're copying,
suppose the adversary includes a
whole bunch of adversarial code, which
the author here abstracted away as a,
a, a, adversary, or attack, attack,
attack, what might actually happen here?
Well, if that adversary gets
lucky enough or is smart enough,
he or she might be able to
overwrite these 4 bytes here
with the address, brilliantly,
of his or her own attack code,
which is the bits or the characters
that he or she typed at the prompt.
So again, a lot of this is luck.
Or a lot of this is trial
and error sometimes,
to attack a program in
this way, but if you
can trick the computer into
jumping to an address that
happens to point at data you
injected into the program,
you have effectively taken
over that user's program,
tricking the program into running
any and all code that in this case
you provided by a arg-v. So if you've
ever heard of a server getting hacked,
or in the future you will hear
of a server getting hacked,
or a program getting compromised,
could mean any number of things,
but one of them, and an
all too common approach,
is that the programmer who wrote
the software used a buffer,
used an array, a chunk
of memory, and he or she
did not check the boundaries
of that array sufficiently.
Did not make sure that
his or her own code
was not going too far as was
this particular example here.
So now that we have this
power of going anywhere
we want in memory, it's incredibly easy
to accidentally or maliciously use,
but that's it.
That boils down to just
this basic understanding
of what's going on underneath
the hood and what you can now
do with the computer's memory.
And now, let's transition to the
second of our real world demands.
Now that we have the ability
to talk about memory addresses
and actually do things at
this lower level, turns out,
we can start solving some really
interesting problems, among them
related to images and forensics, the art
and science of recovering information.
In fact, you might recall from various
TV shows or movies, that it's all
too common for the good guys
in the movies and the TV
shows to be looking over
the shoulder of a tech
and seeing some footage of a burglary
or some other crime that's committed
and say, hey, can you clean that up?
Or can you enhance that?
And, indeed, let's take a look at
one such clip from the real world
and see how it relates
to the actual world
and how we can leverage some of this
new found savvy with memory addresses
and computing to actually solve
some of these problems for real.
[VIDEO PLAYBACK]
-He's lying.
-About what?
-I don't know.
-So what do we know?
-That at 9:15 Ray
Santoya was at the ATM.
-OK, so the question is,
what was he doing at 9:16?
-Shooting the 9 mm at something.
Maybe he saw the sniper.
-Or was working with him.
-Wait.
Go back one.
-What do you see?
-Bring his face up.
Full screen.
-His glasses.
-There's a reflection.
-That's the Nuevita's baseball team.
That's their logo.
-And he's talking to
whoever's wearing that jacket.
[END PLAYBACK]
DAVID J. MALAN: OK.
So this is just nonsense.
Like, you can't just
clean up an image, or look
at something that's very
pixelated, so to speak,
where pixels are dots on the
screen, and just kind of zoom in
and clean that up and enhance it.
And, indeed, that's kind
of this running joke.
Enhance doesn't really
generally mean enhance when you
have a finite amount of information.
For instance, here is a
finite amount of information.
A photograph of Zamyla that
looks to be of very high quality.
You can see lots of detail.
And so you might think that we can
maybe see the reflection of someone
in her eye if we really just
zoom in and enhance, but no.
This is what you see, and it's actually
kind of creepy at this distance.
This is what you see when you zoom
in on that exact same photo of Zamyla
to find that glint in her eye of the
bad guy, or the badge on his shoulder,
or whatever it was.
There is only a finite
amount of information here.
There is literally only a finite
number of rows and columns
and therefore really big pixels
from which to glean information.
Now, there do exist algorithms and
software that can smooth this out.
You can enhance the image in the
sense that you can kind of maybe
tweak the color so that it looks a
little more gradual, the color changes,
and a little less jarring, but
you can't just put pixels there,
put bits there that are not
existent in the first place.
And so, with this, invites the
question well, then what is an image,
how do you represent it, and what
can you actually glean from it
and do from it?
Well, let's consider in the simplest
case what an image might be.
Here is an image of a
wonderfully simple smiley face.
It's black and white.
And if it's black and white, frankly, we
can get away with just ones and zeros.
I might arbitrarily say that the number
1 shall represent the color white
and the number 0 shall represent
the color black and, as such,
if I just have a bunch of patterns
of 8 bits, ones and zeros,
I can effectively think
of them as a grid.
Rows and columns.
And if I see a 0 pixel, assume it's
a black dot, if I see a 1 pixel,
it's a white dot.
And, as such, I can
create a bitmapped image.
Bitmap, it's like a map, x's and
y's, of bits, bitmapped image.
There is the simplest, perhaps,
smiley face we can draw.
Now it's not all that interesting.
It would be a pretty useless
photograph if you only
had dots of that level of detail.
And indeed we saw, when you
really zoom in on Zumilah's eyes,
you have really big grids.
And those were colorful, but
there isn't that much information.
We need more resolution.
We need more dots,
more pixels, more bits.
And so the general case is to use
not something as simple as this,
but a more standard format like JPEG.
JPEG is what you see on Facebook
and on your cameras, most likely.
It's an image format that
photographs are commonly stored in,
because it stores millions of
colors-- potentially for photographs--
and it also allows you to compress them.
You can compress JPEGs in
such a way that you throw away
some of the 0's and 1's,
thereby degrading the image.
So it's a little blurrier,
or a little splotchy,
but at least it's much, much smaller
and can be emailed or texted or stored
with far less space involved.
And it turns out, if we
put on our forensics hats
now-- it turns out that JPEGs
all share something in common.
Any of the JPEGs you see on the
internet or have on your hard drive
or on your phone start with the
same three bytes, typically.
Specifically, the first three bytes
are the numbers 255, 216, 255.
Why that?
It's just the standard that was
adopted some time ago for JPEGs.
It's like a magic number, if you
will, at the start of the image.
Now it turns out these
are decimal numbers.
And frankly, in the world
of computing, and really
the world of file inputs and
output-- file management--
we don't really talk in decimal.
It's just not commonly done.
More commonly done is to talk,
not in decimal, not in binary
as we did in week 0, but in hexadecimal.
So it looks a little spooky
at first, but it really
is an alphabet of 16 characters
instead of 10, instead of 2.
And you start counting at 0, as the real
world does-- 0, 1, 2, 3, 4, 5, 6, 7, 8,
9-- but there is no 10, because
these are single digits.
So after 9 comes A, B, C,
D, E. So with hexadecimal,
you can effectively count
from 0 to 15 so long
as you borrow some
letters of the alphabet
to do 10, 11, 12, 13, 14, and 15.
So let's map that onto
this pattern of numbers
that demarcates the start of a JPEG.
These numbers, if we do them
out in binary, equals this.
And we can leave this
as an exercise at home
if you'd like, but trust
me, per week 0, that these
are the patterns of bits that are
equivalent to 255, 216, and 255.
Those are the light bulbs you
would need to turn on and off.
Meanwhile, we can add some
space there-- so before, after--
and I did this deliberately.
Because it turns out that hexadecimal,
insofar as you can count as high as 15
for 16 total characters-- it turns
out that hexadecimal is really useful.
Because with four bits,
four 0's and 1's, you
can count from zero
all the way up to what?
All the way up to 15, which is perfect,
because if you have four 1 bits,
that's 15.
If you have four 0 bits, that's 0.
And it turns out that patterns
of 4 bits-- so half of a byte--
map perfectly to hexadecimal characters.
You can express chunks of 4 bits with
hexadecimal characters perfectly.
As such, it turns out that 15,
again, is F. So we have FF.
And it turns out that
1101 and 1000 from 216.
If you take those 8 bits
and just spread them out
slightly into two groups of four, is the
letter D in hexadecimal and the number
8 in hexadecimal.
And then we have another couple
of 15's-- two sets of four 1 bits.
So that's FF.
Which is to say that the first
three bytes at the start of any JPEG
happen to be FF D8 FF.
And it's human convention to prefix
hexadecimal digits for clarity
in the real world with just 0x.
It means nothing fundamentally
other than, hey world, here comes
a hexadecimal number, so that the
world just knows what it's looking at
and doesn't confuse it for some
other numeric or base system.
So 0x just means here comes a
series of hexadecimal digits.
So that is what is at
the beginning of a JPEG.
And in fact, one of the problems
we put before you in CS50 each year
is to recover some JPEGs.
In fact, it's all too common,
unfortunately, to lose photographs
from a memory card or to accidentally
delete files from your computer.
And typically, you might
freak out or be worried
that, damn, I didn't mean to delete
those photos or delete those files.
And sometimes they are gone for good.
But it turns out that
computers and even phones
don't necessarily delete
photos right away.
They forget where they are, but
they don't outright delete them.
They don't change the 0's to 1's and
the 1's to 0's unless you start taking
more and more photos immediately.
But if you have accidentally formatted
or corrupted your memory card
in a phone or a computer, such that
you think you've now lost your photos,
what if you did this?
What if you wrote software that iterated
from the start of your memory card
or your phone all the way to the end of
your phone or your memory card's memory
looking for this pattern of
bits, this pattern of numbers
or hexadecimal digits, this pattern
of 24 bits-- 8 plus 8 plus 8?
With high probability,
if, upon iterating
through your phone's memory or
camera's memory, you see FF D8 and FF,
you can perhaps infer that the following
megabytes-- 2 megabytes, however
big the photograph is-- is
the entirety of a photo.
And if you just grab those
bits and save a copy,
maybe you can recover
all or some of the photo.
And indeed, one of the challenges
you'll soon see in our problems
that focus on forensics is
going to be exactly this.
We will accidentally format a memory
card from a camera or a phone.
We will give you a forensic image
of that memory card, so to speak,
a perfect copy of all of the 0's and
1's therein, give it to you as a file,
and challenge you with writing software
that iterates over that file's bytes,
from the 0th byte all the way to the end
of the file, looking for this pattern.
And any time you see this
pattern, you will need to say,
this is probably a JPEG.
Let me start copying these bytes into
a separate file name-- something.jpg.
And indeed, if you get the code right,
suddenly out of this forensic image
pops any number of photographs
that we thought were lost forever.
Bitmap is another type of
image file-- older, perhaps,
than JPEG and simpler--
truly a bitmapped image
with slightly less fancy
algorithms for compression.
But if you remember this
world, this is perhaps
a perfect example of a bitmap image
from Windows XP back in the day.
The wallpaper or desktop image that many
of you might have had on your computers
looked like this.
As an aside, if you go poking
around I think on Wikipedia,
someone wonderfully went
to this same spot, found it
where the artist had taken
this photograph for Microsoft,
and it now looks like this.
So it's not quite the
beautiful place it once was,
but that, indeed, is the same hill.
And this is really just
an opportunity to talk
about this picture,
which is what's really
underneath the hood of that
beautiful grassy meadow.
It's this kind of file header.
We very quickly can make nature
seem very boring quickly with this.
But this is a very complicated,
convoluted looking way
of just formalizing
what we mean by a file.
Like, what is a file?
A file is just a bunch of bits, 0's and
1's, stored somewhere on your laptop,
or desktop, or phone, or wherever.
But what does it mean
to be a Microsoft Word
file versus a JPEG versus
a bitmap versus an Excel
file versus any number
of other file formats
that you might have on your computer?
MPEG4, or MP3, or AAC, or any number
of media formats as well-- well really,
what it means to be a file, is to be
filled with 0's and 1's-- that's true--
but to follow a certain pattern of 0's
and 1's such that the first several
bits or first many bits of a
file typically follow a pattern.
And indeed, JPEGs have
a pretty simple pattern.
Those first three bytes,
those first 24 bits,
follow a certain pattern
of 255, 216, 255.
And that demarcates the start of a JPEG.
And there's some more
complexity therein, too.
In a bitmap, this looks
overwhelming at first glance,
but this is how the world
decided to standardize
what's inside of the first
few bytes of a bitmap
file like that grassy
meadow from Windows XP.
And we'll come back to this in more
detail and the problem set before long.
But you'll notice some identifiers
here that might jump out at you.
So width, height, size, compression--
so there are some key words in there
that might make some intuitive sense.
And indeed, somewhere inside of a
bitmap image, like that grassy meadow,
is information stored, like
what is the width of that image?
What is the height of that image?
Is it compressed, and how?
And then more interestingly,
at the bottom of this picture
is the beginning of what bitmap calls
RGB triples-- Red, Green, Blue triples.
It turns out with
patterns of three bytes,
you can represent, top to bottom,
left to right, essentially, the color
of every one of the dots in an image.
So when we enhanced Zamyla's image, and
you literally saw those really big dots
or pixels.
Each of those dots or pixels
has 24 bits representing
its color-- how much red,
how much green, how much blue
per our discussion of colors in week 0.
And so that's what most
of a bitmap file is here.
The dot dot dot just means these
can go on for quite some time.
What color is this dot?
What color is this dot?
What color is this dot?
All of that is standardized
inside of the file.
And it turns out in C, we can express
and represent that kind of structure
using one last keyword for this week.
In C, there is also a keyword
called struct-- structure--
that allows you to create
an actual structure
inside of which you store information.
For instance, the programming
language C does not
come with a data type
that represents students.
It comes with ints, and chars,
and floats-- heck, it doesn't even
come with string.
It certainly doesn't come with student.
But I would generally
think of a student as being
the combination of a few things.
And let's keep it simple.
A student is a combination of the
student's name and their dorm.
And bunches of other detail, but for
now, let's just focus on name and dorm.
Now the syntax is a
little new for us here,
but all this is saying is, hey,
C, define a type for me that
is a structure containing two
fields inside of it, name and dorm,
and call this new type students.
So it's a little more complicated
than our typedef for a string,
because a string was just
a synonym for char star.
This case, a student, is a synonym
for this container of things.
It's like encapsulating two
pieces of data-- name and dorm.
But this is really useful as follows.
With this kind of C code, can
we now do things like this?
Let me go ahead and open up structs.h.
And you'll see a little
header file I whipped up
here that contains very
little information,
just that same typedef
for struct, but I've also
included CS50.h in my own header file.
Why?
Well, just for today's
purposes, I wanted
to relax our discussion of char star
for a moment and talk about string.
But technically, I could just do this.
And technically I could do this.
And then I could get rid of that,
because now this is true C code.
There is no need for the CS50 library.
But either is equivalent, and whatever
you're more comfortable with for now
is certainly fine.
But let me now open up structs-0.c
and do something with this.
So in this program, notice
that the top of this file, I
first define a value
called students to be 3.
This is an example of a constant.
Sharp define or pound define here
says, define, not a variable,
but a constant-- and the
convention in programming
is to use all capital letters--
define a constant equal to some value.
This is not a variable
whose value can change.
This is just a keyword,
STUDENTS, in all caps,
that is fixed now on that value 3.
And it's one way to define a constant
in C, a value that does not change.
What am I now going
to do with this value?
I first, here, declare a variable called
students containing three students.
So it's a little confusing that we've
got students, STUDENTS, student,
but they all mean different things.
students is just going
by the name of my array.
STUDENTS in all caps is the value
3 from that constant up above.
And then student is the type of this
array per the header file, structs.h.
So this is just really a
succinct way of saying,
hey, computer, give me an array of size
3 inside of which are three students,
ultimately.
Now here's an array that
iterates from i up to three.
I print out name colon, prompting
the user with getstring for a name.
And then notice this new operator dot.
Because struct is being used, and
students bracket i is a structure, not
an int, or a float, or a string itself,
we have to go inside of that structure
to get its field called name.
So dot name says, here's the
structure student-- go inside,
and get at the name field, and
put at that name field whatever
the return value of getstring is.
And do the same thing for dorm.
Go into the i-th student's structure,
and store at the dorm field
the string that has been typed in next.
Now let's scroll down a
little lower and take a look
at what this program proceeds to do.
It's a very simple loop that sort
of, just for demonstration's sake,
says so-and-so is in such-and-such,
printing out the student's name
and dorm respectively.
In other words, if I go down here
and compile structs-0, and then
run structs-0-- I'm going
to go ahead and type
in David Mather and Rob and
Kirkland and Zumilah and Courier,
and you see literally a
regurgitation of what I wrote.
But what's key here is the
syntax that I've introduced.
I'm using a struct, and I'm
accessing its properties or fields
within by way of this dot operator.
Just means go inside the
struct and get that value.
But more cool than that is that
now that we understand files,
and now that we have the ability to
store arbitrary structures that did not
come with C, so to speak,
notice what else we can do.
In this version of the program,
I'm adding one final feature,
combining today's discussion
of pointers with this most
recent discussion of files.
It turns out, in C-- and
it's a little inconsistent--
there is a data type called FILE in all
capital letters, for whatever reason.
In lower case here, I'm just saying
hey, computer, give me a variable called
file of type file star.
That is, give me a
variable that will store
a pointer to a file, the
address of a file, if you will.
fopen is a new function that
we've not used before that says,
hey, browser, open up the
file called students.csv
in write mode. w means write mode.
So I want to save a file, not open and
read a file, but write and save a file.
Sanity check-- is file
not equal to null?
If so, we're going to proceed.
I just want to make
sure nothing went wrong.
And something might go wrong, if maybe
I don't have permission on the computer
to create files, maybe the computer's
out of space or something-- something
could go wrong and null is returned per
the documentation for fopen or its man
page.
So here I have a loop.
For i from 0 to 3-- so do
this 3 times-- fprintf.
So fprintf is identical
to printf that you've
come to get to know in recent
weeks, but fprintf is file printf.
And it allows you to print strings,
not to the screen, but to a file.
So its first argument
is the name of the file
that you've opened that
you want to print to.
Second argument-- and all the others--
are just the same as printf as always--
a string, maybe with some
format codes, and then some
values that you want to
plug in here and here.
So if you've ever seen a CSV file in the
real world-- a Comma Separated Values,
these are very simple
Microsoft Excel files
or Apple Numbers files
or Google Spreadsheets
can export these as well-- they are just
text files whereby all of the fields
are separated by commas.
And so we now, today, have the ability
not only to understand what's going on
underneath the hood
memory-wise, but we now
have the ability to use the star
operator-- and therefore pointers more
generally, which are required
in order to manipulate files
with fopen and this file data type-- and
with fprintf can I write information?
Can I print information to those files?
So for the first time ever,
I have written a program
that, when I run in just a moment,
is going to have the ability
to save information.
I have effectively implemented the
equivalent of the File Save menu
for the first time.
Every other program
we've written so far just
throws its information and its
memory away, but not this time.
Let me go ahead and do
make structs-1, and then
dot slash structs-1, and let me go ahead
and type in David and Mather and Rob
and Kirkland and Zumilah and Courier.
Enter.
Nothing seems to happen.
But let me go to my file browser in the
IDE, and notice this-- students.csv.
If I open that, I have created this
file-- David comma Mather, Rob comma
Kirkland, Zamyla comma Courier.
And if I downloaded this
file, double-clicked it
as you can now if you'd like,
and open it on your Mac or PC,
if you have Numbers installed or Excel
installed or some comparable program.
That should open that program, because
it will recognize the .csv extension.
And it will display all of those
names and all of those dorms
or houses in separate columns.
Because the world decided some
time ago that the format known
as CSV, Comma Separated Values,
is just simple text file
with commas separating values.
And so we now have the ability
to express all of this and more.
So on the horizon is to solve some
of these same kinds of problems
and to actually implement for yourself
code that writes files and reads files.
But to solve actual problems
motivated from the real world domain,
to recover information
that's been deliberately
scrambled or obscured,
to recover information
that's been accidentally
or maliciously deleted,
to recover things, photographs,
memories that you actually
care about-- all with a
fundamental understanding of what's
going on inside of your own computer.
But until then, keep an eye out for
such TV shows and movies as these.
[VIDEO PLAYBACK]
-OK.
Now let's get a good look at you.
-Hold it.
Run that back.
-Wait a minute.
Go right.
-There.
Freeze that.
-Full screen.
-OK, freeze that.
-Tighten up on that, will you?
-Vector in on that guy
by the back of you.
-Zoom in right here on this spot.
-With the right equipment, the
image can be enlarged and sharpened.
-What's that?
-It's an enhancement program.
-Can you clear that up any?
-I don't know.
Let's enhance it.
-Enhance section A6.
-I enhanced the detail and--
-I think there's enough to enhance.
Release it to my screen.
-Enhance the reflection in her eye.
-Let's run this through
video enhancement.
-Edgar, can you enhance this?
-Hang on.
-I've been working on this reflection.
-Someone's reflection.
-Reflection.
-There's a reflection of the man's face.
-The reflection.
-There's a reflection.
-Zoom in on the mirror.
-You can see a reflection.
-Can you enhance the image from here?
-Can you enhance it right here?
-Can you enhance it?
-Can you enhance it?
-Can we enhance this?
-Can you enhance it?
-Hold on a second.
I'll enhance.
-Zoom in on the door.
-Times 10.
-Zoom.
-Move in.
More.
-Wait, stop.
-Stop.
-Pause it.
-Rotate it 75 degrees
around the vertical, please.
-Stop.
And go back to the part
above the doors again.
-Got an image enhancer that can bitmap.
-Maybe we can use the Pradeep
Sen method see into the windows.
-This software is state of the art.
-The eigenvalue is off.
-With the right
combination of algorithms--
-He's taken illumination
algorithms to the next level,
and I can use them to
enhance this photograph.
-Lock on and enlarge the z-axis.
-Enhance.
-Enhance.
-Enhance.
-Freeze and enhance.
[END PLAYBACK]
DAVID J. MALAN: All right.
That's it for CS50.
We will see you next time.
[VIDEO PLAYBACK]
-Everyone knows you went to Yale,
but nobody knows what happened.
What can you tell me about that weekend?
What can you tell me about Rosebud?
[DRAMATIC CHORD]
[DRAMATIC MUSIC PLAYING]
[END PLAYBACK]
