[MUSIC PLAYING]
DAVID MALAN: All right,
well, this is CS50.
And this is lecture 4.
And you'll recall that last time,
we focused really on algorithms.
Not so much code, but on algorithms,
the step-by-step instructions
for solving problems.
And we did this in the
context of numbers.
But more specifically, we assumed that
we had places to put these numbers.
We called it an array.
And that was back-to-back-to-back memory
where we could put numbers or maybe
characters or something else.
But for the most part, we haven't
really had more sophisticated ways
of laying information out in
memory, and so we're kind of
stuck going left to right, right
to left, or maybe a little more
intelligently using what we
call binary search, or divide
and conquer, and kind
of split the difference.
But we can start to get fancier.
Because if you recall,
we have memory inside
of our computers, which we keep
modeling as this rectangular region
with numbers and so forth.
But that's kind of a nice canvas, right?
Much like you could paint on a canvas,
so could we kind of move things around
in memory.
But we don't yet have the vocabulary
or really the data structures,
as they're going to be
called, to do anything more
interesting than this linear approach.
And so today and next time
when we begin to start
to solve problems more cleverly and then
revisit some of the kinds of algorithms
that we've seen already, and
those algorithms last time
were like these, linear
search and binary search.
And that brought us full circle back
to week zero when we used those,
albeit not by name.
And then recall that
we introduced sorting.
Sorting's a more involved
algorithm, more time-consuming.
But it was a precondition for
which of those first two algorithms
to actually work?
Yeah, so binary search
assumes, just like a phone book
assumes that the names are
actually sorted from left to right.
Otherwise, you're completely
wasting your time,
if you're splitting
a phone book in half,
if there's no rhyme or reason to
where the names and the numbers are.
So this is sort of a
requisite ingredient
to get better efficiency
with other algorithms.
And we're to see that moving forward.
And it's going to be up
to decide, you know what?
It's going to take me
more time to figure out
how to write the code
for an algorithm than it
is to just write it the easy
way, and just cut a corner
and run the code once.
But if you're a company,
if you're a website,
if you're a piece of software that's
doing the same algorithm again
and again and again, maybe you do
want to spend that upfront cost.
Maybe it's human time, like your
time, figuring out the algorithm.
Or maybe it's n squared or
hopefully something better, n log n,
that you have to spend just to
get the data in a nice format,
and then, thereafter,
everything can be blazing fast.
So especially when we
get to web programming,
we'll revisit these questions when
we talk about databases as well.
But recall last time we also
introduced a bit of formalism.
And we won't get more
mathematical than this.
But These were sort of
broad strokes with which
we can paint the efficiency of
algorithms from sort of slowest on top
to fastest on the bottom.
And recall that we introduced
some Greek symbols here just
to kind of standardize what
it is we're talking about.
And as a quick check, big O
refers to what kind of boundary?
The upper bound.
So maybe in the worst case, what's the
upper bound on your algorithm's running
time?
How many steps, how many seconds,
how many hours might it take?
By contrast, we had omega,
which was the lower bound.
And then less commonly
used, at least in CS50,
will be theta, which is just
when those two are the same.
But realize those are
the three ingredients,
especially if you continue on in other
CS classes, that might be revisited.
And we introduce this, so that
we're not just counting seconds.
We're not, like, looking at our watch
and counting how fast our algorithm is,
because that's not a very reliable way
of measuring whose algorithm is better
than someone else's.
Maybe my Mac is faster
than your Mac or your PC.
And so we don't want to
just look at raw time.
We want to think about things
a little more theoretically,
albeit without worrying about
denominators and lower order terms.
We talk generally in terms
of these higher order terms
that we saw last time.
So I thought it would be
fun to kind of reminisce.
Some years ago, when a
certain someone was still
just a senator, being interviewed
by Google's own Eric Schmidt,
who was the former CEO of Google.
And this was an interview that took
an interesting algorithmic turn.
If we could dim the lights for a moment.
[VIDEO PLAYBACK]
[APPLAUSE]
- Now, Senator, you're here at Google.
And I like to think of the
presidency as a job interview.
Now, it's hard to get a job--
- Right.
- --as president.
- Right.
- I mean, you're going
through the rigors now.
It's also hard to get a job at Google.
- Right.
- We have questions.
And we ask our candidates questions.
And this one is from Larry Schwimmer.
- OK.
- What-- you guys think I'm kidding?
It's right here.
What is the most efficient way to
sort a million 32-bit integers?
- Well, I--
- Maybe-- I'm sorry, maybe we should--
- --no, no, no, no, no, no, I think--
- That's not a fair question.
- --I think the bubble sort
would be the wrong way to go.
- Come on.
Who told him this?
[END PLAYBACK]
DAVID MALAN: OK, so today
we peel back the layers
that we've been assuming
for some time now.
We've been talking about
integers and characters.
But we also had this
higher level concept
of a string, which was
a generic way of saying
you had back-to-back
characters that represent words
or sentences or paragraphs or whatnot.
But today we reveal that that's
actually been a bit of a lie.
There actually is no such thing as
string in the language called C.
And indeed, you might
have kind of suspected
as much, given that we keep including
cs50.h, which gives you things
like GetInt and GetString and so forth.
But it also gives you
string, literally, a keyword
that actually doesn't come with C.
And before we peel back
what exactly it is,
let's consider perhaps
what problems it creates
and what powers it reveals to understand
what's going on underneath the hood.
And let me propose that we try to
implement, at least in pseudocode,
this algorithm here, swap.
So I proclaim that swap is a
function whose purpose in life
is to take two inputs--
let's call it a and b--
and just swap them, so that
a becomes b, and b becomes a.
And before we even get into the
weeds of pseudocode or actual code,
we actually have two values here.
Might anyone like to join me onstage for
just a moment for a drink of Gatorade?
A little Gatorade?
Around Maybe someone a
little farther back today?
A little drink?
Yeah?
OK, come on down.
What's your name?
KATE: Kate.
DAVID MALAN: Kate, come on down.
Join me for some Gatorade onstage.
Welcome to Kate.
All right.
So the challenge at hand is this.
Let me just set us up here.
So we have some green.
That's very unnatural looking.
OK, we have some pink.
And you know what?
I think, actually, I'd like the
pink, and maybe you could have green.
I need you to swap the values
of these two cups if you could.
I need you to get the pink into
the green cup and the green
into the pink cup.
KATE: I think I need another cup.
DAVID MALAN: OK, so she thinks
she's going to need another cup.
And OK, so good.
I came prepared.
So we need another
variable, if you will.
OK, so here we go, Kate.
Set us up.
All right, green goes
into the empty cup.
All right, so pink goes
into the former green cup.
And green goes-- most green
goes back into the original cup.
Thank you so much to Kate.
Why don't we--
I don't know if you'd like that flavor.
Delicious.
OK, well, thank you very much.
Thanks for coming up.
Here we go.
Oh, and this.
Oh, and if you'd like.
KATE: Sure.
Thank you.
DAVID MALAN: So this was
obviously very intuitive.
That one's not bad.
This is actually pretty intuitive,
of course, for any of us in the room
that you obviously can very cleanly
swap two values or two drinks.
You need some kind of
temporary storage place.
And so we can introduce this in
the form now of some actual code
by claiming that if you want to swap two
values, you need the exact same idea.
So even if variables have
seemed a little abstract
or you're at least used to them
in the context of algebra x and y,
in a programming language,
variables are just
like these empty cups into
which you can put values.
In reality, it's green and pink.
But it could be a number, it could
be a letter, or maybe something more.
Maybe even something like a string.
So we're going to come
back to this idea.
Because if you agree that
Kate's algorithm was correct,
once she had that temporary
variable, it would
seem that this is a pretty
reasonable translation to code.
Put store a in tmp.
So move a into the temporary variable,
just like she did into the empty cup.
Then change a to be b.
And then change b to be what was
a, which is currently in temp.
So it's kind of this
three-step switcheroo,
just like Kate enacted for us here.
But recall last time that we have--
actually, let's do this.
Let me actually go ahead
and open up a program here.
You know what?
Let me go ahead and open up,
let's say, today's example
called noswap, which
is a bit of a spoiler,
insofar as the name suggests
what's actually going to or not
going to happen here.
And if I go ahead here and
open this up in source 4.
Let me go ahead and open up noswap.
Let's take a look at what this
program actually looks like.
Doesn't seem to do all that much.
But it does have a
main routine up front.
And it's got an include of stdio.h.
And then what does main do?
It declares two variables,
this time called x and y,
initializing them to
1 and 2 respectively.
And then a couple of printfs.
So printf x is such and
such, y is such and such,
printing out its actual values.
So that's sort of week
one use of printf.
And then we do the-- it looks like
the exact same thing two lines later.
But in between those two lines is
an actual function call to something
called swap, which, as it
turns out, if we scroll down,
is literally the same thing as we
had on the screen a moment ago.
I'm calling it a and b
here, but I could have
called it anything I want, x and
y, fu and bar, anything like that.
I have my temporary variable or
temporary empty cup, like Kate had,
and I do the switcheroo.
So when I run this program
after compiling it with--
let me go into source 4,
and then do make no swap,
Enter It seems to compile OK.
So when I run noswap, what sentences
should I see on the screen?
Based on main.
AUDIENCE: [INAUDIBLE] x is 2, y is 1.
DAVID MALAN: Yeah, x is 1 comma y is 2.
And then hopefully x is 2, y is
1, if swap, indeed, works exactly
as Kate enacted in the real world here.
So as the name might suggest,
doesn't seem like something's
going to go quite right here.
That's weird.
It didn't actually
seem to swap the value.
OK, so maybe it's a bug.
We've screwed up before.
Maybe I just made some error.
Maybe I'm kind of saying I'm doing one
thing, but am actually doing another.
So let's double-check.
So int x gets 1.
Y gets 2.
That seems correct.
My sentence is correct.
It's x, y and then it's again x, y.
So I didn't accidentally reverse them.
And printing the same thing.
I'm calling swap.
And so it seems Kate was wrong.
She did not swap the green and the
pink Gatorade correctly somehow,
because, at least in
code, it's not working.
So why is this?
And it turns out that the answer
to this, Kate's algorithm's
actually correct, but our interpretation
of it in code isn't quite correct.
Even though it compiles, even
though it runs, and it actually
is doing something underneath the
hood, it's not actually, obviously,
doing precisely what we want.
So why is that?
Well, let's come back to this picture
here, which is just a stick of RAM
that you might have on your
desktop or your laptop.
And each of those black chips
has some number of bytes.
Maybe a gigabyte.
Maybe less.
Maybe more, these days.
And I keep proposing that we think of
these, if you can just kind of zoom in,
as just a grid of values.
So if we zoom in on that, and we think
of the top left corner is the byte 0,
the one next to it byte 1, byte 2.
We can literally number all
the bytes in our computer
from 0 to, like, a billion or 2 billion,
however much RAM you actually have.
But it turns out that computers
use that memory not exactly as left
to right, top to bottom.
There's a bit more structure to it.
And so let's actually be a little
more precise today and zoom in
and propose this.
It's a lot of words
all up front and we'll
tease them apart in just a moment.
But if you think of this as that black
chip on your computer's memory stick,
it turns out that the computer's going
to use different areas of memory--
the stuff down here, the stuff down
here-- just a little bit differently
conceptually.
Humans, years ago,
decided to use this memory
for certain things, this
memory for certain things,
and we've all kind of standardized
on that since, at least in C here.
So there's going to be
two salient features here,
so called the stack and the heap.
And the heap we'll get to before long.
But a stack is just
like it is in English.
If you go over to Annenberg
or some dining hall on campus,
you just have a stack of plastic trays?
That's the same idea,
we'll see, where you
can put more and more
stuff on the stack,
and the heap's going to work
a little bit differently.
But you can still think now, even
though we've started to kind of label
different parts of memory with these
words-- heap, stack, and others--
you can still think of the
idea as being exactly the same.
Within the so-called
heap portion of memory,
we're still going to number our
bytes 0, 1, 2, 3, 4, much like I
proposed with some of squares here.
Same thing for the stack.
The numbers might be different,
because they're obviously
farther away from those other boxes, but
we're still just going to number them.
So the mindset is the same.
It's just we're putting different
types of things in different places
starting today.
And let's see what this
means for us in reality.
I'm going to go ahead here and create
a new program called compare0.c.
If you'd like to play along
at home, the same code
will be online on the course's website.
As usual I'm going to do,
like, an include of cs50.h.
I'm going to do an include of stdio.h.
Then I'm going to do int main void.
I'm not going to worry about
command line arguments today.
And then I'm going to go ahead
and get myself two strings.
So I'm going to do string s,
we'll call it, get_string.
And I'm just going to call this s.
So I'm going to prompt the user
with a pretty simple string.
Like, just give me s.
I want one more.
String t gets get_string, t
colon and then a double quote.
So just sort of, again, week one stuff.
Just give me one string, then give
me another string, and put them in s
and t respectively.
And now, let me just do something that
you might have been inclined to do
or had to do previously, which
is just to compare these things.
So you know what?
I want to check.
If the user typed in the same
word twice, let me just say,
if s equals equals t, I'm going to go
ahead and print out, quote unquote,
"same," and a new line.
I also am going to go ahead and
just literally print out different.
So I've whipped up a
simple program that I
think is just going to ask
the user for two strings,
and then just compare them.
Now, in the past, I have made
mistakes when it comes to equal signs.
Have I used the correct number of
equal signs for something like this?
Or should it just be the one?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: So it should be the two.
Because the one is used
already as assignment.
Move this from right to left.
So equals equals seems to
have the right semantics.
Like, I'm trying to compare
s and t for equality.
So let me see, if I got no syntax
errors here, let me go ahead
and make compare0.
OK, compiled.
And then ./compare0.
Enter.
Let me go ahead, and I'll
type Stelios's name again.
I'm going to go ahead
and type his name again.
Looks good.
And [INAUDIBLE] different.
OK, maybe, I mean, it's kind of a long
name, maybe I just screwed up somehow.
So let's try this again.
So I'll type my own name.
Twice.
No.
OK, maybe we should try
another, like Maria, Maria.
Three times incorrect.
There's got to be something
wrong with my code.
So what is actually going on?
Well, maybe we should just go
about this a little differently.
Let me go ahead and try another program.
Let me go ahead and create a
new program called copy0.c.
And this time, maybe comparing--
I'm just going to copy the
strings this time in order
to actually do something
simple and see the results.
So let me go ahead and do an include
of cs50.h and include of stdio.h,
int main void.
I'm not going to worry about
command line arguments.
And I'm going to, again, do
string s gets get_string.
And I'm just going to say, give me s.
And you know what?
This time rather than complicate
things by getting a second string,
let's just say t is going to equal s.
So I think this is the
correct use of equals,
because it's one equals, which means
copy s from the right to t on the left.
And then I want to
capitalize this string.
So let me just insert a
little bit of logic here.
And I only want to capitalize the string
if the user typed something in that's
long enough.
So I want to do a little
bit of safety checks,
because we're starting to get in the
habit of better error checking now.
So if the length of t is greater than
zero, you know what I want to do?
I want to go ahead and change
the first letter of t, the copy,
to be the result of calling toupper
of that first character of t.
And then you know what?
Let's just print it out.
So s is going to be %s.
And we plug that value in.
And then let me go ahead and print
out t %s backslash n comma t.
So it's a bunch of syntax all at once.
But to recap, I'm getting a string,
calling it s, just like before.
I'm not getting a second string.
Now I'm just saying, you know what?
The string called t is
going to be equal to s.
And we've used assignments,
certainly, before.
Just to be clear, why did I complicate
my code with this use of strlen here?
Why did I do that?
What could go wrong otherwise?
Someone else?
Propose to me what the user could
do that might make problems for me?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Exactly.
Suppose the user doesn't type his or
her name, and just hits, like, Enter.
That's going to return
a string of zero length.
Now, technically, a
string of zero length
uses up how many bytes in memory?
AUDIENCE: One.
DAVID MALAN: Why one?
AUDIENCE: Backslash.
DAVID MALAN: Right.
There's still a backlash 0.
So recall last week when we talked about
what strings are underneath the hood,
they're always terminated
by backslash 0.
Whether there are zero characters, one
characters, five characters, or 1,000,
they end with a backslash 0.
So even if the user just
types Enter, he or she
is really creating a string
that's of 1 byte in memory,
but its length, so
far as we humans care,
is zero, because there's
no characters in it.
But if you look at that zeroth
character, just to be clear,
what is going to be at t
bracket 0 in this case?
Backslash 0.
And if you now change this to
be toupper, it's just weird.
You shouldn't be
touching that backslash 0
and certainly not
trying to capitalize it.
So I'm being a little defensive here.
Though, hopefully, frankly, toupper
would not break in that case,
because it's not an alphabetical letter.
But I'm at least thinking
about these circumstances.
Now I'm just going to go ahead
and print out both s and t.
So let's see the results.
Let me go ahead and make copy0.
And oh, OK, we've seen this before.
Let me zoom in on the bottom.
So the first error
message here has something
to do with implicitly
declaring library functions.
That's a pattern you should
start recognizing now.
What does that mean I probably omitted?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, so some
library, some header file up top.
And maybe by instinct or
maybe by running help50,
you would recall at
this point, oh, right, I
need to do, like, include string.h.
And just to anticipate,
is there one other I
should add before I embarrass
myself with another error?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, so include ctype.
It's not so much an embarrassment.
It's just I should know it
once I've seen it once before.
So here we have ctype.
Because what's in ctype?
Toupper, right?
And we would only know that from
[INAUDIBLE] having done it before.
Otherwise, you pick it up along the way.
All right, so now let me go
ahead and rerun make copy0.
Seems to work OK.
Let me go ahead and do ./copy0.
And let me type in Stelios's
name, but all lowercase for s.
Hit Enter.
OK, kind of worked.
But what's the symptom here now?
Yeah?
AUDIENCE: [INAUDIBLE] capitalize
the s string as well [INAUDIBLE]..
DAVID MALAN: Yeah,
capitalize both s and t.
But maybe this is, like, a screen thing.
Like, a lowercase s and capital
S kind of look the same anyway.
So maybe we're just
kind of seeing things.
So let me just type
my name in lowercase,
where the D is hopefully going
to look quite different when--
no, same behavior.
So why is this?
What's going on?
Again, the code here
is just an application
of the ideas we've been using
for the past couple of weeks.
If you want to get a
string, call get_string.
If you want to copy a variable, use the
assignment operator from right to left.
And so here is where we're beginning
to find weaknesses or sort of signs
of the lie we've been telling
about what a string actually is.
Because it seems that you can't just
compare two strings for equality.
And it seems that you can't just
copy a string into another just
by using the same techniques we've
used for chars and for ints and for all
of the primitives, so to speak,
all of the lowercase types,
like double and float and
char and int that come with C.
But string itself is not
something that comes with C.
So what's actually going
on underneath the hood?
Well, at the risk of or in
an attempt to be dramatic,
today is when we sort of get to take
off these training wheels that you might
have had on your bicycle
for some amount of time
and now reveal that a string is
actually a little something more arcane.
And you might have seen
a glimpse of this, maybe
in textbooks or Google
or online or whatnot.
But a string is actually just
a synonym that CS50 staff
created for a little more complicated
expression called char star.
Now, char is familiar.
Character.
And for now you can think
of the char as maybe
implying that there's going to
be multiple characters involved.
Because at the end of the
day, that's what a string is.
A string, at the end of the day,
is still a sequence of characters.
But more precise than
that is the question
that we're going to explore now.
So let me go ahead and do this.
Let me go ahead now and take
away the layer that is string,
and look at this instead
just in the context
now of using char star
instead of that same keyword.
I'm going to go ahead and
create one more file here.
I'm going to call compare1.c.
So hopefully, an improvement
on my previous version.
I'm going to go ahead
and include cs50.h,
so I can still use GetString
and any other functions.
I'm going to go ahead and include
stdio, so I can actually print things.
And I'm going to go ahead and
preemptively include string.h,
so I have some other fancy
features available to me.
So let's do this now, int main void.
So no command line arguments still.
Instead of string s, I'm just going
to take off that training wheel.
Char star s equals get_string,
quote unquote, "s."
And then you know what?
Let's do char star t equals
get_string t: like that.
So the only difference, thus far, is
that I have literally sort of dropped
from my vocabulary the word string and
I'm starting to write it as char star
instead.
It's not multiplication.
So there's a limited number
of keys on the keyboard.
So humans, years ago, decided to use
different symbols for multiple things
sometimes.
And so that's where we're at.
And now recall that last time, if s
equals equals t, I wanted to print out,
quote unquote, "same" else I
wanted to print out "different."
So I think this is the exact same
program at this moment in the story,
except I've just change
star string to char star.
So maybe that's it.
Let's see if maybe all
I need to do is sort
of take away those training wheels.
Run make compare1.
Seems to compile.
And then let me zoom in at the
bottom here and do ./compare1.
And let's go ahead here
and compare, again,
Stelios's name against Stelios's name.
Still different.
Let's try my name.
David, David.
Still different.
Maybe it's a capitalization thing.
Let's do david, david.
All right, so it's still broken.
So obviously, just taking
off the training wheels
doesn't make the problem
better, apparently.
We need to actually understand how
the bicycle works, if I can really
milk the metaphor today.
So how does a string really
work underneath the hood?
Well, you know what?
I'm not quite sure
how to explain it yet.
But I know that I can
actually solve this problem
by not using equals equals.
I can instead do this.
If strcmp of s comma t equals
equals 0, of all things,
now print that they're the same.
You would never know this
unless you were told or found it
in a reference book or online resource
or whatnot that strcmp exists.
As its name kind of suggests,
it's just succinctly written,
strcmp, cmp, so you're
comparing two strings s and t.
And if we read the documentation, like
the man page or CS50 reference online
or Google or whatnot, we would see
that the definition of strcmp is this.
If s and t are visually the
same string, then return zero,
just because a human decided that.
If s comes before t alphabetically,
return a negative number.
If s comes after t alphabetically,
return a positive number.
So that's kind of nice, because
there's three scenarios where
you're comparing two things.
Either they're the same or one is
bigger or alphabetically before
or alphabetically after or smaller.
So there's kind of
three cases to consider.
And we've seen that before
when we've written out
pseudocode when testing for things like
equality or looking for Mike Smith.
So strcmp leverages that same idea.
It's just a little messy that
we have to use numbers, like 0
for equal and negative for less
than and positive for greater than.
But let me try recompiling this now.
Make compare1.
And then let me do ./compare1,
Stelios, Enter, Stelios, Enter.
And now they're the same.
Let's try this again.
Let's do Stelios again.
Maria.
They're different.
So again, proof by example should
be sufficiently compelling.
But it's better, it seems.
And indeed, it is correct now.
So what is it that's going on?
And you know, maybe we
just got lucky here too.
Let me go ahead and iterate
on this and just improve
things in a couple of ways.
Let me go ahead and open
up or point out this.
Let me go ahead and open up copy1, which
is an improvement on our original copy
version as follows.
Let me do this here.
So what's actually going on?
So it turns out--
actually, let's do this.
Before we jump ahead to compare1,
let's consider what's actually
happening in this computer program.
So to recap, we're getting
a string, calling it s.
We're getting another string.
We're calling it t.
And then previously, we were
just comparing equal equal.
But now strcmp seems
to solve this problem.
So someone solved this problem for us.
Let's see if we can't
infer what's going on.
Well, let me go ahead and just pull
up a little chalkboard here of sorts
and propose to consider what
exactly GetString is doing.
All this time, we say that GetString
gets a string from the user
and just returns it to you.
And indeed, when we did
an example the other day,
and our volunteer was wearing
the GetString name tag,
he just handed me back a slip of paper
that said the audience member's name.
So GetString does work like that.
But we only have access
to this canvas of memory.
So what does it really
mean to return a string
and to put it somewhere in memory?
Well, it turns out when you do a line
like this, string s equals get_string,
there's two parts to this,
the left and the right.
So what's actually going on?
Well, the left-hand side of
this expression, string s,
is telling the computer, hey,
computer, give me a chunk of memory,
and I'm going to draw it
as a box, and call it s.
The right-hand side of this expression,
obviously, does get someone's name.
And I'm going to draw that just for
the moment as, like, Stelios's example.
So Stelios's name.
And just to be super precise, there's
a backslash 0 in there, recall.
Let's continue that assumption.
But what exactly is going in here?
Like, Stelios's name,
literally, visually,
cannot fit in that tiny little box.
So what is it, when Stelios's
name is handed to me,
I'm actually storing in this little box?
Well, what is Stelios name
here actually implemented as?
I've just written it out.
But what is it more technically,
using some CS jargon?
It's an array of characters.
So let's see that.
So let's just kind of draw out what
we know is underneath the hood,
even though we can just kind of
take for granted that this works.
So this is not to scale.
But each of these characters,
letters in his name, take up a byte.
They're an ASCII character, so to speak.
So I don't know where
these things are in memory,
but I'm just kind of going to guess.
I've been using my computer for a while.
So maybe this is at, like, byte 100.
This is byte 101.
This is byte 102, and so forth.
So those bytes are numbered
whatever the numbers actually are.
So given that, the fact that a string
is just a sequence of characters,
and those characters each
take up, like, a byte
of actual memory in your
computer, and those bytes
can be thought of as having numbers from
zero to, like, a billion or 2 billion,
what would you propose, just
intuitively, we put in the box
when we get back a string?
AUDIENCE: 106.
DAVID MALAN: 106?
OK, why 106?
AUDIENCE: Because you can
tell where the [INAUDIBLE]..
DAVID MALAN: If we say 106.
So that would be like--
oh, sorry, do you mean--
that's my way of writing 100.
Do you mean 106 as in one, two,
three, four, five, six, this one?
AUDIENCE: Oh, no, no, no.
DAVID MALAN: Oh, just
my messily written 100.
AUDIENCE: So we can, like, store the
location of the various [INAUDIBLE]..
DAVID MALAN: Correct.
So if we fix my horrible
handwriting, we could put in this box
just the address of that string.
That's not his whole name.
It's just, like, the location
in my computer's memory
of the first character in his name,
which feels a little reckless,
because Stelios's name is not s.
But why is this sufficient
information to store Stelios's string
in this variable effectively?
AUDIENCE: Backslash.
Backslash 0 is there.
DAVID MALAN: Exactly.
The backslash 0 is there.
And even though a computer, if you
go back to our locker metaphor,
kind of has to open each door in order
to see what character is behind it,
we can do that with just a
simple for loop or a while loop.
And a computer, given the address
of Stelios's name's first character,
it's kind of like a map,
like, X marks the spot.
Like, the computer can go
to that location in memory
and say, oh, here is the
first letter in his name.
And then the computer,
if printing out his name
using printf or something
like that, the computer
can just print out the s, then
open the locker door next to it.
And if it's not a backslash
0, print out the t.
Open the next door.
If it's not a backslash
0, print out the e.
And repeat and repeat and repeat.
And as soon as it does get to that
special sentinel value, as we say,
backslash 0, it closes the
locker door and stops printing.
So because all this time we have been
terminating our strings with backslash
0 as the special
demarcation, all we need
to know logically to store a string
is where does that string begin,
the upside of which is now we can store
just a single tiny chunk of memory.
It tends to be 4 bytes or 8 bytes,
depending on what kind of computer
you have, 32 bits or 64 bits.
But what you don't need is as many bytes
as there are characters in his name,
because those are already
using other bytes in memory.
So when I now declare t in my
first version of the program,
string t gets get_string,
quote unquote, "t:"
just as my user's prompt,
close quote, semicolon,
what's happening in this scenario?
Well, the logic is exactly the same.
This is saying, hey, computer
give me a chunk of memory.
Call this chunk t.
And then even if the human types
in literally the same thing, like,
S-T-E-L-I-O-S, that is underneath the
hood just an array that we can keep
drawing like this.
And of course, the
computer, for us, is going
to put that backslash 0,
because the computer's been
doing that for the past few weeks.
It's not going to be the
same location, though.
Maybe this is now at, like,
location 300, and this is 301,
and this is 302, and so forth, because
the computer's doing other things.
It's using memory over
here or over here.
So it's not going to be
in the same location.
But what, therefore, gets
stored in t when we assign
the return value of get_string to t?
300 in this case.
And that goes here.
And so if we go back to our original
computer program, which, again,
was this program compare0.c, why now
does it make perfect sense, eventually,
that s and t are always different
no matter what you type in?
Because what's actually getting
compared on this highlighted line
here when you do if s equals equals t?
The memory locations.
You're literally doing the same
thing for the past few weeks.
You are literally just comparing
two variables, s and t.
The difference is that now that you've
broken the abstraction layer, taken
the training wheels off, however you
want to think of what we're doing here,
you are literally doing the
same thing, but you're just
comparing two addresses, two locations.
You are not very fancily comparing
every character against every character
to determine what we humans think
of as equal or identical strings.
So this notion of an address,
this notion of a location,
is given the buzzword pointer.
And a pointer is just an
address of something in memory.
And you can think of it, literally,
as pointing to something.
Like, what is 100 pointing at?
You can visualize it as well.
It's kind of pointing to the first
byte of some array of memory.
And t, meanwhile you can think
of as pointing to, with an arrow,
the first byte of some
other chunk of memory.
And as such, you know what?
At the end of the day,
we really aren't going
to have to care where
things are in memory.
These are uninteresting
implementation details
and shouldn't really
matter to us, because we
can talk about things symbolically.
We can say s instead of
hard-coding the stupid number 100.
We can say t instead of
300, and let the computer
and the compiler do all
that kind of math for us,
just knowing we have this canvas
of memory at our disposal.
And so why does string not exist?
Well, this notion of a string--
apologies, again, for my
handwriting-- is, again,
just a synonym for char star.
And this string, too, is
just a synonym for char star.
And what does that actually mean?
Star is just the symbol that
humans, years ago, decided
would represent an address,
a pointer, a location.
Char is relevant.
Because what type of pointer is this?
It is a pointer to what data type?
AUDIENCE: A character.
DAVID MALAN: A character.
Or it's the address of a character.
So you can of it as pointer to, address
of, however makes sense in your mind.
And so even though this
is a string, well, t
and s are not pointing
at a string, per se.
If you really zoom in,
they're technically
only pointing at a character.
And it's our choice of
implementation that, yes,
we can eventually find the end of a
sequence by looking for backslash 0.
But this is what's really
powerful and also confusing
about C sometimes is that at the end of
the day, it's very simple operations.
And when you copy something,
you're literally just
moving one number or one character
from one place to another.
And if you want higher level
constructs, like strings,
you have to implement
them underneath the hood.
And this is not sort of a
fun way to start programming
in the very first week of a class.
You'd be like, hey,
everyone, today we're going
to talk about pointers and addresses.
It's mind-numbing, it's
confusing, and it really
doesn't allow us to
focus on the algorithms
and the actual problem-solving.
But now we're at the point in
CS50, and soon with other problems
sets, where you want to--
or we'll have problems
that really can't or shouldn't
be solved by just numbers
and characters and arrays alone.
We actually want to leverage
the power of our intel inside
and do something more sophisticated.
And so we do, at this point, need to
understand what's really going on.
And that's going to unlock
a lot of capabilities.
So with that said, if compare0
was wrong, but compare1 was right,
what can we infer
about how strcmp works?
Someone wrote this years
ago, decades ago even,
but what did he or she do in order
to implement strcmp correctly,
do you think?
AUDIENCE: They gave the
same memory location.
DAVID MALAN: They gave
the same memory location.
Can't be that, because they don't
have control over the strings
I'm comparing, right?
I am passing in s,
and I am passing in t,
and I got those from wherever I want.
So 20, 30, 40 years ago, someone only
could take in as inputs two strings,
or technically two pointers, or
really technically two addresses.
So what did this person
do to implement GetString?
How about here.
AUDIENCE: [INAUDIBLE] compare the
values that are actually [INAUDIBLE]..
DAVID MALAN: Yeah, just intuitively,
if this program's correct,
strcmp must be doing what we humans
thought equals equals would do
or might do.
And so what strcmp is presumably
doing is it's taking two addresses--
we'll call them s and t--
and it's going to those
addresses, and it's checking,
are these two letters
literally the same?
And if so, it moves onto the next.
Are these three letters
literally the same?
If so, it moves on.
And the moment it notices a mismatch,
what does strcmp probably do?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Well, strcmp
doesn't print out different.
That was me in 2017.
AUDIENCE: It returns a negative
or a positive [INAUDIBLE]..
DAVID MALAN: Exactly.
It would return negative
or positive based
on whether the string is determined
to be alphabetically before or after.
And so only if we get to the very end
and we see, ooh, we made it to two
backslash 0's at the same moment
in time can strcmp just return 0
and say these two strings
are, in fact, equal.
So we, too, could implement that, right?
It just sounds like a for loop, a while
loop, just comparing those characters.
Yeah.
AUDIENCE: So when you were
comparing [INAUDIBLE]??
DAVID MALAN: Oh, so say that once more.
AUDIENCE: When you were
comparing Stelios [INAUDIBLE]??
DAVID MALAN: Stemios.
OK.
OK, so good question.
So let's actually see this.
Let me make one quick tweak here.
And let me go ahead and do this.
I'm going to go ahead and do int
answer gets strcmp s comma t, just
so I can tuck it away in a variable.
And that means I can now just
do if answer equals equals 0.
So now my-- sorry, I'm the
only one who is playing along.
So what I just did was this.
Let me rewind in time.
So just a moment ago, my
code looked like this.
I was just calling strcmp, passing in
s and t, and checking the return value.
I'm just going to give
myself a variable here,
and store the return
value of strcmp now.
And then I'm just going to
change this to be answer,
so that my code is identically,
functionally, the same,
it's just now I have access to answer.
Because I want to answer
your question empirically.
I want to just try this out.
I don't want to have to
wrestle with the documentation
even Let's just try and see.
So let me go ahead here and do
this, printout %i backslash n.
And then even if they're different,
let's go ahead-- actually,
you know what?
I don't even need to put that there.
Let's just put it out here.
So printf answer is %i semicolon.
AUDIENCE: Can you put,
like, comma [INAUDIBLE]??
DAVID MALAN: Yes, exactly there.
Precisely my point.
So now let me go and do
make compare1 ./compare1.
And let me zoom in here.
And let's type in Stelios and
his evil brother Stemois, Enter.
And so the answer is negative 1.
Why?
Well, alphabetically, Stelios, with
the L, comes before Stemois, with an M,
and so we get back a negative number,
which happens to be negative 1.
But I think the documentation
just says it's a negative number.
So you don't check for
negative 1, per se.
And now if we reverse them, let's
just do a little quick check here.
If now we do the evil brother
first and then Stelios,
Enter, now it's a positive value,
because I've switched the order.
And of course, hopefully, all this time,
when you type in Stelios both times,
the answer is, in fact, 0.
So a very simple idea.
We're just getting back an integer,
but doing something ultimately pretty
powerful with it.
So what about copy?
So copy needs a better
solution than we have at hand.
Because when I do copy,
for instance-- let
me go ahead and just clear this and
propose what's actually going on.
So with copy, I have
string s gets get_string.
But let me drop the
training wheels again.
So let me just write
it as what it really
is, char star s equals get_string.
And then the prompt for the
user was, quote unquote, "s."
And now that gives me a picture
in memory, like this, called s.
And that's going to give me a string,
which is going to look like this.
And let's go ahead and
do Stelios's name again.
So Stelios with a
really big backslash 0.
And my array as before.
And I don't care about
the addresses anymore.
This is, like, an uninteresting
story to keep making up numbers.
Let's just point at it.
Conceptually.
So a picture is worth as many words.
But now in my second
program, remember, copy0--
let me just pull it up to remind.
In copy0, recall, we had this code,
where I copied a string's address
by claiming just t gets s, t equals s.
So what's the implication now for this?
Well, let's just do char star t gets s.
And again, the only
thing I've changed now
is I've changed the word
string to char star,
just to be more pedantic
about what's going on.
This gives me what kind of picture?
Well, it gives me a little
chunk of memory called t.
But to be clear, what goes in s?
Just an address, right?
Char star makes super clear
now that t is an address.
So all we can fit is an address here.
We can't fit, like,
S-T-E-L-I-O-S and so forth.
We can only fit an address.
So what address?
Well, let's go back
to the previous story.
If we did have numbers, like
100 and 101 and 102, as such,
technically in this box,
it's just the number 100.
But again, who cares?
We can just draw pictures at this point.
What is in t?
The same thing.
So it's also 100.
But again, who cares about the numbers?
It's kind of conceptually
the same thing.
So at this point in the story, when
I have written string t gets s,
it is working correctly.
It is copying what's in
s and putting it in t.
But what's in s is an address.
And so that's there for the same
thing that's going to be in t.
Yeah.
AUDIENCE: How is the upper
function able to work?
DAVID MALAN: Ah!
So how has the upper
function able to work?
But work in what sense?
Because recall that when
we used copy0 before--
let me zoom in here, ./copy0, and I
typed in my own name all in lowercase,
it worked in the sense that it
seems to have changed both s and t.
So it kind of worked, but
it overworked, if you will,
because it should have
only changed the copy.
So why is that?
Well again, let's go back to
kind of the fundamentals here.
If copy0-- and let me go
in to highlight these lines
to which I've added some comments now.
These are the lines of code
to which you're referring.
T bracket 0 gets toupper of t bracket 0.
So I kind of read that
now, even though it's
a little cryptic, as pass
the first character of t,
t bracket 0, into the toupper
function, and if it's a little d,
become big D, if it's a little
a, become big A, and so forth,
and put that back in the same location.
But the catch is that, OK, so
let's do that on the screen.
So t bracket 0 means go to the
first element in the array,
and it happened to be a
little d in the example I did,
and capitalize that letter.
Here it would be s, but
it's already capital.
So now I'm kind of
combining the stories.
And so when you change
the first letter of t,
you're literally changing
the first letter of s.
Because t and s are both kind of
like maps, little treasure maps,
that lead to literally
the same location.
And so here, too, is something
that I've not yet stated clearly.
Even though t we've
been calling a string,
and now we're all of a sudden
calling it a char star,
you can still manipulate its characters
using square bracket notation.
So when you see something
like t bracket 0,
we've all known for
the past week or two,
and especially with
Caesar and Visionare,
that that just means change the zeroth
character of the string called t.
But a little more technically
now, if t is a pointer,
it really means go to the
address and then get bracket 0.
So there's a little more
work that's been happening.
It's just more of a mouthful
than we needed a week or two ago.
So what is the solution then
to this fundamental problem?
Copy0 is still broken.
You might understand how it works.
And if not, you can certainly
read through this again
or take it a little slower or use
the debugger to see what's going on.
But for now, it's broken.
So what's the fundamental
conceptual solution here?
AUDIENCE: Set up two addresses.
DAVID MALAN: Yeah, set up two--
and not just two addresses.
A little more.
AUDIENCE: Two houses?
DAVID MALAN: Two what?
AUDIENCE: Two houses.
DAVID MALAN: OK, sure,
two houses, right?
One for Stelios and one
for evil twin or whatnot.
The key being, we need two
of these chunks of memory
to be different strings.
We don't just want two addresses.
So we need to kind of
fill in this gap here.
So how do I get an
extra chunk of memory,
so that I can put a copy of s and a copy
of t and a copy of E-L-I-O-S backslash
0?
What mechanism allows me
to get as many characters
as I need to fit the original name,
so that I truly have two copies, so
that this picture is not this anymore?
How do I create a scenario in a
correct version of my copy program,
so that t actually
points to a true copy,
so that to your point, when I
change the zeroth character,
I'm changing the zeroth character
of the copy, not of the original?
Well, we don't have
this answer just yet.
But we do if we introduce
this one function here.
So I'm going to go ahead and
open up here, proactively,
copy1.c, which I wrote in advance,
which looks the same up here,
except for at least one.
You'll see eventually.
So I'm going to open up this program
I wrote in advance called copy1.c.
And it's almost the same, but I've
added a few more lines of code.
So here's a familiar line now.
Last week it was string
s gets get_string.
Today it's char star s gets
get_string, because we now kind of know
what's going on underneath the hood.
And just propose-- now I'm
just being a little anal
here by writing more lines of code.
Line 12, 13, 14, 15.
Why do you think I might be having this
if condition all of a sudden today?
This is technically better.
And I've kind of been cutting some
corners the past couple of weeks.
AUDIENCE: Just to
check if s is a string.
DAVID MALAN: Yeah, it's kind
of-- it essentially is that.
Just to make sure that
s is actually a string.
Because it turns out, my
computer, as you know,
has a finite amount of memory, right?
I only have so many of
those little black chips.
I only have so many bytes
or gigabytes in my computer.
And even though we're really not
taxing my computer's capabilities
with typing Stelios's name or
mine, theoretically, my computer
can run out of memory.
And frankly, if your Mac or PC has ever
frozen or crashed or hung or whatever,
maybe it ran out of memory.
It just didn't handle it well.
And so bad things happen.
So bad things can happen.
So how do we know if
something bad has happened?
Well, it turns out that get_string,
if you read the documentation for it,
and you might have or might
be for the current problem set
where we ask you to consider
what it's really doing,
it turns out that get_string
can sometimes have problems.
Out of memory.
Dammit.
What do you do?
If the user typed in his
or her whole life history,
and it couldn't fit in
memory, what do you return?
You don't just return part of their
life history, just a few of the words
or paragraphs.
You instead return a special value
that we've not used until now.
And that special value is
a keyword that can actually
be something called NULL,
which is confusingly named,
because N-U-L is the name that
humans gave to backslash 0.
N-U-L-L is the word people
gave to this notion today.
So get_string can, in some cases,
fail, because you run out of memory.
The user typed in too many
characters at his or her keyboard.
So get_string can return NULL.
If you don't check for
null, maybe your program
will crash or hang or do
something unpredictable.
And so to hedge against
that, we ask this.
And this is a very
succinct way of doing this.
Let me actually be a
little more verbose.
If s equals equals NULL, return 1.
So there's a couple of
things going on here.
Line 12, if you believe me
that get_string can sometimes
return a special value--
literally, N-U-L-L,
all caps, no quotes--
it stands to reason
you can check for that
and do something based on that decision.
And that decision can be to
return 1 or 2 or negative 1.
So long story short, main,
recall, recall that main,
we've typed in a couple of ways,
sometimes with command line arguments,
sometimes without.
But the word int has always
been there now for weeks.
And I kind of like just ignored
it for the past several weeks.
It turns out that main
is a little special.
And humans, years ago, decided that
main will just always return an int.
And that integer is
used by the computer,
like Linux or macOS or Windows, to just
know if a program succeeded or failed.
If a program worked OK, succeeded,
main is supposed to just return 0.
If something goes wrong, main is instead
supposed to return 1 or 2 or negative 1
or negative 2, any number
of values except 0.
0 is the only number in
the world that means OK,
which is a little paradoxical,
because it usually means false or off.
But there's one 0, and there's, like,
an infinite number of other numbers.
And a lot of things can go wrong
in programs is the thinking there.
So in fact, if you've
ever, on your Mac or PC,
seen an error message that's,
like, error negative 239.
Like, what the hell is that?
It just means that some
program returned a number
that some human decided would represent
whatever problem just happened.
I, thankfully, don't
have a very big program.
Not that much can go wrong.
So I'm just returning the first number
I thought of, which was positive 1.
But it's kind of arbitrary.
But at least it's not 0.
So now I'm doing a little check there.
But you know what?
It turns out, null is just a
synonym for 0, it turns out.
More on that before long.
So what does the
exclamation point mean in C?
AUDIENCE: Not.
DAVID MALAN: It means not.
Like, reverse the answer.
So this is saying, if s if not, s.
So if s is null, according to my
definition today, it's equal to 0.
So if not 0 means if
true, then return 1.
And this is completely
backwards and confusing I think.
But this is a way of
saying, long story short,
if something went wrong, return 1 now.
That's all.
That's all.
So let's not dwell too much on that,
because that is not the juicy part.
The juicy part is the scary thing.
So let's just consider for
a moment what's going on.
This is probably the scariest line or
longest line of code we've yet seen,
but it's doing something
relatively simple.
What's going on here?
Well, on the left-hand side,
let's pluck off the easy one.
Someone, in English,
just tell me, what's
the left-hand side doing to
the left of the equal sign?
Sorry.
Say again?
AUDIENCE: Declaring a string.
DAVID MALAN: Declaring a string.
Give me a string called s.
But let's be a little more precise.
Give me a variable called s
that's going to store today a--
AUDIENCE: Pointer.
AUDIENCE: Array.
DAVID MALAN: --not an array, but
a pointer or the address of a--
of a character.
So again, if we really
want to be uptight today,
yes, it's declaring a string
called s-- or sorry, t.
Sorry.
Yes, it's declaring a string called t.
But more precisely, it's
declaring a variable called
t that's going to store
the address of a character.
So it's more words.
This is why code is nice and succinct.
You can say in a few characters what
just took me this whole sentence.
So that's all that's happening
on the left-hand side.
So again, if we draw this as a
picture, what's happening now
is that char star t just
gives me this tiny little box.
That is not nearly enough
room to store Stelios or David
or any number of other names.
So the magic must be in the
right-hand side of this expression.
And this is a bit of a
mouthful at the moment.
So let me distill it to its essence.
It turns out there is a function
in the world called malloc
for memory allocation.
Succinctly named.
It takes one argument, the
number of bytes you want.
And Windows or macOS or Linux,
whatever your computer's running,
its purpose in life is to hand
you back a chunk of memory
that is equal to that length,
5, in this arbitrary case.
So what does that mean pictorially?
If I call malloc 5, that means that
the computer finds somewhere among all
of those green circuit boards and those
black chips we've been talking about,
it finds me a chunk of 5 identical
bytes that are back-to-back-to-back--
identically sized bytes that are
back-to-back-to-back-to-back.
And then returns to me, malloc
does, what, do you think?
AUDIENCE: The address.
DAVID MALAN: The address of?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Exactly.
Of the new place you've
just allocated memory.
So if this is arbitrarily at
location 400 now, and 401,
and so forth, what malloc
returns is just a number.
And more precisely, an address.
It returns the address of the
beginning of that chunk of memory.
But it's worth noting,
malloc is completely generic.
It just gives you a chunk of memory.
It has nothing to do with strings
or integers or anything like that.
So the burden is entirely on you to
know how many bytes you asked for.
So I literally hard-coded 5.
So hopefully, I, the human, remember
that when I write more lines of code.
But notice, absent from
my picture is what?
Deliberately.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: There's no backslash 0.
This isn't necessarily a string.
Maybe this is a really long
number or something else.
A data structure, as
we'll soon call them.
I just have to remember
how long it actually is.
And in particular,
it's worth noting now,
and we've used this
terminology before, when
you don't initialize memory yourself,
you should think of those values,
in almost all context, as
just having garbage values.
Maybe it's the number 1.
Maybe it's a 0.
Maybe it's the number 42.
Anything.
It's just garbage from that memory
maybe having been used previously
in your program for some other purpose
before you didn't need it anymore.
So now if I go back to my
code, 5 doesn't quite work,
because it doesn't even fit my name.
D-A-V-I-D is 5.
But how many bytes do I
need to store my own name?
AUDIENCE: 6.
DAVID MALAN: 6, because
of the backslash 0.
So if I want enough memory to store
my name, I actually need this to be 6.
But this is obviously kind of stupid,
because now I can never support--
I can't even support Stelios.
I can support me and Maria and
some other short names, but not
longer names.
So that's why, at first
glance, the code was scarier.
But let's just break it down.
This expression here in parentheses is
asking probably a familiar question.
What is the length of s?
The string the human typed in.
And then why the plus 1, to be clear?
AUDIENCE: Backlash 0.
DAVID MALAN: The backslash 0, right?
Strlen gives you the
human length of a string,
like, what we view on the screen.
That plus 1 gives us the backslash 0.
And then just to be super precise,
I want that many chunks of memory
times the size of the chunk
of memory that I want.
And technically in C,
char is always 1 byte.
So technically, this is not
necessary in this context.
But just to be super clear, I
decided to be super pedantic
and just say, give me this many
chunks of memory, each of which
should be this size.
And it turns out in C, there's an
operator called sizeof, where you just
specify what type do you want to get
the size of, and it will just tell you.
1, 2, 4, 8, whatever it is.
So now, even though this looks pretty
darn cryptic, all it's implementing
is this picture.
And it's storing in this
variable t the number 400.
Or if, again, your sort of eyes
glaze over at that level of detail,
just think of it as an arrow
pointing at the chunk of memory.
So when I then do this
just intuitively, why
am I asking this question, if not t?
And again, if not t is the same
thing as saying if t equals
equals NULL, why am I doing this?
You've just met malloc.
But maybe what could go wrong?
AUDIENCE: You get a garbage value.
DAVID MALAN: Garbage values are OK.
I can deal with those, because
I can just change them.
Say again?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: You don't have enough.
Maybe I said, give me malloc
of, like, not 5, but give me
malloc of, like, 5 0, 0, 0, 0, 0,
0, 0, 0, 0, like, 5 billion billion.
Maybe I said, give me 5 billion bytes.
And the computer just doesn't
have it, so it's not going
to return me part of the memory I want.
It's just going to say, mm-mm,
like, NULL is the returned value.
It failed completely.
So that's why we check for it.
But that's all.
It's just another form
of error checking.
And then now, notice the
kind of work I have to do.
If I want to copy the string, it's
clearly, today, not sufficient
to just do the equal sign,
moving from right to left.
You have to do the work yourself.
So here's a for loop that
we might have used a week
or two ago for just
iterating over strings.
And that's all I'm doing.
I'm iterating over s, and
copying each of s's characters
into t by just this line of code here.
And now, why-- this looks like a
bug in almost every other context.
Why am I starting at 0,
iterating up to the length of s,
but technically up through the length
of s, even though I'm 0 indexed, right?
In almost every for loop
we've written in the past,
we've done this with strings.
Why did I deliberately
add the equal sign today?
AUDIENCE: You wanted to
include the NUL character.
DAVID MALAN: Yeah, you wanted
to include the NUL character.
N-U-L, just one L, the
special backslash 0.
Otherwise you might end up putting,
like, D-A-V-I-D, garbage value,
which might then be
printed on the screen.
It might be considered part of my name.
And so there could be any number
of garbage values, all of which
will be conflated for
letters of my name.
And so now when we do
this line of code, which
we did have before, we are
literally only capitalizing
the first character of t.
It is completely independent of s.
Because the picture we've
done is have not just this,
but an identical chunk of memory that
has those same characters down there.
And we're only mutating or
changing those characters.
So when we finally print the
result down here, s and t,
we're printing the original s and
the copy of s with its first letter
capitalized.
And that is why then when I did
./copy1 in my source directory.
So if I go into src4,
and then I do ./copy1.
Dammit.
Make copy1.
And I do ./copy1.
And I type in david, all lowercase.
I capitalize only t.
And if I run it again, say,
with stelios, all lowercase,
I capitalize only t, because now
I'm actually manipulating the memory
as I originally intended.
Any questions there?
OK, that's a lot.
Let's take a five-minute break, and
we'll come back with one more layer.
All right.
So we're back.
And to recap where we're at, we sort
of revealed what a string actually is.
It's just a char star.
Star means it's an address
or pointer, location,
however you want to think of it.
And the type in front of the star
means, what is it the address of?
And we've been talking
char star, so that
means the address of a character,
which we can think of, higher level,
as just a string.
So if you're comfortable with
that-- and even if you're not,
it'll sink in over time.
But if you're comfortable with at
least the claim that a string is just
the address of a single
character, then let's
consider what we've been doing all
this time by way of this example.
String 0, which is on
the course's website
if you'd like to look later, too.
And notice, I'm just following
a familiar paradigm now.
I'm just getting a
string and I'm storing it
in s, which is technically a lie.
I'm storing the address of the
first character of that string
in s, to be super precise.
And then I'm just making sure, if
there was not enough memory for this,
return 1 and just abort the program.
If there was plenty of
memory, do this for loop.
And we've done this for the past
couple of weeks in various contexts.
Initialize i to 0.
Initialize n to the
length of that string.
And then go from i less than n just
to print out each character at a time.
And just to be clear, because of
the backslash n, if I type in like,
Stelios, it's going to print S-T-E-L and
so forth, each character on one line.
So this is sort of week one,
week two stuff at this point.
But today, just realize there's
an equivalent to a fancier syntax,
but maybe just less intuitive.
If I open up string 1 now, notice
I, again, do the same thing.
Just get a string.
Call it s.
Make sure there's enough memory.
And then move on.
And what do I do here?
This is kind of funky.
So it's still the same
for loop from i to n.
But I threw away my
square bracket notation,
which we've been comfy with, perhaps,
for the past few weeks with strings.
And square brackets allow
me to index into an array.
And an array is just a chunk of memory.
But you know what?
Those have really just been
what a programmer would
call syntactic sugar, which is
a really weird way of saying
they're like a feature of the
language that just kind of simplify
aesthetically a fundamental idea.
And today's fundamental idea is
that characters have addresses.
And therefore, string is just
the address of a character
or the first character of a string.
And so what could this possibly mean?
This is an example of a fancier
technique called pointer arithmetic,
which, no surprise, involves pointers
and arithmetic, adding them together.
But what am I doing?
Well, just to be clear, in this
example, what is s on line 19?
AUDIENCE: It's the address [INAUDIBLE].
DAVID MALAN: Good.
It's the address of the
string that the user typed in.
So if it's Stelios, it's
the address of capital S,
or if it's David, a capital D,
or whatever the character is,
s is the address of the
string or more precisely
the address of the first
character in the string.
I is what in general?
It's just an integer.
It starts at 0, because of my
for loop, and it goes on up
to the length of the string.
So it's 0, then 1, then 2, then 3.
And what do we know about strings?
Strings are sequences of
characters back-to-back-to-back.
And every character is 1 byte.
So beautifully, if you think about
what's going on underneath the hood,
if this is location 400, I don't
even need to write the other numbers.
You intuitively know that this
is location 401, 402, 403, 404.
It's just arithmetic to
get from one to the other.
And so it stands to reason that if you
know that, you can kind of throw away
the training wheels or the syntactic
sugar of square brackets even,
and just say, you know what?
I want to print out the character
that's at location s plus some offset,
so to speak, plus 0 more bytes
or 1 byte away or two bytes away.
So I just do s plus i.
S is the address.
I is a number from 0 on up.
So it just kind of moves the
address from left to right.
But there's one funky syntax.
And this is by far one of the most
annoying design decisions in C,
especially when you're
first learning this stuff.
There's another damn star here.
And the star means something
different based on the context.
So notice, previously
when we used this star--
well, previously, previously,
star meant multiplication,
and life was much simpler.
Today star means it's
an address of something.
But there's two different
contexts in which
a star is going appear moving forward.
Here is where we're
declaring a variable.
So any time you see a
star that's clearly not
being used for
multiplication and instead is
something fancier, like
this, if there's a data
type to the left of the star, that
means you are declaring a variable.
Specifically, a pointer variable that
points to that type of data type,
like, on the address of a character.
If, though, in another context, you see
the star used in a weird way-- like,
it's not multiplication, because there's
nothing to the left of that star.
So it's not grade school math here.
It's just a star.
And it seems to be kind of related
to the parenthetical expression that
follows, s plus i.
So s is an address.
I is a number.
S plus i is, therefore,
just a different address.
So the star operator in this
context just tells the computer,
go to that address.
Follow the treasure map and go to where
X marks the spot, where X is just s
plus i, in this case, the address.
So when you see the star,
when it's not multiplication
and there's no data type to the left,
it means, don't give me a pointer.
Go to that pointer.
Go to that address.
And so this for loop now is
equivalent to the program
we saw just a moment ago, even
though it's kind of crazy-looking.
Which one is better?
I mean, honestly, when I
wrote code, when I write code,
I still generally write it this way.
The syntactic sugar is nice.
It's just a little easier to read.
I don't have to do
arithmetic in my head.
And frankly, it sort of just reads
more intuitively than this version.
But both are correct.
And among today's goals are to reveal
why this other fancier version is also
correct.
Let me introduce one
other example that kind of
takes things in the opposite direction.
So all this time, we've
had these training wheels
of GetInt and GetString.
And frankly, that's just because
in C, it's really annoying
to get input from the user.
C does not make it easy, especially if
you want to add error checking, right?
If you guys have used
GetInt or GetString,
if you use GetInt and you don't
type a number, you type a word,
it's going to reprompt you and
reprompt you and reprompt you.
So there's a lot of,
like, those features
that are just nice in the first few
weeks of an intro CS class, when
it's hard enough to get the parentheses
and the semicolons right, it'd
be nice if at least when you ask
for an int, you get an int back.
And that's why we have
those training wheels.
But if we take those off,
and take away even today--
we're not going to yank it
away, but if you sort of
decide to let go of the CS50
library, we have to come up
with other ways of getting user input.
So let's use today's
ideas to do exactly that.
It's a super small program.
It's a little cryptic-looking.
But let's see how you can get an int
from the user without using GetInt.
Here is one way.
Hey, computer, give me a variable
called x in which to store an int.
So that's, like, week one stuff.
Just give me a variable.
No magic.
This is also week one.
Just print out x.
And then down here,
print out a prompt for x.
And then down here, just
print out whatever is in x.
So the only new line is
this one here, scanf.
Scan means to kind of
read from the keyboard.
Like scan whatever the user typed in.
F means formatted.
And that means you can read a
certain type of input from the user.
So %i, recall, has
generally meant integer.
So scanf borrows some
of the syntax of printf,
but it's kind in the opposite direction.
Printf goes out to the screen.
Scanf comes in from the keyboard.
But dang it, if there isn't
this one new piece of syntax.
But you can maybe reason
through what's going on here.
So x is an integer, but we
want to store something in it.
And up until now, any time we wanted
to store something in an integer
or whatever variable, we just used the
assignment operator from right to left.
And that's exactly how GetInt works.
But GetInt, underneath the
hood, uses other techniques
that we haven't yet seen.
And what could be implemented as, though
it's actually even fancier than this,
it could be implemented using scanf.
The way scanf works is this.
Scanf takes in a format
string like this,
which just tells the function
what type of data to expect.
And then you pass in not a variable,
but the address of a variable.
So the ampersand means get
the address of some variable.
Why is this useful?
Well, when you just do int x, like
here, like we've done for weeks,
you just get a chunk of memory.
A ampersand x figures out
in memory where that is.
And maybe it's at address 500, and
so ampersand x literally returns 500.
So scanf then, as input, takes this
format string, which just tells it
what kind of data to read, and then
it takes the address of a variable,
because scanf's purpose in life is
going to be to go to that address
and put at it whatever
number the user typed in.
Now, why is this necessary?
Why in the world can
I not just do scanf x?
Well, it's related to a
problem we ran into earlier.
Recall that when Kate came up and did
the swapping of those two variables,
we claimed that logically
she was correct.
And we claimed or I
claimed that my code was
similarly correct or at least a decent
translation of what she did here
in reality.
But this did not work.
It still left one in x and two in
y, and never actually changed them.
Well, why is that actually the case?
Well, let me go into noswap in the IDE.
Let's go ahead and do this.
Let me go ahead and use my old
friend eprintf and say a is %i.
And then enter a there.
And then here I'm going
to go ahead and say b--
actually, let me do it in one line
just to kind of simplify the code.
So a is %i and b is %i.
And I'm going to pass in a and b here.
I just want to see inside of swap
if Kate's logic is actually correct.
And maybe a and b are being swapped,
but maybe somehow x and y are not.
So let's actually see
what's going on here.
So if I go ahead now and run
make noswap to compile it.
Oh, and I got to include CS50's library.
Include cs50.h.
Let me go ahead and compile this.
Make noswap ./noswap.
Let's see what actually happens.
So x is 1.
Y is 2.
And at the end of the
story, x is 1, y is 2.
That was the problem.
Now I'm digging in a little deeper.
And I'm looking inside of noswap.
So at noswap line 20, I
printed out a is 1, b is 2.
Oh, my god.
Kate was right.
Her code works, but it
doesn't work lastingly.
It works within the
function, but it seems
to have no effect on passing in x and y.
So it's kind of like
she swapped it onstage.
But as soon as she
stepped offstage, they
were kind of back the way they
once were, which is just weird.
So what's actually happening?
Well, it turns out that when we
talk about a computer's memory,
we have this layout here.
And there's different areas of memory.
And the two we're focusing on today
are just these, stack and heap.
And all this time when we've
used malloc to get more memory,
that memory's been coming
from the so-called heap, which
you can think of as, like, the
top of your memory right now.
Though technically, you can
think of it as the bottom.
But it's from a specific area
of your computer's memory.
The stack, though, is used differently.
The stack is a region of memory
that any time you call a function,
the memory for your local
variables comes from the stack.
And you get a sliver of memory
from the so-called stack,
just like you would get a tray off of a
stack in a cafeteria or a dining hall.
So what does this mean in code?
Well, if we look at the code
for noswap, and we see here
the code as follows, that
main declares x and y
and takes no command line arguments, but
swap takes in two arguments, a and b,
and has one other variable called temp.
Where is that memory actually going?
Well, you can think of your
computer's memory again
as having these two
regions, heap and stack.
And previously, malloc was
taking memory from up here.
Now let's focus on the bottom
of this region of memory,
which we can think of
as this here stack.
So this is, like, where all
the trays go in the cafeteria.
And it turns out when you
run a program like noswap,
this region of memory at the very bottom
is where any of main's local variables
go.
And to be clear, what local
variables does main have?
What are they called?
Just x and y.
So that means somewhere
in this sliver of memory,
there's space for something called x and
there's space for something called y.
And I, the programmer, put
the numbers 1 and 2 there.
So that's what my computer's
memory looks like in this program.
But when swap is called, much like
I called Kate up onto the stage,
she technically gets
her own memory space
for any work she's actually doing.
And so swap gets its own
slice or, technically,
frame of memory, tray of
memory, in some sense.
And in there go any of hers or our
swap functions, local variables, which
are called what again?
A and b.
So a and b.
And one more?
Temp, a third one.
So swap in its stack
frame, more precisely,
has room for three local
variables, a and b and temp.
So what, though, happens
when main calls swap?
Any time you pass a value, an input
to a function, that receiving function
gets a copy of the value.
So as soon as you call swap, x gets
copied into a, y gets copied into b.
And then when Kate finally
used the empty cup or swap uses
the temporary value, some other value
gets copied into temp, like 1, in order
to keep that value of a around.
So when Kate then swapped
the Gatorades, and when
swap does its thing equivalently
in code, it does actually work.
1 becomes 2 and 2 becomes 1.
But as soon as swap returns, what
happens to that memory perhaps?
It just gets thrown away.
The stack, as the name
suggests, is constantly growing,
but then it's constantly shrinking
as soon as these functions return.
So as soon as swap is done doing
its thing, and as soon as Kate
walks off the stage, it's
as though she was never here
to mutate or change x and y,
because all the work she did
is just kind of ignored thereafter.
And that's because functions don't have
access to each other's memory spaces.
They are only getting copies
of what's actually passed in.
So what then is the implication
or what is the solution here?
Well, let me go ahead and open up
swap.c, which, as the name suggests,
actually does do a swap, and
show how I've changed this.
So in main, I still
declare x and y as 1 and 2.
I changed, though, line 13 to use
ampersand, which again means what?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Not go to that address.
The opposite.
AUDIENCE: Go find.
DAVID MALAN: Find the address or get
the address of x, get the address of y,
and then pass those in to the function.
Meanwhile, and frankly,
it's understandable
if this is kind of visually
overwhelming syntactically,
because of all the damn stars, but
what is swap being declared now?
It's not taking an int.
It's not taking a second int.
It's taking the address of
an int and calling it a.
And it's taking the address of
another int and calling it b.
Different idea.
Then most confusingly of all, but
it'll take time with practice for this
to sink in, this is
implementing now what we want.
This is just storing in
a local temporary value
whatever the contents of a are.
Because star, as you said a
moment ago, albeit prematurely,
means go to the address in a.
And that's going to give you the
number 1, because 1 is at that address.
So put 1 in temp.
This means go to the
address in a again and put
at it whatever is at the address in b.
So take the 2 and put it at
whatever a is pointing at.
And then the final switcheroo is
go to the address in b and store
quite simply the number 1.
So what does that mean for our picture?
Because the syntax, frankly, does
not make this all that much fun.
For our picture here,
if we consider what's
actually happening
underneath the hood now.
Let me just kind of
tidy this up, and then
go back to a cleaner
chunk of memory here.
Temp is going to be the same as before.
It's going to store just an int.
But a and b are fundamentally different.
And I don't quite know what
to draw there, because I
need to be a little more precise.
So let's go back to our original
example, where maybe this is byte 100,
maybe this is byte 101, and so forth.
So what goes in a, to be clear,
in the correct version of swap?
It takes the address of an
int, which would be, like, 101,
and then it takes the
address of an int--
sorry, sorry-- 100, and
then it takes the address
of the other int, which
would be 101, which are not
the numbers we want to
swap, but they're, like,
two treasure maps that lead to
the original x and the original y
in someone else's frame on the stack.
So a function can only change
another function's memory
if you give it a map to
that or those locations.
And so now because of all the
crazy uses of star in our code,
we're going to the address 100, which
means go here, and grabbing its value.
We're going to the address in
b and getting its value here,
moving them around so that
these values don't change.
But when swap is called
in this version, 1
becomes 2 and 2 becomes 1, because swap
is a little more sophisticated now,
and it's like using these
stars to follow a treasure map.
Go to this address, go to that
address, and move those values around.
And as complicated, honestly, as
this might feel or actually be,
realize that it boils down to
just these two primitives--
get me the address of
something, with ampersand,
and go to the address
of something using star.
They kind of undo each other.
And just using those
two new ingredients,
albeit with some strange-looking syntax,
can we start to go really anywhere
we want in a computer's memory.
So if we go back now to
scanf, does it perhaps
make more sense as to why scanf
needs not x, but the address of x?
If we just passed an x to scanf,
that would be like saying,
here is an integer.
Do whatever you want with it, because
I'm not going to see the difference,
just like with the
original no swap version.
But if I say, here is the
address of an integer, swap--
sorry, scanf can go to that address and
put any number it actually wants there.
And similarly-- let's do this.
scanf(1).
The world was so simple
just a moment ago,
but I can break my world very
quickly with this example.
Here is maybe an incorrect
implementation of getString recall
that getString gets a
string from the user
and returns the address
thereto, as of today.
But here are four lines of code that
are just a translation of the integer
example to strings.
But strings are different.
This is saying, in line 7, give
me a pointer to a character.
Give me space for the
address of a character.
This is just saying, hey, printfs colon.
scanf(%s)-- % feels right, because
that's what printf would use--
and pass in s, and then
print out whatever s is.
But there's a problem here.
s is indeed a pointer already,
so I don't need the ampersand.
s is an address.
That's all scanf expects.
But if I say to scanf,
here's the address
of a pointer which happens to be
called s line 7, what is that really?
What is at that address?
Say again?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: It's not even
that, because a pointer.
char(*s), is just the
address of a character.
But what goes in variables
by default in a program?
Garbage values.
So this is some random address
is in the variable called s.
So it is the address, but god knows
where that address is in memory.
So this is like giving scanf a map
to some random location, some garbage
value.
And his purpose in life is to
go there, wherever that is,
and change the contents by putting
whatever characters the human has
typed in.
So this is very bad,
because what I've not done
is build in space for the actual
characters in Stelios' name or Maria's
name, or mine.
For that, what was the--
how can we actually get it that?
Well, let me open scanf 2--
also online to look at later--
and how have I mitigated
this, at least in part?
Well, in this case, I'm saying mm-mm.
s is going to be an array of
characters, specifically five of them.
And it turns out, because of
the syntactic sugar, really--
and technically a more
sophisticated feature than that--
I can pass in the name
of an array to scanf.
Or any function that expects
the address of something,
I can pass in the name of an array,
and it will be treated as a pointer.
And so scanf will now put at that
address whatever the human types in.
But there's a danger.
What could go wrong given that logic?
Yeah.
Exactly.
If you type in something long,
something bad is going to happen.
When I declare s with char s bracket
[5] close bracket, that means,
literally, give me an
array in memory that
can hold five things, that
therefore looks like this.
s, when passed into
scanf, it's like saying,
OK, there's the address
of that chunk of memory.
But if we type in Stelios'
name at the keyboard,
and the programmer has only
allocated enough memory for five
total characters, scanf's not
going to know the difference,
because there's no backslash
0, there's no fanciness.
It's not yet a string.
It's a chunk of memory.
And so o and s and the
backslash n are going
to end up in no man's land,
memory you did not ask for.
Maybe you're tripping over Maria's
name, or something else altogether.
And this is when computers crash.
You have an overflow of the
buffer, or the chunk of memory,
that you've actually allocated.
And we can see this a little more
playfully thanks to a friend of ours
at Stanford, Nick Parlante, who has
spent quite a bit of time, I think,
with claymation.
And so in just a moment--
you're going to see the screen--
Nick Parlante at Stanford put
together the following claymation
that shows us the little world
of Binky, and takes a look
at some C lines of code and translates
them into some claymation fun.
[VIDEO PLAYBACK]
- Hey Binky, wake up.
It's time for pointer fun.
- What's that?
Learn about pointers?
Oh, goodie.
- Well, to get started, I guess we're
going to need a couple of pointers.
- OK.
This code allocates two pointers
which can point to integers.
- OK.
Well, I see the two pointers, but they
don't seem to be pointing to anything.
- That's right.
Initially, pointers
don't point to anything.
The things they point to are called
pointees, and setting them up
is a separate step.
- Oh, right, right.
I knew that.
The pointees are separate.
Er, so, how do you allocate a pointee?
- OK.
Well, this code allocates
a new integer pointee,
and this part sets x to point to it.
- Hey, that looks better.
So make it do something.
- OK.
I'll dereference the pointer x to
store the number 42 into its pointee.
For this trick, I'll need my
magic wand of dereferencing.
- Your magic wand of dereferencing?
Uh, that-- that's great.
- This is what the code looks like.
I'll just set up the number, and--
[POP]
- Hey, look.
There it goes.
So doing a dereference
on x follows the arrow
to access its pointee, in this
case, to store 42 in there.
Hey, try using it to store the number
13 through the other pointer, y.
- OK.
I'll just go over here to y
and get the number 13 set up,
and then take the wand of
dereferencing and just--
[KLAXON]
Woah!
- Oh, hey, that didn't work.
Say, Binky, I don't think dereferencing
y is a good idea, because, you know,
setting up the point is a separate
step, and I don't think we ever did it.
- Mm.
Good point.
- Yeah, we allocated the pointer y, but
we never set it to point to a pointee.
- Hmm.
Very observant.
- Hey, you're looking good there, Binky.
Can you fix it so that y points
to the same pointee as x?
- Sure.
I'll use my magic wand
of pointer assignment.
- Is that going to be
a problem like before?
- No, this doesn't touch the pointees.
It just changes one pointer to
point to the same thing as another.
- Oh, I see.
Now y points to the same place as x.
So wait, now y is fixed.
It has a pointee.
So you can try the wand of
dereferencing again to send the 13 over.
- Uh, OK.
Here goes.
- Hey, look at that.
Now dereferencing works on y.
And because the pointers are sharing
that one pointee, they both see the 13.
- Yeah, sharing, whatever.
So are we going to switch places now?
- Oh, look.
We're out of time.
- But--
- Just remember the three pointer rules.
Number one, the basic structure
is that you have a pointer
and it points over to a pointee.
But the pointer and
pointee are separate,
and the common error
is to set up a pointer
but to forget to give it a pointee.
Number two, pointer dereferencing
starts at the pointer
and follows its arrow over
to access its pointee.
As we all know, this only works if there
is a pointee, which kind of gets back
to rule number one.
Number three, pointer
assignment takes one pointer
and changes it to point to the
same pointee as another pointer.
So after the assignment,
the two pointers
will point to the same pointee.
Sometimes that's called sharing.
And that's all there is to it, really.
Bye bye now.
[END PLAYBACK]
DAVID MALAN: So in a moment,
we're going to transition
to one of our domain-specific
problems, namely forensics
and the art of recovering information
that may have accidentally
or deliberately been deleted.
But those of you in Elliott House might
remember this shot of the back right
corner of the Elliott House dining hall,
one of the undergraduate dining halls
here.
And this is where I was, like, 20
years ago, when I finally got pointers.
The key point being, it was
not in the lecture hall,
and it was not the first time I sat down
with a problem set related to pointers.
Because this topic in
particular, in CS 50,
and specifically in this language C, is
definitely tough for a lot of people.
And if you've totally followed
everything thus far, that's great.
You're already ahead of where I was.
But more so than most any
other topics in this course,
rest assured that if it takes a little
while for this to sink in conceptually,
and for it to kind of come out of your
fingers syntactically, that's normal.
Or at least if I'm normal, it's
normal, because that was precisely
my experience.
And I mean it.
It sunk in literally in this
location, when my teaching
fellow at the time, Nishat Meda--
who was a year or two
older, I think a senior
in Elliot House, which
is why we were there--
he was walking me through it again,
maybe the second or the third time.
And then finally, that
proverbial light bulb went off.
And so keep in mind, more so
than most any other topic,
that that here too might
be your experience as well.
So here, of course, is Zamyla,
one of our senior-most staff
who appears in the course's
walkthroughs for various problems.
And you'll notice that it seems
to be a pretty high quality
photograph, or frame from a video.
But it seems to be the case
that in Hollywood, especially--
you can always do what's
called enhance, which
is the verb that, whenever you want
to find out who done some crime,
you enhance the image,
and there he or she is
in the glint of someone's eye or
contact lens, or something like that.
So this affords us an
opportunity, nicely enough,
to continue along a
trajectory that we've
been on the past few weeks,
whereby with problem set two
we looked at the underlying
representation of strings
via cryptography and Vigenere
and Caesar and so forth.
And then this current week, you're
focusing on music and the underlying
representation of sounds.
And this coming week, we focus
on the underlying representation
of imagery, of which this is
of course just one example.
But it's worth noticing that if
Zamyla is suspected of something,
and we decide-- or rather--
we've got to spin it the right way.
If Zamyla witnessed something, and
we want to see the glint of her eye
who done it, this is about
all you're going to see,
because just as in our
computers, where we
have a finite amount of RAM or
hardware, so in an image is there
a finite amount of information
in the zeros and ones that
compose the GIF, or
the PNG, or the JPEG,
or whatever the file format actually is.
And indeed, if you zoom up closely
enough on this photograph of Zamyla,
you will see the
so-called pixels, or dots,
because an image is just a bunch of
rows and columns of dots, each of which
is colored.
And I'm hard-pressed to imagine the
reflection of any crime being committed
based on the glint of someone's eye.
And yet, nonetheless, if we could
dim the lights for a few seconds,
there are such claims
in Hollywood as these.
[VIDEO PLAYBACK]
- He's lying.
About what, I don't know.
- So what do we know?
- That at 9:15, Ray
Santoya was at the ATM.
- The question is, what
was he doing at 9:16?
- Shooting the 9
millimeter at something.
Maybe he saw the sniper.
- Or was working with him.
- Wait.
Go back one.
- What do you see?
- Bring his face up.
Full screen.
- His glasses.
- There's a reflection.
It's [INAUDIBLE] Vita's baseball team.
That's their logo.
- And he's talking to
whoever's wearing a jacket.
[END PLAYBACK]
DAVID MALAN: OK.
So that's not happening.
This is all that is in the information.
And so this is a wonderful opportunity,
in a pretty cool-sounding and actually
cool world of digital forensics, for us
to bridge a couple of our worlds now.
Up until today, we've
really only had arrays
as our only data
structure, the only way we
could lay out more than just
individual types of memory.
But once we have arrays, that allowed
us to start talking about strings.
And it turns out, strings can be
represented really as just addresses.
And so that now that
we have addresses, we
can kind of stitch together any kind
of structures we want in memory.
Not everything has to
be back to back to back.
And this means we can now start to think
about how we would store information
permanently on disk, on your
hard drive of a computer, how you
can solve problems more efficiently.
And we'll explore that by
first considering how you
represent even information like this.
Well, let's try to take away some of the
complexity of Zamyla's photograph here,
and just boil her down to
this happy smiley face.
This is perhaps the simplest image
we could create using zeros and ones,
because if you think of white dots
as just being represented by ones,
and black dots as being represented
by zeros-- or vice versa,
it doesn't really matter--
you could construct an image from
a grid of bits, zeros and ones,
if you just interpret them
and present them on the screen
according to some pattern
or rule of thumb like that.
And here, then, is an image.
It's got rows-- or scan lines,
as they might be called--
and columns that represent the
individual dots, or a monitor's
interpretation of those zeros and ones.
Now, Zamyla, of course, has many more
dots than just the couple of dozen
we saw on the screen before.
But it's the exact same
idea, because, indeed,
if we zoom in even on the finest
detail of Zamyla's photograph here,
do we actually see the actual dots.
But instead of just
using an individual bit,
one and zero-- because if you've only
got one bit per pixel, or per dot,
you can pretty much do no
better than two colors,
black and white, red or blue,
or whatever two colors you want.
You got to pick something.
And that's when you have very
limited colors, or black and white.
So in Zamyla's case here,
turns out this is a JPEG.
Pretty common for
photographs and the like.
Typically as many as 24
bits are used for every dot.
And with 24 bits, you
can express any number
of millions of numbers,
which means you can
get light blue, dark blue, hot pink,
purple, and any number of colors
in between from the
rainbow, because you have
so much expressiveness
with that many bits,
way more, certainly,
than zeros and ones.
Turns out, JPEGs are a little special.
Any JPEG, at the end of the day,
is just a file on your hard drive,
or an attachment in an email.
And a file is just a
pattern of zeros and ones.
But what does it mean to be a JPEG?
What does it mean to be a Word
document or an Excel file?
Well, generally, some
human or some company
decided that a file that
is a JPEG shell always
start with the same few bits,
the same pattern of bits.
Or maybe, equivalently, a JPEG shall
always end, at the end of the file,
with the same pattern of bits.
Or with something like
an Excel file, there
is something that Microsoft decided
that they decided how to represent rows,
how to represent columns in the file.
And the only way you
can write software that
reads Excel files is if
you read the documentation
Microsoft wrote so you know how
to interpret zeros and ones.
Remember, everything
is context-sensitive.
In another context, these
zeros and ones might
be like a secret message
written in ASCII,
if you interpret these as ASCII codes.
But if you opened this instead
in Photoshop or something,
maybe those same zeros and
ones are presented as an image.
But it turns out, then, if you
open a file that is of type JPEG--
so something.jpeg, Zamyla.jpeg--
the first three bytes of that
file, by definition of a JPEG,
per its documentation, per the Wikipedia
page, are that the first three bytes--
or the first 8 plus 8 plus 8, 24 bits--
should be these three
numbers in decimal.
255, 216, and 255.
Those are going to be zeros
and ones at the end of the day,
but it's annoying to
look at zeros and ones.
Decimal is a little more familiar.
So it's these three
bytes start every JPEG.
Now it turns out, in
the world of files--
which, again, will be one of the
domains for the coming week--
people don't really use
decimal, just because,
and people definitely don't
use binary, because it's
just way too hard for humans to
wrap their minds around typically.
They instead use something
called hexadecimal,
which is a little weird-looking,
but you've probably
seen it in various contexts.
Hexa means 16, in this case.
And you have literally 16 characters
in this alphabet, 0 through 9,
and when you can't count higher
than 9, you go to a, b, c, d, e, f.
So 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a
is 10, b is 11, dot dot dot, f is 15.
And so that's hexadecimal.
But it's the same idea.
If we can do the whole columns and
ones place and 16s place and so forth,
same idea as week 0, but we
have a different alphabet.
Now just so you've seen it, particularly
for the coming week's problems,
know that those three bytes--
255, 216, and 255--
in binary happen to be this.
Now, what's nice about hexadecimal
is that each of those characters,
0 through 9 and a through f,
technically represent four bits.
Why four?
Well, long story short, if you have
four bits, that's this many patterns.
2 times 2 times 2 times 2.
That's 16, ergo hexadecimal.
So long story short, hexadecimal
is just super convenient
because you can take 8 bits,
or a byte, and separate them
into two hexadecimal values,
each of which is four bits.
So with that said, how
do I represent these--
1111, I said it's 15, and
again, if we do the week 0 math,
1111 will equal 15, or f.
Meanwhile, 1101 will equal d, and
1000, per week 0, is 8, and so forth.
So this is just a very methodical way
of taking decimal and presenting it
instead as hexadecimal.
And humans decided years ago
that it's not always obvious
when you're looking at hexadecimal
that it is hexadecimal.
So if humans decided that any time we
humans write something in hexadecimal,
we're usually going
to prefix it, whatever
we're writing, with 0x, just because.
It's like a visual cue to, hey
human, here comes hexadecimal.
So with that said, these now
are the first three bytes
of any JPEG in a file.
This is relevant, because among
the problems for the coming
week are going to be this.
We are going to give
you what's generally
thought of as a forensic image, a copy
of a digital camera's memory card,
on which are 50
photographs, all of which
have somehow been accidentally
or maybe deliberately deleted.
But it turns out, because each of
those 50 photographs are all JPEGs,
and therefore all of them start
with the same pattern of bytes,
if you have a mechanism for
reading over that file's zeros
and ones, one byte at a time, you
could notice, ooh, I just saw ffd8ff.
This is probably the start of
one of the 50 lost photographs.
Let me actually extract that
and a whole bunch of bytes
afterward, and save them
to a temporary file.
And the next time I see
this pattern, ooh, maybe
that's the second photograph that's
been lost on the digital camera.
Let me write that out to a file.
And so among the coming challenges is
going to be to reintroduce you to JPEGs
and have you recover 50
photographs that might have been
accidentally or deliberately deleted.
But there's other graphical
file formats in the world,
and one of the earliest
common ones was bitmap, BMP.
And as the name suggests-- it's kind
of a perfect name for a graphic file,
because it's a bit map, a map of
bits, like the ones and zeros that
compose Zamyla's black and white
happy face just a moment ago.
With the bitmap file format--
you might be familiar
if you ran Windows XP for
some time, if you grew up
with this loading up on your screen.
This was a bitmap file, called
a wallpaper more generally.
And bitmaps clearly are
supportive of color.
Some of them are bits per pixel.
And what's nice about bitmaps is
that they're a relatively simple file
format.
And I say relative because you'll
see there's some complexity,
but they're way simpler
than JPEGs actually are
beyond those three knows characters.
As an aside, I thought
it's always fun to take
a look at what this beautiful field
looks like some 20 years later.
That is the same thing that you might
have grown up with on your wallpaper
instead now.
But here is-- at first
glance overwhelming, but in
the problem set more
methodically presented,
what's an example of a file structure.
So what does it mean to be a file?
At the end of the day, every file
on your computer is just a thing
containing zeros and ones.
But again, those zeros and
ones are laid out in patterns.
JPEGs, for instance,
start with 255, 216, 255.
Microsoft Word files,
Excel files, anything else,
is going to start with
different patterns as well.
And this is the formal way, from
Microsoft's own documentation,
for what a bitmap file looks like.
And we won't dwell on the
low-level details here in class,
but long story short, this documentation
tells us that the first couple of bytes
are something called a
word, and the next few bites
are something called a dword,
or double word, and so forth.
So long story short, if you are
an aspiring programmer and you
want to write a piece of software
that reads and writes, opens and saves
graphics files, you have to
read documentation like this
and understand in what order
you read zeros and ones in,
and what order you write them out in.
Because if it's just a whole
bunch of zeros and ones,
you don't know what the
heck you're looking at.
You need to reference documentation like
this to know that, oh, it's a number.
Oh, then it's a character.
Oh, then it's a whole word.
Oh, then it's a bunch of red pixels
or green pixels or blue pixels.
And indeed, if you look
at the bottom here,
what's nice enough about this file
format is that once you go through some
of these header fields,
as they're called--
more on that in the problem itself--
you get to something a
little more familiar.
If you grew up ever hearing an
expression RGB, red, green, blue--
using those three colors, you can
combine them like waves of light,
or paint if you will, into any number
of colors of the rainbow by mixing
in a little bit of red,
a lot of blue, no green,
just like we did a few
days back when we looked
at that weird representation of yellow
in the context of a graphics program.
But what's in a bitmap
file at the end of the day,
wonderfully, is a whole
bunch of our RGB triples,
which essentially say top left pixel on
the screen will be this amount of red,
this amount of green,
this amount of blue.
Then next in the file is whatever the
color is for the second pixel, and then
the third pixel, and then
the fourth pixel Essentially
a row, a row, a row that
compose the actual image.
And it's a little fancier than
that, but that's the essence.
It's a whole bunch of bits that
represent some color of the rainbow.
But unfortunately, we
haven't seen any way
of dealing with a structure like
this, because all we have arrays.
We have no data type called file yet--
until today.
So it turns out that there is
a keyword in C called struct,
and as the name suggests,
it gives structure to data.
And indeed, one of my
claims at the start of today
was, we want to eventually have
data structures in our toolkit.
Arrays are not sufficiently powerful
to solve real interesting problems
real effectively.
So it turns out, there is no data
type in C for the notion of a student.
There's int, and there's
floats, and there's chars,
and there's kind of sort of strings,
but apparently not even those exist.
Those are just char stars.
But wouldn't it be nice if
implanting a piece of software
that your friends can use
online, or that the registrar can
use to register students, you actually
could declare your own data type
called student?
And indeed, you can.
And over the course of this
problem set and others,
we will introduce syntax like this.
There's a keyword called
typedef and struct
that together let you invent, just
like in Scratch, your own puzzle
piece, or your own data-type
here, called student in this case.
And as you may imagine,
a student might have
associated with him or her a name, and
a dorm, and probably other fields too.
But this is a way of
inventing your own data type,
calling it something very
reasonable and obvious to you,
and including inside of it,
or encapsulating inside of it,
two other known data types, like
two strings or two integers,
or 10 integers, or any
number of other things.
And so simply by using
this syntax now can
we build up things that
aren't just arrays,
but have more meaningful data types
associated with them, all of which
are going to be grouped together.
And among the things
you'll see in problems
set four, moving
forward in its problems,
is an ability to not only declare
data structures like this,
but also read them from
disk, read them from files,
and then write them to files as well.
And indeed, what you'll
do in one of the examples
as well is manipulate bitmap files,
take one as input and then resize it.
Make it much bigger, but
scale it all proportionately.
Or maybe shrink it,
and if you shrink it,
figure out what
information you throw away
and what information
you actually retain.
And in one third problem,
still, who done it,
you'll be presented with
an image with a whole lot
of random noise, a lot
of red speckles, much
like you might have seen in a
child's puzzle book years ago.
And only if, in the real
world, if you hold up
one of those red cellophane
pieces of plastic, can
you kind of see the
magic image behind it.
You'll implement the
notion of that red filter
to reveal who done it in
the context of that example.
But I thought we'd end today
not only with that teaser
but with also this reassurance
that perhaps certain shows get
these kinds of details
more right than others.
[VIDEO PLAYBACK]
- Magnify that death sphere.
Why is it still blurry?
- That's all the resolution we have.
Making it bigger
doesn't make it clearer.
- It does on CSI Miami.
[SIGH]
[END PLAYBACK]
DAVID MALAN: All right,
that's it for lecture four.
We will see you next time.
