DAVID MALAN: Hi, everyone.
This is CS50's own David Malan and--
COLTON OGDEN: Colton Ogden.
DAVID MALAN: So this is
a new approach for us.
We, of course, have lectures
once a week here in Cambridge.
And of course, we've done a number of
live streams over the past few years.
But we thought we'd try something
new with our Twitch channel here.
Colton is a big fan of Twitch.
He's quite the gamer himself,
and a programmer of games.
And we thought we'd use this as a more
interactive time to actually chat with
some folks in the Twitch chat room, and
actually walk through some code in real
time-- live coding--
to get into more depth
on some of the things
that we might otherwise talk about
in class, and then, ultimately,
take people's questions and steer the
conversation in any direction folks out
there would like to go.
COLTON OGDEN: Yeah.
It's a very flexible format.
Everybody who's in, feel free,
definitely, to chime today and chat.
If you aren't following
the CS50 TV account,
I believe there's a 10
minute waiting period.
So definitely follow us
so you can talk to us.
But I see we do have one
person, Srini Vasank says hello.
So hello, Srini Vasank.
Nice to see you here.
DAVID MALAN: Hello.
Nice to meet you as well.
COLTON OGDEN: So yeah.
I guess we can take a
little transition here
to your laptop where we
have your code there.
DAVID MALAN: Magic.
Yep, so we're actually sitting
here in Cambridge, Massachusetts
on Harvard's campus with a
magical green screen behind us.
And so we're superimposing
ourselves, the two humans,
which is why you see this
nice glow on both of us today.
COLTON OGDEN: I wish I had a
shot of just the green screen.
All I have is this.
DAVID MALAN: That's green, but
you see it as gray, perhaps.
And this allows us to superimpose
ourselves over the code,
like you've seen in some other
channels, so that we can actually
talk in greater proximity.
COLTON OGDEN: We do it for
lecture for the code it itself.
DAVID MALAN: We do.
We do this in Sanders Theater.
But the green screen's way behind.
And we use some camera magic to actually
capture it across the room as well.
COLTON OGDEN: Yeah.
DAVID MALAN: All right,
anymore hellos to tend to?
COLTON OGDEN: Not just yet, no.
Just Srini Vasank.
We have 10 viewers currently active.
DAVID MALAN: All right, wonderful.
Well, so glad to have you here in class.
It's like a CS50 section here on campus.
What we thought we'd do today is--
our first trial of this is-- take
a closer look at one of the topics
we look at in CS50.
Those of you have been
following along for some time
and are pretty much
through half of the course
or so might have implemented your
own spell checker at some points.
And for those unfamiliar, roughly
halfway during CS50's semester
here on campus and online,
we challenged students
to build their own spell checker, which
is a piece of software that looks over
all the words you've
typed out in a file,
and tells you, often with read
underlines, which words you've
misspelled, that aren't in a dictionary.
And that's where the CS comes in.
That dictionary is just a
really big list of words,
hopefully alphabetically sorted.
And it's a non-trivial
problem to actually look up
every one of the words in your file
against every word in the dictionary.
If you recall from our discussion
of asymptotic notation,
that's like an n squared problem,
if n is the number of words in each,
just to simplify a bit.
And that can actually take quite
a bit of CPU cycles and memory.
And so one of the challenges in
CS50 is, do that efficiently.
Minimize your use of memory.
Minimize your use of
CPU time, so to speak.
And just write the fastest
spell checker possible.
Now, what language, of course, do
we tend to write this in though?
COLTON OGDEN: Well we start it
in C, generally, for the course.
Yeah.
DAVID MALAN: And that's great.
Because C is super low level.
It's super fast.
And really takes
advantage of the hardware.
But what are some of the prices you'd
say you pay by writing stuff in C?
COLTON OGDEN: Programmer
time would be number one.
Obviously looking for memory
leaks, and seg faults,
and all those sorts of things
that trip up a lot of programmers.
DAVID MALAN: Yeah, those of
you programmed in C, maybe C++,
have dealt with pointers, memory
addresses, segmentation faults,
buffer overflow exploits, and the like.
All of that comes into play.
And so why don't we start there?
Why don't we actually take a look
at what it is you have had to do,
or if you're following along
with CS50 for the first time,
might have to do if you tackle this
particular problem down the road.
And then let's actually do it in a much
more friendly environment of Python.
But consider what prices we
actually pay for doing that.
So we have up here
behind us dictionary.h.
This is the so-called header file
that we give to CS50 students.
And in this header file,
we define four functions.
Do you want to summarize what
functions they have to implement?
COLTON OGDEN: Actually,
offhand, I don't remember.
I have to look at the source code.
DAVID MALAN: Oh, that's cool.
COLTON OGDEN: Sorry.
DAVID MALAN: We have it
right here if you'd like.
OK.
Well, we clearly prepped here.
COLTON OGDEN: It looks like check.
DAVID MALAN: Yeah, check
is one of them, isn't it?
COLTON OGDEN: Load size and unload.
DAVID MALAN: Yeah, now fortunately,
the file's completely commented.
So we can follow along here.
So these four functions characterize,
if you will, the API for spellchecker.
API is just application
programming interface.
And while this might sometimes mean some
web based service or third party you're
using, it can also refer to local
software that you yourself are writing,
or that someone else has started
writing and you now need to finish.
And indeed, the challenge for this
problem is to flesh out that API
and write the actual
implementation thereof.
All right, so in order of
operation, not alphabetically,
load is the first function that
students in this class have to write.
This is a function that
takes as input in C--
it looks like "const char* dictionary".
And how should folks think
about what that means?
COLTON OGDEN: Well, const
meaning that it can't be mutated.
So whatever data that they
pass into this function
should be not tampered
with by the function.
A char* is just a sequence of
chars or a pointer to a char,
which can be any number
of chars after that.
But it's a string, effectively.
DAVID MALAN: Yep.
I agree.
COLTON OGDEN: Text data that's
going to become their dictionary.
DAVID MALAN: And then the name
of this parameter, of course,
is just dictionary.
And so the idea is that this is the
name or the path to the file containing
all of those words that
I mentioned earlier
that represent the actual dictionary.
And load's purpose in life,
per the comment atop it,
is to load dictionary into memory.
And it's supposed to return a bool--
true if successful, else false.
For those unfamiliar, we
are using standard bool.h,
which is a header file that you can
include in C. That actually gives you
access to two definitions of
true and false as 1 and 0,
respectively, effectively.
So after load is called,
once implemented,
then it's the function called "check"
that's supposed to get it called again,
and again, and again for
every word in the file
you're actually spell checking.
So it's that one that's going
to be especially performant.
Super fast, super minimal
memory usage, hopefully,
if you're really trying to
optimize those two things.
What does it appear that
this takes as input to,
if you want to tackle that one?
COLTON OGDEN: The check function.
So it looks like the same thing.
So char*, meaning a pointer to a char.
So a variable length string,
or a sequence of chars.
And then words.
So it's basically going
to be any size word.
And we're going to, basically, check
for the string header representing
the signature, whether the word
is actually in the dictionary.
So you check our
dictionary for this word
and see whether it's actually inside
that dictionary, that data structure.
DAVID MALAN: Yeah.
And so strictly speaking,
you don't have to use const
in either of these definitions.
This is just a nice mechanism for really
protecting yourself from yourself,
lest you accidentally try changing
word or try changing dictionary.
The fact that you defensively-- or
whoever wrote this file-- defensively
put const there means, you
just won't accidentally
screw up, change the
value, and introduce
some bug that could be very easily
avoidable by making it read only,
as const effectively does.
So lastly, there's two other
functions in this file.
Per the comments, those
are size and unload.
Unload is meant to do
the opposite of load.
So whatever work we end up doing
in load is supposed to be undone,
freed up in unload.
And then size is supposed to just return
the number of words in the dictionary.
Now, maybe that's a linear operation.
You just iterate over
whatever your data structure
is, counting up every one of
the words and return that value.
Or if you're smart about
it, you could probably just
keep like an int or a long around,
and just, as you're loading words,
keep track of how many words there are.
And then just spit out in
constant time that same value.
So in C, this is a pain in the neck
to actually implement this thing.
You can implement it as a big array, a
linked list to give you some dynamism
and growth.
You can implement it as a hash
table, which is often implemented
as an array with multiple linked lists.
Or you can implement it using a try,
an even fancier data structure, that
just consumes lots of
memory, but theoretically,
is constant time in its operation,
at least asymptotically.
So we're not going to
do that though in C.
Indeed, that's one of the
challenges in CS50 itself.
And one of the great moments
in a course like this
is when you actually change
context, and you switch from C--
a very low level language-- to
something like Python, which
is higher level, so to
speak, that has lots
more abstractions, lots more
features, lots more capabilities built
into the language itself.
You can whittle down a 10, 20, 30, 40
hour project into truly just minutes
if not seconds, once you're actually
super learned in the language.
Shall we dive into
some actual live code?
COLTON OGDEN: Sure.
Before we do, Srini Vasank
has a question for you.
DAVID MALAN: Yeah.
So pass by value can be
used here instead of const.
Short answer, no.
You are already technically passing
in the dictionary by its value.
But the catch is that the value
you're passing in is its address.
And in so far as passing in an address
allows you to dereference that address
and go to that address in memory,
means that it's potentially
vulnerable to being changed.
And so we're declaring const
here for the very reason
that we're technically passing
by reference here, or by value.
But that value is an address.
And so this is why we have this
defense mechanism in place.
If we were instead passing in a
specific char, or an int, or a float,
or any other primitive type that
doesn't involve memory addresses,
then yes, pass by value avoids this.
Because the worst you can do is
change a copy of the argument.
But in this case, by definition,
we are passing in a string,
which is indeed a char*, or address of
a character, or a sequence of character.
So we can't-- it's not as
easily said as passing by value.
COLTON OGDEN: Correct.
DAVID MALAN: The values here
are what are [INAUDIBLE]..
COLTON OGDEN: I guess the equivalent
would be maybe duplicating
the data structure in the function.
But then that would be like
horribly inefficient for something
massive, like a big dictionary file.
DAVID MALAN: Yeah, you can
totally copy the whole dictionary.
But even then, you'd have this
window of a few lines of code
where you could still screw up and
accidentally change things just
by nature of making a mistake.
COLTON OGDEN: Yeah, that's true.
DAVID MALAN: Really good question.
So keep the questions coming if
you have any as we move ahead.
But let me go ahead and open up a file.
Let's call it dictionary.py,
and actually implement this API,
but using Python, and therefore
Python syntax and the equivalent.
And then if we have
time, we can take a look
at a new speller.py, which is a
porting or translation of speller.c,
which we haven't looked at yet,
but CS50 students do in the class.
That actually implements
the spell checker itself.
But the real intellectual work
and the data structure stuff
really takes place in dictionary.c,
or today, dictionary.py.
So let me go ahead and proactively
save this as dictionary.py
so we get our syntax highlighting.
And let me propose that we just
have place holders initially
for those four functions.
So again, we can't use
dictionary.py with speller.c.
And so this is just a porting of the
whole project to another language.
So in Python, you declare functions
using slightly different syntax.
In Python, you don't
specify a return type.
And you don't specify the types
of your arguments to functions.
But you do specify their names.
And recall that one of
those names was "check".
It does take an argument, in this
case, which we'll call "word",
just as we did in C. But you have
to define the function, which
in Python uses the def keyword.
And then of course, instead of currently
braces in Python, if you're familiar,
you instead indent--
instead of using those curly braces--
often with colons, implying that
here comes some indentation.
And I'm not ready to do this yet.
So I'm just going to
say "pass", literally
passing on implementing the function.
But we'll come back to that.
But that will get the IDE here.
We're using CS50 IDE, or
cloud nine, to just stop
yelling at us with some red marks.
And quite welcome for the
pass by value question there.
So let's go ahead and do two others.
Let's go ahead and define load.
And load takes in a dictionary.
Here, too, I'm just going to pass
for now on actually implementing it.
I'm going to go ahead and implement
size, which takes no arguments.
So you just specify open
paren and closed paren.
You don't need to say void
as you do in C. Let's pass
on implementing that for now.
And then unload, which
frankly, we're probably
not going to need at all in Python,
because memory is managed for you.
You might have to allocate
it, albeit implicitly.
But you're not going to have to
worry about managing it like in C.
So there's the API ported to Python.
And again, you'll notice no return
types and no parameter types.
Python's loosely typed.
It still has type strings, and
ints, and floats, and other values.
But it doesn't actually require
that the programmer, like us,
actually specify those things.
All right, so what do you think
we should do to think about this?
We have to implement a
dictionary for a spellchecker.
A dictionary, by
definition, is a whole bunch
of words that are passed
in in a big text file.
How would you go about thinking about
how to begin solving this in Python?
COLTON OGDEN: I think the
first thing I would want to do
is understand the dictionary
file so that we can parse it
and we can load it into the dictionary.
DAVID MALAN: OK.
All right, well so if we want
to parse the dictionary file,
it's nice that we're actually in Python
because it's even easier in Python
than in C, where you have to really
think about how long your lines are
and how many bytes of
memory you might need.
So let me go ahead and go up there.
Let's implement load first, or
at least the beginning of it.
And if I want to open a file
in Python, it's actually
as simple as saying Open the file
name-- dictionary being the file name--
and then you can specify, do
you want to read the file,
or maybe read and write the file?
And so just as in C, with the
F open function, if you recall,
you can say quote unquote,
"read", which just says,
hey Python, open the
file called "dictionary",
or whatever that value actually
is, and then read it for me.
But this is going to return,
essentially, a file handle to the file,
so to speak-- kind of like
a reference there, too.
So I want to keep that around.
And I'm just going to declare a
variable on the left called "file".
But I could call it anything I want--
F or whatever-- and assign that
the return value of "open".
Notice, I don't have to
specify the type of file.
It's going to return--
it's going to store some
kind of file object.
But I don't need to worry about that.
And now in C, you would probably proceed
to iterate using F read, or F scan F,
F get as, or F get C, or any number
of other file I/O reading functions.
But in Python, there's these
higher level abstractions, where
if you know that text file just
contains word after word after word,
you can just say as much in
Python by literally saying,
you know what, for each line in
the file, go ahead and do this.
And so just as I kind of
rattled that off verbally, so
can you type it in Python.
And it's a lot more English
like, if you will, in that sense.
Now, I don't really know what we
want to do with this word yet.
So let's just say for now, store
word in some data structure.
Because that's kind of an interesting
design question to come back to.
But once we're done
doing that, we're just
going to go ahead and
call close on the file.
Or rather-- I'm using C syntax already--
we're going to call file close.
And here's where Python
is also object oriented.
The file reference that came back itself
has functions or methods built into it.
And if you look at the
documentation, one of those is close.
So file.close actually
closes that specific file.
And then after that, let's
assume for the sake of discussion
that everything went well.
So I'm just going to go ahead and
return true, here in this case.
So returning true just
signifies everything is good.
I do seem to have an error here, but
that's only because I haven't actually
done something in this for loop.
So let me also just say pass
on implementing that for now.
And that error should go away as well.
COLTON OGDEN: There's
an even cleaner way
to do what you've done
with the file, right?
DAVID MALAN: Oh yeah?
COLTON OGDEN: Using what's
called a context manager.
DAVID MALAN: All right,
how do we do that?
COLTON OGDEN: With the "with" keyword.
So if you have the-- instead of needing
to open a file handler with file--
well, you still need a file handler.
But instead of having,
basically, the dock close,
you can enclose all of
these operations that
require this file within its own
indented block using the "with"
keyword.
And this is essentially just
what's called a contact manager.
It manages the context of this
file and the operations on it.
So basically, it saves
you a line of code.
DAVID MALAN: Yeah.
So great point to make.
This is more Pythonic, if you will.
Here on line 7 what I've
done per Colton's suggestion
is actually say, with the opening
of this file in read only mode,
call the return value "file".
And then notice how I've indented by
tabbing everything below that-- line
8, 9, and 10 at the moment, inside
of that indentation-- thereby
implying that this file variable is
useful within every indented line
underneath.
And as Colton remarks, this
notion of context is important.
Because what's going
to happen ultimately
is that Python is going to close this
file for me by the time we hit line 11.
So I just don't need to think about it.
So this is a very common
approach just to clean up
your thinking, focus on the
work you actually care about,
which is opening the
file and getting at it.
It's less interesting,
intellectually, to close the file
and, indeed, much better
if the language itself
deals with that overhead for you.
COLTON OGDEN: Plenty of people forget
to close their files in C as well.
DAVID MALAN: Indeed.
In fact, one of most common
sources of memory leaks,
where you allocate RAM
and forget to give it back
is, indeed, because you forgot to
close some data structure, like a file.
COLTON OGDEN: We have a
few people in the chat.
We have Trebinator who says,
greetings from Germany.
DAVID MALAN: Greetings from America.
COLTON OGDEN: Amit [INAUDIBLE],,
says, greetings from India.
DAVID MALAN: Nice to meet you,
Amit, as well from Cambridge.
COLTON OGDEN: And then Srini Vasank has
another that says, comment in line 9
is stores w.
Comment in line 9.
Store word in some data structure.
DAVID MALAN: Oh yeah.
So entire line, instead
of one-- yes, it is.
Well, it depends what I do with this.
So I can save myself here.
I haven's actually
written this line of code,
so I can claim it's going
to do whatever I want.
But yes, if we do no other action
here in the commented line on line 9,
yeah, we'd be storing the whole line.
But if we assume, as we're allowed
to in CS50 C version of the PSet,
that the dictionary contains
words, one word per line.
Then, with a bit of fanciness,
can I actually extract
from that line the one, and
only one, word that's on it,
get rid of maybe any leading or
trailing whitespace or messiness,
and just load that word.
And indeed, that's the to
do on line 9 that remains.
COLTON OGDEN: For context, do you happen
to have the dictionary file available
for us to look at?
DAVID MALAN: Yeah, absolutely.
If you want to take a
look at the dictionary,
let's go into another folder.
And in that folder, we
have a dictionaries folder.
And I'm going to go ahead and
open up a large dictionary here,
which has got 140,000 words.
Even the IDE here was struggling
for a moment to open it.
And here we go.
So here are the lines in the file.
"A" is the shortest
English word I can think
of that starts with A. Apparently
"aaa" and "aaas" have meaning too.
Aardvark--
COLTON OGDEN: So we've gone ahead--
we've lower cased all the
words, in this case, in advance.
DAVID MALAN: We have.
And so that's kind of an
assumption we're making.
If we didn't want to make
that assumption, that's fine.
We can deal with it in code.
But yes, that would have been one
of the setups in the C version
of the problem, which is that,
yeah, it's all lowercase.
COLTON OGDEN: We got [INAUDIBLE],,
greetings from Italy.
Hello, [INAUDIBLE].
DAVID MALAN: Buongiorno, [INAUDIBLE].
COLTON OGDEN: We have a SB
Venner, greetings from London.
Nice.
DAVID MALAN: Hello.
COLTON OGDEN: Good to see you.
DAVID MALAN: Nice from
the Cambridge here.
COLTON OGDEN: Srini Vasank.
Sorry if I am jumping.
No, that's totally fine.
I think it was a good thing
to catch in the comment.
DAVID MALAN: Yeah.
No, that's fine.
Call me on anything
that we're missing here.
So if you want to see all 140,000 words,
we can scroll here for quite a while.
You can see there's a lot
of A words in English.
And this is why it's actually
important at the end of the day
to think about how much
memory you're using,
how many CPU cycles you're using.
Because at the end of the
day, this is going to add up.
And indeed, I got bored with scrolling.
We haven't even gotten
through all of the A words.
So there's a lot of words there.
COLTON OGDEN: And that's why we have
a small and a large, I'm guessing,
so that we have a dichotomy.
We can performance measure
between the small--
our algorithm operating on the small
database versus the large database.
DAVID MALAN: Yes.
You do not want to try to
debug a problem in your code
with 140,000 inputs coming at you.
Can you even imagine setting breakpoint
in a debugger and walking through that?
COLTON OGDEN: That would be terrible.
DAVID MALAN: So we
have small, which just
has two words-- cat and caterpillar.
So this is actually-- fun fact.
So this is kind of curious
that we have these two words.
Why might you, in writing some
test code, use two words like this,
as opposed to cat and dog?
COLTON OGDEN: I'm thinking from
the perspective of the try,
because they're both going to operate
on the same nodes in the tree.
Cat and then caterpillar--
caterpillar is kind of a
superset of cat in that sense.
That's my first inclination.
DAVID MALAN: Yeah.
No, and that was the motivation.
Because you want to try to think,
when testing your code, of potential
corner cases.
And it's definitely potentially
worrisome if two of your words
are so similar but different.
So whether you're using
a try or anything else,
just choosing those
kinds of corner cases--
one letter words, two letter
words, substrings of other words--
it's a really good defensive
mindset to go into debugging with.
Fun fact two-- it was just
a few years ago that I
learned how to spell caterpillar.
Because it occurred to me, since
childhood, I could pronounce this word.
I knew what a caterpillar was.
And then the one time,
off the cuff, I said, oh,
why don't we spell check cater--
pill-- ar.
I had to Google the darn
word in the front of class.
COLTON OGDEN: How did
you think it was spelled?
DAVID MALAN: A cata-- a cat-a-pillar?
I don't know.
COLTON OGDEN: Oh, I see.
DAVID MALAN: You just take
for granted for 30 years--
COLTON OGDEN: It's a weird word.
I think there's other
words like that too.
DAVID MALAN: I was embarrassed.
COLTON OGDEN: Srini
Vasank says, sorry, I
didn't realize that dictionary
falcon needs only one word.
DAVID MALAN: No need to apologize.
We didn't tell you, so
you wouldn't have known.
No big deal.
All right, so let's go
back to the Python API
that we've been in the
midst of implementing here,
and see if we can't flesh
this out a little more.
So I feel like the big
to do is really this.
We've kind of cut some corners,
and we've only thought about--
we have a special guest here today--
but we've only really thought
about what to do logically.
But where are we going
to tuck this data away?
So how do you think about--
how should anyone think about actually
storing these words, would you say?
COLTON OGDEN: Well in
CS50, I think normally,
we use what's called a hash table.
So some way for us to essentially
map a word to some storage bucket
somewhere, using a
function that's optimized
for even distribution of our words,
so we don't have all of the same words
in, basically, one long linked list.
But I think we should
show people maybe how
to implement a hash table in Python.
DAVID MALAN: Oh yeah.
You want to do a hash table in Python?
COLTON OGDEN: Yeah.
Let's do a hash table in Python.
DAVID MALAN: All rigth.
So if you want to declare a
hash table-- for instance,
a variable called "hash
table in Python"--
you could technically just do that.
COLTON OGDEN: Oh, wow.
DAVID MALAN: And then done.
What would you like me to do next?
COLTON OGDEN: OK, so now I think--
done.
I think we're all set.
DAVID MALAN: Kind of.
But this actually does invite
some interesting design questions.
So this is indeed the definition
of a hash table-- or in Python,
it's called a dictionary or dict--
where you map keys to values.
And that's the fundamental
definition of a hash table.
It's a mapping of keys to values.
COLTON OGDEN: So cat equals
1, dog equals 2, for example.
DAVID MALAN: It could be.
Yeah, but you don't have to
use numeric indices like that.
It really suffices in a dictionary
to say, yes or no, the word is in it.
So you could probably just say, cat
true, dog true, caterpillar true.
But even that feels a little redundant.
And so when it comes to designing
a data structure like this,
hash table might not necessarily
need be what you want.
You can actually use bunches of others.
So you could actually get away
with just using a list in Python,
which is a dynamically
realizable array or vector,
if you're coming from the C++ world.
That's nice too, because we'll
automatically grow to fill the data.
And you're also not unnecessarily
storing true, true, true,
true, or 1, 2, 3, 4.
You're just storing the
words themselves-- the keys.
COLTON OGDEN: More
efficient terms of memoy.
DAVID MALAN: Yeah, I mean, you're
saving some number of bytes.
However Python implements
true and false--
it's got to be non-zero, presumably.
But would you caution against
using just a list, would you say?
COLTON OGDEN: I would,
for access purposes.
Because if you want to find out
whether something's in a list,
it's going to be an O of n operation.
DAVID MALAN: Yeah.
So a big O of n is a
linear time operation.
Because if that word is "zoo", and it's
all the way at the end of your list,
if it's alphabetized, then you've got
to search all 140,000 words just looking
for the darn word that you care about.
COLTON OGDEN: We've got a
couple of questions also.
Trebinator-- are you
doing PSet5 in Python?
Yes.
DAVID MALAN: Yes, essentially.
And fun fact-- PSet5 just
became PSet4 this year.
So stay tuned for CS50x 2019.
COLTON OGDEN: And do
you want to read off
SB Venner as the last question or two?
DAVID MALAN: Yeah, sure.
SB Venner asks, "what does the
passed reserve keyword do in Python?
I've not really seen that before.
Steve."
So it's actually just literally a pass.
When you want to just have a
placeholder line that does nothing,
as I'm doing here sort of
pedagogically, you can just say pass.
That's a bit of a contrived use case.
It really means, when
you need a line of code,
but you don't want to take any
action there, you can use pass.
So just to pull up a
separate file here--
let me just create foo.py
for the sake of discussion.
This would not be the best design,
but it's one possible use of this.
If you wanted to say something
like, if x is greater than y--
but in that case, you don't
want to do anything about it,
you could just say pass.
Now, I say this is bad
design because, frankly,
if you just want to check a
condition and then not do something,
and instead do something only if the
opposite is true-- like do this--
you can just flip the logic, frankly.
And you can, instead, say, if x is
less than or equal to y, then do this.
And you can avoid the pass altogether.
But I've been in the habit on some
applications of actually using it
in the context of exceptions.
So we don't tend to talk
about these in CS50,
but you might in a higher level class,
or in your own professional work.
Often, it's the case
in Python that you want
to try to do something inside of which
you might have some lines of code.
Except, something bad might happen.
And so you might say, except--
exception as e, for exception.
But if you want to do
something with the exception,
you might not want, sometimes, to
do something with the exception.
But you do want to catch
it and then ignore it.
Because you know more than the computer.
And you know that sometimes
an exception might happen.
But if you decide, not a big
deal, you can literally just
say pass, effectively
ignoring the exception,
but still handling it so the
whole program doesn't abort.
So I would say that's a more compelling
use of pass that's not pedagogical,
but it is functional.
COLTON OGDEN: Essentially,
because in Python,
if you try to do a function
definition without a body,
it will just give you a compiler error.
DAVID MALAN: Yeah.
Yeah, indeed.
So if you actually want to have the
equivalent of abstract methods inside
of a class that you then
override in a child's class,
you might just say pass, because
you don't want the function
to have to do anything
by default. But it
does need to be well-defined
grammatically in the language.
That's a really good question.
Keep them coming.
Hope that helps, Steve.
All right, so if we turn our
attention back to dictionary.py,
I would propose we could
take this one step further.
Python has a whole buffet, if
you will, of data structure.
It's not just lists,
not just hash tables,
but it also has one that we
introduced in class, albeit briefly.
It actually has what are called sets.
You might not have thought
about these recently.
But mathematically, a set is a
collection of numbers, typically.
But a key detail is
that the set does not
contain duplicates, which is great
because your dictionary probably
shouldn't contain duplicates.
The word is either a word or it's not.
And so a set is kind of nice
because it maintains that property.
But more importantly, if you read
the documentation for Python,
set is more performant than list.
It will give you an approximation
of constant time access.
So it's probably implemented with
a hash table underneath the hood,
much like a dictionary.
But it does have this
property where you can just
think about membership therein, not
an association of keys with values.
COLTON OGDEN: It's kind
of like the hash table,
but rather than needing to store that
true every time, which like you said,
probably amounts to some non-zero
amount of bytes, we just get all of that
sorted for free with the same
functionality of the membership check.
DAVID MALAN: Yeah, indeed.
And so a set is what you would
often call an abstract data type.
But you don't necessarily
know or have to care how
it's implemented underneath the hood.
All you care about is its properties--
the fact that you can add to it,
remove from it, and get an approximately
constant time access to it.
That's what you care about.
But underneath the hood, it probably
is implemented with some kind of tree,
or a hash table, or anything.
That's not for us to worry
about in a language like Python.
So let's call this actually what it is.
This is our dictionary of words.
So let's just say-- we've got
a variable called "words",
and it's initialized to
be just an empty set,
containing no words initially, until
we're ready to put them into the file.
COLTON OGDEN: We've got
a question from Andre.
Andre-- we see Andre on the Facebook
group all the time, actually.
DAVID MALAN: Indeed.
Hello, Andre.
Nice to see you here on Twitch.
COLTON OGDEN: "Would checking
access aligned bounding box
collisions be a good use case for
pass, since you are typically testing
whether for cases are not true?"
In that case, you still would
check all four conditions.
So I don't think you would, because
you want to make sure all of them
aren't true.
Because that's just part of
how that algorithm works.
So yeah, I don't think pass,
in that particular instance,
would be appropriate.
DAVID MALAN: Yeah, you can still flip
the logic even if syntactically it
starts to get a little messy or sloppy.
Generally, in any language,
having just empty code blocks
where you take no action
or explicitly say pass,
it usually suggests that
better design is possible.
COLTON OGDEN: Yeah, I usually see it as
a non-implemented sort of thing like a
to do.
DAVID MALAN: It's like in the
real world, at least in the US
and in English, you sometimes have silly
business or legal documents that say
"this page intentionally left blank".
I mean, everyone kind of
rolls their eyes at that,
because well, this is kind of silly.
And I think the same idea there.
You don't want to just have
place holders for code.
You want to get the logic
right from the get go.
COLTON OGDEN: Good question.
DAVID MALAN: Yeah, nice to
see you here on another venue.
All right, so this line
9 still kind of remains.
Store line in some data structure.
So honestly, this is pretty
straight forward in Python.
If I have this data
structure called "words",
and I just want to go
ahead and add to it,
I can literally just
add the line to the set.
Now, this isn't quite right.
You notice this earlier--
I think one of our--
COLTON OGDEN: I think Srini Vasank.
DAVID MALAN: Yes-- noted earlier
that, doesn't it contain a full line?
And it does.
But if we know that the dictionary
contains one word per line,
but every line probably ends with a
backslash n, or CRLF, or the like,
we're going to want to strip that off.
And if you read the CS50
specification, you'll
know that, indeed, it ends with
backslash n, or a new line, or a line
feed.
So what's nice in Python is, you can
actually strip this off pretty nicely.
You can say, reverse strip--
rstrip a specific character or
characters from the end of the string,
hence a write strip from the end.
And that's actually going to
strip off one or more new lines
from the end of the
string, thereby giving us,
effectively, the whole
word from the file.
COLTON OGDEN: Nice.
Clean.
DAVID MALAN: And that's it.
Now, even though Colton and I have
been talking for quite a few minutes
here, at the end of
the day, we wrote what,
six or so lines of code to get done
much of a CS50 problem, namely the load
function here.
So let's take a bit of a breather.
And let's go ahead and
just implement size.
This function, according
to the problem statement,
is return the number of words
in the dictionary file, which
is now loaded into memory, presumably.
How would you go about
thinking about this?
COLTON OGDEN: Well, one approach
might be to use some sort of loop
to basically go over our data
structure and say, for-- in Python,
for example-- for something in our
collection-- for word in words.
And then just increment the counter
that you've declared-- size.
So size plus equals 1.
And then yeah.
Just returned the size.
DAVID MALAN: OK.
All right, so that's
pretty straightforward.
It feels like, if words
is a data structure,
it probably supports
the notion of length.
Could we whittle this down
into something more Pythonic?
COLTON OGDEN: Yeah, I would say
most abstract data types probably
do have some sort of length
function, or length field on them.
So I believe you can just call len
on set-- the length function, which
is a function that works on pretty
much any core Python data structure.
And that should give
you the number of words.
It should be stored internally.
DAVID MALAN: Indeed.
This is a nice generalization.
It's a little inconsistent across
Python that you don't call words.lang,
but this is essentially
an API in Python.
leng, or length, is a function that
works across various different data
structures and data types and can even
work across your own customized ones
that you yourself invent, so long
as your data structure implements
itself a magical method called "leng"
itself, which is built into the object.
This len function will
call that length method,
thereby making sure that you can
get the length of any data structure
in a standard way, in this case.
So actually, just letting
the data structure,
the set itself, tell
us what's your length,
it could probably now be done in
constant time, instead of linear.
And again, that's the goal--
speed everything up today.
Well, let's do a spoiler
here with unload in Python.
Even though in C, you would have
to allocate all of this memory--
maybe it's on the stack, or the heap,
or wherever-- once you've decided that,
you have to actually free that memory.
In Python, there's really
no work to be done here.
So I'm just going to go
ahead and boldly say "done".
COLTON OGDEN: Return true.
DAVID MALAN: Yeah.
COLTON OGDEN: What's the mechanism
called that does that for us?
DAVID MALAN: So it would be
called "garbage collection".
And phew, I wasn't sure if what
you were fishing for there.
It'd be called "garbage
collection", which
is kind of a cool way of describing a
process that happens behind the scenes.
Whenever that program isn't really
doing much else, or has a breather,
or just needs to collect garbage,
it will free up any memory
that you are known not
to be using anymore,
so that it can be used for other
purposes in your program or somewhere
else.
Yeah.
COLTON OGDEN: Languages like Java, and
Ruby, and such also include-- typically
interpreted languages, and some
compiled languages have this feature.
DAVID MALAN: Gotcha.
COLTON OGDEN: C definitely
does not have this feature.
DAVID MALAN: No.
No, and that's what makes it all
the more painful sometimes to use.
And Srini, you got it.
Thank you for the summary
there to everyone.
A few people have some
long usernames here today.
So we're kind of left
with just one function
that we passed on earlier, literally.
That is the check.
So check, recall, is
going to be the function
that we're going to
call again, and again,
and again to spell check any of the
words that are actually passed in.
So let's consider how we can
go implement this up top.
We know that the data structure
in question is called "words".
It's a set.
And it turns out, it's pretty
easy to just ask the question,
is this string in this set?
You can literally just say something
like, if word in set, then go ahead
and return true.
Or if that's not the case,
with your colons here,
you can go ahead and return false.
But in Python, anytime you find yourself
in the habit of very verbosely saying,
if this, then return true,
else if that, return false,
you can probably whittle
it down even more cleanly.
And indeed, you can
actually do this in C.
Though we tend not to do it
in the earliest weeks of CS50.
I could actually just be a little more
elegant here, and just say, what--
return word in words.
And that in itself is
a Boolean expression
that will return true or false.
I might want to do one enhancement here.
And you would only know this from
reading the problem statement.
Even though the dictionary
that we were given,
with 140,000 words is sorted in all
lowercase, is in the problem statement,
the file your
spellchecking might not be.
Because that could be any file
that some human typed out.
So we should probably, if we're looking
for words in a lowercase dictionary,
we should probably force every word
that we're being passed to lowercase,
just to make sure that we're comparing
apples and apples, so to speak.
COLTON OGDEN: Good point.
DAVID MALAN: So any questions,
then, on how we've tackled this?
Believe it or not, in 17
lines or fewer, can you
implement a spellchecker in Python
that conforms to CS50's own API, which
takes many more lines and, undoubtedly,
many, many, many more hours,
if not tears in C to implement.
COLTON OGDEN: Sure.
Maybe we have a couple
of minutes for questions.
In the meantime, when was the
first year that we had this PSet?
Do you remember?
DAVID MALAN: I do remember.
In CS50 here at Harvard, 2007.
And AI has yet to
replace a spellchecker.
So we haven't had to replace it yet.
COLTON OGDEN: Did you come
up with this from scratch?
Or was this an inherited PSet?
DAVID MALAN: I did come up with
this from scratch, I believe.
It's a while ago now.
And in fact, I think I first wrote this
PSet, or problem set, even years prior.
Let's roll back to 2002
or 2003. i was teaching
at Tufts University,
another school up the road,
in their computer science department.
And I was making all new homework
assignments for that class.
And I think, unless I'm
rewriting history here,
I think I did write that
from scratch way back when.
And it's evolved a little bit over time.
And in fact, it started off as a
C++ problem set, if this was true.
That, or I'm completely making
all of this up, and in 2007
did I first write this PSet.
COLTON OGDEN: Was it because you were
having issues with Word's spellchecker?
DAVID MALAN: [INAUDIBLE] No, I mean,
we had Microsoft Word in my day.
[INTERPOSING VOICES]
You can actually just ask
the software to do this.
COLTON OGDEN: ZAlien's got
a couple of kappas for us.
Hey, ZAlien, good to see you.
DAVID MALAN: Thank you very much.
COLTON OGDEN: Trebinator, "love
your course and self teaching.
[INAUDIBLE]
DAVID MALAN: Oh, thank you.
Oh, PSet5 and C--
I'm sorry we showed you
the answer in Python.
Otherwise--
[INTERPOSING VOICES]
dictionary.h, the file we started
with, is actually given to students.
So that's not even a spoiler there.
So fun fact-- when I was
a kid, Microsoft Word
came on four floppy disks.
COLTON OGDEN: Oh man.
DAVID MALAN: And it had all the
features I actually needed or cared
about back in, what, 1995 give or take.
Microsoft Word 5.1A.
COLTON OGDEN: How much smaller
was the dictionary back then?
DAVID MALAN: We had as many
words, I think, in English.
Although, you hear about more and more
stupid words coming out every year
that are apparently
immortalized in the dictionary.
COLTON OGDEN: I can't
remember any from this year.
DAVID MALAN: I don't know.
You always roll your eyes.
COLTON OGDEN: You have the floppy
Word in your office at 125, don't you?
DAVID MALAN: Oh, maybe.
COLTON OGDEN: Or Excel.
No, you have the Excel floppy disk.
DAVID MALAN: Oh, possibly.
Yeah.
It was cool.
Actually, a couple of years ago now,
CS50's video team went on the road
to visit our friends at Microsoft.
So I probably shouldn't have said ill
things about Microsoft Word just now.
And they have a wonderful archive.
And we got to explore some of the
boxes and shelves of old stuff.
And they just had all of the oldest
software right there on floppy disks.
We have some fun photos of picking
up entire boxes of really, really
heavy products.
But the stupid thing is,
the heavy products were not
like the actual software.
It was the damn instruction manuals.
They used to come with
dictionary sized user manuals
that we held up above our head.
I don't think we can find the
photo on such quick demand.
Maybe?
Can we do that?
All right.
CS50's own Dan Coffey here is suggesting
that we go to our own photo website
here.
So let's see if we can't dig up
a little memory from yesteryear.
If I go here, I think
we can probably search.
COLTON OGDEN: Srini is asking,
is that a 3.5 inch floppy disk?
Oh, did you already
answer that question?
DAVID MALAN: Hey, look at this.
Hey, look at that search.
That's a good dictionary we've got here.
COLTON OGDEN: Wow.
DAVID MALAN: So this is me looking
very surprised at how heavy Microsoft
Office was a long time ago.
COLTON OGDEN: That's really
how big the manuals were?
Are those the manuals
or the full software?
DAVID MALAN: This is mostly the manual.
So let's enhance here.
And yeah, there were a few floppy
disks or maybe CDs by then in there.
But those are mostly manuals.
That's not even the biggest one.
Here we go.
This is Visual C++.
What the hell?
That's the compiler.
Wow, that's me a long time ago.
That's a compiler and
the manuals for it.
Slightly creepily, too, is that photo.
But also, Microsoft apparently--
what are those called?
Teletubbies?
COLTON OGDEN: Teletubbies, yeah.
DAVID MALAN: Teletubbies used
to be powered by Microsoft.
COLTON OGDEN: Microsoft did--
I had no idea.
DAVID MALAN: Yeah.
COLTON OGDEN: Windows 3.1 was
distributed on 8 3.5 inch floppies,
says Andre.
I believe it, 100%.
DAVID MALAN: Indeed.
Oh, and it's apropos here.
Off camera is our friend Dan Coffey,
who is right there on camera with that.
And then of course, it wouldn't
be CS50 without our friend Natasha
Chornesky, Nacho, who's at Microsoft
with other colleagues of hers as well.
And there she is, CS50 proud,
having taken CS50 herself.
COLTON OGDEN: Do you
remember per Srini's comment
whether it was 3.5 or 5.5 inch floppies?
DAVID MALAN: Let's see, for Office?
In my day, the version we
had but on Macs was 3.5.
But I don't think Macs
ever had 5.25 disks.
But Windows might have, certainly.
If there was Windows.
I mean, DOS and Windows
3.1 and onward might have.
COLTON OGDEN: Yeah.
I remember having a Windows 95
machine with floppies on it.
But it's been a long time.
DAVID MALAN: Yeah.
Hey, we're all sounding
really old right now.
So if there's any you kids
tuning in wondering what the hell
this channel just became--
COLTON OGDEN: I think we
should have prefaced it
with a picture of what a floppy disk is.
DAVID MALAN: That is true.
So here.
Let's do that.
What's a floppy disk?
Floppy disk.
They came in different shapes and forms.
They're mostly died out now.
COLTON OGDEN: You used to have a
really good presentation in CS50
where you would have people
rip open a floppy disk.
You still have those
in storage, by the way.
DAVID MALAN: We do still have
some floppy disks in stock.
Yeah.
Fun fact-- it actually got
really expensive to do.
Because the price of storage and media
just went down, and down, and down,
and down.
So for a few pennies, we could
give 1,000 students in the room
their own floppy disk, and
then tear it apart in class.
But then once nobody
started using them, you
could only get them on
eBay, not even Staples.
And so then it's $1 each, and that
makes for a really expensive demo.
COLTON OGDEN: I remember
when we bulk ordered those.
There were thousands of them.
DAVID MALAN: Yeah.
So this is a somewhat hard floppy disk.
Inside is the actual floppy material.
And then as Srini was referring,
we had 5 and 1/4 inch disks,
and even bigger ones back in the day--
small, medium, and large, if you will.
And I started off life, I think,
with the middle disks there.
But fun fact two-- if we enhance this,
these 5 and 1/4 inch floppy disks,
which were indeed flexible, came with
a little divot out of the corner,
which was to help align them so
you would know which is on the left
and which is on the right, so you
don't put the disk in upside down.
But the catch is that the
materials inside the disk
are actually perfectly symmetric.
And so, if you actually take a
scissors or a special $5 device,
you could snip a hole in the
floppy disk exactly symmetrically
on the other side, thereby
tricking the floppy disk
drive to write bits onto the other
side of the actual floppy disk inside.
Now I'm sure if I had any notion of
electrical engineering at the time,
I would know that this just creates
interference and bad things.
But to a kid trying to
double his storage space,
it doubled my storage space.
So all of us used to do this and
make your floppy disks double sided.
COLTON OGDEN: That's a TIO for me.
I didn't know you could do that.
DAVID MALAN: Yeah, well it
probably wasn't recommended.
The better practice would be
just buy more floppy disks.
But that's what we did back in the day.
COLTON OGDEN: Yeah.
That's cool.
That's a good piece of history.
DAVID MALAN: Yeah.
So any questions from
the audience there?
Anything we can pluck off, either on
C, Python, or anything in between?
COLTON OGDEN: Yeah, if you guys have
any sort of small exercises or examples
you want us to implement
in Python, let us know.
You said potentially resize as well?
Is the one that you have?
I can print that for you.
DAVID MALAN: We do have resize.
But I think we also have the speller,
which we can go from C to Python.
COLTON OGDEN: Oh, that's true.
Yeah, we have that other spelling.
Yeah, we could do that
if folks are interested.
We should let people
decide what they want.
My desk still has a few
of those [INAUDIBLE]..
Do you folks prefer us to do resize or
to take a closer look at speller in C
maybe?
A good chance for some feedback.
DAVID MALAN: I do think
speller in C is nice.
Because we could literally
do it side by side.
COLTON OGDEN: That's true.
That's a good point.
DAVID MALAN: Let's see
how the lines line up.
COLTON OGDEN: That's a good point.
DAVID MALAN: All right.
So let's do that.
COLTON OGDEN: Let's start on that.
DAVID MALAN: All right.
So let's go ahead and let me unenhance--
close a couple of our distractions
there.
And let me go into my
same folder as before.
Let me go ahead and
open up speller.c, which
is indeed the version of this
program that students are given.
So I'm afraid, too,
this won't help you out,
Trebinator, with PSet5, since you
two are given this distribution code.
But what's nice about the IDE
here is we can split our windows
and actually do some
side by side coding,
and essentially comport or translate
C from Python from left to right here.
COLTON OGDEN: We could
also do this magic.
And now we're out of the scene.
Look at that.
DAVID MALAN: Oh, nice.
We can even get rid of you and me.
Nice.
All right.
So here we go.
Let me go ahead and create a new file.
We'll call this speller.py.
And essentially, we're going to
try to recreate this file over
at the right hand side.
So we'll do some simple, some more
sophisticated things for a moment.
Keep watching, Trebinator.
We won't show you dictionary.c.
That's where the real magic is.
COLTON OGDEN: Actually, do you
want me to maybe make us smaller
and put us in the middle
so we can still see--
DAVID MALAN: Yeah, can we unenhance us?
COLTON OGDEN: Yeah, let's do that.
DAVID MALAN: Goodbye.
Oh, there we go.
COLTON OGDEN: There we go.
DAVID MALAN: Oh, we could do a little
Mac versus PC, like C versus Python--
[INTERPOSING VOICES]
COLTON OGDEN: I think I should
get on the other side then.
DAVID MALAN: Yeah, this
is kind of backwards.
But the keyboards-- that's OK.
I'll look this way, you look this way.
And we'll focus on our own code.
All right, so one thing
to note-- first of all,
in Python, there's different
ways to write comments.
And so for multi-line
comments like this,
how would you go about
writing them in your file?
COLTON OGDEN: In C or-- oh, in Python.
Python has a cool syntax--
I mean, one way you could
do it is just preface
every single line with the hash mark.
DAVID MALAN: So just do
every single line like this.
COLTON OGDEN: Which
would get very tedious.
DAVID MALAN: Yeah, for sure.
COLTON OGDEN: Python
has another way you can
do it using either a triple full
quotes or single quotes, which
will allow you to do the
same thing, essentially,
as this, except on the left.
But you don't need to have
the star on each line.
You can just write strings,
literally, as you want.
DAVID MALAN: Yeah, So problem
set 5 implements a spellchecker.
And then down here, we can close that
thought using double or single quotes,
so long as we're actually
consistent across the two.
And now those are our comments.
And you can see it's nicely highlighted.
Its color coded a little differently,
just because the IDE treats comments
in C and Python a little differently.
But same exact idea.
COLTON OGDEN: I think
Srini had the right idea.
He had the triple hash mark.
But just off on just the
quote being the key there.
DAVID MALAN: Indeed.
You definitely want to use the
single or the double quotes
there and then be consistent throughout.
So on the left in C, we had
a whole bunch of header files
that we included-- one of which
was our own, dictionary.h--
and then a bunch of which were
some lower level things really used
for instrumentation.
No problem at all.
And the comparables that we'll
need are a little different.
So there's not necessarily a
one-to-one mapping left to right.
And in fact, rather than
do these preemptively,
we'll come back to some of our imports.
And let's just trip over the libraries
we need once we actually get to them.
Otherwise, it's magical just by
knowing where we're going with this.
But I will say this.
We do know that there's a file
called dictionary.py from earlier.
So I can immediately say, you
know what, from dictionary,
go ahead and import those functions--
check, load, size and unload.
Those are the four functions
we did implement earlier.
And this way, we can just keep
the logic of our spellchecker
separate from the implementation
details of the dictionary.
We could even swap in and out
different dictionaries if we want,
and therefore implement
the API in different ways.
COLTON OGDEN: Basically
how you would port a header
file-- the idea of a header
file from C into Python.
DAVID MALAN: Yeah, exactly.
Now, I've seen in some code-- and
frankly, some of my own code--
this approach, what would this do?
Do you know?
COLTON OGDEN: So from
dictionary import * means--
* being sort of the glob or a catch all
operator in Unix and other systems--
basically means everything.
So from dictionary-- from
the dictionary.py file,
just import everything and pretend
like I wrote this in my own file.
So I'll have access to check.
I'll have access to unload.
I'll have access to size.
All those different ones.
DAVID MALAN: Yeah.
And that otherwise seems pretty handy.
Because then I don't have to
enumerate all four of those functions.
But the catch is, you can potentially
scoop in more than you intend.
So for instance, we had a global
variable in that file called "words",
which was that set.
And by scooping it up
with *, we're accidentally
introducing to this namespace,
so to speak, speller.py,
that same variable, which means I
could somehow accidentally screw
up and mutate or change that
variable, when it really is meant
to be used only by the dictionary.
Now, we can defend against
that in a bunch of ways.
We can introduce classes
in dictionary.py.
We can introduce an actual module
and kind of separate that out.
But in general, just
because * is convenient,
does not mean it's good design.
And in fact, it would be more
conventional and proper to enumerate,
even tediously, the
functions that you actually
care about so that you're
importing only what it is you
care about, and nothing extra--
especially if someone else goes
and changes the implementation of
the dictionary, and all of a sudden
your variables are colliding.
COLTON OGDEN: Exactly.
DAVID MALAN: All right.
So I've imported the dictionary.
Let's go ahead now and start
translating some of the code.
And for the most part, it will line
up, but not necessarily one for one.
So we can point out some of
the differences as we go.
And we might even disagree
how we might implement things.
But we'll try to do
it line for line here.
So over here on line
20, in our C version,
was the definition of a
constant, as it would be called.
It's, by convention, in all caps.
And then this just hardcodes the path
to that big file with 140,000 words.
What would you propose be the
equivalent in Python here?
COLTON OGDEN: Well,
Python doesn't really
have the idea, or at least the
hardcoded idea of constant variables,
unfortunately.
So what I would probably just do is
something like capital dictionary,
just like you have right there, gets--
then the string, "dictionaries/large".
So almost the same thing, just
without the define being necessary.
DAVID MALAN: So if you
can capitalize the word,
though, is this not now just a constant?
COLTON OGDEN: Not in
functionality, but--
not ad hoc-- but implicitly a constant.
Other programmers reading this are going
to see the fact that dictionary is all
capitalized, and just by convention
across multiple programming languages,
all capitalized variables
with underscores
tend to be constants,
even if that idea is not
enforced by the compiler interpreter.
DAVID MALAN: Yeah, and this is
just kind of a language decision.
The reality is that different
communities and different languages--
authors disagree.
And the mentality with Python is, well
yeah, you could change that value,
but just don't, and be a
little more mature about it.
Now, it goes both ways.
I mean, that's fine to be said.
And OK, I promise not
to touch that variable.
But when programs get
big and complicated,
you might forget which
is yours, which is
someone else's, even though the
capitalization is supposed to help.
So a lot of languages--
C among them, Java, C++--
really try to protect you from
yourself, albeit at a price of tedium,
when you have to write even
more code or more thinking just
to get the same kind of work done.
COLTON OGDEN: Exactly.
DAVID MALAN: All right, so that's a
pretty straightforward translation
of that line.
Below that is a
prototype, which, in C, is
a proactive declaration of a
function's name, return value,
and arguments-- at least
the types and order thereof.
There isn't really a
comparable in Python.
So we'll skip over that.
And in Python, there isn't
necessarily a main function.
You can absolutely have one.
But we won't bother because
there's actually going
to be no other functions in this file.
So we'll keep it simple.
So let's focus on argc here.
In C, argc is the argument count--
the number of arguments
that have been passed in.
But in Python, you have to
do this a little differently.
We need a module, right?
COLTON OGDEN: We do, yeah.
Because Python-- C
gives you argc and argv
for free as part of its implementation.
But it's not something
that's necessarily
part of Python's implementation.
Real quick, though,
Srini has a question.
"All caps mean generally
object or class?"
Sorry, I'm not sure.
DAVID MALAN: Yeah, good question.
COLTON OGDEN: [INAUDIBLE]
DAVID MALAN: Oh, you're
testing me again.
All caps generally means, either you're
yelling in an email, or a text message
or whatever--
sorry for those of you
wearing headphones, actually--
or it means a constant,
or a social agreement
that this should be a constant value.
When only the first
letter is capitalized,
or you have a multi-word variable name
or a symbol that has capitalization
throughout, but not the whole thing,
that typically signifies, yes, a class.
And then, if you use more traditional
camel case, like lowercase letters
to begin, or more conventionally,
underscore variable names,
otherwise known as snake case, then
that's just a variable, typically.
COLTON OGDEN: Snake case because Python?
DAVID MALAN: No, I think name
comes from back in the day.
Because it looks like a snake.
It's just your long string
of words with the underscore.
COLTON OGDEN: Oh, interesting.
DAVID MALAN: Hey, you learned
something new here on CS50 live.
COLTON OGDEN: It's interesting
how coincidental that is, though.
DAVID MALAN: Snake case.
Why?
COLTON OGDEN: Because Python.
DAVID MALAN: Oh, interesting.
No, I'm pretty sure it predates it.
But I'm sure Wikipedia
could answer this too.
COLTON OGDEN: That's fascinating.
DAVID MALAN: We'll check, and
next week we'll let you know.
So to gain access to our command
line arguments in Python,
we need to import the system module,
which actually gives us access to that
and a bunch more features.
Or frankly, if we want to
be a little more precise
and make it look a
little more like C, we
can just say, from the sys library,
import argv, or argument vector,
which is a subset of the
functionality therein.
And so down here, if I
actually want to check
did the user run this program with the
right number of command line arguments,
I can actually say something like, well,
if the length of argv does not equal 2,
and length of argv
does not equal 3, then
I can go ahead and exit and complain.
So I can actually do this
in a couple different ways.
In Python, you generally use
print instead of print def.
So you would say "usage:
speller" and then,
if I mimic exactly what we have in
C, I would do something like this.
This is going to be the
name of the program.
If I actually add a shebang and
put it in a file called "speller",
or install it via package,
dictionary is in square brackets
because it's meant to be optional.
That's the file that you
want to load optionally.
And then text is the file name that
you actually want to spellcheck.
And then I can go ahead just like
in C, exit with a certain value.
But for this, I need to do exit 1, which
if I want to return a 1 status code,
I would do that.
But honestly, you can actually
do this a little more simply.
You can get rid of all that.
And you can just exit with a string.
And if you read the documentation for
Python, when you do exit with a string,
it's going to return 1 for you.
But it's also going to
print out that value.
Now Python-- or the IDE
here isn't liking this.
Because I've done something wrong.
Oh, I'm taking C too literally.
You have to literally say "and"--
professionals here-- and change
it back to some human English.
COLTON OGDEN: And so,
because dictionary is
optional, that means that we're going
to essentially use that hardcoded
as a default, I'm guessing?
DAVID MALAN: Exactly.
That's why we have the line 14, a
constant that we'll use as a default.
Why?
Just because we decided
that we don't want
to force the student to have to
specify a dictionary every time
they run their program.
Instead, we want them to
have some default built in.
COLTON OGDEN: And we saw len before--
the len function-- when we used a set.
And now we're using it on argv.
So what is argv?
What is the data structure
that argv is, if it's
following the same
model of data structures
being able to use the length function?
DAVID MALAN: Yeah, in this
case, it happens to be a list.
And this is a list of the
command line arguments
that the human is presumed to have
typed in from index location 0 on up.
And so, leng just tells us
the length of that list,
just like the leng in
the context of sets
gives us the whole
size of the set itself.
COLTON OGDEN: OK, awesome.
DAVID MALAN: All right.
So we're already
trimming away some lines.
We're on line 17 instead of 32.
So it's a little more compact.
And frankly, a lot of our
lines were comments anyway.
So let's proceed here.
Now, in C, we had to do something a
little different using certain structs,
or data structures called rusage
structs, before and after.
We don't strictly need those in Python.
Because we can actually time our
program a little more cleanly.
But more on that in just a moment.
But you'll notice that in C,
we did define a few variables--
time_load, time_check, time_size,
and time_unload-- all of which
seem to be floats, because they
have decimal points in them.
And they're all initialized to 0.
I did this on one line on the left
hand side, just because you could.
You don't strictly have to do that.
You could have four
separate lines or variables.
But I can do roughly the same here.
I can say time_load, time_check,
time_size, and time_unload equals 0.0,
0.0, 0.0, 0.0, effectively inducing
what's called a tuple on the left hand
side and the right hand side.
A tuple is a collection of one or more
values, often written in parentheses,
but not necessarily so.
But if I want to be really
explicit as to what I just did,
it's kind of like doing this,
which is a nice one liner of doing
four things on the right
to four things on the left.
And in this case, the assignment
is what you're actually doing here.
COLTON OGDEN: And we can call
this tuple unpacking, correct?
DAVID MALAN: Yeah, indeed.
Because you're going from a
tuple with four elements in it
to a tuple with four elements in it.
But you're really doing
a one-to-one mapping
between all of the tuple elements.
And so the parentheses
are not strictly required.
The commas imply the presence thereof.
All right, so what else?
Let's go ahead and actually initialize.
Let me bring us back to the middle here.
Here we go.
Let's put this wall between us again.
There we go.
All right, so now if we want
to go ahead and initialize
this variable, dictionary,
to be either argv1, which
is the name of the dictionary the
human maybe typed in, or the default.
We can do this with the ternary
operator here in Python as well.
Let me go ahead here and first,
let's see what Python's--
oh, accidental breakpoint.
No complaints at all.
Let me go ahead and just
initialize a dictionary variable,
no type needed, to either
argv1, if the length of argv
equals 3, else initialize it to
dictionary, just like that there.
So you'll see that this is
a little different from C
syntax in that I'm
using if and else there.
But this is a nice one
liner, a ternary operator
effectively, that's
assigning dictionary to argv1
if, literally, the length argv is 3,
else it's assigning it to dictionary.
COLTON OGDEN: So dictionary, lowercase
now, is our actual 100% dictionary.
And then capital dictionary
is our ostensible dictionary
when we start off, if
we haven't actually
input a dictionary into the command
line when they ran the program.
DAVID MALAN: Exactly.
Yep.
And so we have here another
example of Python code
that kind of reads like English.
On line 21 here on the right, dictionary
equals argv1, if the length of argv
is 3, else dictionary is what it equals.
So it reads a little more
naturally than the more arcane C
syntax for ternary operator.
All right, let's just do
a few initializations.
Again, this one doesn't map perfectly
to Python, because get rusage
is very much a C thing.
But the closest I could come up with was
to declare a few variables like this--
the time involved before
we're going to run some code.
Then we're going to go ahead here and
let's get a variable called "loaded"
and call the load function
in our actual file.
Then let's go ahead and
get the time after that
by using process_time again,
which is the same function built
into the time module like this.
And then, just as a sanity check
here, let's ask the same question.
Well, wait a minute.
If the dictionary was not loaded per
its return value there, let's go ahead
and bail out and say something like,
exit could not load dictionary.
You'll notice, though, we might want
to substitute in a full sentence there.
So I can actually paste
in the variable's value,
so long as I make this a format
string, otherwise known as an F string.
Now the IDE is yelling at us here.
Do you have a sense of
why this red X appeared?
COLTON OGDEN: OK, so before
time.process_time-- so before--
I'm a little confused as
to what you're doing there.
DAVID MALAN: All right, well let me
explain to you a la my rubber duck
here.
So what I'm trying to do is just figure
out what is the current time of day,
if you will, right now.
Then I want to call my load function.
Then I want to ask another question--
what is the time of day now?
Because you can infer
from that, presumably,
if you subtract after minus before,
how many seconds, milliseconds,
whatever, transpires.
COLTON OGDEN: It's basically
like having a stopwatch.
Time functions like a
stopwatch-- start and then stop.
DAVID MALAN: Exactly.
COLTON OGDEN: But I notice you're
still getting the error though, even
though you put the equals sign there.
So the IDE is helpful.
It's giving us some information--
undefined variable time.
And so time, presumably, is a
Python module just like sys.
So it looks like we just
haven't included that--
[INTERPOSING VOICES]
DAVID MALAN: We have to import time,
which is kind of a magical concept.
But yeah.
And that's it.
Import time.
COLTON OGDEN: What was the--
there's a fun little thing
you can do in Python.
It's like import something,
and it redirects you
to a web page or something, right?
What is that?
DAVID MALAN: Oh, I don't know.
But you can import functions
from the future, which is cool.
COLTON OGDEN: Python-- hold on.
Python funny import.
What is it?
What is it?
What is it?
What is it?
Import this, I think.
DAVID MALAN: Import this?
What is that?
What do you mean?
COLTON OGDEN: I think it does--
let me look it up before
we do it on screen.
Python import this-- foo.py.
[INAUDIBLE] wants me to do what?
COLTON OGDEN: Import this.
DAVID MALAN: Import this.
COLTON OGDEN: You can do
it in interpreter too.
And I think that will be--
if you can open interpreter.
DAVID MALAN: All right,
slight digression here.
Let's open a terminal window here.
Let's run Python 3, the latest version.
Let's run a line of code with
dash-C and just say import this.
COLTON OGDEN: And this is "The
Zen of Python" by Tim Peters.
DAVID MALAN: That's a
wonderful Easter egg.
Mind you, on the second
screen, off-camera,
we are googling funny Python imports.
COLTON OGDEN: I remembered that--
I didn't remember what it was.
Oh, we missed a few comments
or things in the chat here.
So argv[1] gets the second argument--
DAVID MALAN: Hey, wait, wait.
I want to import antigravity first.
That's do important
things first, if we could.
COLTON OGDEN: I missed that one.
I missed that one.
DAVID MALAN: This is trying to
make a connection somewhere.
COLTON OGDEN: This is the
one that does the web page.
Here, we can-- do we have
a computer on [INAUDIBLE]
DAVID MALAN: I can do it
on my Mac here, probably.
All right.
Here, Andre, here we go.
Here we go-- Python 3-3,
import anti-gravity.
COLTON OGDEN: Nice.
DAVID MALAN: That's amazing that this is
built into an actual language in 2018.
I had never actually seen that.
That's pretty beautiful.
COLTON OGDEN: I think I saw
it a couple of years ago.
But I forgot what it was.
DAVID MALAN: OK, now Serenity is
being a very good student here
and is following along.
Yes, time was missing.
So, Serenity, you are
absolutely correct.
Apologies for the digression here.
We got a little carried away with xkcd.
But, yes, import time was missing.
So replacing that does
actually fix that problem.
And then, let's see-- we had
another question from SB Venner.
COLTON OGDEN: SB Venner--
on line 16, would it not be a
little better to say something
like argv underscore
line equals line argv?
And then on the next line, say, if
line underscore argv is not equal to 2
and line underscore
argv is not equal to 3,
it would save calling the
length function method twice?
DAVID MALAN: It's a good question.
I don't know if we'll agree on this.
Short answer-- yes, theoretically, by
tucking away that answer in a variable,
you don't have to ask
the same question again.
I would generally
argue, though, that it's
a premature optimization, if you will.
There are so many other things I
could be doing poorly in this program.
And just counting the number of command
line arguments efficiently probably
isn't one of them.
The reality is, too,
the human is the one
that's going to be providing these
command line arguments most likely.
And he or she is not going
to type all that many words.
So argv is probably going
to be size one, two, three,
maybe 10, but most likely not.
And so even if we are, in linear time,
checking the length of those things,
it's still going to be super fast.
And, frankly, if Python's interpretation
of the implementation of a list
is efficient, Python is
probably using its own variable
to keep track of the length
of a list, so that it
can return its length in constant time.
So good instincts.
But those are the kinds of
details that you should generally
relegate to the library
and trust or check
the documentation to see if
it's doing things as smartly
as you might be inclined.
COLTON OGDEN: If we were incurring
this function call six million times
instead of a nested
for loop, I would say
that would be a good point
and to cash that value.
DAVID MALAN: OK.
That's fair.
COLTON OGDEN: For this app's, I
don't think that's necessary at all.
Two function calls versus
six million function calls--
I think that's a different
kind of optimization, for sure.
DAVID MALAN: I agree.
And I would say it doesn't
have to be six million.
If we're talking about a
dictionary with 140,000 lines,
I would worry about anything you're
doing inefficiently with that input,
not the human's two or
three command line argument.
But totally valid, but probably not
something you should worry about.
Better to spend those mental
cycles on harder problems.
COLTON OGDEN: Exactly.
Trinity up above also asked-- argv[1]
gets the second argument or the first
one?
DAVID MALAN: So in argv[0], as in
C, is the name of the program itself
or the name of the Python file, which
would be speller.pi in this case.
So argv[1] gets the first
command line argument,
but the second word that the human
provided at the command line,
if that distinction makes sense.
COLTON OGDEN: And the word Python
itself doesn't even get included.
If I'm remembering correctly, it
doesn't get included in argv at all.
DAVID MALAN: Correct.
Just the name of the file is in argv[0].
COLTON OGDEN: In C, though,
you do get the dot slash,
whatever your script is, as the first--
DAVID MALAN: Whatever
the name of the file.
But same in Python 2.
If you did Python
space dot slash foo.py,
you'd see that as well in argv[0].
DAVID MALAN: Trebinator--
COLTON OGDEN: Do you want to
go ahead and read his question?
DAVID MALAN: Yeah.
When having a try with
a struct try letters--
an array-- what is a letters
initialized when using malloc?
So the initialization of
that is totally up to you.
If you're just calling
malloc, it's not getting
initialized with anything whatsoever.
You're just getting
back a chunk of memory--
if enough is available-- equal to the
number of bytes that you've requested.
If you want to clear that memory,
setting it equal to all zeros--
so not technically null--
null is a pointer value--
but all zeros, you can use
calloc, or clear alloc,
which is just another C function.
Usage is identical, but
it will do the clearing.
Now, historically, people would use
calloc very sparingly, or not at all,
because why bother clearing
your memory if you're just
going to overwrite it yourself.
But in the case of a try,
and in a CS50 spell checker,
if you're planning to
clear that memory anyway,
as with a for loop or a while loop,
then, yeah, let calloc do it for you.
Because it probably is using a for
loop or a while loop itself to do that.
COLTON OGDEN: I was just reading an
article recently about calloc and how
it was not really used historically too.
DAVID MALAN: No, because--
I mean, performance.
Like, literally, by calling calloc
you are asking the computer to do,
let's say, twice as much work.
Give me memory.
Oh, and by the way, clear it for me.
I mean, clear your own memory
was kind of the mindset.
And only initialize it when
you absolutely have to.
COLTON OGDEN: And that would've
been O of N, too, right?
Because it doesn't have to go over every
single byte in that set and [INAUDIBLE]
DAVID MALAN: Most likely,
unless there's some hardware
optimizations where
you can actually clear
a whole page of memory efficiently.
COLTON OGDEN: But I guess
that'd be up to the [INAUDIBLE]
DAVID MALAN: Beyond my bi familiarity.
All right.
So where did we leave off?
We had just benchmarked, if you
will, the running time of the load
function by computing after and before.
So if we actually want to compute
the total time involved in loading,
let's go ahead and call
that time underscore load,
and go ahead and calculate that value--
which is actually pretty easily done--
just do after minus before.
In C, we have to do some
annoying calculations
because of how those structures work.
Python, you just don't
need to worry about that.
You can just compute it with arithmetic.
COLTON OGDEN: Did we
write those functions
ourselves that calculate functions?
DAVID MALAN: We wrote
calculate ourselves.
If you go to the bottom
of this file, you'll
see an annoying function
that actually dives
into this data structure being used.
You just don't need to
do that in Python at all.
COLTON OGDEN: Import time.
DAVID MALAN: Import time, or
anti-gravity if you prefer.
So this is going to be a heck of
an implementation before long.
So let's see what we're actually going
to code live versus just kind of whip
out a sample solution
from the oven here.
But let's see how this actually
maps to the code in question.
Because I think we'll probably just
want to pull up a spoiler at some point
here.
COLTON OGDEN: It's getting close.
It's 4:06.
We have around 24 minutes-ish
before the stream officially ends.
DAVID MALAN: So let's see
how far along we get here.
Let's go ahead and open the actual text.
So on the left-hand side here,
we get some pointers involved.
Thankfully, those are not going
to be involved here in Python.
But let me go ahead and convert
this as closely as I can here.
Give me a variable
called text on the left.
Let me go ahead and go into argv[2].
If the length of argv
equals 3, else argv[1].
So this is just giving me the
last command line argument,
which is either position
2 or position 1,
depending on how many
total words were passed in.
Then let me go ahead and open the
file called Text in read-only mode.
And just for good measure, I can
actually specify an encoding,
though this might not
strictly be necessary
given what dictionary you're using.
But I'm going to specify Latin 1,
which is essentially Ascii, as it
is in our C problem set here.
Notice I misspelled argv,
so the IDE is yelling at me
with that little red circle.
And then I'm just going
to do some error checking.
So if there's not actually a
file, we might have screwed up,
or the user might have messed up.
So I'm just going to go ahead and
say something like print could not
open the name of that text.
Let's go ahead and make this
an F string, or format string.
Then let's go ahead and call on load--
whatever it does-- even
though we secretly know
it doesn't need to do
anything in Python.
And then just go ahead and exit with 1.
So some of these lines
aren't strictly necessary.
We could introduce a context
manager, like you proposed before.
But I wanted to handle the
errors exactly like we are in C,
just so that you can kind of
see C to Python left to right.
COLTON OGDEN: There's
one thing on line 33
that I would suggest
for folks who might not
know some of the shortcuts you
can use with lists in Python.
For example, all we're essentially
trying to do here on line 33
is just take the last--
no matter what it is, we're always
going to take the last thing, right?
So instead of using argv[2] or
argv[1], we can always take,
without any condition, argv[-1].
And that will just give us
the last element in that list.
DAVID MALAN: Indeed.
Now this does assume
that earlier in the file,
we have validated that this has the
right number of command line arguments.
Because if it's got
like 10 arguments, we
don't want to take the 10th,
or the 9th, or whatever.
We want to make sure it has
exactly the right number.
But we did that,
indeed, earlier already.
I like that a lot.
Let's keep that.
All right.
So after that, notice that we want to
prepare to report the misspelled words.
That's actually kind
of an easy one-liner.
So let's give ourselves
some room and scroll up here
so we have a place to put this.
All right.
So right up here,
let's go ahead and just
print out exactly as we did in C,
using some new lines to format it.
Nothing really interesting there.
Misspelled words, couple new lines.
Notice-- I only need
one new line, though,
because we get one for free with Python.
Because the end of a line in
Python is by default that.
But if we know that, we can
just leverage that assumption.
Now let's go ahead and prepare
to spell-check the word.
And I'm going to go
ahead and give myself
an empty word, which is the
equivalent of this line over here.
In C, you have to know how many
characters you're going to fit.
It turns out from the
problem statement--
if you read into CS50's homework-- you'd
see that the maximum length of any word
is going to be "length," which
is a constant, which is like 45
or so characters, defined elsewhere.
But we don't care in Python, because
the words can grow and shrink.
Then I'm going to go ahead and
initialize some variables on the Python
side to be like an index
of misspellings and words,
all equaling 0, 0, and 0, using this
fancy [INAUDIBLE] approach, as before.
All right.
Very welcome, Trebinator.
All right.
So now let's go ahead and
spell-check every word in the file.
This is painful in C because you have
to take baby steps in reading the input.
The problem in C is
that you can't really
safely read in from an arbitrary text
file an arbitrary number of bytes.
You have to decide in advance how many
bytes maximally you're going to read.
The problem, though, is if
we're reading words from a file,
I don't know how long or how
short my lines are going to be.
It could be infinitely long, or however
much memory I actually have on disk.
And so here the best way in
C is to take baby steps--
read one little character at a time
and just accumulate those characters,
rather than blindly
read as many as you can.
Python-- I don't have to
worry about any of that.
I'm going to go ahead and do the
following in an infinite loop,
while true as follows.
If I want to read a character from the
file, let me go ahead and just say--
oops, that's not how you
spell true in Python.
Let me just go ahead and say
file.read one character here.
If there's not a
character, then I'm going
to go ahead and break out of this.
That means I'm at the end of the file.
But if I actually want to now assemble
words from the file-- and this
is what's tricky here.
Unlike the dictionary, where I can just
read one word at a time from a line,
the file we're spell-checking might
have sentences and paragraphs.
They might have apostrophes and
punctuation and all this nastiness.
So we need to actually distinguish.
So I'll do this one a
little quickly, essentially
porting our C from the left to
our Python code here at the right.
But among the approaches I
could take here is this--
if re.match this regular expression--
A-Za-z, c-- or it is the case
that c equals an apostrophe
and I have already
read in part of a word,
implying that this variable called index
from our C version is greater than 0,
then I can go ahead and append
this current character to the word.
And I'm going to go ahead and increment
my index counter advancing in the word.
So what am I doing here?
I'm tip-toeing through the
file that we're spell-checking,
taking one character at a time.
So if the word in the file is
"Colton," I'm getting a C-O-L-T-O-N--
yay, ra ra.
So we're reading in Colton
and one character at a time
and then just appending it to a
variable called "word," incrementally
building up your name.
COLTON OGDEN: And index you're
using to keep track of the fact
that you're inside a word, right?
DAVID MALAN: Exactly.
Now if, meanwhile, I've
read too many lines,
I might actually say this--
so if the index is actually
now longer than the maximum
length of a word that's allowed,
I'm actually going to go ahead
and consume the rest of the string
and just throw this away,
because it's not supported.
Something is wrong.
So while true, I'm going to go ahead
and read in one more character.
And if it is not a character
or it is not the case
that that character matches the
following regular expression,
or pattern, as it really is, A-Z or a-z,
then what I'm going to go ahead and do
is just break out of this all together.
Essentially, something has gone wrong.
If nothing has gone wrong, I'm going to
go ahead and reset index and word to 0
in the empty string, with
a little one-liner there.
But I could write that
syntactically a little different
and actually proceed as follows.
So I don't like how the IDE is yelling
at me in so many different places here.
Let's see-- undefined variable re.
COLTON OGDEN: Well, we
solved that problem before.
DAVID MALAN: We did.
So that's re-- regular expression--
refers to pattern matching
capabilities of a library.
So let's try that trick before--
import re, scroll back down,
come on-- magic, nice-- we've
whittled this down further.
How about here?
Undefined variable length.
So I cut some corners
here and I didn't actually
define the length of the longest word.
Should've done that way early on.
Looks like I skipped a step.
Length equals 45.
I think I just said that one verbally.
All right.
So let's go back down here.
Voila.
All my code looks correct, of course.
And now let's go ahead and
check a couple other things.
So this line up here was saying if each
of the characters we're reading one
at a time out of the
file is alphabetical--
A-Z-- uppercase or lowercase--
or it is an apostrophe,
like Colton's-- the possessive-- then
we're going to go ahead and accumulate
it.
But if we run into other things, we
want to handle those differently.
So else if it is the
case that C is a digit--
this is a method built into strings--
then I'm going to go ahead
and just consume these.
Let's just get rid of them, because
they're not part of the file
I want to read.
So while true, let's go ahead and read
one more character with file-read(1).
Then let me go ahead and say, well,
if that's not actually a character,
or it's not an alphabetical
character, and it's not a digit,
then let me go ahead
and do this with it--
all on one line here--
let's go ahead and just break out.
And then let's go ahead and reset
ourselves-- index word gets 0 here.
So we're going kind of quickly
through the Python version.
But, again, the goal here is to convert
the C code on the left to the Python
code on the right.
So just to fast forward to
where we are in the story,
we've just implemented on the left this
C code with isdigit with the equivalent
Python code on the right.
COLTON OGDEN: Is this all distrib
code that they usually get, or this
is the solution that we're looking at?
DAVID MALAN: This is
all distribution code.
So everything on the
left students were given.
They didn't have to come up
with this all on their own.
COLTON OGDEN: Because they just had
to implement the dictionary files,
or dictionary functions, rather.
DAVID MALAN: Indeed.
So I think, why don't we go ahead and--
COLTON OGDEN: Oh, sorry, we
got a question from Andre.
DAVID MALAN: We got
a decent amount left.
Let's take this question from Andre.
Of course.
COLTON OGDEN: Can I just
ask-- line 76 in C--
let's take a look at line 76--
is a for loop that sometimes
causes some confusion.
Is this something you
would typically use?
Or is it there just for
pedagogical purposes
to show they aren't strictly
limited to simple incrementation
within a for loop?
DAVID MALAN: It's a good question.
And, Andre, I do think
you make a valid point.
In CS50 we tend not to show for loops
involving non-numeric initialization
and updates in for loops.
We pretty much always use
I and J and K and so forth.
I would say we're using this
because it is the clearest design.
The for loop does exactly
this-- initialize it
to some value, check a condition,
and then iteratively do something
again and again and again.
And as such, that is the
right looping construct.
You could use a while loop
and build it that way.
But this is what a for loop is good at.
So if anything, it's my
fault or the course's fault
that we don't show students
different paradigms of for loops.
But all the better if they get to
see it here in the distribution code.
COLTON OGDEN: Yeah, I think if you have
a discreet way of knowing what the end
result of your loop is, without having
some non-deterministic thing happening
in the loop-- which
is normally infinite--
I think that's a great reason
to use a for loop at all costs.
I think it's easier to read.
DAVID MALAN: But if you want us to
sound smarter then, yeah, absolutely,
this is a pedagogical decision.
It's informed by years of experience.
And we really just wanted students
to glean some additional insight
from our distribution code.
COLTON OGDEN: I think it is
illustrative of that paradigm, though.
I think it's a good point.
DAVID MALAN: I know.
That's fair.
But I think we should
just 'fess up, that, well,
we just did it that way because.
COLTON OGDEN: It just happened
to be what we felt was the best.
DAVID MALAN: Indeed.
And, amazingly, we've been talking so
long, that Trebinator has implemented
load in C. So congratulations.
That's some really good progress.
I suppose it helps that we're
showing the logic in Python.
But that's fine.
Zamyla has walked through,
similarly walks students
through the process of
thinking about how you load.
But nicely done, Trebinator.
COLTON OGDEN: I think somebody taking
some Python and implementing in C
is much more impressive
than the opposite.
DAVID MALAN: So for all those of you
who have just done the opposite--
C to Python-- Colton is not impressed.
Just kidding, just kidding.
So why don't we go ahead and
finish up this last for loop.
And then I think I'll probably pull
out some of the actual solution code,
because it's going to
get tedious to type
out some of the last
stuff, the printing stuff.
So here we just claim if,
at this point in our code,
we have found a whole word, then we
can go ahead and do the following--
elif index is greater
than 0, thereby implying
that we've accumulated
at least one character,
let's go ahead and increase our counter
of words, which is a variable we
initialized way up above to 0.
Then let's do some timing.
So the time before we're about to do
this benchmark we'll call before again.
And let's just call time
process time with no arguments.
Then let's go ahead and get
the return value of check word,
thereby checking if
it's misspelled or not.
And then let's get the time
after-- time.process.time.
And this is actually cool.
Notice here that we did this one-liner.
So we want a variable called
misspelled to be true if the word is
misspelled and false otherwise.
The catch is that check returns
true if a word has checked out
and it's not misspelled.
But that's OK.
In C you can invert true
and false, or false and true
with a bang, or an exclamation point.
In Python, you don't do that.
You actually literally say not.
And that reverses the logical effect
from true to false or false to true.
COLTON OGDEN: And a lot of people
will read the line on the left
as not check word either way.
DAVID MALAN: Yeah, they'll still
say not, even though in Python you
literally have to write "not."
COLTON OGDEN: We got a few--
it looks like Trebinator says--
DAVID MALAN: Hey, we rock.
Nice.
Or maybe that's for Trebinator.
Good job.
COLTON OGDEN: I missed
initializing my letter array with--
[INTERPOSING VOICES]
DAVID MALAN: I don't think
it's Trebinator then.
That's OK.
That's OK, you caught it pretty fast.
We'll be posting on Facebook.
We'll take a look.
This is our first one.
We're going to do some post-processing.
And just ping us on email or on Facebook
to ask after this if you don't mind.
COLTON OGDEN: I think it'll
at least be on YouTube.
DAVID MALAN: Yeah.
All right.
So let's finish this up.
So I've just done a benchmark
here with checking for misspelled.
Let me do some total time checking here.
So time check is going to now equal the
difference between after minus before.
That's the total amount of time.
And we're doing this in our loop,
which is why I'm now accumulating.
We want to do this for every word,
see how much time we've totally spent.
And now here, if a word
was misspelled, let's
go ahead and print that word, which is
required by the problem specification
in the homework assignment itself.
And then let's just
increment misspellings by 1
to keep track of how many
words are misspelled.
And then let's get started
to do another loop.
Index and word should be
reset to 0 and quote-unquote.
Could do this in
different ways, but I just
adopted that little tuple approach.
And then assuming all of this went
super well, let's close that file.
And as before, we could use context
manager and let that happen for us.
But we're trying to do a
one-to-one mapping with the C code.
And indeed, if I scroll
down here, you'll
see a called fclose,
which is also checking.
There's some f error code which
isn't necessary in Python,
but checking if there was
an error, what went wrong.
We would ideally do these in
Python using exceptions anyway.
You know what?
We're almost at the bottom of the file.
Let's take it home.
COLTON OGDEN: I think we should.
We should do it.
DAVID MALAN: Let's make
ourselves a spell checker here.
COLTON OGDEN: Scotty says,
"You guys are awesome."
Thanks, Scotty.
DAVID MALAN: All right.
Here we go.
Here we go.
Don't say that until
we get to the end here.
Now let's go ahead and count how long
it takes to compute the dictionary size,
even though it's probably
pretty fast since we're just
checking a variable's value.
N shall be the size, the result
of calling our size function.
And after, of course,
becomes time.process.time.
Then here let's store
in our variable time
underscore size, which we
initialized earlier to 0, which
strictly speaking wasn't necessary.
We were just doing that for
parity with the C version.
Now let's go ahead and time the unload
function, which should really be fast.
Process.time.
COLTON OGDEN: As an
aside, for the P set,
is there a maximum time
that the students are
given for their solution to work?
DAVID MALAN: Not really.
I mean, they're given
a week to implement it.
But the code itself can
run as long as they'd like.
Realistically, if they
run it through check 50,
our auto-grading
infrastructure, probably
gets stopped automatically
after 30 seconds.
But even with the slowest of
data structures, like an array,
that should be plenty of time, even to
spell-check some of those larger files.
We're really talking milliseconds,
not so much seconds or minutes.
After we've done this, let's do an after
benchmark, calling time.process.time
again.
And then let's go ahead and check--
if not unloaded, let's go ahead
and yell as much to the user--
could not load our
dictionary, just as in C.
COLTON OGDEN: 104's
got a very subtle bug.
DAVID MALAN: If not unload--
oh, you are right.
I was accidentally
checking a function value.
That was not going to work.
Nice catch.
Let's go ahead and exit
with this one there.
Let's scroll up.
And I honestly don't really want to
type out all these prints in Python.
This is very ugly looking.
So just like Julia Child,
we're going to go ahead and go
in to our Python version here,
which is in our speller.pi file.
And voila, let's move it
over to the right-hand side.
Voila--
COLTON OGDEN: Magic.
DAVID MALAN: Look at that.
I suppose it's mostly copy-paste,
although the format strings change
and so forth.
So if you're not familiar, let's
point out a couple of things here.
One, we're using print
instead of print def.
Two, we are using f strings or format
strings to plug those things in there.
And, three, we're using these
format codes with f strings,
where you specify the variable
name that you want to print, colon,
and then dot some number, which
is the number of decimal points
you want to place.
Print-- 2, in this case-- and
then f for a floating point value.
And then the nicest line of
all is this exit success.
We don't need to do sys, since
I didn't import sys in that way.
You can just use exit in this case.
But sys would be more
proper version, I'd say.
COLTON OGDEN: For folks
who might have familiarity
with older Python, but not 3.6 and
over, the f string is a new feature.
DAVID MALAN: It is.
You need this in Python 3.6.
COLTON OGDEN: Previously,
you used percentage signs
and the string.format function.
And it was a lot uglier.
But now you can just
preface a string with f
and then include the
sort of bracket variables
in there with format
specifiers and it'll just work.
DAVID MALAN: Indeed.
And Scotty529 asks, "Are there any more
in-person visits to Harvard available?"
Indeed.
Let me see if I can
remember the URL off-hand.
I think it's CS50 lectures 2018--
no.
CS50 lectures-- there we go.
Is this the right year?
Let's just check.
2018.
Yeah.
So if you go to-- let me enhance--
picatic.com/cs50-lectures, you will see
a landing page where you can sign up
for free tickets to any of CS50's
lectures that remain this fall.
We only have a few left, if
this is of interest to you,
and you can travel to Cambridge
or you'll be in the Boston area
and want to drop by.
They're all on Friday mornings.
Thereafter, you can go on
a free tour of the campus
with some of our undergraduate
friends, who are kindly
offering those right
after lecture itself,
and then see a bit of
the university too.
Any other questions from
folks here tuned in?
COLTON OGDEN: We didn't run it.
We got to run it, right?
Are we going to run this?
DAVID MALAN: Oh, you want to
see that my code actually works?
Oh, I missed-- did we
miss that question?
COLTON OGDEN: No.
DAVID MALAN: OK.
Now you're really
putting me on the spot.
Now we have to trust that
the code actually works.
So let's go ahead and open
up a terminal window here.
Let me go into, if you will,
just the pre-baked version,
just in case I did anything
stupid here, and go into speller.
In this directory, notice, I've
got speller.py and dictionary.py
from before.
In dictionaries, recall that
we have large and small.
We use the big one.
And if I do ls on text, we see
a whole bunch of sample texts.
Let's say, Shakespeare,
which is a pretty big file.
In fact, if you want to look at
Shakespeare in the text directory,
this will actually be a pretty big file,
as indicated by the progress bar there.
Come on, come on, come on.
And you'll see the complete
works of Shakespeare.
So, mind you, there's going to
be a lot of misspelled words
that are really just old English.
But that's fine, because we are
spell-checking against something.
So let's go ahead and run this.
Let me go ahead and run Python, or
Python 3, to be specific on, let's say,
the large dictionaries.
So we'll go into dictionaries-large,
and then we'll run this on--
let's do this on texts and Shakespeare.
Something went wrong.
Wait.
What am I doing?
Oh, I'm not running--
I'm trying to interpret--
COLTON OGDEN: Can't interpret
the dictionary file, right?
DAVID MALAN: Yes, key step, key step.
OK, so let's actually run speller.py.
OK, that was going to be the
worst ending to any stream here.
And now let's do this on large.
Aardvarks is not a Python command.
Here we go.
All right, here we go.
We are spell-checking all
these words coming out,
recall, are the result
of that print statement.
Because I have a pretty
tall screen right now
and because this is a web-based IDE,
my spell-checker seems to be god awful.
But that's just an
internet latency thing.
It's just taking some time for all
the print statements to get spit out.
And, indeed, if you actually compute
the total CPU time running that spell
checker on the complete works of
Shakespeare, it only took 1.7 seconds.
The number of words that were
misspelled was 45,000 or more.
The number of words in that dictionary
is indeed just over 140,000.
The number of words in Shakespeare is
a very talkative 904,000-plus words.
And you can see a
breakdown of the number
of milliseconds we spent in
each of those functions checking
taking the most time.
COLTON OGDEN: And those
print statements--
those are all about
misspelled words, correct?
DAVID MALAN: All the misspelled
words were what got spit out.
And so if we actually
hid that print statement,
we would actually see these
results much more quickly.
It's just because of the
internet speed and the printing
thereof that creates the illusion that
the spell-checker is slower than it is.
But those running times are the
result of time.process.time,
which is indeed an accurate count.
COLTON OGDEN: Based on the words
that I saw coming up on the terminal,
it looked like it was
all pretty correct to me.
DAVID MALAN: I think that's
proof by correctness.
COLTON OGDEN: Scotty says,
"Do you guys play video games?
What's your favorite game right now?
Do any help you be a better programmer?"
I'll let you answer first
and then I'll follow up.
DAVID MALAN: I don't really anymore.
I think the last video game
I played was with Colton.
We played the newish Zelda--
COLTON OGDEN: Breath of the Wild--
DAVID MALAN: Breath of the
Wild on Nintendo Switch.
But I realized that if
I only allowed mys--
we played until I essentially needed
to fall asleep, at which point
I decided, OK, that's it for
Switch, Wii, Nintendo for the year.
COLTON OGDEN: I'm pretty
sure you fell asleep too.
DAVID MALAN: Did I fall--
OK.
So, yes.
But it was fun.
I think it was after
last year's CS50 fair.
We decided, no, we're going to
get through the school year.
Then we're going to play
a few hours of Switch.
COLTON OGDEN: Was it
after the hackathon?
I think--
DAVID MALAN: No, it wouldn't
have been after the hackathon.
COLTON OGDEN: Maybe
it was after the fair.
DAVID MALAN: I think it was
after CS50 was all done.
This is like a year ago now.
COLTON OGDEN: Two years ago.
It was 2016.
DAVID MALAN: We didn't even play
anything after last year's CS50 fair.
COLTON OGDEN: No, you're right.
It was like during J term of 2017.
DAVID MALAN: All right, so
the point is no, is really--
when I was a kid, I used to play
Commodore 64, Macintosh games.
We weren't allowed to
have a Nintendo, but I
used to play on my best friend's
Nintendo when he let me borrow it.
So--
COLTON OGDEN: That's rough.
DAVID MALAN: So did any of that
help me be a better programmer?
Yes.
You can-- yes.
CS50 at Harvard says that playing video
games makes you a better programmer.
Honestly, no, not in my case.
I don't think I played
enough or thought about it
to any kind of sophisticated
level where that's an issue.
But, Colton, maybe?
COLTON OGDEN: Certainly, games back
then were less flexible in that area.
Scottie mentions, I heard
you can program in Minecraft.
That's absolutely true.
There are certain mods in
Minecraft-- one of them
is called Computer Craft,
which actually lets you
program in Lua, which is pretty cool.
You can actually program the game
blocks in the world in Lua in real time.
It was called Computer Craft.
I'm not sure if it's
been redone recently.
This was several years ago.
But there's games like SHENZHEN
I-O, which is a hacking game.
And TIS-800, I believe, are
also another programming game.
Very sort of-- well, not
realistic, but kind of realistic.
It requires the same logic.
And I would say if you're looking for
games that could help you program,
definitely check those out.
As for what I'm playing, right now,
I'm playing Fallout New Vegas, which
is not helpful for programming at all.
DAVID MALAN: And Colton's
being modest here.
He's not plugged a
course he teaches here
at Harvard's Extension School,
that's now freely available on edX.
If you go to CS50.edx.org/games, that
will whisk you away to this website
here, where you can register for free.
They want to know our
location for some reason.
But you can enroll now for free
by clicking the audit option.
And this will actually introduce
you to Lua the programming language,
Love 2D, which is a gaming
library framework that you
can use to make your own
games, and, actually then, yes,
claim it is homework
to be playing games.
COLTON OGDEN: True.
Yeah, and I think, going
forward too, I would definitely
like to start streaming games made
using these same technologies.
So if any of those appeal to you,
definitely check the course out.
DAVID MALAN: And a question here--
"Guys, I'm planning to enroll
in a computer science degree,
but it will be after a year from now.
I finished CS50.
What course do you recommend me to study
during my preparing to the CS degree?"
Totally depends.
A typical undergraduate here
at Harvard, for instance,
would generally take
another software class,
specifically a class in
functional programming
or object-oriented programming.
And I think that's-- for those folks
interested in software especially
and introductory computer science--
it's a nice way to
kind of counterbalance
the very procedural programming
that we do in a class like CS50,
and a lot of courses also
do, albeit some courses
do focus on object-oriented stuff.
But I also recommend of course by
some of our friends at Princeton.
There's a free introduction to
algorithms class that's on Coursera.
If I do a quick Google here--
Coursera with Kevin Wayne and his
colleague on algorithms at Princeton--
this should be freely
available, part 1 and part 2.
You should be able to sign up with Bob
and with Kevin here on Coursera.org,
also for free.
And this is more
theoretically oriented class.
But that also gives
you another perspective
as to what CS is and should definitely
challenge you in a different way.
COLTON OGDEN: Do you
know Robert, by chance?
Have you met him?
DAVID MALAN: A little bit.
We've met once or twice.
But I know Kevin much better
from past conferences.
COLTON OGDEN: Interesting.
I happen to know his--
I've seen his books, Robert's book.
DAVID MALAN: Yeah, the two
of them now actually writes
quite a bit I think, together as well.
COLTON OGDEN: Nice.
And I think Trebinator mentioned--
you can program mods for it.
I think you have to use Java.
That is correct, yes, for Minecraft.
DAVID MALAN: So please feel free
to join us here again online.
We're going to figure out
what the next day and time is.
We'll publicize it on the Facebook
channel, Twitter, subreddit
and the like.
We'll figure out how to get
some of these assets online too,
if you want to follow along
later today or thereafter.
Certainly tell your friends
if you'd like to join as well.
COLTON OGDEN: Yeah, and definitely
let us know what sorts of topics
you all are interested in
us going over and teaching.
I know I'm going to probably
cater towards game programming,
but I'm more than happy to
do like Python and web stuff,
even though web's not necessarily
as much of a strong suit.
And I think we're also talking
to Brian about him setting up--
I think Brian is going to maybe set
up a microblog-- do a live microblog
implementation.
DAVID MALAN: Does Brian know this yet?
COLTON OGDEN: Yeah, he does.
DAVID MALAN: Oh, he does now.
OK.
He didn't this morning.
COLTON OGDEN: Oh, no, I
talked to him yesterday.
DAVID MALAN: Oh, you did.
Oh, OK, he knew as of yesterday.
COLTON OGDEN: But, yeah,
we have a bunch of ideas.
But let us know what's
obviously most valuable to you,
so we can have a better
idea of where to go.
DAVID MALAN: Yeah, I think we're
pretty much at the end point here.
Special thanks to Dan, who's
been off-camera here, helping us
run the show.
And all the credit for
this idea goes to Colton,
who spearheaded this idea with Twitch.
So we're really glad
you all tuned in today.
COLTON OGDEN: Yeah, thanks,
everybody for tuning in.
It was a lot of fun.
DAVID MALAN: Yeah.
See you online soon.
