[MUSIC PLAYING]
DAVID MALAN: All right.
So this is CS50, and this day
we continue our look at Python,
but also introduce another language.
This one is called SQL, or
Structured Query Language,
and it's going to allow us to solve
some different problems as well
as some old problems better.
But to do this, let's try to motivate
some of the problems initially.
So this, of course, is just
an example of a spreadsheet.
Odds are you use these
pretty frequently,
whether it's in the form of
Google Spreadsheets or Excel
or Apple Numbers and the like.
And this is actually the simplest
form of what we might call a database,
just some kind of application
that stores lots of data.
And odds are if you've ever taken
registrations for a student group
or had people sign up or
fill out a form or a survey,
odds are you've used Google Forms to
get your data into a Google Spreadsheet.
And so we thought we
would begin there today,
a representative problem whereby you
might want to collect a lot of data
from users and then
do something with it,
but ideally more powerfully than
you could do as a human on your own.
So we've put into place a
Google Form that quite simply
asks a couple of questions today.
What are your favorite TV shows?
And we ask you for the title of
your favorite show and the genre
or genres into which that show falls.
So if you wouldn't mind, go
ahead and open up this URL here
on your laptop or phone.
If you don't have one, that's fine.
You can use the person's
next to you, if I may.
And take just a moment to go to that URL
there and answer those two questions.
What is the title of your favorite TV
show, and into what genre or genres
does it fall?
Favorite TV shows.
And when you go to that URL, you'll
find yourself at, again, this form here,
asking for your title and
for your favorite show.
So by the look of my
own screen here we've
got some 62 responses, 74
responses coming in already,
so we can track this live.
So 111 responses so far.
That's a lot of favorite TV shows.
This number's-- keep going up.
169, up to, next, 191.
Keep them coming.
One per person is fine.
It will give us more than enough data.
246.
We'll give it another few
seconds, but suffice it to say,
this is kind of a lot of data.
And if the next step
in this process were
for me to analyze the data
because I'm curious to know
what's the most popular show
on campus or if I wanted
to do some other
analytical work on this,
it'll be useful to
have all of this data,
not in Google Form's
proprietary interface,
but in Google Spreadsheet form with rows
and columns, a format you're probably
pretty familiar with.
So we'll let this counter
keep going up and up and up,
and I'm going to go ahead
and click on responses.
And if we really wanted
to, we could look through--
The Office is up there, followed
by Games of Thrones, Friends, OK.
Perhaps pretty predictable here.
Let me go ahead and
click this icon, which
is going to open the Google
Spreadsheet version of this.
And you'll see that there's
three columns by default.
[LAUGHTER]
Oh, that's funny.
[APPLAUSE]
Thank you.
Thank you.
OK.
OK.
Well-played.
Now-- so you'll see that by
default we got three columns.
Timestamp, Google Forms
just does automatically
to timestamp the submission of the form.
Title comes from the first question.
Genres comes from the second question.
And if we kept scrolling
and scrolling, odds
are we're seeing more and more rows.
Let's propose that we have
enough of these rows already,
and how might we go
about analyzing the data?
Well, if you're pretty expert
with Google Spreadsheets or Excel
or Numbers, maybe you could
use some functions that
are built into these programs or you
could do some Command-F or Control-F
and search for the
favorite show that you're
looking for to try to run some numbers.
But we have programming
abilities nowadays.
And if you haven't used
this feature already,
if you go to Google
Spreadsheet's file menu.
You can go to download, and
you can download your data
into a bunch of formats.
And the most useful
one I'd propose for now
is going to be this one,
Comma-Separated Values or CSV.
And we've seen this before.
In fact, you've generated this before.
So if I go ahead and do that, it
pretty quickly downloaded a CSV file
that I really wanted to I could open
in Excel or Apple Numbers or the like.
But let's do something instead with
that, and go ahead and put it into,
for instance, my IDE so that I
can actually do something with it.
I'm going to grab my
downloaded file here.
And I'm going to go ahead
and drag that into the IDE.
And so now you'll see a
whole bunch of files here.
Let me put-- whoops.
Let me put this right in
the root of my folder.
And we'll see this file here.
And if I double-click on it,
you'll see exactly the same data,
albeit in CSV format.
So it looks the same, but we have
commas separating the values.
And Google Spreadsheets did
that for us automatically.
So let's now write a program
that analyzes this data
and see if we can't make sense of
what some of the most popular values
are in there.
Let me go ahead and close
this, open a new file,
and I'm going to go ahead and call
it, for instance, favorites.py,
since the goal is to figure
out everyone's favorites here.
And let me go ahead and use
some syntax from last week
that you might be familiar
with now from P set 6.
Let me go ahead and
import CSV, the module,
so that I have access to that
library of functionality.
Then let me do this with Open.
And then I'm going to go ahead and
open this file, and I could rename it,
but you know, I can just grab the
whole file name this way, too.
It's a little excessive, but I'll go
ahead and paste that in in read mode.
And that gives me a file.
Let me go ahead and zoom
out just a little bit there.
And once I have this file, I can
throw it in to a dict reader,
for instance, to make it
easier to read the CSV.
So reader gets CSV.DictReader.
Pass in that file.
And then after this, I
have the ability, now,
to iterate over each row like this,
much like you probably did for DNA
or one or more other problems, as well.
Now, just as a sanity check, let
me just go ahead and print out
each of the row's titles.
So I'm not going to
see anything new here,
but this will be a little sanity
check that I'm writing correct code.
So run Python of favorites.py, and
voila, fwoom, there's all the data.
So The Office is both at the beginning
and at the end here, it seems.
All right, so that's all fine and good,
a nice stepping stone, but let's start
to analyze this data,
this time using some code.
So let me go ahead and say this.
Suppose I wanted to
keep track and determine
what the most popular show is.
So I've got this big CSV file
with title, title, title, title.
Invariably, there's going
to be some duplicates.
A bunch of you probably like
The Office or some other show,
and so it'd be nice to kind
of aggregate that data.
What might be a useful data structure
in Python whereby for each row
we can use this data
structure to just kind of do
the equivalent of like
1, 2, 3, and count each
of the occurrences of those titles?
Any-- OK, I heard a hash table,
but more properly in Python?
A dictionary.
So a dictionary is
kind of the Swiss army
knife of data structures for
a lot of people in the sense
that insofar as it allows
you to store keys and values,
and those keys can be strings, and
those values can be anything you want,
including, for instance, in this
case, numbers, like a counter,
it's a really versatile
data structure when you just
want to throw some data into a data
structure and do something with it.
So let me go ahead and give myself
a dictionary at the top of the file.
I can do this in a couple of ways,
and if my goal is to store the counts
of each of these tiles,
I could call it counts,
and I could set it equal to dict(),
which is just a function that hands me
an empty dictionary.
Or I recall I can use
this syntax, which is
equivalent to just shorthand notation.
But what I want to do now
inside of my loop, every time I
iterate through this file,
I want to check if my--
I've seen this title before,
I want to add 1 to my counter.
But if I've never seen this title
before, what do I want to do instead?
Yeah?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Add the
title to the dictionary
as a key with the value of
what, the first time I see it?
Well, the first time I see it.
AUDIENCE: 1.
DAVID MALAN: So we can
just initialize it to 1.
So there's two scenarios here.
Either we've seen this title
before and we want to do plus 1,
or we've not seen the title before
and we just want to set it equal to 1.
So how might we express that logic?
Well, let me go ahead
and say, first of all,
let's get the title into
a variable of its own,
just to shorten our syntax a little bit.
And then I can say something
like this, if title in counts--
well, what does that mean?
That means that inside
of my counts dictionary,
I can index into it at that location,
and I can increment whatever is there
by 1, assuming on some
previous iteration,
else I went ahead and did
counts of title equals 1.
So these are the two scenarios.
If I've seen this title
before, go ahead and add 1
to its value in the dictionary.
Otherwise, just default to
I have seen it one time.
So some of the shows are going to
stay stuck at one if they're not
very popular among the
audience, or they're
going to keep getting
plus 1, plus 1, plus 1.
So what I often do when
writing code like this
is, as I'm sort of finding my
way, after I'm done with this,
I'm not going to write the whole thing.
Let me go ahead and just, at the
end of this, print counts, right?
I'm just curious to see what it is.
I'm just going to go ahead
and add print counts.
Let me go ahead and run
now Python of favorites.
And voila, you see a huge,
messy data structure.
But let's go to the very top of it.
Notice that it starts with curly braces,
indicating it's, indeed, a dictionary.
It starts with a quoted value,
which is one of the titles in it,
and it looks like two of you said
Dynasty is your favorite show.
25 of you said The Office
is your favorite show.
One of you said Blindspot
is your favorite show.
One of you, 24.
Friends, 19, and so forth.
So the data is kind
of all over the place,
because it's based on the order
in which you all input it.
So let's just clean
this up a little bit.
And let me go ahead and
print this out as follows.
Instead of just printing
a big, messy dictionary,
why don't I go ahead and
iterate over the dictionary?
And we didn't see this
necessarily last week,
but you can actually
do something like this.
If your dictionary's storing titles
as keys and counts as values,
we can do something fancy like
for title comma count in counts,
but if you want to iterate over not just
the keys but all of the items therein,
there's this other function
in Python associated
with a dictionary that will hand
you back two things at once,
key and value together.
And you can call them anything you want.
I'm calling them title
comma count, respectively.
Now, let me go ahead and print out,
for instance, something like the title,
and then let me go ahead and just
print out the count right next to it.
So now, if I save this and run
favorites.py, OK, it's prettier,
but it's still a little
messy in that we have--
the numbers are just right
there after the name.
So let's go ahead and tidy this up.
Recall that when you print
two things with print
separating them with a
comma, what character
did you get automatically
in between them?
Yeah, just a single space.
We saw that at the very
beginning of last week.
But we can override that
separator, otherwise known as sep,
which has a default value of
quote-unquote with a space.
But suppose I just want to give
some visual delineation here,
maybe like a vertical bar to kind of
sort of make it look like a spreadsheet
again.
Let me go ahead and run this.
And now it's not super,
super pretty, but you
can see title followed by a vertical
bar followed by the number 1.
So this is all fine and good,
but it'd be nice, for instance,
if we could sort these,
because I don't really
want to sift through this visually,
looking for all of the possible values.
So you know, the easiest way to sort
these things might be like this.
Instead of passing in
counts of items, there's
another function in Python where you
can literally just sort something
with sorted.
You don't have to worry about bubble
sort, selection sort, insertion sort,
merge sort, anything.
Just, you sort it and Python.
We'll use one of the
fastest implementations
of sorting available to it.
Now, if I rerun this, we'll see that--
[CHUCKLES] OK.
We'll see that we have
all of these values
here starting with 24,
9009, Adventure Time.
But that's not what I meant.
I wanted to sort not
by title, but by what?
Count.
So this gets a little less obvious.
And it turns out Python does this
in a couple of possible ways.
But let me go ahead and
introduce this one first.
It turns out that this sorted function
takes an optional argument that
happens to be called key.
And key simply specifies,
how do you want to sort
the dictionaries that you're being--
or how do you want to sort the key
value pairs that are being passed?
By default, it literally
uses the key, so title.
But if you want to use the value,
you know what you can do, is this.
Let me go ahead, and up here, even
though this is a little weird,
and let me go ahead
and say the following.
Def function item.
So def f.
I'm just giving myself
a pretty generic name
for a function, because I'm going
to get rid of it in a moment.
I'm going to go ahead and
return item bracket 1.
Now, why is this?
Take a look at this line here.
And even though sorted is new
today, as this counts.items(),
all items does is return key comma
value, and sorted sorts then by key,
by default, handing them back
as title comma count, or again,
I could call these anything I want.
But I'm calling them, more semantically
usefully, title comma counts.
So that's giving me back to things.
Well, just like an array
in C or a list in Python,
if you're getting two things at
once, the first of those things,
conventionally, is going to
be accessible in bracket zero.
The second thing is going to
be accessible in bracket 1.
So this super, super simple function
called f takes, as input, an item.
And if that item has two
values, a key and a value,
return item one is just
literally going to return
the second thing, whatever it is.
In this case, it's going to
return, of course, the value.
Now, why is this useful?
Well, because the sorted function takes
an additional argument called key,
you can actually pass it
the name of a function
to use in order to determine
how to sort the previous items.
So this is an example of
a really powerful feature
of Python and other higher-level
languages, in which case
you can pass in, as arguments,
the names of functions.
I'm not calling f.
I'm just passing in its name.
If I were calling it, I'd have
open paren and a closed paren,
but I'm just passing it in by name.
And now, this is going to, by default,
sort things from small to large.
Turns out if you read the
documentation, there's
another argument to sorted called
reverse that will invert it so that now
you'll get largest to smallest.
So now let's try this.
Let me go ahead and rerun
Python of favorites.py
enter, and now, if I
scroll up to the top,
we should see The Office is indeed
the most popular show, apparently
consistent with Netflix's
data these days, too,
followed by Friends, Game of Thrones,
Breaking Bad, Rick and Morty, Black
Mirror, and so forth.
And the least popular show among you,
although there looks like there's
a lot of ties, Demon Slayer.
[CHUCKLES] Or Game of Thrones, but GOT.
So interestingly, there are some--
[LAUGHTER]
There's-- OK.
OK.
So here, we've gone and
sorted these values.
Now, let me go ahead and
clean this up a little bit.
Much like in C where it's kind
of silly to define a variable
and then use it only in one
place, similarly in Python,
if you're going to define a function,
but then only use it in one place,
you don't actually need to.
There's a fancy one-line equivalent
wear you can actually say this.
And you won't see this too often,
but it's a handy thing just
to have seen at least once.
You can define a function called a
lambda function that takes in as input
any argument like this, and then you
can literally just return that value.
So if I go ahead and get rid of this
function f and replace it with this,
the syntax is super
weird-looking at first glance,
but lambda is the term
of art in computer
science for an anonymous function.
I don't care that it's called f.
I'm never going to use it again.
So lambda just means give me a function.
Item means that is the input to this
function, the argument, and the return
value, or output, is going
to be item bracket 1.
So it's a little arcane.
Odds are you'll have to double check
the documentation for using this oneself
in the future, but it's the same thing.
And indeed, if I rerun
the program, I'm going
to get back the exact same results,
this time with the items still sorted
like that.
All right, any questions on the code
here, the features of the syntax?
Anything at all?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yes, if you're writing
a lambda function like this,
you are indeed limited to one line.
That was OK, because f, even
though it was two lines, only
had one line of actual content.
But yes, it has to be one line, indeed.
All right.
So the GOT thing opens up some
interesting messiness, right?
Because GOT presumably denoted Game
of Thrones, and yet most of you
wrote Game of Thrones,
capitalized properly,
and yet, it'd be nice if we
could have some equivalences.
And honestly, I'm guessing there's a
bunch of capitalization differences,
too.
So you know what I
might be inclined to do?
What might be the easiest
way for us to standardize
what we're calling these shows'
titles that would be a little more
resilient against some of
you typing with capitals,
some of you typing with lower case?
Let's ignore the GOT problem, though.
Yeah, so make them all lowercase,
or heck, make them all upper case.
It doesn't matter so long
as you're consistent.
So you know what?
Before I get title from
the row, let me go ahead
and force it all to lower case,
and then let me go ahead and--
in my-- go ahead and force
it to lower case there.
So no matter what the
input is, I'm going
to go ahead and force it
on entry to lower case.
Now, if I do this and scroll back up,
looks like our counts went up a little,
right?
The Office is now at 27.
Game of Thrones is now at 21.
Friends is at 20.
So it looks like some of you,
kind of, reasonably if sloppily,
didn't capitalize it exactly as
the show officially is called.
But that should be OK, right?
This is all about UX or user experience.
This should be one stepping stone toward
making your program much, much more
resilient against user input.
But you know what?
What if there were a better
way to do this, right?
Instead of writing code,
this feels pretty tedious.
Could we not just use a different
tool, different language,
and achieve these same goals?
Today's answer is indeed
going to be yes, with SQL.
But first a question.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: No?
No question.
All right, so then the
segue is exactly this.
I'm going to go ahead now
and close favorites.py.
And I'm just going to maximize the
size of my terminal window here.
And I'm going to go ahead and
run a program called SQLite3.
SQLite3 is a command line program that
you can install on your Mac or PC--
it's already on CS50 IDE--
that allows you to use another
language called SQL interactively.
And it's got this really
cool feature whereby
if I have this file called, very
verbosely, CS520 2019 Lecture 7,
and so forth, I'm going to copy
that after highlighting it.
I'm going to run SQLite3, and I'm
going to put SQLite3 into .mode CSV.
It's a weird syntax, .mode CSV, but you
don't have to use these commands very
often.
Then I'm going to use one other
command here, which is called import.
And I'm going to go ahead and
import this into a database.
But what database, exactly?
Well, let me go ahead and more
specifically-- let me exit out of this
here.
Whoops.
Let me exit out of this here and run
SQLite3 in the name of a database.
I'm dealing with favorite
shows, so maybe I'll
call this favorites.db,
DB denoting database.
I'm going to now go
ahead and do .mode CSV.
I'm going to now do .import,
quoting the name of the file,
and quotes are important, because
it's got some ugly spaces in it,
and I'm going to import this into a
table, kind of like a spreadsheet,
that I'm just going to call
in all lowercase favorites.
And hit Enter.
Nothing seems to happen now.
But if I go back to my
terminal window and type ls,
notice that in addition to my CSV
file and favorites.py and src7,
which I downloaded in
advance with today's code,
I now have this file
called favorites.db.
And if I rerun SQLite, it turns out when
I pass in the name of that database,
now I can ask the same kinds
of questions as before,
but I can do it with a different syntax.
I can go ahead and select all
the titles from my favorites
by typing this command, SELECT title
FROM favorites; Enter, and voila,
we get all of the titles there.
Suppose I want to sort
those titles as before,
or I can SELECT title FROM favorites,
ORDER BY title; ending with--
oh, sorry.
Semicolons are back.
Then I'm going to go ahead and hit here.
And you'll see that
now it's alphabetized,
and there's all those
Offices by first letter.
Now, there are some bugs.
If you scroll down, you'll
see lowercase letters as well,
so we'd have to deal
with that situation, too.
But this seems to be a user-friendly
way of just selecting data
that I care about.
And watch this.
Let me try to group things together.
I'm going to go ahead
and select title, and I'm
going to go ahead and say
count the number of titles
from this table called favorites.
But I want to group
those things by title.
Now, we'll see in a moment what this
means, but the effect is what's cool.
Voila, I now have output
much like my Python program.
It's not really as orderly
as I would like, but notice
what it's done at the very top.
The very first line of output,
if I keep scrolling is this here.
On the left is my title,
then a vertical bar.
On the right is literally count of
title, how many titles match that.
So let me go ahead and do this.
Let me go ahead and first of all
limit this to the top 10 titles,
so I can just deal with this
and see all this at once.
So here we have some
alphabetized list of titles.
But now, suppose I want to go
ahead and rename this thing.
Let me go ahead and say that.
Let me go ahead and say, as--
COUNT AS n.
Now notice that the title of
this output is title, and then n.
But the reason I did that
was so that I can do this.
I'm going to now do ORDER
BY n, descending order,
and then limit to the top 10.
Enter, and voila.
Now, using a completely
different language,
I think I've gotten the same results,
except for the lower casing issue,
which we can come back to another time.
But The Office is here
as the number one hit.
Friends then, 19.
Game of Thrones, 18, and so forth.
So what's then the motivation in part
for this new language called SQL?
And you've seen just
snippets of it here.
It allows us to solve
the same kind of problems
much more easily, because I
don't have to sit down and write
a whole darn Python program,
faster though it is then
C. I don't have to write custom code
just to answer questions about my data.
And in fact, if you pursue
the web track in a couple
of weeks' time or the
mobile track, both of those
are going to offer support for SQL
so that if you want to store data
on users or interactive information
on an app or a website application,
SQL is going to be ultimately where you
can put that data and access it later,
thereby writing even
less code than today.
So let's go ahead and take a look at
some of the features of these things.
What we've really just
done is introduce this,
something called a relational database.
The goal at hand is not to
rely on CSV files anymore.
CSVs are like the weakest
form of databases you can use.
Why?
Well, every time you want to look in a
CSV file for data, you have to open it.
You have to read it in.
You have to iterate over
it, line by line by line,
looking for the data you care
about, and then you're done.
That's going to be super slow when
you have a lot of data in a CSV file.
And in fact, in a
little bit today, we'll
download a really big
database in text form
that we're going to then see 500-plus
megabytes later is much better handled
by what are called
relational databases, pieces
of software you can run on your
Mac or PC or servers in a company.
All they do is store data and
provide you with faster access to it.
So SQLite3 is just a
command line program
via which we can
interact with that data.
It's not terribly compelling.
It's, again, a black and
white blinking prompt,
but we'll see a graphical user
interface, or GUI, in a bit,
too, that'll allow us to
navigate that same data.
Here's how we went ahead
and imported the data.
And .schema is the last of the commands.
In fact, let me go ahead
and type this at the prompt.
Not .mode or .import but .schema.
And you'll see what automatically
happened for me is something
like this, CREATE TABLE favorites.
This all happened automatically when
I imported my data, but in a bit,
we'll be doing this
more manually ourselves.
So where does this then bring
us in terms of features?
Well, in the world of
storing data, whether it's
storing users who've
registered for your websites
or the number of likes a
post on Instagram has gotten
or any form of data that you
might want to read or write,
there are four fundamental
operations in the world.
CRUD is the sort of crude
way of remembering this.
These stand for Create,
Read, Update, Delete.
And I dare say that with
these four operations
you can do anything you want
when it comes with data.
Creating that data, reading that
data, updating it, or deleting it.
Now, in the world of SQL, or
Structured Query Language, or S-Q-L,
this is just another
programming language.
It tends to be used only in
the context of databases,
but you can use it to solve problems
not unlike Python, but in the context
of the data you care about.
So with SQL, these four
CRUD operations actually
have different keywords mapped to them.
They happen to be these,
INSERT, SELECT, UPDATE, DELETE.
That is to say, unlike Python, which has
hundreds of functions available to you,
today is kind of neat in so far AS SQL
has maybe a dozen or so total functions
that you can use, a few dozen
functions total that you can use.
Much more manageable, and these are the
four that we'll use most frequently.
So beyond this, how do you
go about creating a table?
Well, what does that mean?
Well, relational database
is literally that.
It's like a fancier version of Google
Spreadsheets or Microsoft Excel
or Apple Numbers that allows
you to store all of your data
in rows and columns.
But it gives you these
four functions and more
via which to select data you care about,
or delete or update or insert data
that you care about as well.
So we'll see in a bit that this is going
to be the syntax VIA which in SQL you
can create a new table.
In Google Spreadsheets you would
literally go to the plus icon
and say New Sheet.
In SQL, you would express this more
programmatically with a line of code
like this.
But you're going to have to make
a few decisions along the way,
and that's because SQL
has its own data types.
Now, fortunately, they're pretty
reminiscent of C and Python.
But there's five main ones here.
Let's just walk through
a few examples thereof.
So first, INTEGER.
If you know in advance that you want
to store a whole bunch of information
in a database like someone's age
or the number of likes a post has
or anything that lends
itself to an integer,
SQL supports an integer data type.
But in some versions of SQL, you
actually have a decision to make.
Do you want an integer,
a smallint, or a bigint?
And these speak to, as in
C, the size, the number
of bits that are used to store it.
An integer is 32 bits.
A bigint is 64 bits.
A smallint is fewer than both those.
Now, we're using, for class,
something called SQLite.
It's a free and open source, very
user friendly version of SQL.
But there's others you
might have heard about.
So if you've ever heard
of the company Oracle,
their bread and butter
is a database program
called Oracle that is
a SQL database that
allows companies to store
huge amounts of data
and select and create and
update and delete data
using software they've written.
MySQL, Postgres, MariaDB,
Microsoft Access Server,
SQL Server, and bunches of others,
are all examples of SQL servers.
And they'll each support data
types, typically, like this.
We happen to be showing
you ones from Postgres,
which is a super popular open source
one that you can use on the internet,
for instance for final projects.
Real numbers, just like in Python or
C, are similar in spirit to floats.
However, you have a couple of options.
One is a real, which uses 32 bits.
Another is called double precision.
It's actually two words, but
it gives you 64 bits as well,
to give you more precision numerically.
Then you've got this catch-all
known as numeric in SQL,
whereby if you have a number
that's number-like, essentially,
but isn't just an integer, categories--
this category encompasses things
like Boolean values, dates,
which have a predefined
format like YYYY,
four digits typically implements a year.
That would be considered numeric here,
as would be a date and a date time,
which has something like 00:00:00
for hours, minutes, seconds.
And then you have other values as well.
What's nice about this
numeric type specifically
is you can actually solve the
floating point problem in both Python
and C. Recall that floats
or even doubles in C
ultimately have finite precision,
which is bad in terms of values
not quite adding up as you intend.
But with numeric data
types and databases,
you can specify exactly how many digits
you want before the decimal point
and after, maximally.
So if you're doing with financial
information or scientific information,
you can be super, super precise thanks
to this data type in the database.
So finally, a problem that we've
solved, versus those other languages.
Text.
When it comes to storing
data in a database,
you can specify that
your columns can either
be characters, which isn't an
individual characters or chars.
You literally specify n,
the number of characters
that every cell in that column will be.
You never do this in Excel.
You never do this in Google
Spreadsheets or Apple Numbers.
In a database, though, you can tell
the database ever more precisely,
I want to store two characters
in every cell in some column.
Why?
Well, maybe like US state codes.
If you have MA for Massachusetts
or CA for California,
it might be nice to tell the database
that you can store, minimally
and maximally, two characters for
every value I'm going to give you.
Varchar is a little different.
This is a variable number of characters.
If your column might have a few
characters or a lot of characters based
on what some human types in,
you can give an upper bound
n on how many characters
will be stored in every cell.
So if you don't know in advance how long
a user's email address is going to be
or how long my name or your
name is going to be in advance,
you can specify varchar and then pick
some upper bound, some value that, ugh,
no one's going to have more than
20 characters in their name or 200
characters in their name, whatever
line you decide to draw in the sand.
As an aside, does anyone
want to conjecture
what a good upper bound is
for the number of characters
in a human name that might register
for a website you're making?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 25?
Pretty good.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 30, I heard.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 32, OK.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: 45?
Higher.
60?
AUDIENCE: 64.
DAVID MALAN: 64?
This is the kind of thing that
Google, I dare say, is good at.
So let me go ahead and
open up Google real fast.
Suppose the goal at hand
is the store, in a database
table, that is the equivalent of
a spreadsheets column, the longest
name in world.
All right.
Looks like we had 988 characters
in a fellow's name, Hubert.
And I won't even try to pronounce
his last name, but there it is.
So if you want to fill--
if you to be able to
fit everyone's name--
oh, and note, senior.
[LAUGHTER]
You might need even more
characters than that.
So this is a rabbit hole we
won't really go down today,
but suffice it to say that
the answers to these questions
are often not obvious.
And if you think about websites
you've visited in the real world,
whether on campus or off, odds are,
you have tried typing in some input
to some web-based form or
even some mobile application,
where all of a sudden your
keyboard stops working, right?
They won't tolerate any more characters.
Maybe it's the Common Application
and the length of your essay.
Maybe it's the length that
your resume can be in a field.
Maybe it's the length of your name
when registering for a website.
Well, why is that?
Those applications are probably
using a database, probably using SQL,
and they had to decide, in advance,
what is the maximum length of an input
that we're going to
tolerate from a human?
And then the computer
ultimately enforces that.
Why?
Well that would be a very clever
attack, for some bad actor out there
to exploit.
If you had no bounds on
how big the input could
be that a human could type
in, they could-- much like I
need to try the other
day with Emma's name--
just keep pasting pasting,
pasting, pasting a massive input,
hoping to overwhelm your computer's
memory, maybe make your server crash,
and therefore deny
service to other people.
So there are reasons for
these kinds of defenses.
Lastly is BLOB.
At the very top, Binary Large Object.
Don't need it typically
that often, but if you
want to store binary data, raw zeros and
ones that represent files or the like,
you can store it in a
BLOB format as well.
Now, just to give you a
whirlwind tour of other syntax
before we start using
it, you can insert data
into a database using
syntax quite like this.
You can select data from a database
using syntax like I already did.
You can use functions as well.
In fact, in just a bit, when we
load more data into the database,
I could actually select all of the
distinct names from a database--
and in fact, I could have
done this a moment ago.
Rather than SELECT title
FROM favorites, as I
did earlier-- that's going to give me
all of the titles in that database.
I could instead do something like
SELECT DISTINCT title FROM favorites,
and that's going to filter
out all of the duplicates
without me having to write a
single line of code to do that.
You can also count, as we did before.
You can compute averages,
mins, or max, which
is really useful for analytical
work that you might want
to do as a data scientist or the like.
And there's a whole bunch of other
functions available to as well.
But there's some more powerful features.
And we'll begin to use
these in just a little bit.
We used WHERE before.
I used LIMIT before.
I used GROUP BY before.
And we'll see a bunch of examples
that reinforce all of these.
But there's keywords
like LIKE, whereby if you
want to search for everyone's input
who mentions The Office, you know what?
I bet we could do something
like this instead.
I know that I can SELECT
star for-- oh, let's go ahead
and SELECT title FROM favorites WHERE,
and this is kind of nonsensical,
but title equals "The
Office," quote unquote.
And I get back everyone who
typed in literally The Office.
However, if I instead did something
like this, "the office" in lower case,
I'm going to get one of you who
typed in "the office" in lowercase.
And the thing here is just
the title of this output.
But what if I want to do
anything like The Office?
I could say something like LIKE
"%office%", and it turns out in SQL,
this new language,
percent is a placeholder.
It's a wild card that just says,
eh, anything can go before,
and, eh, anything can go after.
So now I'm searching for any titles
that have the word The Office.
We picked up another entry here.
So one of you said Office, without the
word "the," also pretty reasonable.
Some of the uppercase and
lowercase is all over the place.
Now we've caught those.
And if you want to count the
total number of offices now,
now I can do something like this.
COUNT title-- so I can
combine all of these building
blocks, not unlike Scratch,
and get back the total number
of offices, which looks now to
be 33 when we tolerate a lot more
variability in our users' input.
Most powerfully of all,
we'll see something
like JOIN in just a
little bit, and that'll
be all toward the end of using
databases better with better design.
But with SELECTs, we, of course, don't
just want to select all of the titles.
We might select where
some condition is true.
So conditions, just
like in Python and C,
exist in SQL, but at the end
of our queries, as we've seen.
If you want to update
something, for instance,
you can update data in
a database by saying
UPDATE table name SET the
column name equal to some value,
WHERE condition is true.
So how might I use this?
Well, let's again look at the
data where it's like The Office.
And you know what?
I'm going to go ahead and do this.
If I want to clean all of this
data up-- so data cleaning is
a very common approach,
whether you're doing
research or analytical work,
often when you have messy data,
the first pass is honestly grunt work.
Like, writing some code, or
god forbid, manually copying
and pasting in an Excel file,
just to clean up your data.
You can do this a lot
more powerfully in SQL.
So for instance, let me go ahead
and say UPDATE my favorites SET
title equal to "The Office,"
the canonical form, capital
T, capital O, WHERE title LIKE--
and now let me do this
wildcard search, knowing
that that's going to slurp
in all 30-plus of those rows.
But now, when I hit Enter,
nothing seems to happen.
But if I then SELECT title
again WHERE LIKE "%office%",
now they all look the same,
because I've updated my data.
And so in fact, now I could
go back to The Office,
because I've canonicalized my data,
and everything now looks the same.
So with SQL, can you do that as well?
Previously, let's go ahead and
SELECT a title FROM favorites.
And let's just limit this to 10 of them
this time, so they don't all fly by.
Suppose-- VEEP is a good show.
Maybe I'm not a fan of, say--
let's give ourselves 20
here, see what comes up next.
OK, I actually don't like Friends, even
though everyone in the world seems to.
No problem.
DELETE FROM favorites--
[LAUGHTER]
--WHERE title = "Friends";.
Now, no more Friends.
And so you can delete your data as well.
[LAUGHTER]
You can delete your data as well
as by using a query of this form.
This one is perhaps the
most dangerous of all.
Consider what might happen if you're
a little sloppy, a little hasty,
and omit the WHERE condition.
What happens if you go ahead
and DELETE FROM favorites;?
Everything goes away.
And slightly worse than that, though
we won't have occasion to use it,
there's another function
called DROP, where if you just
want to delete the table altogether,
delete all of your rows and columns,
you can say DROP TABLE favorites;,
and that will delete all of the data
as well.
And we'll come back to that, because
the mere fact that you can scarily
delete lots of data at
once makes you super
vulnerable to bad actors on the
internet or in the office who
want to potentially
wreak havoc on your data.
So we'll come back today, talking
about matters of security, as well.
So any questions before, now,
we take things up a notch
and look at a much more massive
than our hundreds of favorites here?
Is that a question there?
Yeah.
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Did not look
like Friends got removed.
Oh, interesting.
You are right.
Let's try a little harder.
LIKE "%friends%".
Better?
Now they're gone.
So I'm guessing-- I would have
to look closer at the data,
but I'm guessing one
or more people typed it
with a slightly
different capitalization,
or maybe they hit a spacebar
character or something, where
I wasn't doing a precise
match, thereby making
the LIKE operator a more robust fix.
However, let's consider
for just a moment.
What might the gotcha here be?
Like, this feels like a
nice quick and dirty fix,
but it really is quick
and dirty in what sense?
AUDIENCE: Like, if you
have another TV show
that has the word friend in it,
but not Friends, [INAUDIBLE]..
DAVID MALAN: Yeah, there
might be another show
with the word "friends" in it.
Quite reasonable.
And I just blew those away
as well from my database.
And in fact, we can
probably see this for real.
If I go ahead and look
at, for instance--
let me go into a browser here, IMDB.com,
super popular website for just getting
movie reviews or
information about movies.
Let me go ahead and search for friends.
And yes, the most popular hit seems
to be the 1994 version of Friends,
but there's Friends with Benefits.
There's Just Friends, We Are Your
Friends, My Best Friend's Wedding,
Fisherman's Friends, and so forth.
Some of those movies,
some of those TV shows.
And had we poked a little further among
our own data of hundreds of rows, maybe
some of you did have some of those
shows as your favorites as well.
So beware that, too.
So in preparing for
today, we wanted to play
with an even juicier data set then
one we could generate for ourselves.
And wonderfully, IMDB makes
a lot of their database
available for download as TSV
files, Tab-Separated Values.
It's really no fundamentally
different from Comma Separated Values.
One human at IMDB just decided
to use tabs to separate
their values instead of commas.
And if you read through
this page here, you'll
see that they let you download a
bunch of different files, one of which
will focus on, initially,
specifically on title.basics.tsv.gz.
So that's a mouthful,
but this is just saying
this is basic information
about titles in their database.
Gz means it's G-zipped,
which is like Zip.
If you're familiar, on
a Mac or PC, it's just
another way of compressing information,
but you can unzip that on a computer
as well.
Dot TSV means Tab-Separated Values.
Now, we read in advance
the information here,
but let me summarize it as follows.
There's some juicy
information in that file.
And I'm going to go ahead and
download this file over here.
If I click a link on
IMDB's website, I'll
see a file called title.basics.tsv.gz.
I'm going to go ahead and let
that download in the background.
It looks like it is 100-plus megabytes.
If I click on it on my Mac,
it's going to decompress it.
And you'll see here, this apparently is
532 megabytes once it's decompressed.
So that's way bigger than the
CSV file I downloaded from Google
a few minutes ago, which is
24 kilobytes, 24,000 bytes.
This is 532 million bytes.
So it's an order of magnitude bigger,
which means our design decisions today
are really going to matter.
Now, if we take a look
inside of this file,
we'll see a bunch of interesting fields.
And I've essentially summarized this
from their own online documentation.
There's these five fields that
are of interest right now,
and we'll tease these apart now.
So tconst, I didn't recognize that.
It's an IMDB-specific term, it seems,
but it means some kind of title
constant.
It's a unique identifier for
every movie in the database.
So for instance, one of the
movies in that TSV file,
which I'll open up in a moment,
has the unique value tt4786824.
Why?
Just because.
But it turns out we can
actually see what this is.
Let me actually grab that value
and go to IMDB.com/title/tt4786824.
Enter.
These actually do have significance.
That is the unique
identifier for The Crown,
popular hit TV show that you
can now watch on Netflix.
So if you go back to the actual field--
that's how I spent the summer.
So if you go back to the fields
here, we have not only tconst,
which uniquely identifies the
film, we also have title type.
So it turns out in
this 500 megabyte file,
there's huge amounts of information
on movies, TV series, documentaries,
shorts, way more data than we could
possibly play with in one day.
So we'll focus just on TV series,
like this, the show The Crown.
Primary title is the
name in the file that is
given to the primary title of the show.
Sometimes shows apparently
have different titles
in different geographies or
communities, so the primary title
is the one that most people know it by.
Start year is the year for a TV show
that-- in which that TV show began.
The Crown came out first in 2016.
And genres, a little weirdly,
is a comma-separated list
of genres into which that show falls.
And I say it's a little weird,
because we do have a TSV file,
but they're using commas inside.
So this is probably among the
motivations for having used tabs in--
to separate all of the
columns, because they
want to use commas for
some other purpose here.
But it's a little messy.
This is actually not necessarily
the best design decision here.
So what can we go ahead and
do with all of this data here?
So let's go ahead and start
to grab some of this data
and write some scripts involving it.
But let me issue a disclaimer.
It turns out when you have
a lot of data in a database,
it can take quite a long time
to analyze it or parse it.
And so if you ever watch certain
baking shows like Julia Child
from down the road here in
Cambridge, very often there
would be some TV magic whereby
someone mixes the cake mix
and puts it into the oven, and then
two minutes later, voila, it comes out.
That's because some of the
scripts we'll write today
might actually take many
minutes to run, but we'll
go ahead and run them
in the backgrounds,
and I'll use the outputs that
I've created here in advance.
So how do we go about beginning
to navigate this data set?
Well, first of all, let me go
ahead and just open this file.
And I'm going to use my
own terminal window here.
This file was called titles.basics.tsv.
Notice that even in my program
here, it took a moment to open,
but this is what it looks like.
It's a little cryptic at first
glance, but notice the first row
is the headers of the file.
So we've got tconst and title type
and primary title and so forth.
Then every row thereafter seems to start
with one of those unique identifiers,
and they seem to be monotonically
increasing, 1, 2, 3, 4,
but they start with tt
then some number of digits.
Then you see that these are a bunch
of shorts, so short films or the like.
And if we kept scrolling,
we'd see some other values.
Here's the first movie in the database.
And if we keep scrolling and scrolling
and scrolling, we'll see more.
So if actually I use my text editor
here and search for literally the word
The Crown, it's going
to take a little while
to get to, because there's a
bunch of shows apparently called
The Crown or Crowning or something.
But you know what'll be faster?
Let me go ahead and search for--
what was it?
If I go back to tt4786824,
Enter, it takes a moment
to find this because it's searching
through hundreds of megabytes,
but voila, down here is
this tconst, TV series type.
It's called The Crown.
In another geography it's called
also The Crown, 2016, and so forth.
So that's a huge amount of data.
In fact, how much data?
Well, if I actually analyze this
with a command on my computer,
there are 6,267,469 shows
in the world, be it TV
or movies or documentaries or the like.
That's a huge data set.
So suppose I want to focus really
just on a subset of that data.
What can I go about using instead?
Well, let me go ahead and open
up, for instance, the same file,
but let me try putting it in CS50 IDE.
I'm going to go ahead and download
this, as you might have downloaded
past problem sets, with wget.
That's going to give
me a pretty big file.
There's a command now that
you've not had occasion
to use in the IDE called gunzip, which
will unzip a file that starts with--
that ends with .gz.
That's going to take a moment
to unzip the whole file.
A moment, a moment, a moment, OK.
[CHUCKLES] And no space left on disk.
But that's OK, because it
turns out what I'm also
going to start doing today is
using my own Mac a bit more.
And you'll recall last week,
I did that a little bit,
because when I wanted to play with
speech recognition in my microphone,
it's a lot easier to have Python
running on my Mac or your PC
if I want to use hardware
built into my computer.
Plus my MacBook, is a lot faster,
for instance, than a single cloud
account on a shared server.
So I'll go ahead and write some of this
code here on my own machine instead.
So let me go ahead and open
a file called import .py.
And I'm going to go ahead and
import the CSV module, as always,
and I'm going to go ahead and open this
file, which is called title.basics.tsv
in read-only mode, and I'm going
to call this variable titles.
What am I next going to do?
Let me myself a reader
using csv.DictReader,
as before, reading in those titles.
But csv.DictReader feels
like the wrong reader.
Why?
Just to be clear.
Yeah, I mean, maybe I want TSV.
But it turns out TSV doesn't exist.
So even though I might be inclined,
for instance, to change this
to a T and this to a T, the
CSV module does enough for me,
but I need to tell it that I want
to use a different delimiter.
Instead of the default,
which looks like this,
I can actually override
that and say, you know what?
Use a tab.
And just like backslash n in C
and in Python, it's a new line.
Backslash t in both
languages is a tab character.
All right, so once I've done this,
let me go ahead and open up, now--
let me go ahead and open up shows0.csv.
My goal in life now is to make this
file a lot more manageable for myself.
I want to take a 500megabyte file and
extract only the TV shows therein.
Moreover, how about only the
TV shows from 1970 onward?
We won't go even further back than that.
So let me go ahead and open up, in
write mode, a file called show0.csv.
And I'm just going to
call that variable shows.
Then I'm going to go ahead
and give myself a writer.
And you might recall
using this in the past.
This is simply going to give me a
variable called writer, via which
I can write to a new file.
Because again, the goal is to read
this file and write to this file
a subset of the data therein.
So let me go ahead and
write one row first.
Write a row, passing in a list
of values, specifically tconst,
which is the title--
which is the ID field;
primaryTitle, which
is the title field; startYear,
which is the year field; and genres,
was one more fields that
I mentioned earlier.
So my goal is to export only those
four columns that I care about for now.
So I'm going to go ahead and do this.
For each row in my reader, I'm
going to go ahead and say if that
row's titleType ==
tvSeries, which, recall,
The Crown was an example of, then
I'm going to go ahead and write
to the writer--
whoops-- writer.writeRow,
a list containing what?
The row's tconst value, the row's
primary title value, the row--
whoops-- the row's start
year, and lastly, the row's--
[CHUCKLES] the row's genres.
So what am I doing
here, just to be clear?
So what I have done is I've
first written out-- once and only
once-- literally these
values, because I want
headers in the first line of my file.
After that, for each row in
the reader, which is currently
iterating over this
file in read-only mode,
I want to print out the current row's
tconst, current row's primary title,
the current row's start year,
and the current row genres.
But notice, I'm ignoring movies and
shorts and documentaries and bunches
of other values as well.
And you know what?
Just for good measure, let's
shrink this a little bit.
And row, how about, is adult
== "0", for today's purposes?
So that'll filter the list further.
If you'd like to flip
that later, that's fine.
So let me go ahead and do Python 3,
for version 3 on my Mac-- actually,
no, let's do Python of
import.py, all right?
I'm going to cross my fingers here.
Hopefully the file is
working, working, working.
But it's taking a decent amount of time.
Like, this is how much time it
takes, apparently, to process over
millions of rows of data.
Still running.
Still running.
But the goal, again,
is to shrink the amount
of data I have to ultimately care
about so that we can actually
search it much more effectively.
So, OK, it actually finished.
So let me go ahead
and open up show0.csv.
Notice now, in my text editor,
I've got a lot less data.
I've thrown away everything
I don't care about,
but I've been left with tconst,
primary title, start year, and genres,
and everything herein is now
consistent with that filtration.
But I haven't filtered
everything I said.
What did I say I wanted
to get rid of earlier?
Yeah, the shows before 1970.
And clearly some of these are
coming from 1940s and so forth.
So let's go ahead and get
rid of those, but see how.
This is the CSV file that we just
opened, but in Google Spreadsheet form.
So I literally just imported
it into Google Spreadsheets
that so we could see it.
Literally the same data as before,
and there are those 1940s movies.
But there's something curious
that I wanted to be mindful of.
If I scroll down in the start years
and keep going and going and going,
huh, those seem OK.
Those are increasing in order.
But let me try this.
Let me just poke around my
data, sorting in reverse order.
It's going to take a while, because even
this is a decent number of TV shows.
Notice this weirdness.
At the top of start year now,
once I've reversed sorted them,
there's a whole bunch of
backslash capital N's.
Now, this has nothing to do with
C, and nothing to do with Python.
It has everything to do
with the documentation.
If you read IMDB's data, as I only did
carefully eventually, you'll see this.
A backslash n and capital N is used
to denote that a particular field is
missing or null for that title name.
Now, this is important,
because if I want
to filter out movies
that are after 1970,
I need to be resilient against that.
So let me go ahead and do this.
So if the current row's startYear
does not equal backslash n,
then I'm going to go ahead and
check that it's a TV series
and that it's not an adult show.
So that would help.
But furthermore, how
can I check this here?
Let me do year.
And how can I convert row
startYear to an integer?
Well, everything in a
spreadsheet, by definition
of it having been in a spreadsheet,
or a TSV file or a CSV file, is text.
But start year looks like
years, so what Python
function can I use to
actually convert text that
resembles a number to an actual number?
Yeah.
So we can do something like this.
So I can convert year
to an int, and now I
can say, if year greater
than or equal to 1970,
now I'm going to go ahead
and do those lines instead.
Now, there's an opportunity,
surely, for better design,
because once your code
starts doing this,
you've done something
suboptimally, right?
This is not going to end well if
all of my code starts wrapping.
So I could clean up the logic in
a little bit, but let's go ahead
and run this just one more time.
This time, changing this to shows1.CSV
so we can see slightly different
outputs.
Let me go ahead and
run Python import.py.
Huh, syntax error,
Unicode error, codec--
that's a weird one.
But this is because backslashes, recall,
in C and Python, have special meaning.
So when you do something like
this, backslash capital n,
even though it's not a lower
case n, backslash n, recall,
is the scape character in C and Python.
So this is like telling Python,
this is a special character.
But it's not.
And we've never really
had occasion to do this,
but how would do you
think we could output
a literal backslash before a capital N?
Yeah.
So it turns out the solution
to this problem, usually,
no matter the language, is that if
you want a literal character, not
an escape character, you literally
put another one of it before.
So even though this
looks a little funky now,
this backslash backslash capital
N literally will mean backslash N.
All right.
So now, let me go ahead
and run this on import.py.
This time, I'm hopefully
going to actually generate
a new file called shows1.csv
that has even less data that
actually is going to contain my
shows, but only a subset of them.
And let's go ahead and pull the
cake out of the oven this way.
This is what I get now this time.
So if I actually load the CSV,
shows1.csv, into Google Spreadsheet,
just because it's pretty easy to look
at than the black and white window,
now you can see that I apparently am
only getting shows 1970 and onward.
And indeed, if I sorted them,
I would see no backslash N's.
I would have thrown away everything
that doesn't meet that criteria.
Well, let me go ahead and
do one last thing here.
I'm going to go ahead
and make one more change.
And first, let's
improve the design here.
This indentation is the result of
my asking questions again and again
and again and indenting if and
only if those things are true.
But notice, you can start to
flip your logic here, right?
Instead of saying if the start
year does not equal backslash N,
what if I just do this and say continue?
I can then unindent this--
because if you've not used
it before in Python and in C,
if you say continue inside of a loop,
it's not going to continue down there,
it's going to [WHOOSH] continue
to the start of the loop again.
So via this logic, we can actually
keep wrapping around again and again.
And here, too, we could say, if
year less than 1970, I can go ahead
and say continue, which would then
allow me to unindent this as well.
So there are solutions,
design-wise, to actually avoiding
that infinite indentation.
All right.
Let's go ahead and do one last version.
Then I'm going to go ahead and
pull out of the oven in a--
premade.
So the last thing I didn't
load before was this.
Suppose that I want to load into
the CSV file, all of the genres
associated with the show.
It looks like all of these shows have
one or more genres, just like you
were asked for your favorite shows.
And so now, we have a CSV file with
tconst, primary titles, start year,
and genres, where genre is
itself is a comma-separated list.
But there's a fundamental problem here.
Even though I have all of this
data here, the best I can do
is a program like this.
Let me go ahead and search for,
for instance, the following.
Let me go ahead and
grab a file real fast.
Let me go ahead and grab
a copy of shows2.csv
and write one final program here.
If I want to go ahead now and search
this very large, still, data set,
in shows2.csv, well, let
me go ahead and do this.
Import CSV.
Let me go ahead now and
ask the user for a title.
I could use CS50's get string, but
there's really no need for that
anymore now that we have the title
function-- the input function, recall.
So I'll just use that.
And then I'm going to go ahead and
open up shows2.csv in read-only mode.
And I'm going to call that my file.
Then I'm going to go ahead and give
myself a reader from csv.DictReader,
passing in that file.
And now I'm going to go ahead and,
for row in reader, do the following.
The goal now is to write a
Python program that allows
me to search only those TV shows.
So I could say something like this,
if title == row "primaryTitle",
then I can go ahead, for instance,
and print out row "primaryTitle,"
started in row "startYear."
So what is the goal of this program?
It's going to ask the user for input.
It's going to open this
big CSV that I've created.
But that's still smaller
than the 50-megabyte version.
It's going to iterate over every
row in that file via DictReader,
checking if the title the human typed in
equals the current row's primary title.
And if so, it's going to print
the title and year of that show.
So if I go ahead and
run Python of search.py,
typing in something like The Crown,
Enter, voila, I get that answer.
If I go ahead and do The Office,
Enter, there's a bunch of Offices.
And in fact, if you haven't
seen more than the American one,
there's the UK one, and
apparently several others.
And we can actually
corroborate this now.
If you go to IMDB, where all of
this data originally came from,
and type The Office, there are all
of The Offices that actually line up
with our own very data set.
It's going to be so
damn tedious every time
you want to search for
data or update data
or insert new data to write
a Python program to do it,
so we need a few more features of SQL.
But I think first, we
need some Halloween candy
for our five-minute break outback.
All right, we are back.
So we have a whole bunch more
data now, because we've downloaded
that really big TSV file from IMDB.
I've simplified it into some CSV files,
but that really gets me half of the way
there, because now if I want to
search the data, as with search.py,
I still have to look over
thousands of rows of TV shows,
and it's only going to be linear search
by nature of just how files are read,
top to bottom, left to right.
But it turns out in Python that you
can actually write SQL code itself.
And CS50 has a Python library that
has not only get string and get int
and so forth, but it also has a SQL
function built in that allows you
to connect to, so to speak, a file
that ends in something like .db,
which is to say, in a moment, we'll
start to write some Python code now
toward an end of loading a really large
data set like IMDB's into a proper SQL
database, thereby allowing us
hereafter to use all of the power
and expressiveness of SQL, and more
examples of that in just a bit like
SELECT, UPDATE, DELETE, and INSERT,
without having to write a whole bunch
of Python code.
So to be clear, instead of using Python
to search and manipulate our data,
we're going to write a script, a program
in Python, whose sole purpose in life
is to get data from one
format into another.
And you can imagine there's being
generally useful, whether it's
a Google Spreadsheet you've
downloaded or a large data
set you found on the internet that
you want to use for a final project
or for some other class.
Python can be a really powerful way
of taking one data source as input
and producing its output now,
as of today, SQL instead.
So let's go ahead and iterate one
final time of our title.basics.tsv.
But this time, not just
save it into a CSV file.
Let's put it into a proper SQL
database on my own Mac or your PC.
So let me go ahead and do this.
First, let me go ahead and say, just
like on the slide, db = cs50.sql,
and then quote unquote, "sqlite:///--"
so the third slash is not a typo.
It should indeed be there.
And I'm going to say
shows3.db, just because this
is version 3 now of my import script.
I'm going to go ahead, just as
last week, and now import CSV--
CS50's library as well.
But for this to work, the file
shows3.db needs to exist first.
And there's a couple of ways on a Mac
or a PC or a Linux computer, typically,
to create an empty file
that's ready to receive data.
You can literally use
the command touch, which
just will create an empty file by
whatever name you type at the prompt.
Or we can do this programmatically.
And I'm going to do it
programmatically, because I
bet I'm going to screw up
one or more times here,
and it's going to be useful to let
my Python program create and recreate
the database again and again
and again until I get it right.
So let me go ahead and open a file
called shows3.db in write mode.
And recall from Python
and C, using fopen,
anytime you open a file in write mode,
it will overwrite any file that exists
or create any file that doesn't.
That's all I needed to do.
So in Python 2, recall that we
were able to use this dot notation.
And it turns out here,
when you open a file,
if you want to immediately close it,
because your only goal was to create
it, you can just do .close on
the very thing you just opened.
That is equivalent, just
to be clear, to doing
something a little more pedantic like
this, file = open, and then file.close.
But we can collapse this into
one slick one-liner, so to speak,
by just doing instead
what I did a moment ago.
All that does is create
empty shows.3 file.
Now, open that file for SQLite.
And again, SQLite is
the light version of SQL
that anyone can use on
their own Mac or PC.
You don't need a special server
to get up and running with it.
So now let me go ahead and
open up title.basics.tsv.
And then here, let me go ahead
and create myself a DictReader so
that I can iterate over
the lines in that TSV file.
And now, let me go ahead, and for
row in reader, do the following.
I first want to filter
out stuff just as before.
So I'm going to say if row
bracket "titleType" == "tvSeries"
and row "isAdult" == quote unquote "0",
then I'm going to go ahead and check
one other thing.
I'm going to go ahead and give myself a
start year variable, similar to before,
although I called it year earlier.
Then let me go ahead
and do row "startYear",
just so I can cast that to an int.
But I only want to do that
if row "startYear" does not
equal that special backslash N
that IMDB told me to watch out for.
So I only want to do
that if it's not that.
And then if startYear is
greater than or equal to 1970,
let's go ahead and do the following.
Let's go ahead and do
genres, gets row "genres".
Let's go ahead and get
tconst, gets row "tconst,"
just so I can put these in some
slightly shorter variable names,
just to keep myself sane.
primaryTitle is going to
be from row "primaryTitle."
and then let me go
ahead and give myself--
we already have startYear, so those
are the only other three fields I need.
So now, I want to go ahead and
insert this row from my TSV
into a SQLite database.
And the operative word that we saw
earlier that we haven't used it yet,
is INSERT.
We did use SELECT.
We did use UPDATE.
We did use DELETE.
We haven't used INSERT yet.
So I'm going to do that in a moment.
But first, I need my
database to actually exist,
so I need to create an actual table.
So I'm going to go up here first
and do this, db, is a reference now,
a variable representing
my database, and I'm
going to call the only
function inside of it that's
useful for our purposes, called execute.
What I can now do is
execute any SQL I want.
So what do I want to
load into this database?
I think I want to load in the tconst,
the primaryTitle, the startYear,
and the genres, just like we had
earlier from title.basics.tsv.
I want to load rows that represent
this kind of data, all right?
So how am I going to do this?
Well, let me go ahead
and create a table.
I'm going to call it shows, because that
seems nice and conceptually consistent.
I'm going to go ahead and
create a list of columns now.
tconst is going to be one column.
primaryTitle is going to be another.
startYear is going to be another.
And genres is going to be the last.
I can literally, that is to say,
write SQL inside of a string
that I pass to a Python
function called db.execute.
And because db.execute--
or rather, because db
was configured with shows3.db, when
I execute this string in Python,
it's going to get executed on
that database file, shows3.db.
So it's a nice way of
bridging these two worlds.
So I'm going to have to be a
little more specific, though.
Recall that SQL has a bunch of types.
And I'm going to keep it simple.
I'm going to go ahead and say that
the type of that tconst value is text.
The type of the primaryTitle is text.
The type of startYear is going
to be numeric, kind of a catch
all for dates and date times.
And then genres is going
to be text as well.
So the syntax is a little funky.
You actually specify the name
of the column and then the type,
as opposed to the opposite, which we
did in C. But that's the way SQL is.
So I'm going to go
ahead and save that now.
And just to comment this,
this is going to create
a table called shows in
database file called shows3.db,
just to be super explicit.
So what am I going to do down here?
It looks like I have the
ability with CS50's library
to execute any SQL I want.
So let me go ahead and insert
into shows the following values,
a tconst, a primaryTitle,
a startYear, and a genre--
and genres.
What values?
I want to insert these values.
Now, I don't know in
advance, so I'm literally
going to put some question marks here.
And it turns out in SQL, this is valid
syntax for the library we're using.
This is this INSERT query.
INSERT INTO the table
name a parenthesized list
of the columns you want
to insert data into.
Then, a set of values
in separate parentheses.
And for now, I'm using question
marks for placeholders,
for reasons we'll come back to.
But I'm going to go ahead
and plug the following values
into those placeholders, tconst,
primaryTitle, startYear, and genres.
And what the db.execute
function is going
to do for me automatically is it's
going to look at this SQL query.
Notice that, oh, it's got four
question marks in it, or placeholders.
Those, in SQL, are like the %s was
in C or are like the curly braces are
and Python f strings.
So this says, give me 1, 2, 3, 4
placeholders and plug in, ultimately,
the following four values,
tconst, which is just a variable;
primaryTitle, which is the same;
startYear, which is the same;
and genres, which is the same.
So what am I going to do?
For every TV series in this
file that's not an adult series
and that started after 1970,
insert it into my database.
If I've made no typos, I'm going to go
ahead and run this, cross my fingers,
and enter.
This one is going to take more
time, because it turns out writing
to a CSV file is actually pretty quick.
You can just write row, write
row, write row, write row,
but inserting into a SQLite
database is going to take more time.
More time upfront, but it's going to
be a lot faster to search thereafter.
So let me go ahead and do
the cake in the oven thing
and go ahead and now
open up a file I made
an advance in today's src3
directory called shows3.db
using SQLite3, that command
line program we used earlier.
Recall that I can say .schema to see
the types of data in the database.
And indeed, look what
I've done in advance.
I created a table called shows
with exactly those columns in a--
with exactly these four columns, tconst,
primaryTitle, startYear, and genres.
But I did this in advance to save
us time so that I can now do SELECT
* FROM shows.
And let me not get all of them.
Let me do the first 10, semicolon.
All right.
So we see the first 10 shows
from IMDB in whatever order
IMDB distributes them.
You can see their ID
numbers are incrementing.
And All My Children.
I remember growing up
with that show years ago.
And it seems that that's a drama--
it's not really mystery, but so
be it, and a romance show there.
But it's indeed 1970, as are every
show thereafter in 1970 or onward.
I can go ahead and search
for more like this.
Let me give myself the first 100 shows.
All right, so this is a
pretty large data set.
And let me go ahead and count them all.
Recall that you can use a
COUNT function so that we don't
have to print them all on my screen.
It looks like there are
153,331 TV series in IMDB.
Crazier than that, you want to know
how many of them came out this year?
WHERE startYear = 2019;.
There were 6,099 new TV shows in the
world this year according to IMDB.
Just one of those is The
Office that won our--
won our vote earlier.
So SELECT *, star denoting wild
card, in this case, everything.
And unfortunately, SQL
has two wild cards.
Star means select all of
the columns in the table.
Percent means let any characters
come before or after a quoted
string in a WHERE clause.
So let me go ahead and SELECT *
FROM shows WHERE title = The Office,
and we'll actually see-- whoops.
primaryTitle, sorry.
Let me fix that.
primaryTitle = The Office.
There are all of those Offices.
And indeed, 2005 is probably
the one we know and love.
If I go to a browser and go to
IMDB slash title slash that ID,
indeed, that's probably the
one we're all thinking of,
unless you voted for
the UK version instead.
So again, this is actually real, live
data that we're now playing with.
Well, what more can we do?
Well, there's one thing
that I don't really
like about this, which is that
when we select all of the columns--
and let's go ahead and do this.
Let's select another
hundred of them before.
This feels a little messy that we have
all of these nice, clean columns except
for when we get to genres.
Then we just have this
arbitrary comma-separated list.
Suppose, for instance,
I want to search for all
of the comedies that came out in 2019.
I could say SELECT * FROM shows
where genres = "Comedy" AND--
turns out you can use
conjunctions like this--
startYear = 2019.
So that gives me a whole bunch.
Let's count them.
So COUNT this here, Enter.
OK, a thousand of those
6,000 shows are comedies.
But I think that's an underestimate.
Why is this query buggy at the moment?
Yeah?
AUDIENCE: [INAUDIBLE]
DAVID MALAN: Yeah, some of
them had more than one genre,
so comedy is somewhere in
that comma-separated list.
And so what I should probably do
instead is not say genre = "Comedy,"
but maybe genres LIKE "Comedy," and
allow something maybe to appear before,
something maybe to appear after.
And that's going to give
me 1,593 comedies that
came out this year in 2019.
So that seems a little better.
But this is not very robust, right?
Once you start resorting
to techniques like this,
it should start, as a programmer,
to rub you the wrong way.
It's kind of a hack, right?
Like, you're searching for comedy,
but there could be something before it
or something after it,
and odds are, there
is no other word I can think
of in the world of genres
that starts with or ends with
comedy, so we're probably OK.
But this is kind of hack-ish, that
you're just kind of searching.
It would be nice if we could just search
for a specific column called genre.
So how can we go about doing that?
Well, let me go ahead and
do the following instead.
Let me go ahead and open up a
final version of my import script,
this one that does two things up top.
At the top, I'm going to create two
tables, one called shows, which has--
I'm to clean up the column names, too.
IMDB is a little nonconventional.
What most people would do when
describing a unique identifier,
they're going to call it ID, not tconst.
So we're going to rename it to ID.
They're not going to call
their title primaryTitle.
They're going to call it title,
so we're going to rename it title.
They're not going to name it startYear.
We're going to call it year.
And then that's it.
We'll come back to primary
key in just a moment.
But notice this.
In my new and final version
of this script, I'm creating,
I propose, a second table called
genres whose purpose in life
is to contain a value called show_id
and another one called genre.
So what's going on?
Well, let me go ahead and show this.
If I load the resulting database from
this one, shows4.db, and I do .schema,
you'll see that I
indeed have two tables.
Let me go ahead and SELECT * FROM
shows WHERE title, this time,
because I've renamed it from
primaryTitle, = The Office.
OK.
That's a lot of Offices.
But let's go, AND year = 2005, which
is the one we're all thinking about.
And it's that one.
And now, notice this.
Notice that I'm getting back what?
An ID, I'm getting back a title, and
I'm getting back a year, but no genres.
That's because there's another table
now called genres that's separate.
And you know, I'm kind of curious.
I see that a genre table has show_id.
Let me go and do this.
SELECT * FROM genres
WHERE show_id =, and let
me do a little copy paste
here, = this show_id.
And what might I see?
Comedy.
So what have we done now?
For any TV show that was
in IMDB's database that
was a comma-separated list of genres,
I've exploded it, so to speak.
I've split that value on the commas.
And if the show is a comedy, I've
added a row in these genres table,
but then I've jotted down the
show's ID next to that genre
so I remember that that
show was of that genre.
But if another show has
multiple fields-- for instance,
let's go ahead and search for not
The Office but, say, The Crown.
And there's only one of those.
And now I do SELECT * FROM genres
WHERE show_id = this number,
we'll see that, oh, The Crown now
has drama and history as a genre.
And so therefore, in
the genres table, notice
that there's two IDs and two genres.
But now we can use a quality.
I can now search for all of the comedies
from 2019 in kind of a powerful way,
SELECT * FROM shows WHERE id IN--
and here's the cool part--
SELECT show_id FROM genres WHERE
genre = "Comedy" AND year = 2019.
So this is admittedly a mouthful,
but let's consider what's happening.
First, I'm selecting star from shows.
That means give me all the
shows in all of the columns.
But filter as follows, only show me
those shows where the ID of the show
is in the following list of IDs.
Now, you can look at a nested
query inside the parentheses here.
This list here selects all of the
show IDs from the genres table
where genre = "Comedy".
So the highlighted parenthetical
right now returns essentially a list
of all of the ID numbers of shows
that are associated with comedy,
even if they're associated
with other things, too.
And we're making sure
that the year equals 2019.
So if I now hit Enter, we'll see a whole
bunch of results, but we should see,
if I count these by
using my usual syntax,
there were 1,593 shows
that are comedies in 2019.
That does happen to equal the same
count we did earlier by using like,
but this is better designed in the
sense that there's no ambiguity.
You're not just hackishly
looking for a substring,
so to speak, in a comma-separated list.
You can actually now
search more robustly
by having redesigned your data.
And what we've done really
is something like this.
Instead of storing our data in
just one table called shows,
and every show has an ID, a
title, a year, and genres,
we've instead exploded those genres
into two separate tables, such
that now our shows
table looks like this.
We have an ID, a title, and a year.
And notice that the ID of a
show can also appear over here
in another table called genres, and
just by convention, to keep us sane,
instead of calling it
ID here, we've called
it show_id to connote that it came
from a table called shows, plural,
but it's a single ID from a show.
So by convention, humans often call them
table name, minus the S, underscore ID.
And then the keyword here or genre
is comedy or drama or documentary
or some other genre as well.
And now, this is a little
subtle, but the fact
that this little symbol here,
drawn from the database world,
flails out into three
separate places, this
is a one-to-many
relationship, so to speak.
You can have one show over here
mapping to many genres over here,
or maybe zero, but it's zero
or more possible genres.
All right.
Any questions just yet?
All right.
So the real power then, to
be clear, is coming, now,
from this kind of expressiveness.
So now, let's play around with
some other queries and features.
But first, let's give
ourselves a bit more data.
It turns out besides the
file called title.basics.tsv,
IMDB.com makes a bunch of
others available to us as well.
There is one called name.basics.tsv.
And this is one that
has information on all
of the names of actors and
actresses and directors and writers
and other people in the world.
So for instance, there is an nconst,
which is like a name constant
or an ID, nm2946516,
which happens to belong
to the actor whose primary name is
Claire Foy, the star of The Crown.
She was born in 1984, and there are
some other fields in that file as well.
But also juicy is this
file, title.principals.tsv,
and this is where it
gets interesting, too.
In this file, notice there
are no actual titles.
There's no primary titles.
There's no actual human names.
Instead, there's just two unique
identifiers, a tconst and an nconst,
which IMDB speak for a title
identifier and a name identifier.
So for instance, in one of the
rows in this TSV file called
title.principals.tsv, there is a
row that starts with tt4786824;
also has nm2946516, and
has the word actress,
thereby implying that
if you look up the nm--
the nconst in the names file and you
look up the tconst in the titles file,
you will be able to, by transitivity,
infer that Claire Foy is in The Crown.
This allows us to have a
many-to-many relationship.
A one movie or show
can have many actors,
and one actor can be in many shows, so
we're using a sort of join file here,
a join table, that's
going to somehow allow
us to link two different data sets
together, and more on that in a moment.
But what's really fun in IMDB is
that it also has a bunch of ratings
that humans have typed in saying, I
get this 10 out of 10, a 0 out of 10,
and so forth.
And they keep track of the number
of votes that shows have gotten.
And so in title.ratings.tsv, yet
another file you can download from IMDB,
you can look up a given tconst,
the unique identifier for a title,
what its average rating is
and the number of votes.
And in fact, if I pull this up, for
instance, on The Crown, if I go back
to IMDB itself--
IMDB, search for The
Crown, the 2016 version,
you'll see that indeed, it is an
8.7, which lines up with exactly what
we have here.
But over time, that number is going to
go up or down, because IMDB is updating
their data set every day as well.
So besides this data, we
also then have the ability
to consider what this all
looks like collectively.
So in this case here,
here is another diagram.
It's more complicated now, but
it just captures the intuition
that you would derive by just reading
through IMDB's documentation, which
defines the following.
It turns out if you read
closely among those files,
you'll see that, oh, we can glean a
whole bunch of shows that are going
to have IDs, title, year, and episodes.
I want to go ahead and associate those
shows with a whole bunch of stars.
But people are the entities in the
world that have IDs, names, and birth.
So now things get a little weird.
Let's focus only on
these two tables here.
This is a diagram of the
goal at hand, and this
is a script I wrote in advance.
And in a moment, we'll open up a
SQL database that represents this.
There's going to be a table
called shows, every row of which
has an ID, title, year, and
some number of episodes,
so you can see how long-running it is.
There's also going to be
a table called people.
Claire Foy is going to be among them.
She and the other humans will have an
ID, name, and birth year associated
with them as well.
But there's going to be this
other table here called stars.
And you'll notice there's a line
that links shows to stars to people,
again by transitivity.
If there is, in the stars table, a show
ID, a.k.a. tconst, and a person ID,
a.k.a.
nconst, that links, for instance,
Claire Foy to The Crown.
It's going to link Steve
Carell to The Office.
It's going to link every other
actor to their show as well.
Similarly for writers, we won't play
too much with the writers today,
but writers are people, too,
just as stars are people.
And so here's another feature
or design goal of SQL.
You want to ideally factor
out the commonalities, so
that Claire Foy appears
only in one place by name,
but her unique identifier might
appear in bunches of places.
There's a lot of actors in the world
who are also writers themselves.
They are, at the end of
the day, people, but they
might appear both in the stars
table and in the writers table
by way of their person ID.
So the goal of SQL is to not copy and
paste Claire Foy, Claire Foy, Claire
Foy, or Steve Carell, Steve Carell,
Steve Carell, all over the place.
You want to have one
authoritative place for all
of your people, one authoritative
place for all of your shows,
and then you have these
other tables called
join tables, which are similar
in spirit to the TSV files
you can download that somehow
link these identifiers together.
So if I, on my computer here, open a
file that I had-- came with in advance,
that you can also play with online
as well, called, finally, shows.db.
And I do .schema on this one.
This is the largest of the databases,
and this is the one that we, the staff,
actually generated from all
of the TSV files online,
but we threw away a lot of information.
We threw away anything before
1970, and we filtered out
everything except TV series,
so that we can actually
play with the data in this data set.
But honestly, using a black and
white prompt and a terminal window
tends not to get fun, especially
when the data flies over the screen.
So there are also things called
GUIs, Graphical User Interfaces,
and indeed, there is a program
that's freely available for Macs,
PCs, and other types of operating
systems, called DB Browser.
And indeed, will point
you at this online.
This is just a program that allows you
to explore SQL files on your own Mac
or PC much more visibly--
or much more visually and much
more pleasantly than maybe just
a command line interface allows.
So let me go ahead and open
up, for instance, shows.db.
And we'll see a whole bunch of things.
First of all, the
graphical user interface
here shows me the same information,
just in slightly prettier format.
Shows.db, per my diagram a moment
ago, has six tables, people,
like I mentioned;
shows, like I mentioned;
also, stars and writers, ratings, and
then that separate table called genres.
And if you look over
here on the right, you
can actually see the SQL code
we wrote to create those tables.
But cooler than that is that
notice these tabs up top here?
I am currently on Database Structure,
and if you're using Windows,
your interface will
look a little different,
but the options are still there.
I'm going to go ahead
and click on Browse Data.
And now you'll see a little
dropdown of all of the tables.
So if I want to go ahead and
look at a whole bunch of shows,
I can actually see all of my data
here, not unlike Google Spreadsheets.
But notice, this is 153,331 movies--
or shows, rather, that I
can see altogether here.
And lastly, what's cool is that
if I go over to the SQL tab,
I can now execute some sample queries.
So let me go ahead and do this.
SELECT * FROM shows;.
This is going to give
me a whole lot of shows.
Enter.
So I hit the little Play button
that just executes that query,
and you see the resulting
rows that have come back.
So again, how did we get to this point?
We, the staff, downloaded all
of those TSV files in advance.
We wrote a Python script that imported
all of the data from those files
into memory, threw away the
stuff we didn't care about,
and then inserted it into tables
like this table called shows.
And what's nice about this tool is,
because if you are an aspiring data
scientist or you're just trying
to get some analytical work done
for this class or any
other, or any other project,
a graphical tool lends itself
to just kind of poking around.
So for instance, you'll see that the
shows table no longer has any genres.
But that's OK.
We can reconstitute that data.
I can go ahead and SELECT * FROM shows--
but you know what I can do?
I can actually join in the genres table.
So I can take the shows table
here, the genres table here,
and essentially link them
together by way of the ID.
How is that possible?
Well, if you look at shows, a show
has an ID title, year, and episodes.
Genres has a show ID.
So if you think of ID on my left
hand as representing my fingers here.
Show ID in genres is
representing my right hand here.
What we want to do is
lineup ID with show ID
to make one larger, wider table that
constitutes all of that data together.
So how do I do this?
Well, in SQL, you can join two tables.
You say what table you want to
join with what other table and
how you want to do it.
Well, I want to go ahead and join
it on shows.id = genres.show_id;.
And now, when I hit
execute, it took a moment--
indeed, it took 408 milliseconds.
But my god, that's a lot faster
than writing a whole Python
script to do this.
Now I have a table with
all of the shows as before.
But notice the table got wider.
This is a temporary table
that SQL has returned to me.
This one now has genre and show ID.
So in fact, let me go
ahead and filter this.
WHERE title = "The Office" AND year
= 2005, Play, we'll just get this.
So notice I have a wider table
containing all of the columns
from both of those joined tables.
But if I change this now to The Crown,
and that was 2016, and hit play,
notice I get back seemingly
redundant information.
But this might be useful
nonetheless, because now I
can iterate over all of the rows knowing
that every row has not only a title
and a year, but also a genre as well.
So I can reconstitute
the table like that.
Well, how can I ask
other questions as well,
like what are actors that
are in other shows as well?
Well, let me go ahead and
do something like this.
For instance, let's select
all of Steve Carell's movies.
Or-- let's select Steve
Carell himself first.
So SELECT * FROM--
let's see-- no.
Let's do this.
How about Ellen?
So SELECT * from people
where name = Ellen DeGeneres.
I spelled that right.
Semicolon, Play.
All right.
So this is handy, because I now know
that Ellen's birth year is 1958,
but her ID is 1122.
That's the same thing as that
nconst, but we threw away the nm
and we got rid of all the zeros and made
it into a proper number for efficiency,
better to uniquely identify
humans, typically-- or anything--
by numbers which fit in 32 or 64
bits rather than longer strings.
So now I know Ellen's ID
is 1122, what can I do?
Well, let me go ahead and SELECT *
FROM stars WHERE person_id = this.
This will tell me what?
This will tell me all of the information
about shows that Ellen starred in,
including, presumably, her own.
OK, so I now see person_id is
the same, the same, the same,
but Ellen is apparently
in all of those shows.
But that's not that
helpful, and also, this
is kind of lame that I've
just hardcoded Ellen's ID.
But I don't have to do that.
I can do a subquery, SELECT id FROM
people WHERE name = "Ellen DeGeneres",
closed parenthesis, Play.
Now it's dynamic, so now
I've not hardcoded anything.
But this isn't that useful.
Let me go ahead and just
select show_id here.
So I now have SELECT show_id
FROM stars WHERE the person_id ID
= whatever Ellen's ID is here.
How can I take this one step further?
Well, what if I do SELECT * FROM
shows WHERE the ID of the show
is in the following list of values?
So not only does SQL
support equals, when
you want to compare one value against
the next, much like in Python,
you have the keyword IN, where you can
say select everything from shows where
the ID of the show is in
the following list of shows,
which happens to represent the
list of show IDs that Ellen is in.
Phew.
Lets hit Play on this.
It took a moment, but it looks
like these are all of the shows
that Ellen has been
in, according to IMDB.
And it looks like The Ellen
DeGeneres Show is one.
She's been on the air
for 2,865 episodes.
There is Ellen, her original TV show,
which was on for a few seasons back
then.
Looks like she's doing some game shows
these days as of 2017, and so forth.
So using these step-by-step
thought processes
can we actually build up
more interesting queries
to get back information like that.
All right, any questions before
we try a few others out as well?
No?
All right.
Well, let me show one other
approach to this same problem.
It turns out this is what
are known as nested queries.
You keep using the
parenthesization, much like in math,
where you can nest arithmetic
expressions in parentheses.
You can do the same in SQL.
But you can also join information
in slightly different ways as well.
I can actually do something like this.
Let me go ahead and SELECT
Title FROM the following tables,
people JOIN stars ON
people.id = stars.person_id--
and we'll walk through
this in just a second--
JOIN-- sorry--
JOIN shows ON stars.show_id = shows.id
where name equals "Ellen DeGeneres."
This, while more of a mouthful, is
equivalent to what I've just done.
Notice that I've select
the title of the show
from the following, the result
of joining people on stars.
How do you join the people's
table and the stars table?
Well, people have IDs.
Stars have person IDs,
according to the diagram.
What else do you want to join together?
Let's join all of that with shows.
How?
Well, let's go ahead and say
stars.show_id = shows.id.
So it's a way of linking,
transitively, multiple tables together,
and then filter all of that with
the name of Ellen DeGeneres.
This is what we would
call an explicit join.
And if I click Play there,
notice it's taking a moment.
Took, in fact, 1,990 milliseconds,
almost two human seconds.
So it got slower, but it
does give me another syntax
via which to select data.
And honestly, what's cool about this
is I can select other fields as well.
They don't have to come from the
original query in my nested selects.
But let me go ahead and do
this, just as a teaser, and then
we'll consider some
problems before wrapping up.
So it turns out 2 millisecond--
2 seconds is kind of slow, and if I do
this again, it took 2,029 milliseconds.
If I do it again, it
took 1,963 milliseconds,
and I'm looking at the time here.
That's slow, right?
That means if you have a finite
number of servers in the world
and each of them can only handle
some number of users at a time,
it's kind of a waste of hardware and of
money, arguably, to spend more seconds
or milliseconds servicing one query from
a given user on your app or your web
application.
But it turns out we can do
something kind of smart here.
Notice that in our database structure,
we've done a few things here.
Let me open up, for instance, people.
Rather, let me go over to people,
and you'll see a few keywords
that I ignored earlier.
When you're defining a table, you
can specify that one of your columns
is what's called a primary key.
That is the column by which you're going
to uniquely identify all your data,
so it's like those numeric IDs
we've given every person and show.
A foreign key is what
we call the same number
when it appears in some other table.
So when we saw a person_id or show_id,
those are what are called foreign keys,
because it's the same
numbers, but in another table
that we're using to join
things together, eventually.
And you can also define
columns to be unique,
or you can index them for efficiency.
And that's, in fact, where
we're going right now.
If you look at the same
query, you'll notice that I'm
searching on a bunch of columns.
I'm clearly searching on the name field.
I'm also searching on-- that
is, joining on-- show_id,
and I'm searching on person_id, not
to mention the individual ID fields.
But the cool thing about
a primary key, which
we have in advance to find all of
our ID columns, all of the columns
called ID, to be primary keys, you
get a feature for free with SQL.
SQL builds up what's called
an index, a very fancy data
structure, a tree-like structure,
that actually allows you to search
for information quite efficiently.
So when you define a
column to be a primary key,
you get what's called an index.
Specifically, if we go back a few
weeks, to tree, our discussion of trees,
it looks like this.
This is what's called the B-tree,
which is not a binary tree.
It's a B-tree in the sense
that it's got a lot of nodes
that might be one or two
or more children each,
but it's very short with a lot
of nodes, very-- a lot of width.
So that means when you store your
data in a tree-like structure,
long story short, it just tends
to be very efficiently searchable.
So when you define a
column to be a primary key,
you get that speed for free.
But we can also tell our
SQL database, you know what?
I plan on searching on person
IDs and show IDs and names, also,
so let's go ahead and
create myself some indexes.
Let me go ahead and execute the
following queries just once.
Let me create something called an index
called person_index on the stars table
using its person_id column.
So pretty quick syntax.
This means create a B-tree,
create a fancy tree structure,
called person index, on
the stars table by person--
on the person_id column.
That is, make a table like this in
memory to store all of the person IDs.
Why?
So I can find them faster than
linear search in the column itself.
Let me go ahead and execute this.
And you'll see it takes a moment.
It's thinking.
It took me 1.6 seconds.
So a little slow, but I
only have to do this once.
Let me create two more.
CREATE INDEX called show_index
ON stars ON the show_id column.
So almost the same, but
this just means give me
a tree that looks like
this in memory, so
that when I query for
data like Ellen DeGeneres,
it searches a tree-like structure
instead of a list-like structure
in a column alone.
Let me execute that.
And I screwed up ON stars show_ID.
There we go.
That took 1 second.
And lastly, let's create an index called
name index on people on the name column
so that I can search for people
by name more efficiently as well.
Otherwise, without these
indexes on a column like name,
it is going to check every
damn cell in the column looking
for Ellen DeGeneres, Ellen DeGeneres,
Ellen DeGeneres, using big O of N,
running time, or linear search.
So now, let me go back to my query here.
I've not made any changes to the query.
The last time I ran this,
it took almost 2 seconds.
Now, after creating these indexes
and telling my SQL database
I plan to search on those
columns, watch this,
and watch the blue highlighted number.
8 milliseconds.
8 millisecond.
7 milliseconds.
7, 7, 7.
Looks like 4 milliseconds that time.
So this is what the Googles of the
world, the Facebooks of the world,
the Microsofts, who
have very large data,
they not only store
their data in databases
like we are here, but they also
index their tables intelligently,
drawing in ideas from weeks ago, so that
the database, for them, and for free,
and sort of magically, creates
these kinds of structures in memory,
but does it so that you can
search and insert and update
your data all the more efficiently.
So there's got to be some
problems or some prices paid.
So let's consider just
a couple of those.
So what problems actually arise
when using a SQL database like this?
Well, the first is what's
called a race condition.
And perhaps the best way to
explain this is by way of a--
sort of a story that I was told
when I took a class like this years
ago in a class called
CS161, Operating Systems.
So contrived scenario,
but consider this.
You and your roommates have
a fridge in your dorm room,
and you and your roommate
really like milk.
And one of you comes home one day,
opens the fridge, and sees, oh, dammit,
we're out of milk.
And so you close the fridge and
you walk into Harvard Square
and you head to CVS or
some other such store.
Meanwhile, your other roommate
comes home, opens the fridge,
really wants some milk, but,
oh, darn it, we're out of milk.
So they close the fridge,
walk outside, and head
to some other store that sells milk,
and for reasons that we're contriving,
don't bump into you at CVS.
A few minutes later, you both get home,
of course, having bought some milk.
But of course, milk
doesn't last that long,
so now you've got twice as much
milk, and surely some of it's
going to go bad.
Like horrible, horrible problem.
Why, right?
It's contrived, but this-- fond
memories of when it was taught to me.
So what is the fundamental problem
here, if you put your finger on it?
Both you and your roommate
were allowed to inspect
the state of that refrigerator
without knowing that the other was
about to do the same.
Better would have been for one of you to
leave a note on the fridge saying, gone
for milk, so that the other
person doesn't do it as well.
Or more stringently, just
lock the refrigerator
so that no one else besides you
can see inside of it until you have
updated the state of that refrigerator.
So this comes into play
all too often these days
when dealing with
really large data sets.
The Twitters of the world, the
Instagrams of the world, all of you
are probably in the habit
of liking or uploading posts
on those sites or some other.
But if a lot of other people are,
too, especially when things go viral,
there might be code like
this underneath the hood that
essentially does the following.
Here's three lines of representative
Python code that use SQL.
The first line here selects the
number of likes from a table
called posts where the ID
of the post is something,
a variable called ID, 1, 2, 3, 4, 5.
That just tells me, what is the
total number of likes this post has?
Then, I store that answer
in a variable using likes.
And you'll see this
syntax in problem set 7
as you begin to play with SQL
yourself in the CS50 library.
Suppose that your third line of
code is to then update posts,
setting likes equal to some
place holder, where that place
holder is going to be likes plus 1.
This is the same problem
as the milk example,
because if you check
how many likes there
are on this post, or your
roommate checks how--
or if you check how much milk
is available left in the fridge
and then go off and begin to
make a decision like buying milk,
your roommate might, on
a server that's running
the same code to handle thousands
and thousands of people,
might ask that same question.
How many likes does
this current post have?
Suppose the post has a million likes.
Both of you execute
that first line of code,
see that, oh, I have a million likes.
I've just clicked plus 1 on the site.
And so you try changing the number of
likes with this second update query
to 1 million plus 1.
But if both of you have
been making a decision that
gets interleaved with the other
person, as will absolutely
happen with Instagram and
Twitter, who have thousands
of servers all operating in parallel,
instead of changing the number of likes
from 1 million to 1,000,002, you might
change it both of you to 1,000,001,
thereby wasting one of the
counts, because it gets lost,
because you inspected the state
of the fridge or the likes.
You made a decision.
And then you updated it
based on that decision
without realizing that
life changed in between.
Your roommate got home
and checked the fridge,
too, or someone else clicked the upvote.
So this is a bad situation,
but there are solutions in SQL
that we won't look at this week,
but what are known as transactions.
That is a solvable problem, and more on
that in a higher level database class.
And one final example of
this, because this one you
can avoid throughout the coming
weeks in CS50 and beyond.
A SQL injection attack is
when you write bad code that
somehow allows some bad actor on
the internet or an application
to trick your code into running
SQL code that you did not intend.
For instance, this is how Yale
students log on in New Haven.
This, of course, is how Harvard
students log on here in Cambridge.
You're asked for a username
and password, or a Harvard key
and a password.
Suppose, though, you as
a user don't cooperate,
and instead of typing in your
email address and password,
suppose you type something like
this, like your email address,
Malan@Harvard.edu, then maybe a
single quote, and then a dash dash.
Well, it turns out in
SQL, dash dash is how
you start a comment, similar to hash
in Python or in slash slash in C.
But suppose that Harvard had
implemented its website such
that there's a query like this
going on underneath the hood,
if the Harvard key is implemented in
Python, that says this, SELECT * FROM
users WHERE username = placeholder
AND password = placeholder,
passing in username and password.
That is good.
That is correct code, because
those place holders are important.
What the CS50 library does, and other
libraries in the world like it, is
it escapes user input.
If a user tries to trick your code into
inputting comments or single quotes,
db execute and other libraries
will sanitize the user's input
and prevent the
following from happening.
If, however, you do
this-- not good, but bad.
Suppose that you really practice what
we preach last week using f strings,
this is now bad in the context of SQL.
If you create a format string
like this with a little f
that is literally SELECT *
from users WHERE username =,
and then use the fancy Python
notation for curly braces,
and password = curly braces, this
is a correct SQL construction
so long as the human cooperates and puts
their username here and password here.
But what if the user
is malicious, like me,
and actually includes a single quote
in their input, and then dash dash?
So Malan@Harvard.edu
quote dash dash would
have the effect of injecting
that into the user's input.
But dash dash I claim
as a comment, so it's
a way of tricking the computer into
executing not SELECT * WHERE user--
SELECT * FROM users where username
= this AND password = that,
it has the effect of just
saying SELECT * FROM users where
username equals Malan@Harvard.edu.
And if-- and you'll
see this in p set 7--
db.execute returns one or
more rows from the table--
that's how you can actually
select data using Python code--
and the number of rows
equals 1, because it's
selected Malan@Harvard.edu
you could log in
as me because you very simply tricked
the computer into executing code
that you injected.
And the biggest takeaway here is
this is 100% solvable problem.
You just need to know
that it is solved and not
do code like this using f
strings in Python, but instead
to use library code like db.execute,
or after this course, anything like it.
With that said, we'll end with a joke.
OK.
That's it for CS50.
We'll see you next time.
