PROFESSOR DAVID J MALAN:
All right, we are back.
And our final session is
ostensibly on web programming
and some of the ingredients
that are commonly used therein.
Thought we'd make this a mix of
conceptual and design, concepts
and design, as well some
hands on, which will
end the day, using a
little bit of JavaScript
in a couple of different contexts.
But first, let's make sure we
knock off the last of these two
as well as a more
general answer to where
do we actually put all of the data
when making web-based applications.
We've talked about
databases in the abstract.
We've talked about them in
the infrastructure sense.
But we never actually
opened up the database
and talked about how you
actually put data in there.
And what are some of the kinds
of questions to consider.
What makes a database design good.
What makes an engineer designing
a database good or not so good.
And so let's begin with that.
So in terms of database
technologies, there
are generally two classes
of databases, one of which
is called relational databases
and one of which might
be called object-oriented databases.
So relational and OO,
or object oriented,
or document stores as
they're typically called.
And we'll do these in reverse order,
these object-oriented databases,
specifically these documents
stores, things like MongoDB
which is a very common one.
So let me start another
list of ingredients,
gives you the ability
to store data in what
effectively resembles something called
JavaScript Object Notation, JSON.
And we'll actually wrap with
a look at JavaScript itself.
And it looks a little
something like this.
So if I were to store
my data in JSON format
and my data, for
instance, was a customer,
I might represent a customer with
the following kind of syntax.
I might have the customer's
name being David.
I might have the customer's
address as being,
say, 1 Brattle Square, Cambridge, Mass.
02138 USA.
I might have the email
address be malan@harvard.edu.
And now this user might
also have an ID of, say,
123, which is just a
unique numeric identifiers.
And there might be other fields still.
So JSON generally refers to
this key value pair approach
to data, where you have keys
like name, address, email, ID,
and their respective values.
And there are some other features.
You can have arrays built into this.
So for instance, let's see, what
might I have a whole bunch of?
So if this isn't so
much course-- actually,
not so much customers
but courses I've taken,
maybe we could do something
like this, an array of courses.
Like I've taken courses with
ID number 5 and 7 and 18.
So we could represent things
like arrays using square bracket
notation, as is the
convention, to represent
a list of things associated with me.
So this is the general idea.
And if I had a second
such customer, I would
start another one of these
objects, if you will,
using the curly brace notation.
So this is totally language specific.
But in this context, curly braces
represent an object and an option.
An object is just a data structure
containing keys and values
for our purposes.
And an array is exactly what we
discussed earlier but in JavaScript
as opposed to c.
So one of the upsides of storing your
data in this object-oriented way where
you think about a customer or a
student as having data in this way
is that there is some hierarchy to it.
You might have a list of courses,
which itself has a child, which
is this array of course IDs.
You could actually explode this so
that, yes, it's an array of courses.
But you know what, we don't have
to think of those courses in terms
of their IDs, we can think of them
in terms of their names like, say,
computer science for business leaders
whose ID is, for instance, the number
5, and whose start date is something.
And so forth.
So you can have this nested
structure where you just
continually and progressively associate
more and more data with whatever object
is of interest to you.
This is nice and it
lends itself, let me say,
in programming to
accessing the data easily.
Just in terms of code, you can
write relatively little code
to get at data like this.
Unfortunately, it doesn't necessarily
give you as much expressiveness,
depending on the service you're
using, as something more traditional
called a relational database does.
And indeed, this is one of these
religious things, or at least trends
right now, whereby people are
absolutely still using SQL databases
as there typically called.
SQL databases.
By contrast, there are
NoSQL databases, which
generally mean you're
not using SQL and it
happens to be something
a little more like this.
And there's a larger array of
space and options here but let's
focus on one of the
more traditional ones,
if only because it lends
itself to, at least I think,
some more intuitive design
decisions that certainly relate
to object-oriented databases as well.
And one of the design decisions you
initially have to make, typically,
is what type of data do you want to
store and how do you want to store it?
And by contrast to this
hierarchical approach,
a relational database typically
has you store your data
in a very flat way, very
Excel-like or Google spreadsheet.
So for instance, if we want to
create a database of customers
in a relational database context,
we might do the following.
Well, what makes a customer?
I have a few fields here, so name,
address, email, maybe unique ID,
maybe-- course is irrelevant because I'm
changing the story back to customers.
What else might you
associate with a customer?
Phone number is a good one.
What else?
What they bought, so
like an order history.
Anything else?
AUDIENCE: Age.
PROFESSOR DAVID J MALAN:
OK, age, good one.
AUDIENCE: Contacts within the company.
PROFESSOR DAVID J MALAN: Contacts
within the company, so like,
let's say, like customer
service history kind of thing.
I'm sure there's a better word for that.
AUDIENCE: Mailing list, yes or no.
PROFESSOR DAVID J MALAN: OK, so yeah,
let's do an opt in kind of field.
OK, so that's a good list.
And I'm sure there's innumerable
more we could come up with.
So now let's dive in a little deeper.
What does this data look
like and why do we care?
So in a relational database, we
would typically specify a data type
for these kinds of fields.
And we would ultimately store
this kind of information
in the equivalent of a Microsoft Excel
worksheet or Google Spreadsheets sheet
that allows us-- once Excel opens.
Come on.
Open a new file.
OK, so over here we might
put, what do we have,
name, address, email, phone, order
history, age, customer service history,
opt in.
And I deliberately left
room for ID just because I'm
going to put the ID to the left here.
OK, so here's how we
might lay out this data.
And I might be customer number 123.
David Malan 1 Brattle Square, Cambridge,
Mass, 02138, malan@harvard.edu,
617-495-9000.
Order history, we'll come back
to that because that sounds hard.
Age, we'll just leave that
blank and customer service
we'll just leave that blank.
And opt in will be a 1 for
yes, I've opted in to emails.
So this is all fine and good in Excel.
And Excel has a little
bit of expressiveness
for how you can display your data.
But you don't really specify
what type of data it must be.
You can always override
Excel's default settings.
And indeed, you can go to Format
and specify this is a number.
This is how many digits to show.
But it really is just an aesthetic
detail for the most part,
that you can actually impact for
better or for worse your data
by specifying those things.
So instead here, let's consider
the question of how we actually
represent this information.
So name, feels just like
a sequence of characters.
Age feels like a number.
Phone number is a little weird
but it's more like words.
And it's like alphanumeric
with some punctuation.
So it's not just strictly a number.
Customer service history and order
history kind of are scary right now,
so we'll come back to that.
Opt in feels like a Boolean,
like a Boolean value
meaning 1 or 0, true or false,
anything like that, yes or no.
Email has a pretty standard
format, something at something
dot something, maybe
something more and so forth.
And then address, which is
just like a phrase or sentence
or something like that.
But we can dive in a little
deeper, in particular, we
have a whole bunch of data types in
SQL, Structured Query Language, which
itself is a programming
language with which you
can query for data on a database.
And indeed, SQL and
relational databases more
generally are example
of CRUD systems, whereby
you can create data, read data, update
data, and delete data, silly acronym.
And specifically, they
have instructions that
are called insert and select
and update and delete.
So in other words, even though,
theoretically, these operations
are generally referred
to with these words,
in actuality when you're
programming in SQL,
you use these four keywords instead.
And they kind of do what
they mean where select
is the only non-obvious one, where
select means search the database
and give me back some rows.
So for instance, with
Microsoft Excel here,
you could do this with
formulas or macros or whatnot,
but generally you don't--
many people, myself included,
tend to use Excel more for storing
data and not necessarily for writing
software against it, in particular
because it's going to be slow overall,
especially with thousands
and thousands of rows.
You would generally use a database,
something like a SQL database
or the like.
So what it means to select
data is to take a database
like this that presumably has
more customers than just me
and select subsets of them.
Select all the customers
that we have that
are in the age range of like 18 to 49.
Or give me all customers
who live in Massachusetts.
Give me all customers in
specifically 02138, that zip code.
Or give me all customers who have spent
more than $100 this month with us.
Any number of queries can be
solved using SQL as a language.
But before you even get to the
point of using your database,
you have to design it.
And among the data types
we have at our disposal
are data types like char
for character, one or more,
varchar for variable
number of characters,
where you don't necessarily know
how long the thing is going to be.
We have things like int.
Have data types like big int.
We have data types like decimal,
float, year, date, date time,
which is both, time, and there
is many, many more than this.
But this is a decent list with
which to start, which is to say,
if we want to store this
data in our database,
we first have to ask ourselves
how should that data be stored,
for a couple of reasons.
One, among the features of a database
is to ideally give you data quickly
and to make updates or
insertions or deletions quick.
And to help the database do that, it
helps to tell it what type of data
it is.
Because it turns out,
storing things as numbers
is often faster than storing
them is alphabetical characters.
For instance, the number,
let's say, 1 million.
1-0-0-0-0-0-0.
That's 8 characters
to type at a keyboard.
So if I type that using tech or
store that using text, a.k.a.
Ascii from yesterday, I need 7 bytes.
However, the number 1 million is far
less than our special value 4 billion.
And you know how many bits, perhaps
now, we need to store 4 billion.
Which is how many bits?
32.
Which if you divide by
8 is how many bytes?
4.
So if we instead store the number 1
million as an integer, so to speak,
as an int, and not as
a string of characters,
we can go from 7 down to 4 bytes.
So this is an example
of why you actually
want to care about your
underlying representation
because you can speed things
up, you can save on space
and generally help the
database to do its job better,
which doesn't matter for small websites
but for medium-large scale websites,
absolutely, all of these kinds
of things can start to matter.
But the second concern, too, is
you can leverage your database
to protect yourself from yourself.
You can have the database make sure that
the only type of value that can go here
is an integer.
The only type of value
that can go here is
a year, which means it must
be a four digit number only.
You must be able-- you can specify
that this has to be a date.
So it has to be year, year, year, year,
dash, month, month, dash, day, day.
And even if you or your
programmers accidentally screw up,
the database will prevent you
from inserting a bogus value.
And this is an added layer of
defense and a good thing in general.
So you can also specify
what types of numbers
you're storing, how long
those numbers might be.
And there, too, we have an opportunity
to discuss design decisions
where the length of these,
in particular, matters.
So for someone's name,
the first question
when designing a relational
database might be,
how many characters
shall a user's name be?
So what's the typical length
of a human's first full name?
10?
Feels a little short,
maybe for a single name.
D-A-V-I-D space M-A-L-A,
dammit, I'm one short already.
So 11 minimally seems to
be the current lower bound.
If we polled everyone on their
full names, probably 20, 30.
What's that?
AUDIENCE: 25, 30.
PROFESSOR DAVID J MALAN: 25, 30.
And I bet, just to play devil's
advocate, longest name in the world.
How many characters is this?
That's crazy.
I don't have to count-- his full name.
All right, let me take out a program
that will do the counting for me.
OK, 226 characters, I think, or 225
if I'm interpreting this correctly.
226 characters.
So incorrect.
So we need at least 226 characters.
But this is, actually, it's
kind of a can of worms.
Like I don't know what the upper bound.
And apparently it seems to be
this, pragmatically speaking.
So it turns out there
are certain conventions.
Like in a SQL database,
it was very common
for years for a name-- or rather, for
a character-based field to be 255.
Why?
That was the maximum
length for some time.
And thankfully it just about
fits this fellow's name.
So there's a difference, though,
between a data type, as these are.
We talked about int
earlier in the context of c
and SQL has these data types, too.
And we're going to need to assign
one of these to each of these.
A character field is defined as
a fixed number of characters.
So if you specify 255 and the
data type is char, or character,
that means you will always use
255 characters to store the data.
And if it's only D-A-V-I-D
M-A-L-A-N with a space,
all of the other 200 plus characters
are just blanks, essentially,
but they're allocated.
A varchar is when I don't
really know what the biggest
name is going to be so you say, 255.
But the database only uses as many
characters or bytes as it needs.
So for David Malan it might store
11 total bytes, give or take.
And it won't waste the
other 200 plus of them.
Well, this seems silly.
This sounds better.
Like I give it an upper
bound but it's less wasteful.
Why might char still even exist?
Why might you want to commit a
priori to a specific number of bytes?
Even wastefully.
Yeah?
Anessa?
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: So that's true.
But this decision would be
independent of the front end.
So presumably, you want to support
users whose names are of any length
because some of us may certainly
have run into websites,
or even if you've been
filling out comment forms,
like big, obnoxious companies, tend not
to let you leave very long comments.
And they will literally countdown
the characters, a la Twitter.
And why is that?
Well, one, they probably,
from a business perspective,
don't want to read too much text.
But two, they've probably
specified that we're
going to store your message in
a field of a specific length.
So let's separate the front end
and those kinds of decisions
from the underlying distinction
between what's stored in the database.
So a char field will store a
fixed number of characters,
even if a lot of them are blank.
But varchar will only store
up to a certain amount.
So why would you want one or the other?
AUDIENCE: Scalable.
PROFESSOR DAVID J MALAN: More
scalable, let's come back to that.
AUDIENCE: So maybe a phone number or
all of the area codes or something,
be able to know at a certain point if
the character is going to be something.
PROFESSOR DAVID J MALAN: OK.
Yeah, so that helps.
So if you have a fixed
format you could certainly
go with char because you know in
advance how long it's going to be.
Phone number, could
break down if you want
to support international folks who have
longer, different length phone numbers.
Maybe zip code, if
we're just US customers
and we throw away the
extra four digits, we just
have five digit zip codes or
maybe nine character zip codes,
that could work as well,
or 10 with the dash.
But yeah, if you know in advance
how long that it is going to be,
you might as well tell the
database it's a fixed length.
But it's actually for a scalability
or really a performance reason.
It turns out that if
you think of your data
as being stored in a
column in Microsoft Excel,
if you specify that your
field, every value in this row,
is going to be 5 characters, 5
characters, 5 characters, for a zip
code, each one is going
to be exactly this length.
And much like our discussion earlier,
you can address these things.
So this is address 0, 5, 10,
15, 20, 25, and so forth.
And that, specifically, is
the number of bytes away
from which each of these things is.
In other words, there's
a gap of five bytes
because I'm assuming 5 characters.
And what feature do we gain when
we know our data is back to back
to back to back at predictable gaps?
AUDIENCE: Binary search.
PROFESSOR DAVID J MALAN: Potentially
binary search, if it's sorted.
And it also allows us, more
generally, random access
I can jump to the middle
of my rows because I just
do some simple arithmetic, x minus
y, like the total minus wherever
I start at.
And that's a feature.
If by contrast we don't know what the
length of the strings are going to be,
deterministically, and when we
say it could be as many as 255.
The visual effect might be this
first string might be pretty long.
This next one might be half the length.
This next one might be like 7 3/4.
This one might be really short.
This one might be blank.
And now you have these
ragged edges, which
means the numbers no longer apply.
This row might start at location 0.
This might still start at a location 5.
This might start now at location 7.
This might start at
5, 11, let's say, 12.
This is also going to be 13.
And then this one might be
14 or something like that,
depending on the lengths.
In other words, the numbers are
now useless because there is not
a predictable offset, which means
you can't just skip around randomly.
So this is the kind of thing where
the database can leverage the data's
structure if you help inform it.
And what a DBA, database administrator,
or just generally a developer who's
doing database design, you can provide
these kinds of hints to the database
so as to perform better to help
things like Twitter analyze or search
through their data or store
their data more quickly.
So we have to specify a length.
So for a name, how long of an upper
bound do we want to give a name?
Probably don't want to use
char because most people don't
have 226 characters in their name.
So it feels wasteful to
have all those blanks.
So let me propose varchar.
But what's a reasonable
upper bound, then?
What's that?
AUDIENCE: 30.
PROFESSOR DAVID J MALAN: 30?
Well, the only catch
with going small again
is we're kind of screwing
over the gentleman who
was written up in the Guinness
Book and probably other people.
There's probably thousands of
people who have pretty long names.
So what would be common
convention would be,
you know what, I'm going to make this
a varchar with a max length of 255,
partly just by convention.
255 is a little arbitrary but it happens
to be the former boundary on a string.
We know, empirically, there is no
one with a longer name right now.
But if someone does create a
name on their birth certificate
that's 256 characters, their
name will get truncated.
They're going to have to sacrifice
one of their names when they register.
But that's one of the tradeoffs here.
By contrast, address is
fundamentally harder to think about.
How long might an address be?
I don't know.
Let's see.
We have our little Excel file here.
One, so I proposed this address
here where we currently are.
So 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
looks like 50 characters or so.
Is that long enough?
No, maybe 100.
I don't know, 255.
Here, too, there might
be some common defaults.
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: Not
necessarily for that reason.
Because at the end of the
day, even if you split it up,
the total effect is
probably about the same
but there's a more compelling
reason to split up the address.
What I have done is very lazy and
very bad design at the moment.
Adam?
AUDIENCE: I was just going to
say so that they're searchable.
PROFESSOR DAVID J MALAN:
Yeah, right now there
is no easy way to search this because
Cambridge is kind of sandwiched
in the middle of the comma and MA.
The number is all the way at
the end but it's not alone.
It would be nice to kind clean this up.
And indeed, let me go ahead and do that.
Let me go ahead and
insert a few new columns.
And instead of calling this address,
how about we'll call this street,
city, state, zip, and for
now, we could do country,
but let's bias ourselves
to the US for now
just so we can discuss
zip codes specifically.
So in this case, I
might now rewrite this
as 1 Brattle Square, Cambridge-- no.
Wait, I got those backwards.
Cambridge, Mass, 02138.
So it's a little cleaner.
And to your intuition, we can now
search those fields individually.
So each of these fields
I really don't know,
but it should no longer
be called just address.
This should be called
Street, City, State and Zip.
Meanwhile, each of city and
street should be, I don't know,
maybe varchar 255, if only because it's
kind of an arbitrary but conventional
default and doesn't paint
us too much into a corner.
As an aside, you can have longer strings
of text in your database than 255.
Indeed, varchar can be bigger.
I think it can be 65,535 nowadays.
But there comes a point where if
you have even bigger blobs of text,
because maybe you're letting
people upload their resume
or maybe you're letting people upload a
college essay or really large documents
or something, there are
other data types that
are on this list called, quite
literally, text and large text, I
believe, which are even bigger.
But they're stored in the
database in a different way,
in a way that's slightly
slower to access.
So that would be one of the
motivations for using varchar.
And again, you'd have to read
the fine print of your database,
although they do tend to
follow certain conventions.
But that's the kind of
intuition behind that.
But state, let's just
assume for simplicity
the US, how long should that be?
AUDIENCE: Two.
PROFESSOR DAVID J MALAN: Two.
There is an advantage to use
not varchar but char two,
because if we use the two letter
codes we can save some space there,
which I like.
Zip code, again, we can be
a little presumptuous here.
We could do char 5 or char 9 or char
10 if we include the dash, depending
on whether we want to store that.
But we'll keep it simple, just do five.
Post office will figure it out.
Email, what data type should it be?
AUDIENCE: Varchar.
PROFESSOR DAVID J MALAN: Yeah, probably.
And here, I'm getting a little
lazy by sort of encouraging
us to use 255 for everything.
But it is just common.
But you know, so long as you're
comfortable with the value that's
what matters in the end.
Unfortunately, in a
database, typically, you
can't impose a formatting constraint.
You can't say, has to have an at
sign, has to have a .com or .net.
That has to be in your code.
But at least you can specify
its maximum length here.
Now things get a little
more interesting, an ID.
Typically an ID would have been
the first thing we discussed.
But now that we've kind of
had a logical progression,
now it's time to go back and
improve this and give everyone
a unique identifier.
But wait a minute, wouldn't their
email be a unique identifier?
AUDIENCE: What if they
don't have an email?
PROFESSOR DAVID J MALAN: What
if they don't have an email?
So reasonable problem.
Let's make the business decision
that, to hell with these people,
they need to have an email address to
use our website, for whatever reason.
So not concerned about that.
AUDIENCE: Two people share an email.
PROFESSOR DAVID J MALAN: If two people
share an email could be a corner case.
Grace?
AUDIENCE: That was mine.
PROFESSOR DAVID J MALAN:
Could be people sharing
email for family or significant others,
or you just happened to be logged in,
it's easier to use the
same email account.
So that could certainly happen.
And there's another
more technical reason.
What's that?
AUDIENCE: Change in email.
PROFESSOR DAVID J MALAN: If
they want to change their email,
that's actually a good one, too.
It turns out, even though
we're talking right now
about one worksheet, one
database table, so to speak,
it turns out that the unique
identifier is probably
going to end up in other
worksheets or other database tables
like customer service
history, order history.
In other words, whatever we're
using to uniquely identify the user,
we probably are not going to put their
order history in the same worksheet,
if only because, like,
where do I put it?
Well, I could put a column
here for their first order.
And then a column here for their second.
And then their third order.
But this very quickly becomes
messy because where does it end?
Some users are going to have one order.
Some users might have 100 orders.
Doesn't feel like a very clean
way of organizing you data.
Your rows should really be what you
keep adding to database, not columns.
So as such, much like you
might in your own spreadsheets,
you'll probably put our orders
or customer service history
in their own worksheets were each
email or each order is its own row.
But to do that, if we
have another worksheet.
there needs to be some
common link among them,
and maybe that's their email address.
But that could be
problematic, then, if they
change their email
address, oh my god, now I
have to change it in so
many different places.
There's yet another reason to
use something other than email
to uniquely identify your users.
Yeah?
AUDIENCE: Does it have the
@ symbol as an integer?
PROFESSOR DAVID J MALAN: Yes, so
it will-- an integer is better.
So let's clarify the question further.
I'll claim, it is better to identify
your users via a unique integer
than by an email address.
Why might that claim further be true?
AUDIENCE: An int is
going to take up space.
PROFESSOR DAVID J MALAN:
Yeah, that's the biggie.
An int is going to take up four bytes.
And email address might take up five
bytes, 10 bytes, I mean, 20 bytes,
depends on how long
your email address is.
And that just seems
unnecessarily inefficient.
So indeed, it's the case in databases
the ID will almost always be
an arbitrary but a consistent
unique number per user.
And it's usually just auto incremented.
So you I might be user one.
Nicholas might be user two.
Avi might be user three, and so forth.
And it just keeps getting
incremented automatically
in the database each time
someone registers for the site.
Phone number, integer?
AUDIENCE: You put dashes
between [INAUDIBLE].
PROFESSOR DAVID J MALAN:
Hm, could do that.
We could just store it
as an integer and just,
because we know we're dealing with
only Americans in the US right now,
we can just forcibly insert, visually,
in the presentation of our data
the parentheses or the
dashes, or whatever.
Possible, and this won't really bite
us because, again, not to belabor math
too much, this is how many
digits, just to be safe?
So we are-- oh, actually.
Three, no, no good.
Why?
Did I count correctly?
Four of these, three of these, yep.
Can't represent the zip code for
430-- the area code 430 or 431 or 432.
All right, so big into to the rescue.
So it turns out there
is big int on the list.
It's 64 bits instead of 32.
But this, too, is kind of foolish
but for a different reason.
That is plenty big to
represent a phone number.
AUDIENCE: Another int.
PROFESSOR DAVID J MALAN: What's that?
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: OK, so
we can kind of cheat and just use
another int, which is
reasonable, especially
back in the day, an int for the area
code and then an int for the number.
We could even do it for the exchange
and then the last four digits.
But not necessary.
In fact, there is sort of a
semantic thing that should
start to rub you the wrong way here.
Like a phone number,
we call it a number.
It's a collection of
numbers, but it's not
an arbitrary number from 0 to 4 billion
or so, or 0 to whatever this number is.
It is a pattern of 10
digits, in the US case, only.
And so arguably, you
know, I probably store
this as a character field
of length 10, or maybe
a character field of length 10
plus 2, 12, to have hyphens.
Or maybe a couple more characters
if you want parentheses.
But frankly, there's no reason to store
any of the punctuation in the database.
I would probably just
store a 10 character field
because now I know that
the length is bounded
and I'm going to have
to relegate to my code
the check of whether it's all numeric.
So that feels better.
An integer really should be
unbounded except by the size
of the data type itself.
A phone number feels to
me, but you could argue it
both ways, that it
should be something else.
But an int, too.
An integer should be
something you do math on.
Shouldn't be doing math
on your phone number.
Feels wrong and feels irrelevant, too.
You'd never have a use case for that.
So let's jump to another number, age.
Here is a good candidate
for an integer, right?
Who hates this idea?
OK, someone should hate
this idea, leading question.
But why?
It's fine to represent age with an int.
Dan?
AUDIENCE: Because it would change.
PROFESSOR DAVID J MALAN:
Yeah, I don't want
to really be changing my database
365 times a year by incrementing
1/365 of my customer base by one just
because their birthday is any given
day.
Better than representing
their age would be what?
Their birth date using
this data type, which
happens to be in the format yyyy,
month month, day day, typically, which
sorts nicely.
In fact, this is an interesting aside.
Computer scientists tend
to think in this way.
There is a huge benefit, well,
huge is subjective, I suppose,
to storing dates whether
it's in your file names
or whether it's in your
database in this format
as opposed to the silly American
convention of month month,
day day, year year year year, or
even the EU approach of like this,
and ignore the errors.
I'm using a calculator
to type out words.
Why is the first way that I claimed
is what database uses, better?
Dan?
AUDIENCE: It makes sense
to sort by year, rather
than which day in the year or
which month in the year it is.
PROFESSOR DAVID J MALAN: Exactly.
AUDIENCE: If you're going
to have a date on an item,
it would make sense to do it by
year first, so you could see.
PROFESSOR DAVID J MALAN: Exactly.
Everything sorts chronologically as a
result because if you have something
like 2016, 07, 20-whatever
today is, 6 or 7 or so.
So here's one filename or
here's one row in my database.
And now let's pick a day
in August 2016 08 29.
If you compare these alphabetically
or lexical graphically
as they would appear in a
dictionary, this later date
actually will come alphabetically
later than everything else and so it
sorts properly.
So you can tell who a
computer scientist is
if they cringe when people store
their dates in the wrong format.
Anyhow, slight tangent.
But age, bad, date of birth, better,
would be a better design decision here.
So date of birth.
Opt in can really just
be a Boolean field.
It turns out most databases
can't just give you a bit,
they can give you a byte.
So you have to waste a few of those
bits to effectively store true or false,
1 or 0, or the like.
So let's talk about one
last detail here that
also rears its head in
programming languages as well.
It turns out that there's different
types of numbers in the world you might
recall from grade school, some of them
have decimal points and some of them
don't.
Integers do not have decimal points.
Its numbers like negative
1, 0, 1, dot, dot, dot,
to infinity in both directions.
Then there are real numbers which are
a superset of those numbers, which tend
to be represented with decimal points.
And even though there is an
infinite number of integers,
there is even more real numbers in some
sense because of the decimal point.
And there's sort of an interesting
theoretical argument there.
But for our purposes, know
that computers, of course,
only use finite amount of memory.
This is why the biggest int a computer
can typically represent is 4 billion,
if using 32 bits.
And even that's an overstatement.
If you want to support negative numbers,
you have to steal one of those bits,
essentially for the equivalence of
the negative sign or positive sign.
So that gives you only 2 billion
numbers, from negative 2 billion
to positive 2 billion, give or take.
So float, as a real
number is called, a float
in a programming language or a database
is a number that has a decimal point.
This is even more problematic because
if you have a finite number of bits, 32
or 64, you can only represent a
finite number of digits in a number.
Unfortunately, there's a lot of
numbers in the world that have
an infinite number of digits in them.
And they're not dot 0, 0, 0, 0.
It's things like pi, 3.14159.
And I don't know the rest
of pi, but it's a lot.
And it goes on forever.
And so at some point, the computer
essentially has to truncate the number
or round the number.
Which is to say, if you choose a float
for a data type in a computer program
or in a database program,
you will be, occasionally,
making mathematical errors.
And unfortunately I can't cite these
examples in the undergrad class
anymore because none of them have
actually watched Superman 3 or even
Office Space.
I mentioned that one in a high
school class recently and I felt old.
But you might recall if you did
see either or both of those movies,
that Richard Pryor and Ron
Livingston and his character
made an awful lot of money,
sort of accidentally,
by skimming fractions of
pennies off of their companies.
Because they realized that
in financial transactions,
they were only looking
at the number.cents.
And if it were half a cent or a
quarter of a cent, that would normally
get rounded away, truncated away.
And so they figured
out in both scenarios,
and office space stole
the idea from Superman 3
in the narrative of the story, they
just put all of those fractions of cents
in their bank account.
And as I recall, spoiler
alert, but I think
the movie's been out for 10
or 20 years, 30 or 40 years,
they ended up with a whole lot of money
in their bank account because of this.
And that was because the computers were
effectively using floats, and therefore
imprecise data types.
Thankfully, databases like MySQL,
PostgreSQL, Oracle and Microsoft SQL
Server support decimal
types instead, which
are numbers also that
have decimal points
but you have the luxury of specifying
how many digits to the left
and how many digits to the
right of the decimal point.
And so for a database storing
financial information,
you would absolutely want
to use this over the more
familiar, because of
programming languages,
float, because you get exact precision.
And the database figures
out how best to do that.
So this is the common
sort of subtle thing.
Maybe it doesn't matter
for most companies,
certainly banking companies should be
in the know as to details like this
because it can actually
add or subtract money
from the total account
balances as a result.
All right, so at the end of the day,
what do some of the queries look like?
I'll just give you a couple of
samples but we don't actually
play with an actual database here.
If you want to select data
from a table called customers,
you would typically see programmers
type something like this,
or even analysts or less technical
people often pick up a bit of SQL
so that they can do
their own data analytics
or answer their own questions
based on the data set.
You don't necessarily have to
feel like you are or actually
be a professional programmer.
Select star from customers, semicolon.
This is a representative SQL query
that would select all of the rows
from the table called customers
and let me iterate over
them, a la scratch one at a time, like
the repeat block or the forever block.
If I want to insert into
my customer's database,
I might want to insert a new name,
email-- just name and email, let's say.
Specifically these values, David
and then malan@harvard.edu.
That's how I might insert a
customer from the database.
Delete from customers where
email equals malan@harvard.edu.
That would delete me as a customer.
And I deliberately chose to
delete me based on my email, why?
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: Yeah, I
don't want to accidentally unregister
all of the Davids in the database.
And frankly, even email is a
little sloppy for the reasons
we discussed earlier.
Two people might have the same
email address, if you allow that.
So better still would be
where ID equals 123, where
123 happens to be my unique identifier.
But notice, I didn't
insert a unique identifier.
One of the features
you get from databases,
typically, is that they
will generate the ID for you
and let you know what it is, which
in this case I'm assuming was 123.
We can update values as well.
But we can also scope our
queries to be more limited.
Customers where, let's
say, zip code equals 02138.
This would give me the ability to select
only customers in this particular zip
code.
Notice I quoted the string.
I did not do this, because in
many programming languages,
SQL among them, quoting a value and
not actually has semantic meaning.
When you quote a value, it's a string,
a sequence of alphabetical characters
or alphanumeric characters.
When you don't quote a
sequence of characters,
it's interpreted, generally,
as being a number.
Unfortunately, a zip code is
not a number, semantically.
It's a sequence of digits, so to speak.
But of course, in the
decimal notation, that
would be equal to this, which
also suggests if we now rewind,
what data type should we use
to be clear for our zip code?
Characters is probably
better than numbers.
And in fact, I learned this
the hard way when years ago, I
was using Microsoft Outlook
for years for email.
I eventually decided to
switch to Gmail and I
exported all of my
contacts using Outlook
as like a big CSV file,
comma separated values, which
is like an Excel spreadsheet.
And then I must have done
a spot check and I double
clicked and opened it
in Excel, looked at it,
must have instinctively
or reflexively hit Save
and then quit it without
really making any changes.
But dammit if Excel
didn't presumptuously
decide that any column that has numbers
must surely be numbers, not zip codes.
So to this day I occasionally
look up a friend's address for
like mailing them something
and I find that they
live in Cambridge,
Massachusetts 2138 USA
because Excel treated the data
as a number and not as a string.
And so to this day I always sort of
cringe, like years and years later,
I'm still finding
friends who live in 2138.
But it sort of speaks to this
kind of corner case or issues.
So this should have been
considered a string or sequence
of characters in both cases.
All right, so we've only
just scratched the surface.
But this should give you, hopefully,
a sense of the sort of litany
of design decisions
that have to be made.
And this is the kind of thing that
actually does determine whether someone
is good or not so good at this.
And it determines how well your
website performs under load.
Because even beyond this, just to give
you three final ingredients, or one
final ingredient, there
are things in databases
called indexes and primary keys,
which we've only just alluded to.
And let's see, full text is a feature of
MySQL and other databases, and unique.
And these are just keywords
where I can specify
in advance that a field in my database
should be optimized for searchability.
In other words, if I know
in advance that I'm really
going to search on zip codes a
lot, I should tell my database
to index that field.
And I do this with a certain command
or by clicking something in a web page.
And I do that once when I
first set up my database
and I'm designing my website.
And then thereafter, the
database's purpose in life,
and why I am paying Oracle or why I am
using a popular open source free tool,
is because they claim to be more
high performing than others.
And that's because they are good
at building fancy tree-like data
structures underneath the hood
to get me my data quickly.
But I have to give them these hints.
And I need to tell them,
hey, this is unique.
Don't let me-- don't let two users
with the same email address register.
Hey, let me search free form text.
So if the user just types
in some random words,
I want to be able to search over
their whole profile using something
like this.
And then primary is a way of
saying, this ID number in this field
shall uniquely identify my users.
I am sort of contractually
agreeing to that as the programmer
so that the database can
actually leverage that detail.
So it's all about sort of
educating machines in this way.
And while this is not
machine learning, it's
a decent opportunity to mention a
couple of these topics which generally
fall into the category of
ingredients that we can bring
to solve problems in a software sense.
Machine learning is
one incarnation of AI,
or artificial intelligence, whereby
you write software that somehow learns.
And you typically provide your
program with training data,
sort of representative financial
data or maybe sales data,
or any type of data, that
is sort of retrospective.
And you want to sort
of train the software
you've written to leverage that
data and predict future results.
So there's this kind of
feedback loop whereby
you train your data set and then
you try to apply it to new problems.
Or what's the stock price
tomorrow going to be like?
What are our projections for sales
going to be like in the future?
And so this is very much a trendy
and fundamentally compelling
subfield of computer
science, whereby you
can leverage this to try to
answer questions more effectively,
things like Siri and Cortana are
really about machine learning.
Apple and Microsoft and others
trying to train a software
to interpret my own voice better.
And in fact, machine learning can
sometimes take individual ingredients.
They don't do this so much anymore,
but what's it called, Dragon speak,
I think, the software
where you could actually
talk to your computer
for recitation software,
would often train the software
based on your own voice,
having you read certain things.
And that, too, would be an example
of machine learning as well.
Hadoop, meanwhile, is
a piece of software
that's commonly used in
distributed applications.
It's software that you can run.
And this would have tied in
pretty well to yesterday's chat
about cloud computing where you have
access to lots and lots of machines.
Hadoop allows you to take some job, for
instance, even something like the New
York Times, for example,
generating a whole lot of PDFs
of millions and millions of
articles, but distributing
that load over a whole
bunch of worker nodes,
whereby there is one master
node that somehow orchestrates
all of this in a cluster but then
it just kind of farms out all
of the actually hard or
interesting work to these worker
nodes, who eventually report back.
And that data all gets
aggregated somehow.
And so Hadoop is very popular for that.
And it's very popular
in the cloud context
because people often want to spin up
or turn on a whole bunch of machines
at once, run some distributed
job and then that's it.
It doesn't necessarily
have to be run ongoingly
but it certainly could
on premise as well.
Damn it, I've got to keep thinking
of an answer to this one, now.
All right, any questions,
then, on database design
or those kinds of topics?
Yeah?
AUDIENCE: How does MongoDB
fit into all of this?
Just like an online
database, a program online?
PROFESSOR DAVID J MALAN: No,
online wouldn't have meaning here.
It's software that you can download.
You can run it here on my laptop.
You could run it in the cloud on
Heroku or Amazon Web Services.
It is an answer to the first type
of database that we talked to,
an object-oriented
database, where you can
store things that look
like those JSON objects
that I first wrote with
the more textual syntax.
They're especially trendy now because
they're easier to use in some sense.
You can think they're designed to allow
you to think a lot less about your data
but you do pay a price sometimes
in terms of redundancy.
You might sometimes have the same
data stored in multiple locations,
though there is the notion
of unique identifiers
that allows you to factor that out.
MongoDB and things like that
are a little more conducive
to languages that are in vogue
these days, JavaScript specifically.
So it, too, is just a
trend and representative
of a class of type of databases.
Yeah?
AUDIENCE: Is JSON like XML?
PROFESSOR DAVID J MALAN: Yes, it's sort
of a lighter weight version of XML.
XML is just very verbose.
It's kind of dying off
as a popular format
because it was such a pain to use.
Good intentions, just very heavyweight.
Yeah?
Anessa?
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: Possibly.
I would need to know more and would need
to read up on some specific technology
to speak to that better.
But the general principle is absolutely.
Irrespective of how you
store your data, you
might need to massage it into some
other format, as someone would say,
whereby you ready it for some
other analytical process.
So hard to answer in the
abstract but absolutely,
that would be a commonly done thing,
especially for analytics if you're
trying to aggregate the data somehow.
Yeah, Avi and Marco?
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN:
Short answer, yes.
For instance, when I mentioned that
really big textual strings are stored
elsewhere, I meant that literally.
So if you think of a table as really
just being an array of memory,
when you have really
big chunks of text that
are bigger than a varchar
supports, they wouldn't
be stored in this rectangular
region of memory, so to speak.
It might be stored over here
where there is more space,
albeit at the cost of slower to access.
And there would be the
equivalent of a pointer
where that cell would be in the
database pointing over here.
So your schema decisions,
your design decisions
do affect the lower
level details for sure.
AUDIENCE: Otherwise, the data is
actually stored in the physical table?
PROFESSOR DAVID J MALAN:
Physical disk and on top
of that is layered the idea of a table.
So at the end of the day,
everything is stored permanently
on disks these days, so like
mechanical disks, maybe SSDs.
But you get more space from
mechanical disks, still.
And it might live temporarily in memory.
So to yesterday's comment about
in memory as being a feature,
all the data is hopefully
being still stored on disk.
But the system probably comes
with a lot of RAM or memory
to hold it temporarily.
Marco?
AUDIENCE: I don't know
if it's true or not,
but some months ago there was a story
about a woman with the last name Null,
N-U-L-L.
PROFESSOR DAVID J MALAN: OK.
AUDIENCE: Everytime she tried to
register or to buy airplane ticket,
for instance, she had
problems, because the website
crashed because of her last name.
PROFESSOR DAVID J MALAN: Really?
AUDIENCE: I don't know if it's true.
PROFESSOR DAVID J MALAN: It could be.
I mean, it doesn't fundamentally
need to be the case.
There are bugs in the
software, then, that
are not handling her name properly.
I can imagine what was
happening, whereby they were just
plugging her name into it a-- context.
Would that do it?
No.
It's possible.
I can't think of a specific
language where that would happen.
So it could be kind of a
myth or a joke but maybe.
Let me think about what
language could trick-- you
could trick null to thinking it's zero.
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN:
No, not a problem here.
Yeah?
AUDIENCE: If where does-- [INAUDIBLE].
Like if you delete a
field from a table and you
might delete all that data with it too.
What does it--
PROFESSOR DAVID J MALAN: Good question.
That one I think will depend
much more on the database.
That's a level of detail that the
database user wouldn't necessarily
know.
In reality, yesterday, I was
definitely oversimplifying a bit
because there are so many layers in
between us and our files these days.
There is the physical hard drive.
There is the software or
the firmware, so to speak,
that's running on the hard drive.
There's the device driver
built into the operating
system that talks to the hard drive.
There is the operating system that talks
to the device driver and each of those
can do whatever it wants
with the layer below it.
So it's hard to say.
Odds are, space is
re-used where possible,
except for performance
reasons, sometimes it
might be packed,
especially tight together.
For instance, there's an archive
data format for certain databases
whereby the moment you write or
insert a row into the database,
it gets compressed.
And something tells me that is
really compacted in memory back
to back to back because you're
making the contract with the database
that you're not going
to change that data.
You want it to be archived.
But hard to say without looking at the
actual source code or documentation.
Other questions?
All right, so let's take a
final look at web programming
through the lens of an
actual language, JavaScript,
playing in turn with some
sample code and a sample API.
So we'll get to an API.
Let's start with a bit of JavaScript.
And let's do it as follows first.
If you go to, let's
say, this screen here.
Let me give a couple definitions.
Here's a very simple web page, again.
And I've highlighted in yellow
two new tags, the script tag.
And we saw these briefly when we looked
at Google source code, but in no detail
yesterday.
But we also saw other tags
in the head of a web page
when we looked at CSS, for
Cascading Style Sheets.
So we're introducing script
because in this scenario
you can actually put programming
code between that open script tag
and closed script tag.
Specifically, the
language is JavaScript.
Back in the day, you could use something
like VB script, Visual Basic script
and Microsoft IE.
No one really did that and
it's not across-platform.
So JavaScript is really the only
thing you can put there these days.
Let me stipulate that putting JavaScript
code in the head of your web page,
not good, for all of the reasons we
discussed yesterday because you're
co-mingling your data with the
presentation with now some business
logic that you would express in code.
And so while possible, this is
generally not the right approach.
A more correct approach
tends to be this,
where you write all of
your JavaScript code,
a bit of which we'll
write in just a moment.
But you put in a separate file, maybe
it's called scripts.js or whatever.
But you reference that file in this way.
You then get the benefits of caching.
You then get the benefits of
separating your logic from your markup
language and all of the same
answers we gave yesterday
for Cascading Style Sheets.
So let's play with this
in the following way.
So this is some examples
from a colleague at Stanford.
So if you could, from today's slides,
go to this URL, this URL here.
And let me introduce you to the
simplest of APIs as follows.
Let me grab one thing.
I'm looking at the source code
of the page for just a moment
so I can remember something.
Where is that?
OK.
OK, I'm about to define
the following API for us.
And that ties together nicely
enough a whole bunch of topics.
So an API, or application
programming interface,
is a fancy way of describing a way
of using a library, if you will.
A library is a bunch of
code that someone else wrote
that does something that you can use.
An API it's kind of a
higher level concept.
It is the documentation
for how you use that code.
If you're using an API, you are
using a library in a prescribed way,
if you will.
And this can be more concretely
defined in the following way.
We're about to introduce
you to JavaScript.
But using your keyboard
only, no mouse, no clicking
and dragging, because JavaScript
is a textual language.
So what you're about to see
are the textual equivalent
of scratches, puzzle pieces,
or the programming blocks
we used a moment ago.
You're about to have the
ability to call, so to speak,
a few different puzzle pieces.
A puzzle piece, or a function,
or method as we would call it,
called get read, that
takes two values, x
and y, where x and y are the Cartesian
coordinates of a pixel in an image.
Henceforth, we're going to
assume that an image is really
just a rectangle on the screen.
And it's a GIF or PNG or JPEG,
things that we see every day
on Facebook and Gmail and the like.
And generally speaking,
this is 0, 0 over here.
This would be like something comma 0.
This would be 0 comma something.
And this would be
something comma something.
So you count this way
and that way, generally.
So when I say x and y, this means get
me the x-th y-th pixel at x comma y
location.
So there are two other functions,
get red, get green, x comma y.
And get, as you might
guess, blue, x comma y.
So those are three API calls, so
to speak, three functions that you
can call in this way.
And then there's three others, set red.
And actually,
capitalization is important
so I should be a little less sloppy.
Set red, xy, and I'm going to call it
n, where n is the number, in this case,
from 0 to 255, I
believe, for Nick's code.
Set green xy n.
Set blue xy n.
And I've deliberately
written my method names
in what's called camel case,
where camels have humps
and so similarly does the
text kind of have humps to it
where the convention
is, you start lowercase.
And then you capitalize each
subsequent word in the method
or in the function's name.
So this is a convention.
And Nick, the professor at
Stanford who wrote this code just,
adhered to convention.
But this not a technical thing,
it's more of a human convention.
And it varies by language
what people tend to do.
So here's the challenge at hand.
Number one, an iron image puzzle.
So this iron puzzle.png
image is a puzzle.
It contains an image
of something famous.
However, the image has been distorted.
The famous object is in the red values.
However, the red values
have all been divided by 10.
So they're too small by a factor of 10.
So all of the redness
in these pixels has
been dulled down so much that you can't
really tell what the image is anymore.
The blue and green values are just
all meaningless random values, a.k.a.
Noise, added to obscure the real image.
You must first undo these
distortions to reveal the real image.
And how to do this.
First, set all of the
blue and green values
to zero to get them out of the way.
Look at the result. If you look very
carefully, you may see the real image,
though it is very dark,
way down towards zero.
Then multiply each red value
by 10, scaling it back up
to approximately its proper value.
What's the famous object?
So this ties together our
discussion yesterday of RGB,
whereby each of these
thousands of dots on the screen
has three numbers associated with
it, how much red, how much green,
how much blue.
What Nick is saying is
that he's just added
a whole bunch of green and blue values.
So for every pixel that has
three values, two of them
are just random numbers that Nick
has thrown at the puzzle creating
this noisy static-y image.
The red, meanwhile, he's turned
the dial all the way down.
So there's still a little red there.
If he turned it all the way to
zero, there'd be no information.
But there's enough information, it's
just a tenth as much information
as you want.
So we're going to have to zero
out the red and green values
and ratchet up, magnified by
a value of 10, the red values.
So let me get you started
and you're welcome to work
with the person next to you.
And the goal here, really,
is just to give you
a taste of programming in JavaScript
with a very nice visual impact.
And here, in this text box below
the image, is some sample code.
Let me walk you through it
and give you a bit of syntax
and then send you on your way to
see if you can recover this image.
Here's how it works.
This top line on the left declares
a variable called IM for image.
It's arbitrary, Nick just was
succinct, so IM is what he chose.
Right hand side says new simple
image and then iron puzzle png.
This is just code that's using a
library called the simple image library.
And Nick knows that to open
a file using this library,
you literally type
new simple image quote
unquote "filename"
with some parentheses.
The effect of that is
to store in the variable
called IM, not a number, not a word
like we've discussed in the past
as in Scratch, but to
store in a variable
a whole image, a whole grid
of pixels, if you will.
The next line of code is similar
to Scratch's repeat block.
It's a for loop, so to speak.
And the syntax here is saying,
initialize a variable called x to zero.
Then increment x on each
iteration of this loop by one.
So x plus plus just means add 1, add
1, add 1, add 1, starting from zero.
And then this condition,
notice the less than sign,
says, keep doing this so long as x
is less than the width of that image.
So this syntax here is
image is the variable name.
Dot means go inside of that
variable and call, that is,
use the puzzle piece called get
width, whose purpose in life
is to just give you the
width of that image.
So excuse me, in
layman's terms, this just
means do the following thing x times
where x is the width of the image.
So it's like iterating over every
column of pixels, if you will.
And then you can perhaps guess
what does the inner loop do,
the for loop that involves y?
If the outer loop is
iterating over the columns,
probably y is representing the
rows, down and down and down.
So this here is just a comment.
So I'm going to delete this.
And let me give you this tidbit.
If you want to set the
green value to something,
you would do image.setgreen???.
If you want to do
image.setblue, you would
do something, something, something.
And if you want to get
the value of red, you
might say red gets image.getred
of something something.
And that's it.
Red, here is a variable.
And I'm omitting one
final line, which will
allow you to set the amount of red.
But let me turn on some
music for a couple minutes,
even if you've never
programmed before, you're
welcome to work with the person or
persons to the left, to the right,
in front and behind, whoever
helps you get this done.
And re-read the problem
statement if you need to.
But I claim that my little
hints here are probably
enough puzzle pieces
for you to figure out
how to implement this in JavaScript.
So let me start to fill in some blanks.
Would someone like to offer up, what
is the line of code with which I can
set all of the green values to zero?
AUDIENCE: Im.setgreen(x,y,0).
PROFESSOR DAVID J MALAN: OK, good.
So let me run this per my
suggestion of baby steps.
Click Run, Save.
And notice, it suddenly
gets much more blue.
Why is that?
Well we've essentially
turned off the green.
Just for demonstration's
sake, let me do the opposite.
Let me ratchet it up
to 255 instead of 0.
And now the image is really green.
So really, we're just kind
of turning a knob there.
But let's leave it at zero.
And someone else, how do I
set the blue to zero as well?
Set blue xyz 0.
So now let me hit Run Save.
Now unfortunately, it
looks really, really black
and really washed out on
this screen, certainly.
And you can probably tilt your
laptop and turn up your brightness
and kind of see
something, and that's just
because the red value is so close to
zero, that there's information there.
But as they would say
in the cheesy TV shows,
we need to enhance the image
so as to increase the fidelity.
So we need one other line of code.
And I gave you this one.
I said red gets image.getred at xy.
And I gave that hint to
you so that you would
have a way of referencing the amount
of red currently in the image.
And how did people go about
magnifying it by a factor of 10?
And I did not give you this ingredient,
so it's perhaps non-obvious.
How can I multiply this value?
So it turns, out if you
want to take the red value
and set it equal to its current value
times 10, you might think it's x.
But of course x, we've already
seen, is a variable in this case.
So it turns out that many
programming languages
use an asterisk as multiplication.
You wouldn't know that so it's fine
if you struggled with the final step.
But let me multiply it by 10.
But it's not enough to
just change the variable.
What do I now need to do
with the variable called red?
AUDIENCE: Set the red to it.
PROFESSOR DAVID J MALAN: I
need to set the red to it.
So you can think of red, this
variable, as a puzzle piece
that I now need to drag and drop into
one of those question mark placeholders
and say image.setred at x
comma y to not 0 not 255,
but to whatever this red value is.
So if I now click Run Save, if
you've not solved it on your screen,
the answer is the Eiffel Tower.
And it's just there by nature of
having ratcheted up the red value so
that there's still black
in the image, the Eiffel
Tower itself is mostly black.
But against this Red Sky, it
rather pops out as the result.
So very nice.
And this is an example of a general
technique known as steganography,
or the art of hiding information
in other information.
And the world starts to get kind of
spooky when you think about this.
Because we've clearly hidden in
what was previously a whole bunch
of seemingly noise, an actual image.
Now that image could have been text.
This could have been my secret
message to [INAUDIBLE] earlier.
It could have been in the
form of an image, not even
in the form of a note or an email.
And you can imagine artists leveraging
this to watermark their images.
We typically see pretty blatant
ugly watermarks on images,
but there's no reason you
couldn't embed much more subtlety
in the pixels of an image your name,
your initials, even more information.
So that if someone is
ripping off your images
you can claim, especially
if you're in the media,
that you are the original
source of these images.
Or you can actually transmit messages.
I mean, what more clever a way for two
bad guys to communicate on the internet
than to both have seemingly very
innocuous websites, a blog if you will,
photos of what they've
done during the day.
But if you actually run
code on the images, embedded
in every one of those publicly
accessible images on Tumblr or Facebook
or whatever, might very
well be secret messages
using a technique not unlike this one.
Let's do one more.
This next one is a reddish image,
also showing something famous.
And the definition here is
that the true image this time
is in the blue and green values.
However, all of the blue and green
values have been divided by 20.
So the values are very small.
Excuse me.
The red values are just random numbers,
noise, that's been added on top.
So you need to undo those
distortions to reveal the image.
Let me take the first
line of code that would
allow us to set the red values to zero.
What do I have to do this time?
image.setred at xy 2 0.
All right, next.
Let me go ahead and run this.
Pretty black.
So let's see what comes next.
Multiply the blue and green values
by 20 to get them back approximately.
So how did someone do
the green values first?
Any suggestions?
Green, I hear Alycia
mouthing green equals--
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: Getgreen at xy.
And I'm going to be a
little presumptuous,
blue gets image.getblue at xy.
And now what do I want
to do with these values?
AUDIENCE: [INAUDIBLE].
PROFESSOR DAVID J MALAN: OK,
so green gets green times 20.
Blue gets blue times 20.
And then lastly,
image.setgreen to xy green.
Image.setblue to xy blue.
Holding our breath, Ah.
A little more color
this time because we're
using both the green and the
blue channels and just not
the red this time.
So if in an at home exercise you'd
like to tackle the west image puzzle,
there is one more in here
that you might enjoy as well.
So let's take a gamble here.
And in our remaining time ratchet
things up so that you either feel-- so
hopefully, this one of
those demos that backfires
because there's a few moving parts.
The goal is to get
everyone up and running
with your own instantiation
of a tiny, tiny web app that
implements the Google Maps API.
So how are we going to do this?
First, if you would, go to cs50.io like
yesterday, and go ahead and log in.
And I'm going to go ahead
and do the same here.
And I'm going to go ahead
and sign in to this here.
And take a moment to just get
back to where you were yesterday,
which should load after a moment or so.
So eventually you should
be at a screen like this.
And in the mean time, if you
could open up today's slides
and also open in a separate
tab, this URL here,
which is the entry point for an API
from Google that folks like Uber,
I believe, and lots and lots and
lots of people on the internet
use to embed maps into
their own applications
so that you can start to
do things that use maps,
but simply exhume it as an ingredient
to your own, more interesting,
application.
So that URL there.
And at this point in the story,
hopefully everyone has cloud 9 open
to roughly this state?
It's OK if you have other
tabs open from yesterday.
And let's go ahead and do the following.
Go ahead and go to File, New File.
And that will give you a new tab.
And then just type in the
word, Tuesday, or something,
just so we have a quick and dirty test
of whether or not this is working.
Go to File, Save.
And call it map.html.
And odds are this will co-exist
alongside yesterday's file, which
was hello.html.
So when you hit Enter, odds are your
interface looks roughly like mine,
with map.html to open in the editor,
also in the file browser at left.
And you probably have your little blue
terminal window open at the bottom.
So now, if you would, typically,
since we're using the free accounts,
the web server typically turns itself
off and your account hibernates
after some amount of time.
So just for good measure, go ahead in
your terminal window and run Apache
50 start period with spaces in between.
And hit Enter.
And if it's still running, that's fine.
It might say stopping and then starting.
And you should see the same URL that
you were encouraged to visit yesterday.
And if you could, the third and
final window to open in a tab
here is click on that URL, open
your website, and visit map.html.
And you should see one of
two things, ultimately.
Either forbidden, like mine, or you
see Tuesday, or whatever you typed.
If you see forbidden, what was the
solution in your terminal window?
Yeah.
Chmod a+r for read, on map.html.
Let all of the world read it.
And again, that's just
giving global permissions.
Nothing should seem to
happen when you hit Enter.
But if you go back now
to the forbidden window--
and notice I didn't mention this
yesterday-- if you look at the tab,
it says 403 forbidden.
There's that http status code.
Not 404 but 403.
If you reload, hopefully
you see Tuesday or whatever
it is you typed into your tab.
Just catch my eye if you
want me to run over or look
on with the person next to
you. [INAUDIBLE], question?
Yeah?
AUDIENCE: Oh, I was-- I have--
PROFESSOR DAVID J MALAN: OK.
Oh, OK.
Down here you want a terminal window.
Somehow you closed it.
AUDIENCE: OK.
PROFESSOR DAVID J MALAN: So
use the blue window there.
Sure.
Oh, you capitalized map, which is fine.
But just when you type
the name, you're going
to have to chmod a capital letter.
All right, any questions?
Use the buddy system or use
me to run over to unstick.
All right.
All right.
Meanwhile, in that
other tab, I invited you
to open earlier you probably
see this Google screen.
And we will just barely
scratch the surface.
The goal here is not to
build an application, per se,
but really just to get you up and
running with their very simple sample
map just so that you understand the
workflow and feel like if you do
want to tinker afterward,
you have a little something
to build on if you would like.
Notice that Google offers maps for
different platforms, Android, iOS, web
and web services.
Web is what we want.
So if you're on this screen,
go ahead and click web.
That will lead you to a
page that looks like this.
And notice, maps user love-- there's
different ways to embed maps.
And frankly, it all can be a
little overwhelming at first.
So sometimes Google
is your-- ironically,
Google is your friend as to figure out
what you actually want by just googling
around for recommendations.
But I figured it out for us.
So go to the Google Maps JavaScript
API, the very first link.
And now here, too, they have
not made it very obvious
because there's a lot of fluffy
like images and text here.
But click guides at the top here.
So not overview but guides.
And that should finally lead
you to something more technical.
So what you are looking at is
essentially, API documentation.
This is not a standard format.
Every company will do this
a little bit differently,
but generally, good
API documentation will
have formal definitions
of what their API does,
the functionality they're giving
you, and how you can actually
use it with sample code.
So we will literally do the
Hello World sample here.
And it's going to be
relatively straightforward.
But I'll run around and unstick
any issues people are having.
Notice that down below underneath Hello
World there's a whole bunch of html.
And for better or for worse there is
some JavaScript commingled in the page.
So not best practices but
it makes the simple example
Google's giving us all self-contained.
The objective at hand is to
quite simply copy and paste
that sample code into map.html,
save it but with one change.
Notice down here, and
they've highlighted it,
they are using a script tag
in the sample program that is
src=https://maps.googleapis.com/maps--
but notice it says key equals your API
key.
So the way they keep track
of who's using their API
and they impose limits on people how
often they query their API is everyone
gets assigned a big pseudo-random
number that they save in their database.
Rather than have everyone
here sign up for this,
hopefully mine has
not been overused, you
can go to the next slide
in today's handouts
and definitely go to the slides,
don't try to transcribe this.
Here is an API key that I created
for us that you can copy paste.
So again, if you need
today's slides, you'll
never be able to transcribe that URL.
Recall that today's slides
exist here, just like yesterday.
And definitely copy and
paste from the slides.
Don't manually transcribe.
And again, the goal is copy
the Hello World example
into your own map.html file.
Save it, reload and change the API key.
And hopefully, you will have your very
own map.html with an embedded Google
map.
The goal really was to give
you a sense of JavaScript
as a language, two, using an API.
And frankly, just as exciting
the world of programming
can be, when you have these APIs
and libraries and third party
support on top of which you
can build your own product.
And indeed, what's especially exciting
about software development these days,
is it's so much more increasingly about
weaving together various ingredients
and standing on the shoulders
of others equivalently in order
to make some really cool applications.
And case in point is something
like Uber where they are not
in the mapping business, per se.
But having access to the ability
to embed interactive maps
into their application was the
enabling technology, dare say,
on top of which they could then
build a car sharing service as well.
So it's really quite
cool what you can do.
Thank you so much to the whole
team who's been behind the scenes
both in the room and outside
the room for the videos today.
We'll edit these and make them
available online and follow up
via email at some point.
The slides are already available,
so all those references are there.
Do feel free to keep in touch
if you have any questions.
But otherwise, let me
officially step out
so you feel comfortable
filling out the evaluations.
And I'll linger in the lobby
if anyone has questions.
But thanks so much for
coming to town this week.
See you soon.
Thanks.
[APPLAUSE]
