[MUSIC PLAYING]
BRIAN YU: OK, let's get started.
Welcome, everyone, to the
final day of CS50 Beyond.
And goal for today is going
to be to take a look at things
at a bit of a higher level.
There is going to be less
code in today's lecture.
The focus of today is
on two main topics--
security and scalability--
which are both important as you
begin to think about, you're writing
all this code for your web application.
You're ready to deploy it so
that people can actually use it.
What are the sorts of considerations
you need to bear in mind?
What are the security
considerations in making
sure that wherever you're hosting the
application, you and the application
itself is secure and that your users are
secure from potential vulnerabilities
or potential threats?
And also, from a
scalability perspective,
we've been designing applications
that so far probably only you
or a couple other
people have been using.
But what sorts of things
do you need to think about
as your applications begin to scale, as
more and more people begin to use it,
and you have to begin to think about
this idea of multiple people trying
to use the same application
at the same time?
So a number of different
considerations come about there.
We'll show a couple of code examples.
But the main idea of this is going to
be high level, just thinking abstractly,
sort of trying to design the product,
trying to design the project,
trying to figure out how exactly we
need to be adjusting our application
to make sure that it's secure and
to make sure that it's scalable.
So we'll go ahead and
start with security.
And on the topic of
security, we're going
to look at a number of different
security considerations
as we move all throughout the week,
from the beginning of the week
until the end of the week, thinking
about the types of security
implications that come about.
And so one of the first things we
introduced in the class was Git,
the version control
tool that we were using
to keep track of different
versions of our code
in order to manage different branches
of our code, so on and so forth.
And so a couple of important security
considerations to be aware with
regards to Git.
You all probably created
GitHub repositories
over the course of this week,
maybe for the first time.
And GitHub repositories
by default are public.
And this is in the spirit of the idea
of open source software, the idea
that anyone can see the code.
Anyone can contribute to the code.
And that, of course,
comes with its trade offs.
On one hand, everyone being
able to see the code certainly
means that anyone can help you
to find bugs and identify bugs.
But it also means that anyone on
the internet can see the code,
look for potential
vulnerabilities, and then
potentially take advantage
of those vulnerabilities.
So definitely, trade offs,
costs, and benefits that
come along with open source software.
And another thing just to be aware of,
we mentioned this earlier in the week,
but your Git commit history is going
to store the entire history of any
of the commits that you have
made, as the name might imply.
And so if you make a
commit and you do something
you shouldn't have done, for instance--
you make a commit that accidentally
includes database credentials
inside of the commit somewhere
or includes a password
inside of the commit
somewhere-- you can later
on remove those credentials
and make another commit
and remove the credentials.
But the credentials are still
there inside of the history.
If you go back, you could
still find the credentials
if you had access to the
entire Git repository
and could go back and find
that point in Git's history.
So what are the potential solutions
for if you do something like this,
accidentally expose credentials
at some point in the repository
and then remove them?
What could you do?
Yeah?
AUDIENCE: Change the credentials.
BRIAN YU: Certainly.
Changing the credentials, something
you should almost definitely do.
Change the password.
It's not enough just to remove
them and make another commit.
And there's also something you
can do known as Git purge, where
you can effectively purge the history
of commit, sort of overwrite history,
so to speak, in order to
replace that, as well.
But even that, if it's
been online on GitHub,
who knows who may have been
able to access the credentials?
So definitely always a good
idea to remove those, as well.
On the first day, we
also took a look at HTML.
We were designing basic HTML pages.
And there are a number of
security vulnerabilities
you could create just with HTML alone.
Perhaps one of the most basic is just
the idea that the contents of a link
can differ from where
the link takes you to.
There's probably a pretty
obvious point where you often
have text that links you
to a particular page.
But this can often be
misleading and is commonly
used in phishing email
attacks, for instance,
whereby you have a link
that takes you to URL one,
but by default, it shows you URL two,
which can be misleading, for sure.
Or I can have situations where I could--
let's go into link.html--
I have a link that presumably
takes me to google.com.
But if I click on google.com,
it could take me anywhere else--
to some other site, for instance.
And the way that it does
that is quite simply by just
having a link that takes you to a
URL, but the contents of that URL
are something different or
something else entirely.
And so that alone is
something to be aware of.
But that problem is compounded
when you consider the idea
that even though your server-side
code-- application code
you write in Python and
Flask, for instance--
you can keep secret from
your users, HTML code is not
kept secret from users.
Any users can see HTML and do
whatever they want with it.
And so on the first
day, you may have been
trying to take a look at an HTML
page and try and replicate it
using your own HTML
and CSS, for example.
The simplest way to
do something like that
would just be to copy the source code.
So I could go to bankofamerica.com, for
instance, Control-Click on the page,
view the page source, and all right.
Here's all the HTML on Bank
of America's home page.
I could copy that, create a new
file, and call it bank.html.
Paste the contents of it in here.
Go ahead and save that.
And now, open up bank.html.
And now, I've got a page that basically
looks like Bank of America's website.
And now, I could go in.
I could modify the links, change
where Sign In takes you to,
make it take you to
somewhere else entirely.
And so these are potential
threats, vulnerabilities,
to be aware of on the internet
that are quite easy to actually do.
So this is less about when you're
designing your own web applications
but, when you're using web
applications, the types of security
concerns to definitely be aware of.
So let's keep moving forward
in the week-- yeah, question?
AUDIENCE: Can you copy JavaScript
source code in the same way?
BRIAN YU: Yes.
Any JavaScript code that is
on the client, you can access
and you can modify.
You can change variables
and so on and so forth.
And this is actually a
pretty easy thing to do.
So if I go to like, I don't know, The
New York Times website, for instance,
and I look at the source code there--
let me go ahead and inspect
the element, and I'll
try and hover over a main headline.
OK.
This is the name of a CSS class.
You could access any JavaScript.
You can also run any JavaScript
in the console arbitrarily.
So I could say, all right,
document.query selector all let's
get everything with that CSS class.
Or maybe it's just the first one,
because it's two CSS classes.
All right.
Great.
I'll take the first one,
set its inner HTML to be,
like, welcome to CS50 Beyond.
And you can play around with websites
in order to mess around, change them.
So all of the JavaScript
CSS classes, all of that,
is accessible to anyone who is
using the page, for example.
Other questions before I go on?
Yeah.
AUDIENCE: Any thoughts on
JavaScript obfuscation?
BRIAN YU: JavaScript obfuscation--
certainly something you can do.
So since JavaScript is available to
anyone who has access to the web page,
there are programs called
JavaScript obfuscators gators
that basically take plain
old looking JavaScript
and convert it into something
that's still JavaScript
but that's very difficult
for any human to decipher.
It changes variable names and does
a bunch of tricks in JavaScript
to still execute the exact same
way but that looks quite obscure.
Definitely something you can do.
Still not totally foolproof,
because there are ways
of trying to deobfuscate JavaScript
code, at least to some extent.
So it's not perfect, but definitely
something that you can do.
Other things?
All right.
Let's take a look at--
OK, when we were writing
Flask applications,
we were writing web servers.
And so one thing that's just good
to know from a security perspective
is the difference between HTTP,
the Hypertext Transfer Protocol,
and the secure version of it, HTTPS.
And that has to do with the
idea that on the internet,
we have computer servers that
are trying to communicate
with each other that are trying to
send information back and forth.
And when these computers are trying
to send information back and forth,
we would like for that
to happen securely,
that when one computer is sending
information to another computer,
that information is going through
a number of different routers.
And each of those routers
could hypothetically
have information that's intercepted.
Someone could try and intercept a
package on its way from computer number
one to computer number two.
So how do we securely try and
transfer information from one location
to the other?
And this has to do with the
entire field of cryptography,
which is a huge field that
we're only going to be
able to barely scratch the surface of.
But the basic idea here is
that we would like some way
to encrypt our information, that if I
have some plain text that I would like
to send from my computer
to someone else's computer,
I would like to encrypt that plain text,
send it across in some encrypted way,
such that the person on the
other end could decrypt it.
And so this is perhaps a
more sophisticated version
of what you might have done
in CS50's problem set two
when you were using the
Caesar or the Vigenere cipher
in order to encrypt something.
The ciphers that are used in computing
on the internet, for instance,
are just much more secure, for example.
But they follow a similar principle.
And so one form of cryptography
is called secret-key cryptography,
where the idea is that if
I am a computer up here
and I have some plain text
that I want to encrypt,
I also have some key that only I know.
And I can take the plain
text, and I can take that key
and run an algorithm on it.
And that generates some ciphertext,
some encrypted version of the plain text
that was encrypted using the key.
I can then send that ciphertext
along to the other person.
And so long as the other person
has both the ciphertext and the key
to encrypt it, they
can do the same process
and just decrypt it, generating
the plain text from it.
That way, the ciphertext is
transferred, not the plain text,
from one side to the other
side of this communication.
And so long as both parties in this
instance have access to the same key,
they can encrypt and
decrypt messages at will.
Why doesn't this quite work
on the internet, though?
What is the problem with this model?
Yeah?
AUDIENCE: If you're sending the
key as well as the ciphertext,
then it's just revealed as sending
the plain text that you have one.
BRIAN YU: Exactly.
When we transfer the ciphertext
across, the other person
also needs access to the key.
We need to transfer the
key across the internet,
as well, to give it to the other person.
And so anyone who is
intercepting the ciphertext
could also have intercepted
the key and therefore could
have decrypted the information
and gotten the plain text
as a result of it.
So this secret-key
cryptography, ultimately, it
doesn't work in the
context of the internet
if it needs to be the
case that the key is just
transferred across the internet.
Now, you could try encrypting
the key, for example.
But then whenever key you
used to encrypt the key,
that also needs to be
sent across the internet,
and you end up with this problem where
you can never figure out a way in order
to make sure that information
can be transferred securely.
So the solution to this lies in a
different idea called public-key
cryptography, where the idea here
is that instead of having one key,
we'll have two keys--
one called a public key,
one called a private key.
And the idea here is that a public key
is something you can share with anyone.
Doesn't matter who has it.
And a private key is a key
that you keep to yourself
that you don't give to anyone,
even the person that you're
trying to communicate with.
And because we have two keys, each key
is going to serve a different purpose.
They're going to be
mathematically related.
And take a theory of
computing class if you
want to understand the exact
mathematics behind this.
But the basic idea is that the public
key can be used to encrypt messages,
and the private key can be
used to decrypt messages that
were encrypted using the public key.
And so what does this model look like?
Well, I have some
public and private key.
And if I want some other
person to send me information,
I will give them my public key.
Just give the other person the public
key so that they have access to it.
Remember, the public key
is used to encrypt data.
So they can use the public key
and encrypt the plain text,
generate some ciphertext.
And then all the other person needs
to do is send me that ciphertext.
The ciphertext comes across to me.
And I now have the private
key, the key that I
can use to decrypt the information.
And using the private
key and the ciphertext,
I can then decrypt the message
and generate the plain text.
So this is the basic idea
of public-key cryptography,
this idea that we use a public key to
encrypt information and a private key
to decrypt information.
And by separating this out
into two different keys,
we can share the public
key freely without needing
to worry about the potential
for internet traffic
to be intercepted and
decrypted, for example.
And so this is the basis on
which internet security works.
Yeah?
AUDIENCE: What if someone
else intercepts the ciphertext
and they also have a private key?
Would they be able to decrypt it?
BRIAN YU: If someone else intercepts the
ciphertext and they have a private key,
they won't be able to decrypt
it, because the private key
and the public key are
mathematically related in such a way
that if you encrypt
something with a public key,
you can only decrypt it with
the corresponding private key.
And so generally speaking,
you'll generate both the public
and the private key at the same time,
such that only messages encrypted
with one can be
decrypted with the other.
So you can't just have some other random
private key and decrypt the message.
It can only decrypt messages
from the public key.
AUDIENCE: So how did this person
get that specific [INAUDIBLE]??
BRIAN YU: So this person down
here generated both the public
and the private key at the same time.
There's just an algorithm that
you can use to randomly generate
a public and private key.
You share the public key with anyone you
want to be able to send you messages.
That person you share it with can use
the public key to encrypt the message.
And then you, the person
who generated these keys,
can take the encrypted message, use
the private key that you generated,
and get the plain text out of that.
Yeah?
AUDIENCE: How difficult is it to get
the private key from the public key?
Is it impossible?
BRIAN YU: How difficult is it to get
the private key from the public key?
Long story short, we don't really know.
We think it is very difficult to do.
We think that it would
take a very long time.
If you took a computer and tried
to get it to go from the public key
to the private key, we think it would
probably take billions, trillions, more
years if a computer was operating at
top speed trying to do this calculation.
But no one has been able to
technically prove that it is difficult.
And so this is a big open
question in computing right now.
You can take a theory
of computation class
for more information
on this sort of thing.
But there are some open
unsolved problems in computing,
and this happens to be one of them.
Yeah?
AUDIENCE: Is it based on primes
and very large primes, and you
multiply them together?
BRIAN YU: Yes, this is basically
the idea of very large prime numbers
that you multiply together.
The long story short of it
is it's based on the idea
that there is some mathematical
operations that are easy
and some mathematical operations
that are believed to be difficult.
And if you take two
very big prime numbers,
a computer can multiply
those numbers very easily
and calculate what the product
of those two numbers is.
It's just a simple
multiplication algorithm.
But if you have that result,
that big multiplied prime number,
it's very difficult
to factor that number
and figure out which two prime
numbers were multiplied together
in order to generate that number.
And nobody has been able to come up with
an efficient algorithm for factoring
it.
And so as a result, because we
believe factoring numbers to be
a very difficult problem,
we use it as the basis
for computing security on the internet.
Brief teaser of theory of computation.
Take any of the 120 series
here at Harvard, at least,
for more information about that.
Other things?
Some other security considerations
when designing web applications
to be aware of-- we
mentioned this before,
but when it comes to
storing credentials,
you should generally
always store credentials
in environment variables
inside of your application
rather than have inside of
your Python code some password,
whether it's the secret
key of your application,
whether it's the credentials
to your database,
whether it's some other
credentials for an API key,
for example, that you're
using the server to access.
Usually best not to put that in
the code in case someone else
gets access to the code.
Generally best to put
it in an environment
variable, a variable that's just
stored in the command line environment
where your server's being run from.
And then add code that just pulls
the credentials from the environment.
You can use in Python,
at least, os.environ.get
to mean get some information from
the application's environment.
And this is generally going to be a
more secure way of doing the same thing.
Yeah?
AUDIENCE: How do we do
that in Heroku if we
want to upload our code to the website?
BRIAN YU: Yeah.
So if you're uploading this to Heroku,
if you go to your Heroku application
and go to the Settings
panel, there is a section,
I think it's called config vars, that
basically just lets you add environment
variables to the Heroku application.
And that will automatically set
those environment variables such
that when you run the
application, it can
draw from those environment variables.
Yeah?
AUDIENCE: Is it [INAUDIBLE]
yesterday, or is that something
you can't have access to?
Because if you just did
[INAUDIBLE] and then the key,
it goes away when you close
the terminal, correct?
BRIAN YU: Yes.
So that's true.
So you can certainly,
on your own computer,
set aliases or environment
variables inside
of your profile that automatically
set credentials in a particular way.
The idea is that you never want
to be taking those credentials
and committing them to a
repository that other people might
be able to see, for instance.
That's where things
start to get less secure.
OK.
Moving on in the week to talk about
some other security considerations.
We'll talk about SQL,
the idea of databases.
And when we introduce databases, there
are a lot of security considerations
that come about.
But we'll just touch
on a couple of them.
The first is how you store passwords.
So you can imagine that
inside of a database,
you might be storing users
and passwords together.
And maybe we have a whole users
table that has an ID column,
a column for people's usernames,
and a column for people's passwords.
And you could imagine just storing
passwords inside of the row.
But why is this not particularly secure?
Yeah?
AUDIENCE: If anyone gets
access to the data table,
they can see what all the passwords are.
BRIAN YU: Exactly.
If anyone gets access to the
database, they immediately
have access to all of the passwords.
And this is probably not a
secure way to go about things,
because you probably hear
in the news from time
to time that databases aren't perfectly
secure, that every once in a while,
there's some big security vulnerability
where someone's able to get access
to passwords inside of a database.
And that becomes a
major security concern.
And so one way to try
and mitigate this problem
is, instead of storing passwords
inside of the database,
store a hashed version of the password.
A hash function, as you might recall
from CS50, just takes some input
and returns some deterministic output.
And a hash function can
generally take any input password
and turn it into what looks like a whole
bunch of random sequences of letters
and numbers.
And the idea here is
that it's deterministic.
The same password will always
result in the same hash value
whereby when someone tries to log
in, when they type in their password,
rather than just literally
compare their password
and say does the password match up
with the password in this column,
you can say, all right, let's
hash the password first.
And if the hashes match up,
then with very high probability,
the user actually signed in to the
website with the correct password.
And you can then log the user in.
And now, if someone was able
to get access to the database,
they wouldn't get access
to all the passwords.
They would only get access
to the password hashes.
Now, it's still a
security vulnerability,
because someone could, in
theory, be able to figure out
information about the password
from the password hashes.
But better, certainly, than
literally storing the raw text
of the password in the database.
Yeah?
AUDIENCE: Do we know how the hash
functions generate that code?
BRIAN YU: Yeah.
The hash functions tend
to be deterministic,
and you look up what the hash
functions themselves are.
So there are a couple of
quite popular hash functions
that are out there that
do this sort of thing.
But the idea of the hash
function is similar to the idea
of public and private keys, that
it's very easy to hash something,
and it's very difficult to
go in the other direction.
I can easily hash a
password and generate
something that looks like this.
But it's a difficult operation to
take something that looks like this
and go backwards and figure out what
it was that the original password was.
And so that's one of the
properties of a good hash function.
Yes?
AUDIENCE: Did you actually hash these,
or did you just hit the keyboard?
BRIAN YU: I think these are probably--
there might be hidden messages
here if you look carefully.
But separate issue.
Other things?
OK.
So how is it that potential data is
leaked as a result of using a database?
Well, there are a number of ways
that applications can inadvertently
leak information.
Take a simple example.
Oftentimes, you'll see websites
that have a Forgot Your Password
screen where you type in an email
address, and you click Reset Password.
And that helps you to send
you an email that allows you
to reset your password, for example.
And you imagine that you
type in an email address,
and you get, OK, password
reset email has been sent.
But maybe some applications
work such that if you type
in an email address
that doesn't exist, then
you get an error that says, OK, error.
There is no user with
that email address.
What data has this
application now exposed?
What information can you get just by
using this part of a web application,
for instance?
Yeah?
AUDIENCE: You know that that email
address is not in the system,
so you know that person
is not using that app.
BRIAN YU: Yeah, exactly.
Just using the Forgot Password
part of this application,
you can tell exactly who has
an account for this application
and who doesn't just by typing email
addresses and seeing what comes back.
So there's potential
vulnerabilities in terms of data
that gets leaked there, as well.
And there are all sorts of different
ways that information can get leaked.
Oftentimes, there's a
growing field whereby
you can tell just based on the amount
of time it takes for an HTTP request
to come back whether or not--
you can get information about
the data inside of a database
based on that whereby if you make a
request that takes a long time, that
can tell you something different
than if a request comes back
very quickly, because that might
mean fewer database requests
were required in order to make
that particular operation work
or any number of different things.
And so there are security
vulnerabilities there, as well.
Final one.
I'll briefly mention the SQL injection.
We've already talked about that.
But again, something to be
aware of just to make sure
that whenever you're
making database queries,
you're protecting yourself
against SQL injection,
that you're making sure to either use a
library that takes care of this for you
or escape any characters
that you might be using that
could ultimately result
in vulnerabilities in SQL.
Yeah?
AUDIENCE: How about
the websites or tools
like LastPass that store your
credentials for other sites?
Don't they have to have some way
of reversing their own hash on it
in order to give you that credential
when you go to another site?
So when it auto fills your
username and password,
it has to-- if they're storing a hashed
version on their side but filling
in the plain text version
in the password field,
how are they able to reverse
that in a way that is secure?
They would have to have a
table of keys or something
that then is just as vulnerable
as leaving the password.
BRIAN YU: Yeah.
So for password manager-type
applications, it's a good question.
I think the way most of them do this
is that you have a master password that
unlocks the entire database of the
passwords that are stored there.
And the idea would be
that they're encrypted
using the master password as
the key to be the unlocker such
that they're encrypted.
And only by getting the
master password correct
can you then decrypt
the information and then
access the plain text version of
the passwords that are inside.
And so hashing and encryption and
decryption are slightly different.
In the case of encryption
and decryption,
you still want to be able to go from
the ciphertext back to the plain text,
whereas in the case of
the password hashing,
you don't really care about the ability
to reverse engineer it to go backwards.
All right.
And finally, on the
topic of security, we'll
talk a little bit about JavaScript.
JavaScript opens a whole host of
different potential vulnerabilities
from a security standpoint.
But we'll talk about a couple.
The first is this idea
called cross-site scripting,
or the idea of taking a script and
being effectively able to inject it
into some other site by putting
some JavaScript that the web
application didn't intend into
the web application itself.
And so here's a very simple web
application written in Flask.
And this is the entire web application.
It's got a route, a default route,
called / that just returns, "Hello,
world!"
And it's got an error handler that
we didn't really see in the class.
But basically, it
handles whenever there's
a 404 error, whenever you're trying
to access a page that was not found.
And it just returns, "Not found,"
followed by request.path, whatever it
is that was the URL that you requested.
And so I could run this application.
I'll go ahead and start up
Chrome, and I'll go ahead
and go to the source code for XSS1.
I'll run this application.
Go here.
It says, "Hello, world!"
And if I go to helloworld/foo, for
example, some route that doesn't exist,
I get not found, /foo, because that's
not a route that's available on this
page.
I go to /bar.
Not found, /bar.
What could go wrong here?
Where's the security
vulnerability, again,
thinking in the context of JavaScript?
The page my application is returning
is literally just "not found"
followed by whatever was
typed into the request path.
And so what I could do is you could
imagine that instead of running /foo,
I could instead make a request
that looks something like /script
alert('hi) and then
/script, for instance,
injecting some JavaScript into the
request path whereby if I do that,
I say, OK, /script alert('hi') /script.
Press Return.
And OK, Chrome is
being smart about this.
Chrome actually isn't
allowing me to do this,
because Chrome has some more
advanced features that are basically
saying Chrome detected
unusual code on this page
and blocked it to protect your
personal information and error blocked
by XSS auditor.
That's cross-site scripting.
So Chrome is automatically
auditing for this.
But not all browsers are like that.
And I can, I think--
let's see if I can disable--
if I disable cross-site
scripting protections,
I think I can get this to-- yeah, OK.
Disabling cross-site
scripting productions,
we can still type in the URL
and actually get some JavaScript
that the page didn't intend to still
run on this particular web page.
And so if someone were to send you
a link that took you to this page,
/script alert('hi'), you could
get JavaScript to run that you
didn't intend.
And maybe that's not a big deal.
But it could be a bigger
deal in a situation that
looks like this, where
we have JavaScript
and document.write is a function
that just add something to the page.
And here, we're loading
an image, img src,
and the source is some hacker's website.
And then we say, cookie=
and then document.cookie.
Document.cookie stores the
cookie for this particular page.
And so effectively, what's
happening in this script
is that your page, when you
load it, is going to make a web
request to the hacker's URL.
And it's going to provide
it as an argument whatever
the value of your
cookie is, for instance.
And that cookie could be
something that you use in order
to log in as the credentials
for some website,
like a bank application or whatnot.
And as a result, the
hacker now has access
to whatever the value of
your cookie is, because they
can look at their list
of all the requests
that have been made to the
application much in the same way
that you've been able
to do in the terminal
to see all the requests
for your Flask application.
And they can see that someone requested
hacker_url?cookie= this cookie,
and they can then use that cookie to
be able to sign in to other sites,
as well.
So most modern browsers,
like Chrome, are
pretty good at defending
against this sort of thing.
But definitely something that is a
potential vulnerability, especially
for older browsers.
Questions about this
cross-site scripting?
Yeah?
AUDIENCE: Are you getting
the user's cookie,
or whose cookie are you getting there?
BRIAN YU: Whoever opens the page.
So the user's cookie, potentially
on an entirely different site.
The idea is that if your site
is vulnerable to cross-site
scripting in this form, then
you open up a possibility
where someone could generate
a link to your website that
includes some JavaScript injected
like this whereby someone else could
steal the cookies of your
users on your website.
And they could get the
cookies for themselves
and use those cookies to
sign into your website
and pretend to be people that
they're not, for example.
There's a potential
security threat there.
So cross-site scripting is one
example of a JavaScript vulnerability.
Another vulnerability is called
cross-site request forgery.
Imagine that you have a
bank website, for instance,
and that bank gives you
a way to transfer money.
And if you go to that URL /transfer and
then you provide arguments as to who
you're transferring money to and
how much money you're transferring,
you can transfer money.
Might be a web request
that allows you to do that.
Imagine some other
website, some website where
hackers are trying to steal
money, where they have code that
looks a little something like this.
They have a link that
says, "Click Here!"
And when you click on the link, that
takes you to yourbank.com/transfer
transferring to a particular person,
transferring a particular amount.
And some unsuspecting user on this
website could click the button.
And as a result, that
takes them to their bank.
And if they happen to be logged
into their bank at the time,
that could result in actually
making that transfer.
So cross-site request
forgery is the idea
that some other site can make a request
on your site as by, in this case,
linking to it.
This still isn't an amazing threat,
because the person actually still needs
to click on the button in order to be
able to load in order to actually go
to yourbank.com/transfer/whatever.
But you can imagine that a clever
hacker might be able to get around this
by doing something like this--
rendering an image, for example,
and saying the source of the image
is going to be this.
And when an HTML sees an image tag, the
browser is just going to go to that URL
and try and download that image.
It's going to go to the URL,
try and fetch that resource.
And here, that resource is
yourbank.com/transfer and then
transferring that money.
So the user doesn't even
have to click on anything.
And by making a GET request
to yourbank.com/transfer,
if yourbank.com isn't implemented
particularly securely and just allows
you to go to a URL like this to transfer
money, then that could be the result.
So how do you protect against this?
How would you protect
against your website
being able to do something like this?
Because your website
probably wants some way
of being able to transfer money
if you have a bank application,
but you don't want to allow
people to make requests like that.
Answer, yeah?
AUDIENCE: Yeah.
It's facetious.
BRIAN YU: Go for it.
AUDIENCE: You get a better bank.
BRIAN YU: Get a better bank.
OK.
Certainly something that would work.
Other thoughts?
Yeah?
AUDIENCE: Change the form request
type so it's not literally in your own
[INAUDIBLE].
BRIAN YU: Yeah.
Change the form request type so
that it's not literally here.
So this right here is a GET request.
You might imagine that instead, it's
a form that's submitted by a POST,
like a POST request, a
form that you actually
have to submit, click on a Submit
button, in order to submit that form.
And so now, you could imagine
that someone could still
create a vulnerability by
doing something like this.
They have a form whose action is
yourbank.com/transfer submitting
by a method POST.
And now, they have these
input that are type hidden,
which are just input fields that
don't show up inside of a page.
And they can have
hidden input fields that
specify who it's to, what the amount
is, and then just some button that says,
"Click Here!"
And if they click
here, then unwittingly,
the user could be submitting
a form to the bank that's
initiating some transfer.
And in fact, if the hacker
is being particularly clever,
you don't even need the
user to click anything,
because we can use event
listeners to get around this.
I could say body onload--
in other words, when the body
of the page is done loading,
run this JavaScript.
Document.forms returns an array of
all the forms in the web document.
Square bracket 0 says
get the first form.
And there's a function in JavaScript
called .submit that submits a form.
So you can say, all right, get
all the forms, get the first form,
and run submit.
And that's going to result
in submitting this form,
making a POST request to
yourbank.com/transfer,
which results in some
amount being transferred.
So this is a potential
vulnerability, as well.
If you're writing this
bank application, you
don't want to allow a code like this to
be able to get through your security,
because that opens up a whole host of
potential security vulnerabilities.
And in general, the way
that people tend to deal
with this is by adding what's called
a CSRF token, a Cross-Site Request
Forgery token, basically adding
some special value that changes
into their own forms and
then, anytime someone submits
the form, checking to make
sure the value of that token
is, in fact, a valid token.
And that way, someone couldn't
fake it because some other form
on some other hacker's website
isn't going to have a valid CSRF
token inside of their form page.
And so larger scale web application
frameworks, like Django,
offer easy ways to add CSRF
tokens to your forms, as well.
But just something to
be aware of as you begin
to think about, when you're
designing a web application,
how could someone exploit it?
How could someone make
requests on behalf of users
that they don't intend
to in order to get
some malicious result to come about?
So lots of security things
to be thinking about.
Questions about security or
any of the security topics
that we've covered or talked about?
Yeah?
AUDIENCE: [INAUDIBLE] the token
is generated [INAUDIBLE] event,
or it's a unique token for every user?
BRIAN YU: Yeah.
Imagine that in the
case of CS50 Finance,
for instance, that when I click
on the Buy page that takes me
to the page where I can buy
stocks, my route for buy
is going to basically
generate a new token
and insert it into the form
that then gets displayed to me.
And then when I submit that
form, it gets submitted back
to the same application.
And the application can then check.
Did the token that came back match the
token that I inserted into the page?
And if they do, in
fact, match, then that's
a way of sort of verifying
that the user was actually
submitting the actual form
and not some fake form
that they were tricked into submitting.
All right.
In that case, let's
switch gears a little bit,
and let's talk about scalability.
Here again, there's going
to be even less code.
And the idea is just going to
be, all right, what happens when
we begin to scale our web application?
We've got some web server,
and we've got some users
that are using that web server, which
we're going to represent as that line.
And so what happens
when that server starts
to have more users that
are all trying to use
the application at the same time?
What do we do?
Well, the first thing to probably
do is figure out how many users
our website can actually support.
How many can it handle before it
stops being able to support users?
And so this is where
benchmarking is quite important.
Benchmarking is just this process by
which we can test and sort of load test
our application to see what we can do to
see how many users we could potentially
handle on our server.
And so what happens if we find
out via benchmarking that,
OK, our server can only hold 100 users?
What if we need to support
101 users or 102 users?
What can we do?
One thing we can do is called
vertical scaling, where the idea here
is, all right, we have a server.
And that server only supports 100 users.
All right, well, let's just
get a bigger server, right?
Let's get a server that
supports 200 users or 300 users.
And that's going to be able
to better handle that load.
But there's a limit to this, right?
There's a limit to how much you can
just increase the size of a server
and increase its ability to handle load.
And so what could you do to
be able to handle more users?
AUDIENCE: More servers.
BRIAN YU: More servers.
Great.
And this is an idea called
horizontal scaling, where
the idea is that we have some server.
And let's say, instead
of having one server,
let's go ahead and have two servers
that are running the exact same web
application.
And now, we have two servers that
are able to run the application
and handle twice as many people.
What problems come
about now, logistically?
User tries to access our
website, and now what?
Yeah?
AUDIENCE: That means you could
have a race condition situation
or how the servers communicate
to each other [INAUDIBLE]..
BRIAN YU: Yeah.
How do the servers
communicate with each other?
Certainly, race conditions
become a threat, as well.
And then a fundamental problem
is a user comes to the site,
and which server do they go to, right?
We need some way of deciding which
server to direct a particular user to.
And so generally, this is solved by
adding yet another piece of hardware
into the mix, adding some load
balancer in between the user
and the servers whereby a user,
when they request the page,
rather than going straight to the
server, they go to the load balancer
first.
And from there on, the load
balancer can split people up,
say certain people go to this server,
certain people go to that server,
and try and decide how it is
that people are going to be
divided into the different servers.
And so how could a load balancer decide?
If there are five servers
and a user comes along,
how should a load balancer decide
which server to send a user to?
There is no one right answer to this.
There are a number of possible
options, a number of different
what are called load balancing methods.
But how could you decide
where to send a user?
Yeah?
AUDIENCE: The server with the
least amount of users currently.
BRIAN YU: Sure.
The server with the fewest
users currently, what's often
called the fewest connections
load balancing method.
You try and figure out which
server has the fewest people on it.
And whichever one has the fewest
people on it, send the user there.
Definitely good for trying to make sure
that each one has about an equal load,
but potentially
computationally expensive.
You're doing a lot of calculation
now, so there's a trade off.
Yeah?
AUDIENCE: You could just do it randomly.
BRIAN YU: You could do it randomly.
You could just generate a
random number between 1 and 5
and randomly assign someone
to a particular server.
Definitely something you could do.
Other things?
Certainly the random approach is quick.
It doesn't involve having to
do any calculation across all
the different servers.
But if you're unlucky,
you could end up putting
a lot of people on server number two and
not many people on server number eight
or whatnot.
And so what else could we do?
Yeah?
AUDIENCE: Just set up
a counter [INAUDIBLE]..
BRIAN YU: Sure.
Some sort of counter.
If you only have two, you just
alternate odd, even, odd, even.
Go to this server.
Go to that one.
If you've got eight, you just
rotate amongst the eight--
1, 2, 3, 4, 5, 6, 7, 8 and go back to 1.
And so these are probably three of the
most common load balancing methods--
random choice, whereby you just pick a
random server, direct the user there;
round robin, where we do exactly that,
just basically go one up until the end
and then go back to server number one;
and then fewest connections, whereby
you try and actually calculate
which server currently
has the fewest number
of people on it and then
try and direct the user to that
one with the fewest connections.
There are other methods
in addition to this,
but these are perhaps
three of the most intuitive
where you can start to
see their trade offs.
Depending upon the
type of user experience
you want, depending
on how computationally
expensive certain operations are, you
might choose different load balancing
methods.
Yeah?
AUDIENCE: [INAUDIBLE] benchmarking, and
what are some common ways to do that?
BRIAN YU: Yeah, there are
software tools that can do this.
There are a number of different
ones-- the names are escaping me
at the moment--
where you can basically
test on a particular URL
and get a sense for how well
it's able to handle that load.
And if you have particular use cases, I
can chat with you about that, as well.
So all right, let's imagine
we have two servers now.
And every time a user
makes an HTTP request
to a server, every time
they request a page,
we direct them to one server
or the other server using
one of these methods, either by
choosing randomly or by round robin
or by figuring out which one currently
has the fewest users connected to it
or is handling the fewest connections.
What can go wrong?
Whenever we're dealing with issues of
scale, we just try and solve a problem
and figure out what new
problems have arisen.
Yeah?
AUDIENCE: You only have five
servers, and now you need six.
BRIAN YU: Yeah.
Certainly, if you only have five
servers and suddenly you need six,
that could potentially
become a problem, as well.
But let's even assume that
we have enough servers.
We have five servers, and
every time someone load a page,
they get sent to a different server
based on one of these methods.
What can still go wrong
with the user experience?
And in particular, I'll give you a hint.
Let's think about sessions.
What can go wrong?
Remember, sessions were ways of
storing information-- in our case,
inside of the server--
about the user's current
interaction with the server.
It stored which user was logged in.
It stored the current state
of the tic-tac-toe game.
It stored other information.
Yeah?
AUDIENCE: You have to
pick one [INAUDIBLE]..
BRIAN YU: Yeah, exactly.
If I initially load a page and I go
to server one and some information
about me is stored in the session,
like whether I'm logged in
or the current state of
my game or something else,
and then I load another
page and it takes
me to server four this
time, well, now, that server
doesn't have access to the
same session information
that server one had if the
information about the session
was stored in the server.
And now, that information is lost.
So I could load a page,
and suddenly, now, I'm
logged out of the page
for no apparent reason
even though I've logged
in just a moment ago.
And then I could go to another
page, and maybe by chance,
I'm back to server one, and
now I'm logged in again.
So strange things can begin to happen.
And so to solve that, what could we do?
How can we make sure that
sessions are preserved
when the user is requesting pages?
Again, no one correct answer.
Multiple possibilities here.
How do we solve this problem?
Yeah?
AUDIENCE: Would there any way to store
the session on the load balancer?
BRIAN YU: Store the session
on the load balancer.
That's a good idea.
And that will actually get
me at the first idea here,
which is this idea of sticky sessions.
And this is slightly different.
Rather than store all the session
information in the load balancer,
it just needs to store for
this particular user which
server has their session information.
So if I went to server
number one initially,
the load balancer will remember me based
on my IP address, cookie, or whatever
and say, all right, next time
I try and request a page,
let me direct them back to
server number one, for instance.
That way, whenever I come back, I'm
always going to go to the same place.
There are other ways to
solve this problem, as well.
You could store session
information in the database
that all the servers have access to.
You could store session information
on the client side, whereby
it doesn't matter what server you go to,
because all the session information is
inside the client.
So there are a number of
ways to solve this problem,
but these generally fall under
the heading of session-aware load
balancing.
Someone mentioned the problem of,
OK, well, I have five servers,
but what happens when I need six?
To solve this in the
world of cloud computing,
where nowadays most people don't
maintain their own hardware
for their web applications,
they just rent out
hardware on someone else's servers,
for instance, on AWS, for instance,
use Amazon servers--
you can take advantage of auto scaling,
which automatically will grow or shrink
the number of servers based upon
load, whereby you could initially
have two servers.
But if more users come
about and you need more,
we can add a third server into the mix.
More people come out, we need even more.
We add a fourth server.
And auto scaling goes
in both directions.
So if suddenly we find, all
right, we had a lot of load
at this particular peak time
of the day but now there are
fewer users on the site, the auto
load balancer can sort of say,
all right, we don't need
four servers anymore.
Let's go back to three and then
later on, if it needs doing,
go back up to four again.
And it can automatically, dynamically
reconfigure the number of servers
in order to figure out
what the optimal number is
given the number of users that are
currently using the application.
What happens, though, when one of
the servers fails for some reason?
The server just dies, for instance.
The load balancer doesn't
necessarily know about that.
And so if it's still directing
people across four different servers,
it could direct users to that server
that is no longer operational.
Any thoughts on how we
might solve that problem?
Yeah?
AUDIENCE: Have the load balancer ping
the server at determined intervals
to see if it's still there.
BRIAN YU: Yeah, some
sort of ping to make sure
That the server is still there.
And often, one of the easiest
ways that this is done
is via what's called a heartbeat,
whereby each of the servers
gives off a heartbeat every fixed number
of seconds or minutes, for instance,
whereby if every 10 seconds
the server pings the heartbeat,
that gets sent to the load balancer.
If ever the load balancer doesn't
hear the heartbeat from the server,
it can know that that server is no
longer operational, and it can say,
all right, you know what?
Let's stop sending users there and only
send users to the other three servers.
Questions about that or any of the
ideas of how we scale our servers
to be able to handle load?
We decided, all right, if too
many people are on one server,
we need to split up into
two different servers.
But that introduced a
bunch of problems that we
had to solve-- problems about load
balancing, problems about what to do
about sessions, so on and so forth.
Yeah?
AUDIENCE: Do you hear a lot
about distributed servers?
I'm wondering how they [INAUDIBLE].
BRIAN YU: Sure.
How do servers share data?
Well, they use databases.
And of course, as we start to figure out
what to do with more and more servers,
we also need to figure out
what to do about databases,
figure out how to scale databases
and make sure that as we scale them,
the databases are able to
handle that load, as well.
And so in the past, we've had,
all right, a load balancer.
We've got servers.
And in our model right now, we have
a database that both of these servers
are connected to.
But of course, the problem is
soon going to arise of, all right,
now we've got a lot of
servers that are all
trying to connect to the same database.
And now, we've got yet
another single point
where things could
potentially go wrong or where
we could potentially be overloaded.
So how do we solve this type of problem?
One of the most common ways
is database partitioning.
One form of database partitioning
you've, in fact, already seen,
and it's just an extension of
what we've been doing with SQL,
whereby we have this flights table.
And we could say, all right, rather than
store the origin and the origin code,
let's go ahead and separate
what's in one table
into a couple different tables.
Let's separate the flights
table into a locations table
where the locations table has a
number for each possible location.
And then it also, in
the flights table, now,
only needs to store a single number for
the origin ID and the destination ID.
We could also separate
tables in different ways.
If we have some general
way we could partition
a table into different
parts that are generally
going to be queried
separately, then we can
do another partition where
I could say, all right,
my flight's table is getting big.
Let's split it up.
And all right, at my airline, the
international departures and arrivals
are handled separately from the
domestic departures and arrivals.
So no need for those to
be in the same table.
Let me just go ahead and
take flights and separate it
into a domestic flights table and
an international flights table,
for instance.
One way to just partition things
into two different tables that
could potentially be stored in
different places that ultimately
allows for handling of scale.
But ultimately, all
of these are problems
that are still going to lead to the
fundamental problem of if I only
have one database and 10 or
dozens of servers that are all
trying to communicate
with that same database,
we're going to run into problems.
The database can only handle
some fixed number of connections.
And so one solution to this
is database replication.
So all right, how does
database replication work?
Well, probably the simplest
form of database replication
is what's called single
primary replication, whereby
I have one what's called
primary database and maybe
three databases in total,
but only one that I'm
going to consider the primary one.
And you can read data
from any of the databases.
You can get data out of
any of the three databases,
whereby if there are three servers
and each one wants to read data,
they can just share among the
three databases reading data
to make sure that we're
not overloading any one
database with too many connections.
But you can only write
data to a single database.
And by only writing data
to a single database,
that means that anytime
this database is updated,
then this database,
our primary database,
just needs to update
the other two databases.
Say, all right, there's been a
change made to the primary database.
And it's the primary
database's responsibility
to then communicate to the other two
databases what those changes are.
And so that's
single-primary replication.
Yeah?
AUDIENCE: How is that more efficient
than just communicating with all three
of them?
Because I think you're
sending information
from the first database
to the second and third.
[INAUDIBLE] information sent that's
just rewriting to all three of them.
BRIAN YU: That's true, though.
Databases could potentially
batch information
together into transactions
and things and groups
so as to be a little bit more efficient.
So certainly ways around that problem.
But yeah, a good point.
Of course, this helps the read problem.
It makes it easier to be able
to read data out of databases.
But it leaves open a
potential vulnerability
or a potential scalability problem
with regard to writing data,
because there is still only a single
database on which I can actually
write data to if that one database
is responsible for updating
all of the other databases.
And so a more complex
version of this is what's
known as multi-primary
replication, where
the idea is that each database
can be read to and written from.
But now, updates get a
lot more complicated.
All of the databases need to
have some notion and some way
of being able to update each other.
And there, conflicts begin to arrive.
You can have update conflicts
where two different databases
have updated the same row.
All right, how do you
resolve that problem?
You can have uniqueness
conflicts, whereby
if you add a row to each of two
databases at the same time, maybe
they get the same ID.
Maybe this one only has
27 rows, so this database
adds a new row with ID number 28, and
this database does the same thing.
And now, when they try
to update each other,
we have two rows with the same ID.
And now, we need some
way of resolving those,
because the IDs are
supposed to be unique.
And so that can create
problems, as well.
And then there are other types of
conflicts, too-- delete conflicts,
whereby one database tries to
delete a row at the same time
another database tries to update a row.
So which do you do?
Do you update the row?
Do you delete the row?
And so these are all conflicts
that when you're setting up
a multi-primary replication
system, you need
to figure out how you're going to
ultimately resolve those conflicts.
You gain the ability to
write to all the databases,
but new problems arise
as you begin to do that.
Yeah?
AUDIENCE: So is the information
in each database the same?
Are they [INAUDIBLE] with each other?
BRIAN YU: Yeah.
In this model, the
databases in general are
going to be the same, though
they're not always perfectly going
to be in sync, which is yet another
problem, whereby there might
be some time after I
write to this database
before that data propagates through
all of the databases, for instance.
AUDIENCE: So why not keep it in one?
BRIAN YU: You could keep all
the information in one database.
But a single database server can
only handle so many connections.
And so you might imagine that having
three different servers, three
different computers that are all
able to handle incoming requests,
just increases the capacity
of your application
to be able to handle that kind of load.
All right.
Questions about databases, database
replication, any of the scale problems
that come about there?
All right.
Final thing I'll mention on the
topic of scaling that can be helpful
is just the idea of caching.
Caching is something we've
talked about a lot before.
But a general idea could be that in
order to try and solve this problem
of constantly having to request
information from the database,
if we could store data in some
other place-- in particular,
inside of a cache--
then we don't need to access the
database as often, because we've
got the information already stored.
And so one way to do this
is via client-side caching.
And so inside of the HTTP
headers, when an HTTP response
is sending back
information to a user, you
can add an HTTP header called
cache control that basically
says for up to this number of seconds,
you can just store information
about this page and not
request it again if you try
and request the page for a second time.
And this helps to make sure that
if the browser tries to request
the page again, it doesn't need to.
It can just use the version
that's stored inside of the cache.
And a more recent development is this
idea of an ETag, or an entity tag.
And the idea here is that if we have
some web resource, some document,
some piece of data from a database
that our web application is sending out
to users, when I send users
that resource, that document,
I'll send that document, and
I'll also send an entity tag that
corresponds to that particular
version of the document
and send them both to the user.
And imagine this is a big document.
It's a lot of data, so it's expensive
to query and to send to the user.
The next time the user tries to
request this page, what the user can do
is the user can send the entity tag,
the ETag, along with their request.
I would like to request this
resource, and, oh, by the way,
I already have this version
of the entity stored
locally inside of my computer's cache.
And if the web application then
looks at that ETag and says,
all right, you know what?
That's the latest
version of the document.
The web application can just respond--
in particular, with an HTTP status
code of 304, meaning not modified,
to just say, you know what?
This entity tag is the
most recent entity tag.
Don't bother trying to
request the document again.
Just use the version you
saved locally in your cache.
And if, on the off chance,
the document's been updated
and therefore has a new ETag
value, then the web application
goes through the process of sending
that entire document back to the user.
But by taking advantage
of technologies like this,
this can allow us to
make sure that we're not
making too many requests
to the database,
that we don't make redundant requests
if a particular resource hasn't changed.
So caching can be done
on the client side.
Caching can also be done
on the server side, which
changes our diagram slightly
so as to look a little bit more
like this, whereby now, we've
got some more complications here.
We've got some load balancer
that's communicating
with a bunch of different servers.
All of those servers have to
interact with the database,
and maybe you've got multiple databases
going on here that are each able to do
reads and writes, either
in a single-primary model
or a multi-primary model.
And those servers also have access
to some cache that makes it easier
to access data quickly,
in a sense, saying,
if there's some
expensive database query,
don't bother performing the database
query again and again and again.
Take the results of that
database query once.
Save it inside of the cache.
And from then on, the server
can just look to the cache
and get information out of there.
So lot of security and
scalability concerns
that can potentially come about as
you begin web application development.
And so goal of today was
really just to give you
a sense for the types of
concerns to be aware of,
the types of things
to be thinking about,
and the types of issues
that will come about
if you decide to take a web application
and begin to have more and more people
actually start to use it.
So questions about that or
about any of the other topics
we've covered this week?
All right.
So with the remainder of this morning,
between now and about 12:30 or so,
we'll leave it open to more
project time, an opportunity
to work on any of the
projects you've worked on
so far over the course of this week
and also an opportunity to work
on something new if you would like to.
I know many of you yesterday decided
to start on new projects, projects
of your own choosing
built in React or Flask
or using JavaScript or any
of the other technologies
we've talked about this week.
Before we conclude, though, I do
have to say a couple of thank yous,
first to David for helping to advise
the class, to the teaching fellows--
Josh and Christian
and Athena and Julia--
for being excellent in
helping to answer questions
and helping to make sure that the
course can run smoothly, to Andrew up
in the back, who's been taking care
of the production side of everything
over the course of this week, making
sure that all the lectures are recorded
and making sure they're posted
online, such that afterwards, you,
when you're here or
when you're not here,
are able to come online to see them.
So thank you to everyone for
helping to make the course possible.
Thank you to all of you
for coming to the course.
Hope you enjoyed it.
Hope you got things out of it.
We've really only scratched
the surface, though,
of a lot of the topics
that we've covered
over the course of the past week.
There's a lot more to CSS and HTML
and JavaScript and Flask and Python
and React than we were really able to
touch on over the course of the week.
It was really meant to
be more of an opportunity
to give you some exposure to some
of the fundamentals of these ideas,
some of the tools and the
concepts that you can ultimately
use them as you begin to design
web applications of your own.
So I do hope that you've learned
something from the week but,
in particular, that you found things
that are interesting to you, such
that you continue to take
those ideas and explore them.
Go beyond just what we've been able
to cover over the course of this week
and explore what else these technologies
and these tools and these ideas
ultimately have to offer.
So thank you so much.
We'll stick around until 12:30
to help with project time.
[APPLAUSE]
But this was CS50 Beyond.
