[MUSIC PLAYING]
DAVID MALAN: All right.
This is CS50, and last
time where we left off
was here, focusing on data structures.
And indeed, one of the last
data structures we looked at
was that of a hash table.
But that was the result of a
progression of data structures
that we began with this
thing here, an array.
Recall that an array was actually a data
structure that was actually introduced
back in week two of CS50, but
it was advantageous at the time,
because it allowed us to do things
efficiently, like binary search,
and it was very easy to use
with its square bracket notation
and adding integers or
strings or whatever it is.
But it had limitations, recall.
And among those
limitations were its lack
of resizeability, its lack of dynamism.
We had to decide in advance how big
we wanted this data structure to be,
and if we wanted it to be any bigger,
or for that matter any smaller,
we would have to dynamically
ourselves resize it and copy
all of the old elements into a new
array, and then go about our business.
And so we introduced last
time this thing here instead,
a linked list that addresses that
problem by having us on demand
allocate these things
that we called nodes,
storing inside them an integer or
really any data type that we want,
but connecting those nodes
with these arrows pictured
here, specifically connecting them or
threading them together using something
called pointers.
Whereby pointers are just
addresses of those nodes in memory.
So while we pay a bit of a price
in terms of more memory in order
to link these nodes together,
we gain this flexibility,
because now when we want to grow or
shrink this kind of data structure,
we simply use our friend malloc or
free or similar functions still.
But we then, using linked
lists-- and at a lower level,
pointers as a new building block,
did we begin to solve other problems.
We considered the problem of a
stack of trays in the cafeteria,
and we presented an abstract
data type known as a stack.
And a stack supports
operations like push and pop.
But what's interesting about
a stack for our purposes
recall is that we don't
need to commit necessarily
to implementing it one way or another.
Indeed, we can abstract away
the underlying implementation
details of a stack, and implement
it using an array if we want,
if we find that easier or convenient.
Or for that matter we can
implement it using a linked list,
if we want that additional
ability to grow and shrink.
And so the data type itself in a stack
has these two operations, push and pop,
but they're independent,
ultimately, of how we actually
implement things underneath the hood.
And that holds true as well
for this thing you here.
A line, or more properly
a queue, whereby
instead of having this last
in, first out or LIFO property,
we want something more fair in the
human world, a first in, first out.
So that when you NQueue
or Dqueue some piece
of data, whatever was NQueued
first is the first thing
to get out of that queue as well.
And here too did we see that we could
implement these things using a linked
list or using array, and
I would wager there is yet
other possible implementations as well.
And then we transitioned from
these abstract data types
to another sort of paradigm for
building a data structure in memory.
Rather than just linking things together
in a unidirectional way, so to speak,
with a linked list, we introduced
trees, and we introduced things
like binary search trees,
that so long as you
keep these data structures pretty well
balanced, such that the height of them
is logarithmic and is not
linear like a linked list,
can we achieve the kind of efficiency
that we saw back in week zero
when we did binary
search on a phone book.
But now, thanks to these pointers
and thanks to malloc and free
can we grow and shrink the data
structure without committing in advance
to an actual fixed size array.
And similarly did we solve
another real world problem.
Recall that a few weeks
ago we looked at forensics,
and most recently did we look
at compression, both of which
happen to involve files.
And in this case, the goal
was to compress information,
ideally losslessly, without throwing
away any of the underlying information.
And thanks to Huffman coding
did we see one technique
for doing that, whereby instead of using
seven or eight bits for every letter
or punctuation symbol in some text, we
can instead come up with our own coding
that we use one bit like a one to
represent a super common letter like e,
and two or three or four
or more bits for the less
common letters in our world.
And then again we came to hash tables.
And hash tables too is an abstract type
that we could implement using an array
or using a linked list or using
an array and a linked list.
And indeed, we looked first at a hash
table as little more than an array.
But we introduced this idea
of a hash function, that
allows you, given some input,
to decide on some output
and index a numeric value,
typically, that allows
you to decide where to put some value.
But if you use something
like an array, of course,
you might paint yourself
into a corner, such
that you don't have enough
room ultimately for everything.
And so we introduced separate chaining,
whereby a hash table in this form
is really just an array, pictured
here vertically, and a set of linked
lists hanging off that array,
pictured here horizontally,
that allows us to get some
pretty good efficiency in terms
of hashing, finding the chain that
we want in pretty much constant time,
and then maybe incurring
a bit of linear cost
if we actually have a
number of collisions.
Now today, after leaving behind
these data structures-- among them
a try, which recall was our last
data structure that allowed us
in theory in constant time to look up
or insert or even delete words in a data
structure, depending only
on the length of the string,
not how many strings were in there--
do we continue to use these ideas,
these building blocks,
these data structures.
But now today we literally
leave behind the world of C
and starts to enter the
world of web programming,
or really the world of web
pages and dynamic outputs
and databases, ultimately, and
all of the things that most of us
are familiar with every day.
But it turns out that this time we don't
have to leave behind those ingredients.
Indeed, something like
this, which you'll soon
know as HTML-- the language in which
web pages are written-- HyperText Markup
Language-- even this textual document,
which seems to have a bit of structure
to it, as you might glean here from the
indentation, can underneath the hood
be itself represented as a tree.
A DOM, or Document Object
Model, but indeed, we'll
see now some real world, very
modern applications of the same data
structures in software
that we ourselves use.
Because today, we look at
how the internet works,
and in turn how we actually
build software atop it.
But first, a teaser.
[VIDEO PLAYBACK]
[MUSIC PLAYING]
-He came with a message,
with a protocol all his own.
He came to a world of cruel
firewalls, uncaring routers,
and dangers far worse than death.
He's fast.
He's strong.
He's TCPIP, and he's got your address.
Warriors of the Net.
[END PLAYBACK]
DAVID MALAN: All right, so coming
soon is how the internet works.
And it's not quite like that.
But we'll see in a bit more detail.
But let's consider first something a
little more familiar, if abstractly,
like our own home.
So odds are, before coming
to a place like this,
you had internet access at home or
at school or at work or the like.
And inside of that building--
let's call it your home--
you had a number of devices.
Maybe a laptop, maybe a desktop,
maybe both, maybe multiple.
And you had some kind of internet
service provider, Comcast or Verizon
or companies like that, that actually
run some kind of wired connection,
typically-- though it could
be wireless-- into your home,
and via that connection are you on
your laptop or desktop able to get out
onto the internet.
Well it turns out that the internet
itself is a pretty broad term.
The internet is really
just this interconnection
of lots of different networks.
Harvard here has a network.
Yale has a network.
Google has a network.
Facebook has a network.
Your home has a network and the like.
And so the internet really
is the interconnection
of all of those physical networks.
And on top of this internet, do there
run services, things like the web
or the world wide web.
Things like email.
Things like Facebook Messenger.
Things like Skype.
And any number of applications
that we use every day
run on top of this physical
layer known as the internet.
But how does this internet itself work?
Well, when you first plug in
your computer to a home modem
that you might get from Verizon
or Comcast-- it might be a cable
modem or a DSL modem or another
technology still-- or more commonly
these days, you connect wirelessly,
such that your Mac or PC
laptop connects somehow wirelessly to
this device, what actually happens?
Like the first time you have
internet installed on your home,
how does your computer know
how to connect to that device,
and how does that device know
how to get your laptop's data
to and from the rest of the internet?
Well, odds are you
know on your Mac or PC
you at least get to choose
the name of your network,
whether it's Harvard University or
Yale or LinkSys or Airport Extreme
or whatever it is at home, and
then once you're connected to that,
it turns out that there's
special software running
on this device in your
home called a router.
And actually, it can be
called any number of things.
But one of its primary functions
is to route information,
and also to assign certain
settings to your computer.
Indeed, running inside of this
so-called router in your home
typically is a protocol, a special type
of software called DHCP-- Dynamic Host
Configuration Protocol.
And this is just a fancy way of saying
but that little device in your home
knows how to get you onto the internet.
And how does it do that?
Well, the first time you
turn on your Mac or PC
and connect to your home network-- or
Harvard's or Yale's for that matter--
you are assigned, thanks to this
technology DHCP an IP address,
a numeric address, something
of the form something
dot something dot something dot
something that uniquely in theory
identifies your computer
on the internet,
so long as your computer speaks this
protocol IP, or the Internet Protocol.
And we'll see in a bit
that IP and TCP-- or more
commonly known as TCPIP-- is
really just a set of conventions
that governs how computers talk
to each other on the internet.
And the first way they do that
is by agreeing upon in advance
what each of their addresses look like.
Now, these addresses are actually
changing in format over time,
because frankly, we're running
out of these addresses.
But the most common
address right now still
is an IP version 4, or V4 address, that
is literally of the form something dot
something dot something dot something.
And so when your computer first
turns on in your home network,
you are given a number that looks
a little something like that.
And via that address now can you talk
to other computers on the internet,
because this is like your from
address in the physical world,
and you can receive responses
from computers on the internet,
because they now know
you via this address.
So much like the CS building
here is that 33 Oxford Street
Cambridge, Massachusetts,
or the CS building at Yale
was 51 Prospect Street, New
Haven, Connecticut, much as those
addresses uniquely identified
those two buildings,
so do IP addresses in the world of
computers uniquely identify computers.
So here for instance just
happens to be by convention
what most of Harvard's own
IP addresses look like.
Now that I'm on this network here,
odds are my IP address starts
with 140.247 dot
something dot something,
or 128.103 dot something dot something.
Or at New Haven at Yale, it
might look like 130.132 dot
something dot something, or 128.36
dot something dot something.
And it turns out that
each of these somethings
simply is by definition a
number between 0 and 255.
0 to 255.
I feel like we've heard
these numbers before.
And indeed, if you can
count from 0 to 255,
that means you're using what 8 bits.
And so each of these numbers
is 8 bits plus 8 plus 8 plus 8.
So that's 32 bits.
And indeed, an IP address typically
these days-- at least version 4--
is a 32-bit value which means there
can be total no more than 4 billion
or so computers on the internet.
And we're actually starting to bump
up against that, because everything
these days seems to be on the internet,
whether it's your phone, laptop,
or even some smart device in your home.
And so there is a way to mitigate that.
It turns out that your computer,
even if you're on campus,
might not quite have one of
those Harvard or Yale IPs.
You might instead have depending on
where you are on campus a private IP
address, or if you're in
your home, you similarly
might have one of these addresses.
And these are private
in the sense that they
are used to route information within
your home or within your school
or within your company,
but these addresses are not
meant to be used by the outside world.
Instead, what you get
from Harvard or Yale
or Comcast or Verizon when
you connect to their network
typically is at least the ability
to have one or more public IP
addresses that the rest
of the world knows you by.
So what does this actually mean?
Well, sometimes it doesn't
really mean anything at all.
And in fact, if you look at popular
media today or various television
shows, you'll see that IP is
either miscommunicated or outright
misunderstood.
Let's take a look.
[VIDEO PLAYBACK]
-It's a 32-bit IPv4 address.
-IP, as in the internet?
-Private network.
To meet is private network.
It's just so amazing.
It's in their IP address.
She's letting us watch what
she's doing in real time.
[END PLAYBACK]
DAVID MALAN: No, no, that is not
what a hacker does in real time,
and that is not how you
watch a hacker in real time.
Indeed, if you zoom in
on this screen here,
you'll see that what's
actually being looked at
has nothing to do with
networking per se.
This is actually
programming code written
in a language called
Objective C, which happens
to be used conventionally
for Mac applications
or more recently iOS applications.
And of all the things for
them to have pulled out,
they use this code,
which has to be something
related to some kind of drawing program
insofar as it's talking about crayons.
Moreover, if you actually look at one
of the other scenes from this show,
this was the IP address in question.
This too is not technically accurate.
What's wrong with this IP address
in this frame here from the show?
Yeah, so if the IP addresses
can only be from 0 to 255,
275 is definitely too big.
Now, in their defense, this
is probably a good thing,
because now they're not broadcasting
some random, unsuspecting person's
actual IP address.
But there too there's
a technical limitation.
But of course, we humans, when we visit
websites using Safari or Chrome or IE
or Edge or whatever,
we rarely if ever type
in the address of websites or servers
by these numeric IP addresses.
Rather, we seem to use
more user-friendly words,
like www.google.com, or harvard.edu, or
yale.edu, or facebook.com, or the like.
And thankfully, there
exists in the world
another system, another technology
known as DNS-- Domain Name System.
And what DNS does is it simply
converts numeric IP addresses
to more human-friendly host names,
or fully qualified domain names.
Which is to say when I first sit down
at my Mac or my PC on my home network
or Harvard's or Yale's and I type in
something like www.google.com and hit
Enter, the way that my computer
actually talks to google.com
is by way of those numeric IP addresses.
But the way my Mac or PC figures out
what that IP address is of google.com
is it asks the local operating
system-- Mac OS or Windows--
and if Mac OS or Windows doesn't
know, my operating system asks
Harvard's network or Yale's
network or Comcast's network,
wherever I physically am,
because each of those networks
has their own DNS server, whose
purpose in life is to convert IP
addresses to host names and
host names to IP addresses.
And in the event that Comcast or
Yale or Harvard, wherever I am,
doesn't know the answer to what is
the IP address for www.google.com,
there exist root servers in the world.
Servers that are globally
administered at the end of the day
can at least help those DNS servers
figure out what the answers are.
And indeed, when you
buy or when you rarely
rent a domain name, among
the things you're doing
is informing the world
via a set of standards
what your server's IP addresses are.
And so that's exactly what
Google and others have done.
But of course, the data at the end of
the day still has to get from my laptop
to Google.
And then my search results
have to get from Google to me.
And how does that happen?
I mean, most of Google's
servers are probably
out in Mountain View, California or
maybe here on the East Coast somewhere,
if they have multiple servers.
Or maybe somewhere in the world.
And indeed, big companies these days
have servers all over the place.
So how does one little
old laptop know how
to request search results
from Google or how
to request my news feed from Facebook
or how to do any number of other things
on the internet?
Well it does it by way of
these things called routers.
It turns out that between me and
most any other point on the internet,
there's one or more
routers-- special servers
that could be this big, this big,
any number of sizes these days.
They're just computers that typically
live in data centers of some sort.
And these routers' purpose in life
is to quite simply route information.
So when my Mac wants
to talk to google.com,
my Mac constructs what we call a
packet of information inside of which
is my request.
Give me all of your
search results for cats,
for instance, if that's
what I'm searching for.
And that packet is handed
off to the nearest router.
That router happens to be, at this
point in the story, at Harvard here.
Harvard has its own routers.
And Harvard's routers are somehow
wired or wirelessly connected
to other routers in the world.
And those routers, typically
no more than 30 routers away,
can get my data by routing it, routing
it, routing it, routing it, routing it,
until it eventually reaches
its correct destination.
In its simplest form, what
you can think of these routers
as doing is looking at those
IP addresses-- something
dot something dot something dot
something-- and deciding, based
on those numbers, which direction to go.
So maybe if my IP address
starts arbitrarily with 1,
maybe the packet should go
that way to that router.
If it starts with 2,
it should go that way
and be routed to that router,
or that way, or that way.
It doesn't really matter.
This all happens dynamically
thanks to software.
But routers just use those
IP addresses to decide
which way to route your information.
And we can actually see this.
Let me go ahead into CS50 IDE, and
Macs and PCs and other computers
have the same software.
This will allow me to do a number
of things at my command line here.
For instance, suppose
that I wanted to check
what the IP address is for google.com.
Because if I want to send Google a
letter, like a packet of information
requesting a whole bunch of
search results about cats,
I need to know their IP address.
So what I can do at
the command line here
is run a command that's pretty popular
called nslookup-- names server lookup.
And I can type in something
like www.google.com Enter,
and wala, I seem to get the answer here
that Google's IP address is apparently
172.217.4.36.
And I know that answer,
because Harvard's server--
and I know it's Harvard, because it
starts with 140.247-- Harvard's DNS
server somewhere here on
campus just knew that result.
But it's non-authoritative, in the sense
that Harvard does not run google.com.
But Harvard has previously
asked Google or someone else
for Google's IP address.
And so Harvard is answering the question
for me, but not authoritatively.
It's a delegate who is relaying
that information to me.
Now, suppose I want to
do this for another site.
Let me go ahead and search for
nslookup say www.facebook.com.
And you'll see here that Facebook's
IP address is apparently 31.13.80.36.
And there's some more
cleverness going on here.
It turns out there's
other types of DNS records
or entries, starmini.c10r.facebook.com.
I don't really know what that means.
Facebook's a big enough
company that there's probably
a lot more complexity going on.
But just out of curiosity, let me go
ahead and copy this IP address here.
And in a browser, go to
http:// that IP address.
Enter.
And wala, I make my way to Facebook.com.
But it would be pretty bad for
business if everyone in the world
had to know that Facebook's
IP address is this.
Back in the day when people
still used phone numbers,
you might have services like
1-800-COLLECT, C-O-L-L-E-C-T,
these mnemonics, so that it was easier
for humans to remember phone numbers.
Thankfully, DNS does all
of this automatically.
We just have to remember
facebook.com, and DNS
does that conversion even more
dynamically than the old school
1-800-COLLECT tricks
that the world adopted.
So that's how my computer
would get the TO address.
So at this point in the story, if I
want to send a request to google.com--
and this is just an envelope in
which I might send a letter--
I need to have two
pieces of information.
I need to have the TO address
here, which for Google recall--
let me look it up
again-- is 172.217.4.36.
7 And so I'm going to put that
in the TO field of this envelope.
And now I need to know
my own IP address.
So it turns out my computer
has its own IP address.
And so when I send this request
over the internet to Google,
I'm going to need to include my own
IP address, which Windows or Mac
OS knows for me.
And so in the top corner
of this envelope might
I write my actual IP address as well.
So now I have to actually
route this information.
I first have to write
Google a note, and I
might say on this blank sheet
of paper, search for cats.
So this might be my search request.
And I'm going to go ahead
and just bundle this
up, put this inside of this envelope.
But now I need to send this envelope
or this so-called packet of information
to www.google.com.
And who knows where they are?
Maybe they're in California.
Maybe they're here on the East Coast.
Maybe they're somewhere else.
How do I route this information?
Well, turns out that
Harvard has a router, again,
and Harvard's routers
know of other routers.
And in turn, and we using
the same command prompt
can we actually see the
path that my data should
take if I trace the route one query
at a time from here to www.google.com.
And now what you see, one row
at a time, is the following.
The first hop between me and Google
is apparently this router here.
Row number,
mr-sc-1-gw-vl427.fas.net.harvard.edu.
Don't quite understand
all of that, but I
do know just from knowing the people
there, MR is the machine room.
So here at Harvard Science Center,
there is a room with machines.
And that's where this
server apparently is.
SC means Science Center.
GW by convention means
gateway, which is just
a synonym for router,
this kind of device.
And then I don't know what VL427 means.
But I do know that if we
continue to the next hop here,
row two, Core Science Center gateway,
or Core Science Center router.
So one router is connected
to another router.
The third hop to which
my data is delivered
is bdrgw2, which I know by
convention means border gateway.
And so this data is being passed
from hop one to two to three.
And once it goes there, it
goes to hop four or router
number four, which is nox1sumgw.
So nox is the northern crossroads,
which is a common peering point here
in the Northeast of the US, which
just means lots of different internet
service providers interconnect
their cabling and their technology
so as to route data
to and from locations.
That's apparently where
we're connected here.
Then I don't know where
row five is, but it
looks like its owned
by internet two, which
is a fast level of internet service
that a lot of universities use.
Then router 6, 7, 8, 9, 10, and 11 don't
even disclose that they have names.
And they might not.
Routers don't and computers
don't need to have
domain names or human-friendly terms,
it's just useful for us humans.
But then lastly in hop 12, we finally
make our way to whatever this is,
which seems to be some kind of synonym
or alias for one of Google's servers.
So it seems that in just 12 hops,
I can get data from here to Google.
And you know how long it takes to get
from here to Google, wherever they are?
9 milliseconds in total.
That's pretty darn fast to
make a request from my computer
to some other computer, especially when
that computer could be most anywhere
in the world or in the country.
Now, there's a lot of variability.
If you look at each of these rows--
1.5 milliseconds, 1.9, 2.9, 25, 25, 25.
These aren't cumulative.
What my computer is doing is sending
a packet to the first router,
then to the second rather,
then to the third router,
and measuring each
time how long it takes.
So you really just get a rough
sense, an average of sorts,
based on running this command like this.
So it seems to take between
10 and 30 milliseconds
to get my data from me to Google.
Now, I don't know where
Google's servers are,
but I do know that UC
Berkeley is in California,
and their servers I do
think are in California.
So let's do another by tracing
the route to www.berkeley.edu
where some of our friends there are.
That was super fast, even though
it still took some 93 milliseconds.
So I'm going to infer that
the server of Google's
that I'm talking to isn't
all the way in California,
because to get to California in
reality seems to take a good 100 or 90
milliseconds.
But let's see what we can glean here.
So Machine Room Science Center.
It's a core gateway.
It's a border gateway to Northern
Crossroads, to an unnamed server.
Don't know what this one is.
But I can guess maybe what this is.
And notice in particular, router
number six jumps from seven
milliseconds to like 49.
That's a pretty good distance.
And indeed, if you look at the
name here, Hous, this I'm guessing
is a router that's in Houston,
Texas, halfway across the country.
After that, maybe Los
Angeles here in step 8.
And that, indeed takes
a little more time.
So you can probably infer
that it's farther away.
No name, no name.
This one here, I'm not really sure.
But now we seem to be in Berkeley's
campus and CalWeb-- California web,
their server farm production.
Indeed, it takes some 90 seconds
in total to get to Berkeley.
What about MIT?
MIT should be pretty close.
Let's do a trace route to MIT.edu.
And it takes-- all right, so it seems
that two routers between us and MIT
aren't even cooperating, and
that's their prerogative.
Not actually responding to our requests.
And so in about 10 milliseconds,
we get to MIT's server,
which seems to be
hosted by a third party
company called Akamai, which
is a content delivery network,
among other things.
Which means MIT has
outsourced to some third party
the physical hosting of their
servers, which is not uncommon.
But let's do one more.
Let's do one like for CNN,
but not here in the US.
But maybe .co.jp for the Japanese
version of CNN's website.
Let's go ahead and run this.
Initially following the same route,
Machine Room, Core Gateway, border.
And then wala, 189 milliseconds later,
we seem to have gotten to Japan.
But what can we glean
from these numbers?
I'm not quite sure where
all of these hops are.
But what is interesting to me is this
one here between routers 8 and 9, what
do you notice?
That's a sizable jump in time.
And it's not a fluke.
It's not an anomaly, because
indeed, it seems to persist.
So if we go farther and
farther into this trace, then
indeed it's staying at
170 plus milliseconds.
So what do you think is in
between routers number 8 and 9?
What would be between these?
I dare say there's an
entire ocean between them.
And we can see that thanks
to this animation here,
there's a whole lot going
on between points A and B,
including sometimes some pretty big
cables and some pretty big oceans.
Let's take a look.
[MUSIC PLAYING]
All right, there's something
about really cool music
that makes lines cool.
But indeed, those pictures
capture the complexity
of all the wiring that's actually
interconnecting all of the continents
and countries of the world that
actually explains more technically some
of those differences in timings.
But at the end of the day, this
packet has to get somewhere.
And suppose it does make its
way over to Google servers,
and Google receives this
packet of information,
realizes, oh, someone is
searching for cats again.
What does Google actually do in
order to respond to that request?
Well, it turns out that Google too is
going to use a whole bunch of packets.
And whereas previously, it was
their address in the TO field
and my address in the
FROM field, now they're
just going to simply reverse this
so that the TO field now is to me,
the FROM field is from Google.
And inside of this envelope is going
to be their various search results.
Now turns out we found one
such search result here.
So if Google has decided to
send me back this search result.
Maybe I was feeling lucky
and clicked that button.
So I just get back one result. They're
going to put the cat into the envelope.
But sometimes, the data is pretty big.
Sometimes this image might be kilobytes,
megabytes, or if it's a video file,
could be gigabytes large.
And it would be kind of
rude if Google, in order
to send me a really big response,
shoved a really big piece of information
in its packet and then clogged the
internet so-called tubes on their way
back to my laptop, thereby preventing
anyone else from talking to Google
or nearby websites at
that same moment in time.
So indeed, what Google
and what many websites do
is they leverage a feature of
IP, and its sister protocol
TCP that lets us fragment this.
And indeed, they will take this
perfectly nice picture of a cat,
and they will fragment it, thanks to IP,
into maybe four different pieces, each
of which is smaller than the original.
And inside of this envelope
then goes one piece at a time.
And so if I put one such
piece in this first envelope.
I can then much more efficiently
clearly proceed to transmit this.
And then if I do the same
with a second and a third
and maybe a fourth envelope, now
Google can respond with one, two, three
and maybe more packets of information
that make their way on the internet,
not even necessarily
following the same path.
In fact, there's no
guarantee that A to B
is going to be the same route as B to
A. Things change dynamically over time.
But Google's going to have
to include a little bit
more information on this envelope.
It's not sufficient anymore
just to send me four envelopes.
What else had they probably
best do so that I can actually
see my cat when it gets back to me?
I've got to know how many
packets they sent me,
and I need to know in what order.
So it turns out that what
Google is probably going to do
is something like this, write on this
envelope the number of the packet
and really how many there are.
And this is a bit of a
white lie, it's actually
done a little differently
thanks to some other fields
that are inside of this envelope.
But we can think of it
really as 1/4, 2/4, 3/4, 4/4,
so that if I only get
two of these envelopes
or three of these envelopes or four, I
now know definitively, wait a minute,
I only got 3/4 of my cat.
And moreover, the ones I did
get, I know the order in which
I can reassemble those packets.
Now, I mentioned this
other protocol, TCP,
that, indeed often works
in conjunction with IP.
And you can think of IP as giving
you features like addressing, signing
every computer in the world a
unique address, and fragmentation,
being able to chop things up.
But TCP further allows us to associate
sequence numbers with packets
that allows me the receiver
to know, wait a minute,
I'm missing one or more packets.
So TCP is often said to guarantee
delivery, and it is this protocol.
So long as your Mac or
your PC or your computer
supports it, which
they all do these days.
If it determines, hey, wait a
minute, I'm missing this packet,
TCP is the protocol, the set of
conventions, that say Google,
I need this packet again
or these packets again,
and they will be retransmitted.
Now, you pay a price in
terms of performance,
because now you might have to
wait for the rest of the cat.
So there might be a bit of a latency
in order to get back that response.
And that might not always be desirable.
And indeed, I can think
of some scenarios,
like if you're watching a baseball
game on TV or soccer or football
where you're watching a live stream--
or maybe it's the Oscars or the Emmys,
or something live, where you really want
to stay in sync with that broadcast,
even if sometimes there's
network issues or there's
buffering-- you don't
necessarily want it to buffer.
You don't necessarily want lost
information to be retransmitted.
You'd rather just lose a
few seconds of the show
so that at least you're staying
current, especially if you're there
with a bunch of other people and
it would be just silly if you
gradually over time drift out of date.
And so the rest of the world is
finished watching the show or the game,
and you're still chugging along.
So as an alternative to TCP, there's
other protocols, one of which
is called UDP that's very
often used for live streaming
and for video and applications like
that, where you really just want
the software to forge ahead,
rather than wait for some new data
to get transmitted.
But there's other things we
can do with the internet.
And indeed, there's lots of
things we ourselves do every day.
It's not just the web, like in
downloading cats from Google.
But there's email, and there's
Skype, and Facebook Messenger,
and any number of other services.
So how in the world does a computer
upon receiving a packet of information
know if it is an email or if it is
a web page, or put more concretely,
how do I know if I should show this user
this cat in his or her email program
or in his or her browser,
which might be the same?
In other words, how do I distinguish
between one type of program
running on the internet from another?
Well, turns out that TCP also provides
a standardization of services.
And that is just a fancy way of
saying that in addition to saying
on this envelope to who it is and
what number it is and from whom it is,
I also need to uniquely
identify the type of service
whose information is in that packet.
And I do this just by writing a number.
And I typically write
one of these numbers.
80 if that packet is meant
to be web information.
So HTTP is the string that most of
us type most every day-- or at least
see these days, even though our
browsers generally fill it in
if we don't explicitly type it.
It turns out that the
world decided years ago
that if you want to send
information from yourself
to a web server like
Google to request cats,
you had better write the number
80 in the TO field in addition
to Google's IP address.
This way, Google knows it's not
an email destined for Gmail,
knows it's not a message destined
for Google Hangouts or the like.
Google servers can
actually distinguish this
as an HTTP request or web request
from any number of other services.
If you're using encryption,
HTTPS, that special number
that the world standardized on is 443.
You rarely see this, but
it's on the envelopes
that your Macs or PCs are actually
sending to Google servers.
Meanwhile, there's other
port numbers, so to speak.
If you've ever heard of
FTP, file transfer protocol.
This is software that's
not recommended anymore,
because it's comply unencrypted.
But it's still unfortunately
popular in some applications
or with some less
expensive web services.
21 is the number that
identifies that service.
And that just means
inside of this packet
is information related to transferring
files, not a web page per se.
22, SSH, Secure Shell.
This is a very popular protocol,
at least among computer scientists
and others, that allows you to
run commands on your Mac or PC
on a remote server, but
in an encrypted way.
And those kinds of packets
contain the number 22.
SMTP-- Simple Mail Transfer
Protocol-- is what email generally
is for outbound email.
So if you send an email, your
envelopes have 25 on them.
And then lastly, DNS
is again that service
that converts host names to
IP addresses and vice versa.
So when your Mac or PC asks
the world, hey, wait a minute,
what is the IP address
for www.google.com?
That envelope has the
number 53 on the outside.
And dot dot dot, there's dozens or
even hundreds of these other things,
for Skype and for Google
Hangouts and the like.
But these here are just
some of the most common.
So the envelope, at
the end of the day, has
a decent amount of information on it.
The TO address, the FROM address,
and that TO address furthermore
has a port number associated with it.
And then, if it's been
fragmented especially,
there's got to be some
kind of number that
identifies the packet itself so that
you can detect if something is missing.
But there's kind of a side
effect, or really a feature
of having this level of detail
on each of these envelopes.
You've probably heard of a firewall.
Maybe not in the real world.
In the real world, a
firewall is literally
a wall that's meant to
block fire, typically
in like strip malls and
offices or stores that
are next to each other physically.
A firewall is meant to
keep a fire that breaks out
in one store from traveling into another
store, creating even more damage.
But in the software world, a
firewall is a piece of software
that really keeps packets out
that you don't want coming in,
or keeps packets in that
you don't want going out.
So a firewall might be used
by parents to prevent kids
from accessing Facebook
or Google, or silly things
during the day for instance, if they
want them focusing on other things.
It might be used by
universities or corporations
to block access to certain
websites that you simply
don't want your students or
your staff actually accessing.
It might be used to keep
corporate data inside,
so that nothing accidentally leaks
out-- financial information, or emails,
or the like.
You can use a firewall to
block outbound access as well.
But this invites the question then,
how is a firewall implemented?
Well, it's not all that hard, really.
Because if the internet is just a
whole bunch of these packets flying
back and forth between computers,
between routers, leaving and entering
our own network, whether that's my
home or my campus or my company,
I could just have my
routers, for instance,
look at every one of those
envelopes, look at the TO address,
maybe look at the FROM address, and
just blacklist certain addresses.
Indeed, if I know that I don't want
my employees accessing Facebook,
I could, for instance, just say to
my routers, configure my routers,
do not allow any data going to
or from IP address 31.13.80.36.
Now, it might be easier said than done,
because in reality, Facebook probably
has multiple IP addresses.
So we might have to grow this list
or dig a little deeper in order
to block them.
And better yet, we could potentially
look inside of the envelopes themselves
to see, is this a Facebook packet?
But if they're using encryption,
which they do by default
these days, that might
not really be feasible.
So we can have kind of
a heavy-handed solution
there, and just block everything
we think is Facebook.com.
But certainly, things might leak out
potentially over time if things change.
But what else could we do?
Suppose that I really don't want
people Skyping during the day,
or I don't want people
using Facebook Messenger,
or some software that has its
own unique TCP port number
that some company or the
world has standardized on.
You could block all outbound email by
just blocking port 25, it would seem,
or a few other ports that are popular.
You could block all web
access by blocking 80 and 443.
You could block all DNS
traffic, if you really want.
And indeed, a lot of
companies do this, especially
like Starbucks kind of places, internet
cafes in airports and the like.
Sometimes they only want
you using their DNS server,
not your own company's
or your own home's.
And so they can block access to any
DNS server other than their own.
This is unfortunately often
or sometimes for advertising
reasons, so that they can actually
keep track of what you're accessing
and where and why-- or where, at least.
But it's all possible technologically
with this underneath the hood.
So what are some of
the defenses in place,
especially when you want to visit some
site that isn't necessarily encrypted?
Or maybe you want to visit
some site that is blocked,
and you want to simply be able to work
around this, because you're traveling
or you need to be able to
access something privately
at your home or your work.
Well it turns out, that there are
services called VPNs or Virtual Private
Networks.
And Harvard has one
VPN at vpn.harvard.edu.
And Yale has one as
well at access.yale.edu.
And this is simply software that
you generally download to your phone
or your computer that allows you to
connect via some protocol and some port
to your company or to your home's
network, but in an encrypted way.
So a VPN gives you an
encrypted tunnel, so to speak,
so that you are connected
to the internet.
That's a precondition.
You have to get on the internet itself.
But then you configure
your Mac or PC to route
all-- in theory-- of your
internet traffic through the VPN.
So even if I'm just visiting Gmail
or Facebook or whatever on my Mac,
if I'm connected to Harvard's
VPN, all of that traffic by design
is going through Harvard.edu
first, and then it's
going out to Facebook or Google
or wherever it's destined.
Similarly, if I'm traveling
in a foreign country that
happens to block a lot of internet
access, if they do allow VPN access,
I can, in my hotel room or wherever,
connect to Harvard or to Yale, route
all of my internet traffic through
Harvard or Yale, and then from Yale
to Harvard to wherever
I'm going on the internet.
And the upside of this is
that it's entirely encrypted,
which means no one at that
company or that country in theory
knows what data is going
through the tunnel.
But it also potentially costs
me a good amount of time.
We've seen that we're really
only talking milliseconds,
but hundreds of milliseconds
can certainly add up.
So if I'm abroad, for instance, trying
to connect to some website that's
going from that country to Harvard,
to the destination, back to Harvard,
back to the country I'm in, your
internet connectivity might be slower,
but at least it's not
actually permanently blocked.
So if you've ever heard of friends
of yours actually accessing services
like Netflix or Hulu, that
for licensing reasons,
do restrict you typically
to being in this country--
this is why you might have read
that Hulu and Netflix and others are
cracking down on people
using VPNs, whether it's
Harvard's or Yale's or
a third party companies,
so as to circumvent those
licensing restrictions.
But technologically, all
it's doing is giving you
an encrypted tunnel
between you and someone
you have an affiliation
with, like Harvard or Yale,
and encrypting all of your
traffic in between there,
and routing all of your
traffic through it.
So with that said, we've looked
at DNS, and we've looked at DHCP,
and we've looked at routers.
And there's other hardware
still, whether, it's
in your home or campus
or office, there's
things like switches, which are fairly
simple devices that just have lots
of ethernet jacks, so to speak, that
you can plug physical cables into,
and those cables can
then intercommunicate,
so that you can wire
computers together en mass.
There are things called
access points or APs.
Those are the things around campus
that have the little bunny ear
antennas that are often blinking.
Those are the wireless access points.
And access points often have firewalls,
often have routing software built in.
So the line is increasingly blurry these
days as to what these small devices do.
So it really is the
services that matter.
And indeed, while a
little dated, I thought
it would be fun to take a look now
at a longer form version of the 60
second trailer of
Warriors of the Net that
was made a few years ago to paint
a more visual picture of how
the internet works.
It definitely takes some liberties
with shall we say accuracy.
But it also helps
paint a picture of what
really is going on underneath the hood.
So let's take a look at the internet.
[MUSIC PLAYING]
[VIDEO PLAYBACK]
-For the first time in
history, people and machinery
are working together, realizing a dream.
A uniting force that knows no
geographical boundaries, without
regard to race, creed, or color.
A new era, where communication
truly brings people together.
This is the Dawn of the Net.
Want to know how it works?
Click here to begin your
journey into the net.
Now exactly what happened
when you clicked on that link?
You started a flow of information.
This information travels down
into your own personal mail
room, where Mr. IP packages it,
labels it, and send it on its way.
Each packet is limited in its size.
The mailroom must decide how to divide
the information, and how to package it.
Now the package needs a label,
containing important information,
such as sender's address, receiver's
address, and the type of packet it is.
Because this particular packet
is going out on to the internet,
it also gets an address for
the proxy server, which has
a special function, as we'll see later.
The packet is now launched onto
your Local Area Network, or LAN.
This network is used to connect
all the local computers, routers,
printers, et cetera for
information exchange
within the physical
walls of the building.
The LAN is a pretty uncontrolled
place, and unfortunately, accidents
can happen.
The highway of the LAN is
packed with all types of.
Information these are IP packets,
Novell packets, Apple Talk packets.
They're going against traffic, as usual.
The local router reads the
address, and if necessary,
lifts the packet onto another network.
Ah, the router.
A symbol of control in a
seemingly disorganized world.
[METHODICAL MUTTERING]
There he is, systematic, uncaring,
methodical, conservative,
and sometimes not quite up to speed.
But at least he is
exact, for the most part.
As the packets leave the
router, they make their way
into the corporate internet
and head for the router switch.
A bit more efficient than
the router, the router switch
plays fast and loose with IP packets,
deftly routing them along the way.
A digital pinball wizard, if you will.
[ERRATIC MUTTERING]
As packets arrive at
their destination, they're
picked up by the network interface,
Ready to be sent to the next level.
In this case, the proxy.
The proxy is used by many
companies as sort of a middleman
in order to lessen the load
on their internet connection,
and for security reasons as well.
As you can see, the packets
are all of various sizes,
depending on their content.
The proxy opens the packet and
looks for the web address or URL.
Depending upon whether
the address is acceptable,
the packet is sent on to the internet.
There are, however, some
addresses which do not
meet with the approval of the proxy.
That is to say, corporate
or management guidelines.
These are summarily dealt with.
We'll have none of that.
For those who make it,
it's on the road again.
Next up, the firewall.
The corporate firewall
serves two purposes.
It prevents some rather nasty
things from the internet
from coming into the
intranet, and it can also
prevent sensitive corporate information
from being sent out onto the internet.
Once through the firewall, a
router picks up the packet,
and places it onto a much narrower
road, or bandwidth, as we say.
Obviously, the road is not
broad enough to take them all.
Now, you might wonder what
happens to all those packets
which don't make it along the way.
Well, when Mr. IP doesn't
receive an acknowledgement
that a packet has been
received in due time,
he simply sends a replacement packet.
We are now ready to enter the
world of the internet, a spider
web of interconnected networks
which span our entire globe.
Here, routers and switches
establish links between networks.
Now, the net is an entirely
different environment
than you'll find within the
protective walls of your LAN.
Out here, it's the Wild West.
Plenty of space, plenty
of opportunities,
plenty of things to
explore and places to go.
Thanks to very little
control and regulation,
new ideas find fertile soil to push
the envelope of their possibilities.
But because of this freedom,
certain dangers also lurk.
You'll never know when you'll
meet the dreaded ping of death.
A special version of a normal request
ping, which some idiot thought up
to mess up unsuspecting hosts.
The path our packets take may be via
satellite, telephone lines, wireless,
or even transoceanic cable.
They don't always take the fastest
or shortest routes possible,
but they will get there eventually.
Maybe that's why it's sometimes
called the world wide wait.
But when everything is
working smoothly, you
can circumvent the globe five times
over at the drop of a hat, literally.
And all for the cost of
a local call or less.
Near the end of our destination,
we'll find another firewall.
Depending upon your
perspective as a data packet,
the firewall could be a bastion of
security or a dreaded adversary.
It all depends on which side you're
on and what your intentions are.
The firewall is designed to let in only
those packets that meet its criteria.
This firewall is operating
on ports 80 and 25.
All attempts to enter through other
ports are closed for business.
Port 25 is used for mail packets, while
port 80 is the entrance for packets
from the internet to the web server.
Inside the firewall, packets
are screened more thoroughly.
Some packets make it
easily through customs,
while others look just a bit dubious.
The firewall officer
is not easily fooled,
such as when this ping of death
packet tries to disguise itself
as a normal ping packet.
For those packets lucky
enough to make it this far,
the journey is almost over.
It's just a line up on the interface
to be taken up into the web server.
Nowadays, a web server
can run on many things,
from a mainframe to a webcam
to the computer on your desk.
Why not your refrigerator?
With a proper set up,
you can find out if you
have the makings for chicken cacciatore
or if you have to go shopping.
Remember, this is the dawn of the net.
Almost anything's possible.
One by one, the packets are
received, opened, and unpacked.
The information they contain, that
is, your request for information,
is sent on to the web
server application.
The packet itself is recycled,
ready to be used again, and filled
with your requested information,
addressed, and sent out on its way
back to you, back past
the firewall, routers,
and on through to the internet, back
through your corporate firewall,
and onto your interface, ready to supply
your web browser with the information
you requested, that is, this film.
Pleased with their efforts,
and trusting in a better world,
our trusty data packets
ride off blissfully
into the sunset of another
day, knowing fully they
have served their masters well.
Now isn't that a happy ending?
[END PLAYBACK]
DAVID MALAN: All right, so
that is how the internet works.
And as has been our tendency
over the past few weeks,
now that we know how we can get
data from point A to point B,
we can abstract above
that, and just take
for granted now that we can move
data from point A to point B
and start moving the actual data.
So that invites the question now
of what is inside this envelope.
When I get a response back from Google
containing a whole bunch of cats,
or when I get back my news feed from
Facebook, or my inbox from Google.
Well, inside of these
packets quite often
is messages that conform to HTTP,
the Hypertext Transfer Protocol.
So this is just one of those
services that we alluded to earlier.
Among them also were SSH, and
DNS, and SMTP, and yet others.
But HTTP is perhaps by far
the most common one in so far
as we use the web so much these days.
So inside of HTTP,
there are certain types
of messages, messages that conform
to certain patterns by which we
get information.
Now, what is the P in HTTP?
HTTP, Hypertext Transfer Protocol.
Well, let me borrow Arthuro over here.
And we have this silly
human convention of course
that when you meet someone for the
first time or the first time in a while,
you say, oh, hi, my name is David.
Nice to meet you, Arthuro.
And we exchange hands.
And when I put out my hand,
Arthuro knows to put out his hand.
And then we do this silly handshake.
Why is that?
Well, it's just a protocol.
It's a convention.
It's a set of conventions that
we humans for better or for worse
have adopted by which
we greet each other.
Similarly do computers have
protocols via which they communicate,
and sets of conventions that
govern how you start to communicate
and how you finish communicating.
So what do those messages
actually look like?
The simplest of them is
quite literally this verb
here, get, whereby inside
of this envelope, when
I'm requesting information of
Google for the first time--
and indeed, I put that
message before, search
for cats-- that actually has a certain
message at the top of it, really,
that is literally get.
There's a little more information, but
at the end of the day, it just is get.
Specifically, these are
the first couple of lines
inside of any request that my
browser makes of a web server,
like in this case, harvard.edu.
If I want to get the default home
page of Harvard., I literally,
inside of my envelope, write this
message-- GET slash space HTTP/1.1,
which is the latest version
of HTTp that people use.
Then below that, I specify the host
that I want to talk to, just in case
Harvard or Google or whoever has
multiple domain names physically
running on the same
servers, which is possible.
So I say host, www.harvard.edu.
And then maybe there's some other text.
But this first line or two
is really the most important.
And then what comes
back from the server,
whether it's being sent to
Harvard or being sent to Yale,
is a response that hopefully
says is literally, OK, inside
of which is the cat or inside
of which is the inbox for Gmail
or inside of which is my
news feed from Facebook.
All of which typically
are in this language here,
HTML-- HyperText Markup Language.
So whereas HTTP is a protocol,
like a sort of handshake agreement
that governs that when I want to
request information of a server,
I should say GET and
then a few other words,
and then the server should respond
with OK and a few other words,
HTML is the language
in which the actual web
pages that are coming back from
Google or Facebook or Harvard or Yale
are actually written in.
It's not a programming
language like C or Scratch.
It's a markup language,
as we'll see, that
really controls formatting and layout.
There aren't ifs and loops and
other such constructs instead.
But that's what's below the dot dot
dot when the response comes back
from Harvard or Yale or
Google is this language HTML.
Now, 200 is a status code, so to speak,
that we almost never actually see
from a server.
But odds are, some of you have seen at
least one of these status codes before.
And perhaps the most
obvious or the most familiar
is probably this one here, when
you've requested some web page,
and either it doesn't exist anymore
or you have a typo more commonly
or the URL is broken for some reason.
Odds are you have literally
seen the status code 404,
because the server is
just showing it to you.
But at a lower level,
these numbers are actually
typically sent in these
packets of information
back and forth from the server to me.
But we'll see before long that you
can use status codes like 301 and 302
to you induce redirects, so to speak.
If you want to send the user
from one URL to another-- maybe
the domain name is changed--
you can do that there.
For efficiency, a server
can say 304, not modified.
As in, you already
asked me for this page.
It hasn't modified since
you asked me for it,
I'm not going to send
it to you again, thereby
saving a bit of time and bandwidth.
Unauthorized or
forbidden generally means
that you don't have access
to the file for some reason.
And 500's actually pretty bad.
So we'll probably induce this ourselves
before long when we actually write
programs that run on a web server.
But 500 means there's generally
a problem in your code
that's supposed to be serving
up web content to browsers.
So let's actually see
these kinds of things too.
It turns out that I can pretend to
be a browser at my command line here.
In fact, I can use a
program called Telnet,
which is an older program, similar
in spirit to something called SSH,
which I mentioned earlier,
but it's not encrypted.
But it allows me to connect to a remote
server specifically on a certain port.
So I for instance, can connect
to harvard.edu and on port 80
specifically.
I could actually with textual commands
send emails to Harvard in this way,
or send chat messages
if they support that.
But for now, we're focusing only
on HTTP, the unencrypted version.
And if I go ahead and
hit enter, you'll see
that I'm connected to
www.harvard.edu.cdn.cloudflare.net,
which is curious.
But it turns out-- and we could see this
if we poked around with nslookup again.
It turns out that Harvard
is also outsourcing its home
page to a third party CDN-- Content
Delivery Network-- called Cloudflare,
so Harvard's servers
really live elsewhere.
And now I talked too long and the
connection got automatically closed.
So let me go ahead and redo this, and
just pretend to be a browser by typing
GET/HTTP/1.1 host www.harvard.edu
and then Enter Enter twice.
And it flew across the screen, but
let me scroll back up to the top.
This is-- even though it might
look cryptic to you at the moment
if you've never made web pages before--
this is this language called HTML.
And it's quite a lot of HTML, so let me
keep scrolling up and up and up and up.
Until hopefully if we go up high
enough-- oh, I've exceeded my buffer.
So I'm going to do this differently.
I'm going to go ahead
and-- you might recall
from a past problem, where you can
actually redirect the output to a file.
So I'm going to go ahead and save
this in a file called output.txt.
GET/HTTP/1.1 host
www.harvard.edu, enter, enter.
And now I'm going to go ahead and
open this file, which is here.
And you can see that what
just happened was this.
The server responded with
200 OK, which is great.
And then the date of the
server in Greenwich Mean Time.
And then a bunch of information.
Cookies, we'll come back
to these before long.
But those will be germane
to when we actually
write our own software for the read.
Drupal, seems that
Harvard's website is using
Drupal, a popular content
management software for websites.
And then there's some
other stuff about caching
and when the site expires and so forth.
This is a little strange.
Harvard's website
apparently expired in 1978.
But more on that another time.
And so there's some
interesting HTTP headers
besides things like the host field
that we sent and the GET and the OK
that I mentioned earlier as well.
Now, Telnet is not a very
user-friendly way to do this.
I'm going to actually redo this
with a different command, Curl,
whereby I can do a curl-I,
and I'm going to then do
the full URL-- www.harvard.edu, Enter.
And now what's nice with curl.
Is that I don't actually see the HTML.
I only see in this case the HTTP
headers, which are still quite a few,
but we can now at least see
them a little more readily.
In fact, let me go and do
the same now for yale.edu,
and see if we can glean any
differences in their servers.
There we go here.
So the headers that are coming back for
Yale are these that I've highlighted.
And it looks too that there's
some interesting stuff going on.
It seems that Yale also uses Drupal.
So it seems that both universities
are doing something rather familiar.
But most of this information
is not all that useful.
But it is useful if maybe we do this.
What if we visit, for
instance-- why don't we
go to HTTP-- how about we go to
reference.cs50.net, which you might
use as an alternative to man pages.
And this is a little curious.
It moved permanently.
This is not 200 OK.
Move permanently.
Where did it go?
Well, wait a minute, let me go
ahead and highlight that URL.
And let me go ahead in
another tab and just go there.
OK, it's there.
So where did it move to?
And in fact, if I look
at the domain again,
it is indeed there, but notice this.
Almost all of CS50's website's
actually run not over HTTP per se
but HTTPS, where the S
means secure, whereby
all of our websites for the
most part are encrypted.
But that's not what I typed.
I just went to
http://reference.cs50.net.
And yet when I do that with this
command line interface, which
mimics the behavior of a browser, if I
visit HTTP, I'm told by CS50's server,
moved permanently, status code 301.
But notice this one other header
that's kind of interesting-- location.
This location header--
and a header to be clear
is just a word, a
colon, and then a value.
This header specifies where we move to.
So this seems to be a
mechanism whereby using HTTP
headers-- sort of messages
inside the envelope
that the human doesn't really see, but
that the browser doesn't understand.
This seems to be a way
that we can forcibly
redirect all users from the
insecure version of our website
to the secure version, so that
thereafter, all of the information
is secure.
And frankly, there's not all that much
private information going on there.
But if you don't really want
the whole world or the NSA
or Harvard or Yale knowing
what pages, what functions you
need to look up on reference.cs50.net,
by forcing everything
to HTTPS, in theory, everything
is perfectly secure now so
that only you know what
pages you're visiting.
And we, since we run the server.
But no one in between.
And indeed, that's one of the biggest
values of using HTTPS-based URLs,
so that even if there is
some man in the middle,
so to speak, a bad guy, an adversary
between you and that remote server,
whether it's here on campus
or in Starbucks or the airport
or some random adversary on the
internet, he or she in theory
should not be able to see
anything between points A and B
if you are, as before using a
VPN between those points or two,
using a protocol like HTTPS that by
design is encrypting information.
And suffice it to say the encryption
is far fancier than Caesar or Vegener.
But it is indeed
similar in spirit, where
those zeros and ones going back
and forth are scrambled in some way
that only you and the point B server can
actually decode them or decrypt them.
So let's visit an actual
website now, Google.
But before we do that, let's turn
off some of the more modern features
by going to Setting,
going to Search Settings,
and turn off so-called instant results.
Because for our purposes
today, instant results
use a technology or
language called JavaScript,
which we'll get to in a few
weeks' time, but for now it's
just going to be a distraction
from the underlying HTTP feature.
So I'm going to go ahead and
indeed never show instant results.
So that now when I search for something
like cats on google.com and hit Enter,
I'm going to find myself at a fairly
long URL, indeed this URL here.
And I have no idea what most
of this URL means, not knowing
how Google works underneath the hood.
But I'm looking for
some familiar patterns.
And indeed, if I pretty much a little
ignorantly but hopefully cleverly just
delete anything I don't understand,
I'm going to deliberately leave myself
with just the essence of this URL.
So notice, I didn't type this URL.
I ended up at this URL after I typed in
cats to that search box and hit Enter.
Now I found myself in a
really long URL and then
I just started deleting things
I didn't understand to distill
this URL into quite simply this.
https://www.google.com/search?q=cats.
Well, it turns out that
much like in the world of C,
you have functions from CS50
like getString and getInt,
or if you implement them yourself,
scanF or other such functions
whereby you can get user input.
It's less obvious at first glance how
a web server can get input from a user.
Because there is no-- well,
rather, you can see the search
box that I typed into,
but until I hit Enter,
the server doesn't see that
information necessarily.
And that's a bit of a white lie,
because nowadays thanks to JavaScript
and thanks to autocomplete,
Google's actually
seeing every keystroke you type.
But in theory, when I hit
Enter, only when I hit Enter,
do they see the full word cats.
And how do they get access to it not
having physical access to my keyboard?
They see it in the URL here.
And so indeed HTTP, beyond
supporting status codes and the sort
of digital equivalent of
my handshake with Arthuro,
also supports input, specifically
input parameters that in this case
is arbitrarily but reasonably
called q, because back in the day,
Google decided that the default input
to its search page would be q for query.
And indeed, if I hit Enter now,
the results seem no different.
So for whatever reason, Google uses
by default a lot more parameters,
all of which I deleted.
But the only necessary one is cats.
And notice even without changing
the page, I can go up in here
and change my cats to
dogs and hit Enter.
And now notice I've searched for dogs
just as though I had typed this myself.
But indeed, the only thing I've been
changing up here is the keyword.
And if I search for mice now,
I'm changing the search result.
So it seems that the
essence of an HTTP request
boils down to what is sent here.
So let's try this as well.
Let me go ahead and copy that URL.
And just for good measure, I can
go ahead and do something like curl
and then paste this URL.
And let me go ahead and quote it,
just because it has a question
mark that could break things.
And hit Enter.
It's pretty overwhelming here,
but this is all of the HTML
that's coming back from google.com.
So when I see these search
results in google.com,
this web page is written in
this language called HTML.
And HTML, as we'll see, is a little
overwhelming perhaps at first glance,
but follows some very simple patterns.
And we can see them better in
browsers like Chrome as follows.
If you Control-Click or right click
on your web page, most any web page
if you're using Chrome,
you can choose Inspect.
And there's keyboard shortcuts
and other menu options
by which you can access this.
And notice among the elements
tab here that just popped up.
And notice now, again
a little overwhelming.
But what's nice about Chrome-- and
Edge can do this and Firefox and Safari
and others-- it can
pretty print your HTML.
Sort of like Style 50 you can
sort of see through any messiness,
similarly, can the browser
kind of look at the mess that
just came across the wire from
Google and format it as follows.
And indeed, it looks like this language
HTML follows a certain pattern.
There's always this at the top, open
bracket, exclamation point, doc type,
HTML, close bracket.
Then there's open bracket html in lower
case, then some other words and quotes
and equals signs perhaps.
Then a head, then a body.
Maybe some divs for
divisions of the page.
And even though this is quite a
lot, let's look at a simpler one
just for kicks real fast.
Let's go to harvard.edu and hit Enter.
And indeed-- well, actually, it
looks just about as complicated.
Here's the HTML that
composes harvard.edu.
So let's try to distill
this into its essence.
I showed a web page earlier.
Let's go back to that to
point out-- to be clear,
these were called query strings.
Let's come back to HTML.
So HTML is up to version 5 these days.
And this governs what syntax you
should use when writing HTML.
And here per the
earlier slide is perhaps
the simplest web page we can make.
So the key components-- and
there's others we can add
and others we will soon
add-- boil down to this.
This first line, this is so-called
document type declaration.
This is just a fancy way of saying,
you have to type this line first
in your file in order to
tell the browser that's
reading this file top to
bottom, left to right this web
page is written in version 5 of HTML.
Previous versions either didn't have
this or had longer versions of this.
Is just a globally-understood
symbol that means version 5.
Then below that is
your actual HTML tags.
So web pages are composed of HTML
tags, or more properly, elements.
And most elements have an open
tag and a closed tag-- a start tag
and an end tag-- that are identical,
except for typically the slash.
So indeed, notice the symmetry.
This tag here, and so far it's what
we'll call an open tag or start tag,
means hey browser, here comes
a web page written in HTML.
Hey browser, here comes
the head of the web page.
Hey browser, here comes
the title of the web page.
And there's no technical
reason I wrote this all on one
line instead of putting hello world
on its own line and this other tag
on its own line.
It just felt short enough to just
write in one line, so I went with it.
But notice that title is open tier.
Then there's literally some
hard coded text, hello world.
And then there is the opposite
so to speak, of the tag.
It's the same word for
the tag, but this forward
slash inside of the tag,
which closes or ends the tag
and sort of ends the
whole title element.
Meanwhile, that's it for the
head, at least in this example.
So hey browser, that's it for the head.
Oh hey, browser, here comes the body.
Hey browser, here's some actual text.
Hey browser, that's it for the body.
Hey browser, that's it for the web page.
So I've also by convention-- and
for stylistic purposes like in C--
indented things to be very pretty
printed, very readable to humans.
But the browser certainly doesn't care.
And indeed, we saw when we looked at
the mess that is Google's website,
it's just a big mess of
tags and markup so to speak.
But for Google, that
makes sense, because you
don't want to have to transmit
any characters unnecessarily.
Indeed, if you think about
it, if Google's website gets
visited by a billion
people per day, which
actually feels kind of reasonable.
And suppose that a programmer at Google
hits the space bar just one extra time
and saves Google's home page.
Well what's the implication
of Google having just one
additional space in their web page?
If that web page is
downloaded a billion times,
that's a billion extra ASCII characters
that gets downloaded per day.
And a billion ASCII characters is a
billion bytes, which is one gigabyte.
So just by hitting the spacebar
can really big players like Google
cost themselves a huge amount
of space and maybe cost or time.
So that's why a lot of big websites
minify or compress their information,
whereas we will be a
little more lax here,
because it's more important
for now certainly that things
be readable and understandable.
But the white space does
not matter to the browser.
So let's actually do
something with this.
Keeping in mind the following, just
as this indentation kind of implies,
this really if you think
about it is a tree structure.
There's some document on the screen,
which I will literally call document,
because that's what browsers do.
The top element of which--
I'll draw with a rectangle,
distinguish it from the document
itself-- is the HTML element that
starts here and ends here.
And in so far as it starts here and ends
here, everything that's inside of it,
you can think of as
children in a family tree.
And the first child is
head, the second child
is body, left and right respectively.
The head tag meanwhile
has the title child,
and so that's why we see title here.
And then I'll draw it with an
ellipse, just different shape
because it's raw text.
It's not an actual tag.
And similarly does body
have some text below it.
So this is just a tree.
It's not a binary tree, although
it might be by coincidence here,
because there aren't many children.
But it's some kind of tree structure,
each of whose nodes has zero
or more children.
And indeed, underneath
the hood what is IE,
what is Edge or Firefox or
Chrome or Safari actually doing
when it downloads a web page like this?
Some programmer or programmers
have after taking classes like CS50
and knowing what these data structures
are implemented in code a tree that
represents that web page.
And indeed, once in a few
weeks we get to JavaScript
using yet another language
will you be able to manipulate
that tree in real time to change
the contents of a web page
and what a user is seeing.
Indeed, if you kind of
fast forward in your mind,
suppose that you do use something
like Facebook and Messenger
built into it for sending
messages to people or Gmail,
where you suddenly get new rows
of emails and your web page,
what's really happening?
Every time you get a
message in Facebook,
it's just as though this tree is
getting modified with like another child
somewhere in here.
Every time you get a
new email in Gmail, it's
like another node is
appearing in this tree.
So there really is this equivalence to
this markup language HTML and the tree
structures that we've just
come from in recent weeks.
So let's actually now
do something with this.
I'm going to go over
to CS50 IDE, and I'm
going to go ahead and make if you will
the simplest of web pages as follows.
I'm going to go ahead and
create a new file, a text file.
I'm going to call it hello.html.
And I'm going to go
ahead and populate this
with exactly what we saw a moment ago.
Doc type, HTML.
Open bracket, HTML.
And notice that CS50 IDE is
trying to be helpful here,
and when it notices you
typing something familiar,
it's going to try to finish
your thought for you.
So indeed, it did.
I'm going to go ahead and
open now the head of the page.
It's going to complete that thought.
I'm going to open the title
of the page, hello world.
And now I'm going to move my
cursor down here physically
to do body, close bracket,
hello comma world, save.
So I have written code.
It's source code, but it's code written
in HTML-- HyperText Markup Language.
And indeed, you see no loops
or conditions or functions.
There's no logic.
This is just markup.
Do this, stop doing this.
Do this, stop doing this.
It's fairly mundane.
But it's going to allow us to
actually visit this file in a browser.
Indeed, let me go into a browser
now and visit this page hello.html.
Incredibly underwhelming.
Indeed, this is a huge screen.
And all I've created is a web page
that says hello world up here.
And if I scrolled up, I
could actually see the tab
whose title is also hello world.
But that's my first web page.
And if I now apply a lesson learned,
if I go ahead and right click
or Control-Click Chrome's
backdrop and choose inspect,
now you'll notice finally
here's a simple web page,
and not all the messiness that
was Harvard's or Google's.
You can actually see your HTML.
You can't permanently
change the files here,
because you need to do that in
CS50 IDE and change the files.
And so here's where there's a
potential point of confusion.
CS50 IDE is of course
a cloud based service,
and it's where I'm writing
and saving my files.
And it just so happens
that built into CS50 IDE
is its own web server just
for serving students work.
So when I visit this web here in another
tab, I'm visiting not CS50 IDE per se,
but the web server running
on a certain port on CS50 IDE
so I can serve up these web pages.
So let's go ahead and do something
a little more interesting than that.
Let me go ahead now and create
another file say as follows.
Let me go ahead and copy
this just for good measure
so I don't have to
recreate the whole thing.
And let me go ahead and create
a new file called Image.html.
Paste this in here.
And instead of hello world, I'm just
going to write say image up here.
And how do I embed an image?
Well, turns out that there
is that literally an image
tag-- img to be succint.
Indeed, you might want
to write out this.
But nope, back in the day people
decided that img is sufficient.
I'm going to go ahead
and give it a source.
What should the source of this be?
Well, let me just do a quick
search for like a grumpy cat.
And there's a good one.
So I'm going to go ahead
and Control-Click or Right
Click for our purposes now
just the image address here.
We'll assume this is my image
and I'm grabbing the address here
for the moment.
I'm going to paste it in here, in
that there is the URL of a JPEG that
is of a grumpy cat.
Now with an image, there
isn't really the same concept
of like starting an image and
stopping an image like there
is start the title stop the title,
start the body, stop the body.
And so there are so-called
empty elements in HTML
that you can express either by doing
this, which feels a little silly.
Like you're opening the image tag
and then immediately closing it,
which feels a little ridiculous.
And so there's shorter hand
syntax where you can actually
put the slash inside of
the open tag like this so
that the element is empty so to speak.
Open and closed.
It's not strictly required,
but at least this way
we're making clear our intent is to
open and close the thing all at once.
Now for accessibility purposes, for
someone who has trouble with vision,
you might want to provide some
alternative text like grumpy cat
so that if they're using a screen
reader or some other device, there
it can actually have a system
support explaining what it
is that you might otherwise be seeing.
So let me go ahead now and
open this file image.html.
And it's pretty darn simple.
But there is my own web page
with this big white background,
and nothing else yet
and this grumpy cat.
All right, but of course this
web page doesn't do anything.
It would be nice if I could click
on something and go somewhere.
So let's do that.
Let's do another example whereby--
I'll call this link.html.
And in here-- let me get started
just by copying and pasting
that-- instead of the cat, let
me go ahead and do a an anchor.
So it's a little counterintuitive.
It's not link, it's anchor.
And then anchor, confusingly,
has a hyperreference,
which is the link to which it goes.
And I'm going to go ahead
and do something clever
like
https://www.google.com/search?q=cats.
And then close bracket.
And now notice CS50 IDE
is trying to be helpful.
It closes the tag for me, and
I can just write the word cats.
But let me finish this thought.
Let me say search for cats period.
And so now, even though we've
seen only some simple tags so far,
you can use to HTML in
line, so to speak, sort of
in the middle of another thought.
If I want to convey the
sentence search for cats,
but I want cats to be clickable so
that when you click on the word cats
it actually goes to Google
and searches for cats,
I can borrow the idea
from earlier-- and I just
happen to remember that q is the
query that I have to pass in.
And notice that I surround cats with
the open tag and the close tags.
So that now if I open a
browser with this file,
I see again, a very simple web page.
And I can even zoom in
to make this more clear.
All it says is search for cats period.
But notice, it's the link
alone that's underlined.
And it happens to be purple
by default, because we already
searched for cats earlier, and browsers
typically remember URLs you visited.
So that's why it's purple and not say
blue, which tends to be the default.
But if I click on this, indeed,
I get a page full of cats.
I can combine these ideas.
Let me actually go into the IDE,
and instead of the word cats,
let me go ahead and paste the image tag.
So it's a little hard
to see all on one line
here, but notice I can search
for a href, close this tag.
And then immediately open the image
tag with its same value as before.
And then close that.
And then close the anchor tag.
Save that, reload.
Now it's a little stupid grammatically.
Search for cat picture.
But notice if I hover over the cat,
my cursor becomes a little pointer.
And indeed, if I look in Chrome's bottom
left corner, I'll see that if I click,
it's going to lead me to a URL.
And indeed, if I click on
the cat, anywhere on the cat,
now I've made a hyperlink.
So now the world wide web so to
speak is getting more interesting.
It's getting pretty ugly, but at
least it's getting more interesting.
So what are these things?
They're not tags, per se.
These are what we'll call attributes.
So indeed, it seems that
based on these simple examples
alone certain tags like image
can have their behavior modified
with these attributes.
And the format for those is a
keyword like alt for alternative
equals and then quote
unquote some value,
and source-- src-- which is by design.
You can't write out source
S-O-U-R-C-E. You'd have to do src per
the documentation equals
quote unquote some URL.
And you would only
know that these things
exist by googling around, reading some
online documentation, taking a class.
But thankfully, there's not
terribly, terribly many of them.
And most every one can be
looked up on demand when
you're curious how to do something.
In fact, let's take a
look at a few other tags
some this time that I've
put together in advance.
We have a whole bunch of online examples
that you're welcome to look for online.
Here's one that has a
whole bunch of paragraphs.
So in this page here, notice that
I've done a couple of things.
Inside of my body, I have a
bunch of Latin paragraphs.
Sort of nonsensical Latin,
but I've wrapped each of them
in an open p tag and a closed p
tag, simply because I want these
to be three separate blocks of text.
And let me go ahead into my browser now
and open this file in today's directory
as paragraphs.html.
And that's it.
It's a little more interesting
now that it fills the screen.
But indeed, there are
distinct paragraphs.
There's one other tag that I
proactively included here, which
is a little cryptic at first glance.
But this is a metatag that has to
go in the head of the web page.
And here too you would know
this from some online reference.
And it's cryptic only insofar
as there's a lot of words here.
But the effect of this essentially
is that if this same web
page is viewed not on my browser but
on my phone, which might otherwise
be pretty small to look at, and
I'd have to squint to see the text,
this tag is one technique
for actually telling the web
page to sort of resize itself and the
text for whatever the device with is.
So without this tag,
these three paragraphs
you might have to squint to actually
read them pretty well on an Android
phone or an iPhone.
With that tag, the
font size will sort of
grow to take into account the
fact that this is a smaller device
and everything should not
just be squeezed in on there.
But otherwise, syntactically,
everything else there is the same.
Let's look at another example.
If I go into headings.html, this
one doesn't do all that much.
But it seems to demonstrate tags
called H1 through H6, literally saying
one, two, three, four, five, six.
And by convention, though this
differs ever so slightly by browsers,
H1 is big bold text.
H2 is not quite as big,
but still bold text.
H3 is not quite as big.
H4 not quite as big.
Headings that you might
see in a research paper
or in the chapters and sections
or subsections of a book.
It's a way of adding sort of semantic
headings to a web page that in our case
might look ultimately like this.
From bigger to smaller.
And so these might just be the
section headings in some book
or some kind of reference like that.
What about lists, which
are pretty common?
Well, if we go into list.html,
it's pretty common on the web
or in various applications to have
bulleted lists or ordered lists.
This is in an unordered list of
bullets, foo, bar, and baz, which
are just silly variable
names in computer science.
And if we want to see what this
one is, if I go into list.html,
you'll see quite simply that we
just have a little more nesting.
Body, UL, and LI.
So UL us Unordered List, LI is
List Item, and foo, bar, and baz
are each of the three list items.
If I change this ever so
slightly to OL, Ordered List,
and then go back to that
web page and reload,
now it's an automatically numbered list.
So there's a lot of features
you sort of get for free here,
not unlike a typical Word processor.
If we want to go really all
out and see a lot of nesting,
you can see a table here,
which might be useful
if you want to show a whole bunch of
tabular data for research purposes
or maybe sports scores and data
on a ESPN site or the like.
It's a little more involved, but
if you just read it top to bottom,
it all becomes pretty intuitive.
Inside of this page's body
there's an HTML table.
This table has a TR, Table Row.
And that table row has table
data, table data, table data.
So three columns, left to right.
And another row with
another three columns,
another row with another three, columns
another row with another three columns.
And I chose these
values arbitrarily just
to kind of markup an old school
telephone keypad, because indeed,
if we go into this with
table.html, you see this.
You can add borders, and we'll see ways
you can actually tweak the aesthetics.
But it's just laying
things out in a grid here,
like you might tabular style data.
But none of these have been
all that pretty thus far.
Indeed, I'm just using the default fonts
and sizes, which apparently are just
black text, white
background, Times New Roman
font, and pretty small text at that.
The web of course these days
is much prettier than this.
So how do you actually
start to stylize things?
Well, as we often do, let's
take a progression of ideas.
Let me go into version
zero of this file.
css0.html.
That does something terribly simply.
It's more interesting
than any of the pages
we've seen thus far, if only because
we have some slightly differing
font sizes and some actual content,
but it's still pretty simple.
So what am I doing?
This is big and bold and centered.
This is kind of medium
and bold and centered.
And this is kind of small,
this copyright holder there.
So let's solve this in one way, but
then iteratively improve upon this
as follows.
Let me go into css0.html, and we'll see
that I've introduced amazingly already
another language.
CSS-- Cascading Style
Sheets-- is another language
that is almost always used in
conjunction with HTML these days.
And whereas HTML is all
about formatting-- rather,
all about markup and all
about layouts and sort
of semantically tagging things
in a way that makes sense,
CSS is used to kind of
take things the last mile
and stylize things so that they
look and appear in exactly the way
that you intend.
So this is a little messy
at the moment, because I
seem to be co-mingling my HTML
and CSS literally as follows.
Turns out that in HTML
there's a generic tag
called the div for just
a division of the page.
If you want to think of the page
as having rectangular regions,
div would be one way of doing that.
Or you could use a p tag or paragraph.
And I can add a style attribute
here that's a style font
size colon 36 pixels semi-colon
font weight colon bold semi-colon.
And not all of the semi-colons, at
least on the end there, are necessary.
But this is two CSS properties.
A property called font size
with a value of 36 pixels,
and a property of font
weight with a value of bold.
And then similarly, notice what I've
done in a div of tag outside of this
have I wrapped it with
text align center.
And that's a property called text align.
Its value is center, and it's going to
center all of its children so to speak.
So we can use the same language from
our discussion of data structures
and trees.
Meanwhile, you'll notice that my middle
div is slightly smaller at 24 pixels
and not bold, and my
last one is 12 pixels.
But this is a little
messy now, because I've
co-mingled my HTML markup with my CSS.
It would be kind of nice if we
could factor out the aesthetics,
put them in one central spot
to make it easier to edit.
And so let me propose this instead.
I've now simplified the body of my page
to just have three divs, each of which
has a unique ID.
Turns out there's an attribute
in HTML called ID that
allows you to have a unique identifier.
You can use that almost
any word you want,
though there are some restrictions
on the letters you can use,
or where you can have
numbers, and so forth.
But I'm just going to
sort of conveniently call
the top div top, middle, and bottom.
And those are unique.
And now that I have the ability
to identify those divs uniquely,
let's look at another tag up here.
Inside of the head of my web page
now, notice I have a style tag.
Not a style attribute,
an actual style tag.
And the syntax here is a
little different from before,
but it's kind of reminiscent
of C. But none of this
has to do with programming per
se, this is just aesthetics now.
This syntax here says, hey,
browser, apply to the body tag
the following CSS properties
in between curly braces.
Text align center for the entire body.
Hey, browser, apply the
following properties
to whatever HTML tag
has a unique ID of top.
So the hashtag here means ID.
It's just a symbol that
the world has adopted.
So this means whatever HTML
tag has a unique ID of top,
apply these two properties to it.
Notice the semi-colon's on the
end, and I've invented everything
to keep things nice and pretty.
Middle will have this property,
bottom will have that property.
So now it's cleaner in that
I've relegated to the top
to one central spot all of
the aesthetics of my web page.
I've left all of the lower
level markup down here.
So that if on a whim
tomorrow I want to change
the font size or the
color or the layout,
I can do that very simply without
actually changing the data.
So the data is things like
these white words here.
And I've got some metadata, these
red tags and green attributes,
here, so that I can uniquely
identify things in the page.
But the aesthetics are now
fundamentally separated.
But it's still a little messy, because
they're still in the same file.
So let me open a third
version of this, css2.html,
which makes the file even smaller.
What do I seem to have done here?
So in this case, I seem
to have similarly given
IDs to these three divs.
But I've introduced into the
head of the page not a style tag,
but a link tag, confusingly named,
because it's not an anchor tag,
it's link with an href.
So even more confusing.
But all this means is hey, browser, grab
the contents of this file-- css2.css--
the relation to this file
is that of style sheet.
So it's stylisation.
And then apply it to this web page.
What is in css2.css?
It's just those same tags as
before, but in their own file.
So what's the purpose of this?
At the end of the day, the result
in each of these three cases
is an identical web page.
All three of these things look exactly
like this, so there are no prettier.
But from a design perspective
underneath the hood,
these things are
fundamentally better designed,
because now this CSS file in theory
could be shared across multiple pages.
Multiple pages of mine could now
have this one link tag up top,
so that once a browser downloads
css2.css or whatever the file is,
it can reuse and cache the
file for my entire website
so that as the user clicks
around to my website,
they don't have to re
download the CSS file.
And indeed, even if
the browser tries, it
can get that HTTP 304 not modified
message so that it doesn't waste time
or bandwidth redownloading the file.
So this also allows me to
use, as we'll eventually
see in future problems,
third party libraries.
It turns out that a lot of people in the
world who are better than little old me
at design certainly have created files
ending in .css that have some really
beautiful stylizations that you can
apply to your own web pages so that you
don't have to worry about
as much the aesthetics.
Bootstrap is one such tool
formerly from Twitter,
and other such libraries exist that
allow you to stylize your site just
by using themes or skins, so to
speak, that other people have created.
There is one last piece of syntax
here I should draw attention
to is this thing here.
So this cryptic sequence of characters
is what's known as an HTML entity.
It turns out there are some
symbols that to my knowledge
I can't type on my Mac's keyboard,
like the copyright symbol.
You can maybe do it on iOS these
days via special software support.
But this is the canonical way of putting
certain special characters inside
of a web page that you might
not be able to express or easily
express on your keyboard.
And these are standardized, too.
So if I actually
googled HTML entities, I
could actually see
whole charts telling me
that ampersand hashtag 169 semi-colon
will give me the copyright symbol.
And just to be clear, when
that's actually rendered,
you don't see that in the page.
You instead see the more
familiar copyright symbol there.
So let's now finally try to tie
some of these things together.
I know that Google supports
search queries via GET.
And this is in contrast just to
be clear with one other thing.
That is POST.
It would be a little
worrisome if every time you
logged into Facebook or
Google or any website,
or any time you bought something
on Amazon or any website,
if your credit card and your
password and all your sort
of semi-private information
appeared in the URL
just like these Google search queries.
So it turns out that HTTP
supports another verb.
And there's a few others, but the
two we'll focus on are GET and POST.
And POST is inside the
envelope's initial message,
just like my handshake to
AJ, almost identically.
But instead of GET, it's POST.
What do you want to post information to
and what protocol do you want to use?
This is an example of a snippet
of how I might log into Facebook.
When I log in to Facebook, I don't want
my friends or my siblings or my family
members being able to see in my
browser's history or the search box
what my user name or
really what my password is.
And that's exactly what
HTTP GET does by design.
POST is just another way of
submitting information to a server,
still using the same conventions of
HTTP parameter equals some value.
And indeed, you can send
multiple ones by separating them
in this case with an ampersand.
No relationship to the ampersand
we just saw in an HTML entity.
But notice that this email
and password are deliberately
below the HTTP headers.
So they're not in URL
bar, there instead deeper
inside the envelope, if you will.
But I need to know this because
when I make my own web pages,
this becomes relevant.
Let me go ahead and create a super
simple web page called search.html
that again has the doc type declaration
at the top, that then has my HTML tags,
my head tags, my title tags--
and I'll call this search.
And then over here I will
have the body of the page.
And then I'm just going to do
an H1 for CS50 search, which
is just a big bold heading on the page.
And now I'm going to have a form.
And I'm going to have action equals
https://www.google.com/search.
The method I want to use is
necessarily GET, not POST.
Though in different contexts,
I might want to use POST.
But I'm not doing logons
or something like that.
I'm using Google search engine.
So now I have the HTML form
element, which we've not yet seen.
But it turns out there's
another tag called input
that you can give a name to like
q, that can be a type like text,
and it's empty.
And then we can have
another input whose type
might be quote unquote
submit and close that tag.
And then save the page.
If I now go back into this file and
go to search.html, if I zoom in,
we see if you will, version one
of Google, without any aesthetics.
And indeed, the actual
version one of Google
wasn't all that much more complicated.
But if I now type in cats, submit
this query, I go to actual Google,
typing in effectively
cats, because of the URL
I was redirected to-- which
is to say that using HTML,
we can reconstruct exactly what
Google's been doing all this time.
Because if you distill the essence of
Google into just a few lines of code,
this is it.
And indeed, this is essentially what
Google looked like a few years ago.
Although, to be fair,
they also had this.
They had another input
whose type was submit,
and whose value even early
on was I'm Feeling Lucky.
And if we save this, it's
going to actually do anything,
because we need a little more
logic in order to make that work.
But if I reload, now we get the
second Google button as well.
And so all we've implemented for now is
the front end of Google, so to speak.
We have completely punted to Google's
back end, their own databases,
their own software, the actual
searching of things, because that's
because we don't really
have a language yet,
a way of expressing searches ourselves.
Indeed, we could using C
and using HTML and using
CSS start to build our own
server, and we could actually
write code in C that receives something
like q equals cats, parse the cats,
like to read it, extract
it from that string,
then figure out in our own database
where can I find some cats.
But it's going to be incredibly,
incredibly tedious to do that in C.
In fact, if you think back to
the problems Vigenere and Caesar
and the like, even just manipulating
strings in C is really non-trivial
and gets quickly tedious.
And so we really need a better language.
And that language is going to be
in the coming weeks Python, which
is a higher level language than
C. In fact, the Python interpreter
so to speak itself is written in
C. So the world some years ago
used C to write support for really
what many would call a better language
for solving problems like this.
And so not only can you use Python
for command line applications
and processing and analyzing data like
a data scientist might use it for.
We can also use Python to actually
write the back end of google.com,
or the back end of Facebook,
or the back of any web server
that has to read the parameters,
understand them, maybe look up
some data or store some
data in a database,
and respond to the user
with dynamic output.
So all that and more in the weeks ahead.
[MUSIC PLAYING]
