¡Hola, y bienvenidos de vuelta! Hi and
welcome back. In the previous video, we
described words as bundles of features,
so that, for example, a word could be
defined as having the feature anger: yes
or no, having the feature joy: yes or no.
We use these features to describe the
emotions associated to a word, and then
the emotions associated to a document
overall. In this video, we'll use words as
features of documents, so that a document
will be defined as having the word sushi
or not having the word Hanover or not.
This will let us measure the similarity
between two documents. It will let us
measure the distance between a search
query and a document, and it will help us
cluster similar documents together, so
let's say we have a group of documents,
for example, the very small webpages that
we have here: twelve documents, some of
them are about sushi in Hanover, some of
them are about basketball, some of them
are about how Hanover is cold, for
example, and we wanted to find some sort
of feature system that we could use to
describe these documents, and hopefully
to group them together. We could come up
with an arbitrary system, and again every
system would be arbitrary,
for example the having the word sushi or
not, which of these documents have the
word sushi. Maybe all of them will be
similar, the documents D1 5 7 8 9 10 and
11 have the word sushi. Some of them like
D1 and D5 are about places where you
can get sushi in Hanover because they
have fresh sushi or authentic sushi for
example. Some of the documents like D7 and D8
are about places that are not Hanover,
where you can get sushi such as Osaka
and Seattle, for example. So sushi could be a feature
to describe to documents. Let's try a
second feature, whether the document has
the word Hanover or not, because it
seems that sometimes we're talking about
Hanover, sometimes we're not. Some
documents do have both of these words so
documents 1 5 9 and 10 have both the
word sushi and Hanover. In documents 1
and 5, as we mentioned these are about
restaurants that we have in Hanover that
serve sushi, imaginary restaurants. D10
is of the same type. D9 however is
slightly different. D9 appears to be a document about a
cultural club that has sushi Fridays and
they also have origami Tuesdays, for
example. Is - it must be some sort of
cultural club. Let's use a third feature:
whether the document has the word
origami or not to try to describe how
these documents are different, because
some of them talk about restaurants and
some of them talk about cultural
activities which might include origami.
So which of these documents have the
word sushi, Hanover, and origami. Really
just that one D9. This is a feature
system with three features: whether you
have the word sushi or not, whether you
have the word Hanover or not, and
whether you have the word origami or not,
which can be used to describe some
documents and to figure out how they are
different from some other documents. And
again, this system is completely
arbitrary I made it up. And there's many
other feature systems that we could use
to describe these documents. This is just
one example of what you can do, of how
you could define features and then feed
them into some machine learning system.
Let's keep playing with these three
features and see how far it can take us.
How can we use these features to measure
similarity between documents because we
intuitively - we were dealing with that
concept. Some of these documents were
very similar, some of them were about
restaurants in Hanover, some of them were
about eating sushi and some places are
not Hanover, so how can we use these
features to measure distance? Let's set
up an - a three-dimensional system
where we have three dimensions: whether
the document has the word sushi or not,
this is one of our axes, and if you don't
have the word sushi, your value would be 0,
if you do have the word sushi, your value
would be 1. That's sort of a second
dimension where you have whether the
document is - has the word Hanover or not.
If the document does not have the word
Hanover, the value for this dimension
would be 0. If the document does have the
word Hanover, the value for this
dimension would be 1. Let's set up a
third dimension, where we decide
whether the document has the word
origami or not. The document doesn't have
origami, the value would be 0, if it does
have the word origami, the value would be
1. We have three features for three
dimensions, so we could draw a kind of
cube, and in one of the vertices of the
cube, as you can see there we would have
documents that have 0 for all three
dimensions 0, first for sushi because the
word sushi is not at this document, 0 for
Hanover because Hanover is not there, 0
for origami because origami is not there.
A document like D4 so that would be the
place in this three-dimensional system
where this document would reside. Notice
I'm using three dimensions because
that's as many as I can draw. In theory
you could have four dimensions, 200
dimensions, many others that we could
have mathematically, but I'm gonna stick
with this 3 dimensional example for now.
So that's the place where D4 would live
where all of our dimensions are set to 0.
How about a document like D8, Sushi in
Seattle is much better than you'd expect,
this document has the word sushi so the
value for the sushi dimension is 1
because it does have the word, the value
for the other dimensions is zero because
it does not have the words Hanover or
origami so in our little three
dimensional system, if this was the
center, sushi would just be one step
towards the - in the sushi axis, and then
no steps in the Hanover axis or in the
origami vertical axis. So notice that we
went one step from D - from the
previous document D4, D4 to D8 where we
are now. How about a document like D12,
Hanover is particularly cold. This
document does not have the word for
sushi so the value for the sushi
dimension is zero, it does have the word
for Hanover so the value for that
dimension is one. If this would be the
center we would have one step in the
Hanover dimension, zero steps in the
sushi dimension, and zero steps in the
vertical origami dimension. So we have
that the distance between this document
and D4 the original one, would also be
one step in the Hanover direction, so
the distance would be something like 1.
We can have a third document which says
SushiFresh is the freshest sushi in
Hanover. This one has both the word sushi
and Hanover in it. So in our little 3
dimensional system, it will inhabit
essentially the floor of the cube, where
you have one step in the sushi dimension
one step in the Hanover dimension and
then no steps vertically in the origami
dimension, and so will live on the floor
here opposite to the original document
so the distance between them would be
this line right here. Let's say we have
something that has a value of 1 in all
three dimensions for example, Hanover
Japan club is origami Tuesdays and make
your own sushi Fridays, this one has a
value of 1 for sushi because it does
have the word, a value of Hanover forces
before -
sorry, value of Hanover equals one
because it does have it, and a value of
origami equals one because it does have
the word origami. So this one, if this is
the center, it will be one step in the
sushi dimension, one step in the Hanover
dimension and one vertical step in
the origami dimension. So now this point,
this document lives here. If we want to
measure distance between this document
and the original document, it will now be
this vertical line that goes through here.
As we mentioned, each of those steps
would measure one unit one arbitrary
unit of measurement which is whether it
has the word or not, so if you're in the
center you go one unit or one step in
the sushi dimension, one step in the
Hanover dimension, and one vertical step
in the origami dimension, so one one one.
What would be the distance between 1 1 1 and
0 0 0 which was the original document D4.
The distance between these two
documents is this line here. In
three-dimensional space, we do have a
formula to calculate it and as a matter
of fact this formula can be expanded
into n dimensional space. In three
dimensions, it will be the difference
between the sushi axis in D9 and D4. So
1 - 0, 1 squared which is 1. The second
term would be the difference in the
Hanover dimension between D9 and D4 so 1
- 0, 1 squared equals 1. The third term
would be the difference in the origami
dimension between D9 1 and D4 0, 1 - 0,
1 squared. So we have that - we have the
square root of those subtractions is the
square root of 3 equals 1.73 units of
measurement of what we had. So if D4
lives here and D9 lives here, the distance
between these would be 1.73 steps from one another.
This does correctly capture the intuition
that these two documents are not related.
They're not talking about the same
things, one is talking about basketball
and one is talking about a cultural club
in Hanover. We could calculate the
distance between two things that are
more similar, for example D9 and D1 are
very similar. Both of them talked about
sushi and Hanover, but one talks about
origami and the other one does not. So
the distance between these two would be
the difference in the sushi dimension
1-1, the difference in the Hanover
dimension 1-1, and the difference in the
origami dimension 1 - 0, squared,
so you get root of 1 equals 1. As a
matter of fact the difference between
these two is just the 1 unit: the origami unit.
Now we have the capability of
measuring distance between documents. So the distance between D9 and D1 is a
single unit, and the distance between D9
and D4 is 1.73 units. Again this
correctly captures the intuition that D9
and D1 are somehow similar. They are
talking about sushi and they're talking
about sushi in Hanover so they share
more things in common than both of these
documents and D4 which is located
further away. Here's where the query
would come in. A web search, for example
a query into a search system is
essentially another document and all we
would need to do is figure out the
distance between this incoming document,
the query, and all of the other documents
in our collection. So the query has the
word orgami,
so go down the, I'm sorry, toward sushi so
it is a 1 in the sushi dimension, it has
the word Hanover so it is a
one, yes in the Hanover dimension, and it
does have the word origami so it is a
one for a yes in the origami dimension.
So the difference between, I'm sorry, the
distance between that one which is 1 1 1
and D9 which is 1 1 1 is 0. Those two are
the same in our featural system, so this
would correctly capture the intuition
that this query pre - would prefer to
get the document D9 which has a very short
distance than D4 which has a longer
distance and appears to be unrelated to
the query that we were after.
Notice by the way that we are
essentially calc - calculating vector
distances over three-dimensional space.
We can have four features and it will be
vector distance in four dimensional
space. We could have 200 dimensions and
it will be vector distance over 200
dimensions. The math would be the same
and we could have an arbitrary number of
features to define and explain a
document, and we could always calculate
the distance between them using the kind
of algebra that we saw here. So if we
have a document and we define it as
features for exam - for example having
the word sushi or not, having the word
Hanover or not, we can now measure the
distance between two documents because
we will need to - will be measuring the
distance for whether this thing is 1 1 1,
0 0 0, 1 0 0, and so forth. This is
essentially a kind of vectorial distance
and by the way this is at the heart of
the Google system, for example. It's a
kind of vectorial search
where you have a vector for a query and
you try to find what is the - what are the
nearest documents to the incoming query.
In our next video, we're gonna see at a -
we're gonna look at a beautiful side
effect of this kind of system which is
that documents that are alike cluster together.
