I'm very delighted to introduce Dan Gillick
to you Dan is one of the original and
one of the most appreciated instructors
in our Master of Information and Data
Science program about which I will say
something more later but I'll give you a
bit of his background he graduated with
a BA from Wesleyan University and this
tells you a lot about Dan he majored ...
double majored in computer science and
English literature and then he came to
Berkeley to do his PhD in computer
science where I guess he didn't do much
English literature he has been working
at Google ever since he finished
graduate school working mostly on
natural language processing which as
many of you probably have noticed has
kind of exploded over the past couple of
years I've heard Dan say that when he
started at Google there were as a small
group of them doing NLP and now it is
how many hundreds thousands of engineers
in the field one thing that you don't
know about Dan maybe except from him
probably is that he has won both
national and world championships playing
fill in the blank world frisbee Ultimate
Frisbee sorry Ultimate Frisbee very cool
very cool
he's on his way to New Zealand and we're
really delighted that we got a little
bit of time from him today I just wanted
to say that he's an incredibly gifted
instructor and speaker so you're lucky
to be able to hear from him today about
the work he's doing he built a course on
machine learning ... our sort of
applied machine learning class he also
built a course on natural language
processing and then we added in
I think bow to the students NLP with
deep learning because everybody wanted
deep learning classes both of them are
incredibly popular classes and so I
think you're you know should be ready
for a treat now I'll hand it over to you
Dan
it's green
right
anyway nice to be here nice to see like
some familiar people so that's good
okay so I've never really given a talk
quite like this before this is sort of
doesn't represent any particular paper
that I've worked on it's just some stuff
that I'm working on at Google over the
last six months or so and I think it's
interesting and so I'll try to sort of
present it to you in like a way that I
think hopefully is high-level enough
that you can get the general idea and
with a little bit of technical detail
but it's a little hard to sort of trade
off technical detail and sort of
high-level overview and I'm not quite
sure what everyone's backgrounds are so
I'll give it a shot okay
okay so here's a screenshot from Quora
which is this website that you go to and
ask questions
this particular question I guess Quora
thinks is of interest to me but anyway
so here I've typed in a question which
is of interest to me how does Blue
Bottle cook their eggs something that
I've actually wondered about for quite
some time
Blue Bottle being just down the street
from my house I've always been impressed
by how they cook their eggs and if I
type this into Quora you'll see that it
suggests a bunch of possibly related
questions that maybe I do care about
like where does Blue Bottle get their
glass coffee mugs from no actually
that's not really that related how do
IHOP and Denny's cook their eggs no
definitely not related and so on down at
the bottom we've got how do blue tongue
lizards lay their eggs yeah that's
that's pretty far from what I'm
interested in but anyway they have this
search feature where you type in a
question and they like search their
database and you know maybe you'll find
the answer to something that that's
related enough that it will satisfy you
okay and so this is the problem of
similar question retrieval so Quora they
have something like ten million
questions with answers maybe it's more
by now as people are typing in questions
all the time other people are coming
along experts in various categories or
non experts and answering questions and
the problem that I'm interested in at
least for this talk is given a new
question can we retrieve the closest
matches right so how do I pull this sort
of needle out of a haystack given the
sort of question I have to find the
relevant question that there may be out
there
relatedly how do we score the similarity
between two questions because you can
see how those those might be related by
the way Google has the same problem
although it is somewhat different scale
so you may have tried typing just
instead of going to Quora you can type
a question into Google and sometimes you
get sort of a direct answer like this
type in how to poach an egg by the way
all my examples have to do with poaching
eggs so I'm sorry so and then in this
particular case Google has pulled out
this little like recipe from this this
BBC site ... sorry bbcgoodfood.com probably not related to BBC ... and so
they have the same problem right Google
knows about probably an order of
magnitude more questions and answers but
they know are pretty good from you know
the user logs and so on and given some
new question that might be phrased a
little bit differently they want to know
like well can we find an answer to your
question you know even though we don't
have that exact phrasing in our database
actually this is like this kind of
retrieval problem appears all over the
place like lots of companies have
frequently asked questions sites and
people are always asking them the same
question and they want to know like how
can they route new questions that they
get to like their database of questions
that have known answers right so this
kind of problem comes up a lot and
actually it's a specific case of search
right it's a specific place where the
input is a question and the output is a
question but there's all kinds of other
search problems so just to give you a
few examples like the standard web
search is you type in a query and you're
looking for a document well one that
Google's interested in is typing in a
query and you're looking for an ad
usually you're not but you know Google
wants to find a relevant ad for you
maybe you type in a query and you want
an image so there
the modality is different right like you
have to compare queries and images or
maybe you type in a query and you're
looking for a video or even something
sort of the reverse where you have an
image like you take a picture with your
camera and Google would like to show you
a document that's most relevant right so
I take a picture of like this egg dish
that I'm eating and Google says oh that
looks like you know the eggs that they
serve at Blue Bottle or whatever okay so
really this is to talk about search and
I'm gonna talk about this I'll use as
a lot of examples the sort of question to
question case because I think it's sort
of simple enough that they and it's sort
of easier to discuss by the way like
please interrupt and ask questions if
stuff I'm saying is confusing okay so
first of all classical retrieval systems
well they date back further than the 90s
but the basic ideas are essentially
unchanged since search engines were
created and the key idea is the inverted
index right so if you want to find
documents or maybe a question that's
relevant to the question you just asked
you should keep track of this inverted
index so that you can take each word in
the question and look up all the places
where that that word appears really
quickly right and here's a little
schematic of what an inverted index
construction kind of looks like on the
left you have terms on the right you
have the items that you're indexing so
like items 1 2 & 3 all contain the word
blue item 1 contains the word where and
so on so you're probably all familiar
with this kind of thing so the way
search typically works is you have this
inverted index that you've created from
all your data then given some query you
retrieve some candidates and then you
re-rank them so you sort them based on
whatever criteria you might have to show
people you know when when Quora was
showing you the top questions that they
think is relevant they use some inverted
index to find
other questions that had lots of similar
words and then they did some ranking and
I'm not sure how the ranking works okay
so that's sort of classical retrieval
the shortcomings are that recall is
limited by this word overlap right so in
my case I was looking for something very
specific about poached eggs at Blue
Bottle and maybe there's a there's a
question out there in Quora which is
relevant but we weren't able to find it
and maybe it was because we were limited
by this word overlap thing right the
other shortcoming is we can't use the
inverted index directly to retrieve
across modalities like if I wanted to
retrieve images there would be no way to
directly use an inverted index so often
people do this sort of trick where you
try to annotate images or videos with
words and then you use that to construct
the inverted index right but it's you
could miss a lot that way okay okay so
what I want to talk about is not
classical retrieval but sort of a new
idea which has to do with neural
networks and I'm gonna call it
continuous retrieval and there's a few
pieces and I'll try to talk about each
of the pieces okay so the first piece is
you'll see they all involve some kind of
learning whereas classical search or the
inverted index doesn't really involve
any learning so the first piece is let's
learn to represent objects as continuous
vectors okay so we want to come up with
some framework some kind of model that
can take arbitrary objects text or
images or videos or sound and represent
them as vectors in some continuous space
right so here's an example of
two-dimensional space and you can
imagine my model might place this object
how to cook eggs in this part of the
space and this other object Golden
Retriever in some other part of the
space right so I haven't told you yet
anything about how to actually do this
but we want something that's able to place objects
in space like this the second thing is
that we'd like to learn to place similar
objects close together right so not only
do we want a model that's able to sort
of position items in this space but we
want it to be the case that the question
how to cook eggs and the phrase poached
egg even though you know not all the
words are the same the intents are
related and so we like them to be close
together in space
similarly dog and golden retriever like
those are pretty similar so we want them
to be close together and it would be
really nice if an image of a poached egg
and an image of a golden retriever like
appear close to those sort of
descriptive phrases of that right okay
and then the third piece is we have to
be able to retrieve neighboring objects
really fast
so actually in order to actually use
something like this to do search we have
to be able to take some new query or
question maybe someone says types in
boiling eggs and we need to be able to
very quickly like locate that thing in
this space and find all the neighbors
that are very close to it and that may
include the image and these other
phrases or questions close by and return
them to the user right okay so those are
the three pieces that are essential to
being able to do continuous retrieval by
the way sort of just to be a little more
concrete in our Quora case well I put
this ... I mean I couldn't find sort of
existing representation of questions on
Quora but this is ... represents sort of
questions on Reddit so that's close enough
so here's like this big cloud of
questions that people ask on different
topics on Reddit and I don't know ... the goal of image is
to show that you know kind of similar
questions end up getting clustered
together so you've got like a music
section or the fitness section and so on
okay so with Quora we would take our 10 million questions
and place them in some learned space
such that similar questions are close
together and then given a new question
we'd encode it in the learned space and
we'd find the nearest neighbors in
return retrieve them so I just repeated
what I just said but with this picture
next to it all right let's go on ok so
the outline for the rest of the talk is
that I'm basically going to talk about
each one of these three components in
some detail so ok so first learning to
represent objects as continuous vectors
I guess I'm not going to just repeat all
these things because I've said them
already any questions before I go on ... yeah
yeah so you're making the point that
actually I dramatically simplified how
search works which is a fair point so
it's true that the language of
specifying queries is much more complex
and allows you to like compose different
things that you know with sort of simple
operators like and and or so you can you
know you can require that you want
certain words to be there and certain
words not to be there and that's true
but at the end of the day you're still
using this inverted index you're still
indexing documents by the terms they
contain and you're still that's that
places the same limitation on you that
I'm gonna try to solve by representing
things in continuous space ... yeah
okay
right
yeah
yeah that's fair ... I think that it's
true so yeah so the idea of latent
semantic indexing I think dates back
even further right like yeah I think
they're like patents from the early 80s
where like the idea of latent semantic
indexing like you're saying is to
represent arbitrary objects like
documents in some fixed dimensional
space which is exactly what I'm talking
about doing with neural networks so
that's true I agree there was all this
research I just mean that in large-scale
search systems I don't think anybody
does it right so I'm not claiming that
you know I've invented some sort of new
paradigm for doing this yeah I agree
this has been around for a long time I'm
more interested in sort of making it
practical
okay fair enough
all right
okay other questions
yeah
not a question okay
okay so I want to talk about learning to
represent objects as continuous vectors
I should say originally when I was
making these slides I put a picture of
the ... like the patent that was filed
like back in the 80s and I had some
discussion about LSA but I decided like
I just didn't have enough time to talk
about it
okay yeah that sounds pretty close to
what I'm about to talk about okay so
this is probably gonna be very familiar
to some of you and probably not familiar
to others but anyway ask me questions as
they go okay so let's talk about this it's
sort of totally different for a second
let's talk about the language modeling
problem ... the language modeling problem is given some context can you predict the
next word right so here's a really
simple example and let's say that my
vocabulary just includes five words
right
in reality you might have fifty thousand
or a hundred thousand words in your
vocabulary and that makes everything
more complicated so let's keep it really
simple for now okay so I'm gonna try to
frame this with logistic regression
which i think is probably familiar to
many people okay so what if we wanted to
solve language modeling with a
traditional logistic regression this is
gonna look a little bit funny but you
would do something like this
you would say okay let's represent each
word as a one-hot vector right so this
is sort of like an indicator of you know
you have sort of an indicator for the
presence or absence of each word in your
vocabulary at each one of these three
positions right so at the in the
previous word position where the word
poach is that's a one-hot vector like
there's one of these five words in the
vocabulary is present so there's one and
the rest are not present so they're
zeros right and you could say the same
thing about the word to and the word how
right and then you might concatenate
those to get a representation with 15
numbers 15 features of the context right
this is a like 15 dimensional
representation of the context
and we want to use these 15 features to
predict what the next word is going to
be so what does that look like there's
kind of a funny-looking logistic
regression but this is like multinomial
logistic regression or sometimes called
softmax regression and so there are five
possible words each word in the vocabulary
that we'd like to predict for which we'd
like to predict a probability right and
on the right there that's sort of I
guess I can use my mouse to to point so
this represents our of the model
parameters right so for each of these
columns you can think of as representing
one of the words in our vocabulary
right and so in logistic regression
you take your feature vector which is
those 15 dimensions and you multiply it
by these ... learned parameters for each
word each word in our vocabulary and
that gives us some score some scalar
value for each word so the word chicken
has some score and the word eggs has
some score and so on and then we apply
the usual normalization which is that
little function there where we exponentiate
each score and then in the denominator
we have the sort of total sum of all the
exponentiated scores and then we get a
probability distribution and so let's
say our model then predicts sort of this
picture of the probability distribution
where eggs has some reasonably high
probability chicken because you can also
poach chicken and the other words have
some low probability right okay so does
that look okay okay okay now let's do
the same thing with word embeddings okay
so this is probably something you've
heard of and
this is something that I believe was I
think I can say was invented fairly
recently like around in the last ten
years anyway right and it's come to have
like incredible importance in natural
language processing and also in some
other areas too and it's really a simple
idea the idea is instead of using a one-hot vector to represent each word let's
just arbitrarily create some
two-dimensional representation for each
word right so instead of having this one-hot vector with you know a bunch of
zeros and one one for each word let's
say every word has two parameters that
describe it and we're going to learn
those as well
and those are gonna be sort of latent in
the model and so now I'm gonna add two
parameters to the set of all the
parameters in the model for each word so
chicken has these two parameters eggs has these two parameters and so on and the
first step is instead of trading the one-hot vector I just go to my embeddings
table and I look up the current value of
the embeddings and I write them down
right so each of these three words in
their context has some value for its
embeddings then I'll concatenate them
just like I did before and then
everything looks the same right so now
the features that are going into my
model before I had 15 now I happen to
have six these are sort of arbitrary
numbers and now I do the same thing
right it's still logistic regression
it's just that now the features rather
than being fixed are learned right and we
can learn them using back propagation so
as sort of an extension of the kind
of gradient descent you would use to
train your logistic regression model you
can learn them now in two steps first
you you know estimate the updates to
... your model parameters in the logistic
regression layer and then you update the
parameters in the embeddings layer right
okay so here's a little picture I tried
to draw of what that training looks like
right so in the previous picture I was
asking the model to make a prediction
about the next word right during
training you're saying I have a lot of
examples of sequences of words and I
want to use that training data to
estimate these model parameters right so
in training
I actually know the identity of the next
work right and for a second I hide it
from the model and I ask the model to
make its predictions and that's that's
what these guys are and then I say well
actually I knew the next word was eggs
all along so that sort of looks like
this distribution right it's a
distribution with a 1 and a bunch of
zeros and then I compare these two
distributions using a metric in this
case cross-entropy
right so I compare these two
distributions and I observe like okay
these distributions are not exactly the
same I'd like to update all the
parameters of my model so that next time
around if I were to see this exact same
example I come a little bit closer to
predicting the true distribution right
so I'm this isn't an animation but you
can kind of see if I go back and forth
something like this happens right the
... the value of the cross entropy is
sort of propagated back down through the
model and make small updates to all the
parameters so the next time around you
know I get a little bit more confident
about eggs right any questions
that's to keep the diagram simple but in
practice people do not use two
dimensions people use usually something
like a hundred to a thousand and you
know you could ask like well what do
these dimensions represent right we've
sort of obfuscated things in the model
right like you know often you use the
logistic regression so that you can sort
of say something concrete about how
important each factor is and what I'm
doing is I'm saying well actually I'm
gonna represent poach with a hundred
numbers and I you know when I do that I
exert no control over what each one of
these dimensions is gonna represent and
I will let the model decide during
training what they mean and some of them
may end up with some interpretable
meaning and some may not yeah
yeah a good question people usually just
like choose something yeah I mean this
is like hyper parameter of the model
right so this is the kind of thing that
people often you know if there's sort of
trade-offs associated with these things
right like you know using going from a
hundred to two hundred dimensions you
know for the task that you're trying to
do if you really care about language
modeling you know increasing the
dimensionality might give you better
performance but it might slow things
down or you know
yeah yeah that's a good question I've
got some pictures so people have thought
a lot about how you might evaluate the
quality of an embedding right and
actually for the purposes of this talk
I'm not that interested in the
embeddings themselves I'm interested in
something else but it's worth talking
about for a second so one thing that
people have observed is that the
embedding dimensions are meaningful in
some respect and one way to show that is
to look at the nearest look at the
nearest neighbors of certain words right
and well one thing I have to explain is
that usually people don't use two
dimensions right they use like a hundred
or 500 dimensions and then you need to
like project that down just a couple
dimensions and they do this sort of
fancy nonlinear projection so that
nearest neighbors are preserved that's
what that t-sne thing is anyway if you
then project all these all the words in
your vocabulary into some big space you
can zoom in on that space and you look
you observe some kind of neat things
like look all these number related
things are really close together and
even if like you know you have numbers
spelled out like five and six or you
have numbers you know in decimal format
like 10 and 20 and here's half and the
model has learned that all these things
go close together and also these sort of
number related alert words like some and
many those end up in the sort of general
generally same space also on the right
you have all these sort of related words
like leader and president and chairman
and so on so one thing you can do is
sort of examine the space to sort of
confirm that indeed things that are
similar appear close together another
thing that people like to do is look at
these sort of analogy like relationships
right and you can ask these are like SAT
analogy it's like France is to Paris as
Italy is to blank and then you ask the
model if it can solve the analogy right
and the way you do that is well you can
draw the here look at the this example
up here if you you observe that man is
some you know embedded by some point in
this space and woman is embedded by some
other point and you can then draw the
vector that connects them and then you
can apply that same vector to uncle and
see where you end up like what is the
nearest thing to uncle along that vector
and it turns out that indeed the word
aunt is very close right so that's sort
of the model solving this analogy and
you can ask how often does it solve the
analogies and so here's some examples
like that the model is getting you know
close to right right like France is to Paris as Italy is to Rome and so
on big is to bigger as small is to larger
cold is to colder ... you've got some
interesting things like Einstein was a
scientist and Messi is a midfielder and
so on I think if you look closely you
can find some mistakes the model makes
like here Berlusconi is to Silvio as
Putin is to Medvedev well not quite right
but you know not too bad okay I'm gonna
move on to the next part unless you have
some questions
okay so that's sort of a primer on the
idea of embedding so the next part is
about learning to place similar objects
close together ... okay so the history of
NLP involves a lot of training models
from small labeled data sets so one thing
that I and many people that I know have
done spend a lot of time working on is
something like the parsing problem like
given a sentence can you train a machine
to diagram it like this right to break
it down into like the verb phrase and
the noun phrase and where the
prepositional phrases and over time
we've made a lot of progress with this
but it's hard to find labeled sentences
like sentences labeled with parse trees
right and so what we've done over time
is like pay highly trained experts to
annotate some of them so like we took a
corpus of The Wall Street ... I'm
saying we but obviously was not me ... so
there's an annotated corpus in The Wall
Street Journal like of a million words
so you know I don't know 20 years ago
that was a pretty big size but these
days that's quite small so a million
words annotated with parse structures and
you can take that data and try to learn
you know these algorithms ... to able to
improve translation by somehow like
training a parser and inserting it in
the right place into the translation
model and a lot of people spend a lot of
time doing this with very limited
success like it was very hard to get
even the best parser to make any
difference to a translation system and
part of the problem is just that the
translation system is trained on so much
more data that it sort of is able to
learn what it needs to know about these
sort of sub components like parsing
okay
alright I don't think I want to say
anything else about it okay so I'm gonna
tell you about this model that I'm gonna
call the dual encoder model and the dual
encoder model I guess one last thing to
say about translation is translation is
like a generative task it's like given a
sentence in English can you generate
text in France in French what I'm
concerned about is this retrieval
problem so it's actually not generative
I just want to place objects close
together in space so it's a little bit
different okay so here's a like overview
schematic of this thing I'm calling a
dual encoder model so the idea is that
an encoder is something that represents
an arbitrary object as a fixed
dimensional vector so I just told you
how to represent words as fixed
dimensional vectors let's like scale
that out to representing whole sentences
or maybe images as fixed dimensional
vectors right the encoder is like one of
those things in the blue box and the
dual encoder takes two fix dimensional
vectors produced by two possibly
different encoders and tries to put
similar ones in the same space right and I'll
try to describe how that works in a second
here's a sort of zoomed in view
of what a sample encoder neural
network might look like so looks sort of
similar to before where maybe you
take each of your words you represent
each word with a two-dimensional
embedding then you sort of at the next
layer you combine the embeddings maybe
you average them that's sort of a
standard thing that people do and then
you can add sort of arbitrary hidden
layers that sort of add extra
computation on top
all these arrows are indicating their
like this node right here is like a
combination of is some ... some learned
combination of what's at the previous
layer okay so there are all kinds of
different possible in code or neural
networks and I'm not gonna spend more
time talking about them the important
point is that it's some neural network
that generates a fixed dimensional
representation of an object so here's an
example for text the sort of analogous
thing for images oh sorry question
I'll talk about it too I promise you can
imagine something similar for images I'm
not going to describe it now it involves
convolutional neural networks which
basically is like sliding this little
window over regions of the image and
extracting features and you can use that
to also get a fixed dimensional
representation so let's set that aside
for now but you know here's like a
schematic for the dual encoder that
would place text and images close
together right so here you have your
question encoder and your image encoder
each produces a fixed dimensional
representation and then you have your
similarity function that says like how
similar these two objects are okay so I
have to back up for a second and tell
you about this sort of this idea called
the sampled softmax this is like this
sort of like a one weird trick thing for
machine learning so and people so let me
back up to the language modeling case so
so this is the picture that I showed you
before where we have the truth
distribution that comes from the label
like the actual next word which was eggs
and the predictive distribution that
comes from the model and and then ...
we compare these two before we do back
propagation now in order to get this
predicted distribution remember we need
to compute a value using the model
parameters for every single word in our
vocabulary so that was no big deal when
we had five words in our vocabulary but
if we have a hundred thousand or a
million words in our vocabulary then
every time we want to make a prediction
we have to compute we have to do a lot
of computation we have to compute the
this the scaler which people call a
logit
that represents each individual word in
the vocabulary and then do this
normalization and then you get your
distribution that's a lot of computation
and so the sampled softmax trick is to
say okay let's instead compute the
prediction for the correct label this is
during training let's compute the
prediction for the correct label
because that's important we want to know
what the model thinks is the probability
of eggs and let's compute the
predictions for some of the random some
random work so let's choose like a
thousand random words so like one
percent or a tenth of one percent of the
words in the vocabulary and we'll
estimate the softmax distribution given
that random sample and everything else
will say well that's just a it's just a
zero and then we can still compare these
two distributions sort of you know in an
inexact way and it turns out that's good
enough and you can then train a language
model much faster like many orders of
magnitude faster and so that's led to a
lot of improvements in language modeling
but I'm going to exploit this in a much
more important way for to answer the
question that you just asked for how to
train the dual encoder model so now
instead of predicting the next word I
want to predict the similarity between
you know sentence 1 and sentence 2 or
question 1 and question 2 right so for
in this example I have this on one side
I have how to poach eggs and the other
side I have poached eggs recipe or I
forget what my example was let's say
it's poached eggs recipe alright so
these two two examples and I want them
to be close together right so I have
this positive pair and I want these two
things to be close together and that's
sort of all I know but let's say I have
some very large database of all kinds of
other sentences or other questions like
in Quora there's 10 million questions and
so I will make the assumption that if I
sample a few of them at random they're
sort of negative examples right if I
sample some random negative like where
is Fiji the chances are that that's not
synonymous with how to poach eggs right
so this is like the sampled softmax
trick applied to dual encoders right
yeah
yeah that's a good question so I haven't
told you yet where to get training data
for this all I've said so far is assume
you have positive pairs right so let me
come back and tell us we'll talk about
training day a little bit later but my
assumption is that you can get lots of
these positive pairs and then once you
have positive pairs you can use this
sampled softmax trick to find random
negatives that sort of stand in for the
whole world of possible negatives now in
this case the sampled softmax is
crucially important because we can never
enumerate all the possible sentences
that are negative examples like with our
language modeling case there was some
fixed vocabulary and so we could
enumerate all the examples but here we
can never enumerate them so sampled
softmax is even more important okay by
the way here's sort of an alternate
model architecture so another way this
is a bit of a non sequitur but it's
important so another neural network
architecture for predicting the
similarity between two questions is
something like this so maybe you can see
the difference between this and what I
showed before I can back up if you'd
like so here I have these two inputs
then I do the embedding I get some ... sorry
I shouldn't say embedding like that but ... I
produce ... I guess I could ... I produce
some embedding for you know the words
on either side then I do sort of
combined embedding like I concatenate
the representation of sentence 1 and
sentence 2 and then I have some like
arbitrary hidden layers some like extra
neural net computation on top and
eventually it's a regression problem and
I want to predict how similar these two
things are so there's this diagram and
then on the other hand there's ... well I
don't have a great picture but a diagram
like this where there's sort of two
separate towers right
and why is this an important distinction
well remember that I want to use these
models to do retrieval right which means
that I need to use the model to create a
fixed like a static representation of
everything in my data set so if I have
my 10 million Quora questions I need to
use the model to represent all 10
million of them and have them just sit
there the problem with a model like this
is that I'm required to see both the
left and right side to make the
prediction and then it becomes very
expensive to do retrieval right because
then I need to compare my current query
with every single thing in that 10
million
you know database right but the other
... model architecture where I have these
two separate towers that are only
connected by a distance that allows me
to index all of the items in my data set
and then use nearest neighbors to find
the things that are closest ok ok by the
way sort of maybe not necessary to talk
about but it turns out that you can kind
of compose these things together and
train them all together so you know I
told you maybe you have some data set of
similar questions right but maybe I have
a data set of you know matched questions
and images or maybe you have a data set
of matched questions and documents that's starting to sound a little bit like what
Google web search has right
it turns out that you can like train all
of these things together all these
encoders together in a way that sort of
improves all of them and the resulting
... what ends up happening is that in that
way we're able to encode all these
objects in the same space right sort of
force all these objects to end up in the
same space by doing by training all
these things together by the way I threw
in some sort of interesting retrieval
examples so these are some examples that I noticed like from Quora
that ... where there's very little word
overlap right so the model is able given
this phrase on the left what does
diatonic semi tone mean and music it
retrieves what is a half step called so
not perfect it's not a perfect sort of
one-to-one correspondence between those
things as maybe my brother my brother
would tell you but but they're pretty
similar and it's cool because they don't
share any common words right or what is
the average price for portable toilet
rental and how much are porta-potties
right so like again very little overlap
or how to increase your cerebral
capacity how to grow mind power so it's
able to do this kind of cool thing where
like not only can we find not only can
we place words with similar meaning
close together but we can place whole
sentences with similar meaning close
together so that's kind of cool okay
questions about that
all right I want to talk briefly about
this last step just because it's sort of
the last thing that's important to make
this all work
although this is not really something
that I work on personally okay so this
is the problem of finding nearest
neighbors and you know you could imagine
if you've got these 10 million Quora
sentences and they're all embedded in a
hundred dimensional space that's like
this really complex space and then I'm
coming along with a new question a new
question and I want to embed it and I
want to find it in that space I need to
be able to do that fast enough that
search is actually viable right so
people tend to call this problem
locality-sensitive hashing I guess
that's sort of the name for the solution
and it's a form of approximate nearest
neighbor search where you create these
hash codes that correspond to little
areas so that you can quickly retrieve
the hash code for a new object and then
quickly go look up what else is in that
area just like a zip code right so yeah
so I guess I'll ... I guess I just said this
so it's sort of like constructing a zip
code system from scratch where you don't
know the map right but there's some ... it's
sort of analogous right like in the zip
coding system each of these general
areas ... like the first thing we do is we
do this partitioning with the first
digit of the zip code right like the
whole west coast is like the nine zip
code New England is like the zero zip
code and so each successive digit of the
hash code refines the area right until
in this picture like you get down to
these very local areas so if someone
says like oh this object is 94704
that tells you a lot about the object
because now you know it's like basically
on Berkeley campus
okay and I guess the last thing I want
to say is like like does it work and
actually yes it works very well ... sort
of surprisingly to me because I hadn't
played around with this before but it is
a sort of area of active research like
scaling up to like many billions of
candidates like sometimes you have more
than can fit on one machine and so you
need to do this in a distributed way
like basically if you want to do real
search you're gonna need to do this on
like a very large scale yeah
yeah yeah that's a good point I hadn't
thought about that way of putting it but
yeah so there ... I agree there is some
similarity but yet what you might end up
with could be quite different right
because all because of the way that
things are represented and like well I
don't know here in this little plot here
I did the following ... let me describe this
experiment I don't think it's gonna
answer your question but I'll describe
the experiment anyway so here I wanted
to know how much does this LSH the
locality-sensitive hashing diverge from
exact search right so if I'm willing to
spend the time I can compute the exact
distance between my query and every
single thing in the data set as long as
the data sets not prohibitively big
right and so I said okay let me choose a
bunch of different sized data sets like
hundred thousand objects ... million well I
guess in my example up to a million
objects and so that's a reasonably small...
like it's a big enough number that
retrieval is important but it's small
enough that I can do brute force
comparison where I take you know each
guy and compare it to every single thing
in the data set and then I'm comparing
the ... I want to compare this exact search
to what I get with locality-sensitive
hashing and you might expect that as the
number of candidates grows the
divergence would grow but actually that
wasn't the case it turned out that
actually as the number of candidates
increased the divergence actually looked
like it's either flat or sort of decreasing
so I'm sort of ... that that caused me to be
impressed with with this these sort of
hashing techniques and their ability to
find like this sort of needle in a
haystack problem right finding the
things which are close in this semantic
space one thing to add is that often the
... the way LSH often works is
you might use the hashing to find all of
those sort of nearest neighbors in some
small circle right or some small sphere
and then you might still compute sort of
one-to-one comparisons within that
little area to sort of get a more
precise ranking
right
right yeah I think the place where this
kind of technique can give the most
benefit is improving recall right like
things that you wouldn't have been able
to find I mean I don't mean to suggest
that search doesn't work well like it
works great yeah but ya know I think
that that it can improve recall like I
think that for example in the sort of
long tail of stuff on the web search
engines tend to ... partly as a result of
the inverted index ... tend to be biased
towards the head and this might be a way
to expose things that are more relevant
in the tail not a hundred percent sure
but that's that makes sense to me yeah
yeah
right
yeah I don't know so I have I haven't
worked on the document retrieval problem
I think there are people at Google who
are thinking about this but it's early
so one issue with long documents is that
they're long and the models that we tend
to train for text usually operate ... like
for translation you know we don't do
translation at a large scale like we do
translation one sentence at a time
because it breaks down if we try to use
the same sorts of models you know we try
to use an LSTM over like a thousand words
and ... it stops
working so we don't have good models I
think for sort of dealing with longer
pieces of text I think this will be a
big area of research like I think it's
exciting I think you might need sort of
a more structured representation of
what's in a document also one of my
assumptions in this whole talk was like
you're gonna create a fixed dimensional
representation of all the objects and
that might break down when it comes to
document search because you know lots of
very different queries can resolve to
the same document like the Wikipedia
article for Barack Obama like that
responds accurately to a huge variety of
queries and so to say that we have to
represent it by one fixed dimensional
vector seems like too much of a
shortcoming so what's the right approach
like I don't know what the right
approach is so I think it'll be a while
before this is actually used to
represent whole documents I think what
will happen first is we'll find ways to
represent answers and do question
answering better so one thing I'm
interested in is well okay we can
represent ... I can represent all these
different objects in the same space
let's index lots of sentences and see if
I can do question answering by finding
the sentence that's closest to the
question that you asked by placing you
know having a question encoder and an
answer encoder
then you know we don't necessarily have
to encode like whole documents but just
sentences so anyway I don't really
have an answer to your question but
yeah that's right that's right right I
think we've also trained ourselves to to
search using keywords like over the last
20 years we've like we've evolved with
Google to like learn how to issue
queries that are likely to get the
information that we want and so we've
created this query language that's sort
of like this new thing and Google goes
to ... you know by doing stuff like suggest
they try to get you to like ask it
questions in natural language but
there's all these cases where because
the search system doesn't understand the
way words compose together like you
often it's often hard to find what
you're looking for if it's a little bit
more complicated right like you know
famous one that I've seen is like
Macintosh controlled infrared remote
right so that's like this series of four
complicated nouns and so Macintosh
controlled infrared remote like that is
a phrase that indicates that you're
looking for software that can control
your TV right but if you search for that
in Google you're gonna get a bunch of
remotes like ... not software right so
Google doesn't understand how all those
words like compose together when just
doing keyword search so this kind of
like neural retrieval at least has the
opportunity to like learn how to compose
stuff together and then represent it in
the right part of the space okay I only
have a few more slides okay I wanted to
just sort of mention a few sort of
high-level things so about what this
kind of retrieval could mean so one
thing this could allow is retrieval
across languages like right now every
retrieval system that I know of is
language specific you search in English
you get stuff in English
research and Spanish to get stuff in
Spanish but we can train models that
embed all languages in a shared space
right
in fact I've done experiments like this
this is something that's like always
excited me about language processing and
always seemed far away and now it's like
actually here and this is a picture from
actually from translation yeah question
yeah sure I mean there are you know
words with no translations in other
languages but there are words ... but you
could always sort of summarize the
meaning with enough words right so
as long as we're not limiting ourselves
to like you know one word to one other
word a model like this might be able to
embed some word with no English
translation with an English phrase that
roughly means the same thing you can
always do nearest neighbor search and
find you know for this Russian word or
this Russian phrase what are the English
phrases or words that are sort of
closest and there's no reason why we
can't do that and so you know this can
allow you to issue a query in English
and ask for retrieval in any language oh
I was going to show you this ... so this
picture is actually from a translation
paper ... but the idea is the same and
so if you zoom in on this like little
region right here you see whole
sentences like I mean well I don't read
Korean or Japanese so actually it's not
that exciting but it ... hopefully take the
author of this picture's word for it that
these are translations and you see that
like the translations for these three
languages are sort of clustered very
tightly together in this large space and
we can do the same thing with these dual
encoder models the other thing is that I
already talked about is this also allows
retrieval across modalities where we can
do document or image audio or video
retrieval given a query like we
don't need to constrain ourselves to
keywords anymore we can index all kinds
of crazy things right you can you know I
was talking to my brother the other day
about indexing you could in you could
index audio clips or whole songs you
know I don't know the details of how
this would work but the general
framework still applies and then for an
arbitrary query you could find the
nearest neighbors in that space right
like you know
I could ask like for the sound of
like a bird chirping in South Hall and
you know maybe we'd find some audio
recording maybe not from South Hall but
you know I don't know something looked
close by and so there's no reason why
these sorts of models can't embed all
kinds of different sorts of objects in
the same space I mentioned before this
allows for learning complex
relationships between objects and then I
was also arguing that this would
increase retrieval in the long tail I
wanted to say quickly one thing I didn't
talk about ... I think this is the last slide ...
yeah was well who has this data like
what what kind of data is necessary to
do stuff like this and who has access to
it and certainly Google is a good place
to do this research because the search
logs have all this data in it right like
I can look in the search logs for ... I can
sort of exploit the fact that Google
search works really well to get lots of
examples of queries and documents that
people clicked on to connect queries and
documents or queries and images that
people clicked on to ... connect
those two modalities I can look for
pairs of similar queries or pairs of
similar images but actually I think
there are lots of creative ways to get
really large data sets of positive pairs
like from public data so I don't think
they it's it's true ... like it is true that
Google has this really high quality data
but I think anyone could do this
research effectively so like one example
I can think of is is Reddit so Reddit
contains these like sort of forum like
threads which are in the public domain
and there are questions and answers then
sort of more discussion below and you
can train one of these models to play
sort of parent and child in the same
space and that sort of has the same
effect as paraphrasing right if you can
place a parent ... and a child
in the same space well actually there's
sort of a longer argument but you can
sort of think of it as question
answering right like you can take a
parent I was like a question and a child
as an answer
you can train one of these dual encoder
models with the same sort of techniques
I described to place this text data in
this ... in one space and Reddit has you
know billions of pairs so there's lots
of data there yep I think there's other
sources of data that are also in the
public domain but that you could sort of
be creative to explore okay so that's
all I have like you can ask more
questions if you want
you may have already asked all your
questions
so I have a quick question when you do
this similarity measurement between two
documents your embeddings are the same
size because you have the same number of
words or something like that you like
it's like the embedding was
two-dimensional for the words and then
you concatenated them if they're not the
same size do you would use an RNN for
that ... um the the only thing that I did
was ... let's see ... what's a good picture here the embeddings are concatenated but in ...
let's see I think in this model this is
an encoder that I said produces a fixed
length representation here there's this
this averaging layer and so that ... this is
gonna be of the same size regardless of
the number of words
so in this feed-forward model this is a
way to sort of factor out the number of
words right sure you can use an RNN as
well people have found so far that they
haven't been able to get much better
performance out of RNNs but yeah
there all kinds of other models that you
could sort of stick in in place I kind
of put in the simplest thing I could
think of here and if you have lots of
data it turns out the simplest thing
works quite well
so thank you very much I really liked
this framing of search as kind of you
know things that are close to each other
in some space new to me maybe not new I
think that the one thing that sticks
with me is this idea that the
dimensionality of the embedding space is
a hyper parameter do you think that
there is any way to estimate an
appropriate dimensionality empirically
through some sort of corpus I know that
obviously we can't anticipate all
searches and often the value of search
is that some of them are novel but do
you think that there's any method yes I
think there's a method but ... and I think
the method is sort of strictly practical
right like typically in these models
there's something that you're trying to
do like retrieve stuff and you'll create
some evaluation set and you can increase
the number of embeddings until you don't
see any improvement anymore and that's
the right method but your question is
interesting because it suggests somehow
that there is a like a reasonable or
right number of dimensions to represent
sort of words like vocabulary words or
you know images or whatever it may be
and yeah that I think that sort of ... I
don't know how much I want to say about
it but like it represents some amount of
bias that we have coming in that like
there is some sort of right amount of
sort of dimensions like features that we
use to describe anything in the world
and
yeah well I I don't I don't know of any
way to ... I'm not ... I'm a little bit confused
about what you're asking but the only
the only way I know of to you know ...
the way I think about these embeddings is
sort of as strictly functional right like I
was just trying to say that it is true
that if you then go in and after you've
trained this model you then go and
inspect what each dimension represents
you might find some interesting
correlations between what's in those you
know what's in dimension 7 and you know
some characteristic of words right like
you might find that dimension 7 sort of
corresponds to like the gender of you
know pronouns right for example but that
may not be the case for all those
dimensions right sort of anyway
sure yeah
two very quick questions um there is a
long history of interlingual research
and there's Marti's thesis grandfather
Roger Schank there's people like Sergei
Nirenburg around the same time which is
a more sophisticated one first of all have
you looked at any of this stuff ... only
insofar as it was mentioned to me when I
learned about translation in graduate
school but I don't know much ... yeah ...
tell me more ... well no I'm asking okay you have
answered so ... I don't yet ... the the the
second question is more this where the
use of this word neural ... and for example
there's some of the top minds in
neuroscience are meeting up in Simons
Hall the issue they're trying to work out
what the basic algorithms the brain uses
are and what do you think the basic
algorithms the brain uses are when you
use that word neural what do you mean
when I use the word neural I'm only
using it sort of because of the
convention of referring to like these
kinds of things as neural networks like
I think in my head there are some sort
of loose associations between what this
is and maybe what we know about what
happens in our brains but it's it's not
specific for me like I'm just sort of
using it as a descriptor
I don't know ... do you want me to say a little more about the sort of association that I
understand ... if you'd like to sure ... okay
yeah so I think one crucial part of
neural networks sort of in the machine
learning setting is the idea of
nonlinearities so actually I sort of
left out this discussion from
my description of this picture but you
know like what's happening here at this
... often often we refer to this as a neuron
right it takes input from these two guys
right and there's a ... this value is like a
linear combination of whatever's here
and whatever's here so this thing times
some learned parameter plus this thing
times some learn parameter gives you
this and then what comes out of there is
like you apply this nonlinear function
and the nonlinear function is designed ... I
don't want to say it's like ... I mean it's
sort of inspired by what happens in your
brain where like a neuron receives some
stimulus input from other neurons and
then nothing happens up to a certain
point at which point it fires and sort
of sends its information on to the next
neurons right and so there's this like
loose analogy where you know that neuron
sort of fires its information off like
it's like representing a 0 up until it
retrieves ... it receives some sort of ... gets
to some threshold of input at which
point it sends its information up to the
next layer and that non-linearity like
that threshold is really important to
getting neural networks to work in
practice I think that's something that
sort of parallel in brains
thank you
um so I'm curious I'd like to know what
happens if I ask an ambiguous question
to your system and I've been like
racking my brain to try to think of a
clever ambiguous question the best I
could come up with is where can I buy a
Jaguar so presumably okay ... so actually
the Jaguar is a little bit easier for me
to talk about it's ... sorry
Marti tried to rescue me but so the
reason is I would imagine that questions
about cars will be mapped to sort of one
region of this representation space and
questions about cats will be mapped a
very different kind because this is a
metric space either way that anyway that
my question is embedded it can't be
close to both of those so I'm wondering what happens ... yeah good question so the same
problem happens when you're training
just plain word embeddings right forget
about complex sentences like the word
Jaguar itself is ambiguous right and so
when you're training your language model
like does Jaguar end up close to the
cats or does it end up close to the
cars ...
anyway I'm just trying to create an even
simpler version of your question and
it's true this is sort of a shortcoming
of the way that people think about word
embeddings like every embedding has ...
every word embedding has sort of exactly
one point in space that's representing
it one fixed set of parameters and what
you end up getting is just reflective of
what's in the data so if your data is
coming from news and there's lots more
car Jaguars you know you're embedding
for for Jaguar is just going to be
closer to the to the cars similarly in a
model like this you get the same thing
it's sort of whatever the data shows I
guess a couple other things to say about
it one is when you embed whole sentences
rather than individual words there's
some hope that the rest of a sentence
can disambiguate it right and so the
more context you have the more sort of
disambiguation the model could do and
sure in your case in your example like
maybe we can't do disambiguation and
then you sort of get whatever the
majority class is but the longer the
input and you know there's also
opportunity for models like this to take
additional input so maybe when we're
indexing possible answers we'd also like
to include some context information
about the document in which we saw that
answer and that document could help to
disambiguate sort of decide whether this
is the sort of Jaguar that appears near
the cats or the or the cars
well
there is some work on I mean there are
possible solutions to this problem it's
not not we don't want to have to throw
up our hands like you could you know
decide that for example with your word
with your word embeddings you might
decide like okay words have different
senses so I'm gonna make multiple copies
of each word embedding and I'll allow
them to move sort of to different parts
of the space and people have done some
research like that so there are possible
solutions
yes
so thanks thanks Dan so much for a great
talk so my question is kind of related
to Paul's but I want to sort of think it
through from the perspective of the
training data and the implications of
the training data have for the kinds of
similarities that you're learning and to
take the example of of search engines I
mean there's a lot of different kinds of
information seeking behavior that people
use when they go to search engine and a lot
of different kinds of strategies they
use when they issue queries so some of
it you know some of it is this classical
you search for a question you want to
answer you search for a string you want
to have some information about that
string sometimes it's just typing in
Twitter into Google so you can click on
that first link rather than typing in
twitter.com it's like just foraging
where you want to know something about
Fiji so you type in Fiji and just
see ... where you go I'm curious like to
have you talk just about where like real
training data comes for learning similarity
for these kind of query models and how
the choice of where that training data
comes from can influence whether or not
your privileging or biasing these kinds of similarities toward one of these methods of
information seeking behavior like how in
short I mean sort of generally how does
the training data influence the
representations that you're learning as
a function of the data yeah yeah good
question I'm not sure I can give a good
answer you know at Google the training
data is mostly going to come from the
search logs and that definitely means
that there's a bias and the bias is
towards the way Google search works
today so there's this interesting kind
of interaction between like well we want
to train a model that could in principle
replace or you know improve on Google
search but we're sort of limited to data
that we can draw from Google search so
like and I'm not sure quite how that
influences like what we end up getting
but
one thing that's interesting about these
models is that ... and so like to be a
little more concrete like I'm making
this claim that one nice thing about
these models is that we can retrieve
from the longtail right like you asked
some complex question about like
restringing rare guitars and I find you ...
I surface this like forum buried in like
some Reddit thread from like 10 years
ago where people are like talking about
this like Google doesn't work that way
today but in my mind it could and but
how are we ever gonna get there if I'm
training off of this click data where
people only ever click on one of the top
10 results I think there is a little bit
of hope because the amount of logs data
is huge
like it's really big and the number of
parameters in the model is by comparison
small so because of that bottleneck like
if we can figure out how to like
actually leverage a huge amount of
training data like with a relatively
small number of parameters like we
could potentially force the model to
generalize well enough I think to do the
kind of thing that I just suggested but
I think it's a hard problem I'm not sure
I'm not sure how much ... how difficult it
will be
