so welcome back for the fourth lecture of
this week so we have been talking about topic
models and in the last lecture we had covered
what is generative modo model of lda and how
do you gibbs sampling to estimate the parameters
of lda by using the observations as a various
documents and whatever works regarding the
documents so in in this lecture in the next
we will be talking about different variants
of lda and how do we use that for different
applications so you may not cover many of
these topics in detail but once you get the
idea of what these variants are given application
you can go and look and look more into these
topics so this lecture i will be starting
with ah some sort of classes size (Refer Tim:
01:00) on gibbs sampling that how do you estimate
parameters from a given gibbs sample and then
we will go forward for the variants of lda
so let's take this example problem so what
we are given here so you are you are you are
saying that there is a corpus that has five
documents so one two three four five a document
i d is here and five words river stream bank
money and loan and there are only two topics
that you want to estimate now this is when
you are doing gibbs sampling at certain point
of time you are given what are the different
assignments of topics to different words in
the document so what do you see here document
one has four words bank one two three four
five six six times money and six times loan
and at that point of time all these sixteen
words have been assigned to topic one
so black is topic t one document two has most
of the words assigned to topic t one and one
word to topic t two and so on you are given
the topic of assignment at a given point of
time ok so so you can see that the first show
indicates that the document one contains four
words for instances of the word bank six of
word money and six of word loan and black
and white circles are topic t one and t two
now your task is that you want to use the
system share to estimate different parameters
of your mountain so so remember what are the
two main parameters one was your theta another
virtual beta be beta is what is the probability
of a word given a topic and theta is what
is the probability of a topic given this document
so in this example we will try to estimate
two different beta values that is what is
beta money to probability of money in topic
two and beta bank one probability of bank
in topic t one and you are given that eta
and alpha are point one
now if you remember the formula how do you
compute beta money 
to for that you will need to use your matrices
ok remember you consider two matrices c w
t and c d t see w t this word assigned to
what topic d t this document what are topics
that are sent for this problem for beta money
integrally what is that what is the probability
of the word money for the topic t two ok so
we will need only this matrix this word assigned
to what topic in terms of this matrix how
do you write this ah for two parameter you
say c w t so if you have to write i j i th
word j th topic so let us say this is my ah
i and this is my j so this is i j plus you
have used the hyper parameter eta divided
by now you see all the different words that
are assigned to this topic summation over
k c w t w j sorry it should be k j for all
k plus w times eta so this summation k for
all w and that's why you have w eta here that's
why you estimate this parameter beta money
too
so let us see how do we estimate this parameter
from this matrix so one thing is that you
will have to first construct this matrix so
let us see what does this matrix look like
w t will have five words right river stream
bank money and loan and you have to find out
how many times this word has been assigned
to topic t one and t two and not including
this instance ok fine so that we cannot do
at this time so we will take all the instances
so let us see river river is not assigned
to only topic t one black so river sorry river
is not assigned to topic ah t one at all only
the topic t two so i have one two three four
one two three four five six seven eight nine
rivers is send to topic t two nine times
ok a string again three three six and six
twelve zero twelve bank one two three four
five six seven eight nine ten eleven eleven
times topic t one and one two three four five
six seven eight nine ten eleven twelve thirteen
fourteen to sixteen topic t two money only
topic t one one two three four five six one
two three four five six seven thirteen and
for seventeen and loan one two three six ten
thirteen that is your mad matrix c w t
now let us see how do we come to beta money
for two you have to compute c w t i j i is
money yes and j is t two ok so this is zero
it is not a second topic t to any number of
times i write zero plus eta h point one now
divided by summation over k all the words
any all the words that when you have been
assigned to topic ah to here so i will just
add ah this column nine plus twelve plus sixteen
so that will give me thirty seven plus w number
of words is five times eta this ones point
five so this comes out to be point one divided
by thirty seven point five similarly can i
come to beta bank one for that i will find
out how many times bank has been assigned
to topic t one eleven plus eta point one divided
by summation over k c k j w t so that will
be how many words that have been assigned
to topic t t one so it will be eleven plus
seventeen twenty eight plus thirteen forty
one so forty one plus point five so this comes
out be one point one divided by forty one
point five
so like that you can compute all there your
different betas at this given time point so
well you can compute here thetas so this can
this you can take as an exercise find out
what is theta for document ah one or two for
different topics ok this is something that
you can do now one more thing that that might
be interesting suppose i ask you a question
that find out in this iteration what is the
topic that will be assigned or what will be
the multinomial distribution from which you
will sample a topic for a given work like
the first bank in this document ok first they
are for instance is your bank for the first
national bank you have to assign a new topic
that's what you doing iterations so for that
you will have to again compute different betas
and thetas but what you have to keep in mind
you have to exclude the current instance so
when you are computing this you will remove
the current instance
so suppose this is eleven so you are removed
then one from here and so on you will compute
each values and from by by removing the conditions
you will compute the betas and thetas and
use your formula for find out what is the
probability for topic t one quality to from
this distribution you will sample a topic
there is something that you would keep in
mind so this was a simple example for how
do you use ah gibbs sampling to estimate your
parameters
now we talked about certain applications of
lda in the last lecture so we saw that we
can use it for computing similarity between
words to complete similarity between documents
that are one of some of them very promising
applications but what are some other different
tasks where you can use these topic models
for
first let us see the simplest task that is
can we model the documents using the topics
that is the straightforward thing that you
can do using this lda so here is something
the collection that was also one of the motivation
with which we started these topic models so
we are taking collection of science papers
from nineteen ninety to two thousand so there
are seventeen thousand documents and eleven
million words there and there are twenty k
unit terms twenty thousand different terms
after removing the stock words and real words
now on this collection suppose you run your
lda model so for running the lda model you
need to a tell what is the number of topics
suppose you see it hundred topics so once
you have earn your model using your hundred
topic models and you can use either gibbs
sampling or variational inference so these
are two different possibilities for estimating
your parameters now once you have done that
try to see what are what do your documents
look like what are the topic distributions
there so when we do that so remember this
was the article that we were looking at seeking
lifes bare necessities and this for genetic
in the in the ah parentheses so we found three
four topics there right compositional some
data analysis some genetics evolutionary biology
and so on
now suppose we run this topic model over this
whole corpus we find out what happens to this
document so this document gets a probability
assignment like that so there are hundred
topics and some topics get high probability
so they are few topics that are getting high
probability and then we go back and look at
these four topics what are the most common
words in these four topics so we see something
that we were looking for so we saw from first
topic contains words like human genome d n
a genetic so it's about genetics then the
second topic is evolution in biology third
about different disease and bacteria and forth
about the data analysis so these are the four
topics that come on top and this looks very
interesting that from by you do not give any
information to this model that this document
contains these topics or which document contains
this topics is still by learning from a large
corpus it was le able to learn different topics
and the topic assignment for a given document
so this is very interesting aspect of lda
so now apart from modeling a simple ah topics
that are there in the document what else can
be modeled using these topic models so we
will see how do we model different different
other junctions in the data it so till now
what we are saying so we have a static data
so we have a static data or whatever time
it it spends so there are fix set of topics
and the topics are also kind of independent
of each other you do not say ok if in the
document our t one occurs then t two should
also occur we do not say that but can we also
model these assumptions so for that we have
different model models like correlated topic
models dynamic topic models and measure measuring
scholarly impact
so we will see how do we go from lda to any
of these variance so let us see the correlated
topic models so so what so right now what
we are doing you are having additional distribution
that helps me sample the probability distribution
of topics for a given document so so this
is what this is some simplex where there are
positive vectors not in the probability that
i add up to one however in this digital distribution
the components of the probability distribution
are quite independent of each other so that
means they do not model various ah dependence
between the topics
so suppose i want to say that these are article
about fossil fuels and if i know the topic
fossil fuels occurs in the article probably
the topic about geology may also occur rather
than genetics ok there is something that i
might know that these two topics are quite
correlated and these two topics are not correlated
so can i use this intuition to battle on my
topics and ah distributions within the documents
that certain topics are co related they will
occur together certain topics are not correlated
they will probably not occur together so this
cannot be modeled by using the distribution
vision so we use a different distribution
to mod model the topics in in a document and
that's where we use the multivariate ah normal
distribution
so something like this so you are having k
topics so you will have a multivariate normal
distribution where you are having ah in your
your sampling this k dimensional distribution
but now from this normal distribution with
a mean and covariance so mean will be what
is the ah private information that we have
about different topics what will be the mean
of the different topics and sigma will be
how are these different topics co related
with each other that's what you will ah try
to give us in in your model
so how does your model change so everything
remains the same except that instead of sampling
from a additional distribution you are now
starting a sample from a multivariate logistic
normal distribution with a mu and sigma so
eta these are samples from this distribution
and so so that's where the topics can exhibit
various correlations that these two topics
are correlate with each other while these
two are not correlated so once you have done
that finally what you will get you will again
get your k topics right like hundred topics
you are doing in the case of science plus
you will also know which two topics or which
pair of topics are correlated with each other
and this can be very nicely used to give a
map that these topics are make a make a single
cluster a single ah group they are correlated
to each other these topics are against other
no group that are correlated with each other
so for example if it is ah and and how do
you know that this works better than lda so
one good method of evaluation is that you
try to find out what is the log likelihood
that this providing to a held out data held
out deta is some data that you did not use
for training of your ah our topic model ok
so you do not give it as an input for gibbs
sampling and or variational inference so once
you have learnt the topics try to see what
probability it get to a held out data some
separate data again from a same domain so
whatever topic model gives you a better ah
log likelihood that is probably better that
is a better model this is sim similar to what
we did in the case of language modeling
we found out what is the perplexity that it
assigns to a held out data similarly here
what is the log likelihood that is essentially
different ah held out data so we see if the
colder top model gives a better likelihood
then the ah lda simple lda and that's what
you see here so this is on the held out likelihood
you want a number of topics so so if he so
interestingly if the number of public is small
say thirty to forty both models give the same
log likelihood but as you increase the number
of topics the lda model the likelihood given
lda model is starts decreasing but this does
not happen with c t m model so that means
if you want to have more topics then c t m
is a better choice than lda c t m can model
reach your assumption in the data then lda
if you have more number of topics
so that you can see here also likelihood difference
between c t m and lda it keeps on increasing
as you increase number of topics so this is
how the map will look like so you will see
here this topic talks about united states
women universities students in education and
it's about research funding support sciences
scientists and research people so these are
correlated topics but they are not so correlated
to the topics like here a stars astronomers
universe galaxies and galaxy so which again
make different sort of clusters for topics
and this which model simply by using the covariance
matrix these topics are connected together
now so this was one assumption that we can
model
now suppose just take another assumption that
is which topics are correlated to each other
and which topic sorry how do topics change
over thing so right now what we are assuming
so you have a static corpus and in which there
are the same topics over time ok the same
set of topics over time working and by by
topic i mean the distribution of words in
the topic are also same over time but this
is not true in general suppose you have a
collection that it spends on multiple decades
or even centuries say two hundred years of
data
so what you will see the the give the same
topic the number of work the the type of words
that you are seeing over the time are changing
initially you will see some different sort
of words and later on you will see some different
sort of words topic might be the same but
then the kind of words will keep on change
changing also the probabilities of words will
keep on changing now this you cannot model
by a simple lda model so so how do you ah
actually specify this and that's where a dynamic
topic model is used
ok so what is the problem with lda so it assumes
that the order of document does not matter
and this is not appropriate for the corpus
that are spending for hundreds of years so
we might want to track how the language with
within the topics are changing over time and
for that we use dynamic topic models so it
is very interesting this is just diet extension
of lda but now when you model that ok how
the topics are changing over time so how do
you do that so when you have a large collection
we will divide you into multiple ah multiple
different time points so you say this is your
corpus one corpus two corpus three corpus
four and so on over time
so you are starting from one up to the last
corpus now when you see when you define your
ah topic distributions you say that let us
say the initial corpus had distribution beta
k one ok beta k for time step one so what
you will say as you go from time stamp one
to time stamp two the next beta will not be
the same as the beta two will not be the same
as beta one but will be will be again a distribution
is starting from with the mean of beta k one
with the some variance so that is you are
allowing to change the probabilities of words
within the topic model and that you can do
over time so that is the previous topic topics
influence the next topics but the next topics
can also change with certain variance and
this is how the model looks like so we are
having time stamps from one to t so this is
same as if you are having t capital t different
different ah copies and you are running topic
model for each corpus
but now you are not doing it independently
because your betas are connected so you are
saying beta k one is an input to beta k two
so this is like a normal distribution beta
k two is like a normal distribution with a
mean of beta k one but with some variance
so you you are biased to take same words with
same probability but with certain variance
so that will allow changing the probability
distribution of words and also having new
words in the topic model ok
so that is the only things that happens so
you are having different betas over time but
they are connected it's starting from the
first thing up to last time so what you are
seeing the previous time point topic models
will influence the top model at the next time
point now once you have this dynamic topic
model how it can help so suppose you are modeling
how in science a particular topic is changing
over time
so let us say this is modeling of science
is starting from at eighteen eighty one to
two thousand and the topic is atomic physics
so what you see the kind of words that are
there in the in the topic keeps on changing
the over time so here you have words like
force energy motion differ light major magnet
direct matter and result ok but as you go
over time the words here are energy electron
magnet field atom system to quantum physics
so you see the word quantum comes up and the
word electron comes up that are not there
in the initial time point so this you can
see also over the time and this is a nice
plot that shows how this three words vary
over the decades in this ah in this topic
so see you initially start with matter having
a very high probability but then we starts
decreasing over the decades the word electrons
comes up ok at certain time point ah this
is around nineteen hundred with this cathode
rays experiment and is start this is like
a stable like on each stable over over time
but then the new topic on quantum also comes
up ok and that is having a very high probability
so this gives you a nice visualization that
within this given topic how the words are
evolving over time how is topic evolving over
time
similarly if you see the topic of neuroscience
you can similarly observe that initially the
word nerve was having a very high probability
but over the time you get the word like neuron
coming into picture and then the c a two that
is like in area where you will ah the the
particular area c a two and this you can also
correlate with what the difference sort of
papers say seminal papers that are published
that might have given rise to these terms
coming up into these topics ok so so so this
was interesting that in science in the same
topic how the different words keep on coming
over time
now this also gives rise to a nice application
that is can you model what are the most influential
articles in science influential papers in
science so what will be the idea influential
paper is the one that will affect the topic
model ok that will affect the change in the
topic model so with each document you might
have an influence variable and the idea is
that the change in topics are more affected
by the influential vapors than the non influential
papers and that way we can model which article
is more influential in another so how will
this model look like so so that is influential
articles reflect the future change in the
language usage in the topic and the influence
of the article can be thought of as a latent
variable and what we will do influential influential
articles will affect the drift of the topics
that they discuss ok so that way we can model
it as a variable and the posterior that will
be read for this variable will tell me how
influential this article was
so again i make a very small change to the
model that is now now beta k two instead of
depending only on beta k one it also depends
on this id ok in influence variable so how
influential this article was again depending
on the topic of this article ok so now this
i d is there with each document and finally
while i compute the posterior i find out which
documents get the highest influence ok so
so which ever get um um document will get
the high higher i d will be the influential
article so so this this will remain the same
as the dynamical models only now it is also
estimated by this i d parameter so each document
you see here as an influence score i d and
each topic drifts in a way that it's biased
towards the documents that are having a high
influence ok this is a posterior that i will
estimate from the from the data and yeah you
can explain the changes in the future
now let us see another very interesting variant
of lda that is can be used in the supervised
setting what do you mean by a supervised setting
so till now we are saying i have a set of
documents in the document certain data occurs
and i give it to my model and by using gibbs
sampling evaluation influence i can find out
what are different document which different
topics occur in different documents that's
what we can do and we can model certain assumptions
like how topic change over timings and which
are correlated etcetera
what you are saying now in the supervise settings
can we also use it to do certain prediction
like think about the movie reviews with certain
ratings so can i use this model to say ok
this this review will get that many ratings
suppose in the text can i predict the ratings
ok or web pages link paired with a number
of likes how many likes this web page will
get and the document pay with the link of
other documents or individual with pair with
the category so lot of examples where the
data points have some sort of ah class or
a category can you also model it using ah
topic models so this is we are two two different
ways in which it can be done we will talk
about supervised topic models and see the
what is the other variation
so what is the idea so here so you are modeling
the documents along with the responses or
the categories and the responses are those
they are fit they are fitted to find topics
that are predictive of the response that is
how do the topics tell about the response
so how do the topics coded with the responses
so how is it done see see you this is the
the plate notation for the ah lda model so
this is what we had seen earlier so your beta
k alpha and all that and what is additional
here so ah the draw topic proportional in
each word what is the topic assignment plus
for a given document that is your data point
so you are also finding out so you are seeing
what is the topic ah distributions for the
document from there you are sampling what
is your response so from your z d n you are
sampling your response that is ah like a normal
distribution over eta transpose z bar and
a variance sigma square so what we are seeing
this response variable depends on the topic
distributions with the variance ok so so let
us just quickly see what it means is that
so y d it's sampled from a normal distribution
over eta transpose z bar and sigma square
now what is z bar z bar as a topic proportions
of this documents so i know ok this topic
ah t one occurs point one times t two occurs
point three times and so on eta like the weights
given to different topics so however will
be like if if this topic is coded with the
higher response and it can be negative also
if there is a negative response eta can be
negative so these are like the weights given
to different ah different topics so what ah
whether this top topic will go to a higher
response or lower response and sigma square
is this will give you the mean and then you
will sample the actual ah response with this
p n certain variation
so this will give you a scalar wait eta transpose
z plane will give you a scalar eta one z one
plus eta two z two and so on this will be
a scalar so this will be a link and with this
way with this variance you will sample your
response ok so what you need to do you need
to find out what are your z prime and you
need to estimate what are your etas so you
have to estimate both z prime and eta from
your model so so now we were saying there
are there is also an alternative to using
the supervising lda so you can say ok why
do we here why are we doing taking this complicated
model where we are have to having to estimate
eta or using the model why can't we run our
topic model get my theta for each document
theta d ok that means again we have a distribution
over topics and then then there are something
like a regression so how does this theta gives
me certain is course if liking of five another
theta gives me minus five and so on ok that
is another possibility so i have different
thetas for different documents and i run a
regression from this theta to actually score
and this is called lda plus regression model
and what we are doing right now is a supervised
lda where we are sampling this inside my model
during the ah ah estimation time
so so what will happen here in this case you
are not using the response as an observed
variable it is remains unobserved in this
case where wherever whereas here response
is also observed so what is the intuition
is it if you take your response also as an
observed variable then your topics can be
much more aligned to the actual responses
whereas here the responses are not aligned
to the the topic topics are not aligned to
the actual responses and you have to later
fit on later ah find out a mapping from the
topics to the actual response so this can
be done here in the model itself that's why
supervised lda works better for taking rating
the responses then in lda model plus regression
so this is where you are taking the topic
proportions theta and building your regression
to find out y another model so here response
variable can be taken as an important observations
to infer the topics in this lowest manner
so that's an the interesting thing here so
what will happen so we fit the so a lda parameters
to document responses and you will get the
topics and the coefficients topics from the
coefficients so right and using these together
given a new document you can estimate what
is the response eta transpose times expected
value of z bar given the words in the document
this we can estimate from your ah a s lda
model the same way you are doing from your
lda so what is the topic topic distributions
for a new document and this is what you get
ok this is from the pang and lee paper two
thousand five
so what there is they they they took ten topic
lda mod s lda model and put it on movie reviews
so you see here that ten topics the the most
important words and they are plotted with
their etas corresponding etas so a high eta
mean this topic corresponds to a high ah score
or a high rating and in negative value of
eta means this is called corresponding to
a negative score negative rating so what do
you see here on the higher side you have words
like both motion simple perfect fascinating
powering complex so perfect in presenting
a nice words that are coming with on the higher
side and here you have least problem unfortunately
supposed was flagged up and you will immediately
see they are like a negative image so they
are coming with the with a very low negative
value and this very nicely ah also puts your
topics in a scale of minus two point plus
that this was not ah available earlier
earlier you have different topics but you
did not know whether this topic is for a positive
or negative rating
now you can also find out from your model
itself and it gives a very good held out correlation
also so this is from number of topics even
if you increase the number of topics it gives
you a much better correlation and the lda
model so what did you see here so it enables
model based regression where the predictor
variable is a text document so you did not
have to run regression separately that is
run inside the model ok inside the model itself
pure sampling your response by using a regression
model
now it can be used wherever lda is used in
an unsupervised fashion so you can use it
with images music etcetera wherever the data
is paired with a ah some sort of response
variable so you can also say that s lda is
some sort of supervised dimension reduction
technique wherever is a lda is a unsupervised
technique right you are seeing the response
and by seeing that you are modeling your dimensional
direction so that is about using ah so as
lda so so there are some other variants also
for topic models so we will see some of those
like the relation topic models and some simple
intuition about the nonparametric basic models
in the next lecture
thank you
