Hi guys, this video is aimed at teaching
you latent dirichlet allocation for topic modeling,
so let's get started.
So what exactly is LDA?
LDA is a topic modeling
technique that generates topics based on word
frequency from a set of documents.
LDA particularly useful for finding reasonably
accurate mixtures of topics within a given
document.
Why do we need LDA?
so for example the task
in hand that I have right now is I want to
find out the news highlights of France in
2018.
I am given a data set which contains all the
news articles of the country from 2018.
I'll make use of LDA here to find out the
main topics of 2018 in France
one of the major topics that you can think of right now is
France winning this year's world cup.
so if I want to model something of this sort I'll make
use of LDA.
So how do we go about doing LDA
we create a collection of documents from the
news articles each document represents a
news articles.
Data cleaning is the next step in every NLP
problem we have to clean data so we start
of with tokenizing converting a document
to its atomic element the removing and unnecessary
words such as stop by stop word removal then
we go in for stemming which is merging words
which are of equivalent meaning.
Once we have
the data ready we can directly use packages
which are available for LDA, but the idea
of this video is not to do that it's to give
you an idea about how LDA works
within the whole library so what LDA does
is it assigns a random topic to each word
in the corpus of documents that you have provided
so it start of it starts off randomly assigning
topics so given say a document I that I consider
which has words with ball world card 2018
France and so on it will assign a random
number one out of three because I want three
topics derived from the documents so far so
I started with three to one three one so on
then what it does is it calculates a document
to topic count so in a given document when
I consider the above document that is this
document how many times does my topic 1
come how many time does my topic 2 appear
or how many times my does my topic 3 appear?
I have a count of this.
Plus I also have a count of how many times
a word is associated with a topic so football
is associated with topic 1 just once and
it's associated with topic 3 thirty five times
Now the idea is after the random initialization
you want to converge at a point where your
words signify the topic that you are trying
to figure out so we go on reassigning the
topic of each word at every pass so what happens
is if I am at document I and at word world cup
I'll remove the count of world cup from
one to zero or I remove the topic that is
assigned to it, which was initially topic
one so then the count changes now document.I
since the World Cup word is removed from topic
to the topic to count changes from.
One to zero.
As well as the World Cup count for topic to
again changes from 8 to 7.
Then I make use of 2 calculations.
First, I calculate how much a document likes each
topic based on other assignments in the documents,
so in our case topic 1 in document i comes
in 2 times or topic 2 does not appear right
now because I've removed it and topic 3 comes
in two times.
I calculate how much each topic likes the
word world cup now based on the assignments
on other documents, so now as you can see
the blue bar topic 1 comes in ten times
topic 2 or comes in seven times and topic
3 comes in just one time.
Next all we have to do is multiply both the values that you
have calculated in the previous two steps
and find out a so called an area between how
much document likes a topic and how much a
topic likes a word based on this we find that
topic one has a maximum area enclosed so next
up next step for what LDA would do is it would
reassign the topic for the word world cup
from two to one just what happens here.This
process repeats for all the words in my corpus
of documents in one pass and depending on
how many passes you have in your LDA setup
this process mode iteratively keep changing
the topics for all the words and at after
a stage the whole converging part would happen
you don't have to code anything around we
just have to use a simple library or a function
called from a gensim library and this is how
we are.Output would look now how do I identify
say for example
I'm given this output how is it that football
is come up first then France and then World
Cup what it does is after the model is converged
it sorts all the words based on the count
that it appears and gives out a probability
score which is what you see here that is 0.070
into football 0.054 into France, so you sort based on the count for each word
that appears in a topic & then give out
the number of words from the descending order,
so here if I have just three words then these
are the three words and it gives out a probability
score as well.
So yeah that's about it for the LDA topic
modeling part hope you like the video
don't forget to subscribe my channel, thank you
so much.
