Hello everyone
In this video,
I will go over a simple example
demonstrating the use case of CountVectorizer
& TfidfVectorizer
in extracting features from text.
so let's get started by importing the necessary modules.
The reason why we are changing from text
to numbers is because
as you know
machines, as advanced as they may be
are not capable enough of 
understanding words & sentences
in the same manner as humans do.
In order to make documents corpora more
palatable for computers,
they must first be converted into some 
numerical structure or representation.
For doing that, I have created a function
called as create_document_term_matrix
which takes in multiple
documents in form of a python list
and the second argument is a vectorizer which is used to convert from text to numbers.
there are different vectorizes available which
follow different strategies in 
converting a text to number.
The first thing that I do inside this function is
I called the fit underscore transform function,
which basically transforms all my messages
based on the vectorizer into numbers
& stores it into a matrix like format
called as doc_term_matrix.
After this matrix has been generated.
I basically return a dataframe
wherein I'll have the count of
that particular vectorizer that's applied to a given word
& the columns would be basically my overall
feature names that are individual words
& my rows would be individual
documents that I've considered.
So, I run this cell to import this function into memory
before we jump onto the details of
what CountVectorizer is.
I'll first explain what I've done here
So, I have created a simple list
having 2 elements
or 2 documents.
The first document contains text,
My name is Bhavesh.
& the second document contains the text.
Please subscribe to my YouTube channel.
I run the cell to import this message into memory.
& I create an instance of the CountVectorizer
& save it into a variable called as count_vect
Now, essentially what happens is
when you call the CountVectorizer function
CountVectorizer takes what is 
called the Bag-of-words approach.
Each message inside my 
document is separated into tokens
so essentially my first token would be "My"
second token would be "name"
the third token would be "is"
and the fourth token would be "Bhavesh"
in case of my first document.
Once each message inside 
the document is separated into tokens.
the number of times each token occurs
In a message is counted in case of a CountVectorizer.
So when I call the function that I have just defined above
this is what I get.
As you can clearly see
the first row signifies the first document that I have in hand,
which had the text - "My name is Bhavesh"
So, in case of the first document
and the first word which I'm considering
which is "Bhavesh"
"Bhavesh" occurs only once.
So that is denoted by this number 1.
I go to the second word
since I'm considering a bag of words approach.
I'll consider all the words that are 
there in my corpus of documents
the word channel does not occur in the first document,
so my first document was - "My name is Bhavesh"
so there is no mention of the word called as channel
so the count of "channel" is zero
the count of "is" is one
the count of "my" is one
and since "name" also occurs 
you have a one associated with that
Please, Subscribe, to, youtube
tokens have a count of 0
because they do not exist in document 1.
When I go to the second document.
You will have a count of one
for word such as "please"
"subscribe"
"to"
"my"
"youtube"
and "channel"
because these are the words that are occurring
in the second document.
Now if you're using a classification task
you can then basically
use this as your training data
in terms of your Xtrain
& you will have a corresponding Ytrain values
& Fit the
values as well,
but there are values
or there are words which are
not very significant in
case of a classification task
like - "is"
"to"
And so on and so forth,
so you have to remove unwanted words
from your overall corpus of documents,
or you have to give them a lower value as
compared to the other values which are important
so the way you do that is using something called as
Term Frequency & Inverse Document Frequency.
Now let's go into the specifics of
Term frequency & Inverse document frequency.
Home frequency is a weight representing how
often a word occurs in a document
if we have several occurrences 
of the same word in 1 document,
we expect the TF-IDF to rise.
Now, the way you read this formula is
The term frequency
of the word i
in the jth document
is basically how many times the word
i, j
that is
say a particular word occurs
in that overall document
divided by the total number of words in the document.
Coming to the next part of TF-IDF which is
inverse data frequency or inverse document frequency
Inverse document frequency
is another weight representing how common a word is across all documents,
if a word is used in many documents then the
TF-IDF value would decrease.
So essentially this formula that I've mentioned here
the IDF of a given word "W"
is basically the log
of how many times it appears
in the total number of documents
divided by the total number of documents,
so if you have a word appearing
twice in three documents,
then your IDF value would
be log to the base 10
2 by 3
now there are different ways of computing TF-IDF
Sklearn or scikit learn kind of adds a value of 1
in the numerator &  denominator
in the IDF calculation just to avoid a 0 by 0 calculation,
you can always refer to the documentation of scikit-learn
to understand how IDF is calculated under the hood.
If the definitions are out of the way
we will go through a few examples
to see how it actually works.
Let's start by creating a new list called as msg_2
which contains two documents
my first document contains the text.
Bhavesh is my name
the second document in the list
contains the text called as
"Bhavesh likes python programming language."
Let's import this message into memory.
Let's create an instance of theTfidfVectorizer
and store it into a variable called as
tfidf_vect
Now I pass that message and
the TfidfVectorizer to the function that I had created above
and this is what I observe
As you can clearly see the word Bhavesh
has the lowest value in the first row.
The value of "Bhavesh" in the first row is 0.37
The value of "is" is 0.53
Similarly the value of "my" is 0.53
& even the value of "Name" is 0.53
This is the overall structure that I get
when I apply tfidf to the first document.
So this is what I'm observing
similarly if I go to the second document as well
the value of Bhavesh is
the lowest as compared to the other
words that are there
Now, if I also visualize the value of
"Bhavesh" in the second document, I see
something similar
The value of "Bhavesh" is 0.33 in the second document
& the value of other tokens are really high that is
language
likes
programming & python all have a value of 0.47
Since "Bhavesh" occurs in both the documents,
it brings down the overall tf-idf of score of that word
across the various documents it is present.
now you might wonder why is the value of
"is" 0.53 and the value of "language" 0.47
The reason for this is the difference 
in the total number of words in each document
The first document has four words
the second document has five words
so "is" occurs 1 out of 4 times
that is a term frequency of
"is" is greater than the term frequency of "language".
If this idea is clear to you.
Let's change our messages a bit.
I've added two more words called as
Bhavesh, Bhavesh in the first document
and I expect my overall term frequency
for Bhavesh to increase
& therefore the TF-IDF of value for
Bhavesh in the first message to increase
so let's visualize if this is correct or not,
so I have imported the message in memory
Now, I run the function that I've created.
The value for Bhavesh
in the first message went up
just as expected
there are two things worth noticing here
first the values of other words 
in the first message have decreased
the values of words like "is"
"my"
and name have gone down from 0.53
to 0.36
The reason being the term frequency 
of "Bhavesh" has increased
also the value of "Bhavesh"
in the second message
remains the same
so both the documents contain the word Bhavesh
so the IDF portion is as it is
& also I have not even changed 
the total number of times
"Bhavesh" occurs in both the documents
In both the documents, I still
see Bhavesh occurring at least once
so neither has TF changed
nor as IDF changed in
case of the second document.
If this idea is again cleared to you
let's go on to the third case
where I try to change the IDF portion
of the word Bhavesh.
Now, I'll change the message from
Bhavesh likes Python Programming language
to
I like python programming language.
I am expecting that TF-IDF value
of "Bhavesh" in the first document to increase
since now Bhavesh only occurs in the first document,
so when I run this cell
and also run the function,
this is what I observe.
Bhavesh now has a value of 0.866025
which is greater than 0.77 in the previous case.
the reason being now
"Bhavesh" only occurs in the first document.
I do not see any occurrence
of "Bhavesh" in the second document,
so it is a special word that is
occurring in only one document
& is not a common word that appears
in all the documents
or most of the documents.
So this was my take on how you 
can smartly utilize TF-IDF
& CountVectorizer methods 
to convert your text into numbers.
I hope you enjoyed the video.
If you do have any questions 
with what we covered in this video
then feel free to ask in the comment section below 
& I'll do my best to answer those.
If you enjoy these tutorials & would like to support them
then the easiest way is to simply 
like the video & give it a thumbs up
& also it's a huge help to share these videos with anyone
who you think would find them useful.
Please consider clicking the subscribe button to be notified for future videos.
& Thank you so much for watching the video.
