Hello. Welcome back to Class 2: Data stream
mining in Weka and MOA. We've arrived at the
last lesson of this class. We are going to
look at an application of classifying tweets.
Let's start.
Twitter is a very nice example of a data stream,
because it is data that is produced in real
time. Twitter is a micro-blogging service
that was built to discover what is happening
at any moment in time. There are are more
than 300 million users, more that 2100 million
search queries every day, and a very nice
thing for us is that the data is public and
it can be accessed through a streaming API.
In this lesson, we're going to look at an
application of sentiment analysis. Sentiment
analysis is the task of classifying messages
or tweets into two categories, positive or
negative, depending on the feelings that we
can see inside the messages. Many times it's
very difficult to get the label data. In sentiment
analysis with Twitter, there is very basic
approach, but it works very well. We can get
label data using the tweets that have emoticons
inside. Many tweets have positive or negative
emoticons, and then we can use this information
to classify them as positive or negative.
We can use all of these tweets to train our
model, and then we can predict using the tweets
that don't have emoticons: we can predict
what is the current polarity, and what is the
current sentiment around any specific product
or company or topic.
An important thing that we need to look at
when we are classifying tweets is that if
data is balanced or not. Let's look at an
example. In this simple confusion matrix,
what we see is that we are predicting 82%
as positive and 18% of the instances as negative.
What we see is that we are classifying correctly
the positive class 75% of the instances and
we are correct on the negative class for 10%
of the instances. Our accuracy in this case is 85%.
Is this good performance? To answer this,
one way is that we can look at a random classifier.
Imagine a random classifier that is predicting
randomly but is following the same distribution
between the positive class and negative class.
This is the confusion matrix in the bottom.
There we can see that this classifier is getting
also 82% of the instances positives and is
predicting as negative 18%. The interesting
thing is that it is predicting the positive
class correctly 68% of the time and the negative is
predicted correctly in 3% of the instances.
That means that the accuracy here is 71%.
That means that, if our classifier is predicting
with an accuracy higher than this, then we
can say that is a good classifier, but if
it's predicting less than this 71%, then our
classifier is not doing quite well.
To see this, this is, as you may know, there
is this kappa statistic measure that is measuring
this difference, the difference between the
accuracy of our classifier with the accuracy
of a random classifier that is predicting
using the same distribution of classes. Basically,
the kappa statistic computes this difference,
then it adds a normalizing factor so we get
a value of kappa between 0 and 1.
Now let's look at an application. There is
this Twitter sentiment corpus that was made
by students at Stanford that contains tweets
that were collected between April 2009 and
June 2009. There are 800,000 tweets with positive
emoticons and 800,000 tweets with negative
emoticons.
If we do a prequential evaluation using these
tweets and we use a Naive Bayes multinomial
classifier, Stochastic gradient descent classifier
and a Hoeffding Tree, what we see is that
at the end of the stream, the Stochastic gradient
descent classifier gets an accuracy of 100%.
This is something that is not normal, and
then it's nice to see why it's happening.
If you look at the kappa statistic, what we
see is that at the moment that the accuracy
goes up to 100%, the kappa statistic goes
down. That means that, in that case, the data
at that point starts to be completely unbalanced
and only belonging to one class.
In this data stream, if we compare accuracy
and kappa of the multinomial Naive Bayes,
Stochastic gradient descent, and Hoeffding
Tree classifier, what we can see is that Stochastic
gradient descent is better, but this is something
that may not apply to other data streams.
What is very interesting is that in data stream
mining, we should also not only look at the
accuracy, but also look at the resources,
at time and memory.
We have arrived at the end of this lesson.
In this lesson, we have seen an application
of Twitter classification. Twitter is a micro-blogging
streaming service that is built to discover
what is happening at any moment in time and,
more specifically, what is happening now.
Data may be unbalanced in many data streams,
so it's always important to not only look
at the accuracy, but also look at other measures such as kappa statistics.
Thanks for being there. I hope you enjoyed
it. Bye bye!
