So, let's see to where it is installed.
Okay, so it's right here, WEKA 3.8.3, so
I'll click on the WEKA 3.8. Okay,
so I just click on OK here. Just a
warning information telling that we can
use many learning schemes and tools and there's a tools menu. Okay, click on OK
and so let's get started and building
our model, so why don't you click on the
Explorer button. So in this WEKA
Explorer, it allows you to intuitively
build your prediction model by clicking
on specific functions. So let's get
started by importing the data set; and
the data set that we're going to use in
this practical tutorial will be the
famous Iris data set. Iris data set is a
public domain data set that is commonly
used as an example data set for teaching
data mining; and let's find it so click
on the "Open File", go to the C Drive, go to
the "Program Files", I'm not sure whether
it is "Program Files" here or "Program
Files" with x86. So let me try the first
one, so there should be a folder called
WEKA. Okay, so it's in the "Program Files" > WEKA 3.8, go to "data" subfolder and then
find "iris". Okay, here we go, "iris.arff".
Now they also have the iris 2D, so
let's go with "iris.arff". Click on
"Open" and then this is what we see. So
let's just have a look at what the
various feature or menus are doing. So in
this panel here, it tells you about the
attributes or the variables that this
data set has; and so we can see that there are a total
of five variables. The first one being
the sepal length, the sepal width, the pedal
length, the pedal width and the class (label). So this data set, it's a data set of 150
flowers that are closely related called
Iris and they are described by four
variables: the length and the width of
the sepal, the length and the width of
the petal and then it is, for each flower,
it is given a corresponding label as
being either a setosa, a versicolor or a
virginica. And so here we can see again
that okay this is the Iris dataset, it
has a total of a hundred and fifty
flowers and there are five attributes or
five variables that we see here and if
you click this, if you click on it and to
the right you will see some description
about it. So you can see that there are
50 Iris setosa, 50 Iris versicolor, (and) 50
Iris virginica. And there are no missing
data here, that's good to know, so click
on the first variable we can see that
the minimum value is 4.3, the maximum
value is 7.9, the mean value 5.843
with a standard deviation of 0.828
and then we get the same information by
clicking on the subsequent variables. So
we see the minimum, maximum, the mean (and) the
standard deviation. Okay so notice that
the numbers here, mean and standard
deviation, it really depends on the
average value of each variable and then
the standard deviation which just tell us
the variability of each variable. So
before we build a prediction model, let's
first start by normalizing our data
because each variable will have
different minimum and
maximum as you can see, the first one has 4.3
a minimum of 4.3, maximum of 7.9. 
Second variable has minimum of 2 maximum 4.4
The third one has a pedal length of
1, maximum of 6.9. The fourth variable has
a minimum of 0.1, maximum of 2.5 and we
notice that the mean and standard
deviation of each of one of them are
different. So let's get started by first
normalizing or standardizing our
variable. So let me begin by normalizing the
minimum and maximum to be 0 and 1. So we
can do that very easily in WEKA. So we
have to click on the arrow on supervised
attributes then click on normalize and
then "Apply". So before I click on the
button, notice that the minimum, maximum
are 0.1 and 2.5, 1 and 6.9, 2 and 4.4, 4.3 and 7.9.
So I'll click here "Apply" and
notice what changes. So the
minimum and maximum values becomes 0 and 1 and
we also notice that the mean and also
standard deviation also differed and
here same thing minimum and maximum is 0
and 1. Third variable, same thing, 0
and 1. Fourth variable, 0 and 1 or
alternatively so I can undo that
alternatively so that will bring us back
to the original state. So alternatively
instead of normalized I could use
standardized so it's in the same
subfolder: filter > unsupervised > attribute
and then I'll find the "standardize", so
click on "standardize" and click on "Apply".
But notice that the mean and the
standard deviation will be altered so we
see that they are 5.8 here, 0.8 here, 3
and 0.4, 3.7 and 1.7
1.199 and 7.763. So click on "Apply".
Okay and so the the mean becomes 0
and the standard deviation becomes 1.
So this will happen for every one of
them. Third variable also and also the
fourth variable. However, okay so nothing
happens to the class, because that is
the class (label) where we are going to make our
classification. So in data mining there
are many tasks that you can do. You could
visualize data, you could cluster the
data, you could classify the data, you
could build a regression model. But for
this example because our class or our
output label is a qualitative label
therefore we will perform classification.
By classification we mean that we will
categorize each of the 150 flowers into
one of the three class label, here either
as a setosa, versicolor
or virginica. So this step is called data
pre-processing where we normalize or
standardize the variable. So decide on
one or the other, either normalize
or standardize, but not both.
So the one or the other and when you are
ready go to the next step which is to
click on the "Classify" tab. So let's click
on that and then go to the classifier so
choose a classifier and let's go with a...
how about a decision tree? So let's begin
with the J48 so this is essentially using
the C4.5 algorithm by Ross Quinlan. So
click on the J48 and then the
default for doing the test would be
cross-validation using a 10-fold. So I'm
going to cover this in a future video
about how you can split your data set
into training and testing and also how
you can do the cross-validation set. So
in this tutorial I'm going to stick to
the default and so we'll click on the
"Start" button and then your prediction
model will be constructed so you have
see that it takes only a couple of
seconds, not even a second,
so maybe half a second to create your
model so here, this is the summary of
your prediction model. So let's start by
scrolling up on top. Okay so this
provides a description of the algorithm
that you're using the J48, you
have 150 sample size, you
have 5 variables, you are using
10-fold cross-validation and this is the
resulting decision tree created and found
inside your prediction model. There are a
total of 5 leaves, the size of the tree is 9
and then these are the performance
metric of your prediction model. So you
see that you have a 96% accuracy and
correctly classifying
144 out of 150 flowers into one of the
three classes correctly. And that we have
six that we have misclassified. And we
have kappa statistics here, mean absolute
error, root mean squared error and others
as well. We also are provided with the
true positive rate, false positive rate,
precision, recall, F-measure, MCC or the
Matthews correlation coefficient, ROC,
and the class. So here we are given
the performance metric for each of the
three classes and then this is the
average weight
of all the three classes and the
confusion matrix is provided below. So
what is the confusion matrix? It allows
you to see how your prediction model is
confused. So if you look under the hood
you have 50 flowers for each of the Iris
setosa class, 50 flowers for Iris
versicolor and 50 flowers for Iris
virginica. So out of 50
we have correctly classified 49 and we
have misclassified 1 of them. So "A", right here, is represented by Iris
setosa and so for the Iris setosa, out of
50 one of them is misclassified to be "B".
B is iris versicolor. So we can see
that one flower that is supposed to be
classified as Iris setosa was
misclassified as Iris versicolor. Okay so
going to the second line we see that 47
iris versicolor have been correctly
classified and 3 Iris versicolor have
been misclassified to be Iris virginica.
Okay we'll be going to the third row, 48
Iris virginica have been correctly
classified as Iris virginica, however two
of the Iris virginica have been
misclassified to be B or Iris versicolor.
So this is very useful in helping us to
understand the confusion made by our
prediction model. So until next time I'm
Chanin Nantasenamat on the Data
Professor channel and if you haven't
subscribed yet please consider
subscribing and clicking on the notification bell so that you will be
notified on the next video. So I'll see you in the next one!
