Welcome back to the second video on how to
use R from Weka.
Last time we saw how we can install the R
plugin for Weka so that we can use some of
the functionality in R from Weka.
We saw how we can issue a simple command in
the R console in the Weka Explorer to plot
some data.
Today, we're going to look a bit more at the
extensive plotting functionalities in R.
More specifically, we'll look at the ggplot2
package for R and what kinds of plots we can
generate with this package.
Okay! Let's get started.
To save some time, I've loaded the iris data
into the Explorer already.
Now, we go to the R console to issue our commands
in the R language.
The first thing we need to do is install the
ggplot2 package, which is the plotting package
that we want to use: 'install.packages("ggplot2")'.
Okay, it's finished.
Now that we've downloaded and installed the
package, we can load it into the R environment
by using the library function.
We use "library(ggplot2)".
Now the library is loaded.
We can use it to plot some data.
In ggplot2, we construct a plot in layers.
We can add several different layers of plots
to construct very complex plots, but there's
one layer that is always present in every
plot.
That is the data layer, which specifies the
data that needs to be plotted.
The data later is specified using the ggplot
function.
With the ggplot function, we specify the data
we want to use.
In this case, the data is referred to using
the rdata variable.
rdata is the name of the variable that refers
to the data that we've loaded into the Preprocess panel.
Then, we need to also say which attributes
in the data we want to use.
This is done using the aesthetics function,
the aes function.
For the second argument, use the result returned
by the aesthetics function.
Say "x = petallength" to specify the petallength
attribute as the attribute we want to plot.
In this case, you're just generating a plot
based on a single attribute in the data.
This is now the data layer for our plot.
We also need to add a geometry later, which
actually specifies what type of plot we want
to generate.
Let's say we want to generate a kernel density
estimate based on this attribute that we have
selected.
Then we add a another layer to our plot using
the plus operator.
We call the geometry function for density
estimates, geom_density().
Okay, let's try this.
Right.
Now we have a kernel density estimate for
the petallength attribute.
On the X axis, we have the value of the petallength
attribute, and on the Y axis, we have the
density estimate.
You can see that there are two peaks in this
density estimate, but you can also see that
the plot is not wide enough to cover the entire
area that is relevant.
We should increase the limits of the plot,
and we can do that by adding a call to the
xlim function, where we specify the lower
limit and the upper limit.
Let's say we use 0 as the lower limit and
8 as the upper limit.
That looks better, but perhaps this kernel
density estimate is still a little bit too smooth.
It doesn't show enough detail in the data,
because the kernels that are used are too wide.
Let's reduce the width of each kernel.
We can do that by specifying the adjust argument
for the geom_density function.
This multiplies the width of each kernel by
the given parameter.
Let's say we halve the width of each kernel
estimator.
Now, we get a plot showing a little bit more
detail.
In Weka, we primarily deal with classification
problems.
So, really, we should try to take the class
information into account in our plot.
We can do that by generating three different
plots, one for each class value, and combine
them into one graph.
How do we do that?
It's very simple.
We just add another argument to the call of
the aesthetics function.
Just say the color is given by the class attribute
in rdata.
Class is the name of the class attribute in
the iris data.
We just say that the color is based on the
class attribute.
Now, we get a separate kernel density estimate
for each of the three classes.
You can see that the distributions for iris_versicolor
and iris_virginica overlap a little bit, but
iris_setosa is nicely separated.
We may want to enhance this plot by filling
the area under each estimate.
This is also easy.
It's again done by providing an additional
argument to the aesthetics function.
You just say the fill color should also be
based on the class attribute.
You can see that there is a little bit of
a problem here.
We can't really differentiate the iris_versicolor
and the iris_virginica cases.
We should introduce some transparency in our
plot.
We can do that by providing an alpha value
for our kernel density estimators.
This is a values between 0-1 that determines
the amount of transparency.
1 means no transparency; 0 means totally transparent.
Let's set this to 0.5.
Now, we have a nice plot of the three kernel
density estimates.
Let's say we want to plot the same kind of
plot, but for all four attributes in the iris
data, not just the petallength attribute.
We can also do that, but we need to massage
our data a little bit to achieve that.
We need to load a library called reshape2:
"library(reshape2)".
Then, we can call the so called melt function
to transform our data into an appropriate
format: "melt(rdata)".
The new data, in the new format will be stored
in ndata.
Let's just have a look at what this data
looks like.
We can just type in "ndata", and it will show
us the data.
You can see that we have 600 instances in
the transformed dataset.
There are three attributes in the dataset.
The class value is given as the value of the
first attribute.
The name of the attribute is given as the
second attribute, and the attribute value
is given as the third attribute.
Scrolling all the way up to the first instance,
we can see the first attribute now is called
class.
The second attribute is called variable, and
the last attribute is called value.
We have 600 instances because there are 4
attributes and 150 instances in the original
dataset.
We now have a separate dataset for each of
the attributes.
First, we have all of the attribute values
for the 150 iris flowers for sepallength.
Then we have all the 150 iris flowers for
sepalwidth.
Then we have petallength, and finally we have
petalwidth.
Now that we have the data in this format,
we can use the variable attribute as a way
to generate different plots for each attribute.
How do we do that?
It's quite simple.
Our X value is now based on the value attribute
in this transformed data.
That is the actual numeric value for each
of the attributes.
The color is still based on the class, and,
at the end, we now use the facet_grid function,
to generate a grid of facets, where facets
are subplots.
Here, as arguments for the facet_grid function,
we need to specify which attribute should
be used for the X dimension of the grid, and
which attribute should be used for the Y
dimension of the grid.
In this case, we only have one meaningful
dimension.
Let's say we want to use variable as the variable
determining the X dimension.
Then we use the tilde character to separate
the X dimension and the Y dimension.
In this case, we don't have a variable for the Y dimension of the grid, so we just use
a full stop.
This means there will be just one column in
the grid.
I forgot to change the name of the data.
We want to plot ndata, not rdata.
Now, you can see that we have a different
plot for each attribute.
In the first facet, the first row in this
case, we have the sepallength.
The second row we have the sepalwidth.
The third row we have the petallength, and
in the fourth row, we have the petalwidth.
We can also use columns instead of rows simply
by swapping the order of the arguments here.
We can use a dot on the left-hand side of
the tilde and "variable" on the right-hand side.
Now we have the kernel density estimates arranged
vertically.
Now that we have generated a nice looking
plot, we may want to save it as a PDF file.
We can do that quite easily, as well.
We just need to redirect the output of the
plot.
We do that by using the PDF function, and
we specify the file name.
Let's say "/Users/Eibe/Documents/test.pdf".
Then we simply call the plotting function
again.
Now it's actually printing the plot into the
PDF file, and to redirect our plot to the
window again, we just call the "dev.off()"
function.
There are many other types of plots that we
can generate with ggplot2.
We can generate scatter plots, two-dimensional
kernel density estimate plots, and many other plots.
One very useful type of plot that we cannot
generate with Weka's own graphical user interfaces
is a box plot.
So, let's generate a box plot for the iris
data for each attribute individually using
facet grids.
First, we need to specify the data layer again
using the ggplot function.
"ggplot"--
let's, say, use this ndata that I've already
prepared.
And then, we use the aesthetics function to
specify what exactly we want to plot.
We want to plot the value on the Y axis in
a box plot, and we want to use the class to
distinguish different box plots on the X axis.
We want the color to be also based on the
class.
Now, we use the geom_boxplot function to generate
box plots.
We use the facet_grid function again to generate
the grid of plots.
In this case, let's say, use variable to determine
the column.
As you can see here, we have a really nice
set of box plots.
First, we have the box plot for sepallength,
then for sepalwidth, then for petallength
and for petalwidth.
So, we have generated a fairly complex plot
here.
You can generate many more types of plots
using ggplot2.
Hopefully, this has given you a taster.
See you next time.
