Hi! My name is Geoff Holmes, and today's lesson,
1.6, is infrared data from soil samples.
Before starting to talk about the actual application
we'll develop and look at in the Activity
1.6, I thought I'd just mention something
about application development in general.
The top academic conference in machine learning
is called ICML, International Conference on
Machine Learning.
This is where all the top people in the field
present their work.
In 2012, a paper was published at this conference
which was something of a wake-up call to the
machine learning community.
The author was Kiri Wagstaff from the Jet
Propulsion Lab in Pasadena, CA, and the paper,
which is accessible to anyone with an interest
in machine learning, is called Machine Learning
that Matters.
The URL there on the slide will enable you
to download it and read it.
What the paper does is it kind of points out
that the field is focusing too much on new
methods and on the accuracy of those methods
and less on the kind of application that will
really make a difference.
What Kiri did was to suggest six challenges
for machine learning applications.
I'm not going to go through all the six that
are listed there on the slide.
I just want to talk about the highlighted
one: $100 million saved through improved decision
making provided by an ML system.
Now, believe it or not, you can develop an
ML system using near-infrared data on soil
samples that will be something that could
save $100 million.
This lesson is a starting point for such a
system, but it is possible.
Before we do that, let's just take a moment
to think about what machine learning requires
in order for us to develop an application
of any kind.
Well, it needs input and output in its training
phase.
In our case, we need a set of samples--
those are going to be soil samples in some
form, and you'll see that in a while--
and an output target value.
In our case, this is going to be a real valued
number and will represent a property of interest
of the soil.
That could be organic carbon, organic nitrogen,
available nitrogen, potassium.
Something that we're interested in predicting
from the input.
Our problem, of course, is to learn a mapping
that describes the relationship between the
input and the output.
We refer to this mapping as a model.
We build the model on our training data, and
then we use that model on unseen observations--
so, new soil, if you like--
in order to--
we apply the model to the new soil in order
for it to predict the target soil property
of that soil that we're interested in, such
as the organic carbon.
Now we need to think about where we're going
to get X and Y from for this particular application.
Traditionally, soil samples are processed
using techniques called "wet chemistry" techniques,
and what those wet chemistry techniques are
trying to do is to determine the properties
of the soil, such as available nitrogen, organic
carbon and so forth.
They will result in the Y values that we're
interested in.
What we need for this application is for a
number of samples to have been processed using
wet chemistry to determine these Y values
for us.
Let's say we're interested in available nitrogen.
We need, let's say 50 or 100 different soil
samples to have been processed using wet chemistry
to produce 50 to 100 Y values.
We need to take a portion of each of those
samples from, let's say, a thing called a
"soil bank".
Suppose we've got a soil bank.
We divide our soil sample into half.
We send half off to the wet chemistry lab
to get the property determined, and with the
other half, we put that through a near-infrared
device.
That will produce the X values for our input.
Now, the near-infrared device produces a signature,
if you like, for the soil sample.
I've got an example of one there below on
the slide.
These values will form the input.
In the sense of an ARFF file, they represent
the values or reflectance values for a given
wavelength band.
You'll see in the ARFF file produced for the
activity, that that starts at around 350 nanometers--
that's the first attribute.
The next one might be 370 nanometers, 390--
or 80, 90, 400, 410, and so on.
The number of attributes we have, as you'll
in the example, is something like 200 for
each of those spectral wavelength bands, and
then the values are numeric values, which
are the amplitudes, if you like, of the spectrum,
just the reflectance values that you get from the device.
So, as I said, you need a few hundred samples,
so it's not cheap, because you've got to  send off--
whatever number of samples you've got, it's
very cheap to get the X, but it's expensive
to get the Y, because you've got to send those
off for wet chemistry analysis.
So, to put together a decent training set
is expensive.
Given that, why would you bother doing that
for the soil in this particular application?
Well, once you've, let's say, got your 50-100
samples and you've built your model, and if
a farmer comes in with a new soil sample and
says "I want to know what the available nitrogen
is", we just get out our available nitrogen
model that we built and we get the NIR spectra
for that new sample--
that represents new X, if you like--
we run it through the model, and it will produce
an estimate of Y for that soil signature.
We'll be able to tell the farmer "for your
soil sample, the available nitrogen is 4.3"
or whatever that estimated Y value is.
Instead of days for the wet chemistry to take
place, we're talking about milliseconds for
the NIR device to produce the signature for
us to run through the model and get the estimate of Y.
That's the first thing that makes it useful.
It's very fast.
Second thing that makes it useful is that
we can produce, for the same input, if we've
got enough models, an estimate for a number
of soil properties, not just one.
If we've got, for example, wet chemistry which
has determined the potassium, available nitrogen,
the organic carbon, the organic nitrogen, and
so on, then we can build models for each of
those and for the same X value, we can produce
predictions for each of those soil properties.
So, we can tell the farmer with the soil sample in
very short order, of the order of milliseconds,
what the values are for each of those soil
properties.
All right, so that's the value of it.
How do we actually go about doing the modelling?
Well, the training set, remember, let's imagine
it's an ARFF file.
The right-most column, or the class column,
would be a set of numeric values, so we're
talking about a regression problem.
Then the attributes are all these reflectance
values at various wavelengths.
They're all numeric values, as well.
We've got X numeric values, and so is Y.
The classifiers of interest are things such
as LinearRegression, RepTree, the model tree,
M5 prime, RandomForest, support vector machine
regression, GuassianProcesses, and so on.
What I've done there is lined up the algorithms
in terms of really their processing speed.
What you'll do in the activity is you'll process
the data using the first four, because you'll
see that it's quite a large dataset, and the
other two take too long really to be useful in the activity.
We'll be saying more about that later.
The big thing--message--
though of the activity is that pre-processing
can make a big different to a classifier's performance.
So, what the activity will basically take
you through is establishing a benchmark just
using the classifiers on raw data, and then
using various pre-processing techniques and
seeing whether or not the classifiers improve
on the datasets produced after pre-processing.
Typically for near-infrared data the kinds
of things you can do to pre-process it are
to sample the data, to remove baseline effects
and to smooth the spectrum.
You'll be doing all three of those and combinations
of those three in the activity.
The reason I mentioned the slower classifiers,
such as support vector machines and
GuassianProcesses, is that the activity involves processing 4,000 soil samples--
roughly 4,000 anyway.
What you'll be doing is looking to see if
you can develop a model for organic carbon.
That would be the Y value.
But, as I said previously, organic nitrogen
is also in the dataset.
So, if you want to run the activity completely
again using organic nitrogen instead of organic
carbon, then you're welcome to.
What you'll do is you'll process the data
raw, and then you'll see what happens to the
results when you start applying the pre-processing
techniques.
The classifiers respond in different ways
to the different pre-processing techniques.
Some get better, some get worse, some stay
the same.
You'll see all of those effects through the
activity.
One thing that's worth bearing in mind is
that you're about to enter experimental machine
learning, where you're going to have lots
of results, because the activity takes you
through the first four classifiers on the
previous slide, but all in default mode.
Now, each of them has parameters that can
be tweaked, and so can each form the basis
for a separate experiment.
You'll be using four pre-processing methods,
one of which is to do nothing, just use the
raw spectrum.
Now, some of those methods themselves have
parameters, as well.
Of course, you can combine the pre-processing
methods, as well.
So, the space of experiments is extremely
large.
From all of that, you'll be able to produce
some pretty good results.
Now, what you'll be looking at is particularly
the correlation coefficient.
So, how well does the predicted value match the known value from the training data using cross-validation?
That will give you some idea of how close
you are, and what want, of course, is to produce
models that get you close to 1.0, the perfect
correlation with what you've seen in training
data previously.
Now, you'll see that that's not possible,
because there's too much error in the data typically.
But it will be a starting point.
You'll mainly see the improvement you can
get from that baseline or benchmarking that
you do with the raw data to what happens when
you apply various pre-processing techniques.
I hope you enjoy that.
I hope it wets your appetite for machine learning
application development.
