Hi! It's me, back again.
I've probably got a bit of a tan since I saw
you last.
I've been out sailing around the coast of
New Zealand in my beautiful yacht, Beulah--here
she is--for a couple of weeks while the other
guys have been recording the lessons.
Anyway, I'm just back here to close out the
course.
We've done a lot in this course.
This is just a summary.
We've covered a lot of ground, and you've
learned a lot.
Congratulations for getting this far, and double
congratulations if you've managed to do all
the activities.
At the end of the last course, More Data Mining
with Weka (this is a slide from the
last lesson in the last course), I looked at
what we'd missed in that course, and I proposed
that we might do a third course, Advanced
Data Mining with Weka.
Well, this is what we've done in Advanced
Data Mining with Weka.
We've done most of the things that I proposed
at the end of the last course.
A couple of things have been missed out, multi-instance
learning, and latent semantic analysis.
You'll have to learn those yourself, I'm afraid.
We didn't do a lesson on one-class classification,
but there was a good activity on that, Activity 3.1.
We've done some extra things.
We've done some scripting in the Python and
Groovy languages, and we've done some applications.
And, of course, we've done the Weka package
system.
So, we've done pretty well what we promised
to do, or what we suggested we might do last
time, and a bit more besides.
I hope you've enjoyed it.
The applications we've looked at have been
particularly enlightening, I think.
The first one was Geoff Holmes talking about
infrared data from soil samples.
He explained that it was hard to achieve sufficiently
good performance for practical application.
In the activity, you didn't get there.
You need to do more work on those datasets.
You need to investigate dealing with outliers
and improving the quality of the data and
some more tweaking of the classifiers and
filters in that huge space of experimentation.
Then Tony Smith talked about bioinformatics,
the problem of signal peptide prediction,
and he emphasized that domain knowledge is
vital.
You need to collaborate with experts.
That's true, of course, for all applications.
You need to know whether you're looking for
an accurate prediction or an explanatory model,
and overfitting, of course, is a big issue
in all applications.
Then Pamela talked about functional MRI neuroimaging
data.
You know, what's going on in your brain! It
was a 3D--a 4D--dataset,
the 3 dimensions of your head plus an extra
dimension of time.
Again, the performance we got in the activity
was not all that high, and there were various
things that we might consider doing to improve
that, most of which would involve domain experts
to help interpret the data.
This is a common thread through all the applications.
A very interesting finding was, in an early
competition, just the demographic data alone did well--
in fact, it won the competition!
It's extremely important to evaluate what
you're doing and try the simple models first.
We've been saying that all along.
Finally, we looked at--Mike told us
about--image classification and the specialist
feature extraction techniques for images.
In fact, when I asked him to do this lesson,
we didn't have the feature extraction package
that we now have in Weka.
He created it in order to do the lesson.
This is typical in applications.
You need different extraction techniques for
different kinds of data.
I'm interested in enabling you to carry on
learning, to keep learning in the future.
One really good way is to look at data mining
competitions.
There's a website called Kaggle.
Let me just find it for you.
Just do a Google search for Kaggle, and here
we have it.
Kaggle Competitions.
There are a large number of competitions here.
The first group, this group here, these are
the featured competitions, and here you can
win money.
This AI Science Challenge is worth $80 grand,
for example, and the Home Depot Product Search
Relevance for $40 grand.
You can win real money doing data mining with
these competitions.
The second group of competitions are for recruitment
purposes.
You can get jobs if you do well with the Airbnb
challenge or the Telstra challenge, or the
Yelp challenge.
They'll offer you a job in data mining, so
that's pretty cool.
Here are some featured datasets.
Actually, the Iris dataset you're very familiar
with from the first courses, but here are
some interesting ones:
the Ocean Ship Logbooks, and Salaries in
the San Francisco area.
And some datasets for playing around:
here's the San Francisco crime classification
dataset;
sounds very interesting.
And this last group, "Getting Started",
contains tutorial/educational competitions.
You can play around with these and look at
what other people have done.
These are all current competitions.
You can find past competitions by looking
for ... "completed competitions",
that's the phrase.
Let's just look for those.
Here we've got competitions from two years
ago.
Half a million, two years ago.
Sorry, you're too late for that, but anyway,
someone won half a million two years ago.
There's big money in competitions.
Here's a quarter of a million,
again a couple of years old.
So there are just a lot of past competitions.
On the Kaggle website, we have not just those
competitions, but information about completed
competitions, past solutions, interviews with
winners on the Kaggle blog, and descriptions
of winners' solutions.
So there's a lot of information there.
If you want to keep learning about data mining,
Kaggle would be a good place to start.
I have to finish with a little word on
ethics.
Don't forget! I'm always saying this.
Ethics of data mining is very much in the
news these days.
This is just a few web quotes I got with a
very quick search.
"More than ever, knowingly or unknowingly,
consumers disseminate personal data in daily
activities." Well, we all know that.
"As companies seek to capture data about consumer
habits, privacy concerns have flared." Yes.
"Data mining: where legality and ethics rarely
meet." That's an interesting little title,
and the point of that article was that just
because you're doing things legally in accordance
with the law doesn't necessarily mean you're
doing things ethically.
I would like you to do things ethically, because
you're an ethical person.
It's the right thing to do.
You have personal integrity.
But if that's not enough for you, there are
good business reasons for doing things ethically.
"Big data might be big business, but overzealous
data mining can seriously destroy your brand."
You have to be very careful when you're doing
data mining.
And, the final one, "What big data needs:
A code of ethical practices".
So please be aware of ethical issues when
you do your data mining.
Well, that's it.
This is the end of the course.
I hope to meet you again in some other place,
some other time.
I look forward to that.
Meanwhile, enjoy your data mining.
Good luck with the assessment that you're
about to do to get your statement of completion,
and while you're doing that, I'll go back
to doing something I really love, and play
some music.
Bye for now!
