Well today, I want to talk data mining
which is what I'm really interested in
and I want to explain a little bit about the inner workings of data mining
a little bit of the sort of terms that you might have heard when you read - the first lecture or the first book
I want to talk about supervised learning,
unsupervised learning, what exactly are these things, and then
I want to get on to something new semi-supervised learning and also
What's the research at the moment in this area?
It's called Machine learning
That's the sort of applied artificial [intelligence] machine
learning if you get a data you want to mine the data and
Broadly there's kind of two categories of methods how this works, so if [I] could pull up my prop. Yes, I've carefully prepared
Here are some items of data that I have brought along the first method may be that I should explain is unsupervised learning
Because it perhaps the easier way, it's called unsupervised learning
Because we don't have any examples that are labeled, so it's an unlabeled learning yeah
I guess the idea is a  supervisor knows the answer and we don't have anybody who knows the answer
So we get the data to begin with and we don't really know anything about it
We know obviously the attributes. We know the values, but we don't know what categories are they let's say that's a problem
So unsupervised learning very often is just sorting off the data
so unsupervised learning very often is just sorting of the data
So you get your first date item and you put it somewhere
and then comes another data item and you basically go let's do colors is this similar or is this different and
Now this is quite different. We put it there and then comes another date item. Oh
It's this similar or is it different
it's a little bit similar to the yellow ones so we'll put it a little bit closer to the yellow one and
Then comes another data item and no
This is obviously quite similar to the yellow one so we put it closer to here and then so over time you get all these
Data items in and they might end up a bit like
Something maybe a bit like that
So what have I done? I've done a sorting of the data
and the approach I've done is something based on similarity measures these unsupervised methods they all use the similarity measure in this case
I've done kind of by color the other way these methods usually work is to actually start out by saying but how many groups would?
You like your data to be in how many clusters would you like it to be in?
So let's say you want them in three clusters
Well, then maybe solution might look like this, it's clustered in by the color three clusters
If there would have been four clusters maybe the solution would have looked like this
And if there was maybe two clusters it might even looked like this, so you might ask okay?
So so what's the data mining about the sorting of the data well?
Once we sorted the data in this way. We can of course have a look at all
So what ended up together maybe these things have ended up together?
And maybe now we can say oh, this is the light colors. This is the dark colors, and we certainly have two groups
I mean we wouldn't normally sort color cubes
You would  sometimes saw patients and are they really ill
or are they very ill and you know that sort of thing we could sort about this now most of the
Unsupervised method spoke exactly like I described to the worker by sorting it the differences that [had] [a] measure the difference between things so is
it a statistical similarity is it a
Algebraic similarities that your metric measure you can imagine or so many ways you can measure the difference between things
Unsupervised learning is sort of quite a simple way of doing it
I mean, it's quite quick the algorithms, but it's not as powerful as other methods
What's the problem with it? One of the problems with it is actually quite straightforward
Let's say we end up with this solution. Well, is this  a good solution, or it's not a good solution
It's actually really hard. It's really hard to evaluate because we obviously don't know about the data
We don't know so we're looking at it
Going which looks okay?
But maybe not and then very often what happens actually if you look at the data from one way
It looks like a good solution, but now I do my reveal we sort of turn the data a bit
And you know suddenly we have another angle on the data and like actually now. It's a mess
They're not really sorted variable at them or are they well often what happens?
That's often what happens with unsupervised learning you sort them in one way, and they look quite good
But then we look at the data differently and actually this hasn't quite worked
And it's not so great the other downside with
Unsupervised learning is the algorithms really only work when you tell them how many groups you want to data to be in
two groups, three groups, four groups
For some problems you might notice maybe you have like I say ill patients and healthy patients
And you know there is two groups but very often actually how many groups you have is the whole question so you can't really use
these methods that well, if you want to know some technical terms Kmeans for example, it's a classic unsupervised method
That's very popular. So if you can look it up, you'll learn a bit more about it
now...
Second way of doing learning would be the supervised way
We said unsupervised there must be that must be a supervised way. Here the difference
is that you have some data which has some answers attached to it already so you can learn from it
From this data really learn from it and a classic way of doing it is [them] well well
neural networks forms one of the best-known ones. How does that work, okay? Well?
So have some date again, and this time let's say we want to do something a bit different
We want to just sort them in light colors and dark colors for example
And what happens is I get my data in and already somebody has labeled the data for me
they said these are light colors, these are dark colors so we already know the answer for this data
We don't know it for some other data, but we know it for this
This is our training data
And now I'm going to do a new learning neural network the first data item comes in it goes here
The next occurred item comes in and goes here
And I keep doing this and maybe I end up with something that looks like this
And now of course I can assess the quality of the solution and go... oh well algorithm, you've done
Okay, but you haven't done
it really well because these two should be over there this one should really be there fix the function a bit and do it again
[okay] back
And we might end up like this. It's like
Okay,  that was better
But he's still got one wrong fix the function again and do it against this called back propagation neural Network
And we'll do this again
and of course if you do this long enough eventually the algorithm will learn the perfect function how to sort things and
then the idea is a new data item comes along and
It will go to the same function and because the function is now perfect it you will end up
exactly the right place no problem
and then ah
and then no problem
so
It's supervised because we have labels and because [of] labels
we can assess the quality and in neural networks it's the classic way of doing this and in general supervised learning is very powerful because
As long as we have enough data with enough labels, we can always learn the function, and then it should work really well
But well there wouldn't be research if we're finished with it
So there's obviously a problem with this as well. The problem with this is that it can lead to overfitting
What does overfitting mean?
Means like tight jeans you know. No, not that. It means that you have
Too much emphasis on getting the function right you make it too right.
So the function is absolutely perfect in fact it's so perfect, it's brittle it's it's it's just not good anymore
So what happens is a new data item comes along one that you've not seen before
I got one
And the unsupervised method wouldn't have a problem with this because it just goes by similarity and we'll go
It's kind of a light color you probably end up here
But a supervised method has never seen this color before and the function goes like what do I do with this and it
Pftttt
breaks or it puts you just at a random place like maybe here so supervised learning is really good
But if you overdo it, then you've overfitted and the problem is that you actually make the system worse again. You made it brittle
The other downside of supervised learning is you must actually have enough data
with labels which for some problems you have it's fine but for some problems, you don't really have it, so
Let's talk about a practical problem that I was working on so I was working with doctors in a hospital
Clinicians who look after colon cancer patients and
they took many years to collect the data of about 500 patients of
classic medical data so we've got age, critical medical history
we've got genetic values, blood values, and so on and so on and so on and
They get diagnose the different categories of illness some more serious some less serious and the doctors wanted some help with this categorization
the most serious cases and the least serious cases
they're quite clear, but it's just this whole group in the middle
And I wanted to make sure can we split them a bit better
And so we were working with this with them, and so this is a classic problem
And in that case there was 500 patients that were already categorized as in what
category of illness they were in so actually a supervised approach was really good because we could learn from those
500 and build up a picture and as long as we're careful to not overdo it we'll be fine
But then what actually happened and  this leads me on to what my research is at the moment
What happened is.... not
for all the 500 patients did they have all the labels because some of the technology has been changing over the years
So there's more modern things now  that I didn't have ten years ago
so actually for the last 50 patients they had some additional labels that I didn't have for all the others and
So we were talking about what to do with this
And there's a method called semi-supervised learning which is kind of what the research is on
Why can we take the best of both worlds and maybe combine it a bit so what if you've got a few labels?
It's not enough to learn perfectly, but maybe we can do something so what we've done is a semi-supervised method
And it's kind of a mixture of the two
You get your data and let's just say we want to split them in light and dark colors
It's basically our more serious patients and our less serious patients
And you might end up sorting the data something like that because it's an unsupervised approach first of all we don't know exactly how good
this is
But then for some of the data items we have a label and we can look up
What's the number on them and because for some of you have a label now
We can say okay all the ones have the same label or with a similar label are there in the same group so suddenly
We can assess the quality of this
So we don't have a label for all of these, but have a label for some of them are they in the same group
Yes, and then the same labels are in the same group
Yeah
That looks like a good solution semi-supervised learning is probably the future because as data sets get bigger and bigger and bigger
You don't have labels anymore for everything because nobody has time to label everything and computers can't really label things very well
so you'll have the experts labeling a few things and
semi-supervised learning will be where this is going but
Then the next step really would be to have it interactive that would be even better
So that's kind of what we're working on right now. It's called a man in the loop or human in the loop learning
where
You maybe have no labels at all or maybe just very few and then you do some sorting of the data and then we asked
the expert has the sorting gone well? Has it not gone well?
Well, what about this one item, what would be the label you would give it and it sort of a bit interactive
And I think they'll be much better because then you can you know there is [more] in real time and you can actually also
Latest developments can come in tacit knowledge that you might not even have in the data
So that's like spot checking? Yeah, exactly it's like spot checking it and but then putting that knowledge back into the algorithm
So the algorithm can learn from it again and it's a sort of reinforce a bit
That's a single-car. That's basically controlling the robot twice 864 processes. Which is more than a
robot will usually get. Where are we going now? I'll show you the big machine. That's it.
