I mean Data mining is you get a lot of
Information in a lot of raw data and you want to get the nuggets of information?
Hence the word mining, so the Golden the data. That's the Data Mining I
usually starts with people saying oh
You got loads of data, can we make some money from thi,s is there something interesting in there that we haven't found ourselves yet, you
Tell us you're the expert that's how it usually starts
So it might be big companies or medical people doctors and hospitals they might have lots of data
Actually the first step usually is what do you actually want to know from the data because people aren't always that clear
what they're actually after this is where it comes to artificial intelligence
So a lot of the work that we do is actually applied artificial intelligence
So you get you get your data, and then how do you get a pattern out of it? It involves algorithms,
It involves programming it may be some mathematical or statistical
Systems that you want to design or it might be more artificial intelligence somehow
Using something like evolutionary computation or machine learning in the broadest terms for example sometimes you get your data
And you know it just looks like this is how long you studied. This is the grade you're get on the exam really obvious?
It's obviously clear correlation
Perfect yes, I mean, this is not very difficult and we can use them some simple statistics videos that's fine. That's easy then
Awesome, so the data you get doesn't quite look like this it might look a bit more like
It's more messy like this and even in that case [you] could probably still do some statistics a lil
Yeah, there's probably some relationship
It looks a bit like this, but it's getting a bit more ambiguous now, and it's not so obvious anymore. What is it and?
sort of statistics may work it may not work anymore things when it gets really interesting is when the data starts looking like something like
this so
I'm sort of exaggerating a little bit now, but you do statistics
And it comes back with all those no relationship because something is like a zero relationship between things and obviously we you and me look
At it you've got up cough. I don't know what's going on. Here's a football, and it's clearly something in this data
Something is going on here. Don't understand and that's where the artificial intelligence comes in as experts
We look at this me girl or something here. How can we teach the computer together out? There is a shape
How do we get that shape out and then of course the real problem?
Is that is if it was just looking like this. It'd be easy to teach a computer of course in reality
it's so Messy the data can do enough box on here, but
Something like this and somewhere in here there is this little pattern and maybe it repeats in a few places
But it gets really hard to see you now
How do you teach your computer now to get this pattern out and of course remember you don't know what it looks like the pattern
You're looking for the unknown
If you know what you're looking for I be easy that's what Google does actually google is easy job because they have a huge database
It's just a lookup table. We don't have God
We don't actually know what the thing looks like so you need to find something
You don't know what it looks like it's somewhere hiding in there. Maybe it isn't right in the hard way
We do it does data mining really from this stage. There is lots of different steps
I mean the first thing we usually do when we get data and it doesn't look as simple as the original if you do something
Called pre processing of the data. We make it in a nicer shape. Maybe we bring it on a certain scale
We're actually trying to plot the bait. I'm just looking at it plotting it one way floating is another way two dimensions three dimensions
maybe we can eliminate some of the variables because they
Turn out irrelevant
Or maybe lots of information is missing so we can't use them sometimes also the data is not always numbers it might be text it
might be pictures
And then how do we put out in so there's all this or messing around in the beginning with the data to bring in some
Sort of okay some sort of shape, and it's saying we can look at it more easier
So it's a love it is eliminating background noise as roses it
Yeah, and of course the problem with that is is it really noise, or is that maybe the noisy bit?
It's the really interesting bit in the data. You know when you get your day turn
It's like lots of Data is like this then your phone is there and now of course the question is well, okay?
Maybe that's where your sensor failed. Maybe the person didn't fill in the questionnaire properly or maybe that's actually the one data point
that's really interesting that's where you can make your money and
That's part of the question. This is an outlier which is something
We need to get rid of or is it actually the one interesting pattern that
Makes you the money or the person that needs to drive or whatever whatever the question is if you're looking for this is
Statistics isn't so it would be statistics it go back to my picture
It would be statistic to divorce like this
And we can do things which are quite simple and even in the machine learning we still do some statistics
But statistics on it on its own. It's not enough
Has too many limitations, so we need to go Beyond statistics
And that's what computer programming is really good because it's more flexible than statistics. We can deal with text week until the pictures
I mean to kind of statistics and pictures doesn't work you can do statistics on text
We can do a pattern mining on [that] alphabet. We can do a pattern mining on a picture. We can do an image recognition
We can mix all these things together?
We can also deal then a lot of information is missing I mean usually when you get these big data sets
I mean, this is one of the myths big data people think I so huge data
So you know no problem, and of course big data usually also means a lot of data is missing a lot of Data is messy
The Data isn't always clean and the bigger the data gets this problem doesn't get any smaller gets bigger as well
So statistics usually can't cope windows missing values or when it is missing the data because it has some very
fundamental
Requirements, so for computer science we can deal with this missing data, okay?
We write a bit of code [that] deals with missing data
The Data is Messy me write a bit of code does something massage is the data or I don't know whatever it is
So that's me a very much more flexible and statistics statistics may powerful [than] your works
But when it doesn't work then computer science. What is it correlation does not imply causation?
The idea in [Labor's] terms that just because you see a pattern doesn't mean it's relevant would that be a way of putting well
It's like you know only because you're carrying an umbrella sand doesn't mean it's going to rain today, right? I mean exactly you
Observe things together, but what you don't know is where today?
I actually was a related to [B's] be related to a or maybe and usually actually there's another reason
Underlying it all together. Which is what's happening? I mean a lot of medical work. We do it
This is this is interesting so we were looking at patients taking certain drugs
And we want to know whether the drug is really helping them or not and then after they've taken the drug something might happen to
them
Something something you don't want maybe now have a heart attack and the question of course is well
What's it a drug that caused the heart attack or was it something else?
And it's fundamentally important to understand that you can't just conclude only because they took a drug
That's why the harder that usually it's not at all because of the drug. Yes
Otherwise we could say I drank water and therefore exactly it's something called co-founding in technical terms
There's lots of co-founding issues, and you need to understand the difference between them you will have heard of this
It's obviously the newspaper every year again
Eating sold this bad for you. Why is eating sold bad for you because it apparently causes high blood pressure
Where's the evidence for this so what happens is I looked at this in
1988 there was some studies on this and this is what his picture shows and it's something British medical journal where?
we're looking at different countries, and you were looking at the basically urine in salt and
What's happened is each of these thoughts is a different population. I a different country and they measured
[other] thousands of people in the Urine how much sold their walls and
They also measure the club pressure and this is the graph so you've got this how much sold this how much blood pressure and you
Can do this is sort of slight positive correlation there, and is actually based on this study
Or it's kind of a meta study that the conclusion was okay, so more salt must lead to higher blood pressure
however
Actually if you look at this picture properly you might see that wait a moment
There's lots of point here, and I saw 44 over there, and I look a bit strange and actually
If you were to remove those four points from the analysis your trend would suddenly become negative
It's only because of these that the trend is positive it turns out that these outliers are from countries were sold
isn't very popular in the died sort of non industrialized countries in Africa, so
Okay
Actually is it possible
But this study is flawed
This data should never have been included because it's not what we are used to invest on
Society and if you look at ours actually thinks all this good for you
Because it seems just like to reduce blood pressure
There has been more work since on this then that probably is a link between Salton blood pressure
but it's not as obvious and
There may be much more going on and actually what hasn't been established at all is whether slightly raise blood pressure might actually be good
for you because [it] might also help you in other ways, it's not the
Data itself is how is the interpretation?
Delimitation of it and getting the right in a rubbish in rubbish out
Actually making sure that what you what you put in make sense and and yes think thinking of [to] co-founding things
If you use data that you shouldn't have used because its former country which is completely different for example like here
It doesn't it doesn't help
so Jpeg works by down stopping the car components and using other techniques as well the majority of
Video compression methods will do the same thing. So you're being compressed right now. I certainly am yes
