Another very common method of
pre-processing is sampling.
So those of you, like Ron,
who are from a statistics
background, will understand
sampling quite well.
So, sampling is
the main technique
that we use for data selection.
It's used almost always for
preliminary investigation
of the data, but it's often
used even for the final data
analysis, even in data science.
Statisticians have been
sampling for the duration--
for the length of time
and the discipline
has existed because obtaining
the entire set of data
of interest is either
too expensive, too time
consuming or even
in a lot of cases,
theoretically impossible.
There is no way that
you can sample--
that you can obtain the entire
set of some kinds of data,
it's just not possible.
So you have to sample carefully.
Data miners sample
often, because processing
our entire set of data is too
expensive or time consuming.
If you talk about someone--
like a group something
like, LinkedIn, or
Facebook, or Google,
you're talking about
hundreds of terabytes
into petabytes worth
of data that they
have stored in their servers.
It just would take-- it you
cannot process that kind
of data on anything remotely
resembling a human lifespan,
even with modern technology.
We can process a lot of
data, but there's still
a fundamental limit of
what we can process,
and on top of that,
there's a fundamental limit
of what we as humans can look
at, what we can really get--
what we can look at
all at the same time.
So, when you're sampling, there
is one thing more than anything
else that you have to keep in
mind, which is, representation.
So, the key principle
when you're sampling
is, that the sample will
work almost as well as using
the entire data set, if
and only if, the sample
is representative.
So-- and representative is
sort of one of those fun words
that means something different
for every data set, right?
So, sometimes
representative is as
easy as unweighted
random sampling.
Other times, this
is particularly true
if you were doing something
like anomaly detection.
We need to make sure
that whatever sample
we take has an appropriate
proportion of anomalies
versus normal data.
In other contexts, it gets
even more complicated.
Sometimes you want to make sure
we balance out our different
classes in a classification
context or that certain kinds
of attribute values
that are needed--
that are even target values
but attribute values--
are all represented
in a certain way.
And Balachander notes that
sampling will typically
exclude outliers
and may have noise
and that's absolutely true.
Sampling, if done
improperly, can absolutely
add noise to your data
or, well, not really add
noise in our context,
but certainly
but certainly can
introduce noise.
And outliers are
probably not going
to appear because you
don't sample enough to make
them appear, and that's true.
And that's actually one of the
advantages of that of sampling
is that it will exclude
outliers most of the time.
So if we aren't in
an anomaly detection
context then we don't care--
and we don't want outliers
muddying the
waters, so to speak,
we'll want to exclude them, and
sampling can help us do that.
