So now we get to the
much foreshadowed data
preprocessing section.
So data preprocessing is
sometimes called data cleaning,
but data preprocessing
should involve more steps
than just cleaning the data,
just removing the problems
with the data.
So data cleaning is kind of
a subset of preprocessing.
But most of what we do
during data preprocessing
is, in fact, data cleaning.
So, again, lots
of different terms
to refer to basically
the same thing.
So there's a lot of different
types of preprocessing.
And I'm going to talk about a
lot of different strategies,
aggregation sampling, all
the ones on the screen here.
I'm going to talk about all
these different strategies.
But we don't want to use all
of these different strategies
on every data set.
Right?
There's a lot of different
strategies we can use,
but for any given
data set, we're
only going to use a
couple of them usually.
We don't want to overwhelm.
We're not going to need every
technique and every tool
in our toolbox every time.
Another note before we keep
going, not all of these
are strictly independent.
They all get-- these
terms categories
are all things you
see thrown around
and terms you see used
around the industry.
But, because, again data science
is such a heterogeneous field,
not all of these things
are strictly independent.
So if you see some
overlap in what
I'm talking about between
different attributes,
that's why.
