So, those of you who have
scientific or signal processing
background are probably
familiar with the term noise.
Noise in a data
science context is
when we have an invalid
signal of some sort that
overlaps valid data.
This obscures our
actual attribute values.
And, fundamentally,
what it means
is that some of our data
objects have invalid values
in some of the attributes.
They don't have real--
they have inaccurate
values there.
So, examples of
this in real life--
we have the distortion
of a person's voice
over the phone, snow on
old television screens,
particularly the old
CRT television screens.
Noise can appear because
of human inconsistency
and labeling.
You see this a lot in
sports, for instance
that require human judging.
There's a lot of
inconsistency in how
people get labeled there.
And, just in general, if you're
trying to say rank web sites,
for instance, human
inconsistency in labeling
can be a real problem.
So, as sort of a practical
example of what noise
can do when there's
a lot of it--
this is a pretty
straightforward signal.
We've got two sine waves here
with different frequencies
but the same amplitude-- there's
a blue one and a green one--
and, so, we could
generate the sine wave.
It looks very
clean, very pretty.
We can even distinguish the
two different sine waves.
If we add those two waves
together and then throw noise
at it-- just sort
of basic white noise
like you might see in any
kind of randomization thing--
and you end up with something
that looks like this.
So, the noise has completely
obscured our actual signal.
So, noise is, again,
fundamentally, invalid data
points that are
obscuring our signals.
So we have to be--
there's always some
noise in any system.
It's just the nature
of the universe, sadly.
But understanding where
your noise is at its worst
and how you can deal with
it is very important.
But even recognizing that
it's there is the first step--
recognizing which
of your attributes
are noisy versus which are not--
are more noisy verses which
of them are less noisy.
Sort of the
complimentary problem--
complementary problem-- to noise
is the problem of outliers.
So, outliers often look
like noise at first.
They're data objects
that have characteristics
that are considerably different
from most of the other objects
in the data set.
So, if we look at
the visual here--
we've got some sort of
two-dimensional graphing
of our data and most of
each dot-- each pixel point,
represents a data object that's
been plotted on the graph.
So, we've got four clusters--
very nicely defined clusters--
and then we've got these three
other points just hanging out
in the middle of nowhere,
far away from all
of the other data.
So, the big distinction
between outliers and noise
is that outliers are
actually valid values.
The data was collected
properly-- it's clean,
but it's outside of
the normal range.
The data object,
for some reason,
doesn't look like
a normal object.
All right-- so that's
outliers and noise.
Those are sort of the first
category of data quality
problems that get
encountered a lot.
