Another one that shows up very
frequently is missing values.
So sometimes missing values
are because information is not
collected.
So whether you're looking at
census information or survey
information in
particular, people
will often decline to
give their age and weight
or will decline to give
their annual income.
So you just have missing values.
Other times, the attributes
that you're collecting
may not be applicable
to all cases, right?
If you're asking people
about the annual income
of each member of their
household on a survey,
well, the children
in the household
don't have an annual income.
It doesn't make
sense, so you just
code that as a missing value.
And we'll talk a lot
more about handling
missing values when we talk when
we get to data pre-processing.
But kind of the fundamental
ways we can handle it
are throw out all
the data objects
that have any missing values.
We can estimate
our missing values
using means or medians
or something else.
We can with some
algorithms but not all,
ignore the missing values
on a row by row basis.
Or we can just throw the
attribute out entirely,
which is something
we might want to do
if we have an attribute
that is 80% missing,
we probably just wanted
to throw that column out.
And one of the ways, one
of the other things you
can do sometimes
in some algorithms
is you replace missing
values adaptively
with this happens a
lot in categorical,
where you'll count the
probabilities of an attribute
appearing, an attribute
value appearing
over your whole
dataset, and then
replace all the missing
values with such
that those probabilities
don't change.
And we'll talk a little
bit more about that when
we get to preprocess, I
guess I want to sort of get
the basic sort of
this is how you
handle missing values in a
very basic sense out there.
And the third category then,
alongside missing values, noise
and outliers, is duplicate data.
So this is particularly a
problem when data objects are,
when we're merging data
from heterogeneous sources.
So if we have some data
from Google Analytics coming
from our website and
we have some other data
from actual uses, you know
click counts, and sort
of dwelling time and
things like that,
that's from another system, or
maybe we have a Java applet,
as much as those
things still exist
on the internet, that collects
some data inside of it.
If we want to merge that
data, we will sometimes
have duplicate data objects.
We'll have the same person
with multiple email addresses.
We'll have the same
person represented
with two different IDs,
because they're coming
from two different systems.
So generally speaking,
duplicate data though,
is pretty easy to
handle, assuming
that you can detect
it properly, which
is get rid of the duplicates.
[LAUGHS] Merge it together.
But if you've got data
that's heterogeneous, that's
from multiple
sources, then you do
have to be really careful about
filtering out your duplicates.
