Now we've got sort of
that basic definition,
there's a basic sort
of, we understand
what attributes are in data
objects and the different types
of them.
We can move on to talking
about data quality.
Now data quality is particularly
by new data scientists,
one of the most commonly
overlooked or shortened
or poorly shortened steps.
Pieces of it get ignored,
get skipped because it just
doesn't seem that necessary.
But understanding your
data quality problems
and understanding where
they could come from
is very, very
important to creating
robust models that will
actually work in production.
You have to know what
to expect in order
to handle it appropriately.
So there are three fundamental
questions around data quality,
right?
We have to ask this of
every dataset we get.
One, what problems do
we have to worry about?
How do we detect those
problems, and what can we
do about those problems?
Those are the three
fundamental questions
you should ask
yourself every time
upon approaching a new dataset.
And your early
exploration should really
be, some of your
earliest explorations
should really be focused at
answering these questions.
So I am going to give you
some examples of how we answer
each of these three
questions and some
of the categories
of things coming up.
So there are three
very common kinds
of data quality problems--
noise and outliers, missing
values, and duplicate data.
These show up in
production all the time.
So let's go through and think
about these in this context.
