All right, we can move on
to data set classification.
So data sets are--
there are a lot of different
types of data sets.
And they require different
approaches to analysis.
The pre-processing steps,
the modeling steps,
pretty much
everything that you do
with these different
types of data sets
is going to be different.
The kinds of models you use,
the kinds of visualizations
you construct, the
kind of cleaning that
is proper for that kind of data.
Understanding the structure
of your data at the beginning
is very important to
not wasting time and not
producing incorrect results.
And it's in this step, the
understanding the structure
of your data that things
like domain knowledge
tend to be very important.
But there are still,
certainly, categories
that tend to be similar no
matter what domain they're in.
So we'll talk about these
three different kinds of types
of data sets, records,
graphs, and ordered data sets,
in a little bit more
detail coming up here.
So record data is
data that consists
of a collection of
records, each of which
consists of a fixed
that of attributes.
So this tax ID.
So this particular data set,
which I use in several places,
is a record data.
Every data object has one tax
ID, has a value of whether they
asked for refund,
marital status,
whether they're single
married or divorced,
a taxable income
field, and whether they
cheated on their taxes or not.
So that's what's, sort of, the
structure of this data set.
So any data, which consists
of this kind of collection
of records, which consists
of a fixed set of attributes,
you almost always
represent this kind
of data as a table,
whether a database
table, or a spreadsheet,
or something like that.
And it's the most
common kind of data.
So a lot of people will, if you
talk about data or data sets,
this is what they visualize,
entirely, is record data.
So it's, sort of, your
most common and, sort of,
fundamental kind of data set.
So within record data, there
are a few useful subsets.
So this record data,
with the tax data,
has some categorical values
and then one ordinal variable.
So tax ID is ordinal, right?
Or is it?
It's really more of a
nominal variable, when
you think about it, because
ordering doesn't necessarily
matter.
Right, sure, it
takes numbers but 10
is not meaningfully
different from five.
There's no ordering
implied here.
So tax ID is a nominal field.
Nominal categorical field.
Tax refund is a categorical
field, marital status also,
taxable income is
a continuous field.
So most data that you encounter
has mixed data types like this.
You have some
categorical, some numeric,
and that's, sort of, your
traditional type of record
data.
If, on the other hand, your
record data consists entirely
of numeric attributes, so
this is entirely continuous,
entirely interval,
or ratio variables.
Then we can think of it as a
mathematical matrix rather than
just a table.
So we would have
an m by n matrix.
There are m rows, one
for each data object
and columns, one
for each attribute.
And this is nice because we
can think of these data objects
as points in a
multi-dimensional space,
where each attribute
is represented
along one dimension.
And that allows us to use a
number of numeric techniques,
specifically, involving
distance that some algorithms,
not only make some
algorithms easier,
but which some
algorithms require.
There's a number
of algorithms that
require you to have data
matrix data, all numeric data.
