So we have objects,
and we have attributes.
So each attribute has
a set of values which
the objects can draw from.
So each attribute,
each object is defined
by a set of attribute values.
And each attribute we
can think of as being
defined by the set of
values that it can hold.
So we can have the
same attribute mapped
to different attribute values.
Height can be measured
in meters or feet,
temperature can be measured in
Celsius, Kelvin, or Fahrenheit,
lots of other sorts
of things like that.
And different
attributes will often
be mapped to the
same set of values.
ID numbers and age
are both usually given
as integer values.
Temperature and
height are both often
given as floating point
values, as decimal values.
So the properties
of our attributes
can also be different.
Height, for instance, has
a pretty practical maximum
and minimum value, as
does something like age,
whereas ID number
has no real limit.
It's whatever the people
who created the dataset
define it to be.
So and that kind of gets into
an interesting question of,
who defines what value set
that a given attribute uses?
And the answer to that is
essentially, we do, right?
The people who create the
dataset do, the people who
hand us the data, the data
engineers or the Twitter API
or other APIs that we're
calling in order to get the data
will have some definition of it.
But we can set
that ourselves too.
We can change our
attitudes to be maps
to different sets of values.
And we'll use that in
a variety of places.
All right, so
attributes have, so we
know that we have
these attribute values.
So it's useful to
talk about attributes
as being part of different
classes, different types
of attributes that
we're going to end up
having to handle differently
as we get into the actual data
mining and modeling processes.
So there's two sort
of fundamental types
of attributes,
discrete attributes
and continuous attributes.
So discrete attributes have
either a finite or countably
infinite set of values.
For those of you who don't know,
the term countably infinite
basically means integers.
If you can turn your
attribute into integers,
then it's countably
infinite, or finite
if you've got only a
limited set of integers.
So good examples of
these are zip codes,
things like click counts,
the set of a word count,
word counts in a
collection of documents.
We could in theory, have
as many clicks as we want.
There's a countably
infinite set,
but there are always
going to be integers.
So we have a countably
infinite set of values there.
Usually we represent these
as integer variables.
And binary attributes
are a pretty special case
of discrete attributes
that we end up
having to handle
differently in some cases.
Binary attributes
have only two values.
And we might call those yes
or no, dead or alive, 1 or 0.
And those kinds of columns
are sort of a special case.
In some contexts, we really like
them, they make things easier.
In other contexts, they
could be problematic, which
is pretty much everything.
The other big type of attribute
classification that we
see are continuous attributes.
So in this case, we
have real numbers
as our attribute values.
There's no limitation
to just integers.
So temperature, height, weight,
oxygen level, taxable income,
all these things
have real numbers
as their attribute values.
They can theoretically
take any value at all.
Now in practice
of course, we have
to put these things
into a computer,
and computers can only
measure and represent
a finite set of digits.
So generally speaking,
these attributes
are usually represented as
floating point variables.
So floating points,
for those of you
who are farther out from
your learning of programming,
are essentially
just variables that
hold a real number,
that can hold a decimal,
the floating point being the
decimal point in the number.
