The purpose of this
particular webinar
is to give you all sort of some
basic vocabulary and a very
basic understanding of a number
of different important topics
regarding data
science fundamentals.
So a lot of this talk
is a vocabulary lesson.
So it's really
important that you guys
make sure you understand all
the terms that I'm introducing
and all the ways
that they're used.
We're going to be
covering a lot of material
over the next couple of hours.
So it is pretty
aggressively paced,
but we should be able to
get through all of it.
All right.
So you see on your
screen here the topics
that we're going to be covering.
So we're going to be talking
to start about data and data
types and sort of setting
some ground work for all
the things we'll be talking
about over the course
of the Boot Camp.
Then we're going to talk
about data quality and data
preprocessing, which are
very connected things.
And, finally,
we're going to talk
about some similarity
and dissimilarity metrics
and also some data
exploration and visualization.
So we'll cover data exploration
visualization very briefly
here.
We're going to talk
about it a lot more
next week in the
introduction to our webinar.
So without further ado, then,
let's start with data and data
types.
So what is data is sort of
a very fundamental question
that we can ask.
And here's where our
vocabulary lessons start.
So data is a
collection of objects
that are defined by attributes.
So attributes are the
properties or characteristics
of our objects.
So every entry in
our table, here--
and not all data can be
represented nicely in a table,
but a lot of it can be.
So in this case, the object's,
a data object, is a row,
and a data attribute
as a column.
So we think of the
attributes as being
properties of the objects.
So the eye color of a
person, the temperature,
whether someone filed
for a tax refund
in the next year, what
their taxable income was,
those are all attributes
of our data objects.
So one of the struggles
people sometimes
have in getting
into data science
is that because data science is
a synthesis of probably three
or four completely distinct
fields all coming together
in one way, there are a
lot of different terms
for the same things
in a lot of cases.
So this is our first
encounter with that.
And it's going to show up again.
So attribute is sort of a
decent name for these ideas.
But they're also called
variables and fields
and characteristics and
features and predictors.
And if you've got
tabular data, they'll
be called columns sometimes.
So all of those
different names all
refer to essentially
the same thing.
They're all attributes.
They're a property or
characteristic of our object.
Similarly, when we
have our objects--
so our objects are
then, basically,
a collection of attributes.
It's kind of a
circular definition.
But it's what we've got.
So each object is defined by
its exact attribute values.
And objects-- we'll use
the term data objects
throughout this talk,
but in general, objects
have a lot of different names.
You'll see them called records
and points and cases, samples,
entities, entries,
instances, all of that
and many more sort of things.
You'll also see a set of
data called a data set.
But sometimes it will
be called a table.
And sometimes you'll
just hear, oh, yeah,
we have our data, referring
to the set as a whole.
