All right, so another way
to reduce dimensionality
of data other than just
PCA is a lot of times
we have redundant or
irrelevant features.
So this is going back
to Theresa's questions
about dimensions
being independent.
So if we have redundant
features or irrelevant features,
that will increase our
dimensionality artificially.
It contains little
to no information,
but it increases
our dimensionality.
So we want to be very careful
about trying to detect these.
So a redundant feature
example, for instance,
is that the purchase price of a
product and the amount of sales
tax paid on that product,
those things are,
based on the state,
completely connected.
You can calculate
one from the other.
They're perfectly correlated.
So, as a result, you
want to get rid of it
because it increases
your dimensionality
without adding new information.
Same thing with
irrelevant features.
A student's ID number, the
vast majority of the time,
is irrelevant to the task
of predicting student's GPA.
And these types of redundant
and irrelevant features
don't just harm us via
increased dimensionality.
Redundant features effectively
weigh features multiple times.
If we have the same information
contained in two columns, two
separate columns, that model
thinks are both important,
we have double-weighted
that information.
Similarly, irrelevant features
can confuse our model.
The model will try to do some
fitting based on those features
and it'll just sort of diffuse
the effectiveness of the model.
So one of our big steps
of data pre-processing
is making sure we figure
out what attributes
are redundant and irrelevant and
aggressively cutting them out
of our data set.
And there's a number
of different techniques
you can use to do this
kind of subset selection.
You can brute force it, just
try all your different feature
subsets.
Some algorithms, some of
the most popular algorithms
used, actually, naturally
do feature selection.
And so that's always good.
Sometimes you have
a filter approach
where you use your
exploration and what
you know about the data
set in order to filter out
the bad features.
And sometimes you can get
the data science inception
going on where you use a data
mining algorithm on your data
mining algorithm
in order to find
the best subset of attributes.
But that's feature
subset selection.
It doesn't share a lot.
I'm going to move
on a little quickly.
Please ask questions as they
are as they arise to you.
But we're running a little
bit behind, which is great.
I love the discussions we've
had and it's important.
The front half of
this presentation
is more critical
than the back half.
But I am going to
start increasing
the pace a little bit,
just as a heads up.
So please ask your
questions as they come up.
So another common technique,
and this goes with aggregation
to a certain extent,
is feature creation.
So we have the cursor
dimensionality on the one hand,
but other times we don't
have enough features.
We don't have
enough information.
There is more information
that we could have.
So we can either
extract things, say
combine two columns in order
to get new information, so,
for instance, in sales we
could determine the tag
price from the
total amount paid,
filtering out the sales tax,
which might be important.
Other times we have
aggregation and things
like that with
feature construction.
And last, and
really mostly least
because we don't do this
that much, is mapping data
to a new space.
So those of you from a
scientific background
are probably familiar with
the Fourier transform, which
takes data that is
in the time domain
and converts it to be in
the frequency domain, which
allows you to pick out
different pieces of information.
We don't do this kind
of transformation
that much in data
science because it
tends to require transforming
the entire data object.
But it is something
to be aware of,
to have in your back in
the back of your head.
Because there are some
times that you really do
want to do some sort of massive
transformation like this.
Particularly in an anomaly
detection time series context,
you might want to do
things like take a Fourier
transform of your data.
