So first strategy--
and this one is first
because we see it a
lot-- is aggregation.
So we'll combine two or
more attributes or objects
into a single
attribute or object.
So this can be
where we are trying
to reduce the scale of our
data, reduce the number
of attributes or objects.
So we could, for instance,
combine two attributes,
to combine a
high-temperature attribute
and a low-temperature attribute
in order to get a temperature
difference attribute.
We've now combined two
columns into one column.
Basically every algorithm
has some time dependence
on the number of
attributes it runs,
and certainly in terms of
visualization and exploration,
there's only so many
attributes that you
can look at at the same
time or hold in your head
at the same time.
On the other hand, we
might want to combine
a bunch of different objects.
If we have users who have
many different sessions,
or who navigate to
many different pages,
we'll have dwell times that
are different for every page
and every session,
and we might want
to combine average all
those dwell times in order
to get one data object
that is the average user
behavior for each user,
rather than the 10
or 15 different
sessions for that user.
So the reason why we do
this is exactly that.
If we want to
average user times,
for instance, we're
changing our scale.
We want to aggregate cities into
regions, states, or countries.
We want to aggregate dwell
times across sessions
or across pages.
And one of the big
advantages of aggregation,
particularly averaging,
is that aggregated data
tends to have less variability.
It's a way of reducing
the effective noise.
Well it's a way of reducing
the effect of random noise.
If you've got human
labeling errors,
then you've got human
labeling errors.
If you've got sampling
procedure errors,
you have sampling
procedure errors.
But if you've got
random errors, say
random noise, then aggregated
data will very much
tend to reduce that.
So as an example of that-- and
I really like this next page
for this--
these two are graphs of
precipitation in Australia.
So these are histograms.
So the height of each
block is the number
of locations where
precipitation was measured which
had, in this case, a standard
deviation of the X value
when we measured it on
an on a monthly basis.
So we're measuring the
average monthly precipitation
and measuring the
standard deviation
of that monthly precipitation
at 500 different land
locations in Australia.
When we do that on
a monthly basis,
we get this very wide spread
of standard deviations.
Some places are very
consistent in their rainfall.
There's these two
peaks, and then you
have this long tail of
places that are just
all over the place in
terms of the variability
in precipitation.
On the other hand, if we take
those exact same land locations
and instead, find the average
yearly precipitation--
the variance standard
deviation of that--
we get this very nice
single peaked, mostly single
peaked, very
short-tailed histogram.
We've significantly
reduced our variability.
We've reduced our random
noise in our dataset
by increasing the scale
by aggregating our data
over a longer time period.
So that's one of the big
reasons that we use aggregation.
