There are several
different types
of sampling that are important.
That will come up as we
talk about over the course
of the boot camp.
So there's simple
random sampling,
where there's an equal
probability of selecting
any particular item.
There's stratified
sampling, where
we split the data into
several partitions
and draw out random samples
from each partition.
If we're doing
stratified sampling
with equal sized
partitions, then that's
equivalent to simple
random sampling.
But in a lot of
cases, we don't do it
with equal sized
partitions, we do it
with smaller or larger-- we
have different sized partitions
to draw from,
which is what makes
it fundamentally different
from simple random sampling.
Or we are drawing different
numbers of points out
of the different partitions.
So those are two different--
our two fundamental ways
of actually grouping the data.
And then when we're
actually sampling,
there's two kinds of
sampling that come up--
the sampling
without replacement,
which is what most
people think of when
they're thinking of sampling.
So sampling without replacement
is if we have a bag,
and it's got five red
balls and four blue balls
and three green balls in it.
And we reach into the
bag and pull a ball out
and we saw, aha,
I drew a red ball.
Then we take that red ball,
and we put it on the table.
And then if we
want another item,
we reach back in and pull
out a different ball.
So now the second
time we draw, instead
of there being five reds and
four blues and three greens,
there's four reds, four
blues, and three greens.
So that's the sampling
without replacement--
we do not replace what we're
sampling back into the bag.
On the other hand,
there are uses--
and this actually one
of the most important--
a fundamental concept of a
very common type of modeling
uses sampling with
replacement as part of it.
So in sampling with replacement,
instead of taking the red ball
out and then putting it on
the table and drawing again,
we reach into the bag, pull out
a ball and say, aha, it's red,
note down on a piece of
paper say that it's red,
then put the red ball
back, shake it up,
and draw another ball out again.
Record its color, put
it back in the bag.
So without replacement,
with replacement, that's
exactly what it sounds like.
But they end up having very
different mathematical results.
And as a result,
and because of that,
they are used in
different contexts.
All right, so the last
thing we need to think--
another aspect we need to
think about around sampling
is what size of
sample we want to do.
And I really like
this picture because I
think that it very
excellently illustrates
the problems with sample sizes.
So when we sample, we
do lose information,
just like with aggregation.
So you have to be careful not
to make your sample too small.
So if we look over here,
we have this data set,
and it's just position data.
This is, I think, some sort
of lithography picture.
So we've got these
black structures,
and then we've
got this sine wave
in the background and then a
little bit of just random noise
scattered all over the place.
So if we subsample
this by a quarter,
so we sample 2000
points, we can still
see the structures, the
big thick structures,
are still represented.
But the sine wave has
almost entirely disappeared.
We've lost that
background image.
And if we go down even
farther, if we subsample
by another quarter
down to 500 points,
we've lost even the
information of these things.
Like you can look
at this and you
can kind of see the structures,
but only because you know what
the structures
need to look like.
If I showed you just
this graph first,
you wouldn't pick
out the structures.
You wouldn't be able to, there's
just not enough information
there.
So we want to reduce
our sample size,
we want to sample a small
enough size that we can process
it efficiently, that we can
analyze it efficiently, that we
can explore it efficiently.
But we have to be really careful
not to take too small a sample.
And unfortunately, there
really isn't a good rule
of thumb on this necessarily.
But you just you
need to play with it.
You need to take lots
of different samples
of different sizes.
You need to do
this to figure out
when your information
starts to disappear.
