Now I'd like to introduce
professor Ramesh Johari.
He's today's featured speaker.
He's an associate professor at
Stanford University with a fulltime
appointment in the department of
Management Science & Engineering.
And courtesy appointments in
the departments of computer science and
electrical engineering.
He received an AB in
mathematics from Harvard,
a certificate of advanced study
in mathematics from Cambridge.
And a PhD in electrical engineering and
computer science from MIT.
He's a recipient of several awards
including the Okawa Foundation
research grant.
The MS&E Graduate Teaching Award, a
National Science Foundation career award,
and the Cisco faculty scholarship.
He's also an associate editor for
two journals.
Management Science, for which he focuses
on the area of models and simulation.
And Operations Research, for
which he focuses on information, games and
networks.
And on that note,
I will turn the floor over to Ramesh.
>> All right, thank you Marsha.
I'm really excited to be here.
It's a fun thing to be able to present
the material I'm talking about today.
I think before I do that, I just want to
give you a couple of words about myself.
So I have a pretty diverse academic
background in terms of things I've
worked on.
Recently, my two main interests are the
design of online platforms, so I work
a lot on things like how Airbnb or Uber or
Lyft should structure their platforms.
Then the other one is, that I spend
a lot of my time thinking about how
organizations can make
better decisions from data.
And that's what I wanted
to talk to you about today.
So the topic of today's talk is,
a/b testing,
which is the term that most of the tech
industry uses for experimentation.
I want to tell you a little bit
about some of the problems that
can arise when you use
experiments in online settings,
as we typically do in
the tech industry and beyond.
And also, what we can do about it.
So before I start,
just one thing I should highlight.
The work I'm talking about was actually
done under my role as a technical adviser
to a company called Optimizely,
which is a large a/b testing platform.
So they provide, essentially,
a platform that allows other companies
to run a/b tests on their sites.
So I'll tell you a little bit about
first what exactly is a/b testing,
just to make sure we're
all on the same page.
So A/B testing is in the tech industry.
For those of you who
are in the tech industry,
you've probably figured out that we need
to have our own names for old ideas.
And so A/B testing is nothing more than
a new name for randomized control trials.
But if you're in the tech industry, they
will always be referred to as a/b testing.
And the idea is,
as you can see in this picture,
there's two variations in the most
classic vanilla a/b test.
In one of them, you've got a webpage
with a black Buy Now button.
In another one, you've got a webpage
with a red Buy Now button.
And, you basically want to ask yourself,
which of these two versions
of the webpage is better?
Are people more responsive to
a red button or a black button?
So what you do is you take all
the visitors coming into your site,
split them at random between
the two variations of the web page.
And then just measure to
see which one does better.
And so
the kinds of questions you might ask are,
which has the higher conversion rate?
So yeah, A/B testing is actually been
as I mentioned, since it's called,
it's essentially the same as
randomized experimentations.
It's been around for at least 100 years.
But in the tech industry in particular,
even there it has a very long history.
I'd say, going back at least two decades
in terms of its deep use across a range of
applications.
What's really interesting today is that
we're kind of living through a period
where the ability to run these experiments
is being democratized quickly.
What I mean by democratized is that now,
sort of a typical employee at
a business has the ability to run these
experiments without really a lot of
engineering or statistical knowledge.
That's kind of what the point
of Optimizely really was.
It was to make it so
that all you needed to know was a little
bit of ability to code up a web page.
And all you had to do is code up
two variations of the web page,
in this case the black and the red button.
And then they would take care of the rest.
They would make it so that all
the visitors that are coming here,
site is randomized between
the two variations.
They would take care of all the
statistical computation to show you what
was doing better or worse.
So you really didn't need to know a lot.
And I think if there's one point I
can have you take away from what I'm
going to talk to you about today.
It's that as we live through this sort
of democratization of statistics.
I think how we use statistical tolls and
data scientific tools really
needs to be completely rethought.
The consumer of statistics and data
science is no longer a statistician or
a scientist, it's really everyone.
But stats and
science were not really developed in
an environment where that was the case.
So, kind of what I'm found the thing
is they have a bit of a user
interface problem.
And this talk is, in part,
about that user interface problem.
But it's really just
the tip of the iceberg.
I think there's a lot more to do here.
So, with that in mind, let's think
a little bit about what exactly was
going on 100 years ago
when the foundations
of randomized experimentation were
first developed in statistics.
So, there's a phrase I'm going to use,
which is fixed sample size testing.
And you can see from this picture,
this picture looks about as far away from
the tech world as you can imagine,
which is, it's a picture of a farm.
I think this picture's
actually from Napa Valley, so
it's not that far from the tech industry.
But what we're going to talk about
is really what the initial kind of
setting was within, which randomized
experiments were analyzed.
Which was actually studying crop yields,
and
in particular the effect of
fertilizers on crop yields.
And so in particular,
we're going to consider Farmer Ron
wants to test two fertilizers.
The standard thing that you
would have been told to do,
is to apply what's called
thick sample size testing.
Where the first thing you commit to,
in advance, is how many plots of your land
are you going to use to try to
compare these two fertilizers, okay?
So, that's the sample size.
Next, you wait for a crop cycle.
So, essentially,
you choose how many plots to use.
In some of the plots,
you apply one of the fertilizers and
others of the plots you
apply the other fertilizer.
And you wait for the crop cycle, and
then collect the data at the end, okay?
Once you've collected the data,
you ask yourself a very basic question,
which is a bit kind counterintuitive.
But if you wrap your head around it using
kind of the English description I put
here, it's easier to understand.
What are the chances that you would
see the results that you saw,
if there was actually no difference
between the two fertilizers?
So when you put the two fertilizers
under the two different groups
of two different plots of land.
You may see a difference in exactly
what the crop yield is between the two.
And the question is,
what are the chances that you would
see the difference that you saw?
If there actually was no difference in
the performance of the two fertilizers.
So you may have heard the term p-value or
you may have used p-values and
this is what a p-value is.
A p-value is literally
the false positive probability.
It's the chance that you observe
the difference that you have observed,
or a difference at least
as extreme as what you saw.
Even if there there really was
no difference between the two.
Okay, so in particular,
when p-values are small, what
that's saying is that it's really unlikely
you would see the difference that you saw,
if the two fertilizers
were actually the same.
And what we typically say about that,
is that then we can reject the hypothesis
that the two fertilizers
are actually the same, okay?
And that's kind of how we argue that
there's a difference between them.
So in particular, what does small mean?
A typical cutoff in industry
would be something like 5%.
If the p-value is below 5%, we declare the
results statistically significant, right?
And then we can say there's
a difference in the two fertilizers.
So this is a really powerful approach.
And one of the reasons it's so
powerful is that it actually does
something much stronger than just dealing
with controlling false positives.
So one thing you can see from this Is
that, if you cut off the p-value at 5%,
one of the things you're saying is,
that you're not willing to tolerate
more than 5% false positive, all right?
So, the second one is that in fact, what
it's doing is it's optimally trading off
your ability to control false positives
against the detection of true positives.
because you see that one way you
could control your false positives,
is to never declare that
there's a difference.
So, you run this experiment,
and no matter what you see,
you declare,
I think the two fertilizers are the same.
So, that's a world in which you'll
never have a false positive, but
it's just because you never
declare anything positive.
And, the problem with that, is that you're
never going to find a real difference.
So in all statistical testing,
what we're doing is we're trying to
balance controlling false positive results
with finding as many true
positive results as we can.
And what's really neat about this
approach is that, in fact, it does that.
It actually optimally balances true
positives and false positives.
It does it essentially as
efficiently as possible.
And I wanted to highlight one
other really valuable thing and
I think this is why p-values are so
prevalent in the A/B testing world.
It's cool that you can summarize the
entire experiment through just this one
number which is the p-value.
And one thing that is nice about that is,
different people might run
different experiments.
There might be different statistical
properties of the experiments that they
run, but at the end if what
you produce is a p-value,
then all I need to do is apply
this simple rule in bullet form,
cut off the p-value at my desired
control on false positive probability.
And I know that I've got this optimal
trade off between true positives and
false positives, without having to
really understand anything about outside
the experiment itself.
And p-values, the way we like to talk
about this is that they have kind of
a transparency and a portability to them,
that allows organizations to share
results from multiple experiments
across people using kind of a simple
statistic of a p-value, okay?
So that's a very valuable
thing about p-values.
Now that said,
I'm going to guess that many of you have
probably read things in the popular press,
that suggest that p-values are a kind
of measurer under great scrutiny.
There's been a replication
crisis in psychology,
where people are kind of concerned
that results that are published
in journals are not reproducible,
and this is kind of highlighted.
It's put a lot of attention on whether the
use of p-values and the way we do it and
the scientific literature
is actually reasonable.
I don't want this talk a defense
of p-values necessarily.
But I think what I'd like to ask
is given some of the pitfalls
that arise in A/B testing, can we adapt p-
values to be something slightly different
to address those pitfalls, right?
So now let's get into that,
what are the pitfalls of using
p-values in A/B testing today?
And for that we really need to
compare what the difference is between
running randomized
experiments on a website, and
running a randomized
experiment on crop yields.
So in 1916, in and farmer Ron was actually
an allusion to Ron Fisher, who was kind
of the father of experimental statistics
the way that I described it to you.
In 1916, I think two characteristics
that are important to highlight.
One, data was really expensive and
really slow.
You had to commit these plots of
land to run your experiments on, and
you just had to wait an entire
crop cycle to run the experiment.
There was nothing else to do.
You'd just go grab a cup of coffee and
wait around until you get your results.
The other thing is, this point that I
highlighted at the very beginning, which
is that in 1916, if you wrote up a stats
paper, your wrote up your results using
statistical analysis, the person you were
giving it to was a trained statistician.
So you didn't have to worry about
whether they were interpreting a p-value
directly or drawing the correct inferences
from the dashboard that you showed them.
Today we're in a completely
different environment.
Data is super cheap in real time,
so it's coming in very rapidly.
You can decide on the fly to kill
an experiment, to start a new experiment.
And then the other point I made at
the very outset which is everybody is
a practitioner.
You're a practitioner you may not even
realize it, but many of the things that
you're doing today in your job will be
a/b tested in some form within ten years.
If you think about logistics,
you think about health care,
you think about just about any industry,
there are so many different
points in these different disciplines
where it's possible to introduce
some of these statistical methods of
a/b testing to make better decisions.
So we're going from a world
in which practitioners or
trained statisticians to one
really everybody is going to have
to be doing this kind of work even
if they weren't trained in it.
So, what obviously highlights
then is we need to ask ourselves,
how are you different from Farmer Ron?
Are you being the prototypical
user of a/b testing today,
how are you different from Farmer Ron,
right?
So, the first one is that time is money.
Now because data is cheap and plentiful,
and experiments can be started and
stopped on the fly, you want results
as quickly as possible, okay?
So that's going to be very important.
That's something I think that while Farmer
Ron might have wanted results as quickly
as possible, he didn't have much of
a choice except to wait for a crop season.
You have a choice,
you can actually choose to stop.
And because you can choose to stop,
what you're going to do,
is continuously monitor the results
that you're getting from the tests, and
use those to decide whether or
not you want to continue.
Okay?
And in particular, you're going to use
the dashboard to tell you when the results
are significant, and that's probably the
thing that's going to guide your decision
on whether to stop or keep going.
So this, in contrast to what I
call fixed sample size testing,
is called adaptive sample size testing.
You adjust the test lengths in real
time based on the data coming in, okay?
So I hope that's clear to everyone.
There's a huge difference between
what I just told you here,
and what Farmer Ron was doing.
Farmer Ron had committed to a fixed
sample size at the beginning.
Largely out of necessity,
because there are a fixed numbers of land
plots that he’s going to work with, and
he had to wait an entire season
to wait to get data back.
You, on the other hand are in the world
in which data is coming in real time and
therefore, you are given the ability,
the opportunity to adjust
the task in real time okay?
Or just the experiment in real time.
And in particular, you know what is
fascinating about this is not just that
ability, is actually that we
encourage you to use the ability.
So here's the sample results
dashboard from Optimizely and
I want to call out
a couple things about it.
So in this dashboard, what's going on
is there's actually three variations
that are different from the control,
not just one.
So it's variations one, two, and three,
and then the original is the control.
And then there's three metrics of
interest: views, example click,
and pick click.
And so, those nine boxes that
go from views to pick click and
variation one to variation three,
are all comparisons on that metric of
that variation against the control.
And the green and red are telling you
where the dashboard thinks you have
significant results in the positive or
negative direction.
So what I want to highlight for you are
these huge numbers on the left-hand side,
unique visitors and days running.
Those are like counters that are ticking
forward, that are basically telling you,
hey, here's how wrong your
test has been running.
Every time you log in, we're trying
to remind you that time is money.
We're showing you that time is elapsing,
and here's the results you have.
And you're being put in a position
where you're always being asked to
trade those things off.
So, we're kind of forcing you into
an adaptive sample size testing mentality
just by the design of the dashboard.
And the other thing that I want to
highlight is this little blue link
that says, how long should I run my test?
Now if you click on that link, what that
brings up is a sample size calculator,
which essentially tries to get you
to think how much longer should
the test be running?
The great irony is, this link is on
the results page which means I'm having
you think about how long you want to run
the test once you started the experiment.
Now contrast that the Farmer Ron what we
said about fix sample size testing is
Farmer Ron is asked to do that
before he ever runs the experiment.
And indeed if you look at any tech like
on experimental design that's one of the
first things you'll learn if commit to the
sample size before you run the experiment.
So when I started working with this
optimizer, this was one of the things I
found most fascinating, is that this link,
how long should I run my experiment,
was on the results page, not on
the experiment design page, all right?
So here's a thought experiment.
And I should've set this up as a poll,
and I didn't.
So I'm just going to ask you to
commit to what your answer is.
Here's a thought experiment that's
going to help you get some insight, okay.
And I assure you this is
not a typo on the slide.
Suppose 100 individuals run A/A tests.
So that's not a typo,
what do I mean by an A/A test?
If you're a customer of Optimizely, one of
the first things you want to figure out is
whether Optimizely is actually
doing what they said they would do.
And one of the ways that you might test
that is to run an experiment where you
know up front that the control and
the treatment are exactly the same thing,
that the original and
the variation are the same, okay.
So that's what's called an A/A test.
And the reason it's called an A/A test is
at least in that case you know there's no
difference from the beginning.
So it's like a diagnostic, you can test.
When Optimizely tells me that they're
going to help me find real differences
is that what they're really doing, okay?
So what do you do?
You run this A/A test.
And now you monitor the dashboard that
I showed you on the previous slide.
And you just keep looking at it every day.
And you wait for
the first time that the p-value drops
below 5% in this A/A test, okay.
Which means you wait for the first time
that the dashboard tells you that the data
that you're seeing is unlikely,
if there was really no difference.
And let's say we do this up to a maximum
of 10,000 observational units, or
10,000 visitors to my site, okay.
So the question is how many people
find a result and stop early?
How many people find a significant
result and stop early?
So I'm just going to step through
a few potential values here.
Let's make sure that we all understand if
this is working as it's intended then no
more than 5% of you should
find a significant result and
stop because what we're saying is
that any results here that's declared
significant is a false positive, right.
There's no difference between A and A, so
if I declare that there's a difference,
that is a false positive.
I'm saying they found a difference and
there wasn't one.
So anything that's found that's
significant is a false positive.
When we use a 5% cutoff on the p-value,
what we're saying is we don't want
more than 5% false positives.
So nominally, 5% is what we should expect.
So I can tell you I've run this survey
multiple times, and I climb up from 5%.
I ask everyone to put their hands up and
they climb up from 5% to 10% to 20%.
And the hands stay up.
And when I get to 20 to 30 is
when most of the hands drop.
And then you get up to 40% 50%, and
there's basically nobody left, and 60%,
nobody left.
Well, it turns out the right
answer is actually close 60%.
Over half of these individuals
are going to find a significant result.
And I want you to stop for
a second to understand what this means.
It's means,
you're a data scientist in a company,
your boss is asking you to use and
A/B testing methodology.
You decide to test out your
methodology by running an A/A test,
where you know there's no difference.
And out of half of the A/A
tests that you run,
you discover that there's
a significant difference.
So essentially what that means is A/B
testing would be completely useless,
if that was the case.
It would mean that one out of two times
that you ran an experiment where there was
really no difference,
you're declaring a difference.
In which case, what's the point?
You might as well just flip
coins to decide what to do.
There's no point of actually
running experiments, okay.
And so he first point here, and
I understand that up to now I haven't
explained to you why this is happening.
But the first point is, in A/B testing,
peaking can dramatically inflate the
number of false positives that you get.
Now, what is peaking?
Peaking is watching
the dashboard continuously, and
deciding to stop the first time
you get a significant result.
Okay, so let's try to understand
why that's happening, all right.
And that's kind of the crux of this talk
is exactly what is causing this problem.
I mean, you might imagine that it
shouldn't matter that I'm looking at
the results, right, because whether
I Iook at them or not, it doesn't matter.
The data is the same.
So the problem isn't really the looking,
the problem is the doing.
In particular, what do I do after
I look at the results, all right?
So, remember, Farmer Ron had committed to
a fixed sample size at the very beginning,
and only looked at his data
after that sample size was done.
You, on the other hand, are choosing to
look at your data every single day, and
waiting until the first time that
you see the p-value drop below 5%.
That's the big difference is adaptive
sample size versus fixed sample size.
So what's happening is, and let's look
at the sample run of an A/A test.
This is actually from Optimizely,
all right.
If you look at the sample run,
it's an A/A test, right.
So what we'd expect, and
I should tell you what's being plotted
here is actually one minus the p-value.
So when this graph is high,
that means the p-value is low, okay.
And so what you observe here is that very
early in the test, and late in the test,
there are two points at which the one
minus the p-value crosses 95%, or
equivalently, the p-value falls below 5%,
okay.
This is despite the fact
that it's an A/A test.
And this graph, actually, this is
the type of thing that was the basis for
many customer support calls to Optimizely.
Now, put yourself in the shoes of
the poor customer support rep.
They take the call, and
the person on the other line says, hey,
I just ran this A/A test, and
I'm getting all these significant results.
What's going on?
And in customer support,
you're typically left to explain that
maybe there was a bug in the platform,
maybe they're not using it properly, they
should look at a knowledge-based article.
And the reality is they
were in a tough spot.
It's not clear what to tell this
person because there's nothing wrong,
there's no bugs,
this is what the data's actually doing.
And so what's going on is that when
you watch the p-value over time,
then you allow yourself a lot more
opportunities to decide whether or
not something is significant.
Farmer Ron only had one opportunity,
the very end of the experiment.
In your case, you're looking at it
after 100 visitors, after 200 visitors,
after 300 visitors.
You have all these different
opportunities to conclude whether or
not something is significant.
Well, obviously,
if you give yourself more chances to
decide if something is significant,
that increases the likelihood that you're
going to find a significant result.
And that's exactly what's happening.
In particular, if you're allowing yourself
to continuously watch the p-value,
and if you're willing to wait long enough,
there's actually a high
chance of a result that's essentially
inconclusive, which in this graph,
it is, looking significant at at
least one point along the way, okay.
Said differently, if I'm allowed to just
keep watching the p-value over time,
let's suppose that it's true that
eventually only 5% of these experiments,
of these A/A tests,
are going to have a p-value that's below
5%, so are going to have a false positive.
Even if that's true eventually, a huge
fraction of them, in fact a majority,
are going to have p-value that's less
than 5% at some point along the way.
Okay, so that's the difference.
And if you allow yourself to keep looking,
you're going to pick up on that, right.
So here's a kind of technical issue.
And I want to highlight now, there's
going to be a couple points in the talk
where I do a deeper dive
on the technical side.
This is one of them.
So one way to think about what's going
on in fixed sample size testing,
if you unpack what that p-value
less than 5% really amounts to,
a different way to think about what's
going on, if you look at the observed
difference in sample means between
the control and the treatment.
So I measure the sample mean
conversion rate in the treatment group,
the sample mean conversation
rate in the control group, and
I take the difference of those two, okay.
Now, it's a well known kind of fact in
probability that when you've got N samples
of something, the central limit theorem,
or a lot of other techniques,
will tell you that kind of
the spread of the sample mean is
going to look about like one
over square root of N, okay.
So if you think about what
the sample mean looks like,
it goes down like one over
square root of N over time.
So if I want to declare that there's a
significant difference between control and
treatment, typically, what these rules
involve is saying that you declare
significance if
the difference of those two.
And on the left-hand side,
there should have been an absolute
value there, actually.
It's not just the raw difference,
it's actually the absolute
value of the difference.
If the absolute value of the difference
exceeds sort of a multiple of
one over square root of N,
then we declare it to be significant.
Now, how do we choose that multiple?
The multiple is based on
the significance level.
So in this case,
we had a 5% significance level.
That would dictate what the constant
in the numerator looks like, all right.
So this is kind of a convenient way to
think about what we're actually doing
because I find that it's more intuitive
to think of sample means rather than
p-values.
You can actually interpret what
the sample mean is going to look like and
The more samples you get,
the lower the spread in the sample mean
is going to be and its uncertainty.
And so this is a way that
we declare significance.
Okay.
Now here's the problem.
So what we're saying is,
in fix sample size testing,
you declare significance if
the difference in sample mean,
if it exceeds one over squared event
at the chosen fixed sample size.
But in probability there's a classical
results that says that difference of
sample means exceeds not just constant
over scored events but actually log log
n over square root of n, square root of
log log n over n, for infinitely many n.
Now practically speaking what this means
is that if you're willing to wait long
enough and that your rule is you
declare significance any time
the difference of sample means is bigger
than a constant over scored event.
You're guaranteed to eventually do it,
okay,
if you're willing to wait long enough.
So this basically says, I mean, I could
have asked you a different question.
Which is imagine that every one of
those 100 people waits indefinitely,
to ever see if the p value drops below 5%.
And what this result is saying,
is that every single person would
eventually declare a significant result.
Let's say your positive
rate would be 100%, right,
which is much worse than even the majority
that I was giving you a second ago.
So thinking about it practically,
if you think about what the effect is in
real data, on the x-axis in this graph,
what I have is the number of
observations in the experiments.
It goes up to 10,000.
And on the y-axis is
the false probability, and
the three different curves all right.
The false positive probability
I see up to that sample size if I use
the different cutoffs for the p-value.
So 1%, 5%, or 10%.
And so in the exotic experiment we did,
I was showing you I'm showing you
an alpha of 5% or
a cut off of 5% up to 10,000 observations.
So that's kind of the final
point of this green line.
But you can see that in all of these
cases, red, blue, and green there
is massive inflation over what the nominal
desired false positive probability was.
So for example I use a 1% cut off then
they still get an inflation up to like
21%, 22% false-positive probability
when I have 10,000 observations.
And that means that I'm essentially
getting a 20 fold increase in the number
of false-positives because I'm doing this
thing of peeking at my experiments, okay?
So the point is I can, and if you just
think about the incentives, right?
As a practitioner, there's huge
incentive to find significant results,
because that's how you're judged.
Your boss wants you to say, yeah,
hey look, I ran this test and
I got this amazing result, right.
So, you can have these organizations
where people are running experiments and
finding significant results all the time,
only to discover that most of them are
serious and kind of not meaningful, okay.
So, I guess what I like to ask then is,
are there irreconcilable differences here?
I mean what would the user like?
The user wants to choose
to stop adaptively,
to trade off the test length against
the detection of true positive.
And at that time with
a p-value small enough,
the results declared significant, okay?
So that would be like essentially
that the user wants to use adaptive
sample size testing.
They want to be able to look at that
graph I showed you of an AA test, and
then stop whenever, the first time
that the p-value looks small,
instead of being committed
to a fixed sample size.
On the other hand, what I told you about
fixed sample size testing is that it
optimally balances false positives and
true positives, but it's not adaptive.
But, I told you about adaptive testing is,
that it massively
inflates false positives.
So what do you suppose to do?
You know how do you balance
between these two things?
What's the resolution of
the irreconcilable differences?
Okay?
Well, there is really two views on this.
One of them I got by talking to colleagues
at Stanford, and in the State for Kids
versions is called the RTFM approach which
is read the freaking manual approach.
And, all right, what does that mean?
Well, when I asked my colleagues they
were like, wow, that's incredible.
If they would just pick up a text book,
they would see that
the right way to run an experiment is to
commit to the sample size in advance.
And then the usual sort of ivory tower
comments regarding the fact that
practitioners don't know
what they're doing and
they should all take stats classes or
data science classes.
And I think I am a academic
as well in the ivory tower.
So I'm sort of prone to thinking
this way myself at times.
And initially I was kind of
sympathetic to them.
I'm like, okay, I'll go back to optimize.
I'll just tell them, hey, why don't we
just put up a tutorial that says submit to
your sample size and don't peek, right?
But then we sat back for a second and
what we realized is this makes no sense.
The reasons that users
are peeking is because they can.
So we're basically telling you you're
using a platform where you get data in
real time.
But I don't want you to take
advantage of the data in real time.
I want you to sit back and
turn off your computer.
Go get coffee.
Wait like Farmer Ron would.
Essentially, I want you to
drop yourself back 100 years.
In your practice so
that you're not taking advantage of
all the data you get in real time.
Now when you put it that way,
it makes no sense, right.
And that brings us to kind of
the first maximum good product design,
which is the user is always right.
So what you don't want to do
is you don't want to cmmit
to saying that your methodology
is right and the user is wrong.
Because usually,
when users consistently have a certain
behavior there's a good reason for it.
And in this case, it's because the data
is actually coming in in real-time.
The users are right,
they have good reasons to be adaptive.
And our goal should be to
change what we're doing to
fit a natural user model and one which
is consistent with real-time data.
But that's this point about
user interface design,
which is Think of P-values
as the user interface.
The point is not that users are using
the current P-values incorrectly.
The point is we've designed
the wrong P-values for
the way that users want to make
decisions for real time data, all right?
And so that brings us to how we
reconcile what we're doing with what
users want to do.
And the approach is really just
change the P-value to account for
the fact that people continuously monitor.
So, basically, what we want to do is allow
users to continuously monitor the test,
allow them to choose whenever they want to
stop, based on the p-value that they see,
and really, what we want is
what's called an always p-value.
No matter how the user
chooses to stop the test,
we want to guarantee them that there's no
more than a 5% false positive probability.
Okay, by the way,
I saw one of the questions here.
Is there any significance
in the value of 5% and
I just want to point out that we've
been using 5% as an exemplar throughout
the talk but
it's not really consequential.
It can be 1%, 10%,
it's really just that number is whatever
your desired control on false-positive is.
Okay.
So what we want to do is we want to
report this always valid p-value and
make sure that no matter how
the user choses to stop the test,
there's no more than a 5%
false positive probability.
And in our solution what we do is we
provide a near-optimal balance between
test length and detection of true positive
while controlling false positives and
allowing the user to be adaptive.
All right.
So
we're getting that near-optimal
balance between false positive rate.
That we got from fix sample site testing,
for controlling false positive.
And we're allowing the user
now to be adapted,
so this is the reconciliation
that we wanted.
Now, how are we doing that?
The crux of all of this rest on
choosing the right p-value, okay.
So what I'm going to do is walk to
through just quickly, and I think some,
about 10 more minutes.
>> Yeah.
>> Does that sound right?
Yeah.
I'm going to walk you through quickly
it's a little bit technical but
I'll just walk you through
quickly how we do this.
And there is a lot more
details available through
some of the papers that
we've written on it.
So, remember what I told you which
is that if what we do is we wait for
the difference of sample means
the cross one over square of N or
constant over square of N.
And that's going to
happen infinitely often,
even under what's called
a null hypothesis.
Even if there's no difference between
the two, treatment and control, okay?
So, that's a problem.
What we want to do, if we're
going to become more conservative,
allow the user to stop whenever they want.
We need to pick a stopping rule, or a rule
that declares there is some significant
result that's kind of more conservative
than constant over square of N.
And what we do is we pick
a boundary that instead
looks like a constant times
square root log N over N, ok?
In particular, let's now ask the question,
what is the chance?
That the difference of
sample means will ever cross
a boundary that looks like squared
of log N over N times the constant.
It turns out that that is
actually a bounded probability.
Which means that even if there's
no difference between the two,
the treatment and control, the chance that
you ever cross the boundary constant times
square root of log N over
N is actually bounded.
And in particular by changing the
constant, I can change that probability.
So if I want to do is
control false positives,
then I could do it the following way.
You give me your desired false
positive probability control.
That's what I'm calling alpha.
I will spit out a constant C(alpha) such
that if you stop the first time that
you're difference of sample means
crosses C(alpha) times square root of
log N over N, you will only make mistakes
a fraction alpha of the time in cases
where there's no difference.
So your false positive probability
will actually only be alpha, okay?
In words, the chance the boundary is
ever crossed is no more than alpha.
And this should have additionally if
there is actually no difference between
the treatment and control.
So it's a bit technical again, but the
real point here is like what I'm doing by
forcing the difference of sample
means to be larger before I
reject the null hypothesis is I'm
becoming more conservative, okay.
And essentially,
that's the rule that we use.
It's not exactly this, but
it's approximately this.
And what that rule kind of is doing for
us is what's called
a sequential test of power one.
This rule stop and declare significance
when constant times square root of log N
over N is crossed is what's called
a sequential test of power one.
And it basically has two properties.
One is it controls our false
positive probability at level alpha.
So if alpha is 5% then
I'm guaranteeing myself
if there's no difference between the two.
The chance that I declare
significance is no more than 5%.
But it also has the property that,
if there is a real difference,
eventually, I'll find it,
because if there is a real difference,
eventually square root of
log N over N goes to 0.
And so, eventually,
the sample difference of sample means
will actually cross that level.
You can see this better in the picture.
There are four curves in this picture.
The blue and
the red are basically these curves that
look like constant overscored events.
So what you want to be thinking
to yourself is the sample mean,
the difference in sample mean,
is something which is moving around
in the wide dimension here as
the number of observations increases.
If the difference in sample mean ever
crosses either the blue boundary for
10% cutoff or the gold boundary for
a 5% cut off.
You declare significance using the fixed
sample size testing model, okay.
Now those boundaries are fairly low,
which means even a small difference
of sample means is enough for
you to reject and declare that those
are insignificant results, okay?
Just stop and declare significant results.
And what I pointed out to you earlier is
that this green curve which is the law of
the iterated logarithm that looks
like square root of log N over N
is actually guaranteed to be
crossed infinitely of in them.
And you can see the gold and
the blue curve essentially lie underneath
the green curve, with the exception
of the region at the very beginning.
First of all, sample sizes.
On the other hand this boundary square
root of log N over N that I just
explained to you lies way above
the other three boundaries.
And that gap, that additional gap there is
what essentially is allowing us to ensure
that even if you continuously watch
the difference of sample means over time,
the chance it ever
crosses the red boundary,
even if you were to wait indefinitely
is bounded by either 5% or 10%.
I should say the red curve here
is actually for the 5% setting.
And then the curve for the 10%
setting would be a little bit lower.
Okay, so
those are kind of the technical details.
And kind of giving a little bit of homage
where homage is due, the rule that I
just described to you is what's called the
mixture sequential probability ratio test.
It was actually developed here at Stanford
by Herb Robbins and David Siegmund.
Siegmund is still here.
Robbins passed away and both kind
of very well known statisticians.
So here's how we use it to get p-values.
You basically define T of alpha to
be the stopping time of a test that
stops when your desired false-positive
probability control is alpha.
And of course, that will be later when
you have a tighter desired control on
false-positive probabilities.
And I define the p-value at time N to be
the smallest value of alpha such that that
the corresponding stopping time for that
level of alpha has already gone by, okay?
Where, if that was the desired
alpha that I wanted,
I would have already stopped by now.
So the p-value's the smallest alpha,
where I know I would have
already stopped by now, okay?
Well, it turns out that,
with that definition,
the first time that the p-value ever
falls below alpha is actually T alpha.
Which means in words what this means
is if you follow the rule of stopping
the first time the p value goes
below alpha, then what you're doing
is you're just recovering that
stopping rule that I just developed.
Okay and so in particular you're going to
be finding a rule that controls your false
positive probability at level alpha.
Okay, so it has false positive
probability bounded by alpha as desired.
More importantly,
the same thing is actually true for
any adaptive stopping rule that
the user may choose to use,
even if they don't want to stop the first
time the p-value falls below the alpha.
No matter what they do, the false positive
probability is controlled by alpha.
Now that requires a little
bit of reasoning.
And I'll skip past the proof of that.
But you can kind of easily work through
from what the definition was on
the previous slide here.
So from the definition of the p-value,
you can actually show that no matter how
the user chooses to stop, their false
positive probability will never be larger
than alpha, then their desired alpha.
Okay, so
what I want to try to convince you of,
the last thing I want to
show you on this front,
is that not only are we controlling
your false positive probability.
We're actually doing a really good
job of helping real effects and
to do that, what I want to do
is go back to the platform and
compare what we're doing against
fixed sample size testing.
So I need to make sure I do
an apples to apples comparison here.
And to do that, one of the things that I
need to explain is that when you do fixed
sample size testing,
in general you estimate the effect
that you want detect in advance.
Because you're committing
a sample size in advance.
So this minimum effect that you care
about detecting is going to be what
dictates how large your
experiment needs to be.
And that minimum effect is what's
called the minimum detectable effect.
So what we do is we go through
the following procedure.
We took 40,000 experiments on Optimizely
that had run for a really long time.
So we kind of assumed that the effects
that we're seeing at the very
end of the experiment is the real
difference between the treatment and
control in these experiments.
And for each of them,
what we do is we consider what the fixed
sample size run length would be, if we had
correctly estimated in advance what the
effect in the experiment is going to be.
If we underestimated the effect by 30%,
if we underestimated the effect by 50%.
30 and 50% are not that big by the way.
What I mean by this is like imagine the
conversion rate difference in treatment
and control is like 1% and that's the real
difference, underestimating that by 30%
means that I actually believe that,
something like 0.7% difference.
And underestimate by 50% means I think
it's 0.5% difference which when you're
measuring differences as conversion
rates this would be well within
a margin of error what you
believe is going to happen.
So obviously the fixed sample size testing
here has a huge advantage because I'm
feeding into it what I observe
to be the final difference in
conversion rates at the end of these
experiments that we had historically.
So, in reality I won't
have access to that.
And in particular,
the test that we're using the way we're
developing p values does not assume we
know anything in advance about what
the effect is actually going to be.
So what is a consequence?
So in this graph what I'm showing you is
the relative run length of the test that
we developed against what you would get if
you used a fixed sample size procedure.
And the first graph I want to draw your
attention to is the fully dotted graph,
the one that's most spread out.
And that's the case where
we find a fixed sample size
That gives us 80% detection
of true positives.
And then we compare how long our test,
our newly developed MSPRT test takes
a run against a fixed sample size.
And what you can see here is often it
is no more than 1x the run length, but
sometimes it's 2 or 3x the run length.
The x-axis here is measuring kind
of how large is the continuously
monitoring adapted sample size method
relative to the fixed sample sizes.
So sometimes it's 2 or 3x, even.
So that doesn't sound that great.
But part of the reason for
that is because we favorably biased
the fixed sample size calculation.
We fed into it exactly what
the correct sample would be and
what the correct effect would be.
What I want to draw your attention
to instead is the solid line,
which is what you get if you
underestimate the MDE by 50%,
and then compute a fixed sample size for
that.
And what that does is it makes you
compute a much longer fixed sample size
as a result, when you're adaptive,
you do much better.
And then what you're seeing is that more
often than not, you stop in a time that's
much shorter than what the fixed
sample size would've been.
So, in general, kind of this is like
a tip of the iceberg that shows you that
you're able to get kind of really good
detection of true positives in good time.
But still control false-positive
probabilities despite the fact that you're
being allowed to continuously
monitor the experiment, okay?
So kind of the high level takeaway
from all the technical discussion,
is that we're developing a procedure
to change the p-values so
that you're allowed to continuously
monitor the tests, still get really good
detection of true positives and yet
control your false-positives, okay?
I'm going to wrap up because
I'm nearly out of time here.
Just by mentioning one other thing
that I think is important to deal with
which is what's called
multiple hypothesis testing.
And the point is just that if you remember
that dashboard that I showed you,
if there's really no
difference in any cell, right?
Then you can ask yourself a question like,
what fraction of the cells that I'm
looking at, you know variations and
metrics will have p-values less than 5%?
I'll see a green when it's up, a red when
it's down, a color when it's significant,
when the p-value's less than 5%.
But what I'm saying is that if there's
no difference between treatment and
control in any of these cells,
then I should only see about 5%
of them popping up as positive.
But if I'm running a lot of variations
looking at a lot of metrics,
I'll have a lot of cells in my dashboard.
Imagine I have 100 cells in my dashboard,
this is saying 5 of those cells
are going to pop up as being significant
just due to random chance alone.
Even if there are no differences.
Now, our vision is drawn to
the significant results, right, and
we're coloring them green and red.
And, that's a real problem, because what
it'll mean is, you're getting a lot of
false positives and then drawing all
your attention into the false positives.
Instead, what you typically want to do
is you want to be a little bit more
conservative when you run
a lot of hypothesis tests,
to make sure these false positives
are not polluting your dashboard.
So that when your attention is drawn to
the green and red, really you don't have
a lot of false positives mixed
in with the greens and the reds.
And this is what multiple
hypothesis testing is.
I'm going to skip this slide.
I just want to say kind of in words that
in addition to what we did to control
peaking in the Optimizely dashboard,
we actually implemented correction for
multiple hypothesis testing.
So that people are able to draw more
meaningful inferences from these
environments where they run many
variations and many metrics.
Okay, so like I said,
this is only the tip of the iceberg of
the stuff that there is to learn here.
I teach a class in the fall
called Small Data, MS&E 226,
which really the reason it's called Small
Data is it's everything except big data.
And the point of that class is really
to give you the conceptual foundations
to help you think about what to do in this
kind of brave new world where A/B testing,
other tools like it are democratizing
access to statistics and
scientific inquiry in
virtually every industry.
And so what you really are going to
need to do in that environment is become
an effective conceptual communicator of
exactly what the data is really saying.
Within the organization, and
also you're going to need to set norms for
how the organization uses data so
that you don't make poor decisions.
Kind of like what I've
been talking about here.
Lots of other things that we have done
in this regard, but I'll stop there, and
turn it back to Marsha.
>> Great, thank you very much,
Prof. Johari.
That was a perfect introduction to A/B
testing, a lot of extra detail as well.
So we've got some questions in, we still
have time if you want to send any others
and I'll hand back to Ramesh so
he can tackle some of these.
>> All right, so I think just given
the time I want to make sure I
take kind of a breadth of questions here.
The one question I want to try to get
into, there's a few different questions
that ask about kind of the relationship
between adaptive and fixed sample size.
When would you choose one over the other?
Now thinking about it at a high level,
one thing that I want to point out is that
you're basically always better off
being adaptive at some level, or
you shouldn't be any worse off, right?
Because one thing you could do when you're
adaptive is just wait until whatever
the fixed sample size would've been and
stop there, and then look at the data.
So you're kind of strictly increasing
the range of things you can do.
You would imagine that
shouldn't leave you worse off.
The reason that makes you worse off
is not because you became adaptive,
it's because the method by
which we choose to stop or
declare significance once we
become adaptive is sort of flawed.
So I guess what I would say is that in
general, our goal should be to be building
platforms which allow users
to become adaptive, and
do at least as well as they would've done
using fixed sample size methodology.
In terms of how to choose,
really the point is that we live
in an age of real-time data.
And I think in the age of real-time data,
I think adaptivity is going
to be the wave of the future.
There's no reason to commit,
so it's kind of funny,
I've come full circle here from when I was
asking my colleagues and they're all like,
just read the stats book, it'll tell you
to commit to the sample size in advance.
I guess what I'm saying is we
need a different vision on that.
In the 21st century,
the stats book should lead by saying
imagine your data is coming in,
in real time and you want to choose how
long you run your experiment adaptively.
Couple other things I want to mention.
Someone asked about why not
use a Bayesian approach, and
I think this is a great question.
It would take a while to kind of set
up the background to what it means
to be Bayesian here.
At a high level, what it would mean is
that I'd start with some belief about what
I think I will see in the experiment,
that's called a prior.
I get the data, and I update my prior
belief to what's called a posterior
belief, which sort of is a combination
of what I believed when the experiment
started plus the data I received.
What I want to point out is that there
isn't really much of a difference between
being Bayesian or not, in at least
the method that we implemented.
One of the ways that this method works is
it requires you to sort of average over
all possible values for
that difference between the treatment and
control that you think could be observed.
And that averaging overall of them,
it turns out that, that has a very close
connection to this notion of a prior
distribution in Bayesian statistics.
So I'll kind of leave that as a teaser.
I do spend at least one lecture in my
course in the Fall talking about the role
of Bayesian inference and how it contrasts
with things that are done in practice.
I think that I would also add to
this that my understanding and
kind of my own personal belief is that
practitioners are woefully under trained
in Bayesian methods in
modern data science.
I think lots of organizations would
benefit from using all the past data they
have to build prior beliefs.
And that hasn't really been happening
as extensively as I would like.
So this is a major area of
research emphasis for me.
And then, finally,
the other thing I want to mention here is,
just this is totally self-serving.
But somebody asked me to briefly
describe how the correction of multiple
hypothesis testing works.
Which is great as an academic,
you never love more when you get a
question that lets you continue your talk.
So that's what I'll do.
[LAUGH] So on multiple hypothesis testing,
kind of the main point is this, right?
Let's ask, what is the problem
with getting 5 out of 100?
Let's say there's 100 cells in that matrix
where I have variations in metrics.
And of those,
there's no difference in any of them.
And I declare five of
them to be significant.
The problem really is imagine
what you'll show your manager.
They'll want to see the dashboard for
the test.
In that dashboard they're going to see
five things that look rather clean.
Their eyes will be drawn to that.
They'll want to spend a lot of
time understanding what happened.
So what's happening is we're spending
all our time zooming in on these
cells that look significant.
So what we're trying to do when we apply
multiple hypothesis testing correction
is we're actually trying to stay within
the cells that look significant.
We don't want too many false positives.
So this is what's called
the false discovery proportion.
Which is we want the fraction of false
positives among those that we declare to
be significant to be bounded by something
small, to be, say, something less than 5%.
And that the control of that in averages
is called the false discovery rate.
Basically what the procedure amounts
to is declaring significant those
things that sort of, generally speaking,
have smaller p values.
So p values that are larger and
closer to the cutoff will be less
likely to be very significant.
The details of it, it's called
the Benjamin Huxberg Procedure and
that's something that you can
kind of learn more about.
It's a very simple procedure
using the output t value.
It's something which even the Wikipedia
page does a pretty good job of describing.
And I talk about it a little
more in 226 as well.
Okay and then a couple of other things,
one question about computational issues.
So what is really cool is,
actually a couple of questions about this.
So what's really cool about the method we
developed, is there are not algorithms,
they are formulas.
So in terms of implementation,
all of the computational of these p values
can actually be done inside the browser.
It's super fast.
You can actually get the formulas,
if you want, from, like I said,
these technical reports that are online,
either through optimizely or by looking
at the research papers on the topic.
I think, in particular, this is a really
important consideration in the world of
streaming data,
when data's coming in real time.
Any dashboard that we build for
people to continuously monitor experiments
do have to be able to work
with data in real time and
deliver sort of the computational
results in real time.
One advantage of what we've developed is
that it's as easy as kind of classical
p-values which can also be easily
computed in a streaming format.
I do think that as we move to more complex
methods this will include like methods,
some methods like multi
armed bandit algorithms,
the computational issues do
really come to the floor and
I do think this is an interesting
area of research right now.
There's an intersection between
designing the user interface,
designing the computational methods
to relay statistics to the user.
And then, actually processing the data
coming in in real time on the backend.
So you've got infrastructure,
you've got algorithms,
you've got statistics that are all coming
together to deliver these tools for
people to make their decisions from data.
Let's see.
Maybe one more question?
Does that sound good?
Or do we, I wonder how much more time.
>> [INAUDIBLE] keep going.
>> You want to keep going?
Okay, cool.
So I'll take a few more, then.
So, wow there's some really
good questions here.
So I guess one of the things
I want to talk about,
is there's a question that says what's
the best way to conclude that a test
early on doesn't really
have any difference, right?
So it's basically saying If you're looking
at the data coming in in real time, and
one of the questions I used to love,
I took a leave from Stanford to go
around the data science teams industry.
And one question I used to love was,
how long do I have to wait to
get statistical significance?
I mean just think about what
that question is asking.
It's taking all the methods of
fixed sample size testing and
completely blowing them up because you're
going back to the data science team and
asking them regularly like,
how long am I going to have to wait.
What's really going on underneath
that question is, you know,
again, people care about how
long they're running a test.
They want to know how likely is it,
based on the data that they've received,
that there really isn't any difference
here, even if they were to wait longer.
I mean, I guess my answer to the question,
what's the best way to know that
the best way is to use a message
like the one that we proposed.
It's not the only method of
this sort that could work.
But it's certainly a convenient
one because like I said it reduces
to a simple formula.
The reason a method like that is necessary
is because as soon as you say what is the,
how can I know early on whereas the data's
coming in, whether or not there
will eventually be a real difference,
whether there is a real difference.
Then you're being adaptive,
that's an adaptive question,
that the question that says, I care about
making decisions as the data's coming in.
And, as soon as you're being adaptive,
you need to put in some guard rails.
To prevent yourself from
spuriously deciding that there's
significant results early on, just because
of some random variation in the data.
Like that A/A test that I showed you.
I think the concluding kind of the coda
on the A/A testing is worthwhile.
Which is one of the things that made me so
excited to work on this project wasn't
just that it was kind of
technically interesting,
it was cool to combine stats with
a platform that actually built.
So I should add, what I've just described
in optimizing these core
platforms by all its customers.
So today it's been used
by tens of thousands of customers to run
hundreds of thousands of experiments.
So it's really been pretty wide-ranging.
And beyond that, what's been great is,
because it's kind of publicly known.
It's also something that many other
tech companies have either tried out or
implemented internally for
their own AB testing platforms.
So I think one of the things that's
important to realize about that is that
what we're trying to
do in that environment
is really change that question from
not what's the best way to conclude
whether there's going to be a difference
or not but instead just do what's natural.
Don't think about it, right?
So if you use the p-values that
are coming out of this dashboard and
do what's natural, which when
the p-value's small you decide to stop,
you will actually be making kind of
decisions in a near optimal way.
You'll be making tradeoffs in a good way.
So I think I think that's the job of
those of us that build the platforms.
It's to create an environment where
users don't have to [INAUDIBLE]
about a very difficult technical question,
like that question of whether or
not I should stop or [INAUDIBLE] And
let me see, there was one other question.
A little bit.
I think there's one on bandits, and
that's going to be the one I'll stop with.
>> Sorry.
>> Stop on the next page.
So maybe I'll just kind of
talk briefly about this.
So there's a question about
multi-arm bandits, and
I'm guessing many of you heard about
bandit algorithms as another kind of
technique for making decisions over time.
There's a lot of value
to banded algorithms.
Really what they amount to,
when it comes to experimentation,
is what's called adaptive reallocation.
So in the experiments
that I was talking about,
we committed in advance
at least to the fraction
of incoming individuals who would be
allocated to treatment and control.
And in banded algorithm, what happens
is you adaptively change whether or
not someone gets allocated
to treatment and control.
A canonical example would be
something like headline testing at
the New York Times.
If I have an article and
I have two headlines for that article,
when I'm testing out the headlines and
I discover one is working better
than the other, then in real time,
I might want to shift more people over
to the headline that's working better.
One of the problems with this is that,
while that's great when what you care
about is finding the best thing,
It can be problematic if what you care
about is inference, which means really
understanding why certain things are
performing better or worse than others.
Because of you can imagine
what's happening here,
I'm pulling samples away from
the lesser performing arms.
At the end, what it will mean is that
it'll be harder for me to determine why
those kind of lesser headline
performed as badly as it did.
Was it The wording,
was it the punctuation?
Was it the font?
There's lots of different reasons
why it might've performed badly, and
I won't be able to disentangle any of
these effects, using my knowledge of
the types of people that have seen it, and
kind of whether they clicked on it or not.
So inference is the procedure
of understanding.
And bandits are a procedure
of optimization.
What I love about modern data
science is that it's kind of
equal parts understanding and
business optimization.
What I often am fond of saying, and
maybe this is the last
comment I'll leave you with,
is that people often want to
be optimizers prospectively.
So when you start
an experiment people say,
just find me the best thing and do it.
But they often want
understanding retrospectively.
And that's because when you try to find
them the best thing and do it then
it's not what they thought would be the
best thing, then they want to know why.
So VP comes to you and says, hey,
I think this is going to work well,
go find the best thing and do it.
You run the test,
you find the best thing and
it's not what the VP said
Would be the best thing.
Now they want to know why
it wasn't the best thing.
So one of the challenges, I think,
in business optimization and
inference is exactly to balance
optimizing with understanding.
So that you're able to answer
those questions after the fact,
when they're asked, and yet you know
you made the best decisions possible.
So let me leave it there.
Thank you all.
It was a lot of fun.
And hopefully,
I'll see some of you in 226 in the fall.
>> Okay Ramesh thank you again for
the talk and
also those really helpful detailed answers
and some great practical examples as well.
Thank you to everyone who attended
the webinar and sent in questions.
I hope you'll be able join us again for
a future one.
Enjoy the rest of your day or
depending on where you're joining
us from have a great evening.
Thanks.
