DAVID BLEI: Thanks very much.
Thanks very much for
inviting me to this event.
I'm really excited to
interact with more economists
and think about
economics problems.
My name's Dave Blei.
I'm going to talk
about economics
and probabilistic
machine learning.
John asked me to give an
overview of the type of machine
learning that I
work on, and so I'll
start off with a little
bit of an overview.
But then I want to talk about
some details around machine
learning in economics that I've
been working on lately to give
you a sense of what it might
look like when economics
and machine learning meet.
So I work in
probabilistic modeling.
And modern problematic
modeling, I hope to show you,
is an efficient way to
discover meaningful patterns
in massive data.
So this is a picture that
describes traditional machine
learning, how you might
use traditional machine
learning and statistics
to solve a modern problem.
So you get a cookbook of
machine learning ideas.
You get a shoe horn so that
you can get your problem
into the requirements of one of
those machine learning ideas.
And you never need
just one idea,
so you need a lot of
duct tape in order
to create a very ugly but
interesting thing that spits
out answers to your question.
Probabilistic machine
learning is more about
doing a tinkertoy
type thing where
you build a tailored procedure
for the particular problem
that you have at hand.
And the advantages of
probabilistic machine learning,
building these tailored
models to the problem
that you care about, include
composing and connecting
reusable pieces--
which I'll show you
some a little later--
developing methods that
are driven by your domain
expertise and discipline
knowledge and the particular
questions that you have.
Probabilistic machine
learning has focused recently
on large scale data, both in
terms of high dimensional data
and number of data points.
Though it's not
exclusively about this,
there's been a focus on using
these kinds of techniques
to discover and use structure
in unstructured data.
So you have a big
collection of data
that you think has some
hidden structure inside,
and you can use probabilistic
machine learning to discover
what that structure might be.
And in probabilistic
machine learning,
we've focused also
on things like
exploratory and observational
and causal analyses.
So I think this is one
area in machine learning
that could be very useful
to economics problems.
OK.
I'm not trying to
sell you on it.
The cookbook shoehorn duct
tape approach is good.
There's many software
packages available.
It's typically fast and
scalable in a good way
to get a quick first answer.
Probabilistic
machine learning can
be more challenging to
implement-- that's something
that we're working on
in that community--
and may not be fast or scalable,
but that also is something
that we're working
on in this community.
The way I think about
this kind of problem
is with this
probabilistic pipeline.
It's kind of almost a goofy
flow chart, where you start out
with your knowledge
about a discipline
and a question you
want to answer,
use it to make assumptions
about your data.
This is a graphical model, which
basically a joint probability
distribution of hidden
and observed variables.
You imagine the hidden variables
that live in your data,
and how they interact to
form the observations.
Then your data and
the assumptions
you make together, you
discover the patterns that
live in your data that
you're interested in.
And finally, you use
those discovered patterns
to form predictions,
explore your data,
do whatever it is
you are trying to do.
So just some examples
from my research group
to give you a flavor of
this type of research.
So this is a population
genetics analysis
of 2 billion genetic
measurements.
This is actually the
Pritchard Stevens Donnelly
model of population genetics.
That's a model that came out of
the University of Chicago here,
or I think so.
And analyzes our genomes
to understand how
we mixed ancestral populations.
Here's another example of
overlapping communities
discovered in a large network.
So this is a 3.7
million node network.
You can imagine you
might have a network
of social connections
as a network of patents
citing each other.
And you want to uncover
overlapping communities.
So in a social
network, for example,
you might know some people
from your neighborhood.
You might know some
people from work.
You might know some
people from your family.
And you want to identify how
each person in the network
exhibits these
overlapping communities.
And this is an example
of uncovering overlapping
communities in a big network.
I've done a lot of work
on analyzing texts, also
with John, with John Lafferty.
This is an example of topics
found in 1.8 million articles
from The New York Times.
So here each little
piece is a topic.
It's a group of words
associated under a single theme.
So here we have art, museum,
gallery, artists, street,
paintings, exhibition.
And the way that
this picture was made
was, we ingest 1.8 million
articles from The New York
Times and do some
machine learning
and uncover these patterns
of word co-occurrences
that seem to persevere
in that collection,
and then visualize them.
Here's a neuroscience
analysis of 220 million
fMRI measurements.
So here this is another kind
of super high dimensional data.
This is fMRI data.
And we're finding significant
regions of brain activity
as they connect to
different stimuli
that the person is experiencing
in an fMRI machine.
More recently,
we've been working
with a historian who has two
million diplomatic cables.
So this is a historian at
Colombia, Matt Connelly.
He's got two million cables that
were sent between US embassies
in the 1970s.
So this is I think a
good example of this kind
of tailored type of problem.
And Matt wants to uncover events
in these two million cables.
So each cable has a
sender, a receiver.
And there's two million
of them, so even
though Matt's a
historian, he can't
read all two million cables.
He wants to find what
significant events happened
and what cables are
associated with those events.
He also doesn't have
a catalog of events.
He wants to discover
that from these cables.
And so we built a model
based on topic models,
but also based on
other assumptions
that we want to make
about how history unfolds
and things like that to
uncover those events,
and then to lead the historian
to his primary sources that
make sense around those events.
And then finally, these are
item characteristics learned
from 5.6 million purchases.
So I'm going to go back to this
picture later in this talk,
but this is joint
work with Susan Athey
and with my postdoc Fran Ruiz.
And what we're doing
is, we are ingesting
massive amounts of people
shopping at a grocery store.
And we are learning the
latent characteristics
of each item in
the grocery store
and how they relate
to each other.
So in this particular
segment of the-- oh no,
segment's a real word for you.
In this particular
piece of this larger
map of items and
their characteristics,
you can see Mexican ingredients,
Mexican food ingredients, all
together.
And we'll go back
to this picture
later and discuss it in detail.
So that's a bunch of examples
of what kinds of things
you might be able to do with
this little probabilistic
pipeline.
And our perspective is that this
kind of customized or tailored
data analysis is becoming
important to many fields.
And this pipeline separates the
key activities of understanding
what our knowledge
is of the discipline
and what kinds of
assumptions we want
to make based on that
knowledge, doing computation
that reflects the
assumptions we want to make
and can handle the
data that we have,
and then use that computation
to solve whatever problem we're
trying to solve.
And this, I think,
makes it easy to think
about how to solve
problems collaboratively
between people like machine
learners and economists.
Now, we won't talk
about this here,
but what we need to work on
in machine learning then,
if we are going to
live with this picture,
is build a language
for building models out
of assumptions, flexible
and expressive components.
Build algorithms that
can scale to massive data
in our generic algorithms that
can handle many, many models.
So if you're familiar
with Bayesian statistics--
I'm sure most of
you are-- this is
kind of like applied
Bayesian statistics.
And when I say generic
inference algorithms,
I mean algorithms that don't
worry about the functional
form of the model as much as you
would have to if you were going
to commit to exact inference or
something like a [INAUDIBLE].
And the third thing we work on
is to develop new applications,
to stretch probabilistic
modeling into new areas,
and to enrich this
whole framework
through continuing
to solve new problems
and have new challenges.
There's another piece,
an important piece,
which is to criticize the
model and go through this loop.
We want to make assumptions,
uncover patterns,
do what we need to do,
understand what went right
and what went wrong, and
use that understanding
to then revise our theories
and assumptions and proceed.
OK.
So that's my overview.
Took about five minutes.
Today I want to discuss
two threads of research
with Susan Athey's group.
So Susan Athey and I have
been working together
for a couple of years.
It's been a lot of fun.
And I want to just tell
you a little bit about-- we
have some works in progress that
I want to tell you about around
building probabilistic models
to analyze large scale consumer
behavior.
So this is data
where many consumers
are choosing among many items.
So caveat is that
I'm not an economist,
so that means that I will
say things about economics
that might not make sense.
That's my fault.
If I say anything
about economics that does make
sense, that's Susan's fault.
OK.
Also, as I mentioned, this is
joint work with Francisco Ruiz.
OK, so our big vision is that
we have this data, many people
shopping in a big grocery
store, and they're
buying lots of things.
Or it might be many people
clicking on news articles,
or it might be many
people consuming
other kinds of things.
And our vision is that we
want to build a utility
model for baskets of items
where the utility of a basket
has a bunch of terms.
One of these terms,
the first term here,
is the interaction
between those items.
We want something that
might reflect substitute
and complement patterns.
If that made sense,
that's thanks to Susan.
If it didn't make
sense, it's my fault.
Another term might
be user preferences.
So this term, the
shopper term might
reflect the particular shopper
and what the shopper is buying
in the basket and how
that shopper prefers
or doesn't prefer some of
those items in the basket.
And those preferences could
be correlated across items.
There might be an effect
based on prices of items,
of course, and other things
we didn't think about,
and an error term.
OK, so this is our
vision to build up
this utility model of a basket.
And our goal then is to design,
fit, and check and revise
this model, and eventually to
answer counterfactual questions
about purchase behavior.
And since this is
an overview talk,
I just want to point out
that I'm painting a cartoon,
but in a lot of
machine learning,
these kinds of
counterfactual questions
are not on anyone's radar.
We want to do pure prediction.
And so it's been exciting
to work on a problem
where we have this deeper, more
meaningful type of question
to answer, to try to answer.
So first, I'm going
to talk about a way
to think about substitutes
and complements,
co-purchases in this
market basket data.
And we've been
working on something
called economic embeddings.
In machine learning, a very
newly popular technique
is called word embeddings.
This is a method for
analyzing language.
And word embeddings are
proving to be very powerful.
The idea in a word
embedding is to discover
what's called a distributed
representation of words.
Somehow you have a big
corpus of text documents.
You've got a vocabulary
of say 100,000 words.
You want to identify, what
are the characteristics
of the different words?
That's called the
distributed representation.
This is a famous old
paper by Rumelhart
and others in Cognitive
Science around how
distributed representations
of words might be useful.
And recently in
machine learning,
we've been trying to
discover them ourselves.
With words, it's
distances in the space
of the feature representation
of the vocabulary
appear to capture
semantic similarity.
So king and queen might be close
to each other in the space.
Peanut and walnut might be close
to each other in this space.
If you look it up
on the internet,
you'll find 10,000 papers
about world embeddings,
each one a little different
and a little the same.
But they each reflect
the same main ideas.
One, we place words in a
low-dimensional latent space.
We're going to discover
these characteristics.
We're not going to know in
advance that king and queen are
close to each other.
And two, a word's probability
in an underlying probability
model of language is going
to depend on its distance
to other words in its context.
So if you see the word queen
in the context of other words,
then the probability
of queen depends
on the distance of that
word to the other words'
representations.
So what we've been
trying to do in my group
is generalize this idea
to other types of data.
And we've been
developing something
called exponential family
embeddings for this purpose.
This is kind of like the way you
might go from linear regression
to generalized linear
models, from a word embedding
to a exponential
family embedding.
And it uses exponential families
and generalized linear models.
And we've been
exploring this idea
in a lot of different
contexts, like neuroscience,
recommendation systems, network
analysis, and with Susan,
shopping baskets.
So here's some neuroscience
and social network examples
that we've been exploring.
My talk is long, so I'm
going to not dwell on it.
So let's get right to
discussing embeddings
in an economic context.
To give you the idea, let's
imagine a vacation town deli.
So this deli has
six items in it.
That's all it sells.
Jam, Skippy peanut butter,
Smucker's natural peanut
butter-- that's my preferred
peanut butter-- Coke-- no Diet
Coke-- bread, and pizza.
And people are going
to walk into this deli,
and they're either going
to buy pizza and soda,
or they're going to buy
peanut butter, jam, and bread.
Important, customers
are only going
to buy one type of
peanut butter at a time.
Nobody's going to buy two jars
of different peanut butters.
I hope you agree with that.
Items bought together or
not are co-purchased or not.
And the peanut butters
are substitutes.
So if Skippy was out of stock,
I would likely buy Smucker's.
I'm ignoring a lot of issues
like formal definitions,
price, causality,
things like that.
But I want to capture
this purchase behavior.
I want to somehow take
data from this deli
and capture these facts.
So what we're going to do,
inspired by embeddings,
is we're going to endow each
item with two unknown locations
in some real valued space.
One of those is called
an embedding rho.
We're also going to call it
an interaction coefficient.
You'll see why in a second.
And the other is called
a context vector alpha.
We'll also call
that an attribute.
You'll see why as well.
And the idea is that the
conditional probability
of each item in
the basket depends
on its embedding, rho i.
So here's the conditional
probability of how many times
I'm going to buy this item given
everything else in the basket.
And it's going to depend
on its embedding, rho
i, and the context
embeddings, or the attributes
of the other items
in the basket.
So again, it's important.
I'll keep repeating it.
The alpha location, these are
latent product attributes,
characteristics of
the other products.
And rho indicates how product i
interacts with other products'
attributes.
So there's some attribute
that's jamminess and breadiness,
and peanut butter likes to be
bought with jammy and bready
types of things.
There's a pizzaness
type of attribute,
and soda likes to be
bought with pizza.
That's the idea, and this
comes from a Poisson.
These are all counts.
And it's actually
not a real model,
but don't worry about it,
and don't ask me about it.
It is a real utility
function though.
So we found a deli like this.
It took us a long time.
We got all their data.
And we fit this little model.
And this is the results.
So this is-- we fit this
model with two dimensions.
And in this plot, the
attributes, alpha,
are red dots.
The interaction coefficients, or
embeddings, rho, are blue dots.
And remember, there's
a dot product here.
So what you want to look at
is the sign of the dot product
between the two pieces.
So just to stare at this plot.
Here, let's see.
Pizza and soda are never bought
with bread, jam, and peanut
butter, and vice versa.
OK.
So, well let's first
look at what we know
gets bought together.
So bread and jelly get bought
together in our little cartoon.
And you can see that because
these are in the same quadrant.
They're going to
have positive value.
Pizza and soda are never bought
with bread, jam, or peanut
butter.
So here's soda and pizza.
And determining if I'm
going to buy soda--
let's say I have
jelly in my basket--
determining if I'm
going to buy soda,
I look at the red dot for jelly
because that's its context
vector.
And you can see that
they cross the axes,
and it's going to be
a negative number.
And so pizza and
soda is never going
to be bought with--
pizza is never
going to be bought with jelly.
So this-- if you think about
all those combinations,
you can sort of see that
the characteristics we just
described are depicted in this.
And we obtain this picture
by fitting that model.
We just generated data.
I lied when I said
we found a real deli.
We just generated
data from this deli,
and then we fit that model,
and that gave us this picture.
Bread, jam, and peanut butter
are all bought together.
Now let's look at
the peanut butters.
This is the important one.
Peanut butter one is never
bought with peanut butter two.
Right?
Peanut butter one is bought
with jelly and bread.
The dot product
here is positive,
even though it crosses an axis.
But peanut butter one is never
bought with peanut butter two.
You can see that because peanut
butter one is in this quadrant,
and peanut butter
two is up there.
Its red dot is up there.
Peanut butter two is never
bought with peanut butter one.
So somehow that's capturing--
but not entirely-- the fact
that they are substitutes.
But more important,
or equally important,
is that peanut butter one
is bought with similar items
as peanut butter two.
So these two-- the
blue dot tells you
when you're going to buy it.
And these two blue dots
are close together.
That means that they induce
the distribution of these--
and the red dots are
close together too.
So they induce the similar
distributions of other items.
And so what we're
going to do is,
we're going to play this
game with that big data set
in a second.
OK.
So more generally, just
to give you a sketch
of the general idea here.
These exponential family
embeddings, the goal
is to discover a useful
representation of data.
And the way we notate that,
we have n observations.
xi is a vector, the d vector.
And these might be language,
neurons, social networks,
or shopping, where we have
items and baskets and the number
of times they were purchased.
And when you set up an
exponential family embedding,
you set up three ingredients.
The context, the exponential
family you're going to use,
and an embedding structure.
And so the idea--
this is a little--
if you're not familiar, this
is a little graphical model
where nodes are random variables
and edges are dependencies
between random variables.
This just describes sometimes
a joint distribution.
But the idea here
in this picture
is xi is my observed data point.
So xi, for example, might be,
did I buy Oreos on this trip.
Rho sub i is the
embedding for xi.
So in this case, that's
going to be the embedding
vector for Oreos,
so the interaction
coefficients for Oreos.
And you can see the red
highlighted observations
are the context of xi.
Those are the other
items in my basket.
That's the context
of me buying Oreos.
Maybe I'm buying Oreos
with milk and with a Coke.
And alpha j for
each of those items,
those are the context
vectors for Oreos.
And what we're
going to do is we're
going to model each data
point condition on its context
and on these latent
variables, which
are the embeddings and
the context vectors.
Now, here's a picture
of the context.
So I already explained
it, but the idea
is that I'm going to
model the black node
condition on the
red node, and then
I'm going to move
this window over.
I'm going to model
the second black node
condition on those red nodes.
I'm going to model
the third black node
condition on the red nodes.
And you can see that each
item plays two roles.
It is modeled, but
it's also appearing
in the context of other items.
And so you can
define these contexts
in a lot of different settings.
In language, we model
words conditioned
on the words around them.
In neuroscience, we
model neural activity
and some scientific
data conditioned
on the other
neurons around them.
In a network, we model
an edge conditioned
on the other edges
of the two nodes
that we're trying to discuss,
trying to contemplate.
And in shopping, we
model a purchased item
given the other item
counts on the same trip.
This is all the
idea of a context.
Second, as I mentioned,
we want to define
the conditional
exponential family.
So in neuroscience, for
example, that's real value data.
And the conditional exponential
family is a Gaussian.
Here, in shopping data,
we're going to use a Poisson.
And the idea is that the
data I'm modeling conditioned
on its context comes from
an exponential family whose
parameter is a function of
the value of the observations
in the context and these two
types of vectors, the embedding
and the context vector.
Or, in our language
here, the interaction
coefficients and the attributes.
And in particular--
and you can see
if you're familiar with
generalized linear models--
this is all mirroring
the treatment
of generalized linear models.
The natural parameter
for the data point
is going to be a
function of the dot
product of the embedding,
or the interaction
coefficients, and the context
vectors, or the attributes.
Put in plain language,
if I'm buying
Oreos, that depends on what
Oreos likes to get bought with
and what else is in my basket.
If I have carrots and
celery in my basket,
maybe I'm not buying Oreos.
But if I have milk and
Diet Coke in my basket,
maybe I am buying Oreos.
Finally, there is the
embedding structure.
The embedding
structure determines
how these parameters are shared
so that we can learn something
about the items.
And this is kind
of a technicality,
but the idea is that
I'm going to-- my data
are purchase counts at
different shopping trips.
And I want to make sure that
I learn something about Oreos
across shopping trips.
So I'm going to share
rho and alpha for Oreos,
no matter what
shopping trip it is.
And when we set that up,
it just looks like this.
The same latent item
characteristics and latent
interaction
coefficients are being
used across shopping trips
for the same item, Oreos
for example.
And that's also how
we learn something
about neurons or about
a node in a graph
or about a word in a
language modeling task.
We fit this model,
modeling each data point
conditional on the others.
And we combine these
ingredients in what's
called a pseudolikelihood.
But for us, it's basically
just this utility function
where we have the log
probability of each observed
item conditioned on its
context plus some regularizers
for the two different
types of vectors.
And we fit this with
stochastic optimization, where
exponential families-- the fact
that we use exponent families--
makes these gradients simpler.
Just a high level aside.
Stefan mentioned
machine learning
being successful in
many different spheres.
And it has been.
And I think that is
due in large part
to stochastic optimization.
So when you take
the gradient, this
resembles a collection
of GLM likelihoods.
Doesn't matter.
But part of our
goals in this work
is to make a connection
between these ideas
in neural networks and
probabilistic modeling.
And the stochastic gradients
of this pseudolikelihood
connect to neural net ideas
like negative sampling.
So if you get interested
in this sphere of ideas,
you'll read about
negative sampling
in all of the word
embedding papers.
And that looks something
like a stochastic gradient
of this type of objective.
Let me show you what we can do
with this on some real data.
So we have purchase counts
of items in shopping trips
at a very large grocery store.
And we're going to
fit this in two ways.
In one way we're going
to look at categories.
So we're going to replace
each item with its category.
And there are 478 categories.
This is a data set of 6.8
million purchases, 635,000
baskets.
And we're also going to fit
this at the item level, where
we have 5,675 different items.
620,000 baskets, and
5.6 million purchases.
Again, I know I've
mentioned this.
Here the context is other items
purchased in the same trip.
The structure is that the
embeddings for each item
are shared across trips.
And the conditional exponential
family is a Poisson.
And we also down
weight the zeros.
And we can talk about
that at the end.
There's an open question
there, but it's a good idea
to down weight the zeros.
Again, recall the conditional
probability is Poisson.
And that we think of
alpha as reflecting
the attributes of item
i and rho as reflecting
the interaction of item i with
attributes of other items.
And so what we're
going to do is we're
going to take this data--
620,000 shopping trips-- fit
that model that I
just described to you
with stochastic optimization,
sampling trips repeatedly
and getting noisy gradients and
then fitting the vectors of all
of these items.
And we're going to
then visualize--
I think we used 100
dimensions-- but we're
going to visualize alpha in two
dimensions using a dimension
reduction technique called t-SNE
to show you what it looks like.
Here's a 2D representation
of category attributes.
So earlier with the peanut
butter and the pizza,
I plotted the actual
values because I
had a two-dimensional
representation.
But now we have 100 dimensions.
And so this is a reduced
dimension visualization
of those 100 dimensions.
And here are the category
attributes, alpha i.
And you can look at
different pieces of this map,
and you can see things
like baking ingredients
all going together.
So this is like zooming in
on one part of that big plot.
And you can see that
things you would
use for baking all go together.
And here, things that you need
for babies all go together.
So that's ingesting
market baskets,
running this algorithm,
getting alphas, the latent item
characteristics, without
any prior knowledge,
and then projecting it
down into two dimensions
using this t-SNE algorithm.
Now here is that same
procedure, but applied to items.
So now I've got however
many it was, 5,000 items.
And again, I got 100 dimensions.
And again, I projected it
down to two dimensions,
and we get this map.
And in this map, the colors are
the categories of the items,
just for us to see that
the categories either mix
or don't mix.
That's what the
color represents.
Now again, we can zoom in
on some portions of that map
to see what it's representing.
And you can see things like here
is pasta and sauce together.
In that map, they are
commonly purchased together.
Here is crackers and cheese.
Which I didn't know
went well together,
but thanks to the model, now
I'm a big crackers and cheese
eater.
And here is the
Mexican food section,
where we have shredded
cheese and tortilla
chips and tortillas and salsa
all together on the map.
Because this is a conference
about machine learning
and economics, and in a
machine learning talk we need
slides like this that
say we do the best.
So we do the best.
OK.
What we did here, I
can explain a tiny bit,
is we want to assess how
well this model is working.
We can look at
Mexican food maps.
And as satisfying
as that is, it's
not a quantitative
evaluation, anyway.
And so what we did was we
took these 600,000 or so
market baskets,
and for each one we
held out a few of the items,
one or two of the items.
And then we used
our model to try
to predict the probability
of the held out items
given the other items
that are in there.
And what we're comparing
here are different dimension
reduction methods.
So additive Poisson embedding
and hierarchical Poisson
factorization.
This is actually the method
I'm going to tell you
about in the second
half of the talk,
but it's better than
that method in this case.
Poisson type of PCA,
that's basically just
PCA with a Poisson
likelihood is another kind
of classical matrix
factorization method.
And in all these cases, the
Poisson embedding method
assigns higher probability
to these held out items.
It's a better fit of the data.
Gives us a better
estimate of the data.
Down weighting
zeros is important.
So that's what we learned
at the category level.
At the item level, we see
the same kind of pattern.
Ultimately though, we want
to fit this model in order
to understand something about
patterns of purchase behavior.
And so I'm going to define
some terms, possibly correctly.
Exchangeables.
We're going to call two
items exchangeable if they
have a similar effect on
the purchase of other items.
I'm being a little fast and
loose with causal language,
but if things were
causal, exchangeables
have a similar effect on
the purchase of other items.
Same category items tend
to be exchangeable--
they have a similar effect of
the purchase of other items--
and rarely purchased together.
So this is a subset
of substitutes.
Think about the two
types of peanut butter.
I'm going to buy these
two types of peanut butter
with similar other
items, but I'm never
going to buy these two types
of peanut butter together.
Complements are purchased
or not purchased together.
Like hot dogs and buns.
Rarely will you buy hot dog
buns but not buy hot dogs.
So we want to try to take
those embeddings that we fitted
to the whole dataset
and try to capture
some of these types
of purchase patterns.
And we have a way to do that.
So remember, peanut butter
one and peanut butter two
induce similar distributions
of other items.
You know that because their red
dots are close to each other.
The red dots are
close to each other.
So that means if I have
peanut butter one or peanut
butter two in my basket,
the distribution of what
other stuff I have
in my basket's
the same, no matter which
peanut butter is in there.
They are rarely
purchased together.
Remember, that's
because the blue dot
is far away from the red
dot for the peanut butters.
So let's define the sigmoid
function between two items.
It's loosely the
probability that I'm
going to purchase
one item given that I
purchased the other item.
OK.
And so this is
the sigmoid. ki is
the probability of purchasing
item k given i is in my basket.
I switched from
logistic to Poisson.
We actually used Poisson.
I just didn't update the slides.
But it's all the same
for explaining this.
So that's the
probability that I buy
jam given that peanut
butter is in my basket.
And sigma ki bar is 1
minus that probability
that I don't buy jam given that
peanut butter is in my basket.
The substitute predictor,
properly put in quotations,
looks like this.
And so what this
is, the first term
is something like an
exchangeability measure.
This is a measure of how
similar-- so let's be clear.
There's two items at
play, item i and item j.
And this first term is
the K-L divergence--
a measure of the difference
in distribution--
between whether
or not I buy item
k given that item i is in my
basket, and whether or not I
buy item k given that
item j is in my basket.
So fix another item-- say jam--
and compute the K-L divergence
between the probability that
I buy peanut butter one given
that jam is in my basket
and the probability
that I buy peanut butter two
given that jam is in my basket.
That's this first term.
And I sum over all other items.
so.
I look at each item,
and I ask, what's
the K-L divergence between
these two distributions?
And that's a measure
of how similar
the distributions of other items
are for each of these items.
That's the first term.
And then the second
term penalizes us
if these two items
are bought together.
So if peanut butter one
and peanut butter two
were bought together,
then that second term
penalizes us for that.
What this wants is that when
I buy peanut butter one,
I am not going to buy
peanut butter two.
The complement predictor is
the negative of the last term.
So this is finding
something like co-purchases.
That when I buy jam,
I'm going to buy pizza.
When I buy jam, I'm
going to buy bread.
That's the complement predictor.
And I gave the non-symmetrized
version of these.
We use symmetrized version
of both quantities.
And these quantities generalize
to other exponential families.
So one thing I'm interested
in doing basically next week
is to ask the question, can
we think about complements
and substitutes in the context
of these other applications
that we've been thinking
about embeddings in?
For example, in language.
You can imagine that a
substitute in language
is a synonym, and a
complement in language
is two words that go
together, like New York.
First let's look at
the category level.
Here's some example,
co-purchases
at the category level.
And what you can see is,
things like organic vegetables
and organic fruits go together.
Vegetables and beets, baby
food and diapers, cranberries
and stuffing,
gravy and stuffing,
pie filling and evaporated milk.
Because if you get the
canned pie filling, you need
evaporated milk to make the pie.
Deli cheese, cheese and
crackers, pasta and sauce,
and so on.
Mayonnaise and mustard.
Here are the top 10
potential substitutes
at the category level.
And first, let me explain
something about what
we did with this data.
So here, we moved to
categories precisely
because we didn't want to
deal with the complexity
of substitutes.
And so in this work,
where we're trying
to think about how to
identify possible substitutes,
we engineered our
data in a way where
we took pizza, the
pizza category,
and we artificially
divided it in half.
We said, OK.
Half of these are pizza
type one, and half of these
are pizza type two.
And so hidden in this
category level data
that we analyzed for
this particular table are
substitutes that we know
are actual substitutes,
where we just took
half the pizzas
and called them
one type, and half
and called them the other type.
And then, when you look at that
measure on that fitted model
to that data, you see
that in the top 10,
these engineered substitutes
all rise to the top.
We identified that
pizza one and pizza two
are indeed in the same category.
Bottled water one
and bottled water two
are as well, and so on.
So that's one thing.
But interesting is that even
though we moved to the category
level in order to
avoid this issue,
we see that same
category items are still
in there, like bouquet
and roses, or bouquets
and blooming.
And another one, although
it's lower on the list,
is block ice and beverage ice.
So accounting for this
kind of purchase pattern
could be important even
at the category level.
So now let's go to the UPC
level, to the item level.
So here are example
items that we
co-purchase at the UPC level.
And what you see are
things like yogurts,
different kinds of beans,
different kinds of cat food,
rutabagas and parsnips-- I'm
not sure-- apples and celery.
Here are hamburgers and buns,
mangos and kiwis, and so on.
Here's taco seasoning
and shredded cheese,
Star Magazine and In
Touch magazine, and so on.
I'm not sure about
this Table I think
those maps kind of reflect
this idea a little bit better.
But this is what we get
from that measure, a way
to identify these items
that are purchased together.
But now here are potential
substitutes at the item level.
And I felt that
this was exciting
because this was hard
to get out of that map.
So you see that
potential substitutes are
two different sizes of coffee.
You wouldn't go in and buy
two different sizes of coffee.
If they said they
were out of one size,
maybe you would move
to the other size.
Regular sized and large sized
sandwich, same with the coffee.
Two different kinds of flours,
two different types of sushi,
the semifreddis baguette and
the crusty sweet baguette.
Maybe you won't buy two
different types of baguettes.
Two different kinds of gum,
two different kinds of candy
bars, and so on.
Detergents, the same
type of beer in a bottle
and the same type
of beer in a can.
And these two types of salads.
One is called Greek
salad, and the other
is called Neptune salad.
We can assume they
are both Greek salads.
So this is the kinds of
potential substitutes
we find at the item level.
And yeah, so that's where we
are with this analysis now.
So in summary of this
first half of the talk,
word embeddings have become
a staple in natural language
processing.
And what I've told
you about today
is how we distill their
essential elements
and generalize them to
consumer level data.
And this has been
working in a lot
of different contexts for us.
So also in
recommendation systems
and in neuroscience and
modeling, people reading
scientific articles and here
in the shopping baskets data.
But there are some
questions, of course.
So one is, how can we capture
higher order structure
in the embeddings?
Somehow, looking at
pairwise relationships
to uncover categories
is unsatisfying.
You might want to think
about if there is high level
structure in these
attributes, which of course we
know there is.
If we could really
observe those attributes,
there might be
hierarchical structure,
a clustering structure.
So how can we capture that
in this kind of model?
A second question is,
down weighting the zeros
is important, but how
do you justify that?
And in some other
work, we've worked
on justifying that in the
context of recommendation
systems.
And so I hope we can look
to that for inspiration
for how to down weight
that in this setting.
And then of course,
we demonstrated
that we can find these
purchase patterns at the item
level with very large datasets.
But how can we include
things like price
and other economic complexities
in this kind of model?
So the second thing
I wanted to talk
to you about with
the same data set
is Poisson factorization,
a computationally efficient
method for discovering
correlated preferences.
Again, we're looking
at the same data set.
And we want to understand
people's purchase behavior
through this data.
In economics, we typically look
at items within one category,
like yogurt, and
then try to estimate
the effects of interventions,
like coupons or price or layout
or stock out.
In machine learning, we
look at all the items
typically, not minding
that they are all mixed up
into complements
and substitutes,
estimate user preferences,
and make predictions.
So this is recommendations.
So if you hand this kind of
data to a computer scientist,
she's going to do something
like in the Netflix challenge
and build a recommender system
so that when you buy Oreos,
it suggests that
you should buy milk.
In machine learning,
as I mentioned before,
again this is a cartoon.
We typically ignore things like
causal effects of interventions
and applied causality,
although that's changing.
Happily there's really a
nice new effort in causality
in machine learning.
So in Poisson
factorization, this
is work that we did in the
world of recommendation system.
We're modeling something
called implicit data.
What implicit data
means is, it's
users interacting with items.
So this could be users clicking
on articles on the web.
It could be users liking
things on the web.
Or it could be users
purchasing things.
There's less
information than what's
known as explicit data, which
is when users rate items.
Like in Netflix, you might give
a movie one star or five stars.
But it's much more prevalent.
It's much easier
to get this kind
of what's called
implicit feedback
data than explicit data.
So here's a picture of
users clicking on movies.
And so a sample
user might have seen
Star Wars Empire Strikes Back.
I know that Wrath of Khan
is actually Star Trek 2,
so that's a mistake
in the slide.
But anyway, they might have
seen those three movies
and not seen those other movies.
Return of the Jedi,
When Harry Met Sally,
Barefoot in the Park.
And so what you want to do
is, in a typical machine
learning application, is
recommend to this user
that they watch
Return of the Jedi.
So we've been working on this
recently with something called
Poisson factorization.
And this is now, again,
a graphical model.
Let me explain the graphical
model a little bit more.
So in a graphical
model, like I said,
nodes are random variables.
Edges denote dependence
between random variables.
This is-- I know you look at
graphs like this in economics
too.
Here we don't give any
mechanistic or causal story
behind this.
It's really simply
a way to represent
a factorized joint
probability distribution.
That's what it is.
Where hidden variables
are unshaded,
observed variables are shaded.
And these boxes
are called plates.
They represent replication.
You might have replication of
variables that are distributed
in the same way in your joint.
And that's what
those boxes mean.
So in this picture,
this is the picture
for Poisson factorization.
I'll try to describe it in
the context of shopping.
Here's the idea.
yui is whether or not
user u bought item i.
That's yui.
And you can see that there
are these two plates.
One is the plate for users,
and one is the plate for items.
And yui sits at
their intersection
because there is a
yui-- whether or not
the user bought an item--
for each user and item.
Theta u is a k vector,
a vector of length k.
And it represents the
user's preferences.
The user likes stuff.
Theta u represents
what the user likes.
It's a k vector.
And it's positive.
Beta i represents
who likes that item.
So beta i is a k vector,
and it represents
who-- we call them
latent attributes again,
but it represents
information that's
going to help tell me who's
going to like this item, who's
going to purchase this item.
And the way it works is that how
many times a shopper purchased
an item comes from
a Poisson whose
rate is the dot product
between theta u and beta i.
There are echoes of the
embedding ideas here,
but it's a different model.
So we just take the inner
product of my preferences
and the item
coffee's attributes.
And if it's high, then I'm
likely to buy the coffee.
And so in graphical
models, of course-- again,
this is basically applied
Bayesian statistics--
it's the posterior distribution
that we care about.
The probability of the
user preferences, the item
attributes, given
the observed data
of who bought what reveals
purchase patterns about people.
Why is this advantageous?
One, it captures
heterogeneity among users.
So Larry might-- I
think you have a cat.
Maybe not anymore.
LARRY: I had two
and [INAUDIBLE].
DAVID BLEI: You had
two, and now he has one.
OK, so we see Larry's cat.
I won't make a joke about it.
Yeah, so I had one.
Now I have zero cats.
So anyway, Larry might
like organic vegetables,
and he has a cat.
So he buys cat food.
And so so he's heterogeneous.
John-- thanks.
John has a dog.
He buys dog food, but he might
also like organic vegetables.
So they're both similar in some
ways, different in other ways.
We can capture that kind of
heterogeneity with this model.
For interesting and
technical reasons,
this implies a
convenient distribution
of total consumption.
How much stuff Larry's
going to buy at the store,
we can model that
with this picture,
even though we're modeling
each item individually.
It's because of properties
of the Poisson distribution.
And most importantly,
but also-- yeah,
most importantly-- we can
do efficient approximation
of that posterior, only
requiring the non-zero data.
That too has to do with the
functional form of a Poisson
distribution.
I won't say why that happens.
It's something for
the blackboard.
But what we can do
is even though we
have 5,000 items and however
many hundreds of thousands
of users, we can
efficiently approximate that
posterior only using
the information
about what each user
bought, not using the zeros.
And what happens is the zeros
are still in the likelihood.
They just are easy
to work with because
of the nature of the Poisson.
So we worked on this originally
in the context of recommending
articles to people.
And in fact, they implemented
this at The New York Times
to give recommendations.
And for two different datasets,
here are some of the components
that we get.
So what does this mean?
Remember, each
user is represented
as a vector of preferences.
Each item is represented
as a vector of attributes.
And what I'm doing is
taking an attribute
and looking at the items
that exhibit that attribute
with the most strength.
By the way, if
you're familiar, this
is very similar to non-negative
matrix factorization.
It's kind of like a Bayesian
version of non-negative matrix
factorization.
And so we found
interesting stuff.
Like in the news articles
from The New York Times,
business self-help was
one-- I'm naming these.
That's why it's in quotes.
Business self-help was
one of these attributes.
Personal finance,
all things airplanes.
So even though The New York
Times doesn't have an airplane
section, there are
some people that
always click on
airplane articles,
like "Flying Solo, Crew Only 787
Flight is Approved by the FA."
That's amazing.
I didn't-- that's crazy.
And so these are the kinds
of patterns that might not
be available to whoever
is making these items
and implementing the system.
And they can learn something
about their users this way.
With scientific
articles for Mendeley--
this is scholars
like you sharing
their libraries of
scientific articles--
we find things like
biodiesel and astronomy
and political science.
Again, patterns of usage.
Again, machine learning
talks require a plot
like this or a slide like this.
So it does better in
lots of different ways.
And on the item data, the
600,000 trips to the market,
we also get patterns.
And so this is how
that looks here.
So this is-- actually, this
is on the category level data.
So there is like a fruit
component with stone fruit
and pears and tropical
fruit and apples and grapes.
There's the cat care component,
baby essentials component.
I thought it was funny
that coffee was number
two in the baby component.
And healthy-- I don't
know what to call it--
where you buy soy
milk and organic food
and organic frozen
vegetarian options.
And again, we can capture that
consumers are heterogeneous.
Some like to eat
healthy and have cats.
Some have cats and have babies.
We're interested in
really folding this
into an economics model.
And one way to connect
this to an economics model
is like this.
And this also has to do with
properties of the Poisson
distribution, but I
think it's interesting.
So imagine a utility model.
So let's just imagine
that we're not
capturing the utility
of baskets of items,
but just we're going to
purchase one item at a time.
And so we have the
utility of an item, yui.
What's the utility
of buying cat food?
And let's imagine that
that utility is log
of theta u transposed beta i.
Right?
The preferences
transpose the attributes.
But logged.
You might not be used to
seeing that log there.
Plus epsilon, where epsilon
is a Gumbel, as usual for you.
For me, it's where I
learned about the Gumbel.
Now, suppose a shopper-- some
shopper u-- buys n items.
Let's condition on
that number of items
that she's going to
buy, condition on n.
yu is going to come
from a multinomial
of n items, a vector
over items that
sums to n with probabilities pi
u, special probabilities pi u.
Pi u i, the probability
that user u buys item i,
is proportional to the
exponent of theta u transposed
beta i in this little picture.
Thank you.
What does that mean?
So this, I'm sure, is standard
procedure for all of you.
But what that means is, again,
because of the properties
of the Poisson, that the
unconditional distribution
of counts for this user is
going to be from a Poisson
factorization, where yuj
comes from a Poisson theta u
transposed beta j.
And that means that we can
use Poisson factorization
to analyze this large scale
consumer behavior data.
And it is implicitly kind of
assuming this utility model,
up to all kinds of caveats.
And so what we're
doing now is we
are exploiting this connection
and devising new utility
models that take into account
various other properties
of the shopping experience,
like time of day,
whether or not something's
in stock, the dates.
So you can add date
terms and find out
that ice cream is more
likely in the summer,
and that helps us calibrate
people's preferences.
Observed item
characteristics and category,
demographic information
about the shopper, and so on.
And what's nice is that
inference is still efficient
when we add in all of
these complexities.
And then with assumptions
and with checking assumptions
and trying to check
assumptions, we
can answer
counterfactual questions.
So that's how we're connecting
this idea that we developed
for recommendation systems
into something that we
can work on economics with.
So Poisson factorization
efficiently
analyzes large scale
purchase behavior.
Again, some of
the next steps are
to include these other
properties of the shoppers.
And research and recommendation
can help economic analysis.
That's satisfying.
I said what I
thought I was going
to say next, next, by myself.
Good.
So that's my talk.
Probabilistic
machine learning lets
us design expressive
models to analyze data.
You can tailor the method
to the question you have
and to your knowledge
of the discipline.
You can use generic and
scalable inference, especially
as we continue to develop
those important tools,
to analyze large data sets.
Here we had 600,000 trips
and many thousands of items.
And we needed to rely on things
like stochastic optimization
to fit models of
that size and then
form predictions,
hypotheses, inferences,
and revise the model, all with
this very large scale data,
and with a technique that
is specifically tailored
to the question that you have.
I think there are a lot of
opportunities for economics
in machine learning.
One is to push economics
towards high dimensional data
and scalable computation.
Two is to push machine
learning to explainable models.
I like that in
economics, you care
a lot about interpretability.
And that's something that could
be more important in machine
learning, and it
pushes us there,
applied causal inference
and new problems.
And developing new
modeling methods together
is a ripe activity.
Thanks very much.
[APPLAUSE]
AUDIENCE: So you
emphasized the distinction
between what I would call
causal predictive modeling
and non-causal
predictive modeling,
mentioning that more
traditionally in machine
learning it's non-causal,
but that now there's
a lot of interesting
things [INAUDIBLE].
Even if we stay
non-causal, there
is an issue of what
I would call a point
versus interval or density
prediction [INAUDIBLE]
quantifying the uncertainty.
And my casual impression--
I'm an economist,
so I don't know a lot.
But is that the machine
learning community is much more
kind of straight up
point prediction, almost
like a certainty equivalence.
Get a point forecast.
Plug it into a decision rule.
And then do something,
like make a recommendation,
or put baby diapers here
and not there or something.
But a lot of the
stuff we want to do
involves probabilistic
quantification
of uncertainty predictions.
Is there work going on around
things like that in the machine
learning literature?
Or maybe my whole impression
is not exactly right.
DAVID BLEI: Yeah, I think
your impression is right.
That's a great point.
I think that's another
way-- and that's
a difference between machine
learning and statistics,
for example.
In statistics, they of course
cared about that kind of thing
for a long time.
And it's starting to change.
There's nothing-- there's
an exception for everything.
But it's true that frequently
in machine learning,
we look simply at
posterior expectations
and don't worry about
the spread around them,
or don't try to assess some
kind of frequentist estimate
of spread around an estimate
of a posterior expectation.
And I think that that
should be more important,
and I would hope
that that would be
one way that this kind
of interdisciplinary work
could push machine
learning to new places.
So in probabilistic machine
learning, it's kind of this
grew and developed in
parallel with applied Bayesian
statistics.
This is now my own
personal opinion.
Looking at posterior
variance as a way
to assess uncertainty I think
is a little bit misplaced
because it assumes
that the model is true.
And so it doesn't
really give us a sense
of uncertainty under any kind
of theoretical framework,
except the one where
the model is true.
And that might be what
you're trying to check.
So I think that there are
some delicate issues there,
especially at
these large scales.
But it's a great point.
Yeah.
Thank you.
Yeah.
AUDIENCE: I actually wanted to
ask a question about something
you just said.
It seems like it's also related
to the question of how much you
can down weight zeros.
And there's a more
general question of,
what is the right model?
And I was just wondering how
you think about the model
misspecfication.
Do you think there
is some true model,
and you'd like to
know what it is?
Or what is the criteria
that led you to the Poisson
linear functional form?
DAVID BLEI: Yeah.
That's a good question,
set of questions.
So OK, first, what
I just said is
kind of how I think about
model misspecification.
So I find that--
there's a line of work
on what's called posterior
predictive checks that
is really interesting.
It came out of
Bayesian statistics,
but it's really about
essentially frequentist checks
of Bayesian methods.
Started with people
like George Box,
and then Don Rubin
and Andrew Gelman
have developed this set of
methods, which is around
if I-- loosely,
at a high level--
if I set up a Bayesian model
where I have hidden variables
and observations and
I get the posterior,
and then I just immediately
start making decisions based
on the posterior,
as I mentioned,
that assumes that
the model is true.
And what George Box pointed
out a long time ago-- 1980
or before-- is that in order
to understand whether or not
that model is working, you need
to step outside of the model.
You have to step
outside this assumption
that the model is true.
And that's what these posterior
predictive checks do loosely.
They generate new data
from your estimates,
and then check whether that
new data looks like the data
you conditioned on.
And if it doesn't, it's
like a Bayesian adaptation
of a classical
goodness of fit test.
If it doesn't, then you've
done something wrong,
and you need to
revise your model.
So I think that that kind of
activity is very important.
And that's what I think
about model misspecification.
I don't think there
is a true model.
I think these models
are all simplifications.
And they're for a purpose.
There's a knowledge
and a question.
I've done a lot of work in
topic modeling with John.
And we don't believe that
text comes from a topic model.
That would be insane
to believe that.
But rather, that
particular characterization
of the hidden
variables in texts is
useful for building
navigators based on themes
that help somebody quickly
explore a massive data set.
So personally, I'm not
sure there's a true model.
Although I'm not
speaking for Susan here.
SUSAN: [INAUDIBLE].
We're working on trying
to adapt these posterior
predictive checks,
particularly for [INAUDIBLE]
counterfactual inference.
So like in our data,
we have stock outs,
and we have price changes.
And so under some
assumptions, you
can identify price
effects and so on.
So we're trying to adapt these
posterior predictive check
type of framework to
assess whether, say,
an unconfoundedness
condition holds,
or to assess other
aspects of the model
for making counterfactual
predictions about what
happens when price change.
So I think that as you
start to do counterfactuals,
you have to really
think harder about how
you're assessing your model.
And I think that's kind
of forcing this connection
to happen.
DAVID BLEI: Yeah,
that's a great point.
So if the question
is a causal question,
depending on how crude
that causal question is,
it might not be that you
need the actual mechanism
in your hand in order to
answer that causal question,
but that other assumptions that
wash out some of the details
are enough to answer it.
But then you need
to be able to check
these certain assumptions.
And so that's one of
the research threads
that we're doing.
SUSAN: We maybe also--
just back to your
question-- these
are Bayesian models.
So one difference between
these class of models
and other types of machine
learning models like just
the straight matrix
factorization is really,
this is not very different
than a BLP model or some
of the big models that are
used in marketing by Rossi
and so on.
The bunch of latent variables.
We have a Bayesian model
of these latent variables.
It's the same thing.
There's nothing
really too different,
except for we can compute them.
And there's some
approximations and debates
about exactly how you
think about the fact
that we're computing
them in a different way.
But roughly, they're just
larger scale versions
of economic Bayesian
latent variable models.
DAVID BLEI: I want to
make two other points.
Although this is a whole--
we could be all night
at a bar talking about this.
So one is that-- another
difference between machine
learning sometimes and
classical Bayesian statistics
is that in machine
learning, we often
use what's called variational
inference to approximate
the posterior here.
I'm not sure if you're
familiar with it.
Where unlike
something like MCMC,
there are no guarantees
that you're really
sampling from the
posterior of your model,
even if you wait forever.
And that's very
unsettling to many.
But if you don't think
your model is true,
then it's OK because
you can check
this bundle of your model
and the inference procedure.
And so that's another place
where model misspecification is
crucial, or thinking about
model misspecification
is crucial in that we have
given up on-- giving up's
the wrong word.
We don't need to
have a procedure that
gives us the exact posterior in
some long term infinite limit.
It's not advantageous
because we don't
have infinite amount of time.
And so we need ways
of checking this,
even if we had a true model.
You asked about how we came
up with the Poisson model.
And I'll tell you the answer.
We're working on
a different model.
[INAUDIBLE] and I were
working on a model
for finding
overlapping communities
and massive social networks.
And there was this paper
by Mark Newman and Ball
and Karrer-- Newman,
Ball, and Karrer--
which used Poisson
factorization style
models to analyze networks.
And it was a little funny
because the parameters grew
with the size of the network.
But it seemed to have
some good properties.
And so we were inspired
by that and thought,
OK, let's look into this.
And then, of course, when we
went back, that kind of model
had also been tried in computer
science by John Canny in 2004
and in statistics by
people like David Dunson.
AUDIENCE: [INAUDIBLE].
DAVID BLEI: It was
good at prediction.
It was good at predicting
held out links.
And then we thought more about
these kinds of properties,
about how it models
total consumption.
In the context of a social
network, it models it.
You just can't have
a bazillion friends.
You can only have
so many friends.
And the inference.
Then we sort of saw the
advantages in inference.
And so these three totally
different considerations led us
to that.
AUDIENCE: [INAUDIBLE].
DAVID BLEI: Yeah, yeah.
Thank you.
[APPLAUSE]
