- Hi, I'd like to welcome
you to the Radcliffe
Institute for Advanced Study.
My name is Alyssa
Goodman, and I am
both a professor in the
astronomy department
and also one of the
co-directors here for science.
And I have had the
privilege of inviting
a number of young scholars
to come here to Harvard.
Young researchers to
come here to Harvard
to talk about a field that
I think many of you know
is extremely popular these days.
And that field has come to
be known as data science.
And I want to tell you that the
change in the world's attitudes
toward this kind of
work has changed so much
that about seven years
ago some friends of mine
and I were on the inaugural
version of something
at the National Academies that's
called the Board on Research
Data and Information.
And in that board where we were
talking amongst librarians-- so
information scientists, computer
scientists, other scientists.
And these are-- put it this way.
I was the youngest person
on the committee, OK?
So just-- you got that picture?
So anyway.
So among this group
of people, the words
data scientists came up
when people said, well,
what should we call
people who are dealing
with these mountains of
data and diverse data sets
and trying to learn new
things about the world,
but they're not exactly just
your usual straightforward
academic whose values are only
publishing tons of papers,
but they really want to
share code and share data,
and make it possible for other
people to learn from a data?
And somebody said, well, what
about just data scientist?
And most of the
people in the room
said, what's a data scientist?
That's not a thing.
What is that?
And I was sitting there
going, well, I don't know.
I think that's a thing.
I think that's-- I know a
lot of people who do that,
and I think maybe I do that.
And you know, we
should have that.
And that was literally
seven years ago,
and now most members
of the public
have some idea what
a data scientist is.
And that's because we
have so much information
available online and
through also private means,
but largely online.
Whether you can get to it or
not is a different question.
But anyway, the
kinds of questions
that people can ask
and legitimately answer
have completely changed
in just the lifetimes
of the young people
sitting in the front row.
And so to just explain
a tiny bit more,
the next in science program--
I believe was started
by John Hughes, right?
My predecessor in this role.
And it used to become
the next in science,
and we added the next in
parentheses, data science just
for this year.
And then we had this
tremendous challenge
because the idea
was to find people
who are early in their
careers, but who are already--
I'm sorry to embarrass
you all, but superstars--
and bring them here to
Harvard and Radcliffe
so that you can hear
what they're doing
and learn about the future.
And there are a lot
of these people.
And so we had a
hard time choosing,
and so I hope that you very
much enjoy the four people who
we've brought here today.
And so just to give
you a little preview.
We're going to
have Jen Pan who's
coming to us from
Stanford, and she's
going to talk about
uncovering online censorship
and propaganda in China.
And then we're going to
have Nathan Sanders who
once upon a time started out
his world here at Harvard
in astrophysics, but is now
at Legendary Entertainment
as a data scientist.
And he'll talk about Hollywood
data science, the role
of inference and prediction.
Those are words you haven't
heard before too often,
Hollywood influence prediction.
And then after we
have a short break--
and I hope everybody will enjoy
that and then come right back
in--
we're going to have a talk from
Saki Takahashi who will take us
into the realm of epidemiology.
And she's now a graduate
student at Princeton University.
And then last, but
certainly not least,
we will explore
the wider universe
with my astrophysics
colleague, Rene [? Logic ?] who
will talk among other things
about how data are being
anticipated, and
then will be used
with projects that
are truly gigantic,
like the large synoptic
[INAUDIBLE] telescope.
And if you ask
her nicely, she is
going to show you an amazing
movie at the end of the day.
So just remember that.
Somebody ask her that
in the questions.
So then, we're going to get
started with our first speaker,
and the format is that we're
going to ask each speaker--
they're going to talk about
15, 20 minutes and then
you'll have 5, 10
at the outside,
minutes to ask questions
of just that person.
So in the order that
I just mentioned.
And then at the
end, there will be
a joint 20 minute
or so conversation
with of all four people.
So the first person is
going to be Jen Pan,
and she currently
holds the title
of assistant professor
of communication
at Stanford University.
She went to Stanford
from Harvard
where she got her PhD
here in the government
department working with
Gary King and other people.
And before that she did get
her undergraduate degree
from Princeton, but then she
had an interesting time--
or maybe it was interesting, I
don't know, you can tell us--
as a consultant at McKinsey
in New York and Beijing.
And so, I personally,
am very curious how
that ties into graduate
school and now professorship.
And she's also worked
for the Chinese Center
for Diabetes Control and the
Clinton Foundation HIV/AIDS
Initiative.
And she knows about how
authoritarian regimes
manipulate the techniques
of data science
to manipulate people, I
guess is what we could say.
And I'm not going to
say more about what
she does because she's going
to say plenty more about what
she does [INAUDIBLE].
[APPLAUSE]
- Thank you so
much for having me.
It's great to be here.
Can everyone here
me from the back?
It's all good?
OK, awesome.
So today I'm going to talk to
you about online censorship
and propaganda in China.
When we think about
the Chinese government
and what it does to
control information,
this is a program that's
obviously designed to suppress,
to distort, to manipulate,
and to control information,
but paradoxically it could
actually reveal something to us
about the intentions and
goals of this opaque regime.
When we collect
this data at scale,
and when we observe
this data at scale.
And so this is where
data science comes in.
And I feel like the term
data science-- sometimes we
use science.
Sometimes they use computational
social science machine
learning.
What we mean is using tools
that analyze large scale data,
or at least that's
what I'm talking about.
The work I'm talking
to you about today
is a series from a
series of three papers.
The first is the
observational study
of how the Chinese
government removes
content from social media.
The second is an
experiment that establishes
a causal linkage between what
people post and what's removed.
And the third is a paper that's
looking at online propaganda.
And what I mean by that is
how the government fabricates
social media content as
if it were the opinions
of ordinary people.
Starting with online
censorship, there
are rumors of as many as
200,000 workers in China working
for firms like Tencent,
Alibaba, [? Signa, ?]
whose whole jobs is to
censor and remove content.
A lot of the work
is manual aided by--
aided by technology.
But there is-- at least
the perception is--
or belief that there is a
human oversight into what
is or is not removed.
And there are some
who think that maybe
China's efforts to censor
represents the largest
selective suppression of human
expression in recorded history.
So that sounds really
big, but at the same time
we have more and more data
recorded about our lives
now than ever before.
How we studied censorship
is we gathered millions
of social media posts the
instant they appeared,
so before the government
and the social media
sites could censor them, from
a lot of different social media
sites.
And then we went back carefully
to each of these posts
later on to see if
it was censored,
and we did this in
an automated fashion.
And censorship is very clear.
You see these sorts
of messages that say,
sorry, the host you
were looking for
does not exist, has been deleted
or is being investigated.
Or this is the good one, which
is like, the page you requested
is not here, but why don't
you go look at something else
instead.
So we were clearly able
to detect censorship.
And what we came across when we
looked at the previous research
on what we thought about
censorship in China
is that the belief was that
China would censor everything
that was bad for the
government, bad for the state,
bad for the regime.
That could be criticism.
That could be things about
protest or collective action.
What we find is that
this is not the case.
China censors something much
more narrow and specific,
which is just about censoring
to stop collective action.
When I use the term
collective action,
I mean people getting
together in the real world
to do something.
A lot of times
this is protesting
against the government, but
it doesn't always have to be.
And at the same time,
while censorship
is focused on stopping
collective action,
there's a lot of criticism
that still flourishes online.
When we talk about
social media in China,
as well as other places, we
usually talk about burstiness.
So certain things will trend
at certain points in time.
But what also is interesting
in the Chinese cases is
that censorship is also bursty.
So in this plot, the x-axis
is a six-month time frame.
The y-axis is a sample
of count of posts.
So in the gray are
just number of posts
by day over the
six month period.
And red are all the
posts that are deleted.
And what you can see
is that everything here
is about a city
in southern China
called, [INAUDIBLE]
known for its lichees
and textile manufacturing.
In June of this particular year,
there was a large scale migrant
worker protest.
There was a corresponding
burst of discussion
online about this city
because of the protests,
and there's also this
burst of censorship.
So the unit that
we're interested in
is not topics and not posts,
it's these bursts of discussion
because the censorship is
happening in these bursts.
So some bursts are censored,
but not all bursts are censored.
This is the same type of plot.
Six-month time period,
count on the y-axis.
But everyone here is
talking about power prices.
In this particular frame,
there was a heat wave.
At the same time, the government
increased power prices.
So there are a lot of people
complaining online about this,
and that's the
burst you see here.
But you don't see a
corresponding burst
in censorship.
So our hypothesis going in is
that the Chinese government
is censoring all the posts in
these volume bursts associated
with events with collective
action potential.
It doesn't really matter
what the post is saying.
It matters what the overall
discussion is about.
And we test this in a
couple of different ways.
So one test we do
is to look at what
are the events
precipitating these bursts,
and then to categorize these
events into news, policy,
collective action, and others.
And we create a
measure, something
we call censorship magnitude.
This is the intensity of
censorship in these bursts as
opposed to outside.
And you can see that for the
most part, these births related
to policy and news, which are
purple and blue, in the darker
colors, have a censorship
magnitude around zero, which
means that there's really
no increase in censorship
when the bursts happen.
Sometimes there's
actually a decrease
because there's a burst that
maybe less content is removed.
But when you look, there's
a big long right tail,
and a lot of these are in red.
So all the kind of
little segments in red
are events or bursts related
to collective action,
and that's where the censorship
magnitude is much higher.
We also see that in the
yellow pornography a censored.
That's less interesting
for our particular focus
because every country to some
extent censors pornography.
But we go a step
further, not just looking
at the kind of discussion--
overall discussion
of the bursts, but what
each post is saying.
And there we see that regardless
of the content or sentiment
of the posts, what really
matters as censorship
is the overall discussion.
So into this plot, the y-axis
is the percent censored.
On the left panel, these
are posts in bursts
related to collective action.
And they're whether the post
is critical of the state in red
or support of the
state and green.
It's highly censored
versus posts
that are happening in bursts not
related to collective action.
There, whether you're critical
or not doesn't really matter.
The censorship is very low.
So censorship or this
removing of content
seems really focused
on collective action.
What happens with
online propaganda?
We can think of propaganda as
kind of reverse censorship.
Instead of taking away
we're adding information.
And here, I'm talking about
the specific form of propaganda
that's the fabrication of
content by the government
as if it was the opinions
of ordinary people.
So sometimes this is called
government astro-turfing.
It can be called misinformation
campaigns, disinformation,
trolling.
There are lots of
different names and words
that people are using
for these phenomenon.
In the Chinese context, the
people who do this type of work
are known officially as internet
commentators, [INAUDIBLE]..
So it's a very innocuous name.
But more colloquially, they're
known as 50 cent party members.
People who are paid piecemeal
$0.50 for each post that they
make online.
And so what do they do?
The prevailing view-- and this
is my view before all this
research-- is that these
50 cent party members--
I'm just going to call them
that for the rest of the talk--
that they are arguing with
and debating against those
who criticize China.
So let's say, online someone
is criticizing Chinese leaders,
Chinese policies, or
being really favorable
of the US or democracy,
50 cent party members
will come in and argue
against those people.
That was my belief of what
50 cent party members do.
But with our research,
what we found
is that that's not at
all what's going on.
These 50 cent party
members do not argue.
They don't even touch
these controversial issues.
And instead,
they're distracting.
They're putting out lots of
really positive, happy content
that's totally
irrelevant, does not
touch on any critical,
controversial, debatable
issues.
And the way we study that is to
try to get ground-truth data.
So this is hidden--
this is fabricating
content and trying
to pretend it's real people.
So this is not a public program.
And ground truth data has
been difficult to come by,
but we were lucky that
a few years ago there
was a leaked email archive from
the jungle internet propaganda
office.
This is one county in
southern South Central China
where these e-mails
were leaked, and you
see lots of emails sent to
and from this jungle internet
propaganda office, which
was in this network
plot in the middle.
And all around it
are lots of e-mails
coming in by internet
commentators claiming credit
for the posts they've made
on behalf of the government.
So this is the
ground truth data,
and it's publicly available now.
But it has been
largely overlooked
because it's really messy.
The e-mails are in
different formats.
Some of these credit
claiming as a screenshot.
Some are attachments.
And what we do is use a lot
of different machine learning
as well as large scale
hand annotating methods
to extract and make
this data systematic.
And from this data we extracted
over 40,000 known 50 cent party
posts.
Over 43,000 known posts.
And so that's what I mean
by ground-truth data.
We know that these are workers
making these fake posts
at the behest of the government.
And then we build out
from this ground-truth
to extrapolate and
make predictions.
So first we analyze the
content of the leaked
data, which is as I mentioned,
from one county in China.
And then we extrapolate to
all 2,800 counties in China.
And then we validate
that extrapolation,
and we also validate
our validation.
What we're interested in is the
content of these 50 cent party
posts.
We want to know what
are they saying.
And so we estimate the
distribution of these posts
into five categories.
The first category
is what we call,
taunting of foreign countries.
So this is criticizing
people online
who are supporting the
US, supporting democracy,
supporting the west.
The second is what we
call argumentative praise
and criticism.
So these are people who
are critical of those
who are critical of
China, China's policies,
and then praising those
who endorse China.
The third category we call
non-argumentative praise
or suggestions.
So this is statements like,
we should increase prosperity
or we should improve
the plight of the poor.
Things at such a general level
that everyone can agree with.
The third category
is just announcements
of policies and programs.
And the fourth is this
cheer leading category.
So motivational, inspirational,
positive messages.
You get messages like,
I woke up this morning
feeling so thankful that
I'm a Chinese citizen.
I hope everyone else
has a wonderful day.
So just being very,
very positive and happy.
Not touching on any
sort of content.
Prevailing view is that this
is what the 50 cent party
member is doing.
This taunting, this
argumentation, debating.
And what we find
is all the content
falls into these
three categories.
And I'm going to
show you analysis
from five data sets that
move from the leaked data
out to other places in China.
And so this plot on the
y-axis is just the proportion
of content in each of
these five categories
that I just showed you,
which our labeled here.
The starting data set is these
posts from the leaked e-mails.
This 43,000 posts from
the leaked e-mails.
They're made to all sorts
of websites in China.
And here, over 80% of the
content is in cheer leading.
And that was really
surprising to us,
because we thought
the content would
be like down here in
taunting argumentation,
and that wasn't the case.
And we think this
is very unlikely
to be due to any sort
of sampling error.
But we push really
hard and try to find
any content that would go into
the taunting argumentation
categories.
And in the end, we found,
I think, 26 posts out
of 43,000 that could conceivably
be in those categories.
So the next thing we do,
we want to extrapolate.
The next thing we do
to set up for that
is to look just at
posts made on Weibo,
which is the Chinese kind
of version of Twitter.
And we're looking at posts
from all the accounts
that were leaked.
So this is about
now 160,000 posts.
And again, primarily the content
is focused on cheer leading.
When we looked at
these Weibo accounts,
we found that there are
two types of accounts.
One, where what we
call ordinary accounts.
These are accounts that
you and I might have.
They're posting on
food, travel, kids.
And the intersperse
are a few posts
made at the behest
of the government.
So it's incredibly difficult
to identify these accounts that
are making these posts
because there are actually
people's normal accounts.
Oh, and I should say
that the people who
are making these posts, we found
in the process of systematizing
the data, were basically
all government employees who
have other jobs, who are working
for the county, who are working
for the township,
villages, neighborhoods,
but who are making
these posts on the side.
And so some of them were using
their real Weibo accounts
and making these posts.
Very difficult to identify.
But there are also accounts
that were what we call,
exclusive accounts made just
to make these 50 cent posts.
These are much
easier to identify,
and reassuringly, ordinary
and exclusive accounts
do very similar things.
So is not the case
that ordinary accounts
are doing the taunting
and argumentation while
exclusive accounts of
doing cheer leading.
And we're going to extrapolate
and predict focusing just
on these exclusive accounts,
so that's important for us.
We first start within
county prediction
and then move to out
of county prediction.
And the picture is
very, very consistent.
50 cent party content is
focused on cheer leading,
these positive
inspirational messages.
The other thing to note
is that these posts
are not made uniformly over
time, but are very bursty.
This is from the leaked data.
They're making posts
before certain holidays,
before sensitive political
meetings, web protests,
and collective action
might be more likely,
as well as after events,
collective action events, that
get a lot of national
attention like the [INAUDIBLE]
or [INAUDIBLE]
railway explosion.
So it seems that these--
especially the cheer
leading content
is aimed or timed to distract
from collective action,
at least to a certain extent.
So how do we validate
our predictions?
We decided to ask the
people we predicted
to be 50 cent party members.
You might think, how
is it going to work?
These people's jobs
are to mislead.
They work for the government.
Maybe their jobs are at stake.
They're not your optimal
survey respondents.
But we decided to do it
anyway, and we administered
a double blind survey on
Weibo through direct messaging
where we said,
ask this question.
I saw your comment, and
it's really inspiring.
I want to ask, do you have
any public opinion guidance
management or online
commenting experience?
So we're not using the
terminology 50 cent,
we're using the government's
own terminology.
So we asked this question
of the accounts we predicted
to be 50 cent accounts.
We also asked the same
question of the accounts
we know to be 50 cent accounts.
And then we asked the
accounts the same question
that are unlikely to
be 50 cent accounts.
And we find that in
the predicted accounts,
59% say, yes, they do have
this experience and training.
In the known accounts,
57% say, yes.
And in the accounts unlikely
to be 50 cent from our method,
only 19% say, yes.
And the difference
between the first two,
the predicted and
the known, there's
no significant difference
between that response rate.
So asking the
predicted accounts,
that's our validation.
And then asking the known
and unlikely accounts
is our validation
of the validation.
So altogether, China's strategy
in censorship and propaganda
can be reduced to
very simple rules.
The first is do not engage
on controversial issues.
Don't censor criticism.
Don't argue against critics.
Don't touch kind of people's
real opinions and thoughts
about policies, the
leaders, and the state.
And the second is to
stop collective action.
So censor discussions
of collective action
that are going
online, and then try
to distract with this cheer
leading positive content
after collective action
happens or before time periods
when collective
action might happen.
And I think altogether
for a social scientist
like me is the promise
of data science
is that when we study
these phenomena at scale,
government information
control efforts leave really
big footprints.
Before, we couldn't
see these footprints
so we couldn't really
understand what
these opaque, non-democratic
governments were doing.
But with data science and
computational data science
machine learning, whatever
we want to call it,
these methods allow
us to see things that
were not visible to us before.
So I'll stop there.
Thank you very much.
[APPLAUSE]
- Yes, questions?
- Yes.
- So, take questions.
Go ahead.
- So I was wondering whether
you guys have thought about what
Chinese government might be
doing [INAUDIBLE] data and how
they might have--
because it looks like they
certainly proved to themselves
that the methods
that they were using
are the most effective ways
of dispensing propaganda.
Right?
So are there any known
tools that they're
using to [INAUDIBLE].
- So we don't know for sure
what the government is doing,
but we do know that the
government is putting
a lot of funding into
Chinese academic institutions
to improve technologies and
data science and computational
methods.
A lot of the work on censorship,
at least, that's being done
is being done in the
internet content providers.
So they're the ones who
are developing technologies
and paying for censorship.
But on the propaganda
side of things,
there are likely are studies
within the Chinese regime
of how effective
these strategies are.
But I think at the
same time, one thing
to keep in mind is that--
is not to forget the politics.
So I'm coming from political
science background--
not to forget the politics
within the government that
their bureaucracies--
the propaganda department has
been in charge of propaganda
for decades.
And so they have established
ways of doing things,
and they're encountering
these new technologies.
So it's not entirely clear that
even if there are effectiveness
studies, what that
means in terms
of changing their behavior.
- Maybe this is too
basic a question,
but if these propaganda
cheer leading messages come
in bursts, it's effectively a
reactive coordination message.
So do these people get
little messages somehow
or code like, OK, go for it?
I mean, how is it such
a coordinated burst?
- Yeah, so you actually-- so in
the email archives you also see
outgoing messages
from that county,
propaganda department
saying, OK, now it's time.
So in one of the bursts, which
was after kind of protest
in Western China, you see
at 5:00 PM an email go out,
and then at 8:00 PM you see
hundreds of people reporting
back of the post they made.
So that's how the
coordination is happening.
- [INAUDIBLE]
- [INAUDIBLE]
- So I wondered how you
defined which messages
had content about
collective action,
and do you think
your definition is
the same as the Chinese
government would have?
- So on the censorship
side, right?
- Yes.
- Yeah.
So we have a pretty
narrow definition,
which is people talking about
real world collective action
events.
So we're not including
online mobilization.
It's people talking
about other people who
are in the real world
doing things together.
So it's a pretty
narrow definition.
And in the experiment,
our experimental study,
I didn't talk about as
much, but we actually--
there are made posts in these
collective action bursts,
and we also try to make posts
in kind of online mobilization
bursts, and there that
content is not censored.
So it seems that it's really
about offline people doing
things in the real
world together that
triggers the censorship online.
Yeah.
- One more question.
In the back.
- You.
Sorry, in the very back, though.
You had your hand
up for a long time.
Sorry.
- OK, thank you.
I think this phenomenon can
be explained in [INAUDIBLE]..
It is very normal
in China , I think,
because China's government
want people, young people,
use people, most people to
do economy and technology,
but a lot of politics.
This is a lot
related to democracy
I think This is very
different from America.
So in China, organizations
or collective actions
are designed by the government.
Yeah.
There are many different
collective actions,
but all these actions are
designed by the government.
So if you want to take
collective actions,
you can participate in the
designed organizations.
So if you do not want
to do collective--
do not want to participate in
these designed organizations,
you can participate
[INAUDIBLE] in technology.
So I think so.
Thank you.
- Yeah.
No, I think that's right, and
there's folks here at Harvard
who are studying kind of
state led mobilization
as a way of diverting
people's attention.
But I think at the
same time, there
are lots of collective action
events happening in China.
So I have a new
project which is trying
to identify collective action
events through social media
data, and we are finding
dozens of protests every day
just through social media data,
which has it's own biases.
So I think regardless of
the government's efforts
to try to divert
collective action,
people face grievances
and problems
in their everyday lives
that they can't sometimes
solve on an individual
level basis,
and they get together with
others to face similar problems
and they engage in collective
action and protest.
- Thank you, Jen.
[APPLAUSE]
- Again, and for those
of you who just came in,
we're going to have
another speaker now who
I'll introduce in a second,
and then we'll have a break,
and then we'll have our
two speakers afterwards,
and we're hoping everybody
comes back and joins us.
So that was terrific.
And we will now move from
online censorship to the movies.
But let me tell you a little
bit about Nathan Sanders who's
our next speaker.
He's currently the
senior director
of quantitative analytics
at Legendary Pictures, which
I'm betting is a post that
did not exist a decade ago.
Before he did that,
I mentioned earlier
that he got his PhD from
this fine university.
Don't worry, we do have people
on the program who did not
get their PhDs from Harvard.
Just these first two, and me.
Anyway, so he got his PhD in
Harvard and 2014 in astronomy.
But while he was
here it wasn't enough
for him to very quickly
get a PhD, which
he did where he studied
supernovae and learned
a lot about data
science by looking
at the so-called
light curves, the time
behavior of the
brightness of supernovae.
He also started two things that
some people in the audience
may have heard of.
One is called,
Astral Bite's, which
is a service where graduate
students write summaries
and professional
articles for people.
And building off
of that, there's
now a big conference
he might tell you
something about called
Communicating Science
Conference or ComSciCon, which
is vastly oversubscribed,
and people attend it to learn
how to communicate science
with the public.
And Nathan doesn't
like when people
know that about him in advance
because it puts pressure on him
for giving an amazing talk,
which I'm sure he will.
But anyway, so you can see
that he has historically many
extra curricular interests.
And so he's not going to
talk about new statistical
methodologies for
measuring the light curve
properties of hydrogen rich
type two supernovae today.
But don't worry, we have
different astrophysics
talk for you later.
So instead, he's going
to talk about his work
at Legendary Entertainment
and how that works in the data
science world today.
David.
[APPLAUSE]
- While we're getting set
up, thank you so much,
Alyssa, and thank you to
everybody at Radcliffe
for convening this
amazing forum.
It's such a pleasure
to be able to join you
and to share some of
my own experiences.
So as Alyssa said, I
really want to actually do
two things in my talk.
I want to give you a
little bit of information
about my own background
and research.
First in astronomy,
and now I'll focus
on what I do today at
Legendary Entertainment
and what our group does.
But importantly, I
also want to draw
connections between the two.
I want to make it
really clear how
data science and
industry and academia
are really codependent.
So that'll be my goal.
So as Alyssa said, I have
a background in astronomy.
I started off-- I did my
undergraduate work in astronomy
in Michigan State, my PhD here
at the Center for Astrophysics
at Harvard.
And today, I work at
Legendary Entertainment.
I'll tell you all about
Legendary Entertainment
in a moment, but I
thought I'd start off
with a story about my
graduate work here at Harvard.
So I did my thesis research
with Alicia Soderberg
who's a phenomenal scientist
who studies supernovae here
at Harvard, and our work spanned
the full gamut from actually
detecting the supernovae.
And I won't talk
about this in detail
because I think Renee might
tell us a little bit more
about astronomical transient
discovery in a little bit.
But suffice it to say,
it's a very hard challenge
because we're trying to detect
these incredibly minute changes
in brightness in the sky, which
correspond to these incredibly
energetic and violent
explosions of massive stars that
happen in the distant universe.
Often literally, halfway
across the observable universe.
So my focus in this
work is on-- as Alyssa
said, the statistical
modeling of the light curves
of these stars, of
these star explosions.
So this is what that looks like.
A light curve is
very straightforward.
It's actually just a plot
of the brightness of one
of these explosions over time.
So the x-axis in this plot
is just the time and days.
The y-axis is the
brightness of the event.
Don't worry about the crazy unit
systems that astronomers use.
Don't worry really about
the details of this plot.
What I wanted to show you
is how messy this data
is, to be very frank.
So we're trying to observe
these incredibly faint objects
in the distant universe,
and as a result,
there's at least two big
challenges that we face.
One is that each individual
observation is noisy.
And two, we don't
get as complete
a data set as we'd like to have.
We only really get a few
observations of each exploding
star, and just the few weeks
that it's bright on the sky
before it fades into obscurity.
And that's a huge challenge from
a statistical point of view.
It means we really
need to treat each
and every single one of these
observations as something
truly precious.
There's an enormous
amount of investment
to the scientific
community in dollars
and building these
telescopes to be
able to collect these
precious few data points.
So it is my mission, the
mission of our group,
to use statistical
modeling to extract
every last piece of
information that we
can from these relatively
few precious data points.
And what we do with that
is understand the physics
of supernova explosions.
So we relate the
observational parameters
that we measure from
these limited observations
of these explosions.
So for example, on
the x-axis here,
the total duration of the event.
How long it stays
bright in the sky.
On the y-axis, how bright it is.
We relate those
observational parameters
to the theoretical parameters
of different types of stars
and the explosions they produce.
So other astronomers actually
explode stars and computers,
and they figure out
what the light curve
properties would look like.
And by comparing these
two things together,
we can make
inferences about what
the actual parameters in
the stars that we watch
explode or like how
massive they are
and what their chemical
composition was.
So that's a brief tour
of my work in astronomy.
I'm going to try to tie
that back to the work
that we do today at
Legendary Entertainment.
So most people that I
talk to don't actually
know our brand name, Legendary,
but almost everyone I talk to
has seen at least
one of our movies.
So Legendary is probably best
known for the Dark Knight
Trilogy of Batman films.
The Christopher Nolan trilogy.
We made a bunch of Christopher
Nolan's other films
as well, like Interstellar
and Inception.
More recently, we've made
the Godzilla reboot in 2014.
The Warcraft movie
based on the video game.
The recent Kong
Skull Island movie.
And right now, in
theaters today, you
can go see Pacific Rim Uprising.
So if anyone leaves
during my talk,
I'll know it's because you're
rushing out to see Pacific Rim
and I won't mind at all.
That would be fine.
So Legendary is-- we are
best known as a film studio,
but we also produce TV shows.
We make comic books.
We produce digital
content and online media.
Legendary is really an
entertainment conglomerate.
And we are headquartered
in Hollywood,
and we're a little bit of an
unusual company by Hollywood
standards.
We're quite young.
Legendary has been around
for less than two decades,
and in that time we've
been very successful.
We've had more than $12 billion
in total global box office
gross.
We also have an
office in Beijing.
And we have an applied
analytics group,
which is where I
and my colleagues
work, here in Boston.
So our applied analytics
group really has two missions.
The first is to
radically increase
the efficiency of
marketing in Hollywood
to make every
dollar that we spend
to entice people to go to
movies and to see our films,
to make that spend as effective
and as efficient as possible.
And secondly, to support and
inform the creative process.
The work that the incredible
creative talent that we have
in our office in
Hollywood, the work
that those people are doing
to put the amazing concepts
that they have in the
best possible position
to succeed in the marketplace.
So for me, the most exciting
part about working at Legendary
is that we have just a
phenomenal team of scientists
working here in Boston.
So I just wanted
to flash up for you
the faces of the folks
in our quantitative team.
And since we're
here at Radcliffe,
I wanted to highlight
that we have
lots of Harvard connections.
So perhaps, many of you will
recognize some of the faces
here.
And so I won't introduce
you to everybody here,
but I wanted to point
out a couple of examples.
We have lots of
people, like myself,
that have physical or
computational science
backgrounds including one
face, Jonathan Foster who
among many other things he's
notable for, including just
being a tremendous
researcher, is
a fellow alum of the
astronomy program here
and one of Alyssa's
former students.
So thank you very much, Alyssa.
- You're welcome.
- And another
category of scientists
that we have working with us
are amazing social scientists.
So I thought I'd point
out Phoebe Wong whose
photo is on the bottom there and
is also in the audience today.
And Phoebe is an analyst on
our primary research team,
doing human subjects research
at Legendary and also
a data scientist.
So I want to make
this connection
now between the
type of data science
that we do at Legendary and
the type of data science
that's applied in
academic research
throughout many
fields of science.
And I want to do that
through this lens
that I think of
this through, which
is this duality between
inference and prediction.
So the first thing
I'm going to do
is make very explicit what
I mean by those terms,
and I'll do that by introducing
this very generic concept
of what is a mathematical
model or statistical model.
So any statistical model is
a few fundamental components.
There's some set of input data
that is fed into the model.
Indicators that can be used
to predict other things.
There are some outputs
from the model.
Outcome data that we want
to understand and model.
The way the model
goes from X to Y
is by making predictions
by taking the input data
and transforming it into an
estimate for the outcome Y.
And the goal of
the model is really
to do as good a job as possible
in making the predictions match
the actual outcome data that's
used to train the model.
And then importantly, there's a
set of parameters in the model
that we'll call beta.
And these parameters
are really how
the model makes
predictions, and it's
learning the values
of these parameters
in the fitting
process that allows
us to learn about
the actual mechanisms
that we're trying
to model and study.
And finally, there is
perception error in the model.
There is some uncertainty
associated with it as well.
So the next thing I'm going
to say is entirely obvious,
but I'm going to
say it anyway, which
is that there's that an
aspect of this modeling
process that is critical in
both science and industry,
and that's prediction.
So the predictive aspect of
statistical models of course,
is fundamental to
how data science is
integrated into business.
So I'll give you a
few examples of how
this particular aspect
of the model prediction
is very important.
So in science for
example, if you're
very interested
in exploding stars
and you've observed supernovae
over the past few nights
and you want to know what it's
going to do in the future,
you might make a prediction
of the brightness
of that exploding star tonight.
What's it going to look like if
I try and observe it tonight?
In an industry, particularly
in our industry,
we're interested in people's
movie-going behaviors
and we want to know whether
or not we might buy a ticket.
So if there is
somebody that we have
some information
about like, Alyssa,
we might ask a question like,
is she going to go see a movie.
- Not taking [INAUDIBLE].
- I predict with 100%
confidence that Alyssa
will see Pacific Rim Uprising
before it leaves theaters.
[LAUGHTER] We'll
see if that's true.
OK.
So of course, we apply
predictive modeling
all over the place in
our work at Legendary.
I don't want to give
you an exhaustive survey
of the broad set of research
that all the scientists that I
showed you on the
earlier slide are doing,
but I'll just give you
a scattering of examples
in different categories.
So one big category of work that
we do is in network analysis.
Using social media graph data,
the connections between people
online to understand what
people's interests are,
to look at who people
are connected to, what
influencers or celebrities they
follow to get an understanding
of what type of content they'd
be interested in, what type
of movies they may want to see.
Secondly, we do a lot of
work in image recognition
using exciting relatively
new deep learning
models, like convolutional
neural network models,
to take in images as data,
to actually work directly
on pixel data and
make predictions
about what those images mean
and what it means for someone
to have shared them.
So in particular,
in our industry,
we're very interested
in people's response
to visual media.
If we send a creative message
out as an advertisement,
what will people respond
to and what types of people
will respond to each
type of message?
So for example, from working our
Kong movie, what type of person
is going to be interested
in seeing Kong himself,
the actual monster in the
film, versus other aspects
of the imagery in the film.
And one more
category of examples
is a natural
language processing.
Building models that actually
understand text as people
write a text and share
messages with each other online
to understand what those
individuals are actually
trying to express to their
friends and to others.
So we use a variety of
approaches for this.
The example I am
showing you is using
something called, a
hierarchical attention network.
This is a type of
model that understands
not only what it means
to use a particular word,
but also the context in
which words are used together
to form sentences.
And very importantly,
what to pay attention to,
literally to attend
to, within a sentence.
What words carry
the most meaning
when someone composes a message.
And among other things,
we use this type of model
to predict the
emotions that people
are trying to express when
they share messages online.
So I'm sure none of this is a
huge surprise that companies
are interested in using
predictive modeling
for purposes like
these, but actually
have a concern about how data
scientists often projected
and spoken about in
industry, which is I
think sometimes,
there's actually
an over interest
in or over reliance
on this concept of prediction.
So I think sometimes
machine learning
is spoken of as synonymous
with prediction.
So what I mean is
I think sometimes
when people think of
the modeling process,
this is not actually the
picture they have in mind.
What they have in
mind is something
more like this, where there
is one element of the modeling
process that sort of takes over.
And I think there's at least
one other part of the modeling
process that's equally
interesting and important,
and that's inference.
What I mean by
inference is being
able to learn from the model
fitting process itself,
to be able to learn from
the comparison between model
and data about the process
that we're trying to model.
So in my sort of definition
of statistical inference
is the ability to learn from the
comparison of models and data.
So to make this concrete, let
me give you those same examples
again, but from an
inference perspective.
So in science we might want
to ask not just how bright
is a supernova going
to be, but what
are the properties of the stars
that led to those explosions.
And in industry, we want to
know not just whether or not
someone's going
to go see a movie,
but what led them to
make that decision.
What is it about the film?
What is it about the
person that made them
interested in that content?
So ultimately, for-- I think
scientists in both academia
and in industry,
ultimately, inference
is a decision making tool.
It's what allows us to use data
and use data science to inform
the decisions that we
make as movie producers,
as executives, and
as scientists trying
to evaluate different theories.
So let me give you sort of
another scattering of examples
of how we apply statistical
inference at Legendary.
And again, I'll
just kind of survey
a few different categories.
So the first category
is inferences
that we use to understand
content, understand
the actual products
that we produce.
And this is another area where
natural language processing
is very valuable to us.
So we'll look at the
textual descriptions
that we have for films that are
generated by users, consumers
who are discussing them
online, or even generated
by our own creative team.
And among other things,
we use a type of model
called a partially labeled
[INAUDIBLE] allocation model.
This is a model that can
take a large stream of text
and decompose it into
fundamental elements.
And we use this
really to understand
what are the themes
that are present
within different movies.
This allows us to statistically
compare movies to each other.
Understand how they're
similar and how they're
different from each other.
And also, it's useful
for making predictions.
We can use that same inferences
about thematic content
to anticipate what types of
themes are going to be popular
and what types of people
they're going to appeal to.
We also use inference to
understand market behaviors
at large.
So I'll tie this back
specifically to the problem
that I described of having this
messy and relatively sparse
data and observing supernovae.
Actually, in a lot of ways,
Hollywood is very similar.
So if you look at
any particular genre,
there's typically only a
few major film releases
in that genre per year.
And just like those observations
of the supernova explosion,
every time another
studio releases a film,
that's a precious
opportunity for us
to learn about how the
market is responding
to that type of product.
So we invest a lot of time
in building models that
are sensitive enough and have
a detailed enough accounting
of uncertainty so that we
can learn as much as possible
from these relatively
infrequent events.
And finally, we-- sort
of another category
is doing inference on
individual consumer behavior.
So this is our tie back
to Alyssa and her decision
to go see Pacific Rim,
that's she's going to do.
And the way we know
she is going to do that
is that we look at
data from consumers
that we have information about.
We know some of
their interests and
their demographic attributes.
And we know what messages
that they've been exposed to,
what advertisements
they've received
for different films over time.
And we compare that to
their eventual decision
to go see a film or not.
And we learn from
that comparison.
Where are the
attributes and what
are the messaging techniques
that actually influence
people to change and be
persuaded in their opinion
to go see a film.
The last aspect of this
I want to talk about
is one more
fundamental difference
that I see between prediction
in general and inference
in general, which is
that a prediction can
be automated directly
into business processes.
So of course, we
can build algorithms
that directly operate on the
output of a predictive model
without necessarily having
any human involvement,
without having a
person actually looking
at the output of that model.
But in my view, inference
is fundamentally
different from that.
So in my view, to actually
extract value from inference
to benefit from
statistical inference
we need to have
people in the loop.
People who are actually
looking at models,
interpreting them, and
communicating those results
to other people who are going to
be involved in making decisions
based on that data.
So for me this has
profound implications
for how we prepare scientists
to do this type of work.
Frankly, whether it's
in academic research
or in industry.
So for me, the two
big implications
are one, what we need to
prepare scientists to do.
And in my mind, the
critical aspect of this
is the communication part.
So this is the way I always
make the argument for why
communication is a critical
and fundamental skill
for scientists.
And it's that you can do
the most amazing research
imaginable.
It can be absolutely
groundbreaking.
It can be rigorous,
sophisticated.
It can be perfect.
But if you can't
successfully persuade someone
to pay attention to what you're
doing, to be interested in it,
and to have faith that what
we are doing is correct,
then it may as well
not have happened.
Your work will have no
impact in the world.
That's why I think
communication is critical.
The next question is what a
step in a scientist lifecycle
should we focus on
building these skills?
And in my mind, there's
an obvious answer
to that too, which
is that I think
it has to be at the
graduate student level.
I think there's two
reasons for this.
One is that if you ask the
question of who does science,
if you look at academia and say,
who's the one doing the work?
It's graduate students.
The graduate students in general
in the US in different STEM
fields outnumber,
postdocs outnumber faculty
by nearly 10 to 1.
Graduate students are
really the broad base
of our enterprise of
scientific research.
And secondly, graduates
since of course
have made different
career paths.
Some of them go on into
research and academia.
Some of them go on into
industry, and many other paths.
So if we want all
scientists, no matter
where they're working, to
have this kind of preparation
and be equipped to communicate
the results of their work,
I think we have to
target graduate students.
So the last thing I'll
say is that I'm really
proud to have been part of
a group of graduate students
here at Harvard in 2012.
We founded a program
that's designed
to equip graduate
students with these skills
and empower them to
take leadership roles
and how scientists
communicate to the public.
And that's an organization
called ComSciCon,
the Communicating
Science Conference
here that Alyssa kindly
alluded to earlier.
We also are fortunate to have
one of the leaders of ComSciCon
in the audience, Ben Cook,
who is an astronomy graduate
student here and leads
ComSciCon's local organizing
committee.
And ComSciCon really
does two things today.
We have a national workshop
on leadership and science
communication for
graduate students,
and we have a network of
ComSciCon franchises operated
by our student alums
throughout the country who
are spreading the work
that we do even further.
And if you're
interested, like me,
in how scientists are prepared
to communicate with the public,
and if you're interested in
supporting graduate students
in doing this work, please
check out ComSciCon.com,
and please feel free to ask
me questions about it as well.
Thank you.
[APPLAUSE]
- I learned that I have
to go see Pacific Rim.
- That's true.
- And a lot of other things.
Thank you, Nathan.
So we can take some
questions for Nathan.
Go ahead.
- What do you think
are the differences--
what do you think are the
differences between being
a data scientist at a non-tech
company like Legendary
versus being a data
scientist at a tech company
like Facebook and Google, and
why did you choose the former?
- Oh, that's a very
interesting question.
You know, I'll tell you what
made me really excited to join
the team at Legendary,
and that's the opportunity
to work in a domain that
traditionally hasn't
hired people with
backgrounds like mine
and maybe yours to
work in a space that's
sort of a green field
for data science
and for scientific
research in general.
I think that's what all
scientists are looking for,
no matter what
domain they're in.
A lot of sciences by
necessity, incremental, it's
building upon something
someone else has done.
And that's critical to the
field, and so exciting to do,
but I think scientists
are most excited when
they get the opportunity
to build something new,
create new
capabilities, identify
and think about
problems that people
haven't thought about before.
And for our team
at Legendary, we
have the opportunity
to do that every day
just because
there's not a legacy
of quantitative scientists
working in our field.
- Not to pick on Alyssa
again, but in your example
about the movie ticket,
when I think of inference,
there's always some
error bar in the data.
So how do you determine that
Alyssa has bought the ticket
and what is your error bar
on her ticket buying-ness?
- You come from much the
same perspective as us,
which is that the uncertainly
analysis aspect of that problem
is absolutely critical.
And so I want to
highlight some work
that one of our colleagues, a
Legendary Victor Lee has done.
He's actually a Harvard
alum from the ICS program.
He's built this
fantastic [INAUDIBLE]
probalistic model to
relate the ticket sale
data that we have to individual
consumers' preferences,
interests, and the past as
they've been exposed to.
And in this
problematic framework,
I think among many
reasons why it's
really valuable, is that
it allows us to account
in great detail for the
uncertainty associated
with varying channels that
people get exposed to,
advertisements from.
And also the-- again, the
relatively sparse data
that we have available
for this type of study,
I would actually draw a contrast
between the type of modeling
that we do in this domain, which
is called attribution modeling.
And the work that's done at
some of the tech companies
for example, where
they are operating
on the sales of digital
products that you can
track very easily with a click.
And in that domain, you often
have millions and millions
of cases, millions and
millions of data points
to do this analysis on.
And that scale,
perhaps uncertainty
is not critical to
understand because you
can beat down those error
bars by sheer number.
But in our domain, where tickets
are sold at brick and mortar
stores, there's not a digital
tracking system for them,
but instead we have to use
primary research mechanisms
to collect that data, and we end
up with smaller sample sizes,
understanding of it's
certainty is critical.
And so this is one of the few
areas where we've actually
published our work
so its publicly
available, and
actually and GitHub if
anyone wants to download it and
build their next better version
of it.
So if you search for my
colleague's name, Victor Lee
and attribution
modeling, you can see
exactly how we do that work.
- [INAUDIBLE]
- Thank you very much
for a fascinating talk.
What kind of measures--
[INAUDIBLE]
- Effectiveness, sure.
Yeah, one of the
challenges that we face
is that-- again, Hollywood
is an industry where
there's relatively few
product releases per year.
And of course,
everyone is absolutely
critical to each
production studio.
So we don't have
the ability that you
do in some other
industrial domains
to what's often called
AB tests to release
the product in
different circumstances
and measure which one the
performance is higher in.
So we actually have to use
predictive inferential methods
even to understand
the effectiveness
of our other predictive
inferential methods.
So a lot of the
work that we do is
in building models
that predict what
the outcome of film
releases in general
will be so we can understand
first of all, what
is the optimal strategy for us
when we're releasing our film,
and second of all, what would
the performance of our release
had been had we made
different decisions
before it actually came out.
And by measuring the difference
between that prediction
and the actual outcome,
that's one of the ways
that we can determine
our efficacy.
- One more question.
Sorry.
Anybody who didn't get
to ask your question,
don't forget that
there's a larger question
session at the end [INAUDIBLE].
- Hi.
I just wanted to ask
you about kind of where
you get your social data from.
So you know, now we're living
in the post Cambridge Analytical
world.
I feel like it's good
that we understand
how firms like you are
collecting that kind of data.
I understand how you
get your other data,
but in regard to the network
relationships between people
and the messages their
sending between each other.
- A very astute question,
and did not escape my notice.
I'm sure none of yours as
well that Mark Zuckerberg is
actually testifying
right now in congress
about some of these issues.
And I want to say two
things about that.
One is that I'm
personally, and I
think everyone who is an
advertiser is actually
really glad that this
conversation is happening,
and that there's going to
be a public examination
of the processes by which
targeted marketing happens
today.
Because especially for consumer
oriented brands like ours,
like for movie producers,
the relationship
that we have with consumers and
the trust in that relationship
is critical.
If we have a brand name that's
tarnished by advertising
practices that people
are uncomfortable with,
then people aren't going
to want to see our movies
or buy other products
from other advertisers.
And so it's really
critical that we
identify where the
right boundaries are
in protecting
privacy and offering
services like Facebook.
So I hope that this
conversation that's happening it
was really productive.
The second thing
I'll say is that when
I look at the capabilities
that are being developed
by my colleagues right
now at Legendary,
and I don't want to get
into specific examples.
But the ones that are most
exciting to me, actually
do not rely on any individual
consumers' data at all.
So I'll give you an analogy
as an example for the type
of thing I'm talking about.
I showed you that
we do a lot of work
to understand the efficacy of
different advertising channels.
Maybe not a surprise
that in our industry
one of the most important
ones is actually
theatrical marketing,
actually showing trailers
before films in cinema.
And of course, that's
been done for a long time.
And of course in
doing that, you don't
need to know anything about
who's walking in the door.
You don't need to
know the identity
of any individual
showing up to see
your trailer in the theater.
But it can still have
a big impact on them.
It's also a very targeted
approach in the sense
that it's being
shown only to those
who are obviously moviegoers,
and you can sort of
choose the films that you're
showing them on as well.
So with that-- by that analogy,
by using that type of strategy
that operates on sort
of an aggregate level,
I think we can be really
effective targeted marketers
without being exposed as
having privacy concerns that
are in the debate right now.
- Thank you, so much.
[APPLAUSE]
- I still think there's some
uncertainty on that Pacific Rim
issue.
