[MUSIC PLAYING]
- Well, first I
just want to start
the second half of our program
by thanking the audience.
So you've been wonderful so
far, asking great questions.
And it's always a
sign of a good event
when you can't get people
back in to sit down,
because you're too busy
talking to each other.
But so now we do get to hear a
couple more of presentations,
and then we'll have
this panel discussion,
and then we'll have even more
amazing snacks at the end
where you can continue--
[LAUGHTER]
--the conversations
that you began.
And I should just tell
you that, again, please
think of the questions
that you want
to ask each individual
speaker, but then keep in mind
the ones that you
didn't get to ask
or that apply to
more than one speaker
for joint discussion at the end.
And so again, before
I continue, I'll
just explain that these two
talks by Ben Shneiderman who's
our special guest from
the University of Maryland
next week--
the first one will be a very
large venue in Science Center
D, but you're still requested
to register if you want to come.
And that one is about
algorithmic accountability.
So the reason I mention this
now is because you heard
the question to Nathan at
the end, which I'm sure
will come up again in the
discussion, essentially about
what do we do about
the fact that we have
these algorithms that either we
don't know what they're doing
or they're being applied
for sometimes less
than completely
honorable reasons--
not by Nathan or
legendary entertainment.
But we can talk about
Mr. Zuckerberg later.
Anyway, speaking of Harvard
alums or almost Harvard alums,
I lied to you when I said
that the next speaker wouldn't
be a Harvard alum.
She does not have a Harvard PhD,
like the previous two speakers,
but she does have, I believe,
an undergraduate degree
from Harvard University.
Renee, do you have
any Harvard degrees?
- I have not been to Harvard.
- OK.
Unless we suddenly confer
an honorary degree on Renee,
we're going to be--
- I'm fine.
- --we're going to be good.
You want an honorary degree?
- [INAUDIBLE].
- OK, wait about 30 years
and then we'll give you one.
OK?
So anyway, but her master's
degree is not from Harvard,
it's from Johns Hopkins.
This is Saki Takahashi who I'm
talking about, in epidemiology,
and her undergraduate
degree was in applied math.
And currently, she's
a graduate student
at Princeton University, which
I hear is a pretty good school.
It's OK.
[LAUGHS]
And we learned about her
work from our colleagues
in epidemiology here, and also
from the leader of her lab--
whose name is Jess Metcalf--
who some of you might
remember as a speaker
in the wonderful Contagion event
that my Codirector of Science
here, Janet Rich-Edwards,
organized back in the fall.
And what Jess was talking
about was essentially
the use of data on difficult
problems in epidemiology,
but in this era of the
availability of lots
of large online datasets.
And one aspect of that
that particularly interests
me-- that I'm really eager to
hear about today from Saki--
is looking at
spatial distributions
and temporal
distributions and bringing
the tools of other
fields of science
and other fields of
statistics to epidemiology
in ways that either they
weren't there before
or were not possible before.
And I'm being purposely vague.
She's going to show you now--
that means as soon as I put
the microphone back together.
So let's welcome
Saki, and I can't
wait to hear what she says.
[APPLAUSE]
- Great.
Thank you so much.
Thanks to the organizers of
this event for inviting me here.
So today, I'm going to tell
you about some of the work
that we've been doing on
combining different data
sources to generate fine
scale maps of susceptibility
to vaccine preventable
infectious diseases.
And this work is
really exciting to me
because it was driven by real
time public health policy
questions.
And it's especially exciting
to be here because I actually
first got involved in
epidemiology and public health
as an undergrad, I had a mentor
in the biostats department
at the School of Public Health,
who introduced me to people
in the epidemiology department.
And needless to
say, I got hooked.
And I've been doing infectious
disease epidemiology
since then.
So I know that you guys
have hosted previous events
on epidemiology, so perhaps
you've come across this map
before.
For those of you
who haven't, this
is a map of cholera
cases in London
that was made by
John Snow in 1854.
And he traced the source of
the large cholera outbreak--
and cases are shown as
these black rectangles--
to the famous Broad
Street Pump shown in red.
Jon Snow is known as
one of the founders
of the field of epidemiology.
But I'm going to
argue that he was also
a founder of spatial
epidemiology.
And my talk will be on
spatial epidemiology.
So one infectious
disease that I,
and a lot of people in my
field, think about often
is measles, which
as you may know,
is one of the most transmissible
and potentially deadly
infectious diseases.
It has a R0, or basic
reproductive number,
of 10 to 20.
And R0 is the expected number
of secondary cases caused
by one infectious
individual in a completely
susceptible population.
And so basically, R0
is a metric of how
transmissible an infection is.
And so let's compare
that to the out of,
let's say seasonal influenza,
which is around two.
But R0 for measles of
10 to 20 is quite high.
And measles is a leading
cause of death in children
under five years of age.
And measles still kills almost
90,000 people each year.
And so here's a toy
schematic to give you
an idea of how transmissible
an infection with an R0 of 10
really is.
So individuals who are
susceptible to a disease here
are shown in green.
And infected and
infectious individuals,
or the infected and infectious
individual is shown in orange.
And now let's say we have
a completely susceptible
population, so everyone's green.
And then we introduce
a single person
who has infection with a
disease with an R0 of 10.
On average, that
one infected person
will cause 10 new infections.
So in addition to R0, a key
concept in infectious disease
epidemiology is herd immunity.
And herd immunity is
the indirect protection
against an infectious
disease that
occurs when a sufficient
proportion of the population
is immune, or in other
words vaccinated.
And Herd immunity is related to
R0 in quite a simple way, which
is that from
epidemiological theory,
we know that the proportion
of the population that
must be vaccinated in
order to actually achieve
herd immunity--
or what we call p of c--
is calculated as
1 minus 1 over R0.
And so this means from
a pragmatic standpoint,
the higher the R0
is, the greater
the proportion of the population
that needs to be vaccinated.
And so for measles, which I said
earlier has an R0 of 10 to 20,
this means that roughly 90
to 95% of the population
actually needs to be vaccinated
in order to eliminate
the disease.
So now here's that
same schematic
again showing a measles-like
infection with an R0 of 10,
except this time we vaccinate
91% of the population.
So there are 11 people in our
population, and 10 out of 11
are vaccinated--
vaccinated individuals
are shown in blue.
And let's assume that vaccine
is perfectly immunizing.
And again, we introduce
a single infected person
into that population.
This time the infected person
won't cause any new infections,
because a sufficient
proportion of the population
is vaccinated, and so
they won't get infected.
In addition, the one
person shown in green
here who wasn't
vaccinated, will also not
get infected because
they're protected
by this concept of herd immunity
that I introduced earlier.
So measles vaccination is one
of the most cost-effective
public health interventions.
And this is because
measles, the disease,
has both a potentially high
case fatality ratio of up to 20%
in some situations, as well
as a safe, effective, and
inexpensive vaccine.
And so this figure here
is showing the number
of reported global measles
cases, in blue bars,
from 1980 to 2016.
And the corresponding
yearly vaccination coverage
during this time period
in blue and red dots.
And you can see that the gains
in vaccination coverage that
have been made
over recent decades
have led to a large reduction
in measles cases globally.
And so as a result of this
all WHO regions and countries
currently target elimination
of measles, the disease,
by the year 2020.
However, continued measles virus
circulation in many countries
makes this goal seem
particularly elusive.
And as I said earlier,
measles is still
one of the leading killers
of children globally.
And so to get a more
clear picture of what
vaccination coverage estimates
in that previous slide that I
showed actually
represent, this is
a map showing reported measles
vaccination coverage in 2016
by country.
And you can see that
across countries, there's
quite a bit of spatial
heterogeneity in coverage.
In countries where coverage
is particularly low,
which is-- so below
the 80% threshold
is shown in light and dark red.
So yeah.
So there's quite a bit
of spatial heterogeneity.
And countries in
Sub-Saharan Africa
have some of the lowest
vaccination coverage levels.
And this is also
where the majority
of the world's remaining
measles cases are found.
So what I'd like to try
to convince you in my talk
is that it's important to think
at an even more local level.
And since a country
level estimate--
so any of these single
polygons-- is still an average,
a country can have
high overall coverage,
but also have pockets
of low coverage,
and that could sustain
a measles outbreak.
So I'm going to tell you
a little bit about how
measles vaccination
is actually delivered
with a focus on Sub-Saharan
African countries.
So there's two ways in which
measles vaccine is delivered.
First, by routine immunization
at health centers.
And this targets young
children at nine months
of age for their first
dose of measles containing
vaccine, which we call MCV-1.
Secondly, countries
conduct catch-up campaigns,
and they're known as
Supplementary Immunization
Activities, or SIAs.
And the purpose
of an SIA campaign
is to boost population
level immunity.
And they do that by targeting
a wider age range of children,
compared to routine
immunization.
So SIAs will sometimes target
kids up to five years age,
but even possibly older--
even up to 15 years of age
in some scenarios.
And so what SIAs, or
these catch-up campaigns,
do is they both provide
a first dose of vaccine
to kids who were missed
by routine immunization,
but they also
provide a second dose
to those who are already
vaccinated by routine programs.
And in many countries,
we have kids getting
two doses of measles vaccine.
And these campaigns
can be conducted
across an entire
country, or they
can target a specific
geographic area,
and they're conducted
every two to five years.
So this equation for p of c
that I mentioned earlier--
which represents the proportion
of the population that
needs to be vaccinated--
this equation assumes that
susceptible individuals
are evenly distributed
throughout the population.
And so this doesn't account for
any heterogen-- heterogeneity
across space.
And in reality, the spatial
clustering of unvaccinated
people-- and unvaccinated
people being susceptible people,
right--
makes it even more difficult to
actually achieve herd immunity.
And let's compare
that to the situation
where the same number
of people are randomly
distributed across space.
And so along those same lines,
estimating vaccination coverage
by taking the average
across large spatial areas
can miss zones of
vulnerability that are small,
or those that don't follow
administrative or political
boundaries.
And this has been increasingly
recognized among public health
policymakers, and is
reflected by a shift
from simply setting
country level targets
to focus on ensuring
uniformly high vaccination
levels across a country.
And so in 2010, the
World Health Assembly
set a goal to achieve 90%
measles of vaccination
by country, as well as
at least 80% coverage
in every district
within a country.
And so in response to this
shifting focus towards more
detailed geographic
information and equity,
we've been setting up an
analytical pipeline that
combines data with models
that incorporate the two
mechanisms of vaccine
delivery, towards the goal
of understanding measles
vaccination coverage
at the sub-national level.
So the primary data that we
use are demographic and health
surveys, or DHS.
These are large
cross-sectional, geolocated and
nationally-representative
household surveys.
They are a rich and
publicly available
source of information on
demographic and health
indicators, including
vaccination,
and they're conducted
across over 90 countries.
And they're usually done every
few years in each country,
and they ask
hundreds of questions
using a standardized framework.
And so every survey
has information
on thousands of participants.
In particular, a DHS
survey has information
on each interviewed woman's
child who are five years of age
or younger at the
time of the survey.
And it includes relevant
dates, their date
of birth, their date of
survey, and from that we
can calculate children's age.
And then there's
also information
on their measles
vaccination status, which
we're really interested in.
For some children,
we have information
on their date of vaccination
from their vaccine cards,
but for others we only know
that they were vaccinated
based on their mother's recall.
Lastly, we have data
on their GPS locations
at the level of
household cluster.
And what cluster
here means is it's
an aggregate of households.
And there are 300 or more of
these clusters per survey.
So on the right here is the
data from the DHS survey
that was conducted in
Madagascar in 2008-2009.
And that's the most
recently available survey
from that country.
And there were over
10,000 children
included in this survey.
And each point on this map
corresponds to a GPS cluster.
And the color of the point
reflects the proportion
of children who are
vaccinated against measles.
And then the size of the point
corresponds to the total number
of children in that cluster.
And so the information that goes
into the DHS vaccination data
comes from individual vaccine
cards, such as this one.
And you can see
here for this child
you can see their date
of birth and their date
of measles vaccination.
And if you do the math, they
got their measles vaccination
at around nine
months of age, which
is the recommended age,
at least in this context.
And this is what the
country level vaccination
data in that Madagascar DHS
survey actually looks like.
And so this histogram is
showing children's age receiving
measles vaccination, if we
know it-- and that's in months.
And this is
information that comes
from vaccine cards like the
one that I just showed you.
And you can see that for
a majority of children
for whom we know when
they were vaccinated,
they got it at
around nine months.
But there is some [INAUDIBLE]
distribution around that.
And this histogram
is a different way
to look at the data, which is
children's measles vaccination
dates that we got from
the DHS survey, also
from the Madagascar survey.
And this is binned into month
and year of vaccination.
And the months with a
turquoise or pink bar
contained a measles
vaccination campaign.
And in Madagascar, they
actually do two campaigns a year
to try to catch up kids.
And so you can see that months
associated with the campaign
have higher vaccination
dates on cards.
So this is kind of the
signature that these campaigns
are doing something.
And so for each child in the
DHS survey, what we do is we
take information on
their vaccination status,
their age at the survey,
and then their age
at the vaccination
if we know it.
And then we fit it all into
a likelihood-based model,
and then we estimate
parameters that
describe the probability of
any child being vaccinated
given their location.
And so we've been developing
the statistical methods
that we think are widely
applicable to countries
that has DHS surveys.
And so as we were
working on these methods,
the West African Ebola
epidemic was under way.
And so in addition to the
considerable morbidity
and mortality is a
direct effect of Ebola,
the epidemic also
disrupted the normal means
by which routine health care was
delivered in these countries.
And this is because Ebola shut
down the health infrastructure
in the affected countries.
And so disruptions in
routine health care
lead to a reduction
in vaccination rates.
And that leads to a reduction
in increasing population level
susceptibility to
measles, especially
in the young age cohort.
And that can lead down to a
breakdown of herd immunity.
And so Guinea, Sierra
Leone, and Liberia
had all been concerned about
growing measles susceptibility
in their countries,
and had planned
to conduct these catch-up
campaigns in the future.
And so our program manager
at the Gates Foundation
told us that given the shut
downs of health services
due to Ebola, measles
outbreaks are inevitable,
and it would be good to
have some sort of risk map
to show and to guide
a preemptive response.
And so for us, this
was an opportunity
to apply the methods
that we've been
developing to analyze the
geolocated DHS vaccination
data.
And what we did was we pulled
the most recently available DHS
survey from affected countries
and surrounding countries,
and we pulled information on
their most recent SIA campaigns
that they've done.
And our aim here was
to map and quantify
how Ebola related health
care disruptions increased
the risk of a measles
outbreak, and that
could create a potential
second public health
crisis following Ebola.
And so we've observed that
measles epidemics frequently
follow humanitarian disasters.
For instance, a
survey of households
that were displaced due to
the Ethiopian famine in 2000,
found measles to
be a contributing
cause in 35 deaths.
Measles epidemics also
followed disruptions
in the health system due
to natural disasters.
And the current political
conflict in Syria
has also led to large
numbers of measles cases
due to the collapse
of the health system.
So now I'm going to take
you through the steps
of our analysis
pipeline, and how
we applied it to
understand how Ebola
related health care disruptions
increased the risk of measles.
So first, we obtained the
geolocated DHS vaccination data
that I've been describing
from these countries
at the epicenter
of the outbreak,
as well as the
adjacent countries.
And then estimated the
relevant vaccination parameters
from our model at the
DHS cluster locations,
and we incorporated
the two mechanisms
of vaccination, both
routine health care
services and these campaigns.
Inevitably, we don't
have to DHS data
from every single
point in space,
which is what we would like.
But what we do is we take our
estimated parameter values
at the locations for
which we do have data,
and then we interpolate,
or we assign,
their values to locations
that are unsampled.
And we assume that
neighboring locations should
have similar parameter values.
And so this, you
might have heard of,
is essentially the
first law of geography,
which is that everything is
related to everything else,
but near things are more
related than far things.
And so this map is showing
the baseline probability
of receiving routine
vaccination by two years of age.
And these are at five kilometer
by five kilometer grid cells.
And so we see that
there's low vaccination
coverage in Guinea
and Liberia overall,
compared to Sierra Leone.
But there is considerable
within country heterogeneity
and vaccination
coverage, as well.
So we use the DHS data to
estimate the baseline--
or pre-Ebola
vaccination estimates--
and then we applied
this framework
to look at the effects of
a reduction in vaccination
due to Ebola.
So reports during
the Ebola outbreak
were indicating that at least
half of health care centers
were closed, and that
those that remained open
were receiving fewer patients.
And so what we did was
we looked at the effect
of a 75% reduction in
routine vaccination rates,
and this is based on the
best available information
we had at the time.
And so we used those
reduced vaccination rates
to estimate post-Ebola
vaccination coverage.
And so what we did was we
took our DHS survey estimates,
and then we combined
them with data
from the World Pop Project.
And they have high
resolution, age structured,
population size, and
birth rate information.
And those are based on satellite
data on human settlements.
And so on these
maps, lighter colors
indicate higher
population density.
And so now that we have this
additional demographic data,
we take our proportions of
vaccinated children, which
is from there at each
location, and then we
can generate the expected
number of unvaccinated kids
at a given age.
So the last step was to
project the population forward
in time using the
spatial birthrate data,
and under different scenarios
in which routine vaccination is
disrupted for six to 18 months.
And so what we
did was we mapped,
and then we counted
up the number
of children who were
unvaccinated against measles
under five years of
age in the region,
under this assumption of a 75%
reduction in vaccination rates.
And then what we found was that
with every month of health care
disruption, an additional
19,000 children
are unvaccinated against measles
in the three focal countries.
And so you can see on this map
here that susceptible children
reside in these
contiguous clusters that
cross national borders.
So to summarize this work
we know that the Ebola
outbreak led to a
significant disruption
in health care services, and
a corresponding reduction
in routine childhood
vaccinations-- including
those against measles.
And then based on our
synthesis of DHS survey
data with high resolution
demographic information,
there would be these
large connected clusters
of unvaccinated children.
Some of the caveats
of this work are
that the exact numbers
of susceptibles
depends on the values of the
reduction in vaccination.
And that could vary
spatially, based on factors
like local Ebola incidents.
But regardless of
the exact numbers,
there is a clear path
to avoiding outbreaks
of childhood vaccine
preventable disease, which
would be to conduct a
multi-country vaccination
campaign aimed at age groups
that were unprotected.
And a robust campaign
could virtually
eliminate the risk of
measles in the region.
So we've heard secondhand
that this analysis, which
we published during the last
year of the Ebola epidemic,
was useful in motivating
a public health response.
So in Liberia there
was an outbreak
of measles in the
first half of 2015,
and a follow up
national campaign--
vaccination campaign
against measles,
as well as polio, was
conducted that June.
And so we've worked with
various organizations
to do specific
follow up analyzes.
And then in the Lola
prefecture of Guinea,
which is near the
Liberian border,
there was also an outbreak
of post-Ebola measles
in the first half of 2015.
And there were over 700
suspected measles cases.
And this was another example
of Ebola-associated disruption
to the health care system
since an SIA campaign had
been planned in Guinea
during the previous year,
but it was interrupted
by the Ebola outbreak.
And it actually never
got to Lola prefecture.
And so Matt Graham,
who was a post-doc
at Johns Hopkins at the
time, led some of our work
in estimating susceptibility
to measles in Lola,
and forecasting the course
of the measles epidemic.
And this was done with the
European CDC and the WHO.
And so to conclude,
measles vaccination
is a best buy in public health.
The DHS survey data are
an invaluable resource
for understanding the spatial
distribution of health
indicators, including
childhood vaccination.
And I hope I've demonstrated
that this type of work that
leverages various data sources
and applies spatial analysis
techniques can reveal
useful insights
for real-time public health
planning and advocacy.
So with that, I'd like to thank
my mentors, my collaborators,
and funding sources.
And thank you very much.
[APPLAUSE]
- Thank you very much.
We'll take just a few
questions now for [INAUDIBLE]
and some more afterwards.
- Oh, the microphone.
You sound [INAUDIBLE].
- So I was really interested
in the demographic data
that you used from
the satellite images.
How does one link just
population density
to population
density of children?
Or are you just
making a correlation
that there's some average number
of kids per number of people?
- Yeah, so these are
actually age-structured data.
So they go in and estimate
at five-year intervals,
like, the number of
zero to five-year-olds,
five to 10-year-olds, 10
to 15-year-olds, by pixel.
- You can-- sorry, you can tell
the number of five-year-olds
from space.
- Apparently.
- Wow.
- Yeah.
Well, so it's based on
demographic projections
by UNDP and whatnot,
and fertility rates
and those things go
into the estimation.
But yeah, it's pretty cool.
- I had a clarification
question on the R0.
And I'm kind of assuming
that, in the background
there, there's some assumption
about kind of how many people
the typical child interacts
with, that's in the background
there somewhere.
And I'm wondering if
you could also tell us
a little bit about the
state of the art in regard
to, say, the social graph
structure in the sense
that there may be some super
connectors who interact
with a lot more people,
and kind of what
does your literature have
to say about that problem,
and how should we be
thinking about it.
- Yeah.
So for your first
point about R0,
so that's something that,
for the context of measles,
transmission is driven by
kids at schools, usually.
So there's some great
work showing that,
during the school--
like when school is in session,
measles, R0, is higher.
And then during the summer
term or winter vacation,
R0 goes down.
And that's because kids
aren't mixing in schools.
And so there's a
whole literature
around the seasonality
of measles.
And then in terms of
I guess mixing, yeah,
so I think that
what this work does
is it sets the baseline
level of susceptibility.
But then when you're actually
thinking about transmission
and epidemic dynamics, it
gets much more complex,
because you have
these non-linearities
between susceptibles
and infecteds.
And you know, it's--
yeah, so I guess
all I would say is
that trying to
understand susceptibility
is possibly a more
tractable question.
I think that-- yeah,
so in other work,
I also think about
like epidemic dynamics
and how transmission plays
out in the population.
And I think you have
to think much more
about how kids contact, or
how people contact each other,
not just by space,
but also by age.
So there are these studies that
show that essentially mixing
patterns between
age, like there's
this diagonal where I'm
more likely to interact
with people of my age.
But then there's also these
interesting off-diagonal
effects, where parents
and children interact.
And so you there are some
strong mixing patterns
between those age
cohorts, as well.
So, yeah, it's a really.
- I'm gonna take the
microphone to ask a followup
question, which will be the
last question, which is just--
so I'm very interested
in spatial epidemiology.
And part of the problem is that
these R0 models, very basic
models, just assume, you know,
who's infected, [INAUDIBLE]
no spatial information
whatsoever.
And so what you've done is
take smaller and smaller
spatial cells, and then
model those very carefully,
and then get a spatial
map of what's going on.
And there's interactions
within those spatial cells
of different kinds of people,
to answer David's question
about transmission, but
is there in your modeling
any interaction between
the cells themselves,
and does that
spatial relationship
of the individual dots--
- Yeah.
Right now, it's
completely static,
but I think that
this could probably
serve as the basis
for dynamic modeling.
Absolutely.
Yeah.
But there's not-- yeah, we don't
have kids moving or interacting
in these flat maps.
- [INAUDIBLE] the answer to
all questions in this session.
But anyway, thank you again.
- Thanks.
[APPLAUSE]
- So now we will move to
the largest scale of today.
We will move to the
entire universe.
And I will read a sentence
which is actually true,
which is that Renee
Hlozek's goal is
to understand the structure
and amount of dark energy
in the universe constraining
theoretical cosmological models
with observations.
So for those of you who
don't study astronomy,
that means that she's
actually studying
the fundamental mysteries
of the universe.
And we don't understand
like 95% of the universe,
optimistically.
And part of what Renee does
and the kind of data science
that she's going to
talk to you about today
is something that
people in astronomy
have come to appreciate
in the last decade or so.
What used to happen
in astronomy is
that people would
make observations,
and then they would
try to back out
what the physical
parameters that explained
those observations are.
Now, there's a lot of
what in statistics often
would be called forward
modeling, where you say,
OK, I think maybe I
understand the physics.
If I make a
prediction, to go back
to Nathan's world, about
what the universe would
look like under
those assumptions,
what would that look like?
And that's what
Renee will probably
call a model, or a cosmological
model, in her case,
of the universe.
And then the statistically
difficult stuff
is to get enough data and
enough clever statistics
to constrain the difference
between the models
and the observations.
And sometimes, people
even go so far--
and I don't know whether that's
what she'll talk about today--
but to make synthetic
observations,
to take these models and
then make fake observations,
as if you have real telescopes.
And so part of what
distinguishes Renee's work
is that she's
actually interested
in those real telescopes,
even though she
trained as a theorist.
So I should explain that
she did not go to Harvard,
but we let her
come today anyway.
So Renee is originally
from South Africa.
And she went both to the
University of Pretoria
and the University of Cape Town.
And then she went
to Oxford, where
she got her DPhil in 2011.
And I met Renee sometime
around then, I don't know,
at a funky [INAUDIBLE] event at
Google, and before that once.
But Renee has a lot of
extracurricular interests,
not unlike Nathan.
And she was, among other
things, a senior TED fellow,
while she was all
these other things,
including she was a post-doc
before her current assistant
professorship at the
University of Toronto.
But she's also at
the Dunlap Institute,
which I should mention.
But she was a post-doc at
Princeton of many flavors
that involve the
name Lyman Spitzer,
who is one of my great
heroes in astrophysics.
And you can ask me why later.
That's not important right now.
What's important
is that Renee is
going to tell us about the most
modern techniques in trying
to understand the
mysteries of the universe.
So, no big deal.
[APPLAUSE]
- Thank you so
much for having me.
It's wonderful to be here.
I'd like to state for the record
that I had my measles and mumps
booster a couple of months ago.
I'm definitely going to
see Pacific Rim uprising,
and I promise to stop adding
extra noise to your tweets,
because I often post that
I'm really glad to be alive,
and I really like the
universe and space.
So I'm going to stop
that now and only be
mildly pleased by the universe.
But I do have the best
job, and I'm probably going
to tweet about that later.
So, thank you so much.
So I used to live in
New York City for a year
while I was at Princeton,
and one of the reasons why
I like bringing that up
when I talk about what I do
is because, while
Hurricane Sandy gave us
many terrible things, one thing
it did give us was darkness.
And so I'm going to
take you back to Sandy.
So this photo was
taken by Jared Levy
through a darkened
lower Manhattan.
And imagine that you were here.
You can see there
were cars on the road,
but there are no street lights.
And you want to cross
the road and not die.
So what do you do?
You typically, if you've
grown up in a city,
you look at the cars
approaching you,
and if their head
lamps are bright,
you realize they're
probably close to you
and will kill you if
you cross the road.
And so you don't.
And if their head
lamps are faint,
you realize they're probably far
away, and you cross the road.
But did you ever ask
who told you to do that?
Like, who taught you to do that?
If anyone has
grown up in a city,
you just learn how to
do that, because you
understand that there's a
connection with brightness
and distance.
But it relies on a
fundamental assumption.
It relies on the assumption
that the person who
put the headlamps
in the car put them
in with the same brightness.
Otherwise, that whole
calculus you did in your head
without thinking would fail, and
you'd cross the road and die.
So we call these things
standard head lamps.
Or if you're in cosmology and
you like to be a little bit
archaic, you call
them standard candles,
because that sounds cooler,
and candles were a thing back
in the day.
But we do this in
astronomy, too.
We actually use
exploding stars that we
think explode with roughly
the same brightness wherever
they are in the
universe to tell us
something about the distance.
So for example, this is an
image of the Whirlpool Galaxy.
And I'm going to tell you
there's an arrow there, which
is going to guide your eye.
But even without the
arrow, if I show you
an image taken before
and after, you'll
see that there's something
there that wasn't there before.
This is an exploding
star, exactly
similar to the kind that
Nathan was talking about,
an exploding
supernova that shines
very brightly in its galaxy.
And so these objects
actually allow
us to make a measurement of
distance in the universe.
And I care about
that as a cosmologist
because, if I understand
how big the universe is,
that tells me something about
what the universe contains,
and also how it
changes with time
and how it evolves and
all of these really
grandiose questions
that I want to answer.
In order to measure
these kinds of objects,
I need to build bigger
and bigger telescopes.
And so this is the wonderfully
named Large Synoptic Survey
Telescope, or LSST, or,
as I like to call it,
the Careful What You
Wish For telescope.
And I'll tell you
why in a second.
This is being constructed
in northern Chile.
I have a great movie
of the construction
later, as Alyssa said,
but I didn't put it
in the beginning of
the talk because I
figured it was too dramatic.
But we can get to the
dramatic parts in a bit.
And LSST is really
incredible because it has
incredibly amazing instruments.
And it looks at a large part
of the sky at any given time.
So it has a 3.5-degree diameter
in the size of what it looks
at.
So that's seven
times the diameter
of the full moon in the sky.
And if you were to point the
Hubble telescope, which we all
know and love, to take
images of that same patch,
you'd need 3,000 images of
Hubble to patch that whole sky.
So it's really incredible.
And the way that
you do this, the way
you get a telescope like this
is you build a giant camera.
So this is a
3.2-megapixel camera.
And it's about the size of
the gentleman [? whose ?]
[? head ?] has no
head is to scale.
So the LSST camera
is about the size
of a Volkswagen Beetle, which
again is a car from the past.
But the incredible
thing is that LSST
will scan the sky very rapidly.
So it will scan the whole sky
about once every three days,
which is something that
really isn't possible now
with current telescopes
to that same depth
and to that same precision.
And that allows us to really see
the sky in a whole new light.
So if we take a journey
back in time to pre-2000,
if you digitize old plates,
old plates from telescopes,
you get something like this.
You can see the sky, and
it's really incredible.
And I'm going to show you
what it looked like post that.
So this is coming
out a little dark,
but the Sloan Digital Sky
Survey was a small telescope
that also scanned the
sky a lot, and really
revolutionized astronomy when
it came about in the 2000s
because, until then, it hadn't
been possible to really scan
the whole sky.
And so one of my old
advisors as an undergrad
said that Sloan essentially did
his entire body's work of 20
years in about three weeks.
So it really
revolutionized things.
But LSST-- this is a simulation
of what LSST will see--
it will just be
incredibly deep, and show
us really incredibly
large numbers of objects
that we just
haven't seen before.
So the projected estimates
are that we measure something
like 40 billion
stars and galaxies.
Those numbers start
to sound astronomical,
for want of a better word.
And what it'll
actually do is now
we can start to see the
sky not only as static,
but really as transient,
that there are objects that
come in and out, as
Nathan introduced us
to earlier, these objects
that brighten and then fade.
And if you looked at the
sky-- this is a simulation
of the kinds of objects that we,
or the kinds of explosions that
we would expect
to see with LSST--
you expect to see so
many of these supernovae
of different kinds going off
in the sky at any time, which
is great.
So we're going to get vast
amounts of data from LSST.
But you have to be
careful what you wish for.
So I like analogies, because
I'm a professor of astronomy,
and we have so many cool things
we can make analogies to.
And so I like to say, imagine if
all the supernovae that we know
of up to now, that
we have really
studied very well with
smaller telescopes,
let's just assume those
are all modeled by a fish.
This fish.
I can see it really well.
I understand it's got some
cool, like, lip action going on.
There's fins.
I can look at its markings, and
I can understand it very well.
And that's fantastically
useful if I want
to understand this one fish.
LSST is going to give us
something more like this.
I'm going to get a lot of
data, but now I can't really
tell how many types
of data there are,
or how many different
kinds of fish there are.
Actually, this comes from
a study on coral diversity.
And I tried to--
it's really hard
to find images of
schools of fish
that are not just
one type of fish.
But apparently, there are
148 different species of fish
in this image, according to
the website that I shamelessly
stole it from.
And that's both a
fantastic challenge
and it's incredibly
scary, because now
we really need to know
about each different type.
The reason why that's
important is the kinds of stars
that I use to do cosmology with
that really are intrinsically
bright everywhere
in the universe,
those are a very specific subset
of these supernovae transients.
They're called
type 1a supernovae.
And they have this relationship.
But other dying stars,
similar to the ones
that Nathan studies,
they die when
they are at the
end of their lives,
and they die with
different various masses.
And so we don't have
the same correlation
with explosion and distance
that you would have.
And so if I just use
all of these fish,
assuming they're
of one type, I'll
actually get a very different
answer than if I hand-pick out
the ones that I want.
But I can't hand-pick out so
many fish because I'd need them
to look like this,
and I want to look--
I know they will look like this.
So the question is,
well, how do we classify
astronomical transients?
I'm going to highlight
some of the work by myself,
but also lots of my colleagues
who are really exploding
in interest in this field.
And it's really fantastic.
So again, this was
introduced before,
but we do something typically
called a difference image.
I take a photo of you
today, I take a photo of you
tomorrow, I remove the average,
which is just how old you are,
how tired you are.
And then I probably see that
you were crying this afternoon
or something like that.
Probably don't cry at
the end of my talk.
But we do this in space.
And so we take a photo of a
galaxy today and tomorrow.
And if we see a difference,
we know that there's
some explosion has gone off.
And you see this.
What looks like just a
boring white dot is actually
incredible because
it's showing us
that there's a transient at
that position in the galaxy.
This allows us to [? build-- ?]
because we have different
filters on LSSTs, so we actually
measure the sky not just
in optical light, but we
select very specific wavelength
of light, and so we can
actually measure what does this
transient look like
in different filters.
And we give them very
original names, like R, G, I,
and Z. And Y.
And so you can see
that the brightness
of the object in these filters
change as a function of time.
And the way that
they change with time
is, luckily, different
for Nathan's supernovae
and my supernovae.
Sorry, I'm throwing
you under the bus now.
You did it to Alyssa, so.
So one of the ways
you can distinguish
these different
types of supernovae,
or type of transients
in general,
is to fit some kind of--
to extract some features.
So I know that
there's a light curve.
Can I describe it in some way?
Can I figure out what the
components of the light curve
R, or can I fit
a template to it?
And myself and
other colleagues do
a lot of these kinds of
approaches, where you find out
the number of features that
describe this light curve,
and then you can train your
machine learning classifier
on that using
various techniques.
And these are great,
but it helps--
the key in this area is really
to understand the clustering
of these objects.
So what types of supernovae
or what types of transients
have the same
characteristics or features,
and how do they
cluster together?
Which is really important.
And so some of my colleagues
spend a lot of time
making graphs like this.
So this is a T-distributed
stochastic neighborhood
embedding.
And the only thing you need
to take away from this plot
is, if the areas are
overlapping in this space,
you can't tell them
apart very well.
What you want is actually, in
this particular dimensional
representation, you want all the
little types or classifications
to be separate.
And one of the things we find
in astronomical supernova data
is that actually we
have a lot of overlap.
A lot of the things look a
little bit like each other.
If I change the characteristics
of the supernova explosion,
it could look a little like a
variable star, which is again
very interesting, but not
telling me about cosmology
necessarily.
And so one way that
you can do that
is to try and reintroduce some
sort of balancing to your data.
One of the things that happens
is some kinds of objects
are much more probable to occur
than other kinds of objects.
And so when you look
at a set of data
that you get from a telescope,
you get a whole lot of one kind
and none of the other.
If I want to then
train some classifying
algorithm on that
training data, it's
going to do very well at
classifying this kind of data
and very badly at classifying
this kind of data.
Just like if you've never
met a South African,
it's hard to classify my accent.
But the more South Africans
you meet, the better you get.
And so the way that you
can do that is either bring
more South Africans into the
room or kind of copying me.
So make different copies of my
accent, change it subtly to try
and make the balance in the
room a little bit better.
And this is what some
of my colleagues do.
Another thing that's
really interesting
is to try and say, well, can I,
rather than removing features,
can I turn a light
curve into an image
and then apply a lot
of very interesting
visual deep learning techniques?
So this is again from
a colleague of mine
who has done this, and is now
able to, instead of saying
I think I know what features
are in this light curve,
turn it into an image
and apply something
like a convolutional
neural network
to try and do this
classification.
And it's shown that
they do pretty well.
This is a little
side note because I
asked a colleague
of mine in Toronto
about the work he's doing
in machine learning,
and he told me about
craters on the moon,
and I just had to
share it with you.
It's not related to
astronomy of exploding stars,
but I thought it
was really cool.
We see craters on the moon.
Turns out, they're very
hard to find and identify.
Did you know?
I did not know that until
a couple of days ago.
And so a colleague of mine
at the University of Toronto,
there's a lot of data and
there are a lot of craters.
And typically, grad
students go and circle them,
and they see if they have them.
But the little craters
are very numerous.
And particularly, the
backside of the moon
doesn't have a lot of
images taken of it.
And so these
colleagues in Toronto
actually developed, again, a
convolutional neural network
to build up templates
of what they think
the circles of the moon are.
And then there's some
data that you can
use to validate your results.
And they can identify 92% of the
craters that exist in images.
And they're having a great
time at the back of the moon.
So this is really
cool work that I just
wanted to have highlight.
But that's a little diversion.
One question that we
ask is, can we actually
skip this difference
imaging completely?
So the problem with taking an
image of me today and an image
of me tomorrow and taking
the difference is, sometimes,
something will happen to
the image that introduces
an artifact that's not real.
So I could-- if I'm not wearing
exactly the same outfit,
something moves, you might
think that there's a transient,
but actually it's just a
factor of the image processing.
And that happens a
lot, particularly
with things like cosmic rays.
So these are things that come
not from a cosmological origin,
but just more local
to the instrument.
And that can really
influence your result.
But some colleagues
of mine have said,
well, listen, we can model
a galaxy plus a little star,
and we can train
our neural networks
to predict that and
then test it with data.
And actually, they're
finding pretty great results,
where they don't
need to just take
differences between day
one and day two, which
is really useful.
So how do we get ready for LSST?
So something that I've become
very involved with recently
is trying to do better
at simulating what
we think LSST will be like.
People talk about
turning the faucet on
and data just comes
streaming out.
And the big problem with
LSST is that, even though we
can write down what we
think the data rate will be,
we've just never seen this many
objects because we've never
had a telescope
that is good enough
to look with such precision and
such accuracy and such repeated
cadence.
So it is going to
just be a data deluge.
And you have to try and
prepare yourself for that.
So as I said before,
we have a problem that,
often, the data
that you train on
isn't representative of
the data that you receive.
And we know that that
will be the case with LSST
because it's just looking
deeper than anything has before.
So what we can try
to do is actually
simulate data sets and collect
all the data to try and make
something more representative.
So this is just another
example of that,
where the training data in black
were used on the testing data.
And you can just see that those
two data sets are really not
representative of each other.
And so you really struggle,
your algorithms really
struggle to predict
the rate if all you had
was the black to train on.
And so some of my work--
some of the work that a
colleague of mine, Rafael,
is doing in the audience is
actually collecting data from
lots of different surveys within
a particular area of the sky
that's been well-studied,
and also trying to say, well,
so these--
on the y-axis is different
types of variables, and trying
to figure out, can
we oversample where
we need to, can we build
complete light curves,
can we actually build
a good data set,
and that people can use to
classify on with existing data.
And this partly inspired us
to do this from a simulation
standpoint, as well.
So with Rafael and
others, I'm actually
leading a group of people around
the astronomical community
who are bringing their
theoretical models.
So they're actually
bringing the models
that they've inferred from data.
And now we're saying,
can we simulate the sky?
Can I tell you I'm
going to make a sky that
has supernovae, and type 1A
supernovae, and core collapse
supernovae, and transients, and
variable stars, and asteroids,
and throw them all together,
and then open it up
to the machine learning
community in general,
not just astronomers, to
say, we are going to give you
an object of about-- sorry,
a data set of about a million
objects as a function of time.
Can you classify them, and
how well are you going to do?
And what happens if we change
the cadence of how often LSST
looks at the sky?
So this is really
exciting work, and it's
proving to be very interesting
and very complex as we try
and piece apart the ways to
make the data rich and usable,
and also to engage
people that are not
just astronomers to try
and help us understand
the incredible sky.
I'm really lucky that I am
able to do this as my job.
But looking up at the
sky shouldn't just
be something I think that
is reserved for people who
live in universities, because
trying to understand what
goes bang in the
night is something
that matters to all of us.
Thanks.
[APPLAUSE]
- [INAUDIBLE].
We'll show the other
beautiful movie
at the end of the discussion.
But anyway, we can
take questions now.
Go ahead.
- Nice talk.
I have kind of a niggledy
piggledy question
on the stochastic
clustering embedding chart.
What's the rationale for not
including individual points?
- So one of the things--
I mean, you can try and do a
clustering over multiple axes.
So in this sense,
what-- oh, you mean
just the visual
representation, why
we used a KDE to smooth it out?
I think that's basically just
to make things look a little bit
better as the plot.
So one of my
bugbears, for everyone
who's interested in data
science, typically the plots
that we make are really ugly
because they're like plots
in a t-SNE plot showing
some kind of clustering,
but if you look at it,
it's not really show
what you're supposed to
believe, or a rock curve
that literally just looks
like every other rock curve.
Sorry if anyone uses them a lot.
So using the KDE was
a way, I think, for--
particularly for my
colleague, Dr. [INAUDIBLE],,
to actually make it
a little bit more
tactile and show
some sort of 3D-ness,
even though that's not
necessarily real, rather than
just individual points.
But the points are there.
They're not sort of
hidden or anything.
- Thank you.
- One more question for
Renee, and then we'll
bring everybody back together.
So, go ahead.
- A simple question.
When the James Webb
eventually gets launched--
when the James Webb
telescope eventually
gets launched in two or three
years, something like that,
is that going to--
I mean, that's looking
at longer wavelengths.
Is it going to produce
a lot of new information
on these kind of events
that you're talking about?
- So one of the interesting
things about longer wavelengths
is that of course these
objects, supernovae,
behave differently-- different
kinds of supernovae behave
differently as a
function of wavelength.
And so one of the cool
things about some supernovae
is they have another bump.
They rise again in brightness
in the infrared, which
is very useful and
very interesting.
James Webb will go
very deep and will
be able to give us
incredible resolution.
So this will definitely
supplement us.
But in the supernova community,
because we're really looking,
you know, wide areas of
the sky, it's actually
one of our biggest problems
is finding other surveys
that can really match the area
that we're going to survey.
And as far as I know,
James Webb will not
tell do a significantly
better job
at actually overlapping
with us, although we're
going to bring all the data
that we can to overlap.
- But there is an
interesting point
in our community in
general that there
is this emphasis on both
very targeted, very narrow
observations like James
Webb can do, and then
these huge surveys.
So a lot of times,
people think data science
is about huge data sets, and
just only hugeness of it.
But there's also
this kind of how
do you optimally combine
different kinds of data,
which I think goes across
all of our fields here.
So why don't we sit down
and talk about that.
But anyway, thank
you again, Renee.
[APPLAUSE]
So we're going to do this
for about 15 minutes.
We'll have everybody come
sit in the front here.
And then we'll have
our little party.
Although this is
like a party here.
So I have a couple
questions to get
it started, which we'll
start with the one
that I just almost ask.
But then please, we'll
just go to the audience,
because I can ask
the questions later.
You don't have to
hear that conversation
if you don't want to.
So but this question
that I just asked
about this challenge
of combining
a lot of different,
diverse data sets,
and a lot of different sizes
of data sets and types of data,
and then I think that there's
this kind of public perception
that the macho data
science, or good data
science is just about the
gigantic, enormous data sets.
And so you don't
all have to answer,
but if anybody has thoughts
on that particular,
you know, the perception
both within the data science
community and in the broader
public, where they don't
understand that diversity of
data, different kinds of data
is so important.
- I mean, one of the things
we're doing with LSST
is actually figuring out--
we know that LSST
only takes images,
so it can't take a very
fine spectrum of energy
as a function of
wavelength of the objects.
And there's some benefits
to taking a kind of spectrum
like that, where you
learn a lot of the very
fine physics of the explosions.
And so we are
trying to figure out
what surveys can give that to
us, and how to design that.
But I think one of the things
that is a real challenge
is, if we throw a
huge bunch of data--
kind of to the point of
inference versus prediction
that Nathan brought up, if
we just throw a bunch of data
at a machine learning
algorithm that we don't really
understand, and we get an answer
that is predictive but doesn't
tell us about the
physics, I'm much less
interested in that,
because that just
means I found a cool new
object that I don't understand.
But I really want
to know, you know,
is there something weird
in the explosion mechanism,
or is the dark energy
changing with space or time?
And so we need to do much more
targeted small surveys, as well
as the big ones, I think.
- I think so.
- One of the ways that we
think about this [INAUDIBLE]
and the way that we've
actually structured our group,
our analytics division
here in Boston--
and I think you see
this more and more
in sort of data-oriented
groups and industry--
is that we actually have
a separation in the teams
and types of people we hire
for between the data science
skill set that I sort spoke
about and represent here today
and the software development
and data engineering skill
set and perspective.
And it's really useful for us
to have that specialization
because it means
people with an interest
like mine, who are really
interested in developing
models and understanding
and modeling systems,
can focus on that,
and we can benefit
from the incredible skill
set that our colleagues
on the software development
side and data engineering side
have to be able to assemble and
wrangle and help us to combine
together [INAUDIBLE].
- I'm really glad that you
brought up the specialization
because my most pressing
question-- is David still here?
Yeah, David Parks, who's one
of the co-directors of the new
Data Science
Institute at Harvard--
Dave, wave, so people
will know and they
can talk to you afterwards--
David was present at
a conversation where
I won't name the other people
involved because it's important
not to, but the conversation
was about science
and the future of science, and
whether or not knowing anything
about actual science, like
principles in science, matters,
or whether you could have
this version of science that
was put forward on the
cover of Wired magazine
a decade ago where people
were like "the end of theory."
there was a big X on the
cover of the magazine.
And the idea was
that, if you have
enough statistics
and enough data,
that you can just
infer everything,
and that you don't have
to understand things
like, I don't know, gravity.
And so you can tell
from my tone of voice
that I don't really subscribe
to this point of view,
but on the other hand, you
noticed in Nathan's talk--
and I want to ask the
other speakers, too,
but in Nathan's talk, you
were talking about things that
had very human terminology.
You know, do they like
robots or, you know,
do they go to the movies
after dinner or before dinner,
and these human concepts of
dinner and robots and all that.
For those of you who don't know
very much about data science,
a lot of it is about
feature vectors.
And so some of like
what Renee was just
showing in these t-SNE
plots are very abstract.
You know, they're not a
physical property of something,
they're something that shows
up again, again repeatedly
statistically, and not even
the experts know what they are.
And so there's this
tension between, like,
human interpretable
things that you
can measure, like what you
were talking about, Nathan,
and then not.
And I'm just curious what,
in your various fields,
you think about--
I mean, you talked about
humans in the loop,
but I'm talking about
something slightly different,
which is like human
interpretable features,
shall we call them.
- Yeah, I think there really is
a sort of division in the data
science community
between people who
do have a scientific
background and are
interested in learning from
data, interested in making--
- That's maybe the
inference part, yeah.
- --exactly right,
interested in inference,
interested in being able to
learn from the comparisons
between models of data,
to update our own theories
and our cognitive model
for how a system works,
and people who just are coming
from a different background.
And as I sort of alluded
to, I think the model
that we have at
our company, and I
think a model that's really
successful in general,
is to have both those
skill sets represented,
and to allow the
interaction between them,
to literally have those
individuals collaborate to get
to the best possible outcome.
- Yeah, because there are
these sort of warring factions.
But I'll turn it into a more
specific question for Jen.
So you were talking about
manipulation in China, right?
But a lot of us-- most
of us, I think-- live
in the United States, and
we wonder about those kinds
of things in the United States.
You know, Facebook
manipulating our lives, things
like their emotional contagion
experiments and all of that.
And so if you think
about that, some
of the data scientists at places
like Facebook, which I'm sure
comes to mind, especially
given where you live,
very often, do and don't
care about the kind of--
whether they understand what
the algorithms are telling them,
and what that means
about real people,
rather than just
what it tells them
about improving the algorithm.
And so what's your
opinion-- like, if you
were going to apply
what you were doing,
and you were going to try
to have a sort of social,
emotional,
psychological, whatever,
but in a kind of human
language way, what would
you do that would change
anything about what you're
doing, and what do you
think that has to do
with what's going on in the US?
That's like five
questions in one.
- That's a lot of questions.
- Pick any of them.
- Actually, so picking
up on this point,
I think from a social
science perspective
there's a lot of
pushback against data
science, computational social
science, machine learning
because of the
focus on prediction
versus the focus on inference.
So as social scientists,
we want to do inference.
We don't want to do prediction.
- Which is the human
understanding side.
- Exactly.
- Right.
- But computer scientists
and others, engineers,
are coming in and
doing prediction
on these human behaviors.
And there's now-- so how can
those communities, instead
of being at odds
with each other,
kind of collaborate together.
And I think where it happens,
at least in the areas I'm in,
is in causal inference.
So we want to know what causes
certain types of behavior
at individual, societal levels.
So are there ways of using more
of these predictive tools that
do more individual-level
causal inference.
- Yeah.
And I'm also curious--
I'll ask one last question
and I'll turn it over
to the audience,
but just for Saki,
I had the pleasure at one
point of meeting Ban Ki-moon
during the Ebola epidemic,
and I was interested then
in spatial epidemiology, too.
And he told me that the reason
that apparently the predictions
about the epidemic and the
spread of the epidemic were
mostly too pessimistic-- at
least that's what he told me--
and he said that the reason
was because the models
of human behavior and how
well they would cooperate
with health officials and
with instructives were wrong,
and that people were more
cooperative than they
should be.
And so how much of that
kind of human factor stuff
comes into what you do?
And how good are we
at modeling that?
- Yeah.
So I guess in terms
of just simple
epidemiological
mechanistic models,
we don't usually incorporate
the human behavior element.
But there are social
scientists and people
that we work with who
incorporate rational behaviors
into the epidemic process.
So basically, if you see
a lot of other people
who are infected, you're
more likely to, you know,
stay away from them or
quarantine yourself.
So there are-- and that's
kind of-- it's called
a rational epidemic.
And that's something that
I'm very interested in just
because like, yeah, there's a
bunch of cool new data coming
out from, like, social
media, people using,
like, Twitter data to look
at how behaviors spread
on a network, right?
And so like,
specifically in my field,
it's like vaccine refusal or
human behaviors about health.
They also spread on a
network, just like diseases
spread on a network.
So yeah, there are--
- Good to hear.
- --a lot of links to the
social sciences, I think.
- OK.
Now I will fulfill my promise.
Go ahead.
- I actually also had
a question a little bit
about the interpretability,
and probably
mostly about the
inference domain,
but so with the explosion
in data science,
there's obviously a lot
more budding data scientists
I think than there are
people willing to undergo
like this strong quantitative
type of training,
let's say a PhD in statistics
or a related field, right?
So I was wondering if you guys
had any idea maybe for how to I
guess mitigate any
potential deficit of people
highly trained in these
quantitative methods?
Because I feel that there
are sort of problems
that it creates, right?
On the one hand, you
have a lot of people
who are super willing to
solve real-world problems
with big data science, right?
And you need to catch
them up to speed.
But on the other hand, you don't
want to make any, you know,
incorrect, haphazard inferences.
- So the question is what
happens when people get too
gung-ho about data science?
- Exactly.
- OK.
- What do we do
to mitigate that?
- OK.
- Well, we definitely see
this from an employer side.
So we'll sometimes go
to recruitment events
for data scientists where
the recruiters far outnumber
the candidates in the room.
That's how much demand there is
for this skill set right now.
And for me, the
solution to this is
creating pathways for
young scientists and people
early in their
careers in general
to have exposure to
date science practices
so that they can
learn whether or not
it's a good fit
for them, something
they're interested in,
and develop the skill set.
So I think there's a couple ways
that I think about doing this.
One I bet everyone in
this room is familiar
with is the idea of having
open competitions online,
like the ones that
Renee talked about,
which I've seen just be a
gateway into data science
for so many people, and I
think are incredibly valuable.
From a more formal
perspective, I
think part of the thing that
sort of academia needs to do
is create pathways for
students to get experience
practicing data science in an
applied setting in industry.
And of course, in
a lot of fields,
this is standard
practice already.
In statistics and
engineering, it's
very common, especially
for graduate students,
to do internships.
But in some other
fields-- actually,
in astronomy and physics
and in other domains--
that's very unusual, and
frankly often frowned upon.
And I think that's actually
anti-productive for science
as an enterprise to
develop this skill set
and be able to take
advantage of it,
and certainly for society
at large and for employers
to have candidates who are
experienced and really skilled
and know that they're
interested in that topic.
- So this is, I think, a really
important thing to discuss,
I think because, particularly
in astronomy, which is already
open to large data, already
has people doing data science,
potentially in another name,
but there is this perception
that, first of all, if
you introduce students
to data science in astronomy,
they will leave, right?
So there's this fear like you
don't put your best students
to work on anything
data science,
because they will leave.
- She meant leave astronomy,
not leave data science,
just to be clear.
- Yeah, sorry, leave
astronomy for data science.
And sometimes, I
think colleagues
can be scared of that.
But then the other
bias sometimes
that I hear from my
more senior colleagues
is that, exactly to
what you alluded to,
are we teaching them--
I mean, someone said
to me once, are you
doing physics or statistics?
And I was like, both.
But I think,
particularly in my field,
we've always really needed to do
surveys in cosmology because we
have to understand
things, you know,
on a population level
of the universe,
rather than individual.
But some other
groups I think have--
you could only study your
kind of object or your star
or your region in the universe.
And so there's a
paradigm shift that
has to happen as people
realize that you can do both.
You can get the answers that you
need astronomically and excite
people, as you said,
with real-world problems.
And then if they
leave, fantastic,
and if they don't, fantastic.
But often, I find myself
having discussions
with colleagues about
sort of whether or not
we should be allowing people
to call it science or astronomy
if they're doing something
that is effectively statistics
and data science and
computer engineering, which
to me is just a moot point.
- I think it's a non--
it's a question that
hasn't been answered.
In social sciences,
the expectation
is that if you're wanting
a PhD in, let's say,
political science, you
need a master's in CS
or computer science
or statistics.
That's increasingly the norm.
Or you're spending many
years, or two or three
years when you're a
PhD, just training
on the technical skills.
And then you don't
have that theory.
So it's an issue
that is not resolved.
Some have proposed can we
work just across disciplines,
but that collaboration
is very difficult
unless both
collaborators actually
know something about where
the other is coming from.
So I don't know that we have
an existing kind of answer
to how exactly we change this in
academia in terms of training,
given the time constraints.
- And as the
student perspective,
I think the internet
has made it really easy
to try to learn new
skills in terms of like,
you know, people putting
their code up on GitHub
in a public way so we can just
go through and, like, you know,
it's commented, and we
can see what they did.
And I think that like,
yeah, the internet
has really made it so that--
- Yeah, I'm so glad.
- --we can try to figure
it out on our own,
we don't necessarily have
to take a course on it.
Like, I don't-- at Princeton,
I don't think we, like--
yeah, we don't take
courses in our PhD,
but there aren't,
like, data science
for biology courses yet.
But I think that, yeah,
informally, like, online,
we've been able to, like,
pick up a lot of that.
- David, you have a question?
- We need the
microphone [INAUDIBLE]..
- It's right behind you.
- So if I could just also
make a comment first,
which is that we're launching
a new master's of data science
here within the FAS.
And kind of on this
note about, you know,
the supply and demand,
we got 1,300 applications
and we made 70 offers.
So there's a huge number
of students out there
in the world that are looking to
do a data science master's, and
clearly programs like
Harvard are, you know,
not going to be able
to supply all of that.
So we do need to really
think carefully about this.
And this is a very
integrated program.
My question was, you
know, a lot of us
are seeing this revolution
happening right now
with the black box kind
of end-to-end models.
A number of you mentioned
convolutional neural nets,
but also many of you,
and especially Nathan,
talked actually about
very detailed models
that clearly have been
crafted extremely carefully
by people with high
expertise in the domain.
And I just wanted to invite
you to talk about that.
I'm curious about, in
your case, you know,
how many years of
effort have gone
into getting your
team to the place
where you're able to craft
those models that can really
provide insight?
And invite others on the panel
to talk to that, as well.
- Yeah, maybe I can give
the first response, which
is that I think it's
incredibly important for those
on my team at Legendary to
build that domain expertise.
If we really want
to be in a position
to create a positive impact
for the rest of the company,
and particularly to be able to
collaborate and really support
effectively the creative
side of our company,
it's our responsibility.
It's incumbent on us to
build that domain expertise
to be able to, again,
to interpret our models,
to understand what consequences
the comparisons between model
and data that we're making
have for decision-making
in the company, and to be able
to communicate back to them
in terms that are going to be
understandable and actionable,
how our inferences should
change their decision-making.
So that absolutely is a
challenge and a process
that we've had to go through
over a period of about
four years now to do that.
And I think we're still
learning every day.
And it's a fantastic
opportunity for those of us
who are at Legendary to be
able to interact directly
with incredibly skilled,
talented, and experienced
creative people in our company,
for us to be able to build
that domain expertise.
And it's a pleasure to be able
to have that collaboration.
- And we have some
[? lunch ?] [? and ?] coffee,
but just to clarify for David,
is it right that you started
that group at Legendary, and
that that four years is how
long you've been doing this?
- So I was one of the first
people to join the group.
It was started by
Matt [INAUDIBLE],,
who's our chief
analytics officer,
who came from a sports domain.
- OK.
- [INAUDIBLE].
- But then it was you and him,
and now it's how many people?
- So now we have about 70 people
across our applied analytics
division.
- In four years?
- Right.
- There's your answer, David.
OK, so what I'm
going to do now is
I'm going to suggest that
we thank all the speakers.
[APPLAUSE]
