[APPLAUSE]
>> One bit of housekeeping for
those who are particularly observant
out in the audience, I'm very pregnant.
>> [LAUGH]
>> And I'm also just getting over a head
cold, and that combination
of things sometimes means,
when I speak publicly,
I get a little out of breath.
Nobody panic,
it doesn't mean I'm going into labor.
>> [LAUGH]
>> It's just a reminder that sometimes I
have to slow down a little bit.
So with that out of the way,
let's get started.
[COUGH] This morning I'm going to tell
you a little bit about my organization,
who we are and what we do.
I'm going to tell you about the two
primary analytical methods that we
use to do our work.
I'll give you just a couple of
examples of code snippets to let you
get a sense of what our day-to-day work
is like and then, I'll wrap things up.
So I'm the executive director
of an organization called
the Human Rights Data Analysis Group or
HRDAG.
We're a non-profit based in
San Francisco and we are essentially
the behind the scene scientists for
human rights advocacy organizations.
And my co-founder Dr. Patrick Paul has
been doing this work for 25 years.
And over those two and half decades,
our team collectively has been really
fortunate to partner with a wide variety
of human rights advocacy organizations.
Some of those are big international groups
you may have heard of, Human rights Watch,
Amnesty International, we've worked with a
variety of branches of the United Nations.
We occasionally work
with truth commissions.
And we sometimes work with local
grassroots non-governmental organizations.
Now the shape of our work,
the outcome of our work
really varies on the specific project and
the specific setting.
Sometimes we write memos,
sometimes we're a footnote,
sometimes we're a technical appendix.
But I think one of the most visible ways
that our analytical work comes into play,
is when we're asked to provide expert
witness testimony in court cases.
As Patrick Ball was in 2002,
at the International Criminal Tribunal for
the former Yugoslavia, when he was asked
to testify against Slobodan Milosevic.
And here he's presenting patterns of
refugee flow across the border into
Albania and patterns of known
killings during that conflict.
He also testified in Guatemala in 2013
against general Efraín Ríos Montt,
the former de facto leader
of Guatemala in the 80s.
And here he's presenting a comparison of
the relative risk of death for members of
the indigenous population as compared to
members of the non-indigenous population.
And this comparison allowed lawyers to
make an argument that the patterns of
violence were consistent with
what you would expect to see with
targeted killing aimed
at a specific ethnicity.
So patterns of violence that would
be consistent with acts of genocide.
Rios Montt was found guilty of
acts of genocide, unfortunately,
that verdict only stood for ten days.
The Guatemalan Constitutional Court
overturned that verdict on
a legal technicality, and
this case is continuing to wind its way
through Guatemalan courts
[COUGH] excuse me.
Most recently, in 2015 Patrick
testified again, this time in Senegal,
against the former leader of Chad,
Hissène Habré.
Here he's presenting the crude mortality
rate that occurred in secret prisons
run by Hissène Habré, run by the secret
police managed by Hissène Habré.
And this trial has the most positive
outcomes of our stories so far.
In 2015, Patrick testified.
In 2016, the judges came
back with a guilty verdict,
sentenced Hissène Habré to life in jail.
So far, that verdict has stood.
So what are we actually doing when we're
presenting these results in court?
Well, for most of our projects, we rely
on two specific analytical methods.
And the first one is this big class
of methods called record linkages or
in various other applications, database
de-duplication, entity resolution.
Fortunately for us this problem arises in
a wide variety of fields and applications.
And so this is an incomplete list
of the references that we rely
on in our evolving thinking
about this problem.
Given this conference, I mostly wanna
highlight the last bullet point, that we
rely very heavily on specific modules in
Python to carry out much of this work.
And that's the specific example
I'm gonna touch on in a minute.
But I do wanna mention,
the second analytical method we use,
which we refer to as
multiple systems estimation.
Now for the statisticians in the audience,
you'll know that MSE usually means
something totally different.
But we prefer to call this
multiple systems estimation.
It was developed 150
years ago in ecology for
studying the size of animal populations.
And there, it's called capture-recapture.
But as a human rights researcher, I try
to avoid talking about capturing humans.
So I prefer to call this method MSE.
It's a very broad class of methods.
Again, this is an incomplete list
of references that we rely on.
And to highlight that last bullet point,
right now we primarily implement this
work in our, using two specific packages,
that fortunately for
us were written by friends and colleagues.
The DGA package was written by Christian
Lum, James Jondreau, and Patrick Ball.
And LCMCR was written by
Danielle Manrique-Vallier.
And both of those implement MSE
models using [COUGH] excuse me,
using a Bayesian approach.
So I'm not gonna have time to talk much
about MSE today but during the Q and A or
offline afterwards I'd be
happy to get into that.
To get into our specific example this
morning, this project started in 2012,
when United Nations Office
of the High Commissioner for
Human Rights came to us and said, where
are these four different sources that
are recording information about
people who've been killed in Syria?
And that's a little unusual for us.
Usually, we have one trusted
partner on the ground
who provides us with information.
We're not really sure how to make sense
of this particular data landscape.
And so
they asked us to help them analyze and
describe these four different data sets.
So for a couple of years,
we co-authored a variety of publications.
This one is somewhat out of date.
But this is the last official publication
that we wrote with the UN, but
it's important to note in this headline
that it's missing two important modifiers.
This number describes the number
of documented identifiable
individuals who've been killed in
the ongoing conflict in Syria.
And if you think about violence,
in conflict violence,
it should rapidly become obvious
that that's a subset of all victims,
because not every victim is identified.
Not every victim's story
is told right away.
It may be days or weeks or
months or years before we hear
certain stories from the conflict.
So inevitably, this is a subset
of those total number of victims.
And I'm going to come
back to that at the end.
So I just want you to sort of think
about what that means for our data and
our analysis.
But for now, we'll start with
what we do know, which is based,
currently on four sources, the Center for
Statistics and Research- Syria,
the Damascus Center for Human Rights
Studies, the Syrian Network for
Hunan Rights and
the Violations Documentation Centre.
And each of these groups
collect lists of named victims.
And from the beginning of
the conflict through the end of 2015,
they've collected over 400,000
records of named victims.
Now I need to be very clear here,
that in no way implies that,
that is the number of victims.
There is a lot of duplication
in that collection of records,
because many victims are reported
multiple times to the same source,
perhaps from different community members.
Or they're reported multiple
times to different sources.
Perhaps their family member reports them
to all of the sources
that they have access to.
So the first thing we have to do
is de-duplicate those records.
And what do those records look like?
This is a screen grab from
the public facing website
of the violations documentation center.
This is part of their advocacy work.
They think and
we agree that it's important to
raise awareness about these victims.
To tell their stories.
To say their names.
And most of the lists
look similar to this.
They are the names of the victims,
the date and location where they died,
and any other demographic information
they've been able to collect.
And that's where we get to
this record linkage problem,
which HRDAG sort of organizes
in this very full diagram.
And I mostly just show this to say
there are a lot of steps involved.
I'm gonna tell you about
two of them today.
But we do is supervised record linkage.
So all those symbols on the outside or
everything that starts with TS for
training set is data that goes out for
human review.
And everything down the middle
is what we automate and code and
rely on computers to help us do.
If you wanna get into some of
the more technical details,
we've been building out
the tech corner on our website.
Last year Patrick wrote, a geeky deep-dive
on some of our blocking techniques.
Last summer he also wrote about
clustering and solving the right problem.
I really enjoy this one
because he describes,
as I'm sure all of us have experienced,
really enthusiastically solving the wrong
problem multiple times,
before you find the right one.
But again, I'm just gonna focus
today very briefly on compare and
classify, kinda in
the middle of this diagram.
So if we think about pairs of records,
and we just wanna ask ourselves,
does this pair of records refer
to the same person or not?
And if I have any Arabic speakers in the
audience who are watching the live stream,
I apologize when I mix Arabic and English,
one of them has to end up backwards.
So all the Arabic on these
slides is backwards.
But there's two ways we can
think about this comparison.
We can ask a human, do you think these two
records are the same person, yes or no?
That's how we create training data.
And we can also numerically
summarize how similar or
dissimilar those records are and
feed that into a computer model.
We can do all kinds of string metrics and
comparisons, and look at tokens,
and we could look at how many days
apart the dates of death are, and
how different, geographically,
the locations are.
And those are the two key
pieces of information
that we feed into a classification model.
We gave them the human
labeled training data.
And we give them the features,
the numerical summaries
of the similarities and
the differences between those records.
Unfortunately for us, as has baked in,
basically any classification model
you might wanna consider and
a variety of diagnostic tools.
So just as an example, we might look
at RandomForest, DecisionTrees, and
LogisticRegression.
And again feed them that human
labeled training data, and
those feature vectors, that numerical
comparison of the pairs of records.
And then we're going to get back a score.
We call the probability for playing fast,
and use of our terminology, but
essentially we're gonna get back a number
that we can set a threshold on, and
say above that I'm gonna
call this the same person,
below that I'm gonna say these
are not the same person.
And so that gets us to classification.
Now eventually we're going to have to work
our way through the rest of this diagram.
We're going to have to decide if record
a and record b are the same person.
And record b and
record c are the same person.
Are record a and c?
And we're going to have
to form some clusters.
We're going to have to
break up some clusters.
We're going to have to do some more
post-processing and human review.
But eventually, we're gonna get to
the end of this record linkage problem.
And what we're going to have at the end
of that problem is a database that
tells us the number of uniquely
identifiable victims who've been
documented by one or
more the sources that we have access to.
Now that's not the same as the question
that I started this talk with,
because I have slightly misled you.
We're only about halfway
through this problem.
Because what we have done now
is we've identified that subset
of identifiable victims.
And our job as data scientists is to
recognize that there is some amount
of the population that
we don't know about yet.
And we need to use the rest of the tools
in our toolbox, in our particular
case that means MSE, but we have other
options, to estimate what we don't know.
And so that's really the next step.
If we wanna be able to answer
the substantive question,
we have to do more than just look at what
we've been able to document and record.
Unfortunately, that's a whole another
talk that I don't have time to give
you this morning.
So I will go ahead and
stop there and say thank you.
I'm happy to take questions.
>> [APPLAUSE]
>> Thank you so much, Megan.
>> [LAUGH]
>> Perfect.
Okay, I'm sure there
are lots of questions.
Can we have a mic here and maybe in
the meantime another mic is already.
Ready?
>> No
>> No, [INAUDIBLE]
>> Hi Megan, great talk.
I have a big picture question not just for
you, but for the tech leaders here.
I'm a postdoctoral researcher
working on climate and energy data.
>> Great.
>> So, as you know we're under immense
pressure under new administration,
>> And we have been recently notified
that whatever research we do is going to
go under quote and quote political review.
So I want to know if tech leaders will
be willing to step up to fill this gap
because this is gonna be the next
big humanitarian crisis and
we have been muzzled.
And so it's not just from
a business point of you but
from the independent
research point of view.
If they can step up and fill this gap.
>> I could not agree more.
It's easy for me to say,
since I work in the non-profit space,
to say we occupy that space now,
we're gonna keep occupying it.
I hope that other tech leaders
will step into it as well.
One of the things that
we all reflected on,
on our team after the election was that,
for the last 25 years we have
held state leaders accountable for
violations of human rights.
And we will continue to do
that at home and abroad.
[APPLAUSE]
>> Thank you so much for your talk.
My question gets at something that you
were talking about at the very end.
So we know, of course,
that many of the people that died in wars
like Syria are not killed
directly right there,
victims of disease and
indirect consequences of conflict.
So are you guys trying also, as you're
estimating the number of people who've
died, to get at those kinds of effects?
Or are you just trying to get at
the number you think have been
killed in actual violent events?
>> That's a great question.
Our particular project and
our particular analysis focuses on
what we refer to as conflict deaths.
So those are directly
related to the violence.
And the MSE methods that estimate what
we're missing, like any other inferential
method, are based on what we are able to
observe and the lists are conflict deaths.
But if we did want to get at indirect
deaths, then we might use more
conventional public health approaches
like a retrospective mortality survey or
a household survey or other demographic
techniques to look at what we would expect
the population distribution to be,
to try and get at some of those questions.
We're not looking at
that right now in Syria.
We have historically taken that
approach in some the other conflicts
we've worked in.
>> We have a twitter question.
>> From twitter we have how do
we incentivize data science for
social good when it's often not coming
with a direct monetary benefit.
>> That is such a great question
that I have been pushing on
other people in my field really hard and
I wish I had a better answer.
Because I meet young folks all the time
who want to go into this field and
I want to encourage them, but
it is a hard reality that there
are not a lot of jobs in this field.
And I don't have a very good answer for
how we can grow the support for
the infrastructure that is going to be
required to create more organizations like
mine, more centers, more places that can
then hire folks to do this kind of work.
Unfortunately, I don't have
a good solution for that.
>> Would be great actually if
the big companies would dedicate
a certain percentage of their research
capability to those questions.
>> I would love that.
[LAUGH]
>> Wouldn't that be great.
I'm just putting a bug
in people's ears here.
>> [LAUGH]
>> Think about that.
We need it more than ever.
We have time for
one more question from the audience.
>> Hi, so kind of on that note, for those
of us who are still students who might be
interested in doing kind of data science
for public good, what recommendations
do you have for either skills or
experiences or things for us to go after?
>> Sure.
In terms of skills and experiences,
I think mostly it's a lot like
the last answer to that question,
is just being really creative
in your problem solving and
being really open-minded because our
data always break all of our choice and
so you have to kind of be willing and
comfortable to think about weird corner
cases and build your methodology
out to suit your needs.
In terms of resources,
we can talk a little bit more offline,
but the main two that I usually recommend
to students unfortunately
they're not money making, but
they're good way to kind of get a sense
of the landscape is within the American
statistical association there is a group
called statisticians without borders.
And they're all volunteers and anybody
can sign up for their news letters and
so you can get sense of
the work that's being done and
then also data kind to get
as a volunteer organization.
It has chapters in a variety of locations
including here in the area and again,
it can be a way to get a sense of
the problems that are out there and
the questions that are being asked.
>> Fantastic.
Thank you so much, Megan, for your work,
for your talk, for everything.
Thank you.
>> [APPLAUSE]
