[MUSIC PLAYING]
SPEAKER: I'm very excited
today to introduce to you
Cathy O'Neil, who will be
speaking to us about her book,
"Weapons of Math Destruction."
Cathy is uniquely
welcome to talk
about this subject of
mathematical models
and how they can go bad.
She has a PhD In math
from Harvard University.
She has held positions at
Barnard College and MIT.
She's worked at the
hedge fund DE Shaw,
and also as a data scientist
at various startups.
She also is active in the
Occupy Wall Street Movement,
especially in their
alternative baking group.
She blogs at mathbabe.org.
It's a great blog, you
should check it out.
As well as participates in
the Slate Money podcast.
And she is the author or
co-author of two other books,
"Doing Data Science" and
"On Being a Data Skeptic."
Please join me in
introducing Cathy O'Neil.
[APPLAUSE]
CATHY O'NEIL: Hi, guys.
Thank you so much for coming.
So I'm a math nerd.
I love math.
I was attracted to math in high
school, because it was true.
It was beautiful.
Like, you could
disagree with somebody
about politics and
everything, about what
Manifest Destiny really meant.
But you'd have to agree on
math, because you guys would
be so careful with
your assumptions
and then your logical arguments.
I thought of it as something--
wouldn't it be beautiful
if we could sort of
port this honesty
and truth and clarity to other
fields of the real world?
So after I became assistant
professor at Barnard College,
I realized that it wasn't
the right pace for me,
that I wanted to have more
to do with the real world.
And it was 2007, so I did
what everybody else does,
and I went to a hedge fund.
And quickly realized that--
well, the world blew up almost
the moment I stepped
into the hedge fund.
I was working with the
experts like Larry Summers.
I worked with him on
a couple projects.
I was very disillusioned by what
the experts seemed to actually
know about what was going on.
But I was most disillusioned--
and actually ashamed,
to be honest-- about one of
the most fundamental reasons
that the financial
crisis happened.
Which was the
triple-A ratings on
the mortgage-backed
securities, which
I consider a mathematical lie.
It was the opposite
of how I had fallen
in love with
mathematics, as something
that would bring forth clarity.
It really was hiding corrupt
practices behind mathematics.
Like trust the math, be
impressed by the fact
that we have a bunch of PhDs
in math, who are supposedly
carefully going
through this data
to double check, triple check
that these mortgage backed
securities are very unrisky.
Trust us.
There's math here.
And invest in these
very opaque instruments.
And I'm not saying that's
the only thing that happened.
Obviously, the mortgage brokers
made a bunch of bad mortgages.
But the reason that they were
able to build this machine
and scale it to the
massive international scale
that it became was
largely because
of the triple-A ratings,
because of the trust.
And that was trust
in mathematics.
And it was abused.
I left finance after a
small stint in risk--
which I could go
into if you guys are
interested in the Q&A--
where I was equally
disillusioned by the part I was
playing in particular in risk.
And I left finance altogether,
started my blog MathBabe.
The idea was to sort of
expose corrupt practices
and mathematical
abuse, as I saw it,
which would happen in finance.
In the meantime, I
needed a day job.
So I became a data scientist.
Very easily, l I should say.
I just renamed myself
a data scientist
and got a job the next day.
Well, because I was qualified.
Because what do you do when
you're a data scientist?
You predict people instead
of predicting markets.
I'd been predicting markets,
futures and credit default
swaps and stuff.
So it wasn't that much of a
stretch to predict people.
In fact, it was
very, very simple.
And I wanted to think
that what I was doing
was less destructive than how
I'd been feeling in finance.
But what I realized pretty
quickly within a year or so
is that I was just very
much part of something
that I still
considered not benign.
Like actually negative.
And what I realized was that
the difference between the havoc
that was created by the
financial crisis, where
everybody had noticed,
and the financial systems
of the entire
universe-- entire world
were sort of at risk
and everyone was worried
was that the havoc
that was being wreaked
at the data science
world was happening
at that individual level.
Individuals were
being lost or they
were being deemed losers
or winners by this stuff.
And moreover, what was most
disillusioning and kind of
frightening for me-- because I
was still kind of idealistic--
was that these were all
being sort of marketed
as objective and fair, these
algorithms, when in fact they
were not that.
Algorithms are nothing more
than opinions embedded in code.
So let me give you an example.
And you guys are
technical, so you'll
understand this quite well.
This is my son.
I like to give the
example of what
is an algorithm by
using the example
of my own internal algorithm for
cooking dinner for my family.
So an algorithm essentially
has two big choices,
lots of little choices,
which I'll ignore.
But the big choices are the data
that you train your algorithm
on, and the
definition of success,
or the objective function,
which also includes
your penalty for mistakes.
But ignoring that for now.
So what's the data going into
my algorithm for cooking a meal?
Well, the food I have in my
kitchen, the time I have,
the ambition I have.
By the way, I should say,
I curate that already.
I curate the food.
I do not consider
ramen noodles food.
My teenagers do.
Have teenage sons as well.
So I'm already imposing
my definition of food.
And then the definition of
success for me for dinner
is whether my kids
ate vegetables.
Now, my 7-year-old, if he were
in charge-- this guy right
here, Wolfy-- he would
define success for a meal
is if you got a lot of Nutella.
And the reason this matters is
because over time, we optimize
our algorithms for success.
That's why we define it, right?
We want to get closer and
closer to that success.
So over time, that means I make
more and more meals that have
been successful in the past.
I train my algorithm to success.
That is a very
different succession
of meals than my son's
algorithm would be,
if we optimized on Nutella.
So that's me imposing my
agenda onto the algorithm.
We always do that.
There is no such thing
as objective algorithm,
because at the very
least, the person building
the algorithm defines success.
And that's usually
success for them.
But there's often other
stakeholders in play,
and the question
is, is that success
for the stakeholders as well?
It's not always.
Generally speaking,
the patterns I
found looking through
more and more data science
was that we were more and more
marketing these things as fair
and objective and following
the numbers and mathematical,
but there were
people behind it who
were building the
definitions of success--
these objective functions-- and
they were essentially invisible
behind that mathematical shield.
A lot of these algorithms
were benign or fine or great.
But some of them were
really not great,
and I focus on those in my book.
I focus on those,
because I do not
like vague discussions
about what could go wrong
with a bunch of big data.
I want to know exactly what
is going wrong right now.
It's a kind of triage on
the world of algorithms.
So I focus specifically on
the worst possible algorithms.
If you're wondering why
I'm so negative in my book,
I'm not actually negative.
I love data.
I'm a data scientist still,
and I believe in data.
But I do want to make it
very clear to the public--
and this book is
written for the public--
that some of the times, this
stuff is really fucked up.
And I define that with
three characteristics.
A weapon of math destruction
is one of these really terrible
algorithms.
They have three characteristics.
One is that they're widespread.
Important.
OK, so nobody cares
about the algorithm
I make food for my kids with.
That is not important.
What makes something
important is
it happens to a lot
of people, and it
makes big decisions for them.
Of course, algorithms
don't make big decisions.
To be clear, it's
humans that have set up
processes by which a
score of an algorithm
ends up being a decision.
But I'm going to-- let's
just put that aside for now.
If people's life's options--
important life options,
like whether they get a job,
or whether they go to school,
whether they get a loan, how
much they pay for insurance,
how long they go to jail.
If those kinds of things
are determined in part
by algorithms, essentially
mostly scoring systems,
then they're important.
They're widespread
and important.
The second characteristic of
things I'm really worried about
is the mysteriousness,
the secrecy.
These scoring
algorithms in general
aren't things that
the formulas are not
available for people who
are being scored by them.
Often, they are not even aware
that they are being scored.
And because they are
important, that is not OK.
Because they are
tantamount to laws,
it is not OK for
them to be secret.
We have a constitutional
right to know what
the laws of our country are.
And finally, I care
about algorithms not
that are improving the world--
I mean, those are great--
but that are destructive.
So I'm really focused on
things that are destructive.
And I mean that in two ways.
Like the first way I
mean that is that they
ruin people's lives unfairly.
So those people
whose important life
decisions are being
informed by these scores,
they're getting unfairly
booted from opportunities.
Opportunities are unfairly
being taken from them.
Moreover, they're destructive
in a larger sense.
They actually set out to solve
a problem in a larger sense
usually-- often with
good intentions.
But not only do they fail
to solve that problem,
but they actually make
the problem worse.
They create usually a
destructive feedback loop.
And I'm going to give
you a bunch of examples,
and then we'll have Q&A.
So the first example comes
from teacher assessment.
We've had a basically
two-decades long war
on teachers in this country.
The idea is we have a
problem with education,
let's find the bad teachers
and get rid of them.
And I will put aside
for now the question
of whether we could
actually solve our problems.
And the big problem
being the achievement
gap between rich and poor kids.
Could we solve that by
getting rid of bad teachers?
That's a different question.
The point is that we have been
looking for these bad teachers
so we can get rid of
them for two decades.
The first generation of teacher
assessment tools was really,
really stupid and unfair.
And here's how it was.
Look for teachers a
majority of whose students
did not pass some
proficiency standard
and their standardized tests.
Now, the thing you need
to know to understand this
is that there's a strong
correlation between scores
on standardized
test and poverty.
So poor kids just don't do as
well on standardized tests.
That's not specific
to the United States.
That's true over time.
That's true internationally.
But what it means is that if you
target a teacher, if you label
them as bad because
a lot of their kids
do not pass a certain
threshold of proficiency,
you're basically targeting
teachers of poor kids.
So that was clearly unfair.
So that was discarded.
The second generation was
meant to be more fair,
and it had good intentions,
but it didn't work.
And let me tell you what it was.
It was called the
value-added teacher model.
And the idea here is
that you can't really
blame a teacher if a kid gets
like a 60 in fourth grade,
and they're the
fifth grade teacher.
You can't blame them
if they don't get 100
at the end of
fifth grade, right?
They're starting from 60.
So it's just better if they
get more than expected, right?
So the idea was, there
was a primary model, which
is the expected score for
each student in a class.
And then there was
the actual score
that that kid got at
the end of the year.
And the picture was
basically held accountable
for the difference
between those two scores.
So if their expected score
was 62, but they got to 65,
the teacher was sort of given
credit for those three points.
If they were supposed to
get a 62, but they got a 55,
they were dinged for those seven
points that they didn't get.
Does that makes sense?
Now the problem with this-- it
actually makes sense, right?
And if you have like a
Google-level test of this,
where instead of having
25 kids in the class,
you had 20,000
kids in this class,
then the statistical signal
would be pretty clean.
The problem is we
do not have that.
So what we have instead
is a pretty bad model,
which is estimating the
expected score for a kid
at the end of the year.
It's actually really
hard to take a kid
at the end of fourth
grade and estimate
what they're going to get
at the end of fifth grade.
It's not a precise thing.
There's a lot of uncertainty,
especially for kids in poverty.
Then the second
source of uncertainty
is like what they actually
got at the end of fifth grade.
Like, they could
have gotten more
if they'd taken the
test in the morning
instead of the afternoon, or in
an air-conditioned room instead
of a hot room, or after
they ate a meal because they
were hungry.
So there's a lot of certainty
on both those numbers.
And then you're
taking the difference.
If you think about
statistically,
that's the noise term.
It's called the error term
of that expected score model.
So you're holding teachers
accountable for the error
terms of their students.
Turns out, it's not
a very good model.
It's noisy-- very, very noisy.
Now, I don't know this
because I have the algorithm.
Because I tried to get it.
I did a Freedom
of Information Act
request for the source
code, but it was denied me.
I ended up talking to
someone in Wisconsin.
This is for the New
York City version.
There's versions all
over the country.
But I focus on the
New York City version.
And I eventually
talked to the people
there in Madison, Wisconsin
who built that model,
and they explained to me that
I would never get the source
code for that, because in
fact, nobody in New York City
got the source code for that.
Because by the contract,
it was stipulated
that it was proprietary.
So no one in the
Department of Education
understood how these teachers
were being evaluated.
And yet these scores were being
used to deny people tenure.
So I should mention
the reason why
I thought I might be
able to get the source
code through a Freedom
of Information Act is
because "The New York
Post" actually did
get the scores with
all the teacher's names
through a Freedom of
Information Act request.
And they published them all as
an act of shaming the teachers
with bad scores.
And I thought, OK, maybe
I can get the source code.
If you can get the
scores, I can get the way
the scores were made, right?
No, I couldn't.
But here's the
thing that happened.
This really smart high school
math teacher at Stuyvesant High
School, Gary Rubinstein,
he took those same numbers
that "The New York
Post" had gotten,
and he found teachers that had
two scores for the same year.
You can get two scores if
you teach seventh grade math
and eighth grade math.
And he figured, if
they're both supposed
to be like the final say on
whether you're a good teacher,
they should be
consistent, right?
So if you get a 75 for
seventh grade math,
you should get a 78 maybe
for eighth grade math, right?
Or a 72.
So he plotted them
on a scatter plot.
[LAUGHTER]
It looks almost like uniform
distribution, which is to say,
almost random.
It's actually 24% correlation.
There is some
signal in it, which
is why I said the
thing about if you
had 10,000 kids
in your class, you
might actually know whether
your teacher brings up scores.
But it's simply not
good enough to hold
someone accountable for
that at an individual level.
But we trusted it
anyway-- when I say we,
I mean the DOE--
essentially because they
didn't understand it.
In spite of that, Sarah
Wysocki, who's pictured here,
was fired after she
got a bad teacher
assessment, 50% of which was
her teacher value out of model
score in Washington DC in 2011.
I should say it's 50% of her
score, but it's more than 50%
of the variance of the score.
The big complaint of the way
that teachers were assessed
was that everybody got the same
grade from their principal.
The principals would
categorize them
all as acceptable teachers.
So they're like, we need
some spread in these scores
so we can like rank the teachers
and we find the worst 2%.
So they got spread, but the
question was, is it meaningful?
I claim it's not meaningful,
from the scatterplot we just
saw.
But it was particularly irksome
for Sarah because guess what?
She's teaching a fifth grade.
A bunch of her fourth graders
came in with very high scores
at the end of fourth grade
who couldn't read or write.
And it turns out that in
the Washington, DC district,
Michelle Rhee had instituted
both carrots and sticks.
You get bonus if you
get a high value out
of model score or high
teacher assessment.
You get fired if
you get a bad one.
Moreover, she found out that the
school where some of her kids
came from had exceptionally
high eraser rates for their end
of year standardized tests.
So she has reason to believe
that the previous teachers
actually cheated on the test.
If you think about
what that would mean,
those kids get elevated
scores, and they're
expected to keep them elevated
at the end of her class.
But she just got them.
She taught them
well, but they didn't
get ridiculously
high scores, so she
got punished for the
previous teacher's cheating.
I should mention that she got a
new job a couple days after she
got fired in an affluent
suburb of Washington, DC,
because most of these-- this is
where the value-added model is
being used.
I think it's actually more
than that at this point, mostly
in urban school districts.
So that brings me
to the failure.
Remember I said there's
failure on-- well,
let me just go back
over the three things.
It's widespread,
secret, and it's
destructive on an
individual level--
so we saw Sarah losing her
job-- and at a systemic level.
I claim that the
value-added model does not
get rid of bad teachers.
Remember the whole point was get
rid of the bad teachers, that
will solve the problem
with education.
We're not getting
rid of bad teachers.
We're getting rid
of good teachers.
We got rid of Sarah,
which was just dumb.
But there are
countless teachers who
have quit, who
have retired early,
who have not gone
into teaching--
we have a national
teacher shortage--
or have fled to
the suburbs, where
they don't use this
arbitrary and punitive regime
for teachers.
My next example comes from the
world of personality tests.
This is Kyle Beam.
He was a college student
in the Atlanta area.
He wanted to get a job after
school at a grocery store--
Kroger's.
And he took a personality test.
He failed it.
Most people never find
out they failed it,
but he found out,
because his friend worked
at Kroger's and told him.
So he was unusual in
finding out that he
got red lighted from
his algorithm, which
was built by Chronos, a
small big data company,
actually around
this neighborhood.
The other thing that
was unusual about Kyle
is that his father is a lawyer.
Most people that are applying
for minimum wage jobs,
which this was, do not have
fathers that are lawyers.
So his father said what were
the questions like on this test?
And Kyle said, they were
a lot like the questions
I got at the hospital
when I was being
treated for bipolar disorder.
The Five Factor
Model it's called.
And his father said,
well that's illegal.
Under the Americans
with Disability Act,
it's illegal to have a
health exam-- including
a mental health exam-- as
part of a job application.
And he said, well, can
you apply to other places?
Kyle ended up applying to six
other places-- Lowe's, Yum
Food, which owns Taco
Bell, other big companies.
They all had exactly the
same personality test,
and he failed all of them.
So his father is suing those
seven companies as a class
action lawsuit on
behalf of anyone
who's ever taken that test on
the grounds that it is illegal,
and it violates Americans
with Disability Act.
So that's widespread.
It's secret.
I should mention personally
tests are not new.
RadioShack's old sort of
easily gamed personality test
looked a lot like this.
"Agree or disagree--
I'm always happy."
I think your employer wants
you to agree with that.
These are the more
recent ones, much harder.
"What do you agree with more?
I sometimes get confused by
my thoughts and feelings,
or I don't really like it
when I have to do something
I haven't done before."
Imagine doing 50 of those.
So it's absolutely inscrutable.
So it's secret, and I
would say it's destructive.
It's destructive
for Kyle, because he
didn't get a job at any
of those big companies
he applied to in the
entire area of Atlanta.
But it's also destructive
in a larger sense,
if what we're worried
about is true,
that they're actually sort
of systematically denying
employment to an entire
subpopulation-- people
with mental health problems.
It's not relegated
to minimum wage
work, this idea of
having algorithms
that sort through resumes.
Do you guys know who this?
This is Roger Ailes.
He recently got kicked
out of Fox News.
He led Fox News for a long time.
He got kicked out for like
basically sexually harassing
women and keeping them
from getting promoted.
So I was doing a
thought experiment.
I don't know if Fox News uses
an algorithm to hire people,
but I'm just imagining
that they do.
Play along with me.
What does that
algorithm take in?
Well, it takes in data, probably
historical data from Fox News.
So all the history of
people who've applied to be
say anchors at Fox News.
And then it takes a
definition of success.
What does a successful
employee at Fox News look like?
Let's stipulate a successful
employee at Fox News
stays for three years and
gets promoted at least once.
OK, that's typical for how
these algorithms are designed,
and how they're trained.
So if you think about
what that means,
the algorithm, if you've trained
on a lot of historical data,
what it will do
is given a new set
of applicants for an
anchor job, it will say,
who among these new
applicants looks like someone
who was successful in the past?
And I would not be surprised
if it filtered out women,
because not only did
women not necessarily
get that job as often, but
when they did get the job,
they were systematically
pushed out
by Roger Ailes and the
entire culture there.
Does that make sense?
So the point here-- and this
is a very important point--
is that algorithms
are not inherently
objective or fair all
they are doing there
what they're really good at
is picking out past patterns
and repeating them.
Which is to say, if we had a
perfect way of hiring people,
we would want to codify that.
We would want to
automate that, because it
would save us money and time.
But until we have a perfect way
of doing this, all we're doing
is we are literally
codifying past practices
and propagating them.
My last example comes
from criminal justice.
There's actually
two different kinds
of algorithms that worry me
a lot in criminal justice.
The first is
predictive policing.
You guys know all about
the Black Lives Matter
movement, which is objecting
to black people getting shot.
But it's not just of
course being shot.
It's just the amount of
over-policing and uneven
policing that's happening
to poor black communities.
Some evidence.
There's whites and
blacks smoking pot
at similar rates,
whites smoking pot
a little bit more
if they're young.
But blacks get arrested a lot
more-- a lot more for that.
In fact, depending
on the jurisdiction,
blacks can get arrested
up to 10 times more
often than whites
for smoking pot.
So the two things I want
you to take away from this
are it's biased against
blacks, and actually it
depends very much on local
conditions of how the police
force is expected to act.
So when you hear
about arrest records,
I want you to think just as
much about police practices
as you do about crime, at least
when it's nonviolent crime.
And I do make a distinction
between violent crime
and nonviolent crime.
The problem is that nonviolent
crime is much more prevalent,
and it is much more predictable.
Think about it.
We have people being arrested in
this country for mental health
problems-- when they're
poor, not when they're rich.
We have people being arrested
for addiction problems
when they're poor.
And we have people being
arrested much more often
for low level, nonviolent
crime like drug use.
And for that matter, for
like crimes of poverty.
Just like literally not having
a place to go to the bathroom.
And that goes into
their arrest records.
And the reason I'm
mentioning this
is because the way predictive
policing works is very simple,
and it's actually very stupid.
It's geolocated arrest records.
And the algorithm
says go put police
where we saw a crime in the
past, which is really where
we saw arrests in the past.
Where we saw police arresting
people for things in the past.
This is just as much a
way of predicting police
as it is of predicting crime.
I'm just reiterating that.
It might be different, by the
way, if we really focused only
on violent crime.
Imagine that.
Imagine a predictive
policing algorithm
where they only focus on
violent crime like murder.
The problem is that murder
is really hard to predict.
Like even if you
did predict murder,
would you what, stand
outside a house waiting
for someone to get murdered?
The problem is that
the things that
are actually easy to predict
are things like poverty.
So my claim is that predictive
policing is more or less
creating a feedback
loop, where you're
having a pseudo-scientific
basis for sending police back
to neighborhoods that
are already over policed.
Said another way, another
thought experiment,
imagine that if after
the financial crisis,
all the cops had been
told to go to Wall Street
and arrest the
bankers and find out
if they had cocaine
in their pockets,
because they all
do-- not all of them.
Then the police records-- the
police data, which is just
a reflection of what police
do-- the data would tell them
in these predictive
policing algorithms,
go back to Wall Street, because
that's where the crime is.
But that's not what we did.
So the next example also in
the criminal justice system
is recidivism risk.
This is a score.
Recidivism, by the way,
is coming back to jail.
So recidivism risk is the risk
somebody comes back to jail.
Recidivism risk scores are
given to judges when they
sentence criminal defendants.
Now, there's two kinds of
data that goes into that.
One is arrest records.
We just talked about
arrest records.
They're very biased.
The second is a questionnaire.
The most commonly
used recidivism
risk algorithm is
called the LSIR.
And I'm going to show you a
couple of the questions that
are asked of the
defendant in the LSIR.
Number 29. "Do you live in
a high crime neighborhood?"
It's a proxy for class
and race, because that
is pointing to people who
are already poor and black.
And by the way, if you say
yes or no to these things,
it goes exactly as you imagine.
If you say yes to "I have
a mental health problem,"
you're a higher risk.
And I forgot to mention,
if you're a higher risk,
a judge will sentence
you to longer in jail.
I mean, I mention that
because it's not obvious
that that's what
you would do, right?
Like, if you're a higher
risk of recidivism,
you get put in jail longer.
It's a little bit
minority reportish,
because what you're doing
is you're preemptively
punishing someone for something
they haven't done yet.
But that is the practice
that judges now have.
Here's another set of questions.
"Have you been
suspended from school?"
Number 17.
I'll show you a plot of just how
much that is a proxy for race.
Black girls and
boys are much more
likely to be
suspended from school.
But I think the thing
that bothers me the most--
and I think should bother
absolutely everyone-- is
number 26.
"Was somebody in your
family in prison?"
This is something
that in an open court
would be thrown out by a
judge as unconstitutional.
If a lawyer said, "Your honor,
please sentence this person
to longer, because their
father was also a criminal."
That is not how we do it.
But because it's being
embedded in a risk score, which
is being claimed
to be scientific,
this somehow has
the authenticity
of mathematics and
science, and again,
is being used to send
people to jail for longer.
So I talk about failures.
I should also mention,
this is important.
It's being used in more
than half the states.
It's secret.
People do not
understand what they're
getting into with these scores.
And judges are actually very
secretive about exactly how
much weight they
put on these scores.
But my claim is that they are
the most destructive thing
you can imagine.
They create their own reality.
If you're a high risk, you're
sentenced to jail for longer.
I guess what?
If you're sentenced to prison--
if you're in prison longer,
you don't tend to benefit
from that experience.
You end up out of jail-- 97% of
people eventually leave prison,
I should mention.
So this happens.
But you end up
with no resources,
no connections to your
community, very little wealth.
You have a felony
to your name, often,
so you it's hard to get a job.
And then you end
up back in prison,
partly because you've
got this high risk score.
So you were deemed
high risk, and then you
end up back in prison.
So I'm almost done.
Those are my examples.
But I do want to mention
again that I don't hate data.
I just really think
we have not yet
started understanding what it
means to build safe algorithms.
It's like we are
building cars and just
putting them on the road
without understanding
that cars can kill people.
So I want data
scientists to take
their ethical
responsibility seriously,
which means building some
kind of ethical framework,
like a Hippocratic oath
for data scientists.
I also think that
we need to learn
how to scrutinize these
algorithms, to monitor them,
to audit them for
safety, for fairness,
and for discrimination, and
for meaning-- making sure
that we're actually
building meaningful things.
The teacher value-added
model was not meaningful.
And I also think
that in situations
where it's very, very important
to a given person's life
how they're being scored,
they should have the right
to scrutinize that score.
Like the teachers.
If they are being given
a score that will maybe
make them fired, they should
be able to understand exactly
how that scoring system works.
Because to be
clear, the teachers
were not-- even when they
appealed their score,
they were not told how they
were being actually evaluated.
And that's not right.
Thank you guys.
[APPLAUSE]
If you have any
questions, I'm here.
AUDIENCE: What's the best way
to account for and mitigate
limitations in a model?
So suppose you have to come
up with a mathematical model
to effect somebody's
life, how do you
figure out where the
boundaries of application are?
And how do you I guess
iteratively readdress
that to see if it's doing
what you expect it to?
CATHY O'NEIL: I mean, it's
really a vague question,
so it's hard for me to answer.
I'll tell you how frustrating
it is as a data scientist.
I'm a data scientist.
Like, I worked in the
city hall of New York.
And I was asked to
use the data they
had to figure out how
long a family was going
to be in homeless services.
And I had race, I had
the number of children,
I had how whether the parents
had been in social services.
I had all this data.
But they weren't
telling me how they
were going to use this
algorithm-- this scoring
system, once I had it.
So in particular,
one thing I didn't
know whether I should use
was the attribute race,
or all the other
things that are proxies
to race.
Or should I decorrelate those
other things from the race?
How is this going to be used?
Number one possibility.
If you are high
risk, then you're
going to be put
into worse housing.
If you're expected to
be in long term housing,
you're going to put
into worse housing.
If I if I knew that was how
it was going to be used,
then I would know that
people at higher risk--
there'd be a disproportionate
impact racially, right?
But if instead it was being
used to sort of figure
out interventions
so that-- maybe they
wanted to understand
why black families were
in homeless services for
longer, and to intervene
to try to make that
discrepancy smaller,
then it would be by
all means make sense
to have race as the attribute.
Does that make sense?
So I'm basically not
answering your question.
But I am making it clear that
you cannot answer that question
until you really know the
use case for this algorithm.
And I should also
add that a given
algorithm could have a
positive or negative effect
on the world.
It's really tricky.
Let me give you an example.
Health.
Everyone's talking
about how great
it is that we can predict
each other's health.
Well that's great if your
doctor has that algorithm
and can keep you well.
But it's not great if
an insurance company
has that algorithm and
can charge you more
if you're about to get sick.
Or if Walmart-- and I'm not
saying this is happening--
but if Walmart put a health
risk on top of everybody who's
applying to their
job, and says we
don't want to spend people's
money on our people's
insurance, so we're
going to not hire people
with high health risks.
I mean, just saying, same
algorithm, different use cases.
Could be used for good or bad.
So the answer is,
there is no rule here.
Algorithms are decision
making processes
that are as complicated
as anything in the world.
So we can't pretend
that there are
like formulas for how to use
them and how to make them safe.
They're very, very contextual.
AUDIENCE: Do you have any sense
of-- this is probably a naive
question-- whether the
scale of the problem
is bigger or smaller in a
country with stronger libel
and slander laws, like
for example, England?
CATHY O'NEIL: Say more.
Why would you imagine that?
AUDIENCE: Oh, why would I
imagine it would be different?
CATHY O'NEIL: Yeah.
AUDIENCE: Well, I guess
the people who are victims
probably don't have
the resources to pursue
slander cases, anyways.
CATHY O'NEIL: I think that's
one very important point.
Thank you for making it.
Which is that one of the
commonalities of almost all
these algorithms is
that the people who
are losing by these
scoring systems
are often the most vulnerable
people in our society.
So generally speaking,
they don't have
lawyers to protect them, right?
But I would also
say the following.
Anonymization, which is a tool
that people often bring up
as like a way of solving
some of these problems,
is not really a solution to me.
Let's think about the medical
model I just mentioned.
Like let's say I'm Walmart--
and again, Walmart's not--
I'm a large employer, and
I get long term health data
off of all my employees, because
I force them to use Fitbits
or whatever I do.
I get long term health outcomes.
And I don't charge them.
They're already my employees.
I treat them well.
But the point is, I can
build the algorithm.
I can build the neural
network, what have you.
I can train it so that when
someone applies to my job,
I only have to ask
them six questions,
and I've already categorized--
I've segmented them
into a risk group score.
And it is completely anonymous.
In other words, you can build
an algorithm anonymously,
and still apply it to someone
in a very precise way,
and it still can be problematic.
So for me, it's not
about anonymity.
AUDIENCE: So in a
lot of the reporting
that I have read about
this sort of thing--
and I'm thinking specifically
for example the [INAUDIBLE]
thing on the criminal
justice, the sentencing thing.
I haven't seen a lot of talk
about how these algorithms
perform relative to
this example just
a judge without this score, or
what the preexisting case was.
CATHY O'NEIL: Yeah, great point.
And I'm desperate
for that, right?
Because the point is
that the justice system
is very racist already.
The intention of
this recidivism stuff
was to make judges more
objective and less racist.
And it might be doing that.
With all the flaws that I
just mentioned, which I really
believe in, it
might actually still
be better than what
we have already.
We do not have data for that,
and the Department of Justice
is not coughing up that data.
I've tried.
I've tried to get data to
do an audit of the DOJ.
It should be possible.
The data exists, because some
jurisdictions have this stuff
in use, and some of them don't.
You can just compare
them before and after
or whatever, and compare
it to each other.
I should also add that
like here's the good news.
If we made those recidivism risk
algorithms actually not racist,
then that would definitely be
better than the current judges.
It's like we threw
them out there
and we're just like
oh, they must be good,
because they're algorithms.
They're by definition good.
No, they're not
necessarily good.
But again, you're right.
They might be better
than the existing system.
AUDIENCE: So in the other
examples that you talked about,
do you have a similar sense of
whether that data is out there?
Like the teachers, you know.
Maybe the teachers
is a bad example.
CATHY O'NEIL: The teachers
are just terrible.
AUDIENCE: You're not even trying
to solve the right problem.
CATHY O'NEIL: And
by the way, I still
don't think
recidivism risk being
high shouldn't necessarily
mean you go to jail longer.
Maybe we should be like, why
do we sentence people the way
we sentence people they
way we sentence people?
We should maybe make a redo
of that entire conversation.
If we were thinking
data-driven-ly,
like if we're
Google-thinking people,
we should ask really
basic questions.
Like to what extent does
GED training in a prison
help people when they leave?
To what extent does solitary
confinement help or hurt
people, or sexual assault?
Like, we should actually
know what these things are
doing to our final
outcomes, which hopefully
is a combination
of public safety
and the well-being of
the actual prisoners.
We don't have any of that.
Yeah?
AUDIENCE: I entirely agree
with your wonderful notion
that some of the
data is disparate.
I mean, it hurts
vulnerable communities.
But despite that, the thing
you most drew attention to
was the fact that
people whose relatives
are criminals are judged
more likely to commit crimes.
Well, people whose
relatives are actors
are more likely
to become actors.
People whose
relatives are plumbers
are more likely to
become plumbers.
And even though it's
a horrible thing,
I strongly suspect that
someone whose parents grew up
by stealing cars knew a heck of
a lot more about stealing cars
than someone's who don't.
What do you suggest?
CATHY O'NEIL: I
suggest we follow
the Constitution, which
privileges justice
over anything like that.
AUDIENCE: So send the
police to Wall Street,
because it's equally
likely to have muggings.
CATHY O'NEIL: As a culture, we
decide what is against the law.
But the Constitution
declares that we
have to care about fairness
to the criminal above all.
And that this is not
fairness to the criminal.
By saying you're implicated
because your father
was implicated.
And I also want to
add that I live next
to Columbia University.
And every couple
of years, there's
a huge drug bust in
the fraternities.
Like a huge drug bust.
And then I followed the
NYPD, and they're constantly
boasting about their data-driven
criminal justice stuff,
and how they peg
people as gang members
if they're associated on
social media with gang members.
But those gang members
are always in Harlem,
and they're black.
You see what I mean?
They're not Columbia students.
So it's also an
inconsistently-defined
association.
So there's two different
kinds of problems.
AUDIENCE: OK, I see that.
Thank you.
CATHY O'NEIL: Yeah.
AUDIENCE: So I
think it's very easy
to see that if you're not
careful about how you define
what kind of success
you're optimizing for,
you wind up with an algorithm
that very strongly reinforces
the status quo.
And if we're looking at things
like are hiring practices fair,
then if there is
existing unfairness,
then we're going
to perpetuate it.
So that makes total sense to me.
What I struggle with, is if
I'm asked to do something
like that, and I try
to not incorporate
the existing structural biases
into the algorithm that's
going to be my new recommended
hiring practices thing,
how do I deal with the challenge
that I'm actually optimizing
away from success conditions?
CATHY O'NEIL: OK, so there's
two strategies I suggest.
The first one is
basically a parable.
So there's this thing called a
blind audition for orchestras.
Where the orchestras
acknowledged
that they were being nepotistic
in their practices of hiring.
So they wanted to get
rid of their nepotism.
So they decided to put a
sheet between their judges
and their auditioner
so they wouldn't
know if the person who
was behind the sheet
was their friend.
At first, they saw the shoes.
And they were like,
wait a second.
We can see if it's
a man or woman.
So they brought the
sheet down to the ground.
And they also installed rugs
in the hallway walking up
to the spot so they couldn't
hear if it's high heels or not.
Not only did they
get rid of nepotism,
but they increased the
number of women in orchestras
by a factor of five.
So that's the story.
And the way I would characterize
what they successfully
did there is two things.
First, they decided
a priori what
was it that was a
requirement for this job.
How do you assess
somebody for this job?
And the answer with sound.
The second thing they did, which
was absolutely as important
as the first thing, was
to ignore everything else.
The promise of big data that
we get confused by, because it
sounds really
convincing-- but it's
wrong-- the promise of big
data is the more the data
is, the better.
Just throw all this
data at the wall.
Correlations are just
as good as causation.
Not true.
Because if we did the
orchestra example again,
but we didn't have the sheet,
and we thought to ourselves,
we're looking for good
sound, but we also
knew that it was our
friend, or we also
knew that as a woman,
that creeps into us.
That creeps into our brain.
That's excess information
that we should be ignoring,
but we're not ignoring.
So the first strategy I would
suggest for hiring people
is to think a priori.
To actually build a model
that by construction is fair.
So build the module, can
say, these are the things
we actually want for this job.
The second possibility
is use a machine
learning algorithm that
takes all these correlations
and blah, blah.
But then audit it for fairness.
Then say, wait, does this
basically filter out women?
Check.
Check to see if women
are filtered out.
Check to see whether
they're filtered out
in a reasonable way.
Sometimes they are.
Like I heard this
story on Twitter
just the other day, where way
more Asian people were filtered
out of a certain
job, and people were
complaining because so many
of the applicants were Asian.
But then people
were saying, well,
there were more
Asians that we're
applying that weren't actually
qualified for this job.
It's complicated.
And expect it to be complicated.
That's my third suggestion.
Is that the other promise of
big data that I'm pushing back
against and I hope you
understand and agree with
is that big data is
not a silver bullet.
It's just a tool.
It is not automatically
going to solve our problems.
It's just a tool
that we might be
able to use to solve
some of our problems.
But we have to be careful
about it, and we have to check.
AUDIENCE: Do you have any
other success stories,
where you would go to a judge
or a jurisdiction or a school
system, and they would say,
oh my god, you're right.
We have to fix this.
And they do?
CATHY O'NEIL: Sorry, what?
AUDIENCE: Do you have any
other success stories,
like the orchestra, where
you might go to someone
and say, you know, there's
this problem, and they say,
oh my gosh, as opposed to,
no you can't see our data,
no that's proprietary,
go away, don't bother us?
CATHY O'NEIL: You'd be
surprised how few people want
me to see their source code.
Although I should
say that someone
asked me whether I
thought it would just
be impossible to ever get
a good teacher assessment
tool using data.
And I said, I'll
never say never.
Because like 10 years
ago, I didn't think
we'd have self-driving cars.
But I think now that
we're going to save lives
from drunk driving
with self-driving cars.
But what is it going to take to
get a good teacher assessment
tool is-- we're going to have
to have a lot of evidence
that this is actually working.
Right now we have
no ground truth
for teacher value-added model.
Literally there is no
comparison to some other kind
of qualitative assessment
to see whether a teacher is
good-- a good teacher
is getting a good score
and a bad teacher is
getting a bad score.
It's just whatever the number
is is your score, which
is ridiculous.
So we actually do have
qualitative assessments
for teachers.
They're political,
because different people
disagree about what
makes a good teacher.
But let's say we had agreement.
We're defining this kind
of qualitative assessment
to be important, to
be the ground truth.
And then we could try to find--
maybe we put a bunch of sensors
in the classroom.
We try to replicate that using
data, literally replicate it.
And if we got an algorithm
that after 4,000 teachers
got almost exactly
the same score using
the qualitative
assessment as using
this data-driven
assessment, then we
would build trust that
this data-driven assessment
is doing a good job.
It's very similar to saying,
we have this many miles
of a self-driving car
without an accident.
So we start trusting that
car-- that algorithm.
That make sense?
Right now, we have
nothing like that
in the field of teacher value.
They just want a silver bullet.
That's all they want.
They don't want any
prying eyes, either.
AUDIENCE: Something
that's kind of come up
a bunch is sort of
as an undercurrent
like the idea of open sourcing
how you're doing things.
And I can see that potentially
being really helpful,
for example, now you
can actually criticize,
because you can see the code.
But I can also see it being
somewhat ineffective if you're
still doing-- if you're
still asking questions that
are totally beside the point.
So I guess I'm curious
about your thoughts
as to the effectiveness of
things like open sourcing,
and also the limitations.
CATHY O'NEIL: That's
a great question.
And you guys know
that it's hard.
Because it's relatively
easy to make something
arbitrarily difficult
to understand.
If you know in advance that
it's going to be open source,
you can just make it
impossible to understand.
And the other example
I gave is like, nobody
would want the code
for Google search,
because literally it
probably would not
work on any computer system
except Google's computer
system, right?
So there's a limit to
what open sourcing can do.
But that's not to say I
think we should give up.
I think what we should ask for
is an auditing, like a fairness
auditing.
And I think we
should think about
along the lines of
the way sociologists
audit things for fairness.
So like sociologists will see
whether a hiring practice is
racist by sending a
bunch of applications
with black names
and white names,
and like similar kind
of qualifications,
and see whether white
people get more callbacks.
So we can do that kind
of thing to algorithms.
And Latanya Sweeney did
that for Google search.
You guys know that example?
Latanya Sweeney
googled her own name,
and she found the ad next
to her search results
was, "Are you looking
for the criminal arrest
records for Latanya Sweeney?"
And she Google a white
name, and it didn't happen.
Then she did a
comprehensive test,
because she's a
computer scientist,
and found that black names
were way more likely to have
arrest records next to them--
this ad for arrest records.
Now that's not in my opinion
a weapon of math destruction,
because it's not as directly
impactful in people's lives.
But it's certainly not great.
Go ahead.
AUDIENCE: Thanks.
I was just going to
agree with you before,
when you were talking
about there is no easy way
to do the teacher evaluation.
I think it goes back to your
point of success, right?
Because then it gets into
like student evaluation,
and are the assessments that
they do on students fair?
Or is that just a proxy for
success in their long term
career?
So I think the
teacher evaluation,
there's probably
no right answer.
Do you agree?
CATHY O'NEIL: Yeah, it's
a very tricky problem.
So I think the trickier the
question is, the less likely
it's going to be solved by
some simple scoring system.
AUDIENCE: A lot of the
machine learning algorithms
that are very popular
these days are things
like neural networks, which are
not designed-- they're opaque,
without even having
had attention.
Like, if you click on the
little why button about why
it made a decision,
it'll just say
I did these 3,000
matrix multiplies,
and there's no simpler
answer than that.
But they do give much better
predictions than other things
like decision trees, which
are much more interpretable.
Do you have any
thoughts about how
in the context of these
problems to-- should
we give deliberately worse
predictions because we
need them to be interpretable?
CATHY O'NEIL: Yes, absolutely.
That's probably
the hardest thing
that you guys are going
to hear from me today.
But I definitely think we
need to sacrifice accuracy
for fairness.
Absolutely.
And one of the reasons is
because we as technologists,
we are not the ones
that are at risk.
So our concept of
what looks fair to us,
not good enough, right?
We might have some
better understanding
of how things work,
but we're like, oh,
but it makes it so
much more accurate.
And I actually
talked to somebody
who does recidivism risk
algorithms for a living.
He does it for a state.
And I said, well, do
you ever use race?
And he was like, oh,
no, that would be wrong.
And I said, well do
you ever use zip code?
He goes like, well,
sometimes, because it
makes it so much more accurate.
And I'm like, OK, but
that's a proxy for race.
So you're basically using race.
So another way of saying that
is if it's interpretable,
then it's much more easy for
other people to say that's
not fair.
It's a transparency measure.
There is precedent
for this, by the way.
In credit card law, if
you're a credit card company
and you deny someone
a credit card,
you have to be able
to explain why.
Which restricts people
to using decision trees,
for the most part.
They don't like it.
But that's OK.
There's actually lots
of things about lending
that are a trade off for
fairness and accuracy.
So FICO scores by law--
anti-discrimination laws
called Fair Credit Reporting
Act and the Equal Credit
Opportunity Act make
it illegal for you
to base FICO scores on
race or gender or zip code.
Is that because it's more
accurate when you restrict?
No, it's actually keeping
these credit scores
from being as accurate
as they might be.
But it was deemed
more fair that way.
As a society, we care
about the public, right?
So the idea is-- this
basically happened
in the '70s when especially
divorced women were
being denied loans.
And the idea there was, if
you never let women get loans,
then they're never going to be
able to build up their credit
scores so they'll
be credit worthy.
So it's a feedback loop, right?
And so these
anti-discrimination laws
were specifically designed
to prevent the feedback loop.
And we need that.
We need that kind of
thing, because algorithms--
like that algorithm
I just talked about.
That benefits the banks or
the people that make loans
to the detriment of the public.
So we have to
balance the benefits
for the public
versus the benefits
for the private companies.
I have time for one question.
AUDIENCE: So I was going to say
I'm in the opposite boat, where
I work in a search feature
that looks at lots of queries.
And we have a model for English
where the masculine pronouns
are the default, and some
common names are the default.
So the model knows
a lot about that.
And we're kind of wondering
if queries that women like,
we do terrible on, or worse,
or there's some class.
Except we don't know
anything about the people
who ask the queries.
We've tried really
hard not to know.
I don't know gender.
I don't know income.
I could probably get
zip code in bulk.
So I'm sitting here.
I'm trying to think how I would
write a machine learning model
that would do the
reverse question
and figure out what's the most
unfair thing my current system
is up to?
CATHY O'NEIL: That's a good
question to be asking yourself.
But I would also
add-- and thank you
for asking that to yourself.
I think that's what
we all need to ask.
That's one of the
first questions
we should be asking ourselves.
You made the point yourself,
but I'll just reiterate it.
I'm certainly not the first
person to make this point.
Sometimes you actually do
need to know these attributes
like race and gender in order
to measure your fairness.
So I understand that the
desire to be race-neutral
or gender-neutral, but that
doesn't mean you should not
collect that data.
Because then you're basically
saying, I'm not going to look.
You know, it must
be fair, because I
don't collect that data.
That's not proof that it's fair.
Which isn't to say that it's
easy to collect that data.
AUDIENCE: That's
not why we don't.
CATHY O'NEIL: OK.
I'm saying maybe collect it.
At least for some
experiments as an audit.
Thank you, guys.
[APPLAUSE]
