Okay. Hi everyone, uh, let's get started.
Um, so Chris is traveling this week so he's not here.
But I'm very excited to say that today we've got
Margaret Mitchell who is a Senior Research Scientist at Google AI.
She's going to tell us about, uh, the latest
work defining and understanding and improving
the situation with bias in artificial intelligence.
Uh, Margaret has a background working in NLP and deep learning,
so I'm really interested to hear what she has to say today. Take it away.
Great, thank you. Um, can you guys hear me okay?
I'm not sure if this mic is exactly picking up my voice,
everything's cool? Okay, cool.
Um, so this work is, uh,
the product of a ton of different people and
collaborators that I've tried to put up here.
Um, some students at Stanford also Johns Hopkins, Google,
Facebook and Microsoft are all represented, cool.
So, um, for those of you who haven't seen the set of slides before,
what do you see here? Just shout it out.
Bananas.
Bananas. Okay what else?
Stickers.
Stickers. What else?
[NOISE] Shelves. What else?
Bunches of bananas.
Bunches of bananas. What else?
Yellow, ripe bananas.
You said ripe bananas, good.
So you can add [LAUGHTER] bananas with stickers on them.
You can start doing, like, embedded clauses, you know,
bunches of bananas with stickers on them on shelves in the store to get, kinda, crazy.
But we don't tend to say yellow bananas, right?
So given something like this,
we might say green bananas or we might say unripe bananas.
Given an image like this we might say ripe bananas or,
uh, bananas with spots on them.
Uh, if you're me, you might say bananas that are good for banana bread.
Um, but given an image like this or something like this in the real world,
we tend not to mention the yellowness.
And the reason for this is because yellow is prototypical for bananas.
So the idea of prototypes, uh,
stems from prototype theory which goes back to the early '70s,
uh, coming out of the work of Eleanor Rosch and colleagues.
Um, and it's this idea that there are
some stored central prototypical notions of objects,
um, that we access as we're operating,
uh, throughout the world.
There's some disagreement about whether these prototypes are
actual exemplars of objects or something like a distribution over what's likely,
but there is general agreement that we do have some, sort of,
sense of what's typical and what's a typical of the things in
the world and we tend to notice and talk about the things that are atypical.
Um, so this is a riddle that I
heard in middle school that worked a little bit more at that time,
um, some of you might have heard it before.
A man and his son are in
a terrible accident and are rushed to the hospital in critical care.
The doctor looks at the boy and exclaims,
"I can't operate on this boy,
he's my son," How could this be? [NOISE].
Two dads?
Two dads or he has a mum who's a doctor, right.
Otherwise known as a female doctor,
which might be contracted- contrasted with doctor.
Um, in a study they did,
uh, when they first, sort of,
put forward this riddle at Boston University,
they found that the majority of test subjects
overlooked the possibility that the doctor could be a she.
And that included men, women and self-described feminists.
So the point is that,
these, kinds of, uh,
ways of talking about things and assumptions that we make,
aren't necessarily something that speaks to a negative intent,
but something that speaks to how we actually store representations in
our minds and how we access those representations as we interact,
uh, in the world.
So this, uh, this affects what we can learn when we're learning from text.
So, um, this is work from 2013,
where they took a look at what was, sort of,
most likely, what would you learn if you were just learning from raw text,
um, what were some things that were common in the world?
Um, they found that in this setup
something like murdering was ten times more likely than blinking.
And the reason for this is because people tend
not to mention these typical things that go without saying.
We don't tend to mention things like blinking and breathing,
but we do mention atypical events like murder and that affects the, kind of,
things a machine can learn from texts that we put out in the world,
because it's been subject to all of
these filtering processes that we have as humans before we, uh, communicate.
Um, this issue in particular is known as Human Reporting Bias.
Which is that the frequency with which people write
about actions, outcomes or properties,
is not a reflection of real-world frequencies or
the degree to which a property is characteristic of a class of individuals,
but says a lot more about how we're actually
processing the world and what we think is remarkable.
So this affects everything a system can learn.
Um, in a typical machine learning paradigm,
one of the first steps is to collect and potentially annotate training data.
From there a model can be trained,
uh, from there, uh,
media can be filtered rank- ranked, aggregated,
generated in some way,
um, and from there people see the output.
And we like to think of this as a relatively straightforward pipeline,
um, but at the very start, uh,
even before we're collecting the data,
actually within the data itself,
are a host of different kinds of human biases.
So things like stereotyping, things like prejudice,
things like racism and that's embedded within the data before we collect it.
Then as we collect and annotate data,
further biases become introduced.
So things like sampling errors, confirmation bias, um,
uh, in-group bias and out-group bias and I'll talk about these,
um, a little bit.
Oh, and I should mention feel free to ask questions as I go,
um, totally fine to just,
kind of, interact, uh, throughout.
So here are some of the biases that I think are
relatively important for work in AI and machine learning.
There's hundreds you can go into,
um, but some of the ones that I've, sort of,
become the most aware of working in this space,
um, are these sets and I'll go through each of these a bit.
Um, so I talked about reporting bias earlier,
which is, uh, which affects what we can learn from text.
Um, another example of a kind
of bias that really affects what we can learn from text is selection bias.
So, uh, a lot of times that we,
a lot of times when we get data annotated we'd use something
like Amazon's Mechanical Turk, um,
and the distribution of workers across the world is not even, sort of,
uniform distribution, it's actually, um,
concentrated in India, the US and then some in Europe.
So this leaves out South America,
this leaves out Africa,
this leaves out a lot of China and that affects the, kind of,
things that we'll be able to learn about the world when we have things annotated.
Um, another kind of bias is Out-group Homogeneity Bias,
which is the tendency to see out-group members as more alike than in-group members.
And this is gonna affect what people are able to describe
and talk about when they're annotating things such as emotion.
So, uh, so for example we have these two, like,
adorable puppies on the left here and they're looking at these four cats.
Um, these are all different black cats,
very different in different ways,
but the two puppies look at the cats and they see four cats basically the same.
And it's kind of trivial to understand how that also extends to
human cognition and how we also process people.
Um, it's this- it's the sense we have that the,
the cohort that we're in,
the people that we interact with,
those are the kinds of people that are nuanced and
everybody else is somehow less nuanced,
has less detail to them.
It's a trick our minds play on us in order to help us process the world,
but it affects how we talk about it and it affects further how we annotate it.
Um, this leads to stuff like biased data representations.
So it's possible that you have an appropriate amount of data for
every possible human group you can think of in your data,
um, but it might be the case that some groups
are represented less positively than others.
And if we have time I'll go into, uh,
a long- a longer example of that.
Um, it also leads to things like biased labels.
So, um, this is an issue that came up when we were
getting some annotations for Inclusive Images competition,
asking people to annotate things like bride and wedding and groom.
And we found that given three different kinds of bride,
wedding and groom images,
um, ones that were more Western, European American, uh,
got the appropriate labels and ones that weren't,
just got, sort of, more generic person,
kinds of, labels, uh,
not able to actually tease out what's actually happening in these images.
Compounding this issue are biases in interpretation
when the model outputs, uh, its decisions.
So, um, one, one issue is confirmation bias,
which is the tendency to search for, interpret, favor,
recall information in a way that confirms preexisting beliefs.
And so a lot of times when we, uh,
build end-to-end systems and try and test our hypotheses,
we're kind of just testing it towards, uh,
things that we want to be true and analyzing the results in a way that will,
uh, help confirm what we want to be true.
Um, overgeneralization, which is coming to
a conclusion based on information that's too general or not specific enough.
Um, this is an issue that happens a lot of times
in the analysis of deep learning model results um,
where it's assumed that there's,
there's some kind of general, uh,
conclusion that can be taken away when really it's actually just,
uh, an effect of really skewed data.
Um, this is also closely related to overfitting which
is kind of the machine learning version of overgeneralization,
which is where you're still making predictions and outcomes,
but it's based on a small set of possible features, um,
so it's not actually capturing the space of the correct features for the outcome,
uh, the desired output prediction correctly.
Um, there's also a correlation fallacy,
which is confusing correlation with causation.
And this happens a lot again in talking about what
machine learning models are learning and
deep learning models are learning in particular, um,
where just because things happen together,
doesn't mean that one is causing the other,
but, uh, models don't tell you
anything- deep learning models directly don't
tell you anything about the causal relations.
And so it's easy to think that some output that is predicted
based on a correlation is actually something that's causal,
and I'll talk about some examples of this too.
Um, a further issue is automation bias,
and this really affects the machine learning models we put out there in the world that
then get used by people in systems like justice systems.
Um, so that's the tendency to, um,
favor the suggestions of
automatic predictions of models that output predictions over the,
um, uh, over the different kinds of um,
suggestions of another human.
Um, and this happens even in the face of contradictory evidence.
So, if a system is telling you, you know, "This,
this is the score or this is the risk of this individual",
then we're more likely to think it's true because it came out of a mathematical system,
and we automatically sort of see this as something more objective,
something more mathematical, that something's going to
be more true than humans some- somehow.
Um, and that's automation bias.
So, um, rather than this kind of
clean straightforward pipeline that we have in machine learning,
um, we have human bias coming in at the very start in the data, um,
and then human bias coming in in data collection, annotation,
and then further getting propagated through the system as we train on that data,
um, as we start putting outputs based on that data,
as people act on that data.
And this creates a feedback loop where
the kinds of things that we output for people to act on,
um, are then, are then,
then serves as further training data for input into your system,
so you end up amplifying even further these different kinds of implicit biases.
This is known as a Bias Network Effect or Bias "Laundering", I like to call it.
And so, the message is that human data perpetuates human biases.
And then as as machine learning or deep learning learns from human data,
the results is a bias network effect.
So, I want to steer clear of the idea that if I say bias or someone says bias that equals bad,
it's a little bit more nuanced than that.
Um, so there are all kinds of things that people
mean when they're talking about bias, um,
and even the same bias can be good in some situations and bad in some situations,
so bias in statistics and ML.
Um, we, we talk about the bias of an estimator which is
the difference between the predictions and the truth, the ground truth.
Uh, we talk about the bias term in linear regression.
Um, we also have cognitive biases,
and I talked about that in the beginning,
and not all of those are negative or,
or have to be, uh,
or have to be seen as negative.
So optimism is another kind of bias that we can
have that affects our worldview and the way we sort of process things.
Um, and even things like recency bias and
confirmation bias are just ways that our minds can like, um,
handle the combinatorial explosion of all the different things that can be
true in the world and put it down to something
tractable that we can sort of operate with in the real world.
Um, so algorithmic bias is what a lot
of people mean in headlines and whatnot when we're talking about bias,
which is, uh, more about unjust,
unfair or prejudicial treatment of people that's an output of,
a automated decision system.
Um, and the focus here is really on, uh,
unjust, unfair or prejudicial treatment of people.
So, a lot of the work in this space right now is focusing on trying to understand,
what does it mean to be unjust from an algorithm,
what does it mean to be unfair from an algorithm,
and how can we handle this,
how can we sort of mitigate these issues in order to be able to keep
developing technology that's useful for people without worsening social divides.
Um, and I felt the Guardian put it really well a few years ago.
Um, they said, "Although neural networks might be said to write their own programs,
they do so towards goals set by humans using data collected for human purposes.
If the data is skewed, even by accident,
the computers will amplify injustice."
And it really keyed in on this amplify injustice idea.
Um, and let's talk about what that can mean.
So, one of the avenues of deep learning research that's
taken off in the past few years is predicting criminal behavior.
Um, so, um, how many of you are familiar with Predictive Policing?
[NOISE] Okay, like, half of the class.
Okay. So, in predictive policing, algorithms, um,
predict areas to deploy officers where crime is considered to be likely to occur.
But the data that the- the- the models are trained off
of is based on where police officers have already gone and made arrests.
So, the systems are simply learning the patterns of bias that
humans have and where do they go and where they are trying to decide to def- uh,
to find crime, um,
and then reflecting them back.
So, because this system hones in on some of
the top spots where people have been arrested,
notice that's not the same of- that's
the same thing as where crimes have been committed, right?
It's where arrests have been made.
Um, it means that the other areas that
might be explored for crime don't get explored at all.
That worsens the situation.
Um, some neighborhoods, uh,
get really acutely focused attention on them,
and that heightens the chances of serious repercussions for
even minor infractions, that means arrests.
And that means a feedback loop of data that
you will get an arrest in this place if you go there.
Um, another, uh, sort of related issue in this space is, uh, predictive sentencing.
Um, so there was a really nice article that came out
from Pro- ProPublica a few years ago discussing this.
Um, but when most defendants are booked in jail,
they respond to a questionnaire called COMPAS.
Um, and their answers are fed into this software system that
generates scores that correspond to the risk of recidivism,
that's the risk of um,
er, making a crime again.
Um, and the questions are used to gather data
on the defendant's socio-economic status,
family background, neighborhood crime,
employment status, and other factors in order to reach
some predictim- prediction of an individual's crime or criminal risk.
Um, but what ends up happening is that it ends up focusing on the key bias issues
that humans have and propagating it
back with something that looks like an objective score.
So, you're a lot more likely um,
to be convicted of a crime, um,
if you're black than if you're white,
even if you've made the exact same crime.
And the system will pick up on this,
and will reflect this back to say that people who are
black are more likely to have reci- like recidivism,
more likely to convict a,
uh, to make a crime again.
Um, so this is an example of automation bias, preferring the output of a system, uh,
in the face of overgeneralization, feedback loops,
and correlation fallacy,
confusing things that are occurring together as being somehow causal.
There's another, uh, sort of area of research and, uh,
startups looking at predicting criminality in particular from things like face images.
So there's a company out there, uh, called Faception.
They are based in Israel and they claim to be able to,
um, use individual images, uh,
with computer vision and machine learning technology for profiling people
and revealing their personality based only on their facial image,
um, recognizing things like high IQ,
white-collar offender, pedophile, and terrorist.
Um, and their main clients are Homeland Security,
lots of other, uh,
lots of other countries dealing with sort of public safety issues.
They've not published any details about their methods,
their sources of training data,
or their quantitative results.
We know that in light of automation bias,
people will tend to think it just works even when it doesn't work well.
Um, but there was a paper that came out wi- in
a similar line predicting criminal, criminality,
or purporting to predict criminality from individual face images,
and that one had some results and,
uh, some more details about the data that we could kinda dig
into to understand where are these kinds of claims coming from.
Um, so this was an article that was posted on Archive near the end of 2016.
Um, and they said they were using less than 2,000 closely cropped images of faces, um,
including wanted suspect ID pictures from specific regions,
and they claimed that even based on this very small training dataset, um,
that they were able to predict, uh,
whether or not someone was likely to be a criminal,
uh, greater than 90 percent accuracy.
Um, and they got so lost in this,
this idea that, uh,
it's sort of funny to read to just take
a step back and realize what's actually happening.
So for example, one of
their really great exciting claims was that the angle Theta from nose tip to
two mouth corners is on average
19.6 percent smaller for criminals than for non-criminals.
This is otherwise known as smiling. [LAUGHTER]
Uh, and [LAUGHTER] you know,
exactly the kind of images people would
use when trying to put out wanted criminal pictures,
probably not really happy pictures.
But you get so lost in the confirmation bias.
You get so lost in the correlation and the
feedback loops that you end up overlooking these
really obvious kinds of things.
Um, so that's an example of selection bias,
experimenter's bias, confirmation bias, correlation fallacy,
and feedback loops all coming together to create
a deep learning system that people think is
scary and can do things that it can't actually do.
Um, one of the issues with this was that the media loved it.
Like it was all over the news,
and there's been similar kinds of things happening again and again.
Media wants to sell the story,
and so it's part of our job as researchers,
that people who work on this stuff,
to be very clear about what the technology is actually doing,
uh, and make a distinction between what you
might think it's doing and what it's actually doing.
Um, so another issue that has come up recently, um,
it's claiming to be able to predict
internal qualities but specifically ones that are subject to discrimination,
um, and loss of opportunity.
So in particular, there was this work that came out that claimed
to be able to predict whether or not someone was homosexual,
just based on single face images.
Um, now, it's important to know that the images that they used in the study included
images that were from dating websites where people self-identified as straight or gay,
and identified as whether they were looking for a partner who was straight or gay,
and these became the sources of the training data,
and still from this, uh.
Oh! Before I go on, can you guys just understand
just from that what the issue might have been?
Rainbows.
[LAUGHTER] I don't think that there was actually anything about rainbows,
but that's really unfortunate.
[LAUGHTER].
[inaudible]
Right. Yeah. So this has more to do with the presentation of the self,
the presentation of the social self when you're trying to for example,
attract a partner on a website,
and less to do with how you look day to day.
Um, and yet they kind of went to
these large conclusions that aren't supported at all by the data or by their study,
um, but things like consistent with a prenatal hormone theory of sexual orientation.
Gay men and women tended to have gender atypical facial morphology.
Now, none of the authors actually were prenatal hormone theory specialists, you know.
They have doctor in their name so maybe that's a thing.
Um, this was a Stanford professor and like I've,
I've presented this a few times at Stanford and gotten into
some like pretty harsh fights about this.
So I'm ready if anyone wants to take me on.
[LAUGHTER] But uh, me and my uh,
some of my colleagues decided we'd,
we'd play around with this a bit,
and what we found was that a simple decision tree.
Um, so I'm kind of assuming you guys know what a decision tree is.
So, okay.
Cool. So based on wearing makeup or wearing glasses,
got us pretty close to the accuracy reported in
the paper. That says nothing about internal hormones,
that says nothing about any of that,
and it says a lot about the physical presentation,
the things that are on the surface.
Um, it says a lot more about how people are
presenting themselves than what is happening internally.
Um, so the key thing that's recently kind of
been overlooked is that deep learning is somehow,
i- it's sort of considered that it's somehow magically going beyond surface level.
But the point is that it's working on the surface level and working well.
And in the face of confirmation bias and other kinds of bias factors,
it's easy to assume that something else is happening that's not.
Without critical examination, uh,
for example simple baselines, uh,
simple sanity checks, these kinds of things can just be ignored and,
and not noticed at all.
Um, so that's example of selection bias,
um, experimenter's bias, and correlation fallacy.
Okay. So now I'm going to talk to,
talk about measuring algorithmic bias.
So I just said a lot about different kinds of biases that come in in the data,
in the collection, in the interpretation of the results.
[NOISE] Let's talk about actually quantitatively measuring different kinds of biases.
Um, so one of the key things that's, uh,
emerged in a few different works and really ties nicely to a lot
of fairness work is this idea of disaggregated evaluation.
So in disaggregated evaluation,
you evaluate across different subgroups as opposed to
looking at one single score for your overall testing data set.
Um, so, okay.
You guys are probably familiar with the training testing data split.
You kind of train on there,
on your given training data,
you test on your given testing data and then you repo- you report like precision,
recall, F-score, things like that.
Um, but what that masks is how well the system is actually
working across different kinds of individuals and across different, different subgroups.
Um, and so one just straightforward way to handle
this is to actually evaluate with respect to those different subgroups.
So creating for each sort of subgroup prediction pair.
Um, so for an example,
you might look at women face detection,
men face detection, and look at how the,
the error rates are,
are different or are um, similar.
Um, another important part of this is to look at things intersectionally,
um, combining things, um,
like gender and race at the same time and seeing how those, uh,
how the error rates on those sorts of things
change and how they're different across uh, different intersections.
Um, and this is inspired by Kimberle Crenshaw.
Um, who she, she pioneered intersectional research,
uh, in critical race theory.
Um, and she discussed the story of Emma DeGraffenreid, uh,
who was a woman at General Motors, um,
and she claimed that the company's hiring practices discriminated against black women.
Um, but in her court opinion,
the judges ruled that General Motors hired, um,
many women for secretarial positions and many black people for factory roles,
and thus they could not have discriminated against black women.
What they failed to do was look at the intersection of
the two and understand that the experience there might be
fundamentally different than any of
the experiences of either of these sort of subgroups in isolation.
Um, and the same becomes true when you start looking
at errors that are regularly made in deep learning systems.
Um, so we've been able to uncover a lot of
different kinds of unintended errors by looking not only at
the disaggregated evaluation but also at intersectional disaggregated evaluation.
Um, so I'm going to walk through a bit how this works.
This is probably going to be review for most of you,
but I think it's really important to understand this because it also
ties to how we measure fairness and when we say like,
uh, algorithmic fairness, what we're talking about.
So um, the confusion matrix is a way, you guys.
Okay. Are you guys familiar with the confusion matrix?
[LAUGHTER]. I just want want to know where.
Okay. Awesome. Cool. So you're familiar with the confusion matrix, right.
So you have model predictions and references.
Um, and you can kind of look at these as negative and positive,
uh, binary classification, uh,
kind of approach here where if
the ground truth says something is true and the model predicts it's true,
it's a true positive.
If the ground truth says, uh,
it's, it's, it's false,
um, and the model predicts it's false, it's true negative.
Um, and the errors that the kind of different issues that
arise are false negatives and false positives.
Um, so in false positives the, um,
the ground truth says something is negative but the model predicts that it's positive.
Uh, and then in false negatives, vice versa.
Um, from these, you know,
uh, basic kind of, uh,
these basic breakdown of errors,
you can get a few different metrics.
Um, these metrics actually trivially map to a lot of different fairness criteria.
So um, for example,
if we're looking at something like
a female versus male patient results and figuring out things like precision and recall,
which is relatively common in NLP, um,
if you have equal recall across your subgroups
that's the same as the fairness criteria of equality of opportunity,
um, I could work through the math.
But I mean, this is basically just,
just the main point that, that, uh,
it says that given that something is true in the ground truth,
the model should predict that it's true,
uh, at equal rates across different subgroups.
So this ends up being equivalent to having the same recall across different subgroups.
Similarly, um, having the same precision across
different subgroups is equivalent to a fairness criterion called predictive parity.
And so as fairness has been defined again and again, um,
it was originally some of these definitions came in
1966 following the Civil Rights Act of 1964.
Um, they were reinvented a few times, uh,
and most recently reinvented in, uh, 2016.
Um, but they all sort of boiled down to
this disaggregated comparison across subgroups and the math,
the metrics end being roughly equivalent to what we get from the confusion matrix,
specifically in classification systems.
So which kind of fairness metric do you use,
what are the different criteria you want
to use to look at the differences across different subgroups,
that really comes down to the trade-offs
between false positives and false negatives.
So this is the same problem that you are dealing with
when you're just figuring out how to evaluate generally.
Um, there's no one fairness criteria and that is
the fairness criteria and to rule them all, um,
deciding which one is better than the other is the same as
kind of trying to decide which is better, precision or recall, right?
It depends on what the problem is and what you're interested in measuring.
Um, so a case where false positives might be better than
false negatives and so you want to prioritize something like a false positive rate,
ah, across subgroups is privacy and images.
So here a false positive is something that doesn't need to be blurred gets blurred.
That's just kind of a bummer.
Um, but a false negative would be something that needs to be
blurred is not blurred and that can be identity theft.
It's a much more serious issue.
And so it's important to prioritize
the evaluation metrics that stress the false negative rates.
Um, an example where false negatives
might be better than false positives is in spam filtering.
So a false-negative could be an e-mail that's spam not caught so you see it in your inbox,
that's usually just annoying, it's not a big deal.
Um, but a false positive here would be e-mail flagged as
spam and then removed from your inbox, which,
you know, if its from a friend or a loved one,
it can be, it can be a loss,
maybe a job offer something like that.
All right.
So, um, I just kind of covered how AI can unintentionally lead to
unjust outcomes and some of the things to do
or some of the things to be aware of here,
are the lack of insight into sources of bias in the data, in the model,
lack of insight into the feedback loops from the original data that's collected
as an example of what humans do to the data that's then repurposed,
re-used, acted on, and then further fed in.
Um, a lack of careful disaggregated evaluation,
looking at the disparities,
the differences between different subgroups in order to understand this bias,
this difference across the subgroups.
Um, and then human biases in interpreting, and accepting,
and talking about the results,
which then kind of further the media cycles and the hype around AI right now.
Um, but it's up to us to influence how AI evolves.
So I like to think of this in terms of short term,
middle term, and long-term objectives.
So short term today,
we might be working on some specific model where we're trying to find some local optimum,
we have a task, we have data, something like that.
And that's sort of short-term objectives.
Um, we might have a slightly longer-term objective of getting a paper published,
or if you're an industry like getting a product launched,
whatever it might be.
Um, from there we might see our next endpoint is getting an award or,
you know, maybe become sort of famous for something for
a few minutes, something like that and that's cool.
Um, but there's a longer-term objective that we
can work towards as well at the same time.
And that's something like a positive outcome for humans in their environment.
So instead of just kind of focusing on these local decisions,
these local optima and these
sort of local paper by paper-based approaches to solving problems,
you can also kind of think about what's the long-term objective.
Where does this get me as I trace out an evolutionary path for artificial intelligence,
down the line in 10 years,
15 years, 20 years.
Um, and one of the ways you can address this is by thinking,
now how can the work I'm interested in now be best focused to help others?
And that involves talking to experts,
um, and kind of going outside your bubble,
speaking across interdisciplinary fields like
cognitive science which I've just talked a bit about.
Um, so let's talk about some things we can do.
So first off is data.
Um, so a lot of the issues of bias and fairness,
ah, in machine learning models really come down to the data.
Unfortunately in machine learning and deep learning,
working on data is really not seen as sexy.
Ah, there's a few datasets, ah,
that people use that are out there,
that's what people use,
and there's not a lot of analysis done on,
on how well these datasets capture different truths about the world,
how problematic they might be,
[NOISE] um, but it's a pretty wide area that needs a lot of future,
like lea- needs a lot of future additional work.
Um, [NOISE] so we're going to understanding the data skews and the correlations.
If you understand your data skews and the,
ah, correlations that might be problematic in your data,
then you can start working on either models that address those,
or data augmentation approaches in order to sort of make
the dataset a little bit better or a little bit more representative
of how you want the world to be.
Um, it's also important to abandon the single training set- testing
set from similar distribution approach to advancing deep learning.
So um, when we do projects in deep learning,
you know, we tend to have the training set,
and the testing set and then that's what we sort of benchmark on and prioritize,
but the point is, as you move around different testing sets,
you're gonna get vastly different results.
Um, and so by keeping in
this just sort of one training testing dat- training testing dataset paradigm,
you're really likely to not notice issues that might otherwise be there.
And one way to really focus in on them,
is having a hard set of,
of test cases, that you really wanna make sure the model does well on.
So these are things that are particularly problematic.
Things that would be really harmful to individuals,
um, If they were to experience the output.
Um, and you kinda collect those in a small test set and then it's really easy
to evaluate on that test set as you benchmark improvements on your model,
as you add different kinds of things to your model,
in order to see, um,
not just how your model is doing overall,
in terms of your testing dataset,
but how well you're doing in terms of these examples,
you really want it to do well on.
That you know that is going to be a problem if it doesn't do well on,
and any sort of degradation in that,
you might want to prioritize, um,
to fix above degragaish- degradation and overall accuracy.
Um, and it's also important to talk to experts
about the additional signals that you can incorporate.
Um, so we've put out a tool to help with this,
ah, understanding data skews called facets,
um, it's just available there.
Um, and it's a really handy kinda visualizer for slicing, ah, understanding,
um, you know, what some of the differences are between different subgroups
and different representations and you can sort of dig in and explore a bit more.
So this is just to sort of help people, ah,
come to terms with the data that they're actually using and,
and where there might be, um,
unwanted associations or, or missing,
missing kind of features.
[NOISE] Um, another approach that's been put forward recently,
ah, specifically on the data side is this data,
datasheets for datasets approach.
Um, so this is this idea that when you release a dataset,
it's not enough to just release the dataset with like
some pretty graphs and like talking about basic distributional information,
you need to talk about who the annotators were, where they were,
what the inter-annotator agreement was,
what their background information was,
um, motivation for the dataset.
All these other kinds of details.
So now you actually know that this isn't just a dataset,
this is a dataset that has these specific biases.
There's no such thing as a dataset that isn't biased in some way.
A dataset by virtue of the fact that it's collected from the world as a subset,
is a, is a biased set of the world in some way.
The point is to make it clear what it is,
how it is biased, what are the,
what are the various biases,
ah, that are important to know about in the dataset.
So that's one of these ideas between- behind datasheets for datasets,
releasing its datasets publicly.
All right. Now let's switch a little bit to machine learning.
Um, so there are a couple of techniques that I like to use. Um, I'll talk about two.
One, ah, is bias mitigation,
which is removing the signal for a problematic output.
Um, so removing, ah, stereotyping,
sexism, racism, trying to remove these kind of effects from the model.
Um, this is also sometimes called de-biasing or unbiasing,
but that's a little bit of a misnomer because you're- you're generally just kind of
moving around bias based on a specific set of words for example,
um, so to say it's unbiased is is not true.
Um, but you are kind of mitigating bias with respect to
some certain kinds of information that you provide it with.
Um, and there's inclusion which is then adding signal for desired variables.
So that's kind of the opposite side of bias mitigation.
So increasing model performance with attention to
subgroups or data slices with the worst performance.
Um, so, ah, in order to,
er, address inclusion, ah,
kind of adding signal for under-represented sub-groups,
one technique that's worked relatively well is multi-task learning.
Um, so I've heard that you guys have studied multi-task learning which is great,
um, so I'll tell you a bit about a case study here.
Um, so this is work I did, ah,
in collaboration with a UPenn World Well-being Project, ah,
working directly with clinicians,
and the goal was to create a system that could alert
clinicians if there was a suicide attempt that was imminent.
Um, and they wanted to understand the feasibility of
these kinds of diagnoses when there were very few training,
ah, training instances available.
So that's similar to kind of the minority problem in datasets.
Um, [NOISE]
And, uh, in this work,
we had two kinds of data.
One was the internal data which was the electronic health records, um,
with the- that was either provided by the patient or from the family.
Um, it included mental health diagnoses,
uh, suicide attempts or completions, um,
if, if, if that were the case along with,
uh, the user's, uh,
the person's social media data.
And that was the internal data that we did not publish on,
but that we were able to work with clinicians on in
order to understand if our methods were actually working.
Um, the external data, the proxy data,
the stuff that we could kinda publish on and talk about,
was based on Twitter.
Um, and this was, uh,
using regular expressions in order to extract, uh,
phases in Twitter feeds that had something that was kind of like diagnoses.
So something like, I've been diagnosed with X,
or I've tried to commit suicide.
And that became kind of the,
the proxy dataset and the corresponding social media feeds for,
for those individuals, uh,
for the actual diagnoses.
Um, and the state-of-the-art in clinical medicine, uh,
kind of until this work,
there's been more recently but, uh, it's,
it's sort of this single task logistic regress- lo- lo- logistic regression setup.
Where you have some input features,
and then you're making some output predictions like true or false.
Um, you can add some layers and start making it deep learning which is much fancier.
Um, you can have a bunch of tasks in order to
do a bunch of logistic regression tasks for a clinical environment.
Um, or you can use multitask learning, uh,
which is taking the basic deep learning model and adding a bunch of heads to it,
uh, predicted jointly at the same time.
Um, and here we had a bunch of diagnosis data.
So, um, we predicted things like depression,
anxiety, uh, post-traumatic stress disorder.
Um, we also added in gender because this is
something that the clinicians told us actually, uh,
had some correlation with some of these conditions,
and that they actually used it in making decisions themselves,
for whether or not someone was likely to,
uh, attempt, uh, suicide or not.
Um, and this also used this idea of comorbidity.
So multi-task learning is actually kind of perfect for comorbidity in clinical domains.
So comorbidity is, um,
when you have one condition,
you're a lot more likely to have another.
Um, so people who have
post-traumatic stress disorder are much more likely to have depression and anxiety.
Um, and depression and anxiety tend to be cormorbid,
so people who have one often have the other.
So this points to the fact- this points to the idea that perhaps there's
some underlying representation that is similar across them,
that can be leveraged in a deep learning model,
with individual heads further specifying,
uh, each of the different kinds of conditions.
Um, and so what we found was that as we moved from
logistic regression to the single task deep learning to the multi-task deep learning,
we were able to get significantly better results.
And this was true both in the suicide risk case where we had a,
a lot of data, as well as
the post-traumatic stress disorder case where we had very little data.
Um, the behavior here was a little bit different.
So going from logistic regression to,
um, single task deep learning,
when we had, um,
a lot of data, uh,
as we did with the suicide risk, um,
had the single task deep learning model
working better than the logistic regression model.
Um, but when we have very few instances, uh,
this is where the deep learning models really struggled a lot more.
Um, and so the logistic regression models were actually much better.
But once we started adding heads for the cormorbid different kinds of conditions,
the different kinds of tasks, um,
that related to, you know,
whether or not the person might be committing suicide, um,
we were able to, uh,
bump the accuracy way back up again.
Um, and, it, you know,
it's roughly 120 at-risk individuals that we were able to collect, uh,
in the suicide case that we wouldn't have otherwise been able to,
to notice as being at risk.
Um, one of the approaches we took in this was to
contextualize and consider the ethical dimensions of releasing this kind of technology.
So, um, it's really common in NLP papers to give examples.
Um, but this was an area where we decided that
giving examples of like depressed language,
could be used to discriminate against people,
like at, you know, job,
interviews, or something like that, you know,
the sort of armchair psychology approach.
So we decided that while it was important to talk about the technique,
and the utility of multitask learning in
a clinical domain and for bringing in inclusion of underrepresented subgroups,
it had to be balanced with the fact that there was a lot of
risk in talking about depression,
and anxiety, and how those kinds of things could be predicted.
Um, so we tried to take a more balanced approach here, um,
and since then I've been putting ethical considerations in all of my papers.
Um, it's becoming more and more common actually.
Um, so another kind of approach that's now turning this on its head,
where you're trying to remove some effect, um,
mitigate bias in some way,
is adversarial multi-task learning.
So I just talked about multi-task learning,
and I'll talk about the adversarial case.
Um, and the idea in the adversarial case is that you have a few heads.
Um, one is predicting the main task,
and the other one is predicting the thing that you don't
want to be affecting your model's predictions.
So for example, something like whether or not someone should be promoted based on,
uh, you know, their performance reviews,
and things like that.
Um, you don't want that to be affected by their gender.
Ideally, gender is independent of a promotion decision.
And so you can, uh,
you can create a model for this that actually,
uh, puts that independence, um,
criteria in place by saying, uh,
I want to minimize my loss on the promotion,
while maximizing my loss on the gender.
And so how we're doing that is just predicting gender,
and then negating the gradient.
So removing the effect of that signal.
Um, this is another adversarial approach.
So you might have been familiar with like generative adversarial networks.
So this is like two discriminators, uh,
two different task heads, uh,
where one is trying to do the task that we care about,
and the other one is removing the signal, uh,
that we really don't want to,
um, uh, be coming into play in our downstream predictions.
Um, so this is a way of,
uh, kind of putting this into practice.
So the probability of your output,
uh, predicted output given the,
the ground truth and your sensitive attribute like gender, um,
is equal across all the different, uh,
sensitive attributes or equal across all the different genders.
Um, and that's an example of equality of opportunity in supervised learning,
being put into practice.
So this is one of the key fairness definitions.
It's equivalent to, uh,
equal recall across different subgroups as I mentioned earlier.
Um, and that's a model that will actually,
uh, implement that or help you achieve that.
Um, where you're saying that a classifier's output decisions should be the same
across sensitive characteristics given what the,
what the correct decision should be.
Okay, so how are we on time?
Cool. Are there any questions so far? Are we good?
Okay, cool. So I'm gonna go into a little bit of a case study now, an end-to-end, uh,
system that Google has been working on, uh,
my colleagues have been working on, uh,
that is in NLP domain and deals with some of these bias issues.
Um, so you can find out more about this work, um,
in papers at AIES in 2018 and FAT* tutorial 2019,
um, called Measuring and Mitigating Unintended Bias in Text Classification.
Um, and this came out of Conversation-AI which is a, uh,
which is a product that's, um,
like it's part of this- it's called a bet at Google.
It's a kind of spin-off company called Jigsaw that
focuses on trying to like combat abuse online.
Um, and the Conversation-AI, uh,
team is trying to use deep learning to improve online conversations.
Um, and collaborate with a ton of different,
uh, different people to do that.
Um, so how this works is,
oh you can try it out too, on perspectiveapi.com.
So given some phrase like you're a dork, uh,
it puts out a toxicity score associated to that like 0.91. [NOISE]
Um, and the model starts sort of falsely associating
frequently attacked identities with toxicity.
So this is a kind of false positive bias.
So I'm a proud tall person gets a model,
uh, toxicity score of 0.18.
I'm a proud, uh,
gay person gets a toxicity model score of 0.69.
And this is because these- the term gay tends to be used in really toxic situations.
And so the model starts to learn that gay itself is toxic.
But that's not actually what we want,
and we don't want these kinds of predictions coming out of the model.
Um, so, uh, the bias is largely caused here by the dataset imbalance.
Again, this is data kinda coming and wearing its hat again.
Um, so frequently attacked, uh,
identities are really overrepresented in toxic comments.
There's a lot of toxicity towards LGBTQ identities, um,
it's really horrible to work on this stuff that like
really [LAUGHTER] it can really affect you personally.
Um, uh, and, uh,
one of the approaches that the team took was just to add nontoxic data from Wikipedia.
So helping to- helping the model to understand that these kinds of terms can be used in,
you know, more positive sorts of contexts.
One of the challenges with measuring, uh,
how well the system was doing is that there's not
a really nice way to have controlled toxicity evaluation.
Um, so in real-world conversation,
it can be kind of anyone's guess what the toxicity is of a specific sentence.
Um, if you really wanna control for a different kind of
subgroups or intersectional subgroups,
and it can be even harder to get, uh,
a real good data to evaluate properly.
So what the team ended up doing was developing a synthetic data approach.
Um, so this is kind of like a bias Mad Libs.
Um, where you take template sentences [NOISE], um,
and you use those for evaluation. This is the kind of, um,
evaluation you'd want to use in addition to your target downstream
ah, kind of dataset.
But this helps you get at the biases specifically.
So, um, some template phrase like I am a proud blank person,
and then filling in different subgroup identities.
And you don't want to release the model unless you see that
the scores across these different kinds of, uh,
these different kinds of template sentences with synthetic, uh,
the synthetic template sentences, um,
are relatively kind of the same across, ah, yeah.
All of the different model runs.
Cool. Um, so some assumptions that they made in this was that the dataset, um, uh,
didn't have annotated bias and they didn't do
any causal analysis because they were just trying to focus in particular,
um, on this toxicity problem.
Um, they used the CNN,
ah, convolutional, yeah you guys know, blah, blah, blah.
Uh, with pretrained chain GloVe embeddings.
This is probably like your bread and butter.
Pretrained GloVe embeddings.
I'm sure you know all about this in Word2vec.
Cool, uh, Keras implementation of this.
Um, and, uh, and using these kind of data augmentation approaches, um,
both a Wikipedia, uh,
kind of approach as well as actually collecting positive statements about LGBTQ identity.
So there's this project called Project Respect at Google,
where we go out and,
and talk to people who identify as queer or people who have friends who do,
and like talk about this in a positive way,
and we add this as data.
Um, so we can actually know that this is can be a positive thing.
Um, and in order to measure the model performance here, um,
again it's looking at the differences across different subgroups and trying to
compare also the subgroup performance to some sort of general distribution.
So here they use AUC, um,
where AUC is essentially the probability that a model will
give a randomly sel- selected positive example,
a higher score than a randomly selected, uh, negative example.
So, um, here you can see some toxic comments and
nontoxic comments with a example sort of low AUC.
Um, here, ah, this is an
example with a high AUC,
so the model is doing a relatively good job of separating these two kinds of comments.
Um, and there are different kinds of biases that they've defined in this work.
So, uh, low subgroup performance means that
the model performs worse on subgroup comments than it does,
ah, on comments overall.
And the metric they've introduced to measure this is called subgroup AUC.
Um, another one is subgroup shift.
And that's when the model systematically scores comments,
um, from some subgroup higher.
Um, so this is sort of like to the right.
Um, and then there's also, uh,
this Background Positive Subgroup Negative shifting to the left.
Yeah. Um, yeah that's sort of saying what I said.
It can go either way to the right or the left and there's just
kind of different metrics that can define each of these.
Cool. Um, and the results in this,
ah, sort of going through not only just looking at, you know,
qualitative examples, um, and general evaluation metrics,
but also focusing in on some of the key metrics defined for this work,
these sort of AUC-based approaches.
And they were able to see significant differences in
the original release which didn't account for any of these unintended biases,
and downstream releases, uh, which did,
which incorporated this kind of normative data
that said the sort of things that we thought the model should be learning.
Cool. Um, so, um,
the last thing to keep in mind as you sort of develop and,
and work towards, uh, creating deeper better models is to release responsibly.
Um, so this is a project I've been working on with
a ton of different people called Model Cards for Model Reporting.
It's, uh, it's a little bit of like the next step after Datasheets for Datasets,
um, where, um, Datasheets for Datasets focuses on information about the data.
Ah, Model Cards for Model Reporting focuses on information about the model.
Um, so it captures what it does,
how it works, why it matters.
Um, and one of the key ideas here is disaggregated in intersectional evaluation.
So it's not enough, uh,
any more to put out human-centered technology that just
has some vague overall score associated to it.
You actually need to understand how it works across different subpopulations.
And you have to understand what the data is telling you that.
Um, so here's some example details that a
model card would have,
um, who it's developed by,
what the intended use is,
so that it doesn't start being used in ways that it's not intended to be used.
Um, the factors that are likely to be
affected by disproportionate performance of the model.
Um, so different kinds of identity groups, things like that.
Um, the metrics that, ah,
that you're deciding to use in order to understand the fairness of the model or
the different performance of the model across different kinds of subgroups and factors,
information about the evaluation data and training data.
Um, as well as ethical considerations, um,
so what were some of the things you took into
account or what are some of the risks and benefits,
um, that, uh, that are relevant to this model?
Um, and additional caveats and recommendations.
So for example, in the conversation AI case,
they're working with synthetic data.
So this is the sort of limitation of the evaluation that's important to understand, uh,
because it can tell you a lot about the biases,
but doesn't tell you a lot about how it works generally.
[NOISE] And then the key component in the quantitative,
uh, section of the model card is to have
this both intersectional and disaggregated evaluation.
And from here, you trivially get to different kinds of fairness definitions.
The closer you get to parity across subgroups,
the closer you're getting to something that is mathematically fair.
Okay. So hopefully by paying attention to these kinds of approaches,
taking into account all these kinds of things,
we can move from majority representation of data in
our models to something more like diverse representation,
uh, from our ethical AI.
Okay. That's it.
Thanks. [APPLAUSE]
