HANNAH: Hello everyone.
Welcome to the first colloquium
in the winter quarter.
I am very honored to introduce
our speaker today, Kathy
McKeown, as our first
distinguished lecturer
this quarter.
Kathy McKeown is one
of the NLP leaders
in the area of natural
language processing
and also data science
in the world, let's say.
She is a professor at Columbia
University and the founding
director of the Data
Science Institute.
She is also an Amazon Scholar.
I learned today about that.
Her work is mainly natural
language processing and using
big data.
And her group works on many
different research areas,
including text summarization
and generation,
also question answering,
and also some work in--
or a lot of work in
analyzing social media text.
She has received
a lot of awards.
It's hard for me to remember
all of them and name them.
So just to name a
few of them, she's
an elected member of American
Academy of Arts and Sciences.
She is a founding fellow
of ACL, a fellow of AAAI.
And also some of
her earlier awards
include Presidential
Young Investigator Award
and many others.
OK, so today she will talk
about NLP and social good.
Thank you.
[APPLAUSE]
KATHLEEN MCKEOWN: Hey, thanks
Hannah for that introduction.
And thank you, also,
for inviting me here.
I haven't been on
campus for a while.
And I really enjoyed
walking around earlier
and seeing some of it again.
OK, so as text on the web has
grown, our group at Columbia
has become interested
in seeing how
we can make use of that
data to solve problems
that would benefit society.
And so today I'm going
to be talking to you
about the work our
group at Columbia
has been doing over
the last five years
to address societal needs.
Now as we all know,
data on the web
doesn't only come in English.
We have text data in
many different languages,
and societal problems occur
in many different countries.
Furthermore, if we
look at text that
appears in various
online genres,
often the language
is so different
from standard American
English that we also
need to look at taking
the approach that
is quite different from our
usual approaches in NLP.
So my vision is to be able to
analyze, develop systems that
can analyze social
media for insight
into events in the world
and generate presentations
that connect events, personal
accounts, and answers
to questions.
As is the case for almost
everybody in NLP nowadays,
we take a machine
learning approach.
We start with data,
which is often labeled.
We extract features
from the text data.
And we also look at
representational learning.
And we use that to be
able to predict output
for the problems and the
tasks that we're working on.
Since my research is
not in machine learning,
the kinds of questions
that I'm interested in
are what data is
available for learning
and what features
and representations
yield good predictions.
I'm going to talk today about
work over a period of time.
I'll start with work that
is a little further back,
and so it will take the
approach of extracting features
from data.
And then I'll move to
the last 2/3 of the talk
on representational learning.
So I'm going to look at this
within the context of three
problems--
disasters, the ability to access
global information produced
around the world.
And then the third
context is in the scenario
of analyzing posts made by gang
involved youth on social media.
So turning first to
our work on disasters,
the first problem that
I'm going to talk about
is being able to provide
updates during a disaster.
So we want to
develop a system that
can tell us, as the disaster
progresses over days,
can each day provide us with
an update on what's new.
So this is equivalent
to a problem
of monitoring events over time.
We're taking, as
inputs, streaming data.
It can be either
news or web pages.
And at every hour, we want
to produce a summary which
tells us what's new.
What happened now, which we
didn't know about before.
So ultimately, we
might like to use
this to be able to track
events and the sub-events
that they spur.
So for example, when
Hurricane Sandy hit New York,
it caused a blackout
in Manhattan.
This was a follow on event.
It had caused a fire
in Breezy Point,
which was an area in Brooklyn.
And it caused flooding in the
subway, so a public transit
outage.
So these are all
events that we would
like to be able to
pick up and report on.
For this problem,
we're using data
that was collected by NIST for
a challenge called the Temporal
Summarization Challenge.
And this is data that
was collected by a--
doing a two year web
crawl in the period
2011 to 2013 in 11 different
categories of disaster.
So some of them are shown here.
There is data on protests,
on terrorist events,
on bombings, and so forth.
So in this scenario,
the system is
going to take in documents over
the course of an hour, shown
here just four, but
in the real context
we would have far
more than that.
And the system will
then generate a summary
at the end of an hour.
So shown here is a summary
that the system generated
for a particular
disaster in the NIST
data, an earthquake
that hit in Guatemala.
And you can see in the first
summary, the first update,
it provides information
about what happened.
The earthquake hit, about
where it hit, the magnitude
of the quake, and so forth.
In the next hour after getting
the next sequence of text,
it will produce a new update.
But it has to avoid
producing the information
that it produced so
far for this update.
So for this work, we took
a temporal summarization
approach.
We divide time up
into time slices.
And at each time, t, we
want to first predict
which sentences
in our input data
are salient and are worthy of
being included in the summary.
This is what is called
an extractive approach
to summarization, because we're
extracting full sentences,
including those sentences
verbatim in the summary.
And here we use features that
are specific to the disaster
to help us in being able
to predict salience.
Having selected
salient sentences,
we have to remove those
that are redundant with what
was already reported.
We still will have far more
information or sentences
than we can include
in a summary.
And so at this
point, we cluster.
We use an affinity propagation
clustering algorithm
where we can incorporate
salience prediction as a prior.
And then from each
cluster, we'll
select exemplar sentences
to form the summary.
So that's some of the
features that we use.
We make use of language models.
We have a generic
language model that
was built from news corporum.
We use 10 years of AP and
New York Times articles.
This kind of language
model will score a sentence
by how typical it
is of the language.
So a higher score,
in this case, means
a sentence that is more fluent.
And this is useful
in this context.
Because since we're taking
sentences from web pages,
we often will have
sentences that
are ill formed or
noisy in some way.
And that language model can
help us filter those out.
The domain specific language
model is more interesting.
Here we create different
language models
for each type of domain
that we're looking at.
So we'll have one for
earthquakes, one for protests,
and so forth, which we train
by using disaster related
articles in the category
drawn from Wikipedia.
And this kind of language
model will score a sentence
by how typical it is
of the disaster type.
So it would help us
to get, for example
in the context of an
earthquake, sentences that are
more typical of earthquakes.
So here, for example,
our disaster model
related to earthquakes will
score the first sentence
more highly, because
it contains phrases
that we would expect
to see in the context
of an earthquake like local
tsunami alert or disaster
management.
These are phrases
that we would see
in the context of an earthquake,
but not, for example,
in the context of
a terrorist event.
But we want to make sure that
we're not extracting sentences
about any old earthquake.
Rather, we want the
particular event
that we're reporting on
on this point in time.
So we also use
geographic features.
We tag the input with
a named entity tagger.
We then get coordinates
for locations
and mean distance to the event.
And this kind of
feature will again
give a higher score
to the first sentence
here, because
Nicaragua is closer
to Guatemala than Mexico City.
Finally, we also use
features, semantic features,
so that we can recognize words
that are related to the event
that we're interested in.
So we use a number of
different event type synonyms,
hypernyms, or hyponyms.
So here, for example
in the first sentence,
we have a hypernym
to earthquakes.
So again, we'll get a
slightly higher score.
So what have we learned here?
In this graph, the
red line represents
the results for our system.
So this uses both the affinity
propagation clustering
and the salience predictions.
The blue line is affinity
propagation clustering only.
So how do we do when we drop
those salience features?
The purple line is hierarchical
agglomered clustering.
And this is what was
used in summarization
up to that point in time.
And then we also looked
at a ranking to search.
And we can see that the
salience predictions
lead to high precision quickly.
So we're looking over the course
of time here on the x-axis.
And they also allow us
to more quickly recover
more information that would
be relevant to the event.
So this work was done
before neural methods,
only four years ago.
So since then we've been
exploring neural summarization
methods for extraction.
And in our work in this area we
found that simpler is better.
There have been a number
of models proposed that use
RNNs or CNNs for encoding.
And we find that word
averaging works just as well.
And pre-trained embeddinds are
better than relearning them.
We're currently working on
being able to incorporate
representations that would come
from information extraction.
So for examples of
entities and locations
in a neural model to help
us better identify events.
So we've also been
interested in how we can
summarize personal experiences.
For people who have lived
through the disaster, what did
they experience.
And this is a quite
different kind of text
where this requires looking
at personal narratives that
occur online.
This is an example of one
about Hurricane Sandy.
And we can see that it
follows a structure that
was identified by a
linguist, William Labov, who
studied personal narrative.
It starts out with
the background.
We were sitting down to
a late dinner on Monday
night when the storm
was supposed to hit.
It then goes into a series of
these complicating actions.
By 10:00 PM, the skies lit up
in a purple and blue brilliance,
and the power started to
go out here and there.
That's when I noticed neighbors
across the street running out
of their homes and fire
trucks racing down the block.
I saw a trickle of
steady water coming down
the street on both
sides and then
water began pouring in through
the creaks in the basement
door.
And then finally
in this story, we
have what Labov called
the most reportable event.
This is the moment of
sort of life and death
in the story, the climax.
And here we have he went
upstairs to get a tool.
And in those few seconds, ocean
waves broke the steel door lock
and flooded the basement
six feet high in minutes.
So our goal in
summarizing these stories
is to be able to
find and extract
the most reportable event.
And we hypothesize that
this reportable event
could serve as a summary for
what is this story about.
So for this task for the data,
we used data from Reddit.
We used the Ask Reddit
sub-Reddit with prompts.
So here, for example,
is the prompt what's
your creepiest real life story.
And we had a number of
these different prompts
by which we could gather
these personal narratives.
For our experiments,
we used 3,000 stories.
But of course, this is a method
by which we could get more.
And then a small amount
of them we manually
labeled the most reportable
event and often some
of the different
elements of a narrative.
We had Columbia
students do that.
And this gave us a
seed number of stories
that were well annotated.
You know, it was more of
a high quality annotation.
But it's time consuming and
very hard to get a lot of them.
So we also used a large
amount of automatically
labeled data, making use
of distant supervision.
And we used a number
of different heuristics
to do that.
One example is TLDR,
too long didn't read,
where we would have some
sort of summary of the story
by a person.
And then we looked at
semantic similarity
between the TLDR and different
sentences in the narrative.
Where the sentence with the
highest semantic similarity
would be labeled as the
most reportable event.
For our approach, we draw
from linguistic theory.
So Gerald Prince argued that
stories are about change.
Polanyi noted that the
turning point in a story
is marked by change in
formality, style, and emphasis.
And Labov noted that the MRE
is often accompanied in text
by a change in the verb tense.
So we wanted to look,
monitor these changes
over the course of the story
and use them to help pick out
the most reportable event.
We looked at the
different kinds of things
we could model as scores
which might change.
So we had syntactic scores.
So one example would
be sentence length.
We could also look at
sentence complexity
in terms of the parse tree.
For sentence length, we
reasoned that for example,
if the rest of the story was
told with very long sentences,
the most reportable event might
suddenly switch to a short one
to give emphasis.
We use semantic scores,
looking at similarity
to surrounding
sentences, hypothesizing
that the most
reportable event would
be a break with the
story that came before.
And then we use
scores reflecting
affect drawn from the
dictionary of affected language.
And so this is a dictionary
that, for each word,
contains scores along
three dimensions.
For each word, how
pleasant is it, how active,
so how intense is the
language, and how much does
the word evoke an image of
something in the real world.
And so you can see
here a plot where
we have some changes
in affect noted
across the course
of a narrative.
So here we would be
looking at averages or min
and max we also looked
at within a sentence
as we progressed
through the story.
And our hypothesis is that the
most reportable event, shown
here with the red line, would
occur at either a min or a max
in these scores.
We looked at other features,
also, so tense of the main verb
and inter-sentence
shifts in tense,
the position of the
sentence in the narrative,
and the semantic similarity of
the sentence with the beginning
sentences of the narrative.
So what have we learned here?
Well we've learned that
our change features
of the different
features we looked at
were the most effective.
We also experimented
with how to use the data.
So we experimented
with the seed data
only, which was small but
nonetheless very accurate.
We also experimented with the
distantly supervised data,
which gave us a lot
more, but it was noisy.
And then we tried an
approach of self training
where we used the seed data
first to generate a model,
labeled the data,
and then retained
to augment the data
only those results that
match what the distant
supervision gave us,
so two sources to feed it back.
And we can see the self
training does the best here.
We actually get quite
a big boost in recall,
but also a boost in precision.
So the biggest
problem when dealing
with this kind of
data for summarization
is that the summaries look--
if we have people
summarize them and look
at what a good summary would
be, the summaries themselves
would include
substantial rephrasing
over how the original personal
narrative was presented.
And this is because online
on social media, the language
is very informal.
We could get run on sentences.
We have a lot of pronouns.
We have a lot of interjections,
which are sort of content free.
So in order to do a good job
on generating an abstract,
we really need a
lot of paraphrasing.
And so one of the things
that we looked at next
was a model for aligning the
sentences and the phrases
within the sentences
of the extract
with a manually
generated abstract.
And this is work that we
reported on an ACL this summer.
Its a neural model to align what
we call sentential paraphrases,
because they are fairly extreme.
And in future
work, we're looking
at being able to generate
those paraphrases.
So I'll turn now to our work
on global information access.
As I alluded to
in the beginning,
disasters don't just
happen in the US.
We see them happening
around the world.
And often they
happen in countries
where we don't have good
machine translation.
And furthermore, we don't
have a lot of resources
in that language or
paired with English
in order to develop good
machine translation.
And here are a few examples.
So we have been
looking at developing
cross lingual summarization for
these low resource languages.
Here the task is
given a document
in one language, our
low resource language.
We want to be able to
summarize it in English,
in another language.
And for us, that's English.
There are two different
approaches to this task.
One would be to first
summarize the document
in the low resource language
and then translate the summary.
And the other would be to
translate the low resource
document and then
summarize the translation.
But when we're dealing with
low resource languages,
we have little to no data
to train our summarization
systems.
We certainly do not have, within
the low resource language,
pairs of documents and summaries
that we can use for training.
And if we're looking
across languages,
from the foreign
language to English,
there are very few languages
for which this kind of corpora
has been developed.
I think Chinese is
one of the few cases.
And recently, I saw some work
on developing that for Spanish.
So in our approach,
given the lack
of data in the low
resource language,
we have to take the approach to
translate and then summarize.
But our machine translation
is going to be poor.
And we also have this problem of
pairs from the foreign language
to the English.
We're working with a
large team on this.
So we have machine
translation that's
being developed at University of
Maryland and also at Edinburgh.
This is an example of what the
machine translation produces
for a weblog in Swahili.
And so you can see that
it's not very fluent.
It may be hard to get a
good idea of even what
the weblog is about.
We, as people, can probably
make some inferences.
This is a person
named Mange Kimambi
who's running for Parliament.
And then it says not
special seats kinondoni
without drugs is possible.
So probably he's running on a
platform related to drug use,
reducing drug use
in the country.
So we reasoned that this is a
good task for which to apply
abstractive summarization.
So in contrast to
extractive summarization,
in abstractive summarization
we would take sentences
from the input, but
we would rewrite them.
We may introduce
synonyms, we may change
the syntax, and so forth.
So the first thing that
we needed was data.
And we created a multilingual
summarization corpus
by taking a summarization corpus
in the English, the New York
Times summarization
corpus, which
has news articles
in English paired
with news summaries in English.
And we translated
those documents
into the low resource language.
Here we were working with
Tagalog, Swahili, and Somali,
and then back into
English again.
So this gave us
documents now that
are noisy, very much like
the first one I showed you.
So we have noisy documents.
But we have well written English
summaries and they're paired.
So this looks closer
to the kind of scenario
we have in our system.
So our data, we used the New
York Times document summary
pairs.
We translated 112,000 of them
into each of Somali, Swahili,
and Tagalog using neural MT
systems developed in our larger
project.
And then we translated
them back to English.
This gives you an idea of what
these synthetic translations
look like.
So they're not quite as bad as
the translation of the weblog.
First of all, of
course, they are
newswire, which is a genre
that our systems are more used
to dealing with.
But you can still see
some of the same problems.
So if you look in the second
sentence in his comments
on January 0, he
praised the consultation
of the community of Kansas City,
which half of the participants
failed.
So again, you know,
disfluent, and we may not
have a great idea
of what it's about.
So the abstraction
model that we developed
was a sequence to sequence
model with copy attention
and coverage.
This is a standard
abstraction model
that has been used in the
summarization community
and neural net approach.
We pre-trained for 12 epochs on
the unmodified New York Times,
so in English.
And then we trained
this baseline
for another epochs on 100,000
synthetic translations each
from each of the
three languages.
We also tried mixed
corpus training.
So we took 100,000 of these
synthetic translations
randomly from across
the three languages,
ending up with an equal
amount in each language.
And then we had a validation set
of 6,000 for parameter tuning.
So here is the performance
of our baselines
shown on the top line.
And the metric
that we use here is
one that's commonly used in
summarization called rouge.
Rouge-1 for example
looks at a word,
overlaps a unigram overlap
between a system generated
output and a model summary.
So here we have
the model summaries
from the New York Times.
Rouge-2 would look
for bigram overlap.
These are two state of the art
systems at the point in time
of this paper.
And you can see we
get similar results.
We didn't care about absolutely
getting the top results,
because we want to look at the
comparison between how this
would do and our
new model trained
on our cross lingual corpus.
So first for evaluation,
we did a sanity check
by evaluating on a held
out test set of 6,000
of these synthetic translations.
And I'm showing here the results
for the summaries generated
from the Tagalog data set.
And you can see that our
results trained on Tagalog
are better than our
results trained in either
of the other languages.
And we do get significant
improvement over the baseline.
The training on the mixed
corpus is slightly better.
And so that's encouraging.
But we also wanted to do
a real world evaluation.
What happens when we take
weblogs from each language
and summarize them using
machine translation in the loop?
We don't have any
model summaries
for this, so no gold standard.
So we needed to use
a human evaluation.
So we had five human evaluators,
and we asked them to score
the summaries along
two dimensions.
Content, how informative was
the summary, and fluency.
And they scored from 1
to 3 where 3 was higher.
Well, we expected to
do well on fluency.
That's what we
wanted and thought
we would do better on
fluency than the baseline.
And in fact, you can see
quite a bit of improvement.
This is for the Swahili weblogs.
And we did equally well
training on the Swahili data
set versus the mixed data set.
What we were surprised about
was that the system also
did better on content.
And so we can see that the
mixed data set gives us
the highest scores on content.
And we think this is because
with a more fluent summary,
it's easier for the end user to
understand what's being said.
And the results are similar
for other languages.
So if we go back to our
example, now the document
is color coded for
how information
is copied from the
document into the summary.
This is our baseline system
summary at the bottom.
And you can see that
most of the copying
happens from the first sentence.
So large chunks of it are copied
and placed into the summary.
There is this extra
little information
is on blog which gets in there.
And so we don't have much
of an improvement on fluency
when we use this approach.
But when we use our system
trained on the mix corpus,
again we can see some
copying from the input.
But then we have copying
from different places.
So to vie for, which appears
on the end, kinondoni
without drugs, and some
language that is newly generated
comments on his plans.
This gives us a better
idea that can Mange Kimbabi
is running on a plan for
kinondoni without drugs.
So what have we learned here?
Well, summaries
are a good target
for improving fluency in
this cross lingual case,
making use of the
abstract of approaches.
We've provided a new
multilingual summarization
corpus and a method for
gaining that kind of corpus
in any language pair.
I showed that it improved on the
languages that we trained on.
But I didn't show we also
tested it on unseen languages.
So we looked at Arabic
using model summaries
that were produced in an
earlier summarization task.
And we also got
improvement there.
So our current directions,
one of the big problems
with either neural language
generation or abstractive
summarization using
neural methods,
is that systems
tend to hallucinate.
They hallucinate
phrases that were not
represented in the input.
And they also can make errors
in doing the generation.
So we're beginning to look
at what kind of training
we can do to avoid
those hallucinations.
And we've piloted an approach
in a language generation task,
so not summarization.
Here this was a task to generate
restaurant reviews where
our input is structured data
and attribute value pairs.
And our output is a short text
paragraph giving a review.
And we looked at methods to
do data augmentation and then
self training on that data.
And we'll begin
to look now at how
we can do similar
things in summarization.
So in this area
we've also looked
at how we can use
sentiment analysis
for the task of helping in
the context of a disaster
now around the world.
The goal is we want to
analyze the subjective posts
of everyday people
and identify sentiment
and emotion in those posts.
But again, it could be in
a low resource language.
And the idea is if we
find negative sentiment
or negative emotion, this is
an area where problems likely
still persist whereas positive
sentiment or emotion can
indicate where the
problem has resolved.
And given the lack
of data, we're
looking at creating these models
without machine translation.
So we've developed
an approach called
direct transfer where
we train the sentiment
model on a high resource
language where we do
have labeled data.
So we do have lots of labeled
data in English for sentiment
where we have, for example,
Twitter posts where
each post is labeled as being
either positive, negative,
or neutral.
And we can use that
to train the system.
But in the neural approach,
rather than using English word
embeddings, we now train a set
of bilingual word embeddings.
And we use those as features
in the sentiment model.
And when we train it,
then we can directly
create a model for
transferring sentiment
to the low resource language.
One of the things that
we've experimented with here
are what kind of data
we have available
and how it impacts the
training of these cross lingual
embeddings and what works best.
So we looked at the use of
in-domain parallel corpus.
So if we're dealing
with disaster,
and in-domain parallel corpus
would be Twitter, for example,
that reports on disaster.
And we would want to
have paired sentences,
both in our low resource and
our high resource language.
That could be hard to find.
So we also look at
out-of-domain parallel corpus.
There is a lot of that available
using the Bible and the Quran,
because the Bible and the
Quran have been translated
to almost all languages.
But as you can imagine,
this is very out-of-domain.
The language is not very similar
to what we find in Twitter.
[LAUGHTER]
And probably, it's not
talking about disaster either.
Although, sometimes it is.
[LAUGHTER]
We also looked at what's
called comparable corpora.
So this would be, for example,
you can think about Wikipedia
where we have English Wikipedia
and the foreign language
Wikipedia.
And we may have articles on
the same topic, for example,
on the same disaster.
But they're not
translations of each other.
So they're not
parallel, but they're
likely to contain some
translation somewhere
in the two articles.
And then finally, what if
we don't have any of that?
All we have is a
monolingual corpus.
So we did this experiment
in 17 languages
in 5 broad language families.
Some of them are high
resource languages
like Portuguese or Spanish.
But we did it in a low resource
context with less data.
Some of the low
resource languages
are Sinhalese,
Uyghur, or Tigrinya.
This is the model
architecture that we use.
So it's a bi-directional LSTM.
This allows us to have
help with languages
that have different word order.
And we use pre-trained cross
lingual word embeddings
for them as input.
We also experimented with
some sentiment features,
so bilingual
sentiment embeddings
and bilingual sentiments scores.
These embeddings and
weights are pre-trained,
and so you can see
them at the bottom.
And then we also looked
at bilingual features.
So suppose we only have the
monolingual, a large amount
of monolingual data, but
we have a small amount
of a parallel corpus.
Well, we can build a
bilingual dictionary
from the small parallel corpus.
And then we can take the
large monolingual corpus,
and we can replace each word
in the low resource language
with the translation.
Then we get this code-switched
partial translation
of the training data.
And this does allow us to
update the parallel embeddings.
So this is a large chart.
You don't need to
understand the details
across the bottom or all
of the different languages.
The yellow bars are
if we had labeled
data in the foreign
language, how could we do.
And that's sort of
our upper bound.
The gray is our approach,
the direct transfer.
And the others are
some baseline systems.
And what we see is that transfer
outperforms the baseline.
It lags behind the
supervised model
by about 15 points on average.
So what lessons have
we learned here?
If we have an in-domain parallel
corpus, even if it's small,
we will do the best there.
If it's not available,
then the order
in which you want
to use resources
are first out-of-domain
parallel, then comparable,
and then monolingual data.
And the best
features we've found
are embeddings learned
on the bilingual context.
If we have an
evaluation data set that
has a lot of sentiment in it,
then the bilingual sentiment
embeddings are best.
And target language
lexicalization helps.
So in the last part of the
talk, I want to turn to our work
on analyzing the posts
of gang involved youth.
The background for this
would be the large amount
of firearm related deaths that
happen in the US and the fact
that violence disproportionately
impacts low income
cities, like
Chicago, for example,
which had more than 3,000
shooting victims in 2015.
So we're doing this
work in collaboration
with Desmond Patton, a faculty
member from social work, who
looked at the parallel between
studied social media, which
he calls the digital street,
and saw that it paralleled what
happens on the physical street.
And violence is exasperated
by taunting that occurs
on the digital street.
Furthermore, he saw
that violence was often
triggered by trauma.
So prior to aggressive
posts, there
would have been some
traumatic event.
In his work, he was
studying a girl really named
Gakirah Barnes who was
a recently deceased gang
member in Chicago.
And she became gang
involved at the age of 13
after her friend, Tyquan,
was shot and killed.
And at that point, she
renamed her Twitter
handle TyquanAssassin.
And she was unusual
in that she was
very prolific on social media.
So she had some 27,000 posts to
her name in a three year period
and had roughly 4,000
followers on Twitter.
Now if we look at this
data, it is a challenge
for current natural language
systems, which are primarily
trained on news.
The language is quite
different and would not
be handled by many of the,
for example, parsers or part
of speech taggers
that we've developed.
So we work closely
with the social work
team who do analysis and
annotation of this data.
They also bring in, as part of
their team, youth from Chicago
who were formerly gang involved
who can provide interpretation
for the different posts.
From this, we have about
5,000 labeled tweets
from Gakirah and her
top communicators.
And one of the
things they have seen
is that context is
critical in interpretation.
We also have a much larger
number of unlabeled tweets
from 279 different posters
that were gathered in a--
so finding the posters was
gathered in a snowball sampling
fashion.
So we can see here some
example of the annotation.
They do a very fine
grained analysis.
So they would annotate
the first tweet
as a threat and the
second tweet as an insult.
But they both are
categorized as aggression.
And at this point in time,
doing, well, binary, really
three way categorization is
as much as we could hope for.
And the bottom one
about loss where
somebody is mourning the loss of
death of their friend, Shorty.
So our goal for natural
language processing
is to develop a system that
can predict for unseen tweets
whether they express
aggression, loss,
or some other information.
We want to be able to exploit
the unlabeled data set
in doing that.
Our ultimate goal in
developing this tool
is to develop something that
community outreach workers can
use to intervene and avert
violence that could happen,
either when aggression
is expressed
or at an earlier point when
people first experience loss.
Given the social work
researchers' observation
that the context of
the tweet is very
important to the
interpretation, we
wanted to explore using the
semantic context of what
had been said, the emotional
contexts preceding the post,
and the user's social network.
Our method was word level
convolutional neural network,
and we exploited the
unlabeled data first
by creating domain
specific word embeddings
and also creating a lexicon
of aggression and loss
from that data.
So we use the method
to automatically induce
the lexicon using what's known
as the SENTPROP algorithm.
This constructs a lexical
graph from word embeddings,
connecting words that
are semantically similar.
And it propagates labels from
seed words using a random walk.
So we started off with a small
set of seed words labeled
by our domain experts
and then propagated
that over all of the
vocabulary in our corpus.
So for the context
features, we want
to represent what the user said.
So the word embeddings, this
would be semantics as well
as what the user felt, so using
the aggression or loss lexicon.
And some of the different
hyperparameters that we use
are the number of
days that we look back
in the conversation,
half-life ratio,
so whether more recent tweets
have more of an impact,
the type of the post, and then
how we aggregate from the word
level to the tweet level to
the context, whether we average
or whether, for example,
we take the max.
For the social
network interactions,
we build a representation
of each pair
of users where we say a
user is involved in a tweet
if they posted,
retweeted, or mentioned.
And for each pair of users
in the unlabeled and labeled
corpus, we aggregate
all tweets in which
they both were involved.
And this gives us the
architecture shown here.
It's a standard word level CNN.
But at the last
layer, we can catenate
in context features
representing semantics, emotion,
and the user interaction.
So we first looked at the
impact of using the domain
specific information.
And you can see
that's at the bottom.
We have, when we use our domain
specific word embeddings,
we get the highest F1.
And we compared this against
using word embeddings built
from Google News, where you have
much more data than we have.
We compared with,
well perhaps if we
had a corpus of African
American vernacular
English that was larger than
ours, it would be better.
But that's not the case.
And so we are dealing
with a language
that is very hyper local
to our demographic.
We thought maybe
tweets that came
from the location, the specific
location out of which the gang
members operate,
would be better,
but that is not the case either.
Word embeddings trained
on Glove Twitter
do the best, but
still not close.
For the hyperparameters,
we found
that for semantic contexts what
the user said 90 days was best.
And the half-life
ratio was important.
So the most recent
statements had more impact.
Averaging was more important,
and we only used self post.
For what the user
felt, however, we
saw something quite different.
Only the posts in the immediate
two days prior were helpful.
And here, we sum.
This allows us to
retain peaks of emotion.
And these findings reflect
insights from our social work
research team.
That is that loss precedes
aggression in a two day window
and emotions fluctuate.
If we look at our
results, we can
see that our CNN with
domain specific resources
outperforms our baseline.
And this was a baseline
developed in 2016
by an undergraduate
working with me,
Tara Blevins, who is now a
PhD student here at University
of Washington.
But we get additional
significant improvement
from adding the
context features.
And we see that same
improvement in our original SVM
if we add the context features.
I'm reaching the
end, so I'm going
to have to very quickly
say we also looked
at bias given the interest
that we wanted to use
this work in a live setting.
We were concerned with how well
the model makes the decision,
and we wanted to avoid bias
in the learning algorithm.
So we did a systematic
interpretation
of the model, which did
reveal bias, surprisingly
in 10% of the cases.
The stop word was the
most influential word
in the tweet, something
which should not
have an impact at all.
We did adversarial
testing, inserting
a and on into every tweet.
And we found that
the prediction flips
to aggression in 3% of the
cases for a and 5% of the cases
for on.
And yet, you would never want
to have that happen if you
were making true predictions.
And to reduce bias, we
made use of rationales.
We had the domain expert
identify the words
within the tweet that
led them to their label.
And then we incorporate those
rationales into the model
using trained attention.
So we have the ground
truth for attention,
which is distributed over
the words of the rationale.
And we use KL-divergence
loss in training.
And I'll just point
to this first line.
The average rationale
rank is a new metric
that we devised to measure
the influence of each word.
And we see that more often in
our new model with rationales,
the rationale words are ranked
highest with the lowest score.
And in our new model, we have
the fewest flips to predicting
aggression when we
insert these words.
So what have we learned here?
Measuring bias is important.
We didn't change the f
measure score, the accuracy.
But our metrics allowed
us to test what kind
of bias crept into the model.
We see that integrating
information from context
improves performance and that
we can exploit information
with the unlabeled data.
So in conclusion, I
hope to have shown
that multilingual and
non-standard languages require
attention, and that we
need to pay attention
to the data we use, the
semantics, and real world
context in neural
architectures, as well.
And I would close by
thanking my current PhD
students, one of whom was one
of your undergraduates here,
Emily Allaway.
I had some very
talented undergraduates
who participated in the
gang involved youths,
Serina Chang and Ruiqi
Zhong who are now
at Stanford and Berkeley.
And Yanda Chen is a
junior, so he's continuing
to do research with us.
And then all my past students.
So thank you.
[APPLAUSE]
HANNAH: So we have time
for a few questions.
AUDIENCE: So thanks
for a great talk.
Exciting work.
I want to push back on
this bias thing, where
you found that the
stop words were
triggering a flipped decision.
And I was thinking of,
like, why would this happen.
And sometimes these little
words are important, right?
So there's a big
difference between John
hit Mary and John hit on Mary.
Right?
So they often play a role in
larger multi-word expressions
and completely
change the meaning
to something that's not
at all compositional.
KATHLEEN MCKEOWN: Yes.
AUDIENCE: So I wonder if you--
did you dive into those
examples to see if it was
something like that going on?
KATHLEEN MCKEOWN: So we
did look at the examples,
and we really had to have the
social work researchers look
at the examples.
We did think maybe it was
something about the language,
because some of these
function words in the language
are words like duh for uh
or for the and um, so forth.
We did not find
anything at this point,
but it is something that
we could look at further.
We did look at the labels and
have social work researchers
look at them again.
And in fact, they get
falsely classified
when we use that adversary.
HANNAH: So I have
one quick question.
Sorry, you go ahead.
AUDIENCE: For the cross
language summarization,
you talked about how [INAUDIBLE]
on mixed up synthetic corpora
had better metrics than
each language alone.
Now I'm wondering why do you
think that that's the case?
KATHLEEN MCKEOWN:
So I mean, there's
possibly some
generalization going along
when we look from
different languages.
The difference between the
mixed language and the language
in which the document originally
appeared is quite small.
I'm not sure we would
say it's significant.
It's very close.
But the fact that the
mixed language corpus helps
means that then we can move
to use it on other languages
that we're not--
you know, we don't
have to have one for every
single language pair.
So it's helpful from
that point of view.
HANNAH: So one quick question
about the Twitter analysis
data.
Did you also consider
emojis for encoding?
KATHLEEN MCKEOWN:
Yeah, we have emojis.
They're just words.
So each one is treated,
it's a character,
but it's essentially a word.
HANNAH: Thank you.
[APPLAUSE]
