MODERATOR: Danna Gurari
is an Assistant Professor
at the University of Texas at Austin
where she directs the image
and video computing group,
for the school, the I school, basically.
She did her undergrad work
at Washington University
in St. Louis, and her
PhD at Boston University.
But in between there, she
worked here in the front range
at Raytheon and Boulder Imaging.
So she actually knows about Colorado.
That's why she's back.
Her research interest
has been computer vision,
machine learning, human
computation, crowdsourcing,
human computer interaction, accessibility,
and biomedical image analysis.
And she's won a stack of
awards at conferences in areas
rating from big ones in
Chi and computer vision,
and also more specialized ones
like medical imaging computing.
And she's gonna talk to us
today about some of that work.
Thank you.
It's a pleasure.
Thank you so much for having me.
[audience applauding]
Thank you all so much for being here.
My name is Danna Gurari, and
I look forward to talking
to you about my team's work
on designing computer vision algorithms
to support real users, and
recognize multiple perspectives.
So, in this talk, I'm gonna start
with a little bit of
background on computer vision.
Then go into the past
research and future ideas.
So diving in, computer
vision is the field around
making algorithms that see.
Today, you'll see computer
vision algorithms in things
like self-driving car, vehicles
on Mars, guided surgery,
visual assistance for
people who are blind,
as well as to more efficiently
log on to personal devices
by just snapping a picture of your face.
And so I wanted to talk a little bit about
how did we get to where we are today.
The basic ingredients are emulating
the ingredients of sight,
the brain and eyes.
So if we take a trip back in time,
1945, time of World War II,
you have the first computer that emerged.
Time of World War II, we
needed to put out faster
computation to help in that time.
First programmers were women
who were not out at a war.
Fast forward to 1957,
the first digital image.
This is it.
It's of a three-month-old
son of the person
who took the picture.
What is an image?
It's a grid of numbers in the
case of a gray scale image.
That's what the computer sees.
What we see is the
mapping to various colors,
think paint by the number.
Fast forward to 1966, out
of MIT comes the paper,
the Summer Vision Project.
The dream, solve vision in one summer.
[audience laughs]
AUDIENCE MEMBER: Find an undergrad.
LAUGHS: Yes.
Here I am today. [laughs]
So in this game of, I believe
there's a lot of people
here doing artificial intelligence.
Thinking about, how do we make machines
that do intelligent
things in different ways.
Computer vision sits as a subset.
Think about machines that see.
While, hopefully you see
computer vision has been around
for a long time, the research
community didn't bounce
on that until between 1980 and 1990.
The most important
conferences in our community
include CBPR, followed by 1987 ICCV,
followed by 1990 ECCV.
So if you're curious about
what are the premiere places,
that's the premiere place
to be publishing your work.
Many other conferences
also exist with variance
on various applications and more.
If you look over the past 15 years -ish.
From 2006 to 2019, at the
number of attendees at CDPR,
this is an example of what's going on
at all the computer vision conferences.
What you're gonna see is rapid growth.
So, last year when I was there,
there was 9,400 attendees.
They sold out.
You couldn't register if you
didn't get registered in time.
Why the rapid community growth?
Over the past ten years,
we see that computer vision
can work in practice.
People see that, they wanna get involved.
And so we see that there's
a huge rise of people
who are interested in
joining the community.
The typical computer vision pipeline today
consists of the following three steps.
First, create a large data set.
Then, design and algorithm.
Train it and evaluate it.
Finally, deploy that algorithm
in some really system
that goes out ten users.
This idea if you look at the
first two steps is analogous
to how you might teach
a child to recognize
whether or not an image shows a person.
Before hand, you might show
that person, that child
many examples of people
in many different shapes,
configurations, and sizes, and so on
so that when they see a
novel case, they can say yes,
there's a person, or no, there's not.
That's the idea of what's
going on with algorithms today.
So, just to pinpoint
where my specialty lies,
I've created two courses
that I have taught now
for three semesters each on
addressing these two steps.
First it's called crowdsourcing
for computer vision.
All my slides are publicly
available on my website
if you ever wanna know
what I teach on this,
as well as a class on
introduction to machine learning.
So now I wanna talk about my research,
and what we're addressing.
In doing that, I want to
start with the inspiration
for me to pursue a career in research.
I did work here in Colorado.
I worked at a place
called Boulder Imaging,
and at the time, we developed
cutting edge video recording systems.
We were recording massive amounts
of data for many customers
including NASA, Lexmark,
all over the world.
And, what we were doing is
we were fueling the data
that would be used for analytics.
And parallel, I was supported to come up
to seminars just like
this at the CS Department
which I did do.
I thought it was so cool
that I decided to move away
from Colorado
[audience laughing]
and get a PhD, and here I am now.
Okay so thank you for inviting me.
It's a pleasure to do a
full circle and back here,
and share abut the research
that I've been doing.
In my team's research,
we largely focus on three key problems.
The first is that there's
a data set mismatch
for real users' needs.
The second is that there's also
often an algorithm mismatch
for real users' data.
And the third is that algorithms return
one-size-fits-all responses.
And so the goal of my
team is largely thinking
about how to design
computer vision algorithms
that support real users and
recognize multiple perspectives.
So, I would like to
spend the rest of my talk
on these ideas.
I'll go over next research questions
that I'm going to address.
How to empower the community
to design algorithms
that match real users needs?
Next, how to design
human-machine partnerships
to efficiently produce accurate results?
How can algorithms anticipate when and how
crowd responses will differ?
An finally how can algorithms recognize
the multiple possible
solutions for a task?
So those first two questions
are really focusing
on how do we support real users.
The next two, is how do
we think about designing
future algorithms that can embed
multiple perspectives in them.
So diving into the first one,
the motivation again for this one is about
the datasets mismatch for real users needs
to the datasets we use in practice.
This work that I'm gonna
discuss has been funded largely
by Microsoft and Amazon.
The tasks I'm gonna talk
about is describing an image.
So given an image like this,
one thing you might wanna do
is describe it with caption such as,
"A bunch of small light brown
mushrooms in a green field."
Alternatively, you might
wanna answer a question
about that image, such as
is it edible or poisonous?
And get back an answer,
in this case, poisonous.
These two tasks are very
important for real users today.
Already, people who are blind
have been using these systems
to take a picture, optionally
record a spoken question,
have that sent off to remote humans,
whether that's employees at
a company, or crowd workers,
who will in turn provide a caption
or an answer to that question.
This ideas has been around
now for about a decade,
starting with VizWiz which Tom Yeh,
wherever he is, was part of creating.
And there's been many other
companies and products
that have emerged overtime
since then that do exactly this.
While the computer vision
community has put out datasets
to try to develop progress
in addressing these tasks,
there again is a mismatch.
I show three examples on the top row,
from existing state-of-art datasets,
as well as three visual
questions from real users.
The mismatch stems from how
the data was collected in part.
The picture and question
were collected independently
from prior work.
Pictures were scraped off the
internet, often from Flickr.
Questions were contrived.
So typically, you would show
a crowd worker an image,
and say, think of a question
that would stump a robot.
For real users, it's gonna
be the same user who both
took the picture and asked the question.
And the data reflects
real users interests,
such as learning the flavor of the food,
learning the temperature
of the thermostat,
as well as whether their shoes match.
So, given that, we decided
to embark on creating
a big label dataset
originating from real users.
From a previous app that had
been used from 2011 to 2015,
over 11,000 people had submitted
over 72,000 requests to this app.
Of those, 60% of the data,
the users agreed to share
their data for 62% of those requests.
So that is the foundation
of this body of work.
In making the dataset,
again this data comes
from real users.
We had a lot of extra
steps that we had to do
to make sure that we
could publicly share that.
This included automatizing the data
by transcribing questions
to remove voices,
re-saving images to remove metadata
such as people's locations,
filtering out any sort of
personally identifiable information.
There aren't taxonomies yet,
around what is private information,
what is personally
identifiable information.
We took a stab at it, filtered,
we went through all images
numerous times removed what we
thought was deemed that way,
including people's faces,
their medication information,
as well as their mail which
can show their address
as well as other important information.
[audience member sneezes]
Bless you.
Finally, took the remaining
data, had that sent off
to the crowd to label it
with high quality captions
as well as answers.
The result is two datasets.
For visual question
answering, we have ten answers
for every visual question.
For image captioning, we have
five captions for every image.
For the community, this
is the first dataset
where real users requested
image captions and answers
to their visual questions to
support their real daily needs.
Most commonly, people are
talking about everyday objects.
So if you look at a word map of
what are the most common
words that are used
to describe their images,
often we have things like
bottles, packages, tables, counters,
boxes and computers.
What kinds of questions
are people interested in?
Yes.
AUDIENCE MEMBER: So I'm wondering,
you have the original answers
from like when these
questions were fist asked?
We did, we threw them out.
They weren't sufficient
quality, and it was sporadic
how many answers you had.
At the time the app collected on average,
three answers per visual question,
but it could range from none to like six.
AUDIENCE MEMBER: Yeah, it'd
be specially interesting
if you have feedback from the
people whether the answers--
Yes.
AUDIENCE MEMBER: Worked for them.
Yes, yes.
Hold on to that thought.
Hopefully, I'll get to
that a little bit more,
but I think that's a
very important direction
for continued future work.
Oh yeah, and just on my
comment, if you have questions,
please feel free to interrupt
throughout the talk.
So what kind of information
were people asking about
when they had a question?
Often they asked, "What is this?"
From a language perspective,
this dataset is very, very interesting
because many people use very
conversational language.
Unlike prior datasets, the
say, "Please, Hi, Okay."
in providing questions.
And, so if you're interested,
all this dataset is live.
You can actually search through it.
Here's just a visualization
where I'm gonna show
a tidbit of expanding out to 50 images
from the entire dataset.
You can click on anyone of those images,
to zoom in and see the question,
the collected captions,
as well as the answers.
There's also extra metadata
on additional labeling we did
that you can use to search
and learn more about the
interests of this population.
Cool, we have a new dataset.
The goal was to get people
machine learning, and AI,
and computer vision more generally
to think about this problem.
So we next wanted to figure out,
how hard is this dataset
for the community?
So, in doing that, standard
pipeline, is you have your data.
Split that data into a
training set and a test set.
Train that algorithm on the training set,
take that trained algorithm,
and apply it to your test data.
So an example like
this, predict an answer.
See if matches, it
does, see if it doesn't.
Tally over the many and you can come up
with some performance metric.
So we did this.
We did this for pre-trained
state of our algorithms.
Took the state of our algorithms,
fed it the most popular
computer vision dataset
before this dataset.
We evaluated across many metrics.
I show just a subset here.
The point is, algorithms
performed very poorly.
To the second scenario,
maybe it wasn't the problem algorithms.
Maybe it was the data.
So what happens if you take
these same algorithmic architectures,
and you feed it data taken
by people who are blind.
Same story, you get a bit of a boost,
but algorithms performed poorly.
Let's try one more thing,
this is a small dataset
for machine learning.
What if we use the combination
of the old and the new dataset.
Tried that, in a nutshell,
algorithms performed poorly.
Great for publishing. [laughs]
So, what makes this hard?
We've tried to de-tangle that.
This data matters for a real use case.
It's fun for publishing,
but it also if solved,
well it could change people's lives.
This dataset is challenging for algorithms
for many reasons including that
many images are low quality.
Many contain text, that is not
the case for other datasets
that out there.
And they contain novel concepts.
So, many images contain things
like currencies, captchas.
And it turns out a lot of
them are taking pictures
of images with Kellog branded items.
[audience laughing]
So, we put out our dataset,
the goal again is to help real users.
We wanted to engage the community.
So that's what we did.
To accelerate progress, we
shared our datasets publicly.
Others are also sharing our dataset.
You can also download
it in Academic Torrents,
and Kaggle and Facebook put out
a new language and vision platform
where it's included there as well.
To track progress,
we also provided a
public evaluation server
with leader board.
Currently people across
the world are competing,
and they can see where they
stand on international ranking
in terms of their algorithms performance.
We also organized events
to foster community
and celebrate progress.
Our first event was in 2018.
We will have another one in
2020 at CVPR this summer.
And for those of you...
And so in terms of impact,
we have a growing community.
We've had people write Medium articles,
MIT Technology Reviews.
People are writing articles
about how they're winning
or are in the top three on this challenge
and tweeting about it and the dataset.
And so, if you're interested
or you wanna put a student on this,
I invite you to join.
Here's how it works.
As before, I said we split each dataset
into a training set and a test set.
With the target responses for the test
that hit in from you all.
Design an algorithm and
submit its predictions
on the test set by May 15th.
The top teams will win $10,000
in crowd competing credit.
If you're interested, feel
free to ask me questions
right after this talk,
and I believe we have a
colloquium afterwards.
Yes.
[Audience Members] So the
questions that people are asking
are they very objective questions,
or are they kind of subjective?
Great question.
So they're not fact-based.
They're not just objective,
there're many variants.
The third part of my talk is gonna talk
about what are those kinds
of differences that we see.
What kinds of questions,
what kinds of answers are being elicited.
Awesome.
Great question.
Okay, so at this point, I
wanna shift to discussing
how to design human-machine partnerships
to efficiently produce accurate results.
In this case, the motivation is
that there is algorithm
mismatch for real users data.
This work has received lots of funding
from the Silicon Valley
Community Foundation.
Priscilla Chan and Mark
Zuckerberg are trying
to spend 95% of their
funds before they die.
Part of that is to be able to figure out
how to provide and treat many diseases.
And they funded our project
in helping them understand
how cells, develop tools to
analyze cells in the human body.
And so this work is
sponsored in part by them.
And the task for this work is thinking
about object segmentation.
So while, the motivation
and funding for this work
comes from biomedical images,
this task matters all across the board
for bio- medical images, medical images,
and every day images.
For many tasks, including to do things
like object classification,
object tracking,
image retrieval, and roto-scoping.
Which think Hollywood, where you create
a really cool background
when someone's not actually
at that cool background.
A typical user experience looks like this.
You have a user interface
with a bunch of algorithms.
Let's call those one to seven.
Often these algorithms can be interactive
where a person will click
some initial region.
Doesn't take a lot of human effort.
And then, run an algorithm.
And sometimes, that
algorithm does terribly.
So at that point, you might
run another algorithm.
And in the second case, you'll
find it works beautifully.
So, takes a lot of time
for the user to figure out
which tool to use when.
So, key observation, algorithm
performance is inconsistent.
In some cases, algorithms do well.
In other cases, they don't do well.
So, the idea of this
work is to involve humans
when they're needed most.
When algorithms are failing.
So, the foundation of this
work is based on a method
that can predict the quality
of an object segmentation.
So given an image, with a ground truth,
which is what we want
the algorithm to produce.
We want the algorithm, the
quality prediction mechanism
to be able to, with only seeing the image
in algorithm drawn segmentations,
decide, is it really good, is it bad,
or is it somewhere in between?
So when I talk about quality,
there are many metrics we could use.
In this case, I'm gonna talk
about the Jacquard index
just to put us on the same page.
We'll use this as a toy example.
You have your ground truth,
you have your algorithm result,
intersection over union,
overlay, look at the pixels in dark green,
which is the intersection.
19 are in common.
Look at all the pixels
together, that's 27.
The performance for this
algorithm generated result
would be 70%, and so
that's what we're talking
about for quality.
So, the key ingredients
for putting this together
was we got a dataset.
We got over 10,000 images
from four datasets.
For each image, we applied
14 different algorithms.
We also, because we wanted
high quality results
in a similar amount to the failure cases,
developed three different segmentations
that come from the ground truth.
Use that to train two prediction models.
One for our 2016 paper,
one for our 2019 paper
using linear regression model
and regression tree ensemble.
We did use CNN based
methods, they did terrible.
So for those of you who are curious,
I'm not showing those results
'cause they just don't do as well,
and I'm happy to talk offline about that.
For training instance, we
evaluate the segmentation
similarity to the ground truth.
We compute nine features that describe
the binary segmentation mask
as well as three CNN features
that describe the image properties.
So, for our data, you'll have
all the segmentation examples
which are each either training
or evaluation examples.
You'll have the score fridge,
the nine binary mask features come
and then the rest of the features come
to describe the image.
After training this method,
what you got is as input
an image with a segmentation algorithm.
It produces a segmentation.
That segmentation, and image get passed
to the feature extraction model.
The features are passed on
to the prediction module,
and out comes a quality score.
How does this in practice?
We explored, again with four datasets.
We evaluated across two different metrics,
the correlation coefficient,
and the mean absolute error.
I'm showing our results for
our top performing method,
the random force method,
as well as what the previous
state-of-art method.
And without going into
the details of numbers,
what you wanna see is higher
CC scores and lower MAE scores.
And our method consistently
outperformed the previous
state-of-art method.
What is really cool, in
my opinion, about this.
Is what I'm showing is results
from cross-dataset analysis.
So the train on the three
datasets not shown in red,
which are gray-scale images,
and color everyday images.
Take that, put it on the test examples,
which a dataset on biological information,
and what you get is very good
results where the algorithms
can generalize what's learned
from every day images
to biomedical images.
With that in place, we've turned back to
think about the user.
How do we help the user be more efficient
in collecting segmentations.
So, this is the pipeline,
diving through the steps.
The first step, is to pair
each image with a segmentation.
To do that, we first
trick each image produced
14 computer drawn options.
Use our prediction module to predict
the quality of each one,
and then we kept the best
one shown highlighted in red
for each image.
Next, we applied the algorithm.
An algorithm to define each
chosen segmentation per image.
So we use that as a course initialization.
Given an image in ground truth,
let's just go through
what that looks like.
You can have an input that
comes from our predictive input.
Looks pretty good, refine it,
get a little bit more detail.
In contrast, you might
get a chance output.
Which is randomly choose from
those 14 candidate algorithms.
One option, and then make an output.
Or another standard is to use a rectangle
and then get an output.
AUDIENCE MEMBER: Excuse me.
Yes.
AUDIENCE MEMBER: Could you explain
what the 14 algorithms are?
Oh yeah, [laughs] so there,
CPMC is Constrained, I
forgot what it stands for,
but it's an object region proposal method.
So that's one, which it looks
at an image, and it tries
to find like what are salient regions.
We also use what's called R2 thresholding,
which is an image-based
thresholding method.
Hough transform.
I don't remember what else we did.
I don't remember all of them, truthfully.
But it's a whole span of,
so how did we choose them.
Maybe I'll answer that
without going into it.
So, we were working
across two communities.
We were working in a biological community,
and we were working with the
computer vision community.
And while in theory, we would think that
because we're all working with images,
we might care about the same tools,
that's actually not what happens.
The communities have some separation.
So, we took the methods
that are being actively used
from the biological community,
which are actually really old algorithms.
They work very well.
Hough transfer is my favorite.
Works just really well in practice.
And the we took the state-of-art methods
from the computer vision community,
which are these region proposal methods.
So these sophisticated
methods that put out,
here's definitely objects in the image.
And that's how we came up with
our collection of methods.
I hope that answers your question.
AUDIENCE MEMBER: How many
were from the biology side
and how many were from
the computer vision side?
Good question, I think it
was like half and half.
But I don't remember.
That's a good question.
And so, all together you can see
that things like a rectangle kind of work
somewhat well in practice.
Chance output just does
pretty terribly overall,
but in our method it does very well.
We did quantify that.
We found on average that
intelligently choosing
from 14 options leads to a 15% improvement
in the resulting segmentations
over existing baselines
for three refinement tool.
So at that point, we took the result
from the refinement algorithm,
predicted the quality
of the resulting segmentation per image.
And we were able to pair each
image with a segmentation.
Next step, is we rank the images.
From the cases where we believe
we have the highest quality
of computer generated results,
to the lowest quality.
At that point, allocate
available human budget
to those instances where
the predicted quality
is the lowest quality.
In terms of results, we took our portion
compared to the state-of-art at that time.
State-of-art at that time, took an image
and it involved humans for every image
using one of three options.
It predicted whether to
provide a fine-grain result.
So give all the details,
which can take a human
about 54 seconds.
Or, have the human draw
what's shown in green,
the square, takes about seven seconds.
Or, have the human
create a coarse outline,
which takes about 20 seconds.
So that was very innovative
when it came out.
Our work goes,
we compared to that.
And here's the results.
What you see on your left
side is where computers
generated all the results.
At the right side is where
human created all the results.
And in between is where you have some sort
of mix of human effort.
Key points I wanna make
is that our system works
even when a budget is not
available for every image.
That's one advantage.
Another is that intelligently
allocating human effort
typically eliminated 5.2 to 11.5 seconds
of annotation effort per image,
compared to what had been
the state-of-art method
within that range.
So...
Yeah, and so in this case
you could get about 40%,
with 40% of human involvement,
you could get the same quality level
at about 20% human involvement.
So to summarize this component,
this work showed that you can
intelligently involve humans
where they are needed the most.
So now I'm gonna shift to the question
about subjectivity earlier. [laughs]
And exploring how can
algorithms anticipate
when and why crowd responses will differ.
And so this work looks at
addressing that algorithms
return one-size-fits-all responses.
This work has been sponsored by NSF
in exploring exactly the
question that I am about
to present to you.
And so the task that I'll talk about.
We've done this for a couple of tasks,
but I'll just focus on
visual questioning answering.
Which is given a question about an image,
return and answer.
Key observation, again we
collect multiple results
for each task from different people.
Is that sometimes, crowd
responses can match.
When that's the case,
collecting redundant answers
compromises efficiency.
It's a waste of time and money.
Other times crowd responses can differ.
In that case collecting multiple
answers captures diversity.
Shows the range of possible answers.
So if we really distill that down,
redundancy can be wasteful or valuable.
So, key idea for this work,
is to predict whether crowd
responses will differ.
So we've created a module that did that.
Takes the visual
questions, and it predicts
answers match when it
believes that's the case.
And predicts answers differ,
when it believes it's not the case.
You can take those predictions
to then decide when to collect one answer
'cause that should be enough.
Versus, many answers when it's not enough.
Compared to the status quo
application at the time
for collecting answers, what we found was
that using this approach could save 23%
of the cost and time
with loss of diversity
for over 120,000 visual questions.
This led to a follow up question.
That's great, we can be more efficient.
But we still have crowd
responses differing.
And we wanted to know why.
And we wanted a machine to
be able then tell a user why.
So, with that, we embarked
on the idea of designing
a dataset and algorithm for recognizing
why crowd resources will differ.
So we made a dataset.
And we labeled almost
45,000 visual questions
from two different VQA datasets,
with nine different reasons
for each visual question
of whether or not those reasons are valid.
And then we provided an algorithm
that given an image in question,
predicts what if any of those reasons
are reasons that crown
responses will differ.
So, looking at what
taxonomy did we choose,
that was informed by our prior work
and our analysis of the data.
And we came up with a taxonomy
that includes subjectivity.
So could the floor use some mopping?
Some people think yes,
some people thought no.
We also came up with, often
there could be synonyms.
So how are the water conditions?
You look at the answers
and people say calm, lucid,
various similar meanings.
Other times the visual
question was ambiguous.
Which side of the room is the toilet on?
You'll see from crowd workers,
the middle of the room,
the left of the room, and
the right of the room.
Other times, it's just spam.
So you couldn't have predicted it.
Some person showed up,
and gave us purest answer
and it's just garbage.
That's very rare.
It's less than 1% of the time,
and it's one of the most...
A lot of attention is given to spam,
in my experience in this dataset.
It's the least important reason.
Other times, the answer was just missing.
You can't find evidence of
the answer in that image.
Visual questions could super difficult.
How many sheep are there?
Low quality images, makes
it a little bit harder
for people to come to agreement.
Sometimes you have an invalid question.
"I just wanted to say thank
you for your assistance."
People don't know how to respond.
[audience laughs]
And then, detail level people offered.
Especially when you have
text present in that image.
People offered different amounts of detail
in how much they copied
over from the image
to provide as answer.
So what book is this?
Some people just said the title.
Others gave the subtext as well.
So we came up with that taxonomy.
Labeled our datasets.
This taxonomy for us, we
found it really interesting
because it helps think
about future system designs.
One thing you can think
about is that you could,
if you knew given a visual question,
whether the image was low quality,
or there's an invalid question,
or the answer's missing.
You could tell the user, re-take an image
or re-ask the question.
The alternative, if you
have human-based services,
is they can wait up to
two or three minutes
to learn that they need to
re-phrase their visual question.
For system designers,
there's a number of things they can do.
I didn't put one of the up.
But for Bo, we talked
about output agreement.
How do you aggregate different responses.
There's a pull of algorithms about that.
You could certainly help system designers
figure out which methods used
to aggregate different responses.
You could also figure out when to recruit
more costly experts to come do a task.
So, that was the taxonomy that we up with.
In terms of our algorithm,
hopefully, it becomes clear.
It's not obvious how to design
and algorithm to do this.
Because as a human, you need the image,
you need the question, and
you need to see the answers
for some of these cases.
But at test time, all
you have is the image
and the question.
So the key idea of this
was to take advantage
of the fact that we
believe that algorithms
are currently implicitly learning
when there might be
multiple answers in them
when they predict an answer.
So we have algorithms today
that given an image and a
question will predict an answer.
Some of those algorithms
will give you a distribution
of their confidence.
I'm 60% sure it's yes,
I'm 40% sure it's no.
Some confidence distribution.
And so the idea is maybe
computes are getting confused
when it realizes there should be
more than one answer.
And so that's what we did
in designing this algorithm.
We took advantage of that intuition.
We designed a neural
network based algorithm
that takes in the image and the question,
passes that through and
answer prediction module,
then has that image.
The questions and the answer predicted.
The distribution go to another module
which then flags what are the reasons
for why the responses would differ.
So this was a normal
problem for the community.
So we didn't have any
methods to compete against.
The question was, does this work?
We found yes, algorithms can anticipate
why crowd responses will differ.
Better than random
guessing by a long shot.
And if you look at the related methods
that you might try to
adapt for this purpose,
you find this kind of system
even works better than those.
These problems matter for real EV users.
So as before, we have
shared the dataset publicly.
And we have created a
public evaluation server
with leader board to try
to get a whole community
to be working around this problem.
Any questions before I
go into the last piece
of prior work?
Cool.
So, last one I wanna talk
about is how can algorithms
recognize the multiple
solutions for a task?
This work has largely sat
in collaboration with Adobe.
And we're again addressing
that algorithms return
one-size-fits-all responses.
For this task it's thinking
about what's called
image inpainting, so given an
image with the region in red,
we wanna remove that and
insert something that's
possibly realistic, but different.
There are many applications for this
including personal photo editing,
advertising, landscape architecture,
augmented reality into your design,
as well as privacy obfuscation
which is something I care a lot about.
Status quo, today if you wanna do this.
Many people do it this way.
They would take an image,
cut out their little hole,
go search some database to
find their beautiful image
that they love, find it,
put that in the image,
apply and algorithm such as PatchMatch
that synthesizes same background
and get a result.
And do that over and over and over
until they find what
they're interested in.
In 2018, a team said,
"Let's do a little bit better than this."
And they proposed an algorithm
that can take the image
with the region and a category
and then look through that database
and output what might be possible
things to put in that hole.
Our novel idea is a program we called
Unconstrained Foreground
Object Search, or UFO Search.
Take in an image with the region,
search the database,
and we wanna figure out
what are all the compatible objects across
all categories.
So, in this work, the goal is to go beyond
what we said in the previous study
which is just recognizing when and why
crowd responses will differ.
Let's start to identify the
variety of possible results.
Intuitively, if you look at the top image,
you'll see it looks like an umpire.
You might not be able to see it.
There's a strong light,
but there's an umpire.
There's someone that appears
to be running over there.
It seems like only a
catcher in a baseball game
might fit there.
In contrast, for the bottom image,
you have a plate on a table.
Perhaps any kind of food
could fit in that context.
So figuring out what's
appropriate for each context
is the challenging thing for an algorithm.
The solution, put both
the background image
with the hole and the
same length and space
as all the objects that are candidates.
Putting those together, then
find a compatible objects
and the with post processing
you can then come up
with the alternatives.
So, I'm not gonna go
into how to train this.
I'm happy to answer questions.
I have a slide where I'm
happy to go into details,
but I thought with the fourth topic,
you would [mumbles and
laughs] of the details.
But what happens is computed offline,
we come up with a way to
map all of the objects
were we put similar objects
closer in the search space.
Dissimilar object further
away in the search space.
We also trained beforehand and algorithm
that is able to encode
that image with hole
into the same search space.
Such that it will land
close to the objects
it is compatible with, and
far away from those it's not.
And then return compatibility
and order of ranking
of what's most to least compatible.
Hitting on what I think is
really exciting about this work,
this system is able to learn
when to return no diversity.
So it's able to learn when
to only return one category of catchers.
It's also able to learn
when to return a diversity of options.
In the bottom case, many
different kinds of food.
So it's trying to learn
about how to use the context
to figure out what's
appropriate within that context.
AUDIENCE MEMBER: So it's because the label
is limited to certain choices?
When you say the label is
limited by certain choices,
can you tell me what you mean by label?
AUDIENCE MEMBER: So I mean to know
diversity versus diversity
the model in certain cases,
the model can keep diversity.
Is it because [muttering].
No, so we have, it's a great question.
So the challenge with
this is when you start,
you have one image with one ground truth.
What do I mean by one ground truth?
You have that original
image with only one object
that you know could live there.
You don't have anything else.
And so the trick that I didn't go to
in talking about the details
is part of what we did
is figure out how to figure
how to create a bunch
of examples that are appropriate
to fit within this context.
So part of the novelty of
this work is figuring out
how to artificially create training data.
No human labeling.
How to do all that automatically.
And so that's one aspect of the novelty,
and the other aspect is algorithm design.
But when you start you have,
for each image you have
just one labeled truth,
which is the content
that was originally there
before you removed it.
AUDIENCE MEMBER: So
you're creating some truth
[muttering] distribution that's possible
and then [muttering].
I wouldn't call it truth.
We're making an approximation
and it works well
in practice from our experience.
We did cross referencing
studies to get realism.
We did quantitative studies as
well as qualitative studies,
and what we found is
that this method can do
a reasonably good job.
The crowd workers are
like, "Yeah, that looks
"like a realistic object
to put in that image."
AUDIENCE MEMBER: Also,
the catcher or something,
of how do you call it,
the size, same possible size.
The scaling of that object
kind of matches to the--
Yes, great question.
That is one of the
deficiencies of this algorithm.
So future work, if you're
interested in pulling off of this
would be figuring out how
do you deal with the scale.
Geometry is another issue
we didn't deal with.
So it's learning what's
appropriate for the context,
but it's not learning how to perfectly fit
within the image necessarily.
So, there's a lot of aspects
that's really hard to do
the whole pipeline perfectly.
AUDIENCE MEMBER: What
does make it difficult
to make that scale, the
models you want to scale?
I mean partially, that's a good question.
So what makes it difficult?
If you have unlimited
data, everything's trivial.
Like right?
But you don't, so that's part of it.
If you have an algorithm
that has embedded within it,
a module so we build an [mumbles] system,
so you need to build some sort of module
that would deal with the scale.
But now you need to make
sure everything works,
and that that scale
module doesn't compromise
the performance of other things.
So, would you do it?
Yes.
Do I think it's important for it to work?
Yes.
Did we do it?
No. [laughs]
So, yeah.
I think in general when
you try to put it together
an entire system, to find
that choosing to prioritize
one thing such as
diversity, it gets diluted
when you try to put other
thing that you're trying
to optimize as well.
And the key is how do
you do that seamlessly?
Does that answer your question?
Okay, great questions.
Actually, I can't see a clock,
so I don't know how I'm doing on time.
MODERATOR: Ten minutes.
Ten minutes, okay great.
So that's it for the
past work that I wanted
to share with you all.
To summarize this work,
was really looking at datasets that match
real users needs.
Designing systems that can
efficiently involve humans
to counteract algorithm mistakes.
And also designing
algorithms that can recognize
and convey multiple perspectives.
In terms of future work, still
looking at that pipeline.
The future work I'd like to talk about
is datasets for analyzing private content.
Systems that again involve
humans to counteract
algorithm mistakes, but now in 3D,
which is much more complicated.
As wells as context-aware algorithms.
So, to motivate the first problem.
The first of these three
future work projects,
we did do some initial work
that revealed 12% of images
taken by people who are
blind show private content.
So this is a real problem.
Here you see the samples of face.
All the original data's obscured.
I will not show you
any private information
so you'll see in red
that circles that region
of what's obscured.
It might even be hard if you
had the original information
to find it 'cause these are
again low-quality images, right?
But that's the issue is
how do you deal with that?
And of note, it's not just enough to flag
whether or not they
contain private information
because 20% of the time, people
share private information
intentionally as a trade off to learn
about the visual world.
For example, they might want to know,
is this pregnancy text
positive or negative?
What is the medicine in this pill bottle?
Yes.
AUDIENCE MEMBER: So in the other 80%,
they're asking about
something else of interest?
The other 80% accidentally
captured private information.
So what does that mean for
if you want to think about
how you deal with
privacy, we want a system
that can handle multiple things.
And I'll get to that in a second.
We did come up with a taxonomy.
This is an open problem
of how do you even come up
with a taxonomy that really
addresses what is private
versus not.
We're doing user studies
to ask people who are blind
to help us with that.
We'll have some responses by April
on what do people who are blind
at least think is private.
But we, in-house, went through a process
of figuring out some categories,
and this is what we came up with.
So if we think about this,
the data really is complicated
because we can have
things that are private
from an object standpoint.
As well as it could be text,
and where is that text is
situated that makes it private.
And so in thinking about how we deal
with this real world problem,
we want to develop algorithms
that recognize, obfuscate,
or answer questions about
private information.
We do not want to ever share
the private information.
So, this situates in a
broader set of problems
which are long-tail problems
where you just don't have
a lot of data examples,
and all the focus on the privacy.
So, where I'd like to go for future work
is generating a dataset that
contains fake, realistic
private content so we can
start to support this use case.
Hopefully some of you're like,
"Whoa, that sounds terrifying!"
Yes, that's terrifying.
We need to figure out what information
is deemed private when.
And also, I've contacted a lawyer already
about thinking about what
are the usage regulation
guidelines that are
needed to avert unintended
dangerous uses and consequences.
So very interdisciplinary work.
But I think very
practically important work.
And technically, it's very challenging
'cause private information
spans objects and texts,
and figuring out how to inpaint all that,
we are far from that.
The results that obfuscated with,
hopefully if you remember
what they look like,
they kind of look like garbage.
That's the state-of-art algorithm.
So we gotta long ways to go technically.
Another future work is
compensating for algorithm
failures for 3D data.
So, two examples of thinking abut 3D data.
One is you have an image from a stack.
Lots of medical data looks like this.
And so, you might want to find regions,
segment out regions,
or find matching spots.
Yeah, there is following in video data,
objects over time, such as these cells.
Being able to locate where
they are, their shapes,
all that metadata.
Generally, if we can build
tools that can do this analysis
of 3D data, we could accelerate
basic science research
and medical discovery.
I believe algorithms are far
from being where we need.
And we need to make something useful.
And I think what's interesting in that,
is creating the hybrid partnerships.
And the key questions is
figuring out when and why
do algorithms fail.
Not per algorithm, but starting to think
about properties across
different algorithms.
And then how do you design
crowdsourcing systems
when you're dealing with
3D data or with video?
It would be tedious to show a human
the entire video, or the whole 3D stack
and have them maneuver it.
So there's some challenges
in crowdsourcing.
Also figuring out what
kind of human involvement
is an interesting aspect of this.
And the last one is thinking
about context-aware algorithms
and one use case is image captioning.
So here's an image.
If you ask today's state-of-art algorithms
to describe that image,
you will get from an
algorithm something like,
"A man and a woman sitting on a bench."
We conducted a study asking 28 people,
what kind of information they would want
in seven different contexts.
Some of you I think are
familiar with the tale.
It came from Abigail
Stangle, who finished her PhD
at CU Boulder, and is
now postdoc of my group,
and she help us uncover what do people
who are blind want in different contexts?
In particular, seven.
Just to narrow it down
to three of our findings.
One is if you're in a social media feed,
if the media was in a social media feed,
you'll want the person's
facial expressions,
their body language, and their interaction
with the environment.
So, the content in red.
On a dating website, you would wanna know
the person's physical characteristics,
and the setting to learn
about that person's interest.
On a shopping website,
let's assume it's clothes,
all you care about is the clothes
and the person's body type
to know whether or not that
model's body is similar
to your body.
And so the goal for
this work is to develop
algorithms that can move
beyond one-size-fits-all,
and really return the
appropriate caption per context.
This has a number of key challenges.
One is how do you efficiently
crowdsource those captions
to ground content for each context?
When I say ground, I mean
put language to actual
physical locations.
So seeing the blue region, the red region,
or the purple region.
To efficiently do that with crowdsourcing
is very difficult.
That requires much
innovation to think about
how do we collect a large scale dataset
if that's the way we go, maybe
human- machine partnerships.
The other is how do you design
new algorithmic frameworks
that can simultaneously
learn all the information
across these contexts and also figure out
when to fire with returning each kind
in a human-like description.
So that wraps up the future work.
There are many people to acknowledge.
I've had an amazing group of mentees.
Their images are up in
the upper right corner.
In terms of the work,
I've had many colleagues
that have been very
instrumental in the work
that I've presented.
There's also been many
people who have contributed
to the dataset challenges that we have.
It really takes a community
to do a challenge,
and we have been very
fortunate about the people
who have contributed.
And then of course, there's been thousands
of crowd workers who've
given 10s of thousands
of their hours to this work.
We've also been very
fortunate generous funding
from NSF, Microsoft, Adobe, Amazon,
Silicone Valley Community
Foundation, as well
as UT Good Systems.
So with that, I'd very
much welcome any questions.
Thank you for your time.
[audience applauding]
MODERATOR: We had a lot
of questions in line,
but unfortunately there's
time for a few more.
AUDIENCE MEMBER: So great
talk, thank you very much.
On the subject of one-size-fits-all,
a lot of these are [coughing
drowns out speaker] humans
as interchangeable units that
where we know a little about them
or what makes them different.
And so I'm curious to your
thoughts in these domains.
What are we missing by taking that view
of humans [mutters] some opportunities
where we had a richer model
of what humans were actually doing
so that would make this work even better.
So I certainly think you can,
there's a lot of cues that tell you about
what people's mindsets
when they're responding.
And so if you think about the
crowdsourcing environment,
you have cues like how long do they take
before they responded?
How many times did they
type and then delete?
Like how contemplative were they?
So you can certainly use
behavioral mechanisms.
I say that, but I also
wonder how ethical that is
And so there are some questions
about in crowdsourcing,
how far we go in collecting
people's information
without notifying them.
But I think that's a valuable direction.
I think fundamental
cognitive science studies,
I'm not gonna go there because
it's just to far a leap,
but I would love to collaborate
with people who are there,
but just understanding how
different perspective can arise.
And that can even just be in terms
of how people have learned language.
That can be culturally between
cities, countries, states.
In terms of, so that's on kind
of the data collection side,
but I also think we have a
lot of studies to ask about
what do users care about?
Like, maybe a user doesn't
care if they're synonyms.
They're all equally good,
and we shouldn't deal
with collecting redundant
responses in that case.
And so doing user studies
on the end user aspect,
and really understanding what
are these important problems?
We kinda come up with a taxonomy,
but enriching our focus on
what really matters to people.
That's some thoughts.
Thanks for listening.
MODERATOR: So given all the questions
that were in line,
we're kind of time, so--
[Danna laughs]
[audience applauds]
