I I put this for my own record as well
in the future when I go through slides
that I've given before I will remember
this was the virtual ICLR
hopefully there will be one that will be for
the foreseeable future
I hope everyone is doing okay, umm
personally, it's 5 p.m. where I live I
live in London UK and today is a pretty
rainy day so otherwise we will be
sharing we will be sharing that much
more than than the slides but you know
hopefully that still creates some social
aspect that will be good for later
discussions and I'm gonna be giving the
first talk so I thought I would switch a
bit the two words that these social is
about, generalization and generative
models and actually I think hopefully I
can manage to explain kind of the three
main points that were given in the
social description which is generation
by neural-networks, generalization by
neural networks and then data
augmentation for neural network so in my
keynote I'll actually try to cover these
three areas. I think there's some
interesting connections that flow make
the talk reasonably well from the first
to the second to the third, and so let's
begin with perhaps you know generation
by neural networks and you know generative
models, some people confuse it or like
use unsupervised learning as well to
refer to the same kind of set of
techniques but in this talk and maybe in
the whole social although we'll see
about other talks I think I understand a
generative model by a model that can you
know obviously be parametrized with
parameters and and maybe perhaps has a
data set from which you can sample etc
it's a model that only all we're gonna
ask you to do is to kind of get a sample
of some data modality of interest and
that's kind of for me when I think of a
generative model that's kind of the
first thing that comes to mind so this
is obviously the tweet that Prabhu did earlier, and if there's a bunch
of modalities in text
and some video / images here that we can
see so you know the dream perhaps of
generative models is to be able to as we
generate data modalities every day good
models generate these modalities as well.
there's something like a noise I don't
know if everyone is muted but anyways
it's potentially distracting for others
so let's dive a bit into what this model
is and in general we are interested in
neural networks because they're quite
high capacity expressive models to
generate high dimensional modalities
such as images videos and text so I took
this slide from Aaron (van den Oord) who is
a colleague at DeepMind and he basically
characterizes the model part of the
equation by essentially like four
categories and the first kind of
models those that are called latent
viriable models and famous examples
probably VAE is the one that everyone
still kind of uses and researches on
these days the second one is what are
called implicit models that you know
famously GANs are part of and I'll
describe a bit more not only in
preferences but also on kind of
superficially what these models differ
on the third one is transform
transformation-based models so real MVP
or NICE and I think we actually have a
keynote that ICLR are talking precisely
about these kind of no flow based models
so this is a maybe the third family and
last but not least and certainly one
that I actually have quite a lot of work
is auto-regressive models or so-called
fully observed models so there's like a
lot of early work here and also some
more recent work such as pixel art and
wavenet, pixelsnail and I guess GPT2
would fall into this category as well
so there's a bunch of tutorials these
are fairly old because this slide is
slightly outdated but that let me
actually try to explain a bit more and
pack a bit more for you these four
categories or really three depends on
how you look at that those so the first
one which is the last one in the last
slide actually is auto regressive or
fully observed models so these remember
we're trying to model a
multi-dimensional complicated data
modality so these kind of models assume
that you're gonna observe every pixel or
every word or every character in a
sequence and so on and there will they
generally model the distribution like
the probability distribution of these
data so that you can sample from for
example, and generally in an
auto-regressive fashion by using the chain
rule of probability so these models
generally will try to model all the
random variables involved in a modality
by simply using the chain rule like this
and then you know you train these models
and then you can sample from them and
generally you just maximize the
likelihood of some given dataset the
second category of models does some
things like be different it does not
necessarily although some variants do
that it does not necessarily decompose X
B as a chain rule like before but it
actually assumes there are some latent
variables or there are some factors that
are not observable Z and these can be
just Gaussian a Gaussian prior or
Gaussian noise from which you can then
transform these simple distribution on
to the real data distribution X and
the transformation kind of models and
also a latent variable models such as VAEs adopted and then how they actually
infer Z from X changes depending on the
flavor of the model, right so there's
lots of interesting work on this area
and then the last one is perhaps maybe
the most famous from the last few years.
so to me the most researched upon which
are implicit models from which GAN is
that may be the major example
and so in Gans what you do is generally
you infer X from Z and then you score X
based on just essentially trying to
classify whether samples come from the
real distribution or not there are some
interesting variants of GANs where you
try to also infer Z from acts like
bygones being the notorious example for
that but generally like that's that's
guns only deal with the slice from Z to
X as we know so there's an interesting
kind of characterization that I put
together here that tells us a bit you
know pros and cons of these three or
four depending if you count
transformation-based models in its own
category like Aaron does but I generally
like to just not do that so um you know
fully observed models we know that the
advantages are that evaluating
likelihood is very simple it's using the
chain rule of probability is tractable
and sampling is also simple but it's
actually quite expensive because you
need to generate one sample one after
the other and then extracting some sort
of representation is kind of impossible
or quite hard
although whatever you see harder
impossible in this table it means that
there's generally some research going on
and actually perhaps the greatest
breakthrough of these fully observed
models has been BERT, so we can say BERT
is a good way to extract a
representation from a fully observed
model. now latent models have different
trade-offs so sampling is actually exact
and quite cheap so that's that's always
a desirable property of these kind of
models and that's why to do things like
text-to-speech you first train a fully
observed model wavenet but then you
actually distill it into a just
feed-forward model essentially and then
the last one which are let's say GANs in
you know simplifying a bit it's kind of
quite apart or impossible to evaluate P
of X it's again by simple and cheap to
sample samples from from the
distribution that implicitly these
mouths learn and again extracting some
representation is a bit hard although
again there's a very great paper
that uses these ideas of BigBiGANs
which reverse and instead of going from
Z to X go from X to Z and finds that the
representations learned are useful for
image net classification and so on and
then a last category this is more like
maybe feature learning in general as
well as generative models as understood
as I want a sample from the model just
give me samples and I'm happy but
there's actually a fourth category
that's gained a lot of popularity
recently so I thought it would be
interesting to mention which is that of
unsupervised feature learning or semi
you know self supervised learning
there's many names people have given
these a lot of these methods right now
are using contrastive losses and thanks
to these losses their training features
they extracting some representations
from which training let's say a linear
classifier or an image net from no
labels is very very simple and we've
seen a lot of progress and interesting
results and models like moco CBC and
simply R are examples of the last
category which I'm not gonna describe
more than just giving you in the table
so that you can see the full picture
perhaps of feature learning in general
from data distributions so that brings
me to like ok that's the model
categorizations let's let's look at me
that generalization and I think the best
examples of generative models that
actually do generalize is that of
conditional generative models right so
machine translation is perhaps one of
the examples that comes to mind first in
addition in what I've worked on in the
past which is you know it generates text
from a different language right and so
it encodes like let's say I don't speak
Chinese so I wanna translate from
Chinese to English and this is a
generative model it you sample or you
decode from the models conditional
distribution P of Y English given X
Chinese characters and and then you know
you you just this is actually what's
driving most of the translation big
translation frameworks that are using
you know by
many of the big companies right so you
know generative models really actually do
generalize and they're used every day
and they're obviously used in
circumstances that go beyond the
training distribution they're obviously
not perfect but these are kind of that
one of the best models we have right now
is actually a generative model based on
on just essentially sequence learning
right so this you know this is an
obvious example the other one that that
obviously was has been mentioned as well
in the introduction is you know
transferring from text to speech so this
is kind of the same principles you give
a few characters so it's a conditional
generative model again but given a few
characters you can generate audio and
this is how your Google assistant talks
to you for example so again used by many
and generalizes quite well and then
there's other applications which you
know we've seen in the last few years or
very impressive kind of computer vision
generative model so you're using a
generative model to help you kind of
create modalities or transfer modalities
and there's quite a few examples I'm
gonna just show you a few that I think
are have been quite remarkable in the
last few years right so I mean obviously
given an image you can generate another
image that relates to the input image
right so in this case you make a
caricature out of an image this is a
fairly early word right is 2017 we all
seen these kind of examples of
transforming zebras onto horses so I
think this is a nice example of
generalization right you train your
model to do to do these you know image
to image translation but at test time
you are gonna give an image that it
hasn't seen and we can see these models
generally work reasonably well although
it might be a bit harder to characterize
their generalization capabilities
because this is a bit of a task that
it's it's just more on the creative side
let's say there's also interesting
applications for what's called sim-2-real in robotics it's very common to be
able to access a good stimulator these
simulators don't tend to look so prefer
to realistic as the real images that the
robots would do we would see so then
when obvious thing you could try to do
is again
generate or regenerate these images onto
something that looks even closer to
reality and there's a lot of interesting
work on this kind of domain adaptation
for these tasks as well and then all
sorts of other things
very creative usage of generative models
to do like super-resolution of
images this is another example of super
resolution from Angela et at, all
that is not exactly on images but it's
on these kind of meshes that you can
extract with partial scans that are a
bit noisy and you can see how that the
mouse also help you kind of in paint and
reconstruct these these gaps right so so
also by impressive stuff and then you
know going kind of a bit to kind of what
has happened in generative modes of
modalities before going on to images I
would also want to mention because
there's been quite a lot of progress in
text that you know generating generating
samples from a model is something that's
been done for quite a while actually if
you if you read the paper from Shannon
from from a while back you can kind of
see that he's describing a procedure for
sampling from an n-gram model you know
kind of how English language could look
like and in fact the SLP book has a few
samples that share in a trigram model
and you can see here like this is kind
of not not very good text but that was
of course like this is a fairly old
procedure right to train these models
and then sample from them and then ever
since these we've had a few kind of
leaps in quality you know Ilya (Sutskever) had a
nice paper that actually made me go into
recurrent neural networks that train a
language model and sample from it and
the samples reason look much more
realistic than those that you would get
from n-gram models and then later on
with LS theorems and bigger models
you got like single sentence that
started to look like you could you could
believe these was written by someone
right the sample that you get from the
model was fairly realistic and then in
2018 with the introduction of
transformers actually Liu et al had
very nice paper and summarization but
actually they had also very nice
examples where you sat you condition
your model on the title of a Wikipedia
article and then the the model kind of
generates the whole article in and it's
fairly you know consistent long-term and
and reasonably good and perhaps the
example if anything the only one that
you've heard about is probably GPT 2
which again is a model training on a lot
of text and then notably it generates
samples that you know they're again
leaping forward even more what was able
to be done with you know language models
in the past right so this clearly has
you know a lot of advances and beyond
that I keep I expect to keep seeing more
and more higher quality samples from the
next generation models now in images the
story is fairly similar so I'm not gonna
you know expand too much on these but
actually not even from the GAN paper but
but actually micro real you had a very
nice paper in 2011 where he already was
kind of generating faces right from a
dataset of faces then you know there
were back in the day like interesting
generative models from brass actually on
you know trees rockets and you know like
apples woman all sorts of different
classes and you could kind of see that
statistically started to look like the
image that they were representing very
interesting work also alternative models
of textures right which are slightly
less perhaps complicated and you know
full image is statistically speaking and
then moving forward we know that you
know from the gangs but also micro
radios paper should also have been
included here and we've seen a
tremendous advance in the quality of
samples we getting from from generative
models and GANs have definitely taken
here the lead in terms of showing this
progression so the last may be inception
of ten models that are very impressive
so these are BigGANs which you know
that they images really look you know
real and of course the question is are
they just copying from the data set
there's a lot of interesting questions
around evaluation which I actually will
get to for the third part of the talk
um style gun to this this is this is
kind of almost unbelievable I kind of
the quality that you're getting to do
like this advance you know image
processing where you can get kind of you
can mix two people and see what the
result looks like and I mean these these
these this is like really like a
testament that you know generalization
for for generative models even if it's
not clear how to quantify it
it really has performed astonishingly
well and and the applications for all
these kind of image transformation and
processing and you know kind of
augmented by neural networks I think
we're just seeing this your face and and
it's really interesting to kind of be in
this case just witnessing how the field
is advancing right but as I said besides
kind of nice samples and possibly like
augmenting Photoshop and so on with
these tools there's been a realization
that just kind of assessing the quality
of these models is actually quite hard
and if you think about it even for GPT-2 and so on and to have a metric that
tells us is this sample is this text
really kind of written or as if it was
written by a human it's almost like
solving the Turing test for either a
text or images right you we would need
to have a scoring function that unless
we can ask humans to rate the the
samples which many people are doing you
know that problem feels as hard as the
problem of generating samples to begin
with so there was a very interesting
paper by Lucas presented at ICLR
actually I believe it was come in 2015
so very early ICLR which kind of it was
thought-provoking saying that all these
kind of metrics that people use to all
my model is not overfitting or
likelihood is a good metric or we should
look at all the other sort of metrics
for image generation that all these
metrics can be kind of gamed and so
there's no really like a very good
metric to assess and of how good the
generative models are that is widely
still like I would say accepted in the
community and that's fine because that
makes it a research topic in itself
right the metric
which you can measure this so you know
if you've followed a bit like progress
in GANs there's two metrics and I think
one thing that many people have argued
is using maybe the last part the third
part of the top like the data
documentation for neural networks
so using these precisely as a metric for
assessing how good of a generating a
generative model you have and this is
actually a fairly fairly simple idea the
idea is so simple right that it can be
presented in one slide imagine you know
you have a bigGAN model this is a
generative model that goes from the
noise to the image modality and it also
conditions on the class and you just
train a bigGAN model on the full imagenet
data set this is a data set of 1
million images of a thousand classes and
you know you do that you get samples
from it and you were happy but maybe the
question here is is this model from
which I can sample essentially a
training set of equal characteristics
than the original training set would
this training set that I generate fully
from the bigGAN model would it create a
classifier that would generalize to the
real test set or validation set of
imagenet right so you take your training
set you train your generative model you
throw away your training set you have
your generative model and you start
something from it and you create a new
training set which is now only coming
from BigGAN
samples with its which it's classes of
course and then you go and you train
your ResNet favorite model for imagenet
you train this and then you measure your
metric of interest is how well does this
model train on fake samples or samples
from from the model that does on really
images from the validation set of image
and so many people realize this is a
reasonable metric actually and it
obviously would pushing this metric
forward would also implicitly make these
models that we wish to use for the
data augmentation potentially you know be
useful right because we
would then be able to generate instead
of a million images ten million images
and then great right we have a much
larger data set which probably will will
be better in terms of generalization if
these these samples had labels and so on
it's a date augmentation technique and
so we did a fairly wide study on these
on a paper that was presented just NeurIPS 2019 last year and we took kind of
standard kind of very well like
good-looking state of the art models
like BiGAN, VQVAE and and this model
regressive and we just tested this idea
right on the very latest state of the
art models so real data right if you
don't do any replacement
ResNet 50 gets let's say 76 percent of
work on accuracy and very high inception
score and a very low FID so here lower
is better here higher is better and of
course accuracy higher is better so if
you had to guess if using the procedure
that I described what does BigGAN or VQVAE do I actually I actually would must
say was both unimpressive and impressive
to see these numbers so what you
actually get is well first much lower
accuracy so you train a classifier that
it's not nearly as accurate as a state
of the art but I think at the same time
this is impressive right because you you
now have a generative model and by
training on it solely on it no no real
data use you can actually get fifty five
percent of top-1 accuracy which is of
course you know almost alex net levels
so it's a fairly like good model
training on these kind of samples right
so you know these numbers are of course
there's work to be done to make this
accuracy sorry as good as there may be
the real data or perhaps through pass it
if we are wishing to use these for data
augmentation but i mean at the same time
obviously there's still like you know
the it's much better like than chance
that would be like one over thousand and
it also tells you that the metrics used
you know when you see inception scores
and FIDs you might think big guns are
the samples are better but actually to
train a classifier it turns out that for
in this particular case VQVAEs seems to
be better
and the cool thing about these magic is
that we starting to be able to train a
classifier because ofcourse these
requires you to train a classifier from
scratch
from from from you know a new data set
that you create that you sample right
but it turns out that it takes less than
a day and you know 10 or 20 dollars to
actually train a classifier or imagenet
nowadays so as a metric it starts to be
feasible for people to use it and of
course hopefully big corporations are
happy to help academics to you know
evaluate their models using this kind of
metric that takes a while to compute
like 10 hours it's not I mean it's not
small amount of time so these allows you
to do things like all looking at you
know bigGAN for instance these are the
classes that train a classifier is
actually better than real training data
so perhaps it really understood these
two particular classes in the data set
and then also you can look at what are
the class is that really you know you
sample a thousand classes you train a
classifier and it just doesn't you
generalize at all to the test set right
so you can see for balloons it kind of
there you could believe there's balloons
here but it's a fairly bad sample so
even to the tech sample quality this
metric is fairly reasonable I would say
so in summary I tried to kind of cover a
bit the three themes of these, you know
of this social event
I think generation by neural networks
there's a few model categories that one
could study and fully observed (latent &
implicit) is what we just briefly
discussed generalization in neural
networks is happening I mean we have a
lot of tasks and applications that are
actually hitting production systems in
language to speech you know image domain
transfer robotics and last but not least
data augmentation and that would be
serving a purpose for neural networks I
would say state-of-the-art models still
are not helping in creating new data
from which you know a better classifier
could be trained so with that I think we
might have a time for like one question
or so but otherwise we can obviously do
these questions during the break up so
thank you very much I hope my voice I
mean I I have no clue if the
talk was working all the time but I hope
it was otherwise I've been talking to
myself. (Host: maybe we have time for one question
Kyunghun posted a question maybe he can ask if he wants)
hey Oriol, great talk, oh thank you so I have
one question so you talked about the
latent variable model at the beginning
and then one of the identity that you
pointed out was the that it was easy to
extract the latent vector but
I guess we need to somehow, it'd be
nice if you could clarify what it means
to actually extract the latent vector
from the latent variable model is if it's
a correct let's say a true posterior inference
that is as difficult as anything else
but I guess you mean like the
approximate posterior's mean is fine so
what did you mean by the extraction yeah
yeah yeah for so that's true like when
you train a VAE you have a model that
maps from the modality to the latent but
that's just to approximate the L1 and
so on right but at the very least that
neural network can also be repurposed to
extract a feature or like what are the
latent factors that this image would be
condensed as that then the decoder can
go ahead and regenerate it using the
usual like KL3 right so um it's not
clear that those features will be any
good for classifications but it at least
it's an easy way to see how it would
work right if the latent variables that
your VAE encodes happen to be the imagenet class then your even business your
latent is actually that the class
label so you you can just simply
classify that right but of course you
know not everything not every VAE model
has very meaningful latent variables and
you know as I was saying semi-supervised
learning or sub supervised learning and
so on our great feels to also look at
for the tricks that make these models
better at extracting meaningful
meaningful features from the latents.
okay gotcha thank you
yep cool
thank you wonderful talk
alright thank you very much I'm not sure
I can stop sharing between myself as
well
yeah so maybe Phillip can set up a
screen meanwhile sure sure strongly yeah
yeah yeah maybe maybe I'll give a short
introduction and then they can start
right yeah so Phillip is currently a
professor at MIT
Cisse lab so he was previously at open
AI and was primarily associated with UC
Berkeley he his research also spans
generative models transfer learning few
short learning so today's talk a phase
will be based on like few short learning
and how the benchmarking problem works
so yeah if you can go ahead okay so I
won't talk about fish are learning
actually that might have been a
different talk but that's okay yeah
thanks for organizing I'm going to share
my screen yeah I'm stuck here yeah okay
okay can everyone see my screen yep
yeah okay great so I'm going to talk
about this this view of generative
models as a kind of data visualization
and this kind of gets that would order
was just talking about you know what is
what's the point of a generative model
and there's a lot of ideas out there a
representation learning prediction just
making pretty images but I think that
one that might be a little bit
overlooked is what kind of
visualizations of data you can achieve
with gender models okay so the idea is
that well if you take a big data set of
images there are some sparse set of
samples in data space and what a
generative model does is it kind of
fills in all the gaps what one of these
like latent variable models or implicit
in what else does and you know a lot of
people have
showing great sample images from these
celebrity faces or Flickr HQ faces and
those are cool but I think that's kind
of the least interesting thing you could
do it the generative model is just to
make more data samples because we
already have the data set and you
already have a way of just getting in a
lot of random faces you're just Google
faces what's a lot more impressive is to
get a result like this where you
actually filled in the gaps in between
the faces and in the generative model
you went from the sparse data samples to
this visualization of the entire
manifold of kind of face reality so I'm
going to be talking about generative
models that map from some latent
variable Z to a data distribution I'll
be presenting two models that are based
on Ganz but you could also do this with
VA years or other models that have some
latent variables and the way that they
work as you know we're probably familiar
with is you pass your random noise
sample through a whole network and then
it makes an image and that image lies
somewhere on the manifold of natural
images we simple another Z it lies
somewhere else on the manifold of
natural images so we can think of this
as transforming some prior distribution
this Gaussian noise into this more
complicated distribution and what's
really cool is that this latent space
ends up organizing the data space okay
so we can draw random samples but what's
even better is that we have this mapping
between the latent variable C and the
data X so what people have seen in a lot
of these models is if you do some simple
transformation in latent space like
walking on a line it corresponds to
something that looks meaningful in data
space ok so here's an example from again
where you can interpolate in z space
between a there's Bluebird in this fly
and every image that you pass through
kind of reasonable okay so that
transformation doesn't actually exist in
nature but nonetheless it looks like we
have observed something that is
plausibly physical so what we wanted to
ask is so these generative models are
interpolated in creating this underlying
kind of physical continuum but we have
another way of doing that too which is
just use the graphics engine that is a
generative model of visual data which
has all these really cool properties you
can lift the camera around change the
lighting so are we getting anything new
out of these models and how close are
they to having the capabilities of
traditional graphics which are one of
the powerful ways of visualizing you
know the world so I'm going to talk
about two little papers on this the
first one is actually an eye clear paper
which we're also presenting light at the
end of this hour I think so this is on
this pure ability of generative
adversarial networks and this is work
that Allie and Lucy did and we just
wanted to ask let's see if again in
latent space can model you know the most
basic transformations that a game engine
could model like academic transformation
or a lighting change so how close is
again to this kind of classic graphics
engine and the way it works the way we
try this is really simple we just looked
for paths in latent space that
correspond to some image transformation
in particularly focused on linear paths
so we start with some Z vector and we
add a vector W and we walk in about
alpha and that should correspond to
moving somewhere in image space and
where do we want to move well we can
define target transformations so for
example we can say we want a vector W to
correspond to shifting the flower up to
the top
the image and we do that by setting the
target to be be translated flower up for
the top of the image okay so here's the
objective we search for a direction w
min latent space that matches some edit
operation in image space that edit could
be shifting the pixels it could be a
zooming in on the pixels it could be
rotating them any kind of image
operation you want to define okay so
this is really inspired by things like
work Tyvek where in the latent space or
the representational space of a word
representation model people found this
you know very striking thing that there
exist directions that correspond to
semantics for example there's a
direction that corresponds to the tense
of a verb walking those two walked in
the same direction as women goes to swam
so here we're essentially asking are
there directions in a latent space of
again that correspond to simple image
transformations like zooming in or
rotating a photo so we did that
optimization and indeed you can find a
direction that corresponds to zooming in
on the cat or shifting the cap left to
right brightening this image darkening
this image of course we notice that
there's something weird that happens
where the volcano explodes because
that's kind of the the semantics that
has been learned from the data
okay so we've visualized something about
the data distribution on which this
model was trained by doing this aha so
we wanted to go a little bit further we
can achieve these transformations in
light in space but what can we learn
about the model and the data by looking
at these transformations so we took an
example like this hot air balloon and we
tried to shift it upward by walking in
the shift y coordinate up direction in
lighting space and this is what happens
so you can see that the balloon does go
up and we can also shift it down but it
kind of forms this you know sigmoid
curve it looks like it kind of gets
stuck at the edges of the image
yes you can see that for a lot of
different transformations for example if
i zoom in on this tab initially it works
I can get closer to the cat but
eventually it gets stuck
you just can't zoom in too far on a cap
on this pizza at the bottom we tried to
rotate it and it can rotate it very
slightly but again it gets stuck and it
can't rotate it more than a few degrees
so why is that happening
and our hypothesis is that this is
related to the data that the model has
been trained on so if the models never
seen a super zoomed in cat of course
they can't learn how to generate a super
sending cat this is a kind of a way of
visualizing the biases and limitations
in the data so there's a lot more
details in the paper but I'll just
mention to the results so we did try to
quantify the relationship between the
distribution of data and the ability for
the model to achieve these
transformations so on the x-axis here we
have the variance of an attribute the
zoom attribute in the data so we
measured on what's the variance of how
zoomed in a cat faces for example we can
just use an object detector to detect
the size of the cat face and on the
y-axis we have how well was the model
began able to shift the same variable so
if we walk in this direction
corresponding to zoom in and out how
large a transformation has achieved and
there's a decent correlation between
those two things here are two example
points we have the robin in the laptop
and the laptop is an object that comes
in a lot of different sizes in the
training data which is image net the
robin only comes in one canonical pose
and as you can see it's possible to zoom
in on the laptop but not the robin of
reflecting these biases in the data set
you can get similar things with color
biases if I try to walk in the direction
that goes toward redness in latent space
you can make a
oh sorry toward blueness you can make a
car go from red to blue but you can't
make a firetruck turn blue and of course
the reason is because there are no fire
trucks in the data set so this is kind
of a way of visualizing properties to
the data and maybe also interesting to
look at as a evaluation metric for the
capabilities of generative model to what
degree did the generative model
generalize to be able to do things like
paint a firetruck blue began which is
this model was not able to do that okay
so I'm gonna quickly go over one more
paper where we did a very similar thing
so you know we could say what does it
look like for an image to turn more
white or zoom zoom in or out but what if
we want to ask something that we don't
know what it actually looks like what if
we want to ask what does it look like to
you know make an image become more
memorable I don't know what that
transformation looks like but here's
here's a trick where we can actually
visualize what it looks like and the way
you can do that is as follows and this
is work that we presented at ICCB in
2019 this is the game of ice paper with
Laura and Alex so we took data on how
memorable different photos are we got
this from some old work that we had done
so if some images people remember well
and some they don't remember well and we
can just visualize let's say which
images people remember in which they
forget here are a bunch of dogs and the
ones toward the right and toward the
bottom are more memorable according to
human judgments but as you can see it's
pretty messy it's pretty hard to tell
what's going on by just looking at the
raw data so can we look at the data in
gann space and get a better sense of
what's going on and the way we do it is
we take our z vector we generate an
image from that Z vector and then we
assess the memorability of that image
using a classifier or a regresar that's
been trained to predict how likely it is
that a human will remember that image
based on our training data of these
memorability scores
and then we can try to modify the to
produce an image that goes toward that
area of high memorability okay so we
simply can walk in a line again in
latent space such that we increase or
decrease in memorability the predicted
memorability of an image and here's what
the result looks like it takes this
panda and to make it more memorable it
thinks that that means zooming in on the
Panda
to make it less memorable it's kind of
zooming out and you can now mine through
the data and see a lot of interesting
properties so one property that comes
out is you always zoom in in order to
make that image more memorable another
one is that you can take an image and if
you make it more symmetric or circular
that tends to increase its predicted
memorability we also tested that these
images to the right are indeed more
memorable in psychophysical experiments
so simpler images tend to be more
memorable if you blur out the background
it makes it more memorable and there's
some things that we didn't quite know
how to quantify but I think that they
might exist so making a dog more cute
and expressive with these bright eyes
and makes it more memorable okay you can
do this on you know whatever attribute
you you want if you have data about some
image attribute you can train a model to
predict that attribute and then you
could say you know what would it look
like for a car to become more popular in
Instagram so maybe think about that and
you know it makes it into a sports car
or what does it look like for a dog to
become more emotional emotionally
positive and it makes it more colorful
and kind of happy looking okay so just
to summarize it um there's more details
than all these papers but the idea is
that one of these generative models that
maps from some well-organized latent
space is a way of kind of visualizing
data as a continuum and we can learn
about the data distribution and its
properties by navigating in that latent
space okay thank you and maybe there's
time for a question but I know this is
pretty hard
oh I see there might be some chats the
questions to the in a session later so
also I won't be at the Q&A session so
but feel free to I'll try to track the
chat later and feel free to message me
okay thank you okay thank you
so Gillian maybe we can start shaping
his screen and I'll meanwhile introduce
you
so announced yeah yeah so Killian is a
professor at Cornell University in the
Department of Computer Science and he
has been working with a lot of applied
machine learning and question processes
with like many projects ongoing in even
national language processing and vision
he is also associated with a startup
called ASAP which works in data
augmentation and NLP problems so today's
talk will be on feature normalization so
I guess now team can take the stage all
right can you see the screen here just
yeah maybe now you can say okay awesome
good thanks for holding this workshop
thanks for the invitation
I'm Killian what I'm presenting today is
really work by boy boy Yi and Felix two
students then Co advisor in search and
also got help here from Surinam from
face book research and so yeah as was
just said you know and talking about
teacher normalization and data
augmentation and personalization goes
actually way back so you know sorry
nobody have a we can't see your screen
oh shoot shoot I'm sorry yeah yeah wait
what am i sharing then I think nothing
is being scared yeah yeah and you boo
yes okay what can you do are you seeing
it now are you seeing wrong stream yeah
it's I know are you see
the two slides are just one side ones
lady okay good alright good so I'm the
guy on the right two on the left are the
students they did all the work
alright so future normalization and
fusion romanization really goes back all
the way to the early days of neural
networks in fact the earliest maybe
there's people before that but in 1998
the only Coon had a paper on a fish and
backprop actually a great paper to read
and back then basically he and his
co-author said you know when you train a
new net bro should always normalize for
features because it really helps
optimization etc um that was 1998 back
then new networks the Barry Schwartz
didn't have many layers since then and
of course we have much much bigger in
your networks so make sense to normalize
not just the beginning but actually to
out of course the famous paper on bad
form that essentially and suggests
exactly this and nowadays you know
there's no network anymore that's not
trained without featural normalization
so really really big part of kind of the
breakthrough of newer networks or deep
learning as we call it and why is this
and well number one it generalizes much
better so if you here we have the
training curves off your network and can
people see my mouse can I just take that
now you guess you can because I'm sure
my screen so and here the orange line is
a new network trained with batch
normalization it's much much higher
actors and about batch normalization but
why is it that matched norm is so
important and essentially the reason is
that it allows you larger learning rate
so if you actually lower the learning
rate just train you would have to train
for longer here the red the green line
actually is a new network and whispered
norm but the same learning rate as the
red line that's the network without
batch normalization you see they're
essentially identical right so if you
have the same learning rate there's
really no normalization but one thing
happens if you increase this learning
rate is very small 10 to the minus 4 if
I increase the learning rate but what is
of magnitudes then if I don't use batch
norm the network diverges
but if I do reduce better I can actually
train up to four orders of magnitude
faster and and that's really kind of the
really really cool part about
normalization you can see the sexy just
you know and look at the gradients if
you look at the gradients up on your
network but with batch norm they kind of
a nice little couch right maybe to say
they're all kind of concentrated around
zero negative and they're kind of you
know somewhat similar if you don't use
batch non then you have a few gradients
right that a massive right like orders
of magnitude larger than all the other
gradients and so what happens is that
these few gradients basically you know
dominate the updates and kind of you
know cause the divergence so batch norm
is really really important and so let me
just introduce a notation here so you
know of course you don't know how this
works so we have a continent we have an
input image of this cute cat it
generates features these are different
channels and just to make things simpler
and you don't have to draw these cats
we actually just imagine we kind of
vectorize each one of these these
features activations so we kind of not
have a back door for each channel and
this vector is has as many dimensions as
this is you know the height and the
width multiplied so this is 100 by 100
feature map and this is a 10,000
dimensional vector and so this is
essentially the feature matrix that I'm
generating with my new network of course
I don't just have one image during
training I always use a mini batch so I
have a whole bunch of images and so each
or for them generates this matrix I can
put them together and now I have this
feature block and this feature block is
neat because it you can now illustrate
very nicely what batch normalization
does in vegetable ization you
essentially take the features and all
these green feature that I'm
highlighting you actually average them
so in this case basically you know you
the media the mean is essentially you
average out the position and you average
out the mini-batch alright and so this
is the mean that you get and get the
standard deviation and what you do is
you just throw them away essentially and
the new features that you obtain are
basically normalize zero mean and
standing Asian war
and of course then you learn a new scale
for these features and your offset and
the reason I'm introducing this notation
is because it makes it very clear what
other are turnips they are to batch norm
and so for example you can use instant
norm instead instant no I'm essentially
what does it do it does normalize it's
not a little bit mini batch just
normalizes over the position so this is
what you get yeah and you're gonna
project it a different only and top-down
dimension group norm what do you do here
you essentially do the same thing
instance though
but you're also normalizing across
multiple channels and layer on what you
do you normalize again across different
channels and the position and so this
made us realize and it's really buoyant
Felix soup as I said essentially all of
these normalization techniques normalize
over the position right so they average
out the position of the feature but
that's curious right because the
position on the feature is actually very
very important information so can't be
people will come up with a normalization
scheme that actually respects the
position and that's probably called
positional norm or pono troit and
essentially what we're doing is via
proposing to normalize like across the
horizontal dimension so they say okay
don't do you know don't touch the
position right and to actually also
don't touch the mini-batch just actually
when you have an image basically
averaged out the different deviations so
just formalizing this the mean
essentially you go over all the features
and you average out the channel and the
same thing for the standard deviation I
can illustrate this with cute cat
pictures here we have this cat you make
these different here's the channels that
you're getting and essentially what
you're doing is you average these
channels right and this is what you get
but you get the you know average feature
cap and this is the first layer second
layer and so on this is my last layer
and below is the standard deviations and
one thing that's really really cool
about this way of doing normalization is
that if you look at it you can actually
now see what the new network is
extracting in bits layer right so and
this hypothesis that your networks learn
low-level features early on and high
level features later on seems to sound
more consistent with this that he we
actually see is mostly learning an edge
detector see all these kind of different
interest points all over the place and
letting the later layers what you really
see is that that face is highlighted in
the ears which presumably are important
features for a cat okay so now we can
use pono and and so essentially what
we're doing is he
to extract these average features and
standard deviation and so we could just
use normalization and you know throw
these away
just like in batching on and now have
these normalized images but if we want
to do generative right then actually
what we do is you know we have some kind
of unit structure we have an image you
can run a low dimensional representation
of you know a lower resolution image and
then a lower resolution features and
then you go back and generate some high
resolution image and so you know in some
sense you know as I said earlier right
this is actually the positional
information of these features actually
quite important right and if you look at
this mean it kind of somewhat captures
the essence of a cat right there's some
information here about the cat I've used
to going away right and if you want to
later on generate the cat again we would
have to learn this from a lot of data so
one thing we could do is we could
actually recycle this right and really
inject them later on and we call this
moment shortcut so if you busy use pono
it allows you to reasonable make sure
it's called-- so it's basically skip
connections right a shortcut connections
well you take the the means from P
obtained from normalization early on and
start throwing them away
yuri inject them later on all right and
for that you know supposedly and it does
makes it easier to generate images again
and of course now you know that's
obviously now do is you can say well
instead of just taking the mean as it is
we can also actually run a little
continent and change it to adapt it to
whatever task we're trying to do if you
actually wanted to modify this cat and
generate something that slightly
different and you have a little you know
here's an example of multimodal image -
image transformation and where we
actually you know if you just we didn't
need to trample pull later set change
another smaller as I said if you just
basically try to learn you know uses
methods try to learn the cat with the
post of the dog in the dog puts the cat
and it doesn't you know it doesn't work
naively if you have a lot more data it
looks a lot better
but if you just use make this one change
where you make sure you have you know in
this case you're just slightly so small
but you actually in insert own or with
moment shortcuts and actually the
results are you look a lot better and
the reason is because you're actually
capturing
information right you don't actually
have to learn what the mean what the
standard deviations are and from scratch
in fact what we are doing here
we actually taking the you know we can
actually insert the mean from a cat to a
dog so to actually generate these kind
of images um you know we also tried this
on psych against you know famously
Phillip we just thought introduced and
and you can get improvements but
actually what's let me just focus on
some some paper that just came out which
I think it's really really cool
it actually really shows the advantage
of ponal with and you can get so here
this is a paper that's I think just that
CBPR and this year you know it maybe you
have sketches and you try to basically
generate and images with some
ex-employer so ya wanna have the sketch
and you fill it in with Steve Jobs etc
and here in this case they use it with
him without pono and you can see that
sorry right that actually the opponent
makes a huge difference yes we actually
here they isolate this you have an
expert image I would like to generate a
carpet that someone has the straight and
the pono regularization really makes a
huge difference here and easing your
learning right so basically in some
sense I'm putting in some some prior
information and saying well this
capturing the structure of an image by
computing the first and second moments
of the features and so that will be
useful later on but I'm actually
generate alright so that was cool
normalization but we're not done yet
and turns out you can also use this for
data augmentation and so they'd
orientation again just like
normalization goes way back and there's
actually paper by own personal risk
minimization first vapnik 99 mentioned
it and then Chapelle it all happening
together explore it it's 2001 and the
idea is essentially and you all know
this I mean I have a sample I don't just
want the minimize loss you know where
the sample is but actually kind of in
this vicinity
alright so around the sample it should
actually you know I should actually make
the same prediction
so traditional things of course resizing
the image cropping or horizontal flips
most things are invariant to horizontal
flipping these are code label preserving
data augmentation so essentially what
I'm saying is I'm I'm making these
transformations the label of the image
should not change lately there's been a
trend to towards label perturbing the
augmentation which essentially says
actually I want to go further than that
I don't just want to say okay around the
image that I have there later should be
the same and I also want to go in a
certain direction and so for example you
can do is you can take an image of a cat
and the image of a plane and you can mix
them all right and so there's a snake
mix up and cut mix we basically say well
if I combine these two images then the
classifier should predict something in
between I shouldn't just predict you
know suddenly chair right I'm clearly
between a plane and a cat so somehow i
want to linearize so smooth out this
decision surface and so I also know
something you know what's going on
between these samples this is something
that Philip was was alluding to earlier
on in his talk just now and so when we
thought about this we actually realized
well actually be doing something very
some with pono right so if pono we
actually take these images and be
extract the first and second moments and
later on be can be inject then if he
have a generative task and but one thing
we could actually do is we can take two
different images and that have nothing
to do with each other and they're
totally random and extract the first and
second moments in the feature space and
so here we have the average plane
features here with the average cat
features and then some injecting them be
actually swap that right so that's
facing they say okay so what we're
learning here is essentially you have a
cat and we have to make the features
zero mean and then we add the mean from
the plane all right so now it's cat
features but they are translated
translate it to you know the space with
to the average location where usually
planes are and you also multiply the
mass and deviation that's the same way
and then we also changed the label here
on the right
and instead of just mean cat they
actually say well it's a little bit of
cat and a little bit of flame
right where lambda is some trade off and
the reason this works like this should
work and that does work is essentially
that when I'm training a classifier
there's many different ways of you know
recognizing this is a cat and many
different ways of recognizing this is a
plane and so you know and once for
example the blue sky could just be
enough right of all my data points
myself and my data set have you know if
planes have blue sky then the network
will just learned blue sky and say it's
claiming right because cats rarely have
a blue sky flying and someone in the sky
and so what we're doing here is
essentially by by subtracting getting
this mean we're losing a lot of
information be keeping the structure and
if you inject them into the cat right
the cat essentially only has this piece
the structure of the plane
right and so it has to predict a little
bit of plane at the end so really what
we're doing is with only two different
classifiers right so one that's based on
the features and whether that's actually
on the moment and the first one should
classify care the second would classify
plane during test time yep essentially
running both of them simultaneously
right and so on average it's all blank
if two different be used if you combine
it we can try this so here we have
different network architectures and you
can see the baseline here is using
standard data augmentation like flipping
cropping etc the blue one is in the ads
moment exchange as we call it to it and
you see a very consistent job in error
rate and yes I pretend that same thing
holds for other datasets and one thing
by the way this doesn't just there's
nothing specific about images here you
can actually exchange these features
also for NLP tasks and so on and he also
did this that's all in the paper
we can also that's really we don't have
to use photo if you use any
normalization technique you can just
exchange the mean and the standard
deviation that we subtract some which
direction doesn't matter and as long as
it doesn't go over the mini-batch so we
can use layer norm instance the women
group norm and
turns out but these images Panos seems
to work in the best but not be and some
other tasks and the jury is still out
and finally how does it stack up with
other data augmentation methods that are
also pairwise and so here baseline is
essentially just standard the
documentation then be used these you
know and we tried them at various
different recent publications maurices
very well compared to them and there is
however one metric that is better that's
cut makes the cut makes it's a slightly
better than thermo x but the cool thing
is cotton it's actually operates
completely in the image space under
pixel space may be easily cut out some
part of the image and paste it in the
other one so you can just combine that
with no X all right you can use both and
if you do that and because boy X is
compete in the feature space they don't
interfere with each other then make sure
you get a new state of the art in terms
of feature augmentation okay good so the
summary and essentially I mean pointers
feature normalization organizations are
really important we introduced bono
positional normalization which normally
long as across channels for each
position independently and Bono allows
two really cool things and the first one
is moment reuse so when you extract
these moments you can then re-inject
them later on
that really helps for you know in
generative tasks and the last one is
second one is moment exchange we
actually extract the moment of one image
and stick it into basically we inject it
in another image and this helps a lot
generalization and the cool thing is
really really simple why this literally
one line of code right we actually have
the pseudo code in the paper so you know
it works really well yes you know try it
out that's all thank you very much thank
you so much given for this exciting talk
you know do it should stop saying you
hundred questions out yonder at the end
we'll be able to okay yep so I'll
introduce our next speaker Kuniko Kenya
so he is senior researcher at University
of Oxford and he is working in the tall
vision group he is also associated with
a start-up on self-driving cars called
5ai
and most of his work has been in
semantic segmentation and like transfer
learning given generative adversarial
models some applications in language
processing as well so maybe poet can
start sharing his screen if I I cannot
see all right
can you yep okay great yeah first of all
thanks to the organizer for the
invitation so today I'm going to talk
about improve generalization of machine
learning specifically deep learning
models using stator immobilization and
this is this is a joint work with
amortisseur and the filter so before we
begin let's have another very quick look
at the definition of generalization
error so let's say that the given data
distribution and the object is to learn
a function that map's input x to y and
we want to do this very accurately and
this is exactly what we try to do in
machine however the NIDA distribution is
normally unknown for example if you take
images for image classification we don't
know the underlying data distribution of
the natural images but we do see them
around and therefore even if we don't
know the data distribution we know how
to sample from so that is exactly what
we do because connector training and
approximate the Joint Distribution using
a painting it and try to learn a
function by minimizing some loss
under this bigger distribution now once
we have done that we would also like to
quantify how well this function that we
have learned using the Train leader
actually performs when exposed to a new
data we coming from the same Joint
Distribution P we know that that that is
exactly when the generation gap idea
comes into play
so under this setting we define
generation gap as the difference between
the expected loss over the distribution
- the loss over the team now as I
mentioned before this beta distribution
is unknown so we approximate this this
using again tested
so we sample another stress meter from
the reader distribution and this
inequality is obvious because this test
beta is a subset of the data and what we
try to do is to make sure that the test
beta is very different from the clean
and this is exactly what the generation
gap is and once we have defined this
time now it's very obvious that we want
to learn functions that will have no
generation because we want to learn
functions that one stream on radiator
perform they depend on unseen 10 scooter
and the question is can we actually come
up with some properties of a function
that would actually guide us that would
tell us whether this family of function
is going to actually give you practical
Direction properties now this fake an
example of neural networks and it is
like very intuitive and it's very
obvious to understand that there are so
many factors play a role in fact
accumulation yeah for example suppose I
use an optimization algorithm like
different optimization problems we need
different optimal solution and each
optimal solution each set of parameters
will have different properties and they
said if different rank
and all these different factors will
play very important role when it comes
to generation gap there are many other
factors like learning rate hyper
parameter so even though like there are
so many factors playing role in the
generation gap there are really nice
recent political papers talking about
bounds of this generational gap these
are three little bonds have many
parameters many obvious factors that
play a very important role but they are
of very little practical implication so
there are tons of paper out there but
the bounds they don't actually match the
real generation gap and then they don't
tell us much when it comes to you know
including gender identity of a model so
what we do here is like we say that can
we actually look into these generation
firms and at least say see if there is
something that can guide us to define
the family of function that we like you
know give us some confidence of having
better utilize it so what we do is we
look into nearly nice recent paper by
partner and national two different
papers but they have if you look at the
visualization bound there is an upper
bound and what we do we look only the
parameter dependent terms over there
being more being in or all other factors
playing role in the generation bound you
say that okay this does mean that all
other factors are kept constant like the
same as people actually using the
standard training process and only look
into the parameter you can in fact if so
these factors turn out to be actually
these two terms so the first term is the
product of the spectral norm of each
millionaire you see there is there are n
linearly this is the product of the
spectrum now when spectrum is like the
maximum single one and the second term
is sum of something called stimulants
which is the ratio of the square of
Rubinius norm or the squares I museum
and the ratio of the
we should be square the Frobenius norm
to the spectrum up so this have a close
look into what this is actually what
this tells us so the first term if you
look into the first term is exactly the
distance of a bond and this is what the
spectral log and paper is being the me
Otto people they actually what they're
doing is they the new spec turn on the
ignition of each linear layer and by
doing that they don't kind of you know
decreasing the upper bound ellipses
upper bound of the network and that has
that is directly related to the one of
the this is one of the factor in the
reticle upper bound for the emulation
area now since we know that Spector norm
is extremely successful it's being used
quite a lot these days but if you
observe very carefully this upper bound
is extremely news because it's data
independent it does not depend on theta
at all and so even though there is an
upper bound X is upper bounded I've done
a lot of assumptions over there and but
it works very well so what we ask
ourselves now into the second question
for this part so this is the ratio of
this stable state is the son of the
stable ranks which is the ratio of the
four genious norm square divided by
spectral norm and this is a very known
quantity it's just a proxy to rank is
upper bounded by R and K because if all
the similar values are same then this
will be exactly same as the rock of the
matrix and it is the controls noise
sensitivity there are many nice
properties of this string back now we
actually try to answer this question
what happens if we jointly optimize
these two things because we have seen
that one if we optimize one factor in
likes
spectral norm can people it was very
well but there are other factors taking
role in the generation bound can be
actually if we optimize all of them
together then can we obtain something
which has better generation properties
and this is exactly what we do so let's
have a look into the formal definition
of the stable and normalization problem
so mathematically this is how we define
it so let's say that W is the parameter
of a linear layer given to you what we
want to do is to find a w hat which is
closest to the given W in turn himself
from vinius restraints such that the
stable rank of the w hat is the desired
stable run are we also put another
constraint we make the problem more
general and put another constraint which
we call spectrum reservation constraint
this constraint is saying that make me
stay where and our while ensuring that
the top K is singular values are exactly
the same for both the matrices WLWT so
you can always put K equals to 0 that
will actually remove this constraint we
can have T equals to 1 you can have T
equals to 2 depending on the problem
alright so this is the problem that we
want to solve however the solution is
not straightforward and I'll give me
pigeon spy since steam around is the
ratio of the two knobs which is four
videos on the spectrum now you cannot
just simply normalize and do stable
randomization because they are the same
if you if I divide the matrix W by by
you suppose eater this table and it will
be in the same so I can't do something
similar to spectral normalization where
I simply find the maximum singlet value
divide the matrix by that and that also
is not an optic optimal solution to
spectral normalization but that is what
we do in fact is that this also cannot
be done similar to forgiveness
normalization so we need the proper way
of movement also for K equals to 0 this
problem is not founded so what we want
now is to find a solution to this
problem with these difficulties that
either this C so the solution the
theorem in everything is given in the
paper or the proof and everything how we
find the optimal solution to this
problem
here I'll give you the English what's
actually happening so this is the rori
this
I'll tell you what snuck in here okay so
let's say that so we are doing this for
K equals 2 bar for the tangley so given
a matrix W we can always write it as
this first matrix which is corresponding
to the maximum singular value so any one
we will transpose Sigma so we know this
that the view can be written as
summation of Sigma I so what I'm doing
here is simply saying that okay take
this first term from here and the we can
write it as like the first one and the
Rivini right so the algorithm what it's
doing is at first we are doing spectrum
normalization of the first term and for
the second term they got the theorem
sales that find Anita so we find either
using the solution of theorem on the
final matrix is then this which is
already spectral denominator plus ETA
times W hat and this is the solution
that we obtained so all the details are
in the paper but computation it is
exactly same as we inspector
normalization because what we do is we
also perform power iteration we have the
spectral norm then because there is a
constraint that this over the spectral
knot and then there is extra power that
is only to find out the e term and while
doing when we do that we find the final
matrix F which is spectral normalized
and has desired stable so intuitively
this is what it's doing okay suppose W
is the given matrix the first one and I
am writing two and three as the si there
are two singular values so this is how
many present a singular values but for
us the spectral norm will just divide
everything by 3 because that is the
Sigma Max and it will make the top 1 1
and who will become 2 by P which is like
this so this is what spectral norm will
do and the stable and found the addition
will actually multiply this 2 right
factor which is less than which we prove
always turns out to be less than equal
to one so basically it's again so
spectral or reduce some direction
through which information could
propagate and this is reducing it
further so in one sense we are learning
even more robust mappings because if a
lot of singular values are reduced then
if you if there is any noise in the
input a lot of direction of the noise
will cancel off so that is how these
normalization is work so this is what's
favored our normalization is doing in
the end and then the entire installation
comes from looking at the theoretical
results that are being given and only
working on the parameter dependent
quantities so now let's have a look into
how this works in practice
so what we do is we try favour and
normalization into two problems to set
of problems one is the classification
experiments and another Gann explains so
in case of classification explain events
what we show is that s are and while
providing good generalization it also
has this nice memorization or property
so this is one way we can actually say
that this is working well another way he
quantify would be that we also compute
some recently proposed complexity
measures and we see how how well deeply
nominalizations are going so the first
set of experiment is this so the top one
is the clean data accuracy so this is
the test accuracy on the clean it a
higher spectrum so we try different make
resonate white resonate Alex tonight
clicking on cell 400 and interestingly
so this is vanilla if you have high
spectral norm the performance drop and
that is that can be the reason why
people don't use a spectral
normalization for the classification
problems they are primarily used in gas
even though there is a paper that says
that spectrum normalization the earlier
paper
spectrum normalization improves
resolution but so this is what we see
when we use the spectrum normal track so
generally in some cases but in case of
an accident the results are not the same
but this is like why he used the moment
he applies primarily Bank normalization
we see improvement so it's in a way like
in improving further so it's kind of
fixing what is being done by spectrum
now in concerts so these are the results
when it comes to testing accuracy while
to take the same model exactly the same
and do the shattering explain where you
randomly shuffle all the labels of the
clear leader and in order to have good
clean accuracy in this shattering
experiment the model has to memorize
because sometimes you are saying that
this is cat and next thing the data is
seeing the cat is talk and this is
happening very frequently so the only
way a model can actually do very well is
by memorizing so that means with the
training so in this case if the train
activation is low it indicates national
radiation so we see that in case of
stable and long variation between
accuracy is always lower so so the
conclusion here is that like the same
model the stable and calm language
didn't perform very well on the test
mediator and when it came to shatter
experiment if the train actually
supposed to do so while giving good
generalization it's like having lesson
in breakage now what else can we
actually quantify to find out the the
some of the properties of the function
that we have learnt so some complexity
these are a decent equals in which they
are different parameter dependent
complexity they depend on parameter and
detachable compensating reason is that
these authors are proposed and take for
example they sometimes transform to we
can say like the quantified roberson is
like if you look into the this
particular term so this is say
that is how sensitive your parameters
are to input marks because they are some
kind of quantifying ipsus yes right so
these are these complexity measures
lower the better that is so he we plot
Instagram for all of them and we
observed that with this stable R and so
when Allah is always higher like this is
with known on that means and the
spectrum has improved the vanilla so it
has lower but then when you lose this
table it's lit further people on the
left side so this is all the fascination
experiments now we turn to Gannett
screens now the question is why should a
sudden make sense in case of Gans so
again we only focus on the discriminator
of the and it's a binary classifier so
most of the things that we have seen for
the classification problems like the
theories abound should actually the
apical applicable theories but also like
a lot of people talk about this tension
between capacity and generalizability
obviously a discriminator so from high
level the discriminator should have
enough capacity so that it can
distinguish between real and fake but
should be simple also in order to
generalize well but it is the trade-off
is not maintained then so you always
have to find a traitor so if you see the
recent work like the Massa Stein can
this is like suggesting that cooking
will have no lips is crystal emitter
then there so there are different
theories of mean that Whateley
ingredient there are many many millions
in which it's it's been shown that if
you have a simple family of functions
like your lips means it won't spectra so
that is and as we have already said that
so this is the motivation from the lips
in front of you so now we already argued
in case of classification problems this
might not be good enough so again the
same logic follows here that if we
applies a sorry numbers discriminator of
the ionic should provide some
improvements so for gann experiments
what
sorry maybe put it you can like quickly
wrap up like we're running short of time
sorry that's right so I'm just about
mistake so in case of guy on experiments
we compute the comparator histograms of
the lips is empirical it says - in order
to compute the empirical ipsus behavior
and lower the better so you saw that as
we decrease the slave around the
inception score in the groups but it
will increase tumors and collapses
completely that simply shows that the
capacity of the discriminator has been
dropped down quite a lot and then when
we compare with different gradients we
show that the stable and normalization
works better than spectrum on the
ignition similarly we show the
experiments on neural divergence score
that actually kind of all kind of
quantifies the memorization in dance and
their higher the better so this is what
we show when we compare to Spector calm
down so I want to conclude we actually
used recent modulation bounds to guide
us towards finding something that could
actually help us in developing
normalization schemes with some
guarantees of having improved
interaction properties and that is what
we show using a spectrum language yeah
thank you yeah Thank You Bernie for the
talk not sure if I'm supposed to take
any question with me maybe
maybe later on in the Q&A session so so
I quickly introduced Hugo so he is
currently leading the Google brain team
in Montreal
he is also associated with Vela and he
was also affiliated to university of
sherbrooke and has previously worked at
twitter his work spans generative
modeling and applications of deep
learning in language classification text
classification vision and he was also a
program chair for new reps in 2019 maybe
people will take the stage now
yep let me try to share my screen
liquid metal shop stop sharing I could
do it to let people see my screen right
now
yep yep okay cool
yeah thanks to the organizers for
inviting me try to keep it as short as
possible since we're running a bit out
of time I want to leave some time for
Emily also to give her talk and this is
very much kind of work-in-progress not
entirely perfectly well formed ideas
that I thought I would submit to you in
this short talk which is about
evaluating Jared of models using ideas
and benchmarks from fuschia learning and
this was derived from some experiments
of work that I did with Ellen II Santa
Fe new and in turn a brain and based on
two millenniums or research scientists
or brain as well I thought I'd start by
describing a little bit where I come
from in terms of the sort of line of
work in generative models when I started
my PhD in the mid 2000s I which is kind
of when the resurgence of interest in
neural nets started and in deep learning
I was really actively working on rich
sister Boltzmann machines many of you
might not know what these are or
remember these if you do you're probably
in your 30s at least or maybe Canadian
but we were really interested in this
was kind of the prototypical generative
model based on neural networks and we
were you're kind of evaluating all
useful they are not solely based on
samples but actually based on how useful
they were to provide an initialization
that was based on supervised learning
for deep neural nets however this is not
really used anymore what's used much
more we essentially became much better
doing regular supervised learning with
deep Nets by you know using that's norm
and drop out for better regularization
and better optimism icers like atom data
augmentation and so on
but I continued working intuitive models
and in particular I proposed
aggressive distribution estimators or
nade which again may be you know this
been a while this I published this work
maybe some of you don't remember this
works so much but it was partly
motivated by the fact that these models
a bit like as Oriole presented allowed
for computing the exact block
probability under the model of test data
and it was really great to have a
quantitative way that was directly
related to the generative model to get
some form of metric for its performance
when you think about whether that's a
good metric really the main way of
motivating average lot probability of a
test set as a metric is that you would I
guess care to about compressing the test
set in term and know in terms of number
of bits that would be required to do
that using your model realities we
rarely do that at all
compression and and especially when
you're interested in the AI applications
you're really mostly thinking about
learning representations or and solving
specific tasks and so I've sort of been
still someone involved in some research
and generative models but I kind of
moved to other things and one of the
particular problems that I had moved to
more recently in the past I don't know
something like two years is a few
shopper name that is trying to solve
problems that look like this where we
have a classification problem here it's
a five-way classification problem we
have one image per class and i which is
very little data to be able to train a
classifier and be able to classifier
properly examples in a test step and
really that's the reason i was
interested in terms of models as a form
of unsupervised learning which could
leverage unlabeled data and maybe that
would be key for relying less on label
data to solve certain AR related
problems like perception problems like
image classification and so I've been
really interested in trying to address
this problem more directly and I'll see
how maybe this can connect back to
general models later but essentially in
the past few years there's been a lot of
progress in few staff learning research
and models and the way that we move the
needle and and and
success in few shots learning is a
little bit how we would have success for
regular supervised learning to have a
hope that we can generalize well to new
test examples well in supervised
learning we hope that we can have a lot
of available labeled examples of a few
shot learning we're interested in the
ability of generalizing to new
classification tasks so what we will try
to do is to actually simulate and
collect as many other a few shots
learning tasks that a few shots learning
model would be trained on and then
evaluated a new different future
learning task at the evaluation or test
time there's been a fair number of
benchmarks proposed one of the largest
out there right now is metadata set this
is work that I've been involved in and
Ellen eto anthem flu is the lead author
she's actually presenting this right now
at the four poster session which is
currently ongoing and she'll also be
presenting in the fifth session of the
day so if you're interested in learning
more go check out her five minute
presentation and then go discuss it in
the poster session to learn more about
this but essentially metadata set
provides for a bunch of 2shot
classification tasks for training a few
shot learner and then defines also a set
of tests view shot-- classification
tasks to evaluate your fuchsia learners
so you're a learner that can take in a
small training set for a task and
produce a classifier that can be applied
on the corresponding test set for that
same task and and key for having a good
few shot learner is to have a lot of
different fuschia learning tasks that we
can train on and so in metadata set what
we aim is to try to have as rich as
possible a training set of tasks as well
as a test set of tasks for evaluation of
fuschia learning research and to do this
we leverage a bunch of publicly
available and a variety of publicly
available datasets available for image
classification in general only got image
net aircraft vgg flower quick-draw
traffic signs and so on
and what we do is we leverage those
available sources of data in order to
generate free shot classification tasks
for training and for evaluation and the
way we construct the benchmark is we
make sure that the test set of tasks
correspond to classes of objects or
animals and so on that are strictly
different from the
that we use for training so really are
interested in generalizing but in the
space of tasks of classification tasks
and so since naturally intuitively the
key would be to generate as many tasks
for training then one of the things we
started thinking is can we somehow
augment the amount of classification
tasks that we can train on not by
directly only really relying on
available data sets and this is where
the idea of using Jered of models came
about and we started exploring that so
specifically if we have a model like
began a little bit like what or you'll
explain and and quite related to his new
Rip's paper on using classification of
models trained on generated data we can
we were interested in trying to do this
also for future learning tasks that is
we trying take a big game mode all
trained on imagenet say and then we
would generate a bunch of new synthetic
examples by feeding in the one hot label
of say any one of the thousand classes
from imagenet like remote control
fitting in the noise vector for each
image we want to generate and then we
get new examples from this synthetic
class that comes from Buganda
that we can use them to generate new few
shot learning tasks from infinitely not
an infinite number of these synthetic
images and this might actually work
better because there's more data to
create synthetic tasks for and maybe the
began model or any general model
would've learn something that's the
factors of variation within classes all
also that are shared across classes
though we've seen from phillips talk
that it seems actually to be somewhat
challenging to do but another thing we
thought we could do that's not been
mentioned so far and that's really
interesting for few shot learning is
that perhaps we can actually create some
new classes by manipulating what we feed
as the class definition or the vector
that defines the class to the began
model and in particular one thing that
we played with was let's take the
interpolation of two one hot peppers for
two different classes and use that as
defining a new sort of mixed class so in
this case we took half of the violin
class essentially and half of the
Barracuda class which is a fish so we
took the interpolation between these two
one hot factors and we feed that to the
big gun model and then we gave the these
somewhat weird images of either someone
playing a fish or holding a violin and
and then we just say well let's let's
consider that as a new class that we
invented from which we can get examples
to create new fuschia learning tasks
that we can train a a few shot learner
and so we started playing with this idea
see if it had leg and and some of the
results are quite interesting so what
you're seeing here is each row
correspond to a different test set of
tasks where the image just come from
different sources of data traffic sign
aircraft and so on and in each column is
a different source of data from which we
created the few shot learning tasks to
train on a meta learner or a few shots
learner in particular we use a
prototypical network which I won't
describe here but just think of it as
this is a type of learner for free shot
learning that is trained on tasks on few
shots learning tasks and what we found
is that if we train on the raw images
from image net the tasks that are
generated from it we get a particular
performance and the performance we get
if we instead we use synthetic data
generated from a Gann model with the
hard label the true one hot labels we
actually get pretty close performance
and then we use the soft labels we get
the slightly better performance that is
not quite at the image that performance
but it's not that far which is
interesting to contrast to what Oriol
presented where the gap was much bigger
in the regular supervised learning
setting and this might be because here
what we're interested in this
generalization to new problems new
classes we're not measuring
generalization also with an image net
domain but actually across other domains
so in a sense it's it's more challenging
but that means that there might be even
more benefit of learning a generative
model instead of just using the only the
true examples and in fact then we can
actually train on jointly on the
original image net classification tasks
and again generated classification tasks
and we get the best results when we use
can using these soft labels which
texting generating and tally new classes
that don't actually exist in imagenet
they're kind of synthetically generated
we're getting an interesting boost from
doing that so this is kind of the
proposal that I'm submitting for
discussion perhaps during the Q&A or or
offline maybe future funding benchmarks
like meta data set could serve instead
of thinking of using generative models
to improve your sharpening maybe your
self learning benchmarks can be used to
evaluate the quality of generative
models where you would look at well what
happens instead of a big gun I use an
SPM or a global model or any new model
you would be developing for generating
the data that is used to create fish
shop learning tasks to train on and who
knows maybe we can even surpass just
from synthetic data the original image
net training procedure or maybe actually
combining all of these data sets
together is what's going to push the
boundary of what we can do in fuchsia
learning which is a really interesting
by-product that this could generate so
I'll stop there and with just mentioning
some pros and cons for for you to think
about of doing this approach so it's a
metric that can work for any journal
model unlike pre-training pre-training
you really need to do a mapping between
your java model and the weights of a
supervised they're all net here we're
working with samples so so really any
other model can be used it's a metric
that one can really care about because
it's trying to improve performance on a
what I think is an important problem
with just few shot learning but blog
likelihood compression is not so much
interesting really or at least it's
definitely not one of the most studied
problem in general machine learning it
might be actually one of compared say to
the work that Oriole presented one of
particular interesting ways of measuring
the performance of compositional
generative models that can actually do
this kind of a generating new class
procedure by combining doing novel
combinations of its components to
generate data that is really different
from it it's been trained on and and
this is something that can really be
valuable when you're interested in
generalizing to new few shot learning
tasks and new so new
object categories that you've never seen
before disadvantages it really is only
usable if you have a conditional
generative model a class conditional
generative model I would say that really
general models are mostly interesting
anyways when there are conditional
unconditional generation is I think much
less much more limited interesting use
cases but that's perhaps debatable and
the other thing is that well it's not
directly assessing visual quality that
is there's no theoretical reason why we
need to generate visually perfect
samples to actually get good performance
of a few shots learn or train on these
samples so if you're interested happy to
discuss this either in you know later in
this meeting or offline
don't hesitate to reach out in a various
ways during the conference or offer and
that's it thank you
thank you all for this amazing talk
should we move to the last talk before
and we can take questions after yes yes
cool all right so I'll quickly introduce
Emily so you can see my screen right I
guess yes yeah okay
so emily is a research scientist at the
Google machine intelligence group she
has been working on generative models
but has now also shifted her focus to
ethics and bias and algorithmic fairness
so she'll be talking about a very new
perspective that most of us do not think
of it's on like critiques in generative
AI and I would also like to thank Emily
because she's also attending another
conference and she like accepted this
invitation at at the last moment so over
to you and share this ok can you see my
slides yep yeah okay awesome
cool so um yeah thank you everybody for
joining us today and for all the awesome
talks so far and thanks for the
organizers for putting this together
during some pretty challenging times um
so yeah I'm Emily I'm a research
scientist at Google brain in New York
City my living room right now and so I
just want to share a little bit of my
research background so you have some
more context for this talk right so I
sort of started out doing cognitive
science and then moved over during my
PhD to focus on generative models and
supervised learning a little bit of kind
of forward modeling from an aerial
perspective but over the course of my
PhD I became increasingly aware of the
kind of broader social consequences of
AI then I've now shifted my research
focus towards more associate technically
oriented AI ethics work so I'm not
currently working on a generative models
so when I got this talk invitation I
thought I'd sort of use it as an
opportunity to return to the topic of
models but with this new kind of socio
technical and ethics oriented lens that
I have now so that being said this is
like you know it's kind of thrown
together at the last minute it's really
like an you know informal musing of my
thoughts about some different things and
I hope it's enough to kind of spark
conversation so this talk is gonna be
split into two parts in the first half
I'm going to give a non-exhaustive
overview of some risks and harms that
have emerged in recent years from
generative techniques so some of these
things I think folks are gonna be fairly
familiar with but I also hope to tread
on some new territory or at least some
topics that I think are so far under
discussed in the generative modeling
community and a little bit of a caveat
here is that I'm gonna focus primarily
on images and videos because this is
only a 15-minute talk and because that's
where a lot of my expertise lives but a
lot of the stuff that I'll say you can
imagine that kind of generalizing to
other domains and then the second also
was another thing I want to say here is
that the kind of ethical concerns are
going to be focused in some part on
downstream uses of generative techniques
on sort of on the application level but
I I'm gonna kind of argue that you know
this doesn't kind of resolve the
generative modeling community or at
least the academic community from being
attentive to these things and and other
things that I'm going to talk about are
going to be more oriented towards the
kind of academic work that we're doing
and the research papers that were
publishing and then the second part of
my talk will kind of move more into like
how we as a field can move forward cool
so part one so I'm just going to begin
with the kind of obvious you know
observation that these generative
techniques have kind of advanced to a
point where they can now be leveraged to
create synthetic media that's not
immediately identifiable as
computer-generated right so I think we
all know right now this can be used to
kind of digitally edit audio or visual
content and make it appear as though a
person is saying or doing something they
didn't actually say or do and so
multitude of harms can result here both
when an individual consuming the media
and is deceived into thinking that it's
authentic and also even when they know
that the content has been altered so
um you know some sort of tangible
concrete consequences of this right
these techniques could be used to
promote or circulate various forms of
fake news or misinformation by altering
media content suggest that for example a
political figure is saying or doing
something they didn't say or do in some
instances this can contribute to an
individual consuming and believing that
this content is in fact true then it can
also more broadly we can kind of public
trust in new sources and make it
difficult for the public to distinguish
fact from fiction and it also affords
kind of a new route of deniability and
discrediting of evidence so I think this
is you know one of the topics but
there's a lot of kind of focus on and
folks are also working in this space to
kind of detect these kinds of like fake
videos and stuff so these types of audio
and visual manipulation have also been
weaponized in pretty vile aiding and
violent ways and this has been
disproportionately targeting women and
so for example like deep fake techniques
can insert faces into sexually explicit
content without consent of the
individuals and there's different sorts
of acts that have been proposed and
developed that kind of undressed people
in images again no consent required and
again these are kind of
disproportionately you know targeting
women
um and right so we're kind of regardless
of whether you know the content is
mistaken for real or not in these cases
these tools provide a new avenue for
targeted bratzman
and abuse closely-related there is this
website these news do not exist that
sells algorithmically generated mutants
for $1 each so this is a male run
company that is profiting off of the
commodification of women's bodies
following sort of similar trends that we
that we know and and this reason sort of
you know questions that are actually
important I think to all ml
practitioners to be thinking about
relating to how the data was sourced
right so this is this database the the
company owners say that it you know it
composes women mostly twenty to forty
years old
all white they say they only use public
domain images but most pornographic data
are in the public domain have actually
been stolen from sex workers so this
raises a multitude of issues right not
only have the women who are powering
these models most certainly not
consented to this but they've most
likely not been compensated for it so I
want to shift a little bit away from the
kind of downstream applications and
towards considerations that might seem
like a little bit more of an immediate
concern for this this field so
generative techniques are by design
obviously reproduce you know statistical
correlations and the data set that they
were trained on and I think there is
sort of an increased conversation about
different types of data set bias and
there's been a lot of really interesting
work and sort of looking at how to kind
of understand and mitigate biases that
generative models are trained on and
here I just want to emphasize the
importance of kind of critically
examining data set bias from a social
perspective and understanding how these
correlations might reflect different
kinds of social stereotypes or patterns
of structural inequality in the world
I'm gonna go through an example with
Salemme which is a pretty standard face
dataset and I use this because this is a
dataset that I've used in the past and
so a lot of like my thinking in this
field is kind of reflecting back on my
own practices and you know how I could
have done better or things that I didn't
notice I'm thinking through now so some
like natural questions to ask what this
dataset is like who you know is
represented in the dataset who is
underrepresented you know in Salemme for
example there's a lot of sort of
demographic skews for example contains
predominantly lighter skinned
celebrities we also asked how are people
represented
so in celeb a there's really strong
correlations between certain
characteristics and certain attributes
in the dataset for example the women in
the dataset tend to be relatively young
and are rarely wearing glasses so this
means that like generative models
trained on this dataset you know if
they're sort of trained naively are
gonna you know kind of learn a
particular model of what this type of
person looks like and so we can also
kind of question the values and politics
that might be invented in a particular
data set right with Salemme we see a
very kind of specific logic
that's built into the attribute taxonomy
so we have attributes such as like
attractive male young these are all sort
of far from objective or natural
categories and so these socially
relevant biases can manifest right in
techniques or applications that
ultimately end up reinforcing or
amplifying you know sexist racist or
otherwise kind of offensive biases and
worldviews this figure that I show here
that's sort of manipulating this
attractiveness dimension of a generative
model is actually from a paper that is
proposing a method to mitigate these
types of data set biases so they're not
making any claims that this is what an
attractive person looks like but we have
actually seen this kind of manifesting
in real applications that you know
propose beauty filters or hotness
filters and I've seen you know for
example that these these types of
filters and that kind of lightening skin
tones as hotness is increased so again
the risk here is that these kind of
generative 8'i techniques often
unintentionally end up reinforcing some
kind of normative view and for example
you know what beauty looks like or what
a particular gender looks like and this
in turn can kind of lis enforce societal
expectations and norms so I want to dive
a little bit deeper into race and gender
briefly so definitely don't have time
for a full kind of gender or racial
literacy talk nor do I really think that
I'm the best person to give such a talk
but I want to convey hopefully enough
about this topic to encourage folks to
kind of dig in a little bit more on
their own um so in short gender and race
are their social categories right
they're not kind of fixed attributes
attached to people or images and so
they're socially situated both
temporally and spatially enter and are
less unstable in a variety of different
senses racial categories in particular
are tied to have racial projects and I
have a whole other kind of line of
research and paper and the algorithmic
there in this world kind of looking at
the use of racial categories in machine
learning but for now I'll just say that
when they're adopted as they often are
in machine learning as these kind of
fixed and stable attributes that are
just attached to people
then the kind of social and political
dimensions of these categories end up
being lost or obfuscated and this this
in turn can kind of contribute to the
production and reproduction of racial
inequality and also similarly the kind
of ramification of normative gender
categories which we often see happening
in different kinds of machine learning
technologies can also operate in
similarly oppressive ways so my point
here really is that generative models of
images of people that's what I'm going
to focus on primarily but we can also
think about other kinds of signals like
speech signals that are really
intimately tied to people's identities
these models can risk rien scribing
social categories like race and gender
as sort of visual and natural attributes
so I'll just give a couple examples of
like what I'm talking about here um so
someone who's seen very frequently in
papers about generative models of images
is this idea of manipulating images
along different sort of leading
dimensions and the notion of gender as a
latent dimension is very frequently
utilized so I've kind of just pulled an
example from a recent paper but this is
extremely common I'm kind of you know
flagging this paper over others for any
particular reason
um and so this is something that I I
really really want to just urge people
to think about and subsequently kind of
speak carefully about so these kinds of
gender swapping methods right whether it
be like an app that anybody is using or
just an academic paper that is
describing new techniques can really be
harmful in a variety of ways so first it
kind of reifies binary gender schematics
and also reduces gender to visual
appearance which can hurt a range of
people but specifically is likely to
hurt trans non-binary gender
non-conforming and kind of otherwise
gender diverse individuals it also kind
of reinforces a normative view of what a
particular gender looks like so for
example you know male filters might
shorten hair and add facial hair and
female filters my length of hair and add
makeup which again in turn kind of feeds
into these societal expectations and so
generative models right they're
extremely powerful descriptive tools and
and they have great you
for visualizing understanding complex
data right we saw this in Philips talk
earlier today but when we're dealing
with categories such as as gender even
if sort of you as a researcher
understand that by manipulating the
gender dimension of a generative model
you're really just manipulating a set of
visual characteristics in the data set
that the model was trained on you know
might correlate with a particular label
in that data set as opposed to some
intrinsic property of a person the
manner with which we kind of as
academics from speak about these things
you know both in our own papers and kind
of in broader public discourse really
really matters so there's another line
of research that's related to this that
I want to touch upon that's focused on
generating faces from speech and so
these works have a tendency of referring
to gender race and ethnicity as visual
properties of faces or sometimes as
biophysical parameters and they speak of
producing kind of gender ethnically or
racially appropriate faces from voices
which again kind of ran scribe's race
and gender as both these natural and
visual attributes and so really I want
to bring this up because it emphasizes
the importance of understanding the kind
of underlying logics and of the
resulting application how it fits into
broader social context okay so I'm
running low on time so I'm going to kind
of rest through the second half of my
talk so um really I think that like
individual researchers and machine
learning feel as a whole really to start
to kind of seriously reckon with and be
held accountable for the far-reaching
social consequences of the work that
we're doing and really kind of question
you know the cultural logics that are
embedded in datasets and the resulting
techniques that we're producing so this
obviously isn't gonna be easy I think
critical and reflexive research
practices or things that need to be
learned in computer science as a field
as a whole has historically not sort of
prioritize or valued these types of
methodologies and really what this is
going to require I think is shifting
kind of culture norms and practices so
I'm going to go through a bunch of
things that I think might get us some of
the way some of these things are going
to be things that individual researchers
can think about and grapple with and
other things are really going to
requires community-wide efforts so the
first thing is adopting kind of critical
data practices so this is where most of
my research these days lies so if
anybody's interested in digging into
this more we can you know chat during
the breakout sessions or later and so
critical data practices to me I mean a
variety of different things so really
means you know kind of questioning and
critiquing who is represented in data
sets what kind of statistical
correlations might exist in these data
sets and how these correlations might
relate to social structures of power and
privilege
it also means looking beyond these kind
of distributional properties of data
sets and also questioning the conditions
of construction of a given data set what
categories might underlie a data set the
kind of personal and organizational and
imperatives that shape the collection of
a data set and really what are the kind
of taken for granted and shared
background of assumptions that shape the
construction of data sets and these
points are only for data set creators
right and I think you know there's a lot
of folks who are really just interested
in the modeling side of things really
developing algorithmic techniques and
they may just kind of pick and choose a
data set you know that kind of has some
property that's relevant for the model
that they're developing without a lot of
care to like what's contained in that
data set but even if we're you know just
doing that every time they kind of
uncritically adopt a data set even for
purely research purposes kind of
contribute to that data set being
reified isn't a political scientific
object and many times these data sets
have very very specific kind of
political contexts so there's also a
role for the community at large to play
in terms of shifting reviewing practices
and incentive structures and educational
efforts there's one really concrete
thing that we as a field but also
individual researchers can do which is
to kind of normalize incentive and
incentivize the inclusion of ethical
consideration sections of papers
comprehensive model and data reporting
and documentation frameworks
Sylas model cards and data sheets for
data sets as to examples these have come
out of collaborators on my team and
they're really really amazing frameworks
um and these like they do here is these
kind of help model and data set
developers kind of proactively consider
the limitations of their work kind of
detail the assumption
and motivations and a little bit a
little bit of introspection and
reflection so um one final thing is to
really think about our kind of unique
positions in the world and the set of
experiences that shape our understanding
of the world and limit the bounds of our
perspective right our kind of
positionality limits our kind of
epistemic worldview and the way in which
we see things and accepting that we only
have like a partial perspective on the
world is really about recognizing how
you know people with power and privilege
see the world in very different ways
than people who are holding different
marginalized identities and so we really
need to you know kind of educate
ourselves consider diverse stakeholders
and be aware of the limitations in our
own perspectives interdisciplinarity
I'll kind of skip this slide just for
the sake of time but I think
interdisciplinarity is incredibly
important here's a couple papers that
talk about the importance of this a is
basically fundamentally a social and
technical endeavor and our practices
need to reflect this and finally I think
we need to shift our thinking from
intent to impact most people I think are
really well-intentioned and something
might not know the ways in which their
work might actually be impacting well
but we are building socio technical
systems I need to be explicitly thinking
about the social aspects of our work and
engaging with diverse stakeholders and
kind of actively working to develop
these self-critical and reflexive
practices so I will stop there and just
thank some of my colleagues who I'd chat
about it a lot of this stuff with thanks
I guess on what I'm really
