Good evening, my name is Kimberly Lau and I'm the current chair of the Academic Senate
and I'm so pleased to welcome all of you
to the fifty-third faculty research
lecture at UCSC... it's wonderful to see so
many of you here tonight.  The Faculty
Research Lecture is an institution
supported by the Academic Senates on
all 10 campuses of the University of
California. Its purpose is to highlight a
faculty member's distinguished research
record and to provide the campus and the
local community with an opportunity to
learn more about the honoree's work. The
first faculty research lecture was
delivered at UC Berkeley in 1918 and UC
Santa Cruz initiated its tradition with
Maurice Alexander Natanson's 1967
lecture. Each year a committee of
Academic Senate faculty from across all
five academic divisions nominates a
professor to present this prestigious
lecture based on the overall excellence
of their work. This year the nomination
of professor Lise Getoor was approved
by enthusiastic acclamations at the Senate
meeting of May 16th 2018. Dr. Getoor's
research spans several fields in data
science, including machine learning and
reasoning under uncertainty, data
management, visual analytics and social
network analysis. As you can see from
this brief overview her work is
extremely timely and as the director of
the UCSC D3 standing for data discovery
and decisions data center she was also
particularly concerned with the
responsible use of big data. I'll leave
the details of her work to Dean Alex
Wolf which he'll present in his
introduction and of course to Dr.
Getoor herself. Before then however it's
my great pleasure to introduce our
Chancellor George Blumenthal.
Good evening everyone, and thank you all
for coming. This event which is a peer-
chosen honor and the highest academic
honor we bestow at UCSC is always a real
highlight of the year for me.
And it is so because it spotlights
research which is so central to our
mission at UC Santa Cruz. At UCSC
research goes hand-in-hand with
excellent teaching and we are built on a
foundation of academic research over the
past half a century and over the over a
time that I've seen UC Santa Cruz grow
and evolve from an innovative and
pioneering experiment in higher
education to a nationally ranked
research university, and I'm incredibly
proud of that and it's work of our
faculty people like Lise Getoor,
tonight's lecturer, that have gotten us
to this place.
It really feels amazing that this is our
53rd annual Faculty Research Lecture, and
just for the record while I've been to a
lot of them I haven't been to all of
them. It's important to note that
research is also what sets the
University of California apart from
other systems of higher education. It's
an engine of discovery that has fueled
the Golden State for decades; importantly
research also satisfies that uniquely
human quest for exploration, for
knowledge, and for insight. UC Santa Cruz
is the home to artists and scientists
philosophers and engineers, astronomers
and poets, sociologists and economists...
scholars across all of these disciplines
advance our understanding of the world
around us, their contribution...
contributions enrich us and they enrich
educational experience that we
offer to our students. UC Santa Cruz
students get extraordinary opportunities
to work side by side with leading
scholars, big thinkers who are advancing
their disciplines and really
transforming their their fields... our
professors keep asking important
questions keep writing and keep sharing
their knowledge tonight we get to hear
from one of our very best
Lise Getoor, the 2019 Faculty Research
Lecturer: I had the immense pleasure of
getting to know Lise on an international
trip we made a few years ago to South
Korea where I remember that she wowed
executives at several Korean companies
that we visited just as I'm sure she
will wow us tonight with her
presentation and here to formally
introduce Lise is Baskin School of
Engineering Dean Alex Wolf. Alex
thank you, George, and thank you all for coming tonight. I'd
like to offer a special thank you to my
fellow academic senators for recognizing
tonight's speaker as Dean of the Baskin
School of Engineering one of the many
privileges I get... is to is to get
acquainted with our remarkable faculty
and their very impressive work and then
I get to trumpet how wonderful they are
to as many people as I can...
so tonight I could not be more honored
to introduce Professor Lise Getoor a
scholar who embodies the definition of
world-class faculty first though I'd
like to offer a bit of biographical
background on Lise... Lise is a genuine
product of the West Coast... Lise was born
in Seattle and grew up in San Diego...
Lise is also a child of the UC system:
her father was a well known
mathematician on the faculty of UC San Diego
She earned her bachelor's degree
in computer science at UC Santa Barbara
and a master's at UC Berkeley and then
something went wrong...
somehow she earned a PhD in computer
science at Stanford. PhD in hand,
Lise joined the computer science faculty
at the University of Maryland in College
Park and distinguished herself with her
research in as Kim said machine learning,
reasoning under uncertainty, data
management, visual analytics and social
network analysis. In 2013, she returned to
her UC roots and joined us here at UC
Santa Cruz in the department of computer
science. When you hear just some of the
ways in which Lise has distinguished
herself you'll understand why we're so
lucky to have recruited her and why
we're so proud to have her here with us
today... needless to say she has been an
incredibly successful and highly cited
researcher and a highly, highly
sought-after speaker and lecturer...
Lise is a fellow of the Association for the
Advancement of Artificial Intelligence
triple AI... Lise is the PI for one of only
a handful of National Science Foundation
transdisciplinary research and
principles of data science or TRIPODS
phase one grants, this grant brings
together UC Santa Cruz faculty to develop
a unified theory of data science, applied
to uncertain and heterogeneous graph and
network data. She and her colleagues are
now hard at work on a Phase two proposal
which hopefully will move the work into
a larger multi-institutional research
setting... Lise led the establishment of
the UC Santa Cruz data discovery and
decisions data science Research Center
D3 and serves as founding director. D3
among its research activities works with
companies to provide opportunities for
collaborations between students, faculty
and industry, but more than her scholarly
work Lise has been a highly visible
national and international leader and
advocate. Lise served as chair of the
Computing Research Association's
Subcommittee on data science
and was lead author of a highly
influential CRA study on computer
science and the emerging field of data
science. Lise serves on the National Academy of
Science's round table on Data Science
Education, where the next roundtable
discussion is focused on collaborations
between industry and academia on data
science. Lise is also a quiet yet
exceptionally strong supporter and
advocate for women in the data science
and machine learning communities, and
committed more broadly to mentoring
students, junior faculty and women in
STEM. To say the least, Lise brings great
distinction to the Baskin School of
Engineering and to UC Santa Cruz and I'm
delighted to see her accomplishments
recognized so publicly tonight. So please
join me in welcoming the presenter of
the 2019 Faculty Research Lecture,
Professor Lise Getoor
All right, well thank you, George and Alex
and Kim for a really nice introduction
and thanks everybody for helping to make
this happen
and thank you guys for coming out in the
rain, I'm really happy to see all of you
here and I'm gonna be talking about a
topic that I think is really important:
Responsible Data Science, and I hope by
the end of this lecture you will agree
with me. Now, as far as my goals in this
talk, so I have several: the first one
like any good academic and I am a
professor, I want to educate you a little
bit... I want to also excite you about some
of the opportunities in data science... I
also want to caution you... and at the end
I want to leave you with some tools... and
tools that ideally help you separate
some of the kind of reality from the hype
and vice-versa... but also enough
background so that you can kind of
engage... in what I think is going to be a
really important emerging area on just
what we do as we go forward with data
science and more... so I'm gonna start with
the very beginning... so what is Data
Science? So is it this kind of emerging
discipline or... you know is it just you
know... I'm giant a big fad that's gonna go
away? and I really like this quote that is
from Cathryn Carson. She is a historian
and ethnographer of contemporary science
and what she has to say is "data science
is a shimmering concept, no one agrees
exactly what it is, but it gets at changes
underway that are serious and real, both
in academic disciplines and out there in
the world. Something is emerging in the
space where a mass of our pervasive data,
its computational handling, and its
analytic manipulation come together to
underwrite inferential conclusions and
actions in a data-fied world." She goes on
to say, "whether we're looking at
foundational approaches, domain area uses or contextual implications and
entanglements, the intellectual terrain
is just starting to get mapped. And
however each go forward, there's
something irreversible happening
whatever else happens to the techniques,
the questions, or even the discipline
the data are not going away."
And so, as Alex mentioned, I've been
involved in a number of these
conversations across nationally and
internationally around data science and
it is true that there's something
emerging happening... just to give you one
example at UC Berkeley they have a
course which is their Data Science 8
course, and this is an interesting course
it's designed for freshmen and
sophomore students, they don't have to
have any computation or statistics
background; the most recent edition that
was in fall of this past year had 1300
students, 300 more on the waitlist and
the thing that is novel about this
is you know it's well we do this and a
lot of our courses is grounded in
applications but then all of a lot of
the applications are for the social
world...
so their goal is that eventually all
incoming students will take this class
and will then be kind of literate in
data science computational thinking and
statistical thinking and ethics. Here at
Santa Cruz
Abel Rodriguez has led many of the
efforts around data science together
with folks in CSE and more... so what is
interesting as well as you know there's
a lot of cool things about the course
but for example the gender ratio in the
course is much more equal, so the focus
of this talk however is on responsible
data science. And so in order to ground
this work in responsible data science
I'm going to be using a concept of socio
technical systems. So socio-technical
systems are a term that comes from the
science, technology and society community...
but you know fundamentally it gets at
this kind of interaction between people
and technology... and you'll see throughout
the talk I'm going to return to this as
an important component of responsible
data science... so in this talk I have
three parts: I'm gonna give you some basics, and in the basics what I really hope to
do is there's a number of terms that are
being used in the media, you know,
algorithms and AI and machine learning,
you know I want to unpack these for you...
then I'm gonna go over some research, and
then I'm gonna spend some time going
over cautions and things to be aware of...
and throughout the talk, what I'd like
you to keep in mind is some of the
how, when, why, and why not... I am gonna try
and give you enough of an intuition
about how some of these things work that
ideally you can have also some
understanding of you know when they
might be applicable and also when they
might go wrong... so let me get started
with that and let me get started with
the most basic piece, which is 'what is an
algorithm?' This term is being thrown
around by the media quite a bit, and the
basic idea is you know it's a just a
kind of recipe for doing something... and
you know people use them all the time...
you know any time that you have some
tasks that you need to do... you know, 'I'm
gonna bake a cake,' you know, 'I need some
ingredients,' I have some steps... so I
follow those steps, and I get a cake out of it...
you know that's a human algorithm... when
the term is used by the media, they're
talking about computer algorithms... so I
am going to give you a crash course in
algorithms, and I'm gonna go through five
different kinds of algorithms, and as I
go through them, keep in the back of your
mind, you know what are the different
ways these could be used, but what are
the different ways that they could
potentially go wrong? So first one,
straight-line algorithms, okay, everybody
admit it, how many of you have ever
written a computer program? Okay, bravo, so the
very first program you probably wrote
was a straight-line program, where you
basically just go through... you say some
steps... you know this is an example of one
that computes your account balance after
you make a withdrawal...
the next kind of algorithm is a
rule-based algorithm, where you have some
sort of conditionals: there's some sort
of if-then; so even in my little code
snippet some of you may have thought,
'well, it might have been smart if you
check to see if you had enough money in
your account before you made the
withdrawal.' The interesting thing is
these are the kinds of programs that are
often used for diagnosing things, so
here's a little snippet of you know how
you might diagnose what's going wrong
with your printer if your printer is not
working... and the key thing to note is
these kinds of algorithms go by the term
"expert systems" in AI, so these were
initially super popular in the 80s but
they're still very popular for, you know,
detecting problems and debugging things...
you know those annoying call
centers that you call? And they route you
through things... those are  a form of expert
system-- okay, now let's get to a
data-driven algorithm... so a data-driven
algorithm is something where you have
some data coming in; you go through the
data, you do some counts, and check how
often something occurs... so as an example
here's my... I have some shopping data, and
I've counted the number of times someone
bought bananas and they bought milk... the
number of times they bought bananas and
bread, bananas and carrots and
and you know whatever it is in that
white jar...and then a new customer comes
in you know they have bananas and you
know, I'll pick whichever one of those
happened most often, maybe most often it
was milk and I'll say, 'oh, you know go down
the milk aisle' and now these kinds of
algorithms are often kind of called
machine-learning algorithms and this is
a example of a recommender system in
particular; now you can get fancier than
this, so you can have some statistical
model behind this and so on... but the key
underlying thing that's done is counting...
so keep that in mind when you're
thinking about when these will work and
when they won't. Another very useful kind
of algorithm is a randomized algorithm
so randomized algorithms have some sort
of simulated coin flipping involved, and
these are super useful for different
things; so one of the things they're
really useful for is, if you're searching
very large spaces, it helps, another thing
is if you're doing simulations of
distributions in some sort of
statistical model they're useful and
they're also useful if you're designing
a game where you want it to be
interesting, so the same thing doesn't
happen every time you play the game... and
then the last algorithm category that I
want to go over is deep learning. So how
many of you have heard of deep learning?
Ok, so deep learning is getting a lot of
attention these days and just at the
kind of simplest level, deep learning is
all about kind of constructing a super
simplified abstract neural network and
they're very good when you have large
numbers of input-output pairs of coming
up with a compressed representation for
and so they're good at memorizing large
amounts, and constructing an abstract
representation. The models are considered
what's called 'black box' because you know
just inspecting a neural net, it's really
hard to figure out you know what exactly
it's doing... and there's been some pretty
serious issues... there's a ton of examples
of deep networks gone wrong: one of the
basic first ones came out of Irvine
where they were looking at a deep
network that was trained to recognize
the difference between huskies and
wolves, and so it performed really well
on the training data; then they took some
new data and, you know, it didn't do so well
on the new data... and it turned out that
what the neural net had memorized was
all of the wolves had snow in the
background; all the dogs had grass in the
background; and that's what it had
learned. So a colleague of mine, Rich
Karuna, a number of people are kind of
advocating caution with these... he has
this 'friends don't let friends use black
box models' and I actually think this
WIRED title gets it,
"Greedy brittle opaque and shallow: the
downsides to deep learning" so I think
there's a there's some really cool
things that they can do but there are
some cautions and I'm going to be
returning to those in a later part of the
talk. So you have just had a crash course
in algorithms... you get your gold star and
check mark... so now you just had
extensive education in data science. I'm
now going to turn to talking about
research, and, in talking, we're going to
come back to these algorithms as we go
through. In talking about my research I'm
going to return to talking about these
kind of socio-technical systems that are
complex, connected heterogeneous and
so on... and one of the things that all of
these algorithms, machine learning and
data science algorithms typically do is
they take this rich structure and they
flatten it... so they flatten it into
tables, where, kind of, each row in the
table is treated independently and
atomically, and so they kind of take this
kind of cool data and they put it into a
table... so what are the issues with that?
It turns out that there's actually a
bunch of issues, so first off you know
this flattening oftentimes is making
incorrect independence assumptions...
further, the models end up oftentimes not
being interpretable and here not
interpretable in a different way than
I'm talking about for deep learning,
because oftentimes what you do is you do
a lot of feature engineering to transfer
that rich graph's structure into the
columns in this table, and you know then
you know you wrote some code to do that,
you lose the code, it ends up you know
you don't remember how you got is... you
know very bad... it's not declarative in
that way... but the one that I want to
emphasize is that it doesn't support
collective reasoning, and so collective
reasoning
is this idea that rather than treat all
of the inferences independently, let's
model the dependence... and the particular
way that we model the dependence we
can use kind of local information about
whatever or trying to make inferences
about, but then let's also use the kind
of relational structure... and you can use
this for prediction... I probably will
lapse into talking about prediction but
you can also use it for discovery, so
just trying to kind of understand what's
going on in your domain or your data... and
you can also use it for causal modeling;
causal modeling is important when you
actually want to make an intervention
I'm not going to be talking about that
here so much but I'm very interested in
causal modeling in networks where there
might be interference and so on, so if
anybody's interested in that come talk
to me afterwards... so let me go through
some examples of this collective
reasoning... and I'm gonna go through them
they're gonna be simple examples but I
hope you'll see kind of the utility of
each of them... so the first one is
information integration, and information
integration is something that happens
all the time when you're doing any kind
of data science problem where you have
to... you have these different digital
representations for things, and you have
to figure out you know which ones are
talking about the same thing... so think
about bioinformatics... you have a bunch of
articles that are talking about genes,
they're referred to different ways how
do I figure out which ones are the same?
You know I have different treatments, how
do I figure out which drugs are the same?
Or you know something like digital
humanities, where you have a bunch of
text and maybe you have video, maybe you
have images...
and again you're trying to kind of
figure out you know is this person over
here or this place over here the same in
these different texts... so the challenge
with all this is typically there's all
sorts of noisy clues that you
essentially want to piece together, and
so I'm going to do a very relevant
Santa Cruz example... so this is a
collection of documents about the
university, and talking about our mascot
and the idea is which references refer
to the same real-world entity, and I'm
going to make use of kind of just in a
single document local information... so I'm
gonna figure out you know which things
refer to the university, to UC Santa Cruz
and because I figured out those two are
the same, then I can figure out the
University and the school are the same; I
can look at mascot and banana slug and
so this is something that in natural
language processing is referred to as
co-reference resolution, kind of figuring
out all these things...
whether they're the same in say a single
document, but now I can go to a document
collection and again, do the same kind of
reasoning but reasoning that goes across
these different documents to real world
entities, and then I might have another
document; this happens to be about
Falafels Santa Cruz, and talking about
the owner Sam, and I figure out that that
Sam is not the same as Sammy, and so
there's this kind of figuring out what's
the same, and figuring out what's
different. So this has a number of
challenges, one of the biggest challenges
is that there's tons of uncertainty, I made
this seem like, 'Oh, it was totally obvious
which one referred to the same thing,'
usually, there's a lot more ambiguity
and being able to handle all that
ambiguity is the challenge, and again I
showed it to you for like four documents...
you know, how do you do this at scale
when you're overloaded with digital
information for doing this? So that's the
first example of collective reasoning,
the next one I want to do is collective
classification... then collective
classification is just the idea that
many data science problems you have a
bunch of entities that you're trying to
label or classify in some ways; now it
could be according to some demographic
attribute... so it could be according to
you know gender, could be some sort of
political persuasion or so on... I'm going
to do a slightly more complex example
which is trying to figure out in an
online debate different people's
positions on a topic, and so I have a
debate, people are posting things. Those
arrows, so green arrow is 'I agree with
you'
the red arrow is that I disagree with
you, and the question is, you know, what is
their stance? Is it you know pro-the
topic or against the topic? And this is a
topic that my former PhD student Dhanya Sridhar who's now doing a postdoc at
Columbia worked on together with Marilyn
Walker. Marilyn Walker does tons of work
in this space of dialogues; I'm gonna be
going through a simplified version of
this, but again in order to kind of
figure out users' stances, I can start off
and I can look you know just at the
document and I can infer something about
that maybe
sentiment or some other words that are
used in it
what stance is likely to be but then the
relational information is exactly, kind
of, then using these disagree- and agree-
links to reason about what the stances
would be, and I can reason about the
stances by saying things like 'okay, if
someone's pro, and someone agrees with
them, then the other person is pro. If
someone's pro and a person disagrees
with them, then that person is likely to
be anti-.'  And just like before, I'm you
know kind of writing this as this
logical rule, it's actually going to be
uncertain, so it won't always hold but
these are little clues about what the
stances might be, so I can go through and
make the inferences... now here one of the
challenges that may have crossed your
mind as I described this is well, 'what, if
I didn't want you to figure out my
stance?' So privacy, so this, kind of
inference where even if I try and hide
my attribute , maybe my demographic
attribute by doing this kind of
reasoning, it's oftentimes very easy to
infer something... and I've been in a
number of conversations around big data
and privacy where kind of say like, 'oh,
the way that we'll make sure
you can't disclose this, is we'll say
you're not allowed to store it.' Well if
you're able to infer it with high
probability, then that's an important
kind of leak... leak ... and my former Ph.D
student
Elena Zheleva who's now
an assistant professor at University of
Illinois-Chicago
actually way back in 2009 did some
interesting work on basically Facebook,
looking at the ways that you could
figure out demographic attributes
through group memberships and you know
we're kind of used to, 'Oh, your friends
tell you something about yourself,' but it
turns out that group membership tells
you even more, and interestingly enough
for group memberships, the group owner
had control over that information. You
didn't even have control over, you know,
who saw this, so this whole topic of
privacy and, particularly, privacy and
socio-technical systems is really
interesting and important to be aware of...
the last pattern that I want to talk
about is recommendation, and so
recommendation is something that I'm
sure all you guys are familiar with... a
very common thing where you're trying to
you know recommend some item or give you
a user's a ranked list of documents or
something... and again you can use some
information so here's an example I'm
trying to figure out if a user will like
an item, and you know which items will the a
user like, and I can use local
information so if a user is interested
in a topic and the item in this case
maybe a news article is about that topic
then I can say that the user will like
that topic, but then I can also use this
kind of graphic information so you can
say that, 'okay, if a user likes an item,
another item is similar to that, then the
user will like that second item.' I can
also reason about the similarity between
users
so I can say, 'a user likes an item, tand here's another
user that's similar to that user, then
the second user will like that item too,'
so I can kind of infer all these little
triangles... so what's the challenge here?
The challenge here is how do i define
the similarity measure; you know there's
not going to be one unique similarity
measure to use, there's all kinds of
different ways that you can be similar
and part of our work is very much you
know, 'we're going to allow you to specify
a bunch of different ways of doing this'
and kind of combine them together in a
tractable way... so the commonality with
all of these is there some kind of
relational structure, there's these
complex heterogeneous interdependencies
and there noise and
uncertainty... so my research is all about
how you take a more nuanced approach that 
takes into account these relationships
but at the same time the context and the
probabilistic dependencies... and so I've
done work in this space for a long time,
I'm going to be telling you about some
of our most recent work, which is called
probabilistic soft logic, or PSL, and
it's a programming language that allows
you in an easy and scalable way to
represent these kind of collective
inference problems... and the cool thing
about it is that it combines, you know,
those rule-based algorithms that we saw
before with data-driven approaches, it
combines logic and probability it
combines both hard and soft constraints
so hard constraints that have to be
satisfied, soft constraints, which you
know you'd like to have satisfied, and
fundamentally it ends up combining knowledge and data in a
really interesting way... and so at this
point I have to acknowledge my awesome
group of students,
this is LINQS for Lise's Inquisitive
Students, I have some of them around in
the audience, I think. So they're awesome,
they're the ones that make being a
professor awesome, that's the best part
so thank you guys!
So PSL is this probabilistic programming
language for collective reasoning, and
the way it's encoded is we have some
weighted rules, then we have some data
together with the rules, and the cool
thing is it then defines a probability
distribution over the collective
outcomes, and a PSL program this is an
example of a PSL program for that stance
example from before, so it's pretty
simple to write, it's pretty
interpretable, and it basically kind of
takes--this is a program, you have some
data--and it's instantiated into a
particular kind of distribution, and the
distribution is a particular form of
Markov Random Field, so it's a Hinge Loss
Markov Random Field you know if I was in
a more technical setting I would go off
on a deep dive here for like an hour
telling you all about this, but since I'm
not I'll have to give you the high level...
the high level that's really cool is
it's taking logical inference, which is
known to be intractable, and mapping it
to convex optimization and
the neat thing about this, this is very much
work of my former PhD student Stephen
Bach is now an assistant professor at
Brown, and my former postdoc who is now
an assistant professor at Virginia Tech
where they looked at those randomized
algorithms, some results from there, some
results from machine learning, graphical
models, certain kinds of relaxations for
them, and soft logic so something that's
in the AI world where you have rather
than having boolean values that are
true/false you have...they can be
interpreted either as degree of truth or
similarity.... we were able to show that
there was one formalism that there was a
equivalent optimization under these
three different interpretations and that
it ended up giving this scalable way of
doing inference in these large models
and so this is kind of really cool
whenever you get something that under
three quite different interpretations
you end up with the same optimization, I
think that shows that there's... there's
something fundamental there ...there's
still a lot more work to do you know if
this excites any of you come talk to me
because, yes, there's there's lots more
things we can do from here, so PSL
in a nutshell it's able to do
inference really fast, you can make it
even faster by using kind of state of the art
optimization techniques, distributor
processing and turns out
there's a lot of fine-grain local
parallelism... I'm not talking about it
here but you can learn the rules you can
learn the weights for the rules you can
also deal with latent variables which
I'm gonna give an example of in just a
minute... it combines data and knowledge in
this interesting way that ends up giving
you these models, which are much more
interpretable... and the cool thing is you
know it's open source code data
tutorials are all available online, and
if any of you are interested in using it
we like to help people, so come talk to
us, let me go over now two examples that
are a little bit more complex than the
ones we went over so far the first was
an example where we used PSL models to
detect cyber bullying, and so we were
able to not only kind of look at
different messages and infer you know
when the messages were bullying messages
but we were also kind of trying to tease
out the social structure so what are the
different roles, and then also discover
what are the different kinds of attack
types, and the PSL program for this, oh
sorry, this is work by Sabina Tomkins,
so she was a former Ph.D student who's
now doing a postdoc at Harvard who'd
just graduated from here last year... this
is the program I won't make you read
that but if you want the code it's here...
and she demonstrated it on some... this was
cyber bullying on Twitter, which you can
imagine that has a lot of complexities,
and you know you see this tweet you know
is it really a bullying tweet or not, and
one of the things we were able to show
that modeling that
certainty in the label actually you
could do a better job from modeling the
uncertainty, she also was able to kind of
uncover some evidence of certain
kinds of power dynamics and found
certain things like what are the most
common attack types and so on... so this is
one example another example in the
social domain was some work by my former
postdoc for at long that I mentioned
before... on inferring trusts, social trust
so here I have these individuals and the
green links are that they trust one
another the red links are that they
distrust each other...
and there's actually you know two very
well-known different theories of trust
in social psychology, one is called the
structural balance, this is the idea that
kind of a friend of a friend is a friend,
an enemy of an enemy is my friend and so
on... so it gives like all these
combinations of trust and distrust and
says you know which ones are stable, and
the theory says you're going to go
towards this stable configuration; competing theory is the social status
theory, and this is much more about
hierarchical relationships, so the idea
that I trust people that have more
expertise than me or that are higher in
the hierarchy, and I distrust the people
below me, and the interesting thing is
for this it will give you kind of
different combinations of trust and
distrust links, and so what we could do
is we could build a PSL model for both
of these the PSL balance and PSL trust;
we compared it to a baseline
and what was at the time two state-of-the-art methods and this is for predicting
distress links, which is actually really
hard to do
we ended up both our models did better
than the state-of0the-art methods but now the thing
that we did from here, I think, is really
cool, we then added in psychological
variable, so a latent variable that talks
about how trusting or how trustworthy an
individual is, and adding this into the
model ended up boosting performance
significantly... I think this happens a lot
in these kinds of models kind of
judicious insertion of these latent
variables, and then when you're able to
have some interpretation for them it's
very interesting... so the cool thing is to
then go back and look at the
organization and try and say like, 'oh,
what is it about the place where the
balanced model was working?' 'What is it
about the part where the status was
working and can I say more?' We've done a
lot of other projects, just to mention
some work by Arti Ramesh, a former PhD
student who is now a professor, assistant
professor at SUNY
Binghamton, has done work with MOOCs so looking at engagement and different
kinds of social behavioral effects in
MOOCs, 
Pigi Kouki, who just recently graduated has
done not just recommendation but
also explanations, like why did you give
me that recommendation? And 
Jay Pujara who is now an assistant professor at USC
did a lot of work in knowledge graphs
and this is kind of going a little bit
beyond this information integration that
i talked about so far where you're
trying to kind of extract out these
knowledge structures from text or other
digital information... so that was about
research and I am
very very happy to talk more about it...
I want to go into talking about some
cautions, and for the cautions what can
go wrong? Go back to those little
algorithms that I gave at the beginning
and kind of think about some of the
things that can go wrong; it turns out a
lot can go wrong, I'm gonna go through
some examples that I've gotten from my
colleagues; I'm actually going to be
teaching course in this next year, so if
you have other examples send them my way...
these have been getting a lot of press,
and I'm just going over a couple of
these... so the first example comes from
Amazon and Amazon, this was covered by
Bloomberg in 2016, they built a tool to decide where they
should offer same-day delivery, so it's a
Prime service and they trained it on
users buying history, income, and location
and focus right now on just the left
side of this slide; so the part in gray
shows a part of Atlanta that did get the
service, so the northern part did get the
service, this is Chicago, everybody except
the south side got it, and this is Boston
everybody except Roxbury got it, and what
happens to be true about these
assignments is you know predominantly
white areas got the service and
predominantly African-American areas did
not get the service, and so essentially
what they did was they built a tool; the
tool did not explicitly use race; but not
surprising given those attributes you know it
basically came up with the digital
redlining algorithm... and so they got a
lot of flack... and everybody in these
areas now is receiving the service...
that's one example... another Amazon
example is that they built a tool to
recommend resumes, and guess what
they trained it on their own resumes, and
their own resumes were predominantly
male, and guess what that algorithm ended
up showing a gender bias... so Amazon's not
alone here, are some Google examples...as a
matter of fact Science and Justice is
going to have a speaker in March that's
going to be talking about bias and
search results... so I'm going to cover a
couple different Google examples... one of
them is in their smart compose, where
they are filling in pronouns, it turns
out that you know you type in 'CEO' you
get out and fill in the pronouns, they're
all going to be male pronouns... this is
Google Translate... you take a language
like English that is gendered and you
say she is a doctor, he is a nurse, you
translate it into Turkish, not-gendered,
you translate it back into English, you
get he is a doctor and she is a nurse, so
but probably the example that has gotten
the most attention is around recidivism
prediction... so how many of you are
familiar with this story, this ProPublica
story, okay so not a lot, so this is
really interesting... so and I'm not going
to be able to do justice to it all of
the nuances... but basically there is an
algorithm that is being used
widely in the criminal justice system
for pre-trial bail and sentencing where
the algorithm is supposed to predict the
defendant's likelihood of committing a
crime; well it turns out they were able
to show, in a particular county, that what
was happening was there was strong
racial disparity in the effects, so first
off, false positive where you say someone
is high risk, but they're not high risk,
this happened twice as often to
African-Americans as to whites then
false negatives where you say they're
low risk but they're actually do go on
to commit a crime, that kind of error
happened almost twice as often for white
defendants; so the issue here is over-
predicting recidivism for
African-Americans and under predicting
recidivism for whites... so what went wrong?
Okay, so a bunch of things that went
wrong so one of the things that went
wrong in all of these examples is biased
data, so if you have input to your system
that's biased, whether the bias is coming
from selection bias, institution bias or
societal bias, then the output is going
to be biased, and so this is the famous
computer science phrase: 'garbage in,
garbage out'--  if you give it bad data
you're gonna get out bad results... now, it
turns out that there's a lot of other
kinds of biases in addition to these
that are important to take into account...
and one of them is automation bias, and
this is a well-known effect where,
you know, people have this habit of you
know the algorithm told me to do it you
know, and they trust the algorithm too
much, so this is a serious issue,
sometimes it said you trust the
algorithm, sometimes it's, 'oh, it's a hard
decision and so you know the algorithm
made me do it.' In the case of the
recidivism prediction it's actually even
worse, because in a number of states if
the judge disagrees with the algorithm
then they have to write a report
explaining why, and so that puts like an
additional overhead and all of this
leads to what's referred to as
algorithmic discrimination: the idea
that algorithms can amplify bias which
we just saw, but they can also
operationalize it in a way that it's
applied much more widely than, you know,
from just a biased individual and then it also
legitimizes it, so that it's given too
much credence, and so not surprisingly
people have been... this has become a very
active area of research, so as an example
here are the you know... in 2011 there were
hardly any papers about machine learning
and fairness; this is from a tutorial by
Salon Baracus and Morris Heart, by
2017 there were a ton of them, there's
several workshops and there's now
actually an ACM conference on fairness
accountability and transparency, where
it's specifically trying to bring
together technologists and policy folks,
and practitioners to talk about these
issues, so there's been you know a huge
surge in academic research,
not surprisingly there's been some
criticism of that research, so there's
been a healthy amount of critiquing
these kind of technology... technological
fixes it's like, 'oh, I'm gonna have my
machine learning algorithm and then I'm
gonna put on an ethics box at the end to
fix everything.' The cool thing is, I think,
there's starting to be some real
collaborations where through, you know,
some discovery of what was happening on
the technical side, together with some
interpretation of that on the social
side, kind of new interesting things are
coming out, and I really like this quote
by Bill Hao at UW, he says a
'responsibility means going beyond
technical or technocratic solutions to
also involve substantive debate about
ethics values and competing interest,' How
is ethical expertise defined? Who needs to
be at the table? What are the limits of
certain kinds of solutions? So this is
what's happening in it academia; turns
out there's things that are happening in
the real world too -- there's ...this is the
first law in New York on automated
decisions systems, and there's a number
of other efforts... so...  I'm happy about
this... there's a lot of energy and
research that's happening here, I want to
talk about what else can go wrong... and
what are the things they can go wrong
that I don't think are receiving the
attention that they deserve?
So first is just poor quality... just like
there's bad data, there's bad algorithms
now if you look at that first line on
that ProPublica thing, yeah this thing
only had 61 percent accuracy,
now that's barely better than flipping a
coin...
we should not be using these things in
these settings... the second one is magical
thinking
so I think there's a lot of writing
right now about machine learning and AI...
all those things that it can do... it is
overinflated and so you really need to
question the capabilities and I hope I'm
giving you some of the tools that will
help you be a little bit more skeptical
of some of the claims... and to unpack why
you should be skeptical there's... I'm
gonna go over one technical problem
which is the frame problem... and this is
that just the basic idea which I mean you
should see from the examples that I
gave,  you know, these are really crude
models, so they're gonna make simplifying
assumptions so they can only take into
account some kind of limited amount of
information, and from that limited amount
of information they can make big
mistakes... so AI and machine learning
methods can work well, but usually in
very kind of constrained settings and
this is what's referred to as weak AI or
narrow AI... the fourth thing that can go
wrong is values... so all of these
algorithms at their heart are optimizing
some metric... there's a question of who
supplied the metric, you know, who gets to
decide it, whose values are encoded in it,
and so on...
another issue is just bad code... so just
like we had bad data and bad models you
can have bad code... and so the whole area
of software engineering
our Dean, that's one of his research
areas, is software engineering, is all
about you know how do you ensure that
code doesn't have mistakes in it... now
when you're going to data science
algorithms and data-driven algorithms
how do you ensure things, statistical
properties like... and good science
practice about reproducibility,
generalizability, transparency and
interpretability, you know these are all
important, and the last one that i want
to mention is taking into account that
algorithms can actually shape people, so
again going back to our socio-technical
system, we have people interacting with
technology and algorithms... sometimes
people adjust and adapt their
behaviors in response to algorithms... you
can probably think of something you've
done yourself sometimes it's in benign
ways... sometimes it's an adversarial ways
and sometimes it's just in unintended
ways... and we really need to think more
carefully about the potential impacts on
people's behaviors, in particular their
agency, and their autonomy as we design
an evermore complex adaptive systems...
so in closing, I have a couple more
things that I'd like to go over... first
off, I'm going to get some advice, my
students and colleagues will attest to
the fact that I love to give advice, so
first off advice to computer scientist
some things to keep in mind: first off
data is not objective,
it has biases, historical context or more,
this is something that social scientists,
you know, they're totally well-versed in,
this is something that's not as kind of
commonly taught in computer science or
emphasized in computer science...
technology is not neutral, they have
values baked in, it's important to think
about them and there's a moral
imperative to understand the domains in
which your work is being applied and
consider the domains to which your work
could be applied... there's a ton of new
research that needs to be done to deal
with these incomplete, uncertain, biased
socio-technical systems, and finally we
need really need to educate people... so we
need to educate people about some of the
limitations of you know algorithms,
machine learning, AI and more, okay so
advice for collaborating with computer
scientists, some things to keep in mind:
one of the things is we love
abstractions and simplifications, and I
think if you're someone, a humanist and
so on, this is kind of a totally
different way of thinking we're you know
humans or about the specific
context, and so on and so oftentimes
kinda keeping this in mind that you have
kind of different mindsets is actually
useful, we also we really like logic so
we like zeros and ones, truths and false,
black and white, so that's another useful
thing to keep in mind, we don't always
have the greatest people skills, so yes
there are more introverts and
Asperger's and the other things among
computer scientists and engineers it's
useful to keep this in mind; I remember
the first student that I had that had
Asperger's, once I kind of figured out
that that was the issue, it's like it
made everything much, much easier to deal
with, so we're not typically trained in
ethics yet, so there is a huge surge of
interest across the country, nationally,
about ethics training and ethics
training that's not just a Band-aid, you
know here's your ethics class but that
really goes throughout the curriculum,
and like many, you know, we want to make
the world a better place, so you know
we want to collaborate with folks
in doing this so what is responsible
data science? I've kind of emphasized the
literacy aspects of it, and computational
literacy, statistical literacy, ethics
literacy, justice domain literacy, and
it's also data science for social good,
so this is also an active area where a
number of universities and other
organizations are developing these
programs around data science for social
good, Rayid Ghani at the University of
Chicago was one of the first people to
do this, and I've had him come to speak
campus a number of times, one of the
things I find interesting is that I
think the most interesting ethics
discussions are coming out of the folks
that are working in this space, some of
the deepest discussions that I've heard
about ethics are folks coming from here...
so responsibilities means working
about on these hard societal problems so
we can use data science to discover the
bias, to discover that injustice, we
can look at hard problems like
homelessness and we can look at things
like education justice you know this is
something that Rebecca London and Rod
Ogawa are doing some cool work where
they're combining together education
outcomes together with juvenile justice
and social programs to understand, you
know, what's happening especially to kind
of marginalized populations... some work by
my former Ph.D student combining two hard things: environment and human trafficking,
and you know looking at what the impacts
are, so I think we have a choice, we have
a choice, amount you know we can build
these socio-technical systems these kind
of dystopian ones where you know we
automate bias and they surveil use and
they make us angrier and unhappy or we can
build these kind of utopian ones where
you know we collaborate and we do
collective actions and we help make the
world a better place... and I think it's
our choice, you know we really need to be
thinking about a bigger picture; you know
which are we going to choose, so
returning to my goals, hopefully teach,
excite, caution and maybe give you some tools
in terms of takeaways, this literacy
aspect of computational literacy,
statistical literacy, ethics justice, and
then a process-oriented view towards
designing and critiquing the systems
that we're building and then to do this
we really need to collaborate we need
to collaborate across you know
engineering and humanities, social
science, art, education and you know what
better place to do this than UC Santa
Cruz? We have strength in all these areas,
so let's work together... Thank you,
thank you, Lise... you know I I'm reminded
of something my mother always said to me
just just because you can doesn't mean
you should, which seems to be you know
kind of a theme here, and and that
thoughtfulness that you're bringing to
it is I think just incredibly important
and what it's also important is, it's it's
not just about the technology but it's
about the social context, that surrounds
the technology, so as you said UC Santa
Cruz is a wonderful place to explore
those questions, and speaking of
questions we have time for some... so if I
think there are some runners... So this is
not my area and a naive question I'm
right down here in front of you okay
so these algorithms that look at
Facebook posts or Twitter messages is
the core data just individual words that
really just doing word counts and
associations are gonna do something more
sophisticated like the meaning of a word
in a sentence okay so the question it
was about so we got on the mic so I
don't have to repeat it
right okay and I wanted to put up some
resources resources so first off you
know I have to do my sign my disclaimer
for Facebook or Twitter... I get an answer
hopefully you could see even from what I
was saying that just looking at the text
you can do a surprising amount so text
would just be the word counts... but
looking at the relational structure, so
who liked things, who's friends with whom,
and other kinds of attributes, that's
something that leaks a lot of
information... in terms of inferring
additional meaning yes there are
techniques for trying to infer meaning
from text, sometimes that's grounded in
some ways and sometimes it's not so well
grounded, kind of a separate topic though
because I... I think that part of it is
like how much do they know about us...
might be behind it... one of the things
that they do is they aggregate a lot of
external information, and that external
information is something that gives a
ton of semantics, and so on so it's
really through the kind of combination
of these things... although I will admit
that when I've gone to talk for example
to Facebook, they say that they're not
doing a lot of graph inferences, a lot of
what they're doing is kind of pretty
basic... any questions?
So in one of your last slides you showed
the picture of the utopian the dystopia
and you said that we can choose which
one we hav;e but do we really have that
ability or is it out of our hands? You
know this data is external, like you said...
So first off I think it is important
that we need to think beyond just the
technology to understand important
concepts... economic concepts, political
concepts, social concepts, psychological
concepts to understand where we can
affect change and where we can affect it
most effectively; I do think that right
now, we're on the cusp of an opportunity
for significant changes, and so yes now
is a time where ,especially around
privacy, there's a changing feel to the
debate... and to participate in it it's really
really really important and then also
think of how addictive you are to your
little device, and then try and change
some of your own behaviors which you do
have control over...
Okay Lise, first thank you very much for
your for your talk, I don't know how many
in the room will know Hannah Arendt or
if you know her work but she has this
very wonderful... a lot of us in the
humanities and Social Sciences I think
are synching with her these days... one of
the reasons she wrote a book called the
Human Condition and in that she opens
with her worries about the rise of the
mathematical sciences and how they might
lead to a world in which we could no
longer speak and talk about that which
we do and what I love about what you've
done here is that you have done the work
to speak about that what you do which
can be very technical but you've made it
into a language that we can share and
talk about and so I just want to say
that I really appreciate it and now I
have five different versions of
algorithms work with now... I used to only
have one and... and I used to code back
when it was Basic in the 1980s, I don't
know how many people have had that experience but I want to go to you're raising
these you know Arendt alerted us to
these issues a long time ago but they're
coming home to really roost right now
and you've presented all these wonderful
examples of that, and the question that
we're getting to at the end is how do we
change, and you and this last question
was is it just that we choose and that
you know yes, there's an agency part of
this we can choose, but who's the we? and
how are we going to choose? Becomes a
really hard bit so I want to go back to
a point in your your talk where you talk
about there is a moral imperative for
engineers to think about how what
they're doing is gonna play out... and I
thought this is a really you know in
actual situations I'm not quite sure how
you phrase it with something like that
you said there's a moral imperative... so
first of all I'm interested in how you
ground the moral imperative, but secondly
I want to know how you turn the moral
imperative into a practical endeavor, and
that if we think about and then if we
think about and I think universities are
really important sites for this, because
what are we doing other than training
the
next generation of scientists and
engineers... so how do pedagogically do we
change and what do you see as some of
the pedagogical challenges in
engineering schools what do you see
being tried what do you think is
promising what do you think can happen
here? So easy questions... This is what I
love... these kinds of questions are so big
question... awesome... and this is why I love
being able to talk like with folks like
you around UC Santa Cruz on these topics...
and first off I think that it's very
interesting a lot of the work in socio-
technical systems from I don't know two
decades ago I think is all super super
super relevant now and this is something
that I've been a participant in being a
technologist talking to humanists about
some of the concepts and we do we just
struggle with the language like I know
when I'm talking to you, when I first got
here, and talking to Rod about justice I
was like justice yes justice sounds good
of course I like justice
and then it's like you know in the past
few years kind of reading up more as
like I realize like there's a whole
freaking iceberg of meaning behind it
and right now around ethics and
technology a number of folks have said
they feel like there's a Renaissance
that's emerging that Renaissance is
nascent right now and a lot of it is you
know we don't always have the language
yet for talking about hard issues I've
been fortunate enough that I've had very
patient some economist kind of explained
to me all these different structures and
philosophers explaining things and so
kind of learning about those I do see
this the conversations that even as
short as three years ago seem like
Band-aids have gone much deeper and so I
see progress but it requires this kind
of respect and the respect that goes
across you know yes I'm willing to
explain the complexities of these
structures where would are the places
where you can have practical impact you
know if I had to rank them where would i
look at them and so on and what are the
concepts that need to be like if you you
are trying to train folks you know going
beyond just saying like oh you know
there Kant and there's utilitarianism you
know they need something deeper than
that
and that deeper thing comes from
understanding social science, more so
understanding, I think, economics, power
and how that fits in, understanding you
know political science understanding
sociology understanding psychology and
more are important in that and yep
let's work together on this... Lise, I would
like to ask you a question, you know I
think I'm probably quite impressed
because I think it's the first data
science talk I've ever heard that did
not include the phrase 'big data' anybody
notice that? But still it used to be
very large data, but big data... you know to
me there is a there is a important sense
in which it's it's something to be
considered... because we think that
bringing information together is going
to provide us with data that will feed
these models with better information or
at least more information allowed us to
make larger decisions that consider more
factors but big data also has the
implication of a concentration, and of a
concentration of power and of knowledge
and this kind of gets at an earlier
question because that concentration then
becomes something that's harder to work
against so do you have any thoughts
about how to deal with that aspect of of
the data science question? This
collecting, collecting and collocating,
and encapsulating of of large amounts of
data that actually drive this whole this
whole test socio-technical enterprise?
Now that is a very interesting question there
and there's multiple levels to answer it
at... so you know one of the important
concerns is yes there is a concentration
of power due to you know a very small
number of players having a lot of data,
that's beyond a big data science issue
and that's really important and probably
the most important thing
then the other piece though is it
hopefully came across in my talk is I
don't like things to be purely
data-driven so there's data driven but
then you know I have some knowledge I
have some theory about the way the world
works I want to have a way of kind of
representing that as well and kind of
marrying the two and so I didn't a size
big data in part because I like to
upgrade you know I did say we're
scalable and by the way we do do the
largest graphical models that anybody
does you know even Google is impressed
with how big our graphical models are...
yes yes but I think this idea that just
Oh big data is gonna solve the problem
certain things are not big data problems
and you know since I'm interested in
data science for social good, a lot of
those problems are not big data issues,
for a variety of reasons... could I ask one
question? yeah I think we have time for
one more question, Okay lovely um I think
there's a very timeless natural
trade-off between keeping track yeah oh
yeah I'm over here hi oh thanks hi I
know thank you so much for talking to us
um I would I just wanted to ask about
the natural trade-off between keeping
track of those ethical concerns and
implications of other people using your
work and progress... I think it's really
hard to right try to
model you know how could this you know
affect other people in a very negative
way, while also trying to make progress,
right, there's only so many hours in the
day and there's that trade-off and I was
wondering is there like this sliding
scale of maybe prioritizing certain
ethical implications that we cannot you
know put aside for the time being to
make progress... like they're certain this
is where the language comes into play... I
don't have the language to describe like
the ethical implications but is there a
way to balance both? Because I feel like
they're at odds with one another... to be
very conscientious of your work and
wanting to do things quickly so that
they can have a very positive impact
potentially does that make sense?   I think
I meant if I'm interpreting your
question the way that you intend then
my answer would be no; now we need to
kind of think through the ethics and
then we can't just kind of throw
something out actually I almost so
Cathryn Carson the quote that I gave
at the very beginning she has another
great... it's an extract from a Twitter
stream... that's talking about a data
scientist and and basically making an
analogy to atomic physics and saying
like you know we basically released all
these things and you know we don't even
know the impact of them and just were in
a giant experiment we're gonna see what
happens so I think it's important to not
delay ... yeah do you think we can do both?
We can make progress and have a positive
impact and also be aware of all the
ethical implications of our work as best
to our...
as we can like you think we can do both
Well and this is the why you want to
think of it as a process so it's a
process where you design, you critique
and you build, so it's not just a... one
after the other yes absolutely it
should be intended, thank you.
Okay, well at this point, I think we need to... to
close we we invite you to have some
refreshments upstairs, immediately
through those doors, but I also want to
once again thank Lise for an incredibly
stimulating conversation and to thank
her for her work here at UC Santa Cruz
