Good morning we're gonna go ahead and
get started
once again I'm Mary Hall from University
of Utah and I am introducing the FCRC
2019 speakers these speakers were
selected for by the conference's that
for they're the interdisciplinary impact
of their work and this week's talks have
shared a common theme about new
applications unprecedented scalability
and societal impact and today we're
gonna gonna have a talk that really
focuses on the societal impact of things
that how data science can be used to
address societal challenges our speaker
is Janette Wing from Columbia University
and she's had a career that spans
academic industry and governmental
organizations she is at Columbia
University now also had a long career at
Carnegie Mellon served as vice president
at Microsoft Research and as assistant
director for NSF size Directorate she
received the CRA Distinguished Service
Award in 2011 the ACM Distinguished
Service Award in 2014 and she's a fellow
of AS ACM and I Triple E with that I'd
like to introduce Jeannette
thank you everyone and I'd like to thank
the organizers for inviting me to give
this presentation and thank you for
coming I wanted to talk about data
science and specifically data science at
Columbia University and the tagline that
I use data for good and what I mean by
that let me start with the bigger
picture what I like to call the data
lifecycle to put into context what data
science is all about it of course starts
with generating data we generate lots
and lots of data but lots and lots of
data have also been generated if we
remember by large scientific instruments
so the Large Hadron Collider large
telescopes neutrino detectors in the
South Pole for decades already the
scientists have been generating lots and
lots of data I think one of the new
things in the recent history is why
we're so excited about big data and data
science is that it's the amount of data
that we people are generating that
others are collecting about us I have a
whole other talk about the fairness of
that but I'm gonna say save that for
later so in any case we generate lots
and lots of data and then this data is
collected not all data generated is
collected we do some processing and by
processing I'm throwing in everything
from encryption to compression but also
data cleaning data wrangling
all the kind of stuff that might happen
before you actually store the data lay
it down in some storage medium we off
course will store the data in some way
through data management techniques
database techniques such that we can
retrieve this data very efficiently
later and then of course there's
analysis now I think its analysis and by
that I'm throwing in deep learning
machine learning AI data mining all the
analysis techniques that we have had and
cultivated in the past few years that is
what is making data science so exciting
to not just we in the computers
community but outside every single
sector every single profession is
getting very excited about you know can
I use deep learning for my data set so
that it's usually data analysis that
gets the attention of people outside of
computing with respect to what data
science is but I wanted to put this in
context with respect to the whole data
lifecycle because each of these phases
is very important after our analysis of
course we have to visualize the data
it's it's not enough to spit out some
probabilities or some numbers we
actually have to visualize the data so
that we can interpret the data in some
way so that the end user can do
something with that analysis I it was
not until I joined Columbia University
when I actually added the last phase
phase in my life cycle chart I call it
interpretation it's when I was talking
to my colleagues in the journalism
school and they said you know Jeannette
it's not enough to just visualize the
data it's not enough to just show pie
charts and bar graphs and so on you have
to tell a story about what it is the end
user supposed to get out of that
visualization to get out of that
analysis so the journalists like to call
that storytelling I just use a more
neutral term data interpretation but in
the end it is an end user for now we
still have human beings at the other end
who are taking the results of the
interpretation and doing something with
them making a decision doing a
prediction and so on so throughout this
entire data lifecycle
I very much emphasize the importance of
the privacy and ethical concerns of
collecting the data analyzing the data
telling stories about the data and so on
so this is very much the underlying my
advocacy for data for good so what is
data science I have a very succinctly
that I'd like to share with you data
science is the study of extracting value
from
data there are two important words in
this definition the most important word
is a word value value is deliberately
subject to the interpretation of the end
user so to a scientist value is
discovering new knowledge to a business
person the CEO of a company value is
actually calculable its bottom line its
number of clicks it's however a company
might actually accrue value because of
the data for instance that it collects
about us to a policy maker the value of
data comes into helping the policy maker
maker make decisions so for instance a
mayor in a coastal city should be paying
attention to data that says something
about sea level rise so all of that is a
different kind of value for the end user
the other important word in this
definition is extracting the point is it
takes a lot of work to get this value
from this data so I want to share with
you my three part mission statement for
the data science Institute at Columbia
University the first is advanced state
of the art in data science this is
really speaking about basic research
pushing the frontiers of data science as
it and cell itself is a new emerging
field so we in academia are still trying
to figure out what this data science
field looks like meanwhile we are
continue to make advances in the
underpinnings the foundations of data
science at Columbia we define the
foundations of data so needs to stand on
three pillars of strength computer
science statistics and operations
research but we also have a lot of
colleagues in signal processing from EE
participating in defining the
foundations of data science the second
part of a mission statement really
speaks to the breadth of the University
and that is to transform all fields
professions and sectors through the
application
I thought when I first joined Colombia
two years ago that I would have to go
around talking with all the deans and
the department chairs and faculty I
thought I would have to cajole them into
the importance of data science the
analytical techniques that we are using
for analyzing data it's been just the
opposite they've been knocking at my
door saying we really need data science
expertise we know this data and data
Celine's techniques are going to
transform my own discipline and I'm
getting this down the line whether
they're in the medical school or the
history department or architecture and
those are quite enjoyable conversations
and it's quite a surprise to me so the
transformation is actually just
happening under my nose the last part of
my mission statement really speaks to
the data for good tagline and this is
ensure the responsible use of data to
benefit society to benefit society
speaks to the societal grand challenges
that we scientists and engineers and
also humanists really aim to to tackle
in terms of helping humanity help
Society that I'm talking about
healthcare climate change energy social
justice and so on
it's the unsure the responsible use part
of that mission statement that I'm
really passionate about because this
really speaks to the privacy and ethical
concerns about data data science and so
on so as I mentioned my tagline is data
for good this really stands for two
things one is using data for good to
benefit society but also using data in a
good manner in a responsible manner at
Columbia we have over 350 affiliated
faculty with the data Science Institute
coming from all 16 schools colleges and
Institute's throughout the university
and
they literally are coming from all walks
of life all different disciplines all
different professions we loosely
structured the data Science Institute
around some somatic centers these are
just one way for me to have a reach into
the entire University some of these
themes are quite obvious like smart
cities being in New York City but also
we have one in health analytics one in
business and financial analytics and so
on we have a very strong master's in
data science program I'll just run
through this quickly just to say that
the demand continues to grow our program
is rigorous and everyone gets a job
after they graduate we just started a
PhD specialization in data science this
is mainly for students who are already
enrolled in certain PhD programs in the
engineering school and in statistics
they can take additional courses and get
a PhD specialization we also have
something going on with Columbia
entrepreneurship we call it the
collaboratory where faculty from two or
more different disciplines come together
dissolving and co-teach a course to
bring data science computer science and
technology to these other disciplines
probably the one of the most successful
is actually not one course but a
curriculum that's offered by the
business school it's a sequence of eight
courses and now 50% of the MBA is coming
out of Columbia Business School have had
some exposure to data science we have a
robust industry affiliates program some
of the sponsors of f crc or letter logos
are up there as well and we we have an
international representation you'll see
some well-known Chinese companies
there's a Brazilian company and so on so
a 10 money at any one point in time we
have about 25 different companies that
are affiliated with this program the one
of the benefits the affiliate gets is
to help with the capstone course that's
part of our master's program they have
access to a team of master students who
work on a real-world data set
answering real-world questions and then
just last year we created a partnership
with IBM on a center for blockchain and
data transparency and this is something
that is organized along research
education and entrepreneurship okay so
now what I wanted to do what I really
wanted to do is share some resources
with you and this really will start a
key off my mission statement sharing
only one story having to do with
advancing state of the art it has to do
with causal inference and then I'll
share a little snippets of other stories
to show how expansive data science is at
Columbia University and on a few stories
having to do with the societal
challenges and also privacy ethical
concerns and so on so let me start with
multiple causal inference on causal
inference of course has been a
long-standing interest not just a
computer science but of course
statistics but in other fields in the
social sciences and economics causality
is you know being able to conclude
causality is really the bread and butter
or some of these fields this work is
done by issuing Wang and David Bligh and
it's addressing a slightly different
problem it's addressing the problem of
not single causal inference but multiple
cultural inference recognizing that
that's actually the more common
situation if you want to understand the
some effect on say a medication in fact
there are multiple causes that may lead
to that particular effect and it turns
out that solving the multiple causal
inference problem is slightly easier
than solving the long-standing problem
of single
univariate causal inference so let me
set up an example to give you a concrete
way to think about the problem that they
are addressing so let's pretend I am a
producer of movies and I want to produce
a new movie and I want to decide which
actors to place in my movie
so the actors I choose are the the
causes and the revenue that I get from
the movie is what I'd like to estimate
so at my disposal is a database that has
for a movie here are the actors and
here's how much money that particular
movie made and you can see immediately
so what I want to do mathematically and
this is the potential outcomes rendition
of what I want I need to compute is
really the estimated potential outcomes
of if I place this set of actors in my
movie how much money will I make in the
Perl approach to causal inference one
would write it's something like that
this is just for those of you in the
audience who are familiar with causal
inference so in classical causal said
the problem of course with causal
inference is that there are multiple
confounders and these confounders can
affect both the causes and the effect so
you know if say the genre of a movie is
a kind of confounder where it's going to
determine what actors I put in my movie
because I'm certainly you know there's
certain actors who play well in you know
action movies versus artsy movies and
also the genre could affect the effect
in that action movies tend to make more
than artsy movies also whether a movie
is a sequel is going to be a confounder
because likely to choose actors who were
in the first version of the movie
offered my sequel etc etc so this is
actually this multiple cause multiple
causal situation is actually
quite common common and in genetics so
what genes effect a trait and choosing
actors for movies like choosing athletes
for sports team choosing the prices that
you're going to put on items at a
supermarket will help determine how much
people spend etc etc so in classical
causal inference the way in which we
estimate causal effect is first you
think of all the confounders sequel
genre sometimes you put one actor in a
movie because they kind of go well
together
with other actors all these are
potential confounders so in classical
causal inference your first job is to
think of all the confounders and then
you measuring all those contenders you
then can estimate the causal effect by
and they dealt the w sub i's are the
confounders there you add those
confounders on for each of the movies
and then you can iterate you there's an
iterative process that allows you to
estimate causal effects that's what the
little equation on the right basically
stands for so I already said compounders
can affect both the causes and the
outcomes and we need to correct for all
the confounders that's what's going on
on the right hand side which requires in
theory in principle that we have
identified and measured all the
confounders but that is actually a very
strong assumption it's untestable and
called strong ignore ability the
assumption is that there are no
unobserved confounders so we have to
think and thinking they can think make
sure we thought of all the confounders
and then we can play this little
algorithm out and so that's the way it
is so the new idea is this idea called
the D confounder and let me just
operationalize it for you the first
thing you do is you let say there we go
the first thing you do is you fit a
local latent variable model of
assigned causes those are the possible
assigned causes the observables are
using something like factor analysis and
then you can actually test whether this
model is good or not against some
objective function that you define and
then you then use those as substitute
substitute confounders and then play the
same iterative process in doing the
estimate of the causal effects so the
main point is this if I can find a
factor model that captures the
population distribution of all the
assigned possible causes then we will
have discovered a variable that captures
all multiple cause confounders so the
multiple cost confounders are really
inducing a dependence on the variables
of the assigned possible causes and so
now you don't get anything for free here
there's still an assumption and that
assumption is that there's no observed
single cause confounder and so but it's
a weaker assumption than for the single
causal inference problem so it's a
weaker assumption the model that you
build is you can check it for certain
properties so there's an operational or
computational element there and then the
authors approve in the paper itself that
the d confounder gives you unbiased
causal inference so this is an example
of basically pushing the
state-of-the-art in causal inference in
this particular case in data science
more generally let me come back to
movies just to put this in concrete
terms so this is a this is the a snippet
of double-oh-seven James Bond movie
that's Sean Connery in the middle he
plays double-oh-seven
James Bond and then there are these two
other actors a chief whom play the
characters QAM
with the deacon founder sean connery's
value goes up and the characters who
play Q in them M their values go down
with this another way to say that is
without the the without the deacon
counter the value of Sean Connery is
underestimated and the values of the
characters who play QM are overestimated
and intuitively that should make sense
to you because for the longest time when
Sean Connery was double-oh-seven you
went to see the next James Bond movie
because he was double-oh-seven
and so this bit with this model of
course you can play a lot of what-if
questions answer a lot of what-if
questions so just to summarize this
story the deccan founder corrects for
unobserved confounders like a genre
sequel and so on okay so now what I want
to do switching gears is to just run
through a series of examples of how data
science is used in other disciplines let
me start with something pretty obvious
and this is again recognizing that the
science community has been working in
big data for decades they the example I
want to start with his biology and we
all know at the turn of the century was
the human genome project we knew that
big data was going to really
revolutionize biology and now with DNA
sequencers pumping out data like mad
there's a lot we can do with all this
data in this particular example the
researchers were studying the microbiome
of pancreatic tumor cells and what they
and just with DNA sequencing of the
microbiome and the cancer cells they
were just they discovered that the
microbiome was counteracting the effect
of the chemotherapy that was meant to
treat the cancer so that's not good
but they went one step further and they
saw that if you inject an antibiotic
into the tumor cells then the antibiotic
would counteract the effect at the
microbiome thereby making the
chemotherapy effective and that was
really all through analyzing big data so
that's my biology story there are many
many such stories out there my next
story is about astronomy and I believe
Daniel shoes a co-author on this paper
is in the audience so you can ask him
about the details of this work we have
some I think the Astronomy community in
general is very forward-thinking
there they have so much data that
they're very open to using techniques
that come out of computer science or
statistics or anywhere if it can help
them if these techniques can help them
manage all this data so in this
particular case the astronomers and the
data scientists we're working together
to look at weak gravitational lens data
this is data that people the Astronomy
community is trying to understand and
build models for and estimate
cosmological parameters for in this
what's called the lambda cold Dark
Matter model and what they found is that
yes with the cnn's they were better they
were able to better estimate those
cosmological parameters above and beyond
traditional statistical techniques now
this was all done on simulated data more
recently I know that they've been
wanting to use real data to do the same
analysis but this gave them a glimmer of
hope that yet again deep learning has
yet another astonishing success and yet
another discipline the souther somehow
that moved beyond me
let me turn to a completely different
discipline this is economics in
economics people are always interested
in understanding cause and effect this
particular work is using this technique
that came out of the economics community
called
double machine learning and they we're
looking at the question of whether
online markets online labor markets like
Amazon Mechanical Turk are behave like
regular markets and what they discovered
is that in fact they behave more like
monopsony x' on that any kind of free
market and one example of so a monopoly
is when you have one seller multiple
buyers a monopsony is when you have one
buyer multiple sellers so it turns out
that Amazon Mechanical Turk and the like
behave more like monopsony x' and one
one example data point that they use to
justify their this conclusion is to show
that high reward tasks for instance do
not get chosen picked up more by satyr
cars than low reward tasks you'd imagine
that if you want to make a lot of money
you go off to the Hydra war test but
instead you can go after a lot of little
reward tasks and make your money another
example is in finance so if the finance
industry has have been already putting
out these things called Robo advisors if
you happen to be wealthy you might have
a personal financial advisor but right
now the industry is looking at these
Robo advisors to give you personal
advice and now when you have a human
being as your personal financial advisor
it takes some amount of time usually
months perhaps years for that financial
advisor to learn your investment
preferences are you conservative and you
want to keep all your money
under the mattress or are you very
courageous and you try all sorts of
crazy instruments out there and or
somewhere in between
so what Robo advised it this robot Robo
advising work is trying to do is to
actually predict the investor
preferences using in particular this one
equation here that that you're all
familiar with probably is using
reinforcement learning and just
automating this allowed this particular
team of researchers to just in basically
eight or nine iterations of the
algorithm learn investor preferences so
this is all simulation data this
particular group of researchers actually
want to work with some real financial
institution and has real data to test
out the idea for something completely
different we have a very modern history
department at Columbia many of the
faculty are using data science topic
modeling NLP sentiment analysis
routinely in their work and this is one
example I wanted to share with you that
there's a history faculty member at
Columbia who amasses every year all the
declassified documents that come out of
the federal government so every year
there's a release of documents that's
published and this particular person
just is accumulating all of them and
once at a documents that he analyzed
just to see if he could determine
anomalous events was some cable grams
that diplomats sent each other in the
1970s so what he wanted to do was go
through the contents of the cable grams
the to field the front field and also
the contents a subject cable and so on
and see if he could detect the unusual
events
and I didn't and identify whether those
unusual events actually mapped to real
historical events and sure enough every
black dot there represents one of those
unusual events and some of them you'll
recognize like the evacuation of Saigon
and the deaths amounts of Tom there okay
so that's our history department so I
now want to turn to the data for good
I'll start with responsible use of data
and I I know that there's a lot of talk
in the computing community and Beyond on
fate fat tape I added the s for security
and safety fairness accountability
transparency ethics their whole the ACM
now sponsors fat star and so on I throw
all of those into what I mean by
responsible use and I'm as I mentioned
my my sole contribution to this acronym
is s for safety and security this is
partly reflecting my own background but
I so I wanted to actually share two
stories research stories and research
efforts along those lines so last year
some of my colleagues who were actually
programming language soccer and during
formal methods type of people borrowed
or let's say we're inspired by two
programming languages software
engineering techniques and applied them
to better understanding DNS and what
they it's what they discovered so one of
those techniques is code coverage so
when you have a program you you you
usually want to test it by tickling all
the paths in your program covering all
the code so analogous to that they in
define a notion of neuron coverage where
you'd like to tickle all the nodes of
the network all the edges and so on
so that's very intuitively easy idea to
understand the other idea that they are
borrowing from software engineering is
the notion of differential testing when
you have two pieces of code that are
supposed to compute the same answer you
see if they differ in some way so if you
have two dnns that are supposed to do
the same classification on say a set of
images you see where they differ and so
using those two ideas they built a tool
called deep explore that shows that at
the time all the state-of-the-art vision
d-ends had flaws so for instance what
they actually do is they take certain
images and they systematically perturb
them in natural ways such that the
classifier will label it incorrectly so
in this particular case you see here you
have a typical image this is a the image
system the vision system and say your
self-driving car the sub driving car
sees the guardrail in the image on the
left and the classifier says you know
veer veer to the left so that you don't
go into the guardrail on that's fine
what their system does is automatically
and systematically for all images
perturb these images in a natural way so
the natural way here is to darken the
image which is a very natural occurrence
because we don't always drive in the
sunlight so you dart by darkening the
image the classifier actually says veer
to the right and then you die so that's
the fatal error of these dnns they also
applied this to other things besides
images malware detection and and so on
another example I wanted to give has to
do with a privacy but it's actually it's
inspired by differential privacy but
it's still it's more about robustness
of again perturbing input images for DNS
and we all know how easy it is to fool a
DNN by putting graffiti on a stop sign
and all of a sudden the DNN doesn't
classify it as a stop sign and again you
go through the intersection and you die
so that's not good so what instead of
the testing approach where you sort of
for each DNN for each image you do
something wouldn't it be nicer if you
could guarantee for some set of
perturbations that you do to a set of
images that you have some robustness
guarantee at the end so that's what this
work pixel DP is all about and the idea
is inspired by the stability guarantee
that you get from differential privacy
so I know you had a whole talk by
Cynthia on differential privacy I'm not
going to belabor that but really the
beauty behind differential privacy is
not just that it's applicable the
privacy is that it's actually a
stability guarantee and so they're what
the solution that these researchers did
was add a layer of noise to to a DNN
that will at the end
guarantee some confidence interval
around the prediction for each of the
labels and so this this work is to
appear in this year's security or
appeared in this year's security and
privacy conference okay so now I want to
turn to some of the societal grand
challenges that we at Columbia like to
work on one I want to share with you is
some work on a social justice this is
some work that is done in collaboration
between some natural language processing
faculty and some faculty in the School
of Social Work they have looked at the
they're trying to understand how to
reduce or intervene for gang violence
they look
Chicago gang and in particular the
tweets that the gang members sent each
other and the idea is if you can do
sentiment analysis on the tweets then
you may be able to intervene before any
violent action happens so the idea was
then to look at their tweet that their
their tweets and see if they can
classify a tweet as being aggressive in
sentiment or more about showing a
sadness as opposed to aggression and so
what happened first is of course before
they could even get to that kind of
classification they had to analyze the
language it is its own language
so our NLP people how to do parts of
speech analysis because the language
they use is heavy with emoticons symbols
and words that we may not understand if
we're not in that cago game and so
that's the first thing they did that
parts of speech analysis is available
online so you can check it out yourself
and they were able to do that sentiment
analysis as I said with the idea that if
you can detect that the gang members are
tweeting something that sounds
aggressive then a social worker or law
enforcement can intervene before
physical active violence occurs so
that's the the story we also have a
Columbia one of the world's largest
groups a group of climate scientists is
that lamont-doherty Earth Observatory on
the part of the Earth Institute they
amass many many volumes of data climate
data from around the world doing
simulations at different levels of
resolution for different parts of the
earth oceans
and air and so on
so what I'm showing you here is not so
much the obvious which is oh let's throw
deep learning or machine learning and
some of the data that they are analyzing
the climate scientists are doing that as
I speak but I also wanted to emphasize
the system's nature of doing science so
pan geo is actually an open source
platform for all climate modelers
climate scientists to come together to
share data to share algorithms to share
models to share results it you can see
that it's funded by multiple
organizations and it's already being
used for doing simulations on ocean
turbulence hydrology meteorology and so
on
my last example of societal benefit is
in healthcare so this is a project
called Odyssey
it's observational health data Sciences
informatics we pronounce it Odyssey it's
coordinated by Columbia University it
involves 25 different countries around
the world there are 80 different 80 or
more different data sets that are
federated much like this conference is
federated but what's unique about this
is that there are 600 million patient
records in this federated dataset and
now here's the amazing part they're all
in the same format so when you have 600
million EHRs in the same format you can
actually do some very interesting
analysis that you could not do through
clinical trials and so on so this group
is really emphasizing the importance of
observational data when you have it at
scale then you can discover drug
interactions the effect of certain
medications for treating certain kinds
of
ceases by just looking at the data so
one one example I wanted to give and
this is quite a busy but I have time
quite a busy visualization and the
reason I like showing this is to
emphasize the importance of early err
point I made and that is it's not enough
to just show pretty pictures someone has
to tell you a story about them
so in this Odyssey data set each ring of
circles represents one of those data
clinics so for instance the ring of
circles in the middle on top you can't
read it as the Columbia University
Medical Center and I'm just showing you
nine different clinics and for three
different diseases that they studied so
on the left is diabetes in the middle is
hypertension and on the right is
depression now let me tell you how to
interpret each of those rings of circles
so what they are studying is what drugs
are given to each patient to treat a
particular disease so for instance if
you're being treated for hypertension
that's the middle and you're your
patient at Columbia University Medical
Center that's the top rings of her
circles the inner ring and each patient
is one radius through that ring of
circles so if I'm being treated at
Columbia University Medical Center for
hypertension the first drug I'm given is
represented by the the color there's a
color coding here the first medication
in the inner ring of circles and if all
is fine then well I don't get treated
with any other drug but if that doesn't
work then I'm treated with a second drug
and that's represented by this second
a second circle around that in that ring
and if that doesn't work then I'm given
a third drug and so on and so forth so
each circle in a ring of circles
represents a drug that I'm given in my
sequence of medications treating me for
that particular disease I hope that
makes sense
so now that I've interpret to each ring
of circles for you let me just give you
a couple of observations that the
researchers discovered in looking the
data looking at the data no clinical
trials first for hypertension if one
were to collapse all the rings of
circles into one big ring of circles
they found that one-fourth of the people
treated for hypertension in this
federated data set are treated uniquely
one-fourth so that means if you have
hypertension and you're in that 1/4 and
I'm your doctor and you say Jeanette is
there anyone else in the world being
treated like me the answer would be no
that's amazing astounding it's one of
those pieces of information that come at
you after you look at the data now to
understand why is a whole other set of
research questions to be answered
another interesting observation made by
looking at just the data is the lower
left-hand corner is so if you look at
the left column that is for treating
diabetes all of the other rings of
circles pretty much look like the top
two there's this chart route this drug
that's represented by the chartreuse
ring that for the most part works but
you'll notice that in the lower
left-hand corner that church shoes is
very small in the first ring of circles
it turns out that that so that lower
left-hand corner R stands for a Japanese
medical clinic data from a Japanese
medical clinic and in looking at the
data they surmised and discovered that
the Japanese have a disposition against
that particular drug represented by that
chartreuse color and not it's not a not
is not shared by Chinese or Korean it
was really unique to at looking at the
data the Japanese in that Medical Center
so those are the kinds of things that
one can glean when you have so much data
so what I wanted to do is close I
started by talking about multiple
confounders and what I wanted to do is
to time my first story back with this
story and this is really hot off the
press results I've never talked about
this before the work has just been put
on archive two months ago
it's basically bringing these two
together so let's build a medical D
confounder this is this is really
preliminary work to go to show what
could be done and so what this group of
researchers did was take the Odyssey
dataset and build these Medical D
confounders for specific tasks and then
see what medications have a positive or
negative effect on an indicator for a
particular disease and then they
compared the results of their deccan
counter model with medical literature to
that was their validation mechanism and
what I wanted to share with you is just
one example of what they were able to do
this medical deacon founder and it's
related to treatment effects on
hemoglobin a1c which is a protein that's
an indicator for blood glucose levels
which is an indicator for diabetes so
the unadjusted model and it was
basically a linear regression model is
the equation on the expression on top
and then the d confounder is the the
medical d confounder is expression in
the middle and i wanted to just give you
the bottom lines first of all this d
confounder actually reduces both false
positive and false negative so for
example and now i'm not going to be able
to pronounce these things properly but
acetaminophen what in the unadjusted
model indicated a causal effect but in
the d confounder model it goes from
causal to non-causal so acetaminophen is
what's in tylenol and so maybe that's
not a surprise I mean all of these are
now this is what they observe and now
they have to understand the why in the
middle you'll see aspirin is there which
has in both models no causal effect so
maybe that's analogous to why tylenol or
acetaminophen should not but I don't
know I'm not the medical scientist on
the other side there are two drugs in in
am lodi pain and hydrochlorothiazide
that go the other way in the unadjusted
model they showed you have no causal
effect but in the D confounder model
they should have have a causal effect on
and again the medical scientists in the
room can better explain what questions
this is these are asked these results
are asking and then suggest for their
study the other point I really wanted to
make with this particular example is
that this medical D confounder is
actually more consistent with the
medical literature
with respect to medications affecting
increase or decrease of hemoglobin a1c
so this is all gives us hope that this D
confounder idea actually could be very
useful um it also gives us a different
way to understand the massive amounts of
medical information that we have medical
data that we're collecting so I'm going
to stop there and data for good just
remember that's what data science at
Columbia stands for thank you very much
the bits of
okay is this yeah so Jeanette thanks for
a wonderful talk I'm just wondering for
us as a CS research community when you
think of your mission the three parts
advancing data science impacting other
fields and promoting responsible use of
data where do you think we need to make
the most headway given where we are
right now in computer science yes and
computer science Braille I would say to
be perfectly honest and now I'm speaking
very pragmatically given given what's
happening in there in the real world I
would say for computer science and this
is happening I'm seeing this at many
computer science departments I would
actually say the responsible use data
ethics computer science ethics I know
you know look I never took an ethics
course in my life I was trained as a
computer scientist straight narrow and I
love computer science but now I'm seeing
the implications of not understanding
the consequences of the technology that
we develop and yeah you know many of you
are perhaps uncomfortable with nuanced
questions like ethical questions are
always nuanced and there's no right and
wrong but I think we need to face these
because if we don't as technologists
then it will be the policymakers it will
be not in consultation with us it will
be the policy makers making regulation
or passing laws that are that won't make
sense and I saw some of this when I was
at the National Science Foundation when
there was a lot of interest in
cybersecurity and growing interest in
privacy and I remember seeing draft
legislation on my desk saying you know
we should do this and I think
you know and so first of all it does
speak to the importance of scientist
engaging with policymakers something I
think many of you know I've been
promoting but now something that I've
really come to appreciate being at
Columbia University and and talking to
law professors and and Business School
professors and journalists and people
out people who are not techie by
training they grapple with really hard
problems all the time
societal problems where there is no
right or wrong there there are nuances
there are debates there are many sides
to it and and and it's uncomfortable for
us it's uncomfortable for me because as
a computer scientist especially you like
you know black or white yes or no true
or false or oh one but that's not the
way the world is so that bit would be
something that we can as a computer
science community work on together it's
one of the reasons that I'm promoting
give your data science because data
science is new enough that we can get
the ethics in from day one thank you
thank you for the question
i Sai Minh from Ohio State interesting
talk so what are your views on
topological data analysis you didn't
mention this subject in your talk I to
be honest I don't know a lot about
topological data analysis so I guess I
don't have a view I would say that if
it's coming from the theoretical
community or the topology community for
mathematics I and I have a colleague who
works on that at Columbia if it can help
us understand big data from a structural
point of view where you can identify
certain invariants about the data or
invariance over the algorithms that
manipulate that data then I would say
it's it's great stuff okay thank you
well unfortunately don't I'm not well
yes hello uh Ted lair with City of
Austin the question has three parts the
basic question is what's happening at
Columbia the motivation is that cities
are responsible for collecting a lot of
data with smart cities we're gonna be
collecting a lot a lot more a lot more
data we have a responsibility to be open
and transparent with that data we have a
in Austin we have programs to protect
privacy we don't think we have our act
together but we think we're making great
progress we're exploring differential
privacy for example but we have a
question about system impacts system
let's call it community impacts I'll
give you two examples which will focus
the question for you recently the
Department of Housing the federal
Department of Housing sued Facebook and
I'm not gonna you know I'm not calling
Facebook a bad actor here but they sued
Facebook on the grounds that they
allowed advertising to discriminate for
housing and apartments to discriminate
against minorities religious groups that
kind of thing because as you know how
Facebook works you can decide who you
like to target your advertising at and
apparently they're being accused of
targeting excluding groups that might
have been minority and that kind of
thing that is a that's using data
there's a right involved there that's
not a privacy right
it's not a right of privacy here we
might claim there's a right to something
else it's a fairness but it's a
community right as well the second piece
is that we've experienced this in Austin
and you maybe you're familiar with
filter bubbles how we're not even sure
sometimes when we're talking to our
community that we're talking we're being
informed by the same information we've
had our own version of pizza gate in
Austin where we was unbelievable having
the discussion with the community where
we didn't have the same facts so to
speak so my question back to the
original pieces what is Columbia doing
about addressing these kinds of problems
they're not privacy's for you and me
there's system level issues about what
people do to the data and impact the
system meaning that the community as
well as how do we get how what are you
doing to help in in
in improve the communities talking to
each other so they have shared
information or they at least the same
I'm going to have to have a conversation
about fake news of that help yeah so let
me let me take those two questions
separately first of all I happen to be
on the New York City task force for
accountable decision systems this is
believe it or not New York City passed
legislation last year that basically is
asking city agencies if they use these
accountable decision systems and are how
can we be assured that they're using
them in a non-discriminatory in a fair
way this is an example of legislation
and then they want explanations and they
want everything this is an example where
legislation is ahead of where the
science is and so the good news is that
New York City is thinking about these
issues and there are hard issues even
getting this 12 member of task force or
therefore techies on a lot some lawyers
policymakers city agency people and and
some techies like myself and even
defining what an accountable decision
system is has has caused a lot of you
know raised a lot of issues just within
the task force let alone how is a city
agency supposed to operationalize this
so in the community works very very
closely with the New York City
government
partly because the data Science
Institute was first created out of the
economic development council that New
York City had under Mayor Bloomberg's
watch so we have a very good
relationship with city agencies and the
city in general and they do look to us
for conversations for leadership for
thought leadership and and and so on on
campus itself we have a lot of people
interested in different aspects of the
community that you mentioned we
really strong interest from the school
Social Work school of public health and
you know other parts of campus and those
people are working directly with
community organizations also working
with the under served youth in the city
all sorts of activities going on with
respect to your second question about
every one of us getting different pieces
of information you know this this is
this has just been exacerbated by the
ease of the digitization of information
the ease of disseminating and now the
ease of you know changing you know
altering content we've been working at
Columbia with the School of Journalism
in general about things like fake news
and how to combat that as you know it's
not an easy there's no easy answer to us
and it's you know just deciding when a
piece of information is authentic or not
it's not an easy answer a question to
answer the data science Institute is
sponsoring a couple of seed grants that
are looking at these kinds of what kind
of technology can we bring to the table
to do things like the rasa tea checking
and so on but these are you know really
research ideas so that's that's what I
can offer in terms of what we're doing
thank you this is the last question
eraklyon awesome University great talks
Annette I very much agree with you with
the importance of keeping the data open
and sharing not necessarily well it's
good in the sense that we all develop
foster this way right and I think
computing side computer size as that
field has benefited tremendously over
the last few decades by being very light
on patents or at least we don't use
patents but I can also see as we were
starting to commercialize this which
will surely happen in the next few years
and we get the medical industry and
everything involved there is likely
gonna be a greater and greater demand
for also monetizing this and then I bet
we're gonna see more and more patents
and there I start to get a bit worried
because only one and I like
commercialization and I like my health
care to be better
but patents are certainly gonna start
offsetting things that I can't compete
with you for 20 years or something so do
you handle this in your position leading
a data center in do you have any
policies about this do you want it not
just the data to be free but also the
way what we do with the data yeah this
is a this is a another one of those
questions that they're multiple sides to
I you know I'm a great believer in open
science and being transparent and and
and the scientific the conduct of
science as we know today which is being
open sharing results sharing sharing
data sharing algorithms sharing sharing
but there is what you're saying now the
implications that you're talking about
are more about the commercialization and
then they're talking about the IP and
patents and so on that go with that I
haven't really thought about
commercialization so much in terms of
the the great work that we're producing
I've seen more you know what what
concerns me more right now is that a lot
of the a lot of the advances in
computing especially in AI deep learning
machine learning and so on are happening
in industry not in academia and I think
this is concerning for a couple reasons
one is industry has a different mission
from from academia it is it is about
bottom line it's not about advancing
science it's not about discovering new
knowledge and I you know right now
industry has the data and they have the
compute the data we academics will never
have get their hands on that is an asset
to a company compute yeah well maybe the
commercial cloud providers will provide
some fee cycles for some amount of time
some amount of free storage for some
amount of time but that's not how
science works so I'm concerned about
that and there are different ways in
which we can think of perhaps about
industry and academia working together I
think we need to have a conversation
about that the other reason I'm
concerned is that maybe and this is I
know you heard Jeff hinted and yawn
Laocoon talk about deep learning earlier
this week and I I just want to say that
Jeff Finn and I were colleagues at
Carnegie Mellon together in the in the
mid 80s and it was just you know just
what he probably said in the mid
eighties we were all humoring Jeff um
and now look at what I want to say that
maybe and I don't know i I'd say maybe
eventually deep learning will hit a wall
and then what is that not the role of
academia to know what that next thing is
we should be but probably are already we
should be thinking about what after what
comes after so it's one it's it's about
this is the you know the NSF er me
talking it is about supporting and and
Microsoft Research
it is about supporting basic long term
research and that is the role of
academia for science thank you so much
you
