hi everybody how's it going good
sorry we're starting a little late I
just give a few extra minutes for people
to come in I hear there's some students
around and they're coming from other
classes I'd like to introduce myself and
I'd also like to introduce you all to
the DataEDGE conference I'm super
excited to be here this is my second
year I think this is the second year of
the conference and it's really an honor
to come back you know and talk and kick
off the conference I hope we have a sort
of a fun discussion here that will sort
of open your brains to some of the
things that you're going to see later on
so this is going to be a workshop I
thought about different ways to present
this material I think what I'm going to
do is a little bit of a discussion maybe
it will do some Q&A within the talk that
I give we'll talk a little bit about
sort of the scope of what we're the
tooling is around data science some of
the practical applications of available
tools where we're going where the data
science field is going and we can have a
Q&A afterwards and do a little
discussion about this topic as well so
let me introduce myself my name is
Michael Manoochehri I'm a MIMS alum I
graduated from the ISchool master's
program yes it was it was great I try to
come back as much as I can but I'm
pretty busy I currently work at Google
I'm on the cloud platform team I manage
a team of developer relations engineers
who work with data projects so we we use
our platform for external developers to
build data applications of their own I
work specifically with a technology
called BigQuery which is very
interesting I'm not going to talk much
about that today but if you're
interested in some of the stuff that
we're doing at Google let me know later
I'm also writing a book for Pearson
called "Data Just Right" which is going to
be you know it's I'm going to talk a
little bit about some of that stuff
today and you can find me at my Plus
page there and also you can follow me on
Twitter as well I don't actually hate
slides I don't have any slides so I
decided to just put up this short link
this is the the notes to the
presentation I've already taken notes
for you so it's pretty user friendly so
I'll leave that up for a short time
and I have a couple other things to talk
about too so why did I decide to write
this book about practical applications
of data science well being at Google for
a few years I started to think about
what it means to be a data scientist and
while there are exceptions to this rule
and there are people who have the role
of data scientist at Google
we also don't have a lot of data
scientists and I think a lot of people
don't call themselves that but they do a lot of the work
that we think of data scientists doing I
really love this quote by Hilary Mason
who I believe is going to be at the
conference
okay well she's easy to find
anyway Hillary Mason put together a
really good definition of the tools
around what a data scientist does and
let me see if I have the quote here let
me just pull it up Hillary describes a
data scientist as somebody who does
three fundamentally different things
which i think is very interesting one is
math statistics computation
computational modeling another one is
code is actually being able to implement
your ideas in code to make them happen
in software and the final one which is
probably the most important thing is
communication being able to communicate
these ideas tell stories around around
things that you found in in data that
you're working with and and I realize
that some of the people I work with
don't call themselves data scientists
but they do all of these things and and
I started to become very sensitive to
the mythologies around data science and
I found this paper by by Dana Boyd who's
also an I School PhD graduate and and so
Kate Crawford who is going to be here
today I believe Kate Crawford will be
here and that's gonna be a great talk
you should definitely go to that and
this quote talks about why data science
is so confusing like why are you all
here why are you all learning more about
this why isn't this a mature technology
why don't we know what a day as data
scientist really is and so this this
quote is really stuck with me and it was
a cultural it's a what is data science
right what is this feel it's a cultural
technological and scholarly phenomenon
that rests on the interplay of
technology analysis and mythology that
provokes extensive utopian and dystopian
rhetoric and this just really struck me so
I love this quote right because it
strikes me as there are people who are
doing things with data and are having
successes and doing new and novel things
there are people who think that data
will solve everything that collecting
lots of data will be this magic bullet
and you all know about this right you've
probably all been down this road if
you're in this conference you're
interested in some of the issues that
come around you know collecting data and
trying to find magic treasure inside of
data and what are the tools we need and
what's the technology we need to learn
and so all these things started
resonating with me and I decided since I
work with many customers who are
actually building things that are
successful using data
get a processing applications into
pipelines using things like R and
Pandas and Python PyLab stuff to
actually gain value out of data and do
new things and show new visualizations I
decided to write a book around use cases
and those use cases are things that come
up time and time again I kind of grouped
them into different patterns and I'll
talk about a few of those today I kind
of before I go on I kind of wanted to
get a sense of who's in the audience I
wasn't really sure what to expect I see
some people who are former students I
see some professors back there who here
is a Berkeley student or a student
of any kind an academic mostly graduate
level I'm guessing any undergrads in
here ok great great great and who here
is maybe a professor or some some kind
of teacher a few yeah that's great
and then how about industry are people
here who are from industry that are
trying to learn some new things learn
about new tooling that's great are there
people here trying to figure out if they
should dump their Hadoop installation
and install Spark from the AMPLab I'm
just wondering we'll find out this week too
if that's a good idea
ok so great so it's really it's a really
good mix I wasn't sure what sort of the
level of technology people are aware of
I'm just going to sort of start from
scratch and we'll talk about a few
things another great quote about big
data that I love is George Dyson who I
met at the Strata conference so
Strata is kind of a industry conference
around data science and he said that big
data is what happened when the cost of
storing information became less than the
cost of making the decision to throw it
away right so basically yeah so
basically like we couldn't really do
anything with all that data before
because we didn't have a means to and
now it's becoming too expensive to get
rid of it because who knows what we'll
find in that in that buried treasure so
really really good quotes to me what
this field is is rapid innovation
constantly evolving technology lots of
overlap open-source projects that do
sort of the same thing but not quite and
it's very hard to tease out what to use
in certain situations right so part of
the game around the data data science
tooling field right now is what do i how
do I solve the problem that I have with
the technology available to me and the
most important part of that sentence is
what's the problem that I have you know
what is the use you know what am i
trying to do and what's happening with
this field is there is this onslaught of
technology open source of non open
source and money getting pumped into
this field that it's hard to know what
to do and people are in the situation
where they're thinking technology first
and not use case
first they're thinking you know I have
all these things I need to learn and I
don't know what I'm quite doing and so
what I'm going to talk about today is a
little bit of that the other thing I
always grapple with is what is a data
scientist and I think a lot of people
are talking about that today what does
it mean is this a real academic field is
this a real job role and I kind of
grapple with that all the time
DJ Patil is very famous he was formerly
the chief scientist chief data scientist
at LinkedIn and I believe he works for
Greylock now he sometimes around
Berkeley seems sometimes confidence like
this he uh he has defined data scientist
and become very famous beyond just his
normal work to kind of evangelize this
role and at first I I kind of grappled
with the term data scientist I think
that all scientists are data scientists
right and I think it's an unfortunate
naming for something that is actually
very important and what I think DJ Patil
did very well and others like him who
have advocated for this position is
recognize that inside of companies like
Google and Facebook and inside of
academic institutions as well there are
people who are skilled in multiple
things the math and the code and
visualizations and you know the insight
to put it together and actually do
practical things with data and by naming
that no matter what the name is by
naming that role people like DJ Patil
are saying there is a role out there and
there are people that are doing these
these kind of amazing hybrid things
these kind of cross interdisciplinary
things with data and that to me is
really important right so I agree with
that I don't I don't necessarily think
data scientist is the best the best term
for this role but it's it's a term and
it's a thing that happens right so
people are able to have the math you
know the understanding of what makes a
good statistical question or a good you
know a good model plus the code to
implement it and the engineering skills
to kind of get it done and they have the
ability to communicate these things and
tell other people what's going on and
actually convince them that this is a
valid answer to a data question right
and so by naming these people I think
it's very important to say this is a
role and this should be a role and these
people should help others gain
accessibility to to this kind of
technology right so this is sort of
where I'm coming from with some of these
things very important stuff in the notes
this this thing I also put a quote from
a guy named Sean Taylor who at the time
was a PhD candidate and he said he has a
he has a famous blog post called
scientists make their own data
and he kind of questions the data
scientist term but that's not really the
point of his blog post what he says is
you know many social and digital
scientists are reluctant to invest in
making data because it's you know it's
more costly and risky than analyzing
data that you already have and I've seen
this time and time again academics
especially will grab a data set that
already exists and like 90% of the tough
stuff that happens with data science is
actually collecting the data building
that pipeline finding the right tools
right so it's messy and dirty and one
thing that I don't like about what DJ
Patil says about data science is that
it's the sexy job of the 21st century no
way
it's the messy dirty you know dirt under
your fingernails job it is tough and if
you've ever worked with any size data
set collecting any real data you know
how difficult it is and how much work
you have to do to both you know make the
data valid and clean it up and
convince people that is valid and check
and check and check again and so I think
that it's important for for you guys
especially guys in academics to really
understand what it takes to collect data
not just come up with technology that
doesn't solve practical use cases but
actually dive in there and see what's
going on okay so with all that being
said that was a long introduction but um
but let's talk about some some ideas
about some practical use cases and some
ideas that come up from these problems
right so one of my favorite use cases
that I see time and time again is how do
I share big datasets with people right
seems like an easy problem especially
with the cost of cloud storage solutions
dropping or you've got Amazon now Google
has a product as well
Rackspace you know you can you can host
a lot of data for very cheap so what's
the problem
put up some XML files it's pretty easy
mmm recently the Library of Congress
decided to try a project they where
they were going to collect I don't
remember what it was I think it was
something like 13 billion tweets don't
quote me because I don't remember the
number it was huge right a huge amount
of tweets and Library of Congress was
getting into this I actually got into a
sort of a Twitter battle a tweet battle
online where I was kind of prodding them
and saying you know they had published
that this is hard you know this was hard
to do we bit off more than we could chew
we don't know what to do with all this
data we collected it but what do we do
with it and I said well you know you've
got to publicly I've been kind of posting on
their blogs and putting comments like
you've got to do something with it what
was the point of collecting it what were
you trying to do and this is an example
of a group that decided to collect the
data first without thinking about
why or what they were going to do with it
or who the audience was and not to call
out other government institutions but
I've also been in conversations with
people from oh let's say people who
collect data about you know the census
who who just who have trouble
understanding what the use cases are
right so they'll put up data on an FTP
server in XML format and expect that to
be the solution to open data open
government data and so when these kind
of problems come up you have to think
about what the audience is what's the
point of the data right so what's the
file format they should use right should
they use CSV should they use XML should
they use JSON and for those of you who
aren't familiar with these formats don't
worry they're just formats that are good
for different purposes XML is an
excellent format for document
interchange right if you want to take a
document from one format and make sure it
can be converted into another format
great it's also very heavy it's not
always the best thing for over the wire
data transfer CSV is very useful it's
very universally accepted by just about
everything so if you're deciding to try
to sell some files millions and millions
of different documents maybe CSV is the
right choice and I met a lot of people
who in the data field who just want CSV
files of everything because it's just
so easy and then what do programmers
want you know everybody on the web wants
to use JSON they want to pull the
program into their into their JavaScript
visualization and use it immediately
furthermore why not use an API right
maybe maybe they need API access or if
you're a municipality maybe you need to
put up some dashboards so the you know a
common person who has no understanding
of programming can go in there and start
to actually play with the data and gain
some value out of it right so there's a
lot of considerations here and we've
seen it just happens over and over that
people miss the point of what what they
should be doing with their data now who
the audience is and what's the purpose
of these things they jump to technology
first and don't figure it out so these
are some of the issues that I think
they're really important for you guys
who are interested in this topic to
really understand and think about
another one that I see all the time is
collecting lots of data right this is a
pattern I'm very interested in myself
personally oh I'll put it in something
like this right so this is a pattern
that I work on a lot at Google and in
fact a lot of what happens in Google are
these kind of software patterns and
tooling patterns you'll have people on
mobile devices and I work a lot of game
companies this is where we see this a
lot because people have millions of
users on social networks and they're
just collecting data all the time in
real time and and they want to they want
to transform
the data they want to have a regular
schema they want to make sure it's it's
it's an easily queryable format they
want to store it you know indefinitely
and then they have they have to ask
questions about this data right they
need to ask aggregate questions over the
entire data set sometimes and sometimes
they don't need to but this is this is a
need and I've been doing this kind of
pattern a lot and I'm very interested in
finding out what what the what the best
ways to build these patterns are in the
most accessible way and so what happens
in this space is there's a lot of
open-source tooling around here with a
lot of mismatch and I think one of the
reasons people come to conferences like
this is to kind of wrap their head
around what's the best way to solve
these problems I'm going into industry
I'm making a game I'm making a mobile
application and I need it to always run
I need to be able to collect data I need
to make sure this data is is easily
transformable and movable from place to
place and I need to store it and I need
to query right and the problem is in the
open source space lots of tools solve
little parts of this problem but they
overlap in weird ways and there's not a
lot of standardization I'm always
looking for analogies in past
technology fields that kind of can kind
of give us some insight into this one of
them I was looking at was sort of the
development of TCP/IP and it's not
really the same but there are some
some similarities where lots of
different standards were popping up and
they were incompatible and at some point
like you know TCP/IP became the de facto
for many reasons one is that it pulled
good ideas from a lot of different
applications and so in this space and I
put in the notes which are here I put
some I put a link to a really funny blog
post that I've been following as it's
been evolving and it's called Cassandra
versus Mongo versus Couch versus Redis
versus Riak it's just like every single
non relational database um you can
possibly imagine and I love this right
now because we're in this very
innovative space that's exciting but all
these different tools are coming
out with new ideas and someday someday
we will see a convergence and I'm
already starting to see that I'll talk a
little bit about that later but I think
right now we all have to hang in there
and see what happens in the open source
space find the best design patterns and
some will win and some won't and this is
an exciting time right because we can
try a lot of different things it also
means for you guys as data scientists
you're going to have to deal with this
kind of you know this kind of gray area
for quite a while maybe a few more years
before we really understand what the
best things are same thing is happening
in the in the space around Hadoop so if
you've ever heard of Hadoop it is a
great tool for batch processing of data
over many machines one of the reasons
we're in this great field right now with
you know the possibilities for data you
know data applications is because we now
can use cheap commodity hardware and
open-source software to distribute
data processing workload over a
lot of computers or virtual machines
this is super exciting and this is what
Hadoop allows but Hadoop is not
always the best thing for every you know
every data problem it it's being
overloaded with a lot of other things
like it so being overloaded as a machine
learning tool and a database set a
business intelligence tool and all these
other things and I think that that we're
trying to figure out new paradigms so
it's done two good things it's
opened up a lot of a lot of
possibilities for data and at the same
time people are looking at things like
Hadoop and they're saying okay Hadoop is
not the best thing but we can build
other open source stuff and there's a
fantastic if you've not heard of it the
AMPLab at Berkeley is doing amazing
work around new ideas for data
processing I think Ion Stoica is
talking afterwards what I've been
hanging out with that group quite a lot
and I'm very impressed with some of the
work they're doing think rethinking
distributed data processing and if
you're interested in learning about what
they're doing you should definitely
check out the workshop afterwards it's
it's quite exciting stuff so the
landscape is filling out it's very
exciting but you know really we need to
still think about some of the the
problem spaces and so I think that what
what your challenge is if you're
interested in the data tooling space and
if you're interested in learning more
about these tools is to is to start
thinking about the the hard stuff which
is what you're going to be doing right
now is gluing together things that have
a mismatch and you have to deal with that
you're going to have to learn a little
code a little system administration that
will change but for now that's part of
the game and you're gonna have to
concentrate on on technologies that
solve real problems and so the theme I
keep coming back to is think about some
of the problems so but look at some use
cases that will illustrate my point by the
way I wanted to show you some some other
things here does anyone know who this
guy is
hey anyone anyone he's very famous to
database people the exciting world of
databases
so this is enter Edgar Codd speaking
of mythologies around data he is the
father of the relational database
paradigm and he oh he worked at IBM in
the I guess 50s and 60s and he really
understood what what a relational
database is so if relational databases
if you're not familiar with this
terminology it's all about consistency
and and the the query language SQL
has also come out of this a way to ask
questions about data and the whole idea
was around consistency and what was
fascinating about this man's life is he
used to be in the Royal Air Force he was
a military person he was probably very
interested in structure and I think that
a lot of the ideas around you know
relational databases might have some
connection to that in contrast what
we're having now is people running away
from this this kind of paradigm right so
people at Google for example I've done a
lot around non-relational databases
where schema doesn't matter so much and
consistency is not so important you know
we can spread the work around lots of
computers and and it's not so important
to us if there are always consistent
immediately and kind of you know kind of
breaking through the the kind of a kind
of a you know structure that this this
person's paradigm and gone through which
has opened up the ability to run
databases on many computers now and this
is very very common now to see big data
applications being run across a cloud
computational framework or many
computers and so I just find that very
interesting is that I'm trying to dig a
little deeper into where where we're
going mentally and where we've been
and I think this is a good example of
that another sort of pioneer is this man
Hans Luhn and he was the first
person to coin the term business
intelligence also at IBM a very
fascinating guy he wrote a paper about
business intelligence you can you can
find it online it's great he talks about
what he wants a computer system to do is
send information to a party who needs
that information at the time that
they're getting it and one thing it
would that was missing from his sort of
idea soup was the Internet and I think
we're kind of getting these business
intelligence systems in a kind of
backwards way and when I think of things
like Yelp you know or you're carrying
your mobile device around and it's
saying hey you know some some
information that you wanted is in the
area you're in I don't think he ever
imagined this right because he was
coming from sort of this corporate
business perspective but it's
interesting to see that some of these
kind of
old models we're rethinking them and
rethinking them and I guess the lesson
here is when you start to think about
actual use cases and actual problem
sets you know you you're able to you're
able to actually like get around some of
these technologies that were developed
just for these particular business use
cases and so now we're having when this
open source world we actually have these
these kind of ideas soup that are
floating around and I think by applying
them to more and more different use
cases we're going to get a lot a lot
more interesting technology so I just
wanted to bring those things up I think
another problem that people are having
is what tool do I use for for the math
component right so here's an example a
lot of people are interested in the the
tool R and Hal Varian who was one of
the deans of the I School has called R
you know one of those tools that you
need to know if you're a statistician
it's an open-source tool and it's
fantastic is it accessible this is a
question that's going to come up if you
need to invest in something is it really
going to be R there's also a growing a
number of people who are working with
Python Python is a general-purpose
language it's a scripting language it's
very accessible and there's a lot more
Python programmers than there are R
programmers so if you're investing in
different tools you might want to think
about a little bit about investing in
something like Python where yes it's not
the best tool it doesn't have all of the
features you need for statistics it's
not used by more statisticians than are
but it will eventually and it will allow
you to build more interesting
applications I have some examples of
that actually I was thinking about some
ways to do can you guys see this the
screen I was I was thinking about what's
the difference between different tools
for for statistics and math processing
and I was playing around with with R
and I was trying to do some stuff with
tweets and I realized that there's
there's a lot I don't know about R but
I also wouldn't use it to build actual
applications there's another great
Hilary Mason quote she works at Bitly as
the chief data scientist and she
said R is great for exploration but
it's not good for building applications
right because you're missing the testing
framework and the programmers and the
coders and if you're doing real data
applications you need to express your
work in code you need to be able to
build what you have in your head and and
if you're using something like R it's
great for playing with data but it's not
great for for expressing your ideas when
it comes to
actually building a robust app
there's another great Berkeley project
called iPython which you should really
look into ah good yes I love
iPython it's it's fantastic it allows
you to collaborate with things and I
think it's a it's a tool that you should
really look into if you're interested in
some of the stuff so here's an example
of something I was just playing with I'm
very very very fond of the Python data
processing libraries like Pandas which
is great for time series data
SciPy and Numpy which are things you
should look into too
I think I have links to all of those
tools in my in my notes here now but I
was trying to play with some Twitter
data and I realized I was I just didn't
know how to express what I wanted to do
in R to actually capture that data
process it and do something with the
data so I was just sort of hacking on
this I have this I have this program
that I called tweet you know tweet
stream let me see where I am actually oh
yeah so so if I let's pull it up here so
I have this tweet stats program what I
realized is you know I can I can combine
a bunch of different libraries like the
NLTK tokenizer and the Pandas library
and and the tweet stream which is
something people use to just grab
Twitter data actually combine them into
a mash-up really quickly and this is not
something that you can easily do with
R and so as data scientists you're
gonna have to pick the right tool for
the right job and in this case I wanted
to look at hashtags so actually let's
run this like I haven't played with this
in a while but let's see what happens so
you know this is going to ask me for my
username and so what it's doing is
it's actually just going to read the
public tweet stream and look at tweets
and then it's going to look at hashtags
it's going to use NLTK which is a great
tool for natural language processing and
it's going to pull out some random
statistics I just I just ran some some
easy things so these are actual live hashtags the most popular ones and
love is number two there's not much in
here but doing this so this is this is
an easy hack and as a data scientist I
want to be able to do these kind of
things quickly right so combine a lot a
lot of different different ideas so with
something like R which is fantastic
like ours is a fantastic statistics tool
and I hope you all learned something
about it it's hard to build these kind
of things quickly in R
in my opinion people are good at it but
I think what you're going to find is in
industry when you're doing practical
things you're going to find more Python
programmers and JavaScript programmers
and general-purpose programmers than you
will for these sort of statistics tools
and I think the kind of statistics and
and kind of code around statistics is
pivoting a little bit toward Python and
pivoting a little more away from the
proprietary technologies and more toward
the open-source technologies and I think
that's something to look out for if
you're interested in the space another
one is around visualization tools right
there's different ideas around
visualizations visualizations is a
rich very very rich field that has a
great deal of history and it's it's hard
it's easy to make mistakes right it's
easy to tell a lie with visualizations
and it's hard to convince people who are
really knowledgeable about what makes a
good visualization what you're doing
unless you're unless you're questioning
it yourself and I think a mistake a lot
of people get into with data science is to
pick the wrong types of visualizations
not really understand what's going on
and if you're interested in this field
this is a technology this is a kind of a
type of tooling that you really need to
understand at least understand what
makes a good visualization a compelling
narrative in terms of building them
there's lots of great tools there's a
lot of you know commercial tools I mean
Tableau is very popular ClickView is very
popular etc there's a great open-source
tool called Gephi which is good for
graphs and there's a huge amount of
startups doing visualization work
because it's so important to communicate
this data in terms of building stuff if
you're if you're working on this stuff
and you and you are working in something
like R there's lots of good it there's
also good libraries for that I'm a big
fan of Python's Matplotlib
which is another kind of in that family
around iPython and some of the Pylab
stuff and you should definitely check
some of that out I think I might have
some some examples of the types of data
you can do maybe we could we could run
some of that right now I think that I
think that if you're if you're building
and communicating it's very important to
think about the web as well and another
another kind of technology that you
should really be pretty proficient it is
being able to build web based
visualizations for sharing we've seen a
lot of great stuff coming out of the
Guardian UK data blog that build
interactive visualizations to support
data journalism projects and Mike
Bostock
who I believe went to Stanford he's now
working for the New York Times he's very
famous for a library well Protovis but
now D3.js which is an amazing expressive
visualization library and I would in my
opinion not not for the timid it's it's
pretty advanced but there if you're
interested in doing data science and you
want to communicate it to the world I
think being a JavaScript expert
understanding web technology is very
important and I think that that's
something that you should you should try
to think about the great thing about
some of these libraries is they're very
well connected to some of the other good
stuff that's happening at UC Berkeley
there's a big visualization kind of
academic world at UC Berkeley and I
think that there's a lot of innovation
happening I mean there's new types of visualization communication tools
coming out all the time
and it's it's not a dead field by any
means it's actually very vibrant and
with the web and interactivity we're
coming up with new ways to communicate
this stuff all the time so that's very
exciting
i mean i can give you some examples
right now of a library I think it's
really great if you're using R ggplot
is fantastic let's see if i have some
examples here that what i one thing
that's great about R i think is the
fact that there are there are built-in
data sets to play with even though I
just did a tirade about not using
built-in data sets it's this is fun to
learn and so here one of my favorites is
Old Faithful it's a it's sort of a
length of eruptions and it's a
data frame with length of the eruptions
and how long you had to wait for them
and so it's great what's great about
ggplot is you can able you're able to
plot some of this stuff immediately and
come up with some amazing exploratory
visualizations and you're able to
communicate this to to other people in
your organization very quickly so I'm
just saying use it you do a sort of a
plot of these things oh and it didn't
work for some reason because I spelled
something wrong eruptions so these are
the kind of tools that I think that that
you definitely should focus on if you're
interested in this beautiful
visualizations and seconds and you
should really really think about some of
these things as well and also explore
the the D3.js library I think that's a
great way to go
so let's talk about some other things
that I've done lately I wanted to show
off
some of the stuff I did at Google I/O I'm
interested in exploring the world of
Internet of Things and I don't know much
about it and so what I wanted to do is
start a project that would just sort of
I could just dive into and understand
what what it means to collect data
I wanted discussions to happen around it
I wanted to talk to attendees at the
conference about how they felt about you
know Internet devices collecting data
all the time so what we did was we got a
bunch of Arduinos and we put sensors
on them and put them all over the
conference room at a conference area at
Moscone Center and and we put signs on
them saying this was a sensor collecting
environmental data so we collected
temperature and pressure and kind of
things like that we also collected
things that were slightly more personal
like how many footsteps went over a
motion mat so he put mats on the ground
people were walking over them and we
collected RF energy so we had antennas
that would collect noise so anytime
there was a there was a cell phone
energy at a certain wavelength it would
it would be higher or lower you know we
kind of visualize this stuff in real
time so I was just playing with D3.js
I was looking for things like the 40th
40 noisiest minutes at Google I/O so far
not a great visualization and then we
met the Tableau team who were next door
and they took some of our data and did
some stuff with it this is an
interesting one it's total steps for per
minute per floor during the conference
and this tells a great story at least to
me because I was aware of it the the
blue was the top level of the floor and
orange is the second one I believe and
Green was the the first floor and so
this is sort of the keynote happening at
this moment you know everyone wants to
go see the keynote what's the new device
they're giving away or whatever and so
there's a lot of activity but there's
also some some interesting stuff here
around around what was happening it
stuff I didn't understand so what was
happening right here and what was
happening down here this little dip and
apparently a lot of people this is the
floor where all the session rooms are there
was a kind of a dip like nobody nobody
was around and I couldn't figure out why
and so I started asking people what was
happening around 2:00 p.m. that was the
gift giveaway right so there was a
giveaway of schwag everybody ran
ran down to the second floor there were
no sessions and so these are the kind of
things I can I can actually look at and
glom onto very quickly from from this
data project that I was doing to
understand things that were totally
hidden from me
I didn't realize any of this stuff was
happening and this can inform us
next year of some of these things
another thing that I was really
interested in is doing something more I
called it the serenity metric where I
was I'm looking at this is probably not
very very good statistics but I wanted
to see where it which rooms were the
quietest and had the lowest variance in
sound right so I just took two metrics I
just combined variance and audio noise
and uh and I found that you know like
room three which is off in the corner
and quiet and not not very many people
were in that room was actually the one
that was the most serene which is which
is very interesting and then there was
this huge spike at the end and I didn't
know what that was it first apparently
the label was blimp and what had
happened was some people took some of
our sensors and put them in these blimps
that were flying around with motors and
so you know those were the least serene
so I'm very interested in this this type
of metric something that I just couldn't
put together before and using
visualization tools it just really helps
me and do some of these things and I
thought it was a very interesting sort
of case study about putting these things
together very quickly with some of these
tools like using D3 using we used a
Python to do some of the statistics as
well and this is the kind of thing that
I think is really really fun another one
that I really I think is really great is
a study and I have a link to it in the
notes about the positions in the NBA and
a very interesting study where people
looked at right now there's five
positions right there's two guards two
forwards in the center it is this right
you know is this is this a good way to
kind of think about sports players and
in this study they they actually looked
at clustering of the different
characteristics of these players and
found out that there's more like
thirteen or you know twelve positions
there there are types of people that are
clustered in different ways and I think
that's the kind of sort of type of work
you can do with some of these open
source technologies very easily so um
let's see I know we want to do some Q&A
later so let me let me just so jump
ahead to I think what it's really
important to me is future trends right
what does all this mean where is this
going
what should we invest in what should you
invest in if you're interested in this
kind of work some of the things that
I've actually seen working with with
many companies include push a push
toward everything in the cloud right now
there's a lot of people doing things
with clusters with machines buying
bigger machines
and I think a trend that's happening in
the world is application development is
moving to the cloud all of your data is
kind of going into the cloud when you
build an application a game a website a
lot of that a lot of that technology is
there and actually with mobile devices
growing the use of it more
and more application development is
actually cloud based and so a lot of
data is actually in the cloud as well
and so what's happening is you're seeing
a bunch of industry shifts where people
you know things that were traditionally
servers in the back room are actually
becoming these kind of virtualized
cloud-based system an example is
Amazon's Reshift which is a business
intelligence tool that's completely in
the cloud very interesting the product I
work on Google BigQuery is the same it's
completely hosted it's a it's an
analysis tool for big data and it's
completely cloud-based and you access it
through an API and what you're starting
to see is a lot of this development is
going into the cloud where APIs are
talking to different APIs and the kind
of limiting factor is the speed of the
Internet and that's that's I think
that's that's very interesting and I
think that trend is just going to
continue there's going to be more and
more of that and it's going to not solve
the problem of everybody but it's going
to solve you know 80% of industry use
cases and I think I definitely see
that trend continuing what that means is
that applications are going to be more
and more mash up I mean a mashed up and
we're going to have to think a lot more
even more than we do now about policies
around how security works what protocols
we're using for that where that data goes
how do we track how do we log data
transfer that stuff is going to be even
more important and so this points toward
a theme that I'm always talking about is
what jobs will be mechanized and what
jobs will be will require more
humans and this is this is one of them I
believe right so the sort of policy
around these kind of security
considerations is very important the
same goes for some of the statistics
things I was talking about you can
automate some of this stuff but you know
we really need more people who
understand math and understand stats and
understand how to tell a good story and
how to find a successful use case in all
this data to me that's very important I
think that's a trend that you'll see and
you can actually see that I mean more
and more schools are giving during data
science statistics I was on I was
watching something on YouTube and I saw
a commercial for Colorado State
University statistics program you know
online it just it's pretty important
that we find more people who are knowledgeable about this stuff and I think
that's a trend that that's continuing
another one is convergence I talked
about this before what I'm seeing in the
open source world is the best ideas of
different products coming together so
relational databases things like your
Postgres and your MySQL are learning
how to be more scalable learning how to
deal with more data this is a trend that
you see more and more and and on the
flip side you're seeing scalable
databases that are not not necessarily
the best for consistency but great for
web-scale data collection are getting
more relational features query with
SQL you know the ability to look like
a traditional database you know easy
access to that stuff rather than having
to write code every time you want to do
a query or a MapReduce function so I see
a lot of this convergence and I see new
paradigms I love the Mesos project at
AMPLab right here in Berkeley and also
the sort of the the Spark 
program is really interesting a
rethinking of how to do distributed data
processing I think that stuff is
happening um that when I see when I talk
about these things that it brings me to
my suggestions for you guys to be what
you want to do to become data scientists
I have a list of things that I always
think of our short-term skills that I
think are very important that I see time
and time again and all the companies
that I work with are often hurting for
these kind of skills the most important
ones I call them the short-term skills
because they're things you can do
quickly while you learn about the more
in depth things to me there are
things like a working understanding of
R because a lot of people use R
proficiency in Python and JavaScript I
think these are really key when you're
doing actual data analysis that I just
described some of the things you could
do with Pandas and the iPython suite I
think that's very very important and
that's a big swelling amount of people
that are actually you know moving in
that direction and JavaScript why to
communicate ideas on the web right
because everything is moving to the web
I see that more and more data
journalists need to understand that
people who are doing the code around
some of these projects should be doing
that municipalities who want to share their
data I think dashboarding is very
important for them being able to help
communities do open data projects I
think it's very important that people
understand how to do things in on the
web and soJavaScript is a big one
learning SQL I think SQL is not
going away I think that the
query languages are being kind of
glommed on to other other tools you'll
see things like Hadoop with Hhive and
Impala and things that use SQL to
actually query those datasets BigQquery
itself actually uses SQL even though
it's very far from what a relational
database is understanding of SQL and
learning how to you work your way around
a UNIX shell I think is very important
that a lot of the stuff is being done on
virtual servers on things that look like
UNIX and I think understanding some of
that technology is very important
distributed data tools you know Hadoop
is not going anywhere it's very very
very well used very understood it's
being it's evolving everyday being able
to run a Hadoop instance locally try it
out if you've never done it before and
you're interested in being a data
scientist learn what hurts about it
learn what it's great for it's just one
in one technology in a you know a wide
spectrum of ways to solve problems and
it's interesting to know what when it
works and when it doesn't right a
streaming MapReduce job in Python I
think that's very important to
understand how to actually process some
data and and how hard it can be how easy
it can be what the limitations are and
because in your job you're going to need
to do this sometimes as well and build a
toy project build a toy project with a
non relational database there's a link
here to about 20 non-relational
databases MongoDB CouchDB understand
what they do understand how to solve
this pipeline problem because you're
going to see it time and time again of
lots and lots of data coming in
somewhere and you'll need to be able to
process it that's what the short term
skills things like that
long term skills dive into statistics
understand what does k-means clustering
do what is linear regression what's a
Bayesian model and when do you use these
things what are the why would I use some
of these things a real data scientist
needs to be able to understand this and
be able to tell the right story at the
right time and I think this is very key
so I think these are the kind of skills
that are better we just don't have
enough of an industry and we need more
people that understand mathematical
modeling and what what it means what the
limitations are when you would use one
thing versus another and then data
visualization right explore the world of
data visualization a very rich deep
field and a field that I think is more
and more important every day I think
there are times when but I'm looking at
the product spectrum right now around
these like devices like there's a great
device coming out of the I School
graduates of the I School called
Automatic
which is actually something you plug
into your car and it tracks data about
your car for you there's a there's a
tool called Fitbit there's a tool called
Nest things that are that are data
collection devices for for quantified
self and I think you'll see more and
more of these things and the key
component of that is actually to show
data in a way that you can use and so
the whole visualization area is one that
is going to become more more important
to consumer devices and we need more
people that understand what that means
you know how to tell a good story as
more and more people come online and as
more and more people use some of these
tools I think that's that's a key thing
I mean also just being able to tell the
story if you're interested in data
journalism being able to talk you know
talk about what you're doing talk about
you know communicate to people about the
story this is very very important I think
understanding how to build a
visualization interactive visualization
that is compelling is a very important
important skill and then finally I think
you should throw everything away I just
said because the most important thing is
actually to dive into a real data set I
see so many people who just had never
really played with real data they read
about the technology they play with a
toy they might even be developers
building things I think I was talking a
lot about the AMPLab one of the things
that I've heard about the AMPLab needs
to do more of - just to move you know
more into the mainstream is actually
solve real problems and they are but I
think this is key right your technology
needs to solve real use cases or else
it's worthless and I think what's
happened in the Hadoop field is that
people have taken Hadoop and they said
let's use this for something more than
batch data processing let's use this as
a data warehouse let's use this as a
machine learning tool and finding that
it's not always the best thing the
product I work on BigQuery was actually
reaction to that it was a tool developed
at Google because we found that we were
using MapReduce to do aggregate queries
of large datasets and it was just too
slow so we had to come up from the
ground you know we did we redesigned our
system from the ground up and it became
something called Dremel which is
something you can read about there's a
Google research paper about it has
nothing to do with the Hadoop kind of
MapReduce paradigm it's a completely
different tool and we're seeing that pop
up time and time again so actually
getting your hands dirty with a real
challenge a novel challenge right you
can you can keep running queries over
the same day old data sets all the time
but I think go to your municipality and
solve an open data problem go to a
non-profit
and solve a data problem that doesn't
even require a lot of data maybe the
problem is complexity is actually taking
two data sets and merging the schema
together you know work on a project look
at the final projects from the MIMS you
know I School MIMS teams like they do
amazing very novel work around data sets
that are not necessarily big data may be
complex data or may be a space that's
never had sort of a data perspective and
I think that's the real key to becoming
data scientists is really finding these
real use cases and understanding how to
solve them you might not always solve
them in the right way but but kind of
going through the process of seeing the
kind of nitty-gritty of dealing with
data is very important and real data
scientists to me are people who have
gone through that and understand how to
deal with some of those problems
and I wanted to show you before I kind
of wrap up the section I wanted to show
you a problem that I've had that I might
want some of your help to help with as
well I was playing with this this
BigQuery tool that we have now and we
have this data set in our public data
samples one of which is called the
natality data set and it's many many
millions of American births and I don't
remember the actual dates but I think
it's something like from 1972 to
somewhere in the 2000s and the details
look something like this it's to me it's
a small data set but it's actually it's
got a hundred million records and the
data looks like this there's you know
the day the month hopefully you can see
this it's a little bit small you know
day the month the the state some
demographic information the weight in
pounds and so I I was just playing with
this data and I I was running queries
for no reason I just wanted to see see
some information about them so I ran this query where I selected the
average weight plus four times the
standard deviation basically I wanted to
see what the fattest babies were like
what's that cutoff so let me run
that now so the high was 12 pounds wow
that's pretty high so that's four times
the standard deviation plus the average
that's pretty fat I guess I'm not a
father so I don't know how that works
but so as I decided to go well which
states have the most fat babies it's you
know I'm just going to do a count it's
probably California or New York what do
you guys think something like that so
Mississippi okay those are good choices
there's a good choices let's see if I
have that here so I took that that
number that that standard deviation so
I'm saying give me babies where the
weight in pounds is
greater than that the state and I'm
grouping by the month and year so we can
see what month so running this query I
found that it was Maryland in 1990 had
22 really fat babies and of course you
see New York and California and New
Jersey so okay here's an example of a
big data problem a big big data problem
because it is I don't know why this
happened it might be bad data it might
be something else I don't know where I
have the huge data set I have a tool
powerful tool for for you know crunching
through the data and there's no
information here that makes any sense
like this that this problem is not a big
data collection problem and it's not a
I'm lacking the tools it's I don't have
it I don't have a problem here like I
need to figure out what what this means
to me so this is an example of where I
need to have somebody who is interested
in statistics and interested in going to
the newspapers and seeing where they're
really that many babies was the
hospital's data in correct you know what
was happening here was this just a bunch
of duplicate data I don't know and I
still don't know I haven't had a lot of time
to dig into this but this has piqued my
curiosity's as I read this query and so
this is what data science is all about
right it's about it's about finding out
this stuff and the nitty-gritty like
maybe this data set is just broken I
don't know you know maybe there were
babies in Maryland in December 1990 and
there was this huge amount of really
chubby babies for some reason I don't
know and maybe that points to something
but to me like it starting with the
technology the tools like I have here
and working backwards is the wrong way
to do it right it's not the right way
what we really need is to start with the
real problems right real problems might
be looking at social data and wondering
there was a great Center for
Investigative Reporting report the Bay
Citizen had actually published it about
how late Muni was and they looked at some
datasets and they they're wondering why
why is Muni always late and why isn't
Muni doing something about this they
actually found that Muni was
reporting the wrong times for the buses
and actually things were much later than
they reported that things were on time
when they were actually much later and
the this these journalists solved that
problem looking at the data and they had
started with the problems like why isn't
Muni taking this seriously and I think
that's that's the kind of challenge that
I think you guys if you're starting this
process of learning about data science
is to kind of grapple first and you'll
learn a lot around that learn about the
technology that is really secondary to
answering some of these questions and
and solving some these problems so so
I'll leave it there for talking about
some of this technology because we're
sort of getting close to three and but
if anyone wants to help me solve the
mystery of the fat babies please let me
know because I'd be interested in
working with you as a research project
so yeah so let me let me put this back
up here so you can if I can find it let
me put this back up here so you can go
oh I lost it
anyway I'll put the I'll put the the
link back up so some of the notes where I
have links to almost everything I've
talked about and so since that you know
we're getting sort of close to
three I wanted to open it up for
questions I wanted to make sure that you
guys had a chance to kind of talk and
discuss some of the things you might
want around data science around tooling
around if you have questions about what
this field is like and what we're
lacking or if you have questions about
some of the stuff I was just ranting
about earlier let me know I'm here
let's maybe discuss some of these things
do do we have a microphone for questions
or is there a way to do that there was a
microphone around here okay
oh is it broken I'll make a new one ah
okay so one of the reasons I wanted to
come and meet you guys and sort of like
kick off this thing was to understand
what your interests are like why are you
at the DataEDGE conference what what
what's your motivation what what kind
of things you want to learn and what
kind of issues and technologies you're
interested in talking about I'd love to
get some feedback from some of you about
about that and if you have any questions
for me about some of the things that I
do I'd love to talk to you about that as
well yes
maybe we can pass the microphone sure we
have the microphone I have the mic I had
it all along
yeah please say yeah there's a question
here and maybe you guys can just pass
the mic to each other when hi thank you
for your presentation my name is Matt
Cho I'm a student at Carnegie Mellon
University and I just want to ask you
about the details the what you said was
the new paradigm diverging from the
relational database so I just want yeah
I I wasn't sure how familiar people are
with some of these things so on
relational database technology was
really all about an idealized
understanding of what data should look
like right so in the 60s and 70s and
even the 80s people like Edgar Codd were
interested in a rigid consistency
meaning every time you make a
transaction to the database you add some
data it is locked down and consistent
and once you query for that data it will
be it will be the same and we've seen in
the past because of the internet
basically the past decade that that
system doesn't scale that type of system
doesn't scale well past a single machine
right so the idea was you'd have to keep
getting bigger and bigger machines as
you got more data coming in and what
what happened at places like Facebook
and Google and and places that we're
handling a lot of data for example
indexing the web from multiple clients
you know Google would send out these
spiders and look at every single public
website and send that data back to to
some servers
well single server couldn't handle that
and so people started to think well
maybe we don't need that rigid
consistency maybe we could we could look
at we could spread this work across ten
machines and one machine might get a set
of data and another machine might get
another set of data and they may not be
consistent at first we'll find some
back-end process to make them consistent
and so they threw away the idea of
consistency you sometimes hear this
paradigm called ACID yeah and it starts
with it stands for things like atomicity
and consistency and some they started to
throw away some of those things relax
some of that strictness and that's why
I brought up the idea of like this sort
of a sort of military person who kind of
came up this because to me that's
that's sort of what it was about right
it was like an idealized mythology
around what data should look like
when in practice what people really needed
was a system to collect data quickly
and consistency wasn't the main problem
and so new theories came out actually
there's a person from Berkeley Eric
Brewer who who came up with something
called the CAP theorem which which says
that you can have two two of the
following things you could have you know
consistency you can have something
called partition tolerance which means
that you can actually split your work
across multiple machines you know but
you can't have everything and so to me
that that is a relaxation of this
mythology of idealized data to solve a
problem and you'll see that more and
more right you'll you tend to see that
in this technology in this space because
it's rapidly innovating and there's lots
of little projects working on different
things here's another example you
there's a difference between collecting
data quickly having a system that's
always on and streaming data into that
system all the time versus a system
that's quick to query right they're very
different and unfortunately we're in the
state where there's very there's very
different technologies for those little
niche applications and that's why you
get these kind of confusions around
which non-relational database do I use
do I use Mongo do I use Couch do I use
Redis they all kind of look
the same but they do very different
things and part of the problem with
being a data scientist at the moment is
to tease out those differences for
different use cases and it's it's a
tough job actually but there are best you
know they are best tools for the job
that being said you can use one of the
other tools you know for something else that
you can you can use the wrong tool to
solve a problem and what's most
important is that you solve the problem
right these things get you get into
problems when you actually scale up when
you're not like Facebook scale and
Google scale and Twitter scale when you
really need to be efficient and every
single efficiency saves you know
thousands of dollars but you know when
you're just starting it's good to just
understand what these things do and
dealing with this space of being in a
weird inconsistent space with technology
we'll figure it all out but for the
moment we're stuck thank you
I'm Vladimir Zadorozhny University of Pittsburgh
School of Information Sciences my
question is in general like how much
science do we think we have in the data
science what I'm saying is Edgar Codd
he laid out the nice foundation to what
turned out to be a sustainable
technology and that's why we have
relational data science yes well what we
have right now is a zoo of tools and
skills which allows us to use those
tools to solve some kind of problems so
it reminds us more like engineering
engineering and perhaps even
craftsmanship yes in a way it reminds us
in the situation that we have but pre
relational state of database technology
where people doing many things with
different kind of tools yeah I love this
question right this is actually a very
core theme to what I bet thinking about
a lot what's the timeline for what a
data scientist is actually I remember
like last year I had just like a mini
debate with Anno about like what like
what is the data science field like do we
need another degree for this and I think
the answer is is we it's a new role but
it's going to change rapidly I'm always
trying to find I said I mentioned this
before in the talk I'm always trying to
find a human element of what a data
scientist is because that's the thing
that will probably persist and I'm very
concerned about things that we can
automate because I want to automate them
you know I want to get rid of some of
this stuff and I think right now a data
scientist does have to write code to
glue systems together and I'm hoping
that'll go away right that's the kind of
craftsmanship that I hope goes away but
I don't think it's going to go away is
that the storytelling right being able
to understand what a good data question
is and how to compellingly tell a story
to convince others that this is valuable
now I think that's not going away and I
think we need more of those people so to
me like things like statisticians people
who know how to write code for
visualizations people know who know how
to do math are not going away they're
not going anywhere the engineers who put
together MongoDB clusters probably will
be going away right you have to do that
now you might not have to do that in the
future in the near future and I'm hoping
we don't you'll see that with Hadoop as
well I think the Cloudera guys are in
the other room and you could probably go
talk to them but I think what's
happening in the Hadoop world is that's
all being virtualized like I don't want
to set up a Hadoop cluster I don't care
I just want it to work I want to throw
some data at it and get an answer back
and I think that we're that industry
that whole industry around distributed
batch processing is moving toward
automation
and you're seeing that right now you're
seeing people like MapR building APIs
to existing long-running Hadoop
clusters and that's happening so it's
something that we have to watch but I
think finding those human elements
around statistics and math and telling
the story is what's going to persist and
we do that is a real field what DJ Patil
said about we need to name this because
it's an actual role that we have in
companies I think that's going to stay
for a while I don't think that's going
away I think that's something that's
going to persist for a while and by the
way you just mentioned it relational
databases are still useful there I
didn't mean to say that they're obsolete
they're just useful for a particular
niche of problems
sure that was a great question
and one that I have to think about some
more hi Michael hi my name is Ethan I
work at Students First we're a
non-profit based in Sacramento we are
engaged in fixing our K through 12
public education system yes and so what
are the factors that we we see as kind
of an emblem of the education reform
movement and the education system here
in America is that there's so much data
available but we are concerned at how
little of it is actually being used to
create better systems and better
outcomes such as student achievement so
we're here to try to really learn more
about uses of how can we better use data
analysis and visualization to address
these types of problems and come to
communicate solutions in a compelling
way is that data so just to follow up is
that data that you're talking about
accessible I mean is it out there to use
or is it in weird formats that you have
trouble accessing and what's the problem
what's the actual problem with the data
that you're talking about so I would say
there's two folds the first issue with
the data is that often it's incredibly
hard to process forms so that's a
certain factor we've come to learn a lot
about you know parsing data quality
control more than that's the problem I
see - that's the trouble yeah please
absolutely and the second area that this
is more of a future facing question as
we see it strategically but a lot of the
data addresses status
quo and the question everybody wants to
know is how can we make student outcomes
better and that's kind of the big just
is very industry specific so part of
that specificity but that's the question
ever wants to get you but all the data
is kind of addressing what's currently
happening right without that one element
so we're in a position where we are
working with researchers and working
with people that want to find data and
they're asking us what do we what do we
need to look at and we're also on the
backend trying to figure out how to
visualize that when I by the way this
reminds me when I was at the I School
I worked on a paper and I was doing I
think I was working with Jenna Burrell
in a class that I was looking at how
public schools report data in the Bay
Area in the East Bay and I realized it
was easy to find well relatively easy to
find general statistics on attendance
why because schools get paid on
attendance that's how that's how they
raise money like a school that has high
attendance so they're really good about collecting
that data and they're really bad about
everything else like reporting you know
that stuff should be public I don't mean
that we should have private information
about a particular student but I want to
know how the school is doing and it
should be pretty and you know in my face
I should be able to find it absolutely
and so have you can you give me some
examples about what what kind of data
you're looking for absolutely so um so
just to get at that point one of our big
issues is data transparency on the
fiscal side fiscal efficacy you know
knowing what kind of dollars how they're
being spent to to address which types of
programs and what are the outcomes of
those programs there's a wall between
how money is spent and what the outcomes
are in terms of data collection and and
and districts and schools'
ability to actually even collect that
data even if they wanted to because of a
wide variety of reasons and then on the
other side we have national trend data
to address the achievement gap issue and
knowing you know so we know for example
on average the per pupil expenditures
about ten thousand dollars public
perception is that it's about five
thousand dollars but we know that it
goes up to twenty twenty-five thirty
thousand dollars in some of the poorest
districts yet there has the more money
goes in the the achievement levels have
not changed yeah what a fascinating UC
Berkeley I School type problem right
it's like it's not even a technology
problem like there's so much more to it
right
you you need to find incentives to get I
guess lawmakers and school
administrators to be able to report this
data like like the incentive of getting
attendance equaling dollars there's no
incentive for them to report anything
actually they want to hide some of this
data probably if they're underperforming
schools absolutely so that's kind of the
big question that we're coming to this
conference with just to you get an idea
of what kind of compelling data
visualizations can use to address an
industry that's really trying hard to
use data in a in an accurate and
compelling way this is a great use case
and it shows the complexity of what
you're trying to do like you don't quite
know what you wanted you you don't know
quite know what those visualizations are
you don't quite know how to get the data
you just know that you need to there's
some data you're missing and you
need to convince people of something and
it's not even a technology problem like
the technology is probably really easy
in this case I'm guessing we
would generally think it's probably the
easiest part of your problem but yeah
I'd love to talk to you maybe we can
talk a little bit offline about some of
these things I used to actually do a lot
of work with nonprofits and I'm just
interested in the basic use case
absolutely thank you yeah it's also the
kind of thing I hope you share like
you're a non-profit
it's when you figure out some of
things share share share like share it
online put it on a blog send it around
I'd love to see more people look at this
thanks Michael it reminds me of another
quick story while you're passing the mic
um this is a rumor but NIH I believe NHS in
UK they report things like hospital
accidents and we were joking with
somebody who was really interested
interested in this field and he
was telling me that it's better to go to
a hospital with more accidents because
those are the ones that report them like
you want like what they were
tending to find is people with lots of
accidents were actually burying them you
know they wouldn't report them and so
it's just interesting to see this is
kind of the problem here is like in the
school I'm guessing that if you're an
underperforming school and you're
pumping money into the
school and it's not doing well
somebody probably wants to hide
that data it's something that it's a
real problem a real data problem so
there was a some question here okay yes
so early on in the talk you were talking
oh I'm Mark Giordano I'm an alum of the
master's program awesome a little bit
earlier than you yes nice to meet you Mark um
so early on you were talking about how
the gathering of the data is not the
sexy job it's the dirty grit-under-your-fingernails job and then not two minutes
later you're talking about people
gathering the data normalizing it
putting it up on the web to be given
away on an FTP site to someone else so I
don't I mean I'm sure that you're aware
of that happening but if it is happening
why would anyone do that and then give
the information away to someone else to
get all the glory and fame of doing the
analysis and then getting quoted in the
media and telling the story which is
going to be a lot cooler than having
gathered the data in the first place if
I'm a government's a great question if
I'm a government I want I want people to
use data for some public good I'm
guessing like this example with schools
is a good one right it's hard to get
that data I mean look what he's going
through to get some of that data but if
I'm a school district and I want some
public good to come out of the data I do
want to share it and I want to make it
accessible to people in some way that
way is hard you have to that takes a lot of
non-technical work to figure out what
that is and so I just think ok taking a
step back I just think there are good
success stories in sharing data in
public datasets and open datasets
it's not a magic silver bullet that will
kill all you know social problems not by
any means but I think there are really
good reasons to have it out there at
least to inform journalists at least to
inform academics to actually tell
stories around this and I think that we
we just have a lot of challenges there
so ok yeah there is a lot of glory and
some of that stuff but you know we do
need it to happen but you were
specifically talking about public
datasets that we'd already paid the
taxes to gather you weren't talking about
privately funded gathering of data actually
I'm not exactly sure what you're talking
about at this point I was talking about
the just the process technically of any
kind of data collection so one example
are these game companies I work with I
work with a product called App Engine and a
lot of game companies use it as the
backend of their game and they're making
mobile games and social games and you
know a lot of that data it's random data
right it's different different users
coming in from different countries with
you know it's a game right so it's a
it's a Tetris game and they they want to
have information about which blocks were
being played and these kind of things
which items were in the game and this is
a lot of complex data right it's the
schemas can be different they have to be
able to have uptime all the time to
collect you know millions and millions of
records a day and so that's a hard job
that's actually what I was trying to get
at is that process technically is very
difficult it's not it's not easy to do
by any means and I can actually tell I
meet when I meet someone who's gone
through that process they often do like
it's such a hard process where I think
what we talked about mythologically is
that that big data will solve all
problems but if you actually get into
the nitty-gritty of it you'll you'll
have a different perspective about it so
I think I was trying to sort of make
that distinction but you're talking
about different things public versus
private datasets and I think there's a
lot of differences there thanks a lot
Maggie Kelly from the Environmental
Science Department and great talk but
this last question just prompted me to
follow up on that and in the academic
world
we're being increasingly pushed to
release our data and to publish our
datasets and there are a few kind of
predictable and unpredictable barriers
to that and one is just this reluctance
of a lot of sciencists -  competition in
publishing right - well that's right and
it's very realistic you can understand
it because that's the way we advance
right so there's these this internal
resistance to putting data out even
though our funding agencies are really
pushing us but there's also
technological barriers to doing that and
so the data scientists that are being
trained here and at the I School have the
skills but your average ecologist does
not right so what are the improvements
that could be made to help get the data
out and usable I mean that that's a pot
yeah so I don't know that I mean this is
where I can say I'm an engineer I don't
know what the answer to that is but it's
a big problem right so I've actually
been talking to a lot of people about
this this this problem specifically of I
don't want to share any data until
I've published and maybe even then I don't
want to share any data I don't know the
correct me if I'm wrong but I've seen
studies of people who the verification
of some of these sort of data projects
are not there like they're actually
maybe misleading and these are cases
where you really want a lot of eyeballs
on a dataset if they're telling stories
with statistical models to say well
maybe you're wrong you know this is a
case where I think we need more well I
think often you do want a lot of
eyeballs on the data right but there's
there it's a different culture yeah
the typical ecologist say isn't part of
the culture that and I I think that if
the datasets are shared you get a richer
understanding of the ecology because you
see more you know you see the trees as
well as the forest right so but it's
it's a challenge I'll answer the second
part of the question first I often
introduce I didn't do that today but I
introduced myself as somebody who
believes that what I call utility
computing or cloud computing and and
sort of these kind of big data projects
in the cloud will actually lower the
barrier of entry and I'm this is why I'm
a big fan of automating some of his hard
stuff an example is using here's a so
a lot of maybe you do it as well use R
for statistics and it's okay it's hard
to use R on large datasets it's just
it's not really meant for really big
datasets and there are people that need
that capability so what I want to see is
people and this is happening actually
right now people working on R on an
unlimited sized data set and actually
bringing the cost of computing lower so
you can throw whatever data you want
into some kind of virtual server in the
cloud and run R and not worry about
limitations right like those things I
think are going at going they're getting
out of the way basically much an analogy
is the way that well a lot of you
probably use like Hotmail or Gmail or
some kind of cloud email UC Berkeley I
think switched to one of those and
that's a lot of that administration that
we used to have is all you know it's all
kind of abstracted away and I think this
is happening in data it's just going to
take a little bit more time so that's
happening and I look forward to the day
when you have all the computing power
you need with very little and all the
tooling you need coming soon it's not
quite here yet but it's coming soon and
the data scientists will help build it
the first part of the problem I don't
know I'd love to hear some of your ideas
or some of the ideas around here about
solving that problem because it's a huge
problem yes I agree I think it is going
to take a huge cultural shift and maybe
different incentive shifts for because I
know you guys are all it's
publishing is what you're really yeah
it's different incentivizing of the process or lots of
retirements yeah yep
yes was there another questions hi my
name is Fernando Perez I'm here from UC
Berkeley - it's good to meet you - nice meeting you
and I thought I do have a question on a
topic that on something that you touched
on earlier and then I want to make a
comment about something from the from
the I School so I'll mention that first
that Raymond Yee
 one of the instructors from the
I School just finished teaching a course
called working with open data yeah this
past spring semester and they did a
spectacular job his students put out
just incredible projects working with
open datasets and all of the tools stack
that you were talking about Raymond and
I are putting out a blog post just later
today we just finished it to kind of
showcase the the projects it was was really
beautiful right ah and he did a
great job with that but my question goes
to the sharing and hosting of public
datasets because we just ran into this
problem with a colleague a few just the
last couple weeks we've been trying to
find a convenient way of putting
available as a data set on the order of
100 gigabytes kind of too big for
Dropbox too small to really go into into
the big systems you start looking into
things like Amazon S3 has public
datasets but it's only free for internal
within S3 access and we want something
where we can host it for normal whether
it's external access or within the cloud
access and it turns out it's not easy I've
been talking to the California Digital
Library folks and yes the Cal
library has a project that may be a
solution if you use it - same thing happened
to the Library of Congress exactly -
this it's not easy and so I mean you
touched on it very briefly but it's
really a critical problem for us right
now we have these datasets they're part
of the validation of open source
projects we need to make them available
as part of our test suites that are
publicly available for research and
the hosting is kind of a nightmare it's
hard and it's also the far I don't know
what format I don't know what this data
is but it would - this is biomedical data
so so these are specific binary
basically binary blobs with brain
imaging data in this particular case but
yeah but the problem is kind of generic -
so I've been I've been thinking a lot
about this things like Amazon and
Google's offering or not they're cheap
but they're not that cheap like it does
take some money maybe this is a new
industry that we need because I think
this is coming up time and time again
I keep meeting startups like I think Data
Market is one of them that they're
interested in making a data marketplace
and having accessible data that's kind
of cheap and I think you'll probably
have to pay something you know even as
an academic institution there's
there's gonna be some fee because it's
not free but I think maybe this is a new
industry we need because you're right
it's not easy that sounds like a startup
opportunity to be honest and in part of
the issues that all we're all getting
the NSF the NSF new guidelines the NIH
new guidelines everybody is now coming
out with guidelines where you say when
you put in your grant proposal you have
to actually have a data management plan
where
how you're going to share these datasets
for the long haul well the grant plans
out in three years so how are you gonna
pay for the hosting of these five
terabytes of data for ten years for 20
years who is going to archive these
datasets 10-20 years yes it's a big
problem so we have a huge unsolved
problem here it all sounds like um
fundraising hasn't happened for that
problem you're talking about the well
there's kind of a blind spot on that
actually and at the fundraising and the
end of solicitations from the funding
agencies level yeah I think this is
another thing we need to talk about a
little more I mean I'm interested in
this problem as well it's to me it
sounds like like either a startup
opportunity or industry hasn't come
through like you see the Amazons of
the world this is not quite their focus
but I think somebody needs to deal with
this problem um one thing that's happy
is the cost of that won't start coming
down with that right so like longevity
of public datasets and how how
accessible are they right so in this
example you have an NSF grant for let's
say three years I'm just making some
three years to to share data petabytes
of data? - how many? hundred gigabytes
hundreds of gigabytes let's say well
we have terabytes but not petabytes or
exabytes right now you might be able to
pay Amazon a subscription fee or Google
or somebody like that or Rackspace a
subscription fee to host this data but
your money will run out and what's going
to happen to that data actually who has
access to it who where does it go this
is actually a big I agree it's a big
problem and I think this is another myth
about the cloud and in terms of big data
is it's going to be there forever and
you know it's going to be available you
see people like archive.org attempting
to archive the whole internet but for
your purpose there doesn't seem to exist
something like that and the huge myth is
that even the storage fees that you can
actually rack up on Amazon the only
known cost is for storage but the big
question mark is for transfer if your
data set proves popular in the third
year of the grant you'll have a hundred
thousand dollar transfer bill because
when data moves out of their cloud
that's when they charge you and so the
it's kind of unmanageable there's a
complete mismatch between the funding
model and the reality of storing and
accessing these datasets where the
solution there just is no solution I
agree with that yeah and in fact I don't
work with as many nonprofits and
research institutions as I want to
because we I mean the product I work on
as a premium offering like it does cost
money to use it and I do want to try to
give discounts to people at some point
but it does cost something and to have
this kind of like this guarantee of data
being there for the long haul I think is
it's just a space there's no there's no
product for it
sounds like a startup opportunity to me
the problem is this so this is a good
idea like having data marketplaces come
up time and time again but what's the
value of this data how many people are
willing to pay what for it I mean that's
sort of the supply and demand issue
around this data might be tricky I mean
there's a great data set that's
available from the government the
airline on time data I love this data
set it's every month that whatever the
airline board publishes the airline and
how late each particular flight was very
a lot of people would be interested in
this data I think you can do a lot with
that
but this data set you're using I don't
know what it is I'm guessing not a lot
of people are interested in it there's
not a lot of value around it probably to
make money is that true
it's just for academic yeah these are
specialized research datasets right that
cater to fairly small communities and
that to me is the problem right like we
have to figure so I agree this is a big
problem we have to address and it's one
I haven't been this the thing you're
talking about is a great use case I'll
have to ponder this some more but let's
keep talking about it great great
questions you guys are smart what did
you have a question did somebody else
have a question over here I saw someone had  a
hand up this is a really great
discussion I'm pretty impressed yes
that's you hi I'm Sarah and I'm actually
a MIMS alum too from 2006 and now I
currently work up the hill at
Berkeley Lab so you know we have a lot
of use cases a lot of data a lot of
engineers that can make tools to create
sort of specialized applications for
scientists to analyze their data they're
always very domain-specific they're
always a kind of one-off every project
needs their own sort of dashboard or
something like that and so I'm wondering
if in your work with different um big
data problems are you seeing certain
patterns in the big data questions that
could become in it could be become
generalizable solutions instead of
developing these one-off solution
totally like I don't mean to keep
bringing up Google but that's my
main my main experience with this I
think internally we've done a good job
of trying to
make things into services and so we have
a you're in a different spot because you
have a lot of different things going on
and Google as big a company as it is
really just sort of the same things over
and over
but we are looking this is where I
think a lot of data things are going
when I say moving to the cloud I also
mean moving towards services machine
learning as a service right things like
that where common use cases I brought up
like k-means clustering and linear
regression I see when I meet with
customers I hear that all the time they
just want a linear regression model or
you know multiple multiple you know
regression model and I think some of
those things are automatable some with a
little bit of human help for you know
guiding guiding the ship but I think
that's where a lot of this is going so
you're gonna have actually seen this
happen a lot you're going to start
seeing more and more startups and
actually probably people like am the
Amazons of the world provide more
things as a service the visit the
Redshift product is a good example right
it's a it's a business intelligence as a
service it's just something out there
and it does a limited amount of things
and you're going to start to see more
and more if you haven't already of you
know MapReduce as a service and and and
some of the mathematical modeling as a
service and maybe even one where you can
define your own mathematical models
there's a tool that you'll hear about
more if you haven't called Pig and it's
a workflow modeling tool and there's
there's libraries like one of my
favorites is cascading for for Hadoop
it's a it's a Java based library for
writing workflows and so I think what
we're missing is sort of the workflow
SQL I mean Pig might be that but a
way to define data workflows in an easy
accessible way that some system can
consume that's what we need more of that
and that's coming but we're not there
yet I think that's where a lot of the
stuff is going for your purposes though
you might always have to do some custom
things because you're you know doing
special things that very few people in
the world are doing I guess let's hope
for services that was a long answer to a
short question sorry about that
hi I'm Chris Hoffman I work here at UC
Berkeley in the central IT group in a
group that is supporting research on
campus and we do a lot of work with
museum collections we do a lot of data
work so that this pattern of dealing
with dirty data trying to clean it
normalize it and then make it available
is very familiar to us you've been
talking about you know tools or
solutions for some of that in any gritty
work that you mentioned is being really
hard right you have any any good
pointers yeah they're people are
working on this product and many of the
top of my head we had a great product at
Google called Refine which i think is
now an open source tool the limitation
there was the amount of data you could
deal with but it would had a great
interface for some of this stuff and I'm
sorry that I'm not very familiar with
tools that are helpful in this space but
this is a so I'll say this this is a
problem that that is one of the number
one problems for some of the big
businesses I talk to and it's another
gap it's another industry gap of like an
easy-to-use tool for normalizing data
for you know dealing with missing data
sets extrapolation of data that's
missing and I don't I can't I'm sorry I
don't have good information for you on
my head that people are working on it
and you're going to start to see at
least more commercial tools in the space
and they're going to be in the cloud
they're going to be services like that
where you you know you stick them you
know a bunch of data in the cloud and
you say I want I want all the US United
States and USA's to be the same thing
and you run that workflow and that
that's coming but it's not quite here
yet
okay so that's great yeah right
that sounds great I'm not unfamiliar
with it and thank you for saying that
yeah but yeah a lot of people are just
still using their own custom MapReduce
things for everything when it comes to
that that's that's going to change
oh hey Raymond how's it going so um yeah
any other questions I don't know how are
we doing good on time or a few
more minutes yeah this discussion is
fantastic I'm learning so much just
hearing your questions alright well if
there's no more questions should we
adjourn a bit early and grab some
yeah I mean just personally it was the
the amount of people building games I
just didn't realize that was that was
going to happen I always thought we'd do
more things that were sort of
enterprising beyond games and the reason
for it is so if you guys aren't familiar
the product that I'm talking about I
work for is called Google Cloud Platform
it's a application development platform
it's sort of a fully managed platform if
you've ever heard of App Engine that's
one piece of it and the thing that I
work mostly with is BigQuery which is
this big data analytics tool as a
service you can it scales to really
really big datasets of terabytes even
well we don't have any customers with
the size but very very largely it's like
petabytes and we we use that internally
at Google and so in some ways the cloud
platform is sort of a it's using our
infrastructure to build your own
applications and what surprised me was
the type of applications that the amount
of people building things like games
which is really interesting game
developers move fast they want they
don't want to worry about infrastructure
they want to do graphics and they want
to introduce new things and pivot
super duper fast oftentimes there will be
two coders a graphic designer and a
product manager and that's the whole
team and so they don't want to deal with
infrastructure and that's what we're
trying to do is get the infrastructure
out of the way just write your app it
put it on the web and it's web scale
immediately and that's where I'm coming
from from the data spaces I think more
and more things are going to be like
that as well so BigQuery the product I
work on is just an API you throw your
data in the cloud you ask a question
using something like SQL and you get
a response back as a JSON RPC call
right so you just basically it's a
service where you ask questions about
datasets and the reason I believe in
this not I don't want to sound like a
corporate shill but this stuff that
we've done at Google is often years
ahead of what you see in the open source
world just because you know we
oftentimes will publish a research paper
about what we're doing and then other
people will build similar tools using
the ideas in those papers and the MapReduce paper came out like in
2002 and you know Hadoop has really hit
its stride in the last couple of years
and now it's like projects like Spark are
building off of some of the ideas around
Hadoop they're saying okay Hadoop
doesn't work so well for these use cases
let's build something new and that kind
of cascade of ideas has come out and so
I think of what happens inside of Google
I think is what's going to be similar to
what's happening in the world in like a
year or two or three years five years
down the line and so services so
I'm sorry Quentin I didn't really answer
your question but I think that services
its things are going to be services
infrastructure is going away the web is
going to be the platform right yeah I'm
sorry I can't I don't know much about
the user base of Compute Engine stuff so
not directly on that project people have
been clamoring for it and it feels a
really good need for us for people with
their own application stack so one thing
that we do at Google is we have an
opinionated stack meaning if you use App
Engine you you have to write your code
in a certain way and if you've ever
written code that scales to a web scale
like let's you're a Rails developer you
often have to make modifications to that
code just to get it to work at scale
meaning sharding or whatever and we kind
of we kind of use that we're already
doing that so it's sort of opinionated
space if you're using R you can't run
it on App Engine and a lot of people use
Rr so Compute Engine is a great tool to
fill in that gap where and actually we
did this for that um that data sensing
project that sort of Internet of Things
project where we needed we wanted to do
some we actually used R on the data we
wanted to do a correlation and you
know somebody had R and they wanted to
run a simple correlation between two two
data streams and we didn't we couldn't
run it on App Engine so we had Compute
Engine so we actually we kind of
pipelined all this stuff together and I
think that's that's really exciting to
me that we can out now build these sort
of end-to-end apps in the cloud without
any of infrastructure and I think that's
where a lot of stuff is going in I think
Amazon is doing a similar thing right
they have a similar to stack and and you
see that with other competitors as well
so yeah I'm looking forward to it to me
this is the sort of answers your
question about accessibility no longer
do you have to worry so much about ah
my server's not big enough I ran into
this problem when I was I School
grad the final project I worked on we
were limited by we used like a single
Mac in somebody's office you know we had
a post-it note that said please don't
turn off this computer because we have a
final project presentation you didn't
know that when I gave my final project
presentation but we were stressed that
someone's going to kick the
cord but yeah but no we don't want to
worry about that anymore and I think
that's where we're going at least I feel
like our stack is pretty complete I
don't know what the with the usership of
Compute Engine is I'm not very familiar
with those stats but hopefully it's
useful for people
maybe
like Heroku is a good example right
there they're on Amazon stack and they
probe they have an interesting platform
Parse is another one maybe okay
so maybe lowering the costs for the
examples like like yeah yeah interesting
I hope so right like that's the whole
point is like to make sure there's an
ecosystem around that I mean the to me
like I don't I'm not on the business
side to me it's all about accessibility
like all I care about is getting people
to build their apps the way they want
and a way that's the as pain-free as
possible and what's cool about like
Hadoop and they open source data like
you know wave that's happening is people
are able to do things they were never
able to do before small teams can do
things with you know pretty meager
resources and I'd like I'd like that I
want to see more of that I want to see
utility computing happen right I think
it's pretty good for society where we're
computing is much like the electricity
coming out of a socket where it's a
utility and you just plug into that
utility I think that's going to be
beneficial for a lot of people so I hope
that happens
