Back here in data science, we're going to
continue our attempt to define data science
by looking at something that's really well
known in the field; the Data Science Venn
Diagram.
Now if you want to, you can think of this
in terms of, “What are the ingredients of
data science?”
Well, we're going to first say thanks to Drew
Conway, the guy who came up with this.
And if you want to see the original article,
you can go to this address.
But, what Drew said is that data science is
made of three things.
And we can put them as overlapping circles
because it is the intersection that’s important.
Here on the top left is coding or computer
programming, or as he calls it: hacking.
On the top right is stats or, stats or mathematics,
or quantitative abilities in general.
And on the bottom is domain expertise, or
intimate familiarity with a particular field
of practice: business, or health, or education,
or science, or something like that.
And the intersection here in the middle, that
is data science.
So it's the combination of coding and statistics
and math and domain knowledge.
Now, let's say a little more about coding.
The reason coding is important is because
it helps you gather and prepare the data.
Because a lot of the data comes from novel
sources and is not necessarily ready for you
to gather and it can be in very unusual formats.
And so coding is important because it can
require some real creativity to get the data
from the sources to put it into your analysis.
Now, a few kinds of coding that are important;
for instance, there is statistical coding.
A couple of major languages in this are R
and Python.
Two open-source free programming languages.
R, specifically for data.
Python is general-purpose, but well adapted
to data.
The ability to work with databases is important
too.
The most common language there is SQL, usually
pronounced “Sequel,” which stands for
Structured Query Language, because that's
where the data is.
Also, there is the command line interface,
or if you are on a Mac, people just call it
“the terminal.”
Most common language there is Bash, which
actually stands for Bourne-again shell.
And then searching is important and regex,
or regular expressions.
While there is not a huge amount to learn
there (it’s a small little field), it’s
sort of like super-powered wildcard searching
that makes it possible for you to both find
the data and reformat it in ways that are
going to be helpful for your analysis.
Now, let's say a few things about the math.
You’re going to need things like a little
bit of probability, some algebra, of course,
regression (very common statistical procedure).
Those things are important.
And the reason you need the math is: because
that is going to help you choose the appropriate
procedures to answer the question with the
data that you have.
And probably even more importantly; it is
going to help you diagnose problems when things
don’t go as expected.
And given that you are trying to do new things
with new datas in new ways, you are probably
going to come across problems.
So the ability to understand the mechanics
of what is going on is going to give you a
big advantage.
And the third element of the data science
Venn Diagram is some sort of domain expertise.
Think of it as expertise in the field that
you're in.
Business settings are common.
You need to know about the goals of that field,
the methods that are used, and the constraints
that people come across.
And it’s important because whatever your
results are, you need to be able to implement
them well.
Data science is very practical and is designed
to accomplish something.
And your familiarity with a particular field
of practice is going to make it that much
easier and more impactful when you implement
the results of your analysis.
Now, let's go back to our Venn Diagram here
just for a moment.
Because this is a Venn, we also have these
intersections of two circles at a time.
At the top is machine learning.
At the bottom right is traditional research.
And on the bottom left hand is what Drew Conway
called, “the danger zone.”
Let me talk about each of these.
First off, machine learning, or ML.
Now, you think about machine learning and
the idea here is that it represents coding,
or statistical programming and mathematics,
without any real domain expertise.
Sometimes these are referred to as "black
box" models.
They kind of throw data in and you don’t
even necessarily have to know what it means
or what language it is in, and it will just
kind of crunch through it all and it will
give you some regularities.
That can be very helpful, but machine learning
is considered slightly different from data
science because it doesn’t involve the particular
applications in a specific domain.
Also, there's traditional research.
This is where you have math or statistics
and you have domain knowledge; often very
intensive domain knowledge but without the
coding or programming.
Now, you can get away with that because the
data that you use in traditional research
is highly structured.
It comes in rows and columns, and is typically
complete and is typically ready for analysis.
Doesn’t mean your life is easy, because
now you have to expand an enormous amount
of effort in the methods and the designing
of the project and the interpretation of the
data.
So, still very heavy intellectual cognitive
work, but it comes from a different place.
And then finally, there is what Conway called,
“the danger zone.”
And that's the intersection of coding and
domain knowledge, but without math or statistics.
Now he says it is unlikely to happen, and
that is probably true.
On the other hand, I can think of some common
examples, what are called “word counts,”
where you take a large document or a series
of documents, and you count how many times
a word appears in there.
That can actually tell you some very important
things.
And also, drawing maps and showing how things
change across place and maybe even across
time.
You don’t necessarily have to have the math,
but it can be very insightful and helpful.
So, let's think about a couple of backgrounds
where people come from here.
First, is coding.
You can have people who are coders, who can
do math, stats, and business.
So, you get the three things (and this is
probably the most common), most the people
come from a programming background.
On the other hand, there is also stats, or
statistics.
And you can get statisticians who can code
and who also can do business.
That's less common, but it does happen.
And finally, there is people who come into
data science from a particular domain.
And these are, for instance, business people
who can code and do numbers.
And they are the least common.
But, all of these are important to data science.
And so in sum, here is what we can take away.
First, several fields make up Data Science.
Second, diverse skills and backgrounds are
important and they are needed in data science.
And third, there are many roles involved because
there are a lot of different things that need
to happen.
We’ll say more about that in our next movie.
