Welcome to Data Science: An Introduction.
I'm Barton Poulson and what we are going to
do in this course is We are going to have
a brief, accessible and non-technical overview
of the field of Data Science. Now, some people
when they hear Data Science, they start thinking
things like: Data and think about piles of
equations and numbers and then throw on top
of that Science and think about people working
in their lab and they start to say eh, that's
not for me. I'm not really a technical person
and that just seems much too techy. Well,
here's the important thing to know. While
a lot of people get really fired up about
the technical aspects of Data Science the
important thing is that Data Science is not
so much a technical discipline, but creative.
And, really, that's true. The reason I say
that is because in Data Science you use tools
that come from coding and statistics and from
math But you use those to work creatively
with data. The idea is there's always more
than one way to solve a problem or answer
a question But most importantly to get insight
Because the goal, no matter how you go about
it, is to get insight from your data. and
what makes Data Science unique, compared to
so many other things is that you try to listen
to all of your data, even when it doesn't
fit in easily with your standard approaches
and paradigms you're trying to be much more
inclusive in your analysis and the reason
you want to do that is because everything
signifies. everything carries meaning and
everything can give you additional understanding
and insight into what's going on around you
and so in this course what we are trying to
do is give you a map to the field of Data
Science and how you can use it and so now
you have the map in your hands and you can
get ready to get going with Data Science.
Welcome back to Data Science: An Introduction.
And we're going to begin this course by defining
data science. That makes sense. But, we are
going to be doing it in kind of a funny way.
The first thing I am going to talk about is
the demand for data science. So, let's take
a quick look. Now, data science can be defined
in a few ways. I am going to give you some
short definitions. Take one on my definition
is that data science is coding, math, and
statistics in applied settings. That's a reasonable
working definition. But, if you want to be
a little more concise, I've got take two on
a definition. That data science is the analysis
of diverse data, or data that you didn't think
would fit into standard analytic approaches.
A third way to think about it is that data
science is inclusive analysis. It includes
all of the data, all of the information that
you have, in order to get the most insightful
and compelling answer to your research questions.
Now, you may say to yourself, "Wait... that's
it?" Well, if you're not impressed, let me
show you a few things. First off, let's take
a look at this article. It says, "Data Scientist:
the Sexiest Job of the 21st Century." And
please note, this is coming from Harvard Business
Review. So, this is an authoritative source
and it is the official source of this saying:
that data science is sexy! Now, again, you
may be saying to yourself, "Sexy? I hardly
think so." Oh yeah, it's sexy. And the reason
data science is sexy is because first, it
has rare qualities, and second it has high
demand. Let me say a little more about those.
The rare qualities are that data science takes
unstructured data, then finds order, meaning,
and value in the data. Those are important,
but they're not easy to come across. Second,
high demand. Well, the reason it's in high
demand is because data science provides insight
into what's going on around you and critically,
it provides competitive advantage, which is
a huge thing in business settings. Now, let
me go back and say a little more about demand.
Let's take a look at a few other sources.
So, for instance the McKinsey Global Institute
published a very well-known paper, and you
can get it with this URL. And if you go to
that webpage, this is what's going to come
up. And we're going to take a quick look at
this one, the executive summary. It's a PDF
that you can download. And if you open that
up, you will find this page. And let's take
a look at the bottom right corner. Two numbers
here, I'm going to zoom in on those. The first
one, is they are projecting a need in the
next few years for somewhere between 140 and
190,000 deep analytical talent positions.
So, this means actual practicing data scientists.
That's a huge number; but almost ten times
as high is 1.5 million more data-savvy managers
will be needed to take full advantage of big
data in the United States. Now, that's people
who aren't necessarily doing the analysis
but have to understand it, who have to speak
data. And that's one of the main purposes
of this particular course, is to help people
who may or may not be the practicing data
scientists learn to understand what they can
get out of data, and some of the methods used
to get there. Let's take a look at another
article from LinkedIn. Here is a shortcut
URL for it and that will bring you to this
webpage: "The 25 hottest job skills that got
people hired in 2014." And take a look at
number one here: statistical analysis and
data mining, very closely related to data
science. And just to be clear, this was number
one in Australia, and Brazil, and Canada,
and France, and India, and the Netherlands,
and South Africa, and the United Arab Emirates,
and the United Kingdom. Everywhere. And if
you need a little more, let's take a look
at Glassdoor, which published an article this
year, 2016, and it's about the "25 Best Jobs
in America." And look at number one right
here, it's data scientist. And we can zoom
on this information. It says there is going
to be 1,700 job openings, with a median base
salary of over $116,000, and fabulous career
opportunities and job scores. So, if you want
to take all of this together, the conclusion
you can reach is that data science pays. And
I can show you a little more about that. So
for instance, here's a list of the top ten
highest paying salaries that I got from US
News. We have physicians (or doctors), dentists,
and lawyers, and so on. Now, if we add data
scientist to this list, using data from O'Reilly.com,
we have to push things around. And goes in
third with an average total salary (not the
base we had in the other one, but the total
compensation) of about $144,000 a year. That's
extraordinary. So in sum, what do we get from
all this? First off, we learn that there is
a very high demand for data science. Second,
we learn that there is a critical need for
both specialists; those are the sort of practicing
data scientists; and for Generalists, the
people who speak the language and know what
can be done. And of course, excellent pay.
And all together, this makes Data Science
a compelling career alternative and a way
of making you better at whatever you are doing.
Back here in data science, we're going to
continue our attempt to define data science
by looking at something that's really well
known in the field; the Data Science Venn
Diagram. Now if you want to, you can think
of this in terms of, "What are the ingredients
of data science?" Well, we're going to first
say thanks to Drew Conway, the guy who came
up with this. And if you want to see the original
article, you can go to this address. But,
what Drew said is that data science is made
of three things. And we can put them as overlapping
circles because it is the intersection that's
important. Here on the top left is coding
or computer programming, or as he calls it:
hacking. On the top right is stats or, stats
or mathematics, or quantitative abilities
in general. And on the bottom is domain expertise,
or intimate familiarity with a particular
field of practice: business, or health, or
education, or science, or something like that.
And the intersection here in the middle, that
is data science. So it's the combination of
coding and statistics and math and domain
knowledge. Now, let's say a little more about
coding. The reason coding is important is
because it helps you gather and prepare the
data. Because a lot of the data comes from
novel sources and is not necessarily ready
for you to gather and it can be in very unusual
formats. And so coding is important because
it can require some real creativity to get
the data from the sources to put it into your
analysis. Now, a few kinds of coding that
are important; for instance, there is statistical
coding. A couple of major languages in this
are R and Python. Two open-source free programming
languages. R, specifically for data. Python
is general-purpose, but well adapted to data.
The ability to work with databases is important
too. The most common language there is SQL,
usually pronounced "Sequel," which stands
for Structured Query Language, because that's
where the data is. Also, there is the command
line interface, or if you are on a Mac, people
just call it "the terminal." Most common language
there is Bash, which actually stands for Bourne-again
shell. And then searching is important and
regex, or regular expressions. While there
is not a huge amount to learn there (it's
a small little field), it's sort of like super-powered
wildcard searching that makes it possible
for you to both find the data and reformat
it in ways that are going to be helpful for
your analysis. Now, let's say a few things
about the math. You're going to need things
like a little bit of probability, some algebra,
of course, regression (very common statistical
procedure). Those things are important. And
the reason you need the math is: because that
is going to help you choose the appropriate
procedures to answer the question with the
data that you have. And probably even more
importantly; it is going to help you diagnose
problems when things don't go as expected.
And given that you are trying to do new things
with new data in new ways, you are probably
going to come across problems. So the ability
to understand the mechanics of what is going
on is going to give you a big advantage. And
the third element of the data science Venn
Diagram is some sort of domain expertise.
Think of it as expertise in the field that
you're in. Business settings are common. You
need to know about the goals of that field,
the methods that are used, and the constraints
that people come across. And it's important
because whatever your results are, you need
to be able to implement them well. Data science
is very practical and is designed to accomplish
something. And your familiarity with a particular
field of practice is going to make it that
much easier and more impactful when you implement
the results of your analysis. Now, let's go
back to our Venn Diagram here just for a moment.
Because this is a Venn, we also have these
intersections of two circles at a time. At
the top is machine learning. At the bottom
right is traditional research. And on the
bottom left hand is what Drew Conway called,
"the danger zone." Let me talk about each
of these. First off, machine learning, or
ML. Now, you think about machine learning
and the idea here is that it represents coding,
or statistical programming and mathematics,
without any real domain expertise. Sometimes
these are referred to as "black box" models.
They kind of throw data in and you don't even
necessarily have to know what it means or
what language it is in, and it will just kind
of crunch through it all and it will give
you some regularities. That can be very helpful,
but machine learning is considered slightly
different from data science because it doesn't
involve the particular applications in a specific
domain. Also, there's traditional research.
This is where you have math or statistics
and you have domain knowledge; often very
intensive domain knowledge but without the
coding or programming. Now, you can get away
with that because the data that you use in
traditional research is highly structured.
It comes in rows and columns, and is typically
complete and is typically ready for analysis.
Doesn't mean your life is easy, because now
you have to expand an enormous amount of effort
in the methods and the designing of the project
and the interpretation of the data. So, still
very heavy intellectual cognitive work, but
it comes from a different place. And then
finally, there is what Conway called, "the
danger zone." And that's the intersection
of coding and domain knowledge, but without
math or statistics. Now he says it is unlikely
to happen, and that is probably true. On the
other hand, I can think of some common examples,
what are called "word counts," where you take
a large document or a series of documents,
and you count how many times a word appears
in there. That can actually tell you some
very important things. And also, drawing maps
and showing how things change across place
and maybe even across time. You don't necessarily
have to have the math, but it can be very
insightful and helpful. So, let's think about
a couple of backgrounds where people come
from here. First, is coding. You can have
people who are coders, who can do math, stats,
and business. So, you get the three things
(and this is probably the most common), most
the people come from a programming background.
On the other hand, there is also stats, or
statistics. And you can get statisticians
who can code and who also can do business.
That's less common, but it does happen. And
finally, there is people who come into data
science from a particular domain. And these
are, for instance, business people who can
code and do numbers. And they are the least
common. But, all of these are important to
data science. And so in sum, here is what
we can take away. First, several fields make
up Data Science. Second, diverse skills and
backgrounds are important and they are needed
in data science. And third, there are many
roles involved because there are a lot of
different things that need to happen. We'll
say more about that in our next movie. The
next step in our data science introduction
and our definition of data science is to talk
about the Data Science Pathway. So I like
to think of this as, when you are working
on a major project, you have got to do one
step at a time to get it from here to there.
In data science, you can take the various
steps and you can put them into a couple of
general categories. First, there are the steps
that involve planning. Second, there's the
data prep. Third, there's the actual modeling
of the data. And fourth, there's the follow-up.
And there are several steps within each of
these; I'll explain each of them briefly.
First, let's talk about planning. The first
thing that you need to do, is you need to
define the goals of your project so you know
how to use your resources well, and also so
you know when you are done. Second, you need
to organize your resources. So you might have
data from several different sources; you might
have different software packages, you might
have different people. Which gets us to the
third one: you need to coordinate the people
so they can work together productively. If
you are doing a hand-off, it needs to be clear
who is going to do what and how their work
is going to go together. And then, really
to state the obvious, you need to schedule
the project so things can move along smoothly
and you can finish in a reasonable amount
of time. Next is the data prep, where you
are taking like food prep and getting the
raw ingredients ready. First of course, is
you need to get the data. And it can from
many different sources and be in many different
formats. You need to clean the data and, the
sad thing is, this tends to be a very large
part of any data science project. And that
is because you are bringing in unusual data
from a lot of different places. You also want
to explore the data; that is, really see what
it looks like, how many people are in each
group, what the shape of the distributions
are like, what is associated with what. And
you may need to refine the data. And that
means choosing variables to include, choosing
cases to include or exclude, making any transformations
to the data you need to do. And of course
these steps kind of can bounce back and forth
from one to the other. The third group is
modeling or statistical modeling. This is
where you actually want to create the statistical
model. So for instance, you might do a regression
analysis or you might do a neural network.
But, whatever you do, once you create your
model, you have to validate the model. You
might do that with a holdout validation. You
might do it really with a very small replication
if you can. You also need to evaluate the
model. So, once you know that the model is
accurate, what does it actually mean and how
much does it tell you? And then finally, you
need to refine the model. So, for instance,
there may be variables you want to throw out;
maybe additional ones you want to include.
You may want to, again, transform some of
the data. You may want to get it so it is
easier to interpret and apply. And that gets
us to the last part of the data science pathway.
And that's follow up. And once you have created
your model, you need to present the model.
Because it is usually work that is being done
for a client, could be in house, could be
a third party. But you need to take the insights
that you got and share them in a meaningful
way with other people. You also need to deploy
the model; it is usually being done in order
to accomplish something. So, for instance,
if you are working with an e-commerce site,
you may be developing a recommendation engine
that says, "people who bought this and this
might buy this." You need to actually stick
it on the website and see if it works the
way that you expected it to. Then you need
to revisit the model because a lot of the
times, the data that you worked on is not
necessarily all of the data, and things can
change when you get out in the real world
or things just change over time. So, you have
to see how well your model is working. And
then, just to be thorough, you need to archive
the assets, document what you have, and make
it possible for you or for others to repeat
the analysis or develop off of it in the future.
So, those are the general steps of what I
consider the data science pathway. And in
sum, what we get from this is three things.
First, data science isn't just a technical
field, it is not just coding. Things like,
planning and presenting and implementing are
just as important. Also, contextual skills,
knowing how it works in a particular field,
knowing how it will be implemented, those
skills matter as well. And then, as you got
from this whole thing, there are a lot of
things to do. And if you go one step at a
time, there will be less backtracking and
you will ultimately be more productive in
your data science projects. We'll continue
our definition of data science by looking
at the roles that are involved in data science.
The way that different people can contribute
to it. That's because it tends to be a collaborative
thing, and it's nice to be able to say that
we are all together, working together towards
a single goal. So, let's talk about some of
the roles involved in data science and how
they contribute to the projects. First off,
let's take a look at engineers. These are
people who focus on the back end hardware.
For instance, the servers and the software
that runs them. This is what makes data science
possible, and it includes people like developers,
software developers, or database administrators.
And they provide the foundation for the rest
of the work. Next, you can also have people
who are Big Data specialists. These are people
who focus on computer science and mathematics,
and they may do machine learning algorithms
as a way of processing very large amounts
of data. And they often create what are called
data products. So, a thing that tells you
what restaurant to go to, or that says, "you
might know these friends," or provides ways
of linking up photos. Those are data products,
and those often involve a huge amount of very
technical work behind them. There are also
researchers; these are people who focus on
domain-specific research. So, for instance,
physics, or genetics, or whatever. And these
people tend to have very strong statistics,
and they can use some of the procedures and
some of the data that comes from the other
people like the big data researchers, but
they focus on the specific questions. Also
in the data science realm, you will find analysts.
These are people who focus on the day-to-day
tasks of running a business. So for instance,
they might do web analytics (like Google analytics),
or they might pull data from a SQL database.
And this information is very important and
good for business. So, analysts are key to
the day-to-day function of business, but they
may not be, exactly be Data Science proper,
because most of the data they are working
with is going to be pretty structured. Nevertheless,
they play a critical role in business in general.
And then, speaking of business. You have the
actual business people; the men and women
who organize and run businesses. These people
need to be able to frame business-relevant
questions that can be answered with the data.
Also, the business person manages the project
and the efforts and the resources of others.
And while they may not actually be doing the
coding, they must speak data; they must know
how the data works, what it can answer, and
how to implement it. You can also have entrepreneurs.
So, you might have a data startup; they are
starting their own little social network,
their own little web search platform. An entrepreneur
needs data and business skills. And truthfully,
they have to be creative at every step along
the way. Usually because they are doing it
all themselves at a smaller scale. Then we
have in data science something known as "the
full stack unicorn." And this is a person
who can do everything at an expert level.
They are called a unicorn because truthfully,
they may not actually exist. I will have more
to say about that later. But for right now,
we can sum up what we got out of this video
by three things. Number one, data science
is diverse. There's a lot of different people
who go into it, and they have different goals
for their work, and they bring in different
skills and different experiences and different
approaches. Also, they tend to work in very
different contexts. An entrepreneur works
in a very different place from a business
manager, who works in a very different place
from an academic researcher. But, all of them
are connected in some way to data science
and make it a richer field. The last thing
I want to say in "Data Science: An Introduction"
where I am trying to define data science,
is to talk about teams in data science. The
idea here is that data science has many different
tools, and different people are going to be
experts in each one of them. Now, you have,
for instance, coding and you have statistics.
Also, you have what feels like design, or
business and management that are involved.
And the question, of course, is: "who can
do all of it? Who's able to do all of these
things at the level that we need?" Well, that's
where we get this saying (I have mentioned
it before), it's the unicorn. And just like
in ancient history, the unicorn is a mythical
creature with magical abilities. In data science,
it works a little differently. It is a mythical
Data Scientist with universal abilities. The
trouble is, as we know from the real world,
there are really no unicorns (animals), and
there are really not very many unicorns in
data science. Really, there are just people.
And so we have to find out how we can do the
projects even though we don't have this one
person who can do everything for everybody.
So let's take a hypothetical case, just for
a moment. I am going to give you some fictional
people. Here is my fictional person Otto,
who has strong visualization skills, who has
good coding, but has limited analytic or statistical
ability. And if we graph his stuff out, his
abilities... So, here we have five things
that we need to have happen. And for the project
to work, they all have to happen at least,
a level of eight on the zero-to-ten. If we
take his coding ability, he is almost there.
Statistics, not quite halfway. Graphics, yes
he can do that. And then, business, eh, alright.
And project, pretty good. So, what you can
see here is, in only one of these five areas
is Otto sufficient on his own. On the other
hand, let's pair him up with somebody else.
Let's take a look at Lucy. And Lucy has strong
business training, has good tech skills, but
has limited graphics. And if we get her profile
on the same thing that we saw, there is coding,
pretty good. Statistics, pretty good. Graphics,
not so much. Business, good. And projects,
OK. Now, the important thing here is that
we can make a team. So let's take our two
fictional people, Otto and Lucy, and we can
put together their abilities. Now, I actually
have to change the scale here a little bit
to accommodate the both of them. But our criterion
still is at eight; we need a level of eight
in order to do the project competently. And
if we combine them: oh look, coding is now
past eight. Statistics is past eight. Graphics
is way past. Business way past. And then the
projects, they are too. So when we combine
their skills, we are able to get the level
that we need for everything. Or to put it
another way, we have now created a unicorn
by team, and that makes it possible to do
the data science project. So, in sum: you
usually can't do data science on your own.
That's a very rare individual. Or more specifically:
people need people, and in data science you
have the opportunity to take several people
and make collective unicorns, so you can get
the insight that you need in your project
and you can get the things done that you want.
In order to get a better understanding of
data science, it can be helpful to look at
contrasts between data science and other fields.
Probably the most informative is with Big
Data because these two terms are actually
often confused. It makes me think of situations
where you have two things that are very similar,
but not the same. Like we have here in the
Piazza San Carlo here in Italy. Part of the
problem stems from the fact that data science
and big data both have Venn Diagrams associated
with them. So, for instance, Venn number one
for data science is something we have seen
already. We have three circles and we have
coding and we have math and we have some domain
expertise, that put together get data science.
On the other hand, Venn Diagram number two
is for Big Data. It also has three circles.
And we have the high volume of data, the rapid
velocity of data, and the extreme variety
of data. Take those three v's together and
you get Big Data. Now, we can also combine
these two if we want in a third Venn Diagram,
we call Big Data and Data Science. This time
it is just two circles. With Big Data on the
left and Data Science on the right. And the
intersection in the middle, there is Big Data
Science, which actually is a real term. But,
if you want to do a compare and contrast,
it kind of helps to look at how you can have
one without the other. So, let's start by
looking at Big Data without Data Science.
So, these are situations where you may have
the volume or velocity or variety of data
but don't need all the tools of data science.
So, we are just looking at the left side of
the equation right now. Now, truthfully, this
only works if you have Big Data without all
three V's. Some say you have to have the volume,
velocity, and variety for it to count as Big
Data. I basically say anything that doesn't
fit into a standard machine is probably Big
Data. I can think of a couple of examples
here of things that might count as Big Data,
but maybe don't count as Data Science. Machine
learning, where you can have very large data
sets and probably very complex, doesn't require
very much domain expertise, so that may not
be data science. Word counts, where you have
an enormous amount of data and it's actually
a pretty simple analysis, again doesn't require
much sophistication in terms of quantitative
skills or even domain expertise. So, maybe/maybe
not data science. On the other hand, to do
any of these you are going to need to have
at least two skills. You are going to need
to have the coding and you will probably have
to have some sort of quantitative skills as
well. So, how about data science without Big
Data? That's the right side of this diagram.
Well, to make that happen you are probably
talking about data with just one of the three
V's from Big Data. So, either volume or velocity
or variety, but singly. So for instance, genetics
data. You have a huge amount of data and it
comes in very set structure and it tends to
come in at once. So, you have got a lot of
volume and it is a very challenging thing
to work with. You have to use data science,
but it may or may not count as Big Data. Similarly,
streaming sensor data, where you have data
coming in very quickly, but you are not necessarily
saving it; you are just looking at these windows
in it. That is a lot of velocity, and it is
difficult to deal with, and it takes Data
Science, the full skill set, but it may not
require Big Data, per se. Or facial recognition,
where you have enormous variety in the data
because you are getting photos or videos that
are coming in. Again, very difficult to deal
with, requires a lot of ingenuity and creativity
may or may not count as Big Data, depending
on how much of a stickler you are about definitions.
Now, if you want to combine the two, we can
talk about Big Data Science. In that case,
we are looking right here at the middle. This
is a situation where you have volume, and
velocity, and variety in your data and truthfully,
if you have the three of those, you are going
to need the full Data Science skill set. You
are going to need coding, and statistics,
and math, and you are going to have to have
domain expertise. Primarily because of the
variety you are dealing with, but taken all
together you do have to have all of it. So
in sum, here is what we get. Big Data is not
equal to, it is not identical to data science.
Now, there is common ground, and a lot of
people who are good at Big Data are good at
data science and vice versa, but they are
conceptually distinct. On the other hand,
there is the shared middle ground of Big Data
Science that unifies the two separate fields.
Another important contrast you can make in
trying to understand data science is to compare
it with coding or computer programming. Now,
this is where you are trying to work with
machines and you are trying to talk to that
machine, to get it to do things. In one sense
you can think of coding as just giving task
instructions; how to do something. It is a
lot like a recipe when you're cooking. You
get some sort of user input or other input,
and then maybe you have if/then logic, and
you get output from it. To take an extremely
simple example, if you are programming in
Python version 2, you write: print, and then
in quotes, "Hello, world!" will put the words
"Hello, world!" on the screen. So, you gave
it some instructions and it gave you some
output. Very simple programming. Now, coding
and data gets a little more complicated. So,
for instance, there is word counts, where
you take a book or a whole collection of books,
you take the words and you count how many
there are in there. Now, this is a conceptually
simple task, and domain expertise and really
math and statistics are not vital. But to
make valid inferences and generalizations
in the face of variability and uncertainty
in the data you need statistics, and by extension,
you need data science. It might help to compare
the two by looking at the tools of the respective
trades. So for instance, there are tools for
coding or generic computer programming, and
there are tools that are specific for data
science. So, what I have right here is a list
from the IEEE of the top ten programming languages
of 2015. And it starts at Java and C and goes
down to Shell. And some of these are also
used for data science. So for instance, Python
and R and SQL are used for data science, but
the other ones aren't major ones in data science.
So, let's, in fact, take a look at a different
list of most popular tools for data science
and you see that things move around a little
bit. Now, R is at the top, SQL is there, Python
is there, but for me what is the most interesting
on the list is that Excel is number five,
which would never be considered programming,
per se, but it is, in fact, a very important
tool for data science. And that is one of
the ways that we can compare and contrast
computer programming with data science. In
sum, we can say this: data science is not
equal to coding. They are different things.
On the other hand, they share some of the
tools and they share some practices specifically
when coding for data. On the other hand, there
is one very big difference in that statistics,
statistical ability is one of the major separators
between general purpose programming and data
science programming. When we talk about data
science and we are contrasting with some fields,
another field that a lot of people get confused
and think they are the same thing is data
science and statistics. Now, I will tell you
there is a lot in common, but we can talk
a little bit about the different focuses of
each. And we also get into the issue of definitionalism
that data science is different because we
define it differently, even when there is
an awful lot in common between the two. It
helps to take a look at some of the things
that go on in each field. So, let's start
here about statistics. Put a little circle
here and we will put data science. And, to
borrow a term from Steven J. Gould, we can
call these non-overlapping magisteria; NOMA.
So, you think of them as separate fields that
are sovereign unto themselves with nothing
to do with each other. But, you know, that
doesn't seem right; and part of that is that
if we go back to the Data Science Venn Diagram,
statistics is one part of it. There it is
in the top corner. So, now what do we do?
What's the relationship? So, it doesn't make
sense to say these are totally separate areas,
maybe data science and statistics because
they share procedures, maybe data science
is a subset or specialty of statistics, more
like this. But, if data science were just
a subset or specialty within statistics then
it would follow that all data scientists would
first be statisticians. And interestingly
that's just not so. Say, for instance, we
take a look at the data science stars, the
superstars in the field. We go to a rather
intimidating article; it's called "The World's
7 Most Powerful Data Scientists" from Forbes.com.
You can see the article if you go to this
URL. There's actually more than seven people,
because sometimes he brings them up in pairs.
Let's check their degrees, see what their
academic training is in. If we take all the
people on this list, we have five degrees
in computer science, three in math, two in
engineering, and one each in biology, economics,
law, speech pathology, and one in statistics.
And so that tells us, of course, these major
people in data science are not trained as
statisticians. Only one of them has formal
training in that. So, that gets us to the
next question. Where do these two fields,
statistics and data science, diverge? Because
they seem like they should have a lot in common,
but they don't have a lot in training. Specifically,
we can look at the training. Most data scientists
are not trained, formally, as statisticians.
Also, in practice, things like machine learning
and big data, which are central to data science,
are not shared, generally, with most of statistics.
So, they have separate domains there. And
then there is the important issue of context.
Data scientists tend to work in different
settings than statisticians. Specifically,
data scientists very often work in commercial
settings where they are trying to get recommendation
engines or ways of developing a product that
will make them money. So, maybe instead of
having data science a subset of statistics,
we can think of it more as these two fields
have different niches. They both analyze data,
but they do different things in different
ways. So, maybe it is fair to say they share,
they overlap, they have analysis in common
of data, but otherwise, they are ecologically
distinct. So, in sum: what we can say here
is that data science and statistics both use
data and they analyze it. But the people in
each tend to come from different backgrounds,
and they tend to function with different goals
and contexts. And in that way, render them
to be conceptually distinct fields despite
the apparent overlap. As we work to get a
grasp on data science, there is one more contrast
I want to make explicitly, and that is between
data science and business intelligence, or
BI. The idea here is that business intelligence
is data in real life; it's very, very applied
stuff. The purpose of BI is to get data on
internal operations, on market competitors,
and so on, and make justifiable decisions
as opposed to just sitting in the bar and
doing whatever comes to your mind. Now, data
science is involved with this, except, you
know, really there is no coding in BI. There's
using apps that already exist. And the statistics
in business intelligence tend to be very simple,
they tend to be counts and percentages and
ratios. And so, it's simple, the light bulb
is simple; it just does its one job there
is nothing super sophisticated there. Instead
the focus in business intelligence is on domain
expertise and on really useful direct utility.
It's simple, it's effective and it provides
insight. Now, one of the main associations
with business intelligence is what are called
dashboards, or data dashboards. They look
like this; it is a collection of charts and
tables that go together to give you a very
quick overview of what is going on in your
business. And while a lot of data scientists
may, let's say, look down their nose upon
dashboards, I'll say this, most of them are
very well designed and you can learn a huge
amount about user interaction and the accessibility
information from dashboards. So really, where
does data science come into this? What is
the connection between data science and business
intelligence? Well, data science can be useful
to BI in terms of setting it up. Identifying
data sources and creating or setting up the
framework for something like a dashboard or
a business intelligence system. Also, data
science can be used to extend it. Data science
can be used to get past the easy questions
and the easy data, to get the questions that
are actually most useful to you; even if they
require really sometimes data that is hard
to wrangle and work with. And also, there
is an interesting interaction here that goes
the other way. Data science practitioners
can learn a lot about design from good business
intelligence applications. So, I strongly
encourage anybody in data science to look
at them carefully and see what they can learn.
In sum: business intelligence, or BI, is very
goal oriented. Data science perhaps prepares
the data and sets up the form for business
intelligence, but also data science can learn
a lot about usability and accessibility from
business intelligence. And so, it is always
worth taking a close look. Data science has
a lot of real wonderful things about it, but
it is important to consider some ethical issues,
and I will specifically call this "do no harm"
in your data science projects. And for that
we can say thanks to Hippocrates, the guy
who gave us the Hippocratic Oath of Do No
Harm. Let's specifically talk about some of
the important ethical issues, very briefly,
that come up in data science. Number one is
privacy. That data tells you a lot about people
and you need to be concerned about the confidentiality.
If you have private information about people,
their names, their social security numbers,
their addresses, their credit scores, their
health, that's private, that's confidential,
and you shouldn't share that information unless
they specifically gave you permission. Now,
one of the reasons this presents a special
challenge in data science because, we will
see later, a lot of the sources that are used
in data science were not intended for sharing.
If you scrape data from a website or from
PDFs, you need to make sure that it is ok
to do that. But it was originally created
without the intention of sharing, so privacy
is something that really falls upon the analyst
to make sure they are doing it properly. Next,
is anonymity. One of the interesting things
we find is that it is really not hard to identify
people in data. If you have a little bit of
GPS data and you know where a person was at
four different points in time, you have about
a 95% chance of knowing exactly who they are.
You look at things like HIPAA, that's the
Health Insurance Portability and Accountability
Act. Before HIPAA, it was really easy to identify
people from medical records. Since then, it
has become much more difficult to identify
people uniquely. That's an important thing
for really people's well-being. And then also,
proprietary data; if you are working for a
client, a company, and they give you their
own data, that data may have identifiers.
You may know who the people are, they are
not anonymous anymore. So, anonymity may or
may not be there, but major efforts to make
data anonymous. But really, the primary thing
is even if you do know who they are, that
you still maintain the privacy and confidentiality
of the data. Next, there is the issue about
copyright, where people try to lock down information.
Now, just because something is on the web,
doesn't mean that you are allowed to use it.
Scraping data from websites is a very common
and useful way of getting data for projects.
You can get data from web pages, from PDFs,
from images, from audio, from really a huge
number of things. But, again the assumption
that because it is on the web, it's ok to
use it is not true. You always need to check
copyright and make sure that it is acceptable
for you to access that particular data. Next,
and our very ominous picture, is data security
and the idea here is that when you go through
all the effort to gather data, to clean up
and prepare for analysis, you have created
something that is very valuable to a lot of
people and you have to be concerned about
hackers trying to come in and steal the data,
especially if the data is not anonymous and
it has identifiers in it. And so, there is
an additional burden to place on the analyst
to ensure to the best of their ability that
the data is safe and cannot be broken into
and stolen. And that can include very simple
things like a person who is on their project
but is no longer, but took the data on a flash
drive. You have to find ways to make sure
that that can't happen as well. There's a
lot of possibilities, it's tricky, but it
is something that you have to consider thoroughly.
Now, two other things that come up in terms
of ethics, but usually don't get addressed
in these conversations. Number one is potential
bias. The idea here is that the algorithms
or the formulas that are used in data science
are only as neutral or bias-free as the rules
and the data that they get. And so, the idea
here is that if you have rules that address
something that is associated with, for instance,
gender or age or race or economic standing,
you might unintentionally be building in those
factors. Which, say for instance, say for
title nine, you are not supposed to. You might
be building those into the system without
being aware of it, and an algorithm has this
sheen of objectivity, and people say they
can place confidence in it without realizing
that it is replicating some of the prejudices
that may happen in real life. Another issue
is overconfidence. And the idea here is that
analyses are limited simplifications. They
have to be, that is just what they are. And
because of this, you still need humans in
the loop to help interpret and apply this.
The problem is when people run an algorithm
to get a number, say to ten decimal places,
and they say, "this must be true," and treat
it as written-in-stone absolutely unshakeable
truth, when in fact, if the data were biased
going in; if the algorithms were incomplete,
if the sampling was not representative, you
can have enormous problems and go down the
wrong path with too much confidence in your
own analyses. So, once again humility is in
order when doing data science work. In sum:
data science has enormous potential, but it
also has significant risks involved in the
projects. Part of the problem is that analyses
can't be neutral, that you have to look at
how the algorithms are associated with the
preferences, prejudices, and biases of the
people who made them. And what that means
is that no matter what, good judgment is always
vital to quality and success of a data science
project. Data Science is a field that is strongly
associated with its methods or procedures.
In this section of videos, we're going to
provide a brief overview of the methods that
are used in data science. Now, just as a quick
warning, in this section things can get kind
of technical and that can cause some people
to sort of freak out. But, this course is
a non-technical overview. The technical hands
on stuff is in the other courses. And it is
really important to remember that tech is
simply the means to doing data science. Insight
or the ability to find meaning in your data,
that's the goal. Tech only helps you get there.
And so, we want to focus primarily on insight
and the tools and the tech as they serve to
further that goal. Now, there's a few general
categories we are going to talk about, again,
with an overview for each of these. The first
one is sourcing or data sourcing. That is
how to get the data that goes into data science,
the raw materials that you need. Second is
coding. That again is computer programming
that can be used to obtain and manipulate
and analyze the data. After that, a tiny bit
of math and that is the mathematics behind
data science methods that really form the
foundations of the procedures. And then stats,
the statistical methods that are frequently
used to summarize and analyze data, especially
as applied to data science. And then there
is machine learning, ML, this is a collection
of methods for finding clusters in the data,
for predicting categories or scores on interesting
outcomes. And even across these five things,
even then, the presentations aren't too techie-crunchy,
they are basically still friendly. Really,
that's the way it is. So, that is the overview
of the overviews. In sum: we need to remember
that data science includes tech, but data
science is greater than tech, it is more than
those procedures. And above all, that tech
while important to data science is still simply
a means to insight in data. The first step
in discussing data science methods is to look
at the methods of sourcing, or getting data
that is used in data science. You can think
of this as getting the raw materials that
go into your analyses. Now, you have got a
few different choices when it comes to this
in data science. You can use existing data,
you can use something called data APIs, you
can scrape web data, or you can make data.
We'll talk about each of those very briefly
in a non-technical manner. For right now,
let me say something about existing data.
This is data that already is at hand and it
might be in-house data. So if you work for
a company, it might be your company records.
Or, you might have open data; for instance,
many governments and many scientific organizations
make their data available to the public. And
then there is also third party data. This
is usually data that you buy from a vendor,
but it exists and it is very easy to plug
it in and go. You can also use APIs. Now,
that stands for Application Programming Interface,
and this is something that allows various
computer applications to communicate directly
with each other. It's like phones for your
computer programs. It is the most common way
of getting web data, and the beautiful thing
about it is it allows you to import that data
directly into whatever program or application
you are using to analyze the data. Next is
scraping data. And this is where you want
to use data that is on the web, but they don't
have an existing API. And what that means,
is usually data that's in HTML web tables
and pages, maybe PDFs. And you can do this
either with using specialized applications
for scraping data or you can do it in a programming
language, like R or Python, and write the
code to do the data scraping. Or another option
is to make data. And this lets you get exactly
what you need; you can be very specific and
you can get what you need. You can do something
like interviews, or you can do surveys, or
you can do experiments. There is a lot of
approaches, most of them require some specialized
training in terms of how to gather quality
data. And that is actually important to remember,
because no matter what method you use for
getting or making new data, you need to remember
this one little aphorism you may have heard
from computer science. It goes by the name
of GIGO: that actually stands for "Garbage
In, Garbage Out," and it means if you have
bad data that you are feeding into your system,
you are not going to get anything worthwhile,
any real insights out of it. Consequently,
it is important to pay attention to metrics
or methods for measuring and the meaning - exactly
what it is that they tell you. There's a few
ways you can do this. For instance, you can
talk about business metrics, you can talk
about KPIs, which means Key Performance Indicators,
also used in business settings. Or SMART goals,
which is a way of describing the goals that
are actionable and timely and so on. You can
also talk about, in a measurement sense, classification
accuracy. And I will discuss each of those
in a little more detail in a later movie.
But for right now, in sum, we can say this:
data sourcing is important because you need
to get the raw materials for your analysis.
The nice thing is there's many possible methods,
many ways that you can use to get the data
for data science. But no matter what you do,
it is important to check the quality and the
meaning of the data so you can get the most
insight possible out of your project. The
next step we need to talk about in data science
methods is coding, and I am going to give
you a very brief non-technical overview of
coding in data science. The idea here is that
you are going to get in there and you are
going to King of the Jungle/master of your
domain and make the data jump when you need
it to jump. Now, if you remember when we talked
about the Data Science Venn Diagram at the
beginning, coding is up here on the top left.
And while we often think about sort of people
typing lines of code (which is very frequent),
it is more important to remember when we talk
about coding (or just computers in general),
what we are really talking about here is any
technology that lets you manipulate the data
in the ways that you need to perform the procedures
you need to get the insight that you want
out of your data. Now, there are three very
general categories that we will be discussing
here on datalab. The first is apps; these
are specialized applications or programs for
working with data. The second is data; or
specifically, data formats. There's special
formats for web data, I will mention those
in a moment. And then, code; there are programming
languages that give you full control over
what the computer does and how you interact
with the data. Let's take a look at each one
very briefly. In terms of apps, there are
spreadsheets, like Excel or Google Sheets.
These are the fundamental data tools of probably
a majority of the world. There are specialized
applications, like Tableau for data visualization,
or SPSS, it is a very common statistical package
in the social sciences and in businesses,
and one of my favorites, JASP, which is a
free open source analog of SPSS, which actually
I think is a lot easier to use and replicate
research with. And, there are tons of other
choices. Now, in terms of web data, it is
helpful to be familiar with things like HTML,
and XML, and JSON, and other formats that
are used to encapsulate data on the web, because
those are the things that you are going to
have to be programming about to interact with
when you get your data. And then there are
actual coding languages. R is probably the
most common, along with Python; general purpose
language, but it has been well adapted for
data use. There's SQL, the structured query
language for databases, and very basic languages
like C, C++, and Java, which are used more
in the back-end of data science. And then
there is Bash, the most common command line
interface, and regular expressions. And we
will talk about all of these in other courses
here at datalab. But, remember this: tools
are just tools. They are only one part of
the data science process. They are a means
to the end, and the end, the goal is insight.
You need to know where you are trying to go
and then simply choose the tools that help
you reach that particular goal. That's the
most important thing. So, in sum, here's a
few things: number one, use your tools wisely.
Remember your questions need to drive the
process, not the tools themselves. Also, I
will just mention that a few tools is usually
enough. You can do an awful lot with Excel
and R. And then, the most important thing
is: focus on your goal and choose your tools
and even your data to match the goal, so you
can get the most useful insights from your
data. The next step in our discussion of data
science methods is mathematics, and I am going
to give a very brief overview of the math
involved in data science. Now, the important
thing to remember is that math really forms
the foundation of what we're going to do.
If you go back to the Data Science Venn Diagram,
we've got stats up here in the right corner,
but really it's math and stats, or quantitative
ability in general, but we'll focus on the
math part right here. And probably the most
important question is how much math is enough
to do what you need to do? Or to put it another
way, why do you need math at all, because
you have got a computer to do it? Well, I
can think of three reasons you don't want
to rely on just the computer, but it is helpful
to have some sound mathematical understanding.
Here they are: number one, you need to know
which procedures to use and why. So you have
your question, you have your data and you
need to have enough of an understanding to
make an informed choice. That's not terribly
difficult. Two, you need to know what to do
when things don't work right. Sometimes you
get impossible results. I know that statistics
you can get a negative adjusted R2; that's
not supposed to happen. And it is good to
know the mathematics that go into calculating
that so you can understand how something apparently
impossible can work. Or, you are trying to
do a factor analysis or principal component
and you get a rotation that won't convert.
It helps to understand what it is about the
algorithm that's happening, and why that won't
work in that situation. And number three,
interestingly, some procedures, some math
is easier and quicker to do by hand than by
firing up the computer. And I'll show you
a couple of examples in later videos, where
that can be the case. Now, fundamentally there
is a nice sort of analogy here. Math is to
data science as, for instance, chemistry is
to cooking, kinesiology is to dancing, and
grammar is to writing. The idea here is that
you can be a wonderful cook without knowing
any chemistry, but if you know some chemistry
it is going to help. You can be a wonderful
dancer without know kinesiology, but it is
going to help. And you can probably be a good
writer without having an explicit knowledge
of grammar, but it is going to make a big
difference. The same thing is true of data
science; you will do it better if you have
some of the foundational information. So,
the next question is: what kinds of math do
you need for data science? Well, there's a
few answers to that. Number one is algebra;
you need some elementary algebra. That is,
the basically simple stuff. You can have to
do some linear or matrix algebra because that
is the foundation of a lot of the calculations.
And you can also have systems of linear equations
where you are trying to solve several equations
all at once. It is a tricky thing to do, in
theory, but this is one of the things that
is actually easier to do by hand sometimes.
Now, there's more math. You can get some Calculus.
You can get some big O, which has to do with
the order of a function, which has to do with
sort of how fast it works. Probability theory
can be important, and Bayes' theorem, which
is a way of getting what is called a posterior
probability can also be a really helpful tool
for answering some fundamental questions in
data science. So in sum: a little bit of math
can help you make informed choices when planning
your analyses. Very significantly, it can
help you find the problems and fix them when
things aren't going right. It is the ability
to look under the hood that makes a difference.
And then truthfully, some mathematical procedures,
like systems of linear equations, that can
even be done by hand, sometimes faster than
you can do with a computer. So, you can save
yourself some time and some effort and move
ahead more quickly toward your goal of insight.
Now, data science wouldn't be data science
and its methods without a little bit of statistics.
So, I am going to give you a brief statistics
overview here of how things work in data science.
Now, you can think of statistics as really
an attempt to find order in chaos, find patterns
in an overwhelming mess. Sort of like trying
to see the forest and the trees. Now, let's
go back to our little Venn Diagram here. We
recently had math and stats here in the top
corner. We are going to go back to talking
about stats, in particular. What you are trying
to do here; one thing is to explore your data.
You can have exploratory graphics, because
we are visual people and it is usually easiest
to see things. You can have exploratory statistics,
a numerical exploration of the data. And you
can have descriptive statistics, which are
the things that most people would have talked
about when they took a statistics class in
college (if they did that). Next, there is
inference. I've got smoke here because you
can infer things about the wind and the air
movement by looking at patterns in smoke.
The idea here is that you are trying to take
information from samples and infer something
about a population. You are trying to go from
one source to another. One common version
of this is hypothesis testing. Another common
version is estimations, sometimes called Confidence
Intervals. There are other ways to do it,
but all of these let you go beyond the data
at hand to making larger conclusions. Now,
one interesting thing about statistics is
you're going to have to be concerned with
some of the details and arranging things just
so. For instance, you get to do something
like feature selection and that's picking
variables that should be included or combinations
and there are problems that can come up that
are frequent problems and I will address some
of those in later videos. There's also the
matter of validation. When you create a statistical
model you have to see if it is actually accurate.
Hopefully, you have enough data that you can
have a holdout sample and do that, or you
can replicate the study. Then, there is the
choice of estimators that you use; how you
actually get the coefficients or the combinations
in your model. And then there's ways of assessing
how well your model fits the data. All of
these are issues that I'll address briefly
when we talk about statistical analysis at
greater length. Now, I do want to mention
one thing in particular here, and I just call
this "beware the trolls." There are people
out there who will tell you that if you don't
do things exactly the way they say to do it,
that your analysis is meaningless, that your
data is junk and you've lost all your time.
You know what? They're trolls. So, the idea
here is don't listen to that. You can make
enough of an informed decision on your own
to go ahead and do an analysis that is still
useful. Probably one of the most important
things to think about in this is this wonderful
quote from a very famous statistician and
it says, "All models or all statistical models
are wrong, but some are useful." And so the
question isn't whether you're technically
right, or you have some sort of level of intellectual
purity, but whether you have something that
is useful. That, by the way, comes from George
Box. And I like to think of it basically as
this: as wave your flag, wave your "do it
yourself" flag, and just take pride in what
you're able to accomplish even when there
are people who may be criticizing it. Go ahead,
you're doing something, go ahead and do it.
So, in sum: statistics allow you to explore
and describe your data. It allows you to infer
things about the population. There is a lot
of choices available, a lot of procedures.
But no matter what you do, the goal is useful
insight. Keep your eyes on that goal and you
will find something meaningful and useful
in your data to help you in your own research
and projects. Let's finish our data science
methods overview by getting a brief overview
of Machine Learning. Now, I've got to admit
when you say the term "machine learning,"
people start thinking something like, "the
robot overlords are going to take over the
world." That's not what it is. Instead, let's
go back to our Venn Diagram one more time,
and in the intersection at the top between
coding and stats is Machine Learning or as
it's commonly called, just ML. The goal of
Machine Learning is to go and work in data
space so you can, for instance, you can take
a whole lot of data (we've got tons of books
here), and then you can reduce the dimensionality.
That is, take a very large, scattered, data
set and try to find the most essential parts
of that data. Then you can use these methods
to find clusters within the data; like goes
with like. You can use methods like k-means.
You can also look for anomalies or unusual
cases that show up in the data space. Or,
if we go back to categories again, I talked
about like for like. You can use things like
logistic regression or k-nearest neighbors,
KNN. You can use Naive Bayes for classification
or Decision Trees or SVM, which is Support
Vector Machines, or artificial neural nets.
Any of those will help you find the patterns
and the clumping in the data so you can get
similar cases next to each other, and get
the cohesion that you need to make conclusions
about these groups. Also, a major element
of machine learning is predictions. You're
going to point your way down the road. The
most common approach here; the most basic
is linear regression, multiple regression.
There is also Poisson regression, which is
used for modeling count or frequency data.
And then there is the issue of Ensemble models,
where you create several models and you take
the predictions from each of those and you
put them together to get an overall more reliable
prediction. Now, I will talk about each of
these in a little more detail in later courses,
but for right now I mostly just want you to
know that these things exist, and that's what
we mean when we refer to Machine Learning.
So, in sum: machine learning can be used to
categorize cases and to predict scores on
outcomes. And there's a lot of choices, many
choices and procedures available. But, again,
as I said with statistics, and I'll also say
again many times after this, no matter what,
the goal is not that "I'm going to do an artificial
neural network or a SVM," the goal is to get
useful insight into your data. Machine learning
is a tool, and use it to the extent that it
helps you get that insight that you need.
In the last several videos I've talked about
the role in data science of technical things.
On the other hand, communicating is essential
to the practice, and the first thing I want
to talk about there is interpretability. The
idea here is that you want to be able to lead
people through a path on your data. You want
to tell a data-driven story, and that's the
entire goal of what you are doing with data
science. Now, another way to think about this
is: when you are doing your analysis, what
you're trying to do is solve for value. You're
making an equation. You take the data, you're
trying to solve for value. The trouble is
this: a lot of people get hung up on analysis,
but they need to remember that analysis is
not the same thing as value. Instead, I like
to think of it this way: that analysis times
story is equal to value. Now, please note
that's multiplicative, not additive, and so
one consequence of that is when you go back
to, analysis times story equals value. Well,
if you have zero story you're going to have
zero value because, as you recall, anything
times zero is zero. So, instead of that let's
go back to this and say what we really want
to do is, we want to maximize the story so
that we can maximize the value that results
from our analysis. Again, maximum value is
the overall goal here. The analysis, the tools,
the tech, are simply methods for getting to
that goal. So, let's talk about goals. For
instance, an analysis is goal-driven. You
are trying to accomplish something that's
specific, so the story, or the narrative,
or the explanation you give about your project
should match those goals. If you are working
for a client that has a specific question
that they want you to answer, then you have
a professional responsibility to answer those
questions clearly and unambiguously, so they
know whether you said yes or no and they know
why you said yes or no. Now, part of the problem
here is the fact the client isn't you and
they don't see what you do. And as I show
here, simply covering your face doesn't make
things disappear. You have to worry about
a few psychological abstractions. You have
to worry about egocentrism. And I'm not talking
about being vain, I'm talking about the idea
that you think other people see and know and
understand what you know. That's not true;
otherwise, they wouldn't have hired you in
the first place. And so you have to put it
in terms that the client works with, and that
they understand, and you're going to have
to get out of your own center in order to
do that. Also, there's the idea of false consensus;
the idea that, "well everybody knows that."
And again, that's not true, otherwise, they
wouldn't have hired you. You need to understand
that they are going to come from a different
background with a different range of experience
and interpretation. You're going to have to
compensate for that. A funny little thing
is the idea about anchoring. When you give
somebody an initial impression, they use that
as an anchor, and then they adjust away from
it. So if you are going to try to flip things
over on their heads, watch out for giving
a false impression at the beginning unless
you absolutely need to. But most importantly,
in order to bridge the gap between the client
and you, you need to have clarity and explain
yourself at each step. You can also think
about the answers. When you are explaining
the project to the client, you might want
to start in a very simple procedure: state
the question that you are answering. Give
your answer to that question, and if you need
to, qualify as needed. And then, go in order
top to bottom, so you're trying to make it
as clear as possible what you're saying, what
the answer is, and make it really easy to
follow. Now, in terms of discussing your process,
how you did this all. Most of the time it
is probably the case of they don't care, they
just want to know what the answer is and that
you used a good method to get that. So, in
terms of discussing processes or the technical
details, only when absolutely necessary. That's
something to keep in mind. The process here
is to remember that analysis, which means
breaking something apart. This, by the way,
is a mechanical typewriter broken into its
individual component. Analysis means to take
something apart, and analysis of data is an
exercise in simplification. You're taking
the overall complexity, sort of the overwhelmingness
of the data, and you're boiling it down and
finding the patterns that make sense and serve
the needs of your client. Now, let's go to
a wonderful quote from our friend Albert Einstein
here, who said, "Everything should be made
as simple as possible, but not simpler." That's
true in presenting your analysis. Or, if you
want to go see the architect and designer
Ludwig Mies van der Rohe, who said, "Less
is more." It is actually Robert Browning who
originally said that, but Mies van der Rohe
popularized it. Or, if you want another way
of putting a principle that comes from my
field, I'm actually a psychological researcher;
they talk about being minimally sufficient.
Just enough to adequately answer the question.
If you're in commerce you know about a minimal
viable product, it is sort of the same idea
within analysis here, the minimal viable analysis.
So, here's a few tips: when you're giving
a presentation, more charts, less text, great.
And then, simplify the charts; remove everything
that doesn't need to be in there. Generally,
you want to avoid tables of data because those
are hard to read. And then, one more time
because I want to emphasize it, less text
again. Charts, tables can usually carry the
message. And so, let me give you an example
here. I'm going to give a very famous dataset
at Berkeley admissions. Now, these are not
stairs at Berkeley, but it gives the idea
of trying to get into something that is far
off and distant. Here's the data; this is
graduate school admissions in 1973, so it's
over 40 years ago. The idea is that men and
women were both applying for graduate school
at the University of California Berkeley.
And what we found is that 44 percent of the
men who applied were admitted, that's their
part in green. And of the women, only 35 percent
of women were admitted when they applied.
So, really at first glance this is bias, and
it actually led to a lawsuit, it was a major
issue. So, what Berkeley then tried to do
was find out, "well which programs are responsible
for this bias?" And they got a very curious
set of results. If you break the applications
down by program (and here we are calling them
A through F), six different programs. What
you find, actually, is that in each of these
male applicants on the left female applicants
are on the right. If you look at program A,
women actually got accepted at a higher rate,
and the same is true for B, and the same is
true for D, and the same is true for F. And
so, this is a very curious set of responses
and it is something that requires explanation.
Now in statistics, this is something that
is known as Simpson's Paradox. But here is
the paradox: bias may be negligible at the
department level. And in fact, as we saw in
four of the departments, there was a possible
bias in favor of women. And the problem is
that women applied to more selective programs,
programs with lower acceptance rates. Now,
some people stop right here and say therefore,
"nothing is going on, nothing to complain
about." But you know, that's still ending
the story a little bit early. There are other
questions that you can ask, and as producing
a data-driven story, this is stuff that you
would want to do. So, for instance, you may
want to ask, "why do the programs vary in
overall class size? Why do the acceptance
rates differ from one program to the other?
Why do men and women apply to different programs?"
And you might want to look at things like
the admissions criteria for each of the programs,
the promotional strategies, how they advertise
themselves to students. You might want to
look at the kinds of prior education the students
have in the programs, and you really want
to look at funding level for each of the programs.
And so, really, you get one answer, at least
more questions, maybe some more answers, and
more questions, and you need to address enough
of this to provide a comprehensive overview
and solution to your client. In sum, let's
say this: stories give value to data analysis.
And when you tell the story, you need to make
sure that you are addressing your client's'
goals in a clear, unambiguous way. The overall
principle here is be minimally sufficient.
Get to the point, make it clear. Say what
you need to, but otherwise be concise and
make your message clear. The next step in
discussing data science and communicating
is to talk about actionable insights, or information
that can be used productively to accomplish
something. Now, to give sort of a bizarre
segue here, you look at a game controller.
It may be a pretty thing, it may be a nice
object, but remember: game controllers exist
to do something. They exist to help you play
the game and to do it as effectively as possible.
They have a function, they have a purpose.
Same way data is for doing. Now, that's a
paraphrase for one of my favorite historical
figures. This is William James, the father
of American Psychology, and pragmatism is
philosophy. And he has this wonderful quote,
he said, "My thinking is first and last and
always for the sake of my doing." And the
idea applies to analysis. Your analysis and
your data is for the sake of your doing. So,
you're trying to get some sort of specific
insight in how you should proceed. What you
want to avoid is the opposite of this from
one of my other favorite cultural heroes,
the famous Yankees catcher Yogi Berra, who
said, "We're lost, but we're making good time."
The idea here is that frantic activity does
not make up for lack of direction. You need
to understand what you are doing so you can
reach the particular goal. And your analysis
is supposed to do that. So, when you're giving
your analysis, you're going to try to point
the way. Remember, why was the project conducted?
The goal is usually to direct some kind of
action, reach some kind of goal for your client.
And that the analysis should be able to guide
that action in an informed way. One thing
you want to do is, you want to be able to
give the next steps to your client. Give the
next steps; tell them what they need to do
now. You want to be able to justify each of
those recommendations with the data and your
analysis. As much as possible be specific,
tell them exactly what they need to do. Make
sure it's doable by the client, that it's
within their range of capability. And that
each step should build on the previous step.
Now, that being said, there is one really
fundamental sort of philosophical problem
here, and that's the difference between correlation
and causation. Basically, it goes this way:
your data gives you correlation; you know
that this is associated with that. But your
client doesn't simply want to know what's
associated; they want to know what causes
something. Because if they are going to do
something, that's an intervention designed
to produce a particular result. So, really,
how do you get from the correlation, which
is what you have in the data, to the causation,
which is what your client wants? Well, there's
a few ways to do that. One is experimental
studies; these are randomized, controlled
trials. Now, that's theoretically the simplest
path to causality, but it can be really tricky
in the real world. There are quasi-experiments,
and these are methods, a whole collection
of methods. They use non-randomized data,
usually observational data, adjusted in particular
ways to get an estimate of causal inference.
Or, there's the theory and experience. And
this is research-based theory and domain-specific
experience. And this is where you actually
get to rely on your client's information.
They can help you interpret the information,
especially if they have greater domain expertise
than you do. Another thing to think about
are the social factors that affect your data.
Now, you remember the data science Venn Diagram.
We've looked at it lots of times. It has got
these three elements. Some proposed adding
a fourth circle to this Venn diagram, and
we'll kind of put that in there and say that
social understanding is also important, critical
really, to valid data science. Now, I love
that idea, and I do think that it's important
to understand how things are going to play
out. There are a few kinds of social understanding.
You want to be aware of your client's mission.
You want to make sure that your recommendations
are consistent with your client's mission.
Also, that your recommendations are consistent
with your client's identity; not just, "This
is what we do," but, "This is really who we
are." You need to be aware of the business
context, sort of the competitive environment
and the regulatory environment that they're
working in. As well as the social context;
and that can be outside of the organization,
but even more often within the organization.
Your recommendations will affect relationships
within the client's organization. And you
are going to try to be aware of those as much
as you can to make it so that your recommendations
can be realized the way they need to be. So,
in sum: data science is goal focused, and
when you're focusing on that goal for your
client you need to give specific next steps
that are based on your analysis and justifiable
from the data. And in doing so, be aware of
the social, political, and economic context
that gives you the best opportunity of getting
something really useful out of your analysis.
When you're working in data science and trying
to communicate your results, presentation
graphics can be an enormously helpful tool.
Think of it this way: you are trying to paint
a picture for the benefit of your client.
Now, when you're working with graphics there
can be a couple of different goals; it depends
on what kind of graphics you're working with.
There's the general category of exploratory
graphics. These are ones that you are using
as the analyst. And for exploratory graphics,
you need speed and responsiveness, and so
you get very simple graphics. This is a base
histogram in R. And they can get a little
more sophisticated and this is done in ggplot2.
And you can break it down into a couple other
histograms, or you can make it a different
way, or make it see-through, or split them
apart into small multiples. But in each case,
this is done for the benefit of you as the
analyst understanding the data. These are
quick, they're effective. Now, they are not
very well-labeled, and they are usually for
your insight, and then you do other things
as a result of that. On the other hand, presentation
graphics which are for the benefit of your
client, those need clarity and they need a
narrative flow. Now, let me talk about each
of those characteristics very briefly. Clarity
versus distraction. There are things that
can go wrong in graphics. Number one is color.
Colors can actually be a problem. Also, three-dimensional
or false dimensions are nearly always a distraction.
One that gets a little touchy for some people
is interaction. We think of interactive graphics
as really cool, great things to have, but
you run the risk of people getting distracted
by the interaction and start playing around
with it. Going, like, "Ooh, I press here it
does that." And that distracts from the message.
So actually, it may be important to not have
interaction. And then the same thing is true
of animation. Flat, static graphics can often
be more informative because they have fewer
distractions in them. Let me give you a quick
example of how not to do things. Now, this
is a chart that I made. I made it in Excel,
and I did it based on some of the mistakes
I've seen in graphics submitted to me when
I teach. And I guarantee you, everything in
here I have seen in real life, just not necessarily
combined all at once. Let's zoom in on this
a little bit, so we can see the full badness
of this graphic. And let's see what's going
on here. We've got a scale here that starts
at 8 goes to 28% and is tiny; doesn't even
cover the range of the data. We've got this
bizarre picture on the wall. We've got no
access lines on the walls. We come down here;
the labels for educational levels are in alphabetical
order, instead of the more logical higher
levels of education. Then we've got the data
represented as cones, which are difficult
to read and compare, and it's only made worse
by the colors and the textures. You know,
if you want to take an extreme, this one for
grad degrees doesn't even make it to the floor
value of 8% and this one for high school grad
is cut off at the top at 28%. This, by the
way, is a picture of a sheep, and people do
this kind of stuff and it drives me crazy.
If you want to see a better chart with the
exact same data, this is it right here. It
is a straight bar chart. It's flat, it's simple,
it's as clean as possible. And this is better
in many ways. Most effective here is that
it communicates clearly. There's no distractions,
it's a logical flow. This is going to get
the point across so much faster. And I can
give you another example of it; here's a chart
previously about salaries for incomes. I have
a list here, I've got data scientist in it.
If I want to draw attention to it, I have
the option of putting a circle around it and
I can put a number next to it to explain it.
That's one way to make it easy to see what's
going on. We don't even have to get fancy.
You know, I just got out a pen and a post-it
note and I drew a bar chart of some real data
about life expectancy. This tells the story
as well, that there is something terribly
amiss in Sierra Leone. But, now let's talk
about creating narrative flow in your presentation
graphics. To do this, I am going to pull some
charts from my most cited academic paper,
which is called, A Third Choice: A Review
of Empirical Research on the Psychological
Outcomes of Restorative Justice. Think of
it as mediation for juvenile crimes, mostly
juvenile. And this paper is interesting because
really it's about fourteen bar charts with
just enough text to hold them together. And
you can see there's a flow. The charts are
very simple; this is judgments about whether
the criminal justice system was fair. The
two bars on the left are victims; the two
bars on the right are offenders. And for each
group on the left are people who participated
in restorative justice, so more victim-offender
mediation for crimes. And for each set on
the right are people who went through standard
criminal procedures. It says court, but it
usually means plea bargaining. Anyhow, it's
really easy to see that in both cases the
restorative justice bar is higher; people
were more likely to say it was fair. They
also felt that they had an opportunity to
tell their story; that's one reason why they
might think it's fair. They also felt the
offender was held accountable more often.
In fact, if you go to court on the offenders,
that one's below fifty percent and that's
the offenders themselves making the judgment.
Then you can go to forgiveness and apologies.
And again, this is actually a simple thing
to code and you can see there's an enormous
difference. In fact, one of the reasons there
is such a big difference is because instead
of court preceding, the offender very rarely
meets the victim. It also turns out I need
to qualify this a little bit because a bunch
of the studies included drunk driving with
no injuries or accidents. Well, when we take
them out, we see a huge change. And then we
can go to whether a person is satisfied with
the outcome. Again, we see an advantage for
restorative justice. Whether the victim is
still upset about the crime, now the bars
are a little bit different. And whether they
are afraid of revictimization and that is
over a two to one difference. And then finally
recidivism for offenders or reoffending; and
you see a big difference there. And so what
I have here is a bunch of charts that are
very very simple to read, and they kind of
flow in how they're giving the overall impression
and then detailing it a little bit more. There's
nothing fancy here, there's nothing interactive,
there's nothing animated, there's nothing
kind of flowing in seventeen different directions.
It's easy, but it follows a story and it tells
a narrative about the data and that should
be your major goal with the presentation graphics.
In sum: presenting, or the graphics you use
for presenting, are not the same as the graphics
you use for exploring. They have different
needs and they have different goals. But no
matter what you are doing, be clear in your
graphics and be focused in what you're trying
to tell. And above all create a strong narrative
that gives different level of perspective
and answers questions as you go to anticipate
a client's questions and to give them the
most reliable solid information and the greatest
confidence in your analysis. The final element
of data science and communicating that I wanted
to talk about is reproducible research. And
you can think of it as this idea; you want
to be able to play that song again. And the
reason for that is data science projects are
rarely "one and done;" rather they tend to
be incremental, they tend to be cumulative,
and they tend to adapt to these circumstances
that they're working in. So, one of the important
things here, probably, if you want to summarize
it very briefly, is this: show your work.
There's a few reasons for this. You may have
to revise your research at a later date, your
own analyses. You may be doing another project
and you want to borrow something from previous
studies. More likely you'll have to hand it
off to somebody else at a future point and
they're going to have to be able to understand
what you did. And then there's the very significant
issue in both scientific and economic research
of accountability. You have to be able to
show that you did things in a responsible
way and that your conclusions are justified;
that's for clients funding agencies, regulators,
academic reviewers, any number of people.
Now, you may be familiar with the concept
of open data, but you may be less familiar
with the concept of open data science; that's
more than open data. So, for instance, I'll
just let you know there is something called
the Open Data Science Conference and ODSC.com.
And it meets three times a year in different
places. And this is entirely, of course, devoted
to open data science using both open data,
but making the methods transparent to people
around them. One thing that can make this
really simple is something called the Open
Science Framework, which is at OSF.io. It's
a way of sharing your data and your research
with an annotation on how you got through
the whole thing with other people. It makes
the research transparent, which is what we
need. One of my professional organizations,
the Association for Psychological Science
has a major initiative on this called open
practices, where they are strongly encouraging
people to share their data as much as is ethically
permissible and to absolutely share their
methods before they even conduct a study as
a way of getting rigorous intellectual honesty
and accountability. Now, another step in all
of this is to archive your data, make that
information available, put it on the shelf.
And what you want to do here is, you want
to archive all of your datasets; both the
totally raw before you did anything with it
dataset, and every step in the process until
your final clean dataset. Along with that,
you want to archive all of the code that you
used in the process and analyzed the data.
If you used a programming language like R
or Python, that's really simple. If you used
a program like SPSS you need to save the syntax
files, and then that can be done that way.
And again, no matter what, make sure to comment
liberally and explain yourself. Now, part
of that is you have to explain the process,
because you are not just this lone person
sitting on the sofa working by yourself, you're
with other people and you need to explain
why you did it the way that you did. You need
to explain the choices, the consequences of
those choices, the times that you had to backtrack
and try it over again. This also works into
the principle of future-proofing your work.
You want to do a few things here. Number one;
the data. You want to store the data in non-proprietary
formats like a CSV or Comma Separated Values
file because anything can read CSV files.
If you stored it in the proprietary SPSS.sav
format, you might be in a lot of trouble when
somebody tries to use it later and they can't
open it. Also, there's storage; you want to
place all of your files in a secure, accessible
location like GitHub is probably one of the
best choices. And then the code, you may want
to use something like a dependency management
package like Packrat for R or Virtual Environment
for Python as a way of making sure that the
packages that you use; that there are always
versions that work because sometimes things
get updated and it gets broken. This is a
way of making sure that the system that you
have will always work. Overall, you can think
of this too: you want to explain yourself
and a neat way to do that is to put your narrative
in a notebook. Now, you can have a physical
lab book or you can also do digital books.
A really common one, especially if you're
using Python, is Jupyter with a "y" there
in the middle. Jupyter notebooks are interactive
notebooks. So, here's a screenshot of one
that I made in Python, and you have titles,
you have text, you have the graphics. If you
are working in R, you can do this through
something called RMarkdown. Which works in
the same way you do it in RStudio, you use
Markdown and you can annotate it. You can
get more information about that at rmarkdown.rstudio.com.
And so for instance, here's an R analysis
I did, and as you can see the code on the
left and you see the markdown version on the
right. What's neat about this is that this
little bit of code here, this title and this
text and this little bit of R code, then is
displayed as this formatted heading, as this
formatted text, and this turns into the entire
R output right there. It's a great way to
do things. And if you do RMarkdown, you actually
have the option of uploading the document
into something called RPubs; and that's an
online document that can be made accessible
to anybody. Here's a sample document. And
if you want to go see it, you can go to this
address. It's kind of long, so I am going
to let you write that one down yourself. But,
in sum: here's what we have. You want to do
your work and archive the information in a
way that supports collaboration. Explain your
choices, say what you did, show how you did
it. This allows you to future-proof your work,
so it will work in other situations for other
people. And as much as possible, no matter
how you do it, make sure you share your narrative
so people understand your process and they
can see that your conclusions are justifiable,
strong and reliable. Now, something I've mentioned
several times when talking about data science,
and I'll do it again in this conclusion, is
that it's important to give people next steps.
And I'm going to do that for you right now.
If you're wondering what to do after having
watched this very general overview course,
I can give you a few ideas. Number one, maybe
you want to start trying to do some coding
in R or Python; we have courses for those.
You might want to try doing some data visualization,
one of the most important things that you
can do. You may want to brush up on statistics
and maybe some math that goes along with it.
And you may want to try your hand at machine
learning. All of these will get you up and
rolling in the practice of data science. You
can also try looking at data sourcing, finding
information that you are going to do. But,
no matter what happens try to keep it in context.
So, for instance, data science can be applied
to marketing, and sports, and health, and
education, and the arts, and really a huge
number of other things. And we will have courses
here at datalab.cc that talk about all of
those. You may also want to start getting
involved in the community of data science.
One of the best conferences that you can go
to is O'Reilly Strata, which meets several
times a year around the globe. There's also
Predictive Analytics World, again several
times a year around the world. Then there's
much smaller conferences, I love Tapestry
or tapestryconference.com, which is about
storytelling in data science. And Extract,
a one-day conference about data stories that
is put on by import.io, one of the great data
sourcing applications that's available for
scraping web data. If you want to start working
with actual data, a great choice is to go
to Kaggle.com and they sponsor data science
competitions, which actually have cash rewards.
There's also wonderful data sets you can work
with there to find out how they work and compare
your results to those of other people. And
once you are feeling comfortable with that,
you may actually try turning around and doing
some service; datakind.org is the premier
organization for data science as humanitarian
service. They do major projects around the
world. I love their examples. There are other
things you can do; there's an annual event
called Do Good Data, and then datalab.cc will
be sponsoring twice-a-year data charrettes,
which are opportunities for people in the
Utah area to work with the local nonprofits
on their data. But above all of this, I want
you to remember this one thing: data science
is fundamentally democratic. It's something
that everybody needs to learn to do in some
way, shape or form. The ability to work with
data is a fundamental ability and everybody
would be better off by learning to work with
data intelligently and sensitively. Or, to
put it another way: data science needs you.
Thanks so much for joining me in this introductory
course and I hope it has been good and I look
forward to seeing you in the other courses
here at datalab.cc. Welcome to "Data Sourcing".
I'm Barton Poulson and in this course, we're
going to talk about Data Opus or that's Latin
for Data Needed. The idea here is that no
data, no data science; and that is a sad thing.
So, instead of leaving it at that we're going
to use this course to talk about methods for
measuring and evaluating data and methods
for accessing existing data and even methods
for creating new, custom data. Take those
together and it's a happy situation. At the
same time, we'll do all of this still at an
accessible, conceptual and non-technical level
because the technical hands-on stuff will
happen in later other courses. But for now,
let's talk data. For data sourcing, the first
thing we want to talk about is measurement.
And within that category, we're going to talk
about metrics. The idea here is that you actually
need to know what your target is if you want
to have a chance to hit it. There's a few
particular reasons for this. First off, data
science is action-oriented; the goal is to
do something as opposed to simply understand
something, which is something I say as an
academic practitioner. Also, your goal needs
to be explicit and that's important because
the goals can guide your effort. So, you want
to say exactly what you are trying to accomplish,
so you know when you get there. Also, goals
exist for the benefit of the client, and they
can prevent frustration; they know what you're
working on, they know what you have to do
to get there. And finally, the goals and the
metrics exist for the benefit of the analyst
because they help you use your time well.
You know when you're done, you know when you
can move ahead with something, and that makes
everything a little more efficient and a little
more productive. And when we talk about this
the first thing you want to do is try to define
success in your particular project or domain.
Depending on where you are, in commerce that
can include things like sales, or click-through
rates, or new customers. In education it can
include scores on tests; it can include graduation
rates or retention. In government, it can
include things like housing and jobs. In research,
it can include the ability to serve the people
that you're to better understand. So, whatever
domain you're in there will be different standards
for success and you're going to need to know
what applies in your domain. Next, are specific
metrics or ways of measuring. Now again, there
are a few different categories here. There
are business metrics, there are key performance
indicators or KPIs, there are SMART goals
(that's an acronym), and there's also the
issue of having multiple goals. I'll talk
about each of those for just a second now.
First off, let's talk about business metrics.
If you're in the commercial world there are
some common ways of measuring success. A very
obvious one is sales revenue; are you making
more money, are you moving the merchandise,
are you getting sales. Also, there's the issue
of leads generated, new customers, or new
potential customers because that, then, in
turn, is associated with future sales. There's
also the issue of customer value or lifetime
customer value, so you may have a small number
of customers, but they all have a lot of revenue
and you can use that to really predict the
overall profitability of your current system.
And then there's churn rate, which has to
do with, you know, losing and gaining new
customers and having a lot of turnover. So,
any of these are potential ways for defining
success and measuring it. These are potential
metrics, there are others, but these are some
really common ones. Now, I mentioned earlier
something called a key performance indicator
or KPI. KPIs come from David Parmenter and
he's got a few ways of describing them, he
says a key performance indicator for business.
Number one should be nonfinancial, not just
the bottom line, but something else that might
be associated with it or that measures the
overall productivity of the association. They
should be timely, for instance, weekly, daily,
or even constantly gathered information. They
should have a CEO focus, so the senior management
teams are the ones who generally make the
decisions that affect how the organization
acts on the KPIs. They should be simple, so
everybody in the organization, everybody knows
what they are and knows what to do about them.
They should be team-based, so teams can take
joint responsibility for meeting each one
of the KPIs. They should have significant
impact, what that really means is that they
should affect more than one important outcome,
so you can do profitably and market reach
or improved manufacturing time and fewer defects.
And finally, an ideal KPI has a limited dark
side, that means there's fewer possibilities
for reinforcing the wrong behaviors and rewarding
people for sort of exploiting the system.
Next, there are SMART goals, where SMART stands
for Specific, Measurable, Assignable to a
particular person, Realistic (meaning you
can actually do it with the resources you
have at hand), and Time-bound, (so you know
when it can get done). So, whenever you form
a goal you should try to assess it on each
of these criteria and that's a way of saying
that this is a good goal to be used as a metric
for the success of our organization. Now,
the trick, however, is when you have multiple
goals, multiple possible endpoints. And the
reason that's difficult is because, well,
it's easy to focus on one goal if you're just
trying to maximize revenue or if you're just
trying to maximize graduation rate. There's
a lot of things you can do. It becomes more
difficult when you have to focus on many things
simultaneously, especially because some of
these goals may conflict. The things that
you do to maximize one may impair the other.
And so when that happens, you actually need
to start engaging in a deliberate process
of optimization, you need to optimize. And
there are ways that you can do this if you
have enough data; you can do mathematical
optimization to find the ideal balance of
efforts to pursue one goal and the other goal
at the same time. Now, this is a very general
summary and let me finish with this. In sum,
metrics or methods for measuring can help
awareness of how well your organization is
functioning and how well you're reaching your
goals. There are many different methods available
for defining success and measuring progress
towards those things. The trick, however,
comes when you have to balance efforts to
reach multiple goals simultaneously, which
can bring in the need for things like optimization.
When talking about data sourcing and measurement,
one very important issue has to do with the
accuracy of your measurements. The idea here
is that you don't want to have to throw away
all your ideas; you don't want to waste effort.
One way of doing this in a very quantitative
fashion is to make a classification table.
So, what that looks like is this, you talk
about, for instance, positive results, negative
results... and in fact let's start by looking
at the top here. The middle two columns here
talk about whether an event is present, whether
your house is on fire, or whether a sale occurs,
or whether you have got a tax evader, whatever.
So, that's whether a particular thing is actually
happening or not. On the left here, is whether
the test or the indicator suggests that the
thing is or is not happening. And then you
have these combinations of true positives;
where the test says it's happening and it
really is, and false positives; where the
test says it happening, but it is not, and
then below that true negatives, where the
test says it isn't happening and that's correct
and then false negatives, where the test says
there's nothing going on, but there is in
fact the event occurring. And then you start
to get the column totals, the total number
of events present or absent, then the row
totals about the test results. Now, from this
table what you get is four kinds of accuracy,
or really four different ways of quantifying
accuracy using different standards. And they
go by these names: sensitivity, specificity,
positive predictive value, and negative predictive
value. I'll show you very briefly how each
of them works. Sensitivity can be expressed
this way, if there's a fire does the alarm
ring? You want that to happen. And so, that's
a matter of looking at the true positives
and dividing that by the total number of alarms.
So, the test positive means there's an alarm
and the event present means there's a fire;
you want it to always have an alarm when there's
a fire. Specificity, on the other hand, is
sort of the flip side of this. If there isn't
a fire, does the alarm stay quiet? This is
where you're looking at the ratio of true
negatives to total absent events, where there's
no fire, and the alarms aren't ringing, and
that's what you want. Now, those are looking
at columns; you can also go sideways across
rows. So, the first one there is positive
predictive value, often abbreviated as PPV,
and we flip around the order a little bit.
This one says, if the alarm rings, was there
a fire? So, now you're looking at the true
positives and dividing it by the total number
of positives. Total number of positives is
any time the alarm rings. True positives are
because there was a fire. And negative predictive
value, or NPV, says of the alarm doesn't ring,
does that in fact mean that there is no fire?
Well, here you are looking at true negatives
and dividing it by total negatives, the time
that it doesn't ring. And again, you want
to maximize that so the true negatives account
for all of the negatives, the same way you
want the true positives to account for all
of the positives and so on. Now, you can put
numbers on all of these going from zero percent
to a 100% and the idea is to maximize each
one as much as you can. So, in sum, from these
tables we get four kinds of accuracy and there's
a different focus for each one. But, the same
overall goal, you want to identify the true
positives and true negatives and avoid the
false positives and the false negatives. And
this is one of way of putting numbers on,
an index really, on the accuracy of your measurement.
Now data sourcing may seem like a very quantitative
topic, especially when we're talking about
measurement. But, I want measure one important
thing here, and that is the social context
of measurement. The idea here really, is that
people are people, and they all have their
own goals, and they're going their own ways.
And we all have our own thoughts and feelings
that don't always coincide with each other,
and this can affect measurement. And so, for
instance, when you're trying to define your
goals and you're trying to maximize them you
want to look at things like, for instance,
the business model. An organization's business
model, the way they conduct their business,
the way they make their money, is tied to
its identity and its reason to be. And if
you make a recommendation and it'scontrary
to their business model, that can actually
be perceived as a threat to their core identity,
and people tend to get freaked out in that
situation. Also, restrictions, so for instance,
there may be laws, policies, and common practices,
both organizationally and culturally, that
may limit the ways the goals can be met. Now,
most of these make a lot of sense, so the
idea is you can'tjust do anything you want,
you need to have these constraints. And when
you make your recommendations, maybe you'll
work creatively in them as long as you're
still behaving legally and ethically, but
you do need to be aware of these constraints.
Next, is the environment. And the idea here
is that competition occurs both between organizations,
that company here is trying to reach a goal,
but they're competing with company B over
there, but probably even more significantly
there is competition within the organization.
This is really a recognition of office politics.
And when you, as a consultant, make a recommendation
based on your analysis, you need to understand
you're kind of dropping a little football
into the office and things are going to further
one person's career, maybe to the detriment
of another. And in order for your recommendations
to have maximum effectiveness they need to
play out well in the office. That's something
that you need to be aware of as you're making
your recommendations. Finally, there's the
issue of manipulation. And a sad truism about
people is that any reward system, any reward
system at all, will be exploited and people
will generally game the system. This happens
especially when you have a strong cut off;
you need to get at least 80 percent, or you
get fired and people will do anything to make
their numbers appear to be eighty percent.
This happens an awful lot when you look at
executive compensation systems, it looks a
lot when you have very high stake school testing,
it happens in an enormous number of situations,
and so, you need to be aware of the risk of
exploitation and gaming. Now, don't think,
then, that all is lost. Don't give up, you
can still do really wonderful assessment,
you can get good metrics, just be aware of
these particular issues and be sensitive to
them as you both conduct your research and
as you make your recommendations. So, in sum,
social factors affect goals and they affect
the way you meet those goals. There are limits
and consequences, both on how you reach the
goals and how, really, what the goal should
be and that when you're making advice on how
to reach those goals please be sensitive to
how things play out with metrics and how people
will adapt their behavior to meet the goals.
That way you can make something that's more
likely to be implemented the way you meant
and more likely to predict accurately what
can happen with your goals. When it comes
to data sourcing, obviously the most important
thing is to get data. But the easiest way
to do that, at least in theory, is to use
existing data. Think of it as going to the
bookshelf and getting the data that you have
right there at hand. Now, there's a few different
ways to do this: you can get in-house data,
you can get open data, and you can get third-party
data. Another nice way to think of that is
proprietary, public, and purchased data; the
three Ps I've heard it called. Let's talk
about each of these a little bit more. So,
in-house data, that's stuff that's already
in your organization. What's nice about that,
it can be really fast and easy, it's right
there and the format may be appropriate for
the kind of software in the computer that
you are using. If you're fortunate, there's
good documentation, although sometimes when
it's in-house people just kind of throw it
together, so you have to watch out for that.
And there's the issue of quality control.
Now, this is true with any kind of data, but
you need to pay attention with in-house, because
you don't know the circumstances necessarily
under which people gathered the data and how
much attention they were paying to something.
There's also an issue of restrictions; there
may be some data that, while it is in-house,
you may not be allowed to use, or you may
not be able to publish the results or share
the results with other people. So, these are
things that you need to think about when you're
going to use in-house data, in terms of how
can you use it to facilitate your data science
projects. Specifically, there are a few pros
and cons. In-house data is potentially quick,
easy, and free. Hopefully it's standardized;
maybe even the original team that conducted
this study is still there. And you might have
identifiers in the data which make it easier
for you to do an individual level analysis.
On the con side however, the in-house data
simply may not exist, maybe it's just not
there. Or the documentation may be inadequate
and of course, the quality may be uncertain.
Always true, but may be something you have
to pay more attention to when you're using
in-house data. Now, another choice is open
data like going to the library and getting
something. This is prepared data that's freely
available, consists of things like government
data and corporate data and scientific data
from a number of sources. Let me show you
some of my favorite open data sources just
so you know where they are and that they exist.
Probably, the best one is data.gov here in
the US. That is the, as it says right here,
the home of the US government's open data.
Or, you may have a state level one. For instance,
I'm in Utah and we have data.utah.gov, also
a great source of more regional information.
If you're in Europe, you have open-data.europa.eu,
the European Union open data portal. And then
there are major non-profit organizations,
so the UN has unicef.org/statistics for their
statistical and monitoring data. The World
Health Organization has the global health
observatory at who.int/gho. And then there
are private organizations that work in the
public interest, such as the Pew Research
Center, which shares a lot of its data sets
and the New York Times, which makes it possible
to use APIs to access a huge amount of the
data of things they've published over a huge
time span. And then two of the mother loads,
there's Google, which at google.com has public
data which is a wonderful thing. And then
Amazon at aws.amazon.com/datasets has gargantuan
datasets. So, if you needed a data set that
was like five terabytes in size, this is the
place that you would go to get it. Now, there's
some pros and cons to using this kind of open
data. First, is that you can get very valuable
datasets that maybe cost millions of dollars
to gather and to process. And you can get
a very wide range of topics and times and
groups of people and so on. And often, the
data is very well formatted and well documented.
There are, however, a few cons. Sometimes
there's biased samples. Say, for instance,
you only get people who have internet access,
and that can mean, not everybody. Sometimes
the meaning of the data is not clear or it
may not mean exactly what you want it to.
A potential problem is that sometimes you
may need to share your analyses and if you
are doing proprietary research, well, it's
going to have to be open instead, so that
can create a crimp with some of your clients.
And then finally there are issues with privacy
and confidentiality and in public data that
usually means that the identifiers are not
there and you are going to have to work at
a larger aggregate level of measurement. Another
option is to use data from a third-party,
these go by the name Data as a Service or
DaaS. You can also call them data brokers.
And the thing about data brokers is they can
give you an enormous amount of data on many
different topics, plus they can save you some
time and effort, by actually doing some of
the processing for you. And that can include
things like consumer behaviors and preferences,
they can get contact information, they can
do marketing identity and finances, there's
a lot of things. There's a number of data
brokers around, here's a few of them. Acxiom
is probably the biggest one in terms of marketing
data. There's also Nielsen which provides
data primarily for media consumption. And
there's another organization Datasift, that's
a smaller newer one. And there's a pretty
wide range of choices, but these are some
of the big ones. Now, the thing about using
data brokers, there's some pros and there's
some cons. The pros are first, that it can
save you a lot of time and effort. It can
also give you individual level data which
can be hard to get from open data. Open data
is usually at the community level; they can
give you information about specific consumers.
They can even give you summaries and inferences
about things like credit scores and marital
status. Possibly even whether a person gambles
or smokes. Now, the con is this, number 1
it can be really expensive, I mean this is
a huge service; it provides a lot of benefit
and is priced accordingly. Also, you still
need to validate it, you still need to double
check that it means what you think it means
and that it works in with what you want. And
probably the real sticking point here is the
use of third-party data is distasteful to
many people, and so you have to be aware that
as you're making your choices. So, in sum,
as far as data sourcing existing data goes
obviously data science needs data and there's
the three Ps of data sources, Proprietary
and Public and Purchased. But no matter what
source you use, you need to pay attention
to quality and to the meaning and the usability
of the data to help you along in your own
projects. When it comes to data sourcing,
a really good way of getting data is to use
what are called APIs. Now, I like to think
of these as the digital version of Prufrock's
mermaids. If you're familiar with the love
song on J. Alfred Prufrock by TS Eliot, he
says, "I have heard the mermaids singing,
each to each," that's TS Eliot. And I like
to adapt that to say, "APIs have heard apps
singing, each to each," and that's by me.
Now, more specifically when we talk about
an API, what we're talking about is something
called Application Programming Interface,
and this is something that allows programs
to talk to each other. Its most important
use, in terms of data science, is it allows
you to get web data. It allows your program
to directly go to the web, on its own, grab
the data, bring it back in almost as though
it were local data, and that's a really wonderful
thing. Now, the most common version of APIs
for data science are called REST APIs; that
stands for Representational State Transfer.
That's the software architectural style of
the world wide web and it allows you to access
data on web pages via HTTP, that's the hypertext
transfer protocol. They, you know, run the
web as we know it. And when you download the
data that you usually get its in JSON format,
that stands for Javascript Object Notation.
The nice thing about that is that's human
readable, but it's even better for machines.
Then you can take that information and you
can send it directly to other programs. And
the nice thing about REST APIs is that they're
what is called language agnostic, meaning
any programming language can call a REST API,
can get data from the web, and can do whatever
it needs to with it. Now, there are a few
kinds of APIs that are really common. The
first is what are called Social APIs; these
are ways of interfacing with social networks.
So, for instance, the most common one is Facebook;
there's also Twitter. Google Talk has been
a big one and FourSquare as well and then
SoundCloud. These are on lists of the most
popular ones. And then there are also what
are called Visual APIs, which are for getting
visual data, so for instance, Google Maps
is the most common, but YouTube is something
that accesses YouTube on a particular website
or AccuWeather which is for getting weather
information. Pinterest for photos, and Flickr
for photos as well. So, these are some really
common APIs and you can program your computer
to pull in data from any of these services
and sites and integrate it into your own website
or here into your own data analysis. Now,
there's a few different ways you can do this.
You can program it in R, the statistical programming
language, you can do it in Python, also you
can even use it in the very basic BASH command
line interface, and there's a ton of other
applications. Basically, anything can access
an API one way or another. Now, I'd like to
show you how this works in R. So, I'm going
to open up a script in RStudio and then I'm
going to use it to get some very basic information
from a webpage. Let me go to RStudio and show
you how this works. Let me open up a script
in RStudio that allows me to do some data
sourcing here. Now, I'm just going to use
a package called JSON Lite, I'm going to load
that one up, and then I'm going to go to a
couple of websites. I'm going to getting historical
data from Formula One car races and I'm going
to be getting it from Ergast.com. Now, if
we go to this page right here, I can go straight
to my browser right now. And this is what
it looks like; it gives you the API documentation,
so what you're doing for an API, is you're
just entering a web address and in that web
address it includes the information you want.
I'll go back to R here just for a second.
And if I want to get information about 1957
races in JSON format, I go to this address.
I can skip over to that for a second, and
what you see is it's kind of a big long mess
here, but it is all labeled and it is clear
to the computer what's going on here. Let's
go back to R. And so what I'm going to do
is, I am going to save that URL into an object
here, in R, and then I'm going to use the
command from JSON to read that URL and save
it into R. And which it has now done. And
I'm going to zoom in on that so you can see
what's happened. I've got this sort of mess
of text, this is actually a list object in
R. And then I'm going to get just the structure
of that object, so I'm going to do this one
right here and you can see that it's a list
and it gives you the names of all the variables
within each one of the lists. And what I'm
going to do is, I'm going to convert that
list to a data frame. I went through the list
and found where the information I wanted was
located, you have to use this big long statement
here, that will give me the names of the drivers.
Let me zoom in on that again. There they are.
And then I'm going to get just the column
names for that bit of the data frame. So,
what I have here is six different variables.
And then what I'm going to do is, I'm going
to pick just the first five cases and I'm
going to select some variables and put them
in a different order. And when I do that,
this is what I get. I will zoom in on that
again. And the first five people listed in
this data set that I pulled from 1957, are
Juan Fangio, makes sense one of the greatest
drivers ever, and other people who competed
in that year. And so what I've done is by
using this API call in R, a very simple thing
to do, I was able to pull data off that webpage
in a structured format, and do a very simple
analysis with it. And let's sum up what we've
learned from all this. First off, APIs make
it really easy to work with web data, they
structure, they call it for you, and then
they feed it straight into the program for
you to analyze. And they are one of the best
ways of getting data and getting started in
data science. When you're looking for data,
another great way of getting data is through
scraping. And what that means is pulling information
from webpages. I like to think of it as when
data is hiding in the open; it's there, you
can see it, but there's not an easy, immediate
way to get that data. Now, when you're dealing
with scraping, you can get data in several
different formats. You can get HTML text from
webpages, you can get HTML tables from the
rows and columns that appear on webpages.
You can scrape data from PDFs, and you can
scrape data from all sorts of data from images
and video and audio. Now, we will make one
very important qualification before we say
anything else: pay attention to copyright
and privacy. Just because something is on
the web, doesn't mean you're allowed to pull
it out. Information gets copyrighted, and
so when I use examples here, I make sure that
this is stuff that's publicly available, and
you should do the same when you are doing
your own analyses. Now, if you want to scrape
data there's a couple of ways to do it. Number
one, is to use apps that are developed for
this. So, for instance, import.io is one of
my favorites. It is both a webpage, that's
its address, and it's a downloadable app.
There's also ScraperWiki. There's an application
called Tabula, and you can even do scraping
in Google Sheets, which I will demonstrate
in a second, and Excel. Or, if you don't want
to use an app or if you want to do something
that apps don't really let you do, you can
code your scraper. You can do it directly
in R, or Python, or Bash, or even Java or
PHP. Now, what you're going to do is you're
going to be looking for information on the
webpage. If you're looking for HTML text,
what you're going to do is pull structured
text from webpages, similar to how a reader
view works in a browser. It uses HTML tags
on the webpage to identify what's the important
information. So, there's things like body,
and h1 for header one, and p for paragraph,
and the angle brackets. You can also get information
from HTML tables, although this is a physical
table of rows and columns I am showing you.
This also uses HTML table tags, that is like
table, and tr for table row, and td for table
data, that's the cell. The trick is when you're
doing this, you need the table number and
sometimes you just have to find that through
trial and error. Let me give you an example
of how this works. Let's take a look at this
Wikipedia page on the Iron Chef America Competition.
I'm going to go to the web right now and show
you that one. So, here we are in Wikipedia,
Iron Chef America. And if you scroll down
a little bit, you see we have got a whole
bunch of text here, we have got our table
of contents, and then we come down here, we
have a table that lists the winners, the statistics
for the winners. And let's say we want to
pull that from this webpage into another program
for us to analyze. Well, there is an extremely
easy way to do this with Google Sheets. All
we need to do is open up the Google Sheet
and in cell A1 of that Google Sheet, we paste
in this formula. It's IMPORTHTML, then you
give the webpage and then you say that you
are importing a table, you have to put that
stuff in quotes, and the index number for
the table. I had to poke around a little bit
to figure out this was table number 2. So,
let me go to Google Sheets and show you how
this works. Here I have a Google Sheet and
right now it's got nothing in it. But watch
this; if I come here to this cell, and I simply
paste in that information, all the stuff just
sort of magically propagates into the sheet,
makes it extremely easy to deal with, and
now I can, for instance, save this as a CSV
file, put it in another program. Lots of options.
And so this is one way that I'm scraping the
data from a webpage because I didn't use an
API, but I just used a very simple, one-link
command to get the information. Now, that
was a HTML table. You can also scrape data
from PDFs. You have to be aware of if it's
a native PDF, I call that a text PDF, or a
scanned or imaged PDF. And what it does with
native PDFs, it looks for text elements; again
those are like code that indicates this is
text. And you can deal with Raster images,
that's pixel images, or vector, which draws
the lines, and that's what makes them infinitely
scalable in many situations. And then in PDFs,
you can deal with tabular data, but you probably
have to use a specialized program like Scraper,
Wiki, or Tabula in order to get that. And
then finally media, like images and video
and audio. Getting images is easy; you can
download them in a lot of different ways.
And then if you want to read data from them,
say for instance, you have a heat map of a
country, you can go through it, but you will
probably have to write a program that loops
through the image pixel-by-pixel to read the
data and them encode it numerically into your
statistical program. Now, that's my very brief
summary and let's summarize that. First off,
if the data you are trying to get at doesn't
have an existing API, you can try scraping
and you can write code in a language like
R or Python. But, no matter what you do, be
sensitive to issues of copyright and privacy,
so you don't get yourself in hot water, but
instead, you make an analysis that can be
of use to you or to your client. The next
step in data sourcing is making data. And
specifically, we're talking about getting
new data. I like to think of this as, you're
getting your hands on and you're getting "data
de novo," new data. So, can't find the data
that you need for your analysis? Well, one
simple solution is, do it yourself. And we're
going to talk about a few general strategies
used for doing that. Now, these strategies
vary on a few dimensions. First off is the
role. Are you passive and simply observing
stuff that's happening already, or are you
active where you play a role in creating the
situation to get the data? And then there's
the "Q/Q question," and that is, are you going
to get quantitative, or numerical, data, or
are you going to get qualitative data, which
usually means text, paragraphs, sentences
as well as things like photos and videos and
audio? And also, how are you going to get
the data? Do you want to get it online, or
do you want to get it in person? Now, there's
other choices than these, but these are some
of the big delineators of the methods. When
you look at those, you get a few possible
options. Number one is interviews, and I'll
say more about those. Another one is surveys.
A third one is card sorting. And a fourth
one is experiments, although I actually want
to split experiments into two kinds of categories.
The first one is laboratory experiments, and
that's in-person projects where you shape
the information or an experience for the participants
as a way of seeing how that involvement changes
their reactions. It doesn't necessarily mean
that you're a participant, but you create
the situation. And then there's also A/B testing.
This is automated, online testing of two or
more variations on a webpage. It's a very,
very simple kind of experimentation that's
actually very useful for optimizing websites.
So, in sum, from this very short introduction
make sure you can get exactly what you need.
Get the data you need to answer your question.
And if you can't find it somewhere, then make
it. And, as always, you have many possible
methods, each of which have their own strengths
and their own compromises. And we'll talk
about each of those in the following sections.
The first method of data sourcing where you're
making new data that I want to talk about
is interviews. And that's not because it's
the most common, but because it's the one
you would do for the most basic problem. Now,
basically an interview is nothing more than
a conversation with another person or a group
of people. And, the fundamental question is,
why do interviews as opposed to doing a survey
or something else? Well, there's a few good
reasons to do that. Number one: you're working
with a new topic and you don't know what people's
responses will be, how they'll react. And
so you need something very open-ended. Number
two: you're working with a new audience and
you don't know how they will react in particular
to what it is you're trying to do. And number
three: something's going on with the current
situation, it's not working anymore, and you
need to find what's going on, and you need
to find ways to improve. The open-ended information
where you get past you're existing categories
and boundaries can be one of the most useful
methods for getting that data. If you want
to put it another way, you want to do interviews
when you don't want to constrain responses.
Now, when it comes to interviews, you have
one very basic choice, and that's whether
you do a structured interview. And with a
structured interview, you have a predetermined
set of questions, and everyone gets the same
questions in the same order. It gives a lot
of consistency even though the responses are
open-ended. And then you can also have what's
called an unstructured interview. And this
is a whole lot more like a conversation where
you as the interviewer and the person you're
talking to - your questions arise in response
to their answers. Consequently, an unstructured
interview can be different for each person
that you talk to. Also, interviews are usually
done in person, but not surprisingly, they
can be done over the phone, or often online.
Now, a couple of things to keep in mind about
interviews. Number one is time. Interviews
can range from just a few minutes to several
hours per person. Second is training. Interviewing's
a special skill that usually requires specific
training. Now, asking the questions is not
necessarily the hard part. The really tricky
part is the analysis. The hardest part of
interviews by far is analyzing the answers
for themes, and way of extracting the new
categories and the dimensions that you need
for your further research. The beautiful thing
about interviews is that they allow you to
learn things that you never expected. So,
in sum: interviews are best for new situations
or new audiences. On the other hand, they
can be time-consuming, and they also require
special training; both to conduct the interview,
but also to analyze the highly qualitative
data that you get from them. The next logical
step in data sourcing and making data is surveys.
Now, think of this: if you want to know something
just ask. That's the easy way. And you want
to do a survey under certain situations. The
real question is, do you know your topic and
your audience well enough to anticipate their
answers? To know what the range of their answers
and the dimensions and the categories that
are going to be important. If you do, then
a survey might be a good approach. Now, just
as there were a few dimensions for interviews,
there are a few dimensions for surveys. You
can do what is called a closed-ended survey;
that is also called a forced choice. It is
where you give people just particular options,
like a multiple choice. You can have an open-ended
survey, where you have the same questions
for everybody, but you allow them to write
in a free-form response. You can so surveys
in person and you can also do them online
or over the mail or phone or however. And
now, it is very common to use software when
doing surveys. Some really common applications
for online surveys are SurveyMonkey, and Qualtrics,
or at the very simple end there is Google
Forms, and the simple and pretty end there
is Typeform. There is a lot more choices,
but these are some of the major players and
how you can get data from online participants
in survey format. Now, the nice thing about
surveys is, they are really easy to do, they
are very easy to set up and they are really
easy to send out to large groups of people.
You can get tons of data really fast. On the
other hand, the same way that they are easy
to do, they are also really easy to do badly.
The problem is that the questions you ask,
they can be ambiguous, they can be double-barreled,
they can be loaded and the response scales
can be confusing. So, if you say, "I never
think this particular way" and the person
puts strongly disagree, they may not know
exactly what you are trying to get at. So,
you have to take special effort to make sure
that the meaning is clear, unambiguous, and
that the rating scale, the way that people
respond, is very clear and they know where
their answer falls. Which gets us into one
of the things about people behaving badly
and that is beware the push poll. Now, especially
during election time; like we are in right
now, a push poll is something that sounds
like a survey, but really what it is is a
very biased attempt to get data, just fodder
for social media campaigns or I am going to
make a chart that says that 98% of people
agree with me. A push poll is one that is
so biased, there is really only one way to
answer to the questions. This is considered
extremely irresponsible and unethical from
a research point of view. Just hang up on
them. Now, aside from that egregious violation
of research ethics, you do need to do other
things like watch out for bias in the question
wording, in the response options, and also
in the sample selection because any one of
those can push your responses off one way
or another without you really being aware
that it is happening. So, in sum, let's say
this about surveys. You can get lots of data
quickly, on the other hand, it requires familiarity
with the possible answers in your audience.
So, you know, sort of, what to expect. And
no matter what you do, you need to watch for
bias to make sure that your answers are going
to be representative of the group that you
are really concerned about understanding.
An interesting topic in Data Sourcing when
you are making data is Card Sorting. Now,
this isn't something that comes up very often
in academic research, but in web research,
this can be a really important method. Think
of it as what you are trying to do is like
building a model of a molecule here, you are
trying to build a mental model of people's
mental structures. Put more specifically,
how do people organize information intuitively?
And also, how does that relate to the things
that you are doing online? Now, the basic
procedure goes like this: you take a bunch
of little topics and you write each one on
a separate card. And you can do this physically,
with like three by five cards, or there are
a lot of programs that allow you to do a digital
version of it. Then what you do is you give
this information to a group of respondents
and the people sort those cards. So, they
put similar topics with each other, different
topics over here and so on. And then you take
that information and from that you are able
to calculate what is called, dissimilarity
data. Think of it as like the distance or
the difference between various topics. And
that gives you the raw data to analyze how
things are structured. Now, there are two
very general kinds of card sorting tasks.
There are generative and there's evaluative.
A generative card sorting task is one in which
respondents create their own sets, their own
piles of cards using any number of groupings
they like. And this might be used, for instance,
to design a website. If people are going to
be looking for one kind of information next
to another one, then you are going to want
to put that together on the website, so they
know where to expect it. On the other hand,
if you've already created a website, then
you can do an evaluative card sorting. This
is where you have a fixed number or fixed
names of categories. Like for instance, the
way you have set up your menus already. And
then what you do is you see if people actually
put the cards into these various categories
that you have created. That's a way of verifying
that your hierarchical structure makes sense
to people. Now, whichever method you do, generative
or evaluative, what you end up with when you
do a card structure is an interesting kind
of visualization called a Dendrogram. That
actually means branches. And what we have
here is actually a hundred and fifty data
points; if you are familiar with the Fisher's
Iris data, that's what's going on here. And
it groups it from one giant group on the left
and then splits it in pieces and pieces and
pieces until you end up with lots of different
observations, well actually, individual-level
observations at the end. But you can cut things
off into two or three groups or whatever is
most useful for you here, as a way of visualizing
the entire collection of similarity or dissimilarity
between the individual pieces of information
that you had people sort. Now, I will just
mention very quickly if you want to do a digital
card sorting, which makes your life infinitely
easier because keeping track of physical cards
is really hard. You can use something like
Optimal Workshop, or UserZoom or UX Suite.
These are some of the most common choices.
Now, let's just sum up what we've learned
about card sorting in this extremely brief
overview. Number one, card sorting allows
you to see intuitive organization of information
in a hierarchical format. You can do it with
physical cards or you can also have digital
choices for doing the same thing. And when
you are done, you actually get this hierarchical
or branched visualization of how the information
is structured and related to each other. When
you are doing your Data Sourcing and you are
making data, sometimes you can't get what
you want through the easy ways, and you've
got to take the hard way. And you can do what
I am calling laboratory experiments. Now of
course, when I mention laboratory experiments
people start to think of stuff like, you know,
doctor Frankenstein in his lab, but lab experiments
are less like this and in fact they are a
little more like this. Nearly every experiment
I have done in my career has been a paper
and pencil one with people in a well-lighted
room and it's not been the threatening kind.
Now, the reason you do a lab experiment is
because you want to determine cause and effect.
And this is the single most theoretically
viable way of getting that information. Now,
what makes an experiment an experiment is
the fact that researchers play active roles
in experiments with manipulations. Now, people
get a little freaked out when they hear manipulations,
think that you are coercing people and messing
with their mind. All that means is that you
are manipulating the situation; you are causing
something to be different for one group of
people or for one situation than another.
It's a benign thing, but it allows you to
see how people react to those different variations.
Now, you are going to want to do an experiment,
you are going to want to have focused research,
it is usually done to test one thing or one
variation at a time. And it is usually hypothesis-driven;
usually you don't do an experiment until you
have done enough background research to say,
"I expect people to react this way to this
situation and this way to the other." A key
component to all of this is that experiments
almost always have random assignment regardless
of how you got your sample, when they are
in your study, you randomly assign them to
one condition or another. And what they does
is it balances out the pre-existing differences
between groups and that's a great way of taking
care of confounds and artifacts. The things
that are unintentionally associated with differences
between groups that provide alternate explanations
for your data. If you have done good random
assignment and you have a large enough group
of people than those confounds and artifacts
are basically minimized. Now, some places
where you are likely to see laboratory experiments
in this version are for instance are eye tracking
and web design. This is where you have to
bring people in front of a computer and you
stick a thing there that sees where they are
looking. That's how we know for instance that
people don't really look at ads on the side
of web pages. Another very common place is
research in medicine and education and in
my field, psychology. And in all of these,
what you find is that experimental research
is considered the gold standard for reliable
valid information about cause and effect.
On the other hand, while it is a wonderful
thing to have, it does come at a cost. Here's
how that works. Number 1, experimentation
requires extensive, specialized training.
It is not a simple thing to pick up. Two,
experiments are often very time consuming
and labor intensive. I have known some that
take hours per person. And number three, experiments
can be very expensive. So, what that all means
is that you want to make sure that you have
done enough background research and you need
to have a situation where it is sufficiently
important to get really reliable cause and
effect information to justify these costs
for experimentation. In sum, laboratory experimentation
is generally considered the best method for
causality or assessing causality. That's because
it allows you to control for confounds through
randomization. On the other hand, it can be
difficult to do. So, be careful and thoughtful
when considering whether you need to do an
experiment and how to actually go about doing
it. There's one final procedure I want to
talk about in terms of Data Sourcing and Making
New Data. It's a form of experimentation and
it is simply called A/B testing and it's extremely
common in the web world. So, for instance,
I just barely grabbed a screenshot of Amazon.com's
homepage and you're got these various elements
on the homepage and I just noticed, by the
way, when I did this that this woman is actually
an animated gif, so she moves around. That
was kind of weird; I have never seen that
before. But the thing about this, is this
entire layout, how things are organized and
how they are on there, will have been determined
by variations on A/B testing by Amazon. Here's
how it works. For your webpage, you pick one
element like what's the headline or what are
the colors or what's the organization or how
do you word something and you create multiple
versions, maybe just two version A and version
B, why you call it A/B testing. Then when
people visit your webpage you randomly assign
these visitors to one version or another,
you have software that does that for you automatically.
And then you compare the response rates on
some response. I will show you those in a
second. And then, once you have enough data,
you implement the best version, you sort of
set that one solid and then you go on to something
else. Now, in terms of response rates, there
are a lot of different outcomes you can look
at. You can look at how long a person is on
a page, you can actually do mouse tracking
if you want to. You can look at click-throughs,
you can also look at shopping cart value or
abandonment. A lot of possible outcomes. All
of these contribute through A/B testing to
the general concept of website optimization;
to make your website as effective as it can
possibly be. Now, the idea also is that this
is something that you are going to do a lot.
You can perform A/B tests continually. In
fact, I have seen one person say that what
A/B testing really stands for is always be
testing. Kind of cute, but it does give you
the idea that improvement is a constant process.
Now, if you want some software to do A/B testing,
two of the most common choices are Optimizely
and VWO, which stands for Visual Web Optimizer.
Now, many others are available, but these
are especially common and when you get the
data you are going to use statistical hypothesis
testing to compare the differences or really
the software does it for you automatically.
But you may want to adjust the parameters
because most software packages cut off testing
a little too soon and the information is not
quite as reliable as it should be. But, in
sum, here is what we can say about A/B testing.
It is a version of website experimentation;
it is done online, which makes it really easy
to get a lot of data very quickly. It allows
you to optimize the design of your website
for whatever outcome is important to you.
And it can be done as a series of continual
assessments, testing, and development to make
sure that you're accomplishing what you want
to as effectively as possible for as many
people as possible. The very last thing I
want to talk about in terms of data sourcing
is to talk about the next steps. And probably
the most important thing is, you know, don't
just sit there. I want you to go and see what
you already have. Try to explore some open
data sources. And if it helps, check with
a few data vendors. And if those don't give
you what you need to do your project, then
consider making new data. Again, the idea
here is get what you need and get going. Thanks
for joining me and good luck on your own projects.
Welcome to "Coding in Data Science". I'm Bart
Poulson and what we are going to do in this
series of videos is we're going to take a
little look at the tools of Data Science.
So, I am inviting you to know your tools,
but probably even more important than that
is to know their proper place. Now, I mention
that because a lot of the times when people
talk about data tools, they talk about it
as though that were the same thing as data
science, as though they were the same set.
But, I think if you look at it for just a
second that is not really the case. Data tools
are simply one element of data science because
data science is made up of a lot more than
the tools that you use. It includes things
like, business knowledge, it includes the
meaning making and interpretation, it includes
social factors and so there's much more than
just the tools involved. That being said,
you will need at least a few tools and so
we're going to talk about some of the things
that you can use in data science if it works
well for you. In terms of getting started,
the basic things. #1 is spreadsheets, it is
the universal data tool and I'll talk about
how they play an important role in data science.
#2 is a visualization program called Tableau,
there is Tableau public, which is free, and
there's Tableau desktop and there is also
something called Tableau server. Tableau is
a fabulous program for data visualization
and I'm convinced for most people provides
the great majority of what they need. And
though while it is not a tool, I do need to
talk about the formats used in web data because,
you have to be able to navigate that when
doing a lot of data science work. Then we
can talk about some of the essential tools
for data science. Those include the programming
language R, which is specifically for data,
there's the general purpose programming language
Python, which has been well adapted to data.
And there's the database language sequel or
SQL for structured query language. Then if
you want to go beyond that, there are some
other things that you can do. There are the
general purpose programming languages C, C++,
and Java, which are very frequently used to
form the foundation of data science and sort
of high level production code is going to
rely on those as well. There's the command
line interface language Bash, which is very
common, a very quick tool for manipulating
data. And then there's the, sort of wild card
supercharged regular expressions or Regex.
We'll talk about all of these in separate
courses. But, as you consider all the tools
that you can use, don't forget the 80/20 rule.
Also known as the Pareto Principle. And the
idea here is that you are going to get a lot
of bang for your buck out of small number
of things. And I'm going to show you a little
sample graph here. Imagine that you have ten
different tools and we'll call them A through
B. A does a lot for you, B does a little bit
less and it kind of tapers down to, you have
got a bunch of tools that do just a little
of stuff that you need. Now, instead of looking
at the individual effectiveness, look at the
cumulative effectiveness. How much are you
able to accomplish with a combination of tools?
Well, the first ones right here at 60% where
the tools started and then you add on the
20% from B and it goes up and then you add
on C and D and you add up little smaller,
smaller pieces and by the time you get to
the end, you have got 100% of effectiveness
from your ten tools combined. The important
thing about this is, you only have to go to
the 2nd tool, that is two out of ten, that's
B, that's 20% of your tools and in this made
up example, you have got 80% of your output.
So, 80% of the output from 20% of the tools,
that's a fictional example of the Pareto Principle,
but I find in real life it tends to work something
approximately like that. And so, you don't
necessarily have to learn everything and you
don't have to learn how to do everything in
everything. Instead you want to focus on the
tools that will be most productive and specifically
most productive for you. So, in sum, let's
say these three things. Number 1, coding or
simply the ability to manipulate data with
programs and computers. Coding is important,
but data science is much greater than the
collection of tools that's used in it. And
then finally, as you're trying to decide what
tools to use and what you need to learn and
how to work, remember the 80/20, you are going
to get a lot of bang from a small set of tools.
So, focus on the things that are going to
be most useful for you in conducting your
own data science projects. As we begin our
discussion of Coding and Data Science, I actually
want to begin with something that's not coding.
I want to talk about applications or programs
that are already created that allow you to
manipulate data. And we are going to begin
with the most basic of these, spreadsheets.
We're going to do the rows and columns and
cells of Excel. And the reason for this is
you need spreadsheets. Now, you may be saying
to yourself, "no no no not me, because you
know what I'm fancy, I'm working in my big
set of servers, I've got fancy things going
on." But, you know what, you too fancy people,
you need spreadsheets as well. There's a few
reasons for this. Most importantly, spreadsheets
can be the right tool for data science in
a lot of circumstances; there are a few reasons
for that. Number one, spreadsheets, they're
everywhere, they're ubiquitous, they're installed
on a billion machines around the world and
everybody uses them. They probably have more
data sets in spreadsheets than anything else,
and so it's a very common format. Importantly,
it's probably your client's format; a lot
of your clients are going to be using spreadsheets
for their own data. I've worked with billion
dollar companies that keep all of their data
in spreadsheets. So, when you're working with
them, you need to know how to manipulate that
and how to work with it. Also, regardless
of what you're doing, spreadsheets are specifically
csv - comma separated value files - are sort
of the lingua franca or the universal interchange
format for data transfer, to allow you to
take it from one program to another. And then,
truthfully, in a lot of situations they're
really easy to use. And if you want a second
opinion on this, let's take a look at this
ranking. There's a survey of data mining experts,
it's the KDnuggets data mining poll, and these
are the tools they most use in their own work.
And look at this: lowly Excel is fifth on
the list, and in fact, what's interesting
about it is it's above Hadoop and Spark, two
of the major big data fancy tools. And so,
Excel really does have place of pride in a
toolkit for data analyst. Now, since we're
going to sort of the low tech end of things,
let's talk about some of the things you can
do with a spreadsheet. Number one, they are
really good for data browsing. You really
get to see all of the data in front of you,
which isn't true if you are doing something
like R or Python. They're really good for
sorting data, sort by this column then this
column then this column. They're really good
for rearranging columns and cells and moving
things around. They're good for finding and
replacing and seeing what happens so you know
that it worked right. Some more uses they're
really good for formatting, especially conditional
formatting. They're good for transposing data,
switching the rows and the columns, they make
that really easy. They're good for tracking
changes. Now it's true if you're a big fancy
data scientist you're probably using GitHub,
but for everybody else in the world spreadsheets
and the tracking changes is a wonderful way
to do it. You can make pivot tables, that
allows you to explore the data in a very hands-on
way, in a very intuitive way. And they're
also really good for arranging the output
for consumption. Now, when you're working
with spreadsheets, however, there's one thing
you need to be aware of: they are really flexible,
but that flexibility can be a problem in that
when you are working in data science, you
specifically want to be concerned about something
called Tidy Data. That's a term I borrowed
from Hadley Wickham, a very well-known developer
in the R world. Tidy Data is for transferring
data and making it work well. There's a few
rules here that undo some of the flexibility
inherent in spreadsheets. Number one, what
you want to do is have a column be equivalent
to the same thing as a variable; columns,
variables, they are the same thing. And then,
rows are equal - exactly the same thing as
cases. That you have one sheet per file, and
that you have one level of measurement, say,
individual, then organization, then state
per file. Again, this is undoing some of the
flexibility that's inherent in spreadsheets,
but it makes it really easy to move the data
from one program to another. Let me show you
how all this works. You can try this in Excel.
If you have downloaded the files for this
course, we simply want to open up this spreadsheet.
Let me go to Excel and show you how it works.
So, when you open up this spreadsheet, what
you get is totally fictional data here that
I made up, but it is showing sales over time
of several products at two locations, like
if you're selling stuff at a baseball field.
And this is the way spreadsheets often appear;
we've got blank rows and columns, we've got
stuff arranged in a way that makes it easy
for the person to process it. And we have
got totals here, with formulas putting them
all together. And that's fine, that works
well for the person who made it. And then,
that's for one month and then we have another
month right here and then we have another
month right here and then we combine them
all for first quarter of 2014. We have got
some headers here, we've got some conditional
formatting and changes and if we come to the
bottom, we have got a very busy line graphic
that eventually loads; it's not a good graphic,
by the way. But, similar to what you will
often find. So, this is the stuff that, while
it may be useful for the client's own personal
use, you can't feed this into R or Python,
it will just choke and it won't know what
to do with it. And so, you need to go through
a process of tidying up the data. And what
this involves is undoing some of the stuff.
So, for instance, here's data that is almost
tidy. Here we have a single column for date,
a single column for the day, a column for
the site, so we have two locations A and B,
and then we have six columns for the six different
things that are sold and how many were sold
on each day. Now, in certain situations, you
would want the data laid out exactly like
this if you are doing, for instance, a time
series, you will do something vaguely similar
to this. But, for true tidy stuff, we are
going to collapse it even further. Let me
come here to the tidy data. And now what I
have done is, I have created a new column
that says what is the item being sold. And
so, by the way, what this means is that we
have got a really long data set now, it has
got over a thousand rows. Come back up to
the top here. But, what that shows you is
that now it's in a format that's really easy
to import from one program to another, that
makes it tidy and you can re-manipulate it
however you want once you get to each of those.
So, let's sum up our little presentation here,
in a few lines. Number one, no matter who
you are, no matter what you are doing in data
science you need spreadsheets. And the reason
for that is that spreadsheets are often the
right tool for data science. Keep one thing
in mind though, that is as you are moving
back and forth from one language to another,
tidy data or well-formatted data is going
to be important for exporting data into your
analytical programmer language of choice.
As we move through "Coding and Data Science,"
and specifically the applications that can
be used, there's one that stands out for me
more than almost anything else, and that's
Tableau and Tableau Public. Now, if you are
not familiar with these, these are visualization
programs. The idea here is that when you have
data, the most important thing you can do
is to first look and see what you have and
work with it from there. And in fact, I'm
convinced that for many organizations Tableau
might be all that they really need. It will
give them the level of insight that they need
to work constructively with data. So, let's
take a quick look by going to tableau.com.
Now, there are a few different versions of
Tableau. Right here we have Tableau Desktop
and Tableau Server, and these are the paid
versions of Tableau. They actually cost a
lot of money, unless you work for a nonprofit
organization, in which case you can get them
for free. Which is a beautiful thing. What
we're usually looking for, however, is not
the paid version, but we are looking for something
called Tableau Public. And if you come in
here and go to products and we have got these
three paid ones, over here to Tableau Public.
We click on that, it brings us to this page.
It is public.tableau.com. And this is the
one that has what we want, it's the free version
of Tableau with one major caveat: you don't
save files locally to your computer, which
is why I didn't give you a file to open. Instead,
it saves them to the web in a public form.
So, if you are willing to trade privacy, you
can get an immensely powerful application
for data visualization. That's a catch for
a lot of people, which is why people are willing
to pay a lot of money for the desktop version.
And again, if you work for a nonprofit you
can get the desktop version for free. But,
I am going to show you how things work in
Tableau Public. So, that's something that
you can work with personally. The first thing
you want to do is, you want to download it.
And so, you put in your email address, you
download; it is going to know what you are
on. It is a pretty big download. And once
it is downloaded, you can install and open
up the application. And here I am in Tableau
Public, right here, this is the blank version.
By the way, you also need to create an account
with Tableau in order to save your stuff online
to see it. I will show you what that looks
like. But, you are presented with a blank
thing right here and the first thing you need
to do is, you need to bring in some data.
I'm going to bring in an Excel file. Now,
if you downloaded the files for the course,
you will see that there is this one right
here, DS03_2_2_TableauPublic.excel.xlsx. In
fact, it is the one that I used in talking
about spreadsheets in the first video in this
course. I'm going to select that one and I'm
going to open it. And a lot of programs don't
like bringing in Excel because it's got all
the worksheets and all the weirdness in it.
This one works better with it, but what I'm
going to do is, I am going to take the tidy
data. By the way, you see that it put them
in alphabetical order here. I'm going to take
tidy data and I'm going to drag it over to
let it know that it's the one that I want.
And now what it does is it shows me a version
of the data set along with things that you
can do here. You can rename it, I like that
you can create bin groups, there's a lot of
things that you can do here. I'm going to
do something very, very quick with this particular
one. Now, I've got the data set right here,
what I'm going to do now is I'm going to go
to a worksheet. That's where you actually
create stuff. Cancel that and go to worksheet
one. Okay. This is a drag and drop interface.
And so what we are going to do is, we are
going to pull the bits and pieces of information
we want to make graphics. There's immense
flexibility here. I'm going to show you two
very basic ones. I'm going to look at the
sales of my fictional ballpark items. So,
I'm going to grab sales right here and I'm
going to put that as the field that we are
going to measure. Okay. And you see, put it
down right here and this is our total sales.
We're going to break it down by item and by
time. So, let me take item right here, and
you can drag it over here, or I can put it
right up here into rows. Those will be my
rows and that will be how many we have sold
total of each of the items. Fine, that's really
easy. And then, let's take date and we will
put that here in columns to spread it across.
Now, by default it is doing it by year, I
don't want to do that, I want to have three
months of data. So, what I can do is, I can
click right here and I can choose a different
time frame. I can go to quarter, but that's
not going to help because I only have one
quarter's worth of data, that's three months.
I'm going to come down to week. Actually,
let me go to day. If I do day, you see it
gets enormously complicated, so that's no
good. So, I'm going to back up to week. And
I've got a lot of numbers there, but what
I want is a graph. And so, to get that, I'm
going to come over here and click on this
and tell it that I want a graph. And so, we're
seeing the information, except it lost items.
So, I'm going to bring item and put it back
up into this graph to say this is a row for
the data. And now I've got rows for sales
by week for each of my items. That's great.
I want to break it down one more by putting
in the site, the place that it sold. So, I'm
going to grab that and I'm going to put it
right over here. And now you see I've got
it broken down by the item that is sold and
the different sites. I'm going to color the
sites, and all I've got to do to do that is,
I'm going to grab site and drag it onto color.
Now, I've got two different colors for my
sites. And this makes it a lot easier to tell
what is going on. And in fact, there is some
other cool stuff you can do. One of the things
I'm going to do is come over here to analytics
and I can tell it to put an average line through
everything, so I'll just drag this over here.
Now we have the average for each line. That's
good. And I can even do forecasting. Let me
get a little bit of a forecast right here.
I will drag this on and if you can go over
here. I will get this out of the way for a
second. Now, I have a forecast for the next
few weeks, and that's a really convenient,
quick, and easy thing. And again, for some
organizations that might be all that they
really need. And so, what I'm showing you
here is the absolute basic operation of Tableau,
which allows you to do an incredible range
of visualizations and manipulate the data
and create interactive dashboards. There's
so much to it and we'll show that in another
course, but for right now I want to show you
one last thing about Tableau Public, and that
is saving the files. So now, when I come here
and save it, it's going to ask me to sign
into Tableau Public. Now, I sign in and it
asks me how I want to save this, same name
as the video. There we go, and I'm going to
hit save. And then that opens up a web browser,
and since I'm already logged into my account,
see here's my account and my profile. Here's
the page that I created. And it's got everything
that I need there; I'm going to edit just
a few details. I'm going to say, for instance,
I'm going to leave its name just like that.
I can put more of a description in there if
I wanted. I can allow people to download the
workbook and its data; I'm going to leave
that there so you can download it if you need
to. If I had more than one tab, I would do
this thing that says show the different sheets
as tabs. Hit save. And there's my data set
and also it's published online and people
can now find it. And so what you have here
is an incredible tool for creating interactive
visualizations; you can create them with drop-down
menus, and you can rearrange things, and you
can make an entire dashboard. It's a fabulous
way of presenting information, and as I said
before, I think that for some organizations
this may be as much as they need to get really
good, useful information out of their data.
And so I strongly recommend that you take
some time to explore with Tableau, either
the paid desktop version or the public version
and see what you can do to get some really
compelling and insightful visualizations out
of your work in data science. For many people,
their first experience of "Coding and Data
Science" is with the application SPSS. Now,
I think of SPSS and the first thing that comes
to my mind is sort of life in the Ivory tower,
though this looks more like Harry Potter.
But, if you think about it the package name
SPSS comes from Statistical Package for the
Social Sciences. Although, if you ask IBM
about it now, they act like it doesn't stand
for anything. But, it has its background in
social science research which is generally
academic. And truthfully, I'm a social psychologist
and that's where I first learned how to use
SPSS. But, let's take a quick look at their
webpage ibm.com/spss. If you type that in,
that will just be an alias that will take
you to IBM's main webpage. Now, IBM didn't
create SPSS, but they bought it around version
16, and it was very briefly known as PASW
predictive analytic software, that only lasted
briefly and now it's back to SPSS, which is
where it's been for a long time. SPSS is a
desktop program; it's pretty big, it does
a lot of things, it's very powerful, and is
used in a lot of academic research. It's also
used in a lot of business consulting, management,
even some medical research. And the thing
about SPSS, is it looks like a spreadsheet
but has drop-down menus to make your life
a little bit easier compared to some of the
programming languages that you can use. Now,
you can get a free temporary version, if you're
a student you can get a cheap version, otherwise
SPSS costs a lot of money. But, if you have
it one way or another, when you open it up
this is what it is going to look like. I'm
showing SPSS version 22, now it's currently
on 24. And the thing about SPSS versioning
is, in anything other than software packaging,
these would be point updates, so I sort of
feel like we should be on 17.3, as opposed
to 23 or 24. Because the variations are so
small that anything you learn from the early
ones, is going to work on the later ones and
there is a lot of backwards and forwards compatibility,
so I'd almost say that this one, the version
I have practically doesn't matter. You get
this little welcome splash screen, and if
you don't want to see it anymore you can get
rid of it. I'm just going to hit cancel here.
And this is our main interface. It looks a
lot like a spreadsheet, the difference is,
you have a separate pane for looking at variable
information and then you have separate windows
for output and then an optional one for something
called Syntax. But, let me show you how this
works by first opening up a data set. SPSS
has a lot of sample data sets in them, but
they are not easy to get to and they are really
well hidden. On my Mac, for instance, let
me go to where they are. In my mac I go to
the finder, I have to go to Mac, to applications,
to the folder IBM, to SPSS, to statistics,
to 22 the version number, to samples, then
I have to say I want the ones that are in
English, and then it brings them up. The .sav
files are the actual data files, there are
different kinds in here, so .sav is a different
kind of file and then we have a different
one about planning analyses. So, there are
versions of it. I'm going to open up a file
here called "market values .sav," a small
data set in SPSS format. And if you don't
have that, you can open up something else;
it really doesn't matter for now. By the way,
in case you haven't noticed, SPSS tends to
be really really slow when it opens. It also,
despite being version 24, it tends to be kind
of buggy and crashes. So, when you work with
SPSS, you want to get in the habit of saving
your work constantly. And also, being patient
when it is time to open the program. So, here
is a data set that just shows addresses and
house values, and square feet for information.
This, I don't even know if this is real information,
it looks artificial to me. But, SPSS lets
you do point and click analyses, which is
unusual for a lot of things. So, I am going
to come up here and I am going to say, for
instance, make a graph. I'm going to make
a- I'm going to use what is called a legacy
dialogue to get a histogram of house prices.
So, I simply click values. Put that right
there and I will put a normal curve in top
of it and click ok. This is going to open
up a new window, and it opened up a microscopic
version of it, so I'm going to make that bigger.
This is the output window, this is a separate
window and it has a navigation pane here on
the side. It tells me where the data came
from, and it saves the command here, and then,
you know, there's my default histogram. So,
we see most of the houses were right around
$125,000, and then they went up to at least
$400,000. I have a mean of $256,000, a standard
deviation of about $80,000, and then there
is 94 houses in the data set. Fine, that's
great. The other thing I can do is, if I want
to do some analyses, let me go back to the
data just for a moment. For instance, I can
come here to analyze and I can do descriptive
and I'm actually going to do one here called
Explore. And I'll take the purchase price
and I'll put it right here and I'm going to
get a whole bunch just by default. I'm going
to hit ok. And it goes back to the output
window. Once again made it tiny. And so, now
you see beneath my chart I now have a table
and I've got a bunch of information. A stem
and leaf plot, and a box plot too, a great
way of checking for outliers. And so this
is a really convenient way to save things.
You can export this information as images,
you can export the entire file as an HTML,
you can do it as a pdf or a PowerPoint. There's
a lot of options here and you can customize
everything that's on here. Now, I just want
to show you one more thing that makes your
life so much easier in SPSS. You see right
here that it's putting down these commands,
it's actually saying graph, and then histogram,
and normal equals value. And then down here,
we've got this little command right here.
Most people don't know how to save their work
in SPSS, and that's something you kind of
just have to do it over again every time,
but there's a very simple way to do this.
What I'm going to do is, I'm going to open
up something called a Syntax file. I'm going
to go to new, Syntax. And this is just a blank
window that's a programming window, it's for
saving code. And let me go back to my analysis
I did a moment ago. I'll go back to analyze
and I can still get at it right here. Descriptives
and explore, my information is still there.
And what happens here is, even though I set
it up with drop-down menus and point and click,
if I do this thing, paste, then what it does
is, it takes the code that creates that command
and it saves it to this syntax window. And
this is just a text file. It saves it as .spss,
but it is a text file that can be opened in
anything. And what's beautiful about this
is, it is really easy to copy and paste, and
you can even take this into Word and do a
find and replace on it, and it's really easy
to replicate the analyses. And so for me,
SPSS is a good program. But, until you use
Syntax you don't know the true power of it
and it makes your life so much easier as a
way of operating it. Anyhow, this is my extremely
brief introduction to SPSS. All I want to
say is that it is a very common program, kind
of looks like a spreadsheet, but it gives
you a lot more power and options and you can
use both drop-down menus and text-based Syntax
commands as well to automate your work and
make it easier to replicate it in the future.
I want to take a look at one more application
for "Coding and Data Science", that's called
JASP. This is a new application, not very
familiar to a lot of people and still in beta,
but with an amazing promise. You can basically
think of it as a free version of SPSS and
you know what, we love free. But, JASP is
not just free, it's also open source, and
it's intuitive, and it makes analyses replicable,
and it even includes Bayesian approaches.
So, take that all together, you know, we're
pretty happy and we're jumping for joy. So,
before we move on, you just may be asking
yourself, JASP, what is that? Well, the creator
has emphatically denied that it stands for
Just Another Statistics Program, but be that
as it may, we will just go ahead and call
it JASP and use it very happily. You can get
to it by going to jasp-stats.org. And let's
take a look at that right now. JASP is a new
program, they say a low fat alternative to
SPSS, but it is a really wonderful great way
of doing statistics. You're going to want
to download it, by supplying your platform;
it even comes in Linux format, which is beautiful.
And again, it's beta so stay posted, things
are updating regularly. If you're on Mac,
you're going to need to use Xquartz, that's
an easy thing to install and it makes a lot
of things work better. And it's the wonderful
way to do analyses. When you open up JASP,
it's going to look like this. It's a pretty
blank interface, but it's really easy to get
going with it. So for instance, you can come
over here to file and you can even choose
some example data sets. So for instance, here's
one called Big 5 that's personality factors.
And you've got data here that's really easy
to work with. Let me scroll this over here
for a moment. So, there's our five variables
and let's do some quick analyses with these.
Say for instance, we want to get descriptives;
we can pick a few variables. Now, if you're
familiar with SPSS, the layout feels very
much the same and the output looks a lot the
same. You know, all I have to do is select
what I want and it immediately pops up over
here. Then I can choose additional statistics,
I can get core tiles, I can get the median.
And you can choose plots; let's get some plots,
all you have to do is click on it and they
show up. And that's a really beautiful thing
and you can modify these things a little bit,
so for instance, I can take the plot points.
Let's see if I can drag that down and if I
make it small enough I can see the five plots,
I went a little too far on that one. Anyhow,
you can do a lot of things here. And I can
hide this, I can collapse that and I can go
on and do other analyses. Now, what's really
neat though is when I navigate away, so I
just clicked in a blank area of the results
page, we are back to the data here. But if
I click on one of these tables, like this
one right here, it immediately brings up the
commands that produced it and I can just modify
it some more if I want. Say I want skewness
and kurtosis, boom they are in there. It is
an amazing thing and then I can come back
out here, I can click away from that and I
can come down to the plots expand those and
if I click on that it brings up the commands
that made them. It's an amazingly easy and
intuitive way to do things. Now, there's another
really nice thing about JASP and that is that
you can share the information online really
well through a program called osf.io. That
stands for the open science foundation, that's
its web address osf.io. So, let's take a quick
look at what that's like. Here's the open
science framework website and it's a wonderful
service, it's free and it's designed to support
open, transparent, accessible, accountable,
collaborative research and I really can't
say enough nice things about it. What's neat
about this is once you sign up for OSF you
can create your own area and I've got one
of my own, I will go to that now. So, for
instance, here's the datalab page in open
science framework. And what I've done is i
created a version of this JASP analysis and
I've saved it here, in fact, let's open up
my JASP analysis in JASP and I'll show you
what it looks like in osf. So, let's first
go back to JASP. When we're here we can come
over to file and click computer and I just
saved this file to the desktop. Click on desktop,
and you should have been able to download
this with all the other files, DS03_2_4_JASP,
double click on that to open it and now it's
going to open up a new window and you see
I was working with the same data set, but
I did a lot more analyses. I've got these
graphs; I have correlations and scatter plots.
Come down here, I did a linear regression.
And we just click on that and you can see
the commands that produce it as well as the
options. I didn't do anything special for
that, but I did do some confidence intervals
and specified that and it's really a great
way to work with all this. I'll click back
in an empty area and you see the commands
go away and so I've got my output here in
JASP, but when I saved it though, I had the
option of saving it to OSF, in fact if you
go to this webpage osf.io/3t2jg you'll actually
be able to go to the page where you can see
and download the analyses that I conducted,
let's take a look. This is that page, there's
the address I just barely gave you and what
you see here is the same analysis that I conducted,
it's all right here, so if you're collaborating
with people or if you want to show things
to people, this is a wonderful way to do it.
Everything is right there, this is a static
image, but up at the top people have the option
of downloading the original file and working
with it on their own. In case you can't tell,
I'm really enthusiastic about JASP and about
its potential, still in beta, still growing
rapidly. I see it really as an open source
free and collaborative replacement to SPSS
and I think it is going to make data science
work so much easier for so many people. I
strongly recommend you give JASP a close look.
Let's finish up our discussion of "Coding
and Data Science" the applications part of
it by just briefly looking at some other software
choices. And I'll have to admit it gets kind
of overwhelming because there are just so
many choices. Now, in addition to the spreadsheets,
and Tableau, and SPSS, and JASP, that we have
already talked about, there's so much more
than that. I'm going to give you a range of
things that I'm aware of and I'm sure I've
left out some important ones or things that
other people like really well, but these are
some common choices and some less common,
but interesting ones. Number one, in terms
of ones that I haven't mentioned is SAS. SAS
is an extremely common analytical program,
very powerful, used for a lot of things. It's
actually the first program that I learned
and on the other hand it can be kind of hard
to use and it can be expensive, but there's
a couple of interesting alternatives. SAS
also has something called the SAS University
Edition, if you're a student this is free
and it's slightly reduced in what it does,
but the fact that it's free. And also it runs
in a virtual machine which makes it an enormous
download, but it's a good way to learn SAS
if it's something that you want to do. SAS
also makes a program that I really love were
it not so extraordinarily expensive and that
is called JMP and its visualization software.
Think a little bit of Tableau, how we saw
it, you work with it visually and this one
you can drag things around, it's really wonderful
program. I personally find it prohibitively
expensive. Another very common choice among
working analysts is Stata and some people
use Minitab. Now, for mathematical people,
there's MATLAB and then of course there's
Mathematica itself, but it is really more
of a language than a program. On the other
hand, Wolfram; who makes Mathematica, is also
the people who give us Wolfram Alpha, most
people don't think of this a stats application
because you can run it on your iPhone. But,
Wolfram Alpha is an incredibly capable and
especially if you pay for the pro account,
you can do amazing things in this, including
analyses, regression models, visualizations
and so it's worth taking a little closer look
at that. Also, because it provides a lot of
the data that you need so Wolfram Alpha is
an interesting one. Now, several applications
that are more specifically geared towards
data mining, so you don't want to do your
regular, you know, little t tests and stuff
on these. But, there's RapidMiner and there's
KNIME and Orange and those are all really
nice to use because they are control languages
where you drag notes onto a screen and you
connect them with lines and you can see how
things run through. All three of them are
free or have free versions and all three of
them work in pretty similar manners. There's
also BigML, which is for machine learning
and this is unusual because it's browser based,
it runs on their servers. There's a free version,
though you can't download a whole lot, it
doesn't cost a lot to use BigML and it's a
very friendly, very accessible program. Then
in terms of programs you can actually install
for free on your own computer, there's one
call SOFA Statistics, it means statistics
open for all, it's kind of a cheesy title,
but it's a good program. And then one with
a web page straight out of 1990 is Past 3,
this is paleontological software, on the other
hand does do very general stuff, it runs on
many platforms and it's a really powerful
thing and it's free, but it is relatively
unknown. And then speaking of relatively unknown,
one that's near and dear to my heart is a
web application called Statcrunch, it costs,
but it costs like $6 or $12 a year, it's really
cheap and it's very good, especially if for
basic statistics and for learning, I used
in some of the classes that I was teaching.
And then if you're deeply wedded to Excel
and you can't stand to leave that environment,
you can purchase add-ons like XLSTAT, which
give you a lot of statistical functions within
the Excel environment itself. That's a lot
of choices and the most important thing here
is don't get overwhelmed. There's a lot of
choices, but you don't even have to try all
of them. Really the important question is
what works best for you and the project that
you're working on? Here's a few things you
want to consider in that regard. First off
is functionality, does it actually do what
you want or does it even run on your machine?
You don't need everything that a program can
do. When you think about the stuff Excel can
do, people probably use five percent of what's
available. Second is ease of use. Some of
these programs are a lot easier to use than
the others and I personally find that the
ones that are easier to use, I like them,
so you might say, "No, I need to program because
I need custom stuff". But I'm willing to bet
that 95% of what people do does not require
anything custom. Also, the existence of a
community. Constantly when you're working
you come across problems and don't know how
to solve it and being able to get online and
do a search for an answer and have enough
of a community that there are people there
who have put answers up and discuss these
things. Those are wonderful. Some of these
programs are very substantial communities
and some of them it is practically nonexistent
and it is to you to decide how important it
is to you. And then finally of course there
is the issue of cost. Many of these programs
I mentioned are free, some of them are very
cheap, some of them run some sort of premium
model and some of them are extremely expensive.
So, you don't buy them unless somebody else
is paying for it. So, these are some of the
things that you want to keep in mind when
you're trying to look at various programs.
Also, let's mention this; don't forget the
80/20 rule. You're going to be able to do
most of the stuff that you need to do with
only a small number of tools, one or two,
maybe three, will probably be all that you
ever need. So, you don't need to explore the
range of every possible tool. Find something
that you need, find something you're comfortable
with and really try to extract as much value
as you can out of that. So, in sum, in our
discussion of available applications for coding
and data science. First remember applications
are tools, they don't drive you, you use them.
And that your goals are what drive the choice
of your applications and the way that you
do it. And the single most important thing
is to remember, what works for you, may work
well for somebody else, if you're not comfortable
with it, if it's not the questions you address,
then it's more important to think about what
works for you and the projects that you're
working on as you make your own choices for
tools, for working in data science. When you're
"Coding in Data Science," one of the most
important things you can do is be able to
work with web data. And if you work with web
data you're going to be working with HTML.
And in case you're not familiar with it, HTML
is what makes the World Wide Web go ‘round.
What it stands for is HyperText Markup Language
- and if you've never dealt with web pages
before, here's a little secret: web pages
are just text. It is just a text document,
but it uses tags to define the structure of
the document and a web browser knows what
those tags are and it displays them the right
way. So, for instance, some of the tags, they
look like this. They are in angle brackets,
and you have an angle bracket and then the
beginning tag, so body, and then you have
the body, the main part of your text, and
then you have in angle brackets with backslash
body to let the computer know that you are
done with that part. You also have p and backslash
p for paragraphs. H1 is for header one and
you put it in between that text. TD is for
table data or the cell in a table and you
mark it off that way. If you want to see what
it looks like just go to this document: DS03_3_1_HTML.txt.
I'm going to go to that one right now. Now,
depending on what text editor you open this
up, it may actually give you the web preview.
I've opened it up in TextMate and so it actually
is showing the text the way I typed it. I
typed this manually; I just typed it all in
there. And I have HTML to see what a document
is, I have an empty header, but that sort
of needs to be there. This, I say what the
body is, and then I have some text. li is
for list items, I have headers, this is for
a link to a webpage, then I have a small table.
And if you want to see what this looks like
when displayed as a web page, just go up here
to window and show web preview. This is the
same document, but now it is in a browser
and that's how you make a web page. Now, I
know this is very fundamental stuff, but the
reason this is important is because if you're
going to be extracting data from the web,
you have to understand how that information
is encoded in the web, and it is going to
be in HTML most of the time for a regular
web page. Now, I will mention something that,
there's another thing called CSS. Web pages
use CSS to define the appearance of a document.
HTML is theoretically there to give the content
and CSS gives the appearance. And that stands
for Cascading Style Sheets. I'm not going
to worry about that right now because we're
really interested in the content. And now
you have the key to being able to read web
pages and pull data from web pages for your
data science project. So, in sum; first, the
web runs on HTML and that's what makes the
web pages that are there. HTML defines the
page structure and the content that is on
the page. And you need to learn how to navigate
the tags and the structure in order to get
data from the web pages for your data science
projects. The next step in "Coding and Data
Science" when you're working with web data
is to understand a little bit about XML. I
like to think of this as the part of web data
that follows the imperative, "Data, define
thyself". XML stands for eXtensible Markup
Language, and what it is XML is semi-structured
data. What that means is that tags define
data so a computer knows what a particular
piece of information is. But, unlike HTML,
the tags are free to be defined any way you
want. And so you have this enormous flexibility
in there, but you're still able to specify
it so the computer can read it. Now, there's
a couple of places where you're going to see
XML files. Number one is in web data. HTML
defines the structure of a web page, but if
they're feeding data into it, then that will
often come in the form of an XML file. Interestingly,
Microsoft Office files, if you have .docx
or .xlsx, the X-part at the end stands for
a version of XML that's used to create these
documents. If you use iTunes, the library
information that has all of your artists,
and your genre's, and your ratings and stuff,
that's all stored in an XML file. And then
finally, data files that often go with particular
programs can be saved as XML as a way of representing
the structure of the data to the program.
And for XML, tags use opening and closing
angle brackets just like HTML did. Again,
the major difference is that you're free to
define the tags however you want. So for instance,
thinking about iTunes, you can define a tag
that's genre, and you have the angle brackets
in genre to begin that information, and then
you have the angle brackets with the backslash
to let it know you're done with that piece
of information. Or, you can do it for composer,
or you can do it for rating, or you can do
it for comments, and you can create any tags
you want and you put the information in between
those two things. Now, let's take an example
of how this works. I'm going to show you a
quick dataset that comes from the web. It's
at ergast.com and API, and this is a website
that stores information about automobile Formula
One racing. Let's go to this webpage and take
a quick look at what it's like. So, here we
are at Ergast.com, and it's the API for Formula
One. And what I'm bringing up is the results
of the 1957 season in Formula One racing.
And here you can see who the competitors were
in each race, and how they finished and so
on. So, this is a dataset that is being displayed
in a web page. If you want to see what it
looks like in XML, all you have to do is type
XML onto the end of this: .XML. I've done
that already, so I'm just going to go to that
one. And as you see, it's only this bit that
I've added: .XML. Now, it looks exactly the
same because the web page is structuring XML
data by default but if you want to see what
it looks like in its raw format, just do an
option, click on the web page, and go to view
page source. At least that's how it works
in Chrome, and this is the structured XML
page. And you can see we have tags here. It
says Race Name, Circuit Name, Location, and
obviously, these are not standard HTML tags.
They are defined for the purposes of this
particular dataset. But we begin with one.
We have Circuit Name right there, and then
we close it using the backslash right there.
And so this is structured data; the computer
knows how to read it, which is exactly, this
is how it displays it by default. So, it's
a really good way of displaying data and its
a good way to know how to pull data from the
web. You can actually use what is called an
API, an Application Programming interface
to access this XML data and it pulls it in
along with its structure which makes working
with it really easy. What's even more interesting
is how easy it is to take XML data and convert
it between different formats, because it's
structured and the computer knows what you're
dealing with. So for example, one it's really
easy to convert XML to CSV or comma separated
value files (that's the spreadsheet format)
because it knows exactly what the headings
are; what piece of information goes in each
column. Example two: it's really easy to convert
HTML documents to XML because you can think
of HTML with its restricted set of tags as
sort of a subset of the much freer XML. And
three, you can convert CSV, or your spreadsheet
comma separated value, to XML and vice versa.
You can bounce them all back and forth because
the structure is made clear to the programs
you're working with. So in sum, here's what
we can say. Number one, XML is semi-structured
data. What that means is that it has tags
to tell the computer what the piece of information
is, but you can make the tags whatever you
want them to be. And, XML is very common for
web data and it's really easy to translate
the format XML/HTML/CSV so on and so forth.
It's really easy to translate them back and
forth which gives you a lot of flexibility
in manipulating data so can get into the format
you need for your own analysis. The last thing
I want to mention about "Coding and Data Science"
and web data is something called JSON. And
I like to think of it as a version of smaller
is better. Now, what JSON stands for is JavaScript
Object Notation, although JavaScript is supposed
to be one word. And what it is, is that like
XML, JSON is semi-structured data. That is,
you have tags that define the data, so the
computer knows what each piece of information
is, but like XML the tags can vary freely.
And so there's a lot in common between XML
and JSON. So XML is a Markup Language (that's
what the ML stands for), and that gives meaning
to the text; it lets the computer know what
each piece of information is. Also, XML allows
you to make comments in the document, and
it allows you to put metadata in the tags
so you can actually put information there
in the angle brackets to provide additional
context. JSON, on the other hand, is specifically
designed for data interchange and so it's
got that special focus. And the structure;
JSON corresponds with data structures, you
know it directly represents objects and arrays
and numbers and strings and booleans, and
that works really well with the programs that
are used to analyze data. Also, JSON is typically
shorter than XML because it does not require
the closing tags. Now, there are ways to do
that with XML, but that's not typically how
it's done. As a result of these differences,
JSON is basically taking XML's place in web
data. XML still exists, it's still used for
a lot of things, but JSON is slowly replacing
it. And we'll take a look at the comparison
between the three by going back to the example
we used in XML. This is data about Formula
One car races in 1957 from ergast.com. You
can just go to the first web page here, then
we will navigate to the others from that.
So this is the general page. This is if you
just type in without the .XML or .JSON or
anything. So it's a table of information about
races in 1957. And we saw earlier that if
you add just add .XML to the end of this,
it looks exactly the same. That's because
this browser is displaying XML properly by
default. But, if you were to right click on
it, and go to view page source, you would
get this instead, and you can see the structure.
This is still XML, and so everything has an
opening tag and a closing tag and some extra
information in there. But, if you type in
.JSON what you really get is this jumbled
mess. Now that's unfortunate because there
is a lot of structure to this. So, what I
am going to do is, I am actually going to
copy all of this data, then I'm going to go
to a little web page; there's a lot of things
you can do here, and it's a cute phrase. It's
called JSON Pretty Print. And that is, make
it look structured so it's easier to read.
I just paste that in there and hit Pretty
Print JSON, and now you can see hierarchical
structure of the data. The interesting thing
is that the JSON tags only have tags at the
beginning. It says series in quotes, then
a colon, then it gives the piece of information
in quotes, and a comma and it moves on to
the next one. And this is a lot more similar
to the way data would be represented in something
like R or Python. It is also more compact.
Again, there are things you can do with XML
but this is one of the reasons that JSON is
becoming preferred as a data carrier for websites.
And as you may have guessed, it's really easy
to convert between the formats. It's easy
to convert between XML, JSON, CSV, etc. You
can get a web page where you can paste a version
in and you get the other version out. There
are some differences, but for the vast majority
of situations, they are just interchangeable.
In Sum: what did we get from this? Like XML,
JSON is semi-structured data, where there
are tags that say what the information is,
but you define the tags however you want.
JSON is specifically designed for data interchange
and because it reflects the structure of the
data in the programs, that makes it really
easy. Also, because it's relatively compact
JSON is replacing gradually XML on the web,
as the container for data on web pages. If
we are going to talk about "Coding and Data
Science" and the languages that are used,
then first and foremost is R. The reason for
that is, according to many standards, R is
the language of data and data science. For
example, take a look at this chart. This is
a ranking based on a survey of data mining
experts of the software they use in doing
their work, and R is right there at the top.
R is first, and in fact that's important because
there's Python which is usually taken hand
in hand with R for Data Science. But R sees
50% more use than Python does, at least in
this particular list. Now there's a few reasons
for that popularity. Number one, R is free
and it's open source, both of which make things
very easy. Second, R is specially developed
for vector operations. That means it's able
to go through an entire list of data without
having to write ‘for' loops to go through.
If you've ever had to write ‘for' loops,
you know that would be kind of disastrous
having to do that with data analysis. Next,
R has a fabulous community behind it. It's
very easy to get help on things with R, you
Google it, you're going to end up in a place
where you're going to be able to find good
examples of what you need. And probably most
importantly, R is very capable. R has 7,000
packages that add capabilities to R. Essentially,
it can do anything. Now, when you are working
with R, you actually have a choice of interfaces.
That is, how you actually do the coding and
how you get your results. R comes with it's
own IDE or Interactive Development Environment.
You can do that, or if you are on a Mac or
a Linux you can actually do R through the
Terminal through the command line. If you've
installed R, you just type R and it starts
up. There is also a very popular development
environment called RStudio.com, and that's
actually the one I use and the one I will
be using for all my examples. But another
new competitor is Jupyter, which is very commonly
used for Python; that's what I use for examples
there. It works in a browser window, even
though its locally installed. And RStudio
and Jupyter there's pluses and minus to each
one of them and I'll mention them as we get
to each one of them. But no matter which interface
you use, R's command line, you're typing lines
of code in order to get the commands. Some
people get really scared about that but really
there are some advantages to that in terms
of the replicability and really the accessibility,
the transparency of your commands. So for
instance, here's a short example of some of
the commands in R. You can enter them into
what is called a console, and that's just
one line at a time and that's called an interactive
way. Or you can save scripts and run bits
and pieces selectively and that makes your
life a lot easier. No matter how you do it,
if you are familiar with programming other
languages then you're going to find that R's
a little weird. It has an idiosyncratic model.
It makes sense once you get used to it, but
it is a different approach, and so it takes
some adaptation if you are accustomed to programming
in different languages. Now, once you do your
programming to get your output, what you're
going to get is graphs in a separate window.
You're going to get text and numbers, numerical
output in the console, and no matter what
you get, you can save the output to files.
So that makes it portable, you can do it in
other environments. But most importantly,
I like to think of this: here's our box of
chocolates where you never know what you're
going to get. The beauty of R is in the packages
that are available to expand its capabilities.
Now there are two sources of packages for
R. One goes by the name of CRAN, and that
stands for the Comprehensive R Archive Network,
and that's at cran.rstudio.com. And what that
does is takes the 7,000 different packages
that are available and organizes them into
topics that they call task views. And for
each one if they have done their homework,
they have datasets that come along with the
package. You have a manual in .pdf format,
and you can even have vignettes where they
run through examples of how to do it. Another
interface is called Crantastic! And the exclamation
point is part of the title. And that is at
crantastic.org. And what this is, is an alternative
interface that links to CRAN. So if you find
something you like in Crantastic! and you
click on the link, it's going to open in CRAN.
But the nice thing about Crantastic! is it
shows the popularity of packages, and it also
shows how recently they were updated, and
that can be a nice way of knowing you're getting
sort of the latest and greatest. Now from
this very abstract presentation, we can say
a few things about R: Number one, according
to many, R is the language of data science
and it's a command line interface. You're
typing lines of code, so that gives it both
a strength and a challenge for some people.
But the beautiful thing is that for the thousands
and thousands of packages of additional code
and capability that are available for R, that
make it possible to do nearly anything in
this statistical programming language. When,
talking about "Coding and Data Science" and
the languages, along with R, we need to talk
about Python. Now, Python the snakes is a
general-purpose program that can do it all,
and that's its beauty. If we go back to the
survey of the software used by data mining
experts, you see that Python's there and it's
number three on the list. What's significant
about that, is that on this list, Python is
the only general purpose programming language.
It's the only one that can be theoretically
used to develop any kind of application that
you want. That gives it some special powers
compared to all the others, most of which
are very specific to data science work. The
nice things about Python are: number one,
it's general purpose. It's also really easy
to use, and if you have a Macintosh or Linux
computer, Python is built into it. Also, Python
has a fabulous community around it with hundreds
of thousands of people involved, and also
python has thousands of packages. Now, it
actually has 70 or 80,000 packages, but in
terms of ones that are for data, there are
still thousands available that give it some
incredible capabilities. A couple of things
to know about Python. First, is about versions.
There are two versions of Python that are
in wide circulation: there's 2.x; so that
means like 2.5 or 2.6, and 3.x; so 3.1, 3.2.
Version 2 and version 3 are similar, but they
are not identical. In fact, the problem is
this: there are some compatibility issues
where code that runs in one does not run in
the other. And consequently, most people have
to choose between one and the other. And what
this leads to is that many people still use
2.x. I have to admit, in the examples that
I use, I'm using 2.x because so many of the
data science packages that are developed with
that in mind. Now let me say a few things
about the interfaces for Python. First, Python
does come with its own Interactive Development
Learning Environment and they call it IDLE.
You can also run it from the Terminal, or
command line interface, or any IDE that you
have. A very common and a very good choice
is Jupyter. Jupyter is a browser-based framework
for programming and it was originally called
IPython. That served as its initial, so a
lot of the time when people are talking about
IPython, what they are really talking about
is this Python in Jupyter and the two are
sometimes used interchangeably. One of the
neat things you can do, there are two companies:
Continuum and Enthought. Both of which have
made special distributions of Python with
hundreds and hundreds of packages preconfigured
to make it very easy to work with data. I
personally prefer Continuum Anaconda, it's
the one that I use, a lot of other people
use it, but either one is going to work and
it's going to get you up and running. And
like I said with R, no matter what interface
you use, all of them are command line. You're
typing lines of code. Again, there is tremendous
strength to that but, it can be intimidating
to some people at first. In terms of the actual
commands of Python, we have some examples
here on the side, and the important thing
to remember is that it's a text interface.
On the other hand, Python is familiar to millions
of people because it is very often a first
programming language people learn to do general
purpose programming. And there are a lot of
very simple adaptations for data that make
it very powerful for data science work. So,
let me say something else again: data science
loves Jupyter, and Jupyter is the browser-based
framework. It's a local installation, but
you access it through a web browser that makes
it possible to really do some excellent work
in data science. There's a few reasons for
this. When you're working in Jupyter you get
text output and you can use what's called
Markdown as a way of formatting documents.
You can get inline graphics for the graphics
to show up directly beneath the code that
you did it. It's also really easy to organize,
present, and to share analyses that are done
in Jupyter. Which makes it a strong contender
for your choices in how you do data science
programming. Another one of the beautiful
things about Python, like R, is there are
thousands of packages available. In Python,
there is one main repository; it goes by the
name PyPI. Which is for the Python Package
Index. Right here it says there are over 80,000
packages and 7 or 8,000 of those are for data-specific
purposes. Some of the packages that you will
get to be very familiar with are NumPy and
SciPy, which are for scientific computing
in general; Matplotlib and a development of
it called Seaborn are for data visualization
and graphics. Pandas is the main package for
the doing statistical analysis. And for machine
learning, almost nothing beats scikit-learn.
And when I go through hands-on examples in
Python, I will be using all of these as a
way of demonstrating the power of the program
for working with data. In sum we can say a
few things: Python is a very popular program
very familiar to millions of people and that
makes it a good choice. Second, of all the
languages we use for data science on a frequent
basis, this is the only one that's general
purpose. Which means it can be used for a
lot of things other than processing data.
And it gets its power, like R does, from having
thousands of contributed packages which greatly
expand its capabilities especially in terms
of doing data science work. A choice for "Coding
in Data Science," one of the languages that
may not come immediately to mind when they
think data science, is Sequel or SQL. SQL
is the language of databases and we think,
"why do we want to work in SQL?" Well, to
paraphrase the famous bank robber Willie Sudden
who apparently explained why he robbed banks
and said: "Because that's where the money
is." The reason we would with SQL in data
science is because that's where the data is.
Let's take another look at our ranking of
software among data mining professionals,
and there's SQL. Third on the list, and also
of this list, its also the first database
tool. Other tools, for instance, get much
fancier, and much new and shinier, but SQL
has been around for a while as very very capable.
There's a few things to know about SQL. You
will notice that I am saying Sequel even though
it stands for Structured Query Language. SQL
is a language, not an application. There's
not a program SQL, it's a language that can
be used in different applications. Primarily,
SQL is designed for what are called relational
databases. And those are special ways of storing
structured data that you can pull in. You
can put things together, you can join them
in special ways, you can get summary statistics,
and then what you usually do is then export
that data into your analytical application
of choice. The big word here is RDBMS - Relational
Database Management System; that is where
you will usually see SQL as a query language
being used. In terms of Relational Database
Management System, there are a few very common
choices. In the industrial world where people
have some money to spend, there's Oracle database
is a very common one and Microsoft SQL Server.
In the open source world, two very common
choices are MySQL, even though we generally
say Sequel, when it's here you generally say
MySQL. Another one is PostgreSQL. These are
both open source, free versions of the language;
sort of dialects of each, that make it possible
for you to working with your databases and
for you to get your information out. The neat
thing about them, no matter what you do, databases
minimize data redundancy by using connected
tables. Each table has rows and columns and
they store different levels or different of
abstraction or measurement, which means you
only have to put the information one place
and then it can refer to lots of other tables.
Makes it very easy to keep things organized
and up to date. When you are looking into
a way of working with a Relational Database
Management System, you get to choose in part
between using a graphical user interface or
GUI. Some of those include SQL Developer and
SQL Server Management Studio, two very common
choices. And there are a lot of other choices
such as Toad and some other choices that are
graphical interfaces for working with these
databases. There are also text-based interfaces.
So really, any command line interface, and
any interactive development environment or
programming tool is going to be able to do
that. Now, you can think of yourself on the
command deck of your ship and think of a few
basic commands that are very important for
working with SQL. There are just a handful
of commands that can get you where you need
to go. There is the Select command, where
you're choosing the cases that you want to
include. From: says what tables are you going
to be extracting them from. Where: is a way
of specifying conditions, and then Order By:
obviously is just a way of putting it all
together. This works because usually when
you are in a SQL database you're just pulling
out the information. You want to select it,
you want to organize it, and then what you
are going to do is you are going to send the
data to your program of choice for further
analysis, like R or Python or whatever. In
sum here's what we can say about SQL: Number
one, as a language it's generally associated
with relational databases, which are very
efficient and well-structured ways of storing
data. Just a handful of basic commands can
be very useful when working with databases.
You don't have have to be a super ninja expert,
really a handful. Five, 10 commands will probably
get you everything you need out of a SQL database.
Then once the data is organized, the data
is typically exported to some other program
for analysis. When you talk about coding in
any field, one of the languages or one of
the groups of languages that come up most
often are C, C++, and Java. These are extremely
powerful applications and very frequently
used for professional, production level coding.
In data science, the place where you will
see these languages most often is in the bedrock.
The absolute fundamental layer that makes
the rest of data science possible. For instance,
C and C++. C is from the ‘60s, C++ is from
the ‘80s, and they have extraordinary wide
usage, and their major advantage is that they're
really really fast. In fact, C is usually
used as the benchmark for how fast is a language.
They are also very, very stable, which makes
them really well suited to production-level
code and, for instance, server use. What's
really neat is that in certain situations,
if time is really important, if speeds important,
then you can actually use C code in R or other
statistical languages. Next is Java. Java
is based on C++, it's major contribution was
the WORA or the Write Once Run Anywhere. The
idea that you were going to be able to develop
code that is portable to different machines
and different environments. Because of that,
Java is the most popular computer programming
language overall against all tech situations.
The place you would use these in data science,
like I said, when time is of the essence,
when something has to be fast, it has to get
the job accomplished quickly, and it has to
not break. Then these are the ones you're
probably going to use. The people who are
going to use it are primarily going to be
engineers. The engineers and the software
developers who deal with the inner workings
of the algorithms in data science or the back
end of data science. The servers and the mainframes
and the entire structure that makes analysis
possible. In terms of analysts, people who
are actually analyzing the data, typically
don't do hands-on work with the foundational
elements. They don't usually touch C or C++,
more of the work is on the front end or closer
to the high-level languages like R or Python.
In sum: C, C++ and Java form a foundational
bedrock in the back end of data and data science.
They do this because they are very fast and
they are very reliable. On the other hand,
given their nature that work is typically
reserved for the engineers who are working
with the equipment that runs in the back that
makes the rest of the analysis possible. I
want to finish our extremely brief discussion
of "Coding in Data Sciences" and the languages
that can be used, by mentioning one other
that's called Bash. Bash really is a great
example of old tools that have survived and
are still being used actively and productively
with new data. You can think of it this way,
it's almost like typing on your typewriter.
You're working at the command line, you're
typing out code through a command line interface
or a CLI. This method of interacting with
computers practically goes back to the typewriter
phase, because it predates monitors. So, before
you even had a monitor, you would type out
the code and it would print it out on a piece
of paper. The important thing to know about
the command line is it's simply a method of
interacting. It's not a language, because
lots of languages can run at the command line.
For instance, it is important to talk about
the concept of a shell. In computer science,
a shell is a language or something that wraps
around the computer. It's a shell around the
language, that is the interaction level for
the user to get things done at the lower level
that aren't really human-friendly. On Mac
computers and Linux, the most common is Bash,
which is short for Bourne Again Shell. On
Windows computers, the most common is PowerShell.
But whatever you do there actually are a lot
of choices, there's the Bourne Shell, the
C shell; which is why I have a seashell right
here, the Z shell, there's fish for Friendly
Interactive Shell, and a whole bunch of other
choices. Bash is the most common on Mac and
Linux and PowerShell is the most common on
Windows as a method of interacting with the
computer at the command line level. There's
a few things you need to know about this.
You have a prompt of some kind, in Bash, it's
a dollar sign, and that just means type your
command here. Then, the other thing is you
type one line at a time. It's actually amazing
how much you can get done with a one-liner
program, by sort of piping things together,
so one feeds into the other. You can run more
complex commands if you use a script. So,
you call a text document that has a bunch
of things in it and you can get much more
elaborate analyses done. Now, we have our
tools here. In Bash we talk about utilities
and what these are, are specific programs
that accomplish specific tools. Bash really
thrives on "Do one thing, and do it very well."
There are two general categories of utilities
for Bash. Number one, is the Built-ins. These
are the ones that come installed with it,
and so you're able to use it anytime by simply
calling in their name. Some more common ones
are: cat, which is for catenate; that's to
put information together. There's awk, which
is it's own interpreted language, but it's
often used for text processing from the command
line. By the way, the name 'Awk' comes from
the initials of the people who created it.
Then there's grep, which is for Global search
with a Regular Expression and Print. It's
a way of searching for information. And then
there's sed, which stands for Stream Editor
and its main use is to transform text. You
can do an enormous amount with just these
4 utilities. A few more are head & tail, display
the first or last 10 lines of a document.
Sort & uniq, which sort and count the number
of unique answers in a document. Wc, which
is for word count, and printf which formats
the output that you get in your console. And
while you can get a huge amount of work done
with just this small number of built-in utilities,
there are also a wide range of installable.
Or, other command line utilities that you
can add to Bash, or whatever programming language
you're using. So, since some really good ones
that have been recently developed are jq:
which is for pulling in JSON or JavaScript,
object notation data from the web. And then
there's json2csv, which is a way of converting
JSON to csv format, which is what a lot of
statistical programs are going to be happy
with. There's Rio which allows you to run
a wide range of commands from the statistical
programming language R in the command line
as part of Bash. And then there's BigMLer.
This is a command line tool that allows you
to access BigML's machine learning servers
through the command line. Normally, you do
it through a web browser and it accesses their
servers remote. It's an amazingly useful program
but to be able to just pull it up when you're
in the command line is an enormous benefit.
What's interesting is that even though you
have all these opportunities, all these different
utilities, you can do all amazing things.
And there's still an active element of utilities
for the command line. So, in sum: despite
being in one sense as old as the dinosaurs,
the command line survives because it is extremely
well evolved and well suited to its purpose
of working with data. The utilities; both
the built-in and the installable are fast
and they are easy. In general, they do one
thing and they do it very, very well. And
then surprisingly, there is an enormous amount
of very active development of command line
utilities for these purposes, especially with
data science. One critical task when you are
Coding in Data Science is to be able to find
the things that you are looking for, and Regex
(which is short of Regular Expressions) is
a wonderful way to do that. You can think
of it as the supercharged method for finding
needles in haystacks. Now, Regex tends to
look a little cryptic so, for instance, here's
an example. As something that's designed to
determine if something is a valid email address,
and it specifies what can go in the beginning,
you have the at sign in the middle, then you've
got a certain number of letters and numbers,
then you have to have a dot something at the
end. And so, this is a special kind of code
for indicating what can go where. Now regular
expressions, or regex, are really a form of
pattern matching in text. And it's a way of
specifying what needs to be where, what can
vary, and how much it can vary. And you can
write both specific patterns; say I only want
a one letter variation here, or a very general
like the email validator that I showed you.
And the idea here is that you can write this
search pattern, your little wild card thing,
you can find the data and then once you identify
those cases, then you export them into another
program for analysis. So here's a short example
of how it can work. What I've done is taken
some text documents, they're actually the
texts to Emma and to Pygmalion, two books
I got off of Project Gutenberg, and this is
the command. Grep ^l.ve *.txt - so what I'm
looking for in either of these books are lines
that start with ‘l', then they can have
one character; can be whatever, then that's
followed by ‘ve', and then the .txt means
search for all the text files in the particular
folder. And what it found were lines that
began with love, and lived, and lovely, and
so on. Now in terms of the actual nuts and
bolts of regular expressions, there are some
certain elements. There are literals, and
those are things that are exactly what they
mean. You type the letter ‘l', you're looking
for the letter ‘l'. There are also metacharacters,
which specify, for instance, things need to
go here; they're characters but are really
code that give representations. Now, there
are also escape sequences, which is normally
this character is used as a variable, but
I want to really look for a period as opposed
to a placeholder. Then you have the entire
search expression that you create and you
have the target string, the thing that it
is searching through. So let me give you a
few very short examples. ^ this is the caret.
This is the sometimes called a hat or in French,
a circonflexe. What that means, you're looking
for something at the beginning of the search
you are searching. For example, you can have
^ and capital M, that means you need something
that begins with capital M. For instance the
word "Mac," true, it will find that. But if
you have iMac, it's a capital M, but it's
not the first letter and so that would be
false, it won't find that. The $ means you
are looking for something at the end of the
string. So for example: ing$ that will find
the word ‘fling' because it ends in ‘ing',
but it won't find the word ‘flings' because
it actually ends with an ‘s'. And then the
dot, the period, simply means that we are
looking for one letter and it can be anything.
So, for example, you can write ‘at.'. And
that will find ‘data' because it has an
‘a', a ‘t', and then one letter after
it. But it won't find ‘flat', because ‘flat'
doesn't have anything after the ‘at'. And
so these are extremely simple examples of
how it can work. Obviously, it gets more complicated
and the real power comes when you start combining
these bits and elements. Now, one interesting
thing about this is you can actually treat
this as a game. I love this website, it's
called Regex golf and it's at regex.alf.nu.
And what it does is brings up lists of words;
two columns, and your job is to write a regular
expression in the top, that matches all the
words on the left column and none of the words
on the right. And uses the fewest characters
possible, and you get a score! And it's a
great way of learning how to do regular expressions
and learning how to search in a way that is
going to get you the data you need for your
projects. So, in sum: Regex, or regular expressions,
help you find the right data for your project,
they're very powerful and they're very flexible.
Now, on the other hand, they are cryptic,
at least when you first look at them but at
the same time, it's like a puzzle and it can
be a lot of fun if you practice it and you
see how you can find what you need. I want
to thank you for joining me in "Coding in
Data Science" and we'll wrap up this course
by talking about some of the specific next
steps you can take for working in data science.
The idea here, is that you want to get some
tools and you want to start working with those
tools. Now, please keep in mind something
that I've said at another time. Data tools
and data science are related, they're important
but don't make the mistake of thinking that
if you know the tools that you have done the
same thing as actually conducted data science.
That's not true, people sometimes get a little
enthusiastic and they get a little carried
away. What you need to remember is the relationship
really is this: Data Tools are an important
part of data science, but data science itself
is much bigger than just the tools. Now, speaking
of tools remember there's a few kinds that
you can use, and that you might want to get
some experience with these. #1, in terms of
just Apps, specific built applications Excel
& Tableau are really fundamental for both
getting the data from clients or doing some
basic data browsing and Tableau is really
wonderful for interactive data visualization.
I strongly recommend you get very comfortable
with both of those. In terms of code, it's
a good idea to learn either ‘R' or ‘Python'
or ideally to learn both. Ideally because
you can use them hand in hand. In terms of
utilities, it's a great idea to work with
Bash, the command line utility and to use
regular expression or regex. You can actually
use those in lots and lots of programs; regular
expressions. So they can have a very wide
application. And then finally, data science
requires some sort of domain expertise. You're
going to need some sort of field experience
or intimate understanding of a particular
domain and the challenges that come up and
what constitutes workable answers and the
kind of data that's available. Now, as you
go through all of this, you don't need to
build this monstrous list of things. Remember,
you don't need everything. You don't need
every tool, you don't need every function,
you don't need every approach. Instead remember,
get what's best for your needs, and for your
style. But no matter what you do, remember
that tools are tools, they are a means to
an end. Instead, you want to focus on the
goal of your data science project whatever
it is. And I can tell you really, the goal
is in the meaning, extracting meaning out
of your data to make informed choices. In
fact, I'll say a little more. The goal is
always meaning. And so with that, I strongly
encourage you to get some tools, get started
in data science and start finding meaning
in the data that's around you. Welcome to
"Mathematics in Data Science". I'm Barton
Poulson and we're going to talk about how
Mathematics matters for data science. Now,
you maybe saying to yourself, "Why math?",
and "Computers can do it, I don't need to
do it". And really fundamentally, "I don't
need math I am just here to do my work". Well,
I am here to tell you, No. You need math.
That is if you want to be a data scientist,
and I assume that you do. So we are going
to talk about some of the basic elements of
Mathematics, really at a conceptual level
and how they apply to data science. There
are few ways that math really matters to data
science. #1, it allows you to know which procedures
to use and why. So you can answer your questions
in a way that is the most informative and
the most useful. #2, if you have a good understanding
of math, then you know what to do when things
don't work right. That you get impossible
values or things won't compute, and that makes
a huge difference. And then #3, an interesting
thing is that some mathematical procedures
are easier and quicker to do by hand then
by actually firing up the computer. And so
for all 3 of these reasons, it's really helpful
to have at least a grounding in Mathematics
if you're going to do work in data science.
Now probably the most important thing to start
with in Algebra. And there are 3 kinds of
algebra I want to mention. The first is elementary
algebra, that's the regular x+y. Then there
is Linear or matrix algebra which looks more
complex, but is conceptually it is used by
computers to actually do the calculations.
And then finally I am going to mention Systems
of Linear Equations where you have multiple
equations simultaneously that you're trying
to solve. Now there's more math than just
algebra. A few other things I'm going to cover
in this course. Calculus, a little bit of
Big O or order which has to do with the speed
and complexity of operations. A little bit
of probability theory and a little bit of
Bayes or Bayes theorem which is used for getting
posterior probabilities and changes the way
you interpret the results of an analysis.
And for the purposes of this course, I'm going
to demonstrate the procedures by hand, of
course you would use software to do this in
the real world, but we are dealing with simple
problems at conceptual levels. And really,
the most important thing to remember is that
even though a lot of people get put off by
math, really You can do it! And so, in sum:
let's say these three things about math. First
off, you do need some math to do good data
science. It helps you diagnose problems, it
helps you choose the right procedures, and
interestingly you can do a lot of it by hand,
or you can use software computers to do the
calculations as well. As we begin our discussion
of the role of "Mathematics and Data Science",
we'll of course begin with the foundational
elements. And in data science nothing is more
foundational than Elementary Algebra. Now,
I'd like to begin this with really just a
bit of history. In case you're not aware,
the first book on algebra was written in 820
by Muhammad ibn Musa al-Khwarizmi. And it
was called "The Compendious Book on Calculation
by Completion and Balancing". Actually, it
was called this, which if you transliterate
that comes out to this, but look at this word
right here. That's the algebra, which means
Restoration. In any case, that's where it
comes from and for our concerns, there are
several kinds of algebra that we're going
to talk about. There's Elementary Algebra,
there's Linear Algebra and there are systems
of linear equations. We'll talk about each
of those in different videos. But to put it
into context, let's take an example here of
salaries. Now, this is based on real data
from a survey of the salary of people employed
in data science and to give a simple version
of it. The salary was equal to a constant,
that's sort of an average value that everybody
started with and to that you added years,
then some measure of bargaining skills and
how many hours they worked per week. And that
gave you your prediction, but that wasn't
exact there's also some error to throw into
it to get to the precise value that each person
has. Now, if you want to abbreviate this,
you can write it kind of like this: S + C
+ Y + B + H + E, although it's more common
to write it symbolically like this, and let's
go through this equation very quickly. The
first thing we have is outcome,; we call that
y the variable y for person i, "i" stands
for each case in our observations. So, here's
outcome y for person i. This letter here,
is a Greek Beta and it represents the intercept
or the average, that's why it has a zero,
because we don't multiply it times anything.
But right next to it we have a coefficient
for variable 1. So Beta, which means a coefficient,
sub 1 for the first variable and then we have
variable 1 then x 1, means variable 1, then
i means its the score on that variable for
person i, whoever we are talking about. Then
we do the same thing for variables 2 and 3,
and at the end, we have a little epsilon here
with an i for the error term for person i,
which says how far off from the prediction
was their actual score. Now, I'm going to
run through some of these procedures and we'll
see how they can be applied to data science.
But for right now let's just say this in sum.
First off, Algebra is vital to data science.
It allows you to combine multiple scores,
get a single outcome, do a lot of other manipulations.
And really, the calculations, their easy for
one case at at time. Especially when you're
doing it by hand. The next step for "Mathematics
for Data Science" foundations is to look at
Linear algebra or an extension of elementary
algebra. And depending on your background,
you may know this by another name and I like
to think welcome to the Matrix. Because it's
also known as matrix algebra because we are
dealing with matrices . Now, let's go back
to an example I gave in the last video about
salary. Where salary is equal to a constant
plus years, plus bargaining, plus hours plus
error, okay that's a way to write it out in
words and if you want to put it in symbolic
form, it's going to look like this. Now before
we get started with matrix algebra, we need
to talk about a few new words, maybe you're
familiar with them already. The first is Scalar,
and this means a single number. And then a
vector is a single row or a single column
of numbers that can be treated as a collection.
That usually means a variable. And then finally,
a matrix consists of many rows and columns.
Sort of a big rectangle of numbers, the plural
of that by the way is matrices and the thing
to remember is that Machines love Matrices.
Now let's take a look at a very simple example
of this. Here is a very basic representation
of matrix algebra or Linear Algebra. Where
we are showing data on two people, on four
variables. So over here on the left, we have
the outcomes for cases 1 and 2, our people
1 and 2. And we put it into the square brackets
to indicate that it's a vector or a matrix.
Here on the far left, it's a vector because
it's a single column of values. Next to that
is a matrix, that has here on the top, the
scores for case 1, which I've written as x's.
X1 is for variable 1, X2 is for variable 2
and the second subscript is indicated that
it's for person 1. Below that, are the scores
for case 2, the second person. And then over
here, in another vertical column are the regression
coefficients, that's a beta there that we
are using. And then finally, we've got a tiny
little vector here which contains the error
terms for cases 1 and 2. Now, even though
you would not do this by hand, it's helpful
to run through the procedure, so I'm going
to show it to you by hand. And we are going
to take two fictional people. This will be
fictional person #1, we'll call her Sophie.
We'll say that she's 28 years old and we'll
say that she's has good bargaining skills,
a 4 on a scale of 5, and that she works 50
hours a week and that her salary is $118,000.00.
Our second fictional person, we'll call him
Lars and we'll say that he's 34 years old
and he has moderate bargaining skills 3 out
of 5, works 35 hours per week and has a salary
of $84,000.00. And so if we are trying to
look at salaries, we can look at our matrix
representation that we had here, with our
variables indicated with their Latin and sometimes
Greek symbols. And we will replace those variables
with actual numbers. We have the salary for
Sophie, our first person. So why don't we
plug in the numbers here and let's start with
the result here. Sophie's salary is $118,000.00
and here's how all these numbers all add up
to get that. The first thing here is the intercept.
And we just multiply that times 1, so that's
sort of the starting point, and then we get
this number 10, which actually has to do with
years over 18. She's 28 so that's 10 years
over 18, we multiply each year by 1395. Next
is bargaining skills. She's got a 4 out of
5 and for each step up you get $5,900.00.
By the way, these are real coefficients from
study of survey of salary of data scientists.
And then finally hours per week. For each
hour, you get $382.00. Now you can add these
up, and get a predicted value for her but
it's a little low. It's $30,00.00 low. Which
you may be saying that's pretty messed up,
well that's because there's like 40 variables
in the equation including she might be the
owner and if she's the owner then yes she's
going to make a lot more. And then we do a
similar thing for the second case, but what's
neat about matrix algebra or Linear Algebra
is this means the same stuff and what we have
here are these bolded variables. That stand
in for entire vectors or matrices. So for
instance; this Y, a bold Y stands for the
vector of outcome scores. This bolded X is
the entire matrix of values that each person
has on each variable. This bolded beta is
all of the regression coefficients and then
this bolded epsilon is the entire vector of
error terms. And so it's a really super compact
way of representing the entire collection
of data and coefficients that you use in predicting
values. So in sum, let's say this. First off,
computers use matrices. They like to do linear
algebra to solve problems and is conceptually
simpler because you can put it all in there
in this type formation. In fact, it's a very
compact notation and it allows you to manipulate
entire collections of numbers pretty easily.
And that's that major benefit of learning
a little bit about linear or matrix algebra.
Our next step in "Mathematics for Data Science
Foundations" is systems of linear equations.
And maybe you are familiar with this, but
maybe you're not. And the idea here is that
there are times, when you actually have many
unknowns and you're trying to solve for them
all simultaneously. And what makes this really
tricky is that a lot of these are interlocked.
Specifically that means X depends on Y, but
at the same time Y depends on X. What's funny
about this, is it's actually pretty easy to
solve these by hand and you can also use linear
matrix algebra to do it. So let's take a little
example here of Sales. Let's imagine that
you have a company and that you've sold 1,000
iPhone cases, so that they are not running
around naked like they are in this picture
here. Some of them sold for $20 and others
sold for $5. You made a total of $5,900.00
and so the question is "How many were sold
at each price?" Now, if you were keeping our
records, but you can also calculate it from
this little bit of information. And to show
you I'm going to do it by hand. Now, we're
going to start with this. We know that sales
the two price points x + y add up to 1,000
total cases sold. And for revenue, we know
that if you multiply a certain number times
$20 and another number times $5, that it all
adds up to $5,900.00. Between the two of those
we can figure out the rest. Let's start with
sales. Now, what I'm going to do is try to
isolate the values. I am going to do that
by putting in this minus y on both sides and
then I can take that and I can subtract it,
so I'm left with x is equal to 1,000 - y.
Normally I solve for x, but I solve for y,
you'll see why in just a second. Then we go
to revenue. We know from earlier that our
sales at these two prices points, add up to
$5,900.00 total. Now what we are going to
do is take the x that's right here and we
are going to replace it with the equation
we just got, which is 1,000 - y. Then we multiply
that through and we get $20,000.00 minus $20y
plus $5 y equals $5,900.00. Well, we can subtract
these two because they are on the same thing.
So, $20y then we get $15y, and then we subtract
$20,000.00 from both sides. So there it is,
right there on the left, and that disappears,
then I get it over on the right side. And
then I do the math there, and I get minus
$14, 100.00. Well, then I divide both sides
by negative $15.00 and when we do that we
get y equals 940. Okay, so that's one of our
values for sales. Let's go back to sales.
We have x plus y equals 1,000. We take the
value we just got, 940, we stick that into
the equation, then we can solve for x. Just
subtract 940 from each side, there we go.
We get x is equal to 60. So, let's put it
all together, just to recap what happened.
What this tells us is that 60 cases were sold
at $20.00 each. And that 940 cases were sold
at $5 each. Now, what's interesting about
this is you can also do this graphically.
We're going to draw it. So, I'm going to graph
the two equations. Here are the original ones
we had. This one predicts sales, this one
gives price. The problem is, these aren't
in the economical form for creating graphs.
That needs to be y equals something else,
so we're going to solve both of these for
y. We subtract x from both sides, there it
is on the left, we subtract that. Then we
have y is equals to minus x plus 1,000. That's
something we can graph. Then we do the same
thing for price. Let's divide by 5 all the
way through, that gets rid of that and then
we've got this 4x, then let's subtract 4x
from each side. And what we are left with
is minus 4x plus 1,180, which is also something
we can graph. So this first line, this indicates
cases sold. It originally said x plus y equals
1000, but we rearranged it to y is equal to
minus x plus 1000. And so that's the line
we have here. And then we have another line,
which indicates earnings. And this one was
originally written as $20.00 times x plus
$5.00 times y equals $5,900.00 total. We rearranged
that to y equals minus 4x plus 1,180. That's
the equation for the line and then the solution
is right here at the intersection. There's
our intersection and it's at 60 on the number
of cases sold at $20.00 and 940 as the number
of cases sold at $5.00 and that also represents
the solution of the joint equations. It's
a graphical way of solving a system of linear
equations. So in sum, systems of linear equations
allow us to balance several unknowns and find
unique solutions. And in many cases, it's
easy to solve by hand, and it's really easy
with linear algebra when you use software
to do it at the same time. As we continue
our discussion of "Mathematics for Data Science"
and the foundational principles the next thing
we want to talk about is Calculus. And I'm
going to give a little more history right
here. The reason I'm showing you pictures
of stones, is because the word Calculus is
Latin for stone, as in a stone used for tallying.
Where when people would actually have a bag
of stones and they would use it to count sheep
or whatever. And the system of Calculus was
formalized in the 1,600s simultaneously, independently
by Isaac Newton and Gottfried Wilhelm Leibniz.
And there are 3 reasons why Calculus is important
for data science. #1, it's the basis for most
of the procedures we do. Things like least
squares regression and probability distributions,
they use Calculus in getting those answers.
Second one is if you are studying anything
that changes over time. If you are measuring
quantities or rates that change over time
then you have to use Calculus. Calculus is
used in finding the maxima and minima of functions
especially when you're optimizing. Which is
something I'm going to show you separately.
Also, it is important to keep in mind, there
are two kinds of Calculus. The first is differential
Calculus, which talks about rates of change
at a specific time. It's also known as the
Calculus of change. The second kind of Calculus
is Integral Calculus and this is where you
are trying to calculate the quantity of something
at a specific time, given the rate of change.
It's also known as the Calculus of Accumulation.
So, let's take a look at how this works and
we're going to focus on differential Calculus.
So I'm going to graph an equation here, I'm
going to do y equals x2 a very simple one
but it's a curve which makes it harder to
calculate things like the slope. Let's take
a point here that's at minus 2, that's the
middle of the red dot. X is equal to minus
2. And because y is equal to x2 , if we want
to get the y value, all we got to do is take
that negative 2 and square it and that gives
us 4. So that's pretty easy. So the coordinates
for that red point are minus 2 on x, and plus
4 on the y. Here's a harder question. "What
is the slope of the curve at that exact point?"
Well, it's actually a little tricky because
the curve is always curving there's no flat
part on it. But we can get the answer by getting
the derivative of the function. Now, there
are several different ways of writing this,
I am using the one that's easiest to type.
And let's start by this, what we are going
to do is the n here and that is the squared
part, so that we have x2 . And you see that
same n turns into the squared, and then we
come over here and we put that same value
2 in right there, and we put the two in right
here. And then we can do a little bit of subtraction.
2 minus 1 is 1 and truthfully you can just
ignore that then then you get 2x. That is
the derivative, so what we have here is the
derivative of x2 is 2x. That means, the slope
at any given point in the curve is 2x. So,
let's go back to the curve we had a moment
ago. Here's our curve, here's our point at
x minus 2, and so the slope is equal to 2x,
well we put in the minus 2, and we multiply
it and we get minus 4. So that is the slope
at this exact point in the curve. Okay, what
if we choose a different point? Let's say
we came over here to x is equal to 3? Well,
the slope is equal to 2x so that's 2 times
3, is equal to 6. Great! And on the other
hand, you might be saying to yourself "And
why do I care about this?" There's a reason
that this is important and what it is, is
that you can use these procedures to optimize
the decisions. And if that seems a little
to abstract to you, that means you can use
them to make more money. And I'm going to
demonstrate that in the next video. But for
right now in sum, let's say this. Calculus
is vital to practical data science, it's the
foundation of statistics and it forms the
core that's needed for doing optimization.
In our discussion about Mathematics and data
science foundations, the last thing I want
to talk about right here is calculus and how
it relates to optimization. I like to think
of this, in other words, as the place where
math meets reality, or it meets Manhattan
or something. Now if you remember this graph
I made in the last video, y is equal to x2,
that shows this curve here and we have the
derivative that the slope can be given by
2x. And so when x is equal to 3, the slope
is equal to 6, fine. And this is where this
comes into play. Calculus makes it possible
to find values that maximize or minimize outcomes.
And if you want to think of something a little
more concrete here, let's think of an example,
by the way that's Cupid and Psyche. Let's
talk about pricing for online dating. Let's
assume you've created a dating service and
you want to figure out how much can you charge
for it that will maximize your revenue. So,
let's get a few hypothetical parameters involved.
First off, let's say that subscriptions, annual
subscriptions cost $500.00 each year and you
can charge that for a dating service. And
let's say you sell 180 new subscriptions every
week. On the other hand, based on your previous
experience manipulating prices around, you
have some data that suggests that for each
$5 you discount from the price of $500.00
you will get 3 more sales. Also, because its
an online service, lets make our life a little
more easier right now and assume there is
no increase in overhead. It's not really how
it works, but we'll do it for now. And I'm
actually going to show you how to do all this
by hand. Now, let's go back to price first.
We have this. $500.00 is the current annual
subscription price and you're going to subtract
$5.00 for each unit of discount, that's why
I'm giving D. So, one discount is $5.00, two
discounts is $10.00 and so on. And then we
have a little bit of data about sales, that
you're currently selling 180 subscriptions
per week and that you will add 3 more for
each unit of discount that you give. So, what
we're going to do here is we are going to
find sales as a function of price. Now, to
do that the first thing we have to do is get
the y intercept. So we have price here, is
$500.00, is the current annual subscription
price minus $5 times d. And what we are going
to do is, is we are going to get the y intercept
by solving when does this equal zero? Okay,
well we take the $500 we subtract that from
both sides and then we end up with minus $5d
is equal to minus $500.00. Divide both sides
by minus $5 and we are left with d is equal
to 100. That is, when d is equal to 100, x
is 0. And that tells us how we can get the
y intercept, but to get that we have to substitute
this value into sales. So we take d is equal
to 100, and the intercept is equal to 180
plus 3; 180 is the number of new subscriptions
per week and then we take the three and we
multiply that times our 100. So, 180 times
3 times 100,[1] is equal to 300 add those
together and you get 480. And that is the
y intercept in our equation, so when we've
discounted sort of price to zero then the
expected sale is 480. Of course that's not
going to happen in reality, but it's necessary
for finding the slope of the line. So now
let's get the slope. The slope is equal to
the change in y on the y axis divided by the
change in x. One way we can get this is by
looking at sales; we get our 180 new subscriptions
per week plus 3 for each unit of discount
and we take our information on price. $500.00
a year minus $5.00 for each unit of discount
and then we take the 3d and the $5d and those
will give us the slope. So it's plus 3 divided
by minus 5, and that's just minus 0.6. So
that is the slope of the line. Slope is equal
to minus 0.6. And so what we have from this
is sales as a function of price where sales
is equal to 480 because that is the y intercept
when price is equal to zero minus 0.6 times
price. So, this isn't the final thing. Now
what we have to do, we turn this into revenue,
there's another stage to this. Revenue is
equal to sales times price, how many things
did you sell and how much did it cost. Well,
we can substitute some information in here.
If we take sales and we put it in as a function
of price, because we just calculated that
a moment ago, then we do a little bit of multiplication
and then we get that revenue is equal to 480
times the price minus 0.6 times the price.
Okay, that's a lot of stuff going on there.
What we're going to do now is we're going
to get the derivative, that's the calculus
that we talked about. Well, the derivative
of 480 and the price, where price is sort
of the x, the derivative is simply 480 and
the minus 0.6 times price? Well, that's similar
to what we did with the curve. And what we
end up with is 0.6 times 2 is equal to 1.2
times the price. This is the derivative of
the original equation. We can solve that for
zero now, and just in case you are wondering.
Why do we solve it for zero? Because that
is going to give us the place when y is at
a maximum. Now we had a minus squared so we
have to invert the shape. When we are trying
to look for this value right here when it's
at the very tippy top of the curve, because
that will indicate maximum revenue. Okay,
so what we're going to do is solve for zero.
Let's go back to our equation here. We want
to find out when is that equal to zero? Well,
we subtract 480 from each side, there we go
and we divide by minus 1.2 on each side. And
this is our price for maximum revenue. So
we've been charging $500.00 a week, but this
says we'll have more total income if we charge
$400.00 instead. And if you want to find out
how many sales we can get, currently we have
480 and if you want to know what the sales
volume is going to be for that. Well, you
take the 480 which is the hypothetical y intercept
when the price is zero, but then we put in
our actual price of $400.00, multiply that,
we get 240, do the subtraction and we get
240 total. So, that would be 240 new subscriptions
per week. So let's compare this. Current revenue,
is 180 new subscriptions per week at $500.00
per year. And that means our current revenue
is $90,000.00 per year, I know it sounds really
good, but we can do better than that. Because
the formula for maximum value is 240 times
$400.00, when you multiply those you get $96,000.00.
And so the improvement is just a ratio of
those two. $96,000.00 divided by $90,000.00
is equal to 1.07. And what that means is a
7% increase and anybody would be thrilled
to get a 7% increase in their business simply
by changing the price and increasing the overall
revenue. So, let's summarize what we found
here. If you lower the cost by 20%, go from
$500.00 year to $400.00 per year, assuming
all of our other information is correct, then
you can increase sales by 33%; that's more
than the 20 that you had and that increases
total revenue by 7%. And so we can optimize
the price to get the maximum total revenue
and it has to do with that little bit of calculus
and the derivative of the function. So in
sum, calculus can be used to find the minima
and maxima of functions including prices.
It allows for optimization and that in turn
allows you to make better business decisions.
Our next topic in "Mathematics and Data Principals",
is something called Big O. And if you are
wondering what Big O is all about, it is about
time. Or, you can think of it as how long
does it take to do a particular operation.
It's the speed of the operation. If you want
to be really precise, the growth rate of a
function; how much more it requires as you
add elements is called its Order. That's why
it's called Big O, that's for Order. And Big
O gives the rate of how things grow as the
number of elements grows, and what's funny
is there can be really surprising differences.
Let me show you how it works with a few different
kinds of growth rates or Big O. First off,
there's the ones that I say are sort of one
the spot, you can get stuff done right away.
The simplest one is O1, and that is a constant
order. That's something that takes the same
amount of time, no matter what. You can send
an email out to 10,000 people just hit one
button; it's done. The number of elements,
the number of people, the number of operations,
it just takes the same amount of time. Up
from that is Logarithmic, where you take the
number of operations, you get the logarithm
of that and you can see it's increased, but
really it's only a small increase, it tapers
off really quickly. So an example is finding
an item in a sorted rate. Not a big deal.
Next, one up from that, now this looks like
a big change, but in the grand scheme, it's
not a big change. This is a linear function,
where each operation takes the same unit of
time. So if you have 50 operations, you have
50 units of time. If you're storing 50 objects
it takes 50 units of space. So, find an item
in an unsorted list it's usually going to
be linear time. Then we have the functions
where I say you know, you'd better just pack
a lunch because it's going to take a while.
The best example of this is called Log Linear.
You take the number of items and you multiply
that number times the log of the items. An
example of this is called a fast Fourier transform,
which is used for dealing for instance with
sound or anything that sort of is over time.
You can see it takes a lot longer, if you
have 30 elements your way up there at the
top of this particular chart at 100 units
of time, or 100 units of space or whatever
you want to put it. And it looks like a lot.
But really, that's nothing compared to the
next set where I say, you know you're just
going to be camping out you may as well go
home. That includes something like the Quadratic.
You square the number of elements, you see
how that kind of just shoots straight up.
That's Quadratic growth. And so multiplying
two n-digit numbers, if you're multiplying
two numbers that have 10 digit numbers it's
going to take you that long, it's going to
take a long time. Even more extreme is this
one, this is the exponential, two raised to
the power to the number of items you have.
You'll see, by the way, the red line does
not even go all the way to the top. That's
because the graphing software that I'm using,
doesn't draw it when it goes above my upper
limit there, so it kind of cuts it off. But
this is a really demanding kind of thing,
it's for instance finding an exact solution
for what's called the Travelling Salesman
Problem, using dynamic programming. That's
an example of exponential rate of growth.
And then one more I want to mention which
is sort of catastrophic is Factorial. You
take the number of elements and you raise
that to the exclamation point Factorial, and
you see that one cuts off very soon because
it basically goes straight up. You have any
number of elements of any size, it's going
to be hugely demanding. And for instance if
you're familiar with the Travelling Salesman
Problem, that's trying to find the solution
through the brute force search, it takes a
huge amount of time. And you know before something
like that is done, you're probably going to
turn to stone and wish you'd never even started.
The other thing to know about this, is that
not only do something's take longer than others,
some of these methods and some functions are
more variable than others. So for instance,
if you're working with data that you want
to sort, there are different kinds of sort
or sorting methods. So for instance, there
is something called an insertion sort. And
when you find this on its best day, it's linear.
It's O of n, that's not bad. On the other
hand the average is Quadratic and that's a
huge difference between the two. Selection
sorts on the other hand, the best is quadratic
and the average is quadratic. It's always
consistent, so it's kind of funny, it takes
a long time, but at least you know how long
it's going to take versus the variability
of something like an insertion sort. So in
sum, let me say a few things about Big O.
#1, You need to know that certain functions
or procedures vary in speed, and the same
thing applies to making demands on a computer's
memory or storage space or whatever. They
vary in their demands. Also, some are inconsistent.
Some are really efficient sometimes and really
slow or difficult the others. Probably the
most important thing here is to be aware of
the demands of what you are doing. That you
can't, for instance, run through every single
possible solution or you know, your company
will be dead before you get an answer. So
be mindful of that so you can use your time
well and get the insight you need, in the
time that you need it. A really important
element of the "Mathematics and Data Science"
and one of its foundational principles is
Probability. Now, one of the things that Probability
comes in intuitively for a lot of people is
something like rolling dice or looking at
sports outcomes. And really the fundamental
question of what are the odds of something.
That gets at the heart of Probability. Now
let's take a look at some of the basic principles.
We've got our friend, Albert Einstein here
to explain things. The Principles of Probability
work this way. Probabilities range from zero
to 1, that's like zero percent to one hundred
percent chance. When you put P, then in parenthesis
here A, that means the Probability of whatever
is in parenthesis. So P(A), means the Probability
of A. and then P(B) is the Probability of
B. When you take all of the probabilities
together, you get what is called the probability
Space. And that's why we have S and that all
adds up to 1, because you've now covered 100
% of the possibilities. Also you can talk
about the compliment. The tilde here is used
to say the probability of not A is equal to
1 minus the probability of A, because those
have to add up. So, let's take a look at something
also that conditional probabilities, which
is really important in statistics. A conditional
probability is the probability that something
if something else is true. You write it this
way: the probability of, and that vertical
line is called a Pipe and it's read as assuming
that or given that. So you can read this as
the probability of A given B, is the probability
of A occurring if B is true. So you can say
for instance, what's the probability if something's
orange, what's the probability that it's a
caret given this picture. Now, the place that
this comes in really important for a lot of
people is the probability of type one and
type two errors in hypothesis testing, which
we'll mention at some other point. But I do
want to say something about arithmetic with
probabilities because it does not always work
out the way people think it will. Let's start
by talking about adding probabilities. Let's
say you have two events A and B, and let's
say you want to find the probabilities of
either one of those events. So that's like
adding the probabilities of the two events.
Well, it's kind of easy. You take the probability
of event A and you add the probability of
event B, however you may have to subtract
something, you may have to subtract this little
piece because maybe there are some overlap
between the two of them. On the other hand
if A and B are disjoined, meaning they never
occur together, then that's equal to zero.
And then you can subtract zero which is just,
you get back to the original probabilities.
Let's take a really easy example of this.
I've created my super simple sample space
I have 10 shapes. I have 5 squares on top,
5 circles on the bottom and I've got a couple
of red shapes on the right side. Let's say
we want to find the probability of a square
or a red shape. So we are adding the probabilities
but we have to adjust for the overlap between
the two. Well here's our squares on top. 5
out of the 10 are squares and over here on
the right we have two red shapes, two out
of 10. Let's go back to our formula here and
let's change a little bit. Change the A and
the B to S and R for square and red. Now we
can start this way, let's get the probability
that something is a square. Well, we go back
to our probability space and you see we have
5 squares out of 10 shapes total. So we do
5 over 10, that reduces to .5. Okay, next
up the probability of something red in our
sample space. Well, we have 10 shapes total,
two of them on the far right are red. That's
two over 10, and you do the division get.2.
Now, the trick is the overlap between these
two categories, do we have anything that is
both square and red, because we don't want
to count that twice we have to subtract it.
Let's go back to our sample space and we are
looking for something that is square, there's
the squares on top and there's the things
that are red on the side. And you see they
overlap and this is our little overlapping
square. So there's one shape that meets both
of those, one out of 10. So we come back here,
one out of 10, that reduces to .1 and then
we just do the addition and subtraction here.
.5 plus .2 minus .1, gets us .6. And so what
that means is, there is a 60% chance of an
object being square or red. And you can look
at it right here. We have 6 shapes outlined
now and so that's the visual interpretation
that lines up with the mathematical one we
just did. Now let's talk about multiplication
for Probabilities. Now the idea here is you
want to get joint probabilities, so the probability
of two things occurring together, simultaneously.
And what you need to do here, is you need
to multiply the probabilities. And we can
say the probability of A and B, because we
are asking about A and B occurring together,
a joint occurrence. And that's equal to the
probability of A times the probability of
B, that's easy. But you do have to expand
it just a little bit because you can have
the problem of things overlapping a little
bit, and so you actually need to expand it
to a conditional probability, the probability
of B given A. Again, that's that vertical
pipe there. On the other hand, if A and B
are independent and they never co-occur, or
B is no more likely to occur if A happens,
then it just reduces to the probability of
B, then you get your slightly simpler equation.
But let's go and take a look at our sample
space here. So we've got our 10 shapes, 5
of each kind, and then two that are red. And
we are going to look at originally, the probability
of something being square or red, now we are
going to look at the probability of it being
square and red. Now, I know we can eyeball
this one real easy, but let's run through
the math. The first thing we need to do, is
get the ones that are square. There's those
5 on the top and the ones that are red, and
there's those two on the right. In terms of
the ones that are both square and red, yes
obviously there's just this one red square
at the top right. But let's do the numbers
here. We change our formula to be S and R
for square and red, we get the probability
of square. Again that's those 5 out of 10,
so we do 5/10, reduce this to .5. And then
we need the probability of red given that
it's a square. So, we only need to look at
the squares here. There's the squares, 5 of
them, and one of them is red. So that's 1
over 5 . That reduces to .2. You multiply
those two numbers; .5 times .2, and what you
get is .10 or 10% chance or 10 percent of
our total sample space is red squares. And
you come back and you look at it and you say
yeah there's one out of 10. So, that just
confirms what we are able to do intuitively.
So, that's our short presentation on probabilities
and in sum what did we get out of that? #1,
Probability is not always intuitive. And also
the idea that conditional values can help
in a lot of situations, but they may not work
the way you expect them to. And really the
arithmetic of Probability can surprise people
so pay attention when you are working with
it so you can get a more accurate conclusion
in your own calculations. Let's finish our
discussion of "Mathematics and Data Science"
and the basic principles by looking at something
called Bayes' theorem. And if you're familiar
with regular probability and influential testing,
you can think of Bayes' theorem as the flip
side of the coin. You can also think of it
in terms of intersections. So for instance,
standard inferential tests and calculations
give you the probability of the data; that's
our d, given the hypothesis. So, if you assume
a known hypothesis is true, this will give
you the probability of the data arising by
chance. The trick is, most people actually
want the opposite of that. They want the probability
of the hypothesis given the data. And unfortunately,
those two things can be very different in
many circumstances. On the other hand, there's
a way of dealing with it, Bayes does it and
this is our guy right here. Reverend Thomas
Bayes, 18th Century English minister and statistician.
He developed a method for getting what he
called posterior probabilities that use as
prior probabilities. And test information
or something like base rates, how common something
overall to get the posterior or after the
fact Probability. Here's the general recipe
to how this works: You start with the probability
of the data given the hypothesis which is
what you get from the likelihood of the data.
You also get that from a standard inferential
test. To that, you need to add the probability
to the hypothesis or the cause of being true.
That's called the prior or the prior probability.
To that you add the D; the probability of
the data, that's called the marginal probability.
And then you combine those and in a special
way to get the probability of the hypothesis
given the data or the posterior probability.
Now, if you want to write it as an equation,
you can write it in words like this; posterior
is equal to likelihood times prior divided
by marginal. You can also write it in symbols
like this; the probability of H given D, the
probability of the hypothesis given the data,
that's the posterior probability. Is equal
to the probability of the data given the hypothesis,
that the likelihood, multiplied by the probability
of the hypothesis and divided by probability
of the data overall. But this is a lot easier
if we look at a visual version of it. So,
let's go this example here. Let's say we have
a square here that represents 100% of all
people and we are looking at a medical condition.
And what we are going to say here is that
we got this group up here that represents
people who have a disease, so that's a portion
of all people. And that what we say, is we
have a test and people with the disease, 90%
of them will test positive, so they're marked
in red. Now it does mean over here on the
far left people with the disease who test
negative that's 10%. Those are our false negatives.
And so if the test catches 90% of the people
who have the disease, that's good right? Well,
let's look at it this way. Let me ask y0u
a basic question. "If a person tests positive
for a disease, then what is the probability
they really have the disease?" And if you
want a hint, I'm going to give you one. It's
not 90%,. Here's how it goes. So this is the
information I gave you before and we've got
90% of the people who have the disease; that's
a conditional probability, they test positive.
But what about the other people, the people
in the big white area below, ‘of all people'.
We need to look at them and if any of them
ever test positive, do we ever get false positives
and with any test you are going to get false
positives. And so let's say our people without
the disease, 90% of them test negative, the
way they should. But of the people who don't
have the disease, 10% of them test positive,
those are false positives. And so if you really
want to answer the question, "If you test
positive do you have the disease?", here's
what you need. What you need is the number
of people with the disease who test positive
divided by all people who test positive. Let's
look at it this way. So here's our information.
We've got 29.7% of all people are in this
darker red box, those are the people who have
the disease and test positive, alright that's
good. Then we have 6.7% of the entire group,
that's the people without the disease who
test positive. So we want to do, we want the
probability of the disease what percentage
have the disease and test positive and then
divide that by all the people that test positive.
And that bottom part is made up of two things.
That's made up of the people who have the
disease and test positive, and the people
who don't have the disease and test positive.
Now we can take our numbers and start plugging
them in. Those who have the disease and test
positive that's 29.7% of the total population
of everybody. We can also put that number
right here. That's fine, but we also need
to look at the percentage that do not have
the disease and test positive; of the total
population, that's 6.7%. So, we just need
to rearrange, we add those two numbers on
the bottom, we get 36.4% and we do a little
bit of division. And the number we get is
81.6%, here's what that means. A positive
test result still only means a probability
of 81.6% of having the disease. So, the test
is advertised at having 90% accuracy, well
if you test positive there's really only a
82% chance you have the disease. Now that's
not really a big difference. But consider
this: what if the numbers change? For instance,
what if the probability of the disease changes?
Here's what we originally had. Let's move
it around a little bit. Let's make the disease
much less common. And so now what we do, we
are going to have 4.5% of all people are people
who have the disease and test positive. And
then because there is a larger number of people
who don't have the disease, we are going to
have a relatively larger proportion of false
positives. Again, compared to the entire population
it's going to be 9.5% of everybody. So we
are going to go back to our formula here in
words and start plugging in the numbers. We
get 4.5% right there, and right there. And
then we add in our other number, the false
positives that's 9.5%. Well, we rearrange
and we start adding things up, that's 14%
and when we divide that, we get 32.1%. Here's
what that number means. That means a positive
test result; you get a positive test result,
now means you only have a probability of 32.1%
of having the disease. That's ? less than
the accuracy of 90%, and in case you can't
tell, that's a really big difference. And
that's why Bayes theorem matters, because
it answers the questions that people want
and the answer can be dramatically different
depending on the base rate of the thing you
are talking about. And so in sum, we can say
this. Bayes theorem allows you to answer the
right question, people really want to know;
what's the probability that I have the disease.
What's the probability of getting a positive
if I have the disease. They want to know whether
they have the disease. And to do this, you
need to have prior probabilities, you need
to know how common the disease is, you need
to know how many people get positive test
results overall. But, if you can get that
information and run them through it can change
your answers and really the emotional significance
of what you're dealing with dramatically.
Let's wrap up some of our discussion of "Mathematics
and Data Science" and the data principles
and talk about some of the next steps. Things
you can do afterwards. Probably the most important
thing is, you may have learned about math
a long time ago but now it's a good time to
dig out some of those books and go over some
of the principles you've used before. The
idea here is that a little math can go a long
way in data science. So, things like Algebra
and things like Calculus and things like Big
O and Probability. All of those are important
in data science and its helpful to have at
least a working understanding of each. You
don't have to know everything, but you do
need to understand the principles of your
procedures that you select when you do your
projects. There are two reasons for that very
generally speaking. First, you need to know
if a procedure will actually answer your question.
Does it give you the outcome that you need?
Will it give you the insight that you need?
Second; really critical, you need to know
what to do when things go wrong. Things don't
always work out, numbers don't always add
up, you got impossible results or things just
aren't responding. You need to know enough
about the procedure and enough about the mathematics
behind it, so you can diagnose the problem,
and respond appropriately. And to repeat myself
once again, no matter what you're working
on in data science, no matter what tool you're
using, what procedure you're doing, focus
on your goal. And in case you can't remember
that, your goal is meaning. Your goal is always
meaning. Welcome to "Statistics in Data Science".
I'm Barton Poulson and what we are going to
be doing in this course is talking about some
of the ways you can use statistics to see
the unseen. To infer what's there, even when
most of it's hidden. Now this shouldn't be
surprised. If you remember the data science
Venn Diagram we talked about a while ago,
we have math up here at the top right corner,
but if you were to go to the original description
of this Venn Diagram, it's full name was math
and stats. And let me just mention something
in case it's not completely obvious about
why statistics matters to data science. And
the idea is this; counting is easy. It's easy
to say how many times a word appears in a
document, it's easy to say how many people
voted for a particular candidate in one part
of the country. Counting is easy, but summarizing
and generalizing those things hard. And part
of the problem is there's no such thing as
a definitive analysis. All analyses really,
depend on the purposes that you're dealing
with. So as an example, let me give you a
couple of pairs of words and try to summarize
the difference between them in just two or
three words. In a word or two, how is a souffle
different from a quiche, or how is an Aspen
different from a Pine tree? Or how is Baseball
different from Cricket? And how are musicals
different from opera? It really depends on
who you are talking to, it depends on your
goals and it depends on the shared knowledge.
And so, there's not a single definitive answer,
and then there's the matter of generalization.
Think about it again, take music. Listen to
three concerti by Antonio Vivaldi, and do
you think you can safely and accurately describe
all of his music? Now, I actually chose Vivaldi
on purpose because even Igor Stravinsky said
you could, he said he didn't write 500 concertos
he wrote the same concerto 500 times. But,
take something more real world like politics.
If you talk to 400 registered voters in the
US, can you then accurately predict the behavior
of all of the voters? There's about 100 million
voters in the US, and that's a matter of generalization.
That's the sort of thing we try to take care
of with inferential statistics. Now there
are different methods that you can use in
statistics and all of them are described to
give you a map; a description of the data
you're working on. There are descriptive statistics,
there are inferential statistics, there's
the inferential procedure Hypothesis testing
and there's also estimation and I'll talk
about each of those in more depth. There are
a lot of choices that have to be made and
some of the things I'm going to discuss in
detail are for instance the choice of Estimators,
that's different from estimation. Different
measures of fit. Feature selection, for knowing
which variables are the most important in
predicting your outcome. Also common problems
that arise when trying to model data and the
principles of model validation. But through
this all, the most important thing to remember
is that analysis is functional. It's designed
to serve a particular purpose. And there's
a very wonderful quote within the statistics
world that says all models are wrong. All
statistical descriptions of reality are wrong,
because they are not exact depictions, they
are summaries but some are useful and that's
from George Box. And so the question is, you're
not trying to be totally, completely accurate,
because in that case you just wouldn't do
an analysis. The real question is, are you
better off not doing your analysis than not
doing it? And truthfully, I bet you are. So
in sum, we can say three things: #1, you want
to use statistics to both summarize your data
and to generalize from one group to another
if you can. On the other hand, there is no
"one true answer" with data, you got to be
flexible in terms of what your goals are and
the shared knowledge. And no matter what your
doing, the utility of your analysis should
guide you in your decisions. The first thing
we want to cover in "Statistics in Data Science"
is the principles of exploring data and this
video is just designed to give an exploration
overview. So we like to think of it like this,
the intrepid explorers, they're out there
exploring and seeing what's in the world.
You can see what's in your data, more specifically
you want to see what your dataset is like.
You want to see if your assumptions are right
so you can do a valid analysis with your procedure.
Something that may sound very weird, but you
want to listen to your data. Something's not
work out, if it's not going the way you want,
then you're going to have to pay attention
and exploratory data analysis is going to
help you do that. Now, there are two general
approaches to this. First off, there's a graphical
exploration, so you use graphs and pictures
and visualizations to explore your data. The
reason you want to do this is that graphics
are very dense in information. They're also
really good, in fact the best to get the overall
impression of your data. Second to that, there
is numerical exploration. I make it very clear,
this is the second step. Do the visualization
first, then do the numerical part. Now you
want to do this, because this can give greater
precision, this is also an opportunity to
try variations on the data. You can actually
do some transformations, move things around
a little bit and try different methods and
see how that effects the results, see how
it looks. So, let's go first to the graphical
part. They are very quick and simple plots
that you can do. Those include things like
bar charts, histograms and scatterplots, very
easy to make and a very quick way to getting
to understand the variables in your dataset.
In terms of numerical analysis; again after
the graphical method, you can do things like
transform the data, that is take like the
logarithm of your numbers. You can do Empirical
estimates of population numbers, and you can
use robust methods. And I'll talk about all
of those at length in later videos. But for
right now, I can sum it up this way. The purpose
of exploration is to help you get to know
your data. And also you want to explore your
data thoroughly before you start modelling,
before you build statistical models. And all
the way through you want to make sure you
listen carefully so that you can find hidden
or unassumed details and leads in your data.
As we move in our discussion of "Statistics
and Exploring Data", the single most important
thing we can do is Exploratory Graphics. In
the words of the late great Yankees catcher
Yogi Berra, "You can see a lot by just looking".
And that applies to data as much as it applies
to baseball. Now, there's a few reasons you
want to start with graphics. #1, is to actually
get a feel for the data. I mean, what's it
distributed like, what's the shape, are there
strange things going on. Also it allows you
to check the assumptions and see how well
your data match the requirements of the analytical
procedures you hope to use. You can check
for anomalies like outliers and unusual distributions
and errors and also you can get suggestions.
If something unusual is happening in the data,
that might be a clue that you need to pursue
a different angle or do a deeper analysis.
Now we want to do graphics first for a couple
of reasons. #1, is they are very information
dense, and fundamentally humans are visual.
It's our single, highest bandwidth way of
getting information. It's also the best way
to check for shape and gaps and outliers.
There's a few ways that you can do this if
you want to and the first is with programs
that rely on code. So you can use the statistical
programming language R, the general purpose
language Python. You can actually do a huge
amount in JavaScript, especially D3JS. Or
you can use Apps, that are specifically designed
for exploratory analysis, that includes Tableau
both the desktop and public versions, Qlik
and even Excel is a good way to do this. And
finally you can do this by hand. John Tukey
who's the father of Exploratory Data Analysis,
wrote his seminal book, a wonderful book where
it's all hand graphics and actually it's a
wonderful way to do it. But let's start the
process for doing these graphics. We start
with one variable. That is univariate distributions.
And so you'll get something like this, the
fundamental chart is the bar chart. This is
when you are dealing with categories and you
are simply counting however many cases there
are in each category. The nice thing about
bar charts is they are really easy to read.
Put them in descending order and may be have
them vertical, maybe have them horizontal.
Horizontal could be nice to make the labels
a little easier to read. This is about psychological
profiles of the United States, this is real
data. We have most states in the friendly
and conventional, a smaller amount in the
temperamental and uninhibited and the least
common of the United States is relaxed and
creative. Next you can do a Box plot, or sometimes
called a box and whiskers plot. This is when
you have a quantitative variable, something
that's measured and you can say how far apart
scores are. A box plot shows quartile values,
it also shows outliers. So for instance this
is google searches for modern dance. That's
Utah at 5 standard deviations above the national
average. That's where I'm from and I'm glad
to see that there. Also, it's a nice way to
show many variables side by side, if they
are on proximately similar scales. Next, if
you have quantitative variables, you are going
to want to do a histogram. Again, quantitative
so interval or ratio level, or measured variables.
And these let you see the shape of a distribution
and potentially compare many. So, here are
three histograms of google searches on Data
Science, and Entrepreneur and Modern Dance.
And you can see, mostly for the part normally
distributed with a couple of outliers. Once
you've done one variable, or the univariate
analyses, you're going to want to do two variables
at a time. That is bivariate distributions
or joint distributions. Now, one easy way
to do this is with grouped plots. You can
do grouped bar charts and box plots. What
I have here is grouped box plots. I have my
three regions, Psychological Regions of the
United States and I'm showing how they rank
on openness that's a psychological characteristic.
As you can see, the relaxed and creative are
high and the friendly conventional tend to
go to the lowest and that's kind of how that
works. It's also a good way of seeing the
association between a categorical variable
like region of the United States psychologically,
and a quantitative outcome, which is what
we have here with openness. Next, you can
also do a Scatterplot. That's where you have
quantitative variables and what you're looking
for here is, is it a straight line? Is it
linear? Do we have outliers? And also the
strength of association. How closely do the
dots all come to the regression line that
we have here in the middle. And this is an
interesting one for me because we have openness
across the bottom, so more open as you go
to the right and agreeableness. And what you
can see is there is a strong downhill association.
The states and the states that are the most
open are also the least agreeable, so we're
going to have to do something about that.
And then finally, you're going to want to
go to many variables, that is multivariate
distributions. Now, one big question here
is 3D or not 3D? Let me make an argument for
not 3D. So, what I have here is a 3D Scatterplot
about 3 variables from Google searches. Up
the left, I have FIFA which is for professional
soccer. Down there on the bottom left, I have
searches for the NFL and on the right I have
searches for NBA. Now, I did this in R and
what's neat about this is you can click and
drag and move it around. And you know that's
kind of fun, you kind of spin around and it
gets kind of nauseating as you look at it.
And this particular version, I'm using plotly
in R, allows you to actually click on a point
and see, let me see if I can get the floor
in the right place. You can click on a point
and see where it ranks on each of these characteristics.
You can see however, this thing is hard to
control and once it stops moving, it's not
much fun and truthfully most 3D plots I've
worked with are just kind of nightmares. They
seem like they're a good idea, but not really.
So, here's the deal. 3D graphics, like the
one I just showed you, because they are actually
being shown in 2D, they have to be in motion
for you to tell what is going on at all. And
fundamentally they are hard to read and confusing.
Now it's true, they might be useful for finding
clusters in 3 dimensions, we didn't see that
in the data we had, but generally I just avoid
them like the plague. What you do want to
do however, is see the connection between
the variables, you might want to use a matrix
of plots. This is where you have for instance
many quantitative variables, you can use markers
for group membership if you want, and I find
it to be much clearer than 3D. So here, I
have the relationship between 4 search terms:
NBA, NFL, MLB for Major League Baseball and
FIFA. You can see the individual distributions,
you can see the scatterplots, you can get
the correlation. Truthfully for me this is
a much easier chart to read and you can get
the richness that we need, from a multidimensional
display. So the questions you're trying to
answer overall are: Number 1, Do you have
what you need? Do you have the variables that
you need, do you have the ability that you
need? Are there clumps or gaps in the distributions?
Are there exceptional cases/anomalies that
are really far out from everybody else, spikes
in the scores? And of course are there errors
in the data? Are there mistakes in coding,
did people forget to answer questions? Are
there impossible combinations? And these kinds
of things are easiest to see with a visualization
that really kind of puts it there in front
of you. And so in sum, I can say this about
graphical exploration of data. It's a critical
first step, it's basically where you always
want to start. And you want to use the quick
and easy methods, again. Bar charts, scatter
plots are really easy to make and they're
very easy to understand. And once you're done
with the graphical exploration, then you can
go to the second step, which is exploring
the data through numbers. The next step in
"Statistics and Exploring Data" is exploratory
statistics or numerical exploration of data.
I like to think of this, as go in order. First,
you do visualization, then you do the numerical
part. And a couple of things to remember here.
#1, you are still exploring the data. You're
not modeling yet, but you are doing a quantitative
exploration. This might be an opportunity
to get empirical estimates, that is of population
parameters as opposed to theoretically based
ones. It's a good time to manipulate the data
and explore the effect of manipulating the
data, looking at subgroups, looking at transforming
variables. Also, it's an opportunity to check
the sensitivity of your results. Do you get
the same general results if you test under
different circumstances. So we are going to
talk about things like Robust Statistics,
resampling data and transforming data. So,
we'll start with Robust Statistics. This by
the way is Hercules, a Robust mythical character.
And the idea with robust statistics is that
they are stable, is that even when the data
varies in unpredictable ways you still get
the same general impression. This is a class
of statistics, it's an entire category, that's
less affected by outliers, and skewness, kurtosis
and other abnormalities in the data. So let's
take a quick look. This is a very skewed distribution
that I created. The median, which is the dark
line in the box, is right around one. And
I am going to look at two different kinds
of robust statistics, The Trimmed Mean and
the Winsorized Mean. With the Trimmed mean,
you take a certain percentage of data from
the top and the bottom and you just throw
it away and compute for the rest. With the
Winsorized, you take those and you move those
scores into the highest non-outlier score.
Now the 0% is exactly the same as the regular
mean and here it's 1.24, but as we trim off
or move in 5%, the mean shifts a little bit.
Then 10 % it comes in a little bit more to
25%, now we are throwing away 50% of our data.
25% on the top and 25% on the bottom. And
we get a trimmed mean of 1.03 and a winsorized
of 1.07. When we throw away 50% or we trim
50%, that actually means we are leaving just
the median, only the middle scores left. Then
we get 1.01. What's interesting is how close
we get to that, even when we have 50% of the
data left, and so that's an interesting example
of how you can use robust statistics to explore
data, even when you have things like strong
skewness. Next is the principle of resampling.
And that's like pulling marbles repeatedly
from the jar, counting the colors, putting
them back in and trying again. That's an empirical
estimate of sampling variability. So, sometimes
you get 20% red marbles, sometimes you get
30, sometimes you get 22 and so on. There
are several versions for this, they go by
the name jackknife, the bootstrap the permutation.
And the basic principle of resampling is also
key to the process of cross-validation, I'll
have more to say about validation later. And
then finally there's transforming variables.
Here's our caterpillars in the process of
transforming into butterflies. But the idea
here, is that you take a difficult data set
and then you do what's called a smooth function.
There's no jumps in it, and something that
allows you to preserve the order and work
on the full dataset. So you can fix skewed
data, and in a scatter plot you might have
a curved line, you can fix that. And probably
the best way to look at this is probably with
something called Tukey's ladder of powers.
I mentioned before John Tukey, the father
of exploratory data analysis. He talked a
lot about data transformations. This is his
ladder, starting at the bottom with the -1,
over x2, up to the top with x3. Here's how
it works, this distribution over here is a
symmetrical normally distributed variable,
and as you start to move in one direction
and you apply the transformation, take the
square root you see how it moves the distribution
over to one end. Then the logarithm, then
you get to the end then you get to this minus
1 over the square of the score. And that pushes
it way way, way over. If you go the other
direction, for instance you square the score,
it pushes it down in the one direction and
then you cube it and then you see how it can
move it around in ways that allow you to,
you can actually undo the skewness to get
back to a more centrally distributed distribution.
And so these are some of the approaches that
you can use in the numerical distribution
of data. In sum, let's say this: statistical
or numerical exploration allows you to get
multiple perspectives on your data. It also
allows you to check the stability, see how
it works with outliers, and skewness and mixed
distributions and so on. And perhaps most
important it sets the stage for the statistical
modelling of your data. As a final step of
"Statistics and Exploring Data", I'm going
to talk about something that's not usually
exploring data but it is basic descriptive
statistics. I like to think of it this way.
You've got some data, and you are trying to
tell a story. More specifically, you're trying
to tell your data's story. And with descriptive
statistics, you can think of it as trying
to use a little data to stand in for a lot
of data. Using a few numbers to stand in for
a large collection of numbers. And this is
consistent with the advice we get from good
ole Henry David Thoreau, who told us Simplify,
Simplify. If you can tell your story with
more carefully chosen and more informative
data, go for it. So there's a few different
procedures for doing this. #1, you'll want
to describe the center of your distribution
of data, that is if you're going to choose
a single number, use that. # 2, if you can
give a second number give something about
the spread or the dispersion of the variability.
And #3, give something about the shape of
the distribution. Let me say more about each
of these in turn. First, let's talk about
center. We have the center of our rings here.
Now there are a few very common measure of
center or location or central tendency of
a distribution. There's the mode, the median
and there's the mean. Now, there are many,
many others but those are the ones that are
going to get you most of the way. Let's talk
about the mode first. Now, I'm going to create
a little dataset here on a scale from 1 to
11, and I'm going to put individual scores.
There's a one, and another one, and another
one and another one. Then we have a two, two,
then we have a score way over at 9 and another
score over at 11. So we have 8 scores, and
this is the distribution. This is actually
a histogram of the dataset. The mode is the
most commonly occurring score or the most
frequent score. Well, if you look at how tall
each of these go, we have more ones than anything
else, and so one is the mode. Because it occurs
4 times and nothing else comes close to that.
The median is a little different. The median
is looking for the score that is at the center
if you split it into two equal groups. We
have 8 scores, so we have to get one group
of 4, that's down here, and the other group
of four, this really big one because it's
way out and the median is going to be the
place on the number line that splits those
into two groups. That's going to be right
here at one and a half. Now the mean is going
to be a little more complicated, even though
people understand means in general. It's the
first one here that actually has a formula,
where M for the mean is equal to the sum of
X (that's our scores on the variable), divided
by N (the number of scores). You can also
write it out with Greek notation if you want,
like this where that's sigma - a capital sigma
is the summation sign, sum of X divided by
N. And with our little dataset, that works
out to this: one plus one plus one plus one
plus two plus two plus nine plus eleven. Add
those all up and divide by 8, because that's
how many scores there are. Well that reduces
to 28 divided by 8, which is equal to 3.5.
If you go back to our little chart here, 3.5
is right over here. You'll notice there aren't
any scores really exactly right there. That's
because the mean tends to get very distorted
by its outliers, it follows the extreme scores.
But a really nice, I say it's more than just
a visual analogy, is that if this number were
a sea saw, then the mean is exactly where
the balance point or the fulcrum would be
for these to be equal. People understand that.
If somebody weighs more they got to sit in
closer to balance someone who less, who has
to sit further out, and that's how the mean
works. Now, let me give a bit of the pros
and cons of each of these. Mode is easy to
do, you just count how common it is. On the
other hand, it may not be close to what appears
to be the center of the data. The Median it
splits the data into two same size groups,
the same number of scores in each and that's
pretty easy to deal with but unfortunately,
it's pretty hard to use that information in
any statistics after that. And finally the
mean, of these three it's the least intuitive,
it's the most effective by outliers and skewness
and that really may strike against it, but
it is the most useful statistically and so
it's the one that gets used most often. Next,
there's the issue of spread, spread your tail
feathers. And we have a few measures here
that are pretty common also. There's the range,
there are percentiles and interquartile range
and there's variance and standard deviation.
I'll talk about each of those. First the Range.
The Range is simply the maximum score minus
the minimum score, and in our case that's
11 minus 1, which is equal to 10, so we have
a range of 10. I can show you that on our
chart. It's just that line on the bottom from
the 11 down to the one. That's a range of
10. The interquartile range which is actually
usually referred to simply as the IQR is the
distance between the Q3; which is the third
quartile score and Q1; which is the first
quartile score. If you're not familiar with
quartiles, it's the same the 75th percentile
score and the 25th percentile score. Really
what it is, is you're going to throw away
some of the some of the data. So let's go
to our distribution here. First thing we are
going to do, we are going to throw away the
two highest scores, there they are, they're
greyed out now, and then we are going to throw
away two of the lowest scores, they're out
there. Then we are going to get the range
for the remaining ones. Now, this is complicated
by the fact that I have this big gap between
2 and 9, and different methods of calculating
quartiles do something with that gap. So if
you use a spreadsheet it's actually going
to do an interpolation process and it will
give you a value of 3.75, I believe. And then
down to one for the first quartile, so not
so intuitive with this graph but that it is
how it works usually. If you want to write
it out, you can do it like this. The interquartile
range is equal to Q3 minus Q1, and in our
particular case that's 3.75 minus 1. And that
of course is equal to just 2.75 and there
you have it. Now our final measure of spread
or variability or dispersion, is two related
measures, the variance and the standard deviation.
These are little harder to explain and a little
harder to show. But the variance, which is
at least the easiest formula, is this: the
variance is equal to that's the sum, the capital
sigma that's the sum, X minus M; that's how
far each score is from the mean and then you
take that deviation there and you square it,
you add up all the deviations, and then you
divide by the number. So the variance is,
the average square deviation from the mean.
I'll try to show you that graphically. So
here's our dataset and there's our mean right
there at 3 and a half. Let's go to one of
these twos. We have a deviation there of 1.5
and if we make a square, that's 1.5 points
on each side, well there it is. We can do
a similar square for the other score too.
If we are going down to one, then it's going
to be 2.5 squared and it's going to be that
much bigger, and we can draw one of these
squares for each one of our 8 points. The
squares for the scores at 9 and 11 are going
to be huge and go off the page, so I'm not
going to show them. But once you have all
those squares you add up the area and you
get the variance. So, this is the formula
for the variance, but now let me show the
standard deviation which is also a very common
measure. It's closely related to this, specifically
it's just the square root of the variance.
Now, there's a catch here. The formulas for
the variance and the standard deviation are
slightly different for populations and samples
in that they use different denominators. But
they give similar answers, not identical but
similar if the sample is reasonably large,
say over 30 or 50, then it's really going
to be just a negligible difference. So let's
do a little pro and con of these three things.
First, the Range. It's very easy to do, it
only uses two numbers the high and the low,
but it's determined entirely by those two
numbers. And if they're outliers, then you've
got really a bad situation. The Interquartile
Range the IQR, is really good for skewed data
and that's because it ignores extremes on
either end, so that's nice. And the variance
and the standard deviation while they are
the least intuitive and they are the most
affected by outliers, they are also generally
the most useful because they feed into so
many other procedures that are used in data
science. Finally, let's talk a little bit
about the shape of the distribution. You can
have symmetrical or skew distribution, unimodal,
uniform or u-shaped. You can have outliers,
there's a lot of variations. Let me show you
a few of them. First off is a symmetrical
distribution, pretty easy. They're the same
on the left and on the right. And this little
pyramid shape is an example of a symmetrical
distribution. There are also skewed distributions,
where most of the scores are on one end and
they taper off. This here is a positively
skewed distribution where most of the scores
are at the low end and the outliers are on
the high end. This is unimodal, our same pyramid
shape. Unimodal means it has one mode, really
kind of one hump in the data. That's contrasted
for instance to bimodal where you have two
modes, and that usually happens when you have
two distributions that got mixed together.
There is also uniform distribution where every
response is equally common, there's u-shaped
distributions where people tend to pile up
at one end or the other and a big dip in the
middle. And so there's a lot of different
variations, and you want to get those, the
shape of the distribution to help you understand
and put the numerical summaries like the mean
and like the standard deviation and put those
into context. In sum, we can say this: when
you use this script of statistics that allows
you to be concise with your data, tell the
story and tell it succinctly. You want to
focus on things like the center of the data,
the spread of the data, the shape of the data.
And above all, watch out for anomalies, because
they can exercise really undue influence on
your interpretations but this will help you
better understand your data and prepare you
for the steps to follow. As we discuss "Statistics
in Data Science", one of the really big topics
is going to be Inference. And I'll begin that
with just a general discussion of inferential
statistics. But, I'd like to begin unusually
with a joke, you may have seen this before
it says "There are two kinds of people in
the world. 1) Those you can extrapolate from
incomplete data and, the end". Of course,
because the other group is the people who
can't. But let's talk about extrapolating
from incomplete data or inferring from incomplete
data. First thing you need to know is the
difference between populations and samples.
A population represents all of the data, or
every possible case in your group of interest.
It might be everybody who's a commercial pilot,
it might be whatever. But it represents everybody
in that or every case in that group that you're
interested in. And the thing with the population
is, it just is what it is. It has its values,
it has it's mean and standard deviation and
you are trying to figure out what those are,
because you generally use those in doing your
analyses. On the other hand, samples instead
of being all of the data are just some of
the data. And the trick is they are sampled
with error. You sample one group and you calculate
the mean. It's not going to be the same if
you do it the second time, and it's that variability
that's in sampling that makes Inference a
little tricky. Now, also in inference there
are two very general approaches. There's testing
which is short for hypothesis testing and
maybe you've had some experience with this.
This is where you assume a null hypothesis
of no effect is true. You get your data and
you calculate the probability of getting the
sample data that you have if the null hypothesis
is true. And if that value is small, usually
less than 5%, then you reject the null hypothesis
which says really nothings happen and you
infer that there is a difference in the population.
The other most common version is Estimation.
Which for instance is characterizing confidence
intervals. That's not the only version of
Estimation but it's the most common. And this
is where you sample data to estimate a population
parameter value directly, so you use the sample
mean to try to infer what the population mean
is. You have to choose a confidence level,
you have to calculate your values and you
get high and low bounds for you estimate that
work with a certain level of confidence. Now,
what makes both of these tricky is the basic
concept of sampling error. I have a colleague
who demonstrates this with colored M&M's,
what percentage are red, and you get them
out of the bags and you count. Now, let's
talk about this, a population of numbers.
I'm going to give you just a hypothetical
population of the numbers 1 through 10. And
what I am going to do, is I am going to sample
from those numbers randomly, with replacement.
That means I pull a number out, it might be
a one and I put it back, I might get the one
again. So I'm going to sample with replacement,
which actually may sound a little bit weird,
but it's really helpful for the mathematics
behind inference. And here are the samples
that I got, I actually did this with software.
I got a 3, 1, 5, and 7. Interestingly, that
is almost all odd numbers, almost. My second
sample is 4, 4, 3, 6 and 10. So you can see
I got the 4 twice. And I didn't get the 1,
the 2, the 5, 7, or 8 or 9. The third sample
I got three 1's! And a 10 and a 9, so we are
way at the ends there. And then my fourth
sample, I got a 3, 9, 2, 6, 5. All of these
were drawn at random from the exact same population,
but you see that the samples are very different.
That's the sampling variability or the sampling
error. And that's what makes inference a little
trickier. And let's just say again, why the
sampling variability, why it matters. It's
because inferential methods like testing and
like estimation try to see past the random
sampling variation to get a clear picture
on the underlying population. So in sum, let's
say this about Inferential Statistics. You
sample your data from the larger populations,
and as you try to interpret it, you have to
adjust for error and there's a few different
ways of doing that. And the most common approaches
are testing or hypothesis testing and estimation
of parameter values. The next step in our
discussion of "Statistics and Inference" is
Hypothesis Testing. A very common procedure
in some fields of research. I like to think
of it as put your money where your mouth is
and test your theory. Here's the Wright brothers
out testing their plane. Now the basic idea
behind hypothesis testing is this, and you
start out with a question. You start out with
something like this: What is the probability
of X occurring by chance, if randomness or
meaningless sampling variation is the only
explanation? Well, the response is this, if
the probability of that data arising by chance
when nothing's happening is low, then you
reject randomness as a likely explanation.
Okay, there's a few things I can say about
this. #1, it's really common in scientific
research, say for instance in the social sciences,
it's used all the time. #2, this kind of approach
can be really helpful in medical diagnostics,
where you're trying to make a yes/no decision;
does a person have a particular disease. And
3, really anytime you're trying to make a
go/no go decision, which might be made for
instance with a purchasing decision for a
school district or implementing a particular
law, You base it on the data and you have
to make a yes/no. Hypothesis testing might
be helpful in those situations. Now, you have
to have hypotheses to do hypothesis testing.
You start with H0, which is shorthand for
the null hypothesis. And what that is in larger,
what that is in lengthier terms is that there
is no systematic effect between groups, there's
no effect between variables and random sampling
error is the only explanation for any observed
differences you see. And then contrast that
with HA, which is the alternative hypothesis.
And this really just says there is a systematic
effect, that there is in fact a correlation
between variables, that there is in fact a
difference between two groups, that this variable
does in fact predict the other one. Let's
take a look at the simplest version of this
statistically speaking. Now, what I have here
is a null distribution. This is a bell curve,
it's actually the standard normal distribution.
Which shows z-scores in relative frequency,
and what you do with this is you mark off
regions of rejection. And so I've actually
shaded off the highest 2.5% of the distribution
and the lowest 2.5%. What's funny about this
is, is that even though I draw it +/- 3, it
looks like 0. It's actually infinite and asymptotic.
But, that's the highest and lowest 2.5% collectively
leaves 95% in the middle. Now, the idea is
then that you gather your data, you calculate
a score for you data and you see where it
falls in this distribution. And I like to
think of that as you have to go down one path
to the other, you have to make a decision.
And you have to decide to whether to retain
your null hypothesis; maybe it is random,
or reject it and decide no I don't think it's
random. The trick is, things can go wrong.
You can get a false positive, and this is
when the sample shows some kind of statistical
effect, but it's really randomness. And so
for instance, this scatterplot I have here,
you can see a little down hill association
here but this is in fact drawn from data that
has a true correlation of zero. And I just
kind of randomly sampled from it, it took
about 20 rounds, but it looks negative but
really there's nothing happening. The trick
about false positives is; that's conditional
on rejecting the null. The only way to get
a false positive is if you actually conclude
that there's a positive result. It goes by
the highly descriptive name of a Type I error,
but you get to pick a value for it, and .05
or a 5% risk if you reject the null hypothesis,
that's the most common value. Then there's
a false negative. This is when the data looks
random, but in fact, it's systematic or there's
a relationship. So for instance, this scatterplot
it looks like there's pretty much a zero relationship,
but in fact this came from two variables that
were correlated at .25, that's a pretty strong
association. Again, I randomly sampled from
the data until I got a set that happened to
look pretty flat. And a false negative is
conditional on not rejecting the null. You
can only get a false negative if you get a
negative, you say there's nothing there. It's
also called a Type II error and this is a
value that you have to calculate based on
several elements of your testing framework,
so it's something to be thoughtful of. Now,
I do have to mention one thing, big security
notice, but wait. The problem with Hypothesis
Testing; there's a few. #1, it's really easy
to misinterpret it. A lot of people say, well
if you get a statistically significant result,
it means that it's something big and meaningful.
And that's not true because it's confounded
with sample size and a lot of other things
that don't really matter. Also, a lot of other
people take exception with the assumption
of a null effect or even a nil effect, that
there's zero difference at all. And that can
be, in certain situations can be an absurd
claim, so you've got to watch out for that.
There's also bias from the use of cutoff.
Anytime you have a cut off, you're going to
have problems where you have cases that would
have been slightly higher, slightly lower.
It would have switched on the dichotomous
outcome, so that is a problem. And then a
lot of people say, it just answers the wrong
question, because "What it's telling you is
what's the probability of getting this data
at random?" That's not what most people care
about. They want it the other way, which is
why I mentioned previously Bayes theorem and
I'll say more about that later. That being
said, Hypothesis Testing is still very deeply
ingrained, very useful in a lot of questions
and has gotten us really far in a lot of domains.
So in sum, let me say this. Hypothesis Testing
is very common for yes/no outcomes and is
the default in many fields. And I argue it
is still useful and information despite many
of the well substantiated critiques. We'll
continue in "Statistics and Inference" by
discussing Estimation. Now as opposed to Hypothesis
Testing, Estimation is designed to actually
give you a number, give you a value. Not just
a yes/no, go/no go, but give you an estimate
for a parameter that you're trying to get.
I like to think of it sort of as a new angle,
looking at something from a different way.
And the most common, approach to this is Confidence
Intervals. Now, the important thing to remember
is that this is still an Inferential procedure.
You're still using sample data and trying
to make conclusions about a larger group or
population. The difference here, is instead
of coming up with a yes/no, you'd instead
focus on likely values for the population
value. Most versions of Estimation are closely
related to Hypothesis Testing, sometimes seen
as the flip side of the coin. And we'll see
how that works in later videos. Now, I like
to think of this as an ability to estimate
any sample statistic and there's a few different
versions. We have Parametric versions of Estimation
and Bootstrap versions, that's why I got the
boots here. And that's where you just kind
of randomly sample from the data, in an effort
to get an idea of the variability. You can
also have central versus noncentral Confidence
Intervals in the Estimation, but we are not
going to deal with those. Now, there are three
general steps to this. First, you need to
choose a confidence level. Anywhere from say,
well you can't have a zero, it has to be more
than zero and it can't be 100%. Choose something
in between, 95% is the most common. And what
it does, is it gives you a range a high and
a low. And the higher your level of confidence
the more confident you want to be, the wider
the range is going to be between your high
and your low estimates. Now, there's a fundamental
trade off in what' happening here and the
trade off between accuracy; which means you're
on target or more specifically that your interval
contains the true population value. And the
idea is that leads you to the correct Inference.
There's a tradeoff between accuracy and what's
called Precision in this context. And precision
means a narrow interval, as a small range
of likely values. And what's important to
emphasize is this is independent of accuracy,
you can have one without the other! Or neither
or both. In fact, let me show you how this
works. What I have here is a little hypothetical
situation, I've got a variable that goes from
10 to 90, and I've drawn a thick black line
at 50. If you think of this in terms of percentages
and political polls, it makes a very big difference
if you're on the left or the right of 50%.
And then I've drawn a dotted vertical line
at 55 to say that that's our theoretical true
population value. And what I have here is
a distribution that shows possible values
based on our sample data. And what you get
here is it's not accurate, because it's centered
on the wrong thing. It's actually centered
on 45 as opposed to 55. And it's not precise,
because it's spread way out from may be 10
to almost 80. So, this situation the data
is no help really at all. Now, here's another
one. This is accurate because it's centered
on the true value. That's nice, but it's still
really spread out and you see that about 40%
of the values are going to be on the other
side of 50%; might lead you to reach the wrong
conclusion. That's a problem! Now, here's
the nightmare situation. This is when you
have a very very precise estimate, but it's
not accurate; it's wrong. And this leads you
to a very false sense of security and understanding
of what's going on and you're going to totally
blow it all the time. The ideal situation
is this: you have an accurate estimate where
the distribution of sample values is really
close to the true population value and it's
precise, it's really tightly knit and you
can see that about 95% of it is on the correct
side of 50 and that's good. If you want to
see all four of them here at once, we have
the precise two on the bottom, the imprecise
ones on the top, the accurate ones on the
right, the inaccurate ones on the left. And
so that's a way of comparing it. But, no matter
what you do, you have to interpret confidence
interval. Now, the statistically accurate
way that has very little interpretation is
this: you would say the 95% confidence interval
for the mean is 5.8 to 7.2. Okay, so that's
just kind of taking the output from your computer
and sticking it to sentence form. The Colloquial
Interpretation of this goes like this: there
is a 95% chance that the population mean is
between 5.8 and 7.2. Well, in most statistical
procedures, specifically frequentist as opposed
to bayesian you can't do that. That implies
the population mean shifts, that's not usually
how people see it. Instead, a better interpretation
is this; 95% of confidence intervals for randomly
selected samples will contain the population
mean. Now, I can show you this really easily,
with a little demonstration. This is where
I randomly generated data from a population
with a mean of 55 and I got 20 different samples.
And I got the Confidence Interval from each
sample and I charted the high and the low.
And the question is, did it include the true
population value. And you can see of these
20, 19 included it, some of them barely made
it. If you look at sample #1 on the far left;
barely made it. Sample #8, it doesn't look
like it made it, sample 20 on the far right,
barely made it on the other end. Only one
missed it completely, that sample #2, which
is shown in red on the left. Now, it's not
always just one out of twenty, I actually
had to run this simulation about 8 times,
because it gave me either zero or 3, or 1
or two, and I had to run it until I got exactly
what I was looking for here,. But this is
what you would expect on average. So, let's
say a few things about this. There are somethings
that affect the width of a Confidence Interval.
The first is the confidence level, or CL.
Higher confidence levels create wider intervals.
The more certain you have to be, you're going
to give a bigger range to cover your basis.
Second, the Standard Deviation or larger standard
deviations create wider intervals. If the
thing that you are studying is inherently
really variable, then of course you're estimate
of the range is going to be more variable
as well. And then finally there is the n or
the sample size. This one goes the other way.
Larger sample sizes create narrower intervals.
The more observations you have, the more precise
and the more reliable things tend to be. I
can show you each of these things graphically.
Here we have a bunch of Confidence Intervals,
where I am simply changing the confidence
level from .50 at the low left side to .999
and as you can see, it gets much bigger as
we increase. Next one is Standard Deviation.
As the sample standard deviation increases
from 1 to 16, you can see that the interval
gets a lot bigger. And then we have sample
size going from just 2 up to 512; I'm doubling
it at each point. And you can see how the
interval gets more and more and more precise
as we go through. And so, let's say this to
sum up our discussion of estimation. Confidence
Intervals which are the most common version
of Estimation focus on the population parameter.
And the variation in the data is explicitly
included in that Estimation. Also, you can
argue that they are more informative, because
not only do they tell you whether the population
value is likely, but they give you a sense
of the variability of the data itself, and
that's one reason why people will argue that
confidence levels should always be included
in any statistical analysis. As we continue
our discussion on "Statistics and Data Science",
we need to talk about some of the choices
you have to make, some of the tradeoffs and
some of the effects that these things have.
We'll begin by talking about Estimators, that
is different methods for estimating parameters.
I like to think of it as this, "What kind
of measuring stick or standard are you going
to be using?" Now, we'll begin with the most
common. This is called OLS, which is actually
short for Ordinary Least Squares. This is
a very common approach, it's used in a lot
of statistics and is based on what is called
the sum of squared errors, and it's characterized
by an acronym called BLUE, which stands for
Best Linear Unbiased Estimator. Let me show
you how that works. Let's take a scatterplot
here of an association between two variables.
This is actually the speed of a car and the
distance to stop from about the ‘20's I
think. We have a scatterplot and we can draw
a straight regression line right through it.
Now, the line I've used is in fact the Best
Linear Unbiased Estimate, but the way that
you can tell that is by getting what are called
the Residuals. If you take each data point
and draw a perfectly vertical line up or down
to the regression line, because the regression
line predicts what the value would be for
that value on the X axis. Those are the residuals.
Each of those individual, vertical lines is
Residual. You square those and you add them
up and this regression line, the gray angled
line here will have the smallest sum of the
squared residuals of any possible straight
line you can run through it. Now, another
approach is ML, which stands for Maximum Likelihood.
And this is when you choose parameters that
make the observed data most likely. It sounds
kind of weird, but I can demonstrate it, and
it's based on a kind of local search. It doesn't
always find the best, I like to think of it
here like the person here with a pair of binoculars,
looking around them, trying hard to find something,
but you could theoretically miss something.
Let me give a very simple example of how this
works. Let's assume that we're trying to find
parameters that maximize the likelihood of
this dotted vertical line here at 55, and
I've got three possibilities. I've got my
red distribution which is off to the left,
blue which is a little more centered and green
which is far to the right. And these are all
identical, except they have different means,
and by changing the means, you see there the
one that is highest where the dotted line
is the blue one. And so, if the only thing
we are doing is changing the mean, and we
are looking at these three distributions,
then the blue one is the one that has the
maximum likelihood for this particular parameter.
On the other hand, we could give them all
the same meaning right around 50, and vary
their standard deviations instead and so they
spread out different amounts. In this case,
the red distribution is highest at the dotted
vertical line and so it has the maximum value.
Or if you want to, you can vary both the mean
and the standard deviations simultaneously.
And here green gets the slight advantage.
Now this is really a caricature of the process
because obviously you would just want to center
it on the 55 and be done with it. The question
is when you have many variables in your dataset.
Then it's a very complex process of choosing
values that can maximize the association between
all of them. But you get a feel for how it
works with this. The third approach which
is pretty common is MAP or map for Maximum
A Posteriori. This is a Bayesian approach
to parameter estimation, and what it does
it adds the prior distribution and then it
goes through sort of an anchoring and adjusting
process. What happens, by the way is stronger
prior estimates exert more influence on the
estimate and that might mean for example larger
sample or more extreme values. And those have
a greater influence on the posterior estimate
of the parameters. Now, what's interesting
is that all three of these methods all connect
with each other. Let me show you exactly how
they connect. The ordinary least squares,
OLS, this is equivalent to maximum likelihood,
when it has normally distributed error terms.
And maximum likelihood, ML is equivalent to
Maximum A Posteriori or MAP, with a uniform
prior distribution. You want to put it another
way, ordinary least squares or OLS is a special
case of Maximum Likelihood. And then maximum
likelihood or ML, is a special case of Maximum
A Posteriori, and just in case you like it,
we can put it into set notation. OLS is a
subset of ML is a subset of MAP, and so there
are connections between these three methods
of estimating population parameters. Let me
just sum it up briefly this way. The standards
that you use OLS, ML, MAP they affect your
choices and they determine which parameters
best estimate what's happening in your data.
Several methods exist and there's obviously
more than what I showed you right here, but
many are closely related and under certain
circumstances they're all identical. And so
it comes down to exactly what are your purposes
and what do you think is going to work best
with the data that you have to give you the
insight that you need in your own project.
The next step we want to consider in our "Statistics
and Data Science", are choices that we have
to make. Has to do with Measures of fit or
the correspondence between the data that we
have and the model that you create. Now, turns
out there are a lot of different ways to measure
this and one big question is how close is
close enough or how can you see the difference
between the model and reality. Well, there's
a few really common approaches to this. The
first one has what's called R2. That's kind
of the longer name, that's the coefficient
of determination. There's a variation; adjusted
R2, which takes into consideration the number
of variables. Then there's minus 2LL, which
is based on the likelihood ratio and a couple
of variations. The Akaike Information Criterion
or AIC and the Bayesian Information Criterion
or BIC. Then there's also Chi-Squared, it's
actually a Greek c, it looks like a x, but
it's actually c and it's chi-squared. And
so let's talk about each of these in turn.
First off is R2, this is the squared multiple
correlation or the coefficient of determination.
And what it does is it compares the variance
of Y, so if you have an outcome variable,
it looks like the total variance of that and
compares it to the residuals on Y after you've
made your prediction. The scores on squared
range from 0 to 1 and higher is better. The
next is -2 Log-likelihood that's the likelihood
ratio or like I just said the -2 log likelihood.
And what this does is compares the fit of
nested models, we have a subset then a larger
set, than the larger set overall. This approach
is used a lot in logistic regression when
you have a binary outcome. And in general,
smaller values are considered better fit.
Now, as I mentioned there are some variations
of this. I like to think of variations of
chocolate. The -2 log likelihood there's the
Akaike Information Criterion (AIC) and the
Bayesian Information Criterion (BIC) and what
both of these do, they adjust for the number
of predictors. Because obviously you're going
to have a huge number of predictors, you're
going to get a really good fit. But you're
probably going to have what is called overfitting,
where your model is tailored to specifically
to the data you currently have and that doesn't
generalize well. These both attempt to reduce
the effect of overfitting. Then there's chi-squared
again. It's actually a lower case Greek c,
looks like an x and chi-squared is used for
examining the deviations between two datasets.
Specifically between the observed dataset
and the expected values or the model you create,
we expect this many frequencies in each category.
Now, I'll just mention when I go into the
store there's a lot of other choices, but
these are some of the most common standards,
particularly the R2. And I just want to say,
in sum, there are many different ways to assess
the fit that corresponds between a model and
your data. And the choices effect the model,
you know especially are you getting penalized
for throwing in too many variables relative
to your number of cases? Are you dealing with
a quantitative or binary outcome? Those things
all matter, and so the most important thing
as always, my standing advice is keep your
goals in mind and choose a method that seems
to fit best with your analytical strategy
and the insight you're trying to get from
your data. The "Statistics and Data Science"
offers a lot of different choices. One of
the most important is going to be feature
selection, or the choice of variables to include
in your model. It's sort of like confronting
this enormous range of information and trying
to choose what matters most. Trying to get
the needle out of the haystack. The goal of
feature selection is to select the best features
or variables and get rid of uninformative/noisy
variables and simplify the statistical model
that you are creating because that helps avoid
overfitting or getting a model that works
too well with the current data and works less
well with other data. The major problem here
is Multicollinearity, a very long word. That
has to do with the relationship between the
predictors and the model. I'm going to show
it to you graphically here. Imagine here for
instance, we've got a big circle here to represent
the variability in our outcome variable; we're
trying to predict it. And we've got a few
predictors. So we've got Predictor # 1 over
here and you see it's got a lot of overlap,
that's nice. Then we've got predictor #2 here,
it also has some overlap with the outcome,
but it's also overlaps with Predictor 1. And
then finally down here, we've got Predictor
3, which overlaps with both of them. And the
problem rises the overlap between the predictors
and the outcome variable. Now, there's a few
ways of dealing with this, some of these are
pretty common. So for instance, there's the
practice of looking at probability values
and regression equations, there's standardized
coefficients and there's variations on sequential
regression. There are also, there's newer
procedures for dealing with the disentanglement
of the association between the predictors.
There's something called Commonality analysis,
there's Dominance Analysis, and there are
Relative Importance Weights. Of course there
are many other choices in both the common
and the newer, but these are just a few that
are worth taking a special look at. First,
is P values or probability values. This is
the simplest method, because most statistical
packages will calculate probability values
for each predictor and they will put little
asterisks next to it. And so what you're doing
is you're looking at the p-values; the probabilities
for each predictor or more often the asterisks
next to it, which sometimes give it the name
of Star Search. You're just kind of cruising
through a large output of data, just looking
for the stars or asterisks. This is fundamentally
a problematic approach for a lot of reasons.
The problem here, is your looking individually
and it inflates false positives. Say you have
20 variables. Each is entered and tested with
an alpha or a false positive of 5%. You end
up with nearly a 65% chance of a least one
false positive in there. That's distorted
by sample size, because with a large enough
sample anything can become statistically significant.
And so, relying on p-values can be a seriously
problematic approach. Slightly better approach
is to use Betas or Standardized regression
coefficients and this is where you put all
the variables on the same scale. So, usually
standardized from zero and then to either
minus 1/plus 1 or with a standardized deviation
of 1. The trick is though, they're still in
the context of each other and you can't really
separate them because those coefficients are
only valid when you take that group of predictors
as a whole. So, one way to try and get around
that is to do what they call stepwise procedures.
Where you look at the variables in sequence,
there's several versions of sequential regression
that'll allow you to do that. You can put
the variables into groups or blocks and enter
them in blocks and look at how the equation
changes overall. You can examine the change
in fit in each step. The problem with a stepwise
procedure like this, is it dramatically increases
the risk of overfitting which again is a bad
thing if you want to generalize your data.
And so, to deal with this, there is a whole
collection of newer methods, a few of them
include commonality analysis, which provides
separate estimates for the unique and shared
contributions of each variable. Well, that's
a neat statistical trick but the problem is,
it just moves the problem of disentanglement
to the analyst, so you're really not better
off then you were as far as I can tell. There's
dominance analysis, which compares every possible
subset of Predictors. Again, sounds really
good, but you have the problem known as the
combinatorial explosion. If you have 50 variables
that you could use, and there are some that
have millions of variables, with 50 variables,
you have over 1 quadrillion possible combinations,
you're not going to finish that in your lifetime.
And it's also really hard to get things like
standard errors and perform inferential statistics
with this kind of model. Then there's also
something that's even more recent than these
others and that's called relative importance
weights. And what that does is creates a set
of orthogonal predictors or uncorrelated with
each other, basing them off of the originals
and then it predicts the scores and then it
can predict the outcome without the multicollinear
because these new predictors are uncorrelated.
It then rescales the coefficients back to
the original variables, that's the back-transform.
Then from that it assigns relative importance
or a percentage of explanatory power to each
predictor variable. Now, despite this very
different approach, it tends to have results
that resemble dominance analysis. It's actually
really easy to do with a website, you just
plug in your information and it does it for
you. And so that is yet another way of dealing
with a problem multicollinearity and trying
to disentangle the contribution of different
variables. In sum, let's say this. What you're
trying to do here, is trying to choose the
most useful variables to include into your
model. Make it simpler, be parsimonious. Also,
reduce the noise and distractions in your
data. And in doing so, you're always going
to have to confront the ever present problem
of multicollinearity, or the association between
the predictors in your model with several
different ways of dealing with that. The next
step in our discussion of "Statistics and
the Choices you have to Make", concerns common
problems in modeling. And I like to think
of this is the situation where you're up against
the rock and the hard place and this is where
the going gets very hard. Common problems
include things like Non-Normality, Non-Linearity,
Multicollinearity and Missing Data. And I'll
talk about each of these. Let's begin with
Non-Normality. Most statistical procedures
like to deal with nice symmetrical, unimodal
bell curves, they make life really easy. But
sometimes you get really skewed distribution
or you get outliers. Skews and outliers, while
they happen pretty often, they're a problem
because they distort measures like the mean
gets thrown off tremendously when they have
outliers. And they throw off models because
they assume the symmetry and the unimodal
nature of a normal distribution. Now, one
way of dealing with this as I've mentioned
before is to try transforming the data, taking
the logarithm, try something else. But another
problem may be that you have mixed distributions,
if you have a bimodal distribution, maybe
what you really have here is two distributions
that got mixed together and you may need to
disentangle them through exploring your data
a little bit more. Next is Non-Linearity.
The gray line here is the regression line,
we like to put straight lines through things
because it makes the description a lot easier.
But sometimes the data is curved and this
is you have a perfect curved relationship
here, but a straight line doesn't work with
that. Linearity is a very common assumption
of many procedures especially regression.
To deal with this, you can try transforming
one or both of the variables in the equation
and sometimes that manages to straighten out
the relationship between the two of them.
Also, using Polynomials. Things that specifically
include curvature like squares and cubed values,
that can help as well. Then there's the issues
of multicollinearity, which I've mentioned
previously. This is when you have correlated
predictors, or rather the predictors themselves
are associated to each other. The problem
is, this can distort the coefficients you
get in the overall model. Some procedures,
it turns out are less affected by this than
others, but one overall way of using this
might be to simply try and use fewer variables.
If they're really correlated maybe you don't
need all of them. And there are empirical
ways to deal with this, but truthfully, it's
perfectly legitimate to use your own domain
expertise and your own insight to the problem.
To use your theory to choose among the variables
that would be the most informative. Part of
the problem we have here, is something called
the Combinatorial Explosion. This is where
combinations of variables or categories grow
too fast for analysis. Now, I've mentioned
something about this before. If you have 4
variables and each variable has two categories,
then you have 16 combinations, fine you can
try things 16 different ways. That's perfectly
doable. If you have 20 variables with five
categories; again that's not to unlikely,
you have 95 trillion combinations, that's
a whole other ball game, even with your fast
computer. A couple of ways of dealing with
this, #1 is with theory. Use your theory and
your own understanding of the domain to choose
the variables or categories with the greatest
potential to inform. You know what you're
dealing with, rely on that information. Second
is, there are data driven approaches. You
can use something called a Markov chain Monte
Carlo model to explore the range of possibilities
without having to explore the range of possibilities
of each and every single one of your 95 trillion
combinations. Closely related to the combinatorial
explosion is the curse of dimensionality.
This is when you have phenomena, you're got
things that may only occur in higher dimensions
or variable sets. Things that don't show up
until you have these unusual combinations.
That may be true of a lot of how reality works,
but the project of analysis is simplification.
And so you've got to try to do one or two
different things. You can try to reduce. Mostly
that means reducing the dimensionality of
your data. Reduce the number of dimensions
or variables before you analyze. You're actually
trying to project the data onto a lower dimensional
space, the same way you try to get a shadow
of a 3D object. There's a lot of different
ways to do that. There's also data driven
methods. And the same method here, a Markov
chain Monte Carlo model, can be used to explore
a wide range of possibilities. Finally, there
is the problem of Missing Data and this is
a big problem. Missing data tends to distort
analysis and creates bias if it's a particular
group that's missing. And so when you're dealing
with this, what you have to do is actually
check for patterns and missingness, you create
new variables that indicates whether or not
a variable is missing and then you see if
that is associated with any of your other
variables. If there's not strong patterns,
then you can impute missing values. You can
put in the mean or the median, you can do
Regression Imputation, something called Multiple
Imputation, a lot of different choices. And
those are all technical topics, which we will
have to talk about in a more technically oriented
series. But for right now, in terms of the
problems that can come up during modeling,
I can summarize it this way. #1, check your
assumptions at every step. Make sure that
the data have the distribution that you need,
check for the effects of outliers, check for
ambiguity and bias. See if you can interpret
what you have and use your analysis, use data
driven methods but also your knowledge of
the theory and the meaning of things in your
domain to inform your analysis and find ways
of dealing with these problems. As we continue
our discussion of "Statistics and the Choices
that are Made", one important consideration
is Model Validation. And the idea here is
that as you are doing your analysis, are you
on target? More specifically, the model that
you create through regression or whatever
you do, your model fits the sample beautifully,
you've optimized it there. But, will it work
well with other data? Fundamentally, this
is the question of Generalizability, also
sometimes called Scalability. Because you
are trying to apply in other situations, and
you don't want to get too specific or it won't
work in other situations. Now, there are a
few general ways of dealing with this and
trying to get some sort of generalizability.
#1 is Bayes; a Bayesian approach. Then there's
Replication. Then there's something called
Holdout Validation, then there is Cross-Validation.
I'll discuss each one of these very briefly
in conceptual terms. The first one is Bayes
and the idea here is you want to get what
are called Posterior Probabilities. Most analyses
give you the probability value for the data
given; the hypothesis, so you have to start
with an assumption about the hypothesis. But
instead, it's possible to flip that around
by combining it with special kind of data
to get the probability of the hypothesis given
the data. And that is the purpose of Bayes
theorem; which I've talked about elsewhere.
Another way of finding out how well things
are going to work is through Replication.
That is, do the study again. It's considered
the gold standard in many different fields.
The question is whether you need an exact
replication or if a conceptual one that is
similar in certain respects. You can argue
for both ways, but one thing you do want to
do is when you do a replication then you actually
want to combine the results. And what's interesting
is the first study can serve as the Bayesian
prior probability for the second study. So
you can actually use meta-analysis or Bayesian
methods for combining the data from the two
of them. Then there's hold out validation.
This is where you build your statistical model
on one part of the data and you test it on
the other. I like to think of it as the eggs
in separate baskets. The trick is that you
need a large sample in order to have enough
to do these two steps separately. On the other
hand, it's also used very often in data science
competitions, as a way of having a sort of
gold standard for assessing the validity of
a model. Finally, I'll mention just one more
and that's Cross-Validation. Where you use
the same data for training and for testing
or validating. There's several different versions
of it, and the idea is that you're not using
all the data at once, but you're kind of cycling
through and weaving the results together.
There's Leave-one-out, where you leave out
one case at a time, also called LOO. There's
Leave-p-out, where you leave out a certain
number at each point. There's k-fold where
you split the data into say for instance 10
groups and you leave out one and you develop
it on the other nine, then you cycle through.
And there's repeated random subsampling, where
you use a random process at each point. Any
of those can be used to develop the model
on one part of the data and tested on another
and then cycle through to see how well it
holds up on different circumstances. And so
in sum, I can say this about validation. You
want to make your analysis count by testing
how well your model holds up from the data
you developed it on, to other situations.
Because that is what you are really trying
to accomplish. This allows you to check the
validity of your analysis and your reasoning
and it allows you to build confidence in the
utility of your results. To finish up our
discussion of "Statistics and Data Science"
and the choices that are involved, I want
to mention something that really isn't a choice,
but more an attitude. And that's DIY, that's
Do it yourself. The idea here is, you know
really you just need to get started. Remember
data is democratic. It's there for everyone,
everybody has data. Everybody works with data
either explicitly or implicitly. Data is democratic,
so is Data Science. And really, my overall
message is You can do it! You know, a lot
of people think you have to be this cutting
edge, virtual reality sort of thing. And it's
true, there's a lot of active development
going on in data science, there's always new
stuff. The trick however is, the software
you can use to implement those things often
lags. It'll show up first in programs like
R and Python, but as far as it showing up
in a point click program that could be years.
What's funny though, is often these cutting
edge developments don't really make much of
a difference in the results of the interpretation.
They may in certain edge cases, but usually
not a huge difference. So I'm just going to
say analyst beware. You don't have to necessarily
do it, it's pretty easy to do them wrong and
so you don't have to wait for the cutting
edge. Now, that being said, I do want you
to pay attention to what you are doing. A
couple of things I have said repeatedly is
"Know your goal". Why are you doing this study?
Why are you analyzing the data, what are you
hoping to get out of it? Try to match your
methods to your goal, be goal directed. Focus
on the usability; will you get something out
of this that people can actually do something
with. Then, as I've mentioned with that Bayesian
thing, don't get confused with probabilities.
Remember that priors and posteriors are different
things just so you can interpret things accurately.
Now, I want to mention something that's really
important to me personally. And that is, beware
the trolls. You will encounter critics, people
who are very vocal and who can be harsh and
grumpy and really just intimidating. And they
can really make you feel like you shouldn't
do stuff because you're going to do it wrong.
But the important thing to remember is that
the critics can be wrong. Yes, you'll make
mistakes, everybody does. You know, I can't
tell you how many times I have to write my
code more than once to get it to do what I
want it to do. But in analysis, nothing is
completely wasted if you pay close attention.
I've mentioned this before, everything signifies.
Or in other words, everything has meaning.
The trick is that meaning might not be what
you expected it to be. So you're going to
have to listen carefully and I just want to
reemphasize, all data has value. So make sure
your listening carefully. In sum, let's say
this: no analysis is perfect. The real questions
is not is your analysis perfect, but can you
add value? And I'm sure that you can. And
fundamentally, data is democratic. So, I'm
going to finish with one more picture here
and that is just jump write in and get started.
You'll be glad you did. To wrap up our course
"Statistics and Data Science", I want to give
you a short conclusion and some next steps.
Mostly I want to give a little piece of advice
I learned from a professional saxophonist,
Kirk Whalum. And he says there's "There's
Always Something To Work On", there's always
something you can do to try things differently
to get better. It works when practicing music,
it also works when you're dealing with data.
Now, there are additional courses, here at
datalabb.cc that you might want to look at.
They are conceptual courses, additional high-level
overviews on things like machine learning,
data visualization and other topics. And I
encourage you to take a look at those as well,
to round out your general understanding of
the field. There are also however, many practical
courses. These are hands on tutorials on these
statistical procedures I've covered and you
learn how to do them in R, Python and SPSS
and other programs. But whatever you're doing,
keep this other little piece of advice from
writers in mind, and that is "Write what you
know". And I'm going to say it this way. Explore
and analyze and delve into what you know.
Remember when we talked about data science
and the Venn Diagram, we've talked about the
coding and the stats. But don't forget this
part on the bottom. Domain expertise is just
as important to good data science as the ability
to work with computer coding and the ability
to work with the numbers and quantitative
skills. But also, remember this. You don't
have to know everything, your work doesn't
have to be perfect. The most important thing
is just get started, you'll be glad you did.
Thanks for joining me and good luck!
