Hi. This is James Cook of the University
of Maine at Augusta. Today, we're going to
talk about the practice of data mining
as a research technique and the
interdisciplinary approach called data
science within which data mining occurs.
Data science is really not its own
discipline but rather an approach to
thinking about data that is adopted by
biologists, by computer scientists, by
physicists, by sociologists, by psychologists, by others who study large amounts of
information about individuals and think
about how they interrelate. The idea
behind data science as an idea and data
mining as a practice has been enabled
by the emergence of computers with large
processing power and large areas of the
internet on which information is
recorded. Now if we're lucky, often that
information is presented to us already
managed as a dataset, This classic
organization of data (that you have
become familiar with if you are taking one
of my research classes) in which you have
individual cases and then you have
variables about those cases whether
those cases might be individuals or
groups or relationships. However, the
Internet is usually not that friendly to
a researcher, so it's not as simple as
downloading someone else's data set and
working with it. Nevertheless, it's
important to recognize that the Internet
is built on data. More than 90% of the
web pages that are out there on the
internet -- which present information about
social relations between people,
social assemblages into categories,
groups, institutions
and social behaviors -- more than 90% of
those are created actually by computers.
Computers operate by rules.  So the
first thing to understand is that when
you are data mining you are looking for
something
computer knows how to communicate which is a regular pattern in the presentation
of data. This is to your advantage as
long as you can figure out what that
regular pattern is. If you go and look at
the guts of a web page you will often
find regular patterns in presentation
whether they are in the words themselves
that appear on the web page or in the
code that surrounds the text of the
webpage that reminds the computer how to present the content. If you are able to understand
those regular expressions you can, on
your end,
write a computer program or use another
already made computer program that can
pull out those patterns and put them
into a dataset. The second thing to
understand when you're thinking about
getting started with data mining (which
is really its very own class we could
spend a whole semester on it) is that you
have to understand what it is that
you're wanting to study just as with all
other aspects of research you have to
understand not only what your variables
are that you want to observe but also
what your unit of analysis is.  Are you
trying to study a group, an organization?
Are you trying to study a domain, a space
with either real physical or virtual
boundaries? Are you trying to understand
actions within that space; are those
actions your unit of analysis? Or are you
trying to study individuals whether
those are individual people or other
kind of individuals
that might not even be human? When you create a mined dataset you're going to need to
decide -- when you're taking this regular
text and you're processing it and putting it
into its place -- what those places are. In
other words, what are your cases? This is
something you have to think about. What
are the variables? That is something you
have to think about. Also, you have to
understand that sometimes you're
actually looking at more than one unit
of analysis so it is possible
simultaneously to take a look at
individuals and their place in
organizations and what they do -- that's
three modes of the three kinds of units
of analysis --  and it's possible to put
those together. The third organizing
principle in data mining is deciding
what level of control and responsibility
you actually want to take. The truth
is that in one very limited sense
we are already pretty much all applied data
scientists in the sense that we're
consumers of that information. Now, as a
consumer you might be reading online and
you might be actually just going through
and reading page after page after page
of information that is presented to you.
As a reader in doing so you're engaged
in a qualitative kind of research. It's
also possible that you might be a
consumer of analytics. Have you ever used
Facebook or Twitter, for instance? If you
have and you have created a page, you know that there are analytics packages that
are available in which you can click on
a page; it will tell you what that
patterns are in
the access, the reading of, and in the
following of the links on your pages, where
people come from. You could be a consumer of those statistics and a user of them
knowing that someone else has done the
work of wrangling that data. A second
step up here is to become a data
manager, and to do this means to use an
already existing package provided by a company that will manage data for you and allow
you to look at the data directly. There
are paid analytics packages like
HootSuite that allow you to collect
information on social media streams that
might not be the streams you personally use but might streams others use like a Twitter
list or a Facebook group.  Paid analytics packages can gather that information for you
over time. More adaptable analytics include Google Analytics, which involves a
code that you would place in your own
web pages,
a code that would start a process for Google to go out and track
those who are coming to your webpage,
finding out particular pieces of
information about them and placing them
into a data set for you. The advantage of
these approaches over simply looking at
Facebook's or Twitter's own analytics is
that it allows you to often download
data and work with that data. The
disadvantage is that you have to pay for
these and often have to pay quite
handsomely for them.
Google Analytics is free at its most basic level, but if you want
to become a power user, you have to pay
money and if you want to even
download the lists of data that you have
for Hootsuite you have to pay a fee that
can be prohibitive.
That approach is also somewhat limiting: Google is going to decide for you the sorts of data that
you might want to collect. You're not
going to be able to decide for yourself.
A further step up in data analytics is to
work with a software program that sits
on your computer and that gathers data
up for you.
One prime example of this is NodeXL.  NodeXL (at nodexl.codeplex.com) is a plug-in for
Microsoft Excel, a program will
allow you to engage in a search you describe. It will
grab information, and you get to decide
what kind of information you want. You
get to decide where it comes from. Finally,
you don't have to pay a fee at all for each piece of data.  What it
means, on the other hand, is that you are not going to get a quick analysis run on
your own. You're going to have to decide
what you want to measure. You're going to
have to decide where you want to find
meaning, and that involves a higher level
of responsibility on your part. If you
want to,
on your own, you can move beyond packages
like NodeXL or Gephi or a NetViz
or NetAnalytic and you can
actually use something called an API.
That's a programming interface in which,
if you type in a command, you can send it
to a server somewhere else and you can
obtain lists of data directly. NodeXL
works through API.  Netlytic
Netviz, Gephi: they all work through APIs. These are ways of asking questions of remote
servers and obtaining answers. So, you can
do that through
a URL, entering the URL with an API command in a web browser. If you want to
collect these repeatedly at regular
intervals you're gonna need to take on a
little bit more responsibility: you'll
need to learn a programming languages
like Python or PHP or R, a statistical
programming language. You have control
over what information you will grab
there's a large amount of information
you can grab. You can grab it from all
kinds of websites that offer APIs, but
again now you have a little bit more
responsibility.  With greater control
comes greater responsibility. Perhaps the
highest level of control is one in
which you write a computer program on your
own called a scraper that might be in
PHP or Python, a computer programming
language that looks directly at the web
page itself, looks for patterns in
content and figures out (tells your
computer) how to extract variables. The
computer becomes the coder and you have
to tell exactly how and when to find information.
You can step around APIs that way which
is a very useful thing to be able to do
because APIs are often about
limitation but you began to encounter
ethical questions. Simply because you can
grab all sorts of information from the
Internet is it reasonable for you to do
you so morally? What if, for instance you
could write a web scraper that would
collect information on the public
behavior of children or those who are
not yet 18?
Would it be right to do so? All of a
sudden those ethical questions begin
to land on your shoulders. You need to
decide what is public.
You need to decide what is
appropriate and this is the point at
which you need to, if you were a
university researcher, you need to go to
your institutional review board (IRB) and
talk about how you're going to institute
data protections. Now once you've
gathered this data, the next step is
simply to use an analysis package which
might be as simple as Microsoft Excel
but might be a more complicated program
like UCInet for social network analysis, or SAS or SPSS for the analysis of large
datasets or a free package that is
open source like R. R is a
statistical analysis tool that can work
with large datasets that you gather up
off the internet. It's been written by
statisticians and research methodologies
who were tired of paying hefty hefty
fees many thousands of dollars for a
single version of a program to run data
analysis from companies like SAS or SPSS,
Stata or other professional companies. They thought they could program such a language
on their own and they have! So once you
take that level of responsibility and
control to write programs in Python or
programs in PHP that go out and actually
collect data you now have the
opportunity as a researcher in the 21st
century to to run a free open-source
statistics program, R, to manage their
data and turn that data that raw
set of cases in variables into useful
insights. At the end of the day in order
to take that last step you need something
that we began our course with, and that's
theory.  You need a framework of
ideas involving concept involving
variables in hypotheses that should tell
you what to expect,
what the model for the world should be
like. Those in turn are placed inside
paradigms which consist of the
accumulated consensus of prior
researchers who have built a general
framework or a model for how societies
work. You can't get too far in analyzing
big gigantic datasets with many many
thousands of cases and many hundreds of
variables unless you have an organizing
framework for that. That's where theory
comes in in social research. No matter
how technical research becomes theory
never goes entirely away.
