If you argue that big data requires all of
the three V’s: volume, velocity and variety
at once to count as big data proper, then
is it possible to be a data scientist if a
person with domain expertise and statistical
knowledge and encoding skills without touching
big data.
We’ll look at a few of these possibilities.
First, let’s review our Venn diagram of
data science.
This again shows on the top right we have
statistics, on the bottom we have domain knowledge
and on the top we have coding.
Taken together those class to data science.
We also have a Venn diagram for big data,
where we have volume and velocity and variety,
and again, depending on who you ask, you need
to have all three of them at once to have
big data.
So let’s take a look at data science for
just one V at a time.
So we’re talking about statistics and domain
knowledge and encoding but with just velocity
or variety or volume.
The first example we want to look at is volume
of data without any remarkable velocity or
variety.
So, this would mean a very large and static
data set with a consistent format.
The data would generally be structured as
well so we’re not gonna have free text.
A good example of this is genetics data as
I’m showing right here from Nature Reviews.
Genetics data is huge.
It’s enormous but it follows a well-understood
structure and now there’s an enormous amount
that you have to do to work through it but
it’s consistent in that way.
Next is data science for velocity without
volume or variety.
So, this is referring primarily to streaming
data with a consistent structure.
By streaming data, we mean that data is coming
in consistently and very often you’re not
holding on to the data.
You’re just keeping a small window of it
open.
One interesting example is the earthquake
detection systems of the United States Geological
Survey.
This is the Advanced National Seismic System
which is simply looking to see whether there
are earthquakes that are happening or really
about to happen.
And you don’t necessarily need to hold on
to all the data if what you’re trying to
do is trigger a response so that if an earthquake
is imminent or if it’s just starting, people
may have enough time to respond to it.
Another term for this kind of data where it’s
coming in very fast but you’re not necessarily
keeping it, so it’s relatively low in volume,
and it may have a very consistent structure
so it’s low on variety.
This can also be called data stream mining.
And one possible example is what’s called
real time classification of string and sensor
data.
Finally, for one V at a time, let’s talk
about data science for a data that has variety.
So a lot of different formats without velocity
or volume.
This is where you have a complex but small
or static or relatively static data set.
A couple of examples could include facial
recognition and personal photo collection,
so, you don’t have an enormous number of
photos but you do have a lot of variety visual
data is almost always very high in variety,
and it may be static because you don’t add
to it constantly.
Or you can talk about the data visualization
of complex data sets.
Now these are examples of data science which
again means statistics and it can mean domain
expertise and it can mean coding, or you’re
doing just one of the V’s of data at a time.
You can also do two V’s at a time.
So for instance you may talk about data science
for data where you have volume and velocity
but not a lot of variety.
So for instance, a lot of data is coming in
very fast but it’s in the same format.
You can include for instance stock market
data.
Here is another example.
It has to do with jet engines.
And a surprising statistic here comes from
this chart.
It is estimated that a jet engine has sensors
on it that generate 20 terabytes of information
each hour.
That’s an enormous amount of information.
For instance, 20 terabytes per engine per
hour times two engines, times six hours cross
country flight, times 28,000 flights, and
365 days a year.
The total data generated is over 2.5 billion
terabytes of data per year just from jet engines
alone.
The point here is, it’s a lot of data and
you would want to hold on to that because
the failure of a jet engine is an extraordinarily
important thing, and so you want to be able
to find the patterns in it fully.
Another possibility is data science applied
to data that has velocity.
So it’s coming in fast and there’s a lot
of variety but not a lot of volume.
So again, this is streaming data where you’re
not necessarily holding on to everything.
One interesting example of this is surveillance
video.
If you’re saving all the data then that’s
an enormous volume.
On the other hand, if you’re not saving
it but you’re streaming it in, it’s very
fast because the information comes in very
quickly.
Maybe 20, 30 frames per second and it has
a lot of variety because it’s visual information.
But if you’re simply wanting to see does
a person come through for instance, who’s
carrying a weapon, or does a particular event
occur?
You use a stream and you’re just trying
to trigger when something happens.
Finally, let’s talk about data science for
volume and variety without velocity, and this
can be any large historical data set that
uses multiple formats or includes visual data.
A really good example of this is Google Books.
We’ve looked at this before.
Where they have 30 million books that they’ve
scanned and they’ve digitized them, and
you’re dealing with a really complex information
here.
Similar examples include the Twitter archives
where every single tweet that’s ever been
written has been saved.
That’s an enormous amount of information
because its text is complex but because it’s
not updated constantly, it doesn’t have
the velocity.
Now what these examples show is that despite
the strong association between big data and
data science, the skills of data science,
the statistical knowledge and domain expertise
or knowledge encoding skills.
Those apply even when the three major aspects
of big data aren’t all present at the same
time.
In the next presentation we’ll look at the
flip side of all this.
How to work with big data without requiring
the full data science skill set.
