Good day, everyone.
This is Dr. Soper here.
And today we will be exploring
the eighth and final topic
in our series of
lectures on databases,
with today's topic focusing
on big data, data warehouses,
and business
intelligence systems.
Although the objectives
for today's lecture
are varied and manifold,
broadly speaking,
our goals are to extend our
knowledge of database systems
beyond relational
databases by exploring
some of the additional ways in
which database technologies can
be used to achieve
organizational objectives.
Along the way, we will explore
the basic concepts of data
warehouses and data marts.
And we'll learn about the
basic concepts and architecture
associated with dimensional or
multi-dimensional databases.
We will also learn about
business intelligence systems,
online analytical processing,
and data mining, all of which
are very useful for helping
managers to make decisions.
And then toward the
end of today's lecture,
we will explore some
of the fastest growing
topics in the database
world, including
big data, the NoSQL
movement, structured storage,
and the MapReduce process.
So without further
ado, let's begin.
I thought a good
way of introducing
some of the topics
in this lecture
would be through an exploration
of human mathematical
intuition, that is,
do we as human beings
have a natural intuition for
finding truth in numbers.
To examine this question,
consider the following example.
Here we have two people, whom
I've named Bubba and Daisy.
And both Bubba and Daisy
are interested in purchasing
new vehicles.
Currently, Bubba drives a
very inefficient vehicle.
Let's say that he
drives an old truck.
And he gets 10 miles per gallon.
Whereas Daisy drives a
reasonably efficient car.
And in her case,
her current vehicle
gets 30 miles per gallon.
Now let's say that
both Bubba and Daisy
are interested in purchasing new
vehicles that will have better
fuel economy than their
current vehicles with the goal
of needing to purchase
less gasoline every year,
thereby saving money.
Bubba, however, does
not want to give up
all of the benefits
of his big truck,
so the vehicle he's looking
at provides just 12 miles
per gallon, as opposed
to his current vehicle
which provides 10.
Daisy, on the
other hand, is very
interested in fuel economy.
And so she is planning to
purchase a vehicle which
delivers 50 miles per
gallon, as opposed
to her current
vehicle, which has 30.
Another way of
considering these numbers
is by looking at the percentage
increase in miles per gallon.
In Bubba's case, he is
moving from a vehicle which
has 10 miles a gallon to one
which has 12 miles per gallon,
and that is an increase of
20% in the miles per gallon.
Daisy, by contrast, is
moving from a vehicle
with 30 miles per
gallon to one which
provides 50 miles per gallon.
And that is an
increase of 66.7%.
Next, for the sake of offering
a fair comparison between Bubba
and Daisy, let's say
that both drivers drive
10,000 miles per year.
Now at this point, I
want to appeal to you
or mathematical intuition.
Based upon the situation
that I've described,
I would ask you a
simple question,
which of these drivers,
Bubba or Daisy,
is going to save more money
on gasoline every year?
Well, if you are
like most people,
you will look at these
numbers and you'll
say that clearly Daisy's vehicle
has a much greater increase
in miles per gallon,
therefore, she
will be saving the most
money every year on gasoline.
Unfortunately, this
is the wrong answer.
In fact, Bubba will
be saving more money
every year on gasoline.
And let's see why.
Here we have a
table which shows us
how much gasoline
both Bubba and Daisy
are currently consuming
based upon their driving
habits versus how much they
will consume if they purchase
their new vehicles.
In Bubba's case, he drives
10,000 miles per year.
And in his current vehicle,
he gets 10 miles per gallon.
Therefore he will consume 1,000
gallons of gasoline per year.
In his new vehicle,
Bubba will still
drive 10,000 miles
per year, but now will
get 12 miles per
gallon, and therefore
will consume 833.33 gallons
of gasoline per year.
By contrast, in Daisy's
current vehicle,
she gets 30 miles per gallon.
And driving 10,000
miles per year,
she consumes 333.33
gallons of gas.
Whereas with her new vehicle,
which gets 50 miles per gallon,
she will consume 200 gallons
of gasoline every year.
Simple subtraction,
then, shows that Bubba
will save 166.67
gallons of fuel per year
if he purchases his new vehicle.
Whereas Daisy will only
save 133.33 gallons
of fuel per year.
That is, Bubba will enjoy it
more fuel savings per year
by buying a new truck, which
has 12 miles per gallon,
versus his old truck, which
had 10 miles per gallon,
than will Daisy by
buying a new car which
has 50 miles per gallon,
versus her old car which
provided 30 miles per gallon.
So if you're like most
people, in this problem
you're mathematical
intuition has failed.
And it is for this
reason that I would
argue that managers should
not rely on their intuition
when making important
managerial decisions,
but rather should rely on data.
Many managers have
begun to recognize
the deficiency is
in human intuition,
and are hence
adopting technologies
like business analytics,
business intelligence, data
warehousing, and data mining,
so as to better support
their decision-making
activities with the data
that the organization
has at its disposal.
By the way, if
you're interested,
this problem is known as the
miles per gallon illusion,
and stems from the fact
that we here in the United
States commonly measure fuel
economy in miles per gallon,
as opposed to the much more
appropriate measure which
would be gallons per mile.
Now that we've
seen for ourselves
how human intuition
can fail, in even very
simple mathematical
problems, we can
begin discussing some of the
systems and technologies that
have been developed to help
managers make better decisions
and avoid the failures
that are commonly
associated with relying
upon intuition alone.
First among these technologies
are business intelligence,
or BI, systems.
BI systems are information
systems that are intentionally
designed to allow
managers to capitalize
upon organizational data
for the purpose of improving
their decision making.
It's important to note that
these business intelligence
systems do not support
the real-time operational
activities of the organization.
Those activities, by contrast,
are supported by transaction
processing systems.
Instead, these BI
systems are designed
to support managerial assessment
activities, analysis, planning,
controlling, et cetera.
Broadly speaking, we can
classify these BI systems
into two categories.
First, are simple reporting
systems, the purpose of which
is to provide managers
with the ability
to look at data in a
flexible, real-time way, that
is, these reporting
systems support simple data
organization capabilities,
such as sorting, filtering,
and grouping data.
And they often
provide the ability
to make simple calculations
on data in real time.
These simple calculations
might include
operation such as a sum,
or an average, or a count.
By contrast, second
category of BI systems
are data mining applications.
Data mining applications
are typically
designed not to rely
on that real time data,
but rather to rely on
archived historical data.
And the reason for this is
that data mining applications
typically allow for
sophisticated analyses
on an organization's data.
Because these analyzes
involve complex statistical
and mathematical processing,
they typically cannot be
conducted in real time.
The advantage, of course,
of such statistical and
mathematical
techniques, is that they
can deliver insights and create
predictive models that simply
are not possible with the
simple types of calculations
that are available
within Reporting systems.
This figure depicts
the relationships
that exist between operational
and BI applications.
Whereas our operational
applications
are used by functional users and
rely upon the operational DBMS.
Business intelligence
applications
are used by managerial
users, and can
rely upon the operational
DBMS, or a specialized
DBMS that is made just for
the business intelligence
applications.
And by extension,
these BI applications
can hence rely directly upon
the operational database
by way of the operational
DBMS, for example,
in the case of
reporting applications.
While they also might rely
upon archived historical data,
or other data sources,
which are typically
made available in the form of a
data warehouse or a data mart.
As a quick review, just remember
that simple BI reporting
applications typically
rely on an organization's
operational data.
And they provide the
ability to look at data
in a simple way in real time.
By contrast, data mining
applications typically
rely on archived historical
data, and as such,
do not provide a real time
view of the organization.
The trade-off or
this time lag is
that data mining
applications can
use sophisticated statistical
and mathematical techniques
to create models which allow
managers to perform what
if analyzes, do predictions
about the future,
and generally speaking,
improve their decision making.
As I noted earlier, business
intelligence applications
that provide capabilities which
go beyond basic reporting,
typically rely upon extracts
of the organization's
operational database,
along with data acquired
from other sources, all of which
can be aggregated and stored
in a data warehouse.
Thus a data warehouse
commonly contains
data from many
different sources.
Not only does it contain
data from the organization's
operational
databases, but it can
contain other internal
and external data as well.
In the case of external
data, an organization
may be able to obtain
such data from publicly
available sources.
Or they may purchase data.
Examples of these
external data sets
might include information about
what competitors are doing,
what the market is
doing, or expectations
about future global trends.
Together, data from all
of these various sources
are run through an
ETL system, where
ETL stands for extract,
transform, and load,
so as to clean and prepare
the data for inclusion
in the data warehouse.
After this process,
the data can actually
be added to the data
warehouse itself.
And then our more complex
business intelligence
applications will
have a source of data
upon which they can rely
when performing their tasks.
Despite our best efforts at
designing relational databases
that will ensure the quality
and integrity of the data
that they contain,
it is, unfortunately,
still possible for problematic
data to appear in the database.
Further, because our business
intelligence applications
may rely upon these data in
support of managerial decision
making, it is
critically important
that the data be of the
highest quality possible, such
that managers will
have the greatest
chance possible of
making good decisions.
Here, we're simply referring to
the classic and cliched concept
of garbage in,
garbage out, that is,
if we are providing our managers
with low quality or problematic
data with which to support
their decision making,
then we must expect
the resulting decisions
to be similarly problematic.
So let's examine some
of the problems that can
arise in operational databases.
A very common problem is
what we call dirty data.
And dirty data refers
to a data value
which is obviously incorrect.
As an example, we might have
the letter v stored as a gender
code, instead of the more common
m or f, for male and female.
Or we might have a value
of age stored as 213,
which would be a remarkably old
human being if that number was
correct.
Other problems with
operational data
include missing values
and inconsistent data,
where inconsistent data refer
to data values that have changed
and are not the same across
all of our data sources.
So perhaps a
customer's phone number
was updated in the
operational database.
And the previous value
in the data warehouse
is, therefore, incorrect.
Additional problems
with operational data
include non-integrated
data, that is,
when we have data from
two or more sources
that we need to merge
together in some way
so that they can be added
into the data warehouse.
We may also have data
in an incorrect format,
in which case, it will need to
be transformed into the format
that is required by
the data warehouse.
And of course, we may
simply have too much data.
There is a general concept in
science known as parsimony.
And this concept tells us that
simplicity is often preferred.
For example, if I construct
a predictive model
that is 95% correct, and it
relies upon three predictor
variables in order to
achieve that 95% accuracy.
But I might improve
the accuracy to 96%
by adding 20
additional variables,
in most cases, the additional
complexity involved in order
to achieve such a marginal
gain in predictive power
would not be worth it.
So in that case, we would
prefer the three predictor
model over the model which
contains 23 predictors.
As I mentioned earlier,
because a data warehouse often
contains data from
multiple data sources,
the input data commonly need to
be run through an ETL process
before they can be stored
in the data warehouse.
Again, ETL here stands for
extract, transform, and load.
Where the extract step
is simply pulling data
from these various data sources.
The transform step is
cleaning, or modifying,
or processing the
data in such a way
that they are made
appropriate for inclusion
in the data warehouse.
And then the load step
refers to the process
of taking the transformed,
cleaned, processed data
and actually storing it
in the data warehouse
so that they can be used by
whichever business intelligence
applications might rely
upon that data warehouse.
Simple examples of
such transformations
might be transforming a country
code into a country name.
So we may have the country
code US, for example.
And in the data
warehouse, we need
to transform that into the
name of the country, which
might be United States.
Or we may have a
customer's email address,
such as dan@dan.com.
And we actually just want
to store the email domain
in the data warehouse for
purposes of aggregation.
In that case, we would want to
transform the customer's email
address simply into the
domain, which in this case
would be dan.com, and store the
result in the data warehouse.
Next, I would like to talk about
the concept of a data mart.
And I think the best
way to understand
a data mart is simply that it is
a subset of the organization's
data warehouse.
Data marts are constructed
to support a specific need
within the organization.
So this might be a subset
of the data warehouse
that is needed to support
a particular project,
or a particular functional
area within the business,
like advertising or
marketing, for example.
Or perhaps, we need
to create a data mart
to support a specific
group of employees
within our organization,
like a sales team.
Regardless of the
specific reason
why we create a data mart,
the general principle
underlying their
creation is simply
that not all personnel,
or not all managers,
within the organization
will need access
to all of the organization's
archive historical data.
Personnel within
the organization
who perform new product
development, for example,
probably will not need
access to data associated
with human resources, such as
employee salaries or employee
benefits.
Instead we might
create a data mart just
for the new product
development team,
which contains only
those data that
directly support their needs.
At some point in your adventures
in the database world,
you may have heard the
term dimensional database,
or multi-dimensional database.
And I'd like to take a few
moments to talk about some
of the concepts associated
with these types of databases.
To begin, it's important to note
that dimensional databases are
designed and implemented
using exactly the same sort
of database technology
that we use to create
our operational databases.
That is, dimensional
databases contain tables.
They are related
to each other using
primary key foreign key links.
We have the same
concepts cardinalities,
such as one to one
relationships, one
to many relationships,
et cetera.
So hopefully, operating
within the familiar framework
of the relational
database world,
understanding these
dimensional databases
will be reasonably easy.
Broadly speaking, the idea with
a multi-dimensional database
is that we want to implement
a non-normalized database
structure for the purpose of
vastly improving query speed.
That is, in an
operational database,
we typically implement
a database design
that is largely
normalized, that is,
it might be in
third normal form,
or Boyce-Codd normal form,
with perhaps a few tables being
denormalized for the purpose
of improving efficiency.
And what a normalized
database allows us to do
is to store large
amounts of data
very quickly, while
still preserving
the quality and the integrity
of the data in the database.
The problem with this sort
of rigid, normalized design,
however, is that if we want to
extract data from the database,
we commonly need to perform
computationally expensive join
operations in order to get
the information that we want.
So a normalized
relational database
is very good for quickly
storing information,
but is very bad for quickly
extracting information
that we want.
By contrast, with
the sort of design
that is implemented in a
multi-dimensional database,
storing the data
in the database can
be a very slow and
laborious process.
However, extracting data from
the database is very fast.
And the reasons for this
are that we implement
a non-normalized
database structure,
while simultaneously
storing data
in pre-aggregated
levels of granularity
within the dimensional database.
An important point
to note here is
that these dimensional
databases are
used to track an
organization's historical data,
and therefore they almost
always contain a date or time
dimension.
And it is this date
or time dimension
that provides us
with the ability
to store the same
information aggregated
at different levels
of granularity
within the
multi-dimensional database.
For example, imagine that we
have operational data which
represents all of the sales
transactions for a store.
So every time a
sale is made, we may
generate a sale record,
which records information
about that transaction.
Now, this information
is useful, however,
it may not be as
useful to a manager
as it would be if it were
aggregated up to a coarser
level of granularity.
Consider that if I were to
take all of the sales data
for one day and add
them all together, then
I have a daily total.
Similarly, if I take
the daily totals
for seven consecutive days,
then I have a weekly total.
I can, in such a way,
continue to create
monthly totals, quarterly
totals, yearly totals, et
cetera.
It is the same
information that is
available in the individual
sales transactions,
except it has been
pre-aggregated and stored
in the database in a
pre-aggregated form,
such that we can vastly
improved query speed.
That is, when a
query is run where
we want to look at the data
in the form of weekly totals,
or monthly totals, or quarterly
totals, the database at
that time does not
need to aggregate all
of the individual
sales transactions
in order to produce the result.
The result already exists
in the dimensional database
because it has
been pre-processed
prior to being added
into the database.
So again, the purpose of
these dimensional databases,
then, is to
intentionally implement
redundant data in a
non-normalized design,
such that we can vastly
improved query speed.
We want to be able
to extract data
very quickly from the database.
The most common data model
for a dimensional database
is known as a star schema.
And the general characteristics
of a star schema
are that we have several
dimension tables.
In this case, we have
a time dimension table,
a customer dimension table,
and a product dimension table.
And at the intersection of
all of those dimension tables,
we have something
called a fact table.
Philosophically
speaking, the reason
that the intersection table
at the center of the dimension
tables is called
a fact table, is
because a fact in
very real terms
is the intersection
of information.
For example, imagine that
I'm interested in knowing
how many of a particular product
a specific customer purchased
during the month of December.
Well, the answer
to that question
is the intersection
of the customer,
the product in which
I'm interested,
and the specific time frame
that I specify, in this case,
the month of December.
And the answer
might be 14 units.
So a specific customer
purchased 14 units
of a particular product
during the month of December.
The intersection of those
three values is a fact.
And it is for this
reason that we
label the table at the center
of a star schema, a fact table.
To help you better
conceptualize this concept,
let's consider the
intersection of two dimensions.
And in this example,
we're looking
at the customer
dimension contrasted
with the product dimension
in the form of a two
dimensional matrix.
The value contained in each
cell within this matrix,
then, is a fact.
And it expresses
to us the quantity
of a particular product
that was purchased
by a particular customer.
And you will notice,
of course, if you
recall from our last lecture,
that this structure is
very similar to a bitmap index.
Extending this concept out
into a third dimension,
we can see here that
we're representing a fact
as the intersection of three
different dimensions in a three
dimensional matrix.
Along the horizontal axis
I, again, have customers.
Along the vertical axis
I, again, have products.
But now, along the z-axis,
I have a time dimension.
Therefore, the value
contained in any
of the cells in this
three dimensional matrix
will tell me the quantity
of a given product that
was purchased by a
particular customer
during a particular
date or time frame.
Now unfortunately, human
beings have a great deal
of difficulty
envisioning higher order
spaces beyond three dimensions.
However, this concept
scales very easily up
to higher dimensional spaces.
So we might consider,
for example, a fact
to be the intersection of four
dimensions, or five dimensions.
And although it is not easy
to depict such a situation,
conceptually, it is
just a natural extension
of the two-dimensional and
three-dimensional examples
we saw here.
In either case, I
hope you can now
understand why databases
designed in this way
are called dimensional or
multi-dimensional databases.
Next, I'd like to talk briefly
about OLAP and data mining
technologies.
If you recall from
earlier in the lecture,
we said that,
generally, there are
two broad categories of business
intelligence applications.
And they were, reporting
applications, and data mining
applications.
Online analytical
processing, or OLAP, then,
is a technique that supports
these reporting applications.
That is, OLAP allows us to
dynamically examine database
data in real time, and
apply simple transformations
like sorting, filtering,
grouping, et cetera.
And it allows us to perform
simple arithmetic functions,
such as summing values together,
finding an average, account,
the standard
deviation, et cetera.
And again, this is intended
to be used in real time.
By contrast, data
mining techniques
support data mining category
of business intelligence
applications.
And data mining, broadly,
refers to a collection
of mathematical and
statistical methods
that can be used to gain deep
insights into an organization's
data.
Again, remember that the
level of sophistication
of these techniques generally
requires that they not
be executed in real
time, so as to avoid
interfering with the real time
operations of the organization.
OLAP systems, then, when used in
support of simple BI reporting
needs, produce
something called an OLAP
report, which some people
will refer to as an OLAP cube.
And the general idea
here is that our inputs
are a set of dimensions,
while our outputs are
a set of measures.
So recalling the two dimensional
and three dimensional matrices
that we saw just
a few moments ago,
a manager might select
a series of dimensions.
And the OLAP system
might allow him or her
to perform simple
transformations
or drill down operations
on the data which
lie at the intersection
of those dimensions
so as to gain real time
insights into the organization.
And here we see that
these OLAP cubes
can be constructed using our
standard SQL SELECT queries.
In this case, we're selecting a
number of different dimensions.
We are then performing a
join on four separate tables,
and are imposing some group
by and order by requirements.
The result of this query
in OLAP terminology,
then, would be a
result set which
represents a
collection of measures
that a manager could use
to gain some insights
into his or her organization.
And of course, rather than
constructing these SQL commands
repeatedly, we
might take advantage
of the capability of relational
databases to create views,
so as to save the
SQL statements, which
are used to produce common OLAP
cubes in the database itself.
Data mining, then, can be
viewed as the convergence
of many different disciplines.
A skilled data miner needs
not only working knowledge
of databases, but also needs
statistical and mathematical
knowledge, perhaps knowledge
of artificial intelligence,
or machine learning algorithms,
knowledge of data management
technologies, and so forth.
In the modern world, many
people become highly specialized
in one particular area.
But the people who are most
valuable to an organization
often have expertise
in two or more areas.
And this is certainly
the case with people
who are experts at data mining.
To conclude our
overview of data mining,
I just wanted to
briefly describe
some of the most
common techniques that
are used to perform data mining
against an organization's data.
Among these techniques
are cluster analysis,
in which case, the objective
is to group sets of entities
together according to
their level of similarity
along one or more dimensions.
We also have decision
tree analysis,
in which we can process a large
quantity of historical data
and generate a
decision tree, which
tells us what to do under
different circumstances,
in order to achieve some
kind of the desired result.
We also have regression
available to us
as a very powerful
data mining tool.
The goal of which is to
produce mathematical equations,
or mathematical models, that not
only describe the relationships
between variables,
but also provide us
a basis for predicting
future events based
upon past observations.
Data mining
applications might also
rely on sophisticated artificial
intelligence, or machine
learning algorithms, such as
neural networks or support
vector machines.
And recently, we've seen a
rise in a technique known
as market basket analysis,
or affinity analysis, which
allows us to look for
patterns of co-occurrence,
for example, determining
which products are commonly
purchased together.
And the results of
these affinity analyses
can then be used as the
foundation of a recommendation
engine, which can suggests to
you movies that you might like,
books that you might
like, et cetera.
Now I'd like to move into
the final topic in our course
on databases, and that is the
rise of the big data paradigm.
Scientists and
researchers have recently
noted an exponential increase
the quantity of data being
produced by the human species.
If my memory serves
correctly, I believe
the current rate of growth
is that the amount of data
doubles every 40 months.
At present, the world is
generating many exabytes
of new data every single day.
And if you're unfamiliar
with the term exabyte,
consider that one exabyte
is slightly more than one
million terabytes.
So you may have a
computing device at home
that saves several
terabytes of data.
But consider that several
million terabytes of new data
are being generated by the
human species every single day.
And this situation
creates a vast array
of new and interesting
problems for organizations.
The term big data, then, refers
to the rapidly expanding amount
of data that is being stored
and used by organizations.
These data sets can be very
large and very complex.
And because of their
size and complexity,
they can be extremely
difficult to process
using traditional
database technologies.
And an important
point to note is
that much of what is
considered big data
is being generated by web 2.0
applications, and the emerging
collection of web
3.0 applications.
Traditional examples
of web 2.0 applications
might include social networking
sites, video sharing sites,
blogs, discussion
forums, et cetera.
So this rapidly accumulating
quantity of data
presents many challenges
for organizations.
Among these are simply
capturing all of the data
and storing it, maintaining
the data once we have it.
This is also commonly
referred to as curation,
in the same way that
the curator of a museum
must maintain all of the
ancient artifacts, so, too,
must the curator
of a big data set
be able to maintain the quality
and the integrity of the data
in light of things
like failing hardware,
and the desire of the data to
be used by many people from all
over the world simultaneously.
Additional challenges include
things such as search.
How does one search
efficiently through such
an enormous quantity of data?
Data transfer,
consider for example,
that if you have a 100
megabit network connection
you can transfer
approximately one terabyte
of uncompressed data per day.
At this speed, it would take
you more than a million days
to transfer one exabyte of data.
Further challenges
include analyzing
these massive data
sets, visualizing
the massive quantities
of data, and so forth.
In the past few
years, a term has
arisen in the area
of big data that
is used to describe the
movement toward using
non-relational
databases in order
to support these huge
and highly distributed,
highly replicated
collections of data.
And this term is called NoSQL.
Although the name,
to many people,
implies that SQL is not involved
in these databases, that
is, they do not support
SQL-like queries,
this assumption
is actually wrong.
As it is used in contemporary
database circles,
NoSQL means not only SQL, that
is, these very large databases,
although they may not be
based on relational algebra,
in the same way that a
relational database is,
they nevertheless
support querying
through a SQL-like
query language.
Unlike the relational
database world
where the relational model
is fixed and predominates
all relational database
vendors, in the NoSQL world,
there are many
different architectures
for non-relational databases
that are currently being used.
These include
architectures which
rely upon a key value
store, a wide columnar
store, a documents store.
There are databases that
rely upon graph theory,
and so forth.
Collectively, all of
these different types
of very large data
stores are commonly
referred to as
structured storage.
And they have a few
attributes in common.
First, they arguably employ
simpler designs, as opposed
to relational databases.
And second, they almost always
have a looser consistency model
than one will find in
a relational database.
Another way of saying that
is these structured storage
databases do not
provide ACID guarantees.
If you remember,
ACID is an acronym
which stands for Atomicity,
Consistency, Isolation,
and Durability.
And ACID guarantees
are the hallmark
of a normalized
relational database.
We cannot expect to have that
level of consistency in these
massive, highly distributed,
highly replicated,
structured storage databases.
When discussing the
sort of data models
that are actually used in these
structured storage databases,
I like to use the
data model that
is employed by the
Apache Cassandra database
as an example.
And the reason for
this is that it
is one of the most
popular structured storage
database management systems.
And it is currently the most
popular wide columnar store
that is available.
Broadly speaking, the
Apache Cassandra database
can be classified as a hybrid,
key value slash wide columnar
database.
So its architecture
contains elements
of both a key value store
and a wide columnar store.
The Apache Cassandra database
itself was originality
created at Facebook by two
of their software architects,
after which it was transferred
to the Apache Foundation, where
it now resides as entirely
open source and free database.
Apache Cassandra has
cross-platform support.
The reason for this being
that it was a written in Java.
So it can run on
Linux-based machines,
Windows-based machines,
Unix, et cetera.
And Cassandra supports a
massively distributed database
environment.
That is, it allows us to
subdivide our database
among dozens, or
hundreds, or even
thousands of separate
database servers,
potentially spread out
all over the world.
The database is a highly
scalable and decentralized.
By scalable here, I
mean it's extremely easy
to add an extra node, that is,
an extra server to the cluster,
thereby expanding the
size of the database.
And by decentralized,
what I mean here
is that all of the nodes,
that is all of the database
servers, that are involved in
a Apache Cassandra database,
have the same role.
And this provides the very
desirable characteristic
of there being no
single point of failure.
Another very valuable
characteristic
of the Apache Cassandra
model is that it provides
for automatic data replication.
That is, the database
itself can automatically
make copies of data
and store those copies
in different locations
throughout the cluster.
This makes the database
highly fault tolerant, such
that if an individual node,
that is an individual database
server, were to fail, the
redundant data stores takeover
instantaneously.
There's no down time
with the database at all.
Further, Apache Cassandra
supports the MapReduce process,
which is a computational model
for solving data processing
problems in a highly
distributed environment.
And I'll talk more about
the MapReduce process here
in a few minutes.
And to illustrate the legitimacy
of the Apache Cassandra model,
consider that it is currently
used by CERN, organization
such as Constant Contact, Digg,
Instagram, Netflix, Reddit,
Walmart, Twitter, et cetera.
Now let's talk about the
Cassandra data model itself.
As you know, in a relational
database management system,
related data for an
application are stored together
in a container which is
referred to as a database,
or sometimes as a schema.
And within that
database or schema,
we have one or more tables.
The analogous
concept in Cassandra
is something called a key space.
That is, data for an
application are stored together
in a container known
as a key space.
And inside of that key
space, instead of tables,
we have something known
as column families.
So just as in the
relational database world,
a single DBMS might contain
many databases, each of which
contains many tables.
In the world of
Apache Cassandra,
the Cassandra
database might contain
many key spaces, each of which
contains many column families.
The column families,
then, contain columns.
But this is not the
same as a column
in a relational database.
In Cassandra, a column
consists of a name,
that is the name of the
column, a value, that
is the data value
for that column,
and the time stamp, where
the time stamp indicates
the point in time at which
the data value was changed.
Related columns, then, are all
stored within the same row.
And each row is identified
using a unique row key.
This notion of a
unique row key is
directly analogous to
the idea of a primary key
in the relational
database world.
Rows in the Cassandra
model, however,
are not required to contain the
same set or number of columns.
That is, different rows
within the same column family
might have a different
number of columns.
And the number of columns
in a particular row
is allowed to expand or
contract on an as-needed basis.
A few additional
important differences
to note between
the Cassandra data
model and the
relational data model
are that in the
Cassandra data model
there are no formal foreign key
relationships between column
families.
That is, we cannot establish
formal relationships between
column families within
the same key space.
And what's more, it is
not possible to join
column families
together using a query.
So whereas in the
relational database model
we can write queries that
will join tables together,
it is not possible in
the Cassandra model
to join column
families together.
Now, I know that this sort
of a verbal description
of the Cassandra data model can
be a bit challenging to follow.
So let's look at a
picture which I hope
will help to make some of
these concepts clearer.
Here we see a graphical
representation
of the Cassandra data model.
The outermost
container represents
all of the key spaces for
the Cassandra database.
And in this case, we have just
two key spaces, one of which
is labeled as the
blog key space,
and the other which is labeled
as the store key space, where
the details of the
store key space
are not elaborated
in this diagram.
Again, the idea here
is that a key space
is roughly analogous
to a database
within the relational
database world.
This means that a key
space, then, is typically
oriented toward
a particular need
or a particular application.
Within each key space, we
can have one or more column
families.
In this case, we have a
column family for users,
and we have a column
family for blog entries.
Next, let's look at an
example of a column family.
And to begin, I will refer
to the user column family.
So here, we can see that we have
three rows within the column
family.
And each row represents a unique
user within the blog key space.
A user, then, is
represented as a collection
of one or more columns.
And remember that, in
the Cassandra data model,
the number of columns per
row can vary from row to row.
So in this first row,
that is the Dan 42 row,
we see that we
have three columns.
The first column
is the name column.
It's value is Dan.
And we have a timestamp.
The second column
is the email column.
It's value is dan@dan.com.
Again, it has a timestamp.
And the third column is
the phone columns, which
has a value and a timestamp.
For the next user, we
have only the name column.
And for the third user, we
have only the name and email
columns.
So there is no requirement
that each row contain
the same number, or even
the same type of columns.
Next, let's look at the
blog entry column family
within this blog key space.
Again, we see that each row
within the column family
contains multiple columns.
In this case, both rows
contain the same columns.
But again, that is
not a requirement.
Here, the columns are the text
column, the category column,
and the user column.
Note, particularly, that the
values stored in the user
column can be used to determine
which blog entries were written
by which users.
However, remember that
formal relationships
between column families
do not exist in Cassandra.
That is, we do not
formally establish
primary key, foreign
key relationships.
So I hope that looking at this
diagram of the Apache Cassandra
data model demystifies
things a little bit for you.
I know that learning about
these structured storage data
models for the first
time can be intimidating,
but I hope that
through this diagram,
you can see that it's
really not that complicated.
And I hope that is
encouraging for you.
As I mentioned earlier, these
structured storage databases
are often highly distributed
and highly replicated.
That is, they may be spread
across many, many different
nodes or database servers.
Now this structure has
substantial advantages.
Not only does it
provide fault tolerance,
but it allows for data
requests to be handled
by the nearest
available node that
is able to service the request.
So for example, if
you are in Tokyo,
and it happens that a copy
of the data in which you
are interested is stored on
one of my database nodes,
which is located
near Tokyo, it's
much more efficient for that
node to handle your request
than it would be to
route the request
to a distant geographical
node, say, one
which might be
located in Berlin.
The problem with
this model, however,
is that it can cause
problems with consistency.
Consider what happens when
a data item is updated.
So if I update a data
item on one node,
it will take time
for that update
to cascade to the other
nodes within the cluster that
contain a copy of the data.
So imagine that my distributed
structure storage database
contains 1,000 nodes,
spread all over the world,
and the data item I'm
interested in updating
is replicated across
100 of those nodes.
So I may then perform the
update on one of the nodes.
And until that update
cascades throughout all
of the other nodes
in the cluster,
any requests for
that same data item
that are made of
those other nodes
will be returned values
that are out of date.
And again, this
is due to the fact
that the database is
widely distributed
and widely replicated.
And because we
typically do not enforce
an ACID level of consistency.
Thus, in these replicated
data environments,
we commonly use a
consistency model
that is referred to as
eventual consistency.
And what eventual
consistency means
is that if no new updates
are made to a specific data
item for a period
of time, eventually
all of the requests
for that data item
will return the most
up to date value,
regardless of which node
is servicing the request.
And the time stamps that are
recorded during each item
update are the key
which allows us
to reconcile any inconsistencies
in replicated data
values between nodes.
Finally, I would just like to
take a few moments to discuss
the MapReduce process.
Broadly speaking, MapReduce
is a programming model
that relies on
parallelization in order
to perform data processing tasks
on these huge data sets that
may be distributed across
many different servers,
or many different nodes.
So conceptually speaking,
then, the MapReduce process
involves two different
types of nodes.
There will be a master
node and a worker node.
Put simply, the
master node is usually
the node which receives
the data processing
request from the user.
While the worker
nodes are nodes which
are assigned to complete
part of the processing task
by the master node.
So this MapReduce process,
then, unfolds in two steps.
The first step is
called the map step.
And in the map step,
the master node
will take the data
processing problem
and subdivide it into a
series of sub problems.
And each of these sub
problems is then assigned to,
and carried out
by, a worker node.
The second step in
the MapReduce process,
then, is the reduce step.
So after the worker nodes have
completed their assigned tasks,
they pass the results of their
work back to the master node.
The master node will then
do the final processing,
or final combining,
of those results
in order to produce the overall
answer to the problem, which is
then returned back to the user.
Again, I know that
concepts such as this
can be difficult to
understand in verbal terms,
so let's see if we can get a
better idea using an image.
So toward the top of this
figure we have the master node.
And toward the bottom, we
have various worker nodes,
which here are labeled
one, two, three and n,
up to however many worker nodes
we need to solve the problem.
So the MapReduce process
unfolds as follows.
As input, the data
processing problem
is passed into the master node.
The master node will then divide
that data processing problem
into sub problems, which
are then assigned to
and carried out by the
various worker nodes.
After completing their
tasks, the worker nodes
will return their results
back to the master node, which
performs the final combining and
processing of the worker nodes'
results, in order to produce
the final answer, which is then
the output of the
MapReduce process.
Let's consider an
example, imagine
that we are a wireless
service provider,
and it we use a highly
distributed, structured storage
database, which has
1,000 different servers
all over the world.
Let's further assume
that our 100 million
customers are equally subdivided
among our 1,000 servers.
So that means we would have data
for 100,000 customers per node
within our database environment.
Now let's imagine that our
data processing problem
is that we want to figure out
the average number of text
messages sent during
the month of November.
And we want those
results organized by age.
So we would like to know what
is the average number of text
messages sent by 18-year-olds,
and 19-year-olds,
and 20-year-olds, and
21-year-olds, and so
forth, all the way up
until our oldest customers.
Now let's see how
the MapReduce process
can be used to solve
this data problem.
First, the problem is
passed to the master node.
And the master node might
subdivide the problem such
that it instructs each
of the 1,000 worker
nodes within our
database environment
to count the total number
of text messages sent
by each customer during
the month of November,
and aggregate those
results by age.
The results of the
worker nodes tasks,
then, would be a
table of data, which
might contain three columns.
First would be all of the
distinct ages of the customers
whose data resides on that
particular worker node.
The second column
might be the number
of customers who are that age.
So we might have 1,000
18-year-olds, 714 19-year-olds,
235 20-year-olds, et cetera.
And then, the total
number of text messages
sent by customers of
each particular age.
So perhaps, 18-year-old sent
10 million text messages.
19-year-olds sent 9,800,000
text messages, and so forth.
So each worker node
performs this task
for all of the customers whose
data are stored on that node.
And those results,
then, are returned back
to the master node.
The master node will
then combine the results.
So it will calculate, for
example, the total number
of 18-year-olds and the total
number of text messages sent
by 18-year-olds.
After which it can divide
those two numbers in order
to produce the
average number of text
messages sent for 18-year-olds.
That process is simply repeated
or customers of every page.
And we then have the results,
which we can send back
to the user who requested them.
So I hope that you can
appreciate that this MapReduce
process is a very clever way
of efficiently handling data
processing problems on
distributed database
environments by taking
advantage of parallelization
in order to solve the problem.
Well, my friends,
thus ends our overview
of big data, data warehousing,
and business intelligence
applications.
And more broadly speaking, thus
ends our series of lectures
on database technologies.
It is my sincere
hope that you have
found these lectures useful,
and most importantly,
that you now have
the self-confidence
to go out and start creating and
working with these databases.
It has been a great pleasure
to lead you on this journey,
and I hope that you
have a great day.
