To make better decisions faster, managers
require useful information that is readily
available and flexible enough for analysis.
This requirement isn't trivial, mostly because
of the gap that exists between the mountain
of raw data that companies collect, and the
information that business people need.
To bridge this gap, companies have to realize
how information can be used for business analysis,
and understand how computer systems convert
raw data into useful information.
To understand what makes information useful
for business analysis, let's walk through
a simple example of business intelligence
in action.
A major US beverage company is trying to break
into the German market.
They're selling four different flavors of
soda: cola, cherry, grape, and lemon-lime;
and they're interested in what's going on
in four major markets: Munich, Frankfurt,
Cologne, and Berlin.
They want to begin by analyzing sales - a
pretty reasonable place to start.
This table shows their sales information for
the last two quarters of 2013.
It looks like the company has identical sales
performance in both.
The next step in their sales analysis may
be to examine the same sales data by a few
other perspectives, including the flavor of
soda being sold and the market where it was
sold.
Hopefully you noticed that the total sales
are the same - $32,000 - for all three views.
This is a reassuring sign.
It means you can be confident that you're
looking at the same information - the soda
sales of the beverage company - but with each
view broken out into different categories.
Let's take a minute to consider what we're
looking at.
These are the overall soda sales broken out
by three distinct categories: time, geography,
and product.
These categories are called dimensions.
Based on these data tables, it's not real
obvious what question we should ask next.
All we know at this point is that sales are
identical for each of the two quarters, identical
for each of the four products, and identical
for each of the four geographies.
The data is boring.
I'm bored by it.
You're probably bored by it.
There aren't any interesting patterns, there
isn't much to analyze, nothing's happening.
Instead of looking for a fourth dimension
to categorize the sales data, let's see what
happens when we mash the existing three dimensions
together to create a single, multidimensional
view of the data.
Nice!
All kinds of patterns and anomalies start
jumping out.
Significant management information - which
was obscured by separate analysis of the dimensions
- suddenly becomes obvious.
For example, lemon-lime and grape flavors
don't sell in Cologne and Berlin in the third
quarter, but cola and cherry do.
It's vice-versa for the fourth quarter.
This process of interacting with multidimensional
views of the data - slicing and dicing - almost
always reveals new and interesting information
compared to isolated, single-dimension data
lists.
This is called multidimensional analysis.
At this point, you're probably thinking that
multidimensional analysis is an obvious way
to analyze data - and you're right.
Multidimensional analysis is very intuitive
for businesspeople, because it represents
a natural, easy, and effective way to analyze
information.
The more difficult task is getting the data
into a format that supports multidimensional
analysis.
Let's take a minute and think about where
a company's data originates.
The sales information for the beverage company
was generated by lots of individual shoppers
plunking a soda down in a checkout line at
a retailer.
Each checkout transaction was then stored
in a database, which was specially designed
to store information about every checkout
transaction in every store.
This type of database is called an operational
database, because its job is to support the
day-to-day operations of a company.
Even though data is captured and stored in
an operational database, the data isn't necessarily
available for business analysis.
In fact, the opposite is usually true.
These databases are structured for the purpose
of running the day-to-day business by processing
transactions.
They aren't structured for effective business
analysis.
While the underlying technologies have changed
pretty dramatically over time, operational
databases still have the same basic functionality:
they gather, update, store, retrieve, and
archive data.
In simplest terms, a rotary card file like
the kind my grandfather used in the 1950s
to store addresses is a database.
At the high end, a large enterprise like Wal-Mart
stores and manages trillions of records, taking
up terabytes, and even petabytes, of storage
in a database.
Collectively, operational systems are usually
referred to as online transaction processing
systems - OLTP for short.
To understand how an OLTP system works, let's
walk through a common OLTP example: purchasing
a tube of toothpaste at Target with a debit
card.
You walk up to the cash register, hopefully
not having stood in line for too long, and
place the toothpaste on the rubber conveyor
belt.
The cashier swipes the barcode on the toothpaste
packaging across the point-of-scale scanner;
the POS system looks up the price of the toothpaste,
totals the cost of the transaction including
sales tax, and prompts you to swipe your debit
card.
You swipe your debit card, input your PIN,
and the POS system talks to your bank and
transfers the cost of the toothpaste from
your account to Target's.
After the money is transferred, the POS generates
a receipt, the cashier bags your purchase
,and you're out the door.
This entire process is you interacting with
an OLTP system, and this activity has three
characteristics that fundamentally define
the system.
First, it processes a transaction - you're
exchanging money for toothpaste according
to the rules of the game, which are valid
product, valid debit card, correct PIN code,
money in the account, and so on.
Next, it performs all of the elements of the
transaction in real time: retrieving the price
of the merchandise, verifying who you are,
transferring the money, immediately updating
your account, and so on.
Everything about the transaction occurs more
or less simultaneously.
And finally, it processes lots of transactions
on a continuous basis.
Not just your purchase, but the purchases
of every other Target shopper are processed
continuously all day, every day.
There are no rigid cutoffs, stops, or starts
- at least none that are apparent to you or
any other shopper checking out.
OLTP systems are everywhere: order tracking,
invoicing, credit card processing, retail
point-of-sale, inventory control, personnel
management, banking, airline reservations,
telephone call processing, and more.
OLTP is designed for managing the raw data
of business, which requires efficiency and
up-to-the-minute processing of transactions
at the lowest level of detail.
So why not use these same OLTP systems for
business analysis?
The answer lies in the questions you want
to have answered efficiently.
For example, "Did somebody buy toothpaste
today at the Framingham Target?" is a transaction
question that can be easily answered by an
OLTP system without much fuss.
While this sort of data can be served up quickly,
it's not really useful for an analysis of
the overall business.
"As a trend over the past 6 months, what has
been the average dollar sales of toothpaste
per Target store per week in the greater Boston
area?
Which Targets sell toothpaste as the greatest
percentage of their oral care category sales?
Which Targets are seeing the fastest growth
in toothpaste sales?"
Those are interesting questions, with potentially
interesting answers, that might affect how
both Target and toothpaste manufacturers do
business.
But they aren't questions you can ask an OLTP
system.
There are way too many records to search,
sort, and summarize.
And don't forget the mathematical calculations
that are also required to obtain the answers.
Imposing these types of queries on the point-of-sale
OLTP system on a regular basis would almost
definitely interfere with the main business
of ringing up shoppers' purchases.
While OLTP systems are lousy for analysis,
they do gather the data that is the foundation
for multidimensional analysis.
And that's the challenge that businesses face.
It's those millions and billions of transactions
in operational systems that are the rich basis
of data that needs to be converted into useful
information for business analysis.
Getting it there isn't simple, especially
when the data resides within multiple, disparately
organized, and frankly, often crufty old legacy
systems.
Historically, IT departments spent the majority
of their resources developing and maintaining
operational systems.
But in today's environment, such systems are
almost always purchased as packaged software
applications.
These apps typically include meaningful reporting
capabilities, which do have value for performing
business analysis and are rightly part of
an overall BI strategy.
Unfortunately, virtually all of them suffer
from two basic limitations.
The first limitation is that they report only
on their own internally gathered data, without
the ability to combine data or absorb structures
from other systems.
This situation, where reporting from one system
is essentially blind to the data from other
systems, is commonly known as "stovepipe reporting"
or "silos of data".
When you report from these silos, it's difficult
to create a broad set of metrics and key performance
indicators that combine data from multiple
systems.
For example, a typical productivity measure
in many industries is revenue per employee.
That KPI is computed by combining revenue
data from the order management system, and
employee data from the human resources system.
The second basic limitation of operational
reporting is that it typically doesn't do
a good job supporting high-speed multidimensional
analysis.
It's significantly slower, less intuitive,
and less flexible than required.
For these reasons, some type of BI system
- where data from multiple operational systems,
and possibly outside data sources, is pulled
together for analysis - is part of every large
company's infrastructure.
Fortunately, all operational systems have
export capabilities.
Even better, the vendors who design and sell
operational systems understand the limitations
of data silos, and are increasingly designing
their systems to be integrated with downstream
BI systems.
Most modern BI systems are designed for a
specific analysis model: online analytical
processing, or OLAP.
The name was chosen because it contrasted
well with online transaction processing, a
term that was already in wide use.
E.F. Codd, an IBM researcher who coined the
term OLAP - he also invented SQL and relational
databases - wanted to highlight the fundamental
differences between transaction processing,
and analytical processing.
In 1993 he laid out a broad set of criteria
for OLAP databases.
Without getting into the minutiae, there are
three critically important capabilities that
all OLAP systems have to have.
First, OLAP provides an intuitive data model
that users who aren't necessarily trained
as analysts can quickly understand.
This model is multidimensional analysis - being
able to see data through multiple filters,
or dimensions, at the same time.
The best managers and analysts were multidimensional
thinkers, even before OLAP, asking questions
like: What are our actual sales compared to
the forecasted sales by region, by salesperson,
and by product?
What is our profitability by product and by
customer?
What is our backlog by product, by customer,
and by time?
Region, salesperson, customer, product, and
time - the by, by, by things - are typical
dimensions for business activities.
Unlike standard relational databases, where
dimensionality is present in the data but
is hard to see, OLAP systems organize the
data directly as multidimensional structures.
OLAP systems also provide easy-to-use tools
for users to get at the information in multiple
and simultaneous dimensional views.
Secondly, OLAP is fast for the user.
Fast retrieval times let managers and analysts
ask - and answer - more questions in a concentrated
period of time than ever before.
This addresses the "infinite query syndrome",
where every interesting pattern and useful
piece of information in a database can be
theoretically explored.
Finally, OLAP systems have robust calculation
engines for handling the specialized calculation
requirements that a multidimensional structure
imposes.
Think about being forced to write spreadsheet
formulas that interlink more than two dimensions.
That's not something easy for human beings
to conceptualize.
OLAP calculation engines structure the data
in a way that allows an analyst to write simple,
straightforward formulas that automatically
perform across multiple dimensions.
The power of OLAP comes from structuring the
data so it naturally supports the way people
do analysis work.
The highest-level OLAP structure is a dimension.
Simply defined, a dimension is a categorically
consistent view of data - in other words,
all of the members within a dimension, such
as products, belong together as a group.
There are two simple tests for this.
First, can data about the members be compared?
For example, the sales numbers of one product
can be compared to other sales numbers in
a product dimension.
This is the famous "apples to apples" test.
Secondly, can the data from the members be
aggregated into summary members?
For example, the members January, February,
and March of a time dimension can be aggregated
together to a value for Q1.
One of the interesting things about dimensions
is the ability to slice and dice multidimensional
data, just like you might slice and then dice
a tomato.
Think of the slice as being a specific member
of a dimension, for example, Product X in
the product dimension.
When we dice with OLAP, we're creating a series
of intersections for that Product X slice
with data from other dimensions.
We can view sales of Product X by month, by
region, and so on.
The word "by" indicates how we're dicing the
data.
It's this multidimensional intersection of
data, and how it's organized in an OLAP database,
that makes multidimensional analysis especially
interesting - and powerful.
The ability to intuitively and instantly switch
views is a basic feature of any respectable
OLAP user tool.
Another capability inherent in OLAP's multidimensional
design is the pivoting and nesting of dimensions,
which are point-and-click operations in good
OLAP systems.
Take a look at this table - it's the same
soda data we looked at earlier, but the products
have been pivoted from a row orientation to
a column orientation, and the markets were
pivoted from columns to rows.
In addition, the quarters have been flipped
on the row axis to an inside nested row position.
Now you can see the column totals for products
that weren't available in the previous version
of the table.
Multidimensional data in an OLAP system is
typically visualized as a cube storage structure
with lots of mini-cubes, or cells, making
up the cube as a whole.
The data is a cube because it has three dimensions
of data: geographies, time periods, and products.
This is a data cube that contains the information
for our simple soda example.
Each cell contains a value, which is the value
of sales for the intersection of each unique
combination of time period, geography, and
product.
For example, the value of the cell for cola-flavored
soda in Cologne in the 3rd quarter is $2000.
The total value of all the cells, as we saw
in our data tables, is $32,000.
Let's look at the cube again, this time highlighting
a slice as it might be depicted on an OLAP-generated
report.
The cherry soda member of the product dimension
crosses over all intersections of the time
and geography dimensions.
The sum of these highlighted cells is the
total sales for cherry soda, which is $8,000.
Slicing and dicing, in the OLAP world, is
calling up subsets of cells that the user
has selected to view from the master OLAP
cube stored on a data server.
In the real world, an OLAP cube usually has
dimensions with hundreds, thousands, or even
millions of members.
When we call an OLAP database a cube it's
a metaphor, not an actual storage structure.
The technical definition of an OLAP cube is
a multidimensional structure that stores and
maintains discrete intersection values.
Some OLAP database architectures may also
include multiple cubes with intersections
between them.
Let's assume that you're a business analyst
for a typical consumer goods company.
Your job is to perform a thorough analysis
of units sold, average prices, and dollar
sales for 100 specific products that your
company sells.
Your analysis includes looking at patterns
across 24 months for 200 major cities in the
US.
So how many data points are you going to have
to look at?
The answer is 1,440,000 - that's the three
measures by 100 products by 24 months by 200
geographies.
That's a lot of data points.
Of course, not all of the products are sold
in all 200 cities, and not all of the products
were carried for all 24 months, but can you
really know what products were carried where
and when?
This analysis problem is addressed in OLAP
systems by letting you organize the data into
hierarchies, which aggregate the data to higher
and higher levels.
For example, the monthly data can be rolled
up - summarized - to quarter and year totals,
the cities rolled up to regions and states,
the products rolled up to product lines and
product groups, and so on.
Average prices can also be back-calculated
at each level in the hierarchy, from the aggregations
of the unit sales and dollar sales.
The multidimensional analysis process becomes
manageable, because a hierarchical organization
of each dimension lets you start at the top
of the hierarchy, and drill-down through the
data as you see interesting patterns or anomalies.
This top-down approach to analysis is similar
to a game of 20 questions.
You start by asking broad questions.
Then, depending on the answers, you ask increasingly
focused questions.
Finally, you figure out the correct answer.
Thinking top-down is a natural way for us
humans to organize complex information, but
it is a learned behavior that takes some practice.
Multidimensional data, organized into hierarchies,
is an intuitively comfortable way to analyze
the mountains of data that come out of transactional
systems.
The mechanics of OLAP interfaces - especially
pointing and clicking to drill-down into interesting
layers - is made possible by rapid responses
to queries.
This functionality gives managers and analysts
a completely new process for dealing with
large amounts of data, a process called ad-hoc
analysis.
Paper-based reporting with fixed formats presumes
a set list of questions that require answers.
Ad hoc analysis doesn't have those constraints.
Virtually any question - assuming a well-constructed
OLAP system - can be answered in real-time
with a few minutes of drill-down.
Virtually any report can be formatted or graphed
with pivoting and nesting of dimensions.
Virtually anyone in the company, even executives
who like to ask lots of questions, can be
taught how to do it with minimal training.
For an example of a hierarchy, let's use a
sample time dimension from the year 2013.
A hierarchy is an organization of levels within
a dimension that reflects two things: how
additive data is aggregated level by level,
and the top-down drill-down path for users
within the dimension.
In our example time dimension, the hierarchy
is a general calendar organization of the
levels: year, quarter, and month.
It's possible to have more than one hierarchy
for the same dimension, depending on how you
want to look at the data.
Just remember that every hierarchy in the
dimension uses the same bottom-level data.
More than likely, you're going to supplement
your analysis work with descriptive information
that isn't part of the hierarchy but is still
important to your analysis.
This non-hierarchical information is known
as attributes.
Examples of attributes are model, size, list
price, color, flavor, street address, FIPS
time codes, and so on.
So now that we've covered how you can structure
and view data in an OLAP database using dimensions
and hierarchies, it's time to talk about the
actual data you're aggregating, storing, calculating,
and otherwise analyzing.
The data in most BI applications, and all
OLAP systems, are called measures.
In simplest terms, a measure is any quantitative
expression.
For example, dollar sales is an important
measure for analyzing sales activity, as we
saw earlier in the soda company example.
In OLAP systems, a measure is what's being
analyzed across multiple dimensions.
For example, dollar sales of soda by month,
by product, and by city.
The measure is dollar sales, and each intersection
of the three dimensions has a different dollar
sales value.
The total of dollar sales is the sum of the
values of all combinations of members at the
lowest level.
There are four important parameters for what
measures are, and how they work in multidimensional
analysis.
First, a measure is always a quantity, or
a mathematical formula that yields a quantity.
Secondly, a measure can take any quantitative
format, for example, an absolute value like
unit sales, a currency value like dollar sales,
a percentage like percent sales, or a ratio
like sales per employee.
Next, a measure can be derived from any original
data source or calculation.
For example, a measure can be a direct input
like unit sales, an aggregation like total
unit sales, an average like average selling
price, a formula like dollar sales divided
by employees, a count like number of products,
and so on.
And finally, there has to be at least one
measure to do any type of OLAP analysis.
A typical OLAP cube will usually have lots
of different measures, and in some extreme
cases hundreds and even thousands of measures.
The measures being analyzed depend on the
purpose of OLAP system.
For example, unit sales and sales per salesperson
are important measures for a sales department.
Inventory turns and debt-to-equity ratios
are standard calculations in balance sheet
analysis.
Same store sales are a fundamental metric
for large retail operations.
Burn rate is an important cash metric for
venture-financed startups.
Measures in business intelligence are generally
known by different names, depending on the
application.
The terms "metric" and "key performance indicator",
or KPI, are used interchangeably to represent
important measures you should pay attention
to.
The term "benchmark" refers to a measure used
for making comparisons, for example, the average
cost of goods sold.
A ratio is a measure where the result is calculated
specifically from dividing one measure by
another, such as sales per salesperson.
We've seen that the gap between raw data and
business information can be bridged through
a series of processes that start with OLTP
systems, where the data is collected, and
end with BI systems, where the data is organized
into dimensions, hierarchies, and measures
for analysis.
OLAP systems enable ad hoc analysis, and on-the-fly
construction of specialized reports, by allowing
users to slice and dice their data by different
dimensions, pivoting rows and columns, and
drilling down through hierarchies.
These standard OLAP system capabilities allow
users to quickly ask and answer questions
and identify patterns and anomalies otherwise
obscured by single-dimensional analysis.
