Usually we mine data that sits
somewhere in a database, or
a distributed file system.
And we can access the same data
repeatedly, and it is all available to us,
whenever we need it.
But there's some applications where data
doesn't really live in a database or
if it does.
The data base is so
large, that we can't query it fast
enough to answer questions about it.
Examples include click stream
at that are major internet site,
at a major internet site or observational
data coming down from satellites.
Answering queries about this sort of data
requires clever observation techniques and
methods for
compressing data, in a way that allows us
to answer the queries we need to answer.
To begin with a brief summery,
of a stream management system,
the analog of the data base management
system, the idea of sliding windows is
an essential idea that tells, that lets us
focus on, on recent data in the stream,
we'll then discuss a particular problem,
that of counting ones and
the windows of a big stream gets
slide by The fundamental difference
between a data stream and a database, is
who controls how data enters the system.
In a database system, the staff
associated with the management of
the database generally insert data
into the system using a bulk loader or
even explicit SQL INSERT commands.
The staff can decide how much data to
load into the system, when, and how fast.
In a streaming environment, the management
cannot control the rate of input.
For example, the search queries that
arrive at Google are generated by
random people around the world,
at their pace.
Google staff have no control
over the arrival of queries.
They have to architect their system,
to deal with whatever data rate there is.
You might think a transaction processing
system like Walmart recording all
the purchases at all its cash
registers everywhere, as a stream.
And in a sense it is.
But Walmart has a large but
fixed number of registers.
And checkout clerks can
press the keys just so fast.
So there's actually a pretty well defined
limit on how fast data arrives in
such a system.
[SOUND] So let's see the elements of
the data stream model of computation.
First we assume inputs are tuples
just as in the database system.
Although in many algorithms we
shall assume input elements.
Our tuples is a very simple
kind like bits or, or integers.
We assume there, are one or
more input ports to which data arrives.
Generally we assume the arrival rate is
high although we will be a little vague on
how high is high.
The important property of the arrival
rate is that it is fast enough.
That it is not feasible for the system
to store all the arriving data and
at the same time make it
instantaneously available for
any query we might want
to perform on the data.
As a result, the interesting algorithms or
data stream which are generally methods
that use a limited amount of storage,
perhaps only main memory.
And still enable us to answer important
queries about the content to the stream.
Streams can be queried in two modes.
The first is similar to the way
we query a database system.
You ask a query once and
expect an answer about the state of
the system at the time you ask the query.
For example,
what is the maximum value seen in
the screen from its beginning to
the exact time the query is asked?
This question can be answered
by keeping a single value,
the maximum, and updating it if necessary
each time a new stream element arrives.
'Kay.
The other kind of query is
called a standing query.
You write the query once.
And you expect the system to make
the answer available at all times,
perhaps outputting a new value
each time the answer changes.
For instance, a standing query might ask
for a report of each stream element that
is larger than any element seen so
far in the stream.
We can answer this one by
keeping one value, the maximum.
And each new element is compared with the
max and if it is larger we do two things.
We output the value and
we update the max to be that value.
So here is a very simple outline what
a stream management system looks like.
Kay.
First there's a processor which is
the software, that executes the queries.
The processor, could of course be a large
number of processors working in concert.
The processor may store
some standing queries.
And also allow ad-hoc queries
to be issued by the user.
Here we see several streams
entering the system.
Conditionally we'll assume that
the element at the right end of
the stream has arrived most recently.
And time goes backward to the left.
That is the further, left the earlier
the element entered the system.
[SOUND] The system makes
outputs in response,
to the standing queries and
the add hock queries.
Now usually there is
some archival storage.
This storage is so
massive, that it is not possible to
do more than store the input streams.
We cannot assume the archival storage
is architected like a database system,
where by using appropriate indices or
other tools, one can answer queries
efficiently from that data.
We only know that if we had to reconstruct
the history of the streams we could.
Perhaps taking a long time to do so.
Now, there is a limit at working storage,
which might be main memory, flash storage,
or even disk.
But we assume it holds essential
parts of the input streams in
a way that supports fast querying.
We're going to list some examples
of the sorts of streams that it
could be useful to mine.
One example is the query stream
at a search engine like Google.
For example, Google Trends,
wants to find out which search queries are
much more frequent today, than yesterday.
These queries represent issues
of rising public interest.
Answering such a standing query requires
looking back at most two days in
the query stream.
That's quite a lot,
perhaps billions of queries.
But it is tiny compared with the stream
of all Google queries ever issued.
Click streams are another
source of very rapid input.
A site like yahoo has many
millions of users each day,
and the average user probably
clicks a dozen times or more.
A question worth answering is
which URLs are getting clicked on,
a lot more this past hour than normally.
Interestingly, while some
of these events refler,
reflect breaking news stories,
many also represent a broken link.
When people can't get the page they want,
they'll often click on it
several times before giving up.
So sites mine their click streams to,
to detect broken links.
We can view a switch in
the middle of the Internet,
as processing streams,
one stream for each port.
The elements of the stream,
are IP packets, typically.
And, the switch can store a lot
of information about packets,
including the response speed
of different network links and
the points of origin and
destination of the packets.
This information could be used to advice
the switch in the best routing for
a packet, or
to detect a denial of service attack.
Now the concept of the sliding window,
is essential for
many of the algorithms
we're going to discuss.
The simplest form of window,
is defined by a fixed length and
consistent of the, most recent and
elementary received on a stream.
Notice that each time
an element is received,
the oldest element falls
out of the window.
A variation is to define the window as
all the elements that have arrived within
some time interval T
extending into the path.
Say the last hour.
This sort of window has a storage
requirement that is not fixed since
the number of arrivals
within time t can vary.
In comparison defining the window to
have a fixed number of elements lets us
rely on needing storage space,
only up to a certain limit.
The interesting case is when
we're using a window consisting
of the last end string element.
But N is so large that we cannot
store any elements in main memory.
And while we have options to recr,
increase, the, the size of main memory.
Use, many compute notes to handle one
window or use disc in some cases.
We also need to consider the case,
where there are many streams,
perhaps millions of streams,
arriving at the same stream processor.
In that case, N does not have
to be very large, before we
cannot store all the windows in a way that
allows us to get exact answers to queries.
About the contents of the windows.
So here's a little picture of a steam and
a window of length six.
Okay initially,
the stream has arrived up to this point J.
The elements K L and so
on will arrive in the future.
Okay.
Now k arrives, the oldest element s,
is no longer part of
the window which continues to hold
exactly six elements, as it always will.
Now l arrives and d falls out of
the the window, and z arrives,
causing f to be dropped from the window.
[SOUND] Let's take
a really simple example.
Okay, we have a stream of integers.
The window is of size N.
That is, the window will hold the N,
most recent integers in the stream.
And we want the system to be able
to answer one standing quero,
query, what is the average of
the elements in the window.
Often we imagine stream extends
infinitely into the past.
So, we don't worry about what happen
before there had been enough arrivals to
fill the window.
However realistically we have
to get started some how.
So, lets store the first N
inputs as they arrives and
maintain the sum accountive elements seen
so far, until the account reaches the end.
The average is the sum divided
by the count at any point.
now.
Suppose we have our window full, and it
consists of the most recent end elements.
Those will store the average of
these elements that averages in
the local storage but
it's not part of the window.
Suppose new element I arrives.
The oldest element J in the window
will fall out, of the window.
Thus, the change in the average
is i minus j, all divided by N.
I over n accounts for the contribution
i makes to the average, and minus j
over N accounts for the fact that j no
longer equ, contributes to the average.
The important point is that, in this
matter, we can answer the query what is
the average of the elements in the window,
doing only a small fixed number of
arithmetic steps,
with each arrival on the stream.
That is far, far better than
having to compute the sum and
average of all N elements in the window
each time a new element arrives.
But not every query about
the current value of
the window can be answered in
an equally convenient way.
