0% found this document useful (0 votes)
34 views8 pages

BDA Module-4

Uploaded by

ksc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

BDA Module-4

Uploaded by

ksc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 8

BDA Module-4

Mining Data Streams


Data arrives in a stream or streams, and if it is not processed immediately or stored, then it
is lost forever. The data arrives so rapidly that it is not feasible to store it all in active
storage (i.e., in a conventional database), and then interact with it at the time of our
choosing.

The algorithms for processing streams involve summarization of the stream in some way.

One approach is to make a useful sample of a stream and filter a stream to eliminate
most of the “undesirable” elements. We then estimate the number of different elements in
a stream using much less storage than would be required if we listed all the elements we
have seen

Another approach to summarizing a stream is to look at only a fixed-length “window”


consisting of the last n elements for some (typically large) n. We then query the window as
if it were a relation in a database. If there are many streams and/or n is large, we may not
be able to store the entire window for every stream, so we need to summarize even the
windows. the fundamental problem of maintaining an approximate count on the number of
1’s in the window of a bit stream, while using much less space than would be needed to
store the entire window itself. This technique generalizes to approximating various kinds of
sums.

The Stream Data Model


A Data-Stream-Management System

We can view a stream processor as a kind of data-management system, the high-level


organization of which is suggested in Fig. 4.1.

Any number of streams can enter the system.


Shobha Chandra K, Asst Professor, CSE, MCE Page 1
BDA Module-4

Each stream can provide elements at its own schedule; they need not have the same data
rates or data types, and the time between elements of one stream need not be uniform.

The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a
database-management system.

The latter system controls the rate at which data is read from the disk, and therefore never
has to worry about data getting lost as it attempts to execute queries.

Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.

There is also a working store, into which summaries or parts of streams may be placed,
and which can be used for answering queries. The working store might be disk, or it might
be main memory, depending on how fast we need to process queries. But either way, it is
of sufficiently limited capacity that it cannot store all the data from all the streams.

Examples of Stream Sources

Sensor Data

Image Data

Internet and Web Traffic

Sensor Data

In navigation systems, sensor data is used. Imagine a temperature sensor floating about in
the ocean, sending back to the base station a reading of the surface temperature each
hour. The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes
arriving every day and we for sure need to think about what we can be kept continuing and
what can only be archived.

To learn something about ocean behavior, we might want to deploy a million sensors, each
sending back a stream, at the rate of ten per second. A million sensors isn’t very many;
there would be one for every 150 square miles of ocean. Now we have 3.5 terabytes
arriving every day, and we definitely need to think about what can be kept in working
storage and what can only be archived.

Image Data

Shobha Chandra K, Asst Professor, CSE, MCE Page 2


BDA Module-4

Satellites frequently send down-to-earth streams containing many terabytes of images per
day. Surveillance cameras generate images with lower resolution than satellites, but there
can be numerous of them, each producing a stream of images at a break of 1 second each.

Internet and Web Traffic –

A switching node in the center of the internet receives streams of IP packets from many
inputs and routes them to its outputs. Websites receive streams of heterogeneous types.
For example, Google receives a hundred million search queries per day. Yahoo! accepts
billions of “clicks” per day on its various sites.

Stream Queries

Queries concerning streams can be asked in two different ways.

 Standing query

 Ad-hoc query

Standing query

The stream produced by the ocean-surface-temperature sensor might have a standing


query to output an alert whenever the temperature exceeds 25 degrees centigrade. This
query is easily answered, since it depends only on the most recent stream element.

Alternatively, we might have a standing query that, each time a new reading arrives,
produces the average of the 24 most recent readings. That query also can be answered
easily, if we store the 24 most recent stream elements. When a new stream element
arrives, we can drop from the working store the 25th most recent element, since it will
never again be needed.

Another query we might ask is the maximum temperature ever recorded by that sensor. We
can answer this query by retaining a simple summary: the maximum of all stream elements
ever seen. It is not necessary to record the entire stream. When a new stream element
arrives, we compare it with the stored maximum, and set the maximum to whichever is
larger. We can then answer the query by producing the current value of the maximum.

Similarly, if we want the average temperature over all time, we have only to record two
values: the number of readings ever sent in the stream and the sum of those readings. We
can adjust these values easily each time a new reading arrives, and we can produce their
quotient as the answer to the query.

Ad-hoc query
Shobha Chandra K, Asst Professor, CSE, MCE Page 3
BDA Module-4

The other form of query is ad-hoc, a question asked once about the current state of a
stream or streams. If we do not store all streams in their entirety, as normally we can not,
then we cannot expect to answer arbitrary queries about streams. If we have some idea
what kind of queries will be asked through the ad-hoc query interface, then we can prepare
for them by storing appropriate parts or summaries of streams

If we want the facility to ask a wide variety of ad-hoc queries, a common approach is to
store a sliding window of each stream in the working store. A sliding window can be the
most recent n elements of a stream, for some n, or it can be all the elements that arrived
within the last t time units, e.g., one day. If we regard each stream element as a tuple, we
can treat the window as a relation and query it with any SQL query. Of course the stream-
management system must keep the window fresh, deleting the oldest elements as new
ones come in

Example 4.2: Web sites often like to report the number of unique users over the past
month. If we think of each login as a stream element, we can maintain a window that is all
logins in the most recent month. We must associate the arrival time with each login, so we
know when it no longer belongs to the window. If we think of the window as a relation
Logins(name, time), then it is simple to get the number of unique users over the past
month. The SQL query is:

SELECT COUNT(DISTINCT(name))

FROM Logins

WHERE time >= t;

Here, t is a constant that represents the time one month before the current time.

Issues in Stream Processing

streams often deliver elements very rapidly. We must process elements in real time, or we
lose the opportunity to process them at all, without accessing the archival storage. Thus, it
often is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.
Moreover, even when streams are “slow,”, there may be many such streams. Even if each

Shobha Chandra K, Asst Professor, CSE, MCE Page 4


BDA Module-4

stream by itself can be processed using a small amount of main memory, the requirements
of all the streams together can easily exceed the amount of available main memory.

Sampling Data in a Stream

Example : extracting reliable samples from a stream

A search engine receives a stream of queries, and it would like to study the behavior of
typical users. We assume the stream consists of tuples (user, query, time). Suppose that we
want to answer queries such as “What fraction of the typical user’s queries were repeated
over the past month?” Assume also that we wish to store only 1/10th of the stream
elements.

Approach to solve

 generate a random number, say an integer from 0 to 9, in response to each search


query.

 Store the tuple if and only if the random number is 0. If we do so, each user has, on
average, 1/10th of their queries stored.

 Statistical fluctuations will introduce some noise into the data, but if users issue
many queries, the law of large numbers will assure us that most users will have a
fraction quite close to 1/10th of their queries stored.

However, this scheme gives us the wrong answer to the query asking for the average
number of duplicate queries for a user.

Suppose a user has issued

 s search queries one time in the past month,

 d search queries twice, and

 no search queries more than twice.

If we have a 1/10th sample, of queries,

 we shall see in the sample for that user an expected s/10 of the search queries
issued once.

 Of the d search queries issued twice, only d/100 will appear twice in the
sample; that fraction is d times the probability that both occurrences of the

Shobha Chandra K, Asst Professor, CSE, MCE Page 5


BDA Module-4

query will be in the 1/10th sample. Of the queries that appear twice in the full
stream, 18d/100 will appear exactly once.

 To see why, note that 18/100 is the probability that one of the two occurrences
will be in the 1/10th of the stream that is selected, while the other is in the
9/10th that is not selected.

The correct answer to the query about the fraction of repeated searches is d/(s+d).
However, the answer we shall obtain from the sample is d/(10s+19d). To derive the latter
formula, note that d/100 appear twice, while s/10+18d/100 appear once. Thus, the fraction
appearing twice in the sample is d/100 divided by d/100+s/10+18d/100. This ratio is
d/(10s+19d). For no positive values of s and d is d/(s+d) = d/(10s+19d).

Obtaining a Representative Sample

Many queries about the statistics of typical users, cannot be answered by taking a sample
of each user’s search queries. Thus, we must strive to pick 1/10th of the users, and take all
their searches for the sample, while taking none of the searches from other users.

If we can store a list of all users, and whether or not they are in the sample, then we could
do the following.

 Each time a search query arrives in the stream, we look up the user to see
whether or not they are in the sample.

 If so, we add this search query to the sample, and if not, then not.

 However, if we have no record of ever having seen this user before, then we
generate a random integer between 0 and 9.

 If the number is 0, we add this user to our list with value “in,” and if the
number is other than 0, we add the user with the value “out.”

The method works as long as we can afford to keep the list of all users and their in/out
decision in main memory, because there isn’t time to go to disk for every search that
arrives. By using a hash function, one can avoid keeping the list of users. That is, we hash
each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0, then
accept this search query for the sample, and if not, then not.

we do not actually store the user in the bucket; in fact, there is no data in the buckets at all.
Effectively, we use the hash function as a random number generator, with the important
Shobha Chandra K, Asst Professor, CSE, MCE Page 6
BDA Module-4

property that, when applied to the same user several times, we always get the same
“random” number. That is, without storing the in/out decision for any user, we can
reconstruct that decision any time a search query by that user arrives.

More generally, we can obtain a sample consisting of any rational fraction a/b of the users
by hashing user names to b buckets, 0 through b−1. Add the search query to the sample if
the hash value is less than a.

The General Sampling Problem

Our stream consists of tuples with n components. A subset of the components are the key
components, on which the selection of the sample will be based. In our running example,
there are three components– user, query, and time– of which only user is in the key.
However, we could also take a sample of queries by making query be the key, or even take
a sample of user-query pairs by making both those components form the key.

To take a sample of size a/b,

 we hash the key value for each tuple to b buckets, and accept the tuple for the
sample if the hash value is less than a.

 If the key consists of more than one component, the hash function needs to combine
the values for those components to make a single hash-value.

 The result will be a sample consisting of all tuples with certain key values.

 The selected key values will be approximately a/b of all the key values appearing in
the stream.

Varying the Sample Size

Often, the sample will grow as more of the stream enters the system. In our running
example, we retain all the search queries of the selected 1/10th of the users, forever. As
time goes on, more searches for the same users will be accumulated, and new users that
are selected for the sample will appear in the stream.

If we have a budget for how many tuples from the stream can be stored as the sample,
then the fraction of key values must vary, lowering as time goes on. In order to assure that
at all times, the sample consists of all tuples from a subset of the key values, we choose a

Shobha Chandra K, Asst Professor, CSE, MCE Page 7


BDA Module-4

hash function h from key values to a very large number of values 0,1,...,B−1. We maintain
a threshold t, which initially can be the largest bucket number, B − 1. At all times, the
sample consists of those tuples whose key K satisfies h(K) ≤ t. New tuples from the
stream are added to the sample if and only if they satisfy the same condition.

If the number of stored tuples of the sample exceeds the allotted space, we lower t to
t−1 and remove from the sample all those tuples whose key K hashes to t. For
efficiency, we can lower t by more than 1, and remove the tuples with several of the
highest hash values, whenever we need to throw some key values out of the sample.
Further efficiency is obtained by maintaining an index on the hash value, so we can find all
those tuples whose keys hash to a particular value quickly.

Filtering Streams

Another common process on streams is selection, or filtering. We want to accept those


tuples in the stream that meet a criterion. Accepted tuples are passed to another process
as a stream, while other tuples are dropped.

Shobha Chandra K, Asst Professor, CSE, MCE Page 8

You might also like