0% found this document useful (0 votes)

34 views8 pages

BDA Module-4

Uploaded by

ksc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views8 pages

BDA Module-4

Uploaded by

ksc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 8

BDA Module-4

Mining Data Streams

Data arrives in a stream or streams, and if it is not processed immediately or stored, then it
is lost forever. The data arrives so rapidly that it is not feasible to store it all in active
storage (i.e., in a conventional database), and then interact with it at the time of our
choosing.

The algorithms for processing streams involve summarization of the stream in some way.

One approach is to make a useful sample of a stream and filter a stream to eliminate
most of the “undesirable” elements. We then estimate the number of different elements in
a stream using much less storage than would be required if we listed all the elements we
have seen

Another approach to summarizing a stream is to look at only a fixed-length “window”

consisting of the last n elements for some (typically large) n. We then query the window as
if it were a relation in a database. If there are many streams and/or n is large, we may not
be able to store the entire window for every stream, so we need to summarize even the
windows. the fundamental problem of maintaining an approximate count on the number of
1’s in the window of a bit stream, while using much less space than would be needed to
store the entire window itself. This technique generalizes to approximating various kinds of
sums.

The Stream Data Model

A Data-Stream-Management System

We can view a stream processor as a kind of data-management system, the high-level

organization of which is suggested in Fig. 4.1.

Any number of streams can enter the system.

Shobha Chandra K, Asst Professor, CSE, MCE Page 1
BDA Module-4

Each stream can provide elements at its own schedule; they need not have the same data
rates or data types, and the time between elements of one stream need not be uniform.

The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a
database-management system.

The latter system controls the rate at which data is read from the disk, and therefore never
has to worry about data getting lost as it attempts to execute queries.

Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.

There is also a working store, into which summaries or parts of streams may be placed,
and which can be used for answering queries. The working store might be disk, or it might
be main memory, depending on how fast we need to process queries. But either way, it is
of sufficiently limited capacity that it cannot store all the data from all the streams.

Examples of Stream Sources

Sensor Data

Image Data

Internet and Web Traffic

Sensor Data

In navigation systems, sensor data is used. Imagine a temperature sensor floating about in
the ocean, sending back to the base station a reading of the surface temperature each
hour. The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes
arriving every day and we for sure need to think about what we can be kept continuing and
what can only be archived.

To learn something about ocean behavior, we might want to deploy a million sensors, each
sending back a stream, at the rate of ten per second. A million sensors isn’t very many;
there would be one for every 150 square miles of ocean. Now we have 3.5 terabytes
arriving every day, and we definitely need to think about what can be kept in working
storage and what can only be archived.

Image Data

Shobha Chandra K, Asst Professor, CSE, MCE Page 2

BDA Module-4

Satellites frequently send down-to-earth streams containing many terabytes of images per
day. Surveillance cameras generate images with lower resolution than satellites, but there
can be numerous of them, each producing a stream of images at a break of 1 second each.

Internet and Web Traffic –

A switching node in the center of the internet receives streams of IP packets from many
inputs and routes them to its outputs. Websites receive streams of heterogeneous types.
For example, Google receives a hundred million search queries per day. Yahoo! accepts
billions of “clicks” per day on its various sites.

Stream Queries

Queries concerning streams can be asked in two different ways.

 Standing query

 Ad-hoc query

Standing query

The stream produced by the ocean-surface-temperature sensor might have a standing

query to output an alert whenever the temperature exceeds 25 degrees centigrade. This
query is easily answered, since it depends only on the most recent stream element.

Alternatively, we might have a standing query that, each time a new reading arrives,
produces the average of the 24 most recent readings. That query also can be answered
easily, if we store the 24 most recent stream elements. When a new stream element
arrives, we can drop from the working store the 25th most recent element, since it will
never again be needed.

Another query we might ask is the maximum temperature ever recorded by that sensor. We
can answer this query by retaining a simple summary: the maximum of all stream elements
ever seen. It is not necessary to record the entire stream. When a new stream element
arrives, we compare it with the stored maximum, and set the maximum to whichever is
larger. We can then answer the query by producing the current value of the maximum.

Similarly, if we want the average temperature over all time, we have only to record two
values: the number of readings ever sent in the stream and the sum of those readings. We
can adjust these values easily each time a new reading arrives, and we can produce their
quotient as the answer to the query.

Ad-hoc query
Shobha Chandra K, Asst Professor, CSE, MCE Page 3
BDA Module-4

The other form of query is ad-hoc, a question asked once about the current state of a
stream or streams. If we do not store all streams in their entirety, as normally we can not,
then we cannot expect to answer arbitrary queries about streams. If we have some idea
what kind of queries will be asked through the ad-hoc query interface, then we can prepare
for them by storing appropriate parts or summaries of streams

If we want the facility to ask a wide variety of ad-hoc queries, a common approach is to
store a sliding window of each stream in the working store. A sliding window can be the
most recent n elements of a stream, for some n, or it can be all the elements that arrived
within the last t time units, e.g., one day. If we regard each stream element as a tuple, we
can treat the window as a relation and query it with any SQL query. Of course the stream-
management system must keep the window fresh, deleting the oldest elements as new
ones come in

Example 4.2: Web sites often like to report the number of unique users over the past
month. If we think of each login as a stream element, we can maintain a window that is all
logins in the most recent month. We must associate the arrival time with each login, so we
know when it no longer belongs to the window. If we think of the window as a relation
Logins(name, time), then it is simple to get the number of unique users over the past
month. The SQL query is:

SELECT COUNT(DISTINCT(name))

FROM Logins

WHERE time >= t;

Here, t is a constant that represents the time one month before the current time.

Issues in Stream Processing

streams often deliver elements very rapidly. We must process elements in real time, or we
lose the opportunity to process them at all, without accessing the archival storage. Thus, it
often is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.
Moreover, even when streams are “slow,”, there may be many such streams. Even if each

Shobha Chandra K, Asst Professor, CSE, MCE Page 4

BDA Module-4

stream by itself can be processed using a small amount of main memory, the requirements
of all the streams together can easily exceed the amount of available main memory.

Sampling Data in a Stream

Example : extracting reliable samples from a stream

A search engine receives a stream of queries, and it would like to study the behavior of
typical users. We assume the stream consists of tuples (user, query, time). Suppose that we
want to answer queries such as “What fraction of the typical user’s queries were repeated
over the past month?” Assume also that we wish to store only 1/10th of the stream
elements.

Approach to solve

 generate a random number, say an integer from 0 to 9, in response to each search

query.

 Store the tuple if and only if the random number is 0. If we do so, each user has, on
average, 1/10th of their queries stored.

 Statistical fluctuations will introduce some noise into the data, but if users issue
many queries, the law of large numbers will assure us that most users will have a
fraction quite close to 1/10th of their queries stored.

However, this scheme gives us the wrong answer to the query asking for the average
number of duplicate queries for a user.

Suppose a user has issued

 s search queries one time in the past month,

 d search queries twice, and

 no search queries more than twice.

If we have a 1/10th sample, of queries,

 we shall see in the sample for that user an expected s/10 of the search queries
issued once.

 Of the d search queries issued twice, only d/100 will appear twice in the
sample; that fraction is d times the probability that both occurrences of the

Shobha Chandra K, Asst Professor, CSE, MCE Page 5

BDA Module-4

query will be in the 1/10th sample. Of the queries that appear twice in the full
stream, 18d/100 will appear exactly once.

 To see why, note that 18/100 is the probability that one of the two occurrences
will be in the 1/10th of the stream that is selected, while the other is in the
9/10th that is not selected.

The correct answer to the query about the fraction of repeated searches is d/(s+d).
However, the answer we shall obtain from the sample is d/(10s+19d). To derive the latter
formula, note that d/100 appear twice, while s/10+18d/100 appear once. Thus, the fraction
appearing twice in the sample is d/100 divided by d/100+s/10+18d/100. This ratio is
d/(10s+19d). For no positive values of s and d is d/(s+d) = d/(10s+19d).

Obtaining a Representative Sample

Many queries about the statistics of typical users, cannot be answered by taking a sample
of each user’s search queries. Thus, we must strive to pick 1/10th of the users, and take all
their searches for the sample, while taking none of the searches from other users.

If we can store a list of all users, and whether or not they are in the sample, then we could
do the following.

 Each time a search query arrives in the stream, we look up the user to see
whether or not they are in the sample.

 If so, we add this search query to the sample, and if not, then not.

 However, if we have no record of ever having seen this user before, then we
generate a random integer between 0 and 9.

 If the number is 0, we add this user to our list with value “in,” and if the
number is other than 0, we add the user with the value “out.”

The method works as long as we can afford to keep the list of all users and their in/out
decision in main memory, because there isn’t time to go to disk for every search that
arrives. By using a hash function, one can avoid keeping the list of users. That is, we hash
each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0, then
accept this search query for the sample, and if not, then not.

we do not actually store the user in the bucket; in fact, there is no data in the buckets at all.
Effectively, we use the hash function as a random number generator, with the important
Shobha Chandra K, Asst Professor, CSE, MCE Page 6
BDA Module-4

property that, when applied to the same user several times, we always get the same
“random” number. That is, without storing the in/out decision for any user, we can
reconstruct that decision any time a search query by that user arrives.

More generally, we can obtain a sample consisting of any rational fraction a/b of the users
by hashing user names to b buckets, 0 through b−1. Add the search query to the sample if
the hash value is less than a.

The General Sampling Problem

Our stream consists of tuples with n components. A subset of the components are the key
components, on which the selection of the sample will be based. In our running example,
there are three components– user, query, and time– of which only user is in the key.
However, we could also take a sample of queries by making query be the key, or even take
a sample of user-query pairs by making both those components form the key.

To take a sample of size a/b,

 we hash the key value for each tuple to b buckets, and accept the tuple for the
sample if the hash value is less than a.

 If the key consists of more than one component, the hash function needs to combine
the values for those components to make a single hash-value.

 The result will be a sample consisting of all tuples with certain key values.

 The selected key values will be approximately a/b of all the key values appearing in
the stream.

Varying the Sample Size

Often, the sample will grow as more of the stream enters the system. In our running
example, we retain all the search queries of the selected 1/10th of the users, forever. As
time goes on, more searches for the same users will be accumulated, and new users that
are selected for the sample will appear in the stream.

If we have a budget for how many tuples from the stream can be stored as the sample,
then the fraction of key values must vary, lowering as time goes on. In order to assure that
at all times, the sample consists of all tuples from a subset of the key values, we choose a

Shobha Chandra K, Asst Professor, CSE, MCE Page 7

BDA Module-4

hash function h from key values to a very large number of values 0,1,...,B−1. We maintain
a threshold t, which initially can be the largest bucket number, B − 1. At all times, the
sample consists of those tuples whose key K satisfies h(K) ≤ t. New tuples from the
stream are added to the sample if and only if they satisfy the same condition.

If the number of stored tuples of the sample exceeds the allotted space, we lower t to
t−1 and remove from the sample all those tuples whose key K hashes to t. For
efficiency, we can lower t by more than 1, and remove the tuples with several of the
highest hash values, whenever we need to throw some key values out of the sample.
Further efficiency is obtained by maintaining an index on the hash value, so we can find all
those tuples whose keys hash to a particular value quickly.

Filtering Streams

Another common process on streams is selection, or filtering. We want to accept those

tuples in the stream that meet a criterion. Accepted tuples are passed to another process
as a stream, while other tuples are dropped.

Shobha Chandra K, Asst Professor, CSE, MCE Page 8

3263 - Transfer of Diesel JSA-R76!02!202-01
100% (1)
3263 - Transfer of Diesel JSA-R76!02!202-01
1 page
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
CBRN Fundamentals (ENGLISH)
100% (3)
CBRN Fundamentals (ENGLISH)
43 pages
Department of Education: San Antonio Elementary School
No ratings yet
Department of Education: San Antonio Elementary School
4 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
Unit 4
No ratings yet
Unit 4
84 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Bda M4
No ratings yet
Bda M4
57 pages
Module II
No ratings yet
Module II
22 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
mining data stream
No ratings yet
mining data stream
31 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
data stream unit4
No ratings yet
data stream unit4
20 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Unit II(Big Data)
No ratings yet
Unit II(Big Data)
19 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Real Time Data Stream Processing Engine
No ratings yet
Real Time Data Stream Processing Engine
13 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Bda L4
No ratings yet
Bda L4
32 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Unit 3
No ratings yet
Unit 3
30 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Session 3.9.1
No ratings yet
Session 3.9.1
11 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Data Stream Mg
No ratings yet
Data Stream Mg
528 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Data Stream Management
No ratings yet
Data Stream Management
46 pages
3. Unit 3 - BD - Streaming
No ratings yet
3. Unit 3 - BD - Streaming
42 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
BigData_Mod2
No ratings yet
BigData_Mod2
12 pages
MMD 03
No ratings yet
MMD 03
53 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
UNIT-II 30-1-24
No ratings yet
UNIT-II 30-1-24
162 pages
Unit 2
No ratings yet
Unit 2
10 pages
DSC L2 CQL
No ratings yet
DSC L2 CQL
40 pages
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
No ratings yet
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
68 pages
Unit 4
No ratings yet
Unit 4
10 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Data Stream's Characteristics: The Following Are The List of Characteristics of DS
No ratings yet
Data Stream's Characteristics: The Following Are The List of Characteristics of DS
29 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Data Streams1
No ratings yet
Data Streams1
10 pages
Rijul Research Paper
No ratings yet
Rijul Research Paper
9 pages
Stream
No ratings yet
Stream
30 pages
Computer Science I Essentials
From Everand
Computer Science I Essentials
Randall Raus
5/5 (7)
HMC
No ratings yet
HMC
7 pages
DC Activity Groups C sec
No ratings yet
DC Activity Groups C sec
2 pages
ISE_Analysis_CIE1-B.E-ComputerScience-Semester5-DivisionC-22CS503-DataCommunications
No ratings yet
ISE_Analysis_CIE1-B.E-ComputerScience-Semester5-DivisionC-22CS503-DataCommunications
2 pages
UGC-CARE List (India) Journals
No ratings yet
UGC-CARE List (India) Journals
15 pages
ISE_Analysis_CIE1-B.E-ComputerScience-Semester7-DivisionA-21CS782-BigDataAnalytics
No ratings yet
ISE_Analysis_CIE1-B.E-ComputerScience-Semester7-DivisionA-21CS782-BigDataAnalytics
2 pages
Tutorial DK024 Zahirah
No ratings yet
Tutorial DK024 Zahirah
111 pages
MacCarthyAtthirawongDELPHIInt Manuf IJOPM2003rg PDF
No ratings yet
MacCarthyAtthirawongDELPHIInt Manuf IJOPM2003rg PDF
31 pages
Grade-10 Worksheet-3
No ratings yet
Grade-10 Worksheet-3
2 pages
5.13. What Are The Advantages and Disadvantages of Allowing Employees To Use Their Personal Smartphones For Work?
No ratings yet
5.13. What Are The Advantages and Disadvantages of Allowing Employees To Use Their Personal Smartphones For Work?
3 pages
ZNSHINE SOLAR - PID Test1
No ratings yet
ZNSHINE SOLAR - PID Test1
6 pages
Dual Mode Ultrasonic Cleaner Instruction 33 40KHz
No ratings yet
Dual Mode Ultrasonic Cleaner Instruction 33 40KHz
8 pages
Angol Nyelv: N o V e M 4
No ratings yet
Angol Nyelv: N o V e M 4
31 pages
Lecture 12 Filtration - Design Criteria
No ratings yet
Lecture 12 Filtration - Design Criteria
8 pages
Stainburn St. Mary - Your Church Tour
No ratings yet
Stainburn St. Mary - Your Church Tour
2 pages
Process Data Sheet For Chemical Injection System PU-300A/B
No ratings yet
Process Data Sheet For Chemical Injection System PU-300A/B
2 pages
Part Numbering: Chip Emifilr Capacitor Type/Capacitor Array Type
No ratings yet
Part Numbering: Chip Emifilr Capacitor Type/Capacitor Array Type
11 pages
Oracle MD.050 Sample 1
No ratings yet
Oracle MD.050 Sample 1
13 pages
Fly Into Action: Ready, Set, Jump!
No ratings yet
Fly Into Action: Ready, Set, Jump!
7 pages
Water Hyacinth Ash As Additive On Cement Mortar (Accelerator or Retarder)
100% (3)
Water Hyacinth Ash As Additive On Cement Mortar (Accelerator or Retarder)
30 pages
TLI11210 Cert I in Warehousing Operations 2012
No ratings yet
TLI11210 Cert I in Warehousing Operations 2012
4 pages
Complete Testing Report
No ratings yet
Complete Testing Report
12 pages
DLL - MTB 3 - Q3 - W2
No ratings yet
DLL - MTB 3 - Q3 - W2
4 pages
Nike Case Study Fritz
No ratings yet
Nike Case Study Fritz
11 pages
EDU431 Quiz 4 File by Tanveer Online Academy
No ratings yet
EDU431 Quiz 4 File by Tanveer Online Academy
8 pages
Edsby Online Testing - Final
No ratings yet
Edsby Online Testing - Final
5 pages
Coforge Official Letter
No ratings yet
Coforge Official Letter
3 pages
Open-Loop vs. Closed-Loop Control System
100% (1)
Open-Loop vs. Closed-Loop Control System
6 pages
Quiz 1 Ingles 3 PDF
No ratings yet
Quiz 1 Ingles 3 PDF
6 pages
Notes For DOS Systems
No ratings yet
Notes For DOS Systems
2 pages
Pluralsight 20patterns Ebook
No ratings yet
Pluralsight 20patterns Ebook
53 pages
Essential Controls 27001
No ratings yet
Essential Controls 27001
9 pages
Factorisation Class 8 Extra Questions Extramarks
No ratings yet
Factorisation Class 8 Extra Questions Extramarks
2 pages