Unit 4 Notes PDF
Unit 4 Notes PDF
Unit 4 Notes PDF
INDEX:
Streams Concepts, Stream Data Model and Architecture, Stream Computing
o Definition
o Stream Data vs. Batch Processing
o Stream Data Model
o Data Stream Management System
o Sources of Stream Data
o Stream Queries
o Issues in Stream Processing
Sampling Data in a Stream
o Example
o Obtaining a representative sample
o General sampling problem
o Varying the sample size
Filtering Streams
o Example
o Bloom filter
o Analysis of bloom filter
Counting Distinct Elements in a Stream
o Count-Distinct problem
o FM (Flajolet-Martin) algorithm
o Combining estimates
o Space requirements
Page 1 of 27
Estimating moments
o Definition of moments
o ASM (Alon-Matias-Szegedy) Algorithm
o Higher order moments
o Dealing with infinite streams
Counting oneness in a Window
o Cost of exact counts
o DGIM (Datar-Gionis-Indyk-Motwani) Algorithm
o Storage requirements
o Query answering in DGIM algorithm
o Maintaining DGIM conditions
o Reducing the error
o Extensions to the counting of ones
Decaying Window
o Problem of most common elements
o Definition of the decaying window
o Finding most popular elements
Real time Analytics Platform (RTAP) applications
Case Studies
Real Time Sentiment Analysis
o Definition
o Examples
o Working model
o Methods
Lexicon based
Machine learning based
Combined
Stock Market Predictions
o Definition
o Examples
o Working model
Page 2 of 27
Using Graph Analytics for Big Data: Graph Analytics
o The Simplicity of the Graph Model
o Representation as Triples
o Graphs and Network Organization
o Choosing Graph Analytics
o Graph Analytics Use Cases
o Graph Analytics Algorithms and Solution Approaches
o Technical Complexity of Analyzing Graphs
o Features of a Graph Analytics Platform
o Dedicated Appliances for Graph Analytics
Page 3 of 27
Streaming Data – Definition:
Streaming data is data that is continuously generated by different sources. Such data should
be processed incrementally using Stream Processing techniques.
Page 4 of 27
THE STREAM DATA MODEL
A Data-Stream-Management System
Any number of streams can enter the system. Each stream can provide elements at its
own schedule; they need not have the same data rates or data types, and the time
between elements of one stream need not be uniform.
The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.
The latter system controls the rate at which data is read from the disk, and therefore never
has to worry about data getting lost as it attempts to execute queries.
Page 5 of 27
Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.
There is also a working store, into which summaries or parts of streams may be placed,
and which can be used for answering queries. The working store might be disk, or it might
be main memory, depending on how fast we need to process queries. But either way, it is
of sufficiently limited capacity that it cannot store all the data from all the streams.
Page 6 of 27
Stream Queries
There are two ways that queries get asked about streams. We show in Fig. 4.1 a place
within the processor where standing queries are stored. These queries are, in a sense,
permanently executing, and produce outputs at appropriate times.
The other form of query is ad-hoc, a question asked once about the current state of a stream
or streams. If we do not store all streams in their entirety, as normally we cannot, then we
cannot expect to answer arbitrary queries about streams. If we have some idea what kind
of queries will be asked through the ad-hoc query interface, then we can prepare for them
by storing appropriate parts or summaries of streams.
Example: Web sites often like to report the number of unique users over the past month. If we
think of each login as a stream element, we can maintain a window that is all logins in the most
recent month. We must associate the arrival time with each login, so we know when it no longer
belongs to the window. If we think of the window as a relation Logins(name, time), then it is
simple to get the number of unique users over the past month. The SQL query is:
SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;
Here, t is a constant that represents the time one month before the current time.
Note that we must be able to maintain the entire stream of logins for the past month in
working storage. However, for even the largest sites, that data is not more than a few
terabytes, and so surely can be stored on disk.
Page 7 of 27
Issues in Stream Processing
It is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.
Require the invention of new techniques in order to execute them at a realistic rate
on a machine of realistic size.
It is much more efficient to get an approximate answer to our problem than an exact
solution.
Applications
Mining query streams - Google wants to know what queries are more frequent today than
yesterday.
Mining click streams - Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour.
Mining social network news feeds - E.g., look for trending topics on Twitter, Facebook
Telephone call records - Data feeds into customer bills as well as settlements between
telephone companies
Example
Our running example is the following. A search engine receives a stream of queries, and it
would like to study the behavior of typical users.
Page 8 of 27
We assume the stream consists of tuples (user, query, time). Suppose that we want to
answer queries such as “What fraction of the typical user’s queries were repeated over the
past month?”.
Assume also that we wish to store only 1/10th of the stream elements. The obvious
approach would be to generate a random number, say an integer from 0 to 9, in response
to each search query. Store the tuple if and only if the random number is 0. If we do so,
each user has, on average, 1/10th of their queries stored.
Page 9 of 27
Fixed Size Sampling – Reservoir Sampling
Example:
Suppose we have a stream of tuples with the schema Grades(university, courseID, studentID,
grade) Assume universities are unique, but a courseID is unique only within a university (i.e.,
different universities may have different courses with the same ID, e.g., “CS101”) and likewise,
studentID’s are unique only within a university (different universities may assign the same ID to
different students).
Suppose we want to answer certain queries approximately from a 1/20th sample of the data. For
each of the queries below, indicate how you would construct the sample. That is, tell what the key
attributes should be.
(a) For each university, estimate the average number of students in a course.
(b) Estimate the fraction of students who have a GPA of 3.5 or more.
(c) Estimate the fraction of courses where at least half the students got “A.”
Page 10 of 27
FILTERING STREAMS
Another common process on streams is selection, or filtering. We want to accept those
tuples in the stream that meet a criterion. Accepted tuples are passed to another process
as a stream, while other tuples are dropped.
Page 11 of 27
Example - First Cut Solution
Page 12 of 27
Blooms Filter – Example
Page 13 of 27
COUNTING DISTINCT ELEMENTS IN A STREAM
Suppose stream elements are chosen from some universal set. We would like to know how
many different elements have appeared in the stream, counting either from the
beginning of the stream or from some known time in the past.
Page 14 of 27
The Flajolet-Martin Algorithm
The idea behind the Flajolet-Martin Algorithm is that the more different elements we see
in the stream, the more different hash-values we shall see. As we see more different hash-
values, it becomes more likely that one of these values will be “unusual.”
FM Algorithm – Example
Combining Estimates
First, group the hash functions into small groups, and take their average. Then, take the
median of the averages.
Space Requirements
Observe that as we read the stream it is not necessary to store the elements seen. The only
thing we need to keep in main memory is one integer per hash function.
ESTIMATING MOMENTS
The problem, called computing “moments,” involves the distribution of frequencies of
different elements in the stream.
Definition of Moments
The 0th moment is a count of the number of distinct elements in the stream.
The second moment is the sum of the squares of the mi’s. It is sometimes called the surprise
number, since it measures how uneven the distribution of elements in the stream is.
Page 15 of 27
Page 16 of 27
AMS algorithm- Example
Higher-Order Moments
We estimate kth moments, for k > 2, in essentially the same way as we estimate second
moments.
More generally, we can estimate kth moments for any k ≥ 2 by turning value v = X.value
into n(vk − (v − 1)k.
Page 17 of 27
Dealing With Infinite Streams
Proper technique is to maintain as many variables as we can store at all times, and to throw
some out as the stream grows. The discarded variables are replaced by new ones, in such a
way that at all times, the probability of picking any one position for a variable is the same
as that of picking any other position. Suppose we have space to store s variables. Then the
first s positions of the stream are each picked as the position of one of the s variables.
The probability of selecting each of the first n positions is,
Page 18 of 27
The Datar-Gionis-Indyk-Motwani Algorithm(DGIM)
This version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.
Each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.
There are six rules that must be followed when representing a stream by buckets.
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).
Page 19 of 27
Query Answering in the DGIM Algorithm
Suppose we are asked how many 1’s there are in the last k bits of the window, for some 1
≤ k ≤ N. Find the bucket b with the earliest timestamp that includes at least some of the k
most recent bits. Estimate the number of 1’s to be the sum of the sizes of all the buckets to
the right (more recent) than bucket b, plus half the size of b itself.
Page 20 of 27
j is, this fraction is upper bounded by 1/(r − 1). Thus, by picking r sufficiently large, we can limit
the error to any desired E > 0.
Extensions to the Counting of Ones
The sum of the integers is
DECAYING WINDOWS
Sometimes we do not want to make a sharp distinction between recent elements and those
in the distant past, but want to weight the recent elements more heavily.
We consider “exponentially decaying windows,” and an application where they are quite
useful: finding the most common “recent” elements.
Page 21 of 27
However, when a new element at+1 arrives at the stream input, all we need to do is:
1. Multiply the current sum by 1 − c.
2. Add at+1.
Applications
Sentiment Analysis for Customer Experience - In today’s social media world, most of your
customers are using social networks, particularly Twitter, to talk and express their opinions about
your brand, product, and services. Analyzing tweets, online reviews and news articles is quite
Page 22 of 27
useful for social media managers or business analysts to get insights into how customers feel and
act upon it afterward. For this task, automated sentiment analysis is an essential component as a
language processing tool.
Sentiment Analysis for Employee Experience - A significant percentage of workers leave their
jobs each year, while another portion is fired or let go. In this sense, HR teams are starting to take
actions, with the help of data analytics, to understand such tendencies and reduce turnover while
yielding a performance improvement. Getting what employees are talking about and how they feel
about your company is possible thanks to sentiment analysis systems, and this helps workforce
analysts reduce employee churn.
Working:
Page 23 of 27
There are mainly three approaches in sentiment analysis
1. Lexicon based - considers lexicon dictionary for identifying polarity of the text.
Page 24 of 27
3. Combined approach - Which uses lexicon dictionary along with pre-labelled data set
for developing classification model.
Page 25 of 27
STOCK MARKET PREDICTIONS
Stock market prediction is the act of trying to determine the future value of a company
stock or other financial instrument traded on an exchange. The successful prediction of a
stock's future price could yield significant profit.
Page 26 of 27
Decision Tree algorithm
Page 27 of 27