Unit 4 Notes PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

CS8091 - BIG DATA ANALYTICS – Unit 4 – Lecture Notes

UNIT IV - STREAM MEMORY


Introduction to Streams Concepts – Stream Data Model and Architecture - Stream Computing,
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
Estimating moments – Counting oneness in a Window – Decaying Window – Real time Analytics
Platform (RTAP) applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions. Using Graph Analytics for Big Data: Graph Analytics.

INDEX:
 Streams Concepts, Stream Data Model and Architecture, Stream Computing
o Definition
o Stream Data vs. Batch Processing
o Stream Data Model
o Data Stream Management System
o Sources of Stream Data
o Stream Queries
o Issues in Stream Processing
 Sampling Data in a Stream
o Example
o Obtaining a representative sample
o General sampling problem
o Varying the sample size
 Filtering Streams
o Example
o Bloom filter
o Analysis of bloom filter
 Counting Distinct Elements in a Stream
o Count-Distinct problem
o FM (Flajolet-Martin) algorithm
o Combining estimates
o Space requirements

Page 1 of 27
 Estimating moments
o Definition of moments
o ASM (Alon-Matias-Szegedy) Algorithm
o Higher order moments
o Dealing with infinite streams
 Counting oneness in a Window
o Cost of exact counts
o DGIM (Datar-Gionis-Indyk-Motwani) Algorithm
o Storage requirements
o Query answering in DGIM algorithm
o Maintaining DGIM conditions
o Reducing the error
o Extensions to the counting of ones
 Decaying Window
o Problem of most common elements
o Definition of the decaying window
o Finding most popular elements
 Real time Analytics Platform (RTAP) applications
 Case Studies
 Real Time Sentiment Analysis
o Definition
o Examples
o Working model
o Methods
 Lexicon based
 Machine learning based
 Combined
 Stock Market Predictions
o Definition
o Examples
o Working model

Page 2 of 27
 Using Graph Analytics for Big Data: Graph Analytics
o The Simplicity of the Graph Model
o Representation as Triples
o Graphs and Network Organization
o Choosing Graph Analytics
o Graph Analytics Use Cases
o Graph Analytics Algorithms and Solution Approaches
o Technical Complexity of Analyzing Graphs
o Features of a Graph Analytics Platform
o Dedicated Appliances for Graph Analytics

Page 3 of 27
Streaming Data – Definition:
 Streaming data is data that is continuously generated by different sources. Such data should
be processed incrementally using Stream Processing techniques.

Streaming Data vs. Batch Processing:

Page 4 of 27
THE STREAM DATA MODEL

A Data-Stream-Management System
 Any number of streams can enter the system. Each stream can provide elements at its
own schedule; they need not have the same data rates or data types, and the time
between elements of one stream need not be uniform.
 The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.
 The latter system controls the rate at which data is read from the disk, and therefore never
has to worry about data getting lost as it attempts to execute queries.

Page 5 of 27
 Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.
 There is also a working store, into which summaries or parts of streams may be placed,
and which can be used for answering queries. The working store might be disk, or it might
be main memory, depending on how fast we need to process queries. But either way, it is
of sufficiently limited capacity that it cannot store all the data from all the streams.

Examples of Stream Sources


 Sensor Data - Imagine a temperature sensor bobbing about in the ocean, sending back to
a base station a reading of the surface temperature each hour. The data produced by this
sensor is a stream of real numbers.
 Image Data - Satellites often send down to earth streams consisting of many terabytes of
images per day. Surveillance cameras produce images with lower resolution than satellites,
but there can be many of them, each producing a stream of images at intervals like one
second. London is said to have six million such cameras, each producing a stream.
 Internet and Web Traffic - Web sites receive streams of various types. For example,
Google receives several hundred million search queries per day. Yahoo! accepts billions
of “clicks” per day on its various sites.
 Social Media - One of the best sources for user generated content and processing text.
Twitter — You can use their official API or a third-party library to access a part of their
stream. For Python, go with Tweepy.
 Finance - Money never sleeps. There are people buying and selling financial instruments
every second of the day.
o Stocks — IEX has an amazing developer platform. Not only can you get stock
information, but you can get information on the order book, cryptocurrencies,
sector performance, and much more.
o Blockchain — The official Blockchain organization provides an API to get
notifications on blocks and transactions for Bitcoin.

Data Streaming Challenges


 Plan for scalability
 Plan for data durability
 Incorporate fault tolerance in both the storage and processing layers

Page 6 of 27
Stream Queries
 There are two ways that queries get asked about streams. We show in Fig. 4.1 a place
within the processor where standing queries are stored. These queries are, in a sense,
permanently executing, and produce outputs at appropriate times.
 The other form of query is ad-hoc, a question asked once about the current state of a stream
or streams. If we do not store all streams in their entirety, as normally we cannot, then we
cannot expect to answer arbitrary queries about streams. If we have some idea what kind
of queries will be asked through the ad-hoc query interface, then we can prepare for them
by storing appropriate parts or summaries of streams.

Example: Web sites often like to report the number of unique users over the past month. If we
think of each login as a stream element, we can maintain a window that is all logins in the most
recent month. We must associate the arrival time with each login, so we know when it no longer
belongs to the window. If we think of the window as a relation Logins(name, time), then it is
simple to get the number of unique users over the past month. The SQL query is:
SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;

 Here, t is a constant that represents the time one month before the current time.
 Note that we must be able to maintain the entire stream of logins for the past month in
working storage. However, for even the largest sites, that data is not more than a few
terabytes, and so surely can be stored on disk.

Page 7 of 27
Issues in Stream Processing
 It is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.
 Require the invention of new techniques in order to execute them at a realistic rate
on a machine of realistic size.
 It is much more efficient to get an approximate answer to our problem than an exact
solution.

Applications
 Mining query streams - Google wants to know what queries are more frequent today than
yesterday.
 Mining click streams - Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour.
 Mining social network news feeds - E.g., look for trending topics on Twitter, Facebook
 Telephone call records - Data feeds into customer bills as well as settlements between
telephone companies

SAMPLING DATA IN A STREAM


 It is used to select sample data from the stream data.

Example
 Our running example is the following. A search engine receives a stream of queries, and it
would like to study the behavior of typical users.

Page 8 of 27
 We assume the stream consists of tuples (user, query, time). Suppose that we want to
answer queries such as “What fraction of the typical user’s queries were repeated over the
past month?”.
 Assume also that we wish to store only 1/10th of the stream elements. The obvious
approach would be to generate a random number, say an integer from 0 to 9, in response
to each search query. Store the tuple if and only if the random number is 0. If we do so,
each user has, on average, 1/10th of their queries stored.

Obtaining a Representative Sample


 We must strive to pick 1/10th of the users, and take all their searches for the sample, while
taking none of the searches from other users.
 Generate a random integer between 0 and 9. If the number is 0, we add this user to our
list with value “in,” and if the number is other than 0, we add the user with the value “out.”
 By using a hash function, one can avoid keeping the list of users. That is, we hash each
user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0, then accept
this search query for the sample, and if not, then not.

The General Sampling Problem


 There are three components – user, query, and time – of which only user is in the key.
 To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept
the tuple for the sample if the hash value is less than a.

Varying the Sample Size


 In order to assure that at all times, the sample consists of all tuples from a subset of the key
values, we choose a hash function h from key values to a very large number of values
0, 1, . . . ,B−1.
 We maintain a threshold t, which initially can be the largest bucket number, B − 1. At all
times, the sample consists of those tuples whose key K satisfies h(K) ≤ t. New tuples from
the stream are added to the sample if and only if they satisfy the same condition.

Page 9 of 27
Fixed Size Sampling – Reservoir Sampling

Example:
Suppose we have a stream of tuples with the schema Grades(university, courseID, studentID,
grade) Assume universities are unique, but a courseID is unique only within a university (i.e.,
different universities may have different courses with the same ID, e.g., “CS101”) and likewise,
studentID’s are unique only within a university (different universities may assign the same ID to
different students).

Suppose we want to answer certain queries approximately from a 1/20th sample of the data. For
each of the queries below, indicate how you would construct the sample. That is, tell what the key
attributes should be.
(a) For each university, estimate the average number of students in a course.
(b) Estimate the fraction of students who have a GPA of 3.5 or more.
(c) Estimate the fraction of courses where at least half the students got “A.”

Page 10 of 27
FILTERING STREAMS
 Another common process on streams is selection, or filtering. We want to accept those
tuples in the stream that meet a criterion. Accepted tuples are passed to another process
as a stream, while other tuples are dropped.

Page 11 of 27
Example - First Cut Solution

Page 12 of 27
 Blooms Filter – Example

Blooms Filter - Analysis

Page 13 of 27
COUNTING DISTINCT ELEMENTS IN A STREAM
 Suppose stream elements are chosen from some universal set. We would like to know how
many different elements have appeared in the stream, counting either from the
beginning of the stream or from some known time in the past.

The Count-Distinct Problem

Page 14 of 27
The Flajolet-Martin Algorithm
 The idea behind the Flajolet-Martin Algorithm is that the more different elements we see
in the stream, the more different hash-values we shall see. As we see more different hash-
values, it becomes more likely that one of these values will be “unusual.”
 FM Algorithm – Example

Combining Estimates
 First, group the hash functions into small groups, and take their average. Then, take the
median of the averages.
Space Requirements
 Observe that as we read the stream it is not necessary to store the elements seen. The only
thing we need to keep in main memory is one integer per hash function.

ESTIMATING MOMENTS
 The problem, called computing “moments,” involves the distribution of frequencies of
different elements in the stream.

Definition of Moments
 The 0th moment is a count of the number of distinct elements in the stream.
 The second moment is the sum of the squares of the mi’s. It is sometimes called the surprise
number, since it measures how uneven the distribution of elements in the stream is.

Page 15 of 27
Page 16 of 27
 AMS algorithm- Example

Higher-Order Moments
 We estimate kth moments, for k > 2, in essentially the same way as we estimate second
moments.
 More generally, we can estimate kth moments for any k ≥ 2 by turning value v = X.value
into n(vk − (v − 1)k.
Page 17 of 27
Dealing With Infinite Streams
 Proper technique is to maintain as many variables as we can store at all times, and to throw
some out as the stream grows. The discarded variables are replaced by new ones, in such a
way that at all times, the probability of picking any one position for a variable is the same
as that of picking any other position. Suppose we have space to store s variables. Then the
first s positions of the stream are each picked as the position of one of the s variables.
 The probability of selecting each of the first n positions is,

COUNTING ONES IN A WINDOW


 We now turn our attention to counting problems for streams. Suppose we have a window
of length N on a binary stream. We want at all times to be able to answer queries of the
form “how many 1’s are there in the last k bits?” for any k ≤ N.
The Cost of Exact Counts
 To begin, suppose we want to be able to count exactly the number of 1’s in the last k bits
for any k ≤ N. Then we claim it is necessary to store all N bits of the window.
 If w = 0101 and x = 1010, then k = 1, since scanning from the right, they first disagree at
position 1. If w = 1001 and x = 0101, then k = 3, because they first disagree at the third
position from the right.

Page 18 of 27
The Datar-Gionis-Indyk-Motwani Algorithm(DGIM)
 This version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.
 Each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.

 There are six rules that must be followed when representing a stream by buckets.
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).

Storage Requirements for the DGIM Algorithm


 The total space required for all the buckets representing a window of size N is O(log2 N).

Page 19 of 27
Query Answering in the DGIM Algorithm
 Suppose we are asked how many 1’s there are in the last k bits of the window, for some 1
≤ k ≤ N. Find the bucket b with the earliest timestamp that includes at least some of the k
most recent bits. Estimate the number of 1’s to be the sum of the sizes of all the buckets to
the right (more recent) than bucket b, plus half the size of b itself.

Maintaining the DGIM Conditions


 First, whenever a new bit enters:
• Check the leftmost (earliest) bucket. If its timestamp has now reached the current
timestamp minus N, then this bucket no longer has any of its 1’s in the window.
Therefore, drop it from the list of buckets.
 Now, we must consider whether the new bit is 0 or 1. If it is 0, then no further change to
the buckets is needed. If the new bit is a 1, however, we may need to make several changes.
 First: If there was only one bucket of size 1, then nothing more needs to be done. However,
if there are now three buckets of size 1, We fix this problem by combining the leftmost
(earliest) two buckets of size 1.
 Create a new bucket with the current timestamp and size 1.

Reducing the Error


 The fractional error is

Page 20 of 27
j is, this fraction is upper bounded by 1/(r − 1). Thus, by picking r sufficiently large, we can limit
the error to any desired E > 0.
Extensions to the Counting of Ones
 The sum of the integers is

DECAYING WINDOWS
 Sometimes we do not want to make a sharp distinction between recent elements and those
in the distant past, but want to weight the recent elements more heavily.
 We consider “exponentially decaying windows,” and an application where they are quite
useful: finding the most common “recent” elements.

The Problem of Most-Common Elements


 Suppose we have a stream whose elements are the movie tickets purchased all over the
world, with the name of the movie as part of the element. We want to keep a summary of
the stream that is the most popular movies “currently.”
 We want to discount the popularity of a movie like Star Wars–Episode 4, which sold many
tickets, but most of these were sold decades ago. On the other hand, a movie that sold n
tickets in each of the last 10 weeks is probably more popular than a movie that sold 2n
tickets last week but nothing in previous weeks.
 This technique might work for movies, because there are only thousands of movies, but it
would fail if we were instead recording the popularity of items sold at Amazon, or the rate
at which different Twitter-users tweet, because there are too many Amazon products and
too many tweeters. Further, it only offers approximate answers.

Definition of the Decaying Window


 Let a stream currently consist of the elements a1, a2, . . . , at, where a1 is the first element
to arrive and at is the current element. Let c be a small constant, such as 10−6 or 10−9.
Define the exponentially decaying window for this stream to be the sum,

Page 21 of 27
 However, when a new element at+1 arrives at the stream input, all we need to do is:
1. Multiply the current sum by 1 − c.
2. Add at+1.

Finding most popular elements


 When a new ticket arrives on the streams,

REAL TIME SENTIMENT ANALYSIS


 Sentiment Analysis is the process of determining whether a piece of writing is positive,
negative or neutral.

Applications
Sentiment Analysis for Customer Experience - In today’s social media world, most of your
customers are using social networks, particularly Twitter, to talk and express their opinions about
your brand, product, and services. Analyzing tweets, online reviews and news articles is quite

Page 22 of 27
useful for social media managers or business analysts to get insights into how customers feel and
act upon it afterward. For this task, automated sentiment analysis is an essential component as a
language processing tool.

Sentiment Analysis for Employee Experience - A significant percentage of workers leave their
jobs each year, while another portion is fired or let go. In this sense, HR teams are starting to take
actions, with the help of data analytics, to understand such tendencies and reduce turnover while
yielding a performance improvement. Getting what employees are talking about and how they feel
about your company is possible thanks to sentiment analysis systems, and this helps workforce
analysts reduce employee churn.

Working:

Page 23 of 27
There are mainly three approaches in sentiment analysis

1. Lexicon based - considers lexicon dictionary for identifying polarity of the text.

2. Machine learning based approach - Needs to develop classification model, which is


trained using prelabeled dataset of positive, negative, neutral content.

Page 24 of 27
3. Combined approach - Which uses lexicon dictionary along with pre-labelled data set
for developing classification model.

Page 25 of 27
STOCK MARKET PREDICTIONS
 Stock market prediction is the act of trying to determine the future value of a company
stock or other financial instrument traded on an exchange. The successful prediction of a
stock's future price could yield significant profit.

Page 26 of 27
 Decision Tree algorithm

Page 27 of 27

You might also like