Unit-II (Big Data)
Unit-II (Big Data)
Unit-II (Big Data)
Stream processing is a data management technique that involves ingesting a continuous data
stream to quickly analyze, filter, transform or enhance the data in real time.
Big data streaming is a process in which big data is quickly processed in order to extract real-
time insights from it. Big data streaming is ideally a speed-focused approach wherein a
continuous stream of data is processed.
Big data streaming is a process in which large streams of real-time data are processed with the
sole aim of extracting insights and useful trends out of it. A continuous stream of unstructured
data is sent for analysis into memory before storing it onto disk. This happens across a cluster
of servers. Speed matters the most in big data streaming. The value of data, if not processed
quickly, decreases with time.
Data Stream:
o Storage layer
o Processing Layer
o Scalability
o Data Durability
o Fault Tolerance in both the Storage and processing layer.
In batch data processing, data is downloaded in batches before being processed, stored, and
analyzed. On the other hand, stream data ingest data continuously, allowing it to be processed
simultaneously and in real-time.
The main benefit of stream processing is real-time insight. We live in an information age where
new data is constantly being created. Organizations that leverage streaming data analytics can
take advantage of real-time information from internal and external assets to inform their
decisions, drive innovation and improve their overall strategy.
2.2 STREAM DATA MODEL AND ARCHITECTURE
Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between elements
of one stream need not be uniform. The fact that the rate of arrival of stream elements is not
under the control of the system distinguishes stream processing from the processing of data
that goes on within a database-management system. The latter system controls the rate at which
data is read from the disk, and therefore never has to worry about data getting lost as it attempts
to execute queries.
Archival Store
Streams may be archived in a large archival store, but we assume it is not possible to answer
queries from the archival store. It could be examined only under special circumstances using
time-consuming retrieval processes.
Working Store
There is also a working store, into which summaries or parts of streams may be placed, and
which can be used for answering queries. The working store might be disk, or it might be main
memory, depending on how fast we need to process queries.
But either way, it is of sufficiently limited capacity that it cannot store all the data from
all the streams.
Stream computing is a way to analyze and process Big Data in real time to gain current insights
to take appropriate decisions or to predict new trends in the immediate future. Streams are High
rate of receiving data and are Implements in a distributed clustered environment.
o Financial sectors
o Business intelligence
o Risk management
o Marketing management
o Search engines
o Social network analysis
o Mining query streams
Ex: Google wants to know what queries are more frequent today than yesterday
o Mining click streams
Ex: Yahoo wants to know which of its pages are getting an unusual number of hits
in the past hour
o Mining social network news feeds
E.g., Look for trending topics on Twitter, Facebook
The general problem we shall address is selecting a subset of a stream so that we can ask queries
about the selected subset and have the answers be statistically representative of the stream as a
whole.
Since we can’t store the entire stream, one obvious approach is to store a sample
Problem-1:
Example: A search engine receives a stream of queries, and it would like to study the behaviour
of typical users.1 we assume the stream consists of tuples (user, query, time). Suppose that we
want to answer queries such as “What fraction of the typical user’s queries were repeated over
the past month?” Assume also that we wish to store only 1/10th of the stream elements.
Naïve Solution
Sample users
o Pick 1/10th of users and take all the searches in the sample
o Use a hash function that hashes the user name or user ID uniformly into 10 buckets
Random sapling
Sliding windows
Useful model
Blooms filter allows through all stream elements whose keys are in S, while rejecting most of
the stream elements whose keys are not in S.
Fig: Blooms hashing process for the stream (q), K=3, whose hash functions are h1(q), h2(q), and h3(q).
No False negative
- If the query was inserted before, bloom filters always return true
Chance of false positive
- There is a possibility that It can return true for an element which was not inserted
Weaknesses of Bloom filters
Finding the number of distinct (Unique) elements in a data stream with repeated elements.
Example:
If the stream contains n elements with m of them unique, this algorithm runs in O(n) times and
needs O(log(m)) memory. It gives an approximation for the number of unique objects along
with a standard deviation σ with a maximum error ϵ
Whenever we apply a hash function h to a stream element a, the bit string h(a) will end in some
number of 0’s, possibly none. Call this number the tail length for h(a). Let R be the maximum
tail length of any a seen so far in the stream. Then we shall use estimate 2R for the number of
distinct elements seen in the stream.
FM Algorithm
o Pick a hash function h(a) that maps each of the ‘n’ elements at least log2n bits
o For each stream element a, let r(a) be the number of trailing 0’s in h(a)
o Record R= the maximum r(a) seen.
o Estimate = 2R which is equal to no of distinct elements.
FM Algorithm can estimate the results correctly only by using the appropriate hash function
All the Statistical Parameters mean, median, mode are called moments which are used to
estimate or compute the distribution of frequencies of different elements in a stream.
Ex: A= {4,5,4,5,3,2,4,5,4,2,4,3}
𝒎𝟒 = 5 𝒎𝟓 = 3 𝒎𝟐 = 2 𝒎𝟑 = 2
Special cases
∑𝒊∈𝑨(𝒎𝒊 )𝒌
0th Moment
The 0th moment is the sum of 1 for each mi that is greater than 0.3 that is, the 0th moment is a
count of the number of distinct elements in the stream
1st Moment
The 1st moment is the sum of the mi’s, which must be the length of the stream. Thus, first
moments are especially easy to compute; just count the length of the stream seen so far
2nd moment
The second moment is the sum of the squares of the mi’s. It is sometimes called the surprise
number, since it measures how uneven the distribution of elements in the stream is.
To see the distinction, suppose we have a stream of length 100, in which eleven
different elements appear. The most even distribution of these eleven elements would
have one appearing 10 times and the other ten appearing 9 times each. In this case, the
surprise number is 102 + 10 × 92 = 910.
At the other extreme, one of the eleven elements could appear 90 times and the other
ten appear 1 time each. Then, the surprise number would be 1x 902 + 10 × 12 = 8110.
A 5
B 4
C 3
D 3
Let X1, X2, and X3 are elements picked randomly (different time stamps) from positions 3rd,
8th, and 13th from the above stream
X1.element = c X1.value= 3
X2.element = d X2.value= 2
X3.element = a X3.value= 2
(75+45+45)
Average estimate = = 55
3
There are five rules that must be followed when representing a stream by buckets.
The right end of a bucket is always a position with a 1.
No position is in more than one bucket.
There are one or two buckets of any given size, up to some maximum size.
All sizes must be a power of 2.
Buckets cannot decrease in size as we move to the left (back in time).
Example: Fig: Dividing bit-stream into buckets by following the DGIM rules
At the right (most recent) end we see two buckets of size 1. To its left we see one bucket of
size 2. Note that this bucket covers four positions, but only two of them are 1. Proceeding left,
we see two buckets of size 4, and we suggest that a bucket of size 8 exists further left. Notice
that it is OK for some 0’s to lie between buckets. Also, observe from above Fig. that the buckets
do not overlap; there are one or two of each size up to the largest size, and sizes only increase
moving left.
2.9 DECAYING WINDOW
Decaying algorithm allows you to identify most popular elements (trending in other wards) in
an incoming data stream
This algorithm not only tracks most recurring elements in an incoming data stream, but also
discords any random spikes or spam requests that might have boosted an element’s frequency.
𝑆 = ∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0
Here, t = time stamp
c = small constant
𝑎 = element
Finally, the element with highest total score is listed as trending or most popular
Whenever a new element, say at+1 arrives in the data stream, you perform the following steps
to achieve an updated sum
- In a data stream, consisting of various elements, you maintain a separate sum for each
distinct element.
- For every incoming element, you multiply the sum of the existing elements by a value of
(1-c).
- Further, you add the weight of the incoming element to its corresponding aggregate sum.
Finally, the element with the highest aggregate score is listed as the most popular element.
Example
let c be 0.1
The aggregate sum of each tag in the end of above string will be calculated as below:
ipl 0.9 * (1 – 0.1) + 0 =0.81 (added zero because current tag is other than fifa)
fifa 0.81 * (1 – 0.1) + 1 = 1.729 (added one because current tag is fifa only)
fifa 0 * (1 – 0.1) = 0
In the end of sequence we can see that score of fifa is 2.135 but ipl is 3.7264
An ideal real-time analytics platform would help in analyzing the data, correlating it
and predicting the outcomes on a real-time basis.
• The real-time analytics platform helps organizations in tracking things in real time, thus
helping them in the decision-making process.
• The platforms connect the data sources for better analytics and visualization
• A real-time analytics platform enables organizations to make the most out of real-time data
by helping them to extract the valuable information and trends from it.
• Such platforms help in measuring data from the business point of view in real time, further
making the best use of data.
• Real time credit scoring, helping financial institutions to decide immediately whether to
extend credit.
Why RTAP?
• Decision systems that go beyond visual analytics have an intrinsic need to analyze data and
respond to situations.
• Depending on the sophistication, such systems may have to act rapidly on incoming
information, grapple with heterogeneous knowledge bases, work across multiple domains, and
often in a distributed manner.
• Big Data platforms offer programming and software infrastructure to help perform
• analytics to support the performance and scalability needs of such decision support systems
for IoT domains.
• There has been significant focus on Big Data analytics platforms on the volume dimension
of Big Data.
• In such platforms, such as MapReduce, data is staged and aggregated over time, and analytics
are performed in a batch mode on these large data corpus.
• These platforms weakly scale with the size of the input data, as more distributed compute
resources are made available.
However, as we have motivated before, IoT applications place an emphasis on online
analytics, where data that arrives rapidly needs to be processed and analyzed with low latency
to drive autonomic decision making.
On-demand analytics
Continuous—or streaming—analytics.
On-demand real-time analytics waits for users or systems to request a query and then delivers
the analytic results.
Continuous real-time analytics is more proactive and alerts users or triggers responses as
events happen.
The top platforms being used all over the world for Streaming analytics solutions:
Apache Flink
• Flink is an open-source platform that handles distributed stream and batch data processing.
• At its core is a streaming data engine that provides for data distribution, fault tolerance, and
communication, for undertaking distributed computations over the data streams.
• In the last year, the Apache Flink community saw three major version releases for the platform
and the community event Flink Forward in San Francisco.
• Apache Flink contains several APIs to enable creating applications that use the Flink engine.
Some of the most popular APIs on the platform are-
• DataStream API for unbounded streams, DataSet API for static data embedded in Python,
Java, and Scala, and the Table API with a SQL-like language.
Spark Streaming
IBM Streams
• This streaming analytics platform from IBM enables the applications developed by users to
gather, analyze, and correlate information that comes to them from a variety of sources.
• The solution is known to handle high throughput rates and up to millions of events and
messages per second, making it a leading proprietary streaming analytics solution for real-time
applications.
• IBM Stream computing helps analyze large streams of data in the form of unstructured texts,
audio, video, and geospatial, and allows for organizations to spot risks and opportunities and
make efficient decisions.
Thus the real-time Analytics Platform is useful for Real time application development and its
successful implementations.
Real-Time Sentiment Analysis
1. Fine-grained sentiment analysis: This depends on the polarity based. This category can
be designed as very positive, positive, neutral, negative, very negative. The rating is done
on the scale 1 to 5. If the rating is 5 then it is very positive, 2 then negative and 3 then
neutral.
2. Emotion detection: The sentiment happy, sad, anger, upset, jolly, pleasant, and so on come
under emotion detection. It is also known as a lexicon method of sentiment analysis.
3. Aspect based sentiment analysis: It focuses on a particular aspect like for instance, if a
person wants to check the feature of the cell phone then it checks the aspect such as battery,
screen, camera quality then aspect based is used.
4. Multilingual sentiment analysis: Multilingual consists of different languages where the
classification needs to be done as positive, negative, and neutral. This is highly challenging
and comparatively difficult.
1. Opinion extraction
2. Opinion mining
3. Sentiment mining
4. Subjectivity analysis
Corpus creation and Pre-processing: It is very important step to clean and pre-process tweets
as it reduces the noise from the data. For doing this, tweets are converted in to a corpus of
words and pre-processing and cleaning of data are done.
The extracted tweets should be freed from
Punctuation
white spaces
special characters such as ‘#’,”@” and numbers
Stop words such as ‘is’, ‘at’, ‘the’ etc.
data is converted to lower case for uniformity and better visibility.
Some of the words such as ‘http’, ‘https’ related to web are also removed.
Stemming and lemmatization is done finally to removes suffixes from words in order to get the
common origin.