Chapter 5 Big Data
Chapter 5 Big Data
Chapter 5 Big Data
Introduction
• Big Data:
• is high-volume and high-velocity and/or high-variety information
assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision-making, and
process automation.
• is used to describe immense volumes of data, both unstructured and
structured
• refers to massive volumes of data that cannot be effectively
processed with traditional applications.
• The processing of Big Data begins with the raw data that isn’t
aggregated or organized—and is most often impossible to store in the
memory of a single computer.
2
Introd…
• Big Data is a field dedicated to the analysis, processing, and storage of
large collections of data that frequently originate from disparate
sources.
• Big Data solutions and practices are typically required when
traditional data analysis, processing and storage technologies and
techniques are insufficient
• Big Data addresses distinct requirements, such as:
• the combining of multiple unrelated datasets,
• processing of large amounts of unstructured data and
• harvesting of hidden information in a time-sensitive manner
3
Terms, concepts
• In addition to traditional analytic approaches based on statistics, Big
Data adds newer techniques that leverage computational resources
and approaches to execute analytic algorithms.
• This shift is important as datasets continue to become larger, more
diverse, more complex and streaming-centric.
• Datasets:
• Collections or groups of related data
• dataset member (datum) shares the same set of attributes
• Examples:
• Tweets stored in a flat file
• a collection of image files in a directory
• historical weather observations that are stored as XML files
4
Datasets, Data Analysis
• Datasets can be found in many different
formats
• Data Analysis:
• the process of examining data to find facts, relationships, patterns, insights
and/or trends
• Its goal is to support better decision making
• EX: how the number of ice cream cones sold is related to the daily
temperature
• support decisions related to how much ice cream a store should order in relation to
weather forecast information 5
Data Analytics
• Symbol used for data analysis
• Data Analytics:
• Data analytics is a discipline that includes the management of the
complete data lifecycle, which encompasses collecting, cleansing,
organizing, storing, analyzing and governing data
• The term includes the development of analysis methods, scientific
techniques and automated tools
6
• In Big Data environment, data analytics has developed methods that
allow data analysis to occur through the use of highly scalable
distributed technologies
• for large volume of data from different sources
9
Descriptive Analytics
• Descriptive analytics are carried out to answer questions about events
that have already occurred.
• This form of analytics contextualizes data to generate information
• Sample questions can include:
• What was the sales volume over the past 12 months?
• What is the number of support calls received as categorized by severity and
geographic location?
• What is the monthly commission earned by each sales agent?
• It is estimated that 80% of generated analytics results are descriptive in
nature.
• Value-wise, descriptive analytics provide the least worth and require a
relatively basic skillset.
10
Descriptive analytics …
• Descriptive analytics are often carried out via ad-hoc reporting or dashboards
• The reports are generally static in nature and display historical data that is
presented in the form of data grids or charts
• Queries are executed on operational data stores from within an enterprise,
for example a Customer Relationship Management system (CRM) or
Enterprise Resource Planning (ERP) system
11
Diagnostic Analytics
12
Predictive Analytics
• Predictive analytics are carried out in an attempt to determine the
outcome of an event that might occur in the future
• Questions are usually formulated using a what-if rationale, such as the
following:
• What are the chances that a customer will default on a loan if they have missed a
monthly payment?
• What will be the patient survival rate if Drug B is administered instead of Drug A?
• If a customer has purchased Products A and B, what are the chances that they
will also purchase Product C?
• Predictive analytics try to predict the outcomes of events,
• Predictions are made based on patterns, trends and exceptions found in
historical and current data
13
Prescriptive Analytics
• Prescriptive analytics build upon the results of predictive analytics by
prescribing actions that should be taken
• The focus is not only on which prescribed option is best to follow, but also
why
• Sample questions may include:
• Among three drugs, which one provides the best results?
• When is the best time to trade a particular stock?
• Prescriptive analytics provide more value than any other type of analytics and
• correspondingly require the most advanced skillset, as well as specialized software and
tools
• Various outcomes are calculated, and the best course of action for each
outcome is suggested.
• The approach shifts from explanatory to advisory and can include the
simulation of various scenarios. 14
Prescriptive analytics
• Prescriptive analytics involves
the use of business rules and
internal and/or external data
to perform an in-depth
analysis
15
Big Data Characteristics
• For a dataset to be considered Big Data, it must possess one or more
characteristics that require accommodation in the solution design and
architecture of the analytic environment
• Big Data characteristics are used to help differentiate data categorized as
“Big” from other forms of data.
• The five Big Data traits are commonly referred to as the Five Vs
16
Volume
17
Big Data Chara ….
• Typical data sources that are responsible for generating high data
volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research experiments
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
• Velocity:
• In Big Data environments, data can arrive at fast speeds, and enormous
datasets can accumulate within very short periods of time
• Coping with the fast inflow of data requires the enterprise to design
highly elastic and available data processing solutions and corresponding
data storage capabilities
18
Velocity
• Examples of high-velocity Big Data datasets produced every minute
include tweets, video, emails and GBs of sensor data generated from
a jet engine
19
Big Data chara…
• Variety: refers to the multiple formats and types of data that need to be
supported by Big Data solutions
• Data variety brings challenges for enterprises in terms of data
integration, transformation, processing, and storage
• It includes structured data in the form of financial transactions, semi-
structured data in the form of emails and unstructured data in the form
of images
20
Big Data chara…
• Veracity: refers to the quality or fidelity of data.
• Data that enters Big Data environments needs to be assessed for quality
• The data processing activities need to resolve invalid data and remove noise
• Noise is data that cannot be converted into information and thus has no value
• Signals have value and lead to meaningful information.
• Data with a high signal-to-noise ratio has more veracity than data with a lower
ratio.
• Data that is acquired in a controlled manner, for example via online customer
registrations, usually contains less noise than data acquired via uncontrolled
sources, such as blog postings.
• Thus the signal to noise ratio of data is dependent upon the source of the data
and its type
21
Big Data chara…
• Value: is defined as the usefulness of data for an enterprise
• It is related to the veracity characteristic in that the higher the data fidelity,
the more value it holds for the business
• Value is also dependent on how long data processing takes because
analytics results have a shelf-life
• The longer it takes for data to be turned into meaningful information, the
less value it has for a business
• Data that has high veracity and can be analyzed quickly has more value to a
business
22
Big Data chara…
23
Big Data chara…
• Apart from veracity and time, value is also impacted by the following
lifecycle-related concerns:
• How well has the data been stored?
• Were valuable attributes of the data removed during data cleansing?
• Are the right types of questions being asked during data analysis?
• Are the results of the analysis being accurately communicated to the
appropriate decision-makers?
24
Different Types of Data
• The data processed by Big Data solutions can be human-generated or
machine-generated
• The analytic results are generated by machines
• Human-generated data is the result of human interaction with
systems, such as online services and digital devices
• Examples of human-generated
data include social media,
blog posts, emails,
photo sharing and messaging
25
Different Types of Data
26
Data types processed by Big Data solutions
• The primary types of data are:
• Structured
• Unstructured
• Semi-structured
• Structured Data:
• Structured data conforms to a data model or schema and is often stored in
tabular form
• It is used to capture relationships between different entities and is
therefore most often stored in a relational database
• Structured data is frequently generated by enterprise applications and
information systems like ERP and CRM systems
27
Data types…
• Due to the abundance of tools and databases that natively support
structured data, it rarely requires special consideration in regards to
processing or storage
• Examples of structured data include banking transactions, invoices, and
customer records
• The symbol used to represent structured
data stored in a tabular form
• Unstructured Data:
• Data that does not conform to a data model or data schema is known as
unstructured data
• It is estimated that unstructured data makes up 80% of the data within any
given enterprise.
• Unstructured data has a faster growth rate than structured data 28
Data types …
• Unstructured data is either textual or binary and often conveyed via
files that are self-contained and non-relational
• A text file may contain the contents of various tweets or blog
postings.
• Binary files are often media files that contain image, audio or video
data
29
Data types …
• Semi-structured Data: has a defined level of structure and consistency, but
is not relational in nature.
• Instead, semi-structured data is hierarchical or graph-based.
• Commonly stored in files that contain text
• XML and JSON files are common forms of semi-structured data
• Due to the textual nature of this data and its conformance to some level of
structure, it is more easily processed than unstructured data
30
Big Data lifecycle: Privacy & Security issues in Big Data
• Data Procurement:
• The acquisition of Big Data solutions themselves can be economical,
due to the availability of open-source platforms and tools and
opportunities to leverage commodity hardware
• A substantial budget may still be required to obtain external data
• External data sources include government data sources and
commercial data markets.
• Government-provided data, such as geo-spatial data, may be free.
• However, most commercially relevant data will need to be purchased
and may involve the continuation of subscription costs
31
Privacy & Security issues in Big Data
• Privacy:
• Performing analytics on datasets can reveal confidential information about
organizations or individuals
• Even analyzing separate datasets that contain seemingly benign data can
reveal private information when the datasets are analyzed jointly.
• This can lead to intentional or unintentional breaches of privacy
• Addressing these privacy concerns requires an understanding of the nature of
data being accumulated and relevant data privacy regulations, as well as
special techniques for data tagging and anonymization
• Example: telemetry data, such as a car’s GPS log or smart meter data readings,
collected over an extended period of time can reveal an individual’s location and
behavior
32
Privacy issues in Big Data
33
Security issues in Big Data
• Some of the components of Big Data solutions lack the robustness of
traditional enterprise solution environments when it comes to access
control and data security
• Data involves ensuring that the data networks and repositories are
sufficiently secured via authentication and authorization mechanism
• Big Data security further involves establishing data access levels for
different categories of users.
• Ex.: unlike traditional relational database management systems, NoSQL
databases generally do not provide robust built-in security mechanisms.
• They instead rely on simple HTTP-based APIs where data is exchanged
in plaintext, making the data prone to network-based attacks
34
Security
• NoSQL databases can be susceptible to network-based attacks
35
… Security
• Current Technologies for securing data are slow when applied to huge
amounts of data
36
• the most efficient algorithms give an encryption rate of 64.3 MB/sec.
• However, in the light of Big Data where the amounts of data extend to
a Gigabytes or even Petabytes, Exabytes, Zetabytes
37