0% found this document useful (0 votes)
12 views37 pages

Chapter 5 Big Data

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 37

Big Data

Introduction

• Big Data:
• is high-volume and high-velocity and/or high-variety information
assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision-making, and
process automation.
• is used to describe immense volumes of data, both unstructured and
structured
• refers to massive volumes of data that cannot be effectively
processed with traditional applications.
• The processing of Big Data begins with the raw data that isn’t
aggregated or organized—and is most often impossible to store in the
memory of a single computer.
2
Introd…
• Big Data is a field dedicated to the analysis, processing, and storage of
large collections of data that frequently originate from disparate
sources.
• Big Data solutions and practices are typically required when
traditional data analysis, processing and storage technologies and
techniques are insufficient
• Big Data addresses distinct requirements, such as:
• the combining of multiple unrelated datasets,
• processing of large amounts of unstructured data and
• harvesting of hidden information in a time-sensitive manner

3
Terms, concepts
• In addition to traditional analytic approaches based on statistics, Big
Data adds newer techniques that leverage computational resources
and approaches to execute analytic algorithms.
• This shift is important as datasets continue to become larger, more
diverse, more complex and streaming-centric.
• Datasets:
• Collections or groups of related data
• dataset member (datum) shares the same set of attributes
• Examples:
• Tweets stored in a flat file
• a collection of image files in a directory
• historical weather observations that are stored as XML files
4
Datasets, Data Analysis
• Datasets can be found in many different
formats

• Data Analysis:
• the process of examining data to find facts, relationships, patterns, insights
and/or trends
• Its goal is to support better decision making
• EX: how the number of ice cream cones sold is related to the daily
temperature
• support decisions related to how much ice cream a store should order in relation to
weather forecast information 5
Data Analytics
• Symbol used for data analysis

• Data Analytics:
• Data analytics is a discipline that includes the management of the
complete data lifecycle, which encompasses collecting, cleansing,
organizing, storing, analyzing and governing data
• The term includes the development of analysis methods, scientific
techniques and automated tools

6
• In Big Data environment, data analytics has developed methods that
allow data analysis to occur through the use of highly scalable
distributed technologies
• for large volume of data from different sources

• The symbol used to represent data analytics

• Different kinds of organizations use data analytics tools and


techniques in different ways
• In business-oriented environments, data analytics results can lower
operational costs and facilitate strategic decision-making
7
Data analytics
• In the scientific domain, data analytics can help identify the cause of
a phenomenon to improve the accuracy of predictions.
• In service-based environments like public sector organizations, data
analytics can help strengthen the focus on delivering high-quality
services by driving down costs.
• Data analytics enable data-driven decision-making with scientific
backing
• There are four general categories of analytics that are distinguished by
the results they produce:
• descriptive analytics
• diagnostic analytics
• predictive analytics
• prescriptive analytics
8
… Data analytics
• The different analytics types leverage different techniques and analysis
algorithms
• The generation of high value analytic results increases the complexity
and cost of the analytic environment

• Value and complexity increase


from descriptive to
prescriptive analytics

9
Descriptive Analytics
• Descriptive analytics are carried out to answer questions about events
that have already occurred.
• This form of analytics contextualizes data to generate information
• Sample questions can include:
• What was the sales volume over the past 12 months?
• What is the number of support calls received as categorized by severity and
geographic location?
• What is the monthly commission earned by each sales agent?
• It is estimated that 80% of generated analytics results are descriptive in
nature.
• Value-wise, descriptive analytics provide the least worth and require a
relatively basic skillset.
10
Descriptive analytics …

• Descriptive analytics are often carried out via ad-hoc reporting or dashboards
• The reports are generally static in nature and display historical data that is
presented in the form of data grids or charts
• Queries are executed on operational data stores from within an enterprise,
for example a Customer Relationship Management system (CRM) or
Enterprise Resource Planning (ERP) system

• The operational systems,


pictured left, are queried
via descriptive analytics tools
to generate reports or
dashboards

11
Diagnostic Analytics

• Diagnostic analytics aim to determine the cause of a phenomenon that


occurred in the past using questions that focus on the reason behind the
event
• Its goal is to determine what information is related to the phenomenon
• Such questions include:
• Why were Q2 sales less than Q1 sales?
• Why have there been more support calls originating from the Eastern region than
from the Western region?
• Why was there an increase in patient re-admission rates over the past three
months?
• Diagnostic analytics provide more value than descriptive analytics but
require a more advanced skillset

12
Predictive Analytics
• Predictive analytics are carried out in an attempt to determine the
outcome of an event that might occur in the future
• Questions are usually formulated using a what-if rationale, such as the
following:
• What are the chances that a customer will default on a loan if they have missed a
monthly payment?
• What will be the patient survival rate if Drug B is administered instead of Drug A?
• If a customer has purchased Products A and B, what are the chances that they
will also purchase Product C?
• Predictive analytics try to predict the outcomes of events,
• Predictions are made based on patterns, trends and exceptions found in
historical and current data

13
Prescriptive Analytics
• Prescriptive analytics build upon the results of predictive analytics by
prescribing actions that should be taken
• The focus is not only on which prescribed option is best to follow, but also
why
• Sample questions may include:
• Among three drugs, which one provides the best results?
• When is the best time to trade a particular stock?
• Prescriptive analytics provide more value than any other type of analytics and
• correspondingly require the most advanced skillset, as well as specialized software and
tools
• Various outcomes are calculated, and the best course of action for each
outcome is suggested.
• The approach shifts from explanatory to advisory and can include the
simulation of various scenarios. 14
Prescriptive analytics
• Prescriptive analytics involves
the use of business rules and
internal and/or external data
to perform an in-depth
analysis

15
Big Data Characteristics
• For a dataset to be considered Big Data, it must possess one or more
characteristics that require accommodation in the solution design and
architecture of the analytic environment
• Big Data characteristics are used to help differentiate data categorized as
“Big” from other forms of data.
• The five Big Data traits are commonly referred to as the Five Vs

• The five Vs of Big Data

16
Volume

• The volume of data that is processed by Big Data solutions is substantial


and ever-growing.
• High data volumes impose distinct data storage and processing demands,
as well as additional data preparation, curation and management
processes

• Organizations and users


world-wide create over 2.5 EBs
of data a day.
As a point of comparison,
the Library of Congress currently
holds more than 300 TBs of data

17
Big Data Chara ….
• Typical data sources that are responsible for generating high data
volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research experiments
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
• Velocity:
• In Big Data environments, data can arrive at fast speeds, and enormous
datasets can accumulate within very short periods of time
• Coping with the fast inflow of data requires the enterprise to design
highly elastic and available data processing solutions and corresponding
data storage capabilities
18
Velocity
• Examples of high-velocity Big Data datasets produced every minute
include tweets, video, emails and GBs of sensor data generated from
a jet engine

19
Big Data chara…
• Variety: refers to the multiple formats and types of data that need to be
supported by Big Data solutions
• Data variety brings challenges for enterprises in terms of data
integration, transformation, processing, and storage
• It includes structured data in the form of financial transactions, semi-
structured data in the form of emails and unstructured data in the form
of images

20
Big Data chara…
• Veracity: refers to the quality or fidelity of data.
• Data that enters Big Data environments needs to be assessed for quality
• The data processing activities need to resolve invalid data and remove noise
• Noise is data that cannot be converted into information and thus has no value
• Signals have value and lead to meaningful information.
• Data with a high signal-to-noise ratio has more veracity than data with a lower
ratio.
• Data that is acquired in a controlled manner, for example via online customer
registrations, usually contains less noise than data acquired via uncontrolled
sources, such as blog postings.
• Thus the signal to noise ratio of data is dependent upon the source of the data
and its type
21
Big Data chara…
• Value: is defined as the usefulness of data for an enterprise
• It is related to the veracity characteristic in that the higher the data fidelity,
the more value it holds for the business
• Value is also dependent on how long data processing takes because
analytics results have a shelf-life
• The longer it takes for data to be turned into meaningful information, the
less value it has for a business
• Data that has high veracity and can be analyzed quickly has more value to a
business

22
Big Data chara…

23
Big Data chara…
• Apart from veracity and time, value is also impacted by the following
lifecycle-related concerns:
• How well has the data been stored?
• Were valuable attributes of the data removed during data cleansing?
• Are the right types of questions being asked during data analysis?
• Are the results of the analysis being accurately communicated to the
appropriate decision-makers?

24
Different Types of Data
• The data processed by Big Data solutions can be human-generated or
machine-generated
• The analytic results are generated by machines
• Human-generated data is the result of human interaction with
systems, such as online services and digital devices
• Examples of human-generated
data include social media,
blog posts, emails,
photo sharing and messaging

25
Different Types of Data

• Machine-generated data is generated by software programs and


hardware devices in response to real-world events
• a log file captures an authorization decision made by a security service,
and a point-of-sale system generates a transaction against inventory
• HW generated data: information conveyed from the numerous sensors in
a cellphone that may be reporting information, including position

gas and electricity meter that can


digitally send meter readings to
your energy supplier for more
accurate energy bills

26
Data types processed by Big Data solutions
• The primary types of data are:
• Structured
• Unstructured
• Semi-structured
• Structured Data:
• Structured data conforms to a data model or schema and is often stored in
tabular form
• It is used to capture relationships between different entities and is
therefore most often stored in a relational database
• Structured data is frequently generated by enterprise applications and
information systems like ERP and CRM systems

27
Data types…
• Due to the abundance of tools and databases that natively support
structured data, it rarely requires special consideration in regards to
processing or storage
• Examples of structured data include banking transactions, invoices, and
customer records
• The symbol used to represent structured
data stored in a tabular form
• Unstructured Data:
• Data that does not conform to a data model or data schema is known as
unstructured data
• It is estimated that unstructured data makes up 80% of the data within any
given enterprise.
• Unstructured data has a faster growth rate than structured data 28
Data types …
• Unstructured data is either textual or binary and often conveyed via
files that are self-contained and non-relational
• A text file may contain the contents of various tweets or blog
postings.
• Binary files are often media files that contain image, audio or video
data

• Examples of unstructured data

29
Data types …
• Semi-structured Data: has a defined level of structure and consistency, but
is not relational in nature.
• Instead, semi-structured data is hierarchical or graph-based.
• Commonly stored in files that contain text
• XML and JSON files are common forms of semi-structured data
• Due to the textual nature of this data and its conformance to some level of
structure, it is more easily processed than unstructured data

• Examples of semi-structured data

30
Big Data lifecycle: Privacy & Security issues in Big Data

• Data Procurement:
• The acquisition of Big Data solutions themselves can be economical,
due to the availability of open-source platforms and tools and
opportunities to leverage commodity hardware
• A substantial budget may still be required to obtain external data
• External data sources include government data sources and
commercial data markets.
• Government-provided data, such as geo-spatial data, may be free.
• However, most commercially relevant data will need to be purchased
and may involve the continuation of subscription costs

31
Privacy & Security issues in Big Data
• Privacy:
• Performing analytics on datasets can reveal confidential information about
organizations or individuals
• Even analyzing separate datasets that contain seemingly benign data can
reveal private information when the datasets are analyzed jointly.
• This can lead to intentional or unintentional breaches of privacy
• Addressing these privacy concerns requires an understanding of the nature of
data being accumulated and relevant data privacy regulations, as well as
special techniques for data tagging and anonymization
• Example: telemetry data, such as a car’s GPS log or smart meter data readings,
collected over an extended period of time can reveal an individual’s location and
behavior

32
Privacy issues in Big Data

33
Security issues in Big Data
• Some of the components of Big Data solutions lack the robustness of
traditional enterprise solution environments when it comes to access
control and data security
• Data involves ensuring that the data networks and repositories are
sufficiently secured via authentication and authorization mechanism
• Big Data security further involves establishing data access levels for
different categories of users.
• Ex.: unlike traditional relational database management systems, NoSQL
databases generally do not provide robust built-in security mechanisms.
• They instead rely on simple HTTP-based APIs where data is exchanged
in plaintext, making the data prone to network-based attacks

34
Security
• NoSQL databases can be susceptible to network-based attacks

35
… Security
• Current Technologies for securing data are slow when applied to huge
amounts of data

36
• the most efficient algorithms give an encryption rate of 64.3 MB/sec.
• However, in the light of Big Data where the amounts of data extend to
a Gigabytes or even Petabytes, Exabytes, Zetabytes

37

You might also like