BD1 1
BD1 1
BD1 1
Evolution of Big Data - Best Practices for Big Data Analytics - Big data characteristics - Validating
- The Promotion of the Value of Big Data - Big Data Use Cases- Characteristics of Big Data
Applications - Perception and Quantification of Value -Understanding Big Data Storage - A
General Overview of Architecture - HDFS - MapReduce and YARN - Map Reduce Programming
Model
Evolution of Big data
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible
to process using traditional methods. The act of accessing and storing large amounts of information
for analytics has been around a long time. But the concept of big data gained momentum in the
early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big
data as the three V’s
The 3 Vs of Big Data
1. Volume,
2. Variety, and
3. Velocity.
Definition of Big Data
According to McKinsey Global report,
Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new
technical architectures and analytics to enable insights that unlock new sources of business value.
Characteristics of Big Data
It is defined by three attributes namely
1. Huge volume of data
2. Complexity of data types and structures
3. Speed of new data creation and growth
Driving sources of data deluge
• Mobile sensors
• Social media
• Video surveillance
• Video rendering
• Smart grids
• Medical imaging
• Gene sequencing
• Geophysical exploration
Types of data structures
• Structured
Data containing a defined data type, format, and structure
1
CS8091 BIG DATA ANALYTICS
Ex. transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS,
CSV files, and even simple spreadsheets
• Semi structured
Textual data files with a discernible (noticeable) pattern that enables parsing
Ex. Extensible Markup Language [XML] data files that are self-describing and defined by
an XML schema
• Quasi structured
Textual data with erratic data formats that can be formatted with effort, tools, and time
Ex. web clickstream data that may contain inconsistencies in data values and formats
• Unstructured
Data that has no inherent (natural) structure
Ex. text documents, PDFs, images, and video
BI tends to provide reports, dashboards, and queries on business questions for the current period
or in the past. BI systems make it easy to answer questions related to quarter-to-date revenue,
progress toward quarterly targets, and understand how much of a given product was sold in a prior
quarter or year. These questions tend to be closed-ended and explain current or past behavior,
typically by aggregating historical data and grouping it in some way. BI provides hindsight and
some insight and generally answers questions related to "when" and "where" events occurred.
Data Science tends to use disaggregated data in a more forward-looking, exploratory way, focusing
on analyzing the present and enabling informed decisions about the future. Rather than aggregating
historical data to look at how many of a given product sold in the previous quarter, a team may
employ Data Science techniques such as time series analysis, to forecast future product sales and
revenue more accurately than extending a simple trend line. In addition, Data Science tends to be
more exploratory in nature and may use scenario optimization to deal with more open-ended
questions. This approach provides insight into current activity and foresight into future events,
while generally focusing on questions related to "how" and "why" events occur.
For data sources to be loaded into the data warehouse, data needs to be well understood, structured,
and normalized with the appropriate data type definitions. Although this kind of centralization
enables security, backup, and fail over of highly critical data, it also means that data typically must
go through significant preprocessing and checkpoints before it can enter this sort of controlled
environment, which does not lend itself to data exploration and iterative analytics
3
CS8091 BIG DATA ANALYTICS
As a result of this level of control on the EDW, additional local systems may emerge in the form
of departmental warehouses and local data marts that business users create to accommodate their
need for flexible analysis. These local data marts may not have the same constraints for security
and structure as the main EDW and allow users to do some level of more in-depth analysis.
However, these one-off systems reside in isolation, often are not synchronized or integrated with
other data stores, and may not be backed up
Once in the data warehouse, data is read by additional applications across the enterprise for BI and
reporting purposes. These are high-priority operational processes getting critical data feeds from
the data warehouses and repositories
At the end of this workflow, analysts get data provisioned for their downstream analytics. Because
users generally are not allowed to run custom or intensive analytics on production databases,
analysts create data extracts from the EDW to analyze data offline in R or other local analytical
tools. Many times these tools are limited to in-memory analytics on desktops analyzing samples
of data, rather than the entire population of a dataset. Because these analyses are based on data
extracts, they reside in a separate location, and the results of the analysis-and any insights on the
quality of the data or anomalies-rarely are fed back into the main data repository
analytics. These people tend to have a base knowledge of working with data, or an
appreciation for some of the work being performed by data scientists and others with deep
analytical talent. Ex. financial analysts, market research analysts
• Technology and Data Enablers
This group represents people providing technical expertise to support analytical projects,
such as provisioning and administrating analytical sandboxes, and managing large-scale
data architectures that enable widespread analytics within companies and other
organizations. This role requires skills related to computer engineering, programming, and
database administration.
• Quantitative skill
• Technical aptitude
• Skeptical mind-set and critical thinking
• Curious and creative
• Communicative and collaborative
Examples of Big Data Analytics
• Retail
• IT infrastructure
• Social media
5
CS8091 BIG DATA ANALYTICS
Score by 0 1 2 3 4
Dimension
Feasibility Evaluation of Organization Organization Organization is Organization
New technology tests new evaluates and open to encourages
is not officially technologies in tests new evaluation of evaluation and
sanctioned reaction to technologies new testing of new
market after market technology technology
pressure evidence of Adoption of Clear decision
successful use technology on process for
an ad hoc basis adoption or
based on rejection
convincing Organization
business supports
justifications allocation of
time to
innovation
Reasonability Organization’s Organization’s Organization’s Business Business
resource resource resource challenges are challenges
requirements requirements requirements expected to have resource
for near-, mid-, for near- and for near-term have resource requirements
and long-terms mid-terms are is satisfactorily requirements that clearly
are satisfactorily met, unclear as in the mid- and exceed the
satisfactorily met, unclear as to whether long-terms that capability of
met to whether mid- and long will exceed the the existing
long-term term capability of and planned
needs are met needs are the existing environment
met and planned Organization’s
environment go-forward
business model
is highly
information
centric
Value Investment in The expected Selected Expectations The expected
hardware quantifiable instances of for some quantifiable
resources, value widely is perceived quantifiable value widely
software tools, evenly value may value for exceeds the
skills training, balanced by an suggest a investing in investment in
and ongoing investment in positive return limited aspects hardware
management hardware on investment of the resources,
and resources, technology software tools,
maintenance software tools, skills training,
exceeds the skills training, and ongoing
expected and ongoing management
quantifiable management and
value and maintenance
maintenance
Integrability Significant Willingness to New Clear processes No constraints
impediments invest effort in technologies exist for or
to incorporating determining can be migrating or impediments
any ways to integrated into integrating to fully
nontraditional integrate the new integrate
technology technology, environment technologies, technology
into with some within but require into
environment successes limitations and dedicated operational
with some resources and environment
level of effort level of effort
6
CS8091 BIG DATA ANALYTICS
7
CS8091 BIG DATA ANALYTICS
Application categories
• Counting functions applied to large bodies of data that can be segmented and distributed
among a pool of computing and storage resources, such as document indexing, concept
filtering, and aggregation (counts and sums).
• Scanning functions that can be broken up into parallel threads, such as sorting, data
transformations, semantic text analysis, pattern recognition, and searching.
• Modeling capabilities for analysis and prediction.
• Storing large datasets while providing relatively rapid access
Characteristics of Big Data Applications
• Data throttling
• Computation-restricted throttling
• Large data volumes
• Significant data variety
• Benefits from data parallelization
8
CS8091 BIG DATA ANALYTICS