Bda 4

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

Unit - 4

BIG DATA ANALYSIS: Traditional Data Analysis, Big Data


Analytic Methods, Architecture for Big Data Analysis - Real-Time vs.
Offline Analysis, Analysis at Different Levels, Analysis with Different
Complexity, Tools for Big Data Mining and Analysis
Big Data Analysis
• Traditional Data Analysis: Traditional data analysis means to use proper
statistical methods to analyze massive first-hand data and second-hand
data, to concentrate, extract, and refine useful data hidden in a batch of
data, and to identify the inherent law of the subject matter, so as to
develop functions of data to the greatest extent and maximize the value
of data.
• Data analysis plays a huge guidance role in making development plans for
a country, as well as understanding customer demands and predicting
market trend by enterprises.
• Big data analysis can be deemed as the analysis of a special kind of data.
• Therefore, many traditional data analysis methods may still be utilized for
big data analysis.
• Cluster Analysis: cluster analysis is a statistical method for grouping
objects, and specifically, classifying objects according to some features.
Cluster analysis is used to differentiate objects with certain features and
divide them into some categories (clusters) according to these features,
such that objects in the category will have high homogeneity different
categories will have high heterogeneity.
• Cluster analysis is an unsupervised study method without the use of
training data
• Factor Analysis: Factor analysis is basically targeted at describing the
relation among many indicators or elements with only a few factors,
i.e., grouping several closely related variables and then every group of
variables becomes a factor (called a factor because it is unobservable,
i.e., not a specific variable), and the few factors are then used to reveal
the most valuable information of the original data.
• Correlation Analysis: correlation analysis is an analytical method for
determining the law of correlations among observed phenomena and
accordingly conducting forecast and control.

• Regression Analysis: regression analysis is a mathematical tool for revealing


correlations between one variable and several other variables. Based on a
group of experiments or observed data, regression analysis identifies
dependence relationships among variables hidden by randomness.
Regression analysis may change complex and undetermined correlations
among variables into simple and regular correlations.
• Statistical Analysis: Statistical analysis is based on the statistical
theory, a branch of applied mathematics. In statistical theory,
randomness and uncertainty are modeled with Probability Theory.
Statistical analysis can provide description and inference for large-
scale datasets.

• Data Mining: Data mining is a process for extracting hidden,


unknown, but potentially useful information and knowledge from
massive, incomplete, noisy, fuzzy, and random data.
Big Data Analytic Methods

• In the dawn of the big data era, people are concerned with how to
rapidly extract key information from massive data so as to bring
values for enterprises and individuals.
The main processing methods of big data are:
• Bloom Filter
• Hashing
• Index
• Triel
• Parallel Computing
• Bloom Filter: Bloom Filter is actually a bit array and a series of Hash
functions.
• The principle of Bloom Filter is to store Hash values of data other than
data itself by utilizing a bit array, which is in essence a bitmap index
that uses Hash functions to conduct lossy compression storage of
data.
• It has such advantages as high space efficiency and high query speed,
but also with some disadvantages like having a certain misrecognition
rate and deletion difficulty.
• Bloom Filter applies to big data applications that allow a certain
misrecognition rate.
• Hashing: it is a method that essentially transforms data into shorter fixed-
length numerical values or index values.
• Hashing has such advantages as rapid reading, writing, and high query
speed, but a sound Hash function is hard to be found.

• Index: index is always an effective method to reduce the expense of disc


reading and writing, and improve insertion, deletion, modification, and
query speeds in both traditional relational databases that manage
structured data, and technologies that manage semi-structured and
unstructured data.
• Index has a disadvantage that it has the additional cost for storing index
files and the index files should be maintained dynamically according to
data updates.
• Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to
rapid retrieval and word frequency statistics.
• The main idea of Triel is to utilize common prefixes of character strings
to reduce comparison on character strings to the greatest extent, so as
to improve query efficiency.
• Parallel Computing: Compared to traditional serial computing, parallel
computing refers to utilizing several computing resources to complete a
computation task.
• Its basic idea is to decompose a problem and assign them to several
independent processes to be independently completed, so as to
achieve coprocessing.
• Presently, some classic parallel computing models include MPI
(Message Passing Interface), MapReduce, and Dryad.
Architecture for Big Data Analysis

• Due to the wide range of sources and variety, different structures, and
the broad application fields of big data, different analytical
architectures shall be considered for big data with different
application requirements.
Real-Time vs. Offline Analysis

• Big data analysis can be classified into real-time analysis and off-line analysis.
• Real-time analysis is mainly used in Ecommerce and finance.
• Since data constantly changes, rapid data analysis is needed and analytical
results shall be returned with a very short delay.

• The main existing architectures of real-time analysis include

(a) parallel processing clusters using traditional relational databases,


(b) memory-based computing platforms.

For example, Greenplum from EMC and HANA from SAP are all real-time
analysis architectures.
• Offline analysis is usually used for applications without high
requirements on response time, e.g., machine learning, statistical
analysis, and recommendation algorithms.
• Offline analysis generally conducts analysis by importing big data of logs
into a special platform through data acquisition tools.
• Under the big data setting, many Internet enterprises utilize the offline
analysis architecture based on Hadoop in order to reduce the cost of
data format conversion and improve the efficiency of data acquisition.
• Examples include Facebook’s open source tool Scribe, LinkedIn’s open
source tool Kafka, Taobao’s open source tool Timetunnel, and Chukwa
of Hadoop, etc.
• These tools can meet the demands of data acquisition and transmission
with hundreds of MB per second
Analysis at Different Levels
• Big data analysis can also be classified into memory level analysis, Business
Intelligence (BI) level analysis, and massive level analysis

• Memory-Level: Memory-level analysis is for the case when the total data
volume is within the maximum level of the memory of a clusters.
• The memory of current server cluster surpasses hundreds of GB while even
the TB level is common.
• Therefore, an internal database technology may be used and hot data shall
reside in the memory so as to improve the analytical efficiency.
• Memory-level analysis is extremely suitable for real-time analysis. MongoDB
is a representative memory-level analytical architecture.
• With the development of SSD (Solid-State Drive), the capacity and
performance of memory-level data analysis has been further improved and
widely applied.
• BI: BI analysis is for the case when the data scale surpasses the
memory level but may be imported into the BI analysis environment.
• Currently, mainstream BI products are provided with data analysis
plans supporting the level over TB.
• Massive: Massive analysis for the case when the data scale has
completely surpassed the capacities of BI products and traditional
relational databases.
• At present, most massive analysis utilize HDFS of Hadoop to store data
and use MapReduce for data analysis.
• Most massive analysis belongs to the offline analysis category.
Analysis with Different Complexity
• The time and space complexity of data analysis algorithms differ
greatly from each other according to different kinds of data and
application demands.
• For example, for applications that are amenable to parallel
processing, a distributed algorithm may be designed and a parallel
processing model may be used for data analysis.
Tools for Big Data Mining and Analysis
• Many tools for big data mining and analysis are available, including
professional and amateur software, expensive commercial software, and
free open source software.
• The top five widely used software, according to a survey of “What
Analytics, Data mining, Big Data software you used in the past 12
months for a real project” of 798 professionals made by KDNuggets in
2012 are..
• R (30.7 %)
• Excel (29.8 %)
• Rapid-I Rapidminer (26.7 %)
• KNIME (21.8 %)
• Weka/Pentaho (14.8 %)
• R (30.7 %): R, an open source programming language and software
environment, is designed for data mining/analysis and visualization.
• R is a realization of the S language. S is an interpreted language
developed by AT&T Bell Labs and used for data exploration, statistical
analysis, and drawing plots.
• Due to the popularity of R, database manufacturers such as Teradata and
Oracle both released products supporting R.
• Excel (29.8 %): Excel, a core component of Microsoft Office, provides
powerful data processing and statistical analysis capability, and aids
decision making.
• When Excel is installed, some advanced plug-ins, such as Analysis
ToolPak and Solver Add-in, with powerful functions for data analysis are
also integrated but such plug-ins can be used only if users enable them.
Excel is also the only commercial software among the top five.

You might also like