Bda 4
Bda 4
Bda 4
• In the dawn of the big data era, people are concerned with how to
rapidly extract key information from massive data so as to bring
values for enterprises and individuals.
The main processing methods of big data are:
• Bloom Filter
• Hashing
• Index
• Triel
• Parallel Computing
• Bloom Filter: Bloom Filter is actually a bit array and a series of Hash
functions.
• The principle of Bloom Filter is to store Hash values of data other than
data itself by utilizing a bit array, which is in essence a bitmap index
that uses Hash functions to conduct lossy compression storage of
data.
• It has such advantages as high space efficiency and high query speed,
but also with some disadvantages like having a certain misrecognition
rate and deletion difficulty.
• Bloom Filter applies to big data applications that allow a certain
misrecognition rate.
• Hashing: it is a method that essentially transforms data into shorter fixed-
length numerical values or index values.
• Hashing has such advantages as rapid reading, writing, and high query
speed, but a sound Hash function is hard to be found.
• Due to the wide range of sources and variety, different structures, and
the broad application fields of big data, different analytical
architectures shall be considered for big data with different
application requirements.
Real-Time vs. Offline Analysis
• Big data analysis can be classified into real-time analysis and off-line analysis.
• Real-time analysis is mainly used in Ecommerce and finance.
• Since data constantly changes, rapid data analysis is needed and analytical
results shall be returned with a very short delay.
For example, Greenplum from EMC and HANA from SAP are all real-time
analysis architectures.
• Offline analysis is usually used for applications without high
requirements on response time, e.g., machine learning, statistical
analysis, and recommendation algorithms.
• Offline analysis generally conducts analysis by importing big data of logs
into a special platform through data acquisition tools.
• Under the big data setting, many Internet enterprises utilize the offline
analysis architecture based on Hadoop in order to reduce the cost of
data format conversion and improve the efficiency of data acquisition.
• Examples include Facebook’s open source tool Scribe, LinkedIn’s open
source tool Kafka, Taobao’s open source tool Timetunnel, and Chukwa
of Hadoop, etc.
• These tools can meet the demands of data acquisition and transmission
with hundreds of MB per second
Analysis at Different Levels
• Big data analysis can also be classified into memory level analysis, Business
Intelligence (BI) level analysis, and massive level analysis
• Memory-Level: Memory-level analysis is for the case when the total data
volume is within the maximum level of the memory of a clusters.
• The memory of current server cluster surpasses hundreds of GB while even
the TB level is common.
• Therefore, an internal database technology may be used and hot data shall
reside in the memory so as to improve the analytical efficiency.
• Memory-level analysis is extremely suitable for real-time analysis. MongoDB
is a representative memory-level analytical architecture.
• With the development of SSD (Solid-State Drive), the capacity and
performance of memory-level data analysis has been further improved and
widely applied.
• BI: BI analysis is for the case when the data scale surpasses the
memory level but may be imported into the BI analysis environment.
• Currently, mainstream BI products are provided with data analysis
plans supporting the level over TB.
• Massive: Massive analysis for the case when the data scale has
completely surpassed the capacities of BI products and traditional
relational databases.
• At present, most massive analysis utilize HDFS of Hadoop to store data
and use MapReduce for data analysis.
• Most massive analysis belongs to the offline analysis category.
Analysis with Different Complexity
• The time and space complexity of data analysis algorithms differ
greatly from each other according to different kinds of data and
application demands.
• For example, for applications that are amenable to parallel
processing, a distributed algorithm may be designed and a parallel
processing model may be used for data analysis.
Tools for Big Data Mining and Analysis
• Many tools for big data mining and analysis are available, including
professional and amateur software, expensive commercial software, and
free open source software.
• The top five widely used software, according to a survey of “What
Analytics, Data mining, Big Data software you used in the past 12
months for a real project” of 798 professionals made by KDNuggets in
2012 are..
• R (30.7 %)
• Excel (29.8 %)
• Rapid-I Rapidminer (26.7 %)
• KNIME (21.8 %)
• Weka/Pentaho (14.8 %)
• R (30.7 %): R, an open source programming language and software
environment, is designed for data mining/analysis and visualization.
• R is a realization of the S language. S is an interpreted language
developed by AT&T Bell Labs and used for data exploration, statistical
analysis, and drawing plots.
• Due to the popularity of R, database manufacturers such as Teradata and
Oracle both released products supporting R.
• Excel (29.8 %): Excel, a core component of Microsoft Office, provides
powerful data processing and statistical analysis capability, and aids
decision making.
• When Excel is installed, some advanced plug-ins, such as Analysis
ToolPak and Solver Add-in, with powerful functions for data analysis are
also integrated but such plug-ins can be used only if users enable them.
Excel is also the only commercial software among the top five.