BigData Hadoop Lesson1
BigData Hadoop Lesson1
BigData Hadoop Lesson1
2
Data Explosion
• A typical, large stock exchange captures more than 1 TB of data every day.
• There are around 5 billion cell phones (including 1.75 billion smart phones) in
the world.
• YouTube users upload more than 48 hours of video every minute.
• Large social networks such as Twitter and Facebook capture more than 10 TB
of data daily.
• There are more than 30 million networked sensors in the world.
3
Types of Data
Semi-structured data:
Data which does not have a formal data model
E.g.: XML files
Unstructured data:
Data which does not have a pre-defined data model
E.g.: Text files
4
Need for Big Data
• 90% of the data in the world today has been created in the last two years alone.
• 80% of the data is unstructured or exists in widely varying structures, which are difficult to
analyze.
• Structured formats have some limitations with respect to handling large quantities of data.
• It is difficult to integrate information distributed across multiple systems.
• Most business users do not know what should be analyzed.
• Potentially valuable data is dormant or discarded.
• It is too expensive to justify the integration of large volumes of unstructured data.
• A lot of information has a short, useful lifespan.
• Context adds meaning to the existing information.
5
Data—The Most Valuable Resource
“In its raw form, oil has little value. Once processed and refined, it helps power the world.”
—Ann Winblad
6
Big Data and Its Sources
Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process them using on-hand data management
tools or traditional data processing applications.
The sources of Big Data are:
• web logs;
• sensor networks;
• social media;
• internet text and documents;
• internet pages;
• search index data;
• atmospheric science, astronomy, biochemical and medical records;
• scientific research;
• military surveillance; and
• photography archives.
7
Three Characteristics of Big Data
8
Characteristics of Big Data Technology
• Turned 12 terabytes of Tweets created each day into improved product sentiment analysis
• Converted 350 billion annual meter readings to better predict power consumption
9
Characteristics of Big Data Technology
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
Source: Gartner
10
Appeal of Big Data Technology
11
Leveraging Multiple Sources of Data
Big Data technology enables IT to leverage multiple sources of data. Following are
some of the sources:
12
Traditional IT Analytics Approach
The following are the requirements of the traditional IT analytics approach and
factors they are challenged by:
13
Traditional IT Analytics Approach
Define requirements
Execute queries
14
Big Data Technology—Platform for Discovery and Exploration
Following are the requirements for using Big Data technology as a platform for
discovery and exploration, and the challenges overcome by the same:
• The business team needs to define data • The technology should enable
sources. explorative analysis.
• They need to establish the hypothesis. • Data systems and sources need to
be integrated as required.
15
Big Data Technology—Platform for Discovery and Exploration
The image illustrates how IT systems are built with the help of Big Data technology.
Determine questions to
ask and test hypothesis
16
Big Data Technology—Capabilities
17
Big Data—Use Cases
18
Handling Limitations of Big Data
Following are the challenges that need to be addressed by Big Data technology:
How to handle the system uptime and How to combine data accumulated
downtime from all systems
19
Introduction to Hadoop
20
History and Milestones of Hadoop
Hadoop Milestones
21
Organizations Using Hadoop
22
Quiz
23
Quiz
Which type of data is handled by Hadoop?
1
a. Structured data
b. Flexible-structure data
c. Semi-structured data
d. Unstructured data
24
Quiz
Which type of data is handled by Hadoop?
1
a. Structured data
b. Flexible-structure data
c. Semi-structured data
d. Unstructured data
Answer: d.
Explanation: Hadoop handles unstructured data for processing.
25
Quiz
Which of the following is an unstructured data?
2
26
Quiz
Which of the following is an unstructured data?
2
28
Quiz
Which of the following is structured data?
3
30
Quiz
Which of the following is semi-structured data?
4
a. Volume
b. Value
c. Velocity
d. Variety
32
Quiz
Which of the following aspects of Big Data refers to data size?
5
a. Volume
b. Value
c. Velocity
d. Variety
Answer: a.
Explanation: Volume in Big Data refers to the size of the data set to be
processed. 33
Quiz Which of the following aspects of Big Data refers to the speed of the
6 response of appropriate data request generated by the user?
a. Variety
b. Volume
c. Value
d. Velocity
Answer: d.
Explanation: Velocity in Big Data refers to the speed of the response of
appropriate data request generated by the user. 34
Quiz Which of the following aspects of Big Data refers to multiple data
7 sources?
a. Variety
b. Velocity
c. Value
d. Volume
Answer: a.
Explanation: Variety in Big Data refers to multiple data sources. .
35
Summary
Let us summarize • Big Data is a term applied to data sets that cannot be
the topics covered captured, managed, and processed within a tolerable
in this lesson: elapsed and specified time frame by commonly used
software tools.
• Big Data relies on volume, velocity, and variety with respect
to processing.
• Data can be divided into three types—unstructured data,
semi-structured data, and structured data.
• Big Data technology understands and navigates big data
sources, analyzes unstructured data, and ingests data at a
high speed.
• Hadoop is a free, Java-based programming framework that
supports the processing of large data sets in a distributed
computing environment.
36
37