BigData Hadoop Lesson1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Big Data and Hadoop

Lesson 1—Introduction to Big Data and Hadoop


Objectives

By the end of this lesson, you will be


able to:

• Identify the need for Big Data

• Explain the concept of Big Data

• Describe the basics of Hadoop

• Explain the benefits of Hadoop

2
Data Explosion

Over 2.5 exabytes(2.5 billion gigabytes) of data is generated every day.

Following are some of the sources of the huge volume of data:

• A typical, large stock exchange captures more than 1 TB of data every day.
• There are around 5 billion cell phones (including 1.75 billion smart phones) in
the world.
• YouTube users upload more than 48 hours of video every minute.
• Large social networks such as Twitter and Facebook capture more than 10 TB
of data daily.
• There are more than 30 million networked sensors in the world.

3
Types of Data

The following three types of data can be identified:

Structured data: Data which is represented in a tabular


format
E.g.: Databases

Semi-structured data:
Data which does not have a formal data model
E.g.: XML files

Unstructured data:
Data which does not have a pre-defined data model
E.g.: Text files
4
Need for Big Data

Following are the reasons why Big Data is needed:

• 90% of the data in the world today has been created in the last two years alone.
• 80% of the data is unstructured or exists in widely varying structures, which are difficult to
analyze.
• Structured formats have some limitations with respect to handling large quantities of data.
• It is difficult to integrate information distributed across multiple systems.
• Most business users do not know what should be analyzed.
• Potentially valuable data is dormant or discarded.
• It is too expensive to justify the integration of large volumes of unstructured data.
• A lot of information has a short, useful lifespan.
• Context adds meaning to the existing information.
5
Data—The Most Valuable Resource

“In its raw form, oil has little value. Once processed and refined, it helps power the world.”
—Ann Winblad

“Data is the new oil.”


—Clive Humby, CNBC

6
Big Data and Its Sources

Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process them using on-hand data management
tools or traditional data processing applications.
The sources of Big Data are:
• web logs;
• sensor networks;
• social media;
• internet text and documents;
• internet pages;
• search index data;
• atmospheric science, astronomy, biochemical and medical records;
• scientific research;
• military surveillance; and
• photography archives.
7
Three Characteristics of Big Data

Big Data has three characteristics: variety, velocity, and volume.

Variety encompasses managing the complexity of data in many different


structures, ranging from relational data to logs and raw text.

8
Characteristics of Big Data Technology

Following are the characteristics of Big Data technology:

Cost efficiently processes Responds to the increasing Collectively analyzes the


the growing volume velocity widening variety

• Turned 12 terabytes of Tweets created each day into improved product sentiment analysis
• Converted 350 billion annual meter readings to better predict power consumption

9
Characteristics of Big Data Technology

Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
Source: Gartner

10
Appeal of Big Data Technology

Big Data technology is appealing because of the following reasons:


• It helps to manage and process a huge
amount of data cost efficiently.
• It analyzes data in its native form, which
may be unstructured, structured, or streaming.
• It captures data from fast-happening
events in real time.
• It can handle failure of isolated nodes
and tasks assigned to such nodes.
• It can turn data into actionable insights.

11
Leveraging Multiple Sources of Data

Big Data technology enables IT to leverage multiple sources of data. Following are
some of the sources:

App Data Machine Data Social Data Enterprise Data

• High volume • High velocity • Variety • Variety


• Structured • Semi-structured • Highly • Highly
• High throughput • Ingestion at a high unstructured unstructured
speed • Veracity • High volume

12
Traditional IT Analytics Approach

The following are the requirements of the traditional IT analytics approach and
factors they are challenged by:

Requirements Challenging factors

• The business team needs to define • The requirements are iterative


questions before IT development. and volatile.
• They need to define data sources • The data sources keep changing.
and structures.

13
Traditional IT Analytics Approach

In a typical scenario of traditional IT systems development, the requirements are defined,


followed by solution design and build. Once the solution is implemented, queries are
executed. If there are new requirements or queries, the system is redesigned and rebuilt.

Define requirements

Redesign and rebuild Design solution


for new requirements

Execute queries
14
Big Data Technology—Platform for Discovery and Exploration

Following are the requirements for using Big Data technology as a platform for
discovery and exploration, and the challenges overcome by the same:

Requirements Challenging factors

• The business team needs to define data • The technology should enable
sources. explorative analysis.
• They need to establish the hypothesis. • Data systems and sources need to
be integrated as required.

15
Big Data Technology—Platform for Discovery and Exploration

The image illustrates how IT systems are built with the help of Big Data technology.

Identify data sources

New questions lead to Create a platform for


addition of data sources creative exploration of
and integration available data and content

Determine questions to
ask and test hypothesis
16
Big Data Technology—Capabilities

Following are the capabilities of Big Data technology:

17
Big Data—Use Cases

The use cases of Big Data Hadoop are given below.

18
Handling Limitations of Big Data

Following are the challenges that need to be addressed by Big Data technology:

How to handle the system uptime and How to combine data accumulated
downtime from all systems

• Using commodity hardware for data • Analyzing data across different


storage and analysis machines
• Maintaining a copy of the same data • Merging of data
across clusters

19
Introduction to Hadoop

Following are the facts related to Hadoop and why it is required:

What is Hadoop? Why Hadoop?

• Runs a number of applications on


• A free, Java-based programming framework distributed systems with thousands of
that supports the processing of large data sets nodes involving petabytes of data
in a distributed computing environment.
• Has a distributed file system, called
• Hadoop Distributed File System or HDFS,
Based on Google File System (GFS)
which enables fast data transfer among
the nodes

20
History and Milestones of Hadoop

Hadoop originated from the Nutch open-source project on search engines


and works over distributed network nodes.

Hadoop Milestones

21
Organizations Using Hadoop

The following table shows how various organizations use Hadoop:

22
Quiz

23
Quiz
Which type of data is handled by Hadoop?
1

a. Structured data
b. Flexible-structure data
c. Semi-structured data
d. Unstructured data

24
Quiz
Which type of data is handled by Hadoop?
1

a. Structured data
b. Flexible-structure data
c. Semi-structured data
d. Unstructured data
Answer: d.
Explanation: Hadoop handles unstructured data for processing.
25
Quiz
Which of the following is an unstructured data?
2

a. Collection of text files


b. Collection of tickets
c. Collection of XML files
d. Collection of tables in databases

26
Quiz
Which of the following is an unstructured data?
2

a. Collection of text files


b. Collection of tickets
c. Collection of XML files
d. Collection of tables in databases
Answer: a.
Explanation: Text files are included in the category of unstructured data.
27
Quiz
Which of the following is structured data?
3

a. Collection of text files


b. Collection of XML files
c. Collection of tickets
d. Collection of tables in databases

28
Quiz
Which of the following is structured data?
3

a. Collection of text files


b. Collection of XML files
c. Collection of tickets
d. Collection of tables in databases
Answer: d.
Explanation: Databases are sources of highly structured data.
29
Quiz
Which of the following is semi-structured data?
4

a. Collection of tables in databases


b. Collection of XML files
c. Collection of text files
d. Collection of tickets

30
Quiz
Which of the following is semi-structured data?
4

a. Collection of tables in databases


b. Collection of XML files
c. Collection of text files
d. Collection of tickets
Answer: b.
Explanation: XML files are included in the category of semi-structured data.
31
Quiz
Which of the following aspects of Big Data refers to data size?
5

a. Volume
b. Value
c. Velocity
d. Variety

32
Quiz
Which of the following aspects of Big Data refers to data size?
5

a. Volume
b. Value
c. Velocity
d. Variety
Answer: a.
Explanation: Volume in Big Data refers to the size of the data set to be
processed. 33
Quiz Which of the following aspects of Big Data refers to the speed of the
6 response of appropriate data request generated by the user?

a. Variety
b. Volume
c. Value
d. Velocity
Answer: d.
Explanation: Velocity in Big Data refers to the speed of the response of
appropriate data request generated by the user. 34
Quiz Which of the following aspects of Big Data refers to multiple data
7 sources?

a. Variety
b. Velocity
c. Value
d. Volume
Answer: a.
Explanation: Variety in Big Data refers to multiple data sources. .
35
Summary

Let us summarize • Big Data is a term applied to data sets that cannot be
the topics covered captured, managed, and processed within a tolerable
in this lesson: elapsed and specified time frame by commonly used
software tools.
• Big Data relies on volume, velocity, and variety with respect
to processing.
• Data can be divided into three types—unstructured data,
semi-structured data, and structured data.
• Big Data technology understands and navigates big data
sources, analyzes unstructured data, and ingests data at a
high speed.
• Hadoop is a free, Java-based programming framework that
supports the processing of large data sets in a distributed
computing environment.
36
37

You might also like