0% found this document useful (0 votes)
41 views39 pages

Big Data Overview

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 39

Saad Khan

MSCS
2nd semester
Content

 Introduction
 What is Big Data.
 Characteristic of Big Data.
 Storing, Selecting and processing of Big Data
 Why Big Data
 How it is Different
 Hive
 Pig
 Flume
Introduction

 Big Data may well be the Next Big Thing in the IT world

 The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.

 Big data burst upon the scene in the first decade of the 21st century
What is BIG DATA?
 ‘Big Data’ is similar to ‘small data’, but bigger in size.

 An aim to solve new problems or old problems in a


better way
What is BIG DATA (Cont..)

 Walmart handles more than 1 million customer transactions every hour.

 Facebook handles 40 billion photos from its user base.


Characteristic of Big DATA
Volume

 A typical PC might have had 10 gigabytes of storage in 2000

 Today, Facebook ingests 500 terabytes of new data every day

 Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
Velocity

 Clickstreams and ad impressions capture user behavior at millions of events


per second.

 High-frequency stock trading algorithms reflect market changes within


microseconds

 Machine to machine processes exchange data between billions of devices


Variety

 Big Data isn't just numbers, dates, and strings. Big Data is also geospatial
data, 3D data, audio and video, and unstructured text, including log files
and social media.

 Traditional database systems were designed to address smaller volumes of


structured data, fewer updates or a predictable, consistent data structure
Storing Big Data

 Selecting data source for analysis

 Eliminating redundant data

 Establishing the role of NoSQL


Selecting Big Data Stores

 Choosing the correct data stores based on your data characteristics.

 Moving code to data.

 Implementing polyglot data store solutions


Processing Big Data

 Mapping data to the programming framework

 Connecting and extracting data from storage.

 Transforming data for processing.


Why Big Data

 Increase of Storage capacities.

 Increase of processing.

 Availability of data(different data types).


How is big data different?

 Automatically generated by a machine


(e.g. Sensor embedded in an engine)

 Typically an entirely new source of data


(e.g. Use of the internet)

 Not designed to be friendly


(e.g. Text streams)
Hive
What is Hive?

 Hive is a data warehouse infrastructure tool to process structure data in


Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy

 Initially Hive was developed by Facebook, later the Apache Software


Foundation took it up and developed it further as an open source under
the name Apache Hive.
Feature of Hive

 It stores Schema in a database and processed data into HDFS(Hadoop


Distributed File System).

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.


Architecture Of Hive

 User Interface - Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD.

 Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
Architecture Of Hive(Cont..)
Architecture Of Hive(Cont..)

 HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info
on the Megastore. It is one of the replacements of traditional approach for
MapReduce program

 HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system.
Working of Hive

 Get Plan- The driver takes the help of query complier that parses
the query to check the syntax and query plan or the requirement of
query.

 Get Metadata- The compiler sends metadata request to Megastorez

 Send Metadata- Metastore sends metadata as a response to the


compiler.
Working of Hive(Cont..)

 Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.

 Execute Plan- the driver sends the execute plan to the


execution engine.
Pig
What is Pig?

 A platform for analyzing large data sets that consists of a high-level


language for expressing data analysis programs

 Compiles down to MapReduce jobs

 Developed by Yahoo!
Pig Component

 Two Main Components.


 High Level Language (Pig Latin)
 Set of Commands
 Two Execution Modes
 Local: Read/Write to local file system
 MapReduce: connects to Hadoop cluster and reads/write to HDFS
Why Pig?

 Common design patterns as key word (joins, distinct, counts)

 Data flow analysis

 Avoid java level errors


Language Feature Pig

 Keywords
 Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order by,…

 Aggregations
 Count, Avg, Sum, Max, Min

 Schema
 Defines at query-times not when files are loaded
Flume
What is flume?

 Apache Flume is a tool/service/data ingestion mechanism for collecting


aggregating and transporting large amounts of streaming data such as log
files, events (etc...) from various sources to a centralized data store

 Flume is a highly reliable, distributed, and configurable tool. It is principally


designed to copy streaming data (log data) from various web servers to
HDFS.
Flume Architecture

 Flume Event
 An event is the basic unit of the data transported inside Flume.

 Flume Agent.
 Take a look at the following illustration. It shows the internal components of an
agent and how they collaborate with each other.
Application of Flume

 Assume an e-commerce web application wants to analyze the customer


behavior from a particular region.

 To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.

 Flume is used to move the log data generated by application servers into
HDFS at a higher speed
Feature of flume

 Flume ingests long data from multiple web serves into a centralized store

 Using flume, we can get the data from multiple servers immediately into
Hadoop.

 Flume supports a large set of sources and destinations types

 Flume can be scaled horizontally.


Advantages of flume

 Using apache flume we can store the data in to any of the centralized
stores (Hbase, HDFS).

 Flume provides the feature of contextual routing.

 Flume is reliable, fault tolerant, scalable, manageable, and customizable


Any Question

You might also like