Big Data Overview
Big Data Overview
Big Data Overview
MSCS
2nd semester
Content
Introduction
What is Big Data.
Characteristic of Big Data.
Storing, Selecting and processing of Big Data
Why Big Data
How it is Different
Hive
Pig
Flume
Introduction
Big Data may well be the Next Big Thing in the IT world
The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.
Big data burst upon the scene in the first decade of the 21st century
What is BIG DATA?
‘Big Data’ is similar to ‘small data’, but bigger in size.
Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
Velocity
Big Data isn't just numbers, dates, and strings. Big Data is also geospatial
data, 3D data, audio and video, and unstructured text, including log files
and social media.
Increase of processing.
Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
Architecture Of Hive(Cont..)
Architecture Of Hive(Cont..)
HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info
on the Megastore. It is one of the replacements of traditional approach for
MapReduce program
HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system.
Working of Hive
Get Plan- The driver takes the help of query complier that parses
the query to check the syntax and query plan or the requirement of
query.
Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.
Developed by Yahoo!
Pig Component
Keywords
Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order by,…
Aggregations
Count, Avg, Sum, Max, Min
Schema
Defines at query-times not when files are loaded
Flume
What is flume?
Flume Event
An event is the basic unit of the data transported inside Flume.
Flume Agent.
Take a look at the following illustration. It shows the internal components of an
agent and how they collaborate with each other.
Application of Flume
To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into
HDFS at a higher speed
Feature of flume
Flume ingests long data from multiple web serves into a centralized store
Using flume, we can get the data from multiple servers immediately into
Hadoop.
Using apache flume we can store the data in to any of the centralized
stores (Hbase, HDFS).