Assignment questions BDA Lec 6
Assignment questions BDA Lec 6
Assignment questions BDA Lec 6
Ans:
Volume: Volumes of data generated from many
sources daily, such as business processes, machines,
social media platforms, networks, human
interactions, etc.
Eg: Facebook can generate approximately
a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new
posts are uploaded each day.
Variety: Big Data can be structured, unstructured, and
semi-structured that are being collected from different
sources. Data will only be collected
from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos etc.
Oozie
Apache OOzie is an open-source Java web application
for workflow scheduling in a distributed cluster. It
combines multiple jobs into a single unit, and Oozie
supports various jobs from Hive, Map Reduce, pig etc.
There are three types of Oozie jobs.
• Oozie Workflow jobs: These are Directed
Acyclic Graphs (DAGs) which specify a
sequence of actions to be executed.
• Oozie Coordinator jobs: These are recurrent
Oozie Workflow jobs triggered by time and data
availability.
• Oozie Bundle: It provides a way to package
multiple coordinators and workflow jobs and
manage the lifecycle of those jobs.
Data Processing and Analysis
This could be thought of as the nervous system of Big
Data architecture. Map Reduce, another core
component of Hadoop, is primarily responsible for
data processing. We will also discuss other software
libraries that take part in data processing and analysis
tasks.
Map Reduce
Map Reduce is responsible for processing a huge
amount of data in a parallel distributed manner. It has
two different jobs: Map and the other is Reduce. Just
as the name Map always proceeds to Reduce. The data
is processed and converted into key-value pairs or
tuples in the Map stage. the output of the map job is
fed to the reducer as inputs. Before being sent to the
reducer, the intermediate data is sorted and organised,
and the reducer then aggregates the key-value pair to
output a smaller set of outputs. Final data is then stored
in HDFS.
Just like HDFS, Map Reduce follows a master-slave
design to accomplish tasks. Each Name node has a Job
tracker, which divides and tracks the job submitted by
the clients. Each job is then distributed among data
nodes. These data nodes house task trackers,
periodically sending a heartbeat indicating the node is
alive. This way job tracker tracks the entire process. In
case of a data node failure, the job tracker assigns the
job to another node, thus making the system fault-
tolerant.
Pig
Yahoo developed Apache Pig to analyse large amounts
of data. This is what map-reduce does, too, but one
fundamental problem with Map Reduce is it takes a lot
of code to perform the intended jobs. This is the
primary reason why Pig was developed. It has two
significant components Pig Latin and Pig engine.
Pig Latin is a high-level language that is used to
perform analysis tasks. 10 lines of Pig Latin code can
achieve the same task as 200 lines of map-reduce code.
The pig codes internally get converted to map-reduce
jobs with the help of the pig engine. Thus making the
entire process easier. The Pig Latin language is similar
to SQL.
Spark
One of the critical concerns with map-reduce was that
it takes a sequential multi-step process to run a job,
and it has to read cluster data to do the operation and
write it back to nodes to perform a job. Thus, map-
reduce jobs have high latency, making them inefficient
for real-time analytics.
To overcome the shortcoming, Spark was developed.
The key features that set Apache Spark apart from
map-reduce are its in-memory computation capability
and reusability of data across parallel operations. This
makes it almost 100 times faster than Hadoop map-
reduce for large scale data processing.
The Spark framework includes The spark core,
Spark SQL, MLlib, streaming and Graphx.
• Spark Core: This is responsible for memory
management, scheduling, distributing,
monitoring jobs, and fault recovery. And it was
interacting with storage systems. It can be
accessed by different programming languages
such as Java, Scala, Python and R via APIs.
• MLlib: Library consisting of machine algorithms
to do regression, classification, clustering etc.
• Streaming: It helps ingest real-time data from
sources such as Kafka, Twitter, and Flume in
mini-batches and perform real-time analytics on
the same using codes written for batch analytics.
• Spark SQL: Distributed querying engine that
provides highly optimised queries up to 100x
faster than map-reduce. It supports various data
sources out-of-the-box including Hive,
Cassandra, HDFS etc.
• Graphx: It is a distributed graph processing unit
that provides ETL, Graph computation and
exploratory analysis at scale.
Spark is an ecosystem in itself. It has its cluster
manager called standalone manager, Spark SQL for
accessing data, streaming for batch and real-time data
processing etc. Honestly, it deserves an article in itself.
Data Access
Once the data is ingested from different sources and
stored in cluster nodes, the next step is to retrieve the
right data for our needs. There are a bunch of software
that helps us access the data efficiently as and when
needed.
Hive
Hive is a data warehousing tool designed to work with
voluminous data, and it works on top of HDFS and
Map Reduce. The Hive query language is similar to
SQL, making it user-friendly. The hive queries
internally get converted into map-reduce or spark jobs
which run on Hadoop’s distributed node cluster.
Impala
Apache Impala is an open-source data warehouse tool
for querying high volume data. Syntactically it is
similar to HQL but provides highly optimised faster
queries than Hive. Unlike Hive, it is not dependent on
map-reduce; instead, it has its engine, which stores
intermediate results in memory, thus providing faster
query execution. It can easily be integrated with
HDFS, Hbase and amazon s3. AS Impala is similar to
SQL, and the learning curve is not very steep.
[Hue
Apache Hue is an open-source web interface for
Hadoop components developed by Cloudera. It
provides an easy interface to interact with Hive data
stores, manage HDFS files and directories, and track
map-reduce jobs and Oozie workflows. If you are not
a fan of Command Line Interface, this is the right tool
to interact with various Hadoop components.]
Zookeeper
Apache zookeeper is another essential member of the
Hadoop family, responsible for cross node
synchronisation and coordination. Hadoop
applications may need cross-cluster services;
deploying Zookeeper takes care of this issue.
Applications create a znode within Zookeeper;
applications can synchronise their tasks across the
distributed cluster by updating their status in the
znode. Zookeeper then can relegate information
regarding a specific node’s status change to other
nodes.
Assignment II
Q1. Explain the concept of HDFS. Architectural
design of HDFS.
HDFS Concepts? What is HDFS and what are its
main components? (Name node and Data node)
1. Ans: Blocks: A Block is the minimum amount of
data that it can read or write. HDFS blocks are 128
MB by default and this is configurable. Files n
HDFS are broken into block-sized chunks, which
are stored as independent units.
2. Name Node: The name node acts as master.
Name Node is controller and manager of HDFS
as it knows the status and the metadata of all the
files in HDFS; the metadata information being file
permission, names and location of each block.
The metadata are small, so it is stored in the
memory of name node, allowing faster access to
data.
3. Data Node: They store and retrieve blocks by
client or name node. They report back to name
node periodically, with list of blocks that they are
storing. The data node being a commodity
hardware also does the work of block creation,
deletion and replication as stated by the name
node.
Q2. Enlist the commands used in HDFS ?
Command Description
2. Serialization in Hadoop:
Serialization is the process of converting an object into
a byte stream to store it or transmit it to another
system, where it can later be deserialized into its
original form. Hadoop relies on serialization to
efficiently process and move data across its distributed
environment.
• Data Exchange: In a distributed system, data
must be exchanged between nodes (across
mappers, reducers, and data nodes). Serialization
allows this data to be efficiently encoded and
transferred.
• Storage: Data that is stored in HDFS or processed
by MapReduce jobs needs to be serialized to
ensure that it can be efficiently written to and read
from disk.
Serialization Frameworks in Hadoop:
Writable: Hadoop's native serialization format,
optimized for speed and compatibility with the
Hadoop ecosystem.
Advantages: Lightweight, fast, and well-
suited for Hadoop’s distributed nature.
Disadvantage: Limited to Hadoop and Java
environments; less portable.
❖ Difference between Compression and
serialization
3. KeyValueTextInputFormat:
It is comparable to TextInputFormat. Each line
of input is also treated as a separate record by this
InputFormat. While TextInputFormat treats the
entire line as the value, KeyValueTextInputFormat
divides the line into key and value by a tab
character ('/t'). Hence:
• Key: Everything up to and including the tab
character.
4. SequenceFileInputFormat:
It's an input format for reading sequence files.
Binary files are sequence files. These files also
store binary key-value pair sequences. These are
block-compressed and support direct serialization
and deserialization of a variety of data types.
Hence Key & Value are both user-defined.
5. SequenceFileAsTextInputFormat:
It is a subtype of SequenceFileInputFormat. The
sequence file key values are converted to Text
objects using this format. As a result, it converts
the keys and values by running 'toString()' on
them. As a result, SequenceFileAsTextInputFormat
converts sequence files into text-based input for
streaming.
6. NlineInputFormat
It is a variant of TextInputFormat in which the
keys are the line's byte offset. And values are the
line's contents. As a result, each mapper receives a
configurable number of lines of TextInputFormat
and KeyValueTextInputFormat input. The number
is determined by the magnitude of the split. It is
also dependent on the length of the lines. So, if we
want our mapper to accept a specific number of
lines of input, we use NLineInputFormat.
N- It is the number of lines of input received by
each mapper.
Each mapper receives exactly one line of input by
default (N=1).
Assuming N=2, each split has two lines. As a
result, the first two Key-Value pairs are distributed
to one mapper. The second two key-value pairs are
given to another mapper.
7. DBInputFormat
Using JDBC, this InputFormat reads data from a
relational Database. It also loads small datasets,
which might be used to connect with huge datasets
from HDFS using multiple inputs. Hence:
• Key: LongWritables
• Value: DBWritables.
Output Format in MapReduce
The output format classes work in the opposite
direction as their corresponding input format classes.
The TextOutputFormat, for example, is the default
output format that outputs records as plain text files,
although key values can be of any type and are
converted to strings by using the toString () method.
The tab character separates the key-value character,
but this can be changed by modifying the separator
attribute of the text output format.
SequenceFileOutputFormat is used to write a
sequence of binary output to a file for binary output.
Binary outputs are especially valuable if they are
used as input to another MapReduce process.
DBOutputFormat handles the output formats for
relational databases and HBase. It saves the
compressed output to a SQL table.
Assignment IV
Q1. Explain the new feature of name node with
high availability in Hadoop 2.0.
Ans: Hadoop 2.0's Name Node High Availability
feature addresses the single point of failure problem in
Hadoop clusters by introducing a second Name Node:
1. Active Name Node: Handles all client operations
in the cluster
2. Passive Standby Name Node: A standby Name
Node that has similar data to the active Name
Node
3. Automatic failover: If the active Name Node
fails, the Hadoop administrator can manually
intervene or automatically switch to the passive
Name Node
4. Hot standby: The passive Name Node maintains
enough state to provide a fast failover
5. Replication factor: Data is replicated across
multiple nodes, and the replication factor can be
configured by the administrator
Other features of Hadoop 2.0 include:
1. YARN, which can process terabytes and petabytes
of data
2. Splitting the Work Tracker's roles into the Task
Master and the Global Resource Manager
3. HDFS Federation, which increases the horizontal
scalability of the Name Node
4. HDFS Snapshot
5. Support for Windows
6. NFS access is also provided.
Hive Client
Hive allows writing applications in various
languages, including Java, Python, and C++.
It supports different types of clients such as:-
o Thrift Server - It is a cross-language service
provider platform that serves the request from all
those programming languages that supports
Thrift.
o JDBC Driver - It is used to establish a connection
between hive and Java applications. The JDBC
Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that
support the ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by
Hive:-
o Hive CLI - The Hive CLI (Command Line
Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is
just an alternative of Hive CLI. It provides a web-
based GUI for executing Hive queries and
commands.
o Hive MetaStore - It is a central repository that
stores all the structure information of various
tables and partitions in the warehouse. It also
includes metadata of column and its type
information, the serializers and deserializers
which is used to read and write data and the
corresponding HDFS files where the data is
stored.
o Hive Server - It is referred to as Apache Thrift
Server. It accepts the request from different clients
and provides it to Hive Driver.
o Hive Driver - It receives queries from different
sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the
compiler.
o Hive Compiler - The purpose of the compiler is to
parse the query and perform semantic analysis on
the different query blocks and expressions. It
converts HiveQL statements into MapReduce
jobs.
o Hive Execution Engine - Optimizer generates the
logical plan in the form of DAG of map-reduce
tasks and HDFS tasks. In the end, the execution
engine executes the incoming tasks in the order of
their dependencies.
Q2. Write the necessary steps for the installation of
Hive.
Ans: To install Hive, you can follow these steps:
Hardware requirements
➢ At least 8 GB of RAM
➢ A quad-core CPU with at least 1.80 GHz
➢ JRE 1.8
➢ Java Development Kit 1.8
➢ A software for unzipping, like 7Zip or Win
Rar
1. Install Java: Hive is a Java-based tool, so you
need to install Java on your server. You can check
the Hive documentation for the most compatible
Java version.
2. Download and untar Hive: You can download
Hive from GitHub at
https://github.com/apache/hive.
3. Configure Hive environment variables: You
can configure Hive environment variables in
.bashrc.
4. Edit core-site.xml file: You can edit the core-
site.xml file.
5. Create Hive directories in HDFS: You can
create the /tmp directory and the
/user/hive/warehouse directory.
6. Configure hive-site.xml file: You can configure
the hive-site.xml file, but this is optional.
7. Initiate Derby database: You can initiate the
Derby database.
8. Launch Hive Client Shell: You can launch the
Hive Client Shell.