Big Data Analytics - Notes
Big Data Analytics - Notes
Big Data Analytics - Notes
UNIT-5
In-memory computing technology is essential when high-speed access, low latency, constant data availability,
and reduced load on traditional databases.
CAP theorem:
SCV for real-time analytics systems: adapts the CAP theorem to real-time analytics:
Eg: Let's walk through the typical activities of a data scientist in this context, using CSP
challenges as an example.
0. CSPs notice Increase in complaints about n/w quality. The business prblm here is to
understand why customers are leaving&how to improve service quality to retain
them.
1. A data scientist might hypothesize that customers are leaving due to frequent call
drops or slow internet speeds.
2. To test these hypotheses, data scientists need to gather data. They might collect
customer service logs.
Map: As the name suggests its main use is to map the input data in key-
value pairs. After the map step we shuffle and sort it helps organize the data for
the next step.
Reduce: In this step, a "reducer" function aggregates or processes the grouped data.
Why Use MapReduce?
MapReduce is designed to work with big data that’s too large for a single computer to handle. By splitting
tasks we make the data analysis faster and more efficient.
Anotamy:
You can run a MapReduce job with a
single line of code:
JobClient.runJob(conf).
The whole process is illustrated in Figure
6-1. At the highest level, there are four
independent entities:
» The client, which submits the
MapReduce job.
» The jobtracker, which
coordinates the job run. The
jobtracker is a Java application
whose main class is JobTracker.
» The tasktrackers, which run the
tasks that the job has been split
into. Tasktrackers are Java
applications whose main class is
TaskTracker.
» The distributed filesystem,
which is used for sharing job
files between the other entities.
Job Submission
JobClient: The process starts when a MapReduce job is submitted via the runJob() method of
JobClient. This method initiates the entire job flow.
JobTracker: The JobClient communicates with the JobTracker to get a new job ID.
Resource Distribution: The job's resources r copied to the distributed filesystem.
» The JobClient asks the JobTracker for a new job ID.
» Verifies the input and output paths; throws errors if the directory is missing or already exists.
» Job resources (like JAR files) are copied to the distributed filesystem.
» Once all resources are in place, the job is marked as ready for execution, and the JobTracker is notified .
Job Initialization
Once the job reaches the JobTracker:
Q&scheduling:Job is Q’ed in job trac
Task Creation: The job is broken down into map and reduce tasks.
o Map Tasks: For each input split, a corresponding map task is created.
o Reduce Tasks: The number of reduce tasks is determined based on the
mapred.reduce.tasks configuration property.
» The input splits are retrieved from HDFS by the JobTracker.
» Tasks are assigned IDs, and the job is ready for execution.
Task Assignment
TaskTrackers: TaskTrackers are responsible for executing the tasks assigned to them by the
JobTracker.
Task Allocation: Based on the available resources, the JobTracker assigns tasks to TaskTrackers.
o Map Tasks: The JobTracker tries to assign map tasks close to the data's location (data
locality).
o Reduce Tasks: The JobTracker assigns reduce tasks, which are less sensitive to data
locality.
» The jobTracker assigns either a map or reduce task to a TaskTracker based on
available slots &data locality.
TaskTrackers send heartbeats to indicate they are alive and available for new
tasks.
Task Execution
Necessary job res’s r retrieved
A new JVM is launched for each task to ensure task isolation
Task Isolation: Each map/reduce task runs in its own JVM to prevent crashes from affecting other
tasks.
Task Progress and Status Updates
The proportion of its work that the task has been completed, they periodically report their
progress.
TaskTrackers send heartbeats to the JobTracker at regular intervals. status updates, such as
success,failure.
Job Completion
Completion Notification: Once the last task of the job completes, the JobTracker updates the
job's status to “successful.”
Cleanup: The JobTracker cleans up its internal job state and instructs TaskTrackers to delete
intermediate data files.
» Successful completion, prints the success message.
» The JobTracker can be configured to send HTTP callbacks upon job completion, informing clients
about the job status.
Special Task Handling (Streaming and Pipes)
Streaming: It communicates with external processes via standard input/output.
Pipes: It communicates using socket connections.
In a classic MapReduce environment, managing failures is critical for ensuring job completion despite
issues with tasks or nodes. Hadoop MapReduce offers a robust failure-handling mechanism, addressing
issues at different levels: Task Failures, TaskTracker Failures, and JobTracker Failures. Here’s a simplified
breakdown of each type.
1. Task Failure
Task failures occur when individual Map or Reduce tasks encounter issues, often due to bugs, runtime
errors, or unexpected exits of the JVM. There are several key points to consider:
Runtime Exceptions: If user code within a map or reduce task throws an exception, the task fails,
and the TaskTracker, which monitors these tasks, marks it as failed. The error logs are available
for review in the user’s logs, helping diagnose the issue.
Non-zero Exit Codes (Streaming Tasks): For tasks using Hadoop Streaming, the task fails if the
process returns a non-zero exit code. The failure is tracked by the stream.non.zero.exit.is.failure
property.
Sudden JVM Exit: Occasionally, JVMs may crash unexpectedly, causing a task to fail without error
logging. In such cases, the TaskTracker detects the exit and logs it as a failure.
Hanging Tasks: A task is considered “hanging” if it stops making progress. Hadoop detects this if
the task does not report progress within a set time (default is 10 minutes), after which it’s
marked as failed. This timeout can be customized using the mapred.task.timeout property.
When a task fails, it may be retried up to four times (default), and then, if it still fails, it is logged as a
permanent failure, potentially causing the job to fail. The retry limits are adjustable through properties like
mapred.map.max.attempts and mapred.reduce.max.attempts.
2. TaskTracker Failure
A TaskTracker manages multiple task executions and sends periodic heartbeats to the JobTracker to
indicate its status. If a TaskTracker fails due to a crash, network failure, or excessive slowness, the
JobTracker stops receiving its heartbeats and marks it as unavailable.
Heartbeat Monitoring: The JobTracker expects a heartbeat every few seconds. If no heartbeat is
received for a set duration (default is 10 minutes, configured with
mapred.task.tracker.expiry.interval), the TaskTracker is considered failed.
Task Reassignment: When a TaskTracker fails, any in-progress tasks are reassigned to other
TaskTrackers. Completed map tasks may also be rerun if their intermediate data is on the failed
node, ensuring data remains accessible for the reduce phase.
Blacklisting: TaskTrackers with consistently high failure rates can be “blacklisted,” meaning the
JobTracker will avoid assigning tasks to them. Restarting the blacklisted TaskTracker can reset its
status.
3. JobTracker Failure
The JobTracker is a single point of control in classic MapReduce, coordinating all tasks, task assignments,
and monitoring. As the most crucial component, JobTracker failure is the most severe.
Single Point of Failure: The JobTracker has no built-in redundancy. If it fails, the entire job fails,
and all tasks associated with the job must be restarted.
Future Plans for Resilience: In some Hadoop versions, projects like ZooKeeper are mentioned for
potential future management of JobTracker failover by coordinating multiple JobTrackers, though
this is beyond classic MapReduce.
Hadoop provides properties and configurations to manage how failures are detected, reported, and
retried. Here are a few essential configurations:
Timeout Properties: The mapred.task.timeout controls the wait time before a hanging task is
marked failed.
Retry Limits: Properties like mapred.map.max.attempts and mapred.reduce.max.attempts
determine how many times a task can be retried after failure.
Blacklisting and Heartbeat Interval: The JobTracker’s heartbeat frequency
(mapred.task.tracker.expiry.interval) and blacklist feature help avoid using failing TaskTrackers.
Benefits of Failure Management in MapReduce
These mechanisms ensure that even with hardware or software issues, most MapReduce jobs can
complete. Task retry mechanisms and data replication across nodes improve reliability, making Hadoop’s
MapReduce framework resilient to common types of failures.
YARN
YARN, introduced in Hadoop 2.0, is a framework that manages resources in a Hadoop cluster and allows
various data processing engines to work on a shared dataset within the Hadoop Distributed File System
(HDFS). It was developed to address the limitations of Hadoop 1.0, particularly the bottleneck created by
JobTracker, by separating resource management from application processing.
Scalability: The Resource Manager’s scheduler allows Hadoop to manage thousands of nodes,
increasing cluster size and resource allocation efficiency.
Compatibility: YARN is backward-compatible with Hadoop 1.0’s MapReduce, allowing older
applications to run without modification.
Cluster Utilization: Dynamic resource allocation optimizes utilization, ensuring balanced resource
usage.
Multi-tenancy: YARN’s multi-engine capability allows diverse processing engines to work
simultaneously, providing organizations flexibility in data processing.
YARN Architecture
YARN’s architecture divides its components into resource management and application processing layers.
Key components include:
1. Client
The Client submits MapReduce or other application jobs to the YARN cluster.
2. Resource Manager
The Resource Manager is the main control authority in YARN, managing and assigning resources.
It has two essential sub-components:
o Scheduler: The Scheduler assigns resources based on applications’ needs and available
cluster resources. It does not handle monitoring or fault tolerance. It supports plugins
like the Capacity Scheduler and Fair Scheduler for flexible resource partitioning.
o Application Manager: This component receives job requests, negotiates initial
containers, and restarts the Application Master if needed.
3. Node Manager
Each node in the Hadoop cluster has a Node Manager, which monitors node health, manages
application resources, and reports the status to the Resource Manager. The Node Manager also
launches containers and executes tasks as directed by the Application Master.
4. Application Master
For each application submitted to YARN, the Application Master is responsible for requesting
resources from the Resource Manager, monitoring the application’s status, and coordinating
execution. It communicates with Node Managers to start containers for task execution.
5. Container
A container is a collection of resources (CPU, memory, and storage) on a node. Containers are
managed by the Node Manager and are essential units of resource allocation in YARN. Each
container is launched based on the requirements specified in the Container Launch Context (CLC),
which provides necessary configurations.
Workflow in YARN
The application workflow in YARN can be broken down into the following steps:
Advantages of YARN
YARN’s architecture and workflow offer numerous benefits:
Flexibility: YARN enables multiple processing engines, like Apache Spark and Flink, to work within
the Hadoop ecosystem. This allows simultaneous processing across different frameworks.
Resource Management: Centralized resource management allows administrators to allocate and
monitor resources efficiently, improving resource usage.
Scalability: YARN supports large clusters and can dynamically manage thousands of nodes.
Improved Performance: Centralized scheduling and monitoring of applications improve the
performance and efficiency of task management.
Security: YARN provides security features, including Kerberos authentication and Secure Shell
(SSH) access, ensuring data and application security.
Disadvantages of YARN
Despite its advantages, YARN has some limitations:
Complexity: YARN introduces additional configuration requirements, making it challenging for
users unfamiliar with Hadoop or YARN.
Overhead: Resource management and scheduling introduce overhead, which can reduce
performance.
Conclusion
YARN has transformed Hadoop into a more flexible and efficient resource management system,
supporting a diverse range of processing engines and scalable large clusters. Its robust architecture and
improved resource utilization make it an essential framework for modern big data processing applications,
despite some inherent complexities and overhead.