Big Data Analytics - Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Big Data Analytics

UNIT-5

In-Memory Computing Technology:


 It is a method that stores data directly in the system's RAM rather than on disk.
 This approach is beneficial for analytical problems,and architecture scenarios where high-speed data access
and manipulation are required.
 There are a few factors, which can let you know when to go for in-memory technologies:
» Repetitive Data Access Needs:When appli/ users need to frequently access & process large datasets, IMC
allows them to retrieving data from the RAM rather than disk. Accessing data from RAM is much faster
than from a disk, making the process quicker.
» Algs rely on random data access: algorithms require quick, random access to various parts of the data. In
these cases, keeping the entire data set in memory (RAM) helps speed up processing and improve
performance.Eg: Algs like depth-first search (DFS) & breadth-first search (BFS)These searches move back
and forth through data, which is much faster if data is in memory rather than on slower disk storage.
» Packaged Applications (ERP, SCM, CRM):enterprise res planning,supply chain mngmt,Customer
relationship mngmt. Some ERP,SCM,&CRM applications include in-memory storage as part of their
design. providing better options to store, manage, and process data faster.
 Let’s start by looking at some scenarios where in-memory is not only preferred but also necessary:
» Your database is too slow for interactive analytics.prblm: Traditional databases, especially those used for
transaction processing (OLTP), may not be fast enough for real-time analytics. Sol: In-memory computing is
useful here because it allows fast “speed-of-thought” data analysis without waiting for slower disk-based
databases.
» You need to take load off a transactional database. Prb:  Running analytical queries on a transactional
database can overload it, slowing down crucial operational tasks.Sol: By processing analytics in-memory,
organizations can offload these tasks from the transactional database, maintaining its performance and
reliability for business operations.
» You require always-on analytics. Prb: In areas like logistics, supply chain, fraud detection, and financial
services applications ,analytics must be constantly available.Sol: Distributed in-memory computing
environments provide failover support, ensuring that if one node fails, others others immediately take over
without any interruption in service.
» You need analysis of big data.  prb: Big data platforms like Hadoop are powerful but are not ideal for real-
time analytics due to high query latency. Sol: Loading a subset of big data into memory allows for rapid
analysis and visualization.

In-memory computing technology is essential when high-speed access, low latency, constant data availability,
and reduced load on traditional databases.
CAP theorem:

Introduced by by Eric Brewer in 2000, is a fundamental concept in distributed computing.


Explains how a distributed system can only provide two out of three desired characteristics: consistency,
availability, and partition tolerance.
Let’s dive into what each property means and how they apply to real-time analytics & BD.
Consistency (C):
 Def: Consistency ensures that all nodes in a distributed system have the same data at the same time.
 Every read should receive the most recent writes.
 Ensuring consistency across distributed systems can be challenging.
Availability (A):
 Def: It means that every request receives a response, even if some of the nodes are not functional. sys
should always provide a response to req’s regardless of nodes
 To guarantee availability, the sys must tolerate failures & continue providing responses.
Partition Tolerance (P):
 Def: Partition tolerance means the sys continues to operate even if there is a comm. breakdown between
nodes in the network. is often considered a necessary property.
CAP Theorem in Real-Time Analytics:
 Real-time analytics systems deals with big data. Here’s how CAP considerations impact
these systems:
 consistency Trade-Offs: Real-time sys’s may allow slight inconsistencies in data to
achieve faster response times.Ex:Fraud detection system-best guess than 100% accurac
 Availability: Real-time analytics systems are often designed to prioritize availability.
 Partition Tolerance: Due to the distributed nature of big data systems, partition
tolerance is critical for real-time analytics.
 Similarly, for real-time analytics solutions, there is a variantof the CAP theorem, called SCV.

SCV for real-time analytics systems: adapts the CAP theorem to real-time analytics:

 Speed: Refers to the system’s ability to deliver timely analytics results.


 Consistency: Addresses the degree of accuracy of the results. DIAGRAM
 Volume: Encompasses the sys’s capacity to handle large & fast-growing datasets.
Example Applications
 Cassandra (AP System): In a real-time analytics scenario, Cassandra offers high availability and partition
tolerance, suitable for applications needing fast, “eventually consistent” responses.
 HBase (CP System): HBase provides strong consistency and partition tolerance, making it suitable for use
cases where data accuracy is critical but can tolerate temporary unavailability.

Data Scientist activities:

0. Business Hypotheses: Understanding the business problem by defining


hypotheses or questions that need to be answered with data.
1. Prepare Analytics Sandbox: Setting up a safe and controlled environment for
experimenting with data, ensuring data privacy and security.
2. Choose Appropriate Data Sources: Identifying and selecting relevant data
sources, both internal and external, that will help in solving the business
problem.
3. Acquire Data Sources (Internal/External): Gathering the chosen data from
different systems and sources, such as databases, logs, or external APIs.
4. Enrich & Store Data Sets: Cleaning, organizing, and storing data to make it usable
for analysis. Enrichment might involve combining datasets or adding new
relevant attributes.
5. Search for Information (Who and What?): Exploring data to find specific
information relevant to the business hypothesis.
6. Apply Analytics Techniques: Using statistical or machine learning techniques to
analyze data and generate insights.
7. Choose Appropriate Analytics Techniques: Selecting the most effective analytical
methods based on the type of data and business problem.
8. Establish Relevance (Relevance to Business Hypotheses): Ensuring that the
findings and analyses are aligned with the initial business questions and goals.
9. Build Models (Data Models, Analytical Models): Constructing data models or
predictive models to derive insights and make predictions based on the data.
10. Search for Evidence (How sure are we?): Validating and testing the model to
ensure its accuracy and reliability.
11. Evaluate for Coverage (Is it applicable for the universe or only for a specific
instance?): Assessing if the model or solution is generalizable across the business
or only applies to specific cases.
12. Evaluate Results: Reviewing the model outcomes to determine how well they
address the business problem.
13. Tell Story: Creating a narrative around the insights to communicate findings
effectively to stakeholders.
14. Presentation: Presenting the final results, often with visualizations, to
stakeholders to enable data-driven decision-making.

Eg: Let's walk through the typical activities of a data scientist in this context, using CSP
challenges as an example.

0. CSPs notice Increase in complaints about n/w quality. The business prblm here is to
understand why customers are leaving&how to improve service quality to retain
them.

1. A data scientist might hypothesize that customers are leaving due to frequent call
drops or slow internet speeds.

2. To test these hypotheses, data scientists need to gather data. They might collect
customer service logs.

3. The data scientist collects various types of data, including:


Internal data: Call detail recs, customer External data: Social media mentions,
competitor pricing.
4. The collected data is often messy and inconsistent. The data scientist cleans it by
handling missing values, removing duplicates, and formatting the data uniformly.
5. Data is stored in a securely in sandbox, where the DS can safely analyze it without
affecting live databases.
6. Now, the DS starts analyzing the data using techniques like: Descriptive analytics,
Predictive analytics, Segmentation.
7. Selects specific techniques based on the problem.
8. The data scientist sets benchmarks to compare their findings, These references help
determine if CSP performance is below or above standard.
9. The data scientist builds predictive models to answer questions like:

 "Which customers are likely to churn in the next quarter?" Dia


 "Which areas are most at risk of experiencing network issues?"
10.The models are tested to see how accurate they are.
11. The data scientist uses the analysis results to determine if their hypotheses were correct.
12. The data scientist writes a report explaining their findings in simple terms.
13. The data scientist presents the findings to CSP leadership.
14. Finally, the data scientist presents a comprehensive plan.
15.while presenting the results-DS must Highlighting key data points while avoiding unnecessary
distractions.
» This data-driven approach allows CSPs to retain more customers, reduce costs
associated with churn, and stay competitive in the market.
Anatomy of a MapReduce:
 MR is used to handle large datasets by breaking them down into smaller chunks. It's
a way to process massive amounts of data across many macines &commnly used in
hdp
 The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.

 Map: As the name suggests its main use is to map the input data in key-
value pairs. After the map step we shuffle and sort it helps organize the data for
the next step.

 Reduce: In this step, a "reducer" function aggregates or processes the grouped data.
Why Use MapReduce?
MapReduce is designed to work with big data that’s too large for a single computer to handle. By splitting
tasks we make the data analysis faster and more efficient.
Anotamy:
 You can run a MapReduce job with a
single line of code:
JobClient.runJob(conf).
 The whole process is illustrated in Figure
6-1. At the highest level, there are four
independent entities:
» The client, which submits the
MapReduce job.
» The jobtracker, which
coordinates the job run. The
jobtracker is a Java application
whose main class is JobTracker.
» The tasktrackers, which run the
tasks that the job has been split
into. Tasktrackers are Java
applications whose main class is
TaskTracker.
» The distributed filesystem,
which is used for sharing job
files between the other entities.

Job Submission
 JobClient: The process starts when a MapReduce job is submitted via the runJob() method of
JobClient. This method initiates the entire job flow.
 JobTracker: The JobClient communicates with the JobTracker to get a new job ID.
 Resource Distribution: The job's resources r copied to the distributed filesystem.
» The JobClient asks the JobTracker for a new job ID.
» Verifies the input and output paths; throws errors if the directory is missing or already exists.
» Job resources (like JAR files) are copied to the distributed filesystem.
» Once all resources are in place, the job is marked as ready for execution, and the JobTracker is notified .
Job Initialization
Once the job reaches the JobTracker:
 Q&scheduling:Job is Q’ed in job trac
 Task Creation: The job is broken down into map and reduce tasks.
o Map Tasks: For each input split, a corresponding map task is created.
o Reduce Tasks: The number of reduce tasks is determined based on the
mapred.reduce.tasks configuration property.
» The input splits are retrieved from HDFS by the JobTracker.
» Tasks are assigned IDs, and the job is ready for execution.
Task Assignment
 TaskTrackers: TaskTrackers are responsible for executing the tasks assigned to them by the
JobTracker.
 Task Allocation: Based on the available resources, the JobTracker assigns tasks to TaskTrackers.
o Map Tasks: The JobTracker tries to assign map tasks close to the data's location (data
locality).
o Reduce Tasks: The JobTracker assigns reduce tasks, which are less sensitive to data
locality.
» The jobTracker assigns either a map or reduce task to a TaskTracker based on
available slots &data locality.
 TaskTrackers send heartbeats to indicate they are alive and available for new
tasks.
Task Execution
 Necessary job res’s r retrieved
 A new JVM is launched for each task to ensure task isolation
 Task Isolation: Each map/reduce task runs in its own JVM to prevent crashes from affecting other
tasks.
Task Progress and Status Updates
 The proportion of its work that the task has been completed, they periodically report their
progress.
 TaskTrackers send heartbeats to the JobTracker at regular intervals. status updates, such as
success,failure.
Job Completion
 Completion Notification: Once the last task of the job completes, the JobTracker updates the
job's status to “successful.”
 Cleanup: The JobTracker cleans up its internal job state and instructs TaskTrackers to delete
intermediate data files.
» Successful completion, prints the success message.
» The JobTracker can be configured to send HTTP callbacks upon job completion, informing clients
about the job status.
Special Task Handling (Streaming and Pipes)
 Streaming: It communicates with external processes via standard input/output.
 Pipes: It communicates using socket connections.

Failures in classic Map Reduce

In a classic MapReduce environment, managing failures is critical for ensuring job completion despite
issues with tasks or nodes. Hadoop MapReduce offers a robust failure-handling mechanism, addressing
issues at different levels: Task Failures, TaskTracker Failures, and JobTracker Failures. Here’s a simplified
breakdown of each type.

1. Task Failure

Task failures occur when individual Map or Reduce tasks encounter issues, often due to bugs, runtime
errors, or unexpected exits of the JVM. There are several key points to consider:

 Runtime Exceptions: If user code within a map or reduce task throws an exception, the task fails,
and the TaskTracker, which monitors these tasks, marks it as failed. The error logs are available
for review in the user’s logs, helping diagnose the issue.
 Non-zero Exit Codes (Streaming Tasks): For tasks using Hadoop Streaming, the task fails if the
process returns a non-zero exit code. The failure is tracked by the stream.non.zero.exit.is.failure
property.
 Sudden JVM Exit: Occasionally, JVMs may crash unexpectedly, causing a task to fail without error
logging. In such cases, the TaskTracker detects the exit and logs it as a failure.
 Hanging Tasks: A task is considered “hanging” if it stops making progress. Hadoop detects this if
the task does not report progress within a set time (default is 10 minutes), after which it’s
marked as failed. This timeout can be customized using the mapred.task.timeout property.

When a task fails, it may be retried up to four times (default), and then, if it still fails, it is logged as a
permanent failure, potentially causing the job to fail. The retry limits are adjustable through properties like
mapred.map.max.attempts and mapred.reduce.max.attempts.

2. TaskTracker Failure

A TaskTracker manages multiple task executions and sends periodic heartbeats to the JobTracker to
indicate its status. If a TaskTracker fails due to a crash, network failure, or excessive slowness, the
JobTracker stops receiving its heartbeats and marks it as unavailable.

 Heartbeat Monitoring: The JobTracker expects a heartbeat every few seconds. If no heartbeat is
received for a set duration (default is 10 minutes, configured with
mapred.task.tracker.expiry.interval), the TaskTracker is considered failed.
 Task Reassignment: When a TaskTracker fails, any in-progress tasks are reassigned to other
TaskTrackers. Completed map tasks may also be rerun if their intermediate data is on the failed
node, ensuring data remains accessible for the reduce phase.
 Blacklisting: TaskTrackers with consistently high failure rates can be “blacklisted,” meaning the
JobTracker will avoid assigning tasks to them. Restarting the blacklisted TaskTracker can reset its
status.

3. JobTracker Failure

The JobTracker is a single point of control in classic MapReduce, coordinating all tasks, task assignments,
and monitoring. As the most crucial component, JobTracker failure is the most severe.

 Single Point of Failure: The JobTracker has no built-in redundancy. If it fails, the entire job fails,
and all tasks associated with the job must be restarted.
 Future Plans for Resilience: In some Hadoop versions, projects like ZooKeeper are mentioned for
potential future management of JobTracker failover by coordinating multiple JobTrackers, though
this is beyond classic MapReduce.

Handling of Failures: Key Properties and Mechanisms

Hadoop provides properties and configurations to manage how failures are detected, reported, and
retried. Here are a few essential configurations:

 Timeout Properties: The mapred.task.timeout controls the wait time before a hanging task is
marked failed.
 Retry Limits: Properties like mapred.map.max.attempts and mapred.reduce.max.attempts
determine how many times a task can be retried after failure.
 Blacklisting and Heartbeat Interval: The JobTracker’s heartbeat frequency
(mapred.task.tracker.expiry.interval) and blacklist feature help avoid using failing TaskTrackers.
Benefits of Failure Management in MapReduce

These mechanisms ensure that even with hardware or software issues, most MapReduce jobs can
complete. Task retry mechanisms and data replication across nodes improve reliability, making Hadoop’s
MapReduce framework resilient to common types of failures.

YARN

YARN, introduced in Hadoop 2.0, is a framework that manages resources in a Hadoop cluster and allows
various data processing engines to work on a shared dataset within the Hadoop Distributed File System
(HDFS). It was developed to address the limitations of Hadoop 1.0, particularly the bottleneck created by
JobTracker, by separating resource management from application processing.

Key Features of YARN


YARN has become popular for its features that enhance flexibility, scalability, and resource management:

 Scalability: The Resource Manager’s scheduler allows Hadoop to manage thousands of nodes,
increasing cluster size and resource allocation efficiency.
 Compatibility: YARN is backward-compatible with Hadoop 1.0’s MapReduce, allowing older
applications to run without modification.
 Cluster Utilization: Dynamic resource allocation optimizes utilization, ensuring balanced resource
usage.
 Multi-tenancy: YARN’s multi-engine capability allows diverse processing engines to work
simultaneously, providing organizations flexibility in data processing.
YARN Architecture
YARN’s architecture divides its components into resource management and application processing layers.
Key components include:

1. Client

 The Client submits MapReduce or other application jobs to the YARN cluster.

2. Resource Manager

 The Resource Manager is the main control authority in YARN, managing and assigning resources.
It has two essential sub-components:
o Scheduler: The Scheduler assigns resources based on applications’ needs and available
cluster resources. It does not handle monitoring or fault tolerance. It supports plugins
like the Capacity Scheduler and Fair Scheduler for flexible resource partitioning.
o Application Manager: This component receives job requests, negotiates initial
containers, and restarts the Application Master if needed.
3. Node Manager
 Each node in the Hadoop cluster has a Node Manager, which monitors node health, manages
application resources, and reports the status to the Resource Manager. The Node Manager also
launches containers and executes tasks as directed by the Application Master.
4. Application Master
 For each application submitted to YARN, the Application Master is responsible for requesting
resources from the Resource Manager, monitoring the application’s status, and coordinating
execution. It communicates with Node Managers to start containers for task execution.
5. Container

 A container is a collection of resources (CPU, memory, and storage) on a node. Containers are
managed by the Node Manager and are essential units of resource allocation in YARN. Each
container is launched based on the requirements specified in the Container Launch Context (CLC),
which provides necessary configurations.
Workflow in YARN
The application workflow in YARN can be broken down into the following steps:

1. Application Submission: The client submits an application to the Resource Manager.


2. Container Allocation for Application Master: The Resource Manager allocates an initial container
for the Application Master.
3. Application Master Registration: The Application Master registers itself with the Resource
Manager to begin resource negotiations.
4. Resource Negotiation: The Application Master requests containers for task execution based on
the job requirements.
5. Container Launch: The Application Master instructs Node Managers to launch containers for
processing.
6. Application Execution: Task execution happens within containers. The Application Master
monitors progress and communicates updates to the Resource Manager.
7. Completion: Upon completion, the Application Master deregisters with the Resource Manager.

Advantages of YARN
YARN’s architecture and workflow offer numerous benefits:
 Flexibility: YARN enables multiple processing engines, like Apache Spark and Flink, to work within
the Hadoop ecosystem. This allows simultaneous processing across different frameworks.
 Resource Management: Centralized resource management allows administrators to allocate and
monitor resources efficiently, improving resource usage.
 Scalability: YARN supports large clusters and can dynamically manage thousands of nodes.
 Improved Performance: Centralized scheduling and monitoring of applications improve the
performance and efficiency of task management.
 Security: YARN provides security features, including Kerberos authentication and Secure Shell
(SSH) access, ensuring data and application security.
Disadvantages of YARN
Despite its advantages, YARN has some limitations:
 Complexity: YARN introduces additional configuration requirements, making it challenging for
users unfamiliar with Hadoop or YARN.
 Overhead: Resource management and scheduling introduce overhead, which can reduce
performance.
Conclusion
YARN has transformed Hadoop into a more flexible and efficient resource management system,
supporting a diverse range of processing engines and scalable large clusters. Its robust architecture and
improved resource utilization make it an essential framework for modern big data processing applications,
despite some inherent complexities and overhead.

You might also like