0% found this document useful (0 votes)

37 views4 pages

Algorithm Documentation

Uploaded by

Aqib Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views4 pages

Algorithm Documentation

Uploaded by

Aqib Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Algorithm Documentation

Overview

This algorithm is an enhanced version of the Peterson-Kearns rollback recovery system, designed to
provide fault tolerance in distributed systems with multiple nodes. Each node performs tasks and
communicates with its neighbors while keeping track of its state and handling failures. The system
continues to run indefinitely, with nodes performing random tasks until the program is manually
stopped.

Components and Structure

1. Node Class

The Node class represents an individual node in the distributed system. Each node can:

 Connect to other nodes to form a network.

 Perform random tasks and communicate with its neighbors.

 Save and load its state to recover from failures.

 Log events for monitoring purposes.

Key Attributes:

 id: Unique identifier for the node.

 to_parent: A queue for sending messages to the coordinator.

 num_nodes: The total number of nodes in the system.

 neighbors: List of neighboring nodes this node is connected to.

 proc: The process associated with this node.

 last_checkpoint_time: The last time the node took a checkpoint of its state.

 last_handled_message: The last message that was handled by the node.

 log_file: File where the node logs its events.

 failed: Boolean flag to indicate if the node has failed.

 time_vector: Vector clock to keep track of time across nodes.

 fail_vector: Vector to track which nodes have failed.

 state_file: File where the node's state is saved.

Key Methods:
 connect(other): Connects this node to another node.

 run(): Starts the node's process.

 restart(): Restarts the node after a failure.

 dummy_spin(): The main loop where the node performs tasks and communicates with
neighbors.

 save_state(): Saves the current state of the node to a file.

 load_state(): Loads the node's state from a file.

 log_event(event): Logs an event to the log file.

2. Coordinator Process

The coordinator_process manages the overall system by:

 Monitoring messages from all nodes.

 Initiating rollback procedures when a node fails.

 Handling checkpoints and maintaining a global time vector and fail vector.

Key Attributes:

 global_time_vector: Tracks the global time across all nodes.

 global_fail_vector: Tracks which nodes have failed.

Key Methods:

 initiate_rollback(num_nodes, node_queues, global_fail_vector): Initiates a rollback procedure

for failed nodes.

3. Node Communication and Failure Handling

 Nodes communicate with each other using queues.

 If a node fails, it marks itself as failed in the fail_vector and notifies the coordinator.

 The coordinator initiates a rollback, and the node restarts, loading its last saved state.

Algorithm Workflow

1. Node Initialization:

o Nodes are created and connected to form a network.

o Each node starts running its dummy_spin method in a separate process.

2. Infinite Task Loop (dummy_spin):

o Nodes run indefinitely, performing random tasks, sending, and receiving messages.
o Each node periodically saves its state and checks for failures.

3. Failure Detection and Rollback:

o If a node detects a failure, it updates its fail vector and exits.

o The coordinator detects the failure and initiates a rollback, restarting the failed node
with its last saved state.

4. Manual Termination:

o The system runs indefinitely until manually stopped using Ctrl+C or other termination
methods.

Changes and Enhancements

1. Infinite Execution of Nodes (dummy_spin):

o The main enhancement was modifying the dummy_spin method to run indefinitely. This
allows nodes to continue performing random tasks and communicating with neighbors
until the program is manually terminated.

2. Automatic Failure Handling:

o Nodes now have a 10% chance of simulating a failure. When a node fails, it marks itself
as failed, exits, and then restarts automatically after the coordinator initiates a rollback.

3. Improved Log and State Management:

o Enhanced logging for events like saving state, loading state, and failures.

o Nodes periodically save their state, allowing for a more robust recovery mechanism.

4. Coordinator Process Enhancements:

o The coordinator now handles infinite execution by continuously monitoring nodes.

o Improved handling of rollback messages and failures, ensuring the system remains
stable even with continuous operation.

Manual Termination

The algorithm is designed to run indefinitely. To stop its execution:

 Use Ctrl + C in the terminal to send an interrupt signal.

 Alternatively, terminate the Python process using system tools like Task Manager or the kill
command on Unix-based systems.

Conclusion

This enhanced Peterson-Kearns rollback recovery system ensures continuous operation of nodes in a
distributed system, with automatic failure detection and recovery. The algorithm is robust and can
handle failures while maintaining consistent state across nodes, making it suitable for systems requiring
high reliability and fault tolerance.

Implementation of Election Algorithms
50% (2)
Implementation of Election Algorithms
12 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
Unit 4
No ratings yet
Unit 4
94 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
Os Project Report
No ratings yet
Os Project Report
15 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
21 pages
1904050001
No ratings yet
1904050001
119 pages
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
No ratings yet
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
5 pages
Distributed Computing Module 4 Important Topics PYQs
No ratings yet
Distributed Computing Module 4 Important Topics PYQs
23 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Fault-Tolerant Parallel Algorithms
No ratings yet
Fault-Tolerant Parallel Algorithms
16 pages
Checkpoints Recovery
No ratings yet
Checkpoints Recovery
35 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Presentation On Consistent Checkpoints & Recovery in Distributed System
100% (1)
Presentation On Consistent Checkpoints & Recovery in Distributed System
26 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Assignment 4: Q1 - Deadlock Detector (50 Marks)
No ratings yet
Assignment 4: Q1 - Deadlock Detector (50 Marks)
9 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
DC Unit4
No ratings yet
DC Unit4
33 pages
4.1.6. Coordinated Checkpointing Algorithm-1
No ratings yet
4.1.6. Coordinated Checkpointing Algorithm-1
7 pages
Unit 4 Answer Key
No ratings yet
Unit 4 Answer Key
24 pages
CST402 Scheme
No ratings yet
CST402 Scheme
9 pages
Ensuring A Computational Continuum in The Lai & Wu Algorithm
No ratings yet
Ensuring A Computational Continuum in The Lai & Wu Algorithm
22 pages
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Unit 4
No ratings yet
Unit 4
32 pages
DistributedSystems Notes
No ratings yet
DistributedSystems Notes
73 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Bulletproof Rust: Master Error Handling Like a Pro
From Everand
Bulletproof Rust: Master Error Handling Like a Pro
David Brennan
No ratings yet
Session 33
No ratings yet
Session 33
4 pages
Name:-Varun Tirthani SRN: - PES1201802027 Section: - 5F: Operating Systems Lab Week 5 Submission
No ratings yet
Name:-Varun Tirthani SRN: - PES1201802027 Section: - 5F: Operating Systems Lab Week 5 Submission
3 pages
Thesis Ps
No ratings yet
Thesis Ps
213 pages
DC - Unit IV
No ratings yet
DC - Unit IV
36 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
Capability Based Addressing?
No ratings yet
Capability Based Addressing?
3 pages
Arshad PHD 0506
No ratings yet
Arshad PHD 0506
215 pages
Rohini 836843492
No ratings yet
Rohini 836843492
3 pages
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet
Implementation of Fault Tolerance in Earliest Deadline First (Edf) Scheduling Algorithm Domain: Embedded System
No ratings yet
Implementation of Fault Tolerance in Earliest Deadline First (Edf) Scheduling Algorithm Domain: Embedded System
26 pages
Midterm Cheatsheet
No ratings yet
Midterm Cheatsheet
2 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Go Programming Essentials: From Zero to Production-Ready Applications
From Everand
Go Programming Essentials: From Zero to Production-Ready Applications
Marcus Hartwell
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Cs3551 Unit 4 Qb
No ratings yet
Cs3551 Unit 4 Qb
5 pages
Sec4 Consensus With Raft
No ratings yet
Sec4 Consensus With Raft
23 pages
2021 2022
No ratings yet
2021 2022
6 pages
04 PDF
No ratings yet
04 PDF
2 pages
Concurrent and Distributed Computing in Java
No ratings yet
Concurrent and Distributed Computing in Java
2 pages
How The Code Works
No ratings yet
How The Code Works
4 pages
Cs3551 Unit IV Notes
No ratings yet
Cs3551 Unit IV Notes
34 pages
OS Lec 11 Peterson Soution
No ratings yet
OS Lec 11 Peterson Soution
16 pages
4.1.3. Issues in Failure Recovery-1
No ratings yet
4.1.3. Issues in Failure Recovery-1
4 pages
NetSDK Programming Manual
No ratings yet
NetSDK Programming Manual
49 pages
01 - Memory System PDF
No ratings yet
01 - Memory System PDF
11 pages
1 - Siemens Open Library - Library Overview and Architecture V1.4
No ratings yet
1 - Siemens Open Library - Library Overview and Architecture V1.4
19 pages
Python Report PDF
No ratings yet
Python Report PDF
16 pages
Lab 2 Grouping and Aggregation
No ratings yet
Lab 2 Grouping and Aggregation
10 pages
LZW Algorithm
No ratings yet
LZW Algorithm
3 pages
Bug Check Code Reference (Windows Debuggers)
No ratings yet
Bug Check Code Reference (Windows Debuggers)
8 pages
Book of Vaadin PDF
No ratings yet
Book of Vaadin PDF
536 pages
Unit 4 Spring Web Technology Notes
No ratings yet
Unit 4 Spring Web Technology Notes
22 pages
Assignment 01 (Due in Next Lab, Viva Will Be Conduct Therefore Do Yourself.)
No ratings yet
Assignment 01 (Due in Next Lab, Viva Will Be Conduct Therefore Do Yourself.)
2 pages
Cover Letter
No ratings yet
Cover Letter
1 page
Wchar C
No ratings yet
Wchar C
15 pages
Chapter 17 Linked Lists: Starting Out With C++, 3 Edition
No ratings yet
Chapter 17 Linked Lists: Starting Out With C++, 3 Edition
67 pages
Dsa Module 2 Notes
No ratings yet
Dsa Module 2 Notes
29 pages
Problem Statements C Programming
No ratings yet
Problem Statements C Programming
26 pages
Adobe Prep4sure Ad0-E103 Rapidshare 2020-Oct-02 by Moore 73q Vce
No ratings yet
Adobe Prep4sure Ad0-E103 Rapidshare 2020-Oct-02 by Moore 73q Vce
7 pages
Bugreport KTUS H0 KTUS2208100OS00MP2 2024 12 08 08 01 18 Dumpstate - Log 21997
No ratings yet
Bugreport KTUS H0 KTUS2208100OS00MP2 2024 12 08 08 01 18 Dumpstate - Log 21997
42 pages
Malware Engineering For Dummies: Bachelor Thesis
No ratings yet
Malware Engineering For Dummies: Bachelor Thesis
73 pages
OSY Ch-2
No ratings yet
OSY Ch-2
12 pages
ACA (15CS72) MODULE-1: 1.0 Objective
No ratings yet
ACA (15CS72) MODULE-1: 1.0 Objective
61 pages
Emailing DBMS - QB - Shubhammarotkar Toc Notes
No ratings yet
Emailing DBMS - QB - Shubhammarotkar Toc Notes
14 pages
Windsurf
No ratings yet
Windsurf
10 pages
SPSS
No ratings yet
SPSS
5 pages
Lab Cycle 3 Solutions
No ratings yet
Lab Cycle 3 Solutions
14 pages
Data Hiding in C++
No ratings yet
Data Hiding in C++
12 pages
6.1 Introduction To LaTeX
No ratings yet
6.1 Introduction To LaTeX
14 pages
GSE-2019 An Introduction To SAFTRACE
No ratings yet
GSE-2019 An Introduction To SAFTRACE
36 pages
Java Syntax Cheat Sheet: Control Flow Key Words
100% (1)
Java Syntax Cheat Sheet: Control Flow Key Words
1 page
CN Lab Record Part B
No ratings yet
CN Lab Record Part B
32 pages
Finding Two Elements Such That Their Sum Is Closest To Zero
No ratings yet
Finding Two Elements Such That Their Sum Is Closest To Zero
5 pages

Algorithm Documentation

Uploaded by

Algorithm Documentation

Uploaded by

Algorithm Documentation

Components and Structure

 Connect to other nodes to form a network.

 Perform random tasks and communicate with its neighbors.

 Save and load its state to recover from failures.

 Log events for monitoring purposes.

 id: Unique identifier for the node.

 to_parent: A queue for sending messages to the coordinator.

 num_nodes: The total number of nodes in the system.

 neighbors: List of neighboring nodes this node is connected to.

 proc: The process associated with this node.

 last_handled_message: The last message that was handled by the node.

 log_file: File where the node logs its events.

 failed: Boolean flag to indicate if the node has failed.

 time_vector: Vector clock to keep track of time across nodes.

 fail_vector: Vector to track which nodes have failed.

 state_file: File where the node's state is saved.

 run(): Starts the node's process.

 restart(): Restarts the node after a failure.

 save_state(): Saves the current state of the node to a file.

 load_state(): Loads the node's state from a file.

 log_event(event): Logs an event to the log file.

The coordinator_process manages the overall system by:

 Monitoring messages from all nodes.

 Initiating rollback procedures when a node fails.

 global_time_vector: Tracks the global time across all nodes.

 global_fail_vector: Tracks which nodes have failed.

 initiate_rollback(num_nodes, node_queues, global_fail_vector): Initiates a rollback procedure

3. Node Communication and Failure Handling

 Nodes communicate with each other using queues.

o Nodes are created and connected to form a network.

o Each node starts running its dummy_spin method in a separate process.

2. Infinite Task Loop (dummy_spin):

3. Failure Detection and Rollback:

o If a node detects a failure, it updates its fail vector and exits.

Changes and Enhancements

1. Infinite Execution of Nodes (dummy_spin):

2. Automatic Failure Handling:

3. Improved Log and State Management:

4. Coordinator Process Enhancements:

o The coordinator now handles infinite execution by continuously monitoring nodes.

The algorithm is designed to run indefinitely. To stop its execution:

 Use Ctrl + C in the terminal to send an interrupt signal.

You might also like