Algorithm Documentation
Overview
This algorithm is an enhanced version of the Peterson-Kearns rollback recovery system, designed to
provide fault tolerance in distributed systems with multiple nodes. Each node performs tasks and
communicates with its neighbors while keeping track of its state and handling failures. The system
continues to run indefinitely, with nodes performing random tasks until the program is manually
stopped.
Components and Structure
1. Node Class
The Node class represents an individual node in the distributed system. Each node can:
Connect to other nodes to form a network.
Perform random tasks and communicate with its neighbors.
Save and load its state to recover from failures.
Log events for monitoring purposes.
Key Attributes:
id: Unique identifier for the node.
to_parent: A queue for sending messages to the coordinator.
num_nodes: The total number of nodes in the system.
neighbors: List of neighboring nodes this node is connected to.
proc: The process associated with this node.
last_checkpoint_time: The last time the node took a checkpoint of its state.
last_handled_message: The last message that was handled by the node.
log_file: File where the node logs its events.
failed: Boolean flag to indicate if the node has failed.
time_vector: Vector clock to keep track of time across nodes.
fail_vector: Vector to track which nodes have failed.
state_file: File where the node's state is saved.
Key Methods:
connect(other): Connects this node to another node.
run(): Starts the node's process.
restart(): Restarts the node after a failure.
dummy_spin(): The main loop where the node performs tasks and communicates with
neighbors.
save_state(): Saves the current state of the node to a file.
load_state(): Loads the node's state from a file.
log_event(event): Logs an event to the log file.
2. Coordinator Process
The coordinator_process manages the overall system by:
Monitoring messages from all nodes.
Initiating rollback procedures when a node fails.
Handling checkpoints and maintaining a global time vector and fail vector.
Key Attributes:
global_time_vector: Tracks the global time across all nodes.
global_fail_vector: Tracks which nodes have failed.
Key Methods:
initiate_rollback(num_nodes, node_queues, global_fail_vector): Initiates a rollback procedure
for failed nodes.
3. Node Communication and Failure Handling
Nodes communicate with each other using queues.
If a node fails, it marks itself as failed in the fail_vector and notifies the coordinator.
The coordinator initiates a rollback, and the node restarts, loading its last saved state.
Algorithm Workflow
1. Node Initialization:
o Nodes are created and connected to form a network.
o Each node starts running its dummy_spin method in a separate process.
2. Infinite Task Loop (dummy_spin):
o Nodes run indefinitely, performing random tasks, sending, and receiving messages.
o Each node periodically saves its state and checks for failures.
3. Failure Detection and Rollback:
o If a node detects a failure, it updates its fail vector and exits.
o The coordinator detects the failure and initiates a rollback, restarting the failed node
with its last saved state.
4. Manual Termination:
o The system runs indefinitely until manually stopped using Ctrl+C or other termination
methods.
Changes and Enhancements
1. Infinite Execution of Nodes (dummy_spin):
o The main enhancement was modifying the dummy_spin method to run indefinitely. This
allows nodes to continue performing random tasks and communicating with neighbors
until the program is manually terminated.
2. Automatic Failure Handling:
o Nodes now have a 10% chance of simulating a failure. When a node fails, it marks itself
as failed, exits, and then restarts automatically after the coordinator initiates a rollback.
3. Improved Log and State Management:
o Enhanced logging for events like saving state, loading state, and failures.
o Nodes periodically save their state, allowing for a more robust recovery mechanism.
4. Coordinator Process Enhancements:
o The coordinator now handles infinite execution by continuously monitoring nodes.
o Improved handling of rollback messages and failures, ensuring the system remains
stable even with continuous operation.
Manual Termination
The algorithm is designed to run indefinitely. To stop its execution:
Use Ctrl + C in the terminal to send an interrupt signal.
Alternatively, terminate the Python process using system tools like Task Manager or the kill
command on Unix-based systems.
Conclusion
This enhanced Peterson-Kearns rollback recovery system ensures continuous operation of nodes in a
distributed system, with automatic failure detection and recovery. The algorithm is robust and can
handle failures while maintaining consistent state across nodes, making it suitable for systems requiring
high reliability and fault tolerance.