Big data applications based on graphs need to be scalable enough for handling immense growth in size of graphs, efficiently. Scalable graph processing typically handles the high workload by increasing the number of computing nodes. However, this increases the chances of single or multiple node (multi-node) failures. Failures may occur during normal job execution, as well as during recovery. Most of the systems for failure detection either follow checkpoint-based recovery which has high computation cost, or follows replication that has high memory overhead. In this work, we have proposed an unsupervised learning-based failure-recovery scheme for graph processing systems that detects different kinds of failures and allows node recovery within a shorter amount of time. It has been able to provide enhanced performance as compared to traditional failure-recovery models with respect to simultaneous recovery from single and multi-node failures, memory overload and computational latency. Evaluating its performance on four benchmark datasets has reinforced its strength and makes the proposed model completely fit in with the status quo.
Data availability
Datasets 1 and 2 can be freely downloaded from https://snap.stanford.edu/data/email-Eu-core-temporal.html. Dataset 3 is available at https://snap.stanford.edu/data/ego-Facebook.html. Dataset 4 can be downloaded from https://snap.stanford.edu/data/com-LiveJournal.html.
Code availability
The proposed failure recovery mechanism has been implemented in Python3 and is freely available at https://github.com/aradhita1988/Failure-recovery.
AM is a Senior Research Fellow supported by the Visvesvaraya Ph.D. Scheme for Electronics and IT, under Ministry of Electronics and Information Technology, Government of India. NC acknowledges the DST, ICPS project grant T-884 on “Connected Smart Health Services for Rural India”.
This work has not received any funding.
Conceptualization of Methodology: AM, RC, NC. Data Curation, Data Analysis, Formal analysis, Visualization, Investigation, Implementation, Validation, Original draft preparation: AM. Methodology Validation, Reviewing and Editing: RC, NC. Overall Supervision: RC, NC.F
Mukherjee, A., Chaki, R. & Chaki, N. An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems. J Supercomput 79, 9383–9408 (2023). https://doi.org/10.1007/s11227-022-05028-8
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-05028-8