Abstract
We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this problem without failure detectors. We then show how to solve it using a new failure detector, called heartbeat. In contrast to previous failure detectors that have been used to circumvent impossibility results, the heartbeat failure detector is implementable, and its implementation does not use timeouts. These results have wide applicability: they can be used to transform many existing algorithms that tolerate only process crashes into quiescent algorithms that tolerate both process crashes and message losses. This can be applied to consensus, atomic broadcast, k-set agreement, atomic commitment, etc. The heartbeat failure detector is novel: besides being implementable without timeouts, it does not output lists of suspects as typical failure detectors do. If we restrict failure detectors to output only lists of suspects, quiescent reliable communication requires ◊P [2], which is not implementable. Combined with the results of this paper, this shows that traditional failure detectors that output only lists of suspects have fundamental limitations.
Research partially supported by NSF grant CCR-9402896, by ARPA/ONR grant N00014-961-1014, and by an Olin Fellowship.
Preview
Unable to display preview. Download preview PDF.
References
M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a timeout-free failure detector for quiescent reliable communication. Technical Report 97–1631, Department of Computer Science, Cornell University, May 1997.
M. K. Aguilera, W Chen, and S. Toueg. On the weakest failure detector for quiescent reliable communication. Technical report, Department of Computer Science, Cornell University, July 1997.
M. K. Aguilera, W. Chen, and S. Toueg. Quiescent reliable communication and quiescent consensus in partitionable networks. Technical Report 97–1632, Department of Computer Science, Cornell University, June 1997.
M. K. Aguilera and S. Toueg. Randomization and failure detection: a hybrid approach to solve consensus. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 29–39. Springer-Verlag, Oct. 1996.
Ö. Babaoĝlu, R. Davoli, and A. Montresor. Partitionable group membership: specification and algorithms. Technical Report UBLCS-97-1, Dept. of Computer Science, University of Bologna, Bologna, Italy, January 1997.
A. Basu, B. Charron-Bost, and S. Toueg. Simulating reliable links with unreliable links in the presence of process crashes. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 105–122. Springer-Verlag, Oct. 1996.
R. Bazzi and G. Neiger. Simulating crash failures with many faulty processors. In A. Segal and S. Zaks, editors, Proceedings of the 6th International Workshop on Distributed Algorithms, volume 647 of Lecture Notes on Computer Science, pages 166-184. Springer-Verlag, 1992.
M. Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, pages 27–30, Aug. 1983.
G. Bracha and S. Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4):824–840, Oct. 1985.
T D. Chandra, April 1997. Private Communication.
T D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996.
T D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.
S. Chaudhuri. More choices allow more faults: Set consensus problems in totally asynchronous systems. Information and Computation, 105(1):132–158, July 1993.
B. Chor, M. Merritt, and D. B. Shmoys. Simple constant-time consensus protocols in realistic failure models. Journal of the ACM, 36(3):591–614,1989.
D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. Technical Report 96-1608, Department of Computer Science, Cornell University, Ithaca, New York, 1996.
D. Dolev, N. A. Lynch, S. S. Pinter, E. W. Stark, and W. E. Weihl. Reaching approximate agreement in the presence of faults. J. ACM, 33(3):499–516, July 1986.
P Feldman and S. Micali. An optimal algorithm for synchronous Byzantine agreement. Technical Report MIT/LCS/TM-425, Laboratory for Computer Science, Massachusetts Institute of Technology, June 1990.
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, Apr. 1985.
A. Gopal. Fault-Tolerant Broadcasts and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Corneal University, Jan. 1992.
R. Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms, pages 87–100, Le Mont-St-Michel, France, 1995. Springer Verlag, LNCS 972.
R. Guerraoui, M. Larrea, and A. Schiper. Non blocking atomic commitment with an unreliable failure detector. In Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, pages 13–15, 1995.
V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425, Department of Computer Science, Cornell University, Ithaca, New York, May 1994.
R. Koo and S. Toueg. Effects of message loss on the termination of distributed protocols. Inf. Process. Lett., 27(4):181–188, Apr. 1988.
W.-K. Lo and V Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms, pages 280–295, Terschelling, The Netherlands, 1994.
N. A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., 1996.
M. Rabin. Randomized Byzantine generals. In Proceedings of the 24th Symposium on Foundations of Computer Science, pages 403–409. IEEE Computer Society Press, Nov. 1983.
L. S. Sabel and K. Marzullo. Election vs. consensus in asynchronous systems. Technical Report 95-1488, Department of Computer Science, Cornell University, Ithaca, New York, Febrary 1995.
R. van Renesse, April 1997. Private Communication.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kawazoe Aguilera, M., Chen, W., Toueg, S. (1997). Heartbeat: A timeout-free failure detector for quiescent reliable communication. In: Mavronicolas, M., Tsigas, P. (eds) Distributed Algorithms. WDAG 1997. Lecture Notes in Computer Science, vol 1320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0030680
Download citation
DOI: https://doi.org/10.1007/BFb0030680
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63575-8
Online ISBN: 978-3-540-69600-1
eBook Packages: Springer Book Archive