On communication support for fault tolerant process groups
RFC 992

Document Type RFC - Unknown (November 1986; No errata)
Last updated 2013-03-02
Stream Legacy
Formats plain text pdf html bibtex
Stream Legacy state (None)
Consensus Boilerplate Unknown
RFC Editor Note (None)
IESG IESG state RFC 992 (Unknown)
Telechat date
Responsible AD (None)
Send notices to (None)
K. P. Birman (Cornell)
Network Working Group                                  T. A. Joseph (Cornell)
Request for Comments: 992                              November 1986

       On Communication Support for Fault Tolerant Process Groups

                     K. P. Birman and T. A. Joseph
             Dept. of Computer Science, Cornell University
                           Ithaca, N.Y. 14853
                              607-255-9199

1. Status of this Memo.

   This memo describes a collection of multicast communication primi-
   tives integrated with a mechanism for handling process failure and
   recovery.  These primitives facilitate the implementation of fault-
   tolerant process groups, which can be used to provide distributed
   services in an environment subject to non-malicious crash failures.
   Unlike other process group approaches, such as Cheriton's "host
   groups" (RFC's 966, 988, [Cheriton]), our approach provides powerful
   guarantees about the behavior of the communication subsystem when
   process group membership is changing dynamically, for example due to
   process or site failures, recoveries, or migration of a process from
   one site to another.  Our approach also addresses delivery ordering
   issues that arise when multiple clients communicate with a process
   group concurrently, or a single client transmits multiple multicast
   messages to a group without pausing to wait until each is received.
   Moreover, the cost of the approach is low.  An implementation is be-
   ing undertaken at Cornell as part of the ISIS project.

   Here, we argue that the form of "best effort" reliability provided by
   host groups may not address the requirements of those researchers who
   are building fault tolerant software.  Our basic premise is that re-
   liable handling of failures, recoveries, and dynamic process migra-
   tion are important aspects of programming in distributed environ-
   ments, and that communication support that provides unpredictable
   behavior in the presence of such events places an unacceptable burden
   of complexity on higher level application software.  This complexity
   does not arise when using the fault-tolerant process group alterna-
   tive.

   This memo summarizes our approach and briefly contrasts it with other
   process group approaches.  For a detailed discussion, together with
   figures that clarify the details of the approach, readers are re-
   ferred to the papers cited below.

   Distribution of this memo is unlimited.

Birman & Joseph                                                 [Page 1]
RFC 992                                                    November 1986

2. Acknowledgments

   This memo was adopted from a paper presented at the Asilomar workshop
   on fault-tolerant distributed computing, March 1986, and summarizes
   material from a technical report that was issued by Cornell Universi-
   ty, Dept. of Computer Science, in August 1985, which will appear in
   ACM Transactions on Computer Systems in February 1987 [Birman-b].
   Copies of these paper, and other relevant papers, are available on
   request from the author: Dept. of Computer Science, Cornell Universi-
   ty, Ithaca, New York 14853. (birman@gvax.cs.cornell.edu).  The ISIS
   project also maintains a mailing list.  To be added to this list,
   contact M. Schmizzi (schiz@gvax.cs.cornell.edu).

   This work was supported by the Defense Advanced Research Projects
   Agency (DoD) under ARPA order 5378, Contract MDA903-85-C-0124, and by
   the National Science Foundation under grant DCR-8412582.  The views,
   opinions and findings contained in this report are those of the au-
   thors and should not be construed as an official Department of De-
   fense position, policy, or decision.

3. Introduction

   At Cornell, we recently completed a prototype of the ISIS system,
   which transforms abstract type specifications into fault-tolerant
   distributed implementations, while insulating users from the mechan-
   isms by which fault-tolerance is achieved.  This version of ISIS, re-
   ported in [Birman-a], supports transactional resilient objects as a
   basic programming abstraction.  Our current work undertakes to pro-
   vide a much broader range of fault-tolerant programming mechanisms,
   including fault-tolerant distributed bulletin boards [Birman-c] and
   fault-tolerant remote procedure calls on process groups [Birman-b].
   The approach to communication that we report here arose as part of
   this new version of the ISIS system.

   Unreliable communication primitives, such as the multicast group com-
   munication primitives proposed in RFC's 966 and 988 and in [Cheri-
   ton], leave some uncertainty in the delivery status of a message when
   failures and other exceptional events occur during communication.
   Instead, a form of "best effort" delivery is provided, but with the
   possibility that some member of a group of processes did not receive
Show full document text