Software checksumming in the IMP and network reliability
RFC 528

Document Type RFC - Unknown (June 1973; No errata)
Last updated 2013-03-02
Stream Legacy
Formats plain text pdf html bibtex
Stream Legacy state (None)
Consensus Boilerplate Unknown
RFC Editor Note (None)
IESG IESG state RFC 528 (Unknown)
Telechat date
Responsible AD (None)
Send notices to (None)
Network Working Group                                      J.  McQuillan
Request for Comments: 528                                        BBN-NET
NIC: 17164                                                  20 June 1973

        SOFTWARE CHECKSUMMING IN THE IMP AND NETWORK RELIABILITY

   As the ARPA Network has developed over the last few years, and our
   experience with operating the IMP subnetwork has grown, the issue of
   reliability has assumed greater importance and greater complexity.
   This note describes some modifications that have recently been made
   to the IMP and TIP programs in this regard.  These changes are
   mechanically minor and do not affect Host operation at all, but they
   are logically noteworthy, and for this reason we have explained the
   workings of the new IMP and TIP programs in some detail.  Host
   personnel are advised to note particularly the modifications
   described in sections 4 and 5, as they may wish to change their own
   programs or operating procedures.

1. A Changing View of Network Reliability

   Our idea of the Network has evolved as the Network itself has grown.
   Initially, it was thought that the only components in the network
   design that were prone to errors were the communications circuits,
   and the modem interfaces in the IMPs are equipped with a CRC checksum
   to detect "almost all" such errors.  The rest of the system,
   including Host interfaces, IMP processors, memories, and interfaces,
   were all considered to be error-free.  We have had to re-evaluate
   this position in the light of our experience.  In operating the
   network we are faced with the problem of having to perform remote
   diagnosis on failures which cannot easily be classified or
   understood.  Some examples of such problems include reports from Host
   personnel of lost RFNMs and lost Host-Host protocol allocate
   messages, inexplicable behavior in the IMP of a transient nature,
   and, finally, the problem of crashes -- the total failure of an IMP,
   perhaps affecting adjacent IMPs.  These circumstances are infrequent
   and are therefore difficult to correlate with other failures or with
   particular attempted remedies.  Indeed, it is often impossible to
   distinguish a software failure from a hardware failure.

   In attempting to post-mortem crashes, we have sometimes found the IMP
   program has had instructions incorrect--sometimes just one or two
   bits picked or dropped.  Clearly, memory errors can account for
   almost any failure, not only program crashes but also data errors
   which can lead to many other syndromes.  For instance, if the address
   of a message is changed in transit, then one Host thinks the message
   was lost, and another Host may receive an extra message.  Errors of
   this kind fall into two general classes: errors in Host messages,

McQuillan                                                       [Page 1]
RFC 528             SOFTWARE CHECKSUMMING IN THE IMP        20 June 1973

   whether in the control information or the data, and errors in inter-
   IMP messages, primarily routing update messages.  In the course of
   the last few years, it has become increasingly clear that such errors
   were occurring, though it was difficult to speculate as to where,
   why, and how often.

   One of the earliest problems of this kind was discovered in 1971.
   The Harvard IMP was sometimes crashing in an unknown manner so that
   all the other IMPs were affected.  It was finally determined that its
   memory was faulty and sometimes the routing messages read out from
   memory by the modem output interfaces were all zeroes.  The adjacent
   IMPs interpreted such an erroneous message as stating that the
   Harvard IMP had zero delay to all destinations -- that it was the
   best route to everywhere! Once this information propagated to the
   other IMPs, the whole network was in a shambles.  The solution to
   this problem was to generate a software checksum for each routing
   message before it was sent from one IMP, and to check it after it was
   received at the other IMP.  This software checksum, in addition to
   the hardware checksum of the circuit, checks the modem interfaces and
   memories at each IMP, and protects the IMPs from erroneous routing
   information.  The overhead in computing these checksums is not great
   since the messages are only exchanged every 2/3 of a second.

   In the first few months of 1973, we began to have a great deal of
   trouble with the reliability of some IMPs, especially these in the
   Washington area.  The normal procedures of calling in and working
   with Honeywell field engineers had not cleared up several of these
   persistent failures, and it was felt that an escalation of BBN
   involvement was needed to identify the exact causes of the problems.
   Therefore, during much of February and March there were one or more
   members of the staff at various sites in the network where hardware
Show full document text