Internet Engineering Task Force                            Greg Minshall
INTERNET-DRAFT                                             Siara Systems
draft-minshall-nagle-00                                December 18, 1998


             A Suggested Modification to Nagle's Algorithm


Status of This Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet- Drafts
   as reference material or to cite them other than as ``work in
   progress.''

   To view the entire list of current Internet-Drafts, please check
   the ``1id-abstracts.txt'' listing contained in the Internet-Drafts
   Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
   (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
   (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US
   West Coast).

   This draft proposes a modification to Nagle's algorithm (as
   specified in RFC896) to allow TCP, under certain conditions, to
   send a small sized packet immediately after one or more maximum
   segment sized packet.


Abstract

   The Nagle algorithm is one of the primary mechanisms which protects
   the internet from poorly designed and/or implemented applications.
   However, for a certain class of applications (notably,
   request-response protocols) the Nagle algorithm interacts poorly
   with delayed acknowledgements to give these applications poorer
   performance.

   This draft is NOT suggesting that these applications should disable
   the Nagle algorithm.

   This draft suggests a fairly small and simple modification to the
   Nagle algorithm to preserve Nagle as a means of protecting the
   internet while at the same time giving better performance to a
   wider class of applications.


Introduction to the Nagle algorithm

   The Nagle algorithm [RFC896] protects the internet from
   applications (most notably Telnet, at the time the algorithm was
   developed) which tend to dribble small amounts of data to TCP.
   Without the Nagle algorithm, TCP would transmit a packet, with a
   small amount of data, in response to each of the application's
   writes to TCP.  With the Nagle algorithm, a first small packet will
   be transmitted, then subsequent writes from the application will be
   buffered at the sending TCP until either i) enough application data
   has accumulated to enable TCP to transmit a maximum sized packet,
   or ii) the initial small packet is acknowledged by the receiving
   TCP.  This limits the number of small packets to one per round trip
   time.

   While the current Nagle algorithm does a very good job of
   protecting the internet from such applications, there are other
   applications, such as request-response protocols (with HTTP 1.1
   being a topical example) in which the current Nagle algorithm
   produces non-optimal results.  In this context, the Nagle algorithm
   is interacting with TCP's ``delayed ACK'' policy [RFC1122].


Delayed ACKs

   A receiving TCP tries to avoid acknowledging every received data
   packet.  This process, known as ``delayed ACKing'' [RFC1122],
   typically causes an ACK to be generated for every other received
   (full-sized) data packet.  In the case of an ``isolated'' TCP
   packet (i.e., where a second TCP packet is not going to arrive
   anytime soon), the delayed ACK policy causes an acknowledgement for
   the data in the isolated packet to be sent within 200 milliseconds
   of the receipt of the isolated packet.  (The way delayed ACKs are
   implemented in some systems causes the delayed ACK to be generated
   anytime between 0 and 200ms; in this case, the average amount of
   time before the delayed ACK is generated is 100ms.)


The interaction of delayed ACKs and Nagle

   If a TCP has more application data to transmit than will fit in one
   packet, but less than two full-sized packets' worth of data, it
   will transmit the first packet.  As a result of Nagle, it will not
   transmit the second packet until the first packet has been
   acknowledged.  On the other hand, the receiving TCP will delay
   acknowledging the first packet until either i) a second packet
   arrives (which, in this case, won't arrive), or ii) approximately
   100ms (and a maximum of 200ms) has elapsed.

   When the sending TCP receives the delayed ACK, it can then transmit
   its second packet.

   In a request-response protocol, this second packet will complete
   either a request or a response, which then enables a succeeding
   response or request.

   Note two (related) bad results of the interaction of delayed ACKs
   and the Nagle algorithm in this case: the request-response time may
   be increased by up to 400ms (if both the request and the response
   are delayed); and, the number of transactions per second is
   substantially reduced.


A proposed modification to the Nagle algorithm

   The current Nagle algorithm can be described as follows:

        If a TCP has less than a full-sized packet to transmit,
        and if any previous packet has not yet been acknowledged,
        do not transmit a packet.

   The proposed Nagle algorithm modifies this as follows:

        If a TCP has less than a full-sized packet to transmit,
        and if any previous less than full-sized packet has not
        yet been acknowledged, do not transmit a packet.

   In other words, when running Nagle, only look at the recent
   transmission (and acknowledgement) of small packets (rather than
   all packets, as in the current Nagle).

   (In writing the above, I am aware that TCP acknowledges BYTES, not
   packets.  However, expressing the algorithm in terms of packets
   seems to make the explanation a bit clearer.)


Implementation of the modified Nagle algorithm in a system

   The current Nagle algorithm does not require any more state to be
   kept by TCP on a system.  SND_NXT is a TCP variable which names the
   next byte of data to be transmitted.  SND_UNA is a TCP variable
   which names the next byte of data to be acknowledged.  If SND_NXT
   equals SND_UNA, then all previous packets have been acknowledged.

   The proposed modification to the Nagle algorithm does,
   unfortunately, require one new state variable to be kept by TCP.
   SND_SML is a TCP variable which names the last byte of data in the
   most recently transmitted small packet.

   An implementation could be as follows:

        1.  When transmitting a small packet, record the sequence
        number of the last byte of the small packet in SND_SML.

        2.  When deciding whether or not to transmit a small packet,
        check to ensure that SND_SML is less than, or equal to,
        SND_UNA.


A Failure Mode

   If an application sends a large amount of data, followed by a small
   amount of data, followed by a large amount of data, the current
   Nagle algorithm would perform better than the proposed
   modification.  The current Nagle algorithm would send at most one
   small packet (possibly the last packet), delaying the middle
   (small) amount of data which would allow the application to send
   the following large amount of data; the proposed Nagle algorithm
   would send two small packets (the middle packet, plus possibly a
   last packet).


A separate, but desirable, system facility

   In addition to the Nagle algorithm (or the modification proposed by
   this draft), it would be desirable for a system providing TCP
   service to applications to allow the application to set TCP into a
   mode in which the TCP would only transmit small packets at the
   explicit direction of the application.  For example, a system based
   on BSD might implement a socket option (using setsockopt(2))
   SO_EXPLICITPUSH, as well as a flag to sendto(2) (possibly
   overloading the semantics of an existing flag, such as MSG_EOF).

   In this scenario, an application would set a socket into
   SO_EXPLICITPUSH mode, then enter a mode of writing data to the
   socket and, at the last write, using send(2) with the MSG_EOF flag.
   The underlying TCP would recognize the MSG_EOF flag as an indicator
   to transmit the (possibly) small packet.

   Like the proposed modification to the Nagle algorithm, this is
   fairly simple to implement.

   If a system were to implement this interface, it would be important
   to NOT disable Nagle when using this interface.  In other words,
   when using this interface, the default mode for TCP would be to NOT
   transmit a small packet (even in the presence of MSG_EOF) if a
   previously transmitted small packet was as yet unacknowledged.

   Note, also, that implementing this interface does not eliminate the
   desirability of using the modification of the Nagle as the default
   for applications.  More sophisticated networking applications might
   well use the new interface, but naive applications will often be
   adequately served by the modified Nagle algorithm.


Acknowledgements

   Jim Gettys, Henrik Frystyk Nielsen, Jeff Mogul, and Yasushi Saito,
   as well as a message forwarded to the end2end-interest list by Sean
   Doran, have motivated my current interest in the Nagle algorithm.
   John Heidemann's work related to the Nagle algorithm has informed
   some of the thinking in this draft; discussions with John have also
   been helpful.  Members of the End-to-End Research Group (under
   the direction of Bob Braden) patiently listened to my discussion of
   the current state of the Nagle algorithm and to the modifications
   proposed in this document.


Security Considerations

   The Nagle algorithm does not have major security consequences.

   Implementation of this algorithm should not negatively impact
   the performance of the internet.  The negative impact of
   implementation of this algorithm should be significantly less
   than disabling the Nagle algorithm.


References

[RFC896]        Nagle, J., "Congestion control in IP/TCP internetworks",
                        Jan-06-1984.
[RFC1122]       Braden, R. T., "Requirements for Internet hosts -
                        communication layers", Oct-01-1989.

Author's Addresses

   Greg Minshall
   Siara Systems
   1399 Charleston Road
   Mountain View, CA  94043
   USA

   <minshall@siara.com>