TCP Implementation Working Group                              W. Stevens
INTERNET DRAFT                                                Consultant
File: draft-ietf-tcpimpl-cong-control-00.txt                   M. Allman
                                            NASA Lewis/Sterling Software
                                                               V. Paxson
                                                                    LBNL
                                                            August, 1998


                        TCP Congestion Control

Status of this Memo

    This document is an Internet-Draft.  Internet-Drafts are working
    documents of the Internet Engineering Task Force (IETF), its areas,
    and its working groups.  Note that other groups may also distribute
    working documents as Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time.  It is inappropriate to use Internet-Drafts as
    reference material or to cite them other than as ``work in
    progress.''

    To view the entire list of current Internet-Drafts, please check the
    "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
    Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
    Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
    Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).

Abstract

    This document defines TCP's four intertwined congestion control
    algorithms: slow start, congestion avoidance, fast retransmit, and
    fast recovery.  In addition, the document specifies how TCP should
    begin transmission after a relatively long idle period, as well as
    discussing various acknowledgment generation methods.

1   Introduction

    This document specifies four TCP [Pos81] congestion control
    algorithms: slow start, congestion avoidance, fast retransmit and
    fast recovery.  These algorithms were devised in [Jac88] and
    [Jac90].  Their use with TCP is required by [Bra89].

    This document is an update of [Ste97].  In addition to specifying
    the congestion control algorithms, this document specifies what TCP
    connections should do after a relatively long idle period, as well
    as specifying and clarifying some of the issues pertaining to TCP
    ACK generation.

    Note that [Ste94] provides examples of these algorithms in action
    and [WS95] provides an explanation of the source code for the BSD
    implementation of these algorithms.


Expires: February, 1999                                           [Page 1]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    This document is organized as follows.  Section 2 provides various
    definitions which will be used throughout the paper.  Section 3
    provides a specification of the congestion control algorithms.
    Section 4 outlines concerns related to the congestion control
    algorithms and finally, section 5 outlines security considerations.

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
    document are to be interpreted as described in [Bra97].

2   Definitions

    This section provides the definition of several terms that will be
    used throughout the remainder of this document.

    SEGMENT:
        A segment is ANY TCP/IP data or acknowledgment packet (or both).

    MAXIMUM SEGMENT SIZE (MSS):
        The MSS is the largest segment size that can be used.  The size
        does not include the TCP/IP headers and options.

    FULL-SIZED SEGMENT:
        A segment that contains the maximum number of data bytes
        permitted (i.e., a segment containing MSS bytes of data).

    RECEIVER WINDOW (rwnd)
        The most recently advertised receiver window.

    CONGESTION WINDOW (cwnd):
        A TCP state variable that limits the amount of data a TCP can
        send.  At any given time, a TCP MUST NOT send data with a
        sequence number higher than the sum of the highest acknowledged
        sequence number and the minimum of cwnd and rwnd.

    INITIAL WINDOW (IW):
        The initial window is the size of the sender's congestion window
        when a connection is established.

    LOSS WINDOW (LW):
        The loss window is the size of the congestion window after a TCP
        sender detects loss using its retransmission timer.

    RESTART WINDOW (RW):
        The restart window is the size of the congestion window after a
        TCP restarts transmission after an idle period.

3  Congestion Control Algorithms

    This section defines the four congestion control algorithms: slow
    start, congestion avoidance, fast retransmit and fast recovery,
    developed in [Jac88] and [Jac90].  In some situations it may be
    beneficial for a TCP sender to be more conservative than the
    algorithms allow, however a TCP MUST NOT be more aggressive than the

Expires: February, 1999                                           [Page 2]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    following algorithms allow (that is, MUST NOT send data when the
    value of cwnd computed by the following algorithms would not allow
    the data to be sent).

3.1 Slow Start and Congestion Avoidance

    The slow start and congestion avoidance algorithms MUST be used by a
    TCP sender to control the amount of outstanding data being injected
    into the network.  To implement these algorithms, two variables are
    added to the TCP per-connection state.  The congestion window (cwnd)
    is a sender-side limit on the amount of data the sender can transmit
    into the network before receiving an acknowledgment (ACK), while the
    receiver's advertised window (rwnd) is a receiver-side limit on the
    amount of outstanding data.  The minimum of cwnd and rwnd governs
    data transmission.

    Another state variable, the slow start threshold (ssthresh), is used
    to determine whether the slow start or congestion avoidance
    algorithm is used to control data transmission, as discussed below.

    Beginning transmission into a network with unknown conditions
    requires TCP to slowly probe the network to determine the available
    capacity, in order to avoid congesting the network with an
    inappropriately large burst of data.  The slow start algorithm is
    used for this purpose at the beginning of a transfer, or after
    repairing loss detected by the retransmission timer.

    IW, the initial value of cwnd, MUST be less than or equal to MSS
    bytes.

    We note that a non-standard, experimental TCP extension allows that
    a TCP MAY use a larger initial window (IW), as defined in equation 1
    [AFP98]:

               IW = min (4*MSS, max (2*MSS, 4380 bytes))             (1)

    With this extension, a TCP sender MAY use a 2 segment initial
    window, regardless of the segment size, and 3 and 4 segment initial
    windows MAY be used, provided the combined size of the segments does
    not exceed 4380 bytes.  We do NOT allow this change as part of the
    standard defined by this document.  However, we include discussion
    of (1) in the remainder of this document as a guideline for those
    experimenting with the change, rather than conforming to the present
    standards for TCP congestion control.

    The initial value of ssthresh MAY be arbitrarily high (for example,
    some implementations use the size of the advertised window), but it
    may be reduced in response to congestion.  The slow start algorithm
    is used when cwnd < ssthresh, while the congestion avoidance
    algorithm is used when cwnd > ssthresh.  When cwnd and ssthresh are
    equal the sender may use either slow start or congestion avoidance.

    During slow start, a TCP increments cwnd by at most MSS bytes for
    each ACK received that acknowledges new data.  Slow start ends when

Expires: February, 1999                                           [Page 3]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
    above); or when cwnd reaches rwnd; or when congestion is observed.

    During congestion avoidance, cwnd is incremented by 1 full-sized
    segment per round-trip time (RTT).  Congestion avoidance continues
    until cwnd reaches the receiver's advertised window or congestion is
    detected.  One formula commonly used to update cwnd during
    congestion avoidance is given in equation 2:

                          cwnd += MSS*MSS/cwnd                       (2)

    This provides an acceptable approximation to the underlying
    principle of increasing cwnd by 1 full-sized segment per RTT.  (Note
    that for a connection in which the receiver acknowledges every data
    segment, (2) proves slightly more aggressive than 1 segment per RTT,
    and for a receiver acknowledging every-other packet, (2) is less
    aggressive.)

    Implementation Note: Since integer arithmetic is usually used in TCP
    implementations, the formula given in equation 2 can fail to
    increase cwnd when the congestion window is very large (larger than
    MSS*MSS).  If the above formula yields 0, the result SHOULD be
    rounded up to 1 byte.

    Implementation Note: older implementations have an additional
    additive constant on the right-hand side of (2).  This is incorrect
    and can actually lead to diminished performance [PAD+98].

    Another acceptable way to increase cwnd during congestion avoidance
    is to count the number of bytes that have been acknowledged by ACKs
    for new data.  (A drawback of this implementation is that it
    requires maintaining an additional state variable.)  When the number
    of bytes acknowledged reaches cwnd, then cwnd can be incremented by
    up to MSS bytes.  Note that during congestion avoidance, cwnd MUST
    NOT be increased by more than the larger of either 1 full-sized
    segment per RTT, or the value computed using equation 2.

    Implementation Note: some implementations maintain cwnd in units of
    bytes, while others in units of full-sized segments.  The latter
    will find equation (2) difficult to use, and may prefer to use the
    counting approach discussed in the previous paragraph.

    When a TCP sender detects segment loss using the retransmission
    timer, the value of ssthresh MUST be set to no more than the value
    given in equation 3:

              ssthresh = max (min (cwnd, rwnd) / 2, 2*MSS)           (3)

    Implementation Note: an easy mistake to make is to forget the inner
    min() operation and simply use cwnd, which in some implementations
    may incidentally increase well beyond rwnd.

    Furthermore, upon a timeout cwnd MUST be set to no more than the
    loss window, LW, which equals 1 full-sized segment (regardless of

Expires: February, 1999                                           [Page 4]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    the value of IW).  Therefore, after retransmitting the dropped
    segment the TCP sender uses the slow start algorithm to increase the
    window from 1 full-sized segment to the new value of ssthresh, at
    which point congestion avoidance again takes over in a fashion
    identical to that for a connection's initial slow start.

3.3 Fast Retransmit/Fast Recovery

    A TCP receiver SHOULD send an immediate duplicate ACK when an
    out-of-order segment arrives.  The purpose of this ACK is to inform
    the sender that a segment was received out-of-order and which
    sequence number is expected.  From the sender's perspective,
    duplicate ACKs can be caused by a number of network problems.
    First, they can be caused by dropped segments.  In this case, all
    segments after the dropped segment will trigger duplicate ACKs.
    Second, duplicate ACKs can be caused by the re-ordering of data
    segments by the network (not a rare event along some network paths).
    Finally, duplicate ACKs can be caused by replication of ACK or data
    segments by the network.

    The TCP sender SHOULD use the "fast retransmit" algorithm to detect
    and repair loss, based on incoming duplicate ACKs.  The fast
    retransmit algorithm uses the arrival of 3 duplicate ACKs (i.e., 4
    identical ACKs) as an indication that a segment has been lost.
    After receiving 3 duplicate ACKs, TCP performs a retransmission of
    what appears to be the missing segment, without waiting for the
    retransmission timer to expire.

    After the fast retransmit sends what appears to be the missing
    segment, the "fast recovery" algorithm governs the transmission of
    new data until a non-duplicate ACK arrives.  The reason for not
    performing slow start is that the receipt of the duplicate ACKs not
    only tells the TCP that a segment has been lost, but also that
    segments are leaving the network.  In other words, since the
    receiver can only generate a duplicate ACK when a segment has
    arrived, that segment has left the network and is in the receiver's
    buffer, so we know it is no longer consuming network resources.
    Furthermore, since the ACK "clock" [Jac88] is preserved, the TCP
    sender can continue to transmit new segments (although transmission
    must continue using a reduced cwnd).

    The fast retransmit and fast recovery algorithms are usually
    implemented together as follows.

    1.  When the third duplicate ACK is received, set ssthresh to no
        more than the value given in equation 3.

    2.  Retransmit the lost segment and set cwnd to ssthresh plus 3*MSS.
        This artificially "inflates" the congestion window by the number
        of segments (three) that have left the network and which the
        receiver has buffered.

    3.  For each additional duplicate ACK received, increment cwnd by
        MSS.  This artificially inflates the congestion window in order

Expires: February, 1999                                           [Page 5]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

        to reflect the additional segment that has left the network.

    4.  Transmit a segment, if allowed by the new value of cwnd and the
        receiver's advertised window.

    5.  When the next ACK arrives that acknowledges new data, set cwnd
        to ssthresh (the value set in step 1).  This is termed
        "deflating" the window.

        This ACK should be the acknowledgment elicited by the
        retransmission from step 1, one RTT after the retransmission
        (though it may arrive sooner in the presence of significant
        out-of-order delivery of data segments at the receiver).
        Additionally, this ACK should acknowledge all the intermediate
        segments sent between the lost segment and the receipt of the
        first duplicate ACK, if none of these were lost.

    Implementing fast retransmit/fast recovery in this manner can lead
    to a phenomenon which allows the fast retransmit algorithm to repair
    multiple dropped segments from a single window of data [Flo94].
    However, in doing so, the size of cwnd is also reduced multiple
    times, which reduces performance.  The following steps MAY be taken
    to reduce the impact of successive fast retransmits on performance.

    A.  After the third duplicate ACK is received follow step 1 above,
        but also record the highest sequence number transmitted
        (send_high).

    B.  Instead of reducing cwnd to ssthresh upon receipt of the first
        non-duplicate ACK received (step 5), the sender should wait
        until an ACK covering send_high is received.  In addition, each
        duplicate ACK received should continue to artificially inflate
        cwnd by 1 MSS.

    C.  A non-duplicate ACK that does not cover send_high, followed by 3
        duplicate ACKs should not reduce ssthresh or cwnd but SHOULD
        trigger a retransmission.  A non-duplicate ACK that does not
        cover send_high SHOULD NOT cause any adjustment in cwnd.

4   Additional Considerations

4.1 Re-starting Idle Connections

    A known problem with the TCP congestion control algorithms described
    above is that they allow a potentially inappropriate burst of
    traffic to be transmitted after TCP has been idle for a relatively
    long period of time.  After an idle period, TCP cannot use the ACK
    clock to strobe new segments into the network, as all the ACKs have
    drained from the network.  Therefore, as specified above, TCP can
    potentially send a cwnd-size line-rate burst into the network after
    an idle period.

    [Jac88] recommends that a TCP use slow start to restart transmission
    after a relatively long idle period.  Slow start serves to restart

Expires: February, 1999                                           [Page 6]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    the ACK clock, just as it does at the beginning of a transfer.  This
    mechanism has been widely deployed in the following manner.  When
    TCP has not received a segment for more than one retransmission
    timeout, cwnd is reduced to the value of the restart window (RW)
    before transmission begins.

    For the purposes of this standard, we define RW = IW = 1 full-sized
    segment.

    We note that the non-standard experimental extension to TCP defined
    in [AFP98] defines RW = min(IW, cwnd), with the definition of IW
    adjusted per equation (1) above.

    Using the last time a segment was received to determine whether or
    not to decrease cwnd fails to deflate cwnd in the common case of
    persistent HTTP connections [HTH98].  In this case, a WWW server
    receives a request before transmitting data to the WWW browser.  The
    reception of the request makes the test for an idle connection fail,
    and allows the TCP to begin transmission with a possibly
    inappropriately large cwnd.

    Therefore, a TCP SHOULD reduce cwnd to no more than RW before
    beginning transmission if the TCP has not sent data in an interval
    exceeding the retransmission timeout.

4.2 Acknowledgment Mechanisms

    The delayed ACK algorithm specified in [Bra89] SHOULD be used by a
    TCP receiver.  When used, a TCP receiver MUST NOT excessively delay
    acknowledgments.  Specifically, an ACK MUST be generated for every
    second full-sized segment.  (This "MUST" is listed in [Bra89] in one
    place as a SHOULD and another as a MUST; here we unambiguously state
    it is a MUST.)  Furthermore, an ACK SHOULD be generated for every
    second segment regardless of size.  Finally, an ACK MUST NOT be
    delayed for more than 500 ms waiting on a second full-sized segment
    to arrive.  Out-of-order data segments SHOULD be acknowledged
    immediately, in order to trigger the fast retransmit algorithm.

    A TCP receiver MUST NOT generate more than one ACK for every
    incoming segment.

    TCP implementations that implement the selective acknowledgment
    (SACK) option [MMFR96] are able to determine which segments have not
    arrived at the receiver.  Consequently, they can retransmit the lost
    segments more quickly than TCPs without SACKs.  This allows a TCP
    sender to repair multiple losses in roughly one RTT after detecting
    loss [FF96,MM96a,MM96b].  While no specific SACK-based recovery
    algorithm has yet been standardized, any SACK-based algorithm should
    follow the general principles embodied by the above algorithms.
    First, as soon as loss is detected, ssthresh should be decreased per
    equation (3).  Second, in the RTT following loss detection, the
    number of segments sent should be no more than half the number
    transmitted in the previous RTT (i.e., before loss occurred).
    Third, after the recovery period is finished, cwnd should be set to

Expires: February, 1999                                           [Page 7]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    the reduced value of ssthresh.  The SACK-based algorithms outlined
    in [FF96,MM96a,MM96b] adhere to these guidelines.

5.  Security Considerations

    This document requires a TCP to diminish its sending rate in the
    presence of retransmission timeouts and the arrival of duplicate
    acknowledgments.  An attacker can therefore impair the performance
    of a TCP connection by either causing data packets or their
    acknowledgments to be lost, or by forging excessive duplicate
    acknowledgments.  Causing two congestion control events back-to-back
    will often cut ssthresh to its minimum value of 2*MSS, causing the
    connection to immediately enter the slower-performing congestion
    avoidance phase.

    The Internet to a considerable degree relies on the correct
    implementation of these algorithms in order to preserve network
    stability and avoid congestion collapse.  An attacker could cause
    TCP endpoints to respond more aggressively in the face of congestion
    by forging excessive duplicate acknowledgments or excessive
    acknowledgments for new data.  Conceivably, such an attack could
    drive a portion of the network into congestion collapse.

Acknowledgments

    The four algorithms that are described were developed by Van
    Jacobson.

    Some of the text from this document is taken from "TCP/IP
    Illustrated, Volume 1: The Protocols" by W. Richard Stevens
    (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
    Implementation" by Gary R. Wright and W.  Richard Stevens
    (Addison-Wesley, 1995).  This material is used with the permission
    of Addison-Wesley.

    Sally Floyd devised the algorithm presented in section 3.3 for
    avoiding multiple cwnd cutbacks in the presence of multiple packets
    lost from the same flight.  Craig Partridge and Joe Touch
    contributed a number of helpful suggestions.

References

    [AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial
        Window Size, Internet-Draft draft-floyd-incr-init-win-03.txt.
        May, 1998.  (Work in progress).

    [Bra89] B. Braden, ed., "Requirements for Internet Hosts --
        Communication Layers," RFC 1122, Oct. 1989.

    [Bra97] S. Bradner, "Key words for use in RFCs to Indicate
        Requirement Levels", BCP 14, RFC 2119, March 1997.




Expires: February, 1999                                           [Page 8]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

    [FF96] Kevin Fall and Sally Floyd.  Simulation-based Comparisons of
        Tahoe, Reno and SACK TCP.  Computer Communication Review, July
        1996.  ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.

    [Flo94] S. Floyd, TCP and Successive Fast Retransmits. Technical
        report, October 1994.
        ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

    [HTH98] Amy Hughes, Joe Touch, John Heidemann.  Internet-Draft
        draft-ietf-tcpimpl-restart-00.txt, March 1998.  (Work in
        progress).

    [Jac88] V. Jacobson, "Congestion Avoidance and Control," Computer
        Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988.
        ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

    [Jac90] V. Jacobson, "Modified TCP Congestion Avoidance Algorithm,"
        end2end-interest mailing list, April 30, 1990.
        ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.

    [MM96a] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP
        Congestion Control," Proceedings of SIGCOMM'96, August, 1996,
        Stanford, CA.  Available from
        http://www.psc.edu/networking/papers/papers.html

    [MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding
        Parameters" Available from
        http://www.psc.edu/networking/papers/FACKnotes/current.

    [MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective
        Acknowledgement Options", RFC 2018, October 1996.

    [PAD+98] V. Paxson, M. Allman, S. Dawson, J. Griner, I. Heavens,
        K. Lahey, J. Semke, B. Volz.  Internet-Draft
        draft-ietf-tcpimpl-prob-04.txt, August 1998.  (Work in
        progress).

    [Pos81] J. Postel, Transmission Control Protocol, September 1981.
        RFC 793.

    [Ste94] W. R. Stevens, "TCP/IP Illustrated, Volume 1: The
        Protocols", Addison-Wesley, 1994.

    [Ste97] W. R. Stevens, "TCP Slow Start, Congestion Avoidance, Fast
        Retransmit, and Fast Recovery Algorithms", RFC 2001, January
        1997.

    [WS95] G. R. Wright, W. R. Stevens, "TCP/IP Illustrated, Volume 2:
        The Implementation", Addison-Wesley, 1995.






Expires: February, 1999                                           [Page 9]


draft-ietf-tcpimpl-cong-control-00.txt                       August 1998

Author's  Address:

    W. Richard Stevens
    1202 E. Paseo del Zorro
    Tucson, AZ  85718
    520-297-9416
    rstevens@kohala.com
    http://www.kohala.com/~rstevens

    Mark Allman
    NASA Lewis Research Center/Sterling Software
    21000 Brookpark Rd.  MS 54-2
    Cleveland, OH  44135
    216-433-6586
    mallman@lerc.nasa.gov
    http://gigahertz.lerc.nasa.gov/~mallman

    Vern Paxson
    Network Research Group
    Lawrence Berkeley National Laboratory
    Berkeley, CA 94720
    USA
    510-486-7504
    vern@ee.lbl.gov































Expires: February, 1999                                          [Page 10]