INTERNET-DRAFT                  Amy S. Hughes, Joe Touch, John Heidemann
draft-hughes-restart-00.txt                                          ISI
                                                            Dec. 1, 2001
                                                   Expires: Jun. 1, 2002

              Issues in TCP Slow-Start Restart After Idle

Status of this Memo

   This document is an Internet-Draft and is NOT offered in accordance
   with Section 10 of RFC2026, and the author does not provide the IETF
   with any rights other than to publish as an Internet-Draft.
   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   The distribution of this document is unlimited.

Abstract

   This draft discusses variations in the TCP 'slow-start restart' (SSR)
   algorithm, and the unintended failure of some variations to properly
   restart in some environments. SSR is intended to avoid line-rate
   bursts after idle periods, where TCP accumulates permission to send
   in the form of ACKs, but does not consume that permission
   immediately. SSR's original "restart after send is idle" is commonly
   implemented as "restart after receive is idle". The latter
   unintentionally fails to restart for bidirectional connections where
   the sender's burst is triggered by a reverse-path data packet, such
   as in persistent HTTP. Both the former and latter are shown to permit
   bursts in other circumstances. Several solutions are discussed, and
   their implementations evaluated.

   This document updates draft-ietf-tcpimpl-restart-01.txt.  It is a
   product of the LSAM, X-Bone, and DynaBone projects at ISI.  Comments
   are solicited and should be addressed to the authors.



Expires Jun. 1, 2002                                            [Page 1]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


Introduction

   Slow-Start Restart (SSR) describes one TCP behavior to respond to
   long sending pauses in an open connection.  When a sender becomes
   idle, the normal ACK-clocking mechanism which regulates traffic is no
   longer present and the sender may introduce a burst of packets into
   the network as large as the current congestion window (CWND).  Such a
   burst may be too large for the intermediate routers to handle and may
   be too large for the receiver to handle at one time as well.

   A send timer was first proposed [JK92] to detect idle sending
   periods; the recommended response is to close the congestion window
   and perform a new slow-start.  However, a footnote to this first
   proposed solution noted that send/receive symmetry on the channel
   meant that a receive timer could be used instead to achieve the same
   results.  As this second solution takes advantage of a timer that is
   already required (to detect packet loss) it was implemented by
   Jacobson and Karels.  This solution has been repeated in
   implementations which derive from their work.

   Bursty connections, such as the persistent connections required in
   HTTP/1.1 [FGMFB97] have been found to interact in meaningful ways
   with SSR [Hei97].  In fact, it was discovered that SSR never occurs
   with HTTP/1.1 [Poo97].  This is because a new request will reset the
   receive timer (as suggested in the footnote in [JK92]) and the
   sending pause will not be detected [Tou97].

   Further, both timer solutions depend on the retransmit timeout (RTO)
   and cannot detect send pauses that are shorter than this duration.
   In such cases, the sender may transmit a burst as large as the full
   congestion window.

Burst Detection.

   There are several ways of determining whether a connection is at risk
   of sending a burst of packets into the channel.  We will discuss each
   method below, from the least radical to the most radical.

 Receive Timer:

   The use of a receive timer is the most common burst detection method.
   It is attractive because it is simple and makes use of an existing
   timer.  However, a receive timer does not properly detect bursts in
   HTTP/1.1 because the timer is cancelled when the request packet is
   received.  Further, when the connection is idle for less than a full
   RTO, a burst cannot be detected.  Such a burst can happen when the
   connection is "nearly idle" or when ACKs are lost or reordered.




Expires Jun. 1, 2002                                            [Page 2]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


 Send Timer:

   A send timer is the reciprocal solution to using a receive timer.
   While it requires a new timestamp field to be maintained, it clearly
   detects send pauses and corrects the problem presented by HTTP/1.1.
   However, as with the receive timer, it cannot detect bursts that
   could happen before a full RTO.

 Packet Counting:

   An alternative method examines the unused portion of the congestion
   window to determine if the capacity to burst exists.  This method is
   simple, it uses existing information to make its decision, and it
   solves both the HTTP/1.1. problem as well as the RTO problem.  In
   addition, it addresses the problem that needs to be solved (bursts)
   instead of a specific circumstance where the problem could happen
   (send pauses).  However, where timer detection avoids defining a
   burst (it defines idle periods instead), here a burst must be defined
   before it can be detected.  One possible definition is the situation
   where the available portion of the sending window is some proportion
   of the entire congestion window, say 50%.  Another definition places
   a numerical limit on the available portion of the congestion window,
   say 4 or CWND-1 packets.

Burst Response

   Once a burst is detected, there are several different ways to take
   action.  The different possibilities are listed below, again from
   least to most radical.

 Full Restart:

   Reducing the congestion window to one packet and re-entering slow-
   start, the original slow-start restart, is one response.  This was
   the solution proposed by Jacobson and Karels.  This is a very
   conservative response and it defeats most of the speedup that
   HTTP/1.1 provides [HOT97].  Current proposals [FAP97] have suggested
   increasing the initial window from 1 packet to 4 packets.  Further,
   depending on the method of burst detection, Full Restart can be far
   more punitive than it should be.  Coupled with a timer, full restart
   is most likely to respond to a completely empty congestion window.
   Coupled with Packet Counting, the response could close the window too
   far, even smaller than the amount of outstanding data.

 Window Limiting:

   This is a modified version of Full Restart which solves the problem
   created by using Packet Counting to detect bursts.  With this type of



Expires Jun. 1, 2002                                            [Page 3]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


   response, the congestion window is reduced to the amount of
   outstanding data plus the slow-start initial window (1, 2, or 4).  It
   works exactly like Full Restart in the idle case, but is successful
   at controlling bursts in an active connection.  Further, in an active
   connection, it effectively implements a leaky bucket of the initial
   window size for the accumulation of send opportunity based on the
   receipt of ACKs.  This solution is fairly conservative, especially as
   it defaults to Full Restart, but more importantly, sending
   opportunity is simply lost if not used, and is not available for
   paced output.  Also, it forces negative congestion feedback on the
   congestion window.

 Burst Size Limitation:

   When a burst is detected, its effects are limited, the sender may not
   send any more than a preset number of packets into the network.  This
   is less conservative than the first two responses in that it does not
   affect the size of the congestion window, and it is simple to
   implement, simply count up the number of packets you can send and
   stop when you reach the limit. The burst count is reset according to
   some policy, such as when ACKs are received, each time send is
   called, or when some timer triggers. The behavior is determined by
   how these parameters are combined to reset the count.

 Pacing:

   When a burst is detected, packets are periodically sent into the
   network until the sender starts receiving acks and normal maintenance
   can be resumed [VH97,PK98].  This solution is very easy on the
   network and scales well in cases of high bw/delay.  However, it
   requires a new timer and additional research for parameter tuning.

Implemented Solutions

   Now we will examine combinations of the different detection and
   response methods presented above.  Each of the solutions below have
   been implemented in some form.

 BSD Implementation (Jacobson and Karels)

   The most common implementation uses a receive timer coupled with Full
   Restart.  This is the implementation that causes the interaction
   problems with HTTP/1.1.  The obvious alternative is to implement a
   send timer as originally intended and use Full Restart.  There are
   several drawbacks to this solution.  First, a send timer adds
   additional state and serves no purpose other than to correct the
   bursting behavior after send pauses.  Second, forcing a slow-start in
   this situation is problematic for HTTP/1.1.  A slow-start for each



Expires Jun. 1, 2002                                            [Page 4]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


   new user request adds a delay burden to characteristically small HTTP
   responses.  Further, the HTTP user request pattern is unpredictable.
   It is possible for the user to make a new request before the send
   timer expires, triggering a burst that would defeat such a timer.

 Maximum Burst Limitation (Floyd)

   Floyd has proposed a coupling of Packet Counting with Burst Size
   Limitation.  It was developed to avoid bursts caused by gaps in the
   ACK stream. Such gaps cause the send window to slide forward in
   jumps, rather than smoothly by 1-2 MTUs; subsequent send calls can
   thus generate bursts. This solution has been implemented in "ns" and
   it prevents the sender from transmitting a series of back-to-back
   packets larger than the user configured burst limit (suggested to be
   5 packets) [NS97].

   Maxburst defines a new function, send_much, with a parameter,
   maxburst, which is called every time an external event occurs, such
   as ACK is received or a timeout occurs. Each time send_much is
   called, at most maxburst packets are emitted. This forces interleaved
   processing of ACKs and sending of data. Sends initiated by the
   arrival of data in the buffer are not affected; as a result, if there
   is less than RTO of delay, and the send buffer is filled by the
   application, Maxburst can still send a burst the size of an entire
   window.

   Maxburst does not explicitly address the sending of bursts after an
   idle period; it relies on the existing mechanism to collapse CWND
   after an RTO, because send_much does not replace send as the
   application interface to transmitting data.

   The limit of 5 packets is designed to allow non-sequental ACK losses,
   and allow the CWND to grow normally under slow-start. Here is a list
   of the possible burst limits, and the need for each:


       1 packet per ACK or nothing gets done

       2 packets per ACK if ACKs are compressed

       3 packets during window growth; 2 for the compressed ACK, 1 for
       the growth

       4 packets per ACK received if ACKs are lost (not back-to-back),
       and ACKs are compressed

       5 packets per ACK received if ACKs are lost (not back-to-back),
       ACKs are compressed, and the window is growing



Expires Jun. 1, 2002                                            [Page 5]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


 Use it or lose it (UI/LI) (Hughes, Touch, and Heidemann)

   One proposed solution combines Packet Counting with Window Limiting,
   and the IETF community refers to it as "Use-It-or-Lose-It" (it was
   originally called Congestion Window Monitoring, CWM, but we will use
   the acronym UI/LI, pronounced 'wee-lee', here).  Whenever (CWND -
   outstanding data > 4), we reduce CWND to (outstanding data + 4).  The
   choice of 4 packets is discussed in with the implementation details
   below.  UI/LI allows the congestion window to grow normally but
   shrinks the congestion window as the sender becomes idle.  It also
   prevents the sender from transmitting any bursts larger than 4
   packets in response to a new request.  Because UI/LI is not dependent
   on any timers, the loss of an ACK or a nearly idle connection cannot
   cause any bursts.  UI/LI is similar to Maxburst, but avoids the burst
   by reducing CWND, rather than by inhibiting the sends directly.  As a
   result, we avoid the potential problem of sequential calls to
   "tcp_output", which would cause bursts in Maxburst.  UI/LI also
   causes TCP to use the feedback of 'not using the CWND fast enough',
   which results in a decrease in the CWND.

   UI/LI effectively imposes a leaky bucket type limitation on the
   congestion window.  The window is allowed to grow and be managed
   normally but the sender is not allowed to save up any sending
   opportunities.  Any opportunity that is not used is lost.  This
   property of UI/LI forces interleaved reception of ACKs and processing
   of sends.

 Burst-or-lose (Touch, Floyd)

   We have considered a modified version of Maxburst, in which all
   sends, including those initiated by data arrival from the
   application, limit their burst size each time they are called, and
   reset their flag at that time. We call this burst-or-lose (BOL). The
   result is very similar to UI/LI, but the bucket reflects only the
   permission to send created as the result of an ACK receipt, rather
   than the overall limit of the congestion window.

   In this case, it is unncessary to collapse CWND after an RTO. The
   primary reason for the collapse is to avoid bursts; BOL avoids bursts
   completely, but recovers as quickly as the ACK clocking can be
   restored.

   In the case when there is no data to send, multiple ACK arrivals do
   not increase the bucket available for sending. This avoids the bursts
   after idle periods. In the case when ACKs are lost, the remaining ACK
   generates permission to send only a fixed amount of data. The
   application cannot defeat this bucket by calling send multiple times.




Expires Jun. 1, 2002                                            [Page 6]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


 Rate Based Pacing (Visweswaraiah and Heidemann)

   Rate Based Pacing (RBP) combines the Pacing response with either a
   Send Timer or Packet Counting.  It avoids slow-start when resuming
   after sending pauses and allows the normal clocking of packets to be
   gracefully restarted.  When a burst potential is detected, the
   algorithm meters a small burst of packets into the channel [VH97].
   RBP is the least conservative solution to the bursting problem
   because it continues to make use of the pre-pause congestion window.
   If network conditions have changed significantly, maintaining the
   previous window could cause the paced connection to be overly
   aggressive as compared to other connections.  (Although some work
   suggests congestion windows are stable over multi-minute timeframes
   [BSSK97].)  More recently pacing has been suggested for use in
   wireless networking scenarios [BPK97], and for satellite connections.

Solution Comparison

   Below we present a comparison of the solutions presented above, plus
   the original Send Timer solution.  The Send Timer solution is not in
   use, but we implemented it in FreeBSD for comparison purposes.

   We contrast each solution on four criteria.  A "yes" in the "HTTP
   Burst" column indicates that this solution will solve the bursting
   problem created by HTTP/1.1.  A "yes" in the "Other Burst" column
   indicates that the solution will prevent bursts in other situations
   as well.  A "yes" in the "Adj CWND" column indicates that the
   solution involves adjusting the value of CWND.  The final column,
   "Code Size", gives the number of lines of code needed to correctly
   implement the solution.


                 |
                 | HTTP Burst   Other Burst   Adj CWND   Code Size
      -----------+-------------------------------------------------
      Send Timer |    yes           no          yes      ~5 lines
      -----------+-------------------------------------------------
       Rcv Timer |     no           no          yes       3 lines
      -----------+-------------------------------------------------
        Maxburst |    yes           yes          no      ~5 lines
      -----------+-------------------------------------------------
           UI/LI |    yes           yes         yes       3 lines
      -----------+-------------------------------------------------
             BOL |    yes           yes          no      ~5 lines
      -----------+-------------------------------------------------
             RBP |    yes           yes          no      ~30 lines





Expires Jun. 1, 2002                                            [Page 7]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


There are other conditions under which to compare these algorithms,
including behavior after a single ACK loss, behavior after multiple ACK
losses (sequential and non-sequential), and the window regrowth proper-
ties. These are outside the scope of this document.

                 |
                 |            Recovery rate
      -----------+-------------------------------------------------
      Send Timer |             SS then CA
      -----------+-------------------------------------------------
       Rcv Timer |             SS then CA
      -----------+-------------------------------------------------
        Maxburst |          (not applicable)
      -----------+-------------------------------------------------
           UI/LI | SS and or CA, depending on SSTHRESH
      -----------+-------------------------------------------------
             BOL |          (not applicable)
      -----------+-------------------------------------------------
             RBP |                1 RTT


The only significant difference between UI/LI and BOL is whether CWND is
affected, and thus the recovery period after the loss of an ACK. UI/LI
interprets ACK loss as an event warranting traditional TCP window recov-
ery, whereas BOL and Maxburst do not. Both UI/LI and BOL avoid the need
for idle detection for the purpose of CWND adjustment, and implement a
type of leaky bucket. Although UI/LI was intended as a type of leaky-
bucket system, BOL is much more directly analogous.

Experimental Comparisons

Packet traces of the current FreeBSD implementation of SSR (using the
receive timer), of a modified version of FreeBSD using a send timer, and
of UI/LI with HTTP/1.1 support the above observations.  The graphs below
show traces of two HTTP/1.1 requests.  The first request opens the CWND.
How the server responds to the packet containing the second request
reveals the differences between implementations.

The first graph shows the trace with FreeBSD v2.2.6, but the relevant
portions of the code have not been altered in more current releases.  In
response to the first request, the server performs slow-start normally.
After 10 seconds, and after the retransmission timeout (RTO), the second
request arrives.  At this point, receipt of the second request resets
the receive timer.  The server, under the belief that communication has
not gone idle, sends a CWND-sized burst of packets into the network.






Expires Jun. 1, 2002                                            [Page 8]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


                       Idle trace with Receive Timer
sequence number (in bytes)
          +------+-------+------+-------+------+-------+------+D-A---+
          +      +       +      +       +      +       +    DDAA     |
   100000 ++   Client_ACK  A                               DDAA     ++
          |    Client_SEG  C                            D  A A       |
          |    Server_SEG  D                            D A          |
    80000 ++                                            D A         ++
          |                                        D  A              |
          |                                        D                 |
    60000 ++                         D AA          D                ++
          |                       D A                                |
          |                       D A                                |
          |                   D ADDAA                                |
    40000 ++                  D A                                   ++
          |                   D A                                    |
          |               D AAD A                                    |
    20000 ++              D A                                       ++
          |           D A A                                          |
          +      +D AAD A+      +       +      +       +      +      |
        0 ++A-D-A+-------+------+-------+------+-C-----+------+-----++
          0      2       4      6       8     10      12     14
                       time since first SYN (in seconds)


The second graph shows the trace with FreeBSD v2.2.6 altered to use a
send timer to detect idle communication.  As expected, the server
responds to the first request with normal slow-start.  Now, the receipt
of the second request comes in after the RTO and the idle period is cor-
rectly detected.  In response, the server re-enters slow-start for the
second response.




















Expires Jun. 1, 2002                                            [Page 9]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


                         Idle trace with Send Timer
sequence number (in bytes)
          +-------------+-------------+------------+-------------+DAAA
          +             +             +            +           D AAAA|
   100000 ++   Client_ACK  A                                   DADA ++
          |    Client_SEG  C                                   DA A  |
          |    Server_SEG  D                                D AAD    |
    80000 ++                                                DAAA    ++
          |                                              D AAD       |
          |                                           D AAD          |
    60000 ++                      D AA          D AA AA             ++
          |                     D AA                                 |
          |                     DAAA                                 |
          |                  D AAAA                                  |
    40000 ++                 D AA                                   ++
          |                  DADA                                    |
          |               D AAAA                                     |
    20000 ++              DADA                                      ++
          |            DDAAD                                         |
          +         DAAAA A           +            +             +   |
        0 ++----AADAA---+-------------+--------CC--+-------------+--++
          0             5            10           15            20
                       time since first SYN (in seconds)



The third graph shows the trace with FreeBSD altered to use UI/LI to
limit bursts.  Again, the first request and response are normal.  This
shows that UI/LI allows the congestion window to grow normally.  With
the second request we see the main difference of UI/LI.  Again, the sec-
ond request comes after the RTO, and here we see the main difference
with UI/LI.  In response to the second request, the server does not send
a burst the size of CWND or enter slow-start.  Instead, it sends several
small 4-packet bursts, with the CWND growing normally from this point.

















Expires Jun. 1, 2002                                           [Page 10]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


                           Idle trace with UI/LI
sequence number (in bytes)
          +-------+------+-------+------+-------+------+------DD-A---+
          +       +      +       +      +       +      +      D+A    |
   100000 ++   Client_ACK  A                               D  A A   ++
          |    Client_SEG  C                              DDAA       |
          |    Server_SEG  D                              D A        |
    80000 ++                                          DD AD         ++
          |                                       D  AD A            |
          |                                   D  AD AA               |
    60000 ++                            D A   D                     ++
          |                           DDA A                          |
          |                           D A                            |
          |                        D AD A                            |
    40000 ++                      DDAA                              ++
          |                       D A                                |
          |                   DDAAA                                  |
    20000 ++              D  AD A                                   ++
          |               D A                                        |
          +       D A D AAA      +      +       +      +       +     |
        0 ++A-D-A-D------+-------+------+---C---+------+-------+----++
          0       2      4       6      8      10     12      14
                       time since first SYN (in seconds)



Note, RTO is the usual timer limit, but any value would have the same
results, depending on when the second request was presented in relation
to the timer.  If the second request arrives after the timer expires,
the reponse will behave as presented in these graphs.  If the second
request arrives before the timer expires and after all outstanding pack-
ets have been acknowledged, systems with a send or receive timer will
send a CWND-sized burst. If the second request arrives before the last
ACK but after the last packet sent, the server has the potential to send
a burst no larger than CWND.  UI/LI will limit any burst to 4 packets
regardless of the timing of the second request.

Implementation of UI/LI

UI/LI requires a simple modification to existing TCP output routines.
The changes required replace the current idle detection code.  Below is
the suggested pseudocode from Appendix C of [JK92]:

          int idle = (snd_max == snd_una);
          if (idle && now - lastrcv >= rto)
               cwnd = 1;

   UI/LI would replace that code as below:



Expires Jun. 1, 2002                                           [Page 11]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


          int idle = (snd_max == snd_una);
          int maxwin = 4 + snd_nxt - snd_una;
          if (cwnd > maxwin)
               cwnd = maxwin;

   Packet counting is implemented by line 2.  Lines 3 and 4 implement
   Window Limitation.  Note that the comparison in line 1 (defining
   "idle" [JK92]) is required for later Silly Window Syndrome avoidance
   and could now be moved there.

   We have implemented UI/LI in FreeBSD.  The affected code is in the
   file /sys/netinet/tcp_output.c.  The lines:

          idle = (tp->snd_max == tp->snd_una);
          if (idle && tp->t_idle >= tp->t_rxtcur)
               tp->snd_cwnd = tp->t_maxseg;

   Are replaced with:

          idle = (tp->snd_max == tp->snd_una);
          maxwin = (4 * tp->t_maxseg) + tp->snd_nxt - tp->snd_una;
          if (tp->snd_cwnd > maxwin)
               tp->snd_cwnd = maxwin;

   The choice of limiting the available congestion window to 4 packets
   is based on the normal operation of TCP.  An ACK received by the
   sender may be in response to the receipt of 2 packets, allowing
   another 2 to be sent.  Further, normal window growth may require the
   sending of a third packet.  Lastly, in slow-start with delayed ACKs,
   the receipt of an ACK can trigger the sending of 4 packets.  Thus, 4
   packets is a reasonable burst to send into the network.

   Increasing the initial window in slow-start to 4 packets has already
   been proposed [FAP97].  The effects of this change have been explored
   in simulation in [PN98] and in practice in [AHO98].  Such a modifica-
   tion to TCP would cause the same behavior as our solution in the
   cases where the pause timer has expired.  It does not address the
   pre-timeout bursting situation we are concerned with.

Implementation of BOL

   BOL can be implemented with a small amount of additional state, and a
   small number of lines in TCP's input, output, and timer routines. The
   state indicates the number of packets that can be emitted before the
   next timer or ACK receipt. The other lines increment the state when a
   notable event occurs, and decrement it when packets are sent.

   Our implementation uses two state variables: bucket - the number of



Expires Jun. 1, 2002                                           [Page 12]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


   packets that can be immediately sent, and ack_ratio - the number of
   packets per ACK (upper bound). ack_ratio is implied by RFC-2581, as
   at least one ACK "SHOULD" be sent for every 2 MSSs received, i.e.,
   ACK compression [APS99]. This is our default; we parameterize it so
   that when others modify TCP to use less frequent ACKs, our portion of
   the code will operate correctly. It should be noted that less fre-
   quent ACKs will, by definition, increase the burstiness of the source
   as a result. We also assume 'initial_win' is defined, currently 2
   MSS, but may be configurable as per Experimental RFC-2414 [FAP97].

   There are two events that set the value of bucket. When an ACK is
   received, we set bucket to ack_ratio * 2 + 1. This is for the same
   reasons as Maxburst uses '5' - to allow for TCP to run at full rate
   in the presence of isolated ACK losses. On timeout events, the bucket
   is set to init_win.

   When packets are emitted in the existing tcp_output routine, the
   bucket is decremented. The bucket does not become negative; if it
   would, that data is not sent, and the routine returns. This 'refusal
   to send' behavior is identical to not having sufficient send window;
   the application will retry, typically due to some timer event.

   The code is as follows:

          /* define the state for BOL */
               int  ack_ratio,     /* max number of packets per ack */
               int  bucket,        /* max burst size */

          #DEFINE init_win 2

          /* set the bucket on ACK receipt */
               bucket = ack_ratio * 2 + 1;

          /* set bucket on timeout */
               bucket = init_win;

          /* decrement the bucket accordingly */
               can_send = min(bucket - want_send, 0);
               bucket -= can_send;
               do_send(can_send);
               return can_send;

Local connections

   We recently discovered cases where SSR adversely affected local con-
   nections, i.e., connections within the same subnet. There is already
   code in the current BSD-derived TCP implementation that disables ini-
   tializing the congestion window (CWND) to a small value, when both



Expires Jun. 1, 2002                                           [Page 13]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


   ends of a connection share a subnet. This code, from FreeBSD
    tcp_input.c, is outlined below, where inp = tp->t_inpcb.

          /*
           * Don't force slow-start on local network.
           */
          if (!in_localaddr(inp->inp_faddr))
               tp->snd_cwnd = mss;

   RFC-2414 observes that there are 3 types of window sizes used in BSD
   TCP:

               1. intital window
               2. restart after timeout
               3. restart after loss

   Recent work indicates that these windows should be treated differ-
   ently; e.g., the restart-after-loss window should always be set to 1,
   and the initial window can be set to 4 or not used for local connec-
   tions [FAP97]. We proposed that it is consistent to extend this local
   exception to the restart-after-timeout window, as shown below as a
   FreeBSD code segment:

          idle = (tp->snd_max == tp->snd_una);
          maxwin = (4 * tp->t_maxseg) + tp->snd_nxt - tp->snd_una;
          if ((tp->snd_cwnd > maxwin) &&
             !in_localaddr(tcp->t_inpcb->inp_faddr))
               tp->snd_cwnd = maxwin;

   This modification is considered reasonable because it corresponds to
   the initial window determination. There are other cases where TCP
   uses "local" rules, such as MSS determination, and piggybacking.

Performance Issues

   The purpose of these modifications is to avoid line-rate bursts, pri-
   marily as the result of the source being idle, but also due to ACK
   losses. A result of many of the proposed solutions is to limit the
   source, either by reducing the congestion window, or limiting the
   transmission of bursts. The performance effects of congestion window
   reduction are well-studied, and are not addressed here. BOL can limit
   the source without modifying the congestion window, and there are
   concerns that this may affect the performance of TCP.

   In particular, some implementations of TCP send ACKs in a bursty
   fashion, either by reducing the send/receive context switching, or by
   reducing the overall number of ACKs. These modifications are used to
   increase the performance of TCP, but can also increase the burstiness



Expires Jun. 1, 2002                                           [Page 14]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


   of the source. BOL, by reducing the burstiness of the source, is con-
   sidered to potentially limit the effectiveness of these enhancements.

   TCP currently recommends that an ACK "SHOULD" be sent for at least
   every two MSS's received. We assume that this recommendation implies
   an interleaving factor for the source as well. If this factor is
   encoded in a TCB variable (ack_ratio), then BOL can allow for corre-
   sponding bursts consistent with reasonable (non-adjacent) ACK losses.

   We assume that TCPs should not delay ACK processing to aggregate the
   receive-side timeslices, in order to reduce send/receive interleav-
   ing, unless this is also encoded in their ack_ratio variable. Our
   initial measurements indicate that even the recommended interleaving
   does not substantially affect the performance of TCP.

   In addition, we recommend that any burst-avoidance modification be
   disabled for direct LAN connections. For such connections, the con-
   gestion window mechanism is already disabled in several TCP implemen-
   tations. The resulting burstiness is handled by the link layer and
   receiver buffers.

Conclusions

   At this time, we propose BOL as a simple, minimal and effective fix
   to the 'bug' in current TCP implementations that is exploited by
   HTTP/1.1.  Modifications can be made to TCP to solve the slow-start
   restart problem that are consistent with the original congestion
   avoidance specifications (i.e. a send timer).  However, we feel that
   the original intended behavior is not appropriate to some current
   applications, specifically HTTP. Thus, we recommend BOL to prevent
   bursts into the network.  Not only does this solution solve the cur-
   rent problem in a simple way, it will prevent bursting in any other
   situation that might arise. The packet bursts which we allow are con-
   sistent with congestion window growth algorithms and with Floyd's
   conclusion about increasing the initial window size.

   BOL, as well as the other solutions listed, need to be re-evaluated
   within emerging TCP implementations, e.g., SACK [JB88].  In general,
   TCP has no rate pacing and uses congestion control to avoid bursts in
   current implementations.  A more explicit mechanism, such as RBP or
   similar proposals may be desirable in the future.

Security implications

   There are security risks associated with these algorithms that can
   result in denial of service attacks.  Intermediate routers can delay
   data or ACKs in ways that can cause the restarting party to restart
   more conservatively. They also result in less probability of the



Expires Jun. 1, 2002                                           [Page 15]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


   restarting party generating a line-rate burst.

   In Send-Timer and Maxburst algorithms, as well as in a no-restart
   system, an attacker (either an intermediate router or data receiver)
   can clump the delivery of ACKs, which can result in line-rate bursts.

   The Receive-Timer, Rate-Based Pacing, UI/LI, and BOL algorithms use
   data or ACK arrival to moderate outgoing data and avoid bursts. If
   the attacker attempts to deliver data or ACKs so as to cause a burst,
   these algorithms successfully avoid creating a burst. In the process,
   their throughput is reduced.

   We believe the latter algorithms should be considered more robust
   than the former, because slow-start restart focuses on avoiding
   bursts, rather than maximizing throughput. In the recommended algo-
   rithms (BOL now, RBP for the future), the system succesfully avoids
   generating a burst as the result of an attacker.

Acknowledgements

   We would like to acknowledge Kacheong Poon, Sally Floyd, members of
   the End-to-End-Interest Mailing List, and members of the IETF TCP
   Implementation Working Group for their feedback and discussions on
   this document. We would also like to thank Ruth Aydt, Steve Tuecke,
   and Carl Kesselman for their observation of the local connection
   issue.

References


   [AHO98] Mark Allman, Chris Hayes, and Shawn Ostermann.  An Evaluation
       of TCP with Larger Initial Windows.  ACM Computer Communications
       Review, 28(3), 41-52 , July 1998.

   [APS99] M. Allman, V. Paxson, W. Stevens.  TCP Congestion Control,
       April 1999.  RFC-2581.

   [BPK97] Hari Balakrishnan, Venkata N. Padmanabhan, and Randy H. Katz.
       The Effects of Asymmetry on TCP Performance.  In Proceedings of
       the ACM/IEEE Mobicom, Budapest, Hungary, ACM.  September, 1997.

   [BSSK97] Hari Balakrishnan, Srinivasan Seshan, Mark Stemm, and Randy
       H. Katz.  Analyzing Stability in Wide-Area Network Performance.
       In Proceedings of the ACM SIGMETRICS, Seattle WA, USA, ACM.
       June, 1997.

   [FGMFB97] R. Fielding, Jim Gettys, Jeffrey C. Mogul, H. Frystyk, and
       Tim Berners-Lee.  Hypertext Transfer Protocol -- HTTP/1.1,



Expires Jun. 1, 2002                                           [Page 16]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


       January 1997. RFC-2068.

   [FAP97] Mark Allman, Sally Floyd, and Craig Partridge.  Increasing
       TCP's Initial Window, September 1998. RFC-2414.

   [Hei97] John Heidemann. Performance Interactions Between P-HTTP and
       TCP Implementations.  ACM Computer Communications Review, 27(2),
       65-73, April 1997.

   [HOT97] John Heidemann, Katia Obraczka, and Joe Touch.  Modeling the
       Performance of HTTP Over Several Transport Protocols.  ACM/IEEE
       Transactions on Networking 5(5), 616-630, October, 1997.

   [JB88] Van Jacobson and R.T. Braden. TCP extensions for long-delay
       paths, October 1988. RFC 1072.

   [Jac88] Van Jacobson.  Congestion Avoidance and Control.  Proc. SIG-
       COMM '88, in ACM Computer Communication Review, 18(4):314-329,
       August 1988. NOTE: This paper lacks the appendix that describes
       slow-start restart, which appears only in the unpublished [JK92].

   [JK92] Van Jacobson and Mike Karels.  Congestion Avoidance and Con-
       trol.  This is a revised version of [Jac88], which adds Karels as
       co-author and includes an additional appendix.  The revised ver-
       sion has not been published, but is available at
       ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z

   [NS97] ns Network Simulator.  http://www-mash.cs.berkeley.edu/ns/,
       1997.

   [PK98] Venkata N. Padmanabhan and Randy H. Katz.  TCP Fast Start: A
       Technique for Speeding Up Web Transfers.  In Proceedings of the
       IEEE GLOBECOM Internet Mini-Conference, Sydney, Australia, Novem-
       ber, 1998.

   [PN98] K. Poduri and K. Nichols. Simulation Studies of Increased Ini-
       tial TCP Window Size, September 1998. RFC-2415.

   [Poo97] Kacheong Poon, Sun Microsystems, tcp-implementors mailing
       list, August, 1997.

   [Tou97] Joe Touch, ISI, tcp-implementors mailing list, August 12,
       1997.

   [VH97] Vikram Visweswaraiah and John Heidemann.  Improving Restart of
       Idle TCP Connections.  Technical Report 97-661, University of
       Southern California, November 1997.




Expires Jun. 1, 2002                                           [Page 17]


Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001


Authors/ Address

   Amy S. Hughes, Joe Touch, John Hiedemann
   University of Southern California/Information Sciences Institute
   4676 Admiralty Way
   Marina del Rey, CA 90292-6695
   USA
   Phone: +1 310-822-1511
   Fax:   +1 310-823-6714
   URLs:   http://www.isi.edu/~ahughes
        http://www.isi.edu/touch
        http://www.isi.edu/~johnh
   Email: ahughes@isi.edu
       touch@isi.edu
       johnh@isi.edu




































Expires Jun. 1, 2002                                           [Page 18]