Internet Engineering Task Force Yogesh Swami
INTERNET DRAFT Khiem Le
File: draft-swami-tsvwg-tcp-dclor-00.txt Nokia Research Center
Dallas
November 2002
Expires: Apr 2003
DCLOR: De-correlated Loss Recovery using SACK option
for spurious timeouts.
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of [RFC2026].
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress.
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Abstract
A spurious timeout in TCP forces the sender to unnecessarily
retransmit one complete congestion window of data into the network.
In addition, TCP uses the rate of arrival of ACKs as the basic
criterion for congestion control. TCP makes the assumption that the
rate at which ACKs are received reflects the end-to-end state of the
network in terms of congestion. But after a spurious-timeout, the
ACKs don't reflect the end-to-end congestion state of the network,
but only a part of it. In these cases, the slow-start behavior after
a timeout can further add to network congestion instead of relieving
it.
In this draft we propose changes to the TCP sender (no change is
needed for TCP receiver) that can be used to solve the problems of
Expires: Apr 2003 [Page 1]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
both redundant-retransmission and network congestion after a spurious
timeout. These changes preserve the sanctity of congestion control
principles and are conservative in nature. The proposed
algorithm--called DCLOR--separates congestion control from loss
recovery, and uses the TCP SACK option to achieve this. DCLOR can be
used as a congestion control algorithm (after a spurious timeout) only,
and it can work with other spurious timeout detection mechanisms such
as the Eifel detection scheme.
Table of Contents
Abstract ..................................................... 1
1. Introduction .............................................. 3
2. Problem Description ....................................... 4
2.1 Redundant Data Retransmission ......................... 5
2.2 Can Slow Start add to Congestion ...................... 5
3. Sources of spurious timeout ............................... 3
3.1 Spurious timeout due to Excessive delay ............... 7
3.2 Spurious timeout due to change of subnets ............. 8
3.3 Spurious Timeout due to Protocol Inefficiencies ....... 9
4. De-correlated Loss Recovery (DCLOR) ....................... 9
4.1 Probe phase after a timeout ....................... 10
4.2 Congestion Control After the probe phase ........... 12
4.3 Recovering lost packets after timeout .............. 13
5. Data Delivery To Upper Layers ............................. 14
6. Sack Reneging ............................................. 15
7. DCLOR Examples ............................................ 15
7.1 Timeout due to congestion ............................. 15
7.2 Timeout due to pure packet stalling ................... 16
7.3 Timeout due to stalling and loss ...................... 17
8. Security Considerations .................................. 18
9. References ............................................... 19
10. IPR Statement ............................................ 19
Author's Address ............................................ 19
Appendix - 1 ................................................. 20
Expires: Apr 2003 [Page 2]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
1. Introduction
The response of a TCP sender after a retransmission timeout is
governed by the underlying assumption that a mid-stream timeout can
occur only if there is heavy congestion--manifested as packet
loss--in the network. Even though loss is often caused by congestion,
the loss recovery algorithm itself should only answer the question of
"what" data (i.e., what sequence number of data ) to send. While on
the other hand, the congestion control algorithm should answer the
question of "how much" data to send. But after a timeout TCP
addresses the issues of loss recovery and congestion control using a
single mechanism--send one segment per round trip timeout (RTO)
(answers the "how much" question) until an acknowledgment is
received. The single segment sent is always the first unacknowledged
outstanding packet in the retransmission queue (answers the "what"
question). Since the present TCP's loss recovery and congestion
control algorithms are coupled together, we call this "Correlated
Loss Recovery (CLOR)."
Although the assumption that a timeout can occur only if there is
severe congestion is valid for traditional wire-line networks, it
does not hold good for some other types of networks--networks where
packets can be stalled "in the network" for a significant duration
without being discarded. Typical examples of such networks are
cellular networks. In cellular networks, the link layer can
experience a relatively long disruption due to errors, and the link
layer protocol can keep these packets-in-error buffered as long as
the link layer disruption lasts.
In this document we present an alternative approach to loss recovery
and congestion control that "De-Correlates" Loss Recovery from
congestion congestion and allows independent choice on using a
particular TCP sequence number without compromising on the congestion
control principles of [1][2][3][4]. In doing so, the algorithm
mitigates the ill effects of spurious retransmission timeouts.
Spurious timeouts are not only detrimental to the throughput of the
defaulting connection, but they also add to the overall network
congestion--paradoxically enough, due to the subsequent slow-start.
Although several drafts [7][8] have been presented on this topic in
the IETF, we believe that none of them fully considers all the
problems associated with spurious timeouts. In the following section
we first describe these problems in more detail and then describe the
DCLOR mechanism in section-4.
Expires: Apr 2003 [Page 3]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
2. Problem Description.
Let us assume that a TCP sender has sent N packets, p(1) ... p(N),
into the network and it's waiting for the ACK of p(1) (Figure-1). Due
to bad network conditions or some other problem, these packets are
excessively delayed at some some intermediary node NDN. Unlike
standard IP routers, the NDN keeps these packets buffered for a
relatively long period of time until these packets are forwarded to
their intended recipient. This excessive delay forces the TCP sender
to timeout and enter slow start.
Figure-1
TCP-Sender NDN TCP-Receiver
..... |----p(1)------>| |
^ |----p(2)------>| |
: | . | |
RTT=D | . | |
: | . | |
..... |----p(N)------>| |
| ^ | |
| : | |
| RTO | |
| : | |
| V |----p(1)-->|
... |----p1(1)----->|<---a(1)---|...
L | | |
... |<----a(1)------|----p(2)-->| Inter ACK arrival time (IAT)
|->p1(2),p1(3)->|<---a(2)---|...
| . | . |
| . | . |
| . | . |
| |<---a(N)---|
| |---p1(1)-->|
| |<---a(N)---|
| | |
As far as the TCP sender is concerned, a timeout is always
interpreted as heavy congestion. The TCP sender therefore makes the
assumption that all packets between p(1) and p(N) were lost in the
network. To recover from this misconstrued loss, the TCP sender
retransmits P1(1) ( Px(k) represents the xth retransmission of packet
with sequence number k), and waits for the ACK a(1).
After some period of time the network conditions at NDN become
amicable again, and the queued in packets at NDN are finally
Expires: Apr 2003 [Page 4]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
dispatched to their intended recipient, to which the the TCP receiver
generates the ACK a(1). When the TCP sender receives a(1), it's
fooled into believing that this ACK was generated in response to the
retransmitted packet p1(1), while in reality a(1) was generated in
response to the originally transmitted packet p(1). When the sender
receives a(1), it increases its congestion window to two, and
retransmits p1(2) and p1(3). As the sender receives more
acknowledgments, it continues with retransmissions and finally starts
sending new data.
The following two subsections examine the problems associated with
the above-mentioned TCP behavior.
2.1 Redundant Data Retransmission
The obvious and relatively easy-to-solve inefficiency of the above
algorithm is that the entire congestion window worth of data is
unnecessarily retransmitted. Although such retransmissions are
harmless to high-bandwidth, well-provisioned, backbone links (so long
they are infrequent), it could severely degrade the performance of
slow links.
In cases where bandwidth is a commodity at a premium, (e.g., cellular
networks), unnecessary retransmission can also be costly.
2.2 Can Slow Start add to Congestion after Spurious Timeout
Paradoxically, slow start--the epitome of congestion control--can
itself add to network congestion after a spurious timeout. Going back
to the previous example, when the TCP sender receives a(1), it
falsely assumes that a(1) was triggered by p1(1). But in reality,
a(1) was triggered by p(1), not p1(1). This ambiguity exits because
of the cumulative acknowledgment property of TCP. In this case, if
the TCP sender were to take RTT samples (which it does not take
because of ACK ambiguity), it would register a value of L (Figure-1)
instead of D--the RTT. In addition to this, when the TCP sender sends
a1(2) and a1(3), it will receive the ACK for a1(2) and a1(3) right
after the inter ACK arrival time IAT. Then-after, the TCP sender will
keep increasing the data rate every IAT as new ACKs arrive.
Mathematically, one can show that after a spurious timeout, the rate
at which data is injected in to the network is given by 1. But if we
want to preserve the network probing capability [2] of a TCP sender,
then the rate should be given by (2).
r(t) = 2*ceil(t-t0/IAT) ... (1)
r(t) = 2^ceil(t-t0/D) ... (2)
Expires: Apr 2003 [Page 5]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
Where t>t0, and t0 is the time when first ACK after spurious timeout
is received, and IAT and D are related by (3)
IAT <= D/N ... (3)
(N is the flight size at the time of spurious timeout).
Figure-2
C(1) C(2)... C(M)
| | ... |
| | ... |
| | ... |
V V ... V
\ \ /
\ \ /
\ \ /
+------X--X--X---+ +------------------+
Defaulting | | | |
C(0) ----------->| Bottleneck |------>|Buffered packets |--->
connection | router | | |
+-----X--X----X--+ +------------------+
| | |
| | |
c(1)c(2) C(M)
We now compute the worst-case packet loss due to this data burst.
After the timeout, the TCP sender will set its SS_THRESH to N/2
(packets-in-flight/2). Therefore, for the first N/2 ACKs received
(i.e., ACK a(1) to a(N/2) ), the TCP sender will grow its congestion
window by one and reach the SS_THRESH value of N/2. For each ACK
received, the TCP sender also sends 2 packets. Therefore by the end
of the slow start, the TCP sender would have sent 2*(N/2) packets
into the network. For the remaining N/2 ACKs (i.e., ACKs between
a(N/2+1) to a(N)) the TCP sender will remain in the congestion
avoidance phase and send one packet for each ACK received--sending
N/2 more data segments. The net amount of data sent is therefore N/2
+ N = 3N/2 . Please note that the entire 3N/2 packets are injected
into the network within a time period less than or equal to RTT. The
number of data segments that left the network during this time is
only N. Therefore, there is a high probability that N/2 packets out
of 3N/2 packets will be lost. These N/2 lost packets, however, need
not come from the same connection, and such a data-burst will
unnecessarily penalize all the competing TCP connections that share
the same bottleneck router.
Expires: Apr 2003 [Page 6]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
Going further ahead, let us assume there are M competing TCP
connections that share the same bottleneck router as the defaulting
connection C(0) does (Figure-2). During the period of time while C(0)
is stalled, the TCP sender of C(0) does not use its network
resources--the buffer space--on the bottleneck router(s). The
competing connections, C(1) ... C(M), however see this lack of
activity as resource availability and start growing their window by
at least one segment per RTT during this time period (by virtue of
linear window increase during congestion avoidance phase). If, for
simplicity reasons, we assume that each of these TCP connections has
the same round trip time of RTT, and if the idle time for C(0) is
k*RTT, (where k > RTO/RTT) then each of these competing connections
will increase their congestion window by k segments. Therefore the
amount of packets lost in the network due to slow start can be as
high as:
N/2 + M*k ... (4)
where the first term is the packet loss due to slow start, while the
second term is the loss due to window growth of completing
connections.
Please note that even if a TCP sender restores its congestion window
[7](or halves [8] it) to avoid slow start and redundant
retransmission, it still cannot avoid the loss of M*k segments (M*k-
N/2 segments in case the window was halved) that were added in the
network because of sender's inactivity. Since a TCP sender does not
know the number of connections it's competing with, or the time
duration for which packets could be stalled in the network, the
number of segments lost due to spurious timeout could be large.
In the following sections we describe an algorithm that solves the
problem of both redundant retransmission and packet loss after a
spurious timeout.
3. Sources of spurious timeout
Since the response algorithm after a timeout depends on the event
that triggered it, it is important to know the factors that can cause
such timeouts (in addition to the timeouts caused by congestion.) The
following list enumerates few broad categories of events that can
cause spurious timeouts in TCP.
3.1 Spurious timeout due to Excessive delay at a router
Many link layer technologies have their own loss recovery schemes
that use link layer feedback mechanisms to recover bit losses. These
loss recovery schemes are often unaware of the transport layer loss
Expires: Apr 2003 [Page 7]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
recovery schemes. Therefore, if there is burst of error at the link
layer, the TCP sender may timeout because of
the loss recovery scheme used by the link layers.
3.2 Spurious timeout due to change of subnets
A TCP receiver or sender may change subnets due to sender or receiver
mobility (Figure-3). While the change of subnets is taking place, a
router may keep the data packets buffered until a new route is
established (i.e., until Path-3 and Path-2 are established in
Figure-3). Since establishing a new route can take a long time, a
TCP sender may timeout.
Figure-3
<.... Path of new data and ACK. ...>
Path-2
+-----------> After
+-------------+ | ^ Subnet Change
| +---------+ .
| TCP Sender | .Path-3
| +---------+ .
+-------------+ | .
+-----------> Before
Path-1 Subnet Change
Once the new path is established, the router may forward all data
packets to the TCP receiver using Path-3. However, the ACKs triggered
by these packets will reach the TCP sender using Path-2. In addition,
the TCP sender will send new data in response to these ACKs on
Path-2.
When there is a change of subnets, the entire end-to-end path between
the sender and receiver might change completely (for example, Path-1
and Path-2 may not have any common routers between then). A complete
change in the end-to-end path may take place, for example, if a TCP
receiver uses Mobile-IPv6 with route optimization to move from one
subnet to another. Since the TCP connection does not have any
information about the BDP on new path (Path-2), the TCP connection
should start with a window size same as the window size of a new
connection, i.e., with a size of two.
Please note that of all possible reasons for spurious timeout, a
change in subnets has maximum impact on network congestion. Since the
BDP on a new path could be tens of orders of magnitude higher or
lower than the BDP on an old path, an inappropriate congestion
Expires: Apr 2003 [Page 8]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
control scheme could severely add to network congestion on the new
path (Path-2). For example, let us assume there are 100 TCP
connections sharing the same bottleneck router. Let's also assume
that the average number of subnet change per second per TCP receiver
is 1/50, and the average congestion window for each TCP connection at
the time of spurious timeout is 5 segments. Then in the worst case,
the network may drop as many as (1/50)*(5/2)*100 = 20 segments every
second if the congestion window is restored as done in [7].
3.3 Spurious Timeout due to Protocol Inefficiencies
A TCP sender may timeout because of inherent protocol inefficiencies
in TCP. For example, a spurious timeout may occur if
multiple packets are lost and fast-retransmit/fast-recovery
algorithms could not recover the lost packets within an RTO. A
spurious timeout may also occur if there are so few packets injected
into the network that in case of a packet loss the receiver cannot
generate 3 duplicate ACKs.
4. De-correlated Loss Recovery (DCLOR)
De-correlated loss recovery tries to remedy the problems described in
Section-2 by separating congestion control from loss recovery
mechanism. In doing so, we make the decision of "which packets to
send" independent from "how much" data to send. In addition to this,
the rate at which data is injected into the network preserves the
conservation control principles as described in [1][2][3].
The basic idea behind DCLOR is to send a new data segment from
outside the sender's retransmission queue--the queue of
unacknowledged data segments--and wait for the ACK or SACK of the new
data before initiating the response algorithm. Unlike slow-start
where the response algorithm starts immediately after receiving the
first ACK, DCLOR waits for the ACK/SACK of the new data sent after
timeout before initiating loss recovery. The SACK block for new data
contains sufficient information to determine all the packets that
were lost into the network. Once the sequence number of lost packets
is determined, the TCP sender grows its congestion window as in case
of slow-start, but only tries to recover those segments that were
lost in the network. This not only allows the TCP sender to avoid the
go-back-N problem, but also prevents superfluous congestion window
growth as seen with slow-start (section-2.2).
Before describing the algorithm itself, we fist enumerate the
underlying design philosophy behind DCLOR as it can give more insight
into the workings of the algorithm:
1. A spurious timeout should not degrade the performance of
Expires: Apr 2003 [Page 9]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
other TCP connections sharing the same network path. As pointed
out in Section-2, the slow-start subsequent to spurious timeout
not only affects the performance of the defaulting connections,
but it can also confer data-loss on other competing connections.
2. Congestion control decisions should be made independent of the
loss recovery decision. In other words, the congestion control
algorithm should only determines the amount of data to be
injected into the network, while the loss recovery algorithm
should determine the sequence number of data to be recovered.
3. Since the state of the network can change dramatically
after a long interruption--long with respect to
RTT--(Section-3), the response algorithm after a timeout should
be conservative in the amount of data injected into the network
after a timeout. This avoids packet loss (and congestion) due
to inappropriately inflated congestion window after spurious
timeouts (Section-2.2)
4. The TCP sender's estimate of network capacity--the
SS_THRESH--should be updated if and only if packets are lost in
the network. A spurious timeout without packet loss should not
affect the SS_THRESH since SS_THRESH is a measure of the
network's buffering capacity and there is no need to updated it
unless packet loss is detected.
5. If there is packet loss in addition to packet stalling, all
the lost packets within that window should be recovered without
any need for future timeouts. In other words, the algorithm
should not require the TCP sender to timeout again due to
protocol limitations.
We now describe the algorithm in more detail, keeping the above
mentioned points in mind. Please see Figure-4 for certain parameters.
4.1 Probe phase after a timeout
The following steps describe the response of a TCP sender on a
timeout:
1. If the timeout occurs before the 3 way handshake is complete,
the TCP sender's behavior is unchanged,
2. After a timeout, the TCP sender MUST set its congestion window
to ZERO (CWND = 0), and keep the number of outstanding packets
(N)into a state variable for future use (step-9). The value of
SS_THRESH MUST be left UNCHANGED at this point.
Expires: Apr 2003 [Page 10]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
3. The TCP sender SHOULD also reset all the SACK tag bits in its
retransmission queue (essentially renege all the SACK
information in accordance with the recommendation in [5]. Please
also see Section-6 on SACK reneging for more information.)
4. Instead of sending the first unacknowledged packet P1
after a timeout, the TCP sender disregards its congestion window
and sends ONE NEW MSS size data packet Pn+1.
Figure-4 .
. DCLOR:
. a) At timeout send Pn+1 in stead of P1;
. store SS_PTR=seq-num(Pn+1);set CWND=0
<.....CWND.....> . b) Don't update CWND for ACK/SACK <= SS_PTR
+--+--+-...-+--+----+ . c) Once ACK/SACK > SS_PTR is received
|P1|P2| |Pn|Pn+1| . Update SS_THRESH if SACK received
+--+--+-...-+--+----+ . d) Update lost-packet information
^ . e) set CWND=2 and start loss-recovery/
| . start sending new data in case pure ACK
SS_PTR . greater than SS_PTR is received.
TCP Sender's Retransmission
Queue
The TCP sender also stores the sequence number of this new data
packet in a new state variable called SS_PTR (for slow start
pointer).
If the sender does not have any new data outside its
retransmission queue, or if the receiver's flow control window
cannot sustain any new data, the TCP sender should send the
highest sequence numbered MSS sized data chunk from its
retransmission queue (i.e., it should send the last packet from
its retransmission queue).
The idea behind sending one packet on a timeout is to probe the
network and determine if the network is congested. However, from the
network point of view it does not matter what the sequence number of
data sent was. What matters is the "amount" of data sent into the
network. Therefore, sending a new packet from outside the
retransmission queue has the same affect as sending an old packet.
Therefore this behavior is in accordance with the TCP congestion
Expires: Apr 2003 [Page 11]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
control and avoidance mechanisms described in [1].
5. A TCP sender MUST repeat step-2 until it enters the
Timeout-Recovery state as described in step 6.
4.2 Congestion Control After the probe phase
6. After the timeout, for each ACK received with the ACK-sequence
number less than SS_PTR, the packet ACKed is removed from the
retransmission queue. If the ACK contains a new SACK block, then
the SACK tag is set in the corresponding data packet. However,
no data is sent for these ACKs.
7. The TCP sender MUST NOT update its congestion window from a
value of ZERO until an ACK-sequence number greater than SS_PTR,
or a SACK block containing SS_PTR is received. In other words,
until the ACK-sequence for Pn+1 is received or a SACK block
containing information about Pn+1 is received, NO new data
segments (i.e., segments Pn+2 ...,) are added into the network.
This allows the TCP sender to discard all the "stale ACKs"--the
ACKs that are generated for stalled packets--and prevents
superfluous growth of congestion window.
The TCP sender MUST NOT take RTT samples for stale-ACKs. The
SACK information present in stale-ACKs SHOULD be however used to
put SACK tags on the retransmission queue. A TCP sender MAY
reset its retransmission timer with every stale ACK received.
If the sender receives a SACK block containing SS_PTR, i.e., if
there is a packet loss in the stalled window, it SHOULD go to
step-8.
If the sender receives an ACK that acknowledges SS_PTR, i.e., if
no packets were lost from the stalled window, it SHOULD go to
step-10.
The rationale for not sending any data segment for stale ACKs is to
minimize congestion after a timeout. Since the duration of packet
stalling could be large, the congestion state of the network could
have change dramatically. Therefore, not using the stale ACKs to send
data allows the TCP sender to recompute its congestion state from
scratch and also minimizes packet loss.
NOTE-1: In our experiments, we have tried sending new data for
stale ACKs in the spirit of reverting the congestion window as
described in [7]. We have also tried setting the congestion window to
Expires: Apr 2003 [Page 12]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
half the packets-in-flight (pipe) as described in [8]. But we have
found that not sending any new data, as described above, is the best
scheme whenever there are multiple competing connections in the
system. The results of these experiments, along with the experimental
setup, are listed in Appendix-1.
If a TCP sender has not received any ACK within an RTT, sending more
than a few data segments (>4) will add to network congestion in one
form or the other (in varying degrees) and the overall performance of
the system as a whole degrades--on an average. In addition, the DCLOR
scheme works well in case of subnet change where there could be
orders of magnitude difference in BDP. DCLOR also has minimal effect
on other competing connections.
Based on our experiments, we do not recommend sending new data for
stale ACKs, but it's possible to deploy a variety of congestion
control schemes with the DCLOR framework. For example, a TCP sender
on the reception of an old ACK MAY revert (or halve) its congestion
window to the old value of packets-in-flight (pipe) and MAY send new
data depending upon the congestion window. But the TCP sender SHOULD
initiate its loss recovery ONLY after SS_PTR has been ACKed or
SACKed. In other words, if the TCP sender receives duplicate ACKs
with ACK-sequence number less than SS_PTR, it SHOULD NOT use these
duplicate ACKs to initiate Fast-Retransmit and Fast-Recovery. The
loss recovery SHOULD be initiated only after ACK/SACK for SS_PTR is
received. After that the sender should follow the same steps as
described in step-8, step-9, and step-10 (with a different congestion
window of course).
NOTE-2: For the congestion response of DCLOR, we have also considered
adaptive update of congestion window after each timeout. In this
scheme a TCP sender reduces its congestion window by a factor of 2
for every RTO period of inactivity. The SS_THRESH is not updated
until a loss is detected. At the time of this writing we are still
experimenting with the usefulness of such a scheme. However, such a
scheme is not well suited for spurious timeouts caused by change
of subnets as described in Section-3.2.
4.3 Timeout-Recovery: recovering lost packets after timeout
8. The TCP sender traverses the retransmission queue and marks
all the packets without any SACK tag as lost. The TCP sender
also updates its packets-in-flight (pipe) based on the SACK tags
and the lost segment information (the packets-in-flight (pipe)
should be ZERO after the update).
Please note that unlike Fast-Retransmit and Fast-recovery, DCLOR uses
Expires: Apr 2003 [Page 13]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
only one SACK block containing SS_PTR to mark packets as lost. This
is because we do not expect packet reordering to exist over the
period of RTO.
9. The TCP sender updates its SS_THRESH, as following:
SS_THRESH=packets-in-flight (pipe) at the time of timeout / 2
Please note that the TCP sender stored the value of packets-in-
flight at the time of timeout in step-2.
10. The TCP sender sets its congestion window to 2. If packets
were lost into the network (i.e., if a SACK for SS_PTR was
received), the TCP sender should start by sending the least
sequence number packets, else it should continue by sending new
data.
Please note that with a pure ACK acknowledging SS_PTR, the TCP sender
does not update the SS_THRESH value (it directly enters step-10 from
step-7). This prevents a TCP sender from setting its SS_THRESH to a
very small value if the spurious timeout occurs at the start of the
connection.
The rationale behind not updating the SS_THRESH if no loss is
detected is that SS_THRESH represents the sender's estimate of
network capacity--the BDP. Unless a loss is detected, there is no
reason to update the value of BDP.
5. Data Delivery To Upper Layers
If a TCP sender loses its entire congestion window worth of data due
to congestion, sending new data after timeout prevents a TCP receiver
from forwarding the new data to the upper layers immediately.
However, once the SACK for this new data is received, the TCP sender
will send the first lost segment. This essentially means that data
delivery to the upper layers could be delayed by ONE RTT when all the
packets are lost in the network.
This, however, does not affect the throughput of the connection in
any way. If a timeout has occurred, then the data delivery to the
upper layers has already been excessively delayed. Delaying it by
another round trip is not a serious problem. Please note that
reliability and timeliness are two conflicting issues and one cannot
gain on one without sacrificing something else on the other.
Expires: Apr 2003 [Page 14]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
6. Sack Reneging
The TCP SACK information is meant to be advisory, and a TCP receiver
is allowed--though strongly discouraged--to discard data blocks the
receiver has already SACKed[5]. Please note however that even if the
TCP sender discards the data block it received, it MUST still send
the SACK block for at least the recent most data received. Therefore
in spite of SACK reneging, DCLOR will work without any deadlocks.
A SACK implementation is also allowed not to send a SACK block even
though the TCP sender and receiver might have agreed to SACK-
Permitted option at the start of the connection. In these cases,
however, if the receiver sends one SACK block, it must send SACK
blocks for the rest of the connection. Because of the above mentioned
leniency in implementation, its possible that a TCP receiver may
agree on SACK-Permitted option, and yet not send any SACK blocks. To
make DCLOR robust under these circumstances, DCLOR SHOULD NOT be
invoked unless the sender has seen at least one SACK block before
timeout. We, however, believe that once the SACK-Permitted option is
accepted, the TCP sender MUST send a SACK block--even though that
block might finally be discarded. Otherwise, the SACK-Permitted
option is completely redundant and serves little purpose. To the best
of our knowledge, almost all SACK implementations send a SACK block
if they have accepted the SACK-Permitted option.
7. DCLOR Examples
We now demonstrate the working of DCLOR with a few concrete examples.
We first take the case in which the timeout is caused due to
congestion. We then consider the case, in which all the packets are
stalled in the network but there is no packet loss. Finally, we take
the example in which packets are both lost and stalled at the same
time.
In all these examples, the TCP sender has 20 MSS packets in flight
(pipe) at the time when timeout occurs and each packet is denoted by
P(i) (please note that we are using packet numbers instead of TCP
sequence numbers to make the presentation simple. The example can be
modified to take TCP sequence numbers into account too.) An ACK is
denoted by A(i)[x,y][z,w], where i is the highest packet number
acknowledged, and the ordered pair [x,y] and [z,w] represent the sack
blocks present in that ACK (please note that i < x,y,z,w and its
possible that x=y if just one packet is being sacked.)
7.1 Timeout due to congestion.
In this case, we assume that all packets P(1) to P(20) are lost into
Expires: Apr 2003 [Page 15]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
the network. The following time-line shows the sequence of events:
TCP Sender Network TCP Receiver
---------- ------- -----------
P(1)-----------X (Highest In-order seq = 0)
.
. LOST IN NETWORK
. DUE TO CONGESTION
P(20)-----------X
^
:
RTO
:
V
preserve old_pkt_in_flight (pipe) = 20
send P(21)-------------------------> rec P(21)
set SS_PTR = 21
rcv A(0)[21,21] <-------------------- send A(0)[21, 21]
Since SACK contains SS_PTR ==>
Tag P(1) to P(20) as Lost
Tag P(21) as SACKED
Update pkt_in_flight (pipe) = 21 - 20 -1 = 0
Since SACK=21 was received therefore
set SS_THRESH to old_pkt_in_flight/2 = 10
set CWND = 2
Restart RTO-timer
Send P(1)--------------------------->
Send P(2)--------------------------->
7.2 Timeout due to pure packet stalling.
In this case, we assume that all the packets P(1) to P(20) are
stalled in the network. The following time-line shows the sequence of
events:
TCP Sender Network TCP Receiver
---------- ------- -----------
P(1)-----------+ (Highest In-order seq = 0)
. |
. |
P(20)-----------+
^ |
RTO |
V | Packets delayed
Expires: Apr 2003 [Page 16]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
old_pkt_in_flight (pipe) = 20 |
send P(21)----------+
set SS_PTR = 21 |
CWND=0 +--------------> rec P(1)
rcv A(1) <---------------------------- send A(1)
SS_PTR > A(1)=> |
no change in CWND = 0 .
Remove P(1) from .
retransmission queue .
No timer sample taken .
+--------------> rec P(i) {1<i<21}
rcv A(i) <---------------------------- send A(i)
SS_PTR > A(i) ==> |
no change in CWND = 0 |
Remove P(i) from |
retransmission queue .
No timer sample taken .
+---------------> rec P(21)
rcv A(21) <---------------------------- send A(21)
SS_PTR = A(21) ==>
pure ACK received
Remove P(21) from retransmission queue
Update pkt_in_flight (pipe) = 0
Since Pure ACK was received therefore
SS_THRESH remains unchanged.
set CWND = 2
Restart RTO-timer
Send P(22)--------------------------->
Send P(23)--------------------------->
7.3 Comprehensive Scenario: Timeout due to stalling and loss.
Building upon the previous two example, we now assume that in
addition to data stalling, P(10) was lost into the network.
The following time-line shows the sequence of events:
TCP Sender Network TCP Receiver
---------- ------- -----------
P(1)-------------+ (Highest In-order seq = 0)
. |
P(10)----X Lost |
. |
P(20)-------------+
^ |
Expires: Apr 2003 [Page 17]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
RTO |
V | Packets delayed
old_pkt_in_flight (pipe) = 20 |
send P(21)------------+
set SS_PTR = 21 |
CWND=0 +--------------> rec P(1)
rcv A(1) <----------------------------- send A(1)
SS_PTR > A(1)=> |
no change in CWND = 0 .
Remove P(1) from .
retransmission queue .
No RTT sample taken . {1<i<10}
+--------------> rec P(i)
rcv A(i) <---------------------------- send A(i)
SS_PTR > A(i) ==> |
no change in CWND = 0 |
Remove P(i) from |
retransmission queue .
No RTT sample taken . {10<j<21}
+---------------> rec P(j)
rcv A(9)[11,j] <------------------------ send A(9)[11,j]
SS_PTR > A(9) and |
SS_PTR outside A[11,j]==> |
Tag P(j) as SACKED |
NO change in CWND=0 +---------------> rec P(21)
rcv A(21) <---------------------------- send A(9)[11,21]
Since SS_PTR = 21 ==>
Tag P(10) as Lost
Update pkt_in_flight (pipe) = 11-10-1
(Note, P(1) ... P(9) are already removed
from the retransmission queue)
Since a SACK was received therefore
set SS_THRESH=20/2 = 10
set CWND = 2
Restart RTO-timer
recover the lost packet first
Send P(10)--------------------------->
Send P(22)--------------------------->
8. Security Considerations
No new security threats are introduced by the use of DCLOR.
Expires: Apr 2003 [Page 18]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
9. References
[1] M. Allman, V. Paxson, W. Stevens. "TCP Congestion Control.
RFC-2581."
[2] S. Floyd. "Congestion Control Principles." RFC-2914.
[3] M. Handley, J. Padhye, S. Floyd. "TCP Congestion Window
Validation." RFC-2861.
[4] Ethan Blanton, Mark Allman, Kevin Fall. "Conservative
SACK-based Loss Recovery Algorithm for TCP." draft-allman-tcp-
sack-10.txt. Work in Progress.
[5] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. "TCP Selective
Acknowledgment Options." RFC-2018.
[6] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky. "An Extension to
the Selective Acknowledgment (SACK) Option for TCP" RCF-2883.
[7] Reiner Ludwig, Michael Meyer. "The Eiffel Detection Algorithm
for TCP." Internet Draft-work in progress. draft-ietf-tsvwg-
tcp-eifel-alg-05.txt.
[8] P. Sarolahti, M. Kojo. "F-RTO: A TCP RTO Recovery Algorithm
for Avoiding Unnecessary Retransmissions." Internet Draft--work
in progress. draft-sarolahti-tsvwg-tcp-frto-01.txt
[9] V. Paxon, M. Allman. "Computing TCP's Retransmission
Timer." RFC-2998
10. IPR Statement
The IETF has been notified of intellectual property rights claimed in
regard to some or all of the specification contained in this
document. For more information consult the on-line list of claimed
rights at http://www.ietf.org/ipr.
Author's Address:
Yogesh Prem Swami
Nokia Research Center
6000 Connection Drive
Irving TX-75063
Expires: Apr 2003 [Page 19]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
USA
Phone: +1 972-374-0669
Email: yogesh.swami@nokia.com
Khiem Le
Nokia Research Center
6000 Connection Drive
Irving TX-75063
USA
Phone: +1 972-894-4882
Email: khiem.le@nokia.com
Appendix - 1
In this section we provide information on our test setup and provide
test results for comparison of DCLOR with Standard TCP, Eifel, and
FRTO schemes.
App-1: Testing Methodology
Please see Figure-5 for the test setup. The TCP sender and TCP
receiver are connected through an intermediary "spurious timeout and
network impairment simulator (STIS)," that simulates packet delay and
certain network impairments. The STIS is capable of reading all TCP
packets received on any of its interfaces. It is capable of
processing packets coming from multiple TCP connections. Each TCP
packet is then subjected to a series of network impairments that are
expected to be present in a typical IP network. The specifics of
these impairments along with the model used to simulate them is
enumerated below:
Figure-5
+---------------+ +--------------------+ +---------------+
| TCP Sender | | Spurious timeout & | | |
| +----+ network impairment +---+ TCP Receiver |
| Linux-2.5.40 | | simulator | | |
+---------------+ +--------------------+ +---------------+
Expires: Apr 2003 [Page 20]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
1. Bottleneck Router
The STIS is capable of simulating a bottleneck router on the
end-to-end path of a TCP connection. For TCP, a bottleneck
router is one that has the minimum ratio for the buffer-space to
difference in link speed on the end-to-end path (i.e., if b(i)
represents the buffer space of router i, and a(i) and d(i) be
the incoming and the outgoing link speed of the router for a
particular TCP connection, then the bottle neck router for such
a connection is one that has min{ b(i)/[d(i)-a(i)]} for all
routers i on the end-to-end path of the TCP connection).
The bottleneck router is simulated by specifying a buffer space
and the outgoing link speed (the arrival rate in our simulation
was fixed by the link speed of the interface on which the packet
arrived. This speed was a fixed value of 10Mbps). In order to
simulate outgoing link, the STIS enqueues each TCP packet and
schedules them based on the queuing delay and link-delay
(service time) of packets in the queue (please see item-2 for
more details). A packet received on any of the interfaces is
processed only if the total bytes of enqueued data in the router
is less than the specified buffer space. A packet is dropped if
the buffer space is full. Please note that the buffer space is
allocated for and shared by all possible TCP connections going
through STIS.
2. Link Speed Simulation
Although each TCP connection shares the same buffer space, each
TCP connection has its own logical queue for each direction.
The departure time of a new packet is computed by addition the
link delay (packet-size / outgoing-link-speed) of the new packet
to the queuing delay of all packets ahead of it. A TCP sender is
scheduled on the appropriate interface once its departure time
expires.
3. Fixed Network Delay
The wired-network delay seen by individual packets is the sum of
the queuing delay and link delay of each packet on individual
routers. If we assume that the number of routers are infinitely
many on the end-to-end path, and if the variance of the delay is
bounded (which is the case with finite queue sizes), then by the
law of large numbers, the delay seen by each packet will be a
constant. Although there are not infinitely many routers in the
real networks, to make our model simple we assume that the delay
added to each packet on the end-to-end path is a constant. This
delay is added in addition to the queuing delay incurred due to
Expires: Apr 2003 [Page 21]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
link speed simulation. (Please note that fixed network delay
tries to capture the queuing delay on the high speed link, while
the delay incurred due to link speed simulation is the delay due
to a last hop slow link.)
The fixed network delay is simulated by adding a fixed delay to
the departure time of each packet.
4. Prolonged packet stalling
In addition to the delay models described above, the STIS also
simulates prolonged packet stalling in the network. Although
the end result of packet stalling is a delay spike, we do not
simply add a long delay burst to achieve packet stalling rather
use the Markov model shown in Figure-6. (This model is based on
the two different kinds of delay spikes seen in many cellular-
networks.)
Every second a random number r (<1) is generated(lets say the
random number is generated at time T) for each TCP connection
and compared with p1 and p2. If r is less than p1 or P2 a state
transition is made into state-2 or state-3. When a TCP
connection is in state-2 or state-3, each packet arriving in the
time interval T and T+D (where D = d1 or d2 depending upon the
state), is treated as if it arrived at time T+D. By altering the
arrival time of each packet, this model simulates packet
stalling in the network.
Figure-6
p=1-p1-p2
+------+
\ /
+----+-----V-----+
p=1 | No Delay Spike |
+--------------->| (state-1) |<--------------+
| +--+--------+----+ |
| | | |p=1
| +-----------+ +--------------+ |
| | p=p1 p=p2| |
+------+-------V-------+ +----V-----+--------+
| Moderate delay spike | | Large delay spike |
| delay=d1(state-2) | | delay=d2(state-2) |
+----------------------+ +-------------------+
Expires: Apr 2003 [Page 22]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
Please note that in a real network, when packet stalling
occurs--for whatever reason--all packets remain queued in the
network until the packets can be forwarded again. Therefore the
effect of packet stalling is different from adding a long burst
of delay to a few of the packets. Please note that with such a
model, the delay incurred by individual packets will not be the
same.
Please also note that the state transitions are based on time.
Therefore the STIS will keep entering state-2 and state-3
whether or not there are any TCP packets to be forwarded.
5. Packet Reordering
Since the Internet is known to reorder 12% of the packets (this
study was done by Vern Paxson of UC Berkeley) on the end-to-end
path, we try to simulate this behavior in our simulations. To
simulate packet reordering we try to emulate the network
behavior that creates packet reordering--that is we try to
emulate router failure or load balancing to achieve packet
reordering.
A router failure is modeled using a two state Markov model, in
which transition from state to another means a route failure. In
each of the different states, the TCP connection incurs a
different fixed network delay as described above in item-2.
This model is based on the assumption that when a route failure
takes place, the delay on new path is different from the delay
on the old path.
App-2: Test setup, parameters, and Results
Based on the model described above, the STIS uses following
parameters for testing.
Table-1: STIS parameters
+---------------------------------------+------------------+
| Parameter Name | Parameter Value |
+---------------------------------------+------------------+
| Path MTU | 1500 Bytes |
| Bottleneck Router's Capacity | 74K Bytes |
| Outgoing link data rate | 50 K Bits/sec |
| Fixed Network Delay (one way) | 200 ms |
| Pkt Stalling delay in state-2 (d1) | 5 sec |
| Pkt Stalling delay in state-3 (d2) | 8 sec |
| Pkt Stalling prob to state-2 (p1) | 0.05 per-sec |
Expires: Apr 2003 [Page 23]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
| Pkt Stalling prob to state-3 (p2) | 0.005 per-sec |
| Pkt reordering probability | 0.12 |
| Pkt reordering delay | 20 ms |
+---------------------------------------+------------------+
On the TCP receiver, multiple TCP receiver processes simultaneously
run and try to download files of different sizes--each connection
competing to download a file of given size. In addition, each process
waits for a random time interval and iterates over and over again to
download the same file. The following table shows the traffic mix
used for our experiments.
Table-2: Traffic Mix
+---------------+------------------+--------------+
| File Size | # simultaneous | # iterations |
| (KB) | connections | |
+---------------+------------------+--------------+
| 5 | 6 | 2000 |
| 10 | 5 | 1000 |
| 100 | 5 | 100 |
| 1000 | 3 | 10 |
| 10000 | 1 | 1 |
+---------------+------------------+--------------+
Based on the STIS parameters and traffic mix, the cumulative
probability distribution of download time, i.e., the plot of number
of points with a download time of less than a give time duration in
the above experiment, for each file size was generated (because of
proper format for representing these plots, we are unable to include
these plots in this document. We can provide this plot on request.)
Tables 3,4,and 5 show the mean and variance (in second) for
downloading different file sizes using DCLOR, EIFEL, FRTO, and
Standard TCP. Please note that not every connection was subjected to
a spurious timeout, and therefore the mean value alone is not a good
measure of performance. The variance captures the impact of such
spurious timeouts (the probability distribution is the best
representation of how individual schemes perform for different
download time range). As can be seen from the following table, Eifel
has a large variance because of restoring the window. Please note
that the variance for standard TCP is less than Eifel, or Frto,
because the first N packets in case of standard TCP are redundant,
and therefore even if they are lost due to error burst, the
defaulting connection does not incur a very severe penalty and
therefore the variance is small. But for Eifel and FRTO, the N
packets after spurious timeout are new and therefore loss of these
Expires: Apr 2003 [Page 24]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
packets has a huge impact on the variance.
Table-3: Download Time Statistics for 5K files
+---------+--------+-------+--------+-------+
| (sec) | DCLOR | EIFEL |STD_TCP | FRTO |
+---------+--------+-------+--------+-------+
|Mean | 2.3869 | 2.4162| 2.3962 | 2.4049|
|Variance | 3.2473 |10.2579| 3.2164 | 4.0466|
+---------+--------+-------+--------+-------+
Table-4: Download Time Statistics for 10K files
+---------+-------+--------+--------+------+
| (sec) | DCLOR | EIFEL |STD_TCP | FRTO |
+---------+-------+--------+--------+------+
|Mean |3.4547 | 4.0560 | 3.7314 |3.7317|
|Variance |4.7452 |19.7828 | 7.6378 |8.4576|
+---------+-------+--------+--------+------+
Table-5: Download Time Statistics for 100K files
+---------+---------+---------+---------+---------+
| (sec) | DCLOR | EIFEL | STD_TCP | FRTO |
+---------+---------+---------+---------+---------+
|Mean | 24.6297 | 26.1017 | 26.7425 | 25.5325 |
|Variance | 66.0804 | 98.4933 | 98.9363 | 78.1667 |
+---------+---------+---------+---------+---------+
In addition to the mean and variance of download times, we also
calculated the "spectral efficiency" of the system. The spectral
efficiency was computed as:
(redundant_bytes_received)
SE= --------------------------------
mean(cwnd)*MSS
The spectral efficiency is a measure of usefulness of a particular
Expires: Apr 2003 [Page 25]
draft-swami-tsvwg-tcp-dclor-00.txt November 2002
scheme in terms of fraction of extra bytes sent due to spurious
timeout for every byte sent into the network. The smaller this number
is, the better the scheme is. The numerator is the amount of
unnecessary data, while the denominator is the average network
capacity experienced by the connection. Since some schemes (Eifel for
example) requires more TCP options, the MSS was used instead of MTU
in the denominator. Table-5 shows these results for all the four
different TCP implementations. Please note that FRTO behaves like
standard TCP if there is no new data to send or if the congestion
window is as big as flow control window. This is one of the reasons
why FRTO performance is worse than Eifel or DCLOR.
Table-6: Spectral Efficiency
+-----------+----------+----------+----------+----------+
| File Size | DCLOR | EIFEL | STD_TCP | FRTO |
| |(MSS=1460)| (1448) | (1460) | (1460) |
+-----------+----------+----------+----------+----------+
| 5K | 0.004042 | 0.004454 | 0.092714 | 0.071336 |
| 10k | 0.005249 | 0.012293 | 0.078977 | 0.052581 |
| 100k | 0.017124 | 0.036748 | 0.624361 | 0.079285 |
+-----------+----------+----------+----------+----------+
The DCLOR algorithm for all these tests cases were implemented
in the Linux Kernel-2.5.40 (the source code for DCLOR
could be provided on request). The same kernel version was
used for all the other TCP enhancements too.
Expires: Apr 2003 [Page 26]