Network working group I. van Beijnum
Internet-Draft IMDEA Networks
Expires: November 7, 2009 May 6, 2009
One-ended multipath TCP
draft-van-beijnum-1e-mp-tcp-00
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 7, 2009.
Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents in effect on the date of
publication of this document (http://trustee.ietf.org/license-info).
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
Abstract
Normal TCP/IP operation is for the routing system to select a best
path that remains stable for some time, and for TCP to adjust to the
properties of this path to optimize throughput. A multipath TCP
would be able to either use capacity on multiple paths, or
van Beijnum Expires November 7, 2009 [Page 1]
Internet-Draft One-ended multipath TCP May 2009
dynamically find the best performing path, and therefore reach higher
throughput. By adapting to the properties of several paths through
the usual congestion control algorithms, a multipath TCP shifts its
traffic to less congested paths, leaving more capacity available for
traffic that can't move to another path on more congested paths. And
when a path fails, this can be detected and worked around by TCP much
more quickly than by waiting for the routing system to repair the
failure.
This memo specifies a multipath TCP that is implemented on the
sending host only, without requiring modifications on the receiving
host.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Notational Conventions . . . . . . . . . . . . . . . . . . . . 5
3. Congestion control . . . . . . . . . . . . . . . . . . . . . . 5
3.1. RTT measurements . . . . . . . . . . . . . . . . . . . . . 5
3.2. Fast retransmit . . . . . . . . . . . . . . . . . . . . . 6
3.3. Slow retransmit . . . . . . . . . . . . . . . . . . . . . 6
3.4. SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5. Fairness and TCP friendliness . . . . . . . . . . . . . . 8
4. Path selection . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1. The multipath IP layer . . . . . . . . . . . . . . . . . . 9
4.2. The path indication option . . . . . . . . . . . . . . . . 10
4.3. Timestamp integration option . . . . . . . . . . . . . . . 12
4.4. Path for retransmissions . . . . . . . . . . . . . . . . . 12
4.5. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.6. Path MTU discovery . . . . . . . . . . . . . . . . . . . . 13
5. Flow control and buffer sizes . . . . . . . . . . . . . . . . 14
6. Handling of RSTs . . . . . . . . . . . . . . . . . . . . . . . 14
7. Middlebox considerations . . . . . . . . . . . . . . . . . . . 14
8. Security considerations . . . . . . . . . . . . . . . . . . . 15
9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 15
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15
11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
11.1. Normative References . . . . . . . . . . . . . . . . . . . 16
11.2. Informational References . . . . . . . . . . . . . . . . . 16
Appendix A. Document and discussion information . . . . . . . . . 17
Appendix B. An implementation strategy . . . . . . . . . . . . . 17
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 21
van Beijnum Expires November 7, 2009 [Page 2]
Internet-Draft One-ended multipath TCP May 2009
1. Introduction
In order to achieve redundancy to protect against failures, network
operators generally install more links than the minimum necessary to
achieve reachability. So there are often multiple paths between any
two given hosts, even when paths not allowed by policy are removed.
However, routing protocols usually select a single "best" path. When
multiple paths are used at the same time by the routing system, those
tend to be parallel links between two routers or paths that are
otherwise very similar. As such, a lot of potentially usable network
capacity is left unused. A multipath transport protocol would be
able to use more of that capacity by sending its data along multiple
paths at the same time, or by switching to a path with more available
capacity.
As TCP [RFC0793] is used by the vast majority of all networked
applications, and TCP is responsible for the vast majority of all
data transmitted over the internet, the logical choice would be to
make TCP capable of using multiple paths. SCTP already has the
ability to use multiple paths through the use of multiple addresses.
However, using SCTP in this way requires significant application
changes and deployment would be challenging because there is no
obvious way for an application to know whether a service is available
over SCTP rather than, or in addition to, TCP. In addition, SCTP as
defined today [RFC2960] does not accommodate the concurrent use of
multiple paths. Additional paths are purely used for backup
purposes.
This memo describes a one-ended multipath TCP, which only changes the
behavior of the TCP sender, achieving multipath advantages when
communicating with unmodified TCP receivers. This means it is not
possible to perform path selection by using different destination
addresses. However, other mechanisms that are transparent to the
receiver are possible. A simple one would be for the sender to send
some packets to one router, and other packets to another router. If
these routers then make different routing decisions for the
destination address in the TCP packets, the packets flow over
different paths part of the way. Other mechanisms to achieve the
same goal are also possible. However, with a single destination
address, paths can't be completely disjoint.
Using multiple paths at the same time brings up a number of
challenges and questions:
o Naive scheduling (such as round robin) of transmissions over the
different paths reduces performance of each path to that of the
slowest path.
van Beijnum Expires November 7, 2009 [Page 3]
Internet-Draft One-ended multipath TCP May 2009
o Using multiple paths causes reordering, which triggers the fast
retransmit algorithm, causing unnecessary retransmissions and
reduced performance.
o TCP requires in-order delivery of data to the application, so when
losses occur on one path, buffer capacity may run out and data
can't be transmitted on unaffected paths until the lost data has
been retransmitted.
o Using multiple paths with an instance of regular congestion
control on each path for a single TCP session makes that session
use network capacity more aggressively than single path sessions,
which can be considered "unfair" and increases packet loss.
This memo seeks to address the first two issues by running separate
instances of TCP's congestion control algorithms for the subflows
that flow over different paths. Buffer issues are addressed by
retransmitting packets before buffer space runs out, even if normal
retransmission timers haven't fired yet. The fairness issue is a
topic of ongoing research; this specification simply limits the
number of subflows to limit unfairness and increased loss.
The one-ended multipath TCP takes advantage of the fact that TCP
[RFC0793] congestion control [RFC2581] and flow control are performed
by the sender. With regard to flow control and congestion control,
the role of the receiver is limited to sending back acknowledgments
and advertise how much data it is prepared to receive. Hence, it is
possible for the sender to utilize different paths and modify the
fast retransmit logic as long as the receiver recognizes the packets
as belonging to the same session. So a multipath TCP sender can
distribute packets over multiple paths as long as this doesn't
require incompatible modifications to the IP or TCP header contents,
most notably the addresses. A single-ended multipath TCP session
must still be between a single source address and a single
destination address, regardless of the path taken by packets.
The subset of the packets belonging to a TCP session flowing over a
given path is designated a subflow.
In order to benefit from using multiple paths, it's necessary for the
multipath TCP sender to execute separate TCP congestion control
instances for the packets belonging to different subflows. In the
case where all packets are subject to the same congestion window,
performance over a fast and a slow path will often be poorer than
over just the fast path, defeating the purpose of using multiple
paths. For instance, in the case of a 10 Mbps and a 100 Mbps path
with otherwise identical properties, a simple round robin
distribution of the packets and the use of a single congestion window
van Beijnum Expires November 7, 2009 [Page 4]
Internet-Draft One-ended multipath TCP May 2009
will limit performance to that of the slowest path multiplied by the
number of paths, 20 Mbps in this case.
2. Notational Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
3. Congestion control
A multipath TCP maintains instances of all congestion control related
variables for each subflow. This includes, but is not limited to,
the congestion window, the ssthresh, the retransmission timeout
(RTO), the user timeout and RTT measurements. However, because TCP
requires in-order delivery of data, there must be a single send
buffer and a single receive buffer, thus flow control must happen
session-wide.
Per-subflow congestion control is performed by recording the path
used to transmit each packet. Acknowledgments are then attributed to
the subflow the acknowledged packets were sent over and the
congestion window and other congestion control variables for the
relevant subflow are updated accordingly.
3.1. RTT measurements
Because a multipath TCP sender knows which packet it sent over which
path, it can perform per-path round trip time measurements. This
only works if return packets are consistently sent over the same path
(or a set of paths with the same latency). If the receiver is not
multipath-aware, this condition will generally hold: acknowledgments
will flow from the receiver to the sender over a single path unless
there is a topology change in the routing system or packets that
belong to a single session are distributed over different paths by
routers, which is rare. To multipath-capable routers on the return
path (if any), the non-multipath-aware host appears to select the
default path for all of its packets.
However, if, like the sender, the receiver is multipath-aware, then
the return path that the receiver chooses to send ACKs over will
influence the RTTs seen by the original sender. The situation where
the sender is unaware of fact that the receiver selects different
return paths with different latencies is suboptimal, even compared to
consistently measuring the RTT over the slowest path, as this leads
to higher variability in the RTT measurements and therefore a higher
van Beijnum Expires November 7, 2009 [Page 5]
Internet-Draft One-ended multipath TCP May 2009
RTO.
Having the receiver send ACKs over the same path mitigates the
problem somewhat; but presumably, if the receiver is also multipath
capable and has data to send, it will want to send this data over
more than one path. So RTT measurements may inadvertently end up
measuring different return paths in that case. A better solution is
for the sender to include an indication in packets that allows the
receiver to determine through which path the sender sent the packet.
This information, along with the path initially chosen for the
outgoing packet that is acknowledged, allows TCP to attribute each
RTT measurement to a specific path.
Because congestion control happens per path, there must also be a
separate retransmission timeout (RTO) value for each path.
3.2. Fast retransmit
Different paths will almost certainly have different RTTs, and even
if the average RTT is the same, normal burstiness and differences in
packet sizes will make packets routinely arrive through the different
paths in a different order than the order in which they were
transmitted. Without modifications to the algorithm, this would
trigger the fast retransmit algorithm unnecessarily. To avoid this,
fast retransmit is executed whenever, for packets belonging to the
same subflow, after an unACKed packet or sequence of packets, more
than two segments of new data is ACKed with SACK. This means fast
retransmit happens per subflow, and reordering between subflows no
longer triggers fast retransmit.
3.3. Slow retransmit
In multipath TCP, a per-path RTO is employed to recover from
congestion events that fast retransmit can't handle. Because the
missing packets create holes in the data stream, subsequent packets
received over other paths must be buffered in the receive buffer.
Unless the receive buffer is extremely large, this means the entire
session stalls when the receive buffer fills up. This situation
persists until the RTO expires for the congested or broken path so
the missing packets can be retransmitted. Should the path in
question be completely broken, this will then lead to an almost
immediate new stall, and the stall/RTO cycles will then continue
until the user timeout / R2 timer [RFC1122] for the subflow expires.
This is solved by taking unacknowledged packets transmitted over
subflows that are stalled because they have exhausted their
congestion window and are now waiting for the RTO to expire, and
scheduling retransmissions of those packets over other paths before
van Beijnum Expires November 7, 2009 [Page 6]
Internet-Draft One-ended multipath TCP May 2009
the RTO of the stalled subflow expires. This should be done such
that the missing packet arrives before it becomes necessary to stop
sending data altogether because the receiver advertises a zero
receive buffer. Such retransmissions therefore happen as the receive
buffer space advertised by the receiver reaches RTT * MSS for the
path that will be used for the retransmission; presumably the path
with the lowest RTT. In essence, this creates a second level of fast
retransmit that acts across subflows in addition to the normal fast
retransmit that happens per subflow. This mechanism is named "slow
retransmit".
In the case of single path TCP, scheduling retransmissions before the
RTO expires could be problematic because this would be more
aggressive than standard (New)Reno congestion control. But in the
case of multipath TCP, the retransmission can happen over one of the
other paths, which is still progressing.
By scheduling a retransmission faster than an RTO, there is an
increased risk that a packet that was still working its way through
the network is retransmitted unnecessarily. However, the alternative
is allowing the progress of the session to stall (on all paths),
reducing throughput significantly.
3.4. SACK
When packets (belonging to different subflows) arrive out of order,
the the receiver can't acknowledge the receipt of the out of order
packets using TCP's normal cumulative acknowledgment. However, the
[RFC2018] (also see [RFC1072]) Selective Acknowledgment (SACK)
mechanism is widely implemented. SACK makes it possible for a
receiver to indicate that three or four additional ranges of data
were received in addition to what is acknowledged using a normal
cumulative ACK. When packets are sent over multiple paths and arrive
out of order, the information in the SACK returned by the receiver
can tell the sender how each subflow is progressing, so per-subflow
congestion control can progress smoothly and unnecessary
retransmissions are largely avoided.
One-ended multipath TCP requires the use of SACK to be able to
determine which subflows are progressing even if other subflows are
stalled, and thus the normal TCP ACK isn't progressing. If the
remote host doesn't indicate the SACK capability during the three-way
handshake, a multipath TCP implementation SHOULD limit itself to
using only a single subflow and thus disabling multipath processing
for the session in question.
van Beijnum Expires November 7, 2009 [Page 7]
Internet-Draft One-ended multipath TCP May 2009
3.5. Fairness and TCP friendliness
One of the goals of multipath TCP is increased performance over
regular TCP. However, it would be harmful to realize this benefit by
taking more than a "fair" share of the available bandwidth. One
choice would be to make each subflow execute normal NewReno
congestion control on each subflow, so that each individual subflow
competes with other TCPs on the same footing as a regular TCP
session. If all subflows use non-overlapping physical paths, other
TCPs are no worse off than in the situation where the multipath TCP
were a regular TCP sharing their path, so this could be considered
fair even though the multipath TCP increases its bandwidth in direct
relationship to the number of subflows used. Note that in this case,
although multipath TCP sends at the same rate as regular TCP on a
given path, resource pooling [wischik08pooling] benefits are still
realized because a given transmission completes faster so it uses up
resources for a shorter amount of time.
But if several logical paths share a physical path, multipath TCP
takes a larger share of the bandwidth on that path. This would only
be acceptable as fair for a very small number of subflows. The other
end of the spectrum would be for multipath TCP to conform to exactly
the same congestion window increase and decrease envelope that a
regular TCP exhibits, being no more aggressive than a regular single
path TCP session. At this point in time we will assume that fairness
is a tunable factor of the regular NewReno AIMD envelope. A simple
way to limit the amount of additional aggressiveness exhibited by
multipath TCP is a limit on the number of subflows. Until more
analysis has been performed and/or there is more experience with
multipath TCP, a multipath TCP implementation SHOULD limit itself to
using no more than 3 subflows concurrently.
4. Path selection
Note that in order to gain multipath benefits, the multipath TCP
layer must be able to determine the logical path followed by each
packet so it can measure path properties and perform per-path
congestion control. In order to limit the number of packets flowing
over each path to the amount allowed by the per path congestion
window, the multipath TCP layer must be able to specify over which
path a given packet is transmitted.
The situation where routers distribute packets over different paths
based on their own criteria makes it impossible for hosts to send
less traffic over congested paths and more traffic over uncongested
paths and is therefore incompatible with multipath TCP. When routers
distribute traffic belonging to the same flow (or, in the case of
van Beijnum Expires November 7, 2009 [Page 8]
Internet-Draft One-ended multipath TCP May 2009
multipath TCP: subflow) over different paths this will also cause
reordering and the associated performance impact on TCP.
4.1. The multipath IP layer
The one-ended multipath TCP is logically layered on a multipath IP
layer, which is able to to deliver packets to the same destination
address through one or more logical paths, where the set of n logical
paths share between one and m physical paths. In some cases, the
multipath IP layer will be able to determine that a logical path
isn't working, or maps to the same physical path as a previous
logical path. For example, if the multipath TCP indicates that a
packet should be sent over the third path, and the multipath IP is
set up to use different next hop addresses for path selection, but
only two next hop addresses are available, the multipath IP layer can
provide feedback to the multipath TCP layer. In other cases, packets
simply won't be delivered, or will be delivered through the same
physical path used by other logical paths. This may for instance
happen when multipath TCP selects path 1 and multipath IP puts a path
selector with value "1" in the packet, but there are no multipath
capable routers between the source and destination, so all packets,
regardless of the presence and/or value of a path selector, are
routed over the same physical path.
It is up to the multipath TCP layer to handle each of these
situations.
For the purposes of this multipath TCP specification, the simplest
possible interface to the multipath IP layer is assumed. When TCP
segments traveling down the stack from the TCP layer to the IP layer
aren't accompanied by a path selector value, or the path selector
value is zero, the IP layer delivers packets in the same way as for
unmodified TCP and other existing transport protocols, i.e., over the
default path. Segments may also be accompanied by a path selector
value higher than zero, which indicates the desired path. If the
desired logical path is available, or may be available, the multipath
IP layer attempts to deliver the packet using that logical path. If
the desired logical path is known to be unavailable, the multipath IP
layer drops the segment.
It is assumed that paths as seen by the multipath IP layer are mapped
to logical paths with increasing numbers roughly ordered in order of
decreasing assumed performance or availability. I.e., if path x
doesn't work or has low performance, that doesn't necessarily mean
that path x+1 doesn't work or has low performance, but if if paths x,
x+1 and x+2 don't work or have low performance, then it's highly
likely that paths x+3 and beyond also don't work or have even lower
performance. Routers may have good next hop or even intra-domain
van Beijnum Expires November 7, 2009 [Page 9]
Internet-Draft One-ended multipath TCP May 2009
link weight information and link congestion information, but they
generally don't have information about the end-to-end path
properties, so the ordering of paths from high to low availability/
performance must be considered little more than a hint.
The multipath IP layer may be implemented through a variety of
mechanisms, including but not limited to:
o Using different outgoing interfaces on the host
o Directing packets towards different next hop routers
o Integration with shim6 [I-D.ietf-shim6-proto] so that packets can
use different address pairs
o Manipulation of fields used in ECMP [RFC2992] (i.e., a different
flow label)
o Type of service routing (such as [RFC4915])
o Different lower layer encapsulation, such as MPLS
o Tunneling through overlays
o Source routing
o An explicit path selector field in packets, acted upon by routers
At this time, no choice is made between these different mechanisms.
4.2. The path indication option
Note that several of the fields discussed below are defined with
future developments in mind, they are not necessarily immediately
useful.
In order to allow for accurate RTT measurements and to inform the IP
layer of the selected path, a TCP option indicating the desired path
is included in all segments that don't use the default path. The
format of this option is as follows:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| KIND=TBA | LENGTH = 3 |D| MP |R| SP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The length is 3.
D is the "discard eligibility" flag (1 bit). It is similar, but not
van Beijnum Expires November 7, 2009 [Page 10]
Internet-Draft One-ended multipath TCP May 2009
identical, to the frame relay discard eligibility bit or the ATM cell
loss priority bit. Set to zero, no special behavior is requested.
Set to one, this indicates that loss of the packet will be
inconsequential. This allows routers to drop packets with D=1 more
readily than other packets under congested conditions, and also to
completely block packets with D=1 on links that are considered long-
term congested or expensive, even if there is no momentary
congestion.
Setting the D bit to 1 for some subflows (presumably, ones with a
performance lower than the best performing subflow) allows multipath
TCP to give way to regular TCP and other single path traffic on
congested or expensive paths. As long as the multipath TCP sets D to
0 on the subflow with the best performance, multipath TCP should
still perform better than regular TCP, but the reduction in bandwidth
use on the other paths helps achieve resource pooling benefits.
MP is a is a path selector that may be interpreted by multiple
routers along the way (3 bits). A value of 0 is the default path
that is also taken by packets that don't contain a multipath option.
Multipath TCP aware routers should take this value into account when
performing ECMP [RFC2992]. Packets with any value for MP MUST be
forwarded, even if the number of available paths is smaller than the
value in MP.
R (1 bit) is reserved for future use. MUST be set to zero on
transmission and ignored on reception.
SP is a path selector that is interpreted only once by the local TCP
stack or a router close to the sender (3 bits). A value of 0 is the
default path that is also taken by packets that don't contain a
multipath option. If the value in SP points to a path that isn't
available, the packet SHOULD be silently dropped. This behavior, as
opposed to selecting an alternate path out of the available ones,
helps avoid the use of duplicate paths. As such, a router may only
interpret SP rather than MP when it is known that the router is the
only one acting on SP. All other routers may only act on MP.
It is not expected that routers will make routing decisions directly
based on the path indication option, as this option occurs deep
inside the packet and not in a fixed place. However, a multipath IP
layer or a middlebox may write a path selection value into a field in
packets that is easily accessible to routers. But conceptually, the
routers act upon the values in SP and MP.
The initial packets for each TCP session MUST use D, MP and SP values
of zero. If D, MP and SP are all zero, then the path selector option
isn't included in the packet. This makes sure that single path
van Beijnum Expires November 7, 2009 [Page 11]
Internet-Draft One-ended multipath TCP May 2009
operation remains possible even if packets with the path selector
option are filtered in the network or rejected by the receiver. The
packets that are part of the TCP three-way handshake SHOULD be sent
over the default path, in which case they don't contain the path
selector option; hence the ability to do multipath TCP isn't
indicated to the correspondent at the beginning of the session as is
usual for most other TCP extensions.
4.3. Timestamp integration option
As an optimization, hosts MAY borrow the four bits used by the path
selector option from the timestamp option, and thus save one byte of
option space, which means the path selector option can replace the
padding necessary when the timestamp option is used and not increase
header overhead. In that case, the combined path selector and
timestamp options MUST appear as follows:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| KIND=TBA | LENGTH = 2 | KIND=8 | LENGTH = 10 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|D| MP | TS Value (TSval) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TS Echo Reply (TSecr) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
D and MP are the same as in the three-byte form of the path selector
option. R and SP do not occur in this form of the path selector
option and are assumed to be zero.
TSval is the locally generated timestamp. Because the timestamp is
reduced to 28 bits, the minimum clock frequency is increased from the
59 nanoseconds mandated by [RFC1323] to 1 microsecond so the
timestamp wraps in no less than 255 seconds.
TSecr is the timestamp echoed back to the other side (32 bits).
All hosts conforming to this specification MUST be able to recognize
the integrated path selector and timestamp options, but they are not
required to generate them.
4.4. Path for retransmissions
A multipath TCP implementation MUST be capable of scheduling
retransmissions over a path different from the path used to transmit
the packet originally. This includes packets subject to fast
retransmit.
van Beijnum Expires November 7, 2009 [Page 12]
Internet-Draft One-ended multipath TCP May 2009
4.5. ECN
Explicit Congestion Notification works by routers setting a
congestion indication in the IP header of packets rather than
dropping those packets when they experience congestion. The receiver
echos this information back to the sender which then performs
congestion control in exactly the same way as if a packet was lost.
The ECN specification ([RFC3168]) is such that the receiver sets the
ECN-Echo (ECE) flag in the TCP header for all subsequent packets that
it sends back until the sender sets the Congestion Window Reduced
(CWR) flag. As the ECE flag is set in multiple ACKs, there is no
obvious way to correlate the ECN indication in an ACK with a specific
packet that experienced congestion, and subsequently, the path that
is congested.
At this time, a multipath TCP conforming to this specification SHOULD
NOT use ECN. ECN MAY be negotiated, but when more than a single path
is used at a given time, packets SHOULD be sent with the ECN field
set to Not-ECN (00), and incoming non-zero ECE flags SHOULD NOT be
acted upon with regard to congestion control.
4.6. Path MTU discovery
Path MTU discovery [RFC1191] is performed for TCP by having TCP
reduce its packet sizes whenever "packet too big but DF set" ICMP
messages are received. As the name suggests, the path MTU is
dependent on the path used, so multipath TCP must maintain MTU
information for each path, and adjust this information for each path
individually based on the too big messages that it receives.
The time between probing with a larger than previously discovered MTU
must either be randomized or explicitly coordinated to avoid probing
larger MTUs for multiple subflows at the same time, as probing larger
MTUs is likely to lead to a lost packet, and having losses on
multiple paths at the same time would be suboptimal. For instance,
rather than probe every t, in the case of 2 paths, after t*0.5 the
first path is probed, after t the second and after t*1.5 the first is
probed again.
Both the IPv4 and IPv6 versions of ICMP return enough of the original
packet in a "packet too big" message to be able to recover the
sequence number from the original packet, which makes it possible to
correlate the too big message with the packet that caused it, and
thus the path used to transmit the packet.
van Beijnum Expires November 7, 2009 [Page 13]
Internet-Draft One-ended multipath TCP May 2009
5. Flow control and buffer sizes
In order to accommodate the increased number of packets in flight,
the send buffer must be increased in direct relationship with the
number of paths being used. Alternatively, the number of paths used
concurrently should be limited to send buffer / avgRTT.
Although under normal operation, the receive buffer doesn't fill up,
there are two reasons the receive buffer must be the same size as the
send buffer: it must be able to accommodate a round trip time plus
two segments worth of data during fast retransmit, and the advertised
receive window limits the amount of data the sender will transmit
before waiting for acknowledgments. So in practice, the receive
buffer limits the maximum size of the send buffer, and therefore, the
number of paths that can be supported concurrently.
There is no simple rule of thumb to determine the number of paths
that should be used, as the maximum number of paths that the receive
window can accommodate depends both on the maximum receive window
advertised by the receiver and by the RTTs on the paths.
6. Handling of RSTs
If an RST is received after enabling a new path, this could be a
reaction to the presence of an unknown option. So the optimal
situation would be for an RST to reset just the path used to send the
packet that generated the RST, not the entire session. Only when the
last path or the default path (on which packets don't include special
options) receives an RST, the entire session should be reset.
7. Middlebox considerations
NATs are designed to be transparent to TCP. Because one-ended
multipath TCP conforms to normal TCP semantics on the wire, multipath
TCP should in principle also be compatible with NAT. However, if
different paths are served by different NATs that apply different
translations, the receiver won't be able to determine that the
different subflows through the different paths belong to the same TCP
session. So for NAT to work, the translation must either happen in a
location that all paths flow through, or the different NATs on the
different paths must act as a single, distributed NAT and apply the
same translation to the different subflows.
Middleboxes that only see traffic flowing over a subset of the paths
used will see large numbers of gaps in the sequence number space.
They may also not observe only a partial three-way handshake, or not
van Beijnum Expires November 7, 2009 [Page 14]
Internet-Draft One-ended multipath TCP May 2009
observe any ACKs. As such, like with NATs, middleboxes that enforce
conformance to known TCP behavior, must be placed such that they
observe all subflows. For middleboxes that just check whether
packets fall inside the TCP window, it may be sufficient for
multipath TCP senders to make sure that all paths see at least one
packet per window. Middleboxes that enforce sequence number
integrity will almost certainly also block TCP packets for which they
didn't observe the three way handshake. A possible way to
accommodate that behavior would be to send copies of all session
establishment and tear down packets over all paths that the sender
may use. However, this strategy is still likely to fail unless the
receiver does the same so the middleboxes may observe the signaling
packets flowing in both directions.
It's also possible that middleboxes (or perhaps hosts themselves)
reject packets with the path indicator TCP option. Since packets
flowing over the default path don't carry the path indicato option,
these packets should always be allowed through, so single path
operation is always possible. When a multipath TCP sender starts to
send packets over alternative paths, those packets won't make it to
the receiver because they contain the path indicator option. The
result is that a new subflow, which would use a congestion window of
two maximum segment sizes, would send two packets and then
experiences a retransmission timeout. Slow retransmit makes sure the
packets are transmitted before the session stalls, so the impact of
the lost packets is negligible.
8. Security considerations
None at this time.
9. IANA considerations
IANA is requested to provide a TCP option kind number for the path
indication option.
10. Acknowledgements
The single ended multipath TCP was developed together with Marcelo
Bagnulo and Arturo Azcorra.
Members of the Trilogy project, especially Costin Raiciu, have
contributed valuable insights.
Iljitsch van Beijnum is supported by Trilogy
van Beijnum Expires November 7, 2009 [Page 15]
Internet-Draft One-ended multipath TCP May 2009
(http://www.trilogy-project.org), a research project (ICT-216372)
partially funded by the European Community under its Seventh
Framework Program. The views expressed here are those of the
author(s) only. The European Commission is not liable for any use
that may be made of the information in this document.
11. References
11.1. Normative References
[RFC0793] Postel, J., "Transmission Control Protocol", STD 7,
RFC 793, September 1981.
[RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
November 1990.
[RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
for High Performance", RFC 1323, May 1992.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
Control", RFC 2581, April 1999.
[RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path
Algorithm", RFC 2992, November 2000.
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
of Explicit Congestion Notification (ECN) to IP",
RFC 3168, September 2001.
11.2. Informational References
[RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay
paths", RFC 1072, October 1988.
[RFC1122] Braden, R., "Requirements for Internet Hosts -
Communication Layers", STD 3, RFC 1122, October 1989.
[RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C.,
Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M.,
Zhang, L., and V. Paxson, "Stream Control Transmission
Protocol", RFC 2960, October 2000.
van Beijnum Expires November 7, 2009 [Page 16]
Internet-Draft One-ended multipath TCP May 2009
[RFC4915] Psenak, P., Mirtorabi, S., Roy, A., Nguyen, L., and P.
Pillay-Esnault, "Multi-Topology (MT) Routing in OSPF",
RFC 4915, June 2007.
[wischik08pooling]
Wischik, D., Handley, M., and M. Bagnulo Braun, "The
resource pooling principle", Computer Communication
Review 38, September 2008.
[I-D.ietf-shim6-proto]
Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming
Shim Protocol for IPv6", draft-ietf-shim6-proto-11 (work
in progress), December 2008.
Appendix A. Document and discussion information
The latest version of this document will always be available at
http://www.muada.com/drafts/. Please direct questions and comments
to the multipathtcp@ietf.org mailinglist or directly to the author.
Appendix B. An implementation strategy
In order to perform per-path congestion control, all of the ACK-based
events that trigger congestion control responses as well as all the
variables used by the congestion control algorightms must be
recreated in the multipath situation. These are the triggers and
variables for the four mechanisms in RFC 2581.
1. the path MTU (page 4)
2. the arrival of an ACK that acknowledges new data (page 4)
3. the arrival of a non-duplicate ACK (page 4) or the sum of new
data acknowledged (page 5)
4. triggering of the retransmission timer (page 5)
5. the flightsize or number of bytes sent but not acknowledged (page
5)
6. the retransmission of a segment (page 5)
7. the arrival of a third or subsequent duplicate ACK (page 6, page
7)
van Beijnum Expires November 7, 2009 [Page 17]
Internet-Draft One-ended multipath TCP May 2009
8. whether a retransmission timeout period has elapsed since the
last reception of an ACK (page 7)
1, 4, 6 and 8 are maintained session-wide.
We recreate these events and variables based on SACK information in
the one-sequence number multipath TCP case as follows.
We keep track of every packet sent. (Alternatively: multi-packet
contiguous blocks of data transmitted over the same path.) When an
ACK comes in, we first remove the stored information about packets/
data blocks that are cumulatively ACKed, noting how much data was
ACKed for each path that the packets were sent over. Then we do the
same for all the SACK blocks in the ACK. Because we remove the
information about (S)ACKed data and you can remove something just
once, we don't have to keep track of previous SACKs like the current
BSD implementation does.
The only slightly tricky part is emulating duplicate ACKs. This may
not even be really necessary, as the SACKs give us better information
to base fast retransmit on, but that's something for another day.
What happens in the pseudo code is that when traversing the list of
sent packets (this happens in order of seqnum), we note the path that
packets that aren't SACKed are sent over. When we're done processing
SACK data and it turns out that for a path there are one or more
packets that we skipped over when processing SACK data and there was
also data SACKed after a skipped packet, there was a lost (or
reordered) packet on this path. When the amount of "duplicate ACKed"
data grows beyond two segment sizes, we've reached the equivalent of
three duplicate ACKs so we trigger fast retransmit (7).
We update the congestion window (2 and 3) when there was data
(S)ACKed for a path. ACKs that don't acknowledge any data for a path
aren't relevant because we don't need them to trigger fast retransmit
and we assume that they're sent to (S)ACK data for other paths,
anyway. (Or they could be window updates.)
We maintain the flightsize (5) by simply adding data bytes as packets
are transmitted and subtracting when they're (S)ACKed. Because we
have explicit SACKs, we don't need to guess based on duplicate ACKs.
The flightsize is also adjusted when we perform fast retransmit or a
regular retransmission over a path other than which was used for the
original packet. In addition, we explicitly mark some packets to
trigger once-per-RTT actions when they're ACKed.
Pseudo code for the above:
van Beijnum Expires November 7, 2009 [Page 18]
Internet-Draft One-ended multipath TCP May 2009
// initializing data structures is left as an exercise for the
// reader
// transmitting packets
// assume we've selected a path to transmit over
path.flightsize = path.flightsize + packet.datasize
packet.path = path
packet.status.acked = false
// set up state to remember to do per RTT stuff when packet is
// ACKed
if path.do_per_rtt_next_packet == true
path.per_rtt_seqnum = packet.seqnum.first
packet.per_rtt = true
path.do_per_rtt_next_packet = false
else
packet.status.per_rtt = false
// don't set ECN on outgoing packets for now, can add logic
// for deciding which packets to ECN enable later
packet.ecn.sent = 0
// add to linked list of sent packets (to handle retrans-
// missions, linked list must maintain seqnum order, not FIFO
// or LIFO)
llpush(packet)
// receiving (S)ACKs
// normal flow-wide flow control actions based on cumACK
// also happen (elsewhere)
// handle ECN, must detect transitions rather than
// depend on actual value
if packet.ecnecho == true
if ecn.previous == true
ecn.current = false
else
ecn.current = true
ecn.previous = true
else
ecn.previous = false
// initialize some stuff before we handle the ACK
for each path
path.do_per_rtt = false
path.ackedbytes = 0
path.unacked.sure = 0
path.unacked.maybe = 0
path.ecn.received = false
van Beijnum Expires November 7, 2009 [Page 19]
Internet-Draft One-ended multipath TCP May 2009
// remove cumulatively ACKed packets
llwalk_init
packet = llwalk_next
while packet.seqnum.first < ack.cumulative
// ECN, we only act if we enabled ECN when we sent the packet
if ecn.current & packet.ecn.sent <> 0
path.ecn.received = true
// if part of a packet is ACKed, we need some trickery
if packet.seqnum.last_plus_one > ack.cumulative
path.ackedbytes += ack.cumulative - packet.seqnum.first
packet.seqnum.first = ack.cumulative
else
path.ackedbytes = path.ackedbytes + packet.datasize
if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum
path.do_per_rtt = true
llremove(packet)
packet = llwalk_next
// now we handle the SACKs (assume exactly one SACKblock for
// simplicity) we continue walking the linked list, no need to
// restart
while packet.seqnum.first < ack.sack.last_plus_one
if packet.seqnum.last_plus_one < ack.sack.first
// these packets overlap with the SACK block
// for simplicity, assume packets are always completely
// SACKed in reality we need to split a packet if only the
// middle is SACKed ECN, we only act if we enabled ECN when
// we sent the packet
if ecn.current & packet.ecn.sent <> 0
path.ecn.received = true
path.ackedbytes = path.ackedbytes + packet.datasize
if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum
path.do_per_rtt = true
// add potentially unacked bytes to for sure unacked bytes
// because we now know we had a SACK hole if any
// unacked maybe bytes
path.unacked.sure = path.unacked.sure + path.unacked.maybe
path.unacked.maybe = 0
// remove packet from the list
llremove(packet)
else
// note how many bytes we skipped unSACKed
// if later data is SACKed, that's our version of a dup ACK
path.unacked.maybe = path.unacked.maybe + packet.datasize
packet = llwalk_next
// done processing, now tally up the the results
foreach path
van Beijnum Expires November 7, 2009 [Page 20]
Internet-Draft One-ended multipath TCP May 2009
// update flightsize (item 5 in CC events/variables list)
path.flightsize = path.flightsize - path.ackedbytes
// if any data was ACKed
if path.ackedbytes <> 0
// some stuff was ACKed for this path
if path.unacked.sure > 2 * path.mss
// more than 2 * MSS worth of data in SACK hole = fast
// retransmit execute fast retransmit (item 7 in CC
// events/variables list) need to handle flightsize in
// some way here ignore ECN because we already have a loss
// send back ECN window update indication, though
else
// SACKs were cumulative for this path
// execute cwnd update (items 2 and 3 in CC events/
// variables list)
// ECN must be taken into account here
// and send back ECN window update indication
if path.do_per_rtt
// execute per RTT actions
// indicate that this should be set for next packet sent
path.do_per_rtt_next_packet == true
Note that the pseudo-code doesn't cover all the mechanisms explained
earlier. Also, ECN is handled here because it's not too difficult to
do. The hard part is deciding which packets to enable ECN for.
Author's Address
Iljitsch van Beijnum
IMDEA Networks
Avda. del Mar Mediterraneo, 22
Leganes, Madrid 28918
Spain
Email: iljitsch@muada.com
van Beijnum Expires November 7, 2009 [Page 21]