LEDBAT WG                                                    S. Shalunov
Internet-Draft                                                  G. Hazel
Intended status: Experimental                             BitTorrent Inc
Expires: April 28, 2011                                       J. Iyengar
                                           Franklin and Marshall College
                                                        October 25, 2010

             Low Extra Delay Background Transport (LEDBAT)


   LEDBAT is an experimental delay-based congestion control algorithm
   that attempts to utilize the available bandwidth on an end-to-end
   path while limiting the consequent increase in queueing delay on the
   path.  LEDBAT uses changes in one-way delay measurements to limit
   congestion induced in the network by the LEDBAT flow.  LEDBAT is
   designed largely for use by background bulk-transfer applications; it
   is designed to be no more aggressive than TCP congestion control and
   yields in the presence of competing TCP flows, thus reducing
   interference with the network performance of the competing flows.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 28, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of

Shalunov, et al.         Expires April 28, 2011                 [Page 1]

Internet-Draft                   LEDBAT                     October 2010

   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Requirements notation  . . . . . . . . . . . . . . . . . . . .  3
   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     2.1.  Design Goals . . . . . . . . . . . . . . . . . . . . . . .  3
     2.2.  Applicability  . . . . . . . . . . . . . . . . . . . . . .  3
   3.  LEDBAT Congestion Control  . . . . . . . . . . . . . . . . . .  4
     3.1.  Overview . . . . . . . . . . . . . . . . . . . . . . . . .  4
     3.2.  Preliminaries  . . . . . . . . . . . . . . . . . . . . . .  4
     3.3.  Receiver-Side Operation  . . . . . . . . . . . . . . . . .  5
     3.4.  Sender-Side Operation  . . . . . . . . . . . . . . . . . .  5
       3.4.1.  An Overview  . . . . . . . . . . . . . . . . . . . . .  5
       3.4.2.  The Complete Sender Algorithm  . . . . . . . . . . . .  5
     3.5.  Parameter Values . . . . . . . . . . . . . . . . . . . . .  6
   4.  Understanding LEDBAT Mechanisms  . . . . . . . . . . . . . . .  7
     4.1.  Delay Estimation . . . . . . . . . . . . . . . . . . . . .  7
       4.1.1.  Estimating Base Delay  . . . . . . . . . . . . . . . .  7
       4.1.2.  Estimating Queueing Delay  . . . . . . . . . . . . . .  8
     4.2.  Managing the Congestion Window . . . . . . . . . . . . . .  8
       4.2.1.  Window Increase: Probing For More Bandwidth  . . . . .  8
       4.2.2.  Window Decrease: Responding To Congestion  . . . . . .  8
   5.  Choosing Parameter Values  . . . . . . . . . . . . . . . . . .  9
     5.1.  Queuing Delay Target . . . . . . . . . . . . . . . . . . .  9
   6.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .  9
     6.1.  Framing Considerations . . . . . . . . . . . . . . . . . .  9
     6.2.  Competing With TCP Flows . . . . . . . . . . . . . . . . . 10
     6.3.  Fairness Among LEDBAT Flows  . . . . . . . . . . . . . . . 10
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 12
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   9.  Normative References . . . . . . . . . . . . . . . . . . . . . 12
   Appendix A.  Timestamp errors  . . . . . . . . . . . . . . . . . . 12
     A.1.  Clock offset . . . . . . . . . . . . . . . . . . . . . . . 12
     A.2.  Clock skew . . . . . . . . . . . . . . . . . . . . . . . . 13
       A.2.1.  Deployed clock skew correction mechanism . . . . . . . 13
       A.2.2.  Skew correction with faster virtual clock  . . . . . . 14
       A.2.3.  Skew correction with estimating drift  . . . . . . . . 14
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15

Shalunov, et al.         Expires April 28, 2011                 [Page 2]

Internet-Draft                   LEDBAT                     October 2010

1.  Requirements notation

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   document are to be interpreted as described in [RFC2119].

2.  Introduction

   TCP congestion control [RFC2581], the predominant congestion control
   mechanism used on the Internet, aims to share bandwidth at a
   bottleneck link equitably among flows competing at the bottleneck.
   While TCP works well for many applications, applications such as
   software updates or file-sharing applications prefer to use bandwidth
   available in the network without interfering with the network
   performance of other interactive applications.  Such "background"
   traffic can yield bandwidth to TCP-based "foreground" traffic by
   reacting earlier than TCP to congestion signals.

   LEDBAT is an experimental delay-based congestion control mechanism
   that allows background applications, such as peer-to-peer
   applications, that send large amounts of data particularly over links
   with deep buffers, such as residential uplinks, to operate in the
   background, without interfering with performance of interactive
   applications.  LEDBAT uses one-way delay measurements to determine
   congestion on the data path, and keeps latency across the tight link
   in the end-to-end path low while attempting to utilize the available
   bandwidth on the end-to-end path.

2.1.  Design Goals

   As a "scavenger" mechanism for the Internet, LEDBAT's design goals
   are to:
   1.  Keep delay low when no other traffic is present
   2.  Add little to the queuing delays induced by TCP traffic
   3.  Quickly yield to traffic sharing the same bottleneck queue that
       uses standard TCP congestion control
   4.  Utilize end-to-end available bandwidth
   5.  Operate well in networks with FIFO queuing with drop-tail

2.2.  Applicability

   LEDBAT is a "scavenger" congestion control mechanism---a LEDBAT flow
   attempts to utilize all available bandwidth and yields quickly to a
   competing TCP flow---and is primarily motivated by background bulk-
   transfer applications, such as peer-to-peer file transfers and
   software updates.  It can be used for any application that needs to

Shalunov, et al.         Expires April 28, 2011                 [Page 3]

Internet-Draft                   LEDBAT                     October 2010

   run in the "background", to reduce the application's impact on the
   network and on other interactive network applications.

   LEDBAT can be used with any transport protocol capable of carrying
   timestamps and acknowledging data frequently---LEDBAT can be easily
   used with TCP, SCTP, and DCCP.

3.  LEDBAT Congestion Control

3.1.  Overview

   A TCP sender increases its congestion window until a loss occurs,
   which, in the absence of any Active Queue Management (AQM) in the
   network, occurs only when the queue at the bottleneck link on the
   end-to-end path overflows.  Since packet loss at the bottleneck link
   is often preceded by an increase in the queueing delay at the
   bottleneck link, LEDBAT congestion control uses this increase in
   queueing delay as an early signal of congestion, enabling it to
   respond to congestion earlier than TCP, and enabling it to yield
   bandwidth to a competing TCP flow.

   LEDBAT employs one-way delay measurements to estimate queueing delay.
   When the estimated queueing delay is lesser than a pre-determined
   target, LEDBAT infers that the network is not yet congested, and
   increases its sending rate to utilize any spare capacity in the
   network.  When the estimated queueing delay becomes greater than a
   pre-determined target, LEDBAT decreases its sending rate quickly as a
   response to potential congestion in the network.

3.2.  Preliminaries

   For the purposes of explaining LEDBAT, we assume a transport sender
   that uses fixed-size segments and a receiver that acknowledges each
   segment separately.  It is straightforward to apply the mechanisms
   described here with variable-sized segments and with delayed
   acknowledgments.  A LEDBAT sender uses a congestion window (cwnd)
   that gates the amount of data that the sender can send into the
   network in one RTT.

   LEDBAT requires that each data segment carries a "timestamp" from the
   sender, based on which the receiver computes the one-way delay from
   the sender, and sends this computed value back to the sender.

   In addition to the LEDBAT mechanism described below, we note that a
   slow start mechanism can be used as specified in [RFC2581].  Since
   slow start leads to faster increase in the window than that specified
   in LEDBAT, conservative congestion control implementations employing

Shalunov, et al.         Expires April 28, 2011                 [Page 4]

Internet-Draft                   LEDBAT                     October 2010

   LEDBAT may skip slow start altogether and start with an initial
   window of XXX MSS.

3.3.  Receiver-Side Operation

   A LEDBAT receiver operates as follows:

   on data_packet:
     remote_timestamp = data_packet.timestamp
     acknowledgement.delay = local_timestamp() - remote_timestamp
     # fill in other fields of acknowledgement

3.4.  Sender-Side Operation

3.4.1.  An Overview

   As a first approximation, a LEDAT sender operates as show below.
   TARGET is the maximum queueing delay that LEDBAT itself can introduce
   in the network, and GAIN determines the rate at which the congestion
   window changes; both constants are specified later.

   on initialization:
     base_delay = +infinity

   on acknowledgement:
     current_delay = acknowledgement.delay
     base_delay = min(base_delay, current_delay)
     queuing_delay = current_delay - base_delay
     off_target = TARGET - queuing_delay
     cwnd += GAIN * off_target / cwnd

3.4.2.  The Complete Sender Algorithm

   The simplified mechanism above ignores noise filtering and base
   expiration.  The full sender-side algorithm is specified below:

Shalunov, et al.         Expires April 28, 2011                 [Page 5]

Internet-Draft                   LEDBAT                     October 2010

   on initialization:
     set all NOISE_FILTER delays used by current_delay() to +infinity
     set all BASE_HISTORY delays used by base_delay() to +infinity
     last_rollover = -infinity # More than a minute in the past.

   on acknowledgement:
     delay = acknowledgement.delay
     queuing_delay = current_delay() - base_delay()
     off_target = TARGET - queuing_delay + random_input()
     cwnd += GAIN * off_target / cwnd
     # flight_size() is the amount of currently not acked data.
     max_allowed_cwnd = ALLOWED_INCREASE + TETHER*flight_size()
     cwnd = min(cwnd, max_allowed_cwnd)

     # random() is a PRNG between 0.0 and 1.0
     # NB: RANDOMNESS_AMOUNT is normally 0
     RANDOMNESS_AMOUNT * TARGET * ((random() - 0.5)*2)

     # Maintain a list of NOISE_FILTER last delays observed.
     forget the earliest of NOISE_FILTER current_delays
     add delay to the end of current_delays

     min(the NOISE_FILTER delays stored by update_current_delay)

     # Maintain BASE_HISTORY min delays. Each represents a minute.
     if round_to_minute(now) != round_to_minute(last_rollover)
       last_rollover = now
       forget the earliest of base delays
       add delay to the end of base_delays
       last of base_delays = min(last of base_delays, delay)

     min(the BASE_HISTORY min delays stored by update_base_delay)

3.5.  Parameter Values

   TARGET parameter MUST be set to 100 milliseconds.  GAIN SHOULD be set
   to 1 so that max rampup rate is the same as for TCP.  BASE_HISTORY
   SHOULD be 10; it MUST be no less than 2 and SHOULD NOT be more than
   20.  NOISE_FILTER SHOULD be 1; it MAY be tuned so that it is at least
   1 and no more than cwnd/2.  ALLOWED_INCREASE SHOULD be 1 packet; it

Shalunov, et al.         Expires April 28, 2011                 [Page 6]

Internet-Draft                   LEDBAT                     October 2010

   MUST be at least 1 packet and SHOULD NOT be more than 3 packets.
   TETHER SHOULD be 1.5; it MUST be greater than 1.  RANDONMESS_AMOUNT
   SHOULD be 0; it MUST be between 0 and 0.1 inclusively.

   Note that using the same TARGET value across LEDBAT flows is
   important, since flows using different TARGET values will not share a
   bottleneck equitably---flows with higher values will get a larger
   share of the bottleneck bandwidth.

4.  Understanding LEDBAT Mechanisms

   This section describes and provides insight into the delay estimation
   and window management mechanisms used in LEDBAT congestion control.

4.1.  Delay Estimation

   LEDBAT estimates congestion in the network based on observed increase
   in queueing delay in the network.  To observe an increase in the
   queueing delay in the network, LEDBAT must separate the queueing
   delay component from the rest of the end-to-end delay.  This section
   explains how LEDBAT decomposes the observed changes in end-to-end
   delay into these two components.

   LEDBAT estimates congestion in the direction of data flow.  To avoid
   measuring queue build-up on the reverse path (or ack path), LEDBAT
   uses changes in one-way delay estimates.  The extant One-Way Active
   Measurement Protocol (OWAMP) [XXXcite], can be used for measuring
   one-way delay, but since LEDBAT is used for sending data, and since
   LEDBAT requires only changes in one-way delay to infer congestion,
   simply adding a timestamp to the data segments and a measurement
   result field in the ack packets seems sufficient.  Doing so also
   avoids the pitfall of measurement packets being treated differently
   from the data packets in the network.

4.1.1.  Estimating Base Delay

   End-to-end delay can be decomposed into transmission (or
   serialization) delay, propagation (or speed-of-light) delay, queueing
   delay, and processing delay.  On any given path, barring some noise,
   all delay components except for queueing delay are constant; over
   time, we expect only the queueing delay on the path to change as the
   queue sizes at the links change.  Since queuing delay is always
   additive to the end-to-end delay, we estimate the sum of the constant
   delay components, which we call "base delay", to be the minimum delay
   observed on the end-to-end path.  Using the minimum observed delay
   also allows LEDBAT to eliminate noise in the delay estimation, such
   as due to spikes in processing delay at a node on the path.

Shalunov, et al.         Expires April 28, 2011                 [Page 7]

Internet-Draft                   LEDBAT                     October 2010

   To respond to true changes in the base delay due to route changes,
   LEDBAT uses only "recent" measurements---measurements over the last N
   minutes---in estimating the base delay.  To implement an approximate
   minimum over the last N minutes, a LEDBAT sender stores N+1 separate
   minima---N for the last N minutes, and one for the running current
   minute.  At the end of the current minute, the window moves---the
   earliest minimum is dropped and the latest minimum is added.  When
   the connection is idle for a given minute, no data is available for
   the one-way delay and, therefore, no minimum is stored.  When the
   connection has been idle for N minutes, the measurement begins anew.

   The duration of the observation window itself is a tradeoff between
   robustness of measurement and responsiveness to change: a larger
   observation window yields a more accurate base delay if the true base
   delay is unchanged, whereas a smaller observation window results in
   faster response to true changes in the base delay.

4.1.2.  Estimating Queueing Delay

   Given that the base delay is constant, the queueing delay is
   represented by the variable component of the measured end-to-end
   delay.  We measure queueing delay as simply the difference between an
   end-to-end delay measurement and the current estimate of base delay.

4.2.  Managing the Congestion Window

4.2.1.  Window Increase: Probing For More Bandwidth

   A LEDBAT sender increases its congestion window most when the queuing
   delay estimate is zero.  To be friendly to competing TCP flows, we
   set this highest rate of window growth to be the same as TCP's.  In
   other words, the LEDBAT window grows by at most twice per round-trip
   time.  Since queuing delay estimate is always non-negative, this
   growth rate ensures that a LEDBAT flow never ramps up faster than a
   competing TCP flow over the same path.

4.2.2.  Window Decrease: Responding To Congestion

   When the sender's queuing delay estimate is lower than the target,
   the sending rate should be increased.  When the sender's queueing
   delay estimate is higher than the target, the sending rate should be
   reduced.  LEDBAT uses a simple linear controller to detemine sending
   rate as a function of the delay estimate, where the response is
   proportional to the difference between the current queueing delay
   estimate and the target.  In limited experiments with Bittorrent
   nodes, this controller seems to work well.

   To deal with severe congestion when several packets are dropped in

Shalunov, et al.         Expires April 28, 2011                 [Page 8]

Internet-Draft                   LEDBAT                     October 2010

   the network, and to provide a fallback against incorrect queuing
   delay estimates, a LEDBAT sender halves its cwnd when a loss event is
   detected.  As with NewReno, LEDBAT reduces its cwnd by half at most
   once per RTT.  Note that, unlike TCP-like loss-based congestion
   control, LEDBAT does not induce losses and so it normally does not
   rely on losses to determine the sending rate.  LEDBAT's reaction to
   loss is thus less important than it is in the case of loss-based
   congestion control.  For LEDBAT, reducing the congestion window on
   loss is a fallback mechanism in case of severe congestion and in the
   case of incorrect delay estimates.

5.  Choosing Parameter Values

   Through this discussion, we hope to encourage informed
   experimentation with LEDBAT.

5.1.  Queuing Delay Target

   Consider the queuing delay on the bottleneck.  This delay is the
   extra delay induced by congestion control.  One of our design goals
   is to keep this delay low.  However, when this delay is zero, the
   queue is empty and so no data is being transmitted and the link is
   thus not saturated.  Hence, our design goal is to keep the queuing
   delay low, but non-zero.

   How low do we want the queuing delay to be?  Because another design
   goal is to be deployable on networks with only simple FIFO queuing
   and drop-tail discipline, we can't rely on explicit signaling for the
   queuing delay.  So we're going to estimate it using external
   measurements.  The external measurements will have an error at least
   on the order of best-case scheduling delays in the OSes.  There's
   thus a good reason to try to make the queuing delay larger than this
   error.  There's no reason that would want us to push the delay much
   further up.  Thus, we will have a delay target that we would want to

6.  Discussion

6.1.  Framing Considerations

   The actual framing and wire format of the protocol(s) using the
   LEDBAT congestion control mechanism is outside of scope of this
   document, which only describes the congestion control part.

   There is an implication of the need to use one-way delay from the
   sender to the receiver in the sender.  An obvious way to support this

Shalunov, et al.         Expires April 28, 2011                 [Page 9]

Internet-Draft                   LEDBAT                     October 2010

   is to use a framing that timestamps packets at the sender and conveys
   the measured one-way delay back to the sender in ack packets.  This
   is the method we'll keep in mind for the purposes of exposition.
   Other methods are possible and valid.

   Another implication is the receipt of frequent ACK packets.  The
   exposition below assumes one ACK per data packet, but any reasonably
   small number of data packets per ACK will work as long as there is at
   least one ACK every round-trip time.

   The protocols to which this congestion control mechanism is
   applicable, with possible appropriate extensions, are TCP, SCTP,
   DCCP, etc.  It is not a goal of this document to cover such
   applications.  The mechanism can also be used with proprietary
   transport protocols, e.g., those built over UDP for P2P applications.

6.2.  Competing With TCP Flows

   Consider competition between a LEDBAT connection and a connection
   governed by loss-based congestion control (on a FIFO bottleneck with
   drop-tail discipline).  Loss-based connection will need to experience
   loss to back off.  Loss will only occur after the connection
   experiences maximum possible delays.  LEDBAT will thus receive
   congestion indication sooner than the loss-based connection.  If
   LEDBAT can ramp down faster than the loss-based connection ramps up,
   LEDBAT will yield.  LEDBAT ramps down when queuing delay estimate
   exceeds the target: the more the excess, the faster the ramp-down.
   When the loss-based connection is standard TCP, LEDBAT will yield at
   precisely the same rate as TCP is ramping up when the queuing delay
   is double the target.

   LEDBAT is most aggressive when its queuing delay estimate is most
   wrong and is as low as it can be.  Queuing delay estimate is
   nonnegative, therefore the worst possible case is when somehow the
   estimate is always returned as zero.  In this case, LEDBAT will ramp
   up as fast as TCP and halve the rate on loss.  Thus, in case of worst
   possible failure of estimates, LEDBAT will behave identically to TCP.
   This provides an extra safety net.

6.3.  Fairness Among LEDBAT Flows

   The design goals of LEDBAT center around the aggregate behavior of
   LEDBAT flows when they compete with standard TCP.  It is also
   interesting how LEDBAT flows share bottleneck bandwidth when they
   only compete between themselves.

   LEDBAT as described so far lacks a mechanism specifically designed to
   equalize utilization between these flows.  The observed behavior of

Shalunov, et al.         Expires April 28, 2011                [Page 10]

Internet-Draft                   LEDBAT                     October 2010

   existing implementations indicates that a rough equalization, in
   fact, does occur.

   The delay measurements used as control inputs by LEDBAT contain some
   amount of noise and errors.  The linear controller converts this
   input noise into the same amount of output noise.  The effect that
   this has is that the uncorrelated component of the noise between
   flows serves to randomly shuffle some amount of bandwidth between
   flows.  The amount shuffled during each RTT is proportional to the
   noise divided by the target delay.  The random-walk trajectory of
   bandwidth utilized by each of the flows over time tends to the fair
   share.  The timescales on which the rates become comparable are
   proportional to the target delay multiplied by the RTT and divided by
   the noise.

   In complex real-life systems, the main concern is usually the
   reduction of the amount of noise, which is copious if not eliminated.
   In some circumstances, however, the measurements might be "too good"
   -- since the equalization timescale is inversely proportional to
   noise, perfect measurements would result in lack of convergence.

   Under these circumstances, it may be beneficial to introduce some
   artificial randomness into the inputs (or, equivalently, outputs) of
   the controller.  Note that most systems should not require this and
   should be primarily concerned with reducing, not adding, noise.

   With delay-based congestion control systems, there's a concern about
   the ability of late comers to measure the base delay correctly.
   Suppose a LEDBAT flow saturates a bottleneck; another LEDBAT flow
   starts and proceeds to measure the base delay and the current delay
   and to estimate the queuing delay.  If the bottleneck always contains
   target delay worth of packets, the second flow would see the
   bottleneck as empty start building a second target delay worth of
   queue on top of the existing queue.  The concern ("late comers'
   advantage") is that the initial flow would now back off because it
   sees the real delay and the late comer would use the whole capacity.

   However, once the initial flow yields, the late comer immediately
   measures the true base delay and the two flows operate from the same
   (correct) inputs.

   Additionally, in practice this concern is further alleviated by the
   burstiness of network traffic: all that's needed to measure the base
   delay is one small gap.  These gaps can occur for a variety of
   reasons: the OS may delay the scheduling of the sending process until
   a time slice ends, the sending computer might be unusually busy for
   some number of milliseconds or tens of milliseconds, etc.  If such a
   gap occurs while the late comer is starting, base delay is

Shalunov, et al.         Expires April 28, 2011                [Page 11]

Internet-Draft                   LEDBAT                     October 2010

   immediately correctly measured.  With small number of flows, this
   appears to be the main mechanism of regulating the late comers'

7.  IANA Considerations

   There are no IANA considerations for this document.

8.  Security Considerations

   A network on the path might choose to cause higher delay measurements
   than the real queuing delay so that LEDBAT backs off even when
   there's no congestion present.  Shaping of traffic into an
   artificially narrow bottleneck can't be counteracted, but faking
   timestamp field can and SHOULD.  A protocol using the LEDBAT
   congestion control SHOULD authenticate the timestamp and delay
   fields, preferably as part of authenticating most of the rest of the
   packet, with the exception of volatile header fields.  The choice of
   the authentication mechanism that resists man-in-the-middle attacks
   is outside of scope of this document.

9.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
              Control", RFC 2581, April 1999.

Appendix A.  Timestamp errors

   One-way delay measurement needs to deal with timestamp errors.  We'll
   use the same locally linear clock model and the same terminology as
   Network Time Protocol (NTP).  This model is valid for any
   differentiable clocks.  NTP uses the term "offset" to refer to
   difference from true time and "skew" to refer to difference of clock
   rate from the true rate.  The clock will thus have a fixed offset
   from the true time and a skew.  We'll consider what we need to do
   about the offset and the skew separately.

A.1.  Clock offset

   First, consider the case of zero skew.  The offset of each of the two
   clocks shows up as a fixed error in one-way delay measurement.  The

Shalunov, et al.         Expires April 28, 2011                [Page 12]

Internet-Draft                   LEDBAT                     October 2010

   difference of the offsets is the absolute error of the one-way delay
   estimate.  We won't use this estimate directly, however.  We'll use
   the difference between that and a base delay.  Because the error
   (difference of clock offsets) is the same for the current and base
   delay, it cancels from the queuing delay estimate, which is what
   we'll use.  Clock offset is thus irrelevant to the design.

A.2.  Clock skew

   Now consider the skew.  For a given clock, skew manifests in a
   linearly changing error in the time estimate.  For a given pair of
   clocks, the difference in skews is the skew of the one-way delay
   estimate.  Unlike the offset, this no longer cancels in the
   computation of the queuing delay estimate.  On the other hand, while
   the offset could be huge, with some clocks off by minutes or even
   hours or more, the skew is typically not too bad.  For example, NTP
   is designed to work with most clocks, yet it gives up when the skew
   is more than 500 parts per million (PPM).  Typical skews of clocks
   that have never been trained seem to often be around 100-200 PPM.
   Previously trained clocks could have 10-20 PPM skew due to
   temperature changes.  A 100-PPM skew means accumulating 6
   milliseconds of error per minute.  The expiration of base delay
   related to route changes mostly takes care of clock skew.  A
   technique to specifically compute and cancel it is trivially possible
   and involves tracking base delay skew over a number of minutes and
   then correcting for it, but usually isn't necessary, unless the
   target is unusually low, the skew is unusually high, or the base
   interval is unusually long.  It is not further described in this

   For cases when the base interval is long or the skew is high or the
   target is low, a technique to correct for skew can be beneficial.
   The technique described here or a different mechanism MAY be used by
   implementations.  The technique described is still experimental, but
   it is actually currently used.  The pseudocode in the specification
   below does not include any of the skew correction algorithms.

A.2.1.  Deployed clock skew correction mechanism

   Clock skew can be in two directions: either the sender's clock is
   faster than the receiver's, or vice versa.  We refer to the former
   situation as clock drift "in sender's favor" and to the latter as
   clock drift "in receiver's favor".

   When the clock drift is "in sender's favor", nothing special needs to
   be done, because the timestamp differences (i.e., the raw delay
   estimates) will grow smaller with time, and thus the base delay will
   be continuously updated with the drift.

Shalunov, et al.         Expires April 28, 2011                [Page 13]

Internet-Draft                   LEDBAT                     October 2010

   When the clock drift is "in receiver's favor", the raw delay
   estimates will drift up with time, suppressing the throughput
   needlessly.  This is the case that can benefit from a special
   mechanism.  Assume symmetrical framing (i.e., same information about
   delays transmitted in both way).  If the sender can detect the
   receiver reducing its base delay, it can infer that this is due to
   clock drift "in receiver's favor".  This can be compensated for by
   increasing the sender's base delay by the same amount.  Since, in our
   implementation, the receiver sends back the raw timestamp estimate,
   the sender can run the same base delay calculation algorithm it runs
   for itself for the receiver as well; when it reduces the inferred
   receiver's delay, it increases its own by the same amount.

A.2.2.  Skew correction with faster virtual clock

   This is an alternative skew correction algorithm, currently under
   consideration and not deployed in the wild.

   Since having a faster clock on the sender is, relatively speaking, a
   non-problem, one can use two different virtual clocks on each LEDBAT
   implementation: use, for example, the default machine clock for
   situations where the instance is acting as a receiver, and use a
   virtual clock, easily computed from the default machine clock through
   a linear transformation, for situations where the instance is acting
   as a sender.  Make the virtual clock, e.g., 500 PPM faster than the
   machine clock.  Since 500 PPM is more than the variability of most
   clocks (plus or minus 100 PPM), any sender's clock is very likely to
   be faster than any receiver's clock, thus benefitting from the
   implicit correction of taking the minimum as the base delay.

   Note that this approach is not compatible with the one described in
   the preceding section.

A.2.3.  Skew correction with estimating drift

   This is an alternative skew correction algorithm, currently under
   consideration and not deployed in the wild.

   The history of base delay minima we already keep for each minute
   provides us with direct means of computing the clock skew difference
   between the two hosts.  Namely, we can fit a linear function to the
   set of base delay estimates for each minute.  The slope of the
   function is an estimate of the clock skew difference for the given
   pair of sender and receiver.  Once the clock skew difference is
   estimated, it can be used to correct the clocks so that they advance
   at nearly the same rate.  Namely, the clock needs to be corrected by
   half of the estimated skew amount, since the other half will be
   corrected by the other endpoint.  Note that the skew differences are

Shalunov, et al.         Expires April 28, 2011                [Page 14]

Internet-Draft                   LEDBAT                     October 2010

   then maintained for each connection and the virtual clocks used with
   each connection can differ, since they do not attempt to estimate the
   skew with respect to the true time, but instead with respect to the
   other endpoint.

A.2.3.1.  Byzantine skew correction

   This is an alternative skew correction algorithm, currently under
   consideration and not deployed in the wild.

   When it is known that each host maintains long-lived connections to a
   number of different other hosts, a byzantine scheme can be used to
   estimate the skew with respect to the true time.  Namely, calculate
   the skew difference for each of the peer hosts as described in the
   preceding section, then take the median of the skew differences.

   This inherent clock drift can then be corrected with a linear
   transformation before the clock data is used in the algorithm from
   the preceding section, the currently deployed algorithm, or nearly
   any other skew correction algorithm.

   While this scheme is not universally applicable, it combines well
   with other schemes, since it is essentially a clock training
   mechanism.  The scheme also acts the fastest, since the state is
   preserved between connections.

Authors' Addresses

   Stanislav Shalunov
   BitTorrent Inc
   612 Howard St, Suite 400
   San Francisco, CA  94105

   Email: shalunov@bittorrent.com
   URI:   http://shlang.com

   Greg Hazel
   BitTorrent Inc
   612 Howard St, Suite 400
   San Francisco, CA  94105

   Email: greg@bittorrent.com

Shalunov, et al.         Expires April 28, 2011                [Page 15]

Internet-Draft                   LEDBAT                     October 2010

   Janardhan Iyengar
   Franklin and Marshall College
   415 Harrisburg Ave.
   Lancaster, PA  17603

   Email: jiyengar@fandm.edu

Shalunov, et al.         Expires April 28, 2011                [Page 16]