Skip to main content

Minutes IETF101: iccrg
minutes-101-iccrg-00

Meeting Minutes Internet Congestion Control (iccrg) RG
Date and time 2018-03-23 09:30
Title Minutes IETF101: iccrg
State Active
Other versions plain text
Last updated 2018-03-27

minutes-101-iccrg-00
ICCRG meeting, IETF 101, London
Friday, Mar 23, 2018, 9:30 AM

Agenda:
(10 min) Paul Congdon - Proposed IEEE 802.1Qcz work (Congestion Isolation)
(15 min) Ingemar Johansson - BBR congestion control with L4S support
(25 min) Neal Cardwell - An Update on BBR Work at Google
(20 min) Praveen Balasubramanian - LEDBAT++
(30 min) Michael Schapira - PCC: Performance-Oriented Congestion Control

Minutes:
a.       Congestion Isolation. It was asked if there is a ‘congested’ queue
defined for each ‘non-congested’ queue that you want to monitor.  Effectively,
yes, but it is possible to monitor multiple ‘non-congested’ queues and isolate
to a single ‘congested’ queue.  The diagrams showing how what events generate
signaling only show a single upstream switch and downstream switch, but this
diagram is simplified for illustration purposes.  The reality is that the
network is a L3 CLOS network and incast traffic is coming from multiple ingress
ports.  A question regarding how much of the solution is already known and will
be standardized was asked.  It is possible that multiple solutions can be
considered for various aspects of the standard.  The project defines the
external behavior and scope that can be achieved with different implementation
flexibility.  A challenge in 802.1 hardware standardization is always providing
enough specification for interoperability and correctness but allowing
implementation flexibility.

b.       BBR with L4S support. The changes required to the Linux implementation
were small and easy to add.  There was some concern about exiting slow-startup
on the first round after experiencing ECN.  This is because there is some
burstiness on start-up with incast and ‘drive-by’ traffic and the feeling is
BBR might exit too early.  More support for the idea and acknowledgement that
the idea works with and uses ECN.  How much benefit comes from BBR and how much
from ECN?  There are slides that compare BBR and BBR evo (with ECN and phantom
queues).  More interest in understanding what causes the improvement – is it
BBR changes or ECN again.  There was a suggestion to track the percentile usage
to soften BBR.  It was noted that the BBR max filter is too aggressive and
adding ECN to this softens this.  There was a question on how changing some of
the parameters will adjust the results.  This will be looked at offline.   It
was requested that the time scales between the gain cycle and the filter needs
further analysis.  There was a suggestion to look at how this performs with a
mix of non-ECN traffic.  It was pointed out that when to mark should be based
on queue delays, not queue depths – this allows consistent end-to-end
configuration.  In general there was wide acceptance and approval of this work.

c.       BBR at Google. Review of BBR.  It is used for TCP and QUIC on
google.com and YouTube.  Aggregation and BBR was discussed.  Batches or
consolidation of ACKs sometimes occur for optimization and efficiency (e.g.
WiFi, cable modems, offload mechanism).  The delivery rate estimation draft
discusses how to estimate rate in the presence of this aggregation.  A review
of a WiFi 20 MB transfer was analyzed in detail, showing the impact of large
gaps in the ACK stream.  There is now an aggregation estimator include in BBR
that allows a larger cwnd when the ACK stream is halted.  They calculated an
expected ACKed amount of data when these ACKs are not present.  There was a
clarification question about whether the receive window is considered and the
answer was yes, the received window is considered. In addition to this, an
adaptive draining algorithm was also implemented.  It was discovered that it is
important to limit the amount of time to drain the queue and use randomization
of these phases to avoid mice elephant synchronization.  There is a BBR
implementation available on ns-3 and at Stanford. One question was regarding
how to best use packet loss as a signal and how to consider the time-scale over
which this loss occurs.  On what time scale should a CC algorithm re-probe for
bandwidth after experiencing loss?  This is a fundamental question to balance
link utilization and application performance.  Key issues with BBR at Google
are dealing with aggregation, packet loss signals and interworking with other
CC approaches.  A question was asked if tests have been done where the
source/dest are on WiFi directly?  This impacts the amount of aggregation that
occurs.  It was noted that ACK aggregation is a critical problem because it is
widely used and when that is going on, it is very difficult to measure delay,
thus creating a divided problem space.  It was observed that the analysis has
been limited to flows and not the transactional flow interaction.  The coding
group is looking how to deal with loss through coding and points out that
collaboration is desired and possible.  It was asked if BBR is being used on 5G
networks today – answer is yes.  It was asked if the type of loss is being
distinguished and considered (L2 verse other).   The wireless networks tend to
retransmit and create delay variance.  It was pointed out that some of the
batching in the test results could be caused by power saving mode in WiFi. 
Another comment regarding packet loss is that it may be poison related, but
also there are other correlated packet loss to be considered.  Agreed that this
should be considered in the long-term research.  It was pointed out that
latency is the important metric and not entirely focusing on packet loss
(though the two are clearly related) and perhaps retransmission rates should be
increased to reduce latency.

d.       Performance-Oriented Congestion Control. This is based on two
publications referenced in the slides.  Try different rates and analyze the
response to generate a utility value on your performance.  Perform these micro
experiments over a time interval to help decide what the next rate to use on
the next interval.  This means the congestion control response is not
hard-wired as it is in TCP.   The mechanism in TCP were designed to balance
between fairness and maximizing throughput.  The PCC mechanism need to consider
similar but is doing it using Game theory.  The argument is that the TCP sender
is a really bad ‘learner’ of the congestion state because the response is
hardwired.  It is acknowledged that PCC still has issues; suboptimal
convergence and poor performance in mobile are examples.  A version 2 is being
worked on with the following two changes; changes to the utility framework and
changes to the online learning algorithm.  It is possible for different senders
to use different utility functions and not impact convergence.  Also, the
learning algorithm utility function can determine the degree of adjustment of
rate.  It was noted that BBR is different but shares many similarities.  A
clarification on the dynamic environment is that it is an emulated environment
where one of 5 channel variables (delay, speed, etc) are randomly varied every
5 seconds.  The same dataset was used for each of the compared congestion
control approaches independently.  It was noted that BBR assumes a min-RTT, so
if this parameter was compared in the dynamic environment, it likely isn’t
appropriate.  There was a question about supporting different utility functions
and if the applications have completely different objectives (latency vs
throughput), can you really allow these different functions to work in the same
network?  In the dynamic environment, how did they know the optimal rate? – the
answer is strictly a raw calculation of what is most possible.  Unclear how
multiple flows with different utility functions would impact this.  It was
requested that a utility function address less-than-best-effort type of
service.  There was a question about what happens if the dynamic network
changes of 5 second intervals is changed to something much shorter (e.g. order
of ms or 1 second or so).  This is ongoing research.  The assertion that PCC
doesn’t work as well in Mobile Networks is evidence that their analysis on the
dynamic environment on a 5 second interval might not be realistic.  A separate
question was related to what the implementation cost of this is, whether they
is have done this in comparison with BBR.  It was asked if any application
level metrics were considered in the evaluation (QoE) – the answer is they are
starting to do this and agree it is important.   Also asked if they have
evaluated other parameters like tail-latency in data-centers?  They have not
done it, but the high-level concept should be applicable.  Spencer asks a
question in the form of a suggestion to address how this works with QUIC and
also to identify if there are specific user-cases (corners of the Internet)
that this might be able to progress itself quicker.  There was a question about
whether the 17x improvement on Satellite Networks is accurate – a discussion
will be taken offline.  The loss rate is determined by measuring SACKs.  It was
pointed out that L2 retransmissions can have an impact on these models.  The
researchers are very interested in both Mobile and Satellite networks.

e.       A new CC in bandwidth guaranteed networks.  No time for this
presentation.