Skip to main content

Minutes interim-2022-coinrg-03: Wed 10:00
minutes-interim-2022-coinrg-03-202209281000-02

Meeting Minutes Computing in the Network Research Group (coinrg) RG
Date and time 2022-09-28 14:00
Title Minutes interim-2022-coinrg-03: Wed 10:00
State Active
Other versions markdown
Last updated 2022-10-04

minutes-interim-2022-coinrg-03-202209281000-02

IRTF COIN RG interim - 2022-09-29

Date: Wednesday 29.09.2022
Time: 7:00-9:00 PST - 10:00-12:00 EST - 15:00-17:00 UTC -- 120 mins
Chairs: J/E/M
Jianfei (Jeffrey) He jefhe@foxmail.com
Eve Schooler
eve.m.schooler@intel.com
Marie-Jose Montpetit
marie@mjmontpetit.com

Join via Meetecho:
https://meetings.conf.meetecho.com/interim/?short=5203a6a0-2a4b-4983-b351-791cb6c4d8f0

Materials:
https://datatracker.ietf.org/meeting/interim-2022-coinrg-03/session/coinrg

Shared Note taking:
https://notes.ietf.org/notes-ietf-interim-2022-coinrg-03-coinrg?both

1. Chair Update (J/E/M) - 5 mins

Scope of the RG has continued to evolve. Milestones in need of update;
that reflect the dynamic nature of the field.

Papers (30 mins each)

2. Using Trio – Juniper Networks’ Programmable Chipset – for Emerging In-Network Applications (Mingran Yang, MIT) Paper:

Paper: https://dl.acm.org/doi/pdf/10.1145/3544216.3544262

Juniper's Trio-ML was used for "efficient in-network straggler
mitigation" in AI training. It outperforms Tofino's solution by 1.8x by
leveraging Juniper Networks' programmable chipset.
Fundamental difference is that Trio is a thread-based architecture.
Handles non-homogeneous packet processing rates (vs stricly in-line
processing).
Two main challenges: in-net straggler detection and recovery to deliver
efficienct in-net straggler mitigation.
Trio-ML also improves the time-to-accuracy by 1.6x.

Q:(Jeffrey He)
The typical network delay in DC is micro-sec level, while straggler is
assumed several 100ms. Is this a realistic assumption for the DC?
A: This is realistic because the time difference between stragglers is
not only the network delay but the delay in the computing. The delays
injected are in the range .5-2x the iteration time.
Q:(Jeffrey He)
Is there background traffic in your experiments?
A: No.
Q:(from Chat) are these improvements specific to the chip or generic?
A: the straggler mitigation results are quite specific to the chip
architecture. Mainly due to mitigation in the cluster.

3. DICer: distributed coordination for in-network computations (Uthra Ambalavanan, Bosch GmbH, Germany)

Paper: https://dl.acm.org/doi/abs/10.1145/3517212.3558084

This paper addresses the limitations and challenges in NFN for In-Net
Computations: name based routing of workflows, local decision making and
function decomposition.
Dicer: NFN + distributed synchronization/coordination, i.e., retains NFN
forwarder at all compute nodes, utilizes NDN app data sync protocol to
gain neighborhood information.
4 Phases: neighbor discovery, sync group formation, synchronization and
coordination.
Evaluations show function placement improvement and the completion time
reduced (coordinated solution vs plain NFN).
Future work includes joint optimization of compute and network resource
utilization.

Draft Update (15 mins)

3. Use Cases for In-Network Computing (Dirk Trossen, Huawei and Ike Kunze, RWTH Aachen University)

Expired Draft:
https://datatracker.ietf.org/doc/draft-irtf-coinrg-use-cases

Updates of current draft structure were presented. Regrouped the use
cases: providing new COIN experiences, supporting new COIN systems,
improving existing COIN capabilities, enabling new COIN capabilities.
Sharpened and tightened the taxonomy.
Preparing and starting the analysis of research questions and
requirements.
Several questions were raised:
Where to collect terminology?
Should the anaysis be part of this document or a seperate one?
Is this draft ready for last call?
Comment:(Marie-Jose) Collecting the terminology is a good idea, as is
the idea to move the terminology to a small but standalone draft.
Analysis should be out. Yes, this work should continue. Interesting to
capture how the field continues to evolve.

Discussion in chat:
Comment:(David Oran) A terminology document is useful is (and only if?)
there are conflicts in terminology that are hindering progress. This was
the case in ICNRG and the process of normalizing that resulted in
improved clarity on the differences among the various architectures. is
that the case here? Is trying to normalize terminology necessary?
Premature? Too late?
A purely descriptive (rather than prescriptive) terminology effort would
be of limited utility in my view.
My intuition is to hold off on a terminology document until we have some
terminology-driven conflict that needs to be resolved. But i don't feel
strongly on this point.
Comment:(Colin Perkins)
Terminology docs can also help bring clarity, even if there are no
conflicts in the way terms are used, by bringing precision. That relies
on the resulting terminology being widely referenced and adopted, of
course.
A: (Dirk Trossen)
I think the terminology should be prescriptive in thinking but not
religious in measuring 'COIN approaches', so the concern you express,
@Colin, must be key. I think a stake in the ground, a view of the COIN
RG at this point in time, is good but it must not be THE measure for in
or out of COINish research!

New Draft (10 mins)

4. Distributed Learning Architecture based on Edge-cloud Collaboration (Chao Li, Beijing University of Posts and Communications) Draft:

Draft:
https://datatracker.ietf.org/doc/draft-li-coinrg-compute-resource-scheduling

Edge-cloud collaboration is proposed for AI training in IoT use cases in
context of 5G.
Simulation shows it improves the training accuracy and relieves the
computational pressure in the model training process.
Q (Marie-Jose): what's your goals with the draft?
A: Prove the computing balance for the model training in the networks.
Since this is a new good is to start a discussion on the list and how
this could evolve within the group.

Ideas (10 mins each)

5. Data Operation In-Network (DOIN) Use Cases (Yujing Zhou, Huawei)

Three use cases for in-net data operation were presented: NetReduce,
NetLocker and NetSequence. The proposal was to provide an explicit and
general way in data/control-plane to signal the computation to be
performed.
Q (Marie-Jose): What's the next step in your plan?
A: Propose it in IETF, find a solution to make in-net computing work in
DC. Hope members in COIN RG to join us.
Comment (Jeffrey He): These three operations are more typically
considered “computing” than data operation, for example fetch and add,
compare and swap are atomic computations.
A: Application changes very fast, we can’t develop communication case by
case.
Comment (Colin Perkins): with no hats, I was a little surprised that the
operations were such low level given the more high level use cases. It
wasn’t clear if this was a computation in the network signalling scheme,
or an active networking scheme that was being proposed. It’ll be good to
make clera the distinction on where this is going.
Comment (Jinze Yang): Data operation generally is that the data is
carried in the payload. We want the switch to operate on these data, so
that’s why this work is called data operation. The reason to keep the
operation at low level (for these scenarios)is to keep the task at line
rate.
Next step: consider bringing the topic in the list.

6. EIP update on "Machine Learning for Networking" use case (Stefano Salsano, U. Roma)

Colleagues affiliated with Stanford and Purdue etc.
Using extensible in-band processing (EIP) to support distributed feature
extraction and ML inference for per-packet ML.
A previous COIN presentation was on the Taurus solution, a switch
pipeline that includes ML inference engine. In previous paper, focused
on single node model.
It is proposed here to seperate "feature extraction" (FE) and
"Machine-learning Inference" (MLI) in different network nodes in a
distributed way.
How to encode and transmit features?
EIP can be used to convey the encoded features.
EIP realized as hop-by-hop options in the IPv6 header; supports new use
cases like advanced monitoring, semantic routing, deterministic network,
slicing.
Want to define this approach as a framework, or lightweight
standardization; and not set in stone the content of records, because
want to leave the innovation around features open. However propose a
common framework for their exchange.
Q (Marie-Jose): What next?
A: Create a draft for EIP and the use case and will submit to the COINRG
mailing list for comment.
Q (Jeffrey He): Quite interesting. In typical deep learning today,
"feature extraction" is part of the neural networks. Do you have
specific analysis on some use cases on how to separate feature
extraction and ML inference?
A: The machine learning for network is different from the traditional
machine learning. The raw data are just the packets flowing in a node.
ML for network may use flow level feature, not packets. In this sense,
we need to extract the feature before ML inference.

RG topics (10 mins)

Next meeting in IETF London: a 2-hour session has been reserved for COIN
RG.