Skip to main content

Use Cases and Requirements of Communication Protocol for Troubleshooting Agents on Network Devices
draft-zhang-rtgwg-ai-agents-troubleshooting-00

Document Type Active Internet-Draft (individual)
Authors Ruyi Zhang , Jianwei Mao , Bing Liu , Nan Geng , Xiaotong Shang , Qiangzhou Gao , Zhenbin Li
Last updated 2025-11-02
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-zhang-rtgwg-ai-agents-troubleshooting-00
rtgwg                                                           R. Zhang
Internet-Draft                                                    J. Mao
Intended status: Informational                                    B. Liu
Expires: 7 May 2026                                              N. Geng
                                                                X. Shang
                                                                  Q. Gao
                                                                   Z. Li
                                                                  Huawei
                                                         3 November 2025

Use Cases and Requirements of Communication Protocol for Troubleshooting
                       Agents on Network Devices
             draft-zhang-rtgwg-ai-agents-troubleshooting-00

Abstract

   This document focuses on the use cases and requirements of
   communication protocols for troubleshooting agents on network
   devices.

About This Document

   This note is to be removed before publishing as an RFC.

   The latest revision of this draft can be found at
   https://example.com/LATEST.  Status information for this document may
   be found at https://datatracker.ietf.org/doc/draft-zhang-rtgwg-ai-
   agents-troubleshooting/.

   Discussion of this document takes place on the rtgwg Working Group
   mailing list (mailto:WG@example.com), which is archived at
   https://example.com/WG.

   Source for this draft and an issue tracker can be found at
   https://github.com/USER/REPO.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

Zhang, et al.              Expires 7 May 2026                   [Page 1]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 7 May 2026.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Conventions and Definitions . . . . . . . . . . . . . . . . .   3
   3.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  Use Case 1: Data Center Network . . . . . . . . . . . . .   4
     3.2.  Use Case 2: Campus Network  . . . . . . . . . . . . . . .   7
     3.3.  Use Case 3: IoT Edge Network  . . . . . . . . . . . . . .   9
   4.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .  10
     4.1.  Data Transport Requirement  . . . . . . . . . . . . . . .  10
       4.1.1.  Data Format . . . . . . . . . . . . . . . . . . . . .  10
       4.1.2.  Streaming Capabilities  . . . . . . . . . . . . . . .  11
       4.1.3.  Transaction Integrity . . . . . . . . . . . . . . . .  11
     4.2.  Protocol Implementation Requirements  . . . . . . . . . .  11
       4.2.1.  Mandatory Transport Security  . . . . . . . . . . . .  11
       4.2.2.  Standardized Error Handling . . . . . . . . . . . . .  12
       4.2.3.  Message Prioritization and Preemption . . . . . . . .  12
     4.3.  Operational Requirements  . . . . . . . . . . . . . . . .  12
       4.3.1.  Interoperability and Versioning . . . . . . . . . . .  12
       4.3.2.  Resource Management . . . . . . . . . . . . . . . . .  12
       4.3.3.  Observability and Audit . . . . . . . . . . . . . . .  12
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  12
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
   7.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  13
   8.  Normative References  . . . . . . . . . . . . . . . . . . . .  13
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  13
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  13

Zhang, et al.              Expires 7 May 2026                   [Page 2]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

1.  Introduction

   This document focus on communication protocols and associated
   requirements for network troubleshooting interactions among agents on
   network devices.  As modern networks evolve toward greater complexity
   and dynamism, traditional centralized management systems face
   significant challenges in real-time fault detection and resolution.
   Intelligent agents embedded within network devices represent a
   paradigm shift toward distributed, autonomous network operations.
   This draft addresses the need for standardized communication
   methodologies that enable these agents to collaboratively identify,
   diagnose, and recover network issues across diverse environments.

   The contents of this document are as follows:

   First, this document introduces three use cases to illustrate
   communication workflows between network device agents during
   troubleshooting.  Second, this document analyzes existing transport
   protocols for these interactions, highlighting the strengths and
   limitations for agent scenarios.  Finally, this document establishes
   fundamental requirements for implementing effective agent-to-agent
   troubleshooting systems.  By analyzing those interoperable
   communication modes, this draft aims to facilitate the development of
   self-healing networks capable of maintaining service levels despite
   increasing operational complexity.

   The use cases and requirements outlined herein are designed to be
   applicable across various network domains, including data center
   networks, campus networks, and IoT edge networks.

2.  Conventions and Definitions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   Troubleshooting Agent: An agent that runs on network devices to
   identify, diagnose, and recover from network failures.

3.  Use Cases

Zhang, et al.              Expires 7 May 2026                   [Page 3]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

3.1.  Use Case 1: Data Center Network

   In a large-scale data center network, multiple troubleshooting agents
   on network devices, such as switches and routers, need to
   collaboratively identify and diagnose a transient latency issue
   affecting application performance.  The troubleshooting workflow
   begins when an application performance monitoring agent detects
   elevated response times and assign a diagnosis task to
   troubleshooting agents on network devices.  The application
   performance monitoring agent also can reports this issue to the agent
   on the network controller, which then notifies troubleshooting agents
   on network devices to root the cause of this performance issue.  The
   ways by which troubleshooting agents on network devices receive tasks
   is beyond the scope of this draft.

   For this use case, a high-performance, low-cost communication
   protocol is required.  In existing works, gRPC provides significant
   advantages for this scenario.  This part takes gRPC as an example.

   When a troubleshooting agent on the network device receives a task to
   identify, diagnose, and recover from network failures, the
   communication flow may same as this figure.

  +--------------------+  1. establishes connection  +-----------------+
  | +----------------+ +----------------------------->  +-----------+  |
  | |                | | 2. request for related data |  |           |  |
  | |Initiating Agent| +----------------------------->  |   Agent   |  |
  | |                | |        3. response          |  |           |  |
  | +----------------+ <-----------------------------+  +-----------+  |
  |                    | 4. share analysis results   |    Relevant     |
  |   Network Device   +-----------------------------> Network Devices |
  +--------------------+                             +-----------------+

   Figure: Data Center Networks

   The communication flow includes these steps as follows.  This
   document provides message examples for each step.

   Step 1, the initiating agent establishes connections with relevant
   network device agents.  This step may include some security-related
   steps and description about failure.

Zhang, et al.              Expires 7 May 2026                   [Page 4]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   {
           "Sender": "Agent-I",
           "Failure":
           {
                   "Location": "Host 1",
                   "Type": "Packet loss rate greater than threshold",
                   "Description": "...",
           },
           "Solution":
           {
           },
           "Analysis":
           {
           },
           ...
   }

   Step 2, through bidirectional connection, the initiator requests
   real-time telemetry data including interface statistics, queue
   depths, and latency measurements.

   The request message of initiator could be as following.

Zhang, et al.              Expires 7 May 2026                   [Page 5]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   {
           "Sender": "Agent-I",
           "Failure":
           {
                   "Location": "Host 1",
                   "Type": "Packet loss rate greater than threshold",
                   "Description": "...",
           },
           "Solution":
           {
                   "RelatedNetDevice1":
                   {
                           "Type": "Request",
                           "Resource": "Data",
                           "Description": "Traffic patterns of the device's ingress and egress over a certain period of time.",
                           "TransMehtods": ["gRPC", "QUIC"]
                   },
                   "RelatedHost2":
                   {
                           "Type": "Request",
                           "Resource": "Method",
                           "Description": "The device needs to send colored packets to collect path data."
                   },
                   ....
           },
           "Analysis":
           {
           },
           ...
   }

   Step 3, network agents respond with telemetry data in real-time.  The
   related network device would send this message as response.

Zhang, et al.              Expires 7 May 2026                   [Page 6]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   {
           "Sender": "Agent-D",
           "Response":
           {
                   "Resource": "Data",
                   "Data":
                   {
                           "Description": "Traffic patterns of this device's ingress and egress over a certain period of time.",
                           "TransMethods": "gRPC",
                           ...
                   }
                   ....
           },
           ...
   }

   Step 4, optionally, the initiating agent share analysis results
   through the same channels, enabling collaborative root cause
   identification.

   Step 5, once the root casue of the failure is identified, agents
   negotiate and implement traffic engineering adjustments.

3.2.  Use Case 2: Campus Network

   A network segmentation issue in an enterprise campus requires
   verification of consistent policy application across multiple
   security domains.  Agents residing in firewalls, switches, and
   wireless controllers must collaboratively audit their configurations
   against intended policies to identify discrepancies causing the
   segmentation failure.

   For this use case, a configuration-oriented troubleshooting scenario,
   HTTP-based RESTCONF offers several benefits.  This part takes
   RESTCONF as an example.  The RESTful architecture provides familiar,
   standardized operations (GET, PATCH, DELETE) for configuration
   manipulation.  YANG data modeling ensures semantic consistency across
   multi-vendor environments, crucial for accurate policy verification.
   HTTP/2's header compression and request multiplexing improve
   efficiency when interacting with numerous agents simultaneously.  The
   protocol's stateless nature simplifies error recovery, while
   standardized status codes and error responses enable predictable
   failure handling.  Rich authentication mechanisms integrate
   seamlessly with existing enterprise security infrastructures.
   However, RESTCONF lacks streaming capabilities for real-time
   telemetry exchange.

Zhang, et al.              Expires 7 May 2026                   [Page 7]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   In this use case, the agent who is informed to complete this task is
   named coordinator agent.

                        +-----------------------+
                        | +-------------------+ |
                        | |                   | |
                        | | Coordinator Agent | |4. process data
                        | |                   | |
                        | +-------------------+ |
                        |                       |
                        |     Network Device    |
                        +---^------+------------+
                            |      |1. establishes connections
                            |      |2. queries configuration data
            +---------------+------++------------------------+
            |               |       |5. pushes configuration adjustments
            |  3. response  +-------+---------------+        |
            |        +------+       |               |        |
   +--------v--------+     ++-------v--------+     ++--------v-------+
   |   +---------+   |     |   +---------+   |     |   +---------+   |
   |   |  Agent  |   |     |   |  Agent  |   |     |   |  Agent  |   |
   |   +---------+   |     |   +---------+   |     |   +---------+   |
   |                 |     |                 |     |                 |
   | Network Device  |     | Network Device  |     | Network Device  |
   +-----------------+     +-----------------+     +-----------------+

   Figure: Campus Networks

   The communication flow includes these steps:

   1.  A coordinator agent establishes connections with relevant network
       agents.

   2.  The coordinator queries configuration data from multiple device
       agents using standardized YANG data models.

   3.  Each agent responds with structured configuration data
       representing its current operational state.

   4.  The coordinator analyzes the collective configuration data,
       identifies inconsistencies in access control lists and routing
       policies, and generates remediation instructions.

   5.  Using RESTCONF PATCH operations or other network management
       operations, the coordinator pushes configuration adjustments to
       specific agents.

Zhang, et al.              Expires 7 May 2026                   [Page 8]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   6.  Agents respond with structured error messages if operations fail,
       enabling precise fault localization.

3.3.  Use Case 3: IoT Edge Network

   In an IoT edge network, multiple constrained devices experience
   intermittent connectivity issues.  Lightweight agents on these
   devices must efficiently share fault information and coordinate
   recovery actions while conserving bandwidth and battery resources.

   In IoT network, MQTT protocol is used widely.  This part takes MQTT
   as an example.  MQTT's publish-subscribe model offers distinct
   advantages for distributed troubleshooting scenarios.  The decoupled
   communication pattern allows agents to exchange information without
   direct connections, reducing coordination overhead.  Configurable QoS
   levels enable reliability matching for different message.  For
   example, types—QoS 0 for non-critical telemetry, QoS 1 for important
   fault notifications, and QoS 2 for critical configuration changes.
   The minimal protocol overhead conserves bandwidth and battery life on
   constrained devices.  Last Will and Testament features ensure other
   agents are notified when a device becomes unreachable, enabling rapid
   detection of network partitions.  The topic-based routing simplifies
   message filtering and delivery to interested parties only.

                                                       +----------------+
                                                       |    +-------+   |
                                                       |    | Agent |   |
                                                      ++    +-------+   |
                                                      || Network Device |
2. Identify faiulres  3.publish a failure report      |+----------------+
     +----------------+    +---------------+    1. Subscribe------------+
     |    +-------+   <----++-------------+<----------+|    +-------+   |
     |    | Agent |   |    || MQTT Broker ||          ++    | Agent |   |
     |    +-------+   |    |+-------------+|          ||    +-------+   |
     | Network Device |    | Network Edge  +---------->| Network Device |
     +-------------^--+    +---------------+4. Notification-------------+
                   |                                  |+----------------+
                   +----------------------------------++    +-------+   |
                 5. Offer related data and resources  ||    | Agent |   |
                                                      ++    +-------+   |
                                                       | Network Device |
                                                       +----------------+

   Figure: IoT Edge Networks

   The communication flow includes these steps:

Zhang, et al.              Expires 7 May 2026                   [Page 9]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   1.  Agents subscribe to relevant fault notification topics on an MQTT
       broker deployed at the network edge.

   2.  The agent on the network device which happened a network failure
       identifies the network failure.

   3.  Agent publishes a structured failure report to appropriate
       topics.

   4.  Subscribed agents receive the notification and contribute
       additional context from their perspectives.

   5.  The subscribed agents may offer some data and resources to
       diagnose or recover FROM this failure.

   6.  The agent on the network device that caused this failure recovers
       the failure or reports it.

4.  Requirements

   According to those use cases, this draft concludes requirements of
   communication protocol for network troubleshooting interactions among
   agents on network devices.

4.1.  Data Transport Requirement

4.1.1.  Data Format

   The interaction between Agents should use human-readable language,
   e.g., natural language.  However, in terms of communication
   performance, messages delivered by agents should be encapsulated in
   structured format.  The message sent by agent would be as follows.

Zhang, et al.              Expires 7 May 2026                  [Page 10]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   {
           "Sender": "Agent",
           "Failure":
           {
                   "Location": "..",
                   "Type": "...",
                   "Description": "...",
           },
           "Solution":
           {
           },
           "Analysis":
           {
           },
           ...
   }

4.1.2.  Streaming Capabilities

   Troubleshooting agents MUST support bidirectional streaming for real-
   time telemetry exchange and collaborative analysis.  Streaming
   implementations SHOULD include flow control mechanisms to prevent
   resource exhaustion and MUST maintain message ordering within
   streams.  Agents SHOULD implement priority handling for critical
   troubleshooting messages within streams to ensure timely delivery of
   urgent notifications.

4.1.3.  Transaction Integrity

   For configuration modifications during troubleshooting, agents MUST
   implement transactional semantics to maintain network consistency.
   Multi-agent transactions SHOULD support two-phase commit protocols or
   equivalent distributed consensus mechanisms.  All configuration
   changes MUST be idempotent to allow safe retransmission in case of
   delivery uncertainties.

4.2.  Protocol Implementation Requirements

4.2.1.  Mandatory Transport Security

   All inter-agent communications MUST employ transport-layer security
   (TLS 1.2 or higher) with mutual authentication.  Certificate-based
   authentication is PREFERRED over pre-shared keys for scalable
   deployment.  Agents MUST implement certificate revocation checking
   and SHOULD support forward secrecy cipher suites.

Zhang, et al.              Expires 7 May 2026                  [Page 11]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

4.2.2.  Standardized Error Handling

   Agents MUST implement consistent error reporting mechanisms across
   all communication protocols.  Error responses MUST include machine-
   readable error codes, human-readable descriptions, and suggested
   remediation actions.  Protocol-specific error mappings SHOULD be
   defined to translate underlying transport errors to application-level
   troubleshooting semantics.

4.2.3.  Message Prioritization and Preemption

   Troubleshooting systems MUST implement message prioritization to
   ensure critical fault notifications receive appropriate network
   resources.  Agents SHOULD support preemption of lower-priority
   communications when high-priority troubleshooting sessions require
   immediate attention.  Quality of Service differentiation SHOULD be
   implemented at both transport and application layers.

4.3.  Operational Requirements

4.3.1.  Interoperability and Versioning

   Agents MUST implement protocol version negotiation to maintain
   backward compatibility during upgrades.  Data schema evolution SHOULD
   follow compatibility rules that prevent communication breakdowns.
   Agents SHOULD support graceful degradation of functionality when
   communicating with older implementations.

4.3.2.  Resource Management

   Agent implementations MUST include configurable resource limits to
   prevent exhaustion during mass troubleshooting events.  Memory,
   bandwidth, and processing quotas SHOULD be enforced per communication
   session.  Agents MUST implement circuit breaker patterns to isolate
   misbehaving peers and maintain overall system stability.

4.3.3.  Observability and Audit

   All troubleshooting communications MUST be logged with sufficient
   detail to reconstruct decision processes.  Log entries SHOULD include
   message timestamps, participant identities, and semantic content
   summaries.  Audit trails MUST be protected against tampering and
   available for post-incident analysis.

5.  Security Considerations

   TBD

Zhang, et al.              Expires 7 May 2026                  [Page 12]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

6.  IANA Considerations

   This document has no IANA actions.

7.  Conclusion

   TBD

8.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

Acknowledgments

   TODO acknowledge.

Authors' Addresses

   Ruyi Zhang
   Huawei
   Email: zhangruyi8@huawei.com

   Jianwei Mao
   Huawei
   Email: maojianwei@huawei.com

   Bing Liu
   Huawei
   Email: leo.liubing@huawei.com

   Nan Geng
   Huawei
   Email: gengnan@huawei.com

   Xiaotong Shang
   Huawei
   Email: shangxiaotong@huawei.com

Zhang, et al.              Expires 7 May 2026                  [Page 13]
Internet-Draft  Use Cases and Requirements of Communicat   November 2025

   Qiangzhou Gao
   Huawei
   Email: gaoqiangzhou@huawei.com

   Zhenbin Li
   Huawei
   Email: robinli314@163.com

Zhang, et al.              Expires 7 May 2026                  [Page 14]