Skip to main content

A Framework to Evaluate LLM Agents for Network Configuration
draft-cui-nmrg-llm-benchmark-01

Document Type Active Internet-Draft (individual)
Authors Yong Cui , Chang Liu , Xiaohui Xie , Chenguang Du
Last updated 2025-12-30
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-cui-nmrg-llm-benchmark-01
Network Management Research Group                                 Y. Cui
Internet-Draft                                                    C. Liu
Intended status: Informational                                    X. Xie
Expires: 3 July 2026                                 Tsinghua University
                                                                   C. Du
                                                 Zhongguancun Laboratory
                                                        30 December 2025

      A Framework to Evaluate LLM Agents for Network Configuration
                    draft-cui-nmrg-llm-benchmark-01

Abstract

   This document specifies an evaluation framework and related
   definitions for intent-driven network configuration using Large
   Language Model(LLM)-based agents.  The framework combines an
   emulator-based interactive environment, a suite of representative
   tasks, and multi-dimensional metrics to assess reasoning quality,
   command accuracy, and functional correctness.  The framework aims to
   enable reproducible, comprehensive, and fair comparisons among LLM-
   driven network configuration approaches.

About This Document

   This note is to be removed before publishing as an RFC.

   The latest revision of this draft can be found at
   https://datatracker.ietf.org/doc/draft-cui-nmrg-llm-benchmark/.
   Status information for this document may be found at
   https://datatracker.ietf.org/doc/draft-cui-nmrg-llm-benchmark/.

   Discussion of this document takes place on the Network Management
   Research Group mailing list (mailto:nmrg@irtf.org), which is archived
   at https://mailarchive.ietf.org/arch/browse/nmrg.  Subscribe at
   https://www.ietf.org/mailman/listinfo/nmrg/.

   Source for this draft and an issue tracker can be found at
   https://github.com/nobrowning/draft_llm_conf_benchmark.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

Cui, et al.                Expires 3 July 2026                  [Page 1]
Internet-Draft                NetConfBench                 December 2025

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 3 July 2026.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
   3.  Framework Overview  . . . . . . . . . . . . . . . . . . . . .   4
     3.1.  Components  . . . . . . . . . . . . . . . . . . . . . . .   6
     3.2.  Workflow  . . . . . . . . . . . . . . . . . . . . . . . .   9
   4.  Data Model  . . . . . . . . . . . . . . . . . . . . . . . . .  11
     4.1.  Task Definition Schema  . . . . . . . . . . . . . . . . .  11
     4.2.  Agent-Network Interface (ANI) . . . . . . . . . . . . . .  13
     4.3.  Task Evaluation Interface . . . . . . . . . . . . . . . .  15
   5.  MCP-Based Implementation  . . . . . . . . . . . . . . . . . .  16
     5.1.  Benefits of MCP Integration . . . . . . . . . . . . . . .  16
     5.2.  MCP Tool Definitions for ANI Operations . . . . . . . . .  17
       5.2.1.  1. get_topology . . . . . . . . . . . . . . . . . . .  17
       5.2.2.  2. get_running_config . . . . . . . . . . . . . . . .  18
       5.2.3.  3. update_config  . . . . . . . . . . . . . . . . . .  19
     5.3.  Additional MCP Tools for Advanced Scenarios . . . . . . .  22
       5.3.1.  batch_configure_devices . . . . . . . . . . . . . . .  22
       5.3.2.  check_device_status . . . . . . . . . . . . . . . . .  22
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  23
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  23
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  23
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  24
     8.2.  Informative References  . . . . . . . . . . . . . . . . .  24

Cui, et al.                Expires 3 July 2026                  [Page 2]
Internet-Draft                NetConfBench                 December 2025

   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  24
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  24

1.  Introduction

   Network configuration is fundamental to ensuring network stability,
   scalability, and conformance with intended design behavior.
   Effective configuration requires not only a comprehensive
   understanding of network technologies but also advanced capabilities
   for interpreting complex topologies, analyzing dependencies, and
   specifying parameters accurately.  Traditional automation approaches
   such as Ansible playbooks[A2023], NETCONF[RFC6241]/YANG
   models[RFC7950], or program-synthesis methods-either demand extensive
   manual scripting or are limited to narrow problem
   domains[Kreutz2014].  In parallel, Large Language Models (LLMs) have
   demonstrated the ability to interpret natural-language instructions
   and generate device-specific commands, showing promise for intent-
   driven automation in networking.  However, existing work remains
   fragmented and lacks a standardized way to measure whether an LLM can
   truly operate as an autonomous agent in realistic, multi-step
   configuration scenarios.

   Despite encouraging results in individual subtasks, most
   evaluations[Wang2024NetConfEval] rely on static datasets and ad hoc
   metrics that do not reflect real-world complexity.  As a result: -
   There is no common benchmark suite covering diverse configuration
   domains (routing, QoS, security) with clearly defined intents,
   topologies, and ground truth. - Existing tests seldom involve
   interactive environments that emulate vendor-specific device behavior
   or provide runtime feedback on command execution. - Evaluation
   metrics are often limited to simple syntactic checks or isolated
   command validation, failing to capture whether the intended network
   behavior is actually achieved.

   Consequently, it is difficult to compare different LLM approaches or
   to identify gaps in reasoning, context-sensitivity, and error-
   correction capabilities[Long2025][Liu2024][Fuad2024][Lira2024].  To
   address these shortcomings, this document introduce *NetConfBench*, a
   holistic framework that provides: 1.  An emulator-based environment
   (built on GNS3) to simulate realistic device interactions. 2.  A
   benchmark suite of forty tasks spanning routing, QoS, and security,
   each defined by intent, topology, initial state, ground-truth
   configuration, annotated reasoning trace, and expert-crafted
   testcases. 3.  Multidimensional metrics-_reasoning score_, _command
   score_, and _testcase score_-that evaluate an agent's internal
   reasoning coherence, semantic correctness of generated commands, and
   functional outcomes in the emulated network.

Cui, et al.                Expires 3 July 2026                  [Page 3]
Internet-Draft                NetConfBench                 December 2025

   NetConfBench aims to enable reproducible, comprehensive comparisons
   among single-turn LLMs, ReAct-style multiturn agents, and knowledge-
   augmented variants, guiding future research toward truly autonomous,
   intent-driven network configuration.

2.  Terminology

   For clarity within this document, the following terms and
   abbreviations are defined:

   *  Agent: A software component powered by an LLM that consumes a task
      intent, interacts with a network environment, and issues
      configuration commands autonomously.

   *  Configuration Command: A device-specific instruction (e.g., a
      Cisco IOS CLI line or a Juniper Junos set statement) sent by the
      agent to a network device.

   *  Environment: An emulated or real network instance that exposes
      device status, topology information, and feedback on applied
      commands.

   *  Intent: A high-level specification of desired network behavior or
      objective, expressed in natural language or a structured format
      defined in this document.

   *  Task: A single evaluation unit defined by (1) a scenario category,
      (2) an environment topology, (3) initial device configurations,
      and (4) an intent.  The agent is evaluated on its ability to
      fulfill the intent in the given environment.

   *  Testcase: A concrete, executable set of verification steps (e.g.,
      ping tests, traffic-flow validation, policy checks) used to assert
      whether the agent's final configuration satisfies the intent.

   *  MCP (Model Context Protocol): An open standard protocol designed
      to facilitate communication between LLMs and external data sources
      or tools, enabling standardized tool discovery, invocation, and
      result handling.

3.  Framework Overview

Cui, et al.                Expires 3 July 2026                  [Page 4]
Internet-Draft                NetConfBench                 December 2025

   +------------------+
   |    Task Dataset  |                     +-------------------------+
   |+----------------+|    +-----------+    |        Evaluator        |
   ||Network Intents ||(1) |           |(4) |+----------+ +----------+|
   ||+--------+      |---->| LLM Agent |<--->|Reasoning | |Grnd Truth||
   |||Routing |      ||    |           |    ||Trajectory| |Reasoning ||
   |||Policy  | +---+||    +-----------+    |+----------+ +----------+|
   ||+--------+ |QoS|||          |          |     \             /     |
   ||+--------+ +---+||          |          |      Rouge/Cos. Sim.    |
   |||Security|      ||         (3)         |                         |
   ||+--------+      ||          |          |+----------+ +----------+|
   |+----------------+|          |       (5)|| Final    | |Grnd Truth||
   |+----------------+|          |        +->| Configs  | |Configs   ||
   ||Network Topology||    +-----------+  | |+----------+ +----------+|
   ||+-----+ +-----+ ||(2) |Environment|  | |     \             /     |
   |||Nodes| |Links| |---->|           |-+  |    Precision/Recall     |
   ||+-----+ +-----+ ||    | R2 --- R1 |    |                         |
   |+----------------+|    | |(GNS3)|  |(6) | +---------------------+ |
   |                  |    | R3 --- R4 |<-->| |     Testcases       | |
   |+----------------+|(2) |           |    | +---------------------+ |
   ||Initial Configs |---->| Emulator- |    |            |            |
   |+----------------+|    |  based    |    |        Pass Rate        |
   +------------------+    +-----------+    +-------------------------+

   Legend:
   (1)Task Assignment             (2)Environment Setup
   (3)Interactive Task Execution  (4)Reasoning Trajectory Export
   (5)Final Configuration Export  (6)Testcase Execution

   Figure 1: The NetConfBench Framework

   The proposed framework is shown in Figure 1.  The flow begins with a
   *Task Dataset* defining network intents and topologies.  The *LLM
   Agent* perceives the environment, reasons about required actions, and
   applies configuration commands.  The *Environment* simulates or
   controls real devices, providing feedback for each action.  Finally,
   the *Evaluator* compares the agent's outputs against ground-truth
   configurations and reasoning, computing scores for accuracy and
   completion.

   The framework supports multiple communication protocols for agent-
   environment interaction, including direct API calls and standardized
   protocols such as MCP.  When using MCP, network operations are
   encapsulated as tools that can be discovered and invoked by the LLM
   agent through the MCP client-server architecture.

Cui, et al.                Expires 3 July 2026                  [Page 5]
Internet-Draft                NetConfBench                 December 2025

3.1.  Components

   NetConfBench consists of four key components:

   1.  *Task Dataset*
       A repository of forty configuration tasks, each defined as a JSON
       object with:

       *  *Intent*: One or more natural language instructions.

       *  *Topology*: A list of node names and link definitions.

       *  *Initial Configuration*: The initial configuration state of
          all nodes.

       *  *Ground Truth Configuration*: Expert-validated CLI commands
          that achieve the intent.

       *  *Ground Truth Reasoning*: A textual record of the agent's
          step-by-step reasoning that maps high-level intent to low-
          level configuration actions.

       *  *Testcases*: A set of verification procedures (e.g., _show_,
          _ping_, _ACL_ checks) that confirm functional intent
          satisfaction.

   2.  *Emulator Environment*
       Built on GNS3, this component launches official vendor images for
       routers and switches, replicating realistic CLI behavior.  Key
       interfaces include:

       *  *Agent-Network Interface (ANI)*: Based on the key stages
          commonly involved in intent-driven network configuration, the
          framework provides an Agent-Network Interface to facilitate
          structured interactions between the LLM agent and the emulated
          network environment.  This interface supports four core
          actions: get-topology, get-running-cfg, update-cfg, and
          execute_validation.

          -  get-topology: provides this information in a format
             interpretable by the LLM.

          -  get-running-cfg: enables the agent to obtain the active
             configurations of specified devices, providing essential
             context for planning subsequent updates.

Cui, et al.                Expires 3 July 2026                  [Page 6]
Internet-Draft                NetConfBench                 December 2025

          -  update-cfg: allows the agent to apply new configuration
             commands and provides detailed feedback on their execution,
             including whether each command was accepted or resulted in
             any errors.

          -  execute_validation: accepts a device name and a command
             string as parameters and returns the resulting output.

       *  *Task Evaluation Interface*: To enable reliable and objective
          assessment of the LLM agent's configuration behavior, the
          environment provides a Task Evaluation Interface that allows
          the evaluation module to access relevant execution results.
          Specifically, this interface supports:

          -  *Exporting the final configurations of all devices*: This
             allows for direct comparison with ground truth
             configurations to evaluate the correctness and completeness
             of the agent's output.

          -  *Executing a set of predefined testcases*: These testcases
             are designed to verify whether the resulting network
             behavior accurately reflects the intended configuration
             objectives, as defined by the network intent.

   3.  *LLM Agent*
       A modular component that can be implemented with any LLM (open-
       source or closed-source).  It interacts with the emulator via the
       *Agent-Network Interface* (ANI), issuing queries such as get-
       topology, get-running-cfg, update-cfg, and execute_validation.
       Agents may use:

       *  *Single-Turn Generation*: The entire reasoning and command
          generation in one pass.

       *  *ReAct-Style Multi-Turn Interaction*: Interleaved reasoning
          and actions, with runtime feedback guiding subsequent steps.

       *  *External Knowledge Retrieval*: (Optional) Queries to a
          command manual to resolve vendor-specific syntax.

   4.  *Evaluator*
       Computes three core metrics for each task:

       *  *Reasoning Score (S_reasoning)*

          The reasoning score evaluates whether the agent can coherently
          map network intents to concrete configuration actions through
          semantically aligned reasoning.  This score compares the

Cui, et al.                Expires 3 July 2026                  [Page 7]
Internet-Draft                NetConfBench                 December 2025

          agent's reasoning process with a predefined ground truth
          reasoning process, focusing on logical consistency and
          semantic similarity.

          For one-shot prediction, prompts are designed to elicit the
          reasoning process prior to command generation, enabling direct
          comparison.  For multi-turn interaction, an auxiliary LLM
          summarizes the interleaved steps into a unified reasoning
          process, which is then compared against the ground truth.  The
          reasoning score is computed using cosine similarity:

          S_reasoning = (r_agent * r_gt) / (||r_agent|| * ||r_gt||)

          where r_agent is the embedding of the agent's reasoning
          process, and r_gt is the embedding of the ground truth
          reasoning process.

       *  *Command Score (S_command)*

          This evaluation comprehensively assesses the effectiveness of
          configuration commands generated by the agent.  While
          syntactic correctness is a prerequisite, it does not ensure
          that configuration commands are correctly applied to the
          device, particularly when commands must be issued within
          specific configuration contexts.

          After the agent completes its configuration task, the final
          configurations of all devices are exported and compared to
          their initial configurations to extract the set of commands
          that were actually applied.  Hierarchical parsing using the
          Python library ciscoconfparse ensures structural completeness
          during comparison.  Since certain configuration parameters
          (e.g., ACL numbers, route policy names) are manually defined
          and do not have fixed values, wildcard-based fuzzy matching is
          introduced to ignore non-essential differences and focus on
          semantic equivalence.

          Based on the extracted command sets, standard precision and
          recall are computed: - Precision measures the proportion of
          correctly generated commands among all generated commands -
          Recall measures the proportion of correctly generated commands
          relative to the ground truth command set

          The command score is reported as the harmonic mean of
          precision and recall:

          S_command = (2 * Precision * Recall) / (Precision + Recall)

Cui, et al.                Expires 3 July 2026                  [Page 8]
Internet-Draft                NetConfBench                 December 2025

       *  *Testcase Score (S_testcase)*

          While command-level evaluation based on configuration
          differences can effectively measure the semantic correctness
          of generated commands, it does not fully reflect whether the
          configuration actually achieves the intended network
          behaviors.  To address this limitation, a testcase-driven
          evaluation strategy is introduced that directly verifies the
          functional correctness of the agent's configuration in the
          target environment.

          A set of validation testcases is defined for each task, where
          each testcase encodes a network intent in the form of
          executable verification commands.  To support complex tasks
          involving multiple sub-goals, the overall intent is decomposed
          into sub-intents based on node-specific configuration
          objectives.  Each sub-intent is then formulated as an
          individual testcase to enable fine-grained evaluation and
          enhance interpretability.

          Examples of testcases include: - *Routing intent*: Verifying
          the next hop selection on intermediate routers to confirm end-
          to-end path correctness - *ACL intent*: Simulating traffic
          flows and validating whether they are allowed or denied as
          expected - *QoS intent*: Inspecting interface statistics to
          check whether QoS policies are properly enforced

          The testcase score is defined as the proportion of passed
          testcases among all defined testcases:

          S_testcase = |Passed Testcases| / |Total Testcases|

          This score reflects the agent's ability to produce
          configurations that meet functional requirements and
          demonstrates practical applicability in real-world deployment
          scenarios.

3.2.  Workflow

   The evaluation workflow for each task proceeds through six stages:

   1.  *Task Assignment*
       NetConfBench selects a task from the JSON dataset and provides
       only the high-level intent(s) to the LLM agent.

Cui, et al.                Expires 3 July 2026                  [Page 9]
Internet-Draft                NetConfBench                 December 2025

   2.  *Environment Setup*
       The framework instantiates a GNS3 topology based on the task's
       topology and applies the startup-config to each device.  Once the
       emulated network reaches a stable state, control transfers to the
       agent.

   3.  *Interactive Execution*
       The LLM agent receives the partial prompt containing:

       *  The API specification for get-topology, get-running-cfg,
          update-cfg, and execute_validation.

       *  The natural language intent.

       *  (Optionally) Device model/version hints.

       *  The agent issues a sequence of API calls; for single-turn
          agents, it outputs reasoning followed by a batch of CLI
          commands.  For multi-turn agents, it alternates reasoning
          traces and API calls.  When using MCP, network operations are
          encapsulated as tools that can be discovered and invoked by
          the LLM agent through the MCP client-server architecture.

   4.  *Reasoning Trajectory Export*
       After execution completes (agent signals "task done" or after a
       predefined command budget), NetConfBench captures the entire
       reasoning log:

       *  For single-turn: the reasoning paragraph embedded in the LLM's
          output.

       *  For ReAct: an auxiliary summarization LLM condenses the
          interleaved reasoning and actions into a single coherent
          trace.

   5.  *Final Configuration Export*
       The framework uses the Task Evaluation Interface to extract the
       final running configs from each device.

   6.  *Testcase Execution and Scoring*

       *  *Command Score:* Hierarchical diff against ground truth
          commands.

       *  *Testcase Score:* Execute each testcase in sequence; record
          pass/fail.

Cui, et al.                Expires 3 July 2026                 [Page 10]
Internet-Draft                NetConfBench                 December 2025

       *  *Reasoning Score:* Compute embedding similarity between the
          agent's reasoning trace and ground truth reasoning.

   The final per-task score is typically reported as a tuple
   (S_reasoning, S_command, S_testcase).  Aggregate results across the
   forty tasks enable comparisons among LLMs and interaction strategies.

4.  Data Model

   This section specifies the JSON schemas and interface conventions
   used to represent tasks and to enable structured interaction between
   the LLM agent and the emulated environment.

4.1.  Task Definition Schema

   Each configuration task is defined as a JSON object with the
   following structure:

Cui, et al.                Expires 3 July 2026                 [Page 11]
Internet-Draft                NetConfBench                 December 2025

   {
     "task_name": "Static Routing",
     "intents": [
       "NewYork: create a static route pointing to the Loopback0 on
       Washington, traffic should pass the 192.168.1.0 network.",
       "NewYork: create a backup static route pointing to the Loopback0
       on Washington, administrative distance should be 100."
       ...
     ],
     "topology": {
       "nodes": ["NewYork", "Washington"],
       "links": [
         "NewYork S0/0 <-> Washington S0/0 ",
         "NewYork S0/1 <-> Washington S0/1"
       ]
     },
     "startup_configs": {
       "NewYork": "!\r\nversion 12.4\r\nservice timestamps
       debug datetime msec\r\n...",
       "Washington": "!\r\nversion 12.4\r\nservice timestamps
       debug datetime msec\r\n...",
     },
     "ground_truth_configs": {
       "NewYork": [
         "ip route 2.2.2.0 255.255.255.252 192.168.1.2",
         "ip route 2.2.2.0 255.255.255.252 192.168.2.2 100"
       ],
       ...
     },
     "ground_truth_reasoning": "NewYork to Washington Loopback
     (primary path): add a static route for Washington's
     Loopback0 network (2.2.2.0/30) pointing to the
     next-hop 192.168.1.2...",
     "testcases": [
       {
         "name": "Static Route from NewYork to Washington",
         "expected_result": {
           "protocol": "static",
           "next_hop": "192.168.1.2"
         }
       },
       ...
     ]
   }

Cui, et al.                Expires 3 July 2026                 [Page 12]
Internet-Draft                NetConfBench                 December 2025

4.2.  Agent-Network Interface (ANI)

   The Agent-Network Interface defines the minimal API primitives
   necessary for intent-driven configuration.  Each primitive uses JSON-
   RPC style request/response with the following methods:

   1.  *get-topology*

       *  *Request*:

          {
            "method": "get-topology",
            "params": {
              "devices": ["R1", "R2", ...]
            }
          }

       *  *Response*:

          {
            "topology": {
              "nodes": [...],
              "links": [...]
            }
          }

       *  *Description*: Returns the full topology for the specified
          subset of devices.  If "devices" is empty or omitted, returns
          the entire topology.

   2.  *get-running-cfg*

       *  *Request*:

          {
            "method": "get-running-cfg",
            "params": {
              "device": "R1"
            }
          }

       *  *Response*:

Cui, et al.                Expires 3 July 2026                 [Page 13]
Internet-Draft                NetConfBench                 December 2025

          {
            "running_config": "
             interface Gig0/0
             ip address 192.168.1.1 255.255.255.255
             ...
            "
          }

       *  *Description*: Retrieves the active (running) configuration of
          the specified device.

   3.  *update-cfg*

       *  *Request*:

          {
            "method": "update-cfg",
            "params": {
              "device": "R1",
              "commands": [
                "configure terminal",
                "ip route 2.2.2.0 255.255.255.252 192.168.1.2"
              ]
            }
          }

       *  *Response*:

          {
            "results": [
              { "command": "configure terminal", "status": "success" },
              {
              "command": "ip route 2.2.2.0 255.255.255.252 192.168.1.2",
              "status": "success" }
            ]
          }

       *  *Description*: Applies a sequence of CLI commands to the
          specified device.  Returns per-command status and any error
          messages.

   4.  *execute_validation*

       *  *Request*:

Cui, et al.                Expires 3 July 2026                 [Page 14]
Internet-Draft                NetConfBench                 December 2025

          {
            "method": "execute_validation",
            "params": {
              "device": "R1",
              "command": "show ip route 2.2.2.0 255.255.255.252"
            }
          }

       *  *Response*:

          {
            "output": "S 2.2.2.0/30 [1/0] via 192.168.1.2"
          }

       *  *Description*: Executes a read-only command on the specified
          device and returns its output.  Must not alter device state.

4.3.  Task Evaluation Interface

   After the agent signals completion, the framework uses the Task
   Evaluation Interface to retrieve results:

   *  *export-final-cfg*

      -  *Request*:

         {
           "method": "export-final-cfg"
         }

      -  *Response*:

         {
           "configs": {
             "R1": "!\nversion 15.2\n...",
             "R2": "!\nversion 15.2\n..."
           }
         }

      -  *Description*: Returns the final running-configuration of each
         device.

   *  *run-testcases*

      -  *Request*:

Cui, et al.                Expires 3 July 2026                 [Page 15]
Internet-Draft                NetConfBench                 December 2025

         {
           "method": "run-testcases",
           "params": {
             "testcases": [
               {
                 "device": "R1",
                 "commands": ["show ip route 2.2.2.0 255.255.255.252"],
                 "expected_output": "S 2.2.2.0/30 [1/0] via 192.168.1.2"
               },
               ...
             ]
           }
         }

      -  *Response*:

         {
           "results": [
             {
               "name": "Verify primary static route on R1",
               "status": "pass"
             },
             {
               "name": "Verify backup static route on R1",
               "status": "fail"
             }
           ]
         }

      -  *Description*: Executes each verification command sequence on
         the appropriate device and compares actual output against
         expected_output (regular expression).  Returns pass/fail for
         each testcase.

5.  MCP-Based Implementation

   The Model Context Protocol (MCP) provides a standardized approach for
   implementing the Agent-Network Interface (ANI).  This section
   describes how MCP can be applied to NetConfBench for LLM-driven
   network configuration evaluation.

5.1.  Benefits of MCP Integration

   Integrating MCP into NetConfBench provides several advantages:

   1.  *Standardization*: MCP provides a uniform interface for tool
       invocation across different LLM implementations and network
       environments.

Cui, et al.                Expires 3 July 2026                 [Page 16]
Internet-Draft                NetConfBench                 December 2025

   2.  *Vendor Abstraction*: The MCP server can handle vendor-specific
       command translation, allowing the LLM to work with high-level
       operations without needing detailed knowledge of each vendor's
       CLI syntax.

   3.  *Tool Extensibility*: New network operations can be easily added
       as MCP tools without modifying the LLM agent implementation.

   4.  *Traceability*: The structured MCP communication protocol enables
       detailed logging of all tool invocations and results,
       facilitating debugging and analysis.

   5.  *Ecosystem Integration*: MCP-enabled network tools can
       potentially be reused across different AI applications beyond
       network configuration evaluation.

5.2.  MCP Tool Definitions for ANI Operations

   This subsection provides the complete MCP tool definitions for the
   four core Agent-Network Interface operations: get-topology, get-
   running-cfg, update-cfg, and execute_validation.  These definitions
   use JSON Schema to specify tool parameters and enable LLMs to
   understand and invoke network operations through the MCP protocol.

5.2.1.  1. get_topology

   This tool provides network topology information in a format
   interpretable by the LLM, returning topology for specified devices or
   the entire network if no devices are specified.

Cui, et al.                Expires 3 July 2026                 [Page 17]
Internet-Draft                NetConfBench                 December 2025

   {

     "name": "get_topology",
     "description": "Retrieve network topology information including
      nodes and their interconnections. Returns topology for
      specified devices or entire network if no devices specified.",
     "inputSchema": {
       "type": "object",
       "properties": {
         "devices": {
           "type": "array",
           "items": {
             "type": "string"
           },
           "description": "List of device names. Leave empty for
           entire network topology."
         }
       }
     }
   }

   *Usage Example*:

   {
     "method": "tools/call",
     "params": {
       "name": "get_topology",
       "arguments": {
         "devices": ["R1", "R2", "R3"]
       }
     }
   }

5.2.2.  2. get_running_config

   This tool enables the agent to obtain the active configurations of
   specified devices, providing essential context for planning
   subsequent updates.

Cui, et al.                Expires 3 July 2026                 [Page 18]
Internet-Draft                NetConfBench                 December 2025

   {
     "name": "get_running_config",
     "description": "Retrieve the active running configuration
     from a network device. Returns the complete configuration
     as a text string.",
     "inputSchema": {
       "type": "object",
       "properties": {
         "device": {
           "type": "string",
           "description": "Device name or identifier to retrieve
            configuration from"
         }
       },
       "required": ["device"]
     }
   }

   *Usage Example*:

   {
     "method": "tools/call",
     "params": {
       "name": "get_running_config",
       "arguments": {
         "device": "R1"
       }
     }
   }

5.2.3.  3. update_config

   This tool allows the agent to apply new configuration commands and
   provides detailed feedback on their execution, including whether each
   command was accepted or resulted in any errors.

Cui, et al.                Expires 3 July 2026                 [Page 19]
Internet-Draft                NetConfBench                 December 2025

   {
     "name": "update_config",
     "description": "Apply configuration commands to a network
      device. Executes a sequence of CLI commands and returns
       detailed status for each command.",
     "inputSchema": {
       "type": "object",
       "properties": {
         "device": {
           "type": "string",
           "description": "Device name or identifier to apply
            configuration to"
         },
         "commands": {
           "type": "array",
           "items": {
             "type": "string"
           },
           "description": "Ordered list of CLI commands to
            execute on the device"
         }
       },
       "required": ["device", "commands"]
     }

   }

   *Usage Example*:

   {
     "method": "tools/call",
     "params": {
       "name": "update_config",
       "arguments": {
         "device": "R1",
         "commands": [
           "configure terminal",
           "interface GigabitEthernet0/0",
           "ip address 192.168.1.1 255.255.255.0",
           "no shutdown"
         ]
       }
     }
   }

   ### 4. execute_cmd

Cui, et al.                Expires 3 July 2026                 [Page 20]
Internet-Draft                NetConfBench                 December 2025

   This tool accepts a device name and a read-only command string as
   parameters and returns the resulting output.  It must not alter the
   device state and is intended for validation and status inspection.

   {
     "name": "execute_validation",
     "description": "Execute a read-only validation command
      on a network device to verify configuration or check
       device status. This command must not alter the
       device state.",
     "inputSchema": {
       "type": "object",
       "properties": {
         "device": {
           "type": "string",
           "description": "Device name or identifier to
           execute command on"
         },
         "command": {
           "type": "string",
           "description": "Read-only command to execute
            (e.g., show commands)"
         }
       },
       "required": ["device", "command"]
     }
   }

   *Usage Example*:

   {
     "method": "tools/call",
     "params": {
       "name": "execute_validation",
       "arguments": {
         "device": "R1",
         "command": "show ip route 2.2.2.0 255.255.255.252"
       }
     }
   }

   These four tools form the core MCP interface for NetConfBench.  The
   MCP server must register these tools and handle the translation
   between MCP tool invocations and actual device communication
   protocols (CLI, NETCONF, RESTCONF, etc.).  The JSON Schema
   definitions in inputSchema enable LLMs to automatically understand
   parameter requirements and generate valid tool calls.

Cui, et al.                Expires 3 July 2026                 [Page 21]
Internet-Draft                NetConfBench                 December 2025

5.3.  Additional MCP Tools for Advanced Scenarios

   Beyond the four core ANI operations, additional MCP tools can be
   defined for more complex scenarios.  The following examples
   demonstrate extended tool definitions:

5.3.1.  batch_configure_devices

   For batch operations across multiple devices:

   {
     "name": "batch_configure_devices",
     "description": "Apply configuration commands to
      multiple network devices simultaneously",
     "inputSchema": {
       "type": "object",
       "properties": {
         "device_ips": {
           "type": "array",
           "items": {"type": "string"},
           "description": "List of device IP addresses"
         },
         "commands": {
           "type": "array",
           "items": {"type": "string"},
           "description": "CLI command sequence to execute"
         },
         "credential_id": {
           "type": "string",
           "description": "Authentication credential
            identifier"
         }
       },
       "required": ["device_ips", "commands"]
     }
   }

5.3.2.  check_device_status

   For comprehensive device health monitoring:

Cui, et al.                Expires 3 July 2026                 [Page 22]
Internet-Draft                NetConfBench                 December 2025

   {
     "name": "check_device_status",
     "description": "Check operational status of network
      devices including CPU, memory, and interface metrics",
     "inputSchema": {
       "type": "object",
       "properties": {
         "device_ip": {
           "type": "string",
           "description": "Device IP address to check"
         },
         "metrics": {
           "type": "array",
           "items": {
             "enum": ["cpu", "memory", "interface"]
           },
           "description": "List of metrics to retrieve"
         }
       },
       "required": ["device_ip", "metrics"]
     }
   }

   These additional tools demonstrate the extensibility of the MCP
   approach, allowing the framework to support advanced scenarios such
   as batch operations and comprehensive device monitoring.

6.  Security Considerations

   LLM-driven network configuration introduces risks such as unintended
   or malicious commands, emulator vulnerabilities, and data exposure;
   to mitigate these, NetConfBench should enforce strict input
   validation (e.g., YANG/XML schema checks), run emulated devices in
   isolated sandboxes with limited privileges, encrypt and restrict
   access to task definitions and logs, employ human-in-the-loop
   approval for generated configurations, and use curated prompt
   templates and fine-tuning to reduce LLM hallucinations.  Validation
   endpoints must enforce read-only execution (e.g., execute-validation)
   to prevent unintended state changes.  Where appropriate, human-in-
   the-loop approval should gate privileged write operations (update-
   cfg/update-config) identified as high-impact.

7.  IANA Considerations

   This document has no IANA actions.

8.  References

Cui, et al.                Expires 3 July 2026                 [Page 23]
Internet-Draft                NetConfBench                 December 2025

8.1.  Normative References

   [RFC6241]  Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed.,
              and A. Bierman, Ed., "Network Configuration Protocol
              (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011,
              <https://www.rfc-editor.org/rfc/rfc6241>.

   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
              RFC 7950, DOI 10.17487/RFC7950, August 2016,
              <https://www.rfc-editor.org/rfc/rfc7950>.

8.2.  Informative References

   [A2023]    Hat, R., "Ansible", 2023.

   [Fuad2024] Fuad, A., Ahmed, A. H., Riegler, M. A., and T. Cicic, "An
              intent-based networks framework based on large language
              models", 2024.

   [Kreutz2014]
              Kreutz, D., Ramos, F. M. V., Verissimo, P. E., Rothenberg,
              C. E., Azodolmolky, S., and S. Uhlig, "Software-defined
              networking: A comprehensive survey", 2014.

   [Lira2024] Lira, O. G., Caicedo, O. M., and N. L. S. da. Fonseca,
              "Large language models for zero touch network
              configuration management", 2024.

   [Liu2024]  Liu, C., Xie, X., Zhang, X., and Y. Cui, "Large language
              models for networking: Workflow, advances and challenges",
              2024.

   [Long2025] Long, S., Tan, J., Mao, B., Tang, F., Li, Y., Zhao, M.,
              and N. Kato, "A Survey on Intelligent Network Operations
              and Performance Optimization Based on Large Language
              Models", 2025.

   [Wang2024NetConfEval]
              Wang, C., Scazzariello, M., Farshin, A., Ferlin, S.,
              Kostic, D., and M. Chiesa, "Netconfeval: Can llms
              facilitate network configuration?", 2024.

Acknowledgments

   TODO acknowledge.

Authors' Addresses

Cui, et al.                Expires 3 July 2026                 [Page 24]
Internet-Draft                NetConfBench                 December 2025

   Yong Cui
   Tsinghua University
   Beijing, 100084
   China
   Email: cuiyong@tsinghua.edu.cn
   URI:   http://www.cuiyong.net/

   Chang Liu
   Tsinghua University
   Beijing, 100084
   China
   Email: liuchang23@mails.tsinghua.edu.cn

   Xiaohui Xie
   Tsinghua University
   Beijing, 100084
   China
   Email: xiexiaohui@tsinghua.edu.cn

   Chenguang Du
   Zhongguancun Laboratory
   Beijing, 100094
   China
   Email: ducg@zgclab.edu.cn

Cui, et al.                Expires 3 July 2026                 [Page 25]