Skip to main content

A Framework for a Network Anomaly Detection Architecture
draft-ietf-nmop-network-anomaly-architecture-04

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft whose latest revision state is "Active".
Authors Thomas Graf , Wanting Du , Pierre Francois , Alex Huang Feng
Last updated 2025-09-03 (Latest revision 2025-07-04)
Replaces draft-netana-nmop-network-anomaly-architecture
RFC stream Internet Engineering Task Force (IETF)
Formats
Additional resources Mailing list discussion
Stream WG state WG Document
Associated WG milestones
Sep 2024
Adopt a document on network anomaly management
Dec 2025
Submit Network Anomaly Management to the IESG
Document shepherd Benoît Claise
IESG IESG state I-D Exists
Consensus boilerplate Unknown
Telechat date (None)
Responsible AD (None)
Send notices to benoit@everything-ops.net
draft-ietf-nmop-network-anomaly-architecture-04
NMOP                                                             T. Graf
Internet-Draft                                                     W. Du
Intended status: Informational                                  Swisscom
Expires: 5 January 2026                                      P. Francois
                                                           A. Huang Feng
                                                               INSA-Lyon
                                                             4 July 2025

        A Framework for a Network Anomaly Detection Architecture
            draft-ietf-nmop-network-anomaly-architecture-04

Abstract

   This document describes the motivation and architecture of a Network
   Anomaly Detection Framework and the relationship to other documents
   describing network Symptom semantics and network incident lifecycle.

   The described architecture for detecting IP network service
   interruption is designed to be generic applicable and extensible.
   Different applications are described and examples are referenced with
   open-source running code.

Discussion Venues

   This note is to be removed before publishing as an RFC.

   Discussion of this document takes place on the Operations and
   Management Area Working Group Working Group mailing list
   (nmop@ietf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/nmop/.

   Source for this draft and an issue tracker can be found at
   https://github.com/ietf-wg-nmop/draft-ietf-nmop-network-anomaly-
   architecture/ .

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

Graf, et al.             Expires 5 January 2026                 [Page 1]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 5 January 2026.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Motivation  . . . . . . . . . . . . . . . . . . . . . . .   3
     1.2.  Scope . . . . . . . . . . . . . . . . . . . . . . . . . .   4
   2.  Conventions and Definitions . . . . . . . . . . . . . . . . .   4
     2.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Outlier Detection . . . . . . . . . . . . . . . . . . . .   6
     2.3.  Knowledge Based Detection . . . . . . . . . . . . . . . .   6
     2.4.  Data Mesh . . . . . . . . . . . . . . . . . . . . . . . .   7
   3.  Elements of the Architecture  . . . . . . . . . . . . . . . .   8
     3.1.  Service Inventory . . . . . . . . . . . . . . . . . . . .  10
     3.2.  SDD Configuration . . . . . . . . . . . . . . . . . . . .  10
     3.3.  Operational Data Collection . . . . . . . . . . . . . . .  10
     3.4.  Operational Data Aggregation  . . . . . . . . . . . . . .  10
     3.5.  Service Disruption Detection  . . . . . . . . . . . . . .  11
     3.6.  Alarm . . . . . . . . . . . . . . . . . . . . . . . . . .  13
     3.7.  Postmortem  . . . . . . . . . . . . . . . . . . . . . . .  13
     3.8.  Replaying . . . . . . . . . . . . . . . . . . . . . . . .  14
   4.  Implementation Status . . . . . . . . . . . . . . . . . . . .  14
     4.1.  Cosmos Bright Lights  . . . . . . . . . . . . . . . . . .  15
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
   6.  Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  15
   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
     8.2.  Informative References  . . . . . . . . . . . . . . . . .  17

Graf, et al.             Expires 5 January 2026                 [Page 2]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  18

1.  Introduction

   Today's highly virtualized large scale IP networks are a challenge
   for network operation to monitor due to its vast number of
   dependencies.  Humans are no longer capable to verify manually all
   the dependencies end to end in a timely manner.

   IP networks are the backbone of today's society.  We individually
   depend on networks fulfilling the purpose of forwarding IP packets
   from a point A to a point B at any time of the day.  A loss of such
   connectivity for a short period of time has today manyfold
   implications that can range from minor to severe.  An interruption
   can lead to being unable to browse the web, watch a soccer game,
   access the company intranet or, even in life threatening situations,
   no longer being able to reach emergency services.  Further, a
   congestion in the network leading to delayed packet forwarding can
   lead to severe repercussions on real-time applications.

   Networks are generally deterministic.  However, the usage of networks
   are only somewhat.  Humans, as in a large group of people, are
   somehow predictable.  There are time of the day patterns in terms of
   when we are eating, sleeping, working or leisure.  And these patterns
   are potentially changing depending on age, profession and cultural
   background.

1.1.  Motivation

   When operational or configurational changes in connectivity services
   are happening, it is crucial for network operators to detect
   interruptions within the network faster than the users utilizing the
   connectivity services.

   In order to achieve this objective, automation in network monitoring
   is required.  The amount of people operating the network are today
   simply outnumbered by the amount of people utilizing connectivity
   services.

   This automation needs to monitor network changes holistically by
   supervising all 3 network planes simultaneously for a given
   connectivity service.  The monitoring system needs to detect whether
   configurational or operational State changes, an interface was
   shutdown by an operator versus an interface State went down due to
   loss of signal on the optical layer, are service disruptive, e.g. the
   received packets from customers are no longer forwarded to the
   desired destination, or not.  A State change in control plane and
   management plane together indicate a network topology State change

Graf, et al.             Expires 5 January 2026                 [Page 3]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   while a State change in the forwarding plane describes how the
   packets are being forwarded.  In other words, control and management
   plane State changes can be attributed to network topology State
   changes whereas forwarding plane State changes are related to the
   outcome of these network topology State changes.

   Since changes in networks are happening all the time due to the vast
   number of dependencies, a scoring system is needed to indicate
   whether the change is considered disruptive.  The scoring system
   needs to take into account the amount of transport sessions, the
   amount of affected flows and whether the detected interruptions are
   usual or exceptional.

1.2.  Scope

   Such objectives can be achieved by applying checks on network modeled
   time series data that contains semantics describing their
   dependencies across network planes.  These checks can be based on
   domain knowledge or using outlier detection techniques.  Domain-
   knowledge-based techniques applies the expertise of network engineers
   operating a network to understand whether there is an issue impacting
   the customer or not.  On the other hand, outlier detection techniques
   identify measurements that deviate significantly from the norm and
   therefore are considered anomalous.

   The described scope does not take the connectivity service intent
   into account nor does it verify whether the intent is being achieved
   all the time.  Changes to the service intent causing service
   disruptions are therefore considered service disruptions where on
   monitoring systems taking the intent into account this is considered
   as intended.

2.  Conventions and Definitions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.1.  Terminology

   This document defines the following terms:

   Outlier Detection: Is a systematic approach to identify rare data
   points deviating significantly from the majority.

Graf, et al.             Expires 5 January 2026                 [Page 4]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   Service Disruption Detection (SDD): The process of detecting a
   service degradation by discovering anomalies in network monitoring
   data.

   Service Disruption Detection System (SDDS): A system allowing to
   perform SDD.

   Additionally it makes use of the terms defined in
   [I-D.ietf-nmop-terminology],
   [I-D.ietf-nmop-network-anomaly-lifecycle] and [RFC8969].

   The following terms are used as defined in
   [I-D.ietf-nmop-terminology] :

   *  Resource

   *  Event

   *  State

   *  Relevance

   *  Problem

   *  Symptom

   *  Alarm

   Figure 2 in Section 3 of [I-D.ietf-nmop-terminology] shows
   characteristics of observed operational network telemetry metrics.

   Figure 4 in Section 3 of [I-D.ietf-nmop-terminology] shows
   relationships between, state, relevant state, problem, symptom, cause
   and alarm.

   Figure 5 in Section 3 of [I-D.ietf-nmop-terminology] shows
   relationships between problem, symptom and cause.

   The following terms are used as defined in
   [I-D.ietf-nmop-network-anomaly-lifecycle] :

   *  False Positive

   *  False Negative

   The following terms are used as defined in [RFC8969] :

   *  Service Model

Graf, et al.             Expires 5 January 2026                 [Page 5]
Internet-Draft     Network Anomaly Detection Framework         July 2025

2.2.  Outlier Detection

   Outlier Detection, also known as anomaly detection, describes a
   systematic approach to identify rare data points deviating
   significantly from the majority.  Outliers can manifest as single
   data point or as a sequence of data points.  There are multiple ways
   in general to classify anomalies, but for the context of this draft,
   the following three classes are taken into account:

   Global outliers:  An outlier is considered "global" if its behavior
      is outside the entirety of the considered data set.  For example,
      if the average dropped packet count is between 0 and 10 per minute
      and, in a small time-window, the value gets to 1000, this data
      point is considered a global anomaly.

   Contextual outliers:  An outlier is considered "contextual" if its
      behavior is within a normal (expected) range, but it would not be
      expected based on some context.  Context can be defined as a
      function of multiple parameters, such as time, location, etc.  An
      example of a contextual outlier is when the forwarded packet
      volume overnight reaches levels which might be totally normal for
      the daytime, but anomalous and unexpected for the nighttime.

   Collective outliers:  An outlier is considered "collective" if the
      behavior of each single data point that are part of the anomaly
      are within expected ranges (so they are not anomalous in either a
      contextual or a global sense), but the group, taking all the data
      points together, is.  Note that the group can be made within a
      single time series (a sequence of data points is anomalous) or
      across multiple metrics (e.g. if looking at two metrics together,
      the combined behavior turns out to be anomalous).  In Network
      Telemetry time series, one way this can manifest is that the
      amount of network paths and interface State changes matches the
      time range when the forwarded packet volume decreases as a group.

   For each outlier a score between 0 and 1 is being calculated.  The
   higher the value, the higher the probability that the observed data
   point is an outlier.  Anomaly detection: A survey [VAP09] gives
   additional details on anomaly detection and its types.

2.3.  Knowledge Based Detection

   Knowledge-based anomaly detection, a superset of rule-based anomaly
   detection, is a technique used to identify anomalies or outliers by
   comparing them against predefined rules or patterns.  This approach
   relies on the use of domain-specific knowledge to set standards,
   thresholds, or rules for what is considered "normal" behavior.
   Traditionally, these rules are established manually by a

Graf, et al.             Expires 5 January 2026                 [Page 6]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   knowledgeable network engineer.  Forward-looking, these rules can be
   expressed using human and machine readable network protocol derived
   Symptoms and patterns defined in ontologies.

   Additionally, in the context of network anomaly detection, the
   knowledge-based approach works hand in hand with the deterministic
   understanding of the network, which is reflected in network modeling.
   Components are organized into three network planes: the Management
   Plane, the Control Plane, and the Forwarding Plane [RFC9232].  A
   component can relate to a physical, virtual, or configurational
   entity, or to a sum of packets belonging to a flow being forwarded in
   a network.

   Such relationships can be modelled in a SIMAP to automate that
   process.  [I-D.ietf-nmop-simap-concept] defines the concepts for the
   SIMAP and [I-D.havel-nmop-digital-map] defines an application of the
   SIMAP to network topologies.

   These relationships can also be modeled in a Knowledge Graph
   Section 5 of [I-D.mackey-nmop-kg-for-netops] where ontologies can be
   used to augment the relationships among different network elements in
   the network model.

2.4.  Data Mesh

   The Data Mesh [Deh22] Architecture distinguishes between operational
   and analytical data.  Operational data refers to collected data from
   operational systems.  While analytical data refers to insights gained
   from operational data.

2.4.1.  Operational Network Data

   In terms of network observability, semantics of operational network
   metrics are defined by IETF and are categorized as described in the
   Network Telemetry Framework [RFC9232] in the following three
   different network planes:

   Management Plane:  Time series data describing the State changes and
      statistics of a network node and its Resources.  For example,
      Interface State and statistics modeled in ietf-interfaces.yang
      [RFC8343].

   Control Plane:  Time series data describing the State and State
      changes of network reachability.  For example, BGP VPNv6 unicast
      updates and withdrawals exported in BGP Monitoring Protocol (BMP)
      [RFC7854] and modeled in BGP [RFC4364].

   Forwarding Plane:  Time series data describing the forwarding

Graf, et al.             Expires 5 January 2026                 [Page 7]
Internet-Draft     Network Anomaly Detection Framework         July 2025

      behavior of packets and its data-plane context.  For example,
      dropped packet count modelled in IPFIX entity
      forwardingStatus(IE89) [RFC7270] and packetDeltaCount(IE2)
      [RFC5102] and exportet with IPFIX [RFC7011].

2.4.2.  Analytical Observed Symptoms

   The Service Disruption Detection process takes operational network
   data as input and generates analytical metrics describing Symptoms
   and outlier pattern of the connectivity service disruption.

   The observed Symptoms are categorized into a semantic triple
   [W3C-RDF-concept-triples]: action, reason, trigger.  The object is
   the action, decribing the change in the network.  The reason is the
   predicate, defining why this changed occured and the subject is the
   trigger, which defines what triggered that change.

   Symptom definitions are described in Section 3 of
   [I-D.ietf-nmop-network-anomaly-semantics] and outlier pattern
   semantics in Section 4 of [I-D.ietf-nmop-network-anomaly-lifecycle].
   Both are expressed in YANG Service Models.

   However the semantic triples could also be expressed with the
   Semantic Web Technology Stack in RDF, RDFS and OWL definitions as
   described in Section 6 of [I-D.mackey-nmop-kg-for-netops].  Together
   with the ontology definitions described in Section 2.3, a Knowledge
   Graph can be created describing the relationship between the network
   state and the observed Symptom.

3.  Elements of the Architecture

   A system architecture aimed at detecting service disruptions is
   typically built upon multiple components, for which design choices
   need to be made.  In this section, we describe the main components of
   the architecture, and delve into considerations to be made when
   designing such componenents in an implementation.

   The system architecture is illustrated in Figure 1 and its main
   components are described in the following subsections.

         +---------+                     +-------------------+
         |Service  |                     |     Alarm and     |
    |--- |Inventory|                     | Problem Management|
    |    |         |                     |      System       |
    |    +---------+                     +-------------------+
    |      |                                      ^     Stream
    |      |                                      |
    |      |       +---------+           +-------------------+

Graf, et al.             Expires 5 January 2026                 [Page 8]
Internet-Draft     Network Anomaly Detection Framework         July 2025

    |      |       | Post-   | Stream    |   Message Broker  |
    |      |       | mortem  | <-------- |  with Analytical  |
    |      |       | System  |           |    Network Data   |
    |      |       +---------+           +-------------------+
    |      |            |                         ^     Stream
    |      |            |                         |
    |      |            |                +-------------------+
    |      | Profile    | Fine           | Alarm Aggregation | Store Label
    |      | and        | Tune           | for Anomaly       | ------------|
    |      | Generate   | SDD            | Detection         |             |
    |      | SDD Config | Config         +-------------------+             |
    |      |            |                       ^  ^  ^ Stream             |
    |      v            v                       |  |  |                    v
    |   +-------------------+            +-------------------+        +---------+
    |   | Service Disruption| Schedule   | Service Disruption| Replay |  Data   |
    |   |     Detection     | ---------> |     Detection     |<------ | Storage |
    |   |   Configuration   | Detection  |                   |        |         |
    |   +-------------------+            +-------------------+        +---------+
    |                                       ^ ^ Stream ^ ^ ^               ^
    |                                       | |        | | |               |
    |                                    +---------+---------+             |
    |                                    | Network |  Data   | Store       |
    |----------------------------------> |  Model  |  Aggr.  | ------------|
                                         |         | Process | Operational Data
                                         +---------+---------+
                                                ^  ^  ^ Stream
                                                |  |  |
                                         +-------------------+
                                         |   Message Broker  |
                                         |  with Operational |
                                         |    Network Data   |
                                         +-------------------+
                                                ^  ^  ^ Stream
 Subscribe                       Publish        |  |  |
           +-------------------+         +-------------------+
           | Network Node with | ------> | Network Telemetry |
 --------> | Network Telemetry | ------> |  Data Collection  |
           |   Subscription    | ------> |                   |
           +-------------------+         +-------------------+

      Figure 1: Service Disruption Detection System Architecture

Graf, et al.             Expires 5 January 2026                 [Page 9]
Internet-Draft     Network Anomaly Detection Framework         July 2025

3.1.  Service Inventory

   A service inventory is used to obtain a list of the connectivity
   services for which Anomaly Detection is to be performed.  A service
   profiling process may be executed on the service in order to define a
   configuration of the service disruption detection approach and
   parameters to be used.

3.2.  SDD Configuration

   Based on this service list and potential preliminary service
   profiling, a configuration of the Service Disruption Detection is
   produced.  It defines the set of approaches that need to be applied
   to perform SDD, as well as parameters, grouped in templates, that are
   to be set when executing the algorithms performing SDD per se.

   As the service lives on, the configuration may be adapted as a result
   of an evolution of the profiling being performed.  Postmortem
   analysis are produced as a result of Events impacting the service, or
   the occurrence of false positives raised by the Alarm system.  These
   postmortem analysis can lead to improvements of the deployed profiles
   parameters and creation of new customer profiles.

3.3.  Operational Data Collection

   Collection of network monitoring data involves the management of the
   subscriptions to network telemetry on nodes of the network, and the
   configuration of the collection infrastructure to receive the
   monitoring data produced by the network.

   The monitoring data produced by the collection infrastructure is then
   streamed through a message broker system, for further processing.

   Networks tend to produce extremely large amounts of monitoring data.
   To preserve scaling and reduce costs, decisions need to be made on
   the duration of retention of such data in storage, and at which level
   of storage they need to be kept.  A retention time need to be set on
   the raw data produced by the collection system, in accordance to
   their utility for further used.  This aspect will be elaborated in
   further sections.

3.4.  Operational Data Aggregation

   Aggregation is the process of producing data upon which detection of
   a service disruption can be performed, based on collected network
   monitoring data.

Graf, et al.             Expires 5 January 2026                [Page 10]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   Pre-processing of collected network monitoring data is usually
   performed so as to produce input for the Service Disruption Detection
   component.  This can be achieved in multiple ways, depending on the
   architecture of the SDD component.  As an example, the granularity or
   cardinality at which forwarding plane data is produced by the network
   may be too high for the SDD algorithms, and instead be aggregated
   into a coarser dimension for SDD execution.

   A retention time also needs to be decided upon for Aggregated data.
   Note that the retention time must be set carefully, in accordance
   with the replay ability requirement discussed in Section 3.8.

3.5.  Service Disruption Detection

   Service Disruption Detection processes the aggregated network data in
   order to decide whether a service is degraded to the point where
   network operation needs to be alerted of an ongoing Problem within
   the network.

   Two key aspects need to be considered when designing the SDD
   component.  First, the way the data is being processed needs to be
   carefully designed, as networks typically produce extremely large
   amounts of data which may hinder the scalability of the architecture.
   Second, the algorithms used to make a decision to alert the operator
   need to be designed in such a way that the operator can trust that a
   targeted Service Disruption will be detected (no false negatives),
   while not spamming the operator with Alarms that do not reflect an
   actual issue within the network (false positives) leading to Alarm
   fatigue.

   Two approaches are typically followed to present the data to the SDD
   system.  Classically, the aggregated data can be stored in a database
   that is polled at regular intervals by the SDD component for decision
   making.  Alternatively, a streaming approach can be followed so as to
   process the data while they are being consumed from the collection
   component.

   For SDD per-se, two families of algorithms can be decided upon.
   First, knowledge based detection approaches can be used, mimicking
   the process that human operators follow when looking at the data.
   Machine Learning based outlier detection approaches to detect
   deviations from the norm.

Graf, et al.             Expires 5 January 2026                [Page 11]
Internet-Draft     Network Anomaly Detection Framework         July 2025

3.5.1.  Network Modeling

   Some input to SDD is made of established knowledge of the network
   that is unrelated to the dimensions according to which outlier
   detection is performed.  For example, the knowledge of the network
   infrastructure may be required to perform some service disruption
   detection.  Such data need to be rendered accessible and updatable
   for use by SDD.  They may come from inventories, or automated
   gathering of data from the network itself.

3.5.2.  Data Profiling

   As rules cannot be crafted specifically for each customer, they need
   to be defined according to pre-established service profiles.
   Processing of monitoring data can be performed in order to associate
   each service with a profile.  External knowledge on the customer can
   also help in associating a service with a profile.

3.5.3.  Detection Strategies

   For a profile, a set of strategies is defined.  Each strategy
   captures one approach to look at the data (as a human operator does)
   to observe if an abnormal situation is arising.  Strategies are
   defined as a function of observed outliers as defined in Section 2.2.

   When one of the strategies applied for a profile detects a concerning
   global outlier or collective outlier, an Alarm needs to be raised.

   Depending on the implementation of the architecture, a scheduler may
   be needed in order to orchestrate the evaluation of the Alarm levels
   for each strategy applied for a profile, for all service instances
   associated with such profile.

3.5.4.  Machine Learning

   Machine learning-based anomaly detection can also be seamlessly
   integrated into such SDDS.  Machine learning is commonly used for
   detecting outliers or anomalies.  Typically, unsupervised learning is
   widely recognized for its applicability, given the inherent
   characteristics of network data.  Although machine learning requires
   a sizeable amount of high-quality data and considerable advanced
   training, the advantages it offers make these requirements
   worthwhile.  The power of this approach lies in its generalizability,
   robustness, ability to simplify the fine-tuning process, and most
   importantly, its capability to identify anomaly patterns that might
   go unnoticed to the human observer.

Graf, et al.             Expires 5 January 2026                [Page 12]
Internet-Draft     Network Anomaly Detection Framework         July 2025

3.5.5.  Storage

   Storage may be required to execute SDD, as some algorithms may be
   relying on historical (aggregated) monitoring data in order to detect
   anomalies.  Careful considerations need to be made on the level at
   which such data is stored, as slow access to such data may be
   detrimental to the reactivity of the system.

3.6.  Alarm

   When the SDD component decides that a service is undergoing a
   disruption, an aggregated relevant-state change notification, taking
   the output of multiple Service Disruption Detection processes into
   account, needs to be sent to the Alarm and Problem management system
   as shown in Figure 4 in Section 3 of [I-D.ietf-nmop-terminology].
   Multiple practical aspects need to be taken into account in this
   component.

   When the issue lasts longer than the interval at which the SDD
   component runs, the relevant-state change mechanism should not create
   multiple notifications to the operator, so as to not overwhelm the
   management of the issue.  However, the information provided along
   with the Alarm should be kept up to date during the full duration of
   the issue.

3.7.  Postmortem

    Network Anomaly
      Detection             Symptoms
 +-------------------+         &
 |   +-----------+   | Network Anomalies
 |   | Detection |---|---------+
 |   |   Stage   |   |         |
 |   +-----------+   |         v
 +---------^---------+    +-------------------+   Labels  +------------+
           |              | Anomaly Detection |---------->| Validation |
           |              |   Label Store     |<----------|   Stage    |
           |              +-------------------+  Revised  +------------+
    +------------+             |                 Labels
    | Refinement |             |
    |   Stage    |<------------+
    +------------+    Historical Symptoms
                               &
                       Network Anomalies

            Figure 2: Anomaly Detection Refinement Lifecycle

Graf, et al.             Expires 5 January 2026                [Page 13]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   Validation and refinement are performed during Postmortem analysis.

   From an Anomaly Detection Lifecycle point of view, as described in
   [I-D.ietf-nmop-network-anomaly-lifecycle], the Service Disruption
   Detection Configuration evolves over time, iteratively, looping over
   three main phases: detection, validation and refinement.

   The Detection phase produces the Alarms that are sent to the Alarm
   and Problem Management System and at the same time it stores the
   network anomaly and Symptom labels into the Label Store.  This
   enables network engineers to review the labels to validate and edit
   them as needed.

   The Validation stage is typically performed by network engineers
   reviewing the results of the detection and indicating which Symptoms
   and network anomalies have been useful for the identification of
   Problems in the network.  The original labels from the Service
   Disruption Detection are analyzed and an updated set of more accurate
   labels is provided back to the label store.

   The resulting labels will be then provided back into the Network
   Anomaly Detection via its refinement capabilities: the refinement is
   about the update of the Service Disruption Detection configuration in
   order to improve the results of the detection (e.g. false positives,
   false negatives, accuracy of the boundaries, etc.).

3.8.  Replaying

   When a service disruption has been detected, it is essential for the
   human operator to be able to analyze the data which led to the
   raising of an Alarm.  It is thus important that a SDDS preserves both
   the data which led to the creation of the Alarm as well as human
   understandable information on why the data led to the raising of an
   Alarm.

   In early stages of operations or when experimenting with a SDDS, it
   is common that the parameters used for SDD are to be fined tuned.
   This process is facilitated by designing the SDDS architecture in a
   way that allows to rerun the SDD algorithms on the same input.

   Data retention, as well as its level, need to be defined in order not
   to sacrifice the ability of replaying SDD execution for the sake of
   improving its accuracy.

4.  Implementation Status

   Note to the RFC-Editor: Please remove this section before publishing.

Graf, et al.             Expires 5 January 2026                [Page 14]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   This section records the status of known implementations.

4.1.  Cosmos Bright Lights

   This architecture have been developed as part of a proof of concept
   started in September 2022 first in a dedicated network lab
   environment and later in December 2022 in Swisscom production to
   monitor a limited amount of 16 L3 VPN connectivity services.

   At the Applied Networking Research Workshop at IRTF 117 the
   architecture was the first time published in the following academic
   paper: [Ahf23].

   Since December 2022, 20 connectivity service disruptions have been
   monitored and 52 false positives due to time series database
   temporarily not being real-time and missing traffic profiling,
   comparing to previous week was not applicable, occurred.  Out of 20
   connectivity service disruptions 6 parameters where monitored and 3
   times 1, 8 times 2, 6 times 3, 2 times 4 parameters recognized the
   service disruption.

   A real-time streaming based version has been deployed in Swisscom
   production as a proof of concept in June 2024 monitoring approximate
   >13'000 L3 VPN's concurrently.  Improved profiling capabilities are
   currently under development.

5.  Security Considerations

   TBD

6.  Contributors

   The authors would like to thank Alex Huang Feng, Ahmed Elhassany and
   Vincenzo Riccobene for their valuable contribution.

7.  Acknowledgements

   The authors would like to thank Qin Wu, Ignacio Dominguez Martinez-
   Casanueva, Adrian Farrel, Reshad Rahman and Ruediger Geib for their
   review and valuable comments.

8.  References

8.1.  Normative References

   [I-D.havel-nmop-digital-map]
              Havel, O., Claise, B., de Dios, O. G., Elhassany, A., and
              T. Graf, "Modeling the Digital Map based on RFC 8345:

Graf, et al.             Expires 5 January 2026                [Page 15]
Internet-Draft     Network Anomaly Detection Framework         July 2025

              Sharing Experience and Perspectives", Work in Progress,
              Internet-Draft, draft-havel-nmop-digital-map-02, 21
              October 2024, <https://datatracker.ietf.org/doc/html/
              draft-havel-nmop-digital-map-02>.

   [I-D.ietf-nmop-network-anomaly-lifecycle]
              Riccobene, V., Graf, T., Du, W., and A. H. Feng, "An
              Experiment: Network Anomaly Lifecycle", Work in Progress,
              Internet-Draft, draft-ietf-nmop-network-anomaly-lifecycle-
              03, 8 May 2025, <https://datatracker.ietf.org/doc/html/
              draft-ietf-nmop-network-anomaly-lifecycle-03>.

   [I-D.ietf-nmop-network-anomaly-semantics]
              Graf, T., Du, W., Feng, A. H., and V. Riccobene, "Semantic
              Metadata Annotation for Network Anomaly Detection", Work
              in Progress, Internet-Draft, draft-ietf-nmop-network-
              anomaly-semantics-03, 8 May 2025,
              <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-
              network-anomaly-semantics-03>.

   [I-D.ietf-nmop-simap-concept]
              Havel, O., Claise, B., de Dios, O. G., and T. Graf,
              "SIMAP: Concept, Requirements, and Use Cases", Work in
              Progress, Internet-Draft, draft-ietf-nmop-simap-concept-
              04, 28 June 2025, <https://datatracker.ietf.org/doc/html/
              draft-ietf-nmop-simap-concept-04>.

   [I-D.ietf-nmop-terminology]
              Davis, N., Farrel, A., Graf, T., Wu, Q., and C. Yu, "Some
              Key Terms for Network Fault and Problem Management", Work
              in Progress, Internet-Draft, draft-ietf-nmop-terminology-
              19, 18 June 2025, <https://datatracker.ietf.org/doc/html/
              draft-ietf-nmop-terminology-19>.

   [I-D.mackey-nmop-kg-for-netops]
              Mackey, M., Claise, B., Graf, T., Keller, H., Voyer, D.,
              Lucente, P., and I. D. Martinez-Casanueva, "Knowledge
              Graph Framework for Network Operations", Work in Progress,
              Internet-Draft, draft-mackey-nmop-kg-for-netops-02, 4
              March 2025, <https://datatracker.ietf.org/doc/html/draft-
              mackey-nmop-kg-for-netops-02>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

Graf, et al.             Expires 5 January 2026                [Page 16]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8969]  Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and
              L. Geng, "A Framework for Automating Service and Network
              Management with YANG", RFC 8969, DOI 10.17487/RFC8969,
              January 2021, <https://www.rfc-editor.org/info/rfc8969>.

   [RFC9232]  Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and
              A. Wang, "Network Telemetry Framework", RFC 9232,
              DOI 10.17487/RFC9232, May 2022,
              <https://www.rfc-editor.org/info/rfc9232>.

8.2.  Informative References

   [Ahf23]    Huang Feng, A., "Daisy: Practical Anomaly Detection in
              large BGP/MPLS and BGP/SRv6 VPN Networks", IETF 117,
              Applied Networking Research Workshop,
              DOI 10.1145/3606464.3606470, July 2023,
              <https://hal.science/hal-04307611>.

   [Deh22]    Dehghani, Z., "Data Mesh", O'Reilly Media,
              ISBN 9781492092391, March 2022,
              <https://www.oreilly.com/library/view/data-
              mesh/9781492092384/>.

   [RFC4364]  Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
              Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February
              2006, <https://www.rfc-editor.org/info/rfc4364>.

   [RFC5102]  Quittek, J., Bryant, S., Claise, B., Aitken, P., and J.
              Meyer, "Information Model for IP Flow Information Export",
              RFC 5102, DOI 10.17487/RFC5102, January 2008,
              <https://www.rfc-editor.org/info/rfc5102>.

   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
              "Specification of the IP Flow Information Export (IPFIX)
              Protocol for the Exchange of Flow Information", STD 77,
              RFC 7011, DOI 10.17487/RFC7011, September 2013,
              <https://www.rfc-editor.org/info/rfc7011>.

   [RFC7270]  Yourtchenko, A., Aitken, P., and B. Claise, "Cisco-
              Specific Information Elements Reused in IP Flow
              Information Export (IPFIX)", RFC 7270,
              DOI 10.17487/RFC7270, June 2014,
              <https://www.rfc-editor.org/info/rfc7270>.

Graf, et al.             Expires 5 January 2026                [Page 17]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   [RFC7854]  Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP
              Monitoring Protocol (BMP)", RFC 7854,
              DOI 10.17487/RFC7854, June 2016,
              <https://www.rfc-editor.org/info/rfc7854>.

   [RFC8343]  Bjorklund, M., "A YANG Data Model for Interface
              Management", RFC 8343, DOI 10.17487/RFC8343, March 2018,
              <https://www.rfc-editor.org/info/rfc8343>.

   [VAP09]    Chandola, V., Banerjee, A., and V. Kumar, "Anomaly
              detection: A survey", ACM Computing Surveys 41,
              DOI 10.1145/1541880.1541882, July 2009,
              <https://www.researchgate.net/
              publication/220565847_Anomaly_Detection_A_Survey>.

   [W3C-RDF-concept-triples]
              Cyganiak, R., Wood, D., and M. Lanthaler, "W3C RDF concept
              semantic triples", W3 Consortium, February 2014,
              <https://www.w3.org/TR/rdf-concepts/#section-triples>.

Authors' Addresses

   Thomas Graf
   Swisscom
   Binzring 17
   CH-8045 Zurich
   Switzerland
   Email: thomas.graf@swisscom.com

   Wanting Du
   Swisscom
   Binzring 17
   CH-8045 Zurich
   Switzerland
   Email: wanting.du@swisscom.com

   Pierre Francois
   INSA-Lyon
   Lyon
   France
   Email: pierre.francois@insa-lyon.fr

Graf, et al.             Expires 5 January 2026                [Page 18]
Internet-Draft     Network Anomaly Detection Framework         July 2025

   Alex Huang Feng
   INSA-Lyon
   Lyon
   France
   Email: alex.huang-feng@insa-lyon.fr

Graf, et al.             Expires 5 January 2026                [Page 19]