Skip to main content

Minutes interim-2024-nmop-03: Wed 14:00
minutes-interim-2024-nmop-03-202409111400-00

Meeting Minutes Network Management Operations (nmop) WG
Date and time 2024-09-11 14:00
Title Minutes interim-2024-nmop-03: Wed 14:00
State Active
Other versions markdown
Last updated 2024-09-16

minutes-interim-2024-nmop-03-202409111400-00

Network Management Operations (nmop) WG Agenda - Interim on Anomaly

  • When: Wed, Sept 11, 2024
  • Co-Chairs: Benoît Claise & Mohamed Boucadair

Compact Agenda

| Slot | Topic | Presenters |
| :-: :-: :-
| 14:00 - 14:10 | Agenda Bashing & Introduction | Chairs |
| 14:10 - 14:30 | Swisccom: Network Incident Network Analytics Postmortem | Thomas Graf |
| 14:30 - 14:40 | Bell Canada: Network Incident Network Analytics Postmortem | Dan Voyer |
| 14:40 - 14:55 | Orange: Knowledge Graphs for Enhanced Cross-Operator Incident Management & Network Design | Lionel TailHardat |
| 14:55 - 15:10 | INSA: Practical Anomaly Detection in Internet Services: An ISP centric approach | Alex Huang Feng |
| 15:10 - 15:25 | An Architecture for a Network Anomaly Detection Framework | Wanting Du |
| 15:25 - 15:45 | Experiment: Network Anomaly Lifecycle | Vincenzo Riccobene |
| 15:45 - 16:00 | Discussion, Open Issues, Next Steps | Chairs |

Detailed Agenda

1. Agenda Bashing & Introduction (Chairs) (10 min)

Chairs slides are available here.

No agenda bashing

2. Part 1: Operators Inputs (60 min)

Operators presenting their problems and views, in the context of network
anomaly.

2.1. Swisscom: Network Incident Network Analytics Postmortem (20 min)

  • Presenter: Thomas Graf
  • Reading Material: Describes incidents in terms of what happened,
    which operational metrics where available, which analytical metrics
    described the symptoms and what improvements in the network anomaly
    detection system and network telemetry protocols are proposed.

Presented slides.

  • Data: IPFIX from the core network (BGP, IP-MPLS)
  • Dataset: SRv6 IS-IS ABR Route aggregation, august 14th, post
    maintenance windows analysis.

    • Forwarding plane drops, no alerts
    • Two applications showed loss of connectivity, alerts
    • Post analysis showed that loss of connectivity comes from the
      reconfiguration work, with effects such as congestion and state
      changes.
  • Automated anomaly detection (as opposed to dashboard-based
    analysis): outlier detection to compute a composite time-stamped
    score (concern, traffic loss, etc.) ... kind of fuzzy reasoning.

Rob Wilton: If the case of packets being dropped and wanting further
on-path tracing, how do you know which packets to do the tracing on?
Thomas Graf: Effectively you do statistically tracing/sampling, with
higher priorities given to more important traffic classes.

2.2. Bell Canada: Network Incident Network Analytics Postmortem (10 min)

  • Presenter: Dan Voyer
  • Reading Material: Describes incidents in terms of what happened and
    give insights on the current network telemetry rollout in the
    network and their preparation for anomaly detection deployment.

Rob: You mentioned that it is hard to corrolate that data, is this
because there is too much data from devices, or that it is hard to find
the right fields, or the wrong data is being exported?
Dan: It is really a combination of all three. Going to closed loop
automation means that you have to rely on the quality and consistency of
the data to make robust decisions on config changes.

Comment from Thomas on chat:
15:50
I think the hard, challenging things is to obtain the right data,
correctly structured in near real-time. Chosing the right data is not so
a problem, since the understanding what connectivity service is provided
and what metrics are needed to validate that the service is properly
working is mostly a given.

Presented slides.

  • Data wrangling is complicated because of many legacy systems and
    heterogeneous data.
  • Work agreement signed between Bell Canada and Swisscom to make
    things simpler, notably in knowledge sharing in outage causes and
    remediation actions (post maintenance analysis).

2.3. Orange: Knowledge Graphs for Enhanced Cross-Operator Incident Management and Network Design (15 min)

Presented slides.

Rob: What languages are using to model the graph, is this semantic web?

Lionel: Yes, it is semantic web, I'm using RDFS/OWL and RML.
Rob: Do you see any issues with the mapping? Any corner cases not
covered?
Lionel: no as for now, i.e. using data from several internal
datasources, but not YANG conf yet (work in progress); see these two
following works on data integration:

  • Lionel TAILHARDAT, Raphaël TRONCY, and Yoan CHABOT. 2023. “Designing
    NORIA: a Knowledge Graph-based Platform for Anomaly Detection and
    Incident Management in ICT Systems”. In KGCW’23: 4th International
    Workshop on Knowledge Graph Construction, May 28, 2023, Crete.
    CEUR-WS.org, online CEUR-WS.org/Vol-3471/paper3.pdf
    [pres-pdf]
  • Lionel TAILHARDAT, Raphaël TRONCY, and Yoan CHABOT. 2024. “NORIA-O:
    An Ontology for Anomaly Detection and Incident Management in ICT
    Systems”. In 21st European Semantic Web Conference (ESWC), Resources
    track, May 26-30, 2024, Hersonissos, Greece. Best paper award
    nominee. https://doi.org/10.1007/978-3-031-60635-9_2
    [pres-pdf, Videolectures.net pres-video]

Thomas: Are the control plane and forwarding plane being represented in
YANG data models, and network relationships are coming from the
"ietf-network.yang" at digitial map, and then you use the knowledge
graph to combine the different network planes.
Lionel: Yes, that is a good summary.

2.4. INSA: Practical Anomaly Detection in Internet Services: An ISP centric approach (15 min)

Presented slides.

No questions.

3. Part 2: NMOP Plan of Work (50 min)

Presenters Guidelines and Presentation Objectives:

  • Don't repeat what was presented at the last IETF meeting
  • Please focus on:

    • Open issues (favor the discussion, plan the Q&A in your timeslot)

    • Items that are meant to ease structuring the effort

    • Future plans
    • Experiments explanation, including how to interconnect with other
      experiments

3.1. An Architecture for a Network Anomaly Detection Framework (15 min)

Presented slides.

Wanting Du presenting.

Lionel: It could be interesting to map parts of the proposal to tasks in
some IT Service Management or Incident Management business process (e.g.
ITIL).
Wanting: Thanks a lot. We take that into consideration.

3.2. Experiment: Network Anomaly Lifecycle (20 min)

Presented slides.

Lionel: Regarding the symptoms list in the data model, is it that you
say that a network anomaly can only occurr when there is co-occurrence
of alarms/symptoms?
Vincenzo: There can be network anomalies with only one symptom, based on
the data model, but the ability to group network anomalies is very
important as it allows to group together symptoms that are related to
each other (for instance if mutiple symptoms are related to the same
issue, e.g. data plane and control plane).
Lionel: How do you incorporate symptoms into the data model to make it
shareable? Or do you only consider refering to an external detection
model, and then for analysts to go to the model to understand the
anomaly mechanism?
Vincenzo: The symptom is meant to be the atomic entity of the behaviour,
as the symptom is defined on time series data and the semantic for the
symptom is defined in the network-anomaly-semantic draft.
Lionel: If I was to collaborate with you, it would be to making the data
model more self sufficient.
Vincenzo: Thank you - I would be interested in working with you too.

3.3. Discussion, Open Issues, & Next Steps (15 min)

  • WG Chairs lead discussion

Adrian: Wanted to point that "anomaly" is defined in the terminology
draft. I want to check that the folks who are working in this area are
happy with this definition. Questions: Is this definition okay, and is
this the right reference?
Benoit: How would you phrase your question?
Adrian: If there is an immediate question, that is fine, otherwise, can
bring it back to the WG mailing list. Also have a definition of
"incident" that is pointing to the Incident YANG draft. Also define
"symptom" which is used in this conversation.
Dan: Well written definition are really important. I want to motivate
you to continue this work and not let it go.
Adrian: I think it comes to this: that this definition should remain in
the terminology document and folks should complain if the term is not
right.
Alex: I think that having a draft with all the terminolgy is useful, so
we should have a definition of anomology.
Benoit: Is there an issue regarding the terminology draft (and the
framework) being Informational?
Adrian: This will depend on the IESG at the time. I would say that it
should be informational. Need to reference the terminology to understand
the terms used in a specification, but not to implement the
specification.
Benoit: Will check offline.

Benoit: reading the questions on the slide that were derived from
the various presentations (created on the fly by the WG chairs, during
the call, to understand the WG interest):

1) Is it worth to document lessons learned/shared out of this valuable
« bring your own outage » (credits to Rob :-)) effort? Share Quick Wins?

2) Is the WG interested to investigate means that would help with auto-

generation of correlation and then help with efficient mitigation? Do we

have a candidate technical approach? Can data annotations be useful
here?
3) Call for contributions to the KG experiment with a focus on anomaly
• Share more lessons on automatic mapping of YANG to ontologies?
4) We have so far two drafts to "plug" in the network anomaly detection

framework
• Do they fulfill the needs expressed so far?
• Do we need others?

Michael: Have posted a knowledge graph framework for network operations.
I have some questions on "Orange: Knowledge Graphs for Enhanced
Cross-Operator Incident Management and Network Design" preso
Benoit: Can post to the mailing list.
Michael: We have our own draft, but probably more aligned to
requirements.
Mahesh: Wanted to thank all the presenters. Particularly liked Thomas's
presentation. Mentioned that you wanted to get to the point where you
can determine what config change caused the outage. I do have a student
mentioned my Colin Perkins, and perhaps they could present in NMOP.
Lionel: On question 2), yes we do have technical approaches to learn
behavioral models and issue annotations, with either a human in the loop
or automated approach (or both) ... this is implicitely related to a
"yes" on question 3).
Thomas: Interested in both points (2) and (3).
Vincenzo: the experiment on labelling can be used for annotion
mentionned in point (2).