Network Management Operations (nmop) WG Agenda - Interim on Anomaly

Compact Agenda

| Slot | Topic | Presenters |
| :-: :-: :-
| 14:00 - 14:10 | Agenda Bashing & Introduction | Chairs |
| 14:10 - 14:30 | Swisccom: Network Incident Network Analytics Postmortem | Thomas Graf |
| 14:30 - 14:40 | Bell Canada: Network Incident Network Analytics Postmortem | Dan Voyer |
| 14:40 - 14:55 | Orange: Knowledge Graphs for Enhanced Cross-Operator Incident Management & Network Design | Lionel TailHardat |
| 14:55 - 15:10 | INSA: Practical Anomaly Detection in Internet Services: An ISP centric approach | Alex Huang Feng |
| 15:10 - 15:25 | An Architecture for a Network Anomaly Detection Framework | Wanting Du |
| 15:25 - 15:45 | Experiment: Network Anomaly Lifecycle | Vincenzo Riccobene |
| 15:45 - 16:00 | Discussion, Open Issues, Next Steps | Chairs |

Detailed Agenda

1. Agenda Bashing & Introduction (Chairs) (10 min)

Chairs slides are available here.

No agenda bashing

2. Part 1: Operators Inputs (60 min)

Operators presenting their problems and views, in the context of network
anomaly.

2.1. Swisscom: Network Incident Network Analytics Postmortem (20 min)

Presented slides.

Rob Wilton: If the case of packets being dropped and wanting further
on-path tracing, how do you know which packets to do the tracing on?
Thomas Graf: Effectively you do statistically tracing/sampling, with
higher priorities given to more important traffic classes.

2.2. Bell Canada: Network Incident Network Analytics Postmortem (10 min)

Rob: You mentioned that it is hard to corrolate that data, is this
because there is too much data from devices, or that it is hard to find
the right fields, or the wrong data is being exported?
Dan: It is really a combination of all three. Going to closed loop
automation means that you have to rely on the quality and consistency of
the data to make robust decisions on config changes.

Comment from Thomas on chat:
15:50
I think the hard, challenging things is to obtain the right data,
correctly structured in near real-time. Chosing the right data is not so
a problem, since the understanding what connectivity service is provided
and what metrics are needed to validate that the service is properly
working is mostly a given.

Presented slides.

2.3. Orange: Knowledge Graphs for Enhanced Cross-Operator Incident Management and Network Design (15 min)

Presented slides.

Rob: What languages are using to model the graph, is this semantic web?

Lionel: Yes, it is semantic web, I'm using RDFS/OWL and RML.
Rob: Do you see any issues with the mapping? Any corner cases not
covered?
Lionel: no as for now, i.e. using data from several internal
datasources, but not YANG conf yet (work in progress); see these two
following works on data integration:

Thomas: Are the control plane and forwarding plane being represented in
YANG data models, and network relationships are coming from the
"ietf-network.yang" at digitial map, and then you use the knowledge
graph to combine the different network planes.
Lionel: Yes, that is a good summary.

2.4. INSA: Practical Anomaly Detection in Internet Services: An ISP centric approach (15 min)

Presented slides.

No questions.

3. Part 2: NMOP Plan of Work (50 min)

Presenters Guidelines and Presentation Objectives:

3.1. An Architecture for a Network Anomaly Detection Framework (15 min)

Presented slides.

Wanting Du presenting.

Lionel: It could be interesting to map parts of the proposal to tasks in
some IT Service Management or Incident Management business process (e.g.
ITIL).
Wanting: Thanks a lot. We take that into consideration.

3.2. Experiment: Network Anomaly Lifecycle (20 min)

Presented slides.

Lionel: Regarding the symptoms list in the data model, is it that you
say that a network anomaly can only occurr when there is co-occurrence
of alarms/symptoms?
Vincenzo: There can be network anomalies with only one symptom, based on
the data model, but the ability to group network anomalies is very
important as it allows to group together symptoms that are related to
each other (for instance if mutiple symptoms are related to the same
issue, e.g. data plane and control plane).
Lionel: How do you incorporate symptoms into the data model to make it
shareable? Or do you only consider refering to an external detection
model, and then for analysts to go to the model to understand the
anomaly mechanism?
Vincenzo: The symptom is meant to be the atomic entity of the behaviour,
as the symptom is defined on time series data and the semantic for the
symptom is defined in the network-anomaly-semantic draft.
Lionel: If I was to collaborate with you, it would be to making the data
model more self sufficient.
Vincenzo: Thank you - I would be interested in working with you too.

3.3. Discussion, Open Issues, & Next Steps (15 min)

Adrian: Wanted to point that "anomaly" is defined in the terminology
draft. I want to check that the folks who are working in this area are
happy with this definition. Questions: Is this definition okay, and is
this the right reference?
Benoit: How would you phrase your question?
Adrian: If there is an immediate question, that is fine, otherwise, can
bring it back to the WG mailing list. Also have a definition of
"incident" that is pointing to the Incident YANG draft. Also define
"symptom" which is used in this conversation.
Dan: Well written definition are really important. I want to motivate
you to continue this work and not let it go.
Adrian: I think it comes to this: that this definition should remain in
the terminology document and folks should complain if the term is not
right.
Alex: I think that having a draft with all the terminolgy is useful, so
we should have a definition of anomology.
Benoit: Is there an issue regarding the terminology draft (and the
framework) being Informational?
Adrian: This will depend on the IESG at the time. I would say that it
should be informational. Need to reference the terminology to understand
the terms used in a specification, but not to implement the
specification.
Benoit: Will check offline.

Benoit: reading the questions on the slide that were derived from
the various presentations (created on the fly by the WG chairs, during
the call, to understand the WG interest):

1) Is it worth to document lessons learned/shared out of this valuable
« bring your own outage » (credits to Rob :-)) effort? Share Quick Wins?

2) Is the WG interested to investigate means that would help with auto-

generation of correlation and then help with efficient mitigation? Do we

have a candidate technical approach? Can data annotations be useful
here?
3) Call for contributions to the KG experiment with a focus on anomaly
• Share more lessons on automatic mapping of YANG to ontologies?
4) We have so far two drafts to "plug" in the network anomaly detection

framework
• Do they fulfill the needs expressed so far?
• Do we need others?

Michael: Have posted a knowledge graph framework for network operations.
I have some questions on "Orange: Knowledge Graphs for Enhanced
Cross-Operator Incident Management and Network Design" preso
Benoit: Can post to the mailing list.
Michael: We have our own draft, but probably more aligned to
requirements.
Mahesh: Wanted to thank all the presenters. Particularly liked Thomas's
presentation. Mentioned that you wanted to get to the point where you
can determine what config change caused the outage. I do have a student
mentioned my Colin Perkins, and perhaps they could present in NMOP.
Lionel: On question 2), yes we do have technical approaches to learn
behavioral models and issue annotations, with either a human in the loop
or automated approach (or both) ... this is implicitely related to a
"yes" on question 3).
Thomas: Interested in both points (2) and (3).
Vincenzo: the experiment on labelling can be used for annotion
mentionned in point (2).