Network Management Operations (nmop) WG Minutes - IETF 122

Co-Chairs: Benoît Claise, Reshad Rahman, & Mohamed Boucadair

Session 1

1. Agenda Bashing & Introduction

Thomas Graf: Updates from the BBF Spring Meeting on the WT-508/YANG-Push
Message Broker liaison (https://datatracker.ietf.org/liaison/1975/).
There is traction on timestamping, related to our activities about
observation timestamping in draft-ietf-netconf-notif-envelope at
YANG-Push and how this ties into
draft-ietf-nmop-yang-message-broker-integration. We formed a team to
discuss and also raised the point of reviewing the requirements of data
collection protocols.

2. Updates from the Terminology Fairy

3. SIMAP

Issue 13

Med Boucadair as a chair: Raised the question to the audience wherever
we have any objections. Appears no objection from the audience. We can
continue as suggested by the authors.
Thomas Graf: Support to add viewpoint as a requirement. Suggest to
confine partitioning to network instances.

Issue 21

Nigel Davis: I will give feedback on the mailing list in the next few
weeks.
Thomas Graf: Perfectly fine to keep it optional

Issue 61

Thomas Graf: Confirms that use case 61. Option 2 is perfectly fine.
Oscar Gonzalez De Dios: I think all the use cases are valid. Please go
ahead.
Benoit Claise as a contributor: What I care about is wherever in option
2 there are any new requirements.

Issue 59

Thomas Graf: Correct, it lays the foundation to link BMP BGP control
plane and IPFIX flow data.

Issue 62

Thomas Graf: I wanted to clarify wherever we need to establish the map
between the passive elements or only be able to link passive to active
elements. I am concerned about the complexity to map among passive
elements and wherever there are use cases for it.
Olga Havel: We could do optionally.
Thomas Graf: Thats perfectly fine.
Brad Peters: It is useful to have relationships among passive elements.
It serves use cases. Such as multiple fiber segments.
Lionel Tailhardat: We already have data models in the wild supporting
passive elements (in between and above). I suggest to change the wording
of intended to observed or not observed.
Olga Havel: Are you propose to keep as a requirements?
Lionel Tailhardat: Keep it and change the wording. Active vs. passive.
This means the word intended. Kind of production or decomissioning
aspect.
Olga Havel: This is the intend of the network even if the passive
elements can't be discovered.
Aihua Guo: Agree to keep the requirement as optional. Depending on the
use cases, you might not want to go very deep. We are still debating
wherever passive is more inventory or should be integrated into the
topology.
Olga Havel: We definitely have to link to the work in IVY WG about
passive elements.
Olga Havel to the chairs: Do we need more feedback on the mailing list
or do we have already a consensus?
Med Boucadair as a chair: What I am hearing is that it is optional and
people want to associate passive to the active elements. Lets start from
there and double check on the mailing list.

Issue 22

Thomas Graf: I am for option 3. Lets keeping it for the hackathon.
Oscar Gonzalez De Dios: Not only hackathon. Let's also discuss on the
mailing list.

4. Message Broker: Extensible YANG Model for Network Telemetry Notifications

Benoit Claise: Why do we need metadata for the data collection?
Ahmed Elhassany: In production use cases. You have more than one
collector. When you upgrade the data collection software, in order to
differentiate the collectors, this is useful.
Thomas Graf: Just as a comment. There is interest at BBF to have session
metadata for IPFIX.
Benoit Claise: Is the intent to standardize the message format?
Thomas Graf: Yes, that's the aim. In Data Mesh, there are several
organizational units and companies exchanging data.
Holger Geller: Agree. This is very useful and should be standardized.
Med Boucadair as a chair: I remember that Andy also raised interest
previously. Have the authors reached out to him?
Thomas Graf/Ahmed Elhassany: Not yet. We will do.
Alex Huang: I like the idea of having a standard on this, and support
the idea of having YANG-Push relation in a different YANG module.

5. Anomaly Detection & Incident Management

5.1. Incident Management YANG Module

Brad Peters: The term "root cause" is problematic. We had previously
terminology discussion on "root cause". Rather use "probable cause"
instead. As example, take a SFP which lost optical signal. We never know
the root cause why.
Nigel Davis: In line with Brad. There are causes. The most distant cause
is the closest to the root.
Lionel Tailhardat: In line with Brad and Nigel. Regarding service impact
analysis. Why do you constrain on machine learning when there are also
AI techniques?
Qin Wu: We are open to add other machnisms.
Lionel Tailhardat: How about use the term algorithmic solutions instead
of machine learning?
Qin Wu: Sure that makes sense.
Thomas Graf: Use "probable cause" or "causality" instead.
Tom Hill: I think this is valuable. It's just the term. Suggest to stay
away from "root cause".
Benoit Claise as a chair: I suggest to agree wherever define a new term
in the document or align with the existing terms in the terminology
document.
Adrian Farrel: It was in the terminology document. We took it out for
all the reasons mentioned. In this case it appears to more like a
customer or service root cause. I haven't been able to come up with a
definition.
Benoit Claise as a chair: Did I understand correctly that you are
perfectly fine to leave the definition in incident management document.

Adrian Farrel: Yes.
Rob Wilton, aka difficult terminology reviewer: I can understand why
other colleagues hang up on root cause definition since it is hard to
come by. I don't think it matters too much wherever something is the
actual root cause. It could be just the trigger, root cause from a
particular perspective. Maybe it is ok to keep root cause a bit vague or
come up with a new term?
Benoit Claise as a chair: So you are in favour of "probable cause also
know for some people as root cause".
Rob Wilton: Yes. Or we just say that underneath the causality chain may
continue.
Thomas Graf: I am in favor to add "cause" or "causality" in the
terminology document. And establish the relationship to problem. But I
suggest not to go into details of causality.
Tom Hill: At the end, the network operation colleagues are the ones
reading this. They need to be able to follow.
Rob Wilton: So instead of root cause analysis you would you probably
cause analysis in a network operation manual?
Tom Hill: I am favor to be clear about what it means to avoid ambiguity.

Rob Wilton: Do you have issues today with the term root cause?
Benoit Claise: I suggest to clarify on the mailing list how you call
this in a postmortem.
Brad Peters (chat): old X733 defined a Probably Cause field

5.2. Bring Your Own Outage

5.2.1 DT Network Incident

Rob Wilton: Did you have any drop counters to show the blackholing?
Holger Keller: No. We did not have any counters. Only show commands.
Alex Huang: You mentioned that from a maintenance window verification
everything looked fine but customers complained. Can you describe what
was actually verified?
Holger Keller: An IS-IS adjacency check and a ping is usually performed.
However we missed the ping in this workflow.
Benoit Claise: With TWAMP and ECMP multipathing, you never know what
hash you need to test all the pathes. To measure passively with IPFIX
and ForwardingStatus might have helped to see the drops.
Holger Keller: We already discussed. We don't know unfortunately.
Thomas Graf: On Rob's comment. We not only need to drop counter but also
the link and ECMP path context. On Benoit's comment: It depends on the
capture point. Sometimes at ingress the egress drops are not visible.
Ruediger Volk: The Kirchhoff rule: The sum of all currents entering a
junction must equal the sum of all currents leaving the junction, might
be a solution as well.
Lionel Tailhardat: +1 on the Kirchhoff rule and measuring with IPFIX.
Retrospectively, do you know what were the KPI's and knowledge
representation you wish to capture as a signature?
Holger Keller: We just started the discussion with the network operation
colleagues.

5.2.2 Swisscom Network Incident

Benoit Claise: Two things. Some of the use cases are transient. The last
one was longer. Do you increase the concern score depending on the
duration?
Thomas Graf: Well spotted. Today it is not supported. However we have it
on the roadmap to implement.
Benoit Claise as a chair: You are the author of multiple anomaly
detection documents. How do those documents fit in this architecture?
Thomas Graf: Tomorrow, Vincenzo will show in his slides how the schema's
are applied to the metrics I have shown today here.
Narasimha Prasad: What is your tolerance of the delay for all the
collected Network Telemetry metrics?
Thomas Graf: Our golden rule at Swisscom is that after 7 min the
customer calls and after 15 min the incident is published in the media.
Therefore we want to make sure that the analytical conclusions are in
the range of 3-5 min. Therefore the operational data needs to be
available within 2 minutes.
Narasimha Prasad: Do you already handle multiple failures in the same
incident window?
Thomas Graf: Yes, we already handle that. For a given connectivity
service, we raise a unique relevant state id and associate the events
with it.
Lionel Tailhardat: Same question as with Holger. Retrospectively, do you
know what are the KPI's and knowledge representation you wish to capture
as a signature?
Thomas Graf: Absolutely. In the anomaly semantics we have a section
where the service is being defined. We want to leverage knowledge graphs
for the inventory and network topology relationships.

6. Wrap-up

Med Boucadair as a chair: "Is the WG interested to organize a dedicated
interim on knowledge graph?"

Nigel Davis (chat): Sharing incidents in some abstracted and neutralized
form such that the private data is removed but the incident pattern is
preserved.
Lionel Tailhardat: I have a dream that we can share network incidents to
improve the design of networks. From my 4 years of research I know that
knowledge graphs are useful in this context.
Benoit Claise: Did I understood correctly that you as an operator will
share your incident next under the bring your own outage umbrella?
Lionel Tailhardat: Did not get it. However I am sure it was funny. It is
more about creating the capability to enable network operators to share
the data.
Brad Peters: There is some value to have discussions on knowledge graphs
and incident sharing. At the end we want something actionable for a
network operator. Thomas work is heading towards this goal.
Thomas Graf: Fully agree to what Brad just said. We saw in Holger's
presentation. In an incident, many things are happening in parallel and
knowledge graphs are able to visualize the relationships. To make it
more easily for humans to understand the relationships. Looking forward
to an interim. What are the possibilities and what would be the possible
next steps for the working group.
Benoit Claise as an knowledge graph author: This is an interesting
topic. Still in early stage. Completely new for the IETF. Explaining
what a knowledge graph is. How it is useful. Trying to link all the
incidents.
Nacho (Ignacio Dominguez Martinez-Casanueva): Fully agree what Benoit
just said. The interim will be beneficial to get things going. There are
many applications. We can share existing implementations such as
open-source.
Diego Lopez: Knowledge Graphs are able to enhance what we just have seen
by Holger and Thomas. What also comes to my mind is about our root cause
discussion. How Knowledge graphs are apply to causality.

Session 2 (Hackathon-focused)

1. Agenda Bashing & Introduction

Adrian Farrel: In the first meeting I was mentioning about Rob. His
review and how we progress. We did that. We understand from where we are
each other coming from. We will make some minor edits to make the
definitions more precise. And I will make a more concrete survey how
existing documents make use of the defined terms. This will help us at
the end.

2. Validate Configured Subscription YANG-Push Publisher Implementations

Benoit Claise: As an observation. You went from core, to virtual router
now to OLT (optical line Termination). One gentleman asked me, what
about datacenter.
Thomas Graf: Good point. So far we don't have but that's one of the next
steps.
Nacho (Ignacio Dominguez Martinez-Casanueva): Where can I find the
open-source implementations?
Thomas Graf: We have YANG-Push receivers supporting udp-notif with JSON
encoding: https://www.network-analytics.org/yp/implementations.html.
Nacho (Ignacio Dominguez Martinez-Casanueva): Are you planning to have a
solution covering the entire architecture?
Thomas Graf: Yes, that's the end goal. We are working on extending
Netgauze, the YANG-Push receiver, to support
https://datatracker.ietf.org/doc/html/draft-netana-nmop-message-broker-telemetry-message
for YANG Message Broker Producer and later also that it supports the
notification as YANG document.

3. SIMAP for SRv6 and Linking Topology to External Data

Benoit Claise: Good job. I have some questions but I will bring them to
the list.

4. Anomaly Detection Integration Update

Benoit Claise: You mentioned testing with other operators. But I think
there is also validating the internal process of a network operator
doing the postmortem.
Vincenzo Claise: Well pointed out.
Benoit Claise: You mentioned the global concern score, you might also
want a global confidence score in the schema as well.
Vincenzo Riccobene: Yes. That make sense. We take that into the next
iteration.

5. YANG Configuration Instance Data to Knowledge Graph

Nacho (Ignacio Dominguez Martinez-Casanueva): Its good to see that your
are trying to use RDF Mapping Language. The community group developing
RDF Mapping Language within W3C intends to become a working group. We
could potentially liase with them in the future. Which RDF Mapping
Language did you use?
Michael Mackey: We did one in Python and one in JAVA. The one in JAVA
was faster. At the end it is just XML. There are so many ways how to do
it. At the end what counts is the data volume.
Lionel Tailhardat: I agree with you about the challenges of linking such
a large amount of data. Learning from experiments, I am confident we can
achieve this. I think there would be plenty of things to discuss. I am
not sure how to organize.
Michael Mackey: I love to hear your feedback. The declerative approach
is good for schema mapping. Where streaming approach for the incident
data.
Lionel Tailhardat: To lower the bar. I suggest to focus on the
declerative approach and avoid post processing since you loose
provenance data.
Benoit Claise as a chair: On the point how to organize this. In the last
NMOP session we determined that an interim meeting would be best.
Rob Wilton: It is an interesting problem you are trying to solve here.
Alligning different data sets to get a semantic meaning. I do wonder
wherver IETF XML is the right approach or not. You were mentioning the
amount of data being generated as a possible challenge. I think you were
mentioning that SPARQL would not work in this scenario.
Michael Mackey: Not at all. It was more at the generation of the
semantic triples. The problem is more at the UML language W3C defines.
How the mapping is being established.

6. HTTPS-Notif-Draft Kafka Integration and Bandwidth Analysis for Different Encodings (if time permits)

Rob Wilton: You were presenting this in NETCONF as well. Thanks for
bringing here as well. I think the libcbor and yangsen library are open
source. You might want to consider to contribute there.
Meher (Bharadwaja M Chittapragada): We got similar suggestions from
others as well.