IETF 119 NMOP (Network Management Operations) Minutes

Chairs: Benoît Claise & Mohamed Boucadair

When: 2024-03-18


Agenda Bashing & Introduction (Chairs)

No changes

A Word from the AD (Rob)

Rob presented the NMOP charter and refered during his speach.
Highlighting the important aspects for the AD and the process with the
IESG lead to this outcome

https://datatracker.ietf.org/meeting/119/materials/slides-119-nmop-nmop-charter-slides-00.pdf

Below is Rob Wilton's transcript:

NMOP Working Group is sort of interesting in terms of trying to get this
working chartered as a different style of working group than lots the
regular IETF ones. And one thing that is sort of a slight twist tool we
regularly see in IETF Working Groups that this one actually has a
deliberate aim to try and pull more operator input into the IETF
process. So that's one of the things that this work group is intended to
try to address and charter. For the area of network management
operations. So we were seeing operators self organizing at the previous
active meetings and doing lots of side meeting to present topics and
things, and part of that is to try and pull that into better way into
the item process so that they can actually get agenda time, get people
get people into the room, have all the myths and things like an
associated AD stuff that goes well with his meetings. That was one of
the parts of things.

The other aspect in terms of the charter, and I want to go through,
slide 3. So one of the things that's important in this charter actually
is that we have a focused list of topics. We have some things that we
said, like, in rough priority order. And so, in terms of the sort of
topics it's focusing on, it's really about trying to, how do you, help
better deploy existing network management technologies. So when you're
finding issues as the operator try to deploy IETF protocols, how do they
actually address these issues? Some cases, those can go back to the
working groups and then you and, developing those protocols. But in
other cases, we have issues that are more wide spanning across that, and
we want to have an area, we want to have have discussion of those issues
and, find potential solutions.

And part of that, when we have these discussions may be running short
term experiments as, driven by the operator in terms of this is how we
think you're deploying this and be able to do those experiments and then
report back to this working group and get feedback on those experiments,
like you know, what things are going well, what things need changes, and
then evolve those experiments, over time. The focus here is these
experiments are short term things. It's not like I would like to do
something pie in sky 5 years, 10 years down the line, that's for IRTF.

And it's also not really the case that it's meant to be for developing a
brand new protocol. If you want to do that, then you standard BoF
process or working group chartering. This is aiming about how we make
the existing protocols work a bit better and tweak them and drive that
change in. Some other things we have in here is sort of discussing use
cases and requirements. So that again is useful. And again, that what
will feed into other working groups. In terms of protocol work, this
working group isn't chartered to do any protocol work at this stage. I
tried to get it in the charter to allow to do it, but because there's
nothing on the charter to say, actual work. I got pushbacks. It's been
taken out, except for doing TANG models. But that is still an open door
for this working group in the future if there's some small amount
protocol work you needed to do, that could happen here, but anything
large would go to a separate working group, either existing one new
working group.

What are the cons I've got? So I think it's probably worth saying, what
is out of scope of this working group. And so long term research ideas,
that goes to NMRG, that's come here. And it's, and it's also not planned
that working groups that are existing elsewhere or the closed and have
like a tail end of work, the plan is not for those to come here, even if
they're management operated or focus that unless they're getting strong
deployments, and it's sort of deployment operator issues that are
coming. I think that that sort of covers everything I have.

Oh, and the last thing I want to say is the other thing different with
this working group is that it has a set of topics that it's gonna aim on
aim to focus on. And at the moment, we have 4 particular topics. And
that's chosen by the working group with particularly strong operator
input into what the particular interesting topics should be, and, with
an AD oversight in terms of helping to choose their topics. So I won't
be doing that not for 3 days time, but so I'll fall onto Mahesh, but to
try and sort of curate a role, what the list of topics are.

At the moment, we've got, network YANG-push integration. So Thomas has
been driving quite a lot of that work. So far, and let's sort of bring
that here. And again, some of the actual work is happening in our other
working groups, but the coordination activities and progress will be
reported here.

Anomaly detection & incident management, the 2 things have been coming
up at the sort of 1 or 2 meetings. So it's coordination in that area. It
all happened and I think would be an interesting topic.

We've got a digital map, which is about how to do, deploy the existing
YANG's topology modules and in the process of trying to deploy those
modules, finding what issues arising and ways to fix them. Some of those
base models were, for example, standardized in I2RS, which is no longer
there. So again, this is the home to try and find a place to fix that.

And then the last topic we've got currently focused on is update
rfc3535bis, which was an IAB workshop looking at the requirements for
network management. So the interesting with that one is it was an IAB
document. So we can't directly published a bis to that because we're not
the IAB, but I think what the the interesting piece of that area in this
working group is to figure out what to do next? Do we ask for another
IAB program to do the same thing? Do you want to reach out for operators
directly from this working group? What is is it we want to say there and
what do what do you want to do? And, from the IAB side, Wes Hardaker? We
discussed this in the on the IESG meetings. He's got an interest in
this. He was involved in that document 20 years ago. And so, from IAB
he'd be a good contact person to help in that area.

And I think the last thing is these topics are not fixed over time. They
can change and again, it would be, like, like, Mahesh and the chairs and
the working group would decide what those topics are. But at the moment,
just try and focus on a smaller number of topics and making progress
with those, and giving a decent amount of time to each top So I think I
probably use up enough of your agenda time. Thank you. Okay? Any
questions before we move on. Thank you. Good luck.

NETCONF/YANG Push Integration

Goals:

An Architecture for YANG Push to Apache Kafka Integration

I-D: draft-netana-nmop-yang-kafka-integration
Presenter: Thomas Graf (onsite)

Discussion

Nigel Davis
3 points. This is good work. I completely agree on the time stamping.
There has been poor timestamping in the past. You mentioned semantics. I
don't think there is sufficient semantics in the YANG modules in
general. We need to go deeper. We'll find that as we move along.
Regarding unchanging data from the network. Have you considered
engineering how to deal with sudden spikes and bursts of huge amount of
data leading to potential overload and potential push back on data
export to moderate/average it a bit? Solution engineering is very
important.

Thomas Graf
On the third point. At NETCONF we have considered this by having a
choice between udp-notif and https-notif. udp-notif for accounting
metrics where loss is perfectly fine. Where state changes shall be
transported with https-notif where retransmission is desired.

Nigel Davis
How about state changes which occur in short succession, ratteling.
Rather transmitting all the state changes, only transmit intermitted
states in between.

Thomas Graf
Makes perfecly sense.

Rob Wilton (as an individual)
This is a good start. Great document to have. I believe I was the one
pushing to create it. Therefore not surprising. I am going to review and
feedback.

Qin Wu
You mentioned operational vs. analytical metrics. Can you describe the
meaning. Are you refering to KPI's or metadata?

Thomas Graf
I am using the terminology from the data mesh architecture. Operational
metrics are what you collect. Where analytical metrics are what the
analytical stack generates. Regarding KPI's. For me these are SLI
(Service Level Indicators) and SLO (Service Level Objectives) which are
both analytical metrics. SLI are derived from operational metrics where
SLO you define the intent/objective and measure where you reach that
intent/objective or not.

Qin Wu
You gave a good overall picture how that all fits together. For these
YANG enhancements, are these protocol extensions or YANG module
extensions?

Thomas Graf
So far these extensions apply on the NETCONF notification and YANG-Push
header, and the YANG library. Besides those extensions, there is Apache
Kafka integration work progressing which then leads to end to end
testing activities where the YANG-Push extensions are being validated
against.

Benoit
This is important work since it links all the other work. We believe we
need more reviews before the WG adoption. Rob already indicated, other
please follow up on the mailing list. Ken, unfortunately the queue is
already closed, please follow up on the mailing list.

Chat messages:
Andy Bierman: Does this work need to be implementation-specific to
Apache2?
Is it possible to extend to other message brokers like ZeroMQ?

Robert Wilton: @Andy, my understanding is that it is regarding Apache
Kafka message broker, but I believe that the changes being proposed to
the protocols are effectively generic and would work with any message
broker.

Alex Huang Feng: Agrees that this work can be generalized to any message
broker. The document is particular useful to identify specification gaps
and then also validated against other message brokers.

Kent Watsen: +1 to Andy’s comment. The doc should refer to brokers in
general, using Kafka as an example.

Anomaly Detection

Goal:

Experiment Semantic Metadata Annotation for Network Anomaly Detection + Network Anomaly Postmortem Lifecycle

I-Ds: draft-netana-nmop-network-anomaly-semantics
      & draft-netana-nmop-network-anomaly-lifecycle
Presenters: Thomas Graf (onsite) and/or Vincenzo Riccobene (remote)

Discussion

Rob Wilton
You were refering to replaying the data. Does that mean that you export
from the time series database and feeding into the detection system?

Thomas Graf
Yes. Correct. Once the data is being labeled and identified as potential
anomaly, not only the operationa but the metadata together, is being
stored sperately and you can do a refinement and replay that recorded
data.

Rob Wilton
So where is the anomaly classification being done? On the network node
or on the time series database?

Thomas Graf
The network telemetry data is being received and tranformed from the
network to be feed into the Network Anomaly Detection system where then
the anomalies and symptoms are being identified and labeled and alerts
are being sent.

Rob Wilton
My instinct is that it will take several iterations to identify the
anomaly and symptoms correctly. So therefore replaying and labeling the
data is crucial for the refinement process.

Thomas Graf
Need to make sure when refining the system that previous recognized
anomalies are still reported with the same accuracy, avoid to make the
system worse.

Rob Wilton
So unit testng of anaomlay detection.

Thomas Graf
Exactly!

Benoit Claise
Question on the maturity on the presented anomaly detection documents.
First draft was refined over two IETF hackathons. The second over one
IETF hackathon. Would you agree that these anomaly detection documents
that they are less mature then the previously presented YANG-Push to
Apache Kafka integration document.

Thomas Graf
The Network Anomaly Detection architecture itself is quiet mature since
we have now 18 months experience in production network. While the work
on labeling and doing the life cycle is something we have been recently
working on. The main point is that we want to gather information from
other operators to define a general solution.

Vincenzo Riccobene
In addition to what Thomas said in the presentation. I like to add that
the life cycle part is very useful in the active machine learning area.
Exploring the ability to create labels from both humans and AI which we
are currently exploring as part of this experiment. It is going to be
crucial to work with real data from operators.

Luis Contreras
Interesting. Especially the categorization in labels. Wondering how far
we can go with this work by integrating other data from other sources
such as active probing or Syslog. Talking with my colleagues in
operations who wanted to perform correllation between those data source,
to detect impact and do forecasting. How far can we go in the direction
of integrating other sources?

Thomas Graf
Architecture is open to add other use cases. Actually right know Alex
Huang Feng from INSA Lyon university is doing an internship at NII to
explore how this architecture can be applied on use cases not only for
L3 VPN services but also be applied on Internet use cases with BGP AS
numbers. The architecture appears to be very flexible.

Incident Management

Goals:

Incident Management for Network Services

I-D: draft-feng-nmop-network-incident-yang
Presenter: Qin Wu

Mahesh Jethanandani
Confused between Anomaly Detection and Network Incident Management.
What's the difference between the two? Who is feeding data to whom? Or
even exchanging data? Who do they fit in the stack? Are the two drafts
even talking to each others on how the data is supposed to be exchanged?

Qin Wu
Symptom data can generate incident data. There is a section in the draft
describing this, symptom and related work.

Thomas Graf
You see on diagram at the bottom alerts, refering to RFC 8632. Alerts
can be generated from network nodes or a network anomaly detection
system observing network metrics. Qin is describing the handling of
network incidents further above on the diagram.

Common vocabulary for incident management

I-D: draft-davis-nmop-incident-terminology
Presenter: Nigel Davis (onsite)/Adrian Farrel (remote)

Rob Wilton
Not yet time to review yet. Intend to do later. My only concern is which
other groups are also defining terminology. Just TMF? Or others?

Nigel Davis
There other groups than TMF as well. TMF is just a starting point.
Important if we are trying to bridge with other groups.

Rob Wilton
Bridging with other groups goes to the chairs. May need a liaison
statement to be sent to different groups to make them aware of this
work.

Benoit Claise
I need a list of organisations we need to liaison to. Nigel/Qin, please
help with that.

Rob Wilton
In terms of dependencies I don't want this terminology work to delay
other work. Can be adopted and be worked on for 1 or 2 years to have
reasonably good terminology. Being kept open at the end before last
calling so that other work could refine the terminology at a later
stage.

Nigel Davis
Agrees. Run in parallel, expect to be iterative with other documents.

Thomas Graf
Good work. It will help us a lot when writing other documents. Please
add the term "symptoms". Maybe not only to limit to "resources" which
can generate events.

Nigel Davis
Happy to expand it. Thank you.

Mahesh Jethanandani
I just did a quick search to see if "Anomaly Detection" is defined.
Since we are talking about Anomaly Detection in other documents I
suggest to add it in the terminology here.

Nigel Davis
Makes sense.

Qin Wu
At OPSAWG draft-ietf-opsawg-discardmodel is being discussed. Unsure
wherever terminology is also covered by this document.

Discussion

Benoit Claise
We have to acknowledge that there is a strong support in the OPSAWG
adoption call for draft-feng-opsawg-incident-management. It's a new work
item, where the three drafts are correlated. See that in all discussions
people are already using different terms, such as the symptom (false
positive or not). Mahesh already asked what are relationship between the
documents and the differences between anomyly and incidents. Therefore
the chairs believe that these 3 documents are like a package of 3
different drafts. They should be linked together, and the authors should
be working togehter. Qin requested a design team. Since this is a brand
new working group, we believe it is better that authors work actively
together and organize themselves instead. On the liason input from Rob.
We want to sent a liason request on the incident management to other
SDO's. Well summarized?

Rob Wilton
Yes. Talked about detecting and categorising incidents. Is resolution
also in scope now or at a next stage.

Benoit Claise
This is foundation work for the next step.

Rob Wilton
Pleased to have it small now. Stay restricted for now and do resolution
later. I am asking because of the terminology since this will be
important there.

Benoit Claise
Good point. At the end this leads to closed loop. We need to keep that
in mind for terminology? It's open for discussion.

Nigel Davis
Fully agree. Closed loop is critical at the end of the day. There are
many N levels of loops here. We need also to consider the definition of
a service itself. There are things which can be fixed quickly where
other might need several iterations.

Olga Havel
Intentionally focus only on "resource" or did you leave out "service" on
purpose due to the discussions this will lead to?

Nigel Davis
Service was originally covered. In order to cut complexity in the
document we removed it for the moment since probems mostly occur on
resource level.

Chat messages:

Alex Huang Feng: Supports that the terminology remains in a seperate
document so that it can be referenced.

Goals:

Problem Space and Modeling Issue of the Digital Space

I-Ds: draft-ogondio-nmop-isis-topology & draft-havel-nmop-digital-map
Presenters: Oscar Gonzalez de Dios (remote) & Olga Havel (remote)

Discussion

Italo Busi
Good to hear that you are start thinking about this. I like to mention
that it does not precludes RFC 8795. I suggest to to include RFC 8795 in
your investigation because it does the augmentations on every layers. It
would avoid doing inconsistency on traffic engineered networks by
reviewing RFC 8795 and see wherever it can be reused here.

Olga Havel
Thanks. We want to look at TE and none TE enabled networks. When we
started we opted for RFC 8345 (YANG Data Model for Network Topologies)
because it is simpler then RFC 8795 (YANG Data Model for TE Topologies)
which is rather a complicated model with a lot of traffic engineering
details.

Robert Wilton
A comment on the process. RFC 8345 was standardized in I2RS in the
routing area. This document is in the ops area which is correct I
believe, looking at Mahesh who will take over as an AD. Obviously we
need to flag it so that RTG WG, TEAS and LSR can have a look at it.
Having a good coordination early on. I leave it to Mahesh to do that on
AD level.

Italo Busi
I agree RFC 8795 is rather complex. But I believe it is a good starting
point for putting all the requirements together. The challenge is to
collect all the requirements in which RFC 8795 can help. Of course by
reducing the amount of requirements you are also simplifying the
solution. There is always a trade off between many augmentations or a
big model.

Olga Havel
We need to consider backward compatibility also. If you would use RFC
8795 as a core topology model, than all models which augment RFC 8345
wouldn't be backward compatible. Maybe Oscar can comment on that.

Oscar Gonzalez de Dios
The core concepts of layering in RFC 8795 might by appliccable. I don't
know what is the perfect solution.

Robert Wilton
Just wanted to comment on where it should be a large big model oder many
augmentations. Augmentations are mode flexible but can also introduce
complexity in terms of consistency. We should try to fix inconsistencies
and make augmentations work.

Olga Havel
Thats why we are doing all the classification to understand what
augmentations are for. Linking the inventory or performance measurement
for instance. Others are more generic like VPN which is service related.

Robert Wilton
This is exactly what this working group is about. When operators are
deploying these models identifying issues and smooth out corners. It is
very welcomed that this working group is working on these kind of
problems.

Daniele Ceccarelli
I understood that you are keeping RFC 8795 out of scope for two reasons.
Because it is complex and it is TE. The complex aspect might be
argueable. But please also consider that RFC 8795 also covers non TE
aspects. It would be wise to consider it from for this aspect. I liked
that you stressed simplicity and the need to be backward compatible to
RFC 8345, which is important since RFC 8345 is deployed widely. There
might be some additions in RFC 8795 which might not be backward
compatible. Separate discussion where Nigel could contribute.

Olga Havel
I believe what Nigel defined in his document is fully baxkward
compatiböe to RFC 8345.

Italo Busi
I like to comment on RFC 8795 being backward compatibility or not to RFC
8345. To my understanding RFC 8795 adds an attribute and therefore
doesn't prevent being used together with RFC 8345 at the same time by
doing what we call multi inheritance.

Olga Havel
I don't think the issues here are the attributes. I believe the issue is
the topological entities. If you have, some layers modeled in the core
and some layers modeled in the augmentation, or cross domain modeled in
augmentation, single domain in core, existing drafts RFC augmenting the
core and not TE.

Italo Busi
Those drafts are not updating new topological elements. If you need to
upgrade those topological elements you need to update those drafts.

Olga Havel
Agree that the issue is not backward compatibility. It is more in terms
of adding features.

Benoit Claise
I think the key message is the one Rob mentioned. We have been known at
the IETF for independent YANG modules. With IETF network topology we
have a chance to make all these YANG modules work together.

Med
For the next steps for this work, let's excercise the proposed plan
shown by Olga.

Chat messages:
Vishnu Beeram: RFC8795 though it says TE topology model, it is in
essence a TE aware topology.

Collecting Updated Operator Requirements for IETF Network Management Solutions

Goals:

An Update of Operators Requirements on Network Management Protocols and Modelling

Joe Clark
Fantastic. Long overdue. I might be pesimistic. You did summarize the
status but miss the SNMP MIB development aspect. It took to long to
develop SNMP MIBs which lead to propriety MIBs. Exactly the same is
happening now with YANG modules. At the IETF NOC, Bill Fenner joked the
other day. Oh well lets use the old SNMP instead. Based on my experience
in working at Cisco TAC, I believe this is driven by application needs.
We need an eco system where these YANG modules are being used for very
practical purposes. I would like to see here is to have suggestions
describing the open-source communities and vendors how these
capabilities could been used and make it as easy as possible. We need to
do a better a job to be agile and make the YANG modules better
augmentable. I love the work. Happy to help.

Qin Wu
To augment Joe's comments. Good work. Its time to revisit the
requirements. What should be the audience for this draft? Network
operators, network vendors, test equipment vendors?

Luis Miguel Contreras Murillo
Clearly practitioners of the YANG modules. Probably mainly operators.

Rob Wilton
Do you want to reach out to a wider operator community? Whos is reaching
out? What is the scope of people who are feeding input? Different
audience have different views. Is IAB involved? Based on which criteria
should we select? Question to Mahesh and the chairs.

Luis Miguel Contreras Murillo
We want to reach out to other venues and SDOs to collect input to
understand their vision.

Rob Wilton
I visited RIP NCC and was surprised that there was nothing related on
network management scheduled on the agenda.

Rüdiger Volk
Would like to contribute the view as someone who participated the IAB
workshop RFC 3535. Direct outcome was NETCONF. YANG came later. There
was a disconnect between network management people and network
operators. Therefore the audience which contributes is key. Good luck.
Things look much better today than back then.

Benoit Claise
What is the scope of the document. Is the focus solely on NETCONF,
RESTCONF and YANG or are you the network operators also considering AI
like in LLMs (Large Language Models)?

Luis Miguel Contreras Murillo
I think the focus is automation. YANG is the tool that we use for
automation today. Could lead into something else.

Benoit Claise
Do we have the right audience? To reverse Qin's previous question. Who
do we need to ask for input? And to come back to Rob, I don't think we
have enough operators here today to define the requirements. Therefore
we need to reach out to other SDO's like RIPE NCC or NANOG.

Rüdiger Volk
Nice to have YANG and mapping the network. Finally the tooling what
matters.

Chat messages:

Med: Answer to Joe: Some of your points are addressed here:
https://github.com/boucadair/rfc3535-20years-later/issues

Dhruv Dhody: One more option is to propose a new IAB workshop.

Adjourn