Skip to main content

Last Call Review of draft-ietf-opsawg-service-assurance-architecture-11
review-ietf-opsawg-service-assurance-architecture-11-secdir-lc-huitema-2022-11-20-00

Request Review of draft-ietf-opsawg-service-assurance-architecture
Requested revision No specific revision (document currently at 13)
Type Last Call Review
Team Security Area Directorate (secdir)
Deadline 2022-11-20
Requested 2022-11-06
Authors Benoît Claise , Jean Quilbeuf , Diego Lopez , Daniel Voyer , Thangam Arumugam
I-D last updated 2022-11-20
Completed reviews Artart Last Call review of -11 by Bron Gondwana (diff)
Tsvart Last Call review of -11 by Mirja Kühlewind (diff)
Genart Last Call review of -11 by Paul Kyzivat (diff)
Secdir Last Call review of -11 by Christian Huitema (diff)
Secdir Telechat review of -12 by Christian Huitema (diff)
Assignment Reviewer Christian Huitema
State Completed
Request Last Call review on draft-ietf-opsawg-service-assurance-architecture by Security Area Directorate Assigned
Posted at https://mailarchive.ietf.org/arch/msg/secdir/qb_wM38vil7U9RI_IakoafOI-q4
Reviewed revision 11 (document currently at 13)
Result Has nits
Completed 2022-11-20
review-ietf-opsawg-service-assurance-architecture-11-secdir-lc-huitema-2022-11-20-00
I have reviewed this document as part of the security directorate's ongoing
effort to review all IETF documents being processed by the IESG. These comments
were written primarily for the benefit of the security area directors. Document
editors and WG chairs should treat these comments just like any other last call
comments.

This document proposes an architecture implementing Service Assurance for
Intent-Based Networking (SAIN). The architecture defines a "service assurance
graph", which is decomposed in components. The graph is a directed graph, in
which the root is the service to assure, and edges lead to the components or
subservices on which a service or a component depends. The stated goal is to
efficiently verify whether a service is working as intended by following the
graph and examining the state of each dependency. The graph is not guaranteed
to be free of cycles or "circular dependencies", which the document proposes to
manage by promoting each cycle to a virtual component, and repacing edges
between cycle components by edges starting at the virtual component. The
document defines operation on the graph, maintenance of component states, and
how to mark components as unavailable during maintenance. The operations assume
that components have synchronized clocks.

Writing security considerations for an architecture like this is challenging,
because the architecture itself is rather abstract. The figure 1 describes
multiple SAIN agents each managing components and collecting metrics, obtaining
configuration data from a SAIN orchestrator, feeding health status to a SAIN
collector, with the collector providing data to the Service orchestrator, and
the service orchestrator interacting with the SAIN orchestrator and with the
network itself. In theory, each of the edges of the graph in figure 1 could be
subject to attacks, such as denial of service, spoofing, etc. For example,
network components could deliver incorrect metrics to the SAIN agents, the SAIN
agents could report incorrect statues, the configurations managed by the
orchestrator could be wrong, the communication lines between componnents may be
severed, etc. All these potential threats have different possible consequences.
 At this level of abstraction, the recommendations will have to be high level,
but they should provide enough guidance for the developers of the various
modules.

The security consideration section of this document makes a series of
recommendations:

* securing the various SAIN agents, because a compromised agent could inject
false information in the system. * using SSH or TLS when updating the
configuration of devices. * balance the risk of exposing too much configuration
information and enabling third parties to understand and "efficiently attack"
the system, versus not exposing enough and being unable to address some issues.
* acknowledge that "a lying device or compromised agent could trigger partial
reconfiguration of the service or network".

On the first point, the document says that "the SAIN agents must be secured",
but does not say how. It would be nice if this was developed.

On the second point, mentioning SSH or TLS is nice but very generic. What kind
of credentials should SAIN agents provide or check? What kind of permissions
should they be granted?

The third point is a recurring issue with automation of management, diagnostic,
etc. Management is easier if there is enough data available to describe and
understand a whole system, but the same data could be used by attackers to
understand how to efficiently sabotage that system. There are various kind of
plausible mitigations. For example, it could be argued that some data is
already public, available for example in user manuals of network components,
and that codifying it will improve management without increasing the attack
surface. But that's not always the case, and there are other cases in which
fully exposing configuration details will definitely facilitate attacks. There
may be other mitigations, such as access control on configuration data. It
would be very nice if the architecture document provided clear guidance for
future deployments.

The fourth point boild down to throwing the towel, as in "[if devices lie] The
SAIN architecture neither augments nor reduces this risk." The service
assurance, at a minimum, could detect anomalies, as in "service X depends on
devices Y and Z; the service X is not functional, yet Y and Z both report
correct behavior; hence, one or several of those devices may be in a bad
state." This may well be some form of future work, but flagging the issue would
be useful.

Reading the document, I found other issues that might affect security of
operation. The operation requires receiving streams of metric values, or
repeated polling for these values. What happens if DOS attacks slow down or
prevent the arrival of metric data? Section 3 mentions that "The SAIN
architecture requires time synchronization, with Network Time Protocol (NTP)
[RFC5905] as a candidate, between all elements". What happens if the network
time service is compromised?

Finally, a consideration based on experience with the Windows Diagnostic
system, which was similarly using graphs of dependencies to answer questions
like "why is my Wi-Fi not connecting" or "why can I not read this web site?"
The system would conduct series of tests based on dependency analysis, very
much as what is envisaged here. It was in improvement over the previous state
of error diagnostic, but it was not perfect. Such systems can fail in
frustrating ways if part of the automation is missing, when some tests are not
available, when some metric data cannot be connect, or when the description of
dependencies is incomplete. They can also become very slow if the description
of dependencies is too extensive, leading to too many tests lasting too long.
The dependency graph needs to be curated over time, and that curation probably
should be described in the architecture.