Last Call Review of draft-ietf-opsawg-service-assurance-architecture-11
Request | Review of | draft-ietf-opsawg-service-assurance-architecture |
Requested revision | No specific revision (document currently at 13) | |
Type | Last Call Review | |
Team | Security Area Directorate (secdir) | |
Deadline | 2022-11-20 | |
Requested | 2022-11-06 | |
Authors | Benoît Claise , Jean Quilbeuf , Diego Lopez , Daniel Voyer , Thangam Arumugam | |
I-D last updated | 2022-11-20 | |
Completed reviews |
Artart Last Call review of -11
by Bron Gondwana
Tsvart Last Call review of -11 by Mirja Kühlewind (diff) Genart Last Call review of -11 by Paul Kyzivat (diff) Secdir Last Call review of -11 by Christian Huitema (diff) Secdir Telechat review of -12 by Christian Huitema (diff) |
Assignment | Reviewer | Christian Huitema |
State | Completed | |
Request | Last Call review on draft-ietf-opsawg-service-assurance-architecture by Security Area Directorate Assigned | |
Posted at | | |
Reviewed revision | 11 (document currently at 13) | |
Result | Has nits | |
Completed | 2022-11-20 |
I have reviewed this document as part of the security directorate's ongoing effort to review all IETF documents being processed by the IESG. These comments were written primarily for the benefit of the security area directors. Document editors and WG chairs should treat these comments just like any other last call comments. This document proposes an architecture implementing Service Assurance for Intent-Based Networking (SAIN). The architecture defines a "service assurance graph", which is decomposed in components. The graph is a directed graph, in which the root is the service to assure, and edges lead to the components or subservices on which a service or a component depends. The stated goal is to efficiently verify whether a service is working as intended by following the graph and examining the state of each dependency. The graph is not guaranteed to be free of cycles or "circular dependencies", which the document proposes to manage by promoting each cycle to a virtual component, and repacing edges between cycle components by edges starting at the virtual component. The document defines operation on the graph, maintenance of component states, and how to mark components as unavailable during maintenance. The operations assume that components have synchronized clocks. Writing security considerations for an architecture like this is challenging, because the architecture itself is rather abstract. The figure 1 describes multiple SAIN agents each managing components and collecting metrics, obtaining configuration data from a SAIN orchestrator, feeding health status to a SAIN collector, with the collector providing data to the Service orchestrator, and the service orchestrator interacting with the SAIN orchestrator and with the network itself. In theory, each of the edges of the graph in figure 1 could be subject to attacks, such as denial of service, spoofing, etc. For example, network components could deliver incorrect metrics to the SAIN agents, the SAIN agents could report incorrect statues, the configurations managed by the orchestrator could be wrong, the communication lines between componnents may be severed, etc. All these potential threats have different possible consequences. At this level of abstraction, the recommendations will have to be high level, but they should provide enough guidance for the developers of the various modules. The security consideration section of this document makes a series of recommendations: * securing the various SAIN agents, because a compromised agent could inject false information in the system. * using SSH or TLS when updating the configuration of devices. * balance the risk of exposing too much configuration information and enabling third parties to understand and "efficiently attack" the system, versus not exposing enough and being unable to address some issues. * acknowledge that "a lying device or compromised agent could trigger partial reconfiguration of the service or network". On the first point, the document says that "the SAIN agents must be secured", but does not say how. It would be nice if this was developed. On the second point, mentioning SSH or TLS is nice but very generic. What kind of credentials should SAIN agents provide or check? What kind of permissions should they be granted? The third point is a recurring issue with automation of management, diagnostic, etc. Management is easier if there is enough data available to describe and understand a whole system, but the same data could be used by attackers to understand how to efficiently sabotage that system. There are various kind of plausible mitigations. For example, it could be argued that some data is already public, available for example in user manuals of network components, and that codifying it will improve management without increasing the attack surface. But that's not always the case, and there are other cases in which fully exposing configuration details will definitely facilitate attacks. There may be other mitigations, such as access control on configuration data. It would be very nice if the architecture document provided clear guidance for future deployments. The fourth point boild down to throwing the towel, as in "[if devices lie] The SAIN architecture neither augments nor reduces this risk." The service assurance, at a minimum, could detect anomalies, as in "service X depends on devices Y and Z; the service X is not functional, yet Y and Z both report correct behavior; hence, one or several of those devices may be in a bad state." This may well be some form of future work, but flagging the issue would be useful. Reading the document, I found other issues that might affect security of operation. The operation requires receiving streams of metric values, or repeated polling for these values. What happens if DOS attacks slow down or prevent the arrival of metric data? Section 3 mentions that "The SAIN architecture requires time synchronization, with Network Time Protocol (NTP) [RFC5905] as a candidate, between all elements". What happens if the network time service is compromised? Finally, a consideration based on experience with the Windows Diagnostic system, which was similarly using graphs of dependencies to answer questions like "why is my Wi-Fi not connecting" or "why can I not read this web site?" The system would conduct series of tests based on dependency analysis, very much as what is envisaged here. It was in improvement over the previous state of error diagnostic, but it was not perfect. Such systems can fail in frustrating ways if part of the automation is missing, when some tests are not available, when some metric data cannot be connect, or when the description of dependencies is incomplete. They can also become very slow if the description of dependencies is too extensive, leading to too many tests lasting too long. The dependency graph needs to be curated over time, and that curation probably should be described in the architecture.