Last Call Review of draft-ietf-suit-architecture-11
review-ietf-suit-architecture-11-tsvart-lc-briscoe-2020-08-09-00

Request Review of draft-ietf-suit-architecture
Requested rev. no specific revision (document currently at 16)
Type Last Call Review
Team Transport Area Review Team (tsvart)
Deadline 2020-08-14
Requested 2020-07-24
Authors Brendan Moran, Hannes Tschofenig, David Brown, Milosch Meriac
Draft last updated 2020-08-09
Completed reviews Tsvart Last Call review of -11 by Bob Briscoe (diff)
Genart Last Call review of -11 by Theresa Enghardt (diff)
Secdir Last Call review of -11 by Rich Salz (diff)
Iotdir Telechat review of -13 by Mohit Sethi (diff)
Assignment Reviewer Bob Briscoe 
State Completed
Review review-ietf-suit-architecture-11-tsvart-lc-briscoe-2020-08-09
Posted at https://mailarchive.ietf.org/arch/msg/tsv-art/hpal4eaJGNGPhmZ-JTU7GPawZu0
Reviewed rev. 11 (document currently at 16)
Review result Ready with Issues
Review completed: 2020-08-09

Review
review-ietf-suit-architecture-11-tsvart-lc-briscoe-2020-08-09

This document has been reviewed as part of the transport area review team's
ongoing effort to review key IETF documents. These comments were written
primarily for the transport area directors, but are copied to the document's
authors and WG to allow them to address any issues raised and also to the IETF
discussion list for information.

When done at the time of IETF Last Call, the authors should consider this
review as part of the last-call comments they receive. Please always CC
tsv-art@ietf.org if you reply to or forward this review.

This review is long. For the benefit of busy readers, it is structured with 7 important issues listed first (and tagged either as technical or editorial), followed by minor editorial comments for the authors.

Altho' it is ostensibly from the Transport Area Review Team, this review identifies only one transport-related issue (see item #6a). Most of the major discussion points are offered with a security hat on.

First I want to say that there's a lot of useful stuff in the draft. So I'd like to apologize that the review comments raise issues, and do not dwell on praising all the good stuff.


== Important Issues ==

1. Motivation for publication by the IETF [Editorial]

Until I reached the summary of the recent IoT IAB workshop in the first para of the Security Considerations section, I was wondering why the IETF needed to publish this. It seemed to be a description of what is already done in the industry, but framed as an architecture. Most of this first para of the Security Considerations section motivates this work, and ought to be moved to the Introduction.

Even then, a document that describes what the industry already does isn't a sufficient response to a security problem. Given (I believe) the intention is to encourage the industry to systematically cater for firmware updates, perhaps the draft needs to be a little more hard-hitting (without being patronizing of course). Rather than giving the impression (except in the abstract) that it is just describing current industry practice. For instance, see item #2 below about saying what not to do. I would also suggest that it should highlight the simplest architecture, only giving optional more complex extras later (see item #4 below).

2. Is Anything Not Allowed by this Architecture? [Technical+Editorial]

a) A good architecture precludes as well as includes. Would it be useful to list some common practices that are insecure, and perhaps some common misconceptions about secure firmware update?

b) I could hardly find anything in this draft that did not equally apply to firmware update of "Non-Things". It would indeed be useful to define a 'Thing' (at least what this document means by it). I suggest:
* unattended operation
* not within the operator's physical security control

c) On the subject of ruling things out, I felt the list of items ruled out of scope in the Security Considerations include some items that are so central to IoT that they should not have been ruled out of scope, and in the first two cases quoted below, they didn't need to be ruled out of scope, because the document addresses them:
"
  - installing firmware updates in a robust fashion so that the update 
    does not break the device functionality of the environment this
    device operates in.
  - the distribution of the actual firmware update, potentially in an 
    efficient manner to a large number of devices without human 
    involvement
  - energy efficiency and battery lifetime considerations.
"
And, wouldn't it be better to move scoping statements to just after the Intro, rather than in Security Considerations?
(And, yes, I know that not all Things are energy-challenged, but the size of the subset that are is significant.)


3. Relying on Software with Security Vulnerabilities to Patch Security Vulnerabilities [Technical]

The Intro only mentions 'software updates' generally, and doesn't explicitly mention patching security vulnerabilities (altho the abstract does). Only having read the Security Considerations section, do I discover that the draft is primarily meant to be about patching firmware vulnerabilities.

That raises the question of how secure it is to download new firmware from a device booted from firmware that is potentially already compromised. As a minimum, surely the draft needs to mention this point. And preferably:
* whether anything can be trusted once firmware is compromised, and if so what.
* whether it is still worth updating firmware, even once a vulnerability in the firmware update process has been identified, given:
  o identification of a vulnerability does not necessarily imply it has been exploited, or not prevalently exploited
  o a vulnerability might not make the firmware update process itself vulnerable (with an explanation of how to tell)
* describe which aspects of the firmware update process need to be run within a TEE (and which not if any)
* should the TEE lock the device against booting if a firmware authentication or integrity check fails
  o how to prevent tampering with firmware integrity from itself being used as an attack, e.g. 
    - by ensuring that, once a device is locked against booting, firmware re-update is never completely disabled
    - by ensuring firmware updates are not immediately retried without an exponentially increasing timer back-off, otherwise retries could lead to the devices flooding their own network with fruitless update traffic.


4. Please Focus More on the Simplest Architecture [Technical]

All the following increase system complexity, but are not /essential/ for strong security:
a) Status Tracking Per Device 
b) Confidentiality of the firmware binary
c) Robustness against rendering the device unbootable
d) Supporting both Message Authentication and Object Authentication (see item #5)
e) Broadcast Friendly (see item #6)

This draft is meant to be persuading the 'industry of Things' to provide built-in secure firmware update. It tends to fall into the common trap of setting the security bar so high that practitioners might give up in despair.

a) Per-device status tracking certainly might be preferred by many operators, but the alternative of the operator not knowing the status of each individual device might be acceptable (as in the example in Figure 5). Per-device status tracking introduces the following complexity:
* a need to separately identify each device, both on each device, and in the status tracker.
* a need to securely identify each separate device (to prevent compromised devices masquerading as all the other devices to give a false sense of security), requiring management of separate public or shared keys

b) Confidentiality certainly might provide defence in depth against reverse engineering the binaries, but it is ultimately security by obscurity, and so ultimately optional. By definition (see item #2b) 'Things' are not in a physically secure environment. So, unless all devices decrypt all downloaded binaries within a TEE and store them in tamper-proof memory, once the binaries are stored on each device, they will be accessible to external inspection anyway. So the document should be less dogmatic about confidentiality protection (3rd para of Intro), and at least explain that, with IoT, confidentiality on the wire is moot unless there is also confidential device storage as well.

c) Robustness against rendering the device unbootable
Often, when I initiate an (attended) firmware update, the OS warns me that this is a sensitive process that could render the device useless if the power fails part-way through. So clearly, this is a cost-tradeoff that device designers are willing to compromise on. Therefore, I don't think the IETF is entitled to pronounce a requirement against this practice. I would rather see this text moved from Requirements to somewhere else in the doc, as a commentary on the implementation issues, rather than stating it as a requirement. Climbing down a bit at the end by saying it is only an implementation requirement doesn't help.



5. Both Message Authentication and Whole Object Authentication? [Technical]

Message authentication codes aren't specifically mentioned, until sections 7 & 8, where they are mentioned as if they might be used, without saying why or how. The document needs to discuss the merits of MACs vs. authentication of the whole manifest and/or the whole firmware binary.

Ultimately, if an object's authenticity and integrity will be verified once it is fully delivered, there is no need for MACs as well. However, using message authentication reduces the risk that the device is talking with an imposter at an early stage in the transmission, rather than having to wait until it is complete. And it is easy to arrange message authentication to cumulatively authenticate the whole object, without additional infrastructure for whole-object verification. Therefore using MACs could avoid the need to provide enough storage for a complete update of the firmware as well as the current version - after verifying the manifest and the first message, the device could even start to overwrite the firmware it is currently booted from. 

The above strategy would not be without risk, but my point is not just to suggest this particular strategy. The document ought to at least discuss the trade-offs between MACs and whole-objection authentication, and whether both are really necessary.


6. Friendly to Broadcast Delivery? [Technical]

Section 3. states this as one of the "Requirements", although the text softens it to "may be desirable for some networks". However, broadcast delivery introduces the three significant problems below, wrt a) reliable transport; b) device energy efficiency; and c) broadcast message authentication.

a) Reliable Broadcast Transport
Delivery of binary objects needs to recover lost or corrupt packets. Reliable broadcast delivery at scale is extremely challenging. It needs either fountain coding [1] or reliable multicast.
* Fountain coding delivers an object in a continually repeating stream and ensures that the data in any missing packet can be reconstructed from data in a subsequent different packet. But this would increase device complexity. 
* For broadcast delivery, per-packet acknowledgements (ACKs) from each device do not scale. Negative ACKs (NACKs) can be used but they also do not scale. If a loss is experienced close to the root of the broadcast/multicast, it still causes an implosion of negative ACKs (NACKs) on the sender. Reliable multicast (e.g. PGM [RFC3208]) arranges a spreading tree of delivery nodes each of which handles NACKs solely from its next-degree downstream neighbours. Clearly this increases network or CDN complexity.

b) Broadcast Energy Efficiency
If the IoT device is wireless and needs to take care with its energy consumption, it will need to initiate all communications, rather than have to sit with its radio powered up listening for an incoming message. However, of course, it is not possible for each device to independently initiate an incoming broadcast. It would be possible for a broadcast to be scheduled, and for each device to poll for the schedule. But this would add complexity, particularly because all the device clocks would have to be fairly closely synchronized.

c) Broadcast Message Authentication
Message authentication has potential advantages over whole-object authentication (see #5). When MACs are used over unicast, typically the cost of asymmetric crypto for each message is avoided by using asymmetric crypto just once to transmit a shared key, which is then used to verify each MAC. However, that process is only secure for unicast. For broadcast or multicast delivery, the sender only sends each message once, using one key for the MAC that would therefore have to be shared with every receiver. Then any receiver could masquerade as the genuine sender. TESLA is a solution to this [RFC4082], but it would again increase the complexity of each device and the servers, not least because it requires loose clock synch (nonetheless, uTESLA has been implemented for challenged devices [2]).

Aside regarding broadcast encryption:
In section 3.3. "Use state-of-the-art security mechanisms", it says:
  "The information that is encrypted individually for each device must
  maintain friendliness to Content Distribution Networks, bulk storage,
  and broadcast protocols."
That implies a magic encyption scheme that is beyond any state-of-the-art that I am aware of! If information is encrypted individually for each device, surely by definition it will not be friendly to broadcast protocols. Actually, I suspect the authors did not mean to say "encrypted individually for each device", because a shared group key is adequate for confidentiality - a shared group key is only problematic for message or source authentication (see above).


7. Missing Security Concerns [Technical]

a) Avoiding Reliance on the Device's System Clock

I suggest that the document makes the point that it is preferable for the firmware update process not to rely on the device's system clock.

Reasoning: Even if the TEE maintains the system clock, protection against attacks on this clock rely on voting between multiple time sources. No amount of authentication provides any proof of message timing. So, it is hard for a TEE to protect against tampering with the timing of its messages, given they pass via the untrusted execution environment of the rest of the device, similar to the problem of a secure time source for virtualized functions [3].

I think IoT developers can be reassured that none of the requirements for firmware update need to rely on the system clock. For instance roll-back attack prevention (section 3.4) only requires comparison between version numbers, not comparison between a release time and the clock. 

However, I think not relying on the clock is worth mentioning, because key expiry and key revocation have to be designed carefully to avoid relying on secure time, and this is a subtle point that might not be appreciated by IoT device designers.

b) Key revocation

When keys are in tamper-resistant storage but otherwise not within a physically secure site, the question of revocation surely has to be addressed. In particular, there should be a discussion about the advisability or otherwise of pre-loading the same keys into multiple devices.


== Minor Editorial Issues ==

1. Intro 
  "Updates to the firmware of an IoT device are done to fix bugs in software..."
This would be a good place to highlight the focus on patching security vulnerabilities.

"This version of the document assumes... Future versions may also describe..."
I assume this aspiration needs to be deleted now?

2. Terminology

There are ~22 occurrences of lower case 'must' in this document, and one 'should' (excluding multiple uses in rhetorical questions). I'm not sure whether it is intentional to make it seem like this is an RFC that is mandating behaviour, perhaps for readers who don't understand the subtleties of the IETF informational track. I would prefer it to be clear that this document is not mandating anything, by using alternatives to 'must' like 'ought to' or 'has to'. Otherwise it could be considered disingenuous.


  "The term ’system on chip (SoC)’ is often used for these types of devices."
Perhaps more useful: 
  "The term ’system on chip (SoC)’ is often used interchangeably with MCU, but MCU tends to imply more limited peripheral functions."
  
  "The following entities are used:"
The list is a mix of stakeholders and functions, which tends to show that the authors themselves might not be clear about the distinction. It would be useful to split into two lists.

  "The terms device and
  firmware consumer are used interchangeably since the firmware
  consumer is one software component running on an MCU on the
  device."
I didn't notice them being used interchangeably. If they are anywhere, why not just edit to use whichever term is more appropriate and delete this sentence?

Status Tracker
  "While the IoT device itself runs the client-
  side of the status tracker it will most likely not run a status
  tracker itself unless it acts as a proxy for other IoT devices in
  a protocol translation or edge computing device node."
The client-side of a status tracker surely does run a status tracker itself (the clue is in the name). I know what is intended, but the writer was clearly in two minds as to whether a status tracker is the combination of client and server or just the server. 

3. Requirements

3.5 "High reliability" -> 'Robust against becoming unbootable'.
The title for this requirement otherwise implies a much more general requirement than the description under it.

3.6 Small bootloader
"...again using firmware updates over serial,
USB or even wireless connectivity like a limited version of
Bluetooth Smart."
Don't see why it has to be "...a limited version of...". Suggest these words are deleted.

s/poses a risk in reliability/
 /poses a reliability risk/

s/must fit in the available RAM/
 /must fit in the available memory/
(not necessarily RAM)

s|there are not other task/processing running|
 |there are not other tasks/processes running|

s/unlike it may be the case/
 /unlike that which may be the case/

s/Note: This is an implementation requirement./
 /Note: This last paragraph is an implementation requirement./
(Otherwise, 'this' could ambiguously refer to the whole requirement)

3.7 Small Parsers
"Since parsers are known sources of bugs they must be minimal." 
To be honest, I suspect the target audience will find this sentence and others like it rather pious. Given the purpose of this document is meant to be to encourage implementers to provide secure firmware update, I think these peripheral "requirements" will just serve to make any implementers reading this feel they are being patronized.

As with the earlier requirement about 'robustness against becoming unbootable', I think many of these 'requirements' would be easier to stomach within a discussion of tradeoffs, rather than as a list of pronouncements that demand perfection.

3.8
s/Minimal impact on existing firmware formats/
 /No impact on existing firmware formats/
Reason: This is what the text underneath says.

3.9 Robust permissions

  "...the authorization policy is separated from the
  underlying communication architecture. This is accomplished by
  separating the entities from their permissions."
I'm not sure whether either of these sentences makes much sense (at least not to me). Perhaps the first sentence means to say that 
  "...the authorization policy is separated from the
  firmware it applies to"
And then the second sentence could be deleted. I'm not sure the second sentence would ever be necessary, because entities are always separate from their permissions (otherwise you would have to access an entity to find out you weren't allowed to access it). To be honest, I don't really see the point of the whole requirement. So if it is important, maybe its meaning needs to be clarified for people like me. Otherwise, if it's just stating the obvious, maybe it's not necessary at all.

3.10. Operating modes
Later, in S.5. the term 'delivery modes' is used. If these are meant to mean the same thing, then the same term should be used consistently. In my experience, the term 'interaction model' is used to describe things like polled request-reply, push, publish-subscribe, etc.

"The pre-authorisation step involves verifying..."
When describing a distributed system, pls avoid passive sentences like this, which don't specify which entity is performing the action. It is followed up later by "...the firmware consumer must also...", which implies the subject is the firmware consumer, but it's best not to rely on implication, especially not if it requires two passes to understand.

  "Pushing a manifest and firmware image to the transfer to
  the Package resource of the LwM2M Firmware Update object"
Garbled?

  "...it may need to wait for a trigger from the
  status tracker to initiate the installation, may trigger the update
  automatically, or may go through a more complex decision making
  process to determine the appropriate timing for an update"
I had to read this a few times before realizing it was a list. 
How about:
  "... to initiate the installation, it may either need to wait for a trigger from the
  status tracker; or trigger the update
  automatically; or go through a more complex decision making
  process to determine the appropriate timing for an update"

3.11. 
s/Suitability to software and personalization data/
 /Suitability for software and personalization data/

The document suddenly jumps into a different style at the start of 3.11, more like an log of WG activity than a requirement. Pls consider making the style consistent, especially given it switches back after the first sentence of the 2nd para.


4. Claims
s/Only install firmware with a matching vendor/
 /Only install firmware with a matching author/	?

5. Communication Architecture

The document often repeats that it's agnostic to the communication architecture, then this section starts with the phrase:
  "Figure 1 shows the communication architecture..."
Perhaps it means 'firmware update architecture'? 
Or, possibly this implies that the authors are in two minds as to what 'communications architecture' means.
Or the heading was intended to be 'Communications Architectures' (plural) and the first phrase was meant to say 
  "Figure 1 shows an example communication architecture..."

The text needs to make it clear that a status tracker is optional in the client pull case but not in the server push case (see item #4a earlier).


It would be useful for the doc to say what it means for an operator circle to enclose a function. For instance the 'Device Operator' in Fig 1 encloses the status tracker, which to me implies it controls the status tracker. However, the network operator encloses the device, which probably doesn't imply it operates the device. Perhaps an enclosing circle means 'within the physical security control of'? The network operator isn't mentioned in the text - why is it in the diagram, given it has no role in the firmware update, other than as a common carrier of opaque bits?

  "The following assumptions are made to allow the firmware consumer to
  verify the received firmware image and manifest before updating
  software:"
The following three bullets aren't really assumptions. Perhaps 'statements about the verification process' would be a better phrase. Would another reference to suit-information-model here be useful, to explain why the details are not given here? 

See item #4b) above about highlighting that confidentiality is optional, not just 'deployment specific'.

  "There are different types of delivery modes, which are illustrated
  based on examples below."
Shouldn't this sentence start section 5? (Also see my earlier point about 'operating modes' / 'interaction modes' terminology).

Fig 3 is inconsistent with Fig 1, in that it omits the firmware consumer function.

Fig 4 is inconsistent with Figs 1 & 3, in that there is also an arrow from the status tracker to the author. What does this imply?

  "This architecture does not mandate a specific delivery mode but a
  solution must support both types.
Whatever for? This requirement surely over-plays the IETF's hand, which is not in a position to make such a demand? Is the intention really that being agnostic to the delivery mode means every solution must support all delivery modes?

6. Manifest

Given each of the items in the second bullet list addresses one of the questions in the first bullet list, it would be useful to tabulate them side-by-side and to put them in a more meaningful order, e.g. in the order they occur during firmware update. Also, the the first question bullet (author trust) is not specifically addressed in the second list - implied within the last bullet, but not explicitly stated.

7.1
s/Combined with the non-relocatable nature of the code/
 /Due to the non-relocatable nature of the code/

7.3
  "This configuration has two or more CPUs in a single SoC that share
  memory (flash and RAM). Generally, they will be a protection
  mechanism to prevent one CPU from accessing the other’s memory."
I know what is intended, but it reads as if line 1 contradicts line 3. Perhaps:
 "...
  mechanism to prevent one CPU from unintentionally accessing memory currently allocated to the other."


9. Example

In at least one example figure, it would be useful to show the initial pre-loading of keys, policy logic and trust anchor into the firmware consumer / bootloader.

s/starting with an author uploading the new firmware to firmware server/
 /starting with an author uploading the new firmware to the firmware server/
 
  "This setup does
  not use a status tracker and the firmware consumer component is
  therefore responsible for periodically checking whether a new
  firmware image is available for download."
It needs to be much clearer that the status tracker has both a monitoring function and an update triggering function. So, altho it is essential in the server push model - to trigger updates, it's monitoring function means it is not ruled out for the client pull model.

Fig 5 & 6 are inconsistent, in that the former omits the IoT device box around the Firmware consumer and bootloader.

s/Figure 6 shows an example follow with the device using a status tracker./
 /Figure 6 shows an example with the device using a status tracker./
 
  "For editorial reasons the author publishing the manifest at
  the status tracker and the firmware image at the firmware server is
  not shown."
How about:
  "Depiction of the author publishing the manifest at
  the status tracker and the firmware image at the firmware server would
  be the same as in Figure 5. So for brevity they are not shown."

11. Security Considerations

Between 
  "A report about this workshop can be found at [RFC8240]." 
and
  "A standardized firmware manifest format..."
there either needs to be some glue text to explain that the initial manifest format was an output of the workshop (if it was), or a new para if the second sentence really doesn't follow from the first.

Note also that I suggest (item #1) that the motivating text about the workshop should be moved to the introduction. I also say (in item 2c) that the scoping bullets would be better at the end of the Intro too. However, I can also see a case for them remaining under Security Considerations; to admit that the document does not fully address all possible security concerns.

Given this could leave nothing in the Security Considerations section, it would be appropriate to merely point to all the sections of the document that already cover security matters.



== References ==
[1] Byers, J.; Luby, M.; Mitzenmacher, M. & Rege, A. A Digital Fountain Approach to Reliable Distribution of Bulk Data Proc. ACM SIGCOMM'98, Computer Communication Review, 1998, 28

[2] Perrig, A.; Szewczyk, R.; Wen, V.; Culler, D. E. & Tygar, J. D. SPINS: Security Protocols for Sensor Networks Proc. ACM International Conference on Mobile Computing and Networks (Mobicom'01), 2001, 189-199

[3] Briscoe (Ed.), B. & others Network Functions Virtualisation; Security; Problem Statement ETSI NFV Industry Specification Group (ISG), ETSI NFV Industry Specification Group (ISG), 2014