Early Review of draft-dnoveck-nfsv4-internationalization-01
review-dnoveck-nfsv4-internationalization-01-i18ndir-early-williams-2020-06-01-00

Request Review of draft-dnoveck-nfsv4-internationalization
Requested rev. no specific revision (document currently at 03)
Type Early Review
Team Internationalization Directorate (i18ndir)
Deadline 2020-06-03
Requested 2020-05-25
Requested by Pete Resnick
Authors David Noveck
Draft last updated 2020-06-01
Completed reviews I18ndir Early review of -01 by Nicolás Williams (diff)
Comments
Reviewer: Please consult with Pete Resnick on how to do the review.

Background info from the author:

This document is part of an effort to produce an rfc5661bis.  When looking at the work to be done we found that certain areas (e.g. internationalization, security) would be better dealt with on an NFSv4-wide basis. In the case of internationalization, this was because the treatment in rfc5661 was based on that in rfc3530, which was based on stringprep.  That treatment, apparently,it was dictated by the IESG as part of approving rfc3530 and then ignored by implementers.  As a result, we had to completely rework internationalization for NFSv4.0 in RFC7530, while NFSv4.1 remained as it had with an internationalization section implementers rarely looked at and never implemented.

Although this is explained in the draft, I think I need to mention here that the constraints on NFSv4 implementations (in terms of external APIs and compatibility with/acces to existing file systems means that the IETF and the working group cannot direct the handling of UTF8,  particlarly with regard to normalization, normally expected.

To get back to rfc5661bis, we decided to write an NFSv4-wide draft adopting the treatment of RFC7530 for all NFSv4 minor versions.   Then we ran into the problem that RFC7530 had  been written assuming use of IDNA2003 so that the text of RFC7530 ran into idnits problem since it referred to obsoleted IDNA RFCs.   As a result the transition to IDNA2008 is one of the major issues that need to be reviewed in draft-dnoveck-nfsv4-internationalization-01.
Assignment Reviewer Nicolás Williams 
State Completed
Review review-dnoveck-nfsv4-internationalization-01-i18ndir-early-williams-2020-06-01
Posted at https://mailarchive.ietf.org/arch/msg/i18ndir/mSSJavkEfxzAEo8kYtxG3F555mk
Reviewed rev. 01 (document currently at 03)
Review result On the Right Track
Review completed: 2020-06-01

Review
review-dnoveck-nfsv4-internationalization-01-i18ndir-early-williams-2020-06-01

I have reviewed draft-dnoveck-nfsv4-internationalization.

In my opinion, this draft is extremely important to the Internet
community and beyond, and should progress.  This being an early review,
perhaps I should stop there.

However, there is an important, long-running, low-volume debate to
finally settle here, and it has to be settled in the I18N community.

The architectures and realities of the relevant operating systems makes
it impossible for us to practicably put the onus for I18N on the
filesystem _protocols_.  No, that onus can _only_ live in the
_filesystems_.  I cannot stress this enough.

If you stop reading here, you can take just the above paragraph with you
and consider it carefully.  If you continue reading, please forgive me
for the length of this post.

The document at hand is almost entirely dedicated to convincing the
present audience of the above premise and fact.  Most of the first ten
pages are non-normative text, and when it gets to what happens in
reality... it's essentially still informative rather than normative
text.  The I-D even modifies the meaning of RFC2119 so it can pretend to
be normative while not really being normative, all so it can continue
the fiction that I18N belongs in NFSv4 (and what about WebDAV? and SFTP?
and ...?) and not in the filesystem.

These assertions may cause friction.  Therefore I seek to convince you,
as the author tries as well, but I want to go further: I want to stop
pretending that the filesystem _protocol_ can be responsible for I18N.
Even if this viewpoint ends up on the rough side of consensus, the
running code can. not. change.  Anyone who wishes to argue that we can
only target the protocols and not the filesystems needs to consider this
fact.

The architecture of that running code has been as it is for many decades
-- almost as many decades as there has been an Internet community!

The author gets to the nub of it in section 3, which in pages 5 and 6
says (with marked elisions):

   During the period from the publication of RFC3010 [14] until now, two
   different perspectives with regard to internationalization have been
   held and represented, to varying degrees, in specifications for NFSv4
   minor versions.

   o  The perspective held by NFSv4 implementers treated most aspects of
      internationalization as basically outside the scope of what NFSv4
      client and server implementers could deal with.  This was because
      the POSIX interface treated filenames as uninterpreted strings of
      bytes, ...

   o  Within the IETF in general and in the IESG, there was a feeling
      that new protocols, such as NFSv4, could not avoid dealing with
      internationalization issues, ...

It has now come time to finally settle this debate, these 'different
perspectives'.

The essential detail that we cannot alter is the architecture of most
every general purpose operating system such as Unix, Unix-like
derivatives (e.g., BSD and derivatives), Unix-like non-derivatives
(e.g., Linux), and even Windows, as well as others.  Specifically:

 - there is a pluggable filesystem API -- the virtual filesystem
   switch (VFS);

 - filesystem protocol clients are plugins for the VFS;

 - filesystem protocol servers operate above the VFS;

 - the VFS API, and the SPI that plugins implement, are in the main
   I18N-unaware -- they are just-use-8 (BSD, Linux, Unix) or
   just-use-UTF-16 (I believe Win32 also leaves I18N to the filesystems,
   though I may be wrong about this);

 - the VFS and below are utterly unaware of the locale or even codeset
   used by application clients of that API.

Indeed, on Unix and Unix-like systems, the C library system call stubs,
the system calls themselves, and the entirety of the VFS, treat
filenames and paths as mostly-binary blobs with just two special byte
values: NUL (because these are C strings) and 0x2F (ASCII '/', because
it's the filesystem component separator as there is no array-of-
components representation of paths in the various system calls), and a
few special names in ASCII (e.g., ".", "..").

The kernel side of all of this is even less aware of user-level locale
selection (not. at. all.) than it is of user-level codeset selection
(NULL and / being special and ASCII, so only ASCII and superset codesets
need apply).

That this set of facts is common to such diverse operating systems
should be indicative of how natural this architecture is.  It's really
quite standard to have pluggable interfaces for this sort of
functionality, and it's not at all surprising that software
architectures the evolved in the 1980s didn't account for I18N.

To be sure, there are special-purpose fileservers, of course, and those
might not have a VFS -- who knows what they do.  But that hardly matters
because it suffices that we have decades-long history of VFS
architectures in widespread present use.  That is running code, much,
much running code.

The fact that filesystem protocol servers operate _above_ the VFS
essentially rules out implementation in, e.g., NFSv4 servers, of I18N
behaviors such as:

 - normalize on CREATE

   Sure, NFSv4 servers could, but what about POSIX and WIN32
   applications
   running on the same server?  What about other filesystem protocol
   servers on the same system?  They sure don't and won't, and we can't
   make them do it.

 - preserve form on CREATE and do form-insensitive matching on LOOKUP

   This could be implemented, but conflicts can't be avoided because...
   but what about POSIX and WIN32 applications running on the same
   server?  ... (Ditto.)

 - reject non-Unicode (non-UTF-8 in the case of NFSv4)

   Sure, NFSv4 servers could, but what about POSIX and WIN32
   applications running on the same server?  ... (Ditto.)

   Should NFSv4 servers filter out non-UTF-8 filenames in READDIR??

 - apply specific mappings in case-insensitive filesystems

   (Ditto.)

There's almost no major I18N best practice that an NFSv4 fileserver can
reliably implement on a general-purpose operating system!

Just about the only I18N best practice an NFSv4 fileserver can apply is
to refuse to CREATE new non-UTF-8 filenames.

So why should we have an I18N burden on NFSv4 at all?

If the above is not enough to convince the reader, then what about the
other Internet filesystem protocols, WebDAV and SFTP?

If multiple Internet filesystem protocols can (and they do) co-exist on
the same servers as NFSv4, sharing the same content, how can they have
different I18N requirements and recommendations?  The answer is obvious:
they can't.

And what about non-Internet filesystem protocols, such as:

 - Lustre
 - OpenAFS
 - Auristor
 - CIFS/SMB
 - ...

that also co-exist with Internet filesystem protocols?

We can't advise their designers and implementors, and we can't look to
them to learn from their I18N choices?  Well, we can't impose I18N
requirements on them, no, except by proxy via the Internet filesystem
protocols they also implement (or allow), but again, that just doesn't
work.

And that brings up third-party implementations of Internet filesystem
protocols on general-purpose operating systems.  Those can't possibly
force _our_ I18N values on the platform's native non-Internet filesystem
protocols.  E.g., an SFTP server on Windows co-existing with SMB.

What a mess, no?

But there is a saving grace.

There is one unifying thread: the VFS architecture.  That I18N-unaware
layer above the actual filesystems.  It turns out that this is the key
to the puzzle.

This blissful lack of awareness of I18N at the VFS layer means we can
push I18N all the way down to the filesystem and get good results.  Some
of us reached this conclusion almost twenty years ago, when ZFS
implemented I18N in the filesystem.  Even before that, engineers at
Apple seem to have reached similar conclusions.

In fact, all the problems of filesystem I18N are relatively easy to
address if we push them into the filesystem.  Yes, different filesystem
specifications and implementations may well make different I18N choices
-- they already do anyways, and we can't exactly force them to change.

There are only a few I18N problems to address in the filesystem.  I'll
focus here only on filenames (and pathnames).  We can describe them and
specify solutions as a BCP or even Standard and hopefully those
filesystems that don't yet implement any of these I18N behaviors can get
the hint and start doing so.  These problems are:

 - Unicode equivalence

   There are two approaches in the wild:

    - normalize on CREATE (and typically also LOOKUP)

      HFS+, for example does this.

      HFS+ normalizes to something close to NFD, while input methods
      generally produce sequences closer to NFC, at least for Latin
      scripts anyways.  Other filesystems could well go for NFC, which
      serves to illustrate that there is a variety of I18N behavior in
      the wild.

    - form-preserving on CREATE, form-insensitive on LOOKUP

      ZFS, for example, does this.  Again, diverse I18N behaviors in the
      wild.

   A third and unsatisfying approach is to do nothing.  Naturally we
   would not endorse that approach -- we might not even mention it.

 - Case mappings

   These are only relevant to case-insensitive filesystems.  It is not
   uncommon to have a single server sharing multiple different
   filesystems some of which are case-sensitive, and some of which are
   case-insensitive.

   Here the main problem is that there can be only a single set of
   mappings per-filesystem, and this set of mappings may vary by locale.
   Ergo, each case-sensitive filesystem needs to specify a locale or
   default to a sensible one.

   Note that knowing the locale of user application processes does not
   help here because it is just not possible to have different case
   mappings in the same case-insensitive filesystems for different
   users.

 - What to do about non-Unicode file names

   This is a matter of legacy.  We, the IETF, can say that Internet
   filesystem protocol servers MUST NOT allow the creation of new such
   names, but forbidding such names in the results of listing
   directories is harder.  We can even pretend legacy filesystem content
   does not exist.

   Still, there are only two sensible policies a filesystem might
   implement:

    - forbid non-Unicode;
    - allow non-Unicode, making no attempt to deal with equivalence.

A document that explains all of the above and correctly addresses I18N
requirements mainly at filesystems can be shorter than the document I
just reviewed, and can avoid the uncomfortable attempt at providing
alternate definitions of RFC2119 terms.  Let us do that.  I volunteer to
author or edit such a document if need be.

All that said, there is one way in which I18N does apply specifically to
NFSv4: in non-filename Unicode strings, such as the name@domain
representation of users and groups in access control lists (ACLs).
Fortunately there is no controversy about that, or the choices made in
NFSv4 regarding those, and nothing more need be said about that.

Nico
--