This is my take on issues with this document mostly from my personal
review but also after some discussion we've had on the i18ndir list.
Some parts of this draft are quite hard to follow, so I'm giving my
understanding of the parts I'm commenting on in case I got them wrong.
I realize that a lot of this is unchanged from 4329, which we should
have reviewed more carefully 15 years ago.
Section 4 on Encoding: I believe it says that the preferred encoding
sometimes mislabel them. So for anything that you don't know is a
module, you have to sniff the contents to see if starts with a BOM,
and if so, use the BOM's encoding and delete the BOM. If the BOM uses
an encoding the consumer doesn't support, fail. If there's no BOM,
use the declared character set, or if it's one the consumer doesn't
understand, treat it as UTF-8 anyway.
Step 1 says "The longest matching octet sequence determines the encoding."
which I don't understand, since none of the encodings overlap. Does that
mean it should interpret a partial BOM, e.g., EF BB 20 for UTF-8? Also,
why is the BOM deleted? ECMAscript says a BOM is a space so it should be
While I understand that there is a lot of history here, I'm wondering if
the range mislabeling is really as extreme as this implies. Is there,
say, text labelled Shift-JIS which is really UTF-8 or UTF-16? If the
mislabelled stuff is consistently mislabelled as one of UTF-8/16/16BE/16LE
perhaps it could say to try the BOM trick on those encodings and fail otherwise.
I don't understand step 3, "The character encoding scheme is
determined to be UTF-8." How can it be determined to be UTF-8 other
than by steps 1 and 2? Or is it saying that if the declared charset
is one the consumer doesn't understand such as KOI8-U, assume it's
I'd suggest rewriting the section to make it clearer that if it's not
a module, you look for a BOM, use its encoding if you find one, and (I
think) otherwise use the declared encoding.
Section 4.3 on error handling: I think it says that if there's a byte
sequence that isn't a valid code point in the current encoding, it can
fail or it can turn the bytes into Unicode replacement characters, but
MUST NOT try anything else. I agree with this advice but again it
could be clearer.
Section 3 on Modules: I believe it says that JS scripts and modules have
different syntax but you can't easily tell them apart by inspection.
(The term "goal" is familiar since I used to write books about compiler
tools, and I realize it's what the ECMAscript spec uses, but it's
confusing if you're not a programming language expert. How about just
saying that scripts and modules have different syntax?)
Hence some software uses a .mjs filename as a hint that something is a
module. Again I realize that there is a bunch of existing code but
this is not great MIME practice. If the difference matters, it's
worth providing a new MIME type such as text/jsmodule, which could
have consistently accurate content encodings. It would coexist with
all of the other old MIME types and the .mjs hints. Since this draft
deprecates a bunch of existing types and de-deprecates another, this
seems as good a time as any to do it.
I also wonder whether it's worth making a distinction in MIME
processing between modules and scripts. Would there be any harm in
saying to sniff everything for a BOM? If a .mjs file turns out to
have a UTF-16 BOM, it's wrong, but is it likely to be anything other