Language Tagging in Unicode Plain Text
RFC 2482
Document | Type |
RFC - Historic
(January 1999; No errata)
Obsoleted by RFC 6082
Was draft-whistler-plane14 (individual)
|
|
---|---|---|---|
Authors | Glenn Adams , Ken Whistler | ||
Last updated | 2013-03-02 | ||
Stream | Legacy stream | ||
Formats | plain text html pdf htmlized (tools) htmlized bibtex | ||
Stream | Legacy state | (None) | |
Consensus Boilerplate | Unknown | ||
RFC Editor Note | (None) | ||
IESG | IESG state | RFC 2482 (Historic) | |
Telechat date | |||
Responsible AD | (None) | ||
Send notices to | (None) |
Network Working Group K. Whistler Request for Comments: 2482 Sybase Category: Informational G. Adams Spyglass January 1999 Language Tagging in Unicode Plain Text Status of this Memo This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (1999). All Rights Reserved. IESG Note: This document has been accepted by ISO/IEC JTC1/SC2/WG2 in meeting #34 to be submitted as a recommendation from WG2 for inclusion in Plane 14 in part 2 of ISO/IEC 10646. 1. Abstract This document proposed a mechanism for language tagging in [UNICODE] plain text. A set of special-use tag characters on Plane 14 of [ISO10646] (accessible through UTF-8, UTF-16, and UCS-4 encoding forms) are proposed for encoding to enable the spelling out of ASCII-based string tags using characters which can be strictly separated from ordinary text content characters in ISO10646 (or UNICODE). One tag identification character and one cancel tag character are also proposed. In particular, a language tag identification character is proposed to identify a language tag string specifically; the language tag itself makes use of [RFC1766] language tag strings spelled out using the Plane 14 tag characters. Provision of a specific, low-overhead mechanism for embedding language tags in plain text is aimed at meeting the need of Internet Protocols such as ACAP, which require a standard mechanism for marking language in UTF-8 strings. The tagging mechanism as well the characters proposed in this document have been approved by the Unicode Consortium for inclusion in The Unicode Standard. However, implementation of this decision Whistler & Adams Informational [Page 1] RFC 2482 Language Tagging in Unicode Plain Text January 1999 awaits formal acceptance by ISO JTC1/SC2/WG2, the working group responsible for ISO10646. Potential implementers should be aware that until this formal acceptance occurs, any usage of the characters proposed herein is strictly experimental and not sanctioned for standardized character data interchange. 2. Definitions and Notation No attempt is made to define all terms used in this document. In particular, the terminology pertaining to the subject of coded character systems is not explicitly specified. See [UNICODE], [ISO10646], and [RFC2130] for additional definitions in this area. 2.1 Requirements Notation This document occasionally uses terms that appear in capital letters. When the terms "MUST", "SHOULD", "MUST NOT", "SHOULD NOT", and "MAY" appear capitalized, they are being used to indicate particular requirements of this specification. A discussion of the meanings of these terms appears in [RFC2119]. 2.2 Definitions The terms defined below are used in special senses and thus warrant some clarification. 2.2.1 Tagging The association of attributes of text with a point or range of the primary text. (The value of a particular tag is not generally considered to be a part of the "content" of the text. Typical examples of tagging is to mark language or font of a portion of text.) 2.2.2 Annotation The association of secondary textual content with a point or range of the primary text. (The value of a particular annotation *is* considered to be a part of the "content" of the text. Typical examples include glossing, citations, exemplication, Japanese yomi, etc.) 2.2.3 Out-of-band An out-of-band channel conveys a tag in such a way that the textual content, as encoded, is completely untouched and unmodified. This is typically done by metadata or hyperstructure of some sort. Whistler & Adams Informational [Page 2] RFC 2482 Language Tagging in Unicode Plain Text January 1999 2.2.4 In-band An in-band channel conveys a tag along with the textual content, using the same basic encoding mechanism as the text itself. This is done by various means, but an obvious example is SGML markup, where the tags are encoded in the same character set as the text and are interspersed with and carried along with the text data. 3.0 Background There has been much discussion over the last 8 years of language tagging and of other kinds of tagging of Unicode plain text. It is fair to say that there is more-or-less universal agreement that language tagging of Unicode plain text is required for certain textual processes. For example, language "hinting" of multilingualShow full document text