Language Tagging in Unicode Plain Text
RFC 2482

Document Type RFC - Historic (January 1999; No errata)
Obsoleted by RFC 6082
Was draft-whistler-plane14 (individual)
Last updated 2013-03-02
Stream Legacy
Formats plain text pdf html bibtex
Stream Legacy state (None)
Consensus Boilerplate Unknown
RFC Editor Note (None)
IESG IESG state RFC 2482 (Historic)
Telechat date
Responsible AD (None)
Send notices to (None)
Network Working Group                                       K. Whistler
Request for Comments: 2482                                       Sybase
Category: Informational                                        G. Adams
                                                               Spyglass
                                                           January 1999

                 Language Tagging in Unicode Plain Text

Status of this Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (1999).  All Rights Reserved.

IESG Note:

   This document has been accepted by ISO/IEC JTC1/SC2/WG2 in meeting
   #34 to be submitted as a recommendation from WG2 for inclusion in
   Plane 14 in part 2 of ISO/IEC 10646.

1.  Abstract

   This document proposed a mechanism for language tagging in [UNICODE]
   plain text. A set of special-use tag characters on Plane 14 of
   [ISO10646] (accessible through UTF-8, UTF-16, and UCS-4 encoding
   forms) are proposed for encoding to enable the spelling out of
   ASCII-based string tags using characters which can be strictly
   separated from ordinary text content characters in ISO10646 (or
   UNICODE).

   One tag identification character and one cancel tag character are
   also proposed. In particular, a language tag identification character
   is proposed to identify a language tag string specifically; the
   language tag itself makes use of [RFC1766] language tag strings
   spelled out using the Plane 14 tag characters. Provision of a
   specific, low-overhead mechanism for embedding language tags in plain
   text is aimed at meeting the need of Internet Protocols such as ACAP,
   which require a standard mechanism for marking language in UTF-8
   strings.

   The tagging mechanism as well the characters proposed in this
   document have been approved by the Unicode Consortium for inclusion
   in The Unicode Standard.  However, implementation of this decision

Whistler & Adams             Informational                      [Page 1]
RFC 2482         Language Tagging in Unicode Plain Text     January 1999

   awaits formal acceptance by ISO JTC1/SC2/WG2, the working group
   responsible for ISO10646. Potential implementers should be aware that
   until this formal acceptance occurs, any usage of the characters
   proposed herein is strictly experimental and not sanctioned for
   standardized character data interchange.

2.  Definitions and Notation

   No attempt is made to define all terms used in this document. In
   particular, the terminology pertaining to the subject of coded
   character systems is not explicitly specified. See [UNICODE],
   [ISO10646], and [RFC2130] for additional definitions in this area.

2.1 Requirements Notation

   This document occasionally uses terms that appear in capital letters.
   When the terms "MUST", "SHOULD", "MUST NOT", "SHOULD NOT", and "MAY"
   appear capitalized, they are being used to indicate particular
   requirements of this specification. A discussion of the meanings of
   these terms appears in [RFC2119].

2.2 Definitions

   The terms defined below are used in special senses and thus warrant
   some clarification.

2.2.1 Tagging

   The association of attributes of text with a point or range of the
   primary text. (The value of a particular tag is not generally
   considered to be a part of the "content" of the text. Typical
   examples of tagging is to mark language or font of a portion of
   text.)

2.2.2 Annotation

   The association of secondary textual content with a point or range of
   the primary text. (The value of a particular annotation *is*
   considered to be a part of the "content" of the text. Typical
   examples include glossing, citations, exemplication, Japanese yomi,
   etc.)

2.2.3 Out-of-band

   An out-of-band channel conveys a tag in such a way that the textual
   content, as encoded, is completely untouched and unmodified. This is
   typically done by metadata or hyperstructure of some sort.

Whistler & Adams             Informational                      [Page 2]
RFC 2482         Language Tagging in Unicode Plain Text     January 1999

2.2.4 In-band

   An in-band channel conveys a tag along with the textual content,
   using the same basic encoding mechanism as the text itself. This is
   done by various means, but an obvious example is SGML markup, where
   the tags are encoded in the same character set as the text and are
   interspersed with and carried along with the text data.

3.0 Background

   There has been much discussion over the last 8 years of language
   tagging and of other kinds of tagging of Unicode plain text. It is
   fair to say that there is more-or-less universal agreement that
   language tagging of Unicode plain text is required for certain
   textual processes. For example, language "hinting" of multilingual
Show full document text