Internet Draft M. Duerst
<draft-ietf-acap-langtag-00.txt> University of Zurich
Expires in six months June 1997
Two Alternative Proposals for Language Taging in ACAP
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working doc-
uments of the Internet Engineering Task Force (IETF), its areas, and
its working groups. Note that other groups may also distribute work-
ing documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months. Internet-Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet-
Drafts as reference material or to cite them other than as a "working
draft" or "work in progress".
To learn the current status of any Internet-Draft, please check the
1id-abstracts.txt listing contained in the Internet-Drafts Shadow
Directories on ds.internic.net (US East Coast), nic.nordu.net
(Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
Rim).
Abstract
For various computing applications, it is helpful to know the lan-
guage of the text being processed. This can be the case even if oth-
erwise only pure character sequences (so-called plain text) are han-
dled. From several sides, the need for such a scheme for ACAP has
been claimed. One specific scheme, called MLSF, has also been pro-
posed, see draft-ietf-acap-mlsf-01.txt for details. This document
proposes two alternatives to MLSF. One alternative is using
text/enriched-like markup. The second alternative is using a special
tag-introduction character. Advantages and disadvantages of the var-
ious proposals are discussed. Some general comments about the topic
of language tagging are given in the introduction.
1. Introduction
This introduction contains some considerations about language infor-
mation that should help to better understand why and where language
Expires in six months [Page 1]
Internet Draft Alternative Language Tagings for ACAP May 1997
information can be beneficial. They are intended for general informa-
tion, and are not directly related to the specifics of the proposals
made in this document.
1.1 Multilingual Text
It is sometimes claimed that text, in order to be multilingual, has
to contain language information of some kind or another. This is
definitely not the case. Multilingual text has existed for centuries
on paper and other writing materials, and for decades in computers,
without language information. A huan reader with the necessary lan-
guage background is always able to understand a multilingual text
without explicitly being told which language each word or character
belongs to. In some cases, there may be ambiguities, but this is
either intended, such as in a pun or joke, it is because the reader
is not fully familliar with the involved languages, or it is because
the writer was no precise enough.
The overwhelming majority of characters always has been used in vari-
ous languages, and a character per se therefore cannot be associated
to a single language. This likewise applies to words (and sometimes
even phrases) out of context.
1.2 Language Taging
While the human reader does not need special language information,
such information can be useful for the purpose of automatic process-
ing of various kinds. These in particular include indexing and
searching, text-to-speech conversion, other conversion operations
such as case conversion and transliteration, spelling and grammar
checks, and high-quality typography.
Two other operations are frequently mentionned as benefiting from
language information: Sorting and machine translation. However, in
the case of sorting, this has to occur according to the expectations
of the viewer, frequently encapsulated as a so-called locale. The
language of each of the items being sorted is not relevant. In the
case of machine translation, the knowledge and effort to translate a
language is by magnitudes higher than the knowledge needed to decide
whether a certain word or sentence belong to a given language.
Explicit language information is therefore of marginal importance.
This also applies to other operations, such as text-to-speech conver-
sion, in particular for high quality and for languages with a compli-
cated relationship between spelling and pronounciation (such as
English).
Expires in six months [Page 2]
Internet Draft Alternative Language Tagings for ACAP May 1997
1.2 CJK(V) Glyph Disambiguation
The reason one hears most for the necessity of language information
is the need to disambiguate CJK(V) ideographic glyphs, i.e. to select
specific typographic variants of certain ideographic characters,
variants which can differ somewhat between Chinese (simplified or
traditional), Japanese, Korean, or classical Vietnamese. Some such
distinctions can indeed be made by using language information to
indicate typographic tradition. However, the usefullness of this
approach is limited by a series of facts that are not all very widely
known:
- Even if a glyph is by mistake taken from another typographic tra-
dition, readability, in particular in context, is never affected.
- In running text, the differences resulting from the use of differ-
ent fonts (e.g. Song-style font vs. Mincho-style) as well as from
different weights of the same font are much more visible than the
differences resulting from glyph variant details.
- In many fonts, glyphs vary, for sund aestetic and historic rea-
sons, to a similar or higher degree than that exhibited in the
"reference" glyphs for each typographic tradition as given in [ISO
10646]. This applies to print, but even more to handwriting.
- National standards explicitly or implicitly allow for a certain
variance of glyph shapes. In particular, the newest edition of the
basic Japanese character standard, JIS X 0208-1997 [JIS1997],
explicitly mentionnes a large number of permitted variants (pp.
12-22) for Mincho fonts only. It also explicitly allows a list of
29 much wider-reaching variants as a consequence of some unfortu-
nate changes to the standard in 1983 (p. 22).
- Long-standing typographic practice does not use special glyph
variants for representing short inclusions of foreign origin (such
as names of persons, places, or institutions) in native text.
- Some glyph variants are seen by some persons as explicit proper-
ties of their names. Identifying a name by a particular language
and assuming that this implies a particular typographic tradition
can in some cases lead to the desired result. However, the results
cannot be guaranteed due to design differences between different
fonts used in the same typographic tradition, and due to the fact
that even national standards glyph standards considerably unify
glyph variants.
All the above facts clearly limit the usefullness of language tags
for CJK(V) glyph variant selection. Language taging should therefore
Expires in six months [Page 3]
Internet Draft Alternative Language Tagings for ACAP May 1997
not be advertized as a comprehensive solution to the various problems
of CJK(V) glyph variant selection.
2. A Text/Enriched-like Notation for Language Tags (TELT)
This section specifies a text/enriched-like notation for language
tags, leading to a format simmilar to text/enriched. It can be used
with any character encoding that contains the necessary subset of the
US-ASCII character repertoire.
Language tags are of the form "<LANG=xxxxx>" where xxxxx is a lan-
guage tag as defined in [RFC1766], with all letters written in upper
case. No whitespace of any kind is allowed between "<" and ">".
Language alternatives are started by "<ALTLANG>". Again, no whites-
pace is allowed between "<" and ">".
The use of the character sequences "<LANG=" and "<ALTLANG>" is not
allowed in the text itself. Code to convert from this notation to
MLSF and back and to test for false positives in plain text search is
given in an appendix.
3. Language Tags using a Start Tag Character (STLT)
This method of language taging is only useable with character encod-
ings that can represent the BMP of the Universal Character Set
[ISO10646]. For the purpose of illustration, the character PILCROW
SIGN (paragraph sign, U+00B6) is used as the tag start character. It
would be preferable to officially define a currently unused code
point exclusively for this purpose, but such a definition is outside
of the scope of this document and outside of the scope of IETF work.
If this solution is seriously considered for adoption by the IETF for
use in some of it's protocols, a request for such a codepoint should
be made through the appropriate channels.
For possible future expansions, tag syntax after the start tag char-
acter is kept very simple and general. Tags are defined to start with
a tag start character, contain only characters from the US-ASCII
repertoire (U+0021 through U+007E, inclusive), excluding the tag end
character, and end with a tag end character. The character "#" is
choosen as a tag end character.
Language tags proper are formed by a start tag character, a language
tag according to [RFC1766], with all letters in upper case, and a "#"
as an end tag character, without any intervening white space.
Expires in six months [Page 4]
Internet Draft Alternative Language Tagings for ACAP May 1997
Language alternatives are marked by a sequence of a start tag charac-
ter, a "%", and a "#" as an end tag character, again without any
intevening white space. Code to convert from this notation to MLSF
and back and to test for false positives in plain text search is
given in an appendix.
4. Conformance
Conforming protocols using either of the solutions proposed above
MUST clearly define in which places they do so, and in which places
they don't. If there are other mechanisms in the procotol that can be
used for language taging, these mechanisms should be considered and
used. In particular, storing language information separate from the
actual text is beneficial in many cases because it allows the proto-
col to treat language information and language alternatives in a way
appropriate to the protocol, i.e. only selecting and transmitting
language alternatives desired by the client, and so on.
Conforming protocols and their implementations MUST at all costs
avoid that language tags leak into parts of the protocol where they
are not allowed or into other channels where they are not allowed. In
the absence of specific information to the countrary, a protocol or
implementation MUST assume that another protocol or implementation
does not allow language tags.
In interfaces to protocols and formats that use other ways of lan-
guage taging (for an example HTML, see [RFC2070]), conforming proto-
cols SHOULD convert language tags appropriately or MAY eliminate
them.
If text including language tags as defined in this document leaks
outside the protocol positions where it is explicitly allowed, it
should be treated in the same way other text is treated, with no spe-
cial processing.
5. Discussion
Two alternative forms for language taging have been proposed in this
document. Because they are very simillar, only one of them should
finally be choosen. Compared to [MLSF], their main advantages are
that they can be used with character encodings other than UTF-8, that
they are easily distinguished from UTF-8 by implementors and users,
and that they are advantageous in case of debuging and initial string
composition.
The MLSF proposal has a number of interesting properties that makes
Expires in six months [Page 5]
Internet Draft Alternative Language Tagings for ACAP May 1997
it very suitable for efficient internal processing in certain scenar-
ios. We therefore in particular give conversion functions between
MLSF and our proposals in the appendices.
MLSF continues a long tradition of utilizing unused bit combinations
for internal processing speedups. Exposing such methods to the out-
side of an implementation, however, can lead to serious restrictions
and undesired biases towards certain implementations.
The main difference between the two proposals given here is that TELT
has to exclude certain character sequences from the untagged text,
whereas STLT has a potential to use a special, newly defined, code-
point, that is guaranteed not to appear in text per se.
Acknowledgements
The motivation to write this document came from Harald Alvestran.
Further acknowledgements go to Lisa Moore, Mark Davis, Ken Whistler,
Glenn Adams, and others from the UTC (Unicode Technical Committee),
to Rob Pike, and to Chris Newman, Ned Freed, and Mark Crispin from
the IETF ACAP working group.
Bibliography
[Unicode2] The Unicode Standard, Version 2, Addison-Wesley, Read-
ing, MA, 1996.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor-
mation technology -- Universal Multiple-Octet Coded
Character Set (UCS) -- Part 1: Architecture and Basic
Multilingual Plane.
[JIS1997] Japanese Industrial Standard, "7-bit and 8-bit Double
Byte Coded Kanji Sets for Information Interchange",
JIS X 0208:1997.
[MLSF] C. Newman, "Multi-Lingual String Format (MLSF)",
draft-ietf-acap-mlsf-01.txt, work in progress, June
1997.
[RFC1766] Alvestran, H., "Tags for the Identification of Lan-
guages", RFC 1766.
Expires in six months [Page 6]
Internet Draft Alternative Language Tagings for ACAP May 1997
[RFC2070] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter-
nationalization of the Hypertext Markup Language", RFC
2070, January 1997.
Author's Address
Martin J. Duerst
Multimedia-Laboratory
Department of Computer Science
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich
Switzerland
Tel: +41 1 257 43 16
Fax: +41 1 363 00 35
E-mail: mduerst@ifi.unizh.ch
NOTE -- Please write the author's name with u-Umlaut wherever
possible, e.g. in HTML as Dürst.
Appendix A. Conversion from TELT to MLSF
This is sample code to convert from text/enriched-style language tag-
ing to MLSF. It is assumed that the source is in UTF-8, and that the
output buffer (outp) is long enough to hold the result. The code uses
the functions defined in [MLSF] for convenience.
#include <string.h>
void TELTtoMLSF (unsigned char *outp, unsigned char *inp)
{
unsigned char tagbuff[256];
unsigned char *temp;
while (*inp) {
Expires in six months [Page 7]
Internet Draft Alternative Language Tagings for ACAP May 1997
if (!strncmp(inp, "<ALTLANG>", 9)) {
inp += 9;
*outp++ = 0xFE;
}
else if (!strncmp(inp, "<LANG=", 6)) {
inp += 6;
temp= tagbuff;
while (*inp != '>')
*temp++ = *imp++;
*temp= 0;
outp += MLSFlangencode(outp, tagbuff);
}
else
*inp++= *outp++;
}
*outp= 0;
}
Appendix B. Conversion from STLT to MLSF
This is sample code to convert from Start Tag Character style lan-
guage taging to MLSF. It is assumed that the source is in UTF-8, and
that the output buffer (outp) is long enough to hold the result. The
code uses the functions defined in [MLSF] for convenience.
void STLTtoMLSF (unsigned char *outp, unsigned char *inp)
{
unsigned char tagbuff[256];
unsigned char *temp;
while (*inp) {
if (!strncmp(inp, "\xC2\xA7%#", 4)) {
inp += 4;
*outp++ = 0xFE;
}
else if (!strncmp(inp, "\xC2\xA7", 2)) {
inp += 2;
temp= tagbuff;
while (*inp != '&')
*temp++ = *imp++;
*temp= 0;
outp += MLSFlangencode(outp, tagbuff);
}
Expires in six months [Page 8]
Internet Draft Alternative Language Tagings for ACAP May 1997
else
*inp++= *outp++;
}
*outp= 0;
}
Appendix C. Conversion from MLSF to TELT
This is sample code to convert from MLSF to text/enriched-style lan-
guage taging. It is assumed that the output buffer (outp) is long
enough to hold the result. The code uses the functions defined in
[MLSF] for convenience.
void MLSFtoTELT (unsigned char *outp, unsigned char *inp)
{
unsigned char tagbuff[256];
unsigned char *temp;
int len;
while (*inp) {
/* for speed, first insert a test (*inp != "<") */
if (*inp == 0xFE) {
inp++;
strcpy (outp, "<ALTLANG>");
outp+= 9;
}
else if (*inp >= 0xC0 && inp[1] > 0xC0) {
inp+= MLSFlangdecode(tagbuff, inp);
strcpy (outp, "<LANG=");
outp+= 6;
temp= tagbuff;
while (*temp)
*outp++ = *temp++;
*outp++ = ">";
}
else { /* maybe just *outp++ = *inp++ is enough here? */
len = utlen[*inp];
if (len > 6) break;
while (len-- && *src)
*outp++ = *inp++;
}
}
*outp= 0;
}
Expires in six months [Page 9]
Internet Draft Alternative Language Tagings for ACAP May 1997
Appendix D. Conversion from MLSF to Start Tag Character
This is sample code to convert from MLSF to Start Tag Character style
language taging. It is assumed that the output buffer (outp) is long
enough to hold the result. The code uses the functions defined in
[MLSF] for convenience.
void MLSFtoSTLT (unsigned char *outp, unsigned char *inp)
{
unsigned char tagbuff[256];
unsigned char *temp;
int len;
while (*inp) {
if (*inp == 0xFE) {
inp++;
strcpy (outp, "\xC2\xA7%#");
outp+= 9;
}
else if (*inp >= 0xC0 && inp[1] > 0xC0) {
inp+= MLSFlangdecode(tagbuff, inp);
strcpy (outp, "\xC2\xA7");
outp+= 6;
temp= tagbuff;
while (*temp)
*outp++ = *temp++;
*outp++ = "&";
}
else { /* maybe just *outp++ = *inp++ is enough here? */
len = utlen[*inp];
if (len > 6) break;
while (len-- && *src)
*outp++ = *inp++;
}
}
*outp= 0;
}
Appendix E. Elimination of False Positives in TELT
This is sample code to eliminate false positives in TELT (i.e. check-
ing whether a match found by a search routine starts inside a tag).
The elimination of false positives in STLT is structurally equivalent
Expires in six months [Page 10]
Internet Draft Alternative Language Tagings for ACAP May 1997
and therefore not given explicitly here.
int TELTfalse (unsigned char *inp, unsigned char *pos)
{
while (inp <= pos) {
if (!strncmp(inp, "<ALTLANG>", 9)) {
inp += 9;
if (inp > pos)
return 1;
}
else (!strncmp(inp, "<LANG=", 6)) {
inp += 6;
while (*inp++ != '>') ;
if (inp > pos)
return 1;
}
}
return 0;
}
Appendix F. Elimination of Tags from TELT
This is sample code to eliminate tags from TELT (in UTF-8), thereby
leaving only plain text. The elimination of tags from STLT is struc-
turally equivalent and therefore not given explicitly here.
void TELTclean (unsigned char *inp, unsigned char *outp)
{
while (*inp) {
if (!strncmp(inp, "<ALTLANG>", 9))
inp += 9;
else if (!strncmp(inp, "<LANG=", 6)) {
while (*inp++ != '>') ;
}
else
*inp++= *outp++;
}
*outp= 0;
}
Expires in six months [Page 11]
Internet Draft Alternative Language Tagings for ACAP May 1997
Appendix G. Selection of the "best" alternative from TELT
This is sample code selects the "best" language match from TELT.
Assume input language tag has been converted to upper case. Assume
language tags won't exceed 256 characters. Returns a pointer to the
start of the "best" match. Code for STLT is structurally equivalent
and therefore not given explicitly here.
unsigned char TELTselect (unsigned char *inp, unsigned char *tag)
{
unsigned char tagbuff[256];
unsigned char *match1, match2;
unsigned char best= str;
int bestlen= 0;
int start= 1;
int mlen;
if (tag == NULL || !*tag)
return;
while (*inp) {
if (!strncmp(inp, "<ALTLANG>", 9)) {
inp += 9;
*outp++ = 0xFE;
start= 1;
}
if (start) {
mlen= 0;
/* get tag into tagbuff */
if (!strncmp(inp, "<LANG=", 6)) {
inp += 6;
match1= tagbuff;
while (*inp != '>')
*match1++ = *imp++;
*match1= 0;
inp++;
}
else *tagbuff= 0;
/* check match */
match1= tagbuff;
match2= tag;
while (*match1 && *match1++ == *match2++) {
if (*match2=="-" && (*match2=="-" || !*match2))
mlen = match1 - tagbuff;
}
Expires in six months [Page 12]
Internet Draft Alternative Language Tagings for ACAP May 1997
if (!*match2 && (*match1=='-' || !*match1)) {
best = str;
break;
}
if (mlen > bestlen) {
best = str;
bestlen = mlen;
}
/* search next alternative */
start = 0;
}
else
inp++;
}
return best;
}
Expires in six months [Page 13]