Network Working Group                                           C. Falk
Internet Draft                                        Infinite Automata
Intended status: Informational                            June 13, 2011
Expires: December 2011



            Tags for the Identification of Transliterated Text
                  draft-falk-transliteration-tags-01.txt


Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79. This document may not be modified,
   and derivative works of it may not be created, except to publish it
   as an RFC and to translate it into languages other than English.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008. The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet-Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on December 13, 2011.




Falk                  Expires December 13, 2011                [Page 1]


Internet-Draft           Transliteration Tags                 June 2011


Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.

Abstract

   This document describes the structure, content, creation, and
   semantics of language tags for use in describing text that was
   transliterated from one orthographic system to another.

Table of Contents


   1. Introduction...................................................2
      1.1. Problems Concerning Language Tags.........................2
      1.2. Tags for Identifying Languages............................3
   2. Transliteration Tags...........................................4
   3. Security Considerations........................................4
   4. IANA Considerations............................................4
   5. Conclusions....................................................4
   6. References.....................................................4
      6.1. Normative References......................................4
      6.2. Informative References....................................5
   7. Acknowledgments................................................5
   Appendix A. Examples of Transliteration Tags (Informative)........6

1. Introduction

1.1. Problems Concerning Language Tags

   Language tags are a common tool used in the Internet.  Such tags are
   useful in content localization and machine translation.  Many
   different standards exist for how to represent language information
   in machine-readable formats.

   Existing language tags all suffer from the same problem in that they
   represent only the language and not the orthography used in writing
   said language.  Many languages such as Russian, Chinese, and Arabic
   have multiple orthographies for written content.  A few languages,


Falk                  Expires December 13, 2011                [Page 2]


Internet-Draft           Transliteration Tags                 June 2011


   including Serbian, are digraphic, which means they are natively
   written in two or more different scripts.

   A further complication arises when including the practice of
   transliteration, or changing orthographies.  Most often this is seen
   when languages written in non-Latin orthographies are rewritten
   using Latin characters.  These orthographies are not mutually
   intelligible.  So to say that two different pieces of text are,
   "Chinese written in Latin script," is not useful if one is
   transliterated using the Wade-Giles system while the other is using
   the Pinyin system.

   The problems a complete language tag must address are:

     1. Identify the content's language.
     2. Identify the language's current orthography.
     3. Identify the original orthography used if the content was
        subject to transliteration.
     4. Identify the system used in the transliteration, if the current
        content differs from the original.

   To date no single language tag standard can address all these
   problems.

1.2. Tags for Identifying Languages

   While there are several existing language tag standards only a
   handful of these standards advance us toward the goal of a complete
   language tag system.  Chief among these is the RFC 5646 document as
   edited by Phillips and Davis.  RFC 5646 satisfies the first two
   criteria of the proposed complete language tag.

   First, RFC 5646 it represents the content's language.  This is the
   very first portion of a BCP 47 language tag.  If an alpha-2 code
   belonging to the ISO 639-1 standard is available then that code is
   used.  If no alpha-2 code is available then the longer alpha-3 code
   belonging to the ISO 639-3 standard is used.

   Second, RFC 5646 represents the languages current orthography.  This
   is an optional portion of the BCP 47 tag.  Language orthography
   representation is handled by the alpha-4 tags defined in the ISO
   15924 standard.

   What RFC 5646 doesn't address is the last two transliteration-
   related criteria for a complete language tag.




Falk                  Expires December 13, 2011                [Page 3]


Internet-Draft           Transliteration Tags                 June 2011


2. Transliteration Tags

   While RFC 5646 does have its shortcomings, it provides for future
   growth and expansion through extension sub-tags.  By using these
   extension sub-tags we can add a second layer of analysis upon the
   existing RFC 5646 tags to satisfy our transliteration tag criteria.

   As discussed in section 1.1. , the transliteration tag needs to
   define two additional pieces of data:

     1. Original orthography.
     2. The transliteration system used.

   There will be a new extension tag for each of these pieces of data:

     1. The original source orthography will be denoted by the
        singleton "s" followed by the ISO 15924 for the source script.
     2. The transliteration system will be denoted by the singleton "t"
        followed by a 2-8 character alphanumeric string abbreviation of
        the transliteration system.

3. Security Considerations

   The transliteration tag described in this document includes
   information about the transliteration system used.  Some
   transliteration standards are proprietary, and the information of
   their use in a public exchange might constitute a breach of privacy.

4. IANA Considerations

   There are no IANA considerations for this document.

5. Conclusions

   This document shows how, using the extension mechanisms built into
   the language tag standard of RFC 5646, a more complete way of
   representing written languages is achieved to include any
   transliteration performed upon the text.

6. References

6.1. Normative References

   [1]   Phillips, A. and Davis M. (Editors), "Tags for Identifying
         Languages", BCP 47, RFC 5646, September 2009.




Falk                  Expires December 13, 2011                [Page 4]


Internet-Draft           Transliteration Tags                 June 2011


   [2]   International Organization for Standardization, "ISO 639-
         1:2002.  Codes for the representation of names of languages -
         Part 1: Alpha-2 code", July 2002.

   [3]   International Organization for Standardization, "ISO 639-
         3:2007.  Codes for the representation of names of languages -
         Part 3: Alpha-3 code for comprehensive coverage of languages",
         February 2007.

   [4]   International Organization for Standardization, "ISO
         15924:2004.  Information and documentation -- Codes for the
         representation of names of scripts", January 2004.

                  6.2. Informative References

   [5]   Dale, I.R.H., "Digraphia", International Journal of the
         Sociology of Language 26 (1980) pp. 5-13.

   [6]   Buckwalter, T., "Buckwalter Arabic Transliteration", Qamus,
         2002.

   [7]   International Organization of Standardization, "ISO 9:1995.
         Transliteration of Cyrillic characters into Latin characters -
         Slavic and non-Slavic languages", 1995.

7. Acknowledgments

   Thanks to Tim Buckwalter of the University of Maryland for patiently
   answering questions about his Arabic transliteration system.

   This document was prepared using 2-Word-v2.0.template.dot.


















Falk                  Expires December 13, 2011                [Page 5]


Internet-Draft           Transliteration Tags                 June 2011


Appendix A.                 Examples of Transliteration Tags (Informative)

   ar-Latn-s-Arab-t-buckwalt (Arabic-language text transliterated from
   the Arabic script into the Latin script via the Buckwalter
   transliteration system)

   ru-Latn-s-Cyrl-t-iso9 (Russian-language text transliterated from the
   Cyrillic script into the Latin script via the ISO 9 transliteration
   system)

   zh-Latn-s-Hans-t-pinyin (Mandarin Chinese-language text
   transliterated from the simplified Han script into the Latin script
   via the Pinyin transliteration system)




































Falk                  Expires December 13, 2011                [Page 6]


Internet-Draft           Transliteration Tags                 June 2011


Authors' Addresses

   Courtney Falk
   Infinite Automata

   Email: court@infiauto.com











































Falk                  Expires December 13, 2011                [Page 7]