[Search] [txt|pdf|bibtex] [Tracker] [Email] [Nits]

Versions: 00                                                            
INTERNET-DRAFT                                     M.T. Carrasco Benitez
<draft-benitez-winter-cultures-00.txt>
Expires November 16th 1996                                May 16th, 1996

                                 WInter
               (Web Internationalization & Multilinguism)


Status of this Memo

This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress".

To learn the current status of any Internet-Draft, please check
the "1id-abstracts.txt" listing contained in the Internet-Drafts
Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
ftp.isi.edu (US West Coast).

Distribution of this document is unlimited. Please send comments
to the WInter mailing list at <winter@dorado.crpht.lu>. Information
about the WInter mailing list, including subscription details are
in the WInter Page at:  http://www.crpht.lu/~carrasco/winter



Abstract

This document discusses the Internationalization & Multilinguism
of the Web. A Web capable of supporting different cultures, natural
languages and Language Engineering facilities such as Parallel
Texts. Internationalization permeates most subsystems: client,
transmission, server, data and authoring; the primitive mechanism
for WIntering should be part of the Web foundations.


Table of Contents

1. Introduction
  1.1 Mandate
  1.2 Writing style
WInter Page 2


  1.3 Terminology

2. Character Set
  2.1 Back office
  2.2 Front office
  2.3 Multilingual typography
  2.4 The characters in the URL

3. Internationalization & localization
  3.1 Elements of localization
  3.2 Messages as HTML pages

4. Multilinguism

5. Parallel Hypertext
  5.1 Definition
  5.2 Language tags
  5.3 Document request
  5.4 Parallel Hypertext Data Structure (PHDS)
  5.5 Linking strategy
  5.6 Generation of parallel texts
    5.6.1 Language dependent strings
    5.6.2 Language-void document

6. Bidirectionallity (BIDI)
7. The LANG attribute
8. LINKs
9. Multilingual thesaurus
10. Electronic Data Interchange (EDI)
11. Passing selected text to a CGI
12. Reference model for Internationalization & Multilinguism
13. VRML
14. Java

15. Dragoman
  15.1 Interactive Search
  15.2 The Translation Folder (full preprocessing)
  15.3 Preprocessing for Machine Translation
  15.4 Machine Translation
  15.5 Pseudo-Automatic Translation (PAT)
  15.6 Document Generation
  15.7 Document Comparison
  15.8 Author's Workbench
  15.9 Terminology Verification
  15.10 Multilingual Aligned Text Editor
  15.11 Printing

16. Acknowledgments
WInter Page 3


17. Bibliography
18. Author Address


1. Introduction

The intention of this document is to consider all aspects for
WIntering. It aims to fulfill two functions:

-  A catalogue of issues

-  A primer

To a very large extend, it puts together the efforts of other
groups. It goes in more details when materials are not covered
elsewhere.

An Internationalized & Multilingual Web should have the traditional
facilities of Internationalization and more advanced facilities
needed for Language Engineering. For example, clients should have
a language menu (similar to edit or file menus) that shows in which
other linguistic versions the currently displayed document is
available; or clients should be capable of displaying and moving
in sync side by side, two linguistic versions of the same document.

"Another noteworthy characteristic of this manual is that it doesn't
always tell the truth. When certain concepts of TEX are introduced
informally, general rules will be stated; afterwards you will find
that the rules aren't strictly true."

The TEXbook Donald E. Knuth


The above quote particularly applies to the documents resumed in
this document. Though the intention is to make this document
selfcontained by resuming or quoting other documents, it is strongly
recommended to consult the source documents.

1.1 Mandate
One of the recommendation of the Internationalization Workshop
during the Fifth International WWW Conference in Paris on May 6th
1996, was that a document should be maintained to fulfill the
purpose described in the above introduction. The author accepted
the task and the present document is the result.

1.2 Writing style
A special effort should be made to make this document as accessible
as possible to non-computer specialists (e.g., linguists) and
WInter Page 4


non-English native speakers. Due to the characteristics of WInter,
there should be a significant number of both. This does not imply
that there should be one type of document for each type of participant.
It means that this document should be accessible to all participants.
Perhaps by adopting a journalistic style and re-stating the evident.
The overhead should be small and it is good to avoid misunderstanding,
even between people of the same field.

Comments regarding the writing style from journalists or readers
with similar profiles are very welcome; i.e., non-computer specialists
that have to explain computer materials to other non-computer
specialists. Some of the suggestions could be what additional
material should be included to make this document more selfcontained;
and what terms should be replaced to make it more accessible. But,
the gory normative details must be present.

1.3 Terminology
Alignedness
It is a quality of Parallel Texts; for example, the Treaty of Rome
in English and Spanish are Parallel Texts and they should be aligned.
The interesting part is aligning Parallel Texts automatically.

Author-Translator-Publisher Chain (ATP-chain)
It refers to the integration of all the phases in the production
of documents. Usually, large distributed systems.

Globalization
In the context of electronic commerce, the mechanisms to facilitate
global trade. Internationalization & Multilinguism are some of
these mechanisms. A legal framework is an example of a non computer
mechanism.

I18N
Abbreviation for Internationalization. The 18 refers to the characters
nternationalizatio.

Language Engineering
Language Engineering is the application of computer science to
natural languages. For example:

-  Terminology

-  Translator's Memory

-  Multilingual documentary databases

-  Aligned Text

WInter Page 5


-  Translator's Workbench

-  Author's Workbench

-  Machine Translation

-  Publishing (in particular, multilingual synchronized publishing)

Level of Alignedness
This is a metric of alignedness. According to which depth it is
possible to identify the Linguistic Objects, the texts are aligned
at:

-  Document level: the trivial case; i.e., Parallel Texts.

-  Paragraph level: not too hard to achieve.

-  Sentence level: desirable and possible to achieve.

-  Term level: it needs tagging for automatic alignedness.

-  Word level: it needs tagging for automatic alignedness.

In this context, sentence is a part of a text delimited by a dot,
semicolon or similar; i.e., it has little grammatical meaning and
the main interest is to identify Linguistic Objects.

Linguistic Object
Linguistic Object is a unit of language representation. It can be
a fixed language representation (term, abbreviation, title, segment,
phrase, paragraph, etc) or meta-language representation (a grammatical
construction, etc). More general, a Linguistic Object is a discrete
linguistic unit (usually a string) whose meaning is created by the
program treating it.

Multilingual Aligned Text (MAT)
A MAT is a record in a table with one Linguistic Object per language
field (English, Spanish, German, etc) that are the equivalence
(usually the translation) of each other. There are other fields
for classification and other purposes. MATs constitute independent
elements of a table; i.e., there is no ordering in the table. The
end result is a data structure similar to a multilingual dictionary.

Parallel Texts
Texts that are translations of each other. For example, the Treaty
of Rome in English and Spanish are Parallel Texts. Parallel Texts
could be aligned to several levels.

WInter Page 6


WInter
It stands for Web Internationalization & Multilinguism.


2. Character Set

A large character set is a basic prerequisite for having
Internationalization & Multilinguism. The bottom line is that the
Web must be capable of handling Unicode [UNICODE].

The character set should be considered a low level layer; i.e.,
like the pieces of wires in the seven layers ISO Reference Model
(physical, datalink, network, etc). Other functionalities should
be in other layers. There is a tendency in overloading this layer,
by opposition to defining new layers.

There are two aspects to the character set:

The Back office
It deals with storage in disk, transmission, representation in the
document, etc

The Front office
It is concerned with rendering on the screen or printer.

2.1 Back office
Latin-1[ ISO -8859-1] is the default character set for the Web.
Latin-1 is only sufficient for Western European languages. Latin-1
is an 8-bits encoding. This permits a maximum of 256 characters.

Unicode (ISO 10646 BMP) is a large character set that includes most
of the world languages. Unicode is a 16-bits encoding. This permits
over 65,000 characters. At present, over 25,000 positions are still
free. This form is also called UCS-2; i.e., Universal Character
Set 2-bytes. Unicode is the first plane of ISO 10646 (see below);
this plane is also called BMP (Basic Multilingual Plane) or Plane
Zero. The Internationalization of the Hypertext Markup Language
[I-HTML] proposes Unicode as the document character set.

ISO 10646 is a 32-bits encoding. It is divided into 32,000 planes,
each with 65,000 characters capacity. This permits 2,080 million
characters. This form is also called UCS-4, Universal Character
Set 4-bytes. Only the first plane (Unicode) is in use.

UTF-8, (Universal Character Set Transformation Format) is an addendum
to ISO 10646. It provides compatibility with ASCII and the ASCII
characters are represented by 1 byte (8 bits) and not 4 bytes (32
bits). In general, it is economical with the bytes used in the
WInter Page 7


encoding.

[HTTP-1.1] allows for the character set to be negotiated. For
example, the client and server can agree on using Unicode.

2.2 Front office
Rendering is drawing the glyphs (graphic representation of the
characters) on the screen or printer. This is the job of the browser
and the browser depends on the graphical facilities of the computer.

Undisplayable characters are the characters that cannot be displayed
due to the lack of facilities. The I-HTML "does not prescribe any
specific behavior", but notes some "considerations". WInter recommends
the following:

-  The behavior of undisplayable characters must be controlled by
the options setting of the browser

-  Some options can be combined.

-  There must be a small Undisplayable Characters Flag in the
browser part of the screen, not in the document part. Something
similar to the red button indicating that the browser is loading
a document, but smaller. The flag must be ON if the current document
contains one or more undisplayable characters. The presence or
absence of the flag must be user definable.

-  Undisplayable Character Tolerance is a user definable value in
the range from 0 to 10, that signals the behavior of the browser.

-  0 Undisplayable Character Tolerance means ignore all undisplayable
characters.

-  5 Undisplayable Character Tolerance means a reasonable default
warning for undisplayable characters. This behaviour must be defined.
For example, show only up to 10 continuous undisplayable characters
and try remaps, such as "e'" to "e".

-  10 Undisplayable Character Tolerance means show one Replacement
Glyph for each undisplayable character.

-  The other intermediary values must change gradually.

-  Undefined Undisplayable Character Tolerance must gravitate
towards the default value (5).

-  The undisplayable characters must be remapable to a user definable
Replacement Glyph for example, "_". Or one of several numeric
WInter Page 8


representations; for example, hexadecimal or decimal.

-  The default Replacement Glyph must occupy approximately the same
space as the average glyph in the document. It must be a box
containing the Unicode value in hex.

Font Servers could supply the browser with missing glyphs.

2.3 Multilingual typography
{The proposition of Martin Dvrst will be resumed here.}

2.4 The characters in the URL
The characters allowed in the URL are a subset of ASCII. URL where
supposed to be hidden, but they are very visible and important
commercially: firms want to spell their names with accents. The
most urgent is to have a large character set for the query part.
There have been propositions on using UTF-8. URL needs a lot of
work.


3. Internationalization & localization

Internationalized softwares are developed without the cultural
characteristics embedded. They can be localized parametrically for
different cultures; for example, the same software can run for
Germany with the German conventions, or for Italy with the Italian
conventions.

Internationalization is a well known field; for example, a significant
amount of effort was done during the POSIX (Unix) standardization.
The mechanisms must be sufficient for implementing the localizations.
Localization itself is usually discussed in other fora; for example,
how to represent the date in Germany. Most conventions have been
already agreed.

Any number of cultures (real or imaginary) are possible. For example,
France, Germany, European Commission. In the case of the European
Commission, it has to work in the eleven official languages (including
Greek), and with cross-cultural conventions or with the national
conventions.

3.1 Elements of localization
Languages
Two aspects:

-  Language strings in the software.

-  Data in the document.
WInter Page 9



Example, the software could be in German and the document shown in
French.

Sorting order
Number representation
Example, the internal number could be 12345.67 and the external
representation could be 12,345.67 or 12.345,67.

Date & Time
Example, the internal representation could be 19951231 and the
external representation could be December 31th 1995, or 31-12-1995.

Short quotations
Example,

-  "I am a Berliner" (English)

-  <<Je suis un Berlinois>> (French)

-  ,,Ich bin ein Berliner'' (German)

The new element <Q> in I-HTML is for this purpose.

New internationalization elements should be added to this list,
for example, color.

The software should be localized from a list of preferred localization,
and switchable from one localization to another without re-starting
the application.

3.2 Messages as HTML pages
The Status-Code and the Reason-Phrase (see 6.1.1, HTTP-1.1) are
presented as HTML pages. These are Language strings in the software
but are usually presented as data documents. For example, 404: Not
Found.

The localization of the Reason-Phrase can be done by the client or
the server. If the client can do a better job, it has to drop the
page sent by the server and generate the localized page from the
Status-Code and the LANG tag.


4. Multilinguism

Multilinguism deals with advanced language facilities, often several
languages simultaneously. It is also referred as Language Engineering.
This comes from the tradition of specialized software for Language
WInter Page 10


Engineering, such as Translator's Workbench. One of the main
applications is the processing of Parallel Texts.

Most of the softwares in Language Engineering are incompatible and
there are practically no standards in this field. Usually, researchers
or vendors start from scratch and develop all the modules; even
horizontal modules such as user interfaces and data structures,
rather than concentrate in the engines for language processing (for
aiding the translator, machine translation, etc).

One of the main inmediate objective in Language Engineering must
be the creation of standards that clearly separate data and software;
i.e., it should be possible to adquire a translation aid program
from one vendor and the dictionaries from another vendor.

The purpose is not making every browser a Translator's Workbench,
though browsers could do with more advanced language facilities
that are usually found in internationalized products. But the
standards must allow the construction of Translator's Workbenches
based on the Web technology.

After security and the application for secure payment over the
Internet, Language Engineering is one of the applications most
relevant from an economical point of view; in intranets, with less
security requirements, it is probably the most important. It is as
horizontal as publishing and, indeed, it is the second phase in
the ATP-chain (Author-Translator-Publisher). Translating is expensive
and very human intensive. For most texts, machine translation is
not acceptable. On the other hand, translating aiding tools are
very cost effective. Particularly, if integrated in an ATP-chain.
Saving in translating tends to be big.


5. Parallel Hypertext

5.1 Definition
Parallel Hypertext is an extension of the hypertext paradigm to
natural languages. For example, a user looking at a document in
English should be able to obtain the Spanish version in a transparent
way; i.e., just by selecting the Spanish option in a language menu
and not by selecting a link embedded in the English version. For
this, the Web must know about languages; i.e., the same in another
language. The same property of alignedness in Parallel Texts applied
to Parallel Hypertext.

5.2 Language tags
The language tags (see 3.10, HTTP-1.1) are composed of a primary
language tag and one or more subtags that could be empty.
WInter Page 11



Examples:

en
en-US
en-cockney

There must be a way to indicate

-  Human translation

-  Machine translation

-  Transliteration

This could be part of a subtag or inside the document.
{Examples will be added.}

5.3 Document request
Clients should be able to request documents at least in the following
ways:

-  A document is requested according to a preference language list
that could be the same list used for choosing the display labels
in the user interface. The server must respond with best linguistic
version and the list of available linguistic versions. The best
linguistic version means the nearer to the top of the list and if
none is available, the nearer to the top of the defaults in the
server. In this case, the browser probably does not know what are
the available linguistic versions.
{This will be developed.}

-  A document is requested in one specific language. The server
must respond only with that linguistic version (no other is
acceptable) and the list of available linguistic versions. In this
case, the client probably knows that the requested version is
available; it could be the result of a previous conversation with
the server.

Example:

-  Conversation 1
Client : Give me MyDoc with this order of preference: Danish,
English or German
Server : Take MyDoc in German; it is available in German, Italian and Spanish

-  Conversation 2
Client : Give me MyDoc only in Spanish
WInter Page 12


Server : Take MyDoc in Spanish; it is available in German, Italian
and Spanish


The linguistic versions of the document could be in different servers.

This could be done with the Accept-Language and Content-Language
facilities (see 10.4 and 10.11, HTTP-1.1).

The parameter in Accept-Language:


Quality factor "q" is decribed as "... estimate of the user's
comprehension of that language ..." . But the user indicates his
language preference list and there is no need to use the parameter
with this meaning. It would be more usefull to indicate the "minimum
acceptable quality of the translation". Some of the translation
could be done by more or less experienced translators; or machine
translation.

A different usage could be to indicate the level of alignedness.

Maximum acceptable size "mxb" is not used. It could indicate the
number of linguistic versions desired.

An Accept-Language with a single language parameter must mean that
the browser only wants that linguistic version and not another.

The Content-Language "... describes the natural language(s) of the
intended audience ...". The meaning of this field should be "the
list of linguistic versions available"; it should be used by the
browser to update the language menu, so the user could know which
other linguistic versions are available.

5.4 Parallel Hypertext Data Structure (PHDS)
One Parallel Hypertext Data Structure contains all the information
for one Parallel Hypertext Document. The Parallel Hypertext Data
Structure must allow the following:

-  Several data schemes. For example, directory, SGML, tar, etc

-  Keeping the linguistic versions in different servers

-  Conversation with monolingual clients. In this case, the user
must know the structure

The Parallel Hypertext Data Structure has two parts:

WInter Page 13


The PHDS-Header
Contains administrative data. For example, where is the German
linguistic version. The data is divided into structured fields.

The PHDS-Body
Contains the linguistic data. It has one section per language.

The PHDS-Header is always a HTML file. This file must fulfill two
functions:

-  Allowing a user to select one linguistic version

-  Be used by WIntered Web programs (clients/servers) as a
datastructure to locate the pertinent linguistic version

The PHDS-Header must contain at least the following information:

-  Name

-  DataScheme

-  DataLocation (for all the parts)

The DataSchema applies only to the PHDS-Body. The PHDS-Header is
always a HTML.

{An example of a file in HTML will be added.}

The default for a single set of files is:

DocName.html                              (PHDS-Header)

DocNameDir                                (PHDS-Body, a directory)
           /en.html             English   (PHDS-Body language section)
           /es.html             Spanish   (PHDS-Body language section)
           /de.html             German    (PHDS-Body language section)



The default for several sets of files is:

DocName.html                              (PHDS-Header)

DocNameDir                                (PHDS-Body, a directory)
           /en/DocName1.html    English   (PHDS-Body language section)
           /en/DocName2.html    English   (PHDS-Body language section)

           /es/DocName1.html    Spanish   (PHDS-Body language section)
WInter Page 14


           /es/DocName2.html    Spanish   (PHDS-Body language section)

           /de/DocName1.html    German    (PHDS-Body language section)
           /de/DocName2.html    German    (PHDS-Body language section)

The DocName.html should be usable directly by the present clients
(browsers) and/or indirectly to generate HTML files of the fly.
Multilingual clients should use the information to access the
documents in a transparent way.

Requesting a URL of a PHDS-Header must get the linguistic version
according to the rules of the language preferences. Requesting a
URL of a PHDS-Body language section must get that linguistic version.

The server must know at least the following defaults:

-  language with the explicit links

-  preferred language list

-  MAT table

{This will be extended.}

A standard data structure for Parallel Hypertext would be of use
for anybody working with Parallel Texts, independently if the Web
is used or not. For example, CD-ROMs could be published with Parallel
Texts for language processing programs, such as Machine Translation,
that would know what to expect. At present, there is no standard
for Parallel Texts or MAT.

The relation with Text Encoding Initiative (TEI) will be explored.

5.5 Linking strategy
The linking strategy must minimize the maintenance. This is essential
for large multilingual documentary databases. For example, the
millions of pages of the European Institutions in eleven languages.
Only one linguistic version should have explicit links; i.e., the
links as used today that are physically present in the documents.
The other linguistic versions would have implicit links; i.e. links
that would not be physically present in the texts, but they could
be calculated by the alignedness of the different linguistic
versions.

The generation of implicit links could be client, server and/or
authoring affair:

-  Client.- A client could receive a linguistic version with explicit
WInter Page 15


links and a linguistic version with implicit links. The client
would display the linguistic version with the explicit links or it
would calculate the implicit links on the fly and display the
result.

-  Server.- A multilingual server could process documents with
implicit links and generate on fly documents with explicit links.

-  Authoring.- An interactive or batch authoring system could
process documents with implicit links and it could create new
documents with explicit links; the server would not know how the
new documents were created.

These options should be considered as a continuum and (some) are
not mutually exclusive: most degrees between the extremes are
possible. For example, servers could be able to create documents
on the fly and they could be using documents with the links generated
by authoring systems. Indeed, a mixture could be the most probable
case.

The level of alignedness should be calculated in advance and kept
in the Parallel Hypertext Data Structure. Some documents widely
regarded as aligned because they were revised over half a dozen
time and they have been heavily used for decades (best-case
documents); once submitted to a computer program, it came to light
that they were not aligned even to paragraph level.

The linked text (i.e., what goes between <a ...> and </a>) would
have to be at least to the level to which the texts are aligned.
For example, for texts aligned only at paragraph level, it is not
possible to calculate implicit links at sentence level. A corollary
is that texts aligned at document level can have implicit links
only at the beginning or at the end.

The links would have to be at least at sentence level. It would be
hard to place implicit links in part of a sentence without tagging:
the second text should have null links; named null links if there
are several in one sentence.

Examples:

-  No need for null links in the second text. A whole sentence is
linked in the first text and finding the place for the implicit
links in the second text is easy.

The white table. <a href="MyURL"> The black table </a> The green table.
La mesa blanca.                   La mesa negra.       La mesa verde.
                 (implicit link)
WInter Page 16



-  It needs a null link in the second text. Only part of a sentence
is linked in the first text and finding the place for the implicit
link in the second text is hard; i.e., it cannot be done with simple
strings processing and it needs computational linguistics.

The white table. The black <a href="MyURL"> table </a> The green table.
La mesa blanca.  La <a name="Null"> mesa </a> negra.   La mesa verde.
                     (null link)

5.6 Generation of parallel texts
The linguistic versions could be generated through machine translation
or other techniques. For example, a system could have documents in
Spanish and a program for translation to English. The user should
be informed by the language menu into which languages and with
which techniques (MT, human translator, etc) the documents are
available.

{This will be extended.}

5.6.1 Language dependent strings
These are tags to be replaced by language string (Linguistic Object)
according to the language requested. For example, the following
shows the content of a HTML document and the resulting replacement;
assuming that the language requested is German and that the Linguistic
Object corresponding to the identifier String_1 is the German phrase
below:

 <SomeTag SomeLabel=String_1>

 Ich bin ein Berliner


5.6.2 Language-void document
A document without any language string; i.e., it contains only
language dependent strings. In this case, only one HTML document
is needed and not one per language; this HTML document could be
considered a mask. A database with Linguistic Objects is needed.
The same Linguistic Object can be used in several documents.

This technique could be used for the localization of the messages send by the server as HTML documents.


6. Bidirectionallity (BIDI)

(see 4.2, I-HTML)
{A resume from the I-HTML will be inserted.}

WInter Page 17



7. The LANG attribute

(see 3, I-HTML)
{A resume from the I-HTML will be inserted.}


8. LINKs

<LINK REL=Glossary>
<LINK REL=Dictionary>
<LINK REL=Translation>
{This will be exteneded.}


9. Multilingual thesaurus

This is a tool for finding references to the search in any language.
For example, if the string in the search is "table" it should also
find the Spanish document with the word "mesa" (table in Spanish).


10. Electronic Data Internchange (EDI)

Many EDI messages are printed. As the EDI messages are very
structured, a translation of the message could be shown using
Pseudo-Automatic Translation (PAT).


11. Passing selected text to a CGI

To consult terminological databases easly, it should be possible
to pass selected string (with the mouse or other) to CGI programs
or similar. This is a generic mechanism.


12. Reference model for Internationalization & Multilinguism

This is a very first trial and further work is needed. The model
is layered, similar to the seven layers ISO Reference Model (physical,
datalink, network, etc). A different approach could be needed; for
example, a vector approach.

LayerNumber   LayerName         Example

1             compression       gzip
2             transformation    UTF-8
3             character set     Unicode (65, "LATIN CAPITAL LETTER A")
WInter Page 18


4             glyph             "A"
5             font              Time

Other items to put into the model:

-  sorting order

-  language (e.g., Korean)

There is a general tendency to overload the character set layer.
For example, wishing to allocate two code positions to the same
ideogram because it means different things in different languages.


13. VRML

How objects negotiate when they speak different languages ?
{This will be developped.}


14. Java

{This will be developped.}


15. Dragoman

This section is included mostly to illustrate the kind of applications
for multilinguism.

Dragoman is a reference model for Language Engineering. It uses
Multilingual Aligned Hypertext technique. In essence, Dagroman
describes a Database (part structured and part documental) and
Services that can be implemented over the (multilingual ) Database.
Often, different data structures are used for the Services described
below.

The Web paradigm is particularly well adapted to Dragoman. The term
Dragoman has nothing to do with dragons; it means language interpreter.

What follows is a very brief description of some of the Services
that could be implemented over the Database. There could be several
programs offering the same Service. Services processing whole
documents could be implemented in batch; particularly if they are
using a very large Database (several gigabytes).

15.1 Interactive Search
Selects the Multilingual Aligned Texts (MAT) that match a search
WInter Page 19


criteria. The search is fuzzy (e.g. 87% match). Unfound requests
are valuable information that must be processed further. The system
must keep trace of the unfound requests to put in contact people
with similar needs (matchmaker); the user must decide what is a
typing error and what is a genuine unfound request. Also the user
can send messages to terminologists (demand driven terminology).

15.2 The Translation Folder (full preprocessing)
The objective is to obtain a complete Translation Folder for a
given document. Hence, the translator should not need to consult
dictionaries, databases, glossaries, nomenclature list, etc. It is
like having a hundred assistants preparing the text for the
translator. In a typical Translation Folder, some paragraphs should
be fully translated and some paragraphs should be a mixture of full
sentences, segments, titles, terms, nomenclatures, etc (all these
items are packaged as Linguistic Objects); background documents
could also be taken into account. The Linguistic Objects are marked
with the Status; for example, unverified, verified, compulsory,
etc. The search follows a fuzzy biggest chunk heuristic. Traditionally
there are two texts, source and target. But there could be any
number of language fields. This could be the most useful Service
for the translator and it should be implemented early. The translator
could use the result on paper or on the screen.

15.3 Preprocessing for Machine Translation
Similar to the Translation Folder. It should be adapted to an
(existing) machine translation program that follows up the processing.
For example, select only exact matches (no fuzzy) and terms in the
unfound phrases; the machine translation program would translate
only the unfound phrases.

15.4 Machine Translation
A Machine Translation program that uses the Database directly. For
example, a program could combine perfect matches, process the easy
fuzzy matches such as dates, pure Machine Translation, etc.

15.5 Pseudo-Automatic Translation (PAT)
Similar to the Translation Folder, but where all the texts are
found with a 100% match (no fuzzy search). The program should be
restricted to a collection of records; i.e., it should not be
allowed to roam the Database as there could be bad surprises. In
particular, one must avoid word by word translation; hence one must
be very careful with small Multilingual Aligned Texts (for example,
a one-word Multilingual Aligned Text).

15.6 Document Generation
All the linguistic versions of a document are generated camera
ready. There is no source and translation as such, the index is
WInter Page 20


created, the typesetting (nearly) done. This is the most useful
Service for the Organization. It is a very efficient way to produce
documents. The three phases Author-Translator-Publisher (ATP-chain)
are highly integrated. It is particularly adapted to periodic
publications. The production of standardized documents is trivial.

Documents in several linguistic versions are often required to be
synchronized; i.e., each page in each linguistic version must
contain the same content and the same lay-out (text, number of
paragraphs, etc). The typesetting, including the synchronization,
must be automated and each page should not be processed by a human;
a human operator should intervene only to fine-tune the publication.
TeX should be considered.

A document might need several representations; for example, typesetted
for the Official Journal, formatted for a CD-ROM or marked in HTML
(for CD-ROM or server). First, a document in SGML should be generated;
indeed, the SGML document is the document. All the following
representations should be created from the SGML document. This
method should guarantee that all the representations have the same
content.

With such a system in place, the creation of secondary products is
easy. For example, a Parliamentary Commission could work with a
draft of the Budget typesetted like the Official Journal, in all
the linguistic versions, enriched with hidden comments.

15.7 Document Comparison
The user directs the program to a document similar to the one that
has to be translated. The new pieces could be fetched in the
Database. This program could work without the Database, though the
new pieces would not be fetched. Similar translations could arise
as a version of a previous document and as a new similar document.

15.8 Author's Workbench
Authors could use a similar technique to Translation Folder and
Document Comparison. The unknown parts of the text would be marked
and in certain cases alternatives would be proposed. Texts created
with the translation phase in mind are easier to translate. Ideally,
the author should aim to produce a text for translation with
Pseudo-Automatic Translation.

15.9 Terminology Verification
The objective is to verify the Consistency and Harmonization of
the terminology. The concepts are closely related and they can be
combined, but they are not the same.

-  Consistency is naming the same object with the same term. It is
WInter Page 21


an internal characteristic of a set of documents (the unitary set
is allowed) and it does not need a Database. The more linguistic
versions of the set of documents the better.

-  Harmonization is imposing a term by the Terminological Authority.
It is an external characteristic of the document and it needs a
Database with the harmonized terms.

15.10 Multilingual Aligned Text Editor
An editor shows at least two (aligned) texts, it moves the texts
in sync, it highlights the differences, etc.

15.11 Printing
A program that prints one or several Multilingual Aligned Text side
by side. It could be the following step after the Translation
Folder. Multilingual Aligned Texts (source and target) on paper
allow the translator to use traditional tools such as dictating.


16. Acknowledgments

This document makes heavy use from the documents cited in the texts.
Particularly from the relevant RFC and IETF-drafts.

Also from the following:

-  Web Multilinguism. BOF meeting, Third International WWW Conference

-  Web Internationalization. BOF meeting, Fourth International WWW
Conference

-  Web Internationalization & Multilinguism. BOF meeting, Fifth
International WWW Conference

-  Internationalization Workshop. Fifth International WWW Conference

-  WInter mailing list

-  Informal talks/communications (probably the most fruitful)

The BOF meetings were organized by the author.

Martin Duerst made many suggestions to the position paper of the
author for the Internationalization Workshop during the Fifth
International WWW Conference. The present document is over 80%
based on the position paper. He commented the Reference model and
I expect him to come back with further suggestions.

WInter Page 22


In such fluid circumstances, it is nearly impossible to attribute
credits. Though it particularly comes to mind,

Bert Bos
Martin Bryan
Martin Dvrst
Albert Lunde
Larry Masinter
Gavin Nicol
Steven Pemberton
Christine Stark
Fran[ois Yergeau
Faith Zack

The author tries to look for consensus and borrowed heavily from
many sources. On the other hand, he is the only responsible for
any shortcomings and the opinions expressed.


17. Bibliography

[BRIAN] Martin Bryan, "Using HyTime to Link Translations", contribution
to the WInter mailing list,
http://www.crpht.lu/~carrasco/winter/hytime.html

[CARRASCO-1] M.T. Carrasco Benitez, "On the multilingual normalization
of the Web", Poster for the Third International WWW Conference,
http://www.crpht.lu/~carrasco/winter/poster.html

[CARRASCO-2] M.T. Carrasco Benitez, "Web Internationalization",
Poster for the Fourth International WWW Conference,
http://www.crpht.lu/~carrasco/winter/inter.html

[CARRASCO-3] M.T. Carrasco Benitez, "WInter (Web Internationalization
& Multilinguism0", Position paper for the Internationalization
Workshop during the Fifth International WWW Conference,
http://www.crpht.lu/~carrasco/winter/popa.html

[CONNOLLY] "Character Set Considered Harmful",
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html

[HTML 2.0] T. Berners-Lee, D. Connolly, "HTML 2.0", RFC 1866,
http://www.ics.uci.edu/pub/ietf/html/rfc1866.txt

[HTML 3.0] "HTML 3.0", expired Internet-Draft,
http://www.hpl.hp.co.uk/people/dsr/html3/CoverPage.html

[HTTP-1.1] R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee,
WInter Page 23


"Hypertext Transfer Protocol -- HTTP/1.1", Work in progress
(draft-ietf-http-v11-spec-01.txt) MIT/LCS, January 1996.
http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-01.html,

[I-HTML] F. Yergeau, G. Nicol, G. Adams, M. Duerts, "Internationalization
of the Hypertext Markup Language", Work in progress,
(draft-ietf-html-i18n-03.txt)
http://www.alis.com:8085/ietf/html/draft-ietf-html-i18n.txt

[ISO-8859-1] ISO 8859-1:1987. International Standard -- Information
Processing -- 8-bit Single-Byte Coded Graphic Character Sets --
Part 1: Latin Alphabet No. 1.

[NICOL] G. T. Nicol, "The Multilingual WWW"
http://www.ebt.com:8080/docs/multilingual-www.html

[UNICODE] The Unicode Consortium, "The Unicode Standard -- Worldwide
Character Encoding -- Version 1.0", Addison-Wesley, Volume 1, 1991,
Volume 2, 1992. http://www.unicode.org

[ZACK] F. Zack, "Serving Multilingual Online Documentation", Poster
for the Fifth International WWW Conference

{This list will be completed.}


18. Author Address

Manuel Tomas CARRASCO BENITEZ
carrasco@innet.lu
http://www.crpht.lu/~carrasco/winter