Internet-Draft                                   M.T. Carrasco Benitez
<draft-carrasco-xdossier-04.txt>                              Dragoman
Expires 14 December 2004                                  15 June 2004



                              Xdossier



Status of this memo


This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.


Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.


Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."


The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt


The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.



Abstract


This is an informational memo for Xdossier. A Xdossier is a data
object designed for browsing with web browsers and mappable to XML. It
is based on a directory structure containing files in several formats.



Table of Contents


1.    Introduction
2.    Rationale
3.    Terminology
4.    Xdossier
4.1.  Mapping between Xdossier and XML
5.    Xdossier types
5.1.  Well-formed Xdossier
5.2.  Templated Xdossier
5.3.  Valid Xdossier
6.    Web Formats
7.    Xdossier Node
8.    Node Index
8.1.  Browsing function
8.2.  Metadata function
9.    Node Store
10.   Root directory
11.   Self-containness
12.   Compound Xdossier
13.   Backbone
13.1. File System
13.2. Pack
14.   File formats
15.   Representation
16.   File extension
17.   Character encoding
18.   XHTML for Index
19.   Name
19.1. Strict Name Conformance
20.   References
21.   Acknowledgement
22.   Author
23.   Disclaimer



1. Introduction


It is recommended to play with a Xdossier example, as this memo should
be easier to understand. For examples look in
http://dragoman.org/xdossier.


This recommendation is about organising files. They are organised into
a data object called Xdossier.


Informally, a Xdossier is a directory structure with files in several
formats created for web browsing; direct browsing ("file:") or served
browsing ("http:").


Classifying files within directories is easy and very instinctive. A
few HTML files with some descriptions and links can greatly help the
browsing and give a feel of "oneness". One can easily start organising
using the directory structure point of view. By following a few rules,
one can end up with a data object easy to browse and with a
significant structure.


A directory structure is a tree similar to an XML document. There is a
strong parallelism. With a formal mapping to XML, the directory
structure could be transformed into an XML document.


One could start with the structure of the directories and files (the
"Backbone") and progress with the structuring towards the content of
the individual files (the "Leaves"): a few files could be XML files,
eventually the whole Xdossier should be transformable into a XML
document.


This approach is particularly useful to organise large amount of
legacy data in several formats for which there is no clear formal
definition.



2. Rationale


- Usable with web browsers. At most, only unpacking (e.g., unzipping)
should be necessary.


- Easy to "produce" and easy to "consume".


- Usable "as is" and adapted to further processing. For example, a CD-
ROM must be usable directly ("raw" consumption) and programs should be
capable of mechanical processing to load into a DBMS, web server, etc.


- Easy to prepare with resources (computer equipment, programs, staff,
etc) in most firms or acquirable at low cost. In particular, it should
be easy to prepare by hand without the need of special programs.


- Mappable to XML [XML].


- Vendor independent.


- Usable as an interface to exchange data.



3. Terminology


The specific terms to this memo have usually the first character of
each token in capital. Many term and concepts are the same or parallel
to SGML/XML and file systems.


- Index: Abbreviation of "Node Index".


- Instance: Abbreviation of "Xdossier Instance".


- Minimal Root: Abbreviation of "Minimal Root Xdossier".


- Minimal Root Xdossier: Xdossier with a minimal number of elements in
the Root Node.


- Node: Abbreviation of "Xdossier Node".


- Node Index: File, usually named "index.html" that contains
links to and information on files in a particular Node.


- Node Store: An optional directory named "xdossier" that could be
present in each Node.


- Root Index: The Index in the Root Node.


- Table of Contents: Abbreviation of "Xdossier Table of Contents".


- Templated: Abbreviation of "Templated Xdossier".


- Templated Xdossier: A Xdossier constructed following the indications
of a Xdossier Template.


- Xdossier: (1) The concept as described in this memo. (2)
Abbreviation of "Xdossier Instance".


- Xdossier Instance: Parallel meaning with XML document instance.


- Xdossier Node: A directory and his components.


- Xdossier Table of Contents: The Root Index. The Xdossier Table of
Contents must allow the navigation of the whole Xdossier. Typically,
there would be links to other Directory Indexes.


- Xdossier Template: A Xdossier that indicates how to construct
similar Templated Xdossiers. I can be viewed as a "light" DTD.



4. Xdossier


A Xdossier is a data object composed of a directories/files structure
that follows this specification. In particular, Xdossiers must follow
the rules regarding names, representation, file extension, file
format, character encoding and web format. The start is the Index in
root directory.



4.1. Mapping between Xdossier and XML


The mapping between Xdossier and XML concepts is as follow:


 Xdossier           XML


 directory      <-> element
 directory name <-> element name
 root directory <-> document element
 Index          <-> attributes (for his directory and files)
 file           <-> entity
 file name      <-> entity name
 file content   <-> entity reference
 XML file       <-> Parsed entity
 Non-XML file   <-> Unparsed entity


This could be used for transforming between Xdossier and XML. Xdossier
should be transformable into XML.



5. Xdossier types


There are three types of increasingly strict Xdossiers:

- Well-formed: All Xdossier must be well-formed as defined in this
specification.


- Templated: A Xdossier that in addition follows the indications of a
Xdossier Template.


- Valid: A Xdossier that in addition follows the rules of a precise
syntax; e.g., DTD, schema [SCHEMA], value-pair, etc.


The concepts of "well-formed" and "valid" are parallel to XML.
"Templated" does not exist in XML.



5.1. Well-formed Xdossier


A well-formed Xdossier must follow the rules of construction in this
specification. All Xdossiers must be well-formed. This is the minimum
requirement to be a Xdossier. A well-formed Xdossier does not have to
follow the additional indications/rules of a Template/syntax.


Rules of construction refer to the part of this specification covering
aspects not related to Templated and valid Xdossier.



5.2. Templated Xdossier


A Templated Xdossier follows the indications of a Xdossier Template,
abbreviated to Template. A Template is a Xdossier declared as a
Template. Usually, Xdossier Templates would be purpose built Xdossiers
to fulfil the role of Template.


The presence of directories/files in a Template would mean that they
must be present in Xdossiers; usually with the same name and format.
There can be additional indications, in particular the Indexes; e.g.
"such a file is optional". Probably, some aspects could be fuzzy.


People with limited knowledge in computers could create Templates as
it is instinctive. Probably, the path would be to create a well-formed
Xdossier and then to proceed with the creation of a Template.


As the approach does not have a fixed syntax, it is not intended for
full mechanical validation by computer. Some parts would have to be
validated by humans, though parts that follow a syntax could be
validated mechanically. For example, the content model of the
files/directories could be defined as:


- DTD: an XML DTD.


- Pair of values: for example a list of pair of values like
  "/food/choco/index.html=Documents about chocolate"



5.3. Valid Xdossier


A valid Xdossier follows the rules of a syntactic system such as DTD
or schema [SCHEMA]. This is needed to implement computer programs that
could do full mechanical validation of Xdossiers.


Another memo should address the syntax. Schema and XML DTD will be
considered.



6. Web Formats


These are file formats well adapted to the web and supported by widely
available browsers. A very good format for the web, but not supported
by widely available browsers is not a Web Format.


Web Format is a fuzzy moving definition. It is also "community
dependent"; e.g., a community could consider XML a Web Format and
another community could consider that it is not a Web Format.


By default, the only Web Format is XHTML.


[Relation to Templated and valid Xdossier]: It could redefine the list
of Web Formats.



7. Xdossier Node


A Xdossier Node, abbreviated to Node, is a directory and the
following:


- Xdossier Node Name, abbreviated to Node Name; the name of the
directory.


- Node Index.


- Node Store.


- File(s) in this particular directory.


- Name(s) (not the content) of the directory/ies in this particular
directory.



8. Node Index


Node Index, abbreviated to Index, is a document in Web Format included
in each Node. Indexes should/could have two functions:
- Browsing (informal view).
- Metadata (formal view).


The browsing and metadata are functions. Syntactically, they could be
interwoven.


Syntactically, there are two types of Indexes:
- Informal Index: it does not follow any particular syntax.
- Formal Index: It follows a syntax.


If Index is not present, the filenames in the directory should be
meaningful.


The default Document Name for Index is "index" and the default format
is the default Web Format. Hence, at present the default File Name for
Index is "index.html".


[Relation to Templated and valid Xdossier]: It could redefine the
default Index name.



8.1. Browsing function


Indexes should have a human readable description of his Node and
meaningful labels with links mostly to:
- His file(s).
- Indexes in child directories.
- Navigational aids (e.g., a link to the Root Index).


One should be able to view all the directories/files in the Xdossier
starting from the Root Index and following links, except if the
intention is to hide them. Hence, every directory/file should have a
link pointing to it. Usually from his Index, but it could also be from
other Indexes or files.


Nodes should be as self-contained as possible. Hence, it is
recommended for Indexes to have links only to his files and child
Indexes; i.e., the Indexes of his directories. Though, Indexes could
also contain links to other files/resources.


Links to files within a Xdossier must be relative.



8.2. Metadata function


Indexes could contain the metadata of his Node. The metadata should be
machine processable. The metadata could also be in the Node Store.


Another memo should address the metadata. Resource Description
Framework [RDF] will be considered.



9. Node Store


Node Store, abbreviated to Store, is an optional directory that could
be present in each Node. If it is present, it must be named
"xdossier"; this name is reserved for this purpose.


The Store could contain additional data related to the Node where it
is situated. For example, metadata for his directory/file(s), previous
versions of the directory files, etc.


The Node Store in the root directory is called the Root Node. The Root
Node could contain the information to make the Xdossier Templated or
valid. This is similar to a DOCTYPE in an XML document and the DTD.


Another memo should address specifications of Store.



10. Root directory


The Node Index in the root directory is called the Root Index. The
root directory must contain only one file, the Root Index; and zero or
more directories. Corollary: The trivial Xdossier is composed only of
the Root Index.


The intention for allowing only one file (the other elements must be
directories) in the root directory is to make it obvious that the file
present (the Root Index) is the Table of Contents.


It is recommended to minimise the number of elements in the root
directory, or at least to keep it to a reasonable number.


Minimal Root Xdossier, abbreviated to Minimal Root, is when the Root
Node contains only the Index, one directory and optionally the Store.
The intention is to make it even more obvious for the user. Minimal
Root is appropriate for Xdossiers not intended for loading into web
servers, as the URLs are longer.



11. Self-containness


There are three levels:


- Absolute Xdossier: When all the resources are in the Xdossier.


- Self-Contained Xdossier: When all "Essential Resources" are in the
Xdossier. For example, the CSS is in the Xdossier, though there could
be secondary references to other resources such as a reference to the
W3C site at http://w3.org. At least this level should be attained.


- Fragment Xdossier: When at least one "Essential Resource" is not in
the Xdossier. For example, the CSS is not in the Xdossier and it
relies in an external CSS such as the one in the W3C site at
http://www.w3.org/StyleSheets/Core/. It is only recommended as a
directory of Xdossier. Otherwise, there should be an agreement between
producers and consumers of the Xdossier.


Essential Resources are the ones needed for navigation and display.


[Relation to Templated and valid Xdossier]: It could include the
minimal level of Self-containness requested and a re-definition of the
Essential Resources.



12. Compound Xdossier


It is a Xdossier where all the directories in the root directories are
Xdossier themselves. These directories could also be Compound
Xdossiers and so on.


[Relation to Templated and valid Xdossier]: It could include required
Compound Xdossier.



13. Backbone


The Backbone is the directories and files names; i.e., the main
branches of the tree. The Backbone is not concerned with the structure
of the files.


There are two types of Backbone Formats:
- File System Backbone Format: a directories/files structure.
- Pack Backbone Format: a packed File System; e.g., zip.


File System Backbone Format is abbreviated to File System Format or
simply File System. Pack Backbone Format is abbreviated to Pack Format
or simply Pack.


The main difference is that today Pack must be unpacked before viewing
with browsers. This could change if browsers could support Packs such
as zip. For example, one could have:


 file:///mydirectory/myfile.zip/index.html


or


 zip:///mydirectory/myfile.zip/index.html


This should extract the file "index.html" from the zip file
"myfile.zip" and display the content of "index.html" as if it is
reading from a file system. Pressing the links pointing to other files
in "myfile.zip" should behave in a similar fashion.



13.1. File System


These are directories and files as in a file system; e.g. Windows or
Unix. Xdossier uses mainly the tree properties of file systems.
Xdossier does not consider other properties of file systems such as
access control list (i.e., the bits protection, ownership, etc) or
links within the file systems itself (e.g., symbolic links). It is up
to the user to set the correct access control list; e.g., to reset the
executable bit in the appropriate files. Future versions of this memo
should address this issue.


File System is more adapted to media such as CD-ROMs where one wants
the Xdossier ready for use without any intermediary processing.


The File Systems in order of preference are: Joliet, others.


[Relation to Templated and valid Xdossier]: It could redefine the File
Systems.



13.2. Pack


A directory structure could be packed into one or several file(s);
e.g., zip. Packing must respect the directories/files structure. If a
packing technique compresses, it is just considered a bonus.


Packed is better adapted for:


- Attaching Xdossier(s) to emails.


- In file systems that do not support the naming of the
directories/files (it could easily happen with DOS).


- With large collections of Xdossiers that could cause problems in the
files system.


Care must be taken to unpack in a computer system that supports the
naming in the directory structure. For example, name lengths of the
directories/files and file extensions.


Another approach would be not to unpack the directory structure and
view it with browsers that directly support the unpacking technique,
as described above.


In the future, others aspects would be addressed: Xdossier that
expands several Packs (e.g., pack1.zip, pack2.zip); mixed Pack
Xdossiers (e.g., pack.zip and pack.tar); mixed File System and Pack
(e.g., the Root Node as File System and the rest as Packs).


The Packs in order of preference are: zip, tar, cpio and others.


[Relation to Templated and valid Xdossier]: It could include a list of
accepted packing techniques in order of preference.



14. File formats


Priority should be given to file formats with a good chance of being
readable "forever"; e.g., in 50 years. This points to "neutral"
formats: formal standard, industrial standard, vendor independent,
"text-like", etc.


One should not disregard proprietary formats, as they could be the
"source" format; i.e., the format in which the data was originally
produced. Often, information is lost in format transformation. The
recommendation is to include:


- A file in the source format.


- A file in at least one neutral format.


- Indicate the method used in the format transformations; e.g. source
format saved HTML using the "Save as" facility in such application.


The file formats in order of preference are:


- Text: XML*, XHTML, HTML, XML, text, RTF, PDF and others.
- Graphic: SVG*, PNG or JPEG, GIF, TIFF and others.


*Future directions: XML will be the preferred format (text and
graphic) when it is well supported by widely available browsers. At
present, it is recommended to use as much as it is reasonably
possible. It is recommended to use the appropriate XML applications
such as Chemical Markup Language [CML].


The choice of formats is also dependent on the intention of the user;
e.g., when giving preference to PNG or JPEG.


When other formats are used, they should be widely used formats; e.g.
Word. Some could be widely used in a specialised field; e.g., SAS. In
addition to the proprietary formats, it is recommended to include
transformations to text-like formats with as much information as
possible. For example, word-processing documents could be transformed
into RTF and database tables into "comma separated" files.


[Relation to Templated and valid Xdossier]: It could include a list of
accepted formats in order of preference and different mapping between
the Internet Media Type and file extensions.



15. Representation


The same information could be represented in different fashions. The
dimensions considered are:


- Language; e.g., English, Spanish.
- Media type; e.g., HTML, PDF.
- Encoding; e.g., zip, gzip, compress.



16. File extension


File extensions are used to indicate representations. For example:
         hello                no extension
         hello.html           format HTML
         hello.en             language English
         hello.gz             compressed using "gzip"
         hello.en.html        English in HTML
         hello.html.gz        HTML, gziped
         hello.en.gz          English, gziped
         hello.en.html.gz     English, HTML, gziped


File extensions, particularly the last one, are operating systems
dependants:
- Syntax: e.g., DOS allows up to three characters file extensions.
- Association: which program is associated with the extension.


The extension should correspond to widely used mapping between
Internet Media Types [IMT] and file extensions. The examples above
work for Transparent Content Negotiation [TCN] in Apache [APACHE].


Note the difference between "file" and "document". File refers to
physical storage; e.g., "mydoc.txt" is a file. Document refers to
content; e.g., "mydoc" is a document represented in the files
"mydoc.txt" and "mydoc.html", they contain the same document in
different formats.


Another approach would be to use variants files (see TCN). Another
memo should address the syntax for file extensions.



17. Character encoding


The character encoding ("charset") in order of preference are:


- Unicode UTF-8, Unicode 16 bits [ISO10646].


- ISO-8859-1 (Latin-1) or appropriate ISO-8859-x; e.g., ISO-8859-7 for
Greek.


- Other character encodings. They should be appropriate to the
language and widely available.


[Relation to Templated and valid Xdossier]: It could include a list of
accepted character encoding in order of preference.




18. XHTML for Index


The XHTML used in the indexes should follow the rationale of XHTML
Basic [XHTML-B]. Some indications:


- Simple mainstream XHTML; i.e., facilities easy to write and that
work in most browsers.


- A link to the Root Index. One could also use the "Start" mechanism;
e.g., "<link rel="Start" href="../index.html" />"


- It is recommended to use one CSS for all the Indexes.


- A reasonable presentation with the most popular browsers (e.g.
Internet Explorer, Navigator, etc) and text only browsers (e.g. Lynx).


- Links that work when read directly (e.g., a CD-ROM inserted into a
PC) or served by an HTTP server; i.e., "file:" or "http:".


- Links that point directly to files, except when the intention is to
show the content of the directory. One should not assume that Xdossier
would be served by server; i.e., it should work directly ("file:") or
served ("http:").


- No frames, or at least a no frame option.


- No scripts (e.g. JavaScript) and Java Applets.


- Images (IMG) with alternative texts.


- Relative links within the Xdossier; e.g. href="../doc.html".


- Use language attributes (lang, xml:lang, etc), to indicate the
language of the text.


[Relation to Templated and valid Xdossier]: It could change the XHTML
indications.



19. Name


Xdossier must conform to the XML naming and respect that the name
"xdossier", in all possible combinations of upper or lower case, is
reserved; e.g., xdossier, XDOSSIER, Xdossier, xDossier, etc.


Xdossier are "Strict Name Conformance" when they also conform to
section "19.1 Strict Name Conformance".



19.1 Strict Name Conformance


"Name" is a token composed of the following characters:
- Letters "a" to "z"; i.e., lower case only; [U+0061 to U+007A].
- Digits "0" to "9" [U+0030 to U+0039].
- "-" [HYPHEN-MINUS, U+002D].
- "_" [LOW LINE, U+005F].


The notation "U+" refers to the Unicode [UNICODE] notation.


       Correct Names
         part_a
         part-b
         myfile
         hello
         xdossier-hello


       Incorrect Names
         part a       (' ' ; SPACE is not allowed)
         Myfile       (capitals are not allowed)
         myfile.xml   ('.' ; FULL STOP is not allowed)
         hello:html   (':' ; COLON is not allowed)
         xdossieR     ('xdossieR' ; reserved)


"Directory Name" is a Name.


"File Name" is one Name followed by one or more Name(s) separated by a
'.' (FULL STOP, U+002E).


       Correct File Names
         a_part
         myfile.html
         hello.en.xml
         hello.en.xml.gz


       Incorrect Names
         a part       (' ' ; SPACE is not allowed)
         Myfile.html  (capitals are not allowed)
         hello:xml    (':' ; COLON is not allowed)


"Document Name" is the first Name in the File Name. Example, "docname"
in the File Name "docname.ext"


"File extension(s)" is/are the second and following Name(s). For
example, "ext1", "ext2" and "ext3" in the File Name
"docname.ext1.ext2.ext3"



20. References


[ALLEN] Package or Perish. Terry Allen
Pages 385-390 in SGML/XML '97 Conference Proceedings. SGML/XML '97.


[APACHE] The Apache Foundation
http://apache.org


[CML] Chemical Markup Language
http://xml-cml.org


[CSS2] Cascading Style Sheets, level 2
http://www.w3.org/TR/REC-CSS2


[DC] Dublin Core
http://purl.org/dc


[ESUB] Electronic Submission
http://esubmission.eudra.org


[ISO10646] Information Technology -- Universal Multiple-Octet Coded
Character Set (UCS) -- Part 1: Architecture and Basic Multilingual
Plane, ISO/IEC 10646-1:1993


[HTML] HTML 4.01 Specification
http://www.w3.org/TR/html4


[IMT] Internet Media Types
http://www.isi.edu/in-notes/iana/assignments/media-types/media-types


[MHTML] The MIME Multipart/Related Content-type. E. Levinson
ftp://ftp.ietf.org/rfc/rfc2387.txt


[RDF] Resource Description Framework Model and Syntax Specification
http://www.w3.org/TR/REC-rdf-syntax


[SCHEMA1] XML Schema Part 1: Structures ("work in progress")
http://www.w3.org/TR/xmlschema-1/


[SVG] Scalable Vector Graphics (SVG) 1.0 Specification (work in
progress)
http://www.w3.org/TR/1999/WD-SVG-19991203


[TCN] Transparent Content Negotiation in HTTP
http://ietf.org/rfc/rfc2295.txt


[Unicode] Unicode Consortium
http://www.unicode.org


[XHTML] XHTML 1.0: The Extensible HyperText Markup Language
http://www.w3.org/TR/WD-html-in-xml


[XHTML-B] XHTML Basic ("work in progress")
http://www.w3.org/TR/xhtml-basic/


[XML] Extensible Markup Language (XML) 1.0
http://www.w3.org/TR/rec-xml


[XSL] Extensible Stylesheet Language Specification ("work in
progress")
http://www.w3.org/TR/WD-xsl



21. Acknowledgement


The comments of Martin Bryan to an early draft were very useful. Also,
he suggested the Template. As usual, the author is the sole
responsible for the document.



22. Author


Manuel Tomas CARRASCO BENITEZ
Dragoman
Luxembourg


Telephone +352 26 200 747


xdossier@dragoman.org
http://dragoman.org/carrasco



23. Disclaimer


This document represents the view of the author. It does not
necessarily represent the views of any other parties.