Internet-Draft                                            T. Shaneyfelt
Department of Computer Science             University of Hawaii at Hilo
Expires: February 14, 2004


        HTMLX: Simple Well-Formed Format For Legacy HTML Documents
                     draft-shaneyfelt-htmlx-01.txt


Abstract

     Changing some legacy Htpertext Markup Language (HTML) documents to
     XHTML format would require certain tags to be dropped.  Not changing
     them to some sort of Extensible Markup Language (XML) would prevent
     their use by new tools.

     This memo documents a method for expressing the format and content
     of legacy HTML documents as XML in order that their structure and
     content may be accessible by XML parsers. Obsolete tags are
     explicitly allowed, and well-formedness is required.

1. Introduction

     With minimal modification, legacy documents can become accessible
     to XML parsers.  The minimal requirement is well-formedness.  HTML
     based on Structured Generalized Markup Language (SGML) is not
     well-formed like newer XHTML documents, but at the same time, XHTML
     obsoletes certain tags that were used in prior HTML standards.
     Rather than converting legacy documents directly into XHTML, the
     documents could be tidied into a well-formed representation that
     could then either be accessed with XML parsers without losing
     original tags, or it could be later transformed via tools that
     parse XML (such as XSLT) into XHTML, if desired.  Tools that edit
     legacy documents may implement an option for the user to save the
     document in this intermediate format as the document is being
     transformed into XHTML.  This format is backwards compatible with
     most popular user agents.



2.  Security Considerations

     None known

3. Format

3.1  Well-formedness

     HTMLX Documents shall be well-formed.  Legacy Documents will
     need to be minimally modified to meet the well-formedness
     requirement.

3.2 Empty elements

     For user agent compatibility, it is suggested that the following
     elements should be kept as separate begin/end tags rather than
     being collapsed into a single tag:

     applet,  iframe,  object,  script,  textarea,  title

3.3 Entities

     Entities other than gt,lt,amp,quot,and apos shall be either
     be converted to numeric entities unless defined by a declared
     entity definition.  For example, the nbsp entity may be converted
     to a numeric entity wherever it appears, or it may be defined
     in an entity declaration or DTD at the top of the page.

3.4  Declarations

     Namespaces and DTD tags are not required.  Documents without
     any DTD are considered to be HTMLX1

3.4.1 Namespace

     Editing software is not to add a namespace to a document without
     being directed to do so by the user.  Indiscriminately inserting
     a namespace would imply conformance to related standards, and
     should not be done until the author is ready to take that step.

3.4.2 Document Type Definition

     Editing software is not to add a DTD to a document without being
     directed to do so by the user.  Indiscriminately inserting a DTD
     would imply conformance to related standards, and should not be
     done until the author is ready to take that step.

3.5  Attributes

     All attribute values must be quoted to comply with XML.

4  MIME Type

     HTMLX documents should be sent as "text/html"
     and treated as html, according to the intent of the World Wide
     Consortium's (W3C) HTML Working Group's (WG) intent as expressed in
     http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html



5  The file name may end with any of the following extensions:
     .html
     .htm
     .xml
     A browser will only attempt to format the first two types as HTML,
     whereas the third will typically be processed as an XML data file,
     as current practice and standards dictate.

6  Software

     It is expected that some software will require well-formednes
     and other software will not.  Software reading the document
     is not required to verify well-formedness, but software saving
     the document should attempt to produce well-formedness.  A
     mechanism for alerting the user of ill-formedness upon saving
     a document is suggested for documents in the process of being
     converted where the software does not completely automate the
     process.

Revisions

     00 - Initial version

     01 - Added Security Considerations and Revisions sections,
          and updated the list of empty elements that should be
          kept as separate begin/end tags

Author's Address

     Ted Shaneyfelt
     University of Hawaii at Hilo
     200 W. Lanikaula Street
     Hilo, Hawaii  96720-4091

     For additional contact information, see
     http://cs.uhh.hawaii.edu/cs/people/staff/#ted

Copyright (C) The Internet Society 2004.  This document is subject
to the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights.

This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups.  Note that other
groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than a "work in progress

Expires: February 14, 2004