Network Working Group E. Wilde
Internet-Draft Swiss Federal Institute of
Expires: January 9, 2003 Technology
July 11, 2002
URI Fragment Identifiers for the text/plain Media Type
draft-wilde-text-fragment-00
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http://
www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on January 9, 2003.
Copyright Notice
Copyright (C) The Internet Society (2002). All Rights Reserved.
Abstract
This memo defines URI fragment identifiers for text/plain resources.
These fragment identifiers make it possible to refer to parts of a
text resource, identified by character count or range, line count or
range, or a regular expression. These identification methods can be
combined to identify more than one sub-resource of a text/plain
resource.
Wilde Expires January 9, 2003 [Page 1]
Internet-Draft text/plain Fragment Identifiers July 2002
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 What is text/plain? . . . . . . . . . . . . . . . . . . . . 3
1.2 What is a URI Fragment Identifier? . . . . . . . . . . . . . 3
1.3 Why text/plain Fragment Identifiers? . . . . . . . . . . . . 4
2. Fragment Identification Methods . . . . . . . . . . . . . . 4
2.1 Fragment Identification Schemes . . . . . . . . . . . . . . 5
2.1.1 Character Count . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Character Range . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Line Count . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Line Range . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.5 Regular Expressions . . . . . . . . . . . . . . . . . . . . 5
2.1.6 Combining Fragment Identification Schemes . . . . . . . . . 6
3. Fragment Identification Syntax . . . . . . . . . . . . . . . 6
4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Security Considerations . . . . . . . . . . . . . . . . . . 8
Normative References . . . . . . . . . . . . . . . . . . . . 8
Non-Normative References . . . . . . . . . . . . . . . . . . 8
Author's Address . . . . . . . . . . . . . . . . . . . . . . 9
A. POSIX BRE Syntax . . . . . . . . . . . . . . . . . . . . . . 9
B. Where to send Comments . . . . . . . . . . . . . . . . . . . 9
C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9
Full Copyright Statement . . . . . . . . . . . . . . . . . . 10
Wilde Expires January 9, 2003 [Page 2]
Internet-Draft text/plain Fragment Identifiers July 2002
1. Introduction
Compliant software MUST follow this specification. The capitalized
key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
1.1 What is text/plain?
Internet Media Types as defined in RFC 2045 [RFC2045] and RFC 2046
[RFC2046] are used to identify different types and sub-types of
media. RFC 2046 [RFC2046] and RFC 2646 [RFC2646] specify the text/
plain media type, which is used for simple, unformatted text.
Quoting from RFC 2046 [RFC2046]: "Plain text does not provide for or
allow formatting commands, font attribute specifications, processing
instructions, interpretation directives, or content markup. Plain
text is seen simply as a linear sequence of characters, possibly
interrupted by line breaks or page breaks."
The text/plain media type does not restrict the character encoding,
any character encoding may be used. In the absence of an explicit
character encoding declaration, US-ASCII is assumed as the default
character encoding. This variability of the character encoding makes
it impossible to count characters in a text/plain resource without
taking the character encoding into account, because there are many
character encodings using more than one octet per character.
The biggest advantage of text/plain resources is their portability
among different platforms. As long as they use popular character
encodings (such as US-ASCII), they can be displayed and processed on
virtually every computer system.
1.2 What is a URI Fragment Identifier?
URIs are the identification mechanism for resources on the Web. The
URI syntax specified in RFC 2396 [RFC2396] includes as part of a URI
reference a fragment identifier, which (quoting from RFC 2396
[RFC2396]) "consists of additional reference information to be
interpreted by the user agent after the retrieval action has been
successfully completed. As such, it is not part of a URI, but is
often used in conjunction with a URI".
The most popular fragment identifier is defined for text/html
(defined in RFC 2854 [RFC2854]), and makes it possible to refer to a
specific element of an HTML document.
Wilde Expires January 9, 2003 [Page 3]
Internet-Draft text/plain Fragment Identifiers July 2002
1.3 Why text/plain Fragment Identifiers?
Referring to specific parts of a resource can be very useful, because
it enables users to create more specific references. Rather than
pointing to a whole resource, users can create references to the part
they really are interested in or want to talk about. Even though it
is suggested that fragment identification methods are specified in a
media type's MIME registration, many media types do not have fragment
identification methods associated with them.
Fragment identifiers are only useful if supported by the client,
because they are only interpreted by the client. Therefore, a new
fragment identification method will require some time to be adopted
by clients, and older clients will not support it. However, because
the URI reference still works even if the fragment identifier is not
supported (the resource is retrieved, but the fragment identifier is
not interpreted), rapid adoption is not highly critical to ensure the
success of a new fragment identification method.
Fragment identifiers for text/plain make it possible to refer to
specific parts of a text resource, either by line count, by character
count, or by using a regular expression for searching for a specific
character sequence. Thus, text/plain fragment identifiers enable
users to exchange information more specifically, thereby reducing
time and effort that is necessary to manually search for the relevant
part of a text/plain resource.
2. Fragment Identification Methods
The identification of resource fragments of text/plain resources can
be based on different foundations. Since it is not necessary to
insert explicit identifiers into a text/plain resource (as is
possible with HTML documents by using special attributes), fragment
identification has to rely on certain inherent criteria of the
resource. This memo specifies fragment identification using five
different methods, character counts and ranges, line counts and
ranges, and regular expression matching.
When interpreting character or line numbers, implementations MUST
take the character encoding of the resource into account, because
character count and octet count may differ for the character encoding
being used. For example, a resource using UTF-16 encoding uses two
octets per character, and it may have a leading BOM (Byte-Order Mark)
which does not count as a character and thus also affects the mapping
from a simple octet count to a character count.
Wilde Expires January 9, 2003 [Page 4]
Internet-Draft text/plain Fragment Identifiers July 2002
2.1 Fragment Identification Schemes
2.1.1 Character Count
The simplest way to identify a fragment is to point to a certain
character of the resource. Rather than identifying a fragment
consisting of a number of characters, this method only identifies a
single character, but this often is sufficient by referring to the
start of a region of interest. Character counting starts with 1, so
the first character of a text/plain resource has the count 1.
2.1.2 Character Range
If it is necessary to identify a fragment of multiple characters
using character counting, this can be done by using a character
range. A character range is a consecutive region of the resource
that extends from the starting character of the range to the ending
character of the range. The ending character of the range must have
a greater number than the starting character.
2.1.3 Line Count
Lines in text/plain resources are separated by CRLF sequences, and
consequently it is easy to identify lines. Because lines are the
only structural property of text/plain resources, it is possible to
identify a fragment of a resource by referring to a particular line.
Line counting starts with 1, so the first line of a text/plain
resource has the count 1. If a resource does not contain any CRLF
sequences, then it consists of a single (the first) line.
2.1.4 Line Range
If it is necessary to identify a fragment of multiple lines using
line counting, this can be done by using a line range. A line range
is a consecutive region of the resource that extends from the
starting line of the range to the ending line of the range. The
ending line of the range must have a greater number than the starting
line.
2.1.5 Regular Expressions
A common problem with fragment identifiers is their robustness (to
changes in the resource), and character and line counts can be broken
very easily. A more robust way of identifying a fragment is by
searching for a specific pattern. Thus, it is possible to use a
Basic Regular Expression (BRE) as defined by ISO 9945-2 [ISO9945-2]
(the POSIX standard) as a fragment identifier (Appendix A contains a
short summary of the POSIX BRE syntax).
Wilde Expires January 9, 2003 [Page 5]
Internet-Draft text/plain Fragment Identifiers July 2002
2.1.6 Combining Fragment Identification Schemes
While in most cases only one fragment identification scheme will be
used, it is possible to combine them. By simply concatenating
different fragment identification schemes, the whole fragment
identifier refers to the union of all parts of the text resource
identified by the individual fragment identification schemes. This
way, it is possible to identify disjoint ranges, such as multiple
line ranges.
It should be noticed that regular expressions by themselves may
identify disjoint fragments, which is true in any case where the
regular expression matches more than one occurrence in the resource.
Since disjoint fragments can be identified, implementations SHOULD
make sure that these fragments are appropriately marked, for example
by highlighting the fragment (rather than only scrolling to some
line, which only identifies a single location in the resource).
However, the exact method of how implementations deal with disjoint
fragments depends on the application and interface, and is beyond the
scope of this memo.
3. Fragment Identification Syntax
The syntax for the fragment identifiers is very straightforward. The
syntax defines three schemes, 'char', 'line', and 'match'. The
'char' and 'line' can be used in two different variants, either the
count variant (with a single number), or the range variant (with two
comma-separated numbers). The 'match' scheme has a regular
expression as parameter, which must be specified as a string with
balanced parentheses.
The following syntax definition uses ABNF as defined in RFC 2234
[RFC2234].
text-fragment = 1*text-scheme
text-scheme = ( char-scheme / line-scheme / regex-scheme )
char-scheme = "char(" ( count / range ) ")"
line-scheme = "line(" ( count / range ) ")"
match-scheme = "match(" regex ")"
count = 1*DIGIT
range = count "," count
regex = StringWithBalancedParens
The StringWithBalancedParens may only contain balanced parentheses,
if unbalanced parentheses need to be used, they must be escaped with
a '^' character. A literal '^' must be escaped as '^^'. Thus,
Wilde Expires January 9, 2003 [Page 6]
Internet-Draft text/plain Fragment Identifiers July 2002
before interpreting the StringWithBalancedParens as a BRE, it must be
searched for '^(', '^)', and '^^', and these strings must be
substituted with their unescaped variants.
If any count value is greater than the value for the actual resource,
then it identifies the last character or line of the resource. If a
range scheme's counts are not properly ordered (ie, the first number
is less than the second), then this scheme part has to be ignored.
4. Examples
The following examples show some usages for the fragment identifiers
defined in this memo.
http://example.com/text.txt#char(100)
This URI reference identifies the 100th character of the text.txt
resource. It should be noted that it is not clear which octet(s) of
the resource this will be without retrieving the resource and thus
knowing which character encoding is used for it (in case of HTTP,
this information will be given in the response's Content-type
header).
http://example.com/text.txt#line(10,20)
This URI reference identifies lines 10 to 20 of the text.txt
resource. If the resource has fewer than 10 lines, it identifies the
last line. If the resource has less than 20 but at least 10 lines,
it identifies the lines 10 to the last line of the resource.
http://example.com/text.txt#match(searchterm)
This URI reference identifies all occurrences of the regular
expression 'searchterm' in the resource, ie all occurrences of the
string 'searchterm'. If there is more than one occurrence, then this
URI references a disjoint fragment, consisting of all of these
occurrences.
http://example.com/text.txt#line(1)match(searchterm)
This URI reference identifies the first line and all occurrences of
the regular expression 'searchterm' in the resource. If there is an
occurrence of 'searchterm' outside of the first line, then this URI
references a disjoint fragment.
Wilde Expires January 9, 2003 [Page 7]
Internet-Draft text/plain Fragment Identifiers July 2002
http://example.com/text.txt#match(hello%5E()
This URI reference identifies all occurrences of the regular
expression 'hello(' in the resource. It must first be URL decoded,
which leads to the scheme part 'hello^('. This is then interpreted
according to the definition of a string with balanced parentheses,
treating the '^(' as an escaped '(', so that the actual regular
expression is 'hello('. If there is more than one occurrence of this
regular expression, then this URI references a disjoint fragment,
consisting of all of these occurrences.
5. Security Considerations
There are no relevant security considerations for this memo.
Normative References
[ISO9945-2] International Organization for Standardization,
"Information technology - Portable Operating System
Interface (POSIX) - Part 2: Shell and Utilities", ISO
9945-2, xxxxx 1993.
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996.
[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part Two: Media Types", RFC 2046,
November 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997.
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396,
August 1998.
[RFC2646] Gellens, R., "The Text/Plain Format Parameter", RFC
2646, August 1999.
Non-Normative References
[RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629,
June 1999.
Wilde Expires January 9, 2003 [Page 8]
Internet-Draft text/plain Fragment Identifiers July 2002
[RFC2854] Connolly, D. and L. Masinter, "The 'text/html' Media
Type", RFC 2854, June 2000.
Author's Address
Erik Wilde
Swiss Federal Institute of Technology
ETH-Zentrum
8092 Zurich
Switzerland
Phone: +41-1-6325132
EMail: ietf@dret.net
URI: http://dret.net/netdret/
Appendix A. POSIX BRE Syntax
This section contains a short (and non-normative) summary of the
POSIX BRE synatx defined in ISO 9945-2 [ISO9945-2].
(tbd - is there some rfc that could be referenced instead?)
Appendix B. Where to send Comments
Please send all comments about this document to Erik Wilde.
Appendix C. Acknowledgements
This document has been written using the IETF document DTD described
in RFC 2629 [RFC2629].
Wilde Expires January 9, 2003 [Page 9]
Internet-Draft text/plain Fragment Identifiers July 2002
Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
Wilde Expires January 9, 2003 [Page 10]