INTERNET-DRAFT                                              J. Salsman
Filename: <draft-salsman-www-device-upload-04.txt>       Cisco Systems
submitted to the W3C HTML activity                     2 February 1999

               Form-based Device Input and Upload in HTML

Status of this Memo

   This draft extends an experimental protocol for the Internet
   community.  This draft does not specify an Internet standard.
   Discussion and suggestions for improvement are requested.
   Distribution of this memo is unlimited.

     This document is an Internet-Draft and is in full conformance
     with all provisions of Section 10 of RFC 2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as
     Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.

1.  Abstract

   Currently, HTML forms allow the producer of the form to request
   information -- including files of data -- from the operator reading
   the form.  However, this capability is limited because HTML forms
   don't provide a way to ask the operator to submit input from
   arbitrary sources such as audio devices like microphones.  Since
   input and upload from various devices is a feature that will
   benefit many applications, this draft proposes an extension to the
   HTML Input type=file form element specified in RFC 1867 to allow
   information providers to express requests for uploads from audio
   and other devices uniformly.  A discussion of MIME audio data
   types to facilitate useful audio upload responses follows.  Also
   security discussions are included, as are audio usability and
   quality discussions, a description of a backward compatibility
   strategy allowing new user agents to utilize HTML written with
   earlier proposals for audio input in mind.  Motivations conclude.

2.  HTML forms with device input file upload submission

   Section 3.1 of RFC 1867 provides for the presentation of an
   arbitrary "widget" to specify input for file uploads.  When an
   INPUT tag of type FILE is encountered with a DEVICE attribute, the
   associated value (such as MICROPHONE, or MIC) might select the use
   of a widget capable of buffering and editing real-time input (such
   as speech) instead of entering a file selection mode.

   If an ACCEPT attribute is present in a device file input element,
   the browser might constrain the MIME type of uploaded data to match
   those with the corresponding list of types specified.  If the value
   of the DEVICE parameter is FILESYSTEM or FILES then the INPUT
   element might be treated as usual according to RFC 1867 except that
   the subset of files presented to the operator to choose from may be
   constrained by the specified list of MIME types instead of a
   pattern of file names or extensions.

   Since there is no original filename as specified in section 3.3 of
   RFC 1867 for parameters of the 'content-disposition: form-data' and
   'content-disposition: file' HTTP headers, those headers might be
   provided with a 'type' parameter representing the MIME type of the
   encoded data, if known, and a 'device' parameter with the same
   value as the DEVICE attribute of the associated form input element,
   unless the device or MIME type(s) specified are unsupported in
   which case the value of the 'device' header parameter might be
   'unsupported', or unless the device is unavailable in which case
   the value might be 'unavailable'.  If the MIME types requested are
   unsupported, an additional parameter 'alternates' might be included
   with a space-separated list of MIME types of the same content-type
   which may be supported as alternatives for the specified device.
   The content-disposition header parameter syntax is described in
   RFC 1806.

   There may be significant limitations on the client browser's
   ability to buffer input for upload.  Browsers might provide an
   estimate of the default MAXLENGTH available for device input and
   upload through the HTTP header 'Pragma: device-maxlength='
   followed immediatly with the decimal representation of the number
   of bytes, representing the content-length available to the browser
   for buffering (see section 14.32 of RFC 2068.)  A server may also
   provide information about the largest file size it can accept for
   upload, with a similar 'Pragma: server-maxlength=' header to
   inform the browser of such limits.

   Furthermore, the VALUE attribute may be used to provide a
   disambiguation between multiple similar devices when present.
   Under most conditions the operator should be allowed to select
   the device from ambigous sources of input, or re-select it if
   specified with a VALUE parameter.

   If real time events, such as those described and proposed by
   Gregory S. Aist in "A General Architecture for a Real-Time
   Discourse Agent and a Case Study in Computerized Oral Reading
   Tutoring" (Carnegie Mellon University Computational Linguistics
   Program, 6 December 1996), are required, then the Real-time
   Transport Protocol (RTP, currently RFC 1889) should be used
   instead.  Because of security concerns discussed in section 3
   below, HTML scripts might not be able to invoke a form submission
   when the form involves any kind of file upload without explicit
   instructions from the session operator to the contrary.

2.1.  Examples

     <FORM ENCTYPE="multipart/form-data" METHOD=POST ACTION="_URL_">
       Say something:  <INPUT NAME=SPEECH1 TYPE=FILE DEVICE=MIC>
       <INPUT TYPE=SUBMIT VALUE="Send Speech">
     </FORM>

   In this simple form, the HTML author has requested the upload of
   sampled microphone input from the operator upon form submission.

     <INPUT NAME=SPEECH2 TYPE=FILE DEVICE=MICROPHONE
       ACCEPT="audio/l16 ;rate=11025 ;channels=1 audio/x-cepstral-voc">

   Here MIC is not used as an abbreviation.  The author of the HTML has
   requested that the data input from the microphone be encoded as
   either the MIME type Audio/L16 -- sixteen bit signed linear audio
   samples (most-significant byte first) -- as specified in RFC 1890
   section 4.4.8, with a single (monaural) channel and a sample rate of
   11,025 samples per second, or an unspecified extended MIME Audio
   type named 'x-cepstral-voc'.

     <INPUT NAME=FILE1 TYPE=FILE DEVICE=FILES ACCEPT="text/*">

   Here the form element may be used to upload a file as usual, except
   that the files to select from might be constrained to text files,
   without explicit regard of their filename or extensions.

     <INPUT NAME=PICTURE1 TYPE=FILE DEVICE=CAMERA VALUE=2>

   The final example shows how these extensions may be used to request
   input from other kinds of devices, such as the second of two or
   more cameras connected to the system running the browser.

3.  User interface usability and quality concerns for audio

   An audio sample is customarily recorded on computer equipment with
   a dialog routine capable of allowing the user to record, pause,
   play back, erase, or otherwise edit the recording.  Browsers might
   provide the operator with the same kind of dialog routine for audio
   device input.  And if a MAXLENGTH has been specified or is in force
   because of limited buffer size, a display of the buffer size used
   and remaining might be displayed as a dynamic bar graph (or as a
   percentage if graphics are unavailable.)  A display of time in
   seconds used and remaining in the buffer may also be provided.

   Most MIME types defined for audio do not provide high-quality audio
   encodings.  The 'audio/basic' and other types which use a sample
   rate of 8,000 samples per second truncate the audio spectrum at
   4,000 Hz according to the Nyquist theorem, discarding information
   important for discerning consonants.  Also, audio/basic and other
   MIME Audio types use a sample size of eight bits, which does not
   usually provide enough dynamic range for accurate automatic speech
   recognition unless published automatic gain control algorithms are
   reliably used.  If sixteen-bit unsigned audio encodings are used
   according to section 4.4.8 of RFC 1890, the sample rate --
   specified as the 'rate' parameter of the MIME type 'audio/l16' --
   might be at least 11,025 or 16,000 to adequately provide sufficient
   information for automatic speech recognition.  Otherwise, the audio
   feature extraction encoding of the speech recognition algorithm
   might be used to provide a more compact representation to shorten
   the upload.

4.  Security considerations

   Browser operators may not want to send their files, recordings,
   pictures, video, or other device inputs to arbitrary sites without
   their explicit permission and direction.  Therefore, browser
   authors are encouraged to disallow the submission of forms which
   include any kind of file upload by any means other than the
   standard HTML operator-controlled buttons for form submission
   without explicit instruction from the session operator to the
   contrary.  Accordingly, the SIZE parameter, document style sheets,
   and document layers may be prevented from obscuring any kind of
   file upload widget, especially those capable of accepting a default
   filename.  Furthermore, just as the operator may take direct action
   to initiate, terminate, review and edit recording as described in
   the previous section, browser authors are encouraged to prevent
   HTML scripts from taking those and similar actions, unless for
   example the operator has specifically enabled such script actions
   with a security option.  Even then, such preferences might be
   specified by the operator to reset after an interval or at the end
   of the session.  Finally, explicit information might be provided to
   insure that the operator is informed when files are being uploaded.

5.  Compatibility with earlier forms of audio input

   Audio device input has been proposed before and implemented from a
   microphone at least as early as 1994 in experimental versions of
   common Web browsers.  To accommodate the syntax of these earlier
   extensions, a browser might interpret a valid XML statement such as

     <INPUT TYPE=AUDIO ...>

   as the device input form

     <INPUT TYPE=FILE DEVICE=MICROPHONE ...>

   with all other attribute/value pairs of the original INPUT element
   kept the same as specified.  This would retain compatibility for
   all implementations of which the author of this draft is aware.

6.  HTML Document Type Description changes

   Along with the extension to the HTML InputType entity described in
   the previous section, this proposal makes an addition to the HTML
   DTD for the INPUT element ATTLIST of an #IMPLIED attribute DEVICE
   of type CDATA.

7.  Motivations and conclusion

   The primary motivation for these extensions is to add the
   capability of speech input to Web-based educational systems. [1,2]
   Other motivations include the development of "dictation servers"
   [3] capable of transforming spoken audio uploaded though an HTTP
   session to the corresponding text suitable for sending in email or
   including in another document, for example.  Natural language
   continuous speech recognition software conforming to standard APIs
   for automatic dictation is as of this writing available from retail
   outlets for free in small quantity so there is ample reason to
   believe that transcription servers might soon become commonplace on
   the Web with these extensions.  Furthermore, they could also be a
   great help to hearing impaired people who want to use a "phonology
   server" to practice improving their pronunciation without a human
   speech coach.

   Finally, Larry Masinter, author of RFC 1867, has indicated that
   graphical paper scanners might be used for applications such as
   OCR and bar-code upload.  "DEVICE=SCANNER" is suggested.

   The change to the HTML DTD is very simple, but very powerful.  It
   enables a much greater variety of services to be implemented via
   the World-Wide Web than is currently possible due to the lack of a
   peripheral input upload submission facility.  This would be a very
   valuable addition to the capabilities of the World-Wide Web.

8.  Author's address and acknowledgments

   James Salsman
   Cisco Systems, San Jose, California
   Bovik Research Inst., a non-profit organization

   1285 Montecito Ave Apt 57
   Mountain View, CA  94043

   Email:  jps@bovik.org, jsalsman@cisco.com
   Phone:  (650) 967-2737

   Larry Masinter and Harald Alvestrand contributed excellent advice.
   Ed Tecot contributed the means of device and media independence.
   David McMillian contributed to the description of capabilities of
   the audio widget.  Syracuse Language Systems, The Learning Co.,
   and EduSoft, Ltd., contributed much of the inspiration; Jack Mostow
   et alli did much more work for younger grades.  Dave Raggett of
   the W3C HTML Working Group helped with a number of suggestions.

References

[RFC 1867] Form-based File Upload in HTML.  E. Nebel & L. Masinter,
           November 1995.

[RFC 1806] Communicating Presentation Information in Internet
           Messages:  The Content-Disposition Header.  R. Troost,
           S. Dorner, June 1995.

[RFC 2068] Hypertext Transfer Protocol -- HTTP/1.1.  R. Fielding,
           J. Gettys, J. Mogul, H. Frystyk, & T. Berners-Lee,
           January 1997.

[RFC 1889] RTP: A Transport Protocol for Real-Time Applications.
           H. Schulzrinne, S. Casner, R. Frederick, & V. Jacobson,
           January 1996.

[RFC 1890] RTP Profile for Audio and Video Conferences with Minimal
           Control.  H. Schulzrinne, January 1996.

[1]    http://www.cs.cmu.edu/~listen/
           Literacy instruction from a reading tutor that listens
           from Carnegie Mellon's Project LISTEN.

[2]    http://www.ordinate.com/
           Over-the-phone automated testing of English fluency,
           listening, and vocabulary from Ordinate Corporation.

[3]    http://www.cybertranscriber.com/
           Automatic transcription from spoken dictation from
           Speech Machines Corporation.

END OF DRAFT
Filename: <draft-salsman-www-device-upload-04.txt>

:jps