draft-gettys-webmux-00

INTERNET DRAFT                     Jim Gettys, Compaq Computer Corporation
draft-gettys-webmux-00.txt              Henrik Frystyk Nielsen, W3C, M.I.T
Expires January 1, 1999                                     August 1, 1998

The WebMUX Protocol

Status of This Document

This document is an Internet-Draft. Internet-Drafts are working documents of te
Internet Engineering Task Force (IETF), its areas, and its working groups. Note
that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be
updated, replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet-Drafts as reference material or to cite them other
than as "work in progress."

To view the entire list of current Internet-Drafts, please check the
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories
on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it
(Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or
ftp.isi.edu (US West Coast).

This document describes an experimental design for a multiplexing transport,
intended for, but not restricted to use with the Web. WebMUX has been
implemented as part of the HTTP/NG project. Use of this protocol is EXPERIMENTAL
at this time and the protocol may change. In particular, transition strategies
to use of WebMUX have not been definitively worked out. You have been warned!

Distribution of this document is unlimited. Please send comments to the HTTP-NG
mailing list at <www-http-ng-comments@w3.org>. Discussions are archived at
"http://lists.w3.org/Archives/Public/www-http-ng-comments/".

Please read the "HTTP-NG Short- and Longterm Goals Document" [1] for a
discussion of goals and requirements of a potential new generation of the HTTP
protocol and how we intend to evaluate these goals.

General information about the Project as well as new draft revisions, related
discussions, and background information is linked from
"http://www.w3.org/Protocols/HTTP-NG/".

Note: Since internet drafts are subject to frequent change, you are advised to
reference the Internet Draft directory. This work is part of the W3C HTTP/NG
Activity (for current status, see http://www.w3.org/Protocols/HTTP-NG/Activity).

Abstract

This document defines the experimental multiplexing protocol referred to as
"WebMUX". WebMUX is a session management protocol separating the underlying
transport from the upper level application protocols. It provides a lightweight
communication channel to the application layer by multiplexing data streams on
top of a reliable stream oriented transport. By supporting coexistence of
multiple application level protocols (e.g. HTTP and HTTP/NG), WebMUX should ease
transitions to future Web protocols, and communications of client applets using
private protocols with servers over the same TCP connection as the HTTP
conversation.

WebMUX is intended for, but by no means restricted to, transport of Web related
protocols; the name has been chosen to reduce confusion with other existing
multiplexing protocols.

This document is part of a suite of documents describing the HTTP-NG design and
prototype implementation:
    * HTTP-NG Short- and Longterm Goals, ID
    * HTTP-NG Architectural Model, ID
    * HTTP-NG Wire Protocol, ID
    * The Classic Web Interfaces in HTTP-NG, ID
    * Description of the HTTP-NG Testbed, ID

Changes from Previous Version
    * Changed name from SMUX to WebMUX to reduce confusion with SNMP related
      protocol.
    * Split protocol ID address space to allow an address space for servers to
      use to identify protocols outside of the control of this document.
    * Elaborated endpoint usage.
    * Prepared to meet IETF ID standards.
    * Added acknowlegements section.
    * Some reorganization of the document


        ------------------------------------------------------

Contents
    1. The WebMUX Protocol
    2. Status of This Document
    3. Abstract
          1. Changes from Previous Version
    4. Contents
    5. Introduction
          1. Goals
    6. WebMUX Protocol Operation
          1. Key Words
          2. Deadlock Schenario
          3. Deadlock Avoidance
          4. Operation and Implementation Considerations
          5. WebMUX Header
          6. Alignment
          7. Long Fragments
          8. Atoms
          9. Protocol ID's
          10. Session ID Allocation
          11. Session Establishment
          12. Graceful Release
          13. Disgraceful Release
          14. Message Boundaries
          15. Flow Control
          16. End Points
          17. Control Messages
    7. Security Considerations
    8. Remaining Issues for Discussion
    9. Comparison with SCP (TMP)
    10. Closed Issues from Discussion and Email
    11. Acknowlegments
    12. References
    13. Author's Addresses


        ------------------------------------------------------

Introduction

The Internet is suffering from the effects of the HTTP/1.0 protocol, which was
designed without understanding of the underlying TCP [1] transport protocol.
HTTP/1.0 opens a TCP connection for each URI [28] retrieved (at a cost of both
packets and round trip times (RTTs)), and then closes the TCP connection. For
small HTTP requests, these TCP connections have poor performance due to TCP slow
start [9] [10] as well as the round trips required to open and close each TCP
connection.

There are (at least) three reasons why multiple simultaneous TCP connections
have come into widespread use on the Internet despite the apparent
inefficiencies:
    1. A client using multiple TCP connections gains a significant advantage in
      perceived performance by the end-user, as it allows for early retrieval of
      metadata (e.g. size) of embedded objects in a page. This allows a client
      to format a page sooner without suffering annoying reformatting of the
      page. Clients which open multiple TCP connections in parallel to the same
      server, however could cause self congestion on heavily congested links,
      since packets generated by TCP opens and closes are not themselves
      congestion controlled.
    2. The additional TCP opens cause performance problems in the network, but a
      client that opens multiple TCP connections simultaneously to the same
      server may also receive an "unfair" bandwidth advantage in the network
      relative to clients that use a single TCP connection. This problem is not
      solvable at the application level; only the network itself can enforce
      such "fairness".
    3. To keep low bandwidth/high latency links busy (e.g. dialup lines), more
      than one TCP connection has been necessary since slow start may cause the
      line to be partially idle.

The "Keep-Alive" extension to HTTP/1.0 is a form of persistent TCP connections
but does not work through HTTP/1.0 proxies and does not take pipelining of
requests into account. Instead a revised version of persistent TCP connections
was introduced in HTTP/1.1 as the default mode of operation.

HTTP/1.1 [6] persistent connections and pipelining [11] will reduce network
traffic and the amount of TCP overhead caused by opening and closing TCP
connections. However, the serialized behavior of HTTP/1.1 pipelining does not
adequately support simultaneous rendering of inlined objects - part of most Web
pages today; nor does it provide suitable fairness between protocol flows, or
allow for graceful abortion of HTTP transactions without closing the TCP
connection (quite common in HTTP operation).

Persistent connections and pipelining, however, do not fully address the
rendering nor the fairness problems described above. A "hack" solution is
possible using HTTP range requests; however, this approach does not, for
example, allow a server to send just the metadata contained in embedded object
before sending the object itself, nor does it solve the TCP connection abort
problem.

Current TCP implementations do not share congestion information across multiple
simultaneous TCP connections between two peers, which increases the overhead of
opening new TCP connections. We expect that Transactional TCP [5] and sharing of
congestion information in TCP control blocks [8] will improve TCP performance by
using less RTTs and better congestion behavior, making it more suitable for HTTP
transactions.

The solution to these problems requires two actions; either by itself will not
entirely discourage opening multiple TCP connections to the same server from a
client.
    * Internet service providers should enable the Random Early Detection (RED)
      [12] or other active congestion control algorithms in their routers to
      ensure bandwidth fairness to clients when the network is congested. RED
      also addresses queue length problems observed in routers today.
    * Development and deployment of a multiplexing protocol for use with HTTP
      (and eventually other protocols), so that multiple objects from a web
      server can be fetched approximately simultaneously over a single TCP
      connection, so that the metadata to objects can be sent to clients without
      other metadata waiting for the rest of the first object requested.

This document describes such an experimental multiplexing protocol. It is
designed to multiplex a TCPconnection underneath HTTP so that HTTP itself does
not have to change, and allow coexistence of multiple protocols (e.g. HTTP and
HTTP/NG), which will ease transitions to future Web protocols, and
communications of client applets using private protocols with servers over the
same TCP connection as the HTTP conversation.

Ideas from this design come from Simon Spero's SCP [15] [16] description and
from experience from the X Window System's protocol design [13].

Goals

We believe WebMUX meets the following goals we believe necessary for the use of
a multiplexing protocol for the Web:
    * Unconfirmed service without negotiation or round trips to the server
    * simple design
    * high performance
    * deadlock-free, by a credit based flow control scheme.
    * allow multiple protocols to be multiplexed over same TCP connection
    * allow connections to be established in either direction (enabling
      callbacks to the session initiator).
    * ability to build a full function socket interface above this protocol.
    * low overhead
    * preserves alignment in the data stream, so that it is easy to use with
      protocols that marshal their data in a binary form.


        ------------------------------------------------------

WebMUX Protocol Operation

Key Words

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119 [7].

Deadlock Scenario

Multiplexing multiple sessions over a single transport TCP connection introduces
a potential deadlock that WebMUX is designed to avoid.

Here is an example of potential deadlock:
    * Presume that each session is being handled by an independent thread and
      that memory available to the WebMUX implementation is limited (for
      example, on a thin client on a meter reader).
    * For the purposes of this example, presume the thin client has 50K bytes of
      buffer available to its WebMUX implementation, and cannot get more.
    * The sender of data decides to send, as part of a session request (SYN
      message), 100K bytes of initial data. There are no other senders, so all
      of the data gets transmitted. But the thread to deal with the message is
      blocked, and cannot make progress.
    * Unless WebMUX can buffer all 100K (or 1 meg, or pick your favorite
      numbers), any other session's data would be blocked behind this initial
      transmission until and unless WebMUX can read and buffer the data
      someplace (and since it has no buffer available, the deadlock occurs).
      Many similar (but possibly harder to explain) deadlocks are possible.

This example points out that deadlock is possible: WebMUX must be able to buffer
data independently of the consumers of the data. It must also have some way to
throttle sessions where the consumer of the data is not responsive in the
multiplexing layer (in this example, prevent the transmission of more than 50
Kbytes of data). Note that this deadlock is independent of the size of any
multiplexing fragment, but strictly dependent on availability of buffer space in
WebMUX for a particular session.

Deadlock Avoidance

In WebMUX, the receiver makes a promise (sends a credit) to the transmitter that
a certain amount of buffer space is available (or at least that it will consume
the bytes, if not buffer them, e.g. a real time audio protocol where the data is
disposed of), and the transmitter promises not to send more data than the
receiver has promised (no more than the credit). If these promises are met, then
WebMUX will not deadlock. The AddCredit control message is used to add a credit
to a session.

A WebMUX implementation MUST maintain and adhere to the credit system or it can
deadlock. Implementations on systems with large amounts of memory (e.g. VM
systems) may be quite different than ones on thin clients with limited,
non-virtual memory. It is reasonable on a VM system to hand out credits freely
(analogous to the virtual socket buffering found in TCP implementations); but
your implementation must be careful to test its credit mechanisms so that they
will inter operate with limited memory systems. Credit control messages MAY be
sent on sessions that are not active.

Sessions have an initial credit size (initial_default_credit) of 16 KB on each
session; the SetDefaultCredit control message can set this initial credit to
something larger than the default.

Operation and Implementation Considerations

A transmitter MUST NOT transmit more data in a fragment than the available
credit on the session (or it could deadlock).

An WebMUX implementation MUST fragment streams when transmitting them into
fragments. The fragment size can be controlled using the SetMSS control message.
The max_fragment_size, a variable which is maintained on (currently) a per
transport TCP connection basis, determines the largest possible fragment a
sender should ever send to a receiver. This determines the maximum latency
introduced by a WebMUX layer above and beyond the inherent TCP latencies (socket
buffering on both sender and receiver and the delay-bandwidth product amount of
data that could be in flight at any given instant). A client on a low bandwidth
link, or with limited memory buffering might decide to set the
max_fragment_size down to control latency and buffer space required.

If max_fragment_size is set to zero, the transmitter is left to determine the
fragment size and MAY take into account application protocol knowledge (e.g. a
WebMUX implementation for HTTP might send fragments of the metadata of embedded
objects, or the next phase of a progressive image format, which it only knows).
An implementation SHOULD honor the max_fragment_size as it transmits data, if it
has been set by the receiver.

An WebMUX implementation that does not have explicit knowledge or experience of
good fragment sizes might use these guidelines as a starting point:
    * The path_MTU of the TCP connection, minus the size of the TCP and IP
      headers (remember that IPV6 may have longer headers!) and 8 bytes for an
      WebMUX header, if this information is available [3].
    * The MSS of the TCP connection, if the path_MTU is not available
    * In either case, you probably want to subtract 8 bytes to make sure a
      WebWebMUX header can be added without forcing another TCP segment.

This would result in fragmentation roughly similar to TCP segmentation over
multiple TCP connections.

An implementation should round robin between sessions with data to send in some
fashion to avoid starving sessions, or allowing a single thread to monopolize
the TCP connection. Exact details of such behavior is left to the
implementation. To achieve highest bandwidth and lowest overhead WebMUX
behavior, credits should be handed out in reasonably large chunks. TCP
implementations typically send an ack message on every other packet, and it is
very hard to arrange to piggyback acks on data segments in implementations.
Therefore, for WebMUX to have reasonably low overhead credits should be handed
out in some significant multiple (4 or more times larger) than the ~3000 bytes
represented by two packets on an ethernet. The outstanding credit balance across
active sessions will also have to be larger than the bandwidth/delay product of
the TCP connection if WebMUX is not to become a limit on TCP transport
performance.

Both of these arguments indicate that outstanding credits in many
implementations should be 10K bytes or more. Implementations SHOULD piggyback
credit messages on data packets where possible, to avoid unneeded packets on the
wire. A careful implementation in which both ends of the TCP connection are
regularly sending some payload should be able to avoid sending extra packets on
the network.

If necessary, we could add in a future version fragmentation control messages to
do some bandwidth allocation, but for now, we are not bothering.

WebMUX Header

WebMUX headers are always in big endian byte order.
If people want, we could expand out the union below on a control message type
basis (e.g. the way the C bindings to X events were written out...). For this
draft, I'm not doing so.
 #define MUX_CONTROL       0x00800000
 #define MUX_SYN           0x00400000
 #define MUX_FIN           0x00200000
 #define MUX_RST           0x00100000
 #define MUX_PUSH          0x00080000
 #define MUX_SESSION       0xFF000000
 #define MUX_LONG_LENGTH   0xFF040000
 #define MUX_LENGTH        0x0003FFFF

 typedef unsigned int flagbit;
 struct w3mux_hdr {
     union {
        struct {
            unsigned int session_id : 8;
            flagbit control : 1;
            flagbit syn : 1;
            flagbit fin : 1;
            flagbit rst : 1;
            flagbit push : 1;
            flagbit long_length : 1;
            unsigned int fragment_size : 18;
            int long_fragment_size : 32;
                 /* only present if long_length is set */
        } data_hdr;
         struct {
            unsigned int session_id : 8;
            flagbit control : 1;
            unsigned int control_code : 4;
            flagbit long_length : 1;
            unsigned int fragment_size : 18;
            int long_fragment_size : 32;
                 /* only present if long_length is set */
        } control_message;
     } contents;
 };

The fragment_size is always the size in bytes of the fragment, excluding the
WebMUX header and any padding.

Alignment

WebMUX headers are always (at least) 32 bit aligned. To find the next WebMUX
header, take the fragment_size, and round up to the next 32 bit boundary.

Transmitters MAY insert NoOp control messages to force 64 bit alignment of the
protocol stream.

Long Fragments

A WebMUX header with the long_length bit set must use the 32 bits following the
WebMUX header (the long_fragment_size field) for the value of the fragment_size
field, for whatever purpose the fragment_size field is being used for.

Atoms

Atoms are integers that are used as short-hand names for strings, which are
defined using the InternAtom control message. Atoms are only used as protocol
ID's in this version of WebMUX, though they might be used for other purposes in
future versions. Since the atom might be redefined at any time, it is not safe
to use an atom unless you have defined it (i.e. you cannot use atoms defined by
the other end of a mux connection). Atoms are therefore not unique values, and
only make sense in the context of a particular direction of a particular mux
connection. This restriction is to avoid having to define some protocol for
deallocating atoms, with any round trip overhead that would likely imply.

Strings are defined to be UTF-8 encoded UNICODE strings. (Note that an ascii
string is valid UTF-8). The definition of structure of these strings is outside
of the scope of this document, though we expect they will often be URI's, naming
a protocol or stack of protocols. Atoms always have values between 0x20000 and
0x200ff (a maximum of 256 atoms can be defined).

Strings used for protocol id's MUST be URIs [28].

Protocol ID's

The protocol used by a session is identified by a Protocol ID, which can either
be an IANA port number, or an atom.
    1. To allow higher layers to stack protocols (e.g. HTTP on top of deflate
      compression, on top of TCP).
    2. To identify the protocol or protocol stack in use so that application
      firewall relays can perform sanity checking and policy enforcement on the
      multiplexed protocols .

Firewall proxies can presume that the bytes should conform to that protocol
identified by the Protocol ID.
    * 0-0xFFFF: IANA-registered TCP protocols [17]
    * 0x10000-0x1FFFF: IANA-registered UDP protocols [17]
    * 0x20000-0x2FFFF: per-underlying-connection-defined MUX atoms.
      The scheme name of the URI indicates the protocol family being used (e.g.
      http, ftp, etc.).
    * 0x30000-0x3FFFF: server-assigned protocol IDs
      The assignment of these ID's are outside the scope of this protocol, and
      may pose additional security hazards.

Session ID Allocation

Each session is allocated a session identifier. Session Identifiers below 0 and
1 are reserved for future use. Session IDs allocated by initiator of the
transport TCP connection are even; those allocated by the receiver of the
transport connection odd. Proxies that do not understand messages of reserved
Session ID's should forward them unchanged. A session identifier MUST only be
deallocated and potentially reused by new sessions when a session is fully
closed in both directions.

Session Establishment

To establish a new session, the initiating end sends a SYN message, allocating a
free session number out of its address space. A session is established by
setting the SYN bit in the first message sent on that session. The session is
specified by the session_id field. The fragment_size field is interpreted as the
protocol ID of the session, as discussed above.

The receiver MUST either open the reverse path of that session (send a SYN
message), or it MUST send a FIN message to indicate that the reverse path is not
going to be used further, or send a RST message to indicate an error. This
enables the initiator of a session to know when it is safe to reuse that session
ID.

Graceful Release

A session is ended by sending a fragment with the FIN bit set. Each end of a
WebMUX connection may be closed independently.

WebMUX uses a half-close mechanism like TCP[1] to close data flowing in each
direction in a session. After sending a FIN fragment, the sender MUST NOT send
any more payload in that direction.

Disgraceful Release

A session may be terminated by sending a message with the RST bit set. All
pending data for that session should be discarded. "No such protocol" errors
detected by the receiver of a new session are signaled to the originator on
session creation by sending a message with the RST bit set. (Same as in TCP).

The payload of the fragment containing the RST bit contains the null terminated
string containing the URI of an error message (note that content negotiation
makes this message potentially multi-lingual), followed by a null terminated
UTF-8 string containing the reason for the reset (in case the URI is not
accessable).

Message Boundaries

A message boundary is marked by sending a message with the PUSH bit set. The
boundary is set between the last octet in this message, including that octet,
and the first byte of a subsequent message. This differs slightly from TCP, as
PUSH can be reliably used as a record mark.

Flow Control

Flow control is determined by a simple credit scheme described above by using
the AddCredit control message defined below. Fragments transmitted MUST never
exceed the outstanding credit for that session. The initial outstanding credit
for a session is 16Kbytes.

End Points

One of the major design goals of WebMUX is to allow callbacks to objects in the
process that initiated the transport TCP connection without requiring additional
TCP connections (with the overhead in both machine resources and time that this
would cause, or the problems with TCP connection establishment through
firewalls).

The DefineEndpoint control message allows one to advertize that a particular
(set of) URI's are reachable over the transport TCP connection.

A MUX protocol ID only identifies a MUX channel relative to a particular
"endpoint". The pair of <endpoint><protocol ID> completely identify a MUX
channel, without regard to IP address, TCP port, or other information. Endpoint
IDs are URI names for endpoints. Any endpoint may have multiple endpoint IDs. We
do not place any further restrictions on the types of URIs that are used as
endpoint IDs.

A client connecting from a MUX endpoint A to a MUX channel on a different
endpoint B may send an ID for A to B via the DefineEndpoint control message. If
a client in endpoint B then needs to connect to a MUX channel in endpoint A, it
may do so by using the existing lower-level byte stream originated from endpoint
A. A connection initiator may send multiple DefineEndpoint control messages with
different endpoint IDs for the same endpoint.

Connection initiators may wish to control the disclosure of endpoint
information, both for security purposes and for optimal application timing, and
should be given reasonable

Whether this relative URI naming can be used depends upon the scheme of the URI
[20], which defines its structure. For example, a firewall proxy might advertize
just "http:" for the proxy, claiming it can be used to contact any HTTP protocol
object anywhere, or "http://foo.com/bar/" to indicate that any object below that
point in the URI space on the server foo.com may be reached by this TCP
connection. A client might advertize that "http://myhost.com/" is available via
this transport TCP connection.

Control Messages

The control bit of the WebMUX header is always set in a control message. Control
messages can be sent on any session, even sessions that are not (yet) open. The
control_code reuses the SYN, FIN, RST, and PUSH bits of the WebMUX header. The
control_code of the control message determines the control message type. Any
unused data in a control message must be ignored.

The revised version of WebMUX means that a session creation costs 4 bytes (a
control message with SYN set, and with the protocol ID in the message).
Therefore the first fragment of payload has a total overhead of 8 bytes. (This
is presuming using an IANA based protocol, rather than a named protocol). This
is the same as the previous version, though it means two messages rather than
one.

The individual control message types are listed below (code Name direction;
description):
   0 InternAtom Both
      The session_id is used as the Atom to be defined (offset by 0x2000), so a
      value of 0 is defining ID 0x2000). The fragment_size field is the length
      of the UTF-8 encoded string. The fragment itself contains the string to be
      interned. This allows the interning of 256 strings. (is this enough?).
   1 DefineEndpoint Both
      The session_id is ignored. The fragment_size is interpreted as the
      protocol ID, naming an endpoint actually available on this transport TCP
      connection. This enables a single transport TCP connection to be used for
      callbacks, or to advertise that a protocol endpoint can be reached to the
      process on the other end of the transport TCP connection.
   2 SetMSS Both
      This sets a limit on fragment sizes below the outstanding credit limit.
      The session_id must be zero. The fragment_size field is used as
      max_fragment_size (the largest fragment that be sent on any session on
      this transport TCP connection.). A max_fragment_size of zero means there
      is no limit on the fragment size allowed for this session.
   3 AddCredit R->T
      The session_id specifies the session. The fragment_size specifies the flow
      control credit granted (to be added to the current outstanding credit
      balance). A value of zero indicates no limit on how much data may be sent
      on this session.
   4 SetDefaultCredit R->T
      The session_id must be zero. The fragment_size field is used as to set the
      initial default credit limit for any incoming WebMUX connections over this
      transport TCP connection. (i.e. it is short hand for sending a series of
      AddCredit messages for each session ID).
   5 NoOp Both
      This control message is defined to perform no function. Any data in the
      payload should be ignored.
   6-15 - Undefined.
      Reserved for future use. Must be ignored if not understood, and forwarded
      by any proxies. The fragment_size is always used for the length of the
      control message, and any data for the control message will be in the
      payload of the control message (to allow proxies to be able to forward
      future control messages).


        ------------------------------------------------------

Security Considerations

Advertizing endpoints inappropriately might allow a client to connect to
services that should be protected.

Using the protocol ID range 0x30000-0x3FFFF for server-assigned protocol IDs may
prevent a firewall proxy from having enough information to safely proxy
protocols of those types. Firewall proxy implementers should not blindly forward
protocols of this range.

Firewall proxies implementing WebMUX should enforce appropriate policies for
protocols being multiplexed over WebMUX, in a fashion similar to the policies
imposed for native protocols.

Clearly, any security consideration for a protocol is likely to still apply to
its use when being multiplexed via WebMUX.


        ------------------------------------------------------

Remaining Issues for Discussion

When can WebMUX be used???
    * What are the appropriate strategies for determining if the WebMUX protocol
      can be used?
    * Name server hack?
    * UPGRADE in HTTP?
    * Remember that previous UPGRADE to use WebMUX worked?
    * Should there be a more compact open message?


        ------------------------------------------------------

Comparison with SCP (TMP)

Note that TIP (Transaction Internet Protocol) [21] defines a version of SCP
called TMP .

Goals:
    * Unconfirmed service without negotiation.
    * SCP allows data to be sent with the session establishment; the recipient
      does not confirm successful mux connection establishment, but may reject
      unsuccessful attempts. This simplifies the design of the protocol, and
      removes the latency required for a confirmed operation.
    * simple design
    * performance where critical

There are five issues that make SCP (TMP) inadequate for our use:
    * SCP can deadlock, unless unlimited amounts of memory is available.
    * it has no provision for multiplexing multiple protocols over the same
      transport TCP connection, essential for graceful transition without
      dependency on the currently incomplete NG design, and to allow other uses
      which could use the same multiplexed connection (e.g. applet communication
      with serverlets).
    * SCP's 8 byte overhead is not reasonable most of the time. WebMUX uses four
      bytes in the default case. The design below permits an 8 byte header if
      you care to preserve 64 bit alignment at the cost of bytes. In practice,
      there seems few data formats or architectures that actually require more
      than 32 bit alignment.
    * Without some form of flow control, infinite buffering in clients
      (receivers) would be required.
    * Alignment is preserved in the data stream. This allows compact, high speed
      (un)marshalling code in implementations of binary protocols, without extra
      data copies, which in such protocols can be significant overhead.
    * SCP SYN in Version 2 requires a second message, which costs a round trip.

So far, WebMUX is similar to SCP. There are some important differences:
    * deadlock-free (we believe), by a credit based flow control scheme.
    * allow multiple protocols to be multiplexed over same TCP connection (not
      available in SCP).
    * lower overhead than SCP, while preserving data alignment (very important
      for binary protocol marshaling code)
    * ability to build a full function socket interface above this protocol.
    * WebMUX avoids the SYN round trip of SCP V2 by session ID's being allocated
      in independent address spaces. This also avoids many of the state
      transitions of SCP, simplifying the protocol greatly.
    * SCP has 224 sessions, which seems highly excessive, and reserves 1024 of
      them for future use.


        ------------------------------------------------------

Closed Issues from Discussion and Mail

Some of the comments below allude to previous versions of the specification, and
may not make sense in the context of the current version. It will likely be
eliminated in future versions, but may answer some questions that arise when
reading this document.

Flow control: priority vs. credit schemes

Henrik and I have convinced ourselves there are fundamental differences between
a priority scheme and the credit scheme in this draft. They interact quite
differently with TCP, and priority schemes have no way to limit the total amount
of data being transmitted, though priority schemes are better matched to what
the Web wants. We've decided, at least for now, to defer any priority schemes to
higher level protocols.

Stacking Protocols and Transports (Stacks)

ILU [22] style protocol stacks are a GOOD THING. There have been too many
worries about the birthday problem for people to be comfortable with Bill
Janssen's hashing schemes (see Henrik Frystyk Nielsen and Robert Thau's mail on
this topic). We tried putting this directly in WebMUX in a previous version, and
experience shows that it didn't really help an implementer (in particular, Bill
Janssen while implementing ILU). This version has just the name of the protocol,
and it is left to others to implement any stacking (e.g. ILU).

We believe the name of the protocol is necessary, if WebMUX is ever to be used
with firewalls. Application level firewall relays need the protocol information
to sanity check the protocol being relayed. Application level relays are
considered much more secure than just punching holes in the firewall for
particular protocol families, which small organizations often find sufficient,
as the relay can sanity check the protocol stream and enable better policy
decisions (for example, to forbid certain datatypes in HTTP to transit a
firewall). Large organizations and large targets typically only run application
level proxies.

Byte Usage

Wasting bytes in general, and in particular at TCP connection establishment, for
a multiplexing transport must be avoided. There are several reasons for this:
    * if the initial segment is too long, a network round trip will be lost to
      TCP slow start, so bytes near the beginning of a conversation MAY BE much
      more precious than bytes later in the conversation, once slow start
      overhead has been paid. If the first segment is too long, you fall off a
      cliff.
    * Directly affects user perceived response; no cleverness of later packing
      and batching of request can get the time back; each goes directly to
      perceived latency when a user talks to the server for the first time.

So there is more than the usual tension between generality vs. performance.
Performance analysis

Human perception is about 30 milliseconds; if much more than this, the user
perceives delay. At 14.4 K baud, one byte uncompressed costs .55 milliseco nds
(ignoring modem latencies). On an airplane via telephone today, you get a
munificent 4800 baud, which is 3X slower. Cellular modems transmitting data
(CDPD), as I understand it, will give us around 20Kbaud, when deployed.

So basic multiplexing @ 4 byte overhead costs ~ 2 milliseconds on common modems.
This means basic overhead is small vs. human perception, for most low speed
situations, a good position to be in.

On WebMUX connection open, with above protocol we send 4 bytes in the setup
message, and then must open a session, requiring at least 8 bytes more. 12 bytes
== 7 milliseconds at 14.4K. Not 64 bit aligned, and 4 bytes costs of order 2
milliseconds. Ugh... Maybe a setup message isn't a good idea; other uses (e.g.
security) can be dealt with by a control message.

Multiple protocols over one WebMUX

We want to WebMUX multiple protocols simultaneously over the same transport TCP
connection, so we need to know what protocol is in use with each session, so the
demultipexor can hand the data to the right person. (e.g. SUNRPC and DCERCP
simultaneously).

There are two obvious ways I can see to do this:
   a) Send a control message when a session is first used, indicating the
   protocol.
      Disadvantage: costs probably 8 bytes to do so (4 WebMUX overhead, and 4
      byte message), and destroys potential 64 bit alignment.
   b) If syn is set indicating new session, then steal mux_length field to
   indicate protocol in use on that session.
      (overhead; 4 bytes for the WebMUX header used just to establish the
      session.)

Opinions? Mine is that b) is better than a. Answer: b) is the adopted strategy.

Priority...

For a given stream, priority will affect which session is handled when
multiplexing data; sending the priority on every block is unneeded, and would
waste bytes. There is one case in which priority might be useful: at an
intermediate proxy relaying sessions (and maybe remultiplexing them).

If so, it should be sent only when sessions are established or changed. Changes
can be handled by a control message. Opinions?

A priority field can be hacked into the length field with the protocol field
using b) above.

So the question is: is it important to send priority at all in this WebMUX
protocol? Or should priority control, if needed, be a control message? ;
(control message).

Answer: Not in this protocol. Opens Pandora's box with remultiplexors, which
could have denial of service attacks.

Setup message

Is any setup message needed? I don't think it is,. and initial bytes are
precious (see performance discussion above), and it complicates trivial use. If
we move the byte order flag to the WebMUX header, and use control messages if
other information needs to be sent, we can dispense with it, and the layer is
simpler. This is my current position, and unless someone objects with reasons,
I'll nuke it in the next version of this document.

Answer: Not needed. Nuked.

Byte order flags

While higher layer protocols using host dependent byte order can be a performan
ce win (when sending larger objects such as arrays of data), the overhead at
this layer isn't much, and may not be worth bothering with. Worst case (naive
code) would be four memory reads and 3 shift overhead/payload. Smart code is one
load and appropriate shifts etc.

Opinions? I'm still leaning toward swapping bytes here, but there are other
examples of byte load and shift (particularly slow on Alpha, but not much of an
issue on other systems).

Answer: Not sufficient performance gain at WebMUX level to be worth doing.
Defined as LE byte order for WebMUX headers.

Error handling

There are several error conditions, probably best reported via control messages
from server:
    * No such protocol. Some sort of serial number should be reported, I
      suppose; this serial number can be implicit as in X
    * bad message.
    * Some combinations of flag bits are not legal.
    * Priority if it exists?

Any others? Any twists to worry about?

Answer: Only error that can occur is no such protocol, given no priority in the
base protocol. May still be some unresolved issues here around "Christma s Tree"
message (all bits turned on).

Length Field

Any reason to believe that the 32 bit length field for a single payload is
inadequate? I don't think so, and I live on an Alpha.

Answer: 32 bit extended length field for a single fragment is sufficient.

Compression

Does there need to be a bit saying the payload is compressed to avoid explosion
of protocol types?

Answer: Yes; introduction of control message to allow specification of transport
stacks achieves this.

Stacks

I think that we should be able to multiplex any TCP, UDP, or IP protocol.
Internet protocol numbers are 8 bit fields.

So we need 16 bits for TCP, one bit to distinguish TCP and UDP, and one bit more
we can use for IP protocol numbers and address space we can allocate privately.
This argues for an 18 bit length field to allow for this reuse. * 18 bit length
field * * 8 bit session field * * 4 control bits * * 1 long length bit *

The last bit is used to define control messages, which reuse the syn, fin, rst,
and push bits as a control_code to define the control message. There are
escapes, both by undefined control codes, and by the reservation of two sessions
for further use if there needs to be further extensions. The spec above reflects
this.

Alignment

Back to alignment. If we demand 4 byte alignment, for all requests that do not
end up naturally aligned, we waste bytes. Two bytes are wasted on average. At
14.4Kbaud the overhead for protocols that do not pad up would on mean be 6 bytes
or ~3ms, rather than 4 bytes or ~ 2 ms (presuming even distributions of length).
Note that this DOES NOT effect initial request latency (time to get first URL),
and is therefore less critical than elsewhere.

I have one related worry; it can sometimes be painful to get padding bytes at
the end of a buffer; I've heard of people losing by having data right up to the
end of a page, so implementations are living slightly dangerous ly if they
presume they can send the padding bytes by sending the 1, 2 or 3 bytes after the
buffer (rather than an independent write to the OS for padding bytes).

Alternatively, the buffer alignment requirement can be satisfied by
implementations remembering how many pad bytes have to be sent, and adjusting
the beginning address of the subsequent write by that many bytes before the
buffer where the WebMUX header has been put. Am I being unnecessarily paranoid?

Opinion: I believe alignment of fragments in general is a GOOD THING, and will
simplify both the WebMUX transport and protocols at higher levels if they can
make this presumption in their implementations. So I believe this overhead is
worth the cost; if you want to do better and save these bytes, then start
building an application specific compression scheme. If not, please make your
case.

Control bits

Are the four bits defined in Simon's flags field what we need? Are there any
others?

Answer: no. More bits than we need. Current protocol doesn't use as many. I've
ended back at the original bits specified, rather than the smaller set suggested
by Bill Janssen. This enables full emulation of all the details of a socket
interface, which would not otherwise be possible. See details around TCP and
socket handling, discussed in books like "TCP/IP Illustrated," by W. Richard
Stevens.

Am I all wet?

Opinion: I believe that we should do this.

Control Messages

Question: do we want/need a short control message? Right now, the out for
extensibility are control messages sent in the reserved (and as yet unspecified
) control session. This requires a minimum of 8 bytes on the wire. We could
steal the last available bit, and allow for a 4 byte short control message, that
would have 18 bits of payload.

Opinion: Flow control needs it; protocol/transport stacks need it. Document
above now defines some control messages.

Simplicity of default Behavior

The above specification allows for someone who just wants to WebMUX a single
protocol to entirely ignore protocol ID's.


        ------------------------------------------------------

Acknowledgements

Contributors include (at least): Bill Janssen, Mike Spreitzer, Robert Thau,
Larry Masinter, Paul Leach, Paul Bennett, Rich Salz, Simon Spero, Mark Handey,
Anselm Baird-Smith, and Wan-Teh Chang. Our apologies to anyone we've missed.


        ------------------------------------------------------

References
    1. J.. Postel, "Transmission Control Protocol", RFC 793, Network Information
      Center, SRI International, September 1981
    2. J. Postel, "TCP and IP bake off", RFC 1025, September 1987
    3. J. Mogul, S. Deering, "Path MTU Discovery", RFC 1191, DECWRL, Stanford
      University, November 1990
    4. T. Berners-Lee, "Universal Resource Identifiers in WWW. A Unifying Syntax
      for the Expression of Names and Addresses of Objects on the Network as
      used in the World-Wide Web", RFC 1630, CERN, June 1994.
    5. R. Braden, "T/TCP -- TCP Extensions for Transactions: Functional
      Specification", RFC 1644, USC/ISI, July 1994
    4. R. Fielding, "Relative Uniform Resource Locators", RFC 1808, UC Irvine,
      June 1995.
    5. T. Berners-Lee, R. Fielding, H. Frystyk, "Hypertext Transfer Protocol --
      HTTP/1.0", RFC 1945, W3C/MIT, UC Irvine, W3C/MIT, May 1996
    6. R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee,
      "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, U.C. Irvine, DEC
      W3C/MIT, DEC, W3C/MIT, W3C/MIT, January 1997
    7. S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels",
      RFC 2119, Harvard University, March 1997
    8. J. Touch, "TCP Control Block Interdependence", RFC 2140, April 1997
    9. W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and
      Fast Recovery Algorithms", RFC 2001, January 1997
    10. V. Jacobson, "Congestion Avoidance and Control", Proceedings of SIGCOMM
      '88
    11. H. Frystyk Nielsen, J. Gettys, A. Baird-Smith, E. Prud'hommeaux, H. W.
      Lie, and C. Lilley, "Network Performance Effects of HTTP/1.1, CSS1, and
      PNG", Proceedings of SIGCOMM '97
    12. S. Floyd and V. Jacobson, "Random Early Detection Gateways for
      Congestion Avoidance", IEEE/ACM Trans. on Networking, vol. 1, no. 4, Aug.
      1993.
    13. R.W.Scheifler, J. Gettys, "The X Window System" ACM Transactions on
      Graphics # 63, Special Issue on User Interface Software, 5(2):79-109
      (1986).
    14. V. Paxson, "Growth Trends in Wide-Area TCP Connections" IEEE Network,
      Vol. 8 No. 4, pp. 8-17, July 1994
    15. S. Spero, "Session Control Protocol, Version 1.0"
    16. S. Spero, " Session Control Protocol, Version 2.0"
    17. Keywords and Port numbers are maintained by IANA in the port-numbers
      registry.
    18. Keywords and Protocol numbers are maintained by IANA in the
      protocol-numbers registry.
    19. W. Richard Stevens, "TCP/IP Illustrated, Volume 1", Addison-Wesley, 1994
    20. Berners-Lee, T., Fielding, R., Masinter, L., "Uniform Resource
      Identifiers (URI): Generic Syntax and Semantics," Work in Progress of the
      IETF, November, 1997.
    21. J. Lyon, K. Evans, J. Klein, "Transaction Internet Protocol Version
      2.0," Work in Progress of the Transaction Internet Protocol Working Group,
      November, 1997.
    22. B. Janssen, M. Spreitzer, " Inter-Language Unification"; in particular
      see the manual section on Protocols and Transports.


        ------------------------------------------------------

Authors' Addresses
    * James Gettys
      MIT Laboratory for Computer Science
      545 Technology Square
      Cambridge, MA 02139, USA
      Fax: 1 (617) 258 8682
      Email: jg@pa.dec.com
    * Henrik Frystyk Nielsen
      W3C/MIT Laboratory for Computer Science
      545 Technology Square
      Cambridge, MA 02139, USA
      Fax: +1 (617) 258-8682
      Email: frystyk@w3.org


        ------------------------------------------------------

    @(#) $Id: WD-mux.html,v 1.4 1998/08/03 18:36:32 frystyk Exp $
Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00
	Authors	Jim Gettys , Henrik Nielsen Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml