Network Working Group                  Richard Price, Siemens/Roke Manor
INTERNET-DRAFT                           Jonathan Rosenberg, dynamicsoft
Expires: May 2002                    Abigail Surtees, Siemens/Roke Manor
                                         Mark A West, Siemens/Roke Manor
                                     Lawrence Conroy, Siemens/Roke Manor

                                                       14 November, 2001

                     Universal Decompression Algorithm
                <draft-ietf-rohc-sigcomp-algorithm-00.txt>

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of [RFC-2026].

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This document is a submission to the IETF ROHC WG. Comments should be
   directed to the mailing list of ROHC, rohc@cdt.luth.se.

Abstract

   This specification defines a "universal decompressor" capable of
   interoperating with a wide range of compression algorithms. Using the
   basic techniques of Huffman compression and LZ77-style string
   substitution, the decompressor can be configured to understand the
   output of many well-known compressors including [DEFLATE] and [LZW].














Price et al.                                                  [PAGE 1]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

Revision history

   Changes from <draft-ietf-rohc-sigcomp-01.txt>

     New COPY-LITERAL and COPY-OFFSET tokens added to reduce complexity
     for LZ77-based algorithms.

     COMPARE token modified to allow it to be used without a SWITCH
     token.

     HUFFMAN token modified to require fewer parameters.

     Support added for literal parameters (see Section 4.3).

     Example token sets added for decompression of LZ77, [DEFLATE] and
     [LZW] compressed messages.


Table of contents

   1.  Introduction.................................................3
   2.  Terminology..................................................3
   3.  Requirements on underlying transport protocol................3
   4.  Description of the universal decompressor....................4
   4.1.  Structure of universal decompressor dictionary.............5
   4.2.  Important entries in the byte buffer.......................5
   4.3.  Token parameters...........................................5
   4.4.  Decompressor actions upon receiving a compressed message...6
   4.5.  Decompression failure......................................7
   5.  Library of tokens............................................8
   5.1.  COPY.......................................................9
   5.2.  ADD / SUBTRACT.............................................10
   5.3.  LSHIFT / RSHIFT............................................10
   5.4.  COMPARE....................................................11
   5.5.  SWITCH.....................................................11
   5.6.  CALL / RETURN..............................................11
   5.7.  HUFFMAN....................................................12
   6.  Security considerations......................................14
   7.  References...................................................15
   8.  Authors' addresses...........................................15
   A.  Example sets of tokens.......................................16
   A.1.  Mnemonic language..........................................16
   A.2.  Example token set for simple LZ77 decompression............17
   A.3.  Example token set for DEFLATE decompression................17
   A.4.  Example token set for LZW decompression....................20










Price et al.                                                  [PAGE 2]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

1.  Introduction

   This draft introduces the concept of a "universal decompressor".

   The goal of the document is to standardize a decompressor capable of
   interoperating with a wide range of compression algorithms.
   Consequently this draft describes the decompressor operation only,
   i.e. the actions which the decompressor takes upon receiving a
   certain instruction from the compressor.

2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC-2119].

   Byte buffer

     The universal decompressor maintains a byte buffer containing any
     previously received text strings that might be useful for future
     compression.

   Token

     A token is an instruction transmitted from the compressor to the
     decompressor.


3.  Requirements on underlying transport protocol

   The universal decompressor takes as its input a sequence of
   compressed messages, which are processed and then outputted as a
   sequence of uncompressed messages. This chapter lists the
   requirements on the transport protocol used to carry compressed
   messages to the universal decompressor.

   Note that the universal decompressor outputs an uncompressed message
   when it encounters an explicit end-of-message character.
   Consequently there is no need for one uncompressed message to
   correspond to exactly one compressed message. Two or more compressed
   messages can be sent to reconstruct a single uncompressed message,
   which is very useful for segmenting compressed messages that are
   larger than the MTU of the transport protocol.

   Conversely, one compressed message can be sent to reconstruct
   several uncompressed messages. In particular, messages can be
   successfully decompressed even when the transport protocol provides
   data as a byte stream with no framing.

   Note however that the universal decompressor can make use of
   previously received messages to improve the overall compression
   ratio (see [V-J] for an example of a compression algorithm which
   uses previously received messages in this manner).


Price et al.                                                  [PAGE 3]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   Consequently, the main requirement on the underlying protocol is
   that it MUST ensure that the contents of the decompressor byte
   buffer are as expected by the next message to be decompressed.

   This requirement can be supported in a number of ways. If the
   underlying transport protocol is reliable (for example TCP or
   [SCTP]) then the compressed messages are provided to the universal
   decompressor in the correct order and free from bit errors. In this
   case the byte buffer is automatically updated correctly between
   messages.

   If the underlying transport protocol is unreliable (for example UDP)
   then the byte buffer MUST be reset to a known state between
   compressed messages. This ensures that messages are not compressed
   relative to text that has not yet been received by the universal
   decompressor.

   The "known state" of the byte buffer can be negotiated a-priori: for
   example a set of tokens can be downloaded from the compressor to the
   decompressor when the universal decompressor is being set up. All
   messages are then compressed relative to these tokens. Alternatively
   the transport protocol can indicate to the decompressor how the byte
   buffer should be set up before each message is decompressed.

4.  Description of the universal decompressor

   An important feature of the universal decompressor is that it can
   interoperate with a wide range of compression algorithms. The
   precise method for compression is left as an implementation
   decision, and in fact the standard decompressor can interoperate
   with any of the following classes of algorithm:

   *    Generic text compressor (for example [DEFLATE] or a similar
        algorithm).

   *    Protocol-aware compressor offering excellent performance for
        one type of text message (for example the text messages
        generated by [SIP]).

   *    Hybrid compressor with similar performance to [DEFLATE] for
        generic text strings and superior performance for certain types
        of text message.

   The choice of which tokens to send to the decompressor is left as a
   local implementation decision at the compressor. The only
   requirement is that of transparency, i.e. the compressor MUST NOT
   send tokens which cause the decompressor to incorrectly decompress a
   given message.

   Note however that it is perfectly acceptable for the compressor to
   send tokens which update the byte buffer at the decompressor, but
   which cause no decompressed message to be outputted. Indeed, this is
   a useful technique for pre-populating the dictionary with well-known
   text strings.

Price et al.                                                  [PAGE 4]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

4.1.  Structure of universal decompressor dictionary

   The universal decompressor dictionary consists of a simple byte
   buffer designed to hold the current uncompressed message, the
   current compressed message, and any other previously received text
   strings that might be useful for future compression.

   The size buffer_size of the byte buffer can be negotiated by an
   externally defined mechanism (e.g. by the underlying protocol used
   to transport the compressed messages). Entries in the byte buffer
   are referred to as buffer[n] where 0 =< n < buffer_size.

   As all of the tokens currently use 2-byte indices into the byte
   buffer, the maximum size of the buffer is 64K.

4.2.  Important entries in the byte buffer

   The first few bytes in the byte buffer are used to store some
   important 2-byte integers. These integers are given the following
   names:

   Position in buffer:          Name:

        0 - 1                   first_token
        2 - 3                   uncompressed_start
        4 - 5                   uncompressed_end
        6 - 7                   circular_buffer
        8 - 9                   compressed_start
        10 - 11                 compressed_end
        12 - 13                 compressed_pointer
        14 - 15                 stack_free
        16 - 17                 stack[0]
        18 - 19                 stack[1]
        20 - 21                 stack[2]
           :                        :

   The MSBs of the integer are always stored before the LSBs. So, for
   example, the MSBs of first_token are stored in buffer[0] whilst the
   LSBs are stored in buffer[1].

   The use of each integer is described in the following sections of
   the draft.

4.3.  Token parameters

   Each of the tokens is followed by 0 or more bytes containing the
   parameters required by the token. At present all parameters are
   stored as 2-byte integers with MSBs stored in the byte preceding the
   LSBs in the byte buffer.

   The most significant bit of the 2-byte integer has a special
   meaning: it is used to determine whether the parameter is a literal
   value or an index pointing to a literal value.


Price et al.                                                  [PAGE 5]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   If the most significant bit is 0 then the 2-byte integer is
   interpreted as a literal value from 0 to 32767.

   If the most significant bit is 1 then the 2-byte integer is
   interpreted as an index from 0 to 32767 (the most significant bit
   itself is ignored). This index points to the location in the buffer
   containing the literal value of the parameter (MSBs stored in
   buffer[index] and LSBs stored in buffer[index + 1]). If the index
   references a location beyond the size of the byte buffer then a bad
   compressed message has been received and decompression failure
   occurs (see Section 4.5.).

4.4.  Decompressor actions upon receiving a compressed message

   When the universal decompressor is initialized all entries in the
   byte buffer are set to 0. Upon receiving a compressed message, the
   decompressor strips off the underlying protocol header and then
   performs the following actions:

   1.)   The message is copied directly into the byte buffer beginning
   at the byte specified in compressed_start.

   The underlying protocol MUST NOT pass a compressed message of more
   than 1460 bytes to the universal decompressor. If a larger
   compressed message is received, the underlying protocol passes only
   the first 1460 bytes to the decompressor, and provides additional
   working memory to store the bytes that are not currently being
   decompressed. The remainder of the compressed message is passed to
   the decompressor in blocks of 1460 bytes (or less if it is the last
   block in the compressed message).

   The decompressor MUST NOT concatenate two messages to form a single
   compressed message. This is because compressed messages are
   typically padded with trailing zero bits so that they are a whole
   number of bytes long. Concatenating two messages would cause these
   padding bits to be incorrectly interpreted as compressed data.

   Note that the buffer is circular, so once a byte is copied into
   buffer[buffer_size - 1], the next byte is copied into
   buffer[circular_buffer]. The parameter circular_buffer (see Section
   4.2) can be set to prevent the first part of the buffer from being
   overwritten by new messages. Typically this area of the buffer is
   used to hold important tokens and text strings that should be kept
   from one compressed message to the next. If circular_buffer lies
   beyond the size of the byte buffer then decompression failure occurs
   (see Section 4.5).

   After the message has been copied into the byte buffer, the position
   of the last byte in the compressed message is copied into
   compressed_end. Also, the value in compressed_start is copied into
   compressed_pointer.

   2.)   Next, the tokens contained within the byte buffer are executed
   beginning at the byte specified in first_token. The tokens are

Price et al.                                                  [PAGE 6]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   executed consecutively unless indicated explicitly (for example when
   the decompressor encounters a SWITCH token). If the next token to be
   executed lies outside the byte buffer then decompression failure
   occurs (see Section 4.5).

   3.)   The decompressor stops token execution when it reaches
   buffer[0] or buffer[1]. Depending on which buffer entry is reached,
   the following actions are then taken:

   If the decompressor reaches buffer[0], instead of executing the
   token contained within buffer[0] it stops token execution and
   outputs the uncompressed message. The location of the uncompressed
   message is from uncompressed_start up to, but not including
   uncompressed_end. After the uncompressed message has been outputted,
   the value in uncompressed_end is then copied into
   uncompressed_start. If uncompressed_start = uncompressed_end then no
   uncompressed message is outputted.

   As stated before the buffer is circular, so once a byte is copied
   from buffer[buffer_size - 1], the next byte is copied from
   buffer[circular_buffer]. If either uncompressed_start or
   uncompressed_end lie outside the circular buffer then decompression
   failure occurs.

   If buffer[1] is reached then token execution stops but no
   uncompressed message is outputted.

   When the next compressed message becomes available, the universal
   decompressor continues at Step 1.) above. Note that if the
   underlying transport protocol does not provide reliable, in-order
   message delivery then the contents of the byte buffer MUST be reset
   to the state expected by the next compressed message. This state can
   be negotiated a-priori, or the transport protocol can indicate to
   the decompressor how the byte buffer should be set up before the
   next message is decompressed.

4.5.  Decompression failure

   If the compressed messages received by the decompressor are
   corrupted (either accidentally or maliciously) then one of three
   possibilities might occur:

   *    A decompressed message is outputted that is incorrect.

   *    A token is encountered that cannot be processed successfully by
        the decompressor (for example a RETURN token when no CALL token
        has previously been encountered).

   *    The decompressor never finishes decompressing a message.

   To counter the first possibility the underlying protocol SHOULD
   include a checksum to verify either the compressed message or the
   uncompressed message. If a message fails the checksum then
   "decompression failure" has occurred. The decompressor does not

Price et al.                                                  [PAGE 7]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   output an uncompressed message, and ignores any future compressed
   message until the byte buffer is reset.

   If a token is encountered that cannot be successfully processed then
   decompression failure occurs automatically.

   To counter the third possibility, decompression failure SHOULD also
   occur after a certain number of tokens have been processed for a
   given compressed message. The maximum number of tokens to process is
   currently left as an implementation decision (but might in future be
   negotiated).

5.  Library of tokens

   The universal decompressor currently understands twelve types of
   token, chosen to support the widest possible range of compression
   algorithms with the minimum possible overhead.

   All tokens are stored as a single byte to indicate the token type,
   followed by 0 or more bytes containing the parameters required by
   the token. At present all parameters are 2-byte integers with MSBs
   stored before LSBs. For example, the COPY token is followed by three
   parameters as shown below:

   COPY (position, length, destination)

   In the byte buffer a COPY token is stored as the following 7 bytes:

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0 0 0 0 0 0 0 0| Position MSB  | Position LSB  |  Length MSB   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Length LSB   |Destination MSB|Destination LSB|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Twelve token types are currently available to the universal
   decompressor. The following table lists the different token types
   and the byte values used to transmit the tokens to the decompressor:

   Token type:                  Corresponding byte value:

   COPY                                     0
   COPY-LITERAL                             1
   COPY-OFFSET                              2
   ADD                                      3
   SUBTRACT                                 4
   LSHIFT                                   5
   RSHIFT                                   6
   COMPARE                                  7
   SWITCH                                   8
   CALL                                     9
   RETURN                                   10
   HUFFMAN                                  11

Price et al.                                                  [PAGE 8]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   Each token is explained in more detail below:

5.1.  COPY

   The COPY token instructs the decompressor to copy a string of bytes
   from one part of the byte buffer to another.

   A COPY token is stored in the byte buffer as 7 consecutive bytes as
   follows:

   COPY (position, length, destination)

   As with all tokens currently defined, the COPY token translates into
   a single byte for the token type followed by 2 bytes for each of the
   parameters.

   The meaning of the three parameters is explained below:

   Position:    2-byte integer indicating the location of the first
                byte in the string to be copied.

   Length:      2-byte integer indicating the number of bytes to be
                copied.

   Destination: 2-byte integer indicating the location to which the
                first byte in the string will be copied.

   Note that the copying function is performed on a byte-by-byte basis,
   with the position parameter indicating the first byte to be copied.
   In particular, some of the later bytes to be copied may themselves
   have been written into the byte buffer by the COPY token currently
   being executed.

   Equally, it is possible for a COPY token to overwrite itself or its
   parameters. If this occurs then the COPY token MUST continue to
   execute as if the parameters were still in place in the byte buffer.

   If the source or destination of the next byte to be copied is larger
   than (buffer_size - 1) then sufficient multiples of (buffer_size -
   circular_buffer) are subtracted until it is not. So for example,
   once a byte is copied into buffer[buffer_size - 1] the next byte is
   copied into buffer[circular_buffer]. If circular_buffer equals or
   exceeds buffer_size then a bad compressed message has been received
   and decompression failure occurs (see Section 4.5).

   A modified version of the COPY token is given below:

   COPY-LITERAL (position, length, destination)

   The COPY-LITERAL token is identical to COPY except that after
   copying, the destination parameter is replaced with the value
   (destination + length). If the destination parameter is a literal
   value then it is updated directly. If the destination parameter is
   an index then the literal value it references is updated instead.

Price et al.                                                  [PAGE 9]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   As above, if (destination + length) is larger than (buffer_size - 1)
   then sufficient multiples of (buffer_size - circular_buffer) are
   subtracted until it is not.

   A further version of the COPY-LITERAL token is given below:

   COPY-OFFSET (offset, length, destination)

   The COPY-OFFSET token is identical to COPY-LITERAL except that an
   offset parameter is given instead of a position parameter. The two
   parameters are related by position = (destination - offset).

   If (destination - offset) does not lie between circular_buffer and
   (buffer_size - 1) inclusive then sufficient multiples of
   (buffer_size - circular_buffer) are added or subtracted until it
   does.

5.2.  ADD / SUBTRACT

   The ADD token instructs the decompressor to add two 2-byte integers
   (addition performed modulo 2^16) and to store the result in the
   location of the first parameter.

   The format of the ADD token is given below:

   ADD (parameter_1, parameter_2)

   Note that as per Section 4.3, depending on how the MSB is set the
   parameters can be interpreted as literal values or indices to
   literal values. If parameter_1 is a literal value then it is
   overwritten with the result of the addition. If parameter_1 is an
   index to a literal value then the literal value is overwritten (not
   parameter_1 itself).

   The SUBTRACT token is the same as the ADD token except that the
   second integer is subtracted from the first (subtraction performed
   modulo 2^16):

   SUBTRACT (parameter_1, parameter_2)

5.3.  LSHIFT / RSHIFT

   The LSHIFT token instructs the decompressor to left shift a 2-byte
   value by the specified number of bits:

   LSHIFT (parameter, no_of_bits)

   The value to be shifted is stored in parameter, and the number of
   bits to shift is stored in no_of_bits. As usual both can be stored
   either as a literal value or as an index to a literal value.

   If the value of no_of_bits does not lie between 0 and 15 inclusive
   then a bad compressed message has been received and decompression
   failure occurs.

Price et al.                                                 [PAGE 10]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   The RSHIFT token is the same as the LSHIFT token except that the
   bits are right shifted:

   RSHIFT (parameter, no_of_bits)

5.4.  COMPARE

   The COMPARE token instructs the decompressor to compare two 2-byte
   values and then to jump to one of three specified indices depending
   on the result.

   COMPARE (parameter_1, parameter_2, index_1, index_2, index_3)

   If the parameter_1 < parameter_2 then the decompressor continues
   token execution at the byte position specified by index 1. If
   parameter_1 = parameter_2 then it jumps to index_2. If parameter_1 >
   parameter_2 then it jumps to index_3.

   If an index is specified which is beyond the size of the byte
   buffer, a bad compressed message has been received and decompression
   failure occurs.

5.5.  SWITCH

   The SWITCH token performs a conditional jump based on the value of
   its first parameter.

   SWITCH (j, index_0, index_1, ... , index_n-1)

   When a SWITCH token is encountered the decompressor reads the value
   of j. It then continues token execution at the byte position
   specified by index j.

   If j specifies an index which is beyond the size of the byte buffer,
   a bad compressed message has been received and decompression failure
   occurs.

5.6.  CALL / RETURN

   The CALL and RETURN tokens provide support for compression
   algorithms with a nested structure.

   CALL (index)

   RETURN

   When the decompressor reaches a CALL token, it finds the byte
   position of the token immediately following the CALL token and
   copies this 2-byte integer into stack[stack_free] ready for later
   retrieval. It then increases stack_free by 1 and continues token
   execution at the byte position specified by index.




Price et al.                                                 [PAGE 11]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   When the decompressor reaches a RETURN token it decreases stack_free
   by 1, and then continues token execution at the byte position stored
   in stack[stack_free].

   If stack_free ever becomes more than buffer_size - 1 or less than 0
   then a bad compressed message has been received and decompression
   failure occurs (see Section 4.5).

5.7.  HUFFMAN

   The HUFFMAN token maps a shorthand Huffman code onto its
   uncompressed equivalent.

   The format of a HUFFMAN token is as follows:

   HUFFMAN (position, bit_offset, destination, n, bits_1, uncomp_1,
   code_1, bits_2, uncomp_2, code_2, ... , bits_n-1, uncomp_n-1,
   code_n-1, bits_n, uncomp_n)

   The HUFFMAN token is followed by four mandatory parameters plus n
   additional sets of parameters. Every set contains three parameters
   except the nth set, which contains two. The nth set of parameters
   omits the parameter code_n, as the decompressor can always work out
   what the correct value of code_n should be.

   The meaning of the four mandatory parameters is explained below:

   Position:    Indicates the byte location of the Huffman code to be
                decompressed.

   Bit Offset:  Indicates the bit offset at which the Huffman code
                begins within the byte specified above.

   Destination: Indicates the location to which the uncompressed value
                will be copied.

   n:           Indicates the number of additional sets of parameters
                that follow. If n = 0 then the HUFFMAN token is ignored
                by the decompressor.

   The remaining parameters specify the actual Huffman codes and their
   uncompressed equivalents. Each set of 3 parameters specifies a block
   of Huffman codes with the same length and (when treated as integers)
   with values that increase by 1 for each additional Huffman code. The
   precise meaning of the parameters is given below:

   bits_j:      Indicates the additional length in bits of the Huffman
                codes in block j, compared to the Huffman codes in
                block j-1. Note that the total length of the Huffman
                codes in set j is (bits_1 + bits_2 + ... + bits_j).

   uncomp_j:    Indicates the uncompressed value of the first Huffman
                code in set j.


Price et al.                                                 [PAGE 12]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   code_j:      Indicates the compressed value (when read as an
                integer) of the last Huffman code in set j.

   Note that the compressed value of the first Huffman code in set j is
   not specified explicitly, but instead is taken to be 0 when j = 1 or
   (2^bits_j) x (1 + code_j-1) when j > 1. This simply means that the
   Huffman code is 1 greater than the largest Huffman code in set j-1,
   but padded with enough zeroes to give it the correct length in bits.

   For example, suppose that bits_1 = 2, uncomp_1 = 15 and code_1 = 1.
   This defines a set of 2 Huffman codes (00 and 01) with corresponding
   values 15 and 16.

   Suppose also that bits_2 = 1, uncomp_2 = 4 and code_2 = 7. This
   defines an additional set of 4 Huffman codes (100, 101, 110 and 111)
   with corresponding values 4, 5, 6 and 7.

   Huffman code:                Uncompressed value:

   00                                   15
   01                                   16
   100                                   4
   101                                   5
   110                                   6
   111                                   7

   The motivation for downloading Huffman codes to the decompressor in
   this form is that it is very easy to convert a compressed Huffman
   code into its uncompressed equivalent. This can be achieved by
   taking the following steps:

   1.)   Set j = 1, set code_0 = 65535 and set code_n = 65535.

   2.)   Read a total of (bits_1 + bits_2 + ... + bits_j) bits starting
   from the specified position and bit_offset. Interpret these bits as
   an integer H, with the first bit to be read as the MSB and the last
   bit to be read as the LSB.

   3.)   If (H > code_j) then set j = j + 1 and goto 2.

   4.)   Output (H + uncomp_j - (2^bits_j) x (1 + code_j-1)), with the
   arithmetic operations calculated modulo 2^16.

   Note that as the HUFFMAN token reads individual bits from within a
   byte, to avoid ambiguity it is necessary to define the order in
   which these bits are read. This draft specifies that a bit offset of
   0 indicates the least significant bit of a byte, whilst a bit offset
   of 7 indicates the most significant bit.

   For example, suppose that an 8-bit Huffman code begins at byte
   position 0 and bit offset 2. In this case the 8 bits of the Huffman
   code can be found in the following locations (Bit 0 is the first bit
   in the Huffman code and Bit 7 is the last bit):


Price et al.                                                 [PAGE 13]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

    MSB         LSB MSB         LSB

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |5 4 3 2 1 0    |            7 6|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Byte 0          Byte 1

   If the parameter bit offset does not take a value between 0 and 7
   inclusive then a bad compressed message has been received and
   decompression failure occurs (see Section 4.5.). Decompression
   failure also occurs if a total of more than 16 bits are read from
   the specified position and bit offset.


6.  Security considerations

   Note that this initial version raises the issues, without detailing
   a specific solution to resolve them.

   Introduction:

   In effect, a compressed message is a program; the tokens are
   instructions that are executed by the decompressor. This presents
   particular risks to the operation of the node on which the
   decompressor is running. This is the case for any compression
   algorithm, and so affects the operation of a node using this one.

   Transport Mechanism Requirements:

   The algorithm itself has no security issues, but does place
   requirements on any transport mechanism used to deliver the
   messages.

   If such requirements are not met, then the operation of the system
   can be exposed to attack ranging from indirect reduction in service,
   direct denial of service, and modification of subsequent messages.

   An initial list of requirements on the transport mechanism is:

   - messages should be delivered reliably to avoid corruption - for
   some uses of this algorithm, this corruption may have longer lasting
   effects.

   - messages should be not be duplicated - to ensure that operations
   implied by those messages are executed once only.

   - potentially, the message source should be identified and/or
   validated - to restrict possible insinuation of "attack" messages by
   third parties and to allow "blacklisting" of individual sources that
   "behave badly".




Price et al.                                                 [PAGE 14]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

7.  References

   [DEFLATE]    "DEFLATE Compressed Data Format Specification version
                1.3", RFC 1951, P. Deutsch, May 1996

   [V-J]        "Compressing TCP/IP Headers for Low-Speed Serial
                Links", V. Jacobson, Internet Engineering Task Force,
                February 1990

   [LZW]        "LZW Data Compression", Mark Nelson, Dr. Dobb's
                Journal, October 1989

   [SCTP]       "Stream Control Transmission Protocol", Stewart et al,
                RFC 2960, Internet Engineering Task Force, October 2000

   [SIP]        "SIP: Session Initiation Protocol", Handley et al,
                RFC 2543, Internet Engineering Task Force, March 1999


8.  Authors' addresses

   Richard Price        Tel: +44 1794 833681
   Email:               richard.price@roke.co.uk

   Abigail Surtees      Tel: +44 1794 833131
   Email:               abigail.surtees@roke.co.uk

   Mark A West          Tel: +44 1794 833311
   Email:               mark.a.west@roke.co.uk

   Lawrence Conroy      Tel: +44 1794 833666
   Email:               lwc@roke.co.uk

   Roke Manor Research Ltd
   Romsey, Hants, SO51 0ZN
   United Kingdom


   Jonathan Rosenberg
   dynamicsoft
   72 Eagle Rock Avenue
   First Floor
   East Hanover, NJ 07936
   Email: jdrosen@dynamicsoft.com











Price et al.                                                 [PAGE 15]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

Appendix A.  Example sets of tokens

   This appendix gives three example sets of tokens which can be
   downloaded to the universal decompressor. The first set of tokens
   can be used to decompress a very simple variant of LZ77, which
   illustrates how the different token types are intended to be used.

   The next two examples show how the universal decompressor can be
   configured to understand the output of existing compression
   algorthms. One set of tokens allows the decompressor to understand
   [DEFLATE] compressed messages, whilst the other set allows the
   decompressor to understand [LZW] compressed messages.

A.1.  Mnemonic language

   Presenting the example set of tokens as a set of bytes would not be
   very informative, so they are written instead using a simple
   mnemonic language. Most importantly, the language allows the
   parameters of a token to be specified as text references rather than
   as 2-byte integers.

   The mnemonic language is translated into bytes as follows:

   Tokens:     Token names are given in capitals. Replace each
               name with the corresponding 1-byte value as per
               Chapter 5.

   Labels:     Label names are given as a colon followed by lowercase
               text. They are deleted when converting the mnemonics to
               bytes.

   References: These are indicated by a label name without the
               colon. Replace each reference with a 2-byte value
               specifying the location of the byte immediately
               following the label. Moreover if the reference is
               preceded by a "$" symbol then it is interpreted as an
               pointer rather than a literal value, and hence the MSB
               of the 2-byte value is set to 1.

   .short:     Each decimal number following ".short" is interpreted as
               a 2-byte integer with MSBs preceding LSBs. Unless
               otherwise specified, this is the default.

   .byte:      Each decimal number following ".byte" is interpreted as
               a single byte.

   Whitespace: All whitespace, brackets and commas just delimit the
               tokens. Delete.

   Comments:   These are indicated by a semicolon and continue
               to the end of the line. Delete.




Price et al.                                                 [PAGE 16]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

A.2.  Example token set for simple LZ77 decompression

   The first example gives the tokens required to support a very simple
   LZ77-based algorithm. The decompressor is instructed to interpret a
   compressed message as a set of 4-byte characters, where each
   character contains a 2-byte position integer followed by a 2-byte
   length integer. Taken together these integers point to a previously
   received text string in the byte buffer, which is then copied to the
   end of the uncompressed message.

   Since the compressor can only send references to strings already
   present in the byte buffer, before the first message is decompressed
   the buffer must be initialized with a static dictionary containing
   the 256 ASCII characters.

   :first_token            unpack_static_dictionary
   :uncompressed_start     1792
   :uncompressed_end       2048
   :circular_buffer        2048
   :compressed_start       compressed_message_start
   :compressed_end         0
   :compressed_pointer     0

   :bit_offset             0
   :position               0
   :length                 0
   :token_start            next_character

   :unpack_static_dictionary

   COPY-LITERAL (17, 1, $uncompressed_start)
   ADD ($position, 1)
   COMPARE ($position, 256, unpack_static_dictionary, continue, 0)

   ; The above tokens are used to initialize the static dictionary.

   :continue

   COPY (token_start, 2, first_token)

   :next_character

   HUFFMAN ($compressed_pointer, $bit_offset, position, 1, 16, 0)
   HUFFMAN ($compressed_pointer, $bit_offset, length, 1, 16, 0)
   COPY-LITERAL ($position, $length, $uncompressed_end)
   COMPARE ($compressed_pointer, $compressed_end, next_character, 0, 0)

   :compressed_message_start

A.3.  Example token set for DEFLATE decompression

   This example gives the tokens required to understand [DEFLATE]
   compressed messages as defined in RFC 1951. Note that the example


Price et al.                                                 [PAGE 17]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   only covers static Huffman codes (Block Type 01); dynamic Huffman
   codes will be added in a future version.

   Note also that the order in which the HUFFMAN token decompresses
   bits within a byte differs from that specified in RFC 1951. An
   alternative version of the HUFFMAN token may be introduced to
   support the exact bit order of [DEFLATE] in future (although this is
   not an urgent issue, as the order in which bits are decompressed
   does not affect the overall compression ratio).

   :first_token            next_character
   :uncompressed_start     2048
   :uncompressed_end       2048
   :circular_buffer        2048
   :compressed_start       compressed_message_start
   :compressed_end         0
   :compressed_pointer     0

   :bit_offset             0
   :index                  0
   :extra_length_bits      0
   :length_value           0
   :extra_distance_bits    0
   :distance_value         0

   :next_character

   HUFFMAN ($compressed_pointer, $bit_offset, index, 4, 7, 16428, 23,
            1, 0, 191, 0, 16452, 199, 1, 144)

   ; The above HUFFMAN token decompresses a Huffman code from the
   ; length/literal alphabet.

   COMPARE ($index, 16428, literal, 0, length)

   ; The COMPARE token is used to determine whether the decoded
   ; character should be interpreted as a length value, a literal value
   ; or an end-of-block character. In the latter case the decompressor
   ; stops and outputs the uncompressed message.

   :literal

   COPY-LITERAL (17, 1, $uncompressed_end)
   COMPARE ($compressed_pointer, $compressed_end, next_character, 1, 1)

   ; If the decoded character is to be interpreted as a literal value
   ; then it is copied directly to the uncompressed message. The
   ; decompressor then checks whether more compressed characters are
   ; available. If it has run out then it pauses and waits for some
   ; more.

   :length

   LSHIFT ($index, 2)

Price et al.                                                 [PAGE 18]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   COPY ($index, 4, extra_length_bits)

   ; If the decoded character is to be interpreted as a length value
   ; then as specified in [DEFLATE], the decompressor must first add a
   ; certain number of additional bits from the compressed message to
   ; this length value.

   HUFFMAN ($compressed_pointer, $bit_offset, extra_length_bits, 1,
            $extra_length_bits, 0)

   ; The above HUFFMAN token obtains the correct number of additional
   ; length bits from the compressed message.

   ADD ($length_value, $extra_length_bits)

   :distance

   HUFFMAN ($compressed_pointer, $bit_offset, index, 1, 5, 74)

   ; In [DEFLATE] a length code is always followed by a distance code.
   ; The above HUFFMAN token decompresses this character.

   LSHIFT ($index, 2)
   COPY ($index, 4, extra_distance_bits)
   HUFFMAN ($compressed_pointer, $bit_offset, extra_distance_bits, 1,
            $extra_distance_bits, 0)

   ; The above HUFFMAN token obtains the correct number of additional
   ; distance bits from the compressed message.

   ADD ($distance_value, $extra_distance_bits)
   COPY-OFFSET ($distance_value, $length_value, $uncompressed_end)

   ; The text string specified by the length and distance codes is then
   ; copied to the end of the uncompressed message.

   COMPARE ($compressed_pointer, $compressed_end, next_character, 1, 1)

   .byte 0 0 0 .short    ; 3 bytes worth of padding

   :length_table

   0       3       0       4       0       5
   0       6       0       7       0       8
   0       9       0       10      1       11
   1       13      1       15      1       17
   2       19      2       23      2       27
   2       31      3       35      3       43
   3       51      3       59      4       67
   4       83      4       99      4       115
   5       131     5       163     5       195
   5       227     0       258

   ; This is the length table from Page 11 of [DEFLATE].

Price et al.                                                 [PAGE 19]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   :distance_table

   0       1       0       2       0       3
   0       4       1       5       1       7
   2       9       2       13      3       17
   3       25      4       33      4       49
   5       65      5       97      6       129
   6       193     7       257     7       385
   8       513     8       767     9       1025
   9       1537    10      2049    10      3073
   11      4097    11      6145    12      8193
   12      12289   13      16385   13      24577

   ; This is the distance table from Page 11 of [DEFLATE].

   :compressed_message_start

   ; The compressed messages are stored in the byte buffer following
   ; the above set of tokens.

A.4.  Example token set for LZW decompression

   This example gives the tokens required to understand [LZW]
   compressed messages. This particular variant of LZW uses a codebook
   containing up to 1024 entries, and so each compressed character is a
   10-bit value referencing one of the entries in the codebook.

   :first_token            unpack_static_dictionary
   :uncompressed_start     5888
   :uncompressed_end       1
   :circular_buffer        6144
   :compressed_start       compressed_message_start
   :compressed_end         0
   :compressed_pointer     0

   :bit_offset             0
   :index                  0
   :position_value         0
   :length_value           0
   :codebook_next          2048
   :token_start            next_character

   :unpack_static_dictionary

   COPY-LITERAL (uncompressed_start, 4, $codebook_next)
   COPY-LITERAL (17, 1, $uncompressed_start)
   ADD ($index, 1)
   COMPARE ($index, 256, unpack_static_dictionary, continue, 0)

   ; The above tokens are used to initialize the first 256 entries in
   ; the LZW codebook with single ASCII characters.

   :continue


Price et al.                                                 [PAGE 20]


INTERNET-DRAFT          Universal Decompressor       14 November, 2001

   COPY (circular_buffer, 2, uncompressed_end)
   COPY (token_start, 2, first_token)

   :next_character

   HUFFMAN ($compressed_pointer, $bit_offset, index, 1, 10, 512)

   ; The above HUFFMAN token extracts 10 bits from the compressed
   ; message.

   LSHIFT ($index, 2)
   COPY ($index, 4, position_value)

   ; The contents of the corresponding codebook entry is retrieved by
   ; the above COPY token.

   COPY-LITERAL (uncompressed_end, 2, $codebook_next)
   COPY-LITERAL ($position_value, $length_value, $uncompressed_end)

   ; The above COPY-LITERAL token appends the selected text string to
   ; the end of the uncompressed message.

   ADD ($length_value, 1)
   COPY-LITERAL (length_value, 2, $codebook_next)

   ; As per the LZW algorithm, a new entry is added to the codebook
   ; containing the text string copied to the end of the uncompressed
   ; message, plus the next character in the uncompressed message.

   COMPARE ($compressed_pointer, $compressed_end, next_character, 0, 0)

   :compressed_message_start























Price et al.                                                 [PAGE 21]