INTERNET-DRAFT                                                 A. Costanzo
draft-costanzo-lzju90-mime-01.txt              AKC Computer Services Corp.
Expires: March 8th, 1999                                 September 3, 1998



                           Definition of the LZJU90
                     MIME Content Transfer Encoding Type


1. Status of this Memo

This document is an Internet-Draft.  Internet-Drafts are  working  docu-
ments  of the Internet Engineering Task Force (IETF), its areas, and its
working groups.  Note that other  groups  may  also  distribute  working
documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum  of  six  months
and  may  be  updated,  replaced, or obsoleted by other documents at any
time.  It is inappropriate to use  Internet-Drafts as reference material
or to cite them other than as "work in progress."

To learn the current status of  any  Internet-Draft,  please  check  the
"1id-abstracts.txt"  listing  contained  in  the Internet- Drafts Shadow
Directories   on   ftp.is.co.za   (Africa),   nic.nordu.net    (Europe),
munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast, or ftp.isi.edu
(US West Coast).

Distribution of this memo is unlimited.

2.  Abstract

This memo defines a new transfer encoding type for MIME, namely LZJU90.
LZJU90 specifies a section consisting of an encoded binary or text
object.  The encoding provides both compression and representation in a
text format.

The keywords "MUST", "MUST  NOT", "REQUIRED", "SHALL", "SHALL  NOT",
"SHOULD", "SHOULD  NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.


3. Introduction

The Multipurpose Internet Message Extensions (MIME) define a facility
whereby messages may use a specific transfer encoding scheme to survive
the process of transport through various mail delivery systems and
transport agents.

This Draft specifies and defines the LZJU90 transfer encoding for MIME
compliant mail systems. This transfer encoding was first defined in
RFC 1505 [1].



Costanzo                                                       [Page  1]

EXPIRES IN SIX MONTHS                                 September 3, 1998



LZJU90 specifies a section consisting of an encoded binary or text
object. The encoding (defined below) provides both compression and
representation in a text format. This encoding is advantageous and
superior over other encoding schemes in that the resulting object
is compressed, usually much smaller than an object using a transfer
encoding of some other type.  The resulting compressed object, is in
the character set ISO-10646-UTF-8. [2] [3]


3. Definition of the  LZJU90 Compressed Encoding

LZJU90 is an encoding for a binary or text object to be
sent in an Internet mail message.  The encoding provides both
compression and representation in a text format that will successfully
survive transmission through the many different mailers and gateways
that comprise the Internet and connected mail networks.


3.1  Overview

The encoding first compresses the binary object, using a modified
LZ77 algorithm, called LZJU90.  It then encodes each 6 bits of the
output of the compression as a text character, using a character set
chosen to survive any translations between codes, such as ASCII to
EBCDIC.  The 64 six-bit strings 000000 through 111111 are represented
by the characters "+", "-", "0" to "9", "A" to "Z", and "a" to "z".
The output text begins with a line identifying the encoding.  This is
for visual reference only, the content transfer encoding header
identifies the section to the user program.  It also names the object
that was encoded, usually by a file name.


  The format of this line is:

               * LZJU90 <name>

  where <name> is optional.  For example:

               * LZJU90 foobar

This is followed by the compressed and encoded data, broken into
lines where convenient.  It is recommended that lines be broken about
every 76 characters length.  The decoder must accept lines with 1 to
1000 characters on each line.  After this, there is one final line that
gives the number of bytes in the original data and a CRC of the
original data.  This should match the byte count and CRC found during
decompression.

  This line has the format:

               * <count> <CRC>

Costanzo                                                       [Page  2]

EXPIRES IN SIX MONTHS                                 September 3, 1998




  where <count> is a decimal number, and CRC is 8 hexadecimal digits.


  For example:
               * 4128076 5AC2D50E

The count used in encoding the object. This numeral is the total number
of lines, including the start and end lines that begin with *.

3.2  Specification of the LZJU90 compression

The Lempel-Ziv-Storer-Szymanski model of mixing pointers and literal
characters is used in the compression algorithm.  Repeat occurrences
of strings of octets are replaced by pointers to the earlier
occurrence.

The data compression is defined by the decoding algorithm.  Any
encoder that emits symbols which cause the decoder to produce the
original input is defined to be valid.

There are many possible strategies for the maximal-string matching
that the encoder does, section 3.2.2 gives an example of
one such algorithm.

Regardless of which algorithm is used, and what tradeoffs
are made between compression ratio and execution speed or space,
the result can always be decoded by the simple decoder.

The compressed data consists of a mixture of unencoded literal
characters and copy pointers which point to an earlier occurrence of
the string to be encoded.

Compressed data contains two types of codewords:

  LITERAL pass the literal directly to the uncompressed output.

  COPY    length, offset
          go back offset characters in the output and copy length
          characters forward to the current position.

To distinguish between codewords, the copy length is used.  A copy
length of zero indicates that the following codeword is a literal
codeword.  A copy length greater than zero indicates that the
following codeword is a copy codeword.

To improve copy length encoding, a threshold value of 2 has been
subtracted from the original copy length for copy codewords, because
the minimum copy length is 3 in this compression scheme.


Costanzo                                                       [Page  3]

EXPIRES IN SIX MONTHS                                 September 3, 1998



The maximum offset value is set at 32255.  Larger offsets offer
extremely low improvements in compression (less than 1 percent,
typically).

No special encoding is done on the LITERAL characters.  However,
unary encoding is used for the copy length and copy offset values to
improve compression.  A start-step-stop unary code is used.


A (start, step, stop) unary code of the integers is defined as
follows:  The Nth codeword has N ones followed by a zero followed by
a field of size START + (N * STEP).  If the field width is equal to
STOP then the preceding zero can be omitted.  The integers are laid
out sequentially through these codewords.  For example, (0, 1, 4)
would look like:

                 Codeword      Range
                 0             0
                 10x           1-2
                 110xx         3-6
                 1110xxx       7-14
                 1111xxxx      15-30

Following are the actual values used for copy length and copy offset:

  The copy length is encoded with a (0, 1, 7) code leading to a maximum
  copy length of 256 by including the THRESHOLD value of 2.

                 Codeword       Range
                 0              0
                 10x            3-4
                 110xx          5-8
                 1110xxx        9-16
                 11110xxxx      17-32
                 111110xxxxx    33-64
                 1111110xxxxxx  65-128
                 1111111xxxxxxx 129-256

  The copy offset is encoded with a (9, 1, 14) code leading to a
  maximum copy offset of 32255.  Offset 0 is reserved as an end of
  compressed data flag.

                 Codeword              Range
                 0xxxxxxxxx                0-511
                 10xxxxxxxxxx            512-1535
                 110xxxxxxxxxxx         1536-3583
                 1110xxxxxxxxxxxx       3485-7679
                 11110xxxxxxxxxxxxx     7680-15871
                 11111xxxxxxxxxxxxxx   15872-32255




Costanzo                                                       [Page  4]

EXPIRES IN SIX MONTHS                                 September 3, 1998


The 0 has been chosen to signal the start of the field for ease of
encoding.  (The bit generator can simply encode one more bit than is
significant in the binary representation of the excess.)

The stop values are useful in the encoding to prevent out of range
values for the lengths and offsets, as well as shortening some codes
by one bit.

The worst case compression using this scheme is a 1/8 increase in
size of the encoded data.  (One zero bit followed by 8 character
bits).  After the character encoding, the worst case ratio is 3/2 to
the original data.


The minimum copy length of 3 has been chosen because the worst case
copy length and offset is 3 bits (3) and 19 bits (32255) for a total
of 22 bits to encode a 3 character string (24 bits).


3.2.1  Sample Decoder

As mentioned previously, the compression is defined by the decoder.
Any encoder that produced output that is correctly decoded is by
definition correct.

The following is an implementation of the decoder, written more for
clarity and as much portability as possible, rather than for maximum
speed.

When optimized for a specific environment, it will run significantly
faster.

      /* LZJU 90 Decoding program */

      /* Written By Robert Jung and Robert Ullmann, 1990 and 1991. */

      /* This code is NOT COPYRIGHT, not protected. It is in the true
         Public Domain. */

      #include <stdio.h>
      #include <string.h>

      typedef unsigned char uchar;
      typedef unsigned int  uint;

      #define N          32255
      #define THRESHOLD      3

      #define STRTP          9
      #define STEPP          1
      #define STOPP         14
      #define STRTL          0
      #define STEPL          1
      #define STOPL          7
Costanzo                                                       [Page  5]

EXPIRES IN SIX MONTHS                                 September 3, 1998


      static FILE *in;
      static FILE *out;

      static int   getbuf;
      static int   getlen;
      static long  in_count;
      static long  out_count;
      static long  crc;
      static long  crctable[256];
      static uchar xxcodes[] =
      "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\
      abcdefghijklmnopqrstuvwxyz";
      static uchar ddcodes[256];
      static uchar text[N];

      #define CRCPOLY         0xEDB88320
      #define CRC_MASK        0xFFFFFFFF
      #define UPDATE_CRC(crc, c)  \
              crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \
                    ^ (crc >> 8)
      #define START_RECD      "* LZJU90"



      void MakeCrctable()     /* Initialize CRC-32 table */
      {
      uint i, j;
      long r;
          for (i = 0; i <= 255; i++) {
              r = i;
              for (j = 8; j > 0; j--) {
                  if (r & 1)
                      r = (r >> 1) ^ CRCPOLY;
                  else
                      r >>= 1;
                  }
              crctable[i] = r;
              }
      }


      int GetXX()             /* Get xxcode and translate */
      {
      int c;
          do {
              if ((c = fgetc(in)) == EOF)
                  c = 0;
              } while (c == '\n');
          in_count++;
          return ddcodes[c];
      }


Costanzo                                                       [Page  6]

EXPIRES IN SIX MONTHS                                 September 3, 1998

      int GetBit()            /* Get one bit from input buffer */
      {
      int c;
          while (getlen <= 0) {
              c = GetXX();
              getbuf |= c << (10-getlen);
              getlen += 6;
              }
          c = (getbuf & 0x8000) != 0;
          getbuf <<= 1;
          getbuf &= 0xFFFF;
          getlen--;
          return(c);
      }
      int GetBits(int len)        /* Get len bits */
      {
      int c;
          while (getlen <= 10) {
              c = GetXX();
              getbuf |= c << (10-getlen);
              getlen += 6;
              }
          if (getlen < len) {
              c = (uint)getbuf >> (16-len);
              getbuf = GetXX();
              c |= getbuf >> (6+getlen-len);
              getbuf <<= (10+len-getlen);
              getbuf &= 0xFFFF;
              getlen -= len - 6;
              }
          else {
              c = (uint)getbuf >> (16-len);
              getbuf <<= len;
              getbuf &= 0xFFFF;
              getlen -= len;
              }
          return(c);
      }



      int DecodePosition()    /* Decode offset position pointer */
      {
      int c;
      int width;
      int plus;
      int pwr;
          plus = 0;
          pwr = 1 << STRTP;
          for (width = STRTP; width < STOPP; width += STEPP) {
              c = GetBit();
              if (c == 0)


Costanzo                                                       [Page  7]

EXPIRES IN SIX MONTHS                                 September 3, 1998


                  break;
              plus += pwr;
              pwr <<= 1;
              }

          if (width != 0)
              c = GetBits(width);
          c += plus;
          return(c);
      }



      int DecodeLength()      /* Decode code length */
      {
      int c;
      int width;
      int plus;
      int pwr;
          plus = 0;
          pwr = 1 << STRTL;
          for (width = STRTL; width < STOPL; width += STEPL) {
              c = GetBit();
              if (c == 0)
                  break;
              plus += pwr;
              pwr <<= 1;
              }
          if (width != 0)
              c = GetBits(width);
          c += plus;
      return(c);
      }


      void InitCodes()        /* Initialize decode table */
      {
      int i;
          for (i = 0; i < 256; i++) ddcodes[i] = 0;
          for (i = 0; i < 64; i++) ddcodes[xxcodes[i]] = i;
      return;
      }

      main(int ac, char **av)            /* main program */
      {
      int r;
      int j, k;
      int c;
      int pos;
      char buf[80];
      char name[3];
      long num, bytes;

Costanzo                                                       [Page  8]

EXPIRES IN SIX MONTHS                                 September 3, 1998




          if (ac < 3) {
              fprintf(stderr, "usage: judecode in out\n");
              return(1);
              }

          in = fopen(av[1], "r");
          if (!in){
              fprintf(stderr, "Can't open %s\n", av[1]);
              return(1);
              }


          out = fopen(av[2], "wb");
          if (!out) {
              fprintf(stderr, "Can't open %s\n", av[2]);
              fclose(in);

          return(1);
              }

          while (1) {
              if (fgets(buf, sizeof(buf), in) == NULL) {
                  fprintf(stderr, "Unexpected EOF\n");
              return(1);
                  }
              if (strncmp(buf, START_RECD, strlen(START_RECD)) == 0)
                  break;
              }

          in_count = 0;
          out_count = 0;
          getbuf = 0;
          getlen = 0;

          InitCodes();
          MakeCrctable();

          crc = CRC_MASK;
          r = 0;

          while (feof(in) == 0) {
              c = DecodeLength();
              if (c == 0) {
                  c = GetBits(8);
                  UPDATE_CRC(crc, c);
                  out_count++;
                  text[r] = c;
                  fputc(c, out);
                  if (++r >= N)
                      r = 0;
                  }

Costanzo                                                       [Page  9]

EXPIRES IN SIX MONTHS                                 September 3, 1998


             else {
                  pos = DecodePosition();
                  if (pos == 0)
                      break;
                  pos--;
                  j = c + THRESHOLD - 1;
                  pos = r - pos - 1;
                  if (pos < 0)
                      pos += N;
                  for (k = 0; k < j; k++) {
                      c = text[pos];
                      text[r] = c;
                      UPDATE_CRC(crc, c);
                      out_count++;
                      fputc(c, out);
                      if (++r >= N)
                          r = 0;
                      if (++pos >= N)

                          pos = 0;
                      }
                  }
              }

          fgetc(in); /* skip newline */

          if (fscanf(in, "* %ld %lX", &bytes, &num) != 2) {
              fprintf(stderr, "CRC record not found\n");
              return(1);
              }

          else if (crc != num) {
              fprintf(stderr,
                   "CRC error, expected %lX, found %lX\n",
                   crc, num);
              return(1);
              }

          else if (bytes != out_count) {
              fprintf(stderr,
                   "File size error, expected %lu, found %lu\n",
                   bytes, out_count);
          return(1);
              }

          else
              fprintf(stderr,
                   "File decoded to %lu bytes correctly\n",
                   out_count);





Costanzo                                                       [Page 10]


EXPIRES IN SIX MONTHS                                 September 3, 1998

          fclose(in);
          fclose(out);
      return(0);
      }


3.2.2  An Example of an Encoder

Many algorithms are possible for the encoder, with different
tradeoffs between speed, size, and complexity.  The following
is an example program which is fairly efficient; more sophisticated
implementations will run much faster, and produce somewhat
better compression.

This example also shows that the encoder need not use the entire
window available.  Not using the full window costs a small amount of
compression, but can greatly increase the speed of some algorithms.

    /* LZJU 90 Encoding program */

    /* Written By Robert Jung and Robert Ullmann, 1990 and 1991. */

    /* This code is NOT COPYRIGHT, not protected. It is in the true
       Public Domain. */

    #include <stdio.h>

    typedef unsigned char uchar;
    typedef unsigned int  uint;

    #define N          24000   /* Size of window buffer */
    #define F            256   /* Size of look-ahead buffer */
    #define THRESHOLD      3
    #define K          16384   /* Size of hash table */

    #define STRTP          9
    #define STEPP          1
    #define STOPP         14

    #define STRTL          0
    #define STEPL          1
    #define STOPL          7

    #define CHARSLINE     78

    static FILE *in;
    static FILE *out;

    static int   putlen;
    static int   putbuf;
    static int   char_ct;
    static long  in_count;
    static long  out_count;

Costanzo                                                       [Page 11]


EXPIRES IN SIX MONTHS                                 September 3, 1998

    static long  crc;
    static long  crctable[256];
    static uchar xxcodes[] =
    "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\
    abcdefghijklmnopqrstuvwxyz";
    uchar window_text[N + F + 1];


    /* text contains window, plus 1st F of window again
       (for comparisons) */

    uint hash_table[K];
    /* table of pointers into the text */

    #define CRCPOLY         0xEDB88320
    #define CRC_MASK        0xFFFFFFFF
    #define UPDATE_CRC(crc, c)  \
      crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \
      ^ (crc >> 8)


    void MakeCrctable()     /* Initialize CRC-32 table */
    {
    uint i, j;
    long r;
        for (i = 0; i <= 255; i++) {
            r = i;
            for (j = 8; j > 0; j--) {
                if (r & 1)
                    r = (r >> 1) ^ CRCPOLY;
                else
                    r >>= 1;
            }
            crctable[i] = r;
        }
    }



    void PutXX(int c)           /* Translate and put xxcode */
    {
        c = xxcodes[c & 0x3F];
        if (++char_ct > CHARSLINE) {
            char_ct = 1;
            fputc('\n', out);
        }
        fputc(c, out);
        out_count++;
    }





Costanzo                                                       [Page 12]


EXPIRES IN SIX MONTHS                                 September 3, 1998


    void PutBits(int c, int len)  /* Put rightmost "len" bits of "c" */
    {
        c <<= 16 - len;
        c &= 0xFFFF;
        putbuf |= (uint) c >> putlen;
        c <<= 16 - putlen;
        c &= 0xFFFF;
        putlen += len;
        while (putlen >= 6) {
            PutXX(putbuf >> 10);
            putlen -= 6;
            putbuf <<= 6;
            putbuf &= 0xFFFF;
            putbuf |= (uint) c >> 10;
            c = 0;
            }
    }


    void EncodePosition(int ch) /* Encode offset position pointer */
    {
    int width;
    int prefix;
    int pwr;
        pwr = 1 << STRTP;
        for (width = STRTP; ch >= pwr; width += STEPP, pwr <<= 1)
            ch -= pwr;
        if ((prefix = width - STRTP) != 0)
            PutBits(0xffff, prefix);
        if (width < STOPP)
            width++;
        /* else if (width > STOPP)
        abort(); do nothing */
        PutBits(ch, width);
    }


    void EncodeLength(int ch)   /* Encode code length */
    {
    int width;
    int prefix;
    int pwr;
        pwr = 1 << STRTL;
        for (width = STRTL; ch >= pwr; width += STEPL, pwr <<= 1)
            ch -= pwr;
        if ((prefix = width - STRTL) != 0)
            PutBits(0xffff, prefix);
        if (width < STOPL)
            width++;
        /* else if (width > STOPL)
        abort(); do nothing */
        PutBits(ch, width);
    }
Costanzo                                                       [Page 13]


EXPIRES IN SIX MONTHS                                 September 3, 1998


    main(int ac, char **av)            /* main program */
    {
    uint r, s, i, c;
    uchar *p, *rp;
    int match_position;
    int match_length;
    int len;
    uint hash, h;

        if (ac < 3) {
            fprintf(stderr, "usage: juencode in out\n");
        return(1);
            }

        in = fopen(av[1], "rb");
        if (!in) {
            fprintf(stderr, "Can't open %s\n", av[1]);
        return(1);
            }

        out = fopen(av[2], "w");
        if (!out) {
            fprintf(stderr, "Can't open %s\n", av[2]);
            fclose(in);
        return(1);
            }

        char_ct = 0;
        in_count = 0;
        out_count = 0;
        putbuf = 0;
        putlen = 0;
        hash = 0;

        MakeCrctable();
        crc = CRC_MASK;

        fprintf(out, "* LZJU90 %s\n", av[1]);

        /* The hash table inititialization is somewhat arbitrary */
        for (i = 0; i < K; i++) hash_table[i] = i % N;

        r = 0;
        s = 0;

        /* Fill lookahead buffer */

        for (len = 0; len < F && (c = fgetc(in)) != EOF; len++) {

        UPDATE_CRC(crc, c);
        in_count++;
        window_text[s++] = c;
        }
Costanzo                                                       [Page 14]


EXPIRES IN SIX MONTHS                                 September 3, 1998


        while (len > 0) {
        /* look for match in window at hash position */
        h = ((((window_text[r] << 5) ^ window_text[r+1])
                << 5) ^ window_text[r+2]);
        p = window_text + hash_table[h % K];
        rp = window_text + r;
        for (i = 0, match_length = 0; i < F; i++) {
                if (*p++ != *rp++) break;
                match_length++;
                }
        match_position = r - hash_table[h % K];
        if (match_position <= 0) match_position += N;

        if (match_position > N - F - 2) match_length = 0;
        if (match_position > in_count - len - 2)
            match_length = 0; /* ! :-) */

        if (match_length > len)
            match_length = len;
        if (match_length < THRESHOLD) {
            EncodeLength(0);
            PutBits(window_text[r], 8);
            match_length = 1;
            }
        else {
            EncodeLength(match_length - THRESHOLD + 1);
            EncodePosition(match_position);
            }

        for (i = 0; i < match_length &&
                        (c = fgetc(in)) != EOF; i++) {
                UPDATE_CRC(crc, c);
                in_count++;
            window_text[s] = c;
                if (s < F - 1)
                window_text
                [s + N] = c;
            if (++s > N - 1) s = 0;
            hash = ((hash << 5) ^ window_text[r]);
            if (r > 1) hash_table[hash % K] = r - 2;
            if (++r > N - 1) r = 0;
            }


        while (i++ < match_length) {
            if (++s > N - 1) s = 0;
            hash = ((hash << 5) ^ window_text[r]);
            if (r > 1) hash_table[hash % K] = r - 2;
            if (++r > N - 1 ) r = 0;
            len--;
                }
        }


Costanzo                                                       [Page 15]


EXPIRES IN SIX MONTHS                                 September 3, 1998

        /* end compression indicator */
        EncodeLength(1);
        EncodePosition(0);
        PutBits(0, 7);

        fprintf(out, "\n* %lu %08lX\n", in_count, crc);
        fprintf(stderr, "Encoded %lu bytes to %lu symbols\n",
                in_count, out_count);

        fclose(in);
        fclose(out);

    return(0);
    }


3.3.1  Example of a MIME LZJU90 Compressed Object

  The following is an example of a MIME LZJU90 compressed object.

        Content-Type: text/plain;
                charset="utf-8"
        Content-Transfer-Encoding: LZJU90

        * LZJU90 example
        8-mBtWA7WBVZ3dEBtnCNdU2WkE4owW+l4kkaApW+o4Ir0k33Ao4IE4kk
        bYtk1XY618NnCQl+OHQ61d+J8FZBVVCVdClZ2-LUI0v+I4EraItasHbG
        VVg7c8tdk2lCBtr3U86FZANVCdnAcUCNcAcbCMUCdicx0+u4wEETHcRM
        7tZ2-6Btr268-Eh3cUAlmBth2-IUo3As42laIE2Ao4Yq4G-cHHT-wCEU
        6tjBtnAci-I++
        * 190 081E2601


4. Security Consideration

 The security (or lack) is responsibility of the application domain
 controlling the decoder of the LZJU90 object.


5.  References

   [1] Costanzo, A. Robinson, D. and R. Ullmann, "Encoding Header
       Field for Internet Messages", RFC 1505, AKC Consulting,
       Prime Computer, Inc., August 1993.

   [2] International Organization for Standardization, Information
       Technology -- Universal Coded Character Set (UCS).  ISO/IEC
       10646-1:1993, June 1993.

   [3] International Organization for Standardization, Information
       Technology -- Universal Coded Character Set (UCS).  ISO/IEC
       10646-1: 1993/AMD.2: 1996 (E)


Costanzo                                                       [Page 16]


EXPIRES IN SIX MONTHS                                 September 3, 1998

6. Acknowledgments

The author would like to thank Robert Jung, David Robinson and
Robert Ullmann for their past contributions to this work.


7. Author's Address

Al Costanzo
AKC Computer Services Corp.
P.O. Box 4031
Roselle Park, NJ  07204-0531

Phone: +1 908 298 9000
Email: AL@AKC.COM









































Costanzo                                                       [Page 17]