Network Services Research Lab                  David Korn  and  Kiem-Phong Vo
                                                                    AT&T Labs
                                               Submission:      June 29, 2001
                                               Expiration:  December 29, 2001


        The VCDIFF Generic Differencing and Compression Data Format
                       <draft-korn-vcdiff-04.txt>


Status of this Memo

    This document is an Internet-Draft and is in full conformance
    with all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as
    Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other
    documents at any time.  It is inappropriate to use Internet-
    Drafts as reference material or to cite them other than as
    "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

Abstract

    This memo describes a general and efficient data format suitable
    for encoding compressed and/or differencing data so that they can
    be easily transported among computers.

Table of Contents:

    1.  EXECUTIVE SUMMARY ............................................  1
    2.  ALGORITHM CONVENTIONS ........................................  3
    3.  DELTA INSTRUCTIONS ...........................................  4
    4.  VCDIFF ENCODING FORMAT .......................................  5
    5.  INSTRUCTION CODE TABLES ...................................... 14
    6.  FURTHER ISSUES ............................................... 16
    7.  SUMMARY ...................................................... 17
    8.  ACKNOWLEDGEMENTS ............................................. 17
    9.  SECURITY CONSIDERATIONS ...................................... 17
   10.  SOURCE CODE AVAILABILITY ..................................... 18
   11.  REFERENCES ................................................... 18
   12.  AUTHOR'S ADDRESS ............................................. 18
   13.  APPENDIX: SECTION LAY-OUT IN A DELTA FILE..................... 19
   14.  EXPIRATION DATE .............................................. 19


1.  EXECUTIVE SUMMARY

    Compression and differencing techniques can greatly improve storage
    and transmission of files and file versions.  Since files are often
    transported across machines with distinct architectures and performance
    characteristics, such data should be encoded in a form that is portable
    and can be decoded with little or no knowledge of the encoders.
    This document describes Vcdiff, a compact portable encoding format
    designed for these purposes.

RFC 4   VCDIFF Generic Differencing and Compression Data Format



    Data differencing is the process of computing a compact and invertible
    encoding of a "target file" given a "source file".  Data compression
    is similar but without the use of source data.  The UNIX utilities diff,
    compress, and gzip are well-known examples of data differencing and
    compression tools.  For data differencing, the computed encoding is
    called a "delta file", and, for data compression, it is called
    a "compressed file".  Delta and compressed files are good for storage
    and transmission because they are often smaller than the originals.

    Data differencing and data compression are traditionally treated
    as distinct types of data processing.  However, as shown in the Vdelta
    technique by Korn and Vo [1], compression can be thought of as a special
    case of differencing in which the source data is empty. The basic idea
    is to unify the string parsing scheme used in the Lempel-Ziv'77 style
    compressors [2], and the block-move technique of Tichy [3].  Loosely
    speaking, this works as follows:


        a. Concatenate source and target data.
        b. Parse the data from left to right as in LZ'77 but
           make sure that a parsed segment starts target data.
        c. Start to output when reaching target data.


    Parsing is based on string matching algorithms such as suffix trees [4]
    or hashing with different time and space performance characteristics.
    Vdelta uses a fast string matching algorithm that requires less memory
    than other techniques [5].  However, even with this algorithm, the
    memory requirement can still be prohibitive for large files.  A common
    way to deal with memory limitation is to partition an input file into
    chunks called "windows" and process them separately. Here, except for
    unpublished work by Vo, little has been done on designing effective
    windowing schemes. Current techniques, including Vdelta, simply use
    windows of the same size with corresponding addresses across source and
    target files.


    String matching and windowing algorithms have large influence on the
    compression rate of delta and compressed files. However, it is desirable
    to have a portable encoding format that is independent of such algorithms.
    This enables construction of client-server applications in which a server
    may serve clients with unknown computing characteristics.  Unfortunately,
    all current differencing and compressing tools, including Vdelta, fall
    short in this resspect. Their storage formats are closely intertwined
    with the implemented string matching and windowing algorithms.

    The encoding format Vcdiff proposed here addresses the above issues.
    Vcdiff achieves the below characteristics:

        Output compactness:
            The basic encoding format compactly represents compressed or delta
            files. Applications can further extend the basic encoding format with
            "secondary encoders" (e.g., a Huffman or arithmetic encoder) to
            achieve more compression.
        Data portability:
            The basic encoding format is free from machine byte order and
            word size issues. This allows data to be encoded on one machine
            and decoded on a different machine with different architecture.
        Algorithm genericity:

                                   -2-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


            The decoding algorithm is independent from string matching and
            windowing algorithms. This allows competition among implementations
            of the encoder while keeping the same decoder.
        Decoding efficiency:
            Except for secondary encoder issues, the decoding algorithm runs
            in time proportional to the size of the target file and uses space
            proportional to the maximal window size.

    The Vcdiff data format and the algorithms for decoding data shall be
    described next.  Since Vcdiff treats compression as a special case of
    differencing, we shall use the term "delta file" to indicate the
    compressed output for both cases.


2. ALGORITHM CONVENTIONS

    Algorithms to encode and decode the Vcdiff data format will be presented
    in the ANSI-C language. To simplify the presentation, we shall generally
    omit error checking and resource allocation and deallocation. In addition,
    the below conventions will be observed:

        a. Tokens with all upper cases letters will be C macros.
        b. Variables with capitalized names are global.
        c. Code fragments will be presented with line numbers to be used
           in the subsequent COMMENTS sections.

    Data will be encoded in byte units.  For portability, control data
    generated by Vcdiff shall be limited to the lower eight bits of a byte
    even on machines with larger bytes. The bits in a byte are named from
    right to left so that the first bit has the lowest value, 1<<0 or 1,
    and the eighth bit has value 1<<7 or 128.

    To facilitate the algorithms, we shall assume a few types and functions
    for I/O on streams.  To be definite, we shall use interfaces similar to
    that provided in the Sfio library [6].  Below are descriptions of some of
    these types and functions. Others can be found in reference [6].

        uchar_t:
            This is the type "unsigned char". When coded in a delta file,
            this takes a single byte.

        uint_t:
            This is some unsigned integer type that is at least as large
            as an "unsigned int" and should be large enough to hold file
            offsets.  In general, unsigned integers are coded in a portable
            variable-sized format to be described in Section 4.2.

        Sfio_t:
            This is the type of a stream.

        Sfio_t* sfstropen(uchar_t* buf, int n):
            This is not an Sfio function but it can be easily implemented
            on top of the Sfio primitive sfnew(). sfstropen() creates a stream
            from a given buffer with a given size. We shall assume that such
            a stream is both readable and writable. As with Sfio, a stream
            opened for writing will extend its buffer as necessary to
            accommodate output data.




                                   -3-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001




3.  DELTA INSTRUCTIONS

    A target file is partitioned into non-overlapping sections or windows
    to be processed separately. A target window T of length t may be
    compared against some source data segment S of length s.  Such a source
    data segment may come from some earlier part of the target file or
    it may come from the source file, if there is one. It is assumed that
    there is sufficient memory so that both T and S can be processed
    in main memory.

    For string processing, we treat S and T as substrings of a superstring U
    formed by concatenating T and S like this:

        S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]

    The index of a byte in S or T is referred to by its location in U.
    For example, the index of T[i] is s+i.

    The instructions to encode and direct the reconstruction of a target
    window are called delta instructions.  There are three types:

        ADD: This instruction has two arguments, a size s and a sequence of
            s bytes to be copied.

        COPY: This instruction has two arguments, a size s and an address p
            in the string U. The arguments specify the substring of U that
            must be copied. We shall assert that such a substring must be
            entirely contained in either S or T.

        RUN: This instruction has two arguments, a size s and a byte b that
            will be repeated s times.

    Below are example source and target strings and the delta instructions
    that encode the target string in terms of the source string.


        a b c d e f g h i j k l m n o p
        a b c d w x y z e f g h e f g h e f g h e f g h z z z z


        COPY  4, 0
        ADD   4, w x y z
        COPY  4, 4
        COPY 12, 24
        RUN   4, z


    Thus, the first letter 'a' in the target string will be at location 16
    in the superstring. Note that the fourth instruction, "COPY 12, 24",
    copies data from T itself since address 24 is position 8 in T.
    In addition, some part of the data to be copied is reconstructed along
    with the copying.  This allows efficient encoding of periodic sequences,
    i.e., sequences with regularly repeated subsequences. The RUN instruction
    is a compact way to encode a sequence repeating the same byte even though
    such a sequence can be thought of as a periodic sequence with period 1.




                                   -4-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001



4.  VCDIFF ENCODING FORMAT

    A large target file is partitioned into non-overlapping sections called
    "target windows". Window sizes should be chosen so that each window can
    be completely processed in memory during both encoding and decoding.
    A target window may be compared against some source data segment or none.
    For compression, such a source segment would be from some earlier part of
    the target file while, for differencing, it is often from the source file.


4.1 Delta File Layout

    A Vcdiff delta file is organized into sections as follows:

        Header
        Window 1
        Window 2
        ...

    The header of a delta file includes magic bytes to identify file
    type and information concerning data processing beyond the basic
    encoding format. The window sections encode the target windows.


4.1.1 The Header Section

    Each delta file starts with a header section organized as below.
    Note the convention that square-brackets enclose optional items.

        Header1 - uchar_t
        Header2 - uchar_t
        Header3 - uchar_t
        Header4 - uchar_t
        Indicator - uchar_t
        [Secondary compressor - uchar_t]
        [Length of instruction code table data - uint_t]
        [Instruction code table data]


   The first four Header bytes are defined below. The first three bytes
   are the ASCII characters 'V', 'C' and 'D' or-ed with the eighth bit.
   The fourth byte is currently set to zero. In the future, it may be
   used to indicate the version of Vcdiff.

        #define VCD_HEADER1    (0x56 | (1<<7))  /* 'V' | (1<<7) */
        #define VCD_HEADER2    (0x43 | (1<<7))  /* 'C' | (1<<7) */
        #define VCD_HEADER3    (0x44 | (1<<7))  /* 'D' | (1<<7) */
        #define VCD_HEADER4    (0)              /* version      */


    The Indicator byte shows if there are any initialization data
    required to aid in the reconstruction of data in the Window sections.
    This byte is composed from some subset of the following bits:

        #define VCD_DECOMPRESS (1<<0)
        #define VCD_CODETABLE  (1<<1)


    The bit VCD_DECOMPRESS indicates that a secondary compressor may have

                                   -5-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


    been used to further compress certain parts of the delta data as described
    in Section 4.1.3. In that case, the index of the decompressor is given
    in a subsequent byte.

    The Length of instruction code table and the Instruction code table
    data sections are in the delta file only if the bit VCD_CODETABLE is
    on. The processing of this table will be described in Section 5.


4.1.2 The Format of a Window

    Each window is organized as follows:

        Indicator - uchar_t
        [Source segment size - uint_t]
        [Source segment position - uint_t]
        Length of the delta encoding - uint_t
        The delta encoding

    Below are the detail of the various items:

        Indicator:
            This byte should be zero or one of the below values:

                #define VCD_SRCWINDOW   (1<<0)
                #define VCD_TARWINDOW   (1<<1)

            The value VCD_SRCWINDOW indicates that there is a segment
            of data from the source file used for differencing against
            the current window.  Likewise, VCD_TARWINDOW indicates a
            similar segment of data from the target file. In these cases,
            encoded next are two integers to indicate respectively
            the size and position of the data segment in the relevant file.
            If this byte is zero, the window was compressed without using
            a source segment.

        Length of the delta encoding:
            This is a variable-sized integer that tells the length of
            the delta encoding data for this window.

        The delta encoding:
            This contains the data representing the delta encoding.

4.1.3 The Delta Encoding

    The delta encoding of a window is organized as follows:

        Length of target window - uint_t
        Indicator - uchar_t
        Length of ADD and RUN data section - uint_t
        Length of instruction section - uint_t
        Length of COPY address section - uint_t
        Data section for ADD and RUN instructions
        Instructions section
        Addresses section for COPY instructions

        Length of target window:
            This is a variable-sized integer indicating the size of the
            target window. The decoder can use this value to allocate
            the necessary memory to decode the window.

                                   -6-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001



        Indicator:
            This byte is composed from some subset of the below bits:

                #define VCD_DATACOMP    (1<<0)
                #define VCD_INSTCOMP    (1<<1)
                #define VCD_ADDRCOMP    (1<<2)

            The delta encoding is separated into three parts: the unmatched
            data accompanying the ADD and RUN instructions, the coding of
            the instructions, and the encoded addresses of the COPY
            instructions. If the bit VCD_DECOMPRESS (Section 4.1.1) was on,
            each of these sections may have been compressed using the
            given secondary compressor. The above bits indicate these cases.

        Length of ADD and RUN data section:
            This is the length of the section of data storing the unmatched
            data accompanying the ADD and RUN instructions (which may have
            been compressed using a secondary compressor as discussed).

        Length of instruction section:
            This is similar to the above but for the delta instructions.

        Length of COPY address section:
            This is similar to the above but for the addresses of the COPY
            instructions.

        Data section for ADD and RUN instructions:
            This section contains the unmatched data for the ADD and RUN
            instructions (which may have been further compressed).

        Instructions section:
            This section contains the instructions.

        Addresses section for COPY instructions:
            This section contains the addresses of the COPY instructions.


4.1.4 Processing a Delta File

    Below is the basic algorithm to decode a delta file:

     1. sfgetc(Delta);
     2. sfgetc(Delta);
     3. sfgetc(Delta);
     4. sfgetc(Delta);

     5. Compressor = -1;
     6. Codetable = Dflttable;
     7. if((indi = sfgetc(Delta)) != 0)
     8. {   if(indi & VCD_DECOMPRESS)
     9.          Compressor = sfgetc(Delta);
    10.     if(indi & VCD_CODETABLE)
    11.     {    n_tbl = sfgetu(Delta);
    12.          tbl = getdata(Delta, n_tbl, -1);
    13.          Codetable = gettable(tbl, n_tbl);
    14.     }
    15. }

    16. for(;;)

                                   -7-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


    17. {   if((indi = sfgetc(Delta)) < 0)
    18.          break;

    19.     if(indi != 0)
    20.     {    n_src = sfgetu(Delta);
    21.          p_src = sfgetu(Delta);
    22.          if(indi == VCD_TARWINDOW)
    23.               src = getdata(Target, n_src, p_src);
    24.          else src = getdata(Source, n_src, p_src);
    25.     }
    26.     else n_src = 0;

    27.     n_del = sfgetu(Delta);
    28.     del = getdata(Delta, n_del, -1);
    29.     n_tar = win_inflate(&tar, src, n_src, del, n_del);
    30.     sfwrite(Target, tar, n_tar);
    31. }

COMMENTS.

   1-4: These lines read the four magic header bytes that indicate
        the type of this file.
  5-15: These lines first initialize the indices of secondary encoders
        and code table to default values. Then, the initialization
        data, if any, is processed to reset these variables.
        The function getdata() reads from the given stream the segment of
        data at the given position or the current position (if this
        argument is -1). The function gettable() will be described later.
 16-31: These lines read window data and decode them.
 19-26: These lines determine if there is a source segment of data and read
        it from the indicated file.
 27-30: These lines read the delta encoding data, get processing space,
        call win_inflate() to decode target data, then write the results
        to the target file.


    Next is the function to recompute a target window:

     1. int win_inflate(uchar_t** tarp,
     2.             uchar_t* src, int n_src,
     3.             uchar_t* del, int n_del)
     4. {   int      n_tar, n_data, n_inst, n_addr;
     5.     uchar_t  ctrl, *tar, *data, *inst, *addr;
     6.     Sfio_t   *delf, *dataf, *instf, *addrf;

     7.     delf = sfstropen(del, n_del);

     8.     n_tar = sfgetu(delf);
     9.     ctrl  = sfgetc(delf);
    10.     n_data = sfgetu(del);
    11.     n_inst = sfgetu(del);
    12.     n_addr = sfgetu(del);

    13.     tar = malloc(n_tar);
    14.     data = getdata(delf, n_data, -1);
    15.     inst = getdata(delf, n_inst, -1);
    16.     addr = getdata(delf, n_addr, -1);

    17.     if(ctrl&VCD_DATACOMP)
    18.          n_data = decompress(Compressor, data, n_data, &data);

                                   -8-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


    19.     if(ctrl&VCD_INSTCOMP)
    20.          n_inst = decompress(Compressor, inst, n_inst, &inst);
    21.     if(ctrl&VCD_ADDRCOMP)
    22.          n_addr = decompress(Compressor, addr, n_addr, &addr);

    23.     dataf = sfstropen(data, n_data);
    24.     instf = sfstropen(inst, n_inst);
    25.     addrf = sfstropen(addr, n_addr);
    26.     win_decode(tar, n_tar, src, n_src, dataf, instf, addrf);

    27.     *tarp = tar;
    28.     return n_tar;
    29. }


COMMENTS.

     7: This line creates a stream to read the delta encoding data.
  8-16: These lines read the sizes of the various datasets, then allocate
        memory and read in data as necessary.
 17-22: These lines decompress the above data if necessary.  The function
        decompress() invokes the respective decompressor. It returns the
        size of the decompressed data while the decompressed data itself
        is returned via the last argument. See Section 6 for proposed
        values for Compressor.
 23-26: These lines call win_decode() (Section 4.4) to actually
        reconstruct target data.
 27-28: These lines return the reconstructed target data.




4.2 Encoding Integers Using a Variable-Sized Format

    Vcdiff encodes integer values using the variable-size format introduced
    in the Sfio library [6] for encoding unsigned values. The code presented
    below is not quite correct with respect to type handling as in Sfio but
    they show the ideas. The encoding treats an integer as a number in base
    128 so that each digit can be coded in one byte. The eighth bit of such
    a byte is the continuation bit. It is 1 if there are more digits to come
    or 0 if it is the last digit.



                        +---------------------+
                        | 11100000 | 00111001 |
                        +---------------------+
                          byte 0       byte 1

                                Table 1



    Table 1 shows the variable sized encoding of 12345.  The bytes in
    the encoding are presented in binary to make clear the use of
    the continuation bit. Omitting the continuation bit, the encoding
    of 12345 uses two bytes with values 96 and 57.




                                   -9-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001



    Below are the algorithms:

     1. int sfputu(Sfio_t* f, uint_t v)
     2. {   uchar_t c[2*sizeof(uint_t)], *s;

     3.     s = &c[sizeof(c)-1];
     4.     *s = v & 127;
     5.     while((v >>= 7) != 0)
     6.          *--s = (v & 127) | (1<<7);

     7.     sfwrite(f, s, (&c[sizeof(c)]) - s);
     8.     return  &c[sizeof(c)]-s;
     9. }

    10. uint_t sfgetu(Sfio_t* f)
    11. {   uint_t b, v;

    12.     for(v = 0;; )
    13.     {    b = sfgetc(f);
    14.          v = (v << 7) | (b & 127);
    15.          if(!(b & 128) )
    16.               return v;
    17.     }
    18. }

COMMENTS.

     1: This line declares the formal arguments to sfputu(), a stream f
        to store the encoding and the value v to be encoded.
     2: This line declares an array large enough to store the encoding.
   4-6: These lines extract digits in base 128 and store them in the array.
        Note that the right-most digit is extracted first and does not
        carry a continuation bit.
   7-8: These lines write the encoding out to the given stream f and
        return the length of that encoding.
 10-18: These lines define the decoding algorithm.


4.3 Delta Instruction Coding Format

    Delta instructions are encoded as control bytes with associated data.
    Each control byte is an index into an instruction code table of 256
    entries. Below are the relevant data structures:

     1. typedef struct _vcdinst_s
     2. {   uchar_t    type;   /* instruction type           */
     3.     uchar_t    size;   /* >0 if size is coded here   */
     4.     uchar_t    mode;   /* address mode for COPYs     */
     5. } Vcdinst_t;

     6. typedef struct _vcdcode_s
     7. {   Vcdinst_t  inst1;  /* first instruction          */
     8.     Vcdinst_t  inst2;  /* second instruction         */
     9. } Vcdcode_t;

    10. typedef struct _vcdtable_s
    11. {   uchar_t    s_same; /* s_same*256 addresses       */
    12.     uchar_t    s_near; /* s_near addresses           */
    13.     Vcdcode_t  code[256]; /* the instruction codes   */

                                   -10-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


    14. } Vcdtable_t;

COMMENTS.

   1-5: An instruction is defined by its type, the size of the
        associated data and, in the case of a COPY, the mode of
        how the address is encoded.
   6-9: As shown, a code in the range [0-255] can encode up to
        two instructions. The default Vcdiff code table uses this
        feature to merge adjacent ADD and COPY instructions with
        small sizes.
 10-14: These lines define the encoding table. Addresses may be
        encoded against two optional caches, same and near,
        to be described later.

    Below are the values identifying the instructions:

        #define VCD_NOOP  0      /* not an instruction       */
        #define VCD_ADD   1      /* an ADD instruction       */
        #define VCD_RUN   2      /* a  RUN instruction       */
        #define VCD_COPY  3      /* a COPY instruction       */

    Each instruction has a size s of the data involved.  If the field
    Vcdinst_t.size is non-zero, it is the value for s.  Otherwise,
    s is encoded next in the instruction dataset as a variable-sized
    integer.  If the instruction is a COPY, the copy address will follow
    next in the instruction dataset. Its encoding depends on some
    addressing scheme to be discussed next.

    A COPY address is encoded using different modes as indicated by
    the values of the field Vcdinst_t.mode. Vcdiff allows this field
    to have values in the range [0-15]. The first two values are:

        #define VCD_SELF  0
        #define VCD_HERE  1

    If Vcdtable_t.s_near is positive, the next s_near values
    indicate addresses coded using the "near" cache. Then,
    if Vcdtable_t.s_same is positive, the next s_same values indicate
    addresses coded using the "same" cache.

    Let "addr" be the address of a COPY instruction and "here" the
    current location in the target data. During decoding a delta
    encoding data stream, the address modes have the below meanings:

        VCD_SELF: addr is encoded separately in the next variable-sized
            integer.

        VCD_HERE: Let i be the next variable-sized integer. Then, addr
            is here-i.

        Near cache: The "near" cache keeps s_near addresses. Let m be
            the mode of the instruction and i the next variable-sized
            integer. In addition, let n be m - (VCD_HERE+1).
            Then addr is near[m] + i.

        Same cache: The "same" cache keeps s_same*256 addresses.
            Let m be the mode (Vcdinst_t.mode) of the instruction and
            b the next byte. In addition, let n be m - (VCD_HERE+1+s_near).
            Then, addr is same[m*256 + b].

                                   -11-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001




    Below are the algorithms to maintain address caches.

     1. typedef struct _cache_s
     2. {   int*  near;      /* array of size s_near        */
     3.     int   s_near;
     4.     int   n;         /* the circular index for near */
     5.     int*  same;      /* array of size s_same*256    */
     6.     int   s_same;
     7. } Cache_t;

     8. Cache_t* cache_open(int s_near, int s_same)
     9. {   int   i;
    10.     Cache_t* ka = malloc(sizeof(Cache_t));
    11.     ka->near = malloc(s_near*256*sizeof(int));
    12.     ka->same = malloc(s_same*sizeof(int));
    13.     ka->s_near = s_near;
    14.     ka->s_same = s_same;

    15.     for(i = 0; i < ka->s_near; ++i)
    16.          ka->near[i] = 0;
    17.     for(i = 0; i < ka->s_same*256; ++i)
    18.          ka->same[i] = 0;
    19.     ka->n = 0;
    20. }

    21. cache_update(Cache_t* ka, uint_t addr)
    22. {   if(ka->s_same > 0)
    23.         ka->same[addr % (ka->s_same*256)] = addr;
    24.     if(ka->s_near > 0)
    25.     {   ka->near[ka->n] = addr;
    26.         if((ka->n += 1) >= ka->s_near)
    27.             ka->n = 0;
    28.     }
    29. }

COMMENTS.

   1-7: These lines define the caches.
  8-20: These lines create a cache with addresses initialized to zero.
 21-29: These lines update the caches given an address.  Note that
        the "near" cache is updated in a round-robin manner.


    Below is the function to encode a COPY address:

     1. int addr_encode(Cache_t* ka, int addr, int here, int* mode)
     2. {   int  i, d, bestd, bestm;

     3.     bestd = addr; bestm = VCD_SELF;

     4.     if(sfulen(d = here-addr) < sfulen(bestd))
     5.         { bestd = d; bestm = VCD_HERE; }

     6.     for(i = 0; i < ka->s_near; ++i)
     7.     {   if((d = addr - ka->near[i] < 0)
     8.             continue;
     9.         if(sfulen(d) < sfulen(bestd))
    10.             { bestd = d; bestm = VCD_HERE+1 + i; }

                                   -12-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


    11.     }

    12.     if(sfulen(bestd) > 1 && ka->s_same > 0 &&
    13.        ks->same[d = addr%(ka->s_same*256)] == addr)
    14.         { bestd = d; bestm = (VCD_HERE+1) + ka->s_near + d/256; }

    15.     cache_update(ka,addr);
    16.     *mode = bestm;
    17.     return bestd;
    18. }

COMMENTS.

     1: This lines declare the formal arguments. "addr" is the address
        to be encoded. "here" is the current location in the target data.
        "mode" is used to return the address mode. The encoded address
        itself is returned by the function.
  3-11: "addr" is encoded as one of VCD_SELF, VCD_HERE or some "near" mode
        depending on which addressing scheme uses the least space.
        The sfulen() function computes the number of bytes required to encode
        a given integer.
 12-14: If the coded address based on the above schemes still use more than
        one byte, the "same" cache is used to see if "addr" can be encoded
        in a single byte.
    15: This line updates the address caches.


    Below is the function to decode a COPY address:

     1. int addr_decode(Cache_t* ka, int here, int type, Sfio_t* addrf)
     2. {   int  addr, m;

     3.     if(type == VCD_SELF)
     4.          addr = sfgetu(addrf);
     5.     else if(type == VCD_HERE)
     6.          addr = here - sfgetu(addrf);
     7.     else if((m = type - (VCD_HERE+1)) >= 0 && m < ka->s_near)
     8.          addr = ka->near[m] + sfgetu(addrf);
     9.     else
    10.     {    m = type - (VCD_HERE+1 + ka->s_near);
    11.          addr = ka->same[m*256 + sfgetc(addrf)];
    12.     }

    13.     cache_update(ka, addr);
    14.     return addr;
    15. }


4.4 Decoding A Target Window

    The algorithm to decode a target window is as follows:

     1. win_decode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
     2.            Sfio_t* dataf, Sfio_t* instf, Sfio_t* addrf)
     3. {    uint_t      here, size, addr, i;
     4.      Vcdinst_t  *inst;
     5.      Vcdcode_t  *code, *table = Codetable->code;
     6.      Vcdcache_t *ka = cache_open(Codetable->s_near,Codetable->s_same);

     7.      for(here = 0; here < n_tar; )

                                   -13-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


     8.      {    code = &table[sfgetc(instf)];
     9.           for(i = 1; i <= 2; ++i)
    10.           {    inst = i == 1 ? &code->inst1 : &code->inst2;
    11.                if(inst->type == VCD_NOOP)
    12.                     continue;
    13.                if((size = inst->size) == 0)
    14.                     size = sfgetu(instf);
    15.                if(inst->type == VCD_ADD)
    16.                {    for(; size > 0; --size)
    17.                          tar[here++] = sfgetc(dataf);
    18.                }
    19.                else if(inst->type == VCD_RUN)
    20.                {    int  c = sfgetc(dataf);
    21.                     for(; size > 0; --size)
    22.                          tar[here++] = c;
    23.                }
    24.                else if(inst->type == VCD_COPY)
    25.                {    uchar_t* from;
    26.                     addr = addr_decode(&ka,here,inst->mode,addrf);
    27.                     from = addr < nsrc ? src+addr : tar+addr-nsrc;
    28.                     for(; size > 0; --size)
    29.                          tar[here++] = *from++;
    30.                }
    31.           }
    32.      }
    33. }

COMMENTS.

     5: "table" is initialized to be the instruction code table to be used.
     6: This line initializes the address caches.
     8: This line reads a control byte from the instruction dataset and gets
        the corresponding code table to decode instructions.
  9-32: These lines process the given pair of delta instructions. Note that
        the data for VCD_ADD and VCD_RUN are read from the raw dataset.


5.  INSTRUCTION CODE TABLES

    Delta instructions are encoded based on some instruction code table.
    Vcdiff provides a default instruction code table. As noted in Section
    4.1.1, an application can change this table in the Header section of
    a delta file.

5.1 The Default Instruction Code Table

    A COPY instruction with small data size may use more encoding space than
    the data itself. Thus, it is good to require COPY instructions to have
    some minimum sizes. In turn, a code table can take advantage of such
    a requirement to enhance compactness. Vcdiff provides a default instruction
    code table with the below characteristics:

        s_near: the near cache has 4 entries.
        s_same: the same cache has 3*256 entries.
        COPY: the minimum size for a COPY is 4.

    Thus, there are 9 address modes for a COPY instruction. The first two
    are VCD_SELF and VCD_HERE. Following these are the 4 modes for addresses
    coded against the near cache. The last three modes are for addresses
    coded against the same cache.

                                   -14-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001



    Table 2 below summarizes the code in the default table. The first 6
    columns describe the pairs of instructions per code and the last
    column shows the index range in the code table for these pairs.
    The single RUN instruction shown on the first row always encodes
    the size separately. The second row shows the single ADD instructions.
    The ADD instruction with size 0 (i.e., its size is coded separately
    as a variable-sized integer) has the code index 1. ADD instructions
    with sizes from 1 to 17 use code indices 2 to 18 and their sizes
    will not be separately encoded. The last row shows instruction pairs
    where the first instruction is a COPY while the second is an ADD.
    In this case, only a COPY instruction of size 4 immediately followed
    by an ADD instruction of size 1 would be coded as a pair. The code
    indices for such pairs range from 247 to 255.


           TYPE    SIZE     MODE    TYPE     SIZE     MODE     INDEX
        ---------------------------------------------------------------
          RUN       0         -     NOOP       -        -        0
          ADD   0, [1,17]     -     NOOP       -        -      [1,18]
         COPY   0, [4,18]     0     NOOP       -        -     [19,34]
         COPY   0, [4,18]     1     NOOP       -        -     [35,50]
         COPY   0, [4,18]     2     NOOP       -        -     [51,66]
         COPY   0, [4,18]     3     NOOP       -        -     [67,82]
         COPY   0, [4,18]     4     NOOP       -        -     [83,98]
         COPY   0, [4,18]     5     NOOP       -        -     [99,114]
         COPY   0, [4,18]     6     NOOP       -        -    [115,130]
         COPY   0, [4,18]     7     NOOP       -        -    [131,146]
         COPY   0, [4,18]     8     NOOP       -        -    [147,162]
          ADD     [1-4]       -     COPY     [4-6]      0    [163,174]
          ADD     [1-4]       -     COPY     [4-6]      1    [175,186]
          ADD     [1-4]       -     COPY     [4-6]      2    [187,198]
          ADD     [1-4]       -     COPY     [4-6]      3    [199,210]
          ADD     [1-4]       -     COPY     [4-6]      4    [211,222]
          ADD     [1-4]       -     COPY     [4-6]      5    [223,234]
          ADD     [1-4]       -     COPY       4        6    [235,238]
          ADD     [1-4]       -     COPY       4        7    [239,242]
          ADD     [1-4]       -     COPY       4        8    [243,246]
         COPY       4       [0-8]    ADD       1        -    [247,255]
        ---------------------------------------------------------------

                                Table 2

5.2 Instruction Code Table Encoding

    The format of an instruction code table is as follows:

        Size of near cache - uchar_t
        Size of same cache - uchar_t
        Compressed table data

    Since each instruction is 3 bytes, an instruction code table can be
    represented by a string of length 2*3*256 or 1536 bytes. For compact
    storage, this string is compared against the string of the default
    table to generate an encoding in the same Vcdiff format.

    Two functions tbl2str() and str2tbl() are used for converting between
    a code table and its code string.  Below is the description of tbl2str().
    str2tbl() is just as straightforward so its description will be omitted.
    Note that bytes of the same type are grouped together to induce more

                                   -15-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


    matching and increase compression.

     1. tbl2str(Vcdcode_t* tab, uchar_t data[6*256])
     2. {    int  i, n;
     3.      n = 0;
     4.      for(i = 0; i < 256; ++i)
     5.           data[n++] = tab[i].inst1.type;
     6.      for(i = 0; i < 256; ++i)
     7.           data[n++] = tab[i].inst2.type;
     8.      for(i = 0; i < 256; ++i)
     9.           data[n++] = tab[i].inst1.size;
    10.      for(i = 0; i < 256; ++i)
    11.           data[n++] = tab[i].inst2.size;
    12.      for(i = 0; i < 256; ++i)
    13.           data[n++] = tab[i].inst1.mode;
    14.      for(i = 0; i < 256; ++i)
    15.           data[n++] = tab[i].inst2.mode;
    16. }


    Below is the algorithm to reconstruct an instruction code table:

     1. Vcdtable_t* gettable(uchar_t* data, int n_data)
     2. {   uchar_t    src[6*256], *tar;
     3.     int        n_tar;
     3.     Vcdtable_t *table;

     4.     table = malloc(sizeof(Vcdtable_t));
     5.     table->s_near = *data++;
     6.     table->s_same = *data++;

     7.     tbl2str(Dflttable, src);
     8.     n_tar = win_inflate(&tar, src, 6*256, data, n_data-2);
     9.     table->code = str2tbl(tar);
    10. }


COMMENTS.

   4-6: These lines allocate memory for the new code table and initialize
        the sizes of the near and same caches using the given data.
   7:   This line constructs the string corresponding to the default
        instruction code table.
   8-9: These lines construct the new table.




6. FURTHER ISSUES

    There are two issues not yet addressed:

    Secondary compressors:
        As discussed in Section 4.1.1, certain sections in the delta
        encoding of a window may be further compressed by a secondary
        compressor. In our experience, the basic Vcdiff format is adequate
        for most purposes. So ordinarily, secondary compressors are
        not needed. However, such compressors can be occasionally
        useful for further compressing the unmatched data in RUN and ADD
        instructions when Vcdiff is used only for compression.

                                   -16-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001



        We reserve the following values for the indices of a few common
        compressors:

            #define VC_HUFFMAN   1    /* Huffman encoding       */
            #define VC_ARITH     2    /* arithmetic encoding    */
            #define VC_SPLAY     3    /* splay tree encoding    */

        The formats of the compressed data via such compressors or any
        compressors that may be defined in the future are left open to
        their implementations.

    Large file system vs. small file system:
        As discussed in Section 4.1.2, a target window in a large file
        may be compared against some source window in another file or
        in the same file (from some earlier part). In that case, the file
        offset of the source window is specified as a variable-sized
        integer in the delta encoding. There is a possibility that the
        encoding was computed on a system supporting much larger files
        than that in a system where the data may be decoded (e.g., 64-bit
        file systems vs. 32-bit file systems). In that case, some target
        data may not be recoverable. This state is detectable when such a
        large integer is encountered during decoding.




7.  SUMMARY

    We have described a general and portable encoding format for compression
    and differencing. The format is compact. For compression only, using
    the same LZ-77 string parsing strategy and without any secondary compressors,
    the typical compression rate is better than Unix compress and close to gzip.
    For differencing, the same data format is better than all known methods.
    Ignoring secondary compressor issues, the decoding algorithms run in linear
    time and require working space proportional to window sizes.



8.  ACKNOWLEDGEMENTS

    Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
    who provided much encouragement to publicize Vcdiff.




9.  SECURITY CONSIDERATIONS

    Vcdiff only provides a format to encode compressed and differenced data.
    It does not address any issues concerning how such data are, in fact,
    stored in a given file system or the run-time memory of a computer system.
    Therefore, we do not anticipate any security issues with respect to Vcdiff.








                                   -17-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001


10. SOURCE CODE AVAILABILITY

    Vcdiff is implemented as a data transforming method in Phong Vo's
    Vcodex library. AT&T Corp. has made the source code for Vcodex available
    for anyone to use to transmit data via HTTP1.1 Delta Encoding.
    The source code and according license is accessible at the below URL:

          http://www.research.att.com/sw/tools




11. REFERENCES

    [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
        Practical Reusable Unix Software, Editor B. Krishnamurthy,
        John Wiley & Sons, Inc., 1995.

    [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
        Compression, IEEE Transactions on Information Theory,
        23(3):337-343, May 1977.

    [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
        ACM Transactions on Computer Systems, 2(4):309-321, November 1984.

    [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
        Algorithm, Journal of the ACM, 23:262-272, 1976.

    [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
        Algorithms, IEEE Software Configuration and Maintenance Workshop,
        1996.

    [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
        Accepted for publication in Software Practice & Experience, 1999.








12. AUTHOR'S ADDRESS

    Kiem-Phong Vo (main contact)
    AT&T Labs, Room D223
    180 Park Avenue
    Florham Park, NJ 07932

    Phone: 973-360-8630
    Email: kpv@research.att.com

    David G. Korn
    AT&T Labs, Room D237
    180 Park Avenue
    Florham Park, NJ 07932

    Phone: 973-360-8602
    Email: dgk@research.att.com


                                   -18-

RFC 4   VCDIFF Generic Differencing and Compression Data Format   June 2001





13. APPENDIX: SECTION LAY-OUT IN A DELTA FILE

   For reference convenience, the lay-out of the various sections
   in a delta file are summarized below. An indented section means further
   refinement of the item above it.

        Header
                (0x56 | (1<<7)) - uchar_t
                (0x43 | (1<<7)) - uchar_t
                (0x44 | (1<<7)) - uchar_t
                (0)             - uchar_t
                Indicator - uchar_t
                [Secondary compressor - uchar_t]
                [Length of instruction code table data - uint_t]
                [Instruction code table data]
                        Size of near cache - uchar_t
                        Size of same cache - uchar_t
                        Compressed table data
        Window1
                Indicator - uchar_t
                [Source segment size - uint_t]
                [Source segment position - uint_t]
                Length of the delta encoding - uint_t
                The delta encoding
                        Length of target window - uint_t
                        Indicator - uchar_t
                        Length of ADD and RUN data section - uint_t
                        Length of instructions section - uint_t
                        Length of COPY addresses section - uint_t
                        Data section for ADD and RUN instructions
                        Instructions section
                        Addresses section for COPY instructions
        Window2
        ...








14.  EXPIRATION DATE

    December 29, 2001













                                   -19-