Network Services Research Lab                  David Korn and  Kiem-Phong Vo
                                                                   AT&T Labs
                                               Submission: February 23, 1999
                                               Expiration:   August 23, 1999


        The VCDIFF Generic Differencing and Compression Data Format
                       <draft-korn-vcdiff-00.txt>


Status of this Memo

    This document is an Internet-Draft and is in full conformance
    with all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as
    Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other
    documents at any time.  It is inappropriate to use Internet-
    Drafts as reference material or to cite them other than as
    "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.


Abstract

    This memo describes a general and efficient data format suitable
    for encoding compressed and/or differencing data so that they can
    be easily transported among computers.

Table of Contents:

    1.  Executive Summary ............................................  1
    2.  Delta Instructions ...........................................  3
    3.  Vcdiff Encoding Format .......................................  4
    4.  Instruction Code Tables ...................................... 10
    5.  Compression and Differencing Algorithms ...................... 15
    6.  Summary ...................................................... 19
        ACKNOWLEDGEMENTS ............................................. 19
        REFERENCES ................................................... 19
        AUTHOR'S ADDRESS ............................................. 20

1.  Executive Summary

    Storage and transmission of files and file versions can be greatly
    improved via the use of compression and differencing techniques.
    Since files are often transported across machines with distinct
    architectures and performance characteristics, encoded data should be
    in a form that is portable and easy to decode. This document describes
    Vcdiff, a new encoding format that satisfies these requirements.
    The format enables the construction of a single efficient decoder
    that is independent of encoding algorithms.

    Data differencing is the process of computing a compact and invertible

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    encoding of a target file given a source file.  Data compression is
    similar but without the use of source data.  The UNIX utilities diff,
    compress, and gzip are well-known examples of data differencing and
    compression tools.  For data differencing, the computed encoding is
    called a  delta file, and, for data compression, it is called a
    compressed file.  Delta and compressed files are good for storage
    and transmission because they are often much smaller than the original
    data files.

    Data differencing and data compression are traditionally treated as
    distinct types of data processing.  However, compression can be thought
    of as a special case of differencing in which the source data is empty.
    This blending of differencing and compression was first introduced in
    the Vdelta tool [1] by Korn and Vo.  The basic idea was to unify the
    string parsing scheme used in the Lempel-Ziv'77 style compressors [2],
    and the block-move technique of Tichy [3].  Loosely speaking, this works
    as follows:

        1.  Concatenate source and target data.
        2.  Process the resulting data from left to right using a parsing
                scheme similar to LZ'77.
        3.  Start to output when the target data is reached.

    Parsing can be implemented on top of various string matching algorithms,
    for example using suffix trees [4], each with different space and time
    performance characteristics.  Vdelta uses an efficient string matching
    algorithm previously shown to compare well against other well-known
    techniques as shown in [5].  However, even with the relatively efficient
    memory usage for Vdelta, memory requirement for string matching can be
    prohibitive when processing large files.  To handle this, data
    compressors and differencers typically partition input files into
    chunks called windows and process each window separately. Except for
    unpublished work by Vo, little has been done on designing effective
    windowing schemes. Current techniques, including Vdelta, simply use
    windows with corresponding addresses across the source and target files.
    Such a pair of corresponding windows also often have the same size.

    From the point of view of data transporting and data storage, it is
    desirable to remove all issues concerning string matching and windowing
    algorithms from the decoding phase. This would enable client-server
    applications in which a server may serve multiple clients whose computing
    characteristics are unknown to it.  However, all current differencing and
    compressing tools, including Vdelta, fell short in this aspect since their
    data encoding formats depend on both the string matching and windowing
    algorithms that they use. To solve this problem, we propose here a new
    data encoding format, Vcdiff, for encoding delta and compressed files.
    Vcdiff achieves the below characteristics:

    1.  Algorithm genericity:
        Both the data format and the decoding algorithm are
        independent of string matching algorithms or windowing schemes.
    2.  Decoding efficiency:
        A target file can be reconstructed in time proportional to
        its size and the space used to do so is proportional to
        the maximal window size.
    3.  Output compactness:
        A delta or compressed file is compactly represented.
    4.  Data portability:
        The encoding format is free from byte order and word size issues.

                                   -2-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999



    The Vcdiff data format and the algorithms for decoding data shall be
    described next.  Since Vcdiff treats compression as a special case of
    differencing, we shall use the terms "delta file" to indicate the
    compressed output for both cases.


2.  Delta Instructions

    A large target file is partitioned into sections called windows, each
    of which is processed separately. Each target window T with length n
    may be compared against some source data segment S with length m bytes.
    A source data segment may come from the already processed part of the
    target file or it may come the source file, if one is given. It is
    assumed that there is sufficient memory so that both T and S can be
    treated as strings stored in main memory. For the purpose of addressing
    bytes in a source segment and target window, it is convenient to think
    of S and T as substrings of a superstring U formed by concatenating
    T to S like this:

        s[0]s[1]...s[m-1]t[0]t[1]...t[n-1]

    An index in S or T is referred to by its location in U.  For example,
    index i in T will be referred to as m+i.  Note also that indices start
    from zero.

    The instructions to encode and direct the reconstruction of a target
    window are called delta instructions.  There are three types:

    1.  ADD:
        This instruction type requires two arguments, a size s and a
        sequence of s bytes to be copied to reconstruct the target window.
    2.  COPY:
        This instruction type requires two arguments, a size s and an
        address p in the string U.  The arguments specify the substring
        of U that must be copied. Note that such a substring must be
        entirely contained in one of S or T.
    3.  RUN:
        This instruction type requires two arguments, a size s and a byte c
        that will be repeated s times.

    For example, given the source and target strings

        a b c d e f g h i j k l m n o p
        a b c d w x y z e f g h e f g h e f g h e f g h e f g h

    the target string can be constructed from the source string with the
    following instructions:

        COPY  4, 0
        ADD   4, w x y z
        COPY  4, 4
        COPY 16, 24

    Note that the third COPY instruction copies data from T itself since
    address 24 is position 8 in T.  Note also that parts of the data
    to be copied by this instruction will be reconstructed during the copy
    operation.  Thus, periodic sequences, i.e., sequences with regularly
    repeated subsequences, can be efficiently encoded. Although a run of

                                   -3-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    the same byte can be thought of as a periodic sequence with period 1,
    we provide a separate RUN instruction because for certain types of
    files such as executable code, runs of the same bytes are frequent and
    can benefit from having a compact representation.


3.  Vcdiff Encoding Format

    Algorithms will be presented in ANSI-C.  To ease the presentation,
    error checking will be generally omitted.  Certain elementary functions
    shall be assumed without formal description.  For example, get_byte()
    reads single byte from the delta file. Other like functions will be
    mentioned during the course of the presentation.

    Data will be encoded in byte units.  For portability, control data
    generated by Vcdiff shall be limited to the lower eight bits of a byte
    even on machines with larger bytes.  Bits shall be named from right to
    left so that the first bit has lowest value, 1, and the eight bit has
    value (1<<7) or 128.


3.1 Delta File Layout

    A large target file is typically processed in non-overlapping sections
    called target windows. We shall assert that each window is at most
    2^31 in length.  A target window may be processed against some source
    data segment. For compression, a source segment would be from some
    earlier part of the target file while, for differencing, it may also
    be from the source file.

    The following is the layout of a delta file.

                Header
                First window
                Second window
                ....

    The delta file starts with a 4-byte header identifying the file type,
    0xd6, 0xc3, 0xc4, 0. The first three bytes are the ASCII letters
    'V', 'C' and 'D' or-ed with the eighth bit. The fourth byte is a
    version number, currently set to zero.

    Following the 4-byte header are a number of sections, each of which
    encodes a target window.  During target file reconstruction, windows
    are processed in the order that they appear in the delta file.
    Each window section starts with the below header data:

    1.  The number of bytes used to encode this window. This includes
        all header data. If this value is zero, the reconstruction ends.
        The encoding of the value uses four bytes based on the portable
        format to encode integers in the range [0,2^31] (Section 3.2).

    2.  The size of the target window is encoded as a variable-sized
        unsigned integer (Section 3.2).

    3.  An indicator byte whose low four bits signal the type of source
        segment and the high four bits signal any change in the instruction
        code table. The bits are interpreted as follows:


                                   -4-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


        a.  The low four bits of the indicator byte are 0 for no source
            segment, 1 for the source segment from the source file, and 2
            for from the target file. When there is a source segment,
            next in the delta file are two unsigned integers for the size
            of this segment and its position in the respective file.

        b.  The high four bits of the indicator byte after right-shifting
            4 bits are 0 for no instruction code table change, 1 for
            switching to a code table already defined, and 2 for defining
            a new code table. When the indicator bits are non-zero, next
            in the delta stream will be a byte indicating the index of
            the new table. If the indicator bits have value 2, next will
            be the data defining the new table. Note that the initial code
            table always has index 0. Section 4 discusses instruction
            tables in detail.

    Below is the algorithm to decode a target file:

     1. Vcdcode_t  *Tab;

     2. get_byte(); get_byte(); get_byte(); get_byte();
     3. for(;;)
     4. {   if((n_head = get_rint(1<<31)) == 0 || iseof())
     5.          break;

     6.     n_tar = get_uint();

     7.     if(((indicator = get_byte())&0xf) > 0)
     8.     {    n_src = get_uint();
     9.          p_src = get_uint();
    10.          src = get_source(indicator&0xf, n_src, p_src);
    11.     }
    12.     else n_src = p_src = 0;

    13.     if((indicator >>= 4) > 0)
    14.          Tab = get_codetab(indicator);

    15.     win_decode(Tab, src,n_src, tar,n_tar);
    16.     put_target(tar,n_tar);
    17. }

    Line 1 declares Tab which points to the current instruction code table.
    Line 2 reads the four-byte header.
    Line 4 reads the number of bytes for the encoding of this window.
    Line 6 read the size of a target window to be reconstructed.
    Lines 7-12 obtain the source data segment. The function get_source()
        constructs the source data segment which could be from the
        source file or from the target file depending on the value of i.
    Lines 13-14 process data to define a new instruction code table, if any.
        The function get_codetab() will be discussed in Section 4.2.
    Lines 15-16 call win_decode() to decode the target window, then
        call put_target() to write it to the target file.

    The elementary functions get_source() and put_target() shall be assumed
    without description.


3.2 Integer Encoding Formats


                                   -5-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    It is important that sizes of delta instructions be encoded compactly.
    Vcdiff encodes these values as unsigned integers using the variable
    sized portable format that was introduced in the Sfio library [6].
    Each unsigned integer is encoded as a sequence of bytes.  The eighth
    bit of a byte is the continuation bit, and, when on, indicates that
    the next byte is part the same integer.

    Below are the algorithms to encode and decode unsigned integers. The
    type uchar_t is the C type unsigned char and int_t is the largest
    supported integer type.  We assume that all values fit into int_t. For
    absolute data portability, any encoded value should be small enough to
    fit the int_t of a decoding machine. For example, this means that a file
    with size larger than 2^32 will not be decodable on a machine whose
    integers are restricted to 32 bits. This is reasonable since such a
    machine does not support large files anyway.

     1. #define MORE_SHIFT  7
     2. #define MORE_BIT    (1 << MORE_SHIFT)

     3. int put_uint(int_t v)
     4. {   uchar_t c[2*sizeof(int_t)], *s;

     5.     s = &c[sizeof(c)-1];
     6.     *s = v & (MORE_BIT-1);
     7.     while((v >>= MORE_SHIFT) != 0)
     8.         *--s = (v & (MORE_BIT-1)) | MORE_BIT;

     9.     put_data(s, (&c[sizeof(c)]) - s);

    10.     return  &c[sizeof(c)]-s;
    11. }

    12. int_t get_uint()
    13. {   int_t b, v;

    14.     for(v = 0;; )
    15.     {   b = get_byte();
    16.         v = (v << MORE_SHIFT) | (b & (MORE_BIT-1));
    17.         if(!(b & MORE_BIT) )
    18.             return v;
    18.     }
    19. }

    Line 9 calls put_data() to write the encoded integer to the
        delta file. This elementary function shall be assumed
        without description.


                        +---------------------+
                        | 11100000 | 00111001 |
                        +---------------------+
                          byte 0       byte 1

                                Table 1.

    Table 1 shows the variable sized encoding of 12345.  The bytes in
    the encoding are presented in binary to make clear the use of the
    continuation bit.  Omitting the continuation bit, the encoding of
    12345 as an unsigned integer uses two bytes with values 96 and 57.

                                   -6-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999



    The address of a COPY instruction is an unsigned integers with a
    known range, i.e., it cannot be larger than the current position.
    Such a value can be encoded more compactly than as done in the Sfio
    scheme. For example, if a value is known to be in the range 0-255,
    it can be encoded with a single byte. On the other hand, if the same
    value happens to be larger than 127, it would require two bytes to
    encode in the Sfio scheme. Below are the algorithms to encode
    integers with known ranges.


     1. #define BYTE_SHIFT  8
     2. #define BYTE_MASK   ((1 << BYTE_SHIFT) - 1)

     3. int put_rint(int_t v, int_t range)
     4. {   uchar_t c[sizeof(int_t)], *s;
     5.     s = &c[sizeof(c)-1];
     6.     *s = v & BYTE_MASK;
     7.     while((range >>= BYTE_SHIFT) != 0)
     8.     {   *--s = v & BYTE_MASK;
     9.         v >>= BYTE_SHIFT;
    10.     }
    11.     put_data(s, (&c[sizeof(c)]) - s);
    12.     return (&c[sizeof(c)]) - s;
    13. }

    14. int_t get_rint(int_t range)
    15. {   int_t v;
    16.     for(v = 0; range > 0; range >>= BYTE_SHIFT)
    17.         v = (v << BYTE_SHIFT) | get_byte();
    18.     return v;
    19. }


3.3 Delta Instruction Coding Format

    The delta instructions are encoded in a stream of control bytes and
    associated data. Each control byte is an index into an instruction
    code table of 256 entries whose type, Vcdcode_t, is defined below:

     1. typedef struct _vcdinst_s
     2. {   unsigned char type;  /* instruction type           */
     3.     unsigned char size;  /* >0 if size is coded here   */
     4.     unsigned char mode;  /* address mode for COPYs     */
     5. } Vcdinst_t;

     6. typedef struct _vcdcode_s
     7. {   Vcdinst_t     inst1; /* first instruction          */
     8.     Vcdinst_t     inst2; /* second instruction         */
     9. } Vcdcode_t;

    Thus, each control byte may identify up to two instructions.
    Below are the different instruction types:

     1. #define VCD_NOOP  0      /* not an instruction         */
     2. #define VCD_ADD   1      /* an ADD instruction         */
     3. #define VCD_RUN   2      /* a  RUN instruction         */
     4. #define VCD_COPY  3      /* a COPY instruction         */


                                   -7-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    Each instruction has a size n of the data involved. If the field
    Vcdinst_t.size is non-zero, it is the value for n.  Otherwise,
    n is encoded next in the delta file as an unsigned integer.
    Following this is respectively the n bytes of data for an ADD
    instruction, or the byte to be repeated for a RUN instruction,
    or the copying address for a COPY instruction.

    The COPY address can be encoded in different ways. The field
    Vcdinst_t.mode defines the address encoding mode:

     1. #define VCD_SELF  0      /* coded as itself            */
     2. #define VCD_HERE  1      /* coded off current position */
     3. #define VCD_SAME  2      /* index into "same" cache    */
     4. #define VCD_NEAR  3      /* index into "near" cache    */

    The values VCD_SAME and VCD_NEAR indicate two address caching
    methods designed to take advantage of the fact that different
    versions of a data source often arise from small modifications.
    As such, successive copying addresses tend to have addresses
    that are either identical or close to one another. An address A
    of a COPY instruction is coded as follows:

    1.  Let H be the current location of target data to be processed.
        If H-A is in the range [0,255], the address mode is VCD_HERE
        and A is coded in a single byte with value H-A.
    2.  256 addresses are kept in a "same" cache. The low eight bits
        of A are used to index the cache. If A matches this address,
        then the mode is VCD_SAME and A is coded in a single byte
        representing the matching index.
    3.  4 addresses are kept in a "near" cache. This cache is updated
        in a round robin manner. A is checked against each element E of
        the cache. If A-E is in the range [-127,128], the address mode
        is VCD_NEAR plus the index of E while A is coded in a single
        byte with value A-E+127.
    4.  If none of the above cases apply, the address mode is VCD_SELF
        and A is coded as an integer in the range [0-H] where H is
        the current location.

    Below are the algorithms to initialize and update address caches.

     1. typedef struct _cache_s
     2. {   int   n;
     3.     int   near[4];
     4.     int   same[256];
     5. } Cache_t;
     6. cache_init(Cache_t* ka)
     7. {    int i;
     8.      for(i = 0; i < 4; ++i)
     9.          ka->near[i] = 0;
    10.      ka->n = 0;
    11.      for(i = 0; i < 256; ++i)
    12.          ka->same[i] = 0;
    13. }
    14. cache_update(Cache_t* ka, int_t addr)
    15. {    ka->near[ka->n] = addr;
    16.      if((ka->n += 1) >= 4)
    17.          ka->n = 0;
    18.      ka->same[addr&255] = addr;
    19. }

                                   -8-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999




    Below is the function to encode a COPY address:

     1. int addr_encode(Cache_t* ka, int addr, int here, int* best)
     2. {   int  i, d, mode = -1;

     3.     if((d = (here-addr)) < 256)
     4.     {    *best = d;
     5.          mode = VCD_HERE;
     6.     }
     7.     else if(ka->same[d = addr&255] == addr)
     8.     {    *best = d;
     9.          mode = VCD_SAME;
    10.     }
    11.     else
    12.     {    for(i = 0; i < 4; ++i)
    13.          {    if((d = addr - ka->k_near[i]) >= -127 && d <= 128)
    14.               {    *best = d + 127;
    15.                    mode = VCD_NEAR+i;
    16.               }
    17.          }
    18.     }
    19.     if(mode < 0)
    20.     {    *best = addr;
    21.          mode = VCD_SELF;
    22.     }

    23.     cache_update(ka,addr);
    24.     return mode;
    25. }

    Lines 3-6 check for coding against the current location.
    Lines 7-10 check for coding against the same[] cache.
    Lines 12-17 check for coding against the near[] cache.
    Lines 19-22 encode the address as itself if no other mode was chosen.
    Lines 23-24 update the cache, then return the selected mode.


    Below is the function to decode a COPY address:

     1. int addr_decode(Cache_t* ka, int here, int type)
     2. {   int  addr;

     3.     if(type == VCD_SELF)
     4.          addr = get_rint(here);
     5.     else if(type == VCD_HERE)
     6.          addr = here - get_byte();
     7.     else if(type == VCD_SAME)
     8.          addr = ka->same[get_byte()];
     9.     else addr = ka->near[type] + get_byte() - 127;

    10.     cache_update(ka, addr);
    11.     return addr;
    12. }





                                   -9-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


3.4 Decoding A Target Window

    The algorithm to decode a target window is as follows:

     1. win_decode(Vcdcode_t* tab,uchar_t* src,int nsrc,uchar_t* tar,int ntar)
     2. {    int_t     here, size, addr, i;
     3.      Cache_t   ka;
     4.      Vcdcode_t *code;
     5.      Vcdinst_t *inst;

     6.      cache_init(&ka);
     7.      for(here = 0; here < n_tar; )
     8.      {    code = &tab[get_byte()];
     9.           for(i = 1; i <= 2; ++i)
    10.           {    inst = i == 1 ? &code->inst1 : &code->inst2;
    11.                if(inst->type == VCD_NOOP)
    12.                     continue;
    13.                if((size = inst->size) == 0)
    14.                     size = get_uint();
    15.                if(inst->type == VCD_ADD)
    16.                {    for(; size > 0; --size)
    17.                          tar[here++] = get_byte();
    18.                }
    19.                else if(inst->type == VCD_RUN)
    20.                {    int  c = get_byte();
    21.                     for(; size > 0; --size)
    22.                          tar[here++] = c;
    23.                }
    24.                else if(inst->type == VCD_COPY)
    25.                {    uchar_t* from;
    26.                     addr = addr_decode(&ka, here, inst->mode);
    27.                     from = addr < nsrc ? src+addr : tar+addr-nsrc;
    28.                     for(; size > 0; --size)
    29.                          tar[here++] = *from++;
    30.                }
    31.           }
    32.      }
    33. }

    Line 1 declares the formal arguments to win_decode(). The argument
        "tab" defines the instruction code table.
    Line 6 initializes the address caches.
    Line 8 reads a control byte and gets the corresponding code.
    Lines 9-32 process the pair of delta instructions defined by the code.


4.  Instruction Code Tables

    Each target window is based on some instruction code table. Delta
    file compactness can be enhanced by tailoring such code tables
    to particular data characteristics.

    Vcdiff provides support for up to 256 instruction tables with
    indices in the range [0,255]. However, the first 8 indices [0,7]
    are reserved by Vcdiff for certain predefined code tables and
    should not be reset by applications.




                                   -10-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


4.1 Decoding a Code Table in the Delta File

    To output a table to a delta file, it is necessary to encode it in
    a portable form. Further, it is desirable to do that compactly.
    To accomplish this, we first represent a table as a string. As each
    instruction consists of 3 bytes and each table entry codes 2
    instructions, the total size needed is 6*256 or 1536 bytes. Two
    functions tab2str() and str2tab() are used for converting between
    a table to a string and vice versa.  Below is the description of
    tab2str(). str2tab() is just as straightforward so its description
    will be omitted. Note that bytes of the same type are grouped together.
    This helps with the step to compress the data.


     1. tab2str(Vcdcode_t* tab, uchar_t data[6*256])
     2. {    int  i, n;
     3.      n = 0;
     4.      for(i = 0; i < 256; ++i)
     5.           data[n++] = tab[i].inst1.type;
     6.      for(i = 0; i < 256; ++i)
     7.           data[n++] = tab[i].inst2.type;
     8.      for(i = 0; i < 256; ++i)
     9.           data[n++] = tab[i].inst1.size;
    10.      for(i = 0; i < 256; ++i)
    11.           data[n++] = tab[i].inst2.size;
    12.      for(i = 0; i < 256; ++i)
    13.           data[n++] = tab[i].inst1.mode;
    14.      for(i = 0; i < 256; ++i)
    15.           data[n++] = tab[i].inst2.mode;
    16. }


    Below is the function get_codetab() to get a new instruction code
    table to decode a target window:

     1. Vcdcode_t  *Code[256];

     2. Vcdcode_t* get_codetab(int indicator)
     3. {   int      index;
     4.     uchar_t  *diff, data[1536];

     5.     index = get_byte();
     6.     if(indicator == 2)
     7.     {   diff = tab2str(Code[get_byte()]);
     8.         win_decode(Code[0], diff, 1536, data, 1536);
     9.         Code[index] = str2tab(data);
    10.     }
    11.     return Code[index];
    12. }

    Line 1 indicates that Vcdiff allows up to 256 code tables.
    Line 5 reads the index of the new code table.
    Lines 6-10 construct a new table from data in the delta stream.
        Such a table is always encoded using the reserved code table
        Code[0]. In addition, it is compared against a previously defined
        table to minimize the amount of data needed to transmit in the
        delta file.
    Line 7 reads the index of the table used to compare with the
        new table to construct the delta instructions.

                                   -11-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    Line 8 calls win_decode() to recompute the string representation
        of the table.
    Line 9 sets the new table in the given index.
    Line 11 returns the code table.


4.2 Reserved Instruction Code Tables

    Using COPY instructions with small sizes sometimes adversely affect
    compactness. Thus, it is advantageous to restrict COPY instructions
    to some minimum sizes. Then, code tables can be designed to take
    advantage of this restriction to optimize delta file compactness.
    Vcdiff provides two reserved instruction code tables with indices
    0 and 1 for minimium COPY sizes of 4 and 3 respectively. As noted
    earlier, the starting code table has index 0 so it is optimized
    for minimum COPY size of 4. Note also that even when a table
    optimized for a particular minimum COPY size, this does not mean that
    a COPY instruction with a smaller size cannot be encoded. It simply
    means that the size of such an instruction must be coded separately
    as an unsigned integer.


         TYPE    SIZE      MODE    TYPE    SIZE     MODE    INDEX
      -------------------------------------------------------------
      1.  RUN     0                NOOP                       0
      2.  ADD     0                NOOP                       1
      3.  ADD   [1,14]             NOOP                     [2,15]
      4. COPY     0       [0,6]    NOOP                    [16,22]
      5. COPY   [M,M+10]  [0,6]    NOOP                    [23,99]
      6. COPY     0       [0,6]     ADD   [1,4]           [100,127]
      7. COPY   [M,M+3]   [0,6]     ADD   [1,4]           [128,239]
      8. COPY   [M,M+3]    SELF    COPY   [M,M+3]   SELF  [240,255]
      -------------------------------------------------------------

                            Table 2.


    For a given minimum COPY size M, Table 2 shows how the 256 code
    values are partitioned into different categories of indices:

    1.  Index 0 defines a RUN instruction whose size is always
        coded separately as an unsigned integer. This is signified
        by having the "size" field being 0.
    2.  Index 1 defines an ADD instruction with size coded separately.
    3.  Indices 2-15 define 14 ADD instructions with sizes in the
        range [1,14].
    4.  Indices 16-22 define 7 COPY instructions for the 7 different
        address encoding modes discusses in Section 3.3. The sizes
        of these instructions are coded separately.
    5.  Indices 23-99 define 77 COPY instructions using all 7
        addressing modes and having sizes in the range [M,M+10].
    6.  Indices 100-127 defines 28 pairs of COPY and ADD instructions.
        The COPY sizes are coded separately while the ADD sizes are
        in the range [1,4].
    7.  Indices 128-239 define 112 pairs of COPY and ADD instructions
        whose sizes are in the ranges [M,M+3] and [1,4] respectively.
    8.  Finally, indices 240-255 defines 16 pairs of COPY and COPY
        instructions with sizes in the range [M,M+3] and both using
        the same VCD_SELF addressing mode.

                                   -12-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999



    Next we present the algorithms to construct an instruction code
    table as described above. Where appropriate, the algorithms also
    construct the inverted tables usable for data encoding. Any part
    of a table not explicitly set will be assumed to be initialized
    to be either zero or the null pointer as the case warranted.

     1. mk_codetab(Vcdcode_t tab[256], int add[16], int copy[15][7],
     2.              int copyadd[8][7][5], int copycopy[8][8], int M)
     3. {
     4.     tab[0].inst1.type = VCD_RUN;
     5.     tab[0].inst1.size = 0;
     6.     tab[0].inst2.type = VCD_NOOP;

     7.     mk_add(tab, add);
     8.     mk_copy(tab, copy, M);
     9.     mk_copyadd(tab, copyadd, M);
    10.     mk_copycopy(tab, copycopy, M);
    11. }

    Lines 1-2 declare the arguments for mk_codetab(). The inverted
        tables add[], copy[][], copyadd[][][], and copycopy[][] are
        designed to be indexed by sizes and address modes. The argument
        M is the minimum COPY size used for the table.
    Lines 4-6 construct the RUN instruction.
    Lines 7-10 calls various subroutines to construct different parts
        of the instruction table.


    Below is the algorithm to construct the single ADD instructions:

     1. mk_add(Vcdcode_t tab[256], int add[16])
     2. {   int s, n;

     3.     tab[1].inst1.type = VCD_ADD;
     4.     tab[1].inst1.size = 0;
     5.     tab[1].inst2.type = VCD_NOOP;
     6.     add[0] = 1;

     7.     for(n = 2, s = 1; s <= 14; ++s, ++n)
     8.     {   tab[n].inst1.type = VCD_ADD;
     9.         tab[n].inst1.size = s;
    10.         tab[n].inst2.type = VCD_NOOP;
    11.         add[s] = n;
    12.     }
    13. }

    Lines 3-6 construct the ADD instruction with size coded separately.
        Location 0 in the inverted table add[] points to this code.
    Lines 7-12 construct the ADD instructions with size in the range
        [1-14] as discussed earlier.









                                   -13-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999



    Below is the function to construct the single COPY instructions:

     1. mk_copy(Vcdcode_t tab[256], int copy[15][7], int M)
     2. {   int s, m, n;

     3.     for(n = 16, m = 0; m < 7; ++m, ++n)
     4.     {   tab[n].inst1.type = VCD_COPY;
     5.         tab[n].inst1.size = 0;
     6.         tab[n].inst1.mode = m;
     7.         tab[n].inst2.type = VCD_NOOP;
     8.         copy[0][m] = n;
     9.     }

    10.     for(s = M; s <= M+10; ++s)
    11.     {   for(m = 0; m < 7; ++m, ++n)
    12.         {   tab[n].inst1.type = VCD_COPY;
    13.             tab[n].inst1.size = s;
    14.             tab[n].inst1.mode = m;
    15.             tab[n].inst2.type = VCD_NOOP;
    16.             copy[s][m] = n;
    17.         }
    18.     }
    20. }

    Lines 3-9 construct the 7 COPY instructions with sizes coded
        separately.
    Lines 10-18 construct the 77 COPY instructions with sizes in
        the range [M,M+10].


    The below function constructs the pairs of COPY/ADD instructions:

     1. mk_copyadd(Vcdcode_t tab[256], int copyadd[8][7][5], int M)
     2. {   int s, m, a, n;
     3.     for(n = 100, m = 0; m < 7; ++m)
     4.     {   for(a = 1; a <= 4; ++a, ++n)
     5.         {   tab[n].inst1.type = VCD_COPY;
     6.             tab[n].inst1.size = 0;
     7.             tab[n].inst1.mode = m;
     8.             tab[n].inst2.type = VCD_ADD;
     9.             tab[n].inst2.size = a;
    10.             copyadd[0][m][a] = n;
    11.         }
    12.     }
    13.     for(s = M; s <= M+3; ++s)
    14.     {   for(m = 0; m < 7; ++m)
    15.         {   for(a = 1; a <= 4; ++a, ++n)
    16.             {   tab[n].inst1.type = VCD_COPY;
    17.                 tab[n].inst1.size = s;
    18.                 tab[n].inst1.mode = m;
    19.                 tab[n].inst2.type = VCD_ADD;
    20.                 tab[n].inst2.size = a;
    21.                 copyadd[s][m][a] = n;
    22.             }
    23.         }
    24.     }
    25. }


                                   -14-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999




    Finally, the algorithm to construct the 16 COPY/COPY pairs:

     1. mk_copycopy(Vcdcode_t tab[256], int copycopy[8][8], int M)
     2. {   int s, c, n;
     3.     for(n = 240, s = M; s <= M+3; ++s)
     4.     {   for(c = M; c <= M+3; ++c, ++n)
     5.         {   tab[n].inst1.type = VCD_COPY;
     6.             tab[n].inst1.size = s;
     7.             tab[n].inst1.mode = VCD_SELF;
     8.             tab[n].inst2.type = VCD_COPY;
     9.             tab[n].inst2.size = c;
    10.             tab[n].inst2.mode = VCD_SELF;
    11.             copycopy[s][c] = n;
    12.         }
    13.     }
    14. }


5.  Compression and Differencing Algorithms

    Sections 3 and 4 describe in full how to reconstruct a target file
    from data in a delta file. In this section, we discuss briefly
    general algorithms for compressors and differencers. Two abstract
    functions shall be assumed:

    win_match(): This function computes a source segment, if any, to be
        compared with a target window. It returns 0, 1 or 2 for no source
        segment, source segment from the source file, or source segment
        from the target file.
    str_match(): This function computes a string in either the source
        segment or the target window that matches with a prefix of the
        target data being processed. Such a matched string can be encoded
        as a COPY instruction.

    The performance of an encoder in terms of time as well as output
    compactness depends on what algorithms are used to implement the
    two window and string matching functions. However, as shown earlier,
    the reconstruction of a target file is completely independent of
    such choices.

    Below is the algorithm for processing a target file:

     1. put_byte(0xd6); put_byte(0xc3); put_byte(0xc4); put_byte(0);
     2. for(p_tar = 0; ; p_tar += n_tar)
     3. {   if((n_tar = get_target(&tar)) <= 0)
     4.         break;
     5.     n_delta = put_rint(0, (1<<31));
     6.     n_delta += put_uint(n_tar);
     7.     type = win_match(tar, n_tar, p_tar, &src, &n_src, &p_src);
     8.     n_delta += put_byte(type);
     9.     if(type > 0)
    10.     {   n_delta += put_uint(n_src);
    11.         n_delta += put_uint(p_src);
    12.     }
    13.     n_delta += win_encode(src,n_src, tar,n_tar);
    14.     put_delsize(n_delta);
    15. }

                                   -15-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999



    Line 1 outputs the four header bytes.
    Lines 3-4 get a target window. If there is none, the loop is terminated.
    Line 5 outputs a place holder for the total size of the delta code used
         for this window.
    Line 6 outputs the size of the target window.
    Line 7 calls win_match() to compute a source segment if any. Note
        that the current position in the target file, p_tar, is passed
        to win_match() so that it can ensure that a source segment from
        the target file will come entirely before p_tar.
    Line 8 outputs the indicator byte for the window. Here, we assume
        that the default instruction code table with index 0 is used.
    Lines 9-12 output the size and position of the source segment as needed.
    Line 13 calls win_encode() to encode the target window.
    Line 14 overwrites the value written out on Line 5 with the actual size
        of the delta. The function put_delsize() is straightforward so its
        description will be omitted.


    Below is the function to encode a target window:

     1. int win_encode(uchar_t* src, int n_src, uchar_t* tar, int n_tar)
     2. {   int      add, here, addr, s, size = 0;
     3.     Cache_t  ka;

     4.     cache_init(&ka);

     5.     for(here = 0; add = -1; here < n_tar; )
     6.     {   if( (s = chk_run(tar, n_tar, here)) > 0)
     7.              addr = -1;
     8.         else s = str_match(src,n_src,tar,n_tar,here,&addr);
     9.         if(s > 0)
    10.         {    if(add >= 0)
    11.                   size += put_add(here-add, tar+here-add);
    12.              if(addr < 0)
    13.                   size += put_run(s, tar[here]);
    14.              else size += put_copy(&ka, s, addr, here);

    15.              add = -1;
    16.              here += s;
    17.         }
    18.         else
    19.         {    if(add < 0)
    20.                   add = here;
    21.              here += 1;
    22.         }
    23.     }
    24.     if(add >= 0)
    25.          size += put_add(here-add, tar+here-add);
    26.     else size += put_copy(0,0,0,0);

    27.     return size;
    28. }

    Line 4 initializes the address caches.
    Lines 6-8 check to see if there is a RUN or a COPY. In the latter case,
        str_match() returns the address of the COPY in addr. The function
        chk_run() is straightforward to implement so its description will
        be omitted.

                                   -16-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    Lines 9-17 output the appropriate delta instruction.
    Lines 19-22 get executed if there is no run or match. In this case, data
        is collected for an ADD instruction.
    Lines 24-25 output the last unmatched part of the target window in an
        ADD instruction.
    Line 26 forces the last saved COPY instruction (if any) to be output.
        See the function put_copy() below.
    Line 27 returns the number of bytes used for encoding the data.

    To finish up the encoding algorithm, we need to describe the functions
    put_add(), put_run() and put_copy(). We shall assume that the default
    code instruction table 0 is used. The below code declares the tables
    and define a few convenience macro functions.

     1. Vcdcode_t Tab[256];
     2. Vcdcode_t *Add[16];
     3. Vcdcode_t *Copy[15][7];
     4. Vcdcode_t *Copyadd[8][7][5];
     5. Vcdcode_t *Copycopy[8][8];

     6. #define INDEXADD(a)     ((a >= 1 && a <= 14) ? a : 0)
     7. #define INDEXCOPY(c)    ((c >= 4 && c <= 14) ? c : 0)
     8. #define INDEXCOPYADD(c) ((c >= 4 && c <=  7) ? c : 0)
     9. #define ISTINYADD(a)    ((a >= 1 && a <= 4) ? 1 : 0)
    10. #define ISTINYCOPY(c)   ((c >= 4 && c <= 7) ? 1 : 0)

    Lines 6-8 define functions to compute the index into the tables
        Add, Copy, and Copyadd for some given ADD or COPY size.
    Lines 9-10 define functions to test if a given size is small. These
        are used to determine if certain pair of instructions could be
        coded using a single byte code.


    Below is the function to output a COPY instruction.

     1. int Size, Addr, Here, Mode;

     2. int put_copy(Cache_t* ka, int size, int addr, int here)
     3. {   int code, mode = 0, best = 0, ndel = 0;

     4.     if(size > 0)
     5.         mode = addr_encode(ka, addr, here, &best);

     6.     if(Size > 0)
     7.     {   if(Mode == mode && mode == VCD_SELF &&
     8.            ISTINYCOPY(Size) && ISTINYCOPY(size) )
     9.         {    code = Copycopy[Size][size];
    10.              ndel += put_byte(code);
    11.              ndel += put_rint(Addr, Here);
    12.              ndel += put_rint(addr, here);
    13.              Size = -1;
    14.              return ndel;
    15.         }
    16.         code = Copy[INDEXCOPY(Size)][Mode];
    17.         ndel += put_byte(code);
    18.         if(Tab[code].inst1.size == 0)
    19.              ndel += put_uint(Size);
    20.         if(Mode == VDC_SELF)
    21.              ndel += put_rint(Addr,Here);

                                   -17-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999


    22.         else ndel += put_byte(Addr);
    23.     }

    24.     Size = size;
    25.     Addr = best;
    26.     Mode = mode;
    27.     Here = here;

    28.     return ndel;
    29. }

    Line 1 defines variables to save a COPY instruction so that it
        may be coded jointly with another instruction later.
    Lines 4-5 compute the mode and value to encode the given COPY
        address if there is one (i.e., "size" is positive).
    Lines 6-23 process a previously saved COPY instruction.
    Lines 7-15 output a code that encode both COPY instructions.
    Lines 16-22 output the previously saved COPY instruction singly.
    Lines 24-27 save the current COPY instruction if it has not
        already output as discussed.


    Below is the function to output an ADD instruction:

     1. int put_add(int size, uchar_t* data)
     2. {   int code, ndel = 0;

     3.     if(Size > 0)
     4.     {   if(ISTINYADD(size))
     5.         {    code = Copyadd[INDEXCOPYADD(Size)][Mode][size];
     6.              ndel += put_byte(code);
     7.              if(Tab[code].inst1.size == 0)
     8.                   ndel += put_uint(Size);
     9.              if(Mode == VCD_SELF)
    10.                   ndel += put_rint(Addr,Here);
    11.              else ndel += put_byte(Addr);
    12.              for(; size > 0; --size, ++data)
    13.                   ndel += put_byte(*data);
    14.              return ndel;
    15.         }
    16.         else ndel += put_copy(0,0,0,0);
    17.     }

    18.     code = Add[INDEXADD(size)];
    19.     ndel += put_byte(code);
    20.     if(Tab[code].inst1.size == 0)
    21.         ndel += put_uint(size);
    22.     ndel += size;
    23.     for(; size > 0; --size, ++data)
    24.         put_byte(*data);

    25.     return ndel;
    26. }

    Lines 3-17 consider the case when there is a previously saved
        COPY instruction. In that case, if the ADD size is small,
        a single code is output for both COPY and ADD instructions.
        Otherwise, the COPY instruction is output by itself.
    Lines 18-24 output an ADD instruction by itself.

                                   -18-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999



    Below is the function to output a RUN instruction:

     1. int put_run(int size, int byte)
     2. {   int ndel = 0;
     3.     if(Size > 0)
     4.         ndel += put_copy(0,0,0,0);
     5.     ndel += put_byte(0);
     6.     ndel += put_uint(size);
     7.     ndel += put_byte(byte);
     8.     return ndel;
     9. }

    Lines 3-4 output a previously saved COPY instruction, if any.
    Lines 5-7 output the RUN instruction.


6.  Summary

    We have described a general and portable data format for compression
    differencing. Reference [7] shows that the format is compact. For
    compression, it is better than Unix compress and close to gzip. For
    differencing, it is better than all known methods. The decoding
    algorithms run in linear time and require working space proportional
    to window sizes. This makes it suitable for data communication
    among computers.


ACKNOWLEDGEMENTS
    Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur
    Van Hoff who provided much encouragement to publicize Vcdiff.


REFERENCES

    [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
        Practical Reusable Unix Software, Editor B. Krishnamurthy,
        John Wiley & Sons, Inc., 1995.

    [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
        Compression, IEEE Transactions on Information Theory,
        23(3):337-343, May 1977.

    [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
        ACM Transactions on Computer Systems, 2(4):309-321, November 1984.

    [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
        Algorithm, Journal of the ACM, 23:262-272, 1976.

    [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
        Algorithms, IEEE Software Configuration and Maintenance Workshop,
        1996.

    [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
        Accepted for publication in Software Practice & Experience, 1999.

    [7] D.G. Korn and K.P. Vo, A Generic Differencing and Compression
        Data Format, Technical Report AT&T Labs HA1630000-021899-02TM,
        February 1999.

                                   -19-

RFC 0   VCDIFF Generic Differencing and Compression Data Fornmat   Feb 1999





AUTHOR'S ADDRESS

    Kiem-Phong Vo (main contact)
    AT&T Labs, Room D223
    180 Park Avenue
    Florham Park, NJ 07932

    Phone: 973-360-8630
    Email: kpv@research.att.com


    David G. Korn
    AT&T Labs, Room D237
    180 Park Avenue
    Florham Park, NJ 07932

    Phone: 973-360-8602
    Email: dgk@research.att.com


EXPIRATION DATE

    August 23, 1999


































                                   -20-