Simple Diff Algorithm
draft-cheney-simplediff-00

Document Type Active Internet-Draft (individual)
Last updated 2017-10-03
Stream (None)
Intended RFC status (None)
Formats plain text pdf html bibtex
Stream Stream state (No stream defined)
Consensus Boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date
Responsible AD (None)
Send notices to (None)
INTERNET-DRAFT
EXPIRATION 6 April 2017

Independent Submission                                         A. Cheney
Category: Informational                                   prettydiff.com
                                                            October 2017

                         Simple Diff Algorithm
                     draft-cheney-simplediff-00.txt

Abstract

   This document describes a simplified algorithm for performing
   computational comparisons.  The design goal is to achieve an approach
   that uses the smallest amount of logic.  Necessary secondary goals
   include an approach that is language agnostic, fast, extensible, and
   more predictable than other similar algorithms.

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF).  Note that
   other groups may also distribute working documents as
   Internet-Drafts.  The list of current Internet-Drafts is at
   http://datatracker.ietf.org/drafts/current.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

Comments

   Comments are solicited and should be addressed to
   info@prettydiff.com.

Cheney                      Informational                       [Page 1]

RFC XXXX                Simple Diff Algorithm               October 2017

Table of Contents

   1. Introduction ....................................................3
   2. Understanding how to compare ....................................3
      2.1. It is more about the equality than the differences .........3
      2.2. Algorithm quality ..........................................4
      2.3. Speed ......................................................4
      2.4. Output format ..............................................6
   3. How the algorithm works at a high level .........................7
      3.1. Getting by with a hash map and some counts .................7
      3.2. The third and final pass through the data ..................8
      3.3. The "fix" function ........................................11
      3.4. Done ......................................................16
   4. Extensibility ..................................................23
   5. Security considerations ........................................23
   6. IANA considerations ............................................23
   7. References .....................................................23
   8. Author's address ...............................................23

Cheney                      Informational                       [Page 2]

RFC XXXX                Simple Diff Algorithm               October 2017

1. Introduction

   There are many comparison algorithms available, which are commonly
   referred to as "diff" algorithms after the similarly named Unix
   application.  While there are many algorithms already available none
   achieve speed, precision, and simplicity in an easily predictable
   manner.

   The most commonly fault of many diff algorithms is that they solve
   for the incorrect problem.  Instead of accomplishing the goal, to
   find how two samples differ, they instead will try to solve for a
   common computer science problem called "Longest Common Subsequence
   (LCS)".  The LCS problem seeks to find the longest blocks of equality
   in samples of comparison, which sounds ideal.  Unfortunately, the
   execution of this approach is not predictable as it is bound to the
   commonality of the compared samples.  Furthermore, it isn't simple or
   fast.  Worse still is that this approach is a precision failure in
   situations where differences are interspersed across frequent
   repetition.

2. Understanding how to compare

2.1. It is more about the equality than the differences

   A good diff algorithm will attempt to identify as much equality as
   possible.  Everything else qualifies as differences.  The metric for
   quality and precision is a smaller count of captured differences.
   The smaller this number the better, presuming differences aren't
   escaping undetected.

   False negatives, which is allowing differences through without
Show full document text