Network Working Group                                        M. Schwartz
Internet-Draft                                     Code On The Road, LLC
Expires: April 7, 2002                                   October 7, 2001


      The ANTACID Replication Service: Rationale and Architecture
                   draft-schwartz-antacid-service-00

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC 2026 except that the right to
   produce derivative works is not granted.  (If this document becomes
   part of an IETF working group activity, then it will be brought into
   full compliance with Section 10 of RFC 2026.)

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on April 7, 2002.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

   This memo presents the ANTACID Replication Service, which replicates
   hierarchically named repositories of XML documents for business-
   critical, internetworked applications.

   ASCII and HTML versions of this document are available at
   http://www.codeontheroad.com/papers/draft-schwartz-antacid-
   service.txt and http://www.codeontheroad.com/papers/draft-schwartz-
   antacid-service.html, respectively.




Schwartz                  Expires April 7, 2002                 [Page 1]


Internet-Draft               ANTACID Service                October 2001


Table of Contents

   1.     Introduction . . . . . . . . . . . . . . . . . . . . . . .   3
   1.1    Example Applications . . . . . . . . . . . . . . . . . . .   5
   1.2    Caching vs. Replication and Lazy-vs-Eager Update
          Propagation  . . . . . . . . . . . . . . . . . . . . . . .   7
   1.3    Currently Deployed Replication Technology  . . . . . . . .   9
   1.4    Requirements . . . . . . . . . . . . . . . . . . . . . . .  13
   2.     Terminology  . . . . . . . . . . . . . . . . . . . . . . .  20
   3.     Overview of the Approach . . . . . . . . . . . . . . . . .  23
   4.     Name Space . . . . . . . . . . . . . . . . . . . . . . . .  28
   5.     Replication Topology . . . . . . . . . . . . . . . . . . .  30
   6.     Update Ordering  . . . . . . . . . . . . . . . . . . . . .  32
   7.     Atomicity, Consistency, and Reliability  . . . . . . . . .  33
   7.1    Eventual Consistency . . . . . . . . . . . . . . . . . . .  33
   7.2    ANTACID Semantics  . . . . . . . . . . . . . . . . . . . .  33
   7.3    Locking and Deadlock Prevention  . . . . . . . . . . . . .  35
   7.4    Collapsing Update Semantics  . . . . . . . . . . . . . . .  36
   7.5    Submitted Update Visibility  . . . . . . . . . . . . . . .  37
   7.6    Caching Partial Repository Contents  . . . . . . . . . . .  38
   7.7    Asynchronous Submission and the Client-Server Promise  . .  38
   8.     Conflict Detection . . . . . . . . . . . . . . . . . . . .  40
   8.1    Write-Write Conflicts  . . . . . . . . . . . . . . . . . .  40
   8.2    Read-Write Conflicts . . . . . . . . . . . . . . . . . . .  40
   9.     Transmission Optimization: Encodings . . . . . . . . . . .  42
   10.    Security . . . . . . . . . . . . . . . . . . . . . . . . .  44
   10.1   Authentication . . . . . . . . . . . . . . . . . . . . . .  44
   10.2   Confidentiality  . . . . . . . . . . . . . . . . . . . . .  44
   10.3   Integrity  . . . . . . . . . . . . . . . . . . . . . . . .  44
   10.4   Access Control . . . . . . . . . . . . . . . . . . . . . .  44
   11.    Datastore Interoperation . . . . . . . . . . . . . . . . .  45
   12.    Summary: ARS's Realm of Applicability  . . . . . . . . . .  46
   12.1   Strengths  . . . . . . . . . . . . . . . . . . . . . . . .  46
   12.2   Limitations  . . . . . . . . . . . . . . . . . . . . . . .  46
   12.2.1 Architectural Limitations  . . . . . . . . . . . . . . . .  47
   12.2.2 Completeness Limitations . . . . . . . . . . . . . . . . .  49
   13.    Acknowledgements . . . . . . . . . . . . . . . . . . . . .  51
          Author's Address . . . . . . . . . . . . . . . . . . . . .  56
          References . . . . . . . . . . . . . . . . . . . . . . . .  52
   A.     Extended List of Potential ARS Applications  . . . . . . .  57
          Full Copyright Statement . . . . . . . . . . . . . . . . .  59










Schwartz                  Expires April 7, 2002                 [Page 2]


Internet-Draft               ANTACID Service                October 2001


1. Introduction

   In this document we present the motivation, design rationale, and
   architecture for the ANTACID Replication Service (ARS).  ARS
   replicates repositories of hierarchically-named, well-formed XML
   documents [1] in a manner that supports business-critical,
   internetworked applications (BCIAs).  The ARS protocol and
   algorithmic details are defined in a companion document [2].

   By "business-critical" we mean applications requiring stronger data
   coherence guarantees than file-by-file replication, but weaker than
   global ACID semantics (i.e., Atomic, Consistent, Isolated, Durable
   [3] semantics spanning all replicas).  Our motivation is that many
   commercial services require coherence guarantees for replicated data,
   but that global ACID semantics are not appropriate across the
   Internet because:

   o  global ACID semantics don't scale (a point we will discuss in more
      depth later); and

   o  applications requiring global ACID semantics (e.g., banking) often
      require privately owned, centrally controlled infrastructure
      rather than the open Internet.

   The "ANTACID" part of ARS refers to data coherence semantics we
   define later in this document, which we believe are well suited to
   BCIAs.

   By "internetworked" we mean applications characterized by:

   o  up to millions of servers;

   o  administrative distribution;

   o  both planned and unplanned server and network outages;

   o  a wide range of network capacity; and

   o  a wide range of provisioning requirements concerning timeliness of
      updates, security, and other operational issues.

   To date, much of the replication technology that provides business-
   critical support has developed in the relational database (RDBMS)
   world.  This world has produced technology that provides significant
   structure (well defined data models, integrity constraints, atomic
   update, and serial audit trails), but that does not work well in
   internetworked environments.  For example, RDBMS's often use
   protocols too data intensive for low-bandwidth networks, lack support



Schwartz                  Expires April 7, 2002                 [Page 3]


Internet-Draft               ANTACID Service                October 2001


   for distributed administration, and require all replicas to
   communicate with a single master server.  On the other hand, much of
   the current internetwork-oriented replication technology
   (specifically, protocols standardized by the Internet Engineering
   Task Force (IETF)) does not provide business-critical support.  For
   example, IETF replication protocols typically provide no means of
   atomically updating a set of data objects, and in some cases do not
   guarantee that all updates are delivered to all replicas.

   With continuing growth in the Internet's size and breadth of uses,
   applications increasingly require distributed data management
   technology that provides data coherence guarantees and also works
   well across the Internet.  Recent examples include content delivery
   networks, corporate data sharing partnerships, content syndication,
   and network service providers sharing parts of their network
   management data.  The growing popularity of XML promises to
   accelerate the trend of deploying business critical applications
   across the Internet.

   Because of the split RDBMS/IETF evolution of replication technology,
   and because of the large number of incompatible replication
   technologies that have been developed to date, building a BCIA
   typically requires custom implementation and integration.  The author
   has personally worked on a number of commercial replication efforts
   that took significant time and effort to deploy, because of the need
   to integrate technologies from multiple vendors and fill out missing
   pieces with custom development.  Market pressures forced each of
   these projects to launch with incomplete integration (e.g., lacking
   some needed network management functionality).  Moreover, there were
   recurring integration costs as the vendors introduced new versions of
   their technologies.  This kind of experience is common to many
   Management Information Systems (MIS) shops.

   The current effort seeks to define a standard capable of replicating
   data for use by BCIA's, allowing a robust service to be deployed by
   configuration rather than custom development/integration.  The
   current effort also seeks to incorporate lessons learned over the
   past 20 years from both the RDBMS and the IETF worlds.  Both the
   protocol and the data units replicated by ARS are XML-based because
   we're betting on XML to become a dominant means of structuring data
   on the Internet.










Schwartz                  Expires April 7, 2002                 [Page 4]


Internet-Draft               ANTACID Service                October 2001


   We begin this section by presenting three example applications, to
   motivate ARS's goals of standardized replication support for BCIA's.
   Next, we compare caching and replication, to clarify our meanings for
   these terms and motivate our focus on replication.  Next, we discuss
   why we believe existing replication technology is not well suited to
   business-critical internetworked environments.  We then enumerate the
   ARS replication requirements.

   At the end of this document we summarize (Section 12) the strengths
   and limitations of ARS, to help data service architects determine
   whether ARS is appropriate for their needs.

1.1 Example Applications

   We now present and discuss some example replication applications for
   each of the characteristics introduced earlier.  Briefly, those
   characteristics were:

   1.  business-critical support: stronger data coherence guarantees
       than file-by-file replication, but weaker than global ACID
       semantics;

   2.  support for internetworked environments: the ability to operate
       in an administratively distributed environment with up to
       millions of servers in the face of partial outages, a wide range
       of network capacity, and a wide range of provisioning
       requirements; and

   3.  standard interfaces: the ability to deploy applications through
       configuration rather than custom integration/development.

   As an example application requiring characteristic 1 (business-
   critical support), consider the problem of distributing web site
   updates to a number of replica servers.  This problem is faced by
   content distribution networks, as well as by MIS shops that manage
   geographically replicated intranets.  After being editorially
   reviewed or otherwise approved for distribution, content is placed on
   a central staging server, from where it is replicated to a number of
   other servers.  When replication occurs, the set of files holding
   images referenced from the HTML "assembly" pages must be moved into
   place atomically with the assembly pages themselves, to avoid "broken
   link" errors when users browse the content.

   As an example application requiring characteristic 2 (support for
   internetworked environments), consider the problem of a real-time
   news provider, which must relay news headlines and stories to a large
   base of mobile clients.  As mobile data services grow over the next
   few years, the client base could grow to several hundred million in



Schwartz                  Expires April 7, 2002                 [Page 5]


Internet-Draft               ANTACID Service                October 2001


   size.  To handle the huge load bursts and 99.9999% uptime such scale
   would require, a service provider might replicate content through a
   tiered arrangement, with a single staging server feeding a set of
   continental replica servers across wide area links.  Each continental
   server, in turn, feeds a number of regional replica servers, which
   eventually populate several thousand local point-of-presence (POP)
   servers globally.  Each POP then passes updates to the set of
   geographically local clients as they become intermittently reachable
   over the network.  The entire system would need to be provisioned in
   a fashion that ensures updates reach the POP servers within (say) 10
   minutes of when the updates are injected into the staging server, and
   might support multiple classes of service -- for example with
   professional brokers paying more for up-to-the-minute stock quote
   data.

   As a second example requiring support for internetworked environments
   consider the problem of provisioning service for a new user by
   populating a set of downstream services with slices of data sourced
   at a back office client database.  For example, when a new user signs
   up for Internet service at a national service provider, information
   such as their login name, password, given name, and subscribed
   service list must be propagated to a set of RADIUS [4], DHCP [5], POP
   [6], SMTP [7], LDAP [8] and database-driven web servers.  A similar
   operation (though perhaps applied to a smaller scale user base) is
   needed for supporting MIS operations managing employees and
   contractors in a medium-to-large corporate intranet.  One
   particularly complex requirement (which may or may not be selected,
   depending on local policy) is allowing clients to update their local
   server's information during temporary network partitions.  For
   example, a service provider may wish to allow clients to change their
   passwords and continue logging in to the local server during a
   network partition, and then propagate the password updates to other
   replica sites when the partition is repaired.  This functionality
   would require support within the replication system for so-called
   "multi-master" updates.  This functionality could result in
   conflicting updates being injected at multiple points in the network,
   requiring conflict detection and resolution procedures.  Multi-master
   update and conflict detection/resolution add significant complexity
   to a replication system, so it is desirable to structure the
   replication protocol in such a fashion that services that do not need
   this functionality can avoid the associated implementation costs and
   runtime overhead.

   As an example application requiring characteristic 3 (standard
   interfaces), consider again the service provisioning example
   mentioned above.  At present the commercially available
   implementations of RADIUS, SMTP, etc.  are implemented by a number of
   different vendors, whose technologies use incompatible systems for



Schwartz                  Expires April 7, 2002                 [Page 6]


Internet-Draft               ANTACID Service                October 2001


   populating and replicating their databases.  Moreover, large MIS
   operations often manage employee or client data using technologies
   developed for the corporate personnel marketplace or the telco
   customer data marketplace, and these technologies also use their own
   proprietary data population/replication technology.  Developing a
   service provisioning system therefore requires non-trivial custom
   integration and implementation, particularly as provisioning requires
   integration with operations support systems (e.g., the ability to
   disable all online and physical building access with a single action
   when an employee is terminated).

1.2 Caching vs. Replication and Lazy-vs-Eager Update Propagation

   Caching and replication are both used for distributing copies of
   data, to lower latency and reduce bottlenecks.  It is worth briefly
   reviewing the distinction we see between these approaches, as the
   industry has not converged on widely shared definitions for these
   terms [9][10].

   Caching systems typically make copies of individual objects in
   response to requests for those objects, and refresh copies during a
   subset of future access requests.  Replication systems typically make
   copies of defined collections of objects in advance of requests to
   those collections, and update collections when the origin objects
   (those being replicated) change.


























Schwartz                  Expires April 7, 2002                 [Page 7]


Internet-Draft               ANTACID Service                October 2001


   Caching can typically be implemented more simply than replication,
   but lacks several useful properties of replication systems:

   o  by copying defined collections of objects, replication systems
      ensure that a particular replica contains a "complete" (within the
      constraints of update propagation delays) collection that can be
      accessed during network partitions.  With caching systems,
      partitions usually mean that individual objects are unavailable;

   o  by updating collections when the origin objects change,
      replication systems can potentially keep objects more up-to-date
      than caching systems, without requiring access-time requests to
      check for staleness or origin server requests to "flush" stale
      copies.  This distinction is often referred to as "lazy" vs.
      "eager" update propagation.  While it is possible to implement
      eager cache propagation, the most common situation is lazy cache
      propagation.  Thus, throughout the remainder of this document when
      we discuss caching we assume lazy propagation, while we assume
      replication can be configured to be either lazy or eager.  Stated
      another way, caching operates in "pull" mode, while replication
      can operate in either "pull" or "push" mode;

   o  by updating collections when the origin objects change,
      replication can trigger client notification.  With caching systems
      clients must either check for updates periodically or rely on some
      external system to provide notification; and,

   o  by replicating objects at varying levels of granularity,
      replication systems can provide an efficiency boost to caching
      systems.  For example, replication could be used to propagate
      real-time updates for stock prices into a "my portfolio" page
      whose content (other than the stock prices) is cached.

   Caching systems are therefore most clearly appropriate for cases
   where simple configuration is important, full collection availability
   during partition is not required, object content changes slowly, and
   client notification is not required.

   Replication is most clearly appropriate for an environment where
   content changes rapidly (and therefore, selectively pushing updates
   of changed content works better than caching/pulling content); timely
   information is important; and provisioned service is important (so
   that service providers can engineer the flow of updates to remove
   bottlenecks that interfere with meeting their service level
   agreements).






Schwartz                  Expires April 7, 2002                 [Page 8]


Internet-Draft               ANTACID Service                October 2001


   Note that replication can be implemented using software originally
   designed to support caching.  For example, one could implement a
   content delivery network that replicates content by pre-loading a set
   of Squid [11] caches when new objects are added to a collection,
   flushing the caches when objects are deleted, and re-loading the
   caches whenever objects change.

1.3 Currently Deployed Replication Technology

   A great deal of work has been done on replication over the past 20
   years.  In the current section we focus on deployed technology.
   Throughout the remainder of this document we discuss research
   approaches as they relate to the ARS architecture.

   Many vendors have developed proprietary replication technology, for
   file systems, relational and object databases, version control,
   messaging and directory services, business-to-business integration
   and electronic data interchange, systems and network management,
   workflow, and content delivery networks.  While many of these systems
   provide sophisticated replication functionality, an open standards-
   based approach is valuable because of its potential to enhance
   interoperability among applications that share access to replicated
   datastores.  For example, it is unfortunate that such inherently
   related processes as version control and workflow are typically
   deployed in current intranets using different vendors' incompatible
   replication technologies.  To achieve deep integration, companies
   must spend time developing "plumbing" code that moves data between
   these incompatible systems, encodes semantics from one system into
   opaque fields in the other system(s), and integrates with other
   operations support software (such as network and systems management
   systems).  Often sites deploy only partially integrated solutions
   because of time-to-market pressures.

   While there are a number of replication standards designed for
   internetworked environments, none adequately supports business-
   critical applications.  For example, the Network News Transfer
   Protocol (NNTP) [12] replicates postings from thousands of newsgroups
   across millions of machines, but enforces little structure.  It is
   common to see news postings sequenced in different orders on
   different machines, with confusing references to "earlier" articles
   that appear later in a particular repository.  There also are no
   guarantees that all articles posted to a newsgroup will reach all
   replicas of that newsgroup.








Schwartz                  Expires April 7, 2002                 [Page 9]


Internet-Draft               ANTACID Service                October 2001


   The Internet Cache Protocol (ICP) [13] provides a way for cache
   servers to share data among a hierarchy of caching peers.  This
   approach comes at a cost in runtime latency [14][15].  It also
   provides no means of synchronizing cache contents or ensuring content
   integrity.  Sites wishing such stronger guarantees must develop
   additional mechanisms for those purposes.

   The HTTP Distribution and Replication Protocol (DRP) [16] defines
   efficient means for requesting missing or updated files, primarily in
   support of software distribution.  It lacks support for a variety of
   requirements we consider important for business-critical
   internetworked environment.  For example, DRP provides no support for
   atomically updating a group of files, routing updates according to
   network structure or policy, or allowing multiple sites to update
   shared data.

   The Distributed Authoring and Versioning protocol (WebDAV) [17]
   defines means of remotely editing documents, including support for
   collection creation and versioning.  This functionality is orthogonal
   to replication, and in fact the WebDAV requirements state that "the
   WebDAV extensions must be able to operate in a distributed,
   replicated environment".

   The Network Data Management Protocol (NDMP) [18] defines protocol
   support for clients to control how backup is done among various
   configurations of servers and networks.  This problem is tangentially
   related to replication.

   A common way to achieve some of the advantages of open standards with
   deeper replication capabilities is to represent business logic and
   content as XML stored inside a database or file system that supports
   replication.  While such an approach can be made to work, a storage
   management-independent replication protocol is needed because of the
   fragmented state of the file system and database marketplaces:

   o  The file system marketplace is fragmented by operating system,
      with Windows NT dominating corporate intranets and UNIX dominating
      high-end data centers.  There are cross-platform implementations
      of CIFS [19] and NFS [20], but in practice CIFS is primarily used
      with Windows and NFS with UNIX.

   o  The database marketplace is fragmented by database vendor.
      Specifically, ODBC [21] and SQL [22] define connectivity and query
      standards, but in practice database application programmers always
      make significant use of vendors' proprietary features for
      operationally critical functionality like hot backup, failover,
      and disk layout optimization.  Once developed, it is difficult to
      make such applications interoperate with other database systems.



Schwartz                  Expires April 7, 2002                [Page 10]


Internet-Draft               ANTACID Service                October 2001


   Beyond the marketplace fragmentation issue, file systems and
   databases do not provide appropriate functionality for a business-
   critical internetworked environment:

   o  Replicated file systems typically do not provide sufficiently
      strong atomicity and collection management functionality for
      business-critical applications.  For example, Coda [23] allows
      clients to preload their file system cache for use while
      disconnected, and provides mechanisms for updating replicas and
      detecting conflicts while connected.  This approach works well for
      end user file sharing, but would not be appropriate for business
      applications that, for example, require atomic update and serial
      audit trails.

   o  Database replication systems typically implement data coherence
      mechanisms that do not work well across the Internet.  Global ACID
      semantics, as provided by synchronous replication schemes, result
      in deadlocks or reconciliation's that grow as the cube of the
      number of nodes and the fourth power of transaction size [24].
      Asynchronous replication schemes (such as Oracle updateable
      snapshots [25]) typically commit updates at a local replica and
      later propagate those updates to a master copy where automated
      conflict detection and resolution procedures are invoked.  A
      conflict can cause compensating transactions to be propagated to
      sites where the "losing" transactions were performed, which in
      turn can generate other compensating transactions to undo
      operations that ran based on values read from the losing
      transactions.  This arrangement can result in an explosion of
      global load if widely replicated data experience update conflicts.

   Finally, most database and file replication systems use
   communication-intensive network protocols that exhibit performance
   problems when used across the Internet.  While a number of research
   systems address weakly connected [23] or topologically complex [26]
   environments, there is a paucity of practically deployable
   replication solutions that work well across the range of connectivity
   and topology found on the Internet.

   The service that comes closest to ANTACID's goals and approaches is
   the UDDI replication service [27].  Like ANTACID, UDDI replication
   supports pull-based replication of XML content with notification when
   changes are available, and transmits updates over a Directed Acyclic
   Graph (DAG).  Also, since UDDI is used for advertising and
   discovering services offered by network-accessible businesses, it is
   clearly intended for a business-critical internetworked environment.
   Unlike ANTACID, UDDI replication defines a service-specific interface
   (i.e., intended to support only UDDI data).  As increasing numbers of
   XML-based services come online we believe a general XML replication



Schwartz                  Expires April 7, 2002                [Page 11]


Internet-Draft               ANTACID Service                October 2001


   service would be valuable, for the reasons discussed earlier (Section
   1).  Because UDDI replication is focused on the specific needs of
   UDDI servers, the replication design fixes a number of choices that
   ANTACID leaves configurable, including authentication and
   confidentiality services (vs.  ANTACID's reliance on BEEP [28], which
   allows negotiation of security services); update content encodings
   (vs.  ANTACID's support for negotiated encodings); non-support of
   update-anywhere functionality (vs.  ANTACID's optional support for
   update anywhere); and the requirement that updates be propagated
   within a set amount of time (vs.  ANTACID's treatment of convergence
   time guarantees as a separate service provisioning issue).








































Schwartz                  Expires April 7, 2002                [Page 12]


Internet-Draft               ANTACID Service                October 2001


1.4 Requirements

   We begin with a high-level, idealized list of requirements, which we
   then selectively relax when we consider the example replication
   applications:

   1.  Computational Power and Connectivity of Participating Devices:
       support everything from small/weakly connected devices to
       large/well connected data centers.

   2.  Scale: replicate arbitrarily large volumes of data, to
       arbitrarily many sites.

   3.  Distribution of Control: allow arbitrarily fine-grained control
       over what content is replicated, to which sites, and under which
       administrative policies.

   4.  Availability: allow reading and writing any available copy of
       data in the presence of server and network outages, subject to
       security policy restrictions.

   5.  Consistency: ensure globally consistent transactional (ACID) data
       updates.

   6.  Efficiency / Performance: avoid duplicate transmissions of
       updates across network links; compress redundancy out of data
       transmissions; route updates intelligently with respect to
       network topology; disseminate updates at full network speed;
       guarantee that updates commit globally within a specified,
       bounded period of time.

   7.  Security: support authenticated, encrypted, authorized access to
       digitally signed data.

   8.  Implementation Complexity vs.  Depth of Functionality: define
       easily implemented primitives that can support deep, complex
       application scenarios.

   9.  Configuration and Operational Management:  support adding and
       removing replicas at any time.  Support automatic configuration
       (e.g., of replication topology and update scheduling) that adapts
       intelligently over time.  Support a serial audit capability:  the
       ability to enumerate an authoritative list of all update
       operations applied, singly-clock-timestamps, and principals that
       performed those updates.






Schwartz                  Expires April 7, 2002                [Page 13]


Internet-Draft               ANTACID Service                October 2001


   Looking at the breadth of example target applications, we see that
   they vary in a number of dimensions, resulting, for example, in
   requirements to support wide ranges of computational power,
   connectivity, and number of replicas.  Perhaps more important are the
   characteristics these applications have in common:

   o  all require stronger update atomicity and serialization semantics
      than file-by-file replication, but weaker than global ACID
      semantics;

   o  all require efficient use of network and computational resources;
      and,

   o  all require the ability to continue operations in spite of partial
      node and network outages.

   As an example of the update atomicity and serialization semantics, in
   a content distribution network it is important to update a group of
   HTML and image files atomically so that a network outage cannot leave
   a server in a state where the top-level ("assembly") page has been
   updated but the new image files it references have not yet arrived.
   As a second example, if multiple back office sites submit updates
   that result in subscriber account provisioning for an Internet
   Service Provider, it is important that an audit trail be generated
   showing the serialized sequence of updates that were applied.  Such a
   trail can help track down data integrity problems introduced at
   various back office sites that cause subscriber-visible service
   problems.

   Below we revisit the idealized requirement list, highlight those that
   are particularly difficult to satisfy in an internetworked
   environment, and, with an eye towards the example replication
   applications, present our chosen compromises:


















Schwartz                  Expires April 7, 2002                [Page 14]


Internet-Draft               ANTACID Service                October 2001


   1.  Computational Power and Connectivity of Participating Devices: It
       is unlikely that a single engineering design could work well for
       Internet ranges of connectivity and capacity, and also compare
       favorably to technology tuned for high capacity local
       interconnects (e.g., a fiber channel-based storage area network
       used to interconnect backend databases and application servers).
       We therefore focus ARS on two classes of Internet computing
       environments:

       *  Well connected, high-capacity Internet server complexes: for
          example, a set of failover servers with diverse routed DS3
          connectivity; and,

       *  Mobile nodes: modest computational and storage capacity,
          frequently disconnected network.

   2.  Scale: while we see no algorithmic reasons why ARS could not
       scale arbitrarily, we establish the following numeric goals:

       *  Ability to replicate data sets up to 30 terabytes in size.

       *  Ability to replicate individual pieces of data to hundreds of
          well-connected data centers, or to millions of Personal
          Digital Assistants (PDA's).



























Schwartz                  Expires April 7, 2002                [Page 15]


Internet-Draft               ANTACID Service                October 2001


   3.  Distribution of Control: replication systems designed to support
       arbitrarily distributed control typically route updates among
       replicas at random and use anti-entropy operations [29] to ensure
       that all updates circulate globally so that the replication
       service provides eventual consistency [30].  Anti-entropy can be
       a costly approach, because it requires pair-wise state
       comparisons to determine missed updates.  Moreover, by allowing
       arbitrary flows of data it can be difficult for replica
       administrators to engineer the global service to meet needed
       convergence times.  For these reasons, we impose the following
       constraints on distribution of control:

       *  Updates flow according to a structured communication graph,
          rather than to arbitrary peers.

       *  Ability to distribute control over replication across
          administrative boundaries defined by name tree partitions.

       *  Ability to set differing numbers of replicas for different
          parts of the name space.

       *  Support for both central source-of-record and update-anywhere
          models.

       *  No support for master-less replication (ala LDUP [8]).  While
          master-less replication is useful, it makes serial audit
          impossible and automated 100% coherence difficult.

   4.  Availability: one of the classic design decisions to be made for
       a replication system is how to trade availability against
       consistency.  In the Internet environment, we consider
       availability generally to be more important, and impose most of
       the constraint relaxation on consistency:

       *  Ability to submit updates to any available copy.

       *  Ability to read committed updates from any available copy.
          Optional support for reading the current set of submitted
          updates from any available copy.

       *  Ability to propagate updates in the presence of partial node
          and network outages.









Schwartz                  Expires April 7, 2002                [Page 16]


Internet-Draft               ANTACID Service                October 2001


   5.  Consistency:

       *  Support eventual consistency among asynchronously circulated
          updates.

       *  Ensure mutual consistency among documents at a datastore after
          each update to that datastore.  We exclude from scope broader
          notions of referential integrity, for example guaranteeing
          that a bookmarked web page can later be successfully visited.
          This latter problem might be addressed by a combination of
          archival repositories and a versioning name space, which,
          again, are out of scope.

       *  Support ability to determine when an update has committed at
          the submission server.  Provide no support for determining
          when an update has circulated to all replicas.

       *  Support ability to detect write-write conflicts.  Ability to
          notify applications when updates arrive, as a partial solution
          to the read-write conflict problem.

       *  Not all applications need to replicate all data served by an
          ARS server, so it must be possible for applications to cache
          documents using a simple Time-To-Live (TTL) mechanism to
          detect when it is necessary to request an updated copy.  ARS
          servers are not required to participate in cache coherence
          management.
























Schwartz                  Expires April 7, 2002                [Page 17]


Internet-Draft               ANTACID Service                October 2001


   6.  Efficiency / Performance: consistency requirements generally
       impose performance bottlenecks that prevent updates from flowing
       at full network speed.  Our goal here is that ARS have
       performance reasonably close to that achieved by the best
       currently deployed replication technology, while making
       intelligent use of network resources.  Specifically:

       *  Performance of individual point-to-point replication should
          not be significantly worse than proprietary schemes, excluding
          convergence time (which depends on connectivity factors
          outside the control of the replication system).

       *  Efficient use of network resources: don't send the same update
          many times over a link.

       *  Ability to route update transmissions in ways that make sense
          with respect to network topology and/or policy.

       *  Support for time-bounded convergence is outside the scope of
          ARS, but it should be possible to configure server operations
          to bound convergence times, when done in conjunction with
          appropriate system provisioning/capacity/uptime engineering.
          As a possible future extension, a protocol could be defined to
          support some form of conditional service guarantees for
          deployments where resource reservations are possible.

   7.  Security: security generally comes at a tradeoff against ease of
       configuration and use, and sometimes with performance impact.
       ARS should provide ranges of choices for authentication,
       confidentiality, access control, and data integrity, which allow
       individual sites to choose how they make the tradeoffs.  The
       collective ARS documents should provide sufficient background to
       allow sites to make these choices in an educated fashion.

   8.  Implementation Complexity vs.  Depth of Functionality: define a
       minimal set of broadly useful required protocol elements, and an
       additional set of optional protocol elements that increase the
       functionality for more complex applications.  Also, focus on
       replication, reusing existing solutions to related problems such
       as content filtering, load sharing, and configuration management.











Schwartz                  Expires April 7, 2002                [Page 18]


Internet-Draft               ANTACID Service                October 2001


   9.  Configuration and Operational Management:

       *  Automatic configuration is difficult to achieve for anything
          but the simplest of systems.  Even something as outwardly
          simple as auto-selecting full- vs.  half- duplex network
          switching has historically been the source of surprising
          numbers of operational problems.  Data replication is a far
          more complex problem.  ARS therefore defines no configuration
          automation.  ARS may impose some restrictions about the
          service states when replicas may be added or removed.

       *  One of ARS's key design tradeoffs is the decision that serial
          audit must be supported - as will be seen, this single
          decision forces other design choices that impact availability
          and consistency.  Nonetheless we feel serial audit is critical
          because it can help when tracking down service failures and
          security breaches, and it can provide a legal record of update
          activity.

































Schwartz                  Expires April 7, 2002                [Page 19]


Internet-Draft               ANTACID Service                October 2001


2. Terminology

   We use the term "document" to refer to a well-formed XML document.
   We use the term "datastore" to refer to any hierarchically-named
   document repository.  The datastore could be implemented as a file
   system, a database table, or some persistent storage system.  ARS
   makes no assumptions with respect to granularity of access (e.g.,
   attribute-level vs.  file level) of the underlying datastore.
   However, the ARS service itself replicates at the granularity of an
   individual document.  In other words, modifications to individual
   elements or attributes will result in updates to the complete
   document.

   We use the term "name space" to refer to the universe of all names
   within a particular datastore naming system.  For example, the Blocks
   name space [31] consists of all names for Blocks (such as
   "doc.rfc.2629").  We use the term "name tree" to refer to the
   hierarchical collection of names within a particular name space
   (e.g., the names within the Blocks name space).

   We use the term "server" to mean a process that responds to requests
   issued by another process.  "Client" means a request issuer.  These
   terms apply to roles that may be transient.  In particular, "ARS
   servers" act as servers when a client submits a request, and act as
   clients when forwarding requests to other ARS servers.  Moreover, ARS
   servers can initiate communications when performing callback
   notification to a client that made an earlier request to the server.
   We use the term "ARS peers" when referring to either a client or
   server that communicates using the ARS protocol.

   While the ARS client and server roles can be transient, we make a
   fixed distinction between processes that receive (and possibly
   forward) replication requests vs those that only submit requests.  We
   refer to the former as "replication servers" and the latter as
   "clients of the replication service".  We refer in aggregate to
   replication servers as "the replication service".















Schwartz                  Expires April 7, 2002                [Page 20]


Internet-Draft               ANTACID Service                October 2001


   This role distinction is motivated by the Internet gateway model
   [32], which distinguishes between the information that must be known
   within the routing core of the Internet vs by hosts needing packet
   routing service.  In the case of Internet gateways, the distinction
   reduces the number of machines that are affected by routing changes.
   In the case of ARS the distinction allows us to:

   o  offload state management from clients of the replication service
      for tracking in-progress replication requests, in support of
      computationally modest clients (such as mobile PDAs); and

   o  define a boundary across which clients of the replication service
      may treat replicated data in ways opaque to the replication
      service -- for example when caching or digitally signing
      replicated data.

   We discuss both of these issues in more depth later.

   We use the term "primary" to refer to a designated server for a
   particular region (the structure of which will be explained later) of
   the name space that serializes updates for that region, and "non-
   primary" for all other servers for that name space region.

   ARS servers for individual regions of the name space are arranged in
   a Directed Acyclic Graph (DAG [33]) rooted at the primary.  We refer
   to ARS servers closer to the primary along the DAG than a given
   server as "upstream servers", and those further as "downstream
   servers".

   We do not use the term "secondary" to denote servers that relay
   updates to primary servers, because there can be multiple hops
   between a primary and the set of all downstream servers in the DAG.
   Instead we speak of primary vs.  non-primary servers, and of upstream
   vs.  downstream servers.  It is important that when discussing ARS
   one not use the term "secondary" because it does not adequately
   capture the subtleties of interactions along the propagation DAG.

   We use the term "ARS service" to refer to the overall replicated data
   service supported by ARS.

   We use the term "stable storage" to denote non-volatile memory (e.g.,
   disk) implemented in a fashion where write operations survive system
   crashes/reboots [34].

   We use the term "commit" to indicate a computation performed on a
   single datastore that applies a set of updates in a fashion that
   respects ACID semantics at that single datastore.  There is no notion
   of multi-server commit in ARS.



Schwartz                  Expires April 7, 2002                [Page 21]


Internet-Draft               ANTACID Service                October 2001


   We use the term "submission server" to indicate the server to which
   an update was originally submitted by a client of the ARS service.
   We use the term "submission path" to indicate the sequence of
   servers, starting at the submission server and terminating at the
   primary server, via which an update is submitted and propagated.

   We use the term "principal" to refer to any named entity in the
   system, including users, groups of users, and services running on
   behalf of particular administrative regions of the replicated name
   space.

   ARS specifies a required subset of protocol elements and two optional
   subsets (the details of which will be presented shortly).  We use the
   term "the ARS protocol" to refer to any legitimate subset of all
   three sets of protocol elements.

   Note that some of the terms used by ARS are different from their use
   in the Domain Name System [35], even though ARS also forms a
   hierarchical distributed datastore.  Specifically, both systems have
   a "primary" server notion, but where the DNS has "secondary" servers,
   ARS has non-primary servers.  The DNS replicates data from primary to
   secondary servers, but ARS replicates data along a DAG from primary
   to non-primary servers.  Also, the DNS and ARS both define a notion
   of "zone" to support distributed administration, but the DNS defines
   more ways to partition zones than ARS does (specifically, the DNS
   defines a notion of "class" that is not present in the ARS).

























Schwartz                  Expires April 7, 2002                [Page 22]


Internet-Draft               ANTACID Service                October 2001


3. Overview of the Approach

   Many aspects of the ARS architecture were influenced by the desire to
   balance support for business-critical functionality against the need
   to operate in an internetworked environment.  In this section we
   present an overview of the ARS approach, at the end of which we
   summarize the design decisions made to enhance scalability and
   operation across a variety of network environments.

   ARS partitions each name space into disjoint regions called zones, to
   support distributed administration.  Zones can also enhance ARS
   scalability by allowing a primary that receives too much update
   traffic to delegate part of its work to another server.

   Each ARS server supports one or more zones.  Individual zones may or
   may not be replicated.

   For each zone updates are propagated along a DAG rooted at a
   distinguished primary server that serializes all updates to that
   zone.  This DAG provides for redundancy and load distribution while
   distributing updates, and can be configured to take network
   characteristics (topology, link bandwidth, policy, etc.) into
   account.

   There are two distinct phases of the update process:

   o  update submission, which may be as simple as submitting an update
      to the primary, or may involve submitting to a non-primary that in
      turn propagates the submission through a chain of servers up the
      DAG to the primary; and,

   o  committing of submitted updates, which causes messages to
      propagate down the DAG from the primary to all non-primary servers
      of that zone.

   To distinguish these phases, we talk about update submission as
   distinct from serialization/commit.

   ARS servers can push updates down the DAG as they arrive, to reduce
   replication latency if needed.  Servers can also pull updates, for
   example to "catch up" when first joining a replication DAG, or after
   downtime, partition repair, or mobile reconnection.  For
   synchronization and implementation simplification reasons a push is
   actually just a suggestion from the upstream server that the
   downstream server should perform a pull request.  The choice of
   whether to push or pull updates is a local configuration decision
   made at each server, though server administrators should coordinate
   these choices to support specific provisioning requirements.



Schwartz                  Expires April 7, 2002                [Page 23]


Internet-Draft               ANTACID Service                October 2001


   When a replication server receives an update submission, it enters
   into what we call a client-server promise: after accepting the
   update, the server must follow that update through either to success
   or failure.  This promise extends from the submission server through
   each ARS server up the DAG to the primary, so that in effect the
   client enters into a promise with the replication service (not just
   an individual server).  Because of the client-server promise, an
   infrequently connected client of the replication service can thus
   disconnect from the network after an ARS server has accepted the
   update submission, and rely on the replication service to carry the
   update through to completion or failure without the client's needing
   to handle details of timing out and resubmitting the update.  Clients
   can learn of commit results through an asynchronous notification
   mechanism.

   ARS consists of three sub-protocols, only the first of which must be
   implemented by all ARS servers:

   1.  the Commit-and-Propagate Protocol (ars-c), which allows a client
       to submit an update to an ARS server, performs zone-wide
       serialization, commits updates, and propagates committed updates
       down the DAG to other ARS servers that each individually commit
       and propagate the updates to their downstream servers.  ars-c
       also provides asynchronous notification of commit-or-fail results
       to the submitting clients.

   2.  the Submission-Propagation Protocol (ars-s), which allows an
       update that was submitted to a non-primary server to be
       propagated up the DAG to the primary server, where it is
       serialized with respect to all submission activity at the
       submission server (in addition to the serialization performed by
       ars-c).  ars-s re-uses the protocol element defined by ars-c for
       asynchronous client notification to notify servers down the
       submission path of update commit-or-fail results.  ars-s provides
       a store-and-forward update submission mechanism, so that updates
       can succeed even if all servers along a submission path are never
       simultaneously available.  Implementations that do not support
       ars-c allow updates to be submitted only to the primary.

   3.  the Encoding Negotiation Protocol (ars-e), which defines a
       protocol for negotiating a set of MIME-based [36] data
       representations that can be used when passing submitted and
       committed updates among ARS peers.  These negotiated
       representations support a variety of transmission optimizations,
       including compression, delta encoding, and bulk transfer.
       Implementations that do not support ars-e use a single required
       encoding for submitted and committed updates.




Schwartz                  Expires April 7, 2002                [Page 24]


Internet-Draft               ANTACID Service                October 2001


   This protocol decomposition allows relatively simple implementations
   for broadly useful functionality, and additional functionality for
   more demanding services.  In particular, "update-anywhere"
   functionality (supported by ars-s) is responsible for a significant
   amount of the service complexity.  Many services do not require this
   functionality and need not incur its implementation complexity and
   runtime overhead.

   Similar to Bayou [37], ARS transmits operations rather than data when
   transmitting both submitted and committed updates, to (a) avoid
   ambiguities with respect to deleted documents and (b) make the amount
   of information transmitted proportional to update activity instead of
   overall database size.  ARS also defines a collapsing update
   mechanism that eliminates committed update transmissions that are
   overwritten by subsequent updates.  This approach modifies coherence
   semantics in a way that should not impact most applications, and
   enhances efficiency for infrequently synchronized datastores.  It can
   also reduce space requirements for log-based server implementations.

   ARS supports a server-by-server model of update atomicity and
   consistency.  When a client submits an update, it contains a set of
   operations that are to be performed together, all-or-nothing.  That
   group of updates commits first at the primary and then travels to all
   downstream servers.  The group of updates are applied asynchronously
   at each server with ACID semantics local to that server, and updates
   are guaranteed eventually to circulate to all replicas.  In this way,
   a client interacting with a single replica sees updates that are
   locally consistent (e.g., retaining referential integrity between
   cross-linked documents at that server), but the replicated ARS
   service avoids scaling problems that would result from global ACID
   semantics.

   The ARS service tracks commit sequence numbers (CSNs) for each
   document, which may be used to detect write-write conflicts.  The
   local server implementation or an application may also handle a
   simple case of read-write conflicts by notifying clients when an
   update to their input documents arrives.  Both of these forms of
   conflict detection are optional, and may be implemented outside of
   the ARS service (e.g., with some form of trigger or "persistent
   fetch").











Schwartz                  Expires April 7, 2002                [Page 25]


Internet-Draft               ANTACID Service                October 2001


   ARS provides no protocol support for determining when an update has
   circulated globally.  An application could be written to traverse the
   ARS DAG for a given zone and check for update completion, but
   distributed administration of servers could prevent discovery or
   querying of some servers.  Moreover, the cost of this computation
   could be prohibitive.  Finally, the list of replicas may itself be in
   flux, making it difficult to define a meaningful notion of "all
   replicas".

   Looking at the overall approach outlined above, a variety of design
   decisions contribute to ARS's applicability to internetworked
   environments.  These decisions concern scalability and operation over
   a variety of network environments.

   In terms of scalability, probably the most important design decision
   is ARS's server-by-server model of update atomicity.  This approach
   avoids the deadlock and reconciliation problems of global ACID
   semantics, and induces an application architecture where clients
   interact with local replicas rather than "popping around the net".  A
   second design decision in support of scalability is routing updates
   according to a configured graph structure.  This approach avoids the
   overhead of pair-wise state comparisons performed by anti-entropy
   operations, which are needed when updates are allowed to flow between
   arbitrary peers.  A third scalability-oriented design is the
   separation of conflict detection from the replication service.  This
   approach offloads conflict detection overhead from replication
   servers.  A fourth scalability feature is the use of transmission
   encodings, which can support a number of optimizations, most
   importantly bulk transfer (which can make it feasible to replicate
   very large datastores).  Finally, scalability is supported by the
   ability to partition the name space for distributed administration.

   In terms of operation across a variety of network environments,
   probably the most important design decision is routing updates
   according to a configured graph structure.  This approach allows data
   to be routed according to bandwidth, policy, or other network
   characteristics.  For example, the graph could be configured so that
   updates are sent once across an expensive trans-oceanic link, and fan
   out from there to other replicas.  Graph-based update routing also
   supports a provisioned service model [38], where administrators plan
   flows based on measured update rates and server and network capacity.
   In this way, commercial providers can support service level
   agreements across a variety of network environments, which is
   difficult to do if updates were routed in an unconstrained fashion.
   A second design decision in support of varying network
   characteristics is store-and-forward update submissions.  This
   approach allows updates to succeed when all servers along a
   submission path are not simultaneously available.  On a related note,



Schwartz                  Expires April 7, 2002                [Page 26]


Internet-Draft               ANTACID Service                October 2001


   by allowing both pull and push models of committed update
   propagation, peers that are not continuously connected can request
   updates when they connect, yet continuously connected servers can
   learn of updates without polling.  Finally, the store-and-forward
   update mechanism also supports deployment on bastion hosts that
   forward traffic across security boudaries, similar to how proxy
   servers are often used for allowing Internet HTTP access from within
   an intranet.

   Mobile PDA's represent a broad class of network environment deserving
   particular mention.  ARS's client-server promise and asynchronous
   notification allow a PDA to connect, submit an update, and
   disconnect, managing only the state needed to tie notification back
   to the original submission.  Also, ARS's collapsing update mechanism
   can reduce committed update traffic for infrequently synchronized
   PDA's.  ARS's update-anywhere capability supports geographically
   constrained network architectures, allowing a PDA to synchronize data
   with a nearby cell server or kiosk.  Finally, ARS's distinction of
   replication-service vs client-of-replication-service minimizes
   implementation complexity for computationally modest PDA devices.































Schwartz                  Expires April 7, 2002                [Page 27]


Internet-Draft               ANTACID Service                October 2001


4. Name Space

   ARS operations specify documents using Uniform Resource Identifiers
   (URI's [39]).  For this purpose, the "scheme" part specifies the name
   space being used.  For example, a URI for a document in the Blocks
   name space might be "blocks:doc.rfc.2629".

    The following figure illustrates a number of aspects of ARS naming,
   drawn from the Blocks name space:


                      zzzzzzzzzzzzzzzzzzz
                      z     -------     z
                      z    /   \   \    z
                      z   /    ... ...  z
                 zzzzzzzzzzzz           z
                 z      /   z           z
                 z     doc  zzzzzzzzzz  z
                 z    / | \---\      z  z
                 z  rfc     z edgar  z  z
                 z / | \    z  / | \ z  z
                 z          z        z  z
                 zzzzzzzzzzzzzzzzzzzzzzzz
                      ||\\ \\
                      || \\ \\_________
                      ||  \\ \--------\\
                      ||    \\         \\
                      ||      \\        \----------\\
                      ||        \\                  \\
                 +----------+    +----------+    +-------------+
                 |          |    |          |    |             |
                 |     doc  |    |     doc  |    |     doc     |
                 |    / | \ |    |    / | \ |    |    / | \    |
                 |  rfc     |    |  rfc     |    |  rfc        |
                 | / | \    |    | / | \    |    | / | \       |
                 |          |    |          |    |             |
                 +----------+    +----------+    +-------------+
                      ||    \\  //    ||               ||
                      ||      \\      ||               ||
                      ||    //  \\    ||               ||
                 +----------+    +----------+    +-------------+
                 |          |    |          |    |             |
                 |     doc  |    |     doc  |    |      doc    |
                 |    / | \ |    |    / | \ |    |     / | \   |
                 |  rfc     |    |  rfc     |    |   rfc       |
                 | / | \    |    | / | \    |    |  / | \      |
                 |          |    |          |    |             |
                 +----------+    +----------+    +-------------+



Schwartz                  Expires April 7, 2002                [Page 28]


Internet-Draft               ANTACID Service                October 2001


   In this figure boxes surrounded by z's indicate zone boundaries.
   Zones are named by the top-level node in each zone, with the root
   given the distinguished name ".".  Zones are partitioned by placing
   cuts between nodes in the name tree, and designating a primary server
   for each zone.  In the figure there is a cut at "blocks:doc" and
   another at "blocks:doc.edgar", making 3 separate zones ("blocks:.",
   "blocks:doc", and "blocks:doc.edgar").












































Schwartz                  Expires April 7, 2002                [Page 29]


Internet-Draft               ANTACID Service                October 2001


5. Replication Topology

   For each zone that an ARS server replicates, the server is configured
   to execute the ARS protocol with zero or more upstream servers and
   with zero or more downstream servers.  These servers are arranged
   into a DAG rather than permitting updates to flow via unconstrained
   peer-to-peer interconnections (as done for example by Bayou [37])
   both to eliminate the need for anti-entropy operations and to provide
   more structure for replica administrators who wish to engineer flows
   to meet needed convergence times.

   Each upstream server may be either the zone primary or another ARS
   server that replicates the needed zone.  In this way committed
   document updates are replicated along a DAG rooted at the primary for
   zone data they replicate.  An example is illustrated in the figure in
   the Name Space section (Section 4), with two of the "doc" zone
   replicas configured to replicate from either of two different
   upstream replicas.

   In the simplest ARS service configuration (one supporting only ars-c
   at all servers) updates are submitted at the primary, and after
   committing at the primary the updates propagate down the DAG to the
   non-primary servers.  In a service that also supports ars-s, updates
   can be submitted to non-primary servers and routed up to the primary
   along the DAG.  Supporting non-primary update submissions can provide
   several advantages:

   1.  it allows clients (such as wireless PDA's) to submit updates to
       topologically nearby servers, which may provide transmission cost
       advantages;

   2.  it allows load bursts to be distributed away from the primary,
       with non-primaries time-shifting or otherwise shaping update
       traffic up to the primary;

   3.  it supports a distributed access control model (Section 10.4),
       where individual server administrators manage update privileges
       for their local users; and,

   4.  it allows updates to be submitted when the primary is temporarily
       unreachable.

   Scalability in the number of replicas is achieved by building a DAG
   of ARS servers with fan-out at any particular replica engineered
   based on measured update rates and server and network capacity.






Schwartz                  Expires April 7, 2002                [Page 30]


Internet-Draft               ANTACID Service                October 2001


   Each ARS server knows all of its immediate upstream servers and all
   of its immediate downstream servers.  Thus, becoming a downstream
   server involves configuring the downstream server and all upstream
   servers being used by the new downstream server, through
   administrative arrangement outside the ARS service.  The means by
   which this configuration is done (configuration file, database, etc.)
   are a matter of local implementation.  ARS does not define a protocol
   for joining/leaving the replication topology.

   When zones divide/delegate, the ARS servers beneath them only get
   their content from the named zones, not the new delegated zones.  The
   ARS server administrator must explicitly add the new zones to the
   configuration if they wish to replicate that newly delegated content.

   The choice whether the replication topology construction is done via
   manual configuration (e.g., using a graph editor) vs.  by some
   automatic process (e.g., one that looks at routing tables or that
   performs bandwidth computations) is outside the scope of the ARS
   architecture.  Over time we believe that automating the construction
   of replication DAGs will be needed to scale the service to thousands
   of replicas [26].  Solutions to this problem are an area of future
   study (for example, using some form of Protocol Independent Multicast
   [40] to help discover topologically nearby nodes).




























Schwartz                  Expires April 7, 2002                [Page 31]


Internet-Draft               ANTACID Service                October 2001


6. Update Ordering

   ARS serializes submitted updates to a zone so that all servers commit
   updates in the same order.  Additionally, for pairs of servers that
   support ars-s, ARS serializes submitted updates according to the
   order that they arrived at the submission server.  Each submission
   server's ordering imposes a partial zone-wide ordering: submissions
   are ordered per server, but do not imply any ordering between
   servers.  The partial orderings of all submission servers are
   respected by the zone-wide total ordering chosen by the primary.

   This ordering model is motivated by the desire to provide behavior
   that meets users' intuitive expectations, for the case of
   applications that interact with a single replica.  Clearly it is
   important to sequence all updates into a globally consistent order,
   but it is also important to maintain submission site ordering.  If
   submission site ordering were not maintained, problems like the
   following example scenario could arise.  Consider an online news site
   that publishes updates daily at midnight.  Each day, the site
   performs an internal editorial review and then submits the update to
   the replication system.  The staff member in charge of content
   replication submits the approved update, and while waiting for
   notification that the update has committed begins reading a local
   copy of the updated content.  In browsing through the content the
   staff member notices an error that slipped past the editorial review,
   and submits an update to correct the error.  At this point, there are
   two update submissions in the system, neither of which has committed
   at the primary yet.  If the ordering model did not respect submission
   site ordering, it is possible that the correction could reach the
   primary before the original update (for example, because of a
   transient network outage that caused the first update to stall while
   the second update routed through a different path to the primary) and
   be serialized before the original update.  As a result, the original
   (uncorrected) update content becomes the final content on the web
   site for the current 24 hour period.

   ARS servers request updates from an upstream server by specifying the
   zone and the CSN of the last operation seen for that zone.  The
   upstream server responds by sending all updates that have committed
   since the specified sequence number, possibly collapsing some updates
   per the collapsing update notion mentioned earlier (Section 3).

   Note that ANTACID addresses a relatively constrained update ordering
   problem because it assumes updates are always applied to entire
   documents.  Supporting finer-grained updates (e.g., attribute-level
   updates commonly supported by relational databases) would require
   dealing with update dependencies, merges, etc.




Schwartz                  Expires April 7, 2002                [Page 32]


Internet-Draft               ANTACID Service                October 2001


7. Atomicity, Consistency, and Reliability

7.1 Eventual Consistency

   ARS supports eventual consistency semantics [30].  Specifically,
   there are no guarantees that all replicas are ever in a consistent
   state, but if updates cease for long enough while updates spread,
   eventually all replicas will converge to a consistent state.  There
   are no guarantees about how long it can take for updates to
   propagate, but the local implementation may schedule updates and
   engineer sufficient capacity to meet a given propagation schedule.

   Note that some form of conditional service guarantees [41] might be
   provided by proprietary vendor extensions, to bound update
   propagation times for deployments where resource reservations are
   possible.  Moreover, in the future a new ARS sub-protocol could be
   defined to support such conditional service guarantees.

7.2 ANTACID Semantics

   As noted earlier (Section 1.4), globally consistent ACID semantics do
   not scale and are not needed for a usefully large class of
   applications.  Instead, ARS supports a server-by-server model of
   update atomicity and consistency, which we call Asynchronous Network
   Traversed ACID (ANTACID) semantics.  Specifically, updates submitted
   together are guaranteed to be applied in the same order with ACID
   semantics at each replica as they traverse the DAG and are applied at
   each server.  There are no guarantees about when updates reach
   individual replicas or about the update state of one replica with
   respect to another replica.  This approach is motivated by the
   scaling problems of providing global ACID semantics and by the fact
   that many applications work by interacting with a single replica, for
   which all that is required is that updates to that replica appear to
   be ACID to applications interacting with that replica.

   The term "ANTACID" is also intended to connote a reduction in
   unpleasant complications that would arise from the over-use of ACID
   semantics in the global environment.

   Fox et al.  [42] define BASE (Basically Available, Soft state,
   Eventual consistency) semantics to be, in essence, everything that
   does not support ACID semantics.  We believe it is useful to
   characterize ANTACID semantics at two different levels of system
   granularity.  At the granularity of the global replicated service,
   ANTACID semantics are BASE.  At the granularity of individual
   datastores, ANTACID semantics are ACID.  We believe this split is key
   to supporting business critical internetworked applications.




Schwartz                  Expires April 7, 2002                [Page 33]


Internet-Draft               ANTACID Service                October 2001


   Mnesia [43] is an RDBMS that supports the usual set of transactional
   operations as well as a set of so-called "dirty operations" that
   trade away transactional integrity for increased performance.  These
   operations are atomic for individual reads/writes, but do not provide
   any way to group logically related operations in a way that supports
   referential integrity.  Although Mnesia provides a way to selectively
   relax traditional ACID semantics (as ARS does), the mechanism is
   targeted towards realtime applications, rather than Internet data
   replication.

   To illustrate why we feel ANTACID semantics are appropriate for
   BCIA's, consider the three example applications introduced earlier
   (Section 1.1).

   For the web content distribution application, a limited form of
   referential integrity is required: all of the content updates for a
   site must be made "live" simultaneously to avoid broken link errors
   when users view the content with a web browser.  It is not required
   that all of the updates are made "live" simultaneously at all replica
   servers, and in fact imposing that sort of global ACID semantics
   would heavily impact availability: a partition affecting a single
   replica would delay all other replicas from turning the needed
   updates "live" on schedule, for applications requiring scheduled
   delivery (e.g., an online publication that updates its content at
   regular intervals).  By supporting ANTACID semantics, the impact of
   regional network outages can be limited to the affected region.  Note
   that there may be additional requirements for how closely together
   updates should be made "live" across all of the regions, and that
   such requirements fall under provisioning issues outside the scope of
   the replication system (e.g., requiring a certain level of routing
   redundancy and network capacity to ensure the delivery schedule can
   be met).

   There are similar referential integrity requirements for the real-
   time news provider application.  The provisioning requirements are
   likely to be significantly more demanding than they are for the
   content distribution application, given the time-sensitive nature of
   stock quote data.  Again, these provisioning requirements require
   systems and network engineering beyond the scope of the replication
   system.

   The service provisioning application also requires referential
   integrity.  For example, it would be confusing to users if their DHCP
   accounts were activated before their email accounts were activated.
   By replicating customer/client data in a single standard format to a
   regional server, all customer-facing services (DHCP, POP, etc.) could
   access the same server and achieve the desired property that "all
   services in the region are available as soon as one is".  Attempting



Schwartz                  Expires April 7, 2002                [Page 34]


Internet-Draft               ANTACID Service                October 2001


   to populate all regional servers simultaneously in a large
   geographically dispersed enterprise or ISP would unnecessarily reduce
   availability and increase lock contention/deadlocks/rollbacks needed
   to achieve such global ACID semantics.

   To our knowledge ANTACID semantics have not previously been defined
   per se.  However, these semantics are motivated by trying to
   formalize an approach that has been applied informally/one-off in
   various applications over the years.

7.3 Locking and Deadlock Prevention

   To support ANTACID semantics we define an update group as a grouping
   of update operations to be applied to documents within a single zone,
   all-or-nothing at each ARS replica as the update propagates from
   primary to all non-primary servers.

   To implement ANTACID semantics, locks are acquired separately at each
   server at the time a set of updates are to be committed, starting at
   the primary.  Each server acquires a zone-wide lock (to ensure that
   update groups are applied in a zone-wide serialized order), performs
   the updates, and then releases the zone-wide lock.  Because all locks
   for an update group are acquired at one time at each server after the
   entire update content has arrived at that server, ARS prevents
   deadlock [44].  Note that deadlock is possible if systems other than
   ARS acquire locks in a hold-and-wait fashion, so ARS implementations
   should carefully consider the possible deadlock problems if they
   allow services other than ARS to update the datastore.

   The ARS grouped update mechanism is the only way to relate updates to
   each other, with all-or-nothing failure semantics.  In particular,
   unless updates to multiple documents are grouped together, failure of
   one update has no affect on subsequent updates.

   ARS does not provide transactional consistency among read/written
   values.  In fact, no read locking is supported by ARS.















Schwartz                  Expires April 7, 2002                [Page 35]


Internet-Draft               ANTACID Service                October 2001


7.4 Collapsing Update Semantics

   For efficiency of operation among infrequently synchronized
   datastores and to reduce server space requirements, ARS defines a
   notion of collapsing update semantics.  Rather than insisting that
   each replica apply all update operations, the semantics are relaxed
   as follows.  One or more operations may be elided (the expression of
   which we will discuss shortly) from an update group if the datastore
   contents after successfully committing this subset of updates would
   be identical to the datastore contents that would result if the
   complete set of updates were performed.  Thus, for example, a
   document creation followed by three updates to that document may be
   "replayed" at a downstream server as a single write with that
   document's final value.

   The term "elided" above means that the given update operations and
   content are replaced by null operations accompanied by the CSN for
   each operation.  This approach is used rather than transmitting
   nothing for elided operations so that the downstream server receives
   a (sometimes null) operation for each CSN.  This provides a degree of
   fault tolerance, since it is a simple matter to detect missing
   operations from an up-counting set of sequence numbers.

   Update collapsing may be performed over operations that originally
   spanned multiple update groups, as long as the collapsed updates are
   transmitted within a single update group.  As an example, if ten
   updates were applied in separate update groups to a particular
   document over the course of a day and a PDA requests all committed
   updates at the end of that day, the responding server could send a
   single update group that collapses out all but the final update to
   that document.

   Upstream ARS servers may collapse the set of update operations sent
   in response to a request for committed update content.  Downstream
   ARS servers must be prepared to handle collapsed updates.
















Schwartz                  Expires April 7, 2002                [Page 36]


Internet-Draft               ANTACID Service                October 2001


7.5 Submitted Update Visibility

   While updates do not become part of the "official" state of the
   replicated ARS service until they have been serialized and committed
   at the primary, there are times when it can be useful for
   applications to be able to see updates immediately after they have
   been submitted.  For example, in a system that manages passwords for
   users it would be useful for users to be able to login to the system
   after changing their password at a local server rather than waiting
   for the update to commit at the primary.  This issue gets to the
   heart of some key tradeoffs in the design of a replication system, so
   before presenting the ARS approach we briefly review two other
   approaches we considered.

   Bayou provides a mechanism that allows applications to see a view (in
   RDBMS parlance; essentially, a stored query) showing updates that
   have been submitted but not yet committed, allowing tentative updates
   to circulate and possibly later be undone if a conflict is detected
   [45].  We chose not to use this approach because it can lead to a
   large volume of global undo/re-ordering activity in the case of
   widely replicated data.

   LDUP defines schema and protocol extensions to LDAPv3 [8] for
   replicating directory content [46].  LDUP allows updates to be
   visible immediately after they have been accepted by a server by
   having each server act as an independent master, with peer-to-peer
   conflict resolution procedures performed between servers from time to
   time.  This approach derives from a fundamental requirement of LDUP,
   namely, not depending on any single primary server for any given
   update.  There are good reasons for this requirement, including the
   fact that updates are never stalled by an unreachable primary server.
   However, we chose to accommodate the limitation of requiring a
   primary for reasons that may not be appropriate in the LDAP arena.
   Specifically, without a designated primary server we would lose the
   serial audit capability, which is required (Section 1.4) by ARS.

   Because of the above design considerations, ARS uses the following
   approach.  As a local implementation matter the submission server may
   allow applications to view the state of locally submitted updates.
   Submitted updates must not be propagated to other ARS servers in
   response to pull requests for committed updates.  Applications that
   view submitted but not-yet-committed state must be prepared to deal
   with the case where the update is rolled back after failing to commit
   at the primary.  For example, a password management application might
   allow local users to login with the new password after a local
   password change, but revert to the old password if the primary later
   rejects the update.




Schwartz                  Expires April 7, 2002                [Page 37]


Internet-Draft               ANTACID Service                October 2001


   If an ARS implementation allows applications to view the state of
   locally submitted updates, it should provide two views from which
   applications can choose:  one with only the committed updates, and
   one with the committed as well as locally submitted updates.

7.6 Caching Partial Repository Contents

   Developers of some applications (such as browsers) may wish to cache
   copies of individual documents rather than undertaking the more
   involved implementation and administration requirements of
   replicating an entire zone.  ARS servers themselves do not
   participate in any caching protocol; they manage copies of document
   data using only the ARS protocol.   Caching, if it is performed at
   all, is performed by clients of the ARS service.   In effect, the ARS
   protocol is used for maintaining a given level of data coherence
   corresponding to the local provisioning requirements of the
   replication server administrator(s), and caching may be used by
   clients to hold on to replicated data once it leaves the replication
   service "cloud" according to coherence requirements defined outside
   of the replication service.

   While ARS servers do not participate in a caching protocol, ARS
   specifies one aspect of how caching must be performed by clients that
   wish to cache documents.  Specifically, each document may contain a
   32 bit Time-To-Live (TTL) field that specifies for how many seconds a
   cached document may be held by a client before it expires.  To avoid
   clock synchronization requirements, this TTL must be represented as a
   "delta" from current time rather than as an absolute clock time at
   which expiry is to occur.  Processes must decrement the TTL field by
   an amount equal to how long they have cached the document before
   passing the document to other processes.

7.7 Asynchronous Submission and the Client-Server Promise

   When an ARS server receives an update submission from a client, it
   enters into a promise to follow that update through either to success
   or failure.  An infrequently connected client can thus disconnect
   from the network after an ARS server has successfully received the
   update submission, and rely on the server to perform all needed
   store-and-forward delivery reatempts until the update succeeds or
   fails.










Schwartz                  Expires April 7, 2002                [Page 38]


Internet-Draft               ANTACID Service                October 2001


   For pairs of ARS servers that support ars-s, each server that accepts
   an update submission along the submission path implicitly enters into
   the promise accepted at the submission server.  As a consequence,
   servers must not be shut down permanently while they are processing
   update submissions.  Before permanently decommissioning a server it
   must be operated in a mode that refuses new requests while completing
   old ones.

   To handle the case of a server not shut down cleanly (which can
   happen for a variety of reasons in operational settings, even though
   it violates the above requirement), the administrator of the
   immediately downstream server along the submission path can manually
   run a tool to fail submissions that have been routed to the shutdown
   server.





































Schwartz                  Expires April 7, 2002                [Page 39]


Internet-Draft               ANTACID Service                October 2001


8. Conflict Detection

   Conflict detection is outside the scope of ARS, because some
   applications do not need it and would prefer to avoid the additional
   runtime overhead and implementation complexity.  However, ARS
   provides "hooks" to allow the local server implementation or an
   application to perform conflict detection as follows.

8.1 Write-Write Conflicts

   A server or application running at the ARS primary may detect write-
   write conflicts by checking whether the CSN value associated with the
   document submitted for update matches the CSN value associated with
   the primary's copy of that document.  If it does not match a write-
   write conflict has occurred, in which case the primary may abort the
   update and respond to the submitter with an appropriate failure
   response message.  It is up to the submitting client to resolve the
   conflict, for example by re-reading the updated document and re-
   running the computation that updates it.

   Non-primary servers must not check for write-write conflicts in this
   fashion, else downstream servers will erroneously conclude that a
   conflict has occurred when they attempt to apply committed updates
   propagated down from upstream servers.

8.2 Read-Write Conflicts

   A local implementation may handle a basic case of read-write
   conflicts by allowing downstream clients to be notified when an
   update to their input documents arrives.  This approach allows
   updates to trickle through downstream servers (by re-running
   computations when a read-write conflict is detected so that
   eventually correct versions of the computed documents will be
   replicated), but does not support global transactional consistency
   among read/written values.  For example, consider this scenario: A
   server reads from replicated document X and uses it to compute a
   value for document Y.  Concurrently, someone updates X at another
   replica server after the above server has read X but before Y has
   been generated.  Thus, the generated value of Y does not reflect the
   most recent update to X.  With the notification mechanism, the server
   will find out when the new value of X arrives, run again, and
   regenerate Y based on the latest value of X.

   Note that ARS does not provide an update arrival notification
   primitive itself.  Rather, the use of the above-described arrival
   notification is outside the scope of ARS.  A local implementation may
   use database triggers or other mechanisms to provide notification.




Schwartz                  Expires April 7, 2002                [Page 40]


Internet-Draft               ANTACID Service                October 2001


   We chose this relatively simple approach to read-write conflict
   detection because we feel it is a reasonable compromise of
   scalability and implementation simplicity.  For example, we have
   scaling concerns about precedence graph-based approaches [47], and we
   believe that requiring application-specific conflict detection [48]
   would add too much complexity to the protocol.













































Schwartz                  Expires April 7, 2002                [Page 41]


Internet-Draft               ANTACID Service                October 2001


9. Transmission Optimization: Encodings

   ARS defines a single data update representation that can be used for
   transmitting all submitted and committed updates between ARS peers.
   However, there are a variety of cases where a different
   representation can be significantly more efficient.  The ARS Encoding
   Negotiation Protocol (ars-e) provides a mechanism that can be used
   between pairs of communicating ARS peers for selecting a common set
   of MIME-based [36] data representations that support various types of
   transmission optimizations.  Its existence was inspired in part by
   features in HTTP/1.1 [49] as well as by the author's experience with
   the proprietary replication implementations in a variety of
   commercial technologies and services.

   A basic type of optimization we wish to support with ars-e is
   transmission of compressed data, for example using MIME type "gzip-
   compressed".  ars-e can also be used to support delta encoded data
   [50], which can be useful for textual repositories where small
   changes are made to individual documents.

   Given our data scale and performance requirements (Section 1.4),
   probably the most important optimization is bulk transmission of raw
   data files between a pair of ARS servers that are both implemented
   atop the same proprietary data storage system (e.g., a particular
   version of a particular vendor's relational database).  For example,
   the Oracle 8i exportable tablespace feature [25] provides a means of
   exposing the underlying file system representation of database
   tables, which can be replicated across a network orders of magnitude
   faster than would be possible by transmitting the corresponding
   individual record-level SQL data manipulation operations.  A bulk
   transfer encoding might also be used to take advantage of proprietary
   backend hardware, for example using EMC's TimeFinder [51] product to
   transfer huge datastores across a storage area network.

   This ability to perform bulk synchronization of entire zones is how
   we address the requirement (Section 1.4), "Performance of individual
   point-to-point replication should not be significantly worse than
   proprietary schemes, excluding convergence time." Specifically, in
   the author's experience the performance difference between different
   incremental update protocols is minor compared to the difference
   between a system that supports bulk transmission and one that does
   not.









Schwartz                  Expires April 7, 2002                [Page 42]


Internet-Draft               ANTACID Service                October 2001


   There is one case where the use of an encoding can allow strictly
   greater functionality (as opposed to better performance) to the peers
   using that encoding.  As mentioned earlier (Section 6), each ARS
   server maintains a CSN-indexed list of operations applied to each
   zone.  When a downstream server requests all updates since a given
   CSN, the upstream server consults the operation list to generate the
   needed response.  As a local implementation matter, servers may
   choose to represent this list in a log, and they may choose to
   truncate a prefix of this log periodically to conserve space.  If
   they do, it is possible that a downstream server could request
   updates that precede the upstream server's log truncation point.  In
   this case, the upstream server sends an error to the downstream
   server, from which the downstream server may recover by requesting
   that the complete zone content be transmitted (similar to the
   approach taken by Bayou [37]).  This content must be transmitted with
   an encoding designed for full zone transfers.  For example, an
   encoding could be defined that transmits the complete zone as a
   sequence of one create operation for each existing document.  Or, as
   noted above, the entire zone could be transferred as a compressed
   archive file if both servers are built on the same type of data
   storage system.  In either case, the downstream server would need to
   delete its existing zone content before inserting the transmitted
   content.

   We place one key restriction on the use of ars-e: an implementation
   must not use any proprietary formats in its encodings except for the
   special case of transmitting an entire zone as a bulk transmission.
   In all other cases the encodings must be published as open
   specifications.  Implementations that do not meet this requirement
   are deemed not to be in compliance with the ARS specification.  This
   restriction is intended to allow an important efficiency enhancement
   for bulk data transfer, yet to ensure that there is always an
   interoperable way to transfer data between ARS peers.  Moreover, we
   seek to prevent vendors from claiming "trivial" ARS compliance by
   misusing the bulk encoding "escape" to drop into their proprietary
   system for all replication, and in so doing claim a more efficient
   (but non-interoperable) "ARS implementation" than their competitors.














Schwartz                  Expires April 7, 2002                [Page 43]


Internet-Draft               ANTACID Service                October 2001


10. Security

   ARS is defined in terms of a BEEP [28] profile.  A number of ARS's
   security characteristics derive from BEEP's built-in security models.

10.1 Authentication

   As a BEEP profile, ARS uses the Simple Authentication and Security
   Layer (SASL [52]) to negotiate the authentication mechanism used
   between ARS peers.

10.2 Confidentiality

   Encryption may be used during transmission, per usual practice with
   BEEP profiles.

10.3 Integrity

   Documents may contain primary-generated digital signatures to enforce
   integrity of contents and prevent man-in-the middle or sequence
   number collision attacks.  As is the case with caching, digital
   signing (if implemented) is implemented outside of the ARS service,
   by clients of the ARS service.  ARS does not specify how documents
   are to be signed.  The XML DSIG representation [53] is a possible
   approach.

10.4 Access Control

   Access control is not yet worked for ARS.  We intend to define a two-
   level access control architecture for ARS, wherein a Document Type
   Definition (DTD) specifies an Access Control List structure used
   between pairs of ARS servers, and user-level access control is left
   as a local implementation matter.  (Note that users interact with ARS
   through clients which authenticate for the given users' identity.)
   The ACL DTD, requirements governing the distribution and uniform
   application of ACLs, and other protocol and architecture issues, are
   all areas for future work.














Schwartz                  Expires April 7, 2002                [Page 44]


Internet-Draft               ANTACID Service                October 2001


11. Datastore Interoperation

   Other services may need to share access to a datastore being
   replicated by ARS.  For example, ARS provides no search support,
   instead assuming that such functionality could be built into a system
   that shares access to the replicated datastore.

   There are three requirements for sharing access to a datastore
   replicated by ARS:

   1.  The underlying datastore must consist entirely of hierarchically-
       named, well-formed XML documents.

   2.  All datastore read requests must perform zone-wide locking, as
       described earlier (Section 7.3).

   3.  All datastore updates must "funnel" through ARS.  For example, a
       non-ARS service may accept update requests for replicated
       documents, forward these requests to ARS, and wait for ARS to
       indicate the update has completed before responding to the
       initiating client.






























Schwartz                  Expires April 7, 2002                [Page 45]


Internet-Draft               ANTACID Service                October 2001


12. Summary: ARS's Realm of Applicability

   In this section we briefly summarize the strengths and limitations of
   ARS, to help data service architects determine whether ARS is
   appropriate for their needs.

12.1 Strengths

   ARS replicates XML content in a manner that we believe is appropriate
   for a broad class of business-critical internetworked applications.
   Specifically, ANTACID semantics guarantee that updates submitted
   together are applied in the same order and with ACID semantics at
   each replica, but that individual replicas apply updates
   asynchronously from one another.  This approach is motivated by the
   scaling problems of providing global ACID semantics and by the fact
   that many applications work by interacting with a single replica, for
   which all that is required is that updates to that replica appear to
   be ACID to applications interacting with that replica.

   ARS addresses scalability in data volume and replica count through a
   combination of ANTACID semantics, update routing according to a
   configured graph structure, the separation of conflict detection from
   the replication service, transmission encodings, and the ability to
   partition the name space for distributed administration.

   ARS addresses operation across a variety of network environments
   through a combination of update routing according to a configured
   graph structure, store-and-forward update submissions, and support
   for both pull and push models for committed update propagation.

   ARS addresses mobile/computationally modest client operation through
   a combination of the client-server promise, asynchronous
   notification, collapsing updates, update-anywhere capability, and the
   distinction of replication-service vs client-of-replication-service.

12.2 Limitations

   Replication is a broad problem, for which no single standard could
   work well for "all" applications.  ARS has a number of limitations,
   which we discuss in two categories below: architectural limitations,
   and limitations concerned with the current state of completeness of
   the specification.  The former are probably fundamental to the design
   tradeoffs we have chosen (unless the specification evolves in ways we
   have not predicted), while the latter are likely to change over time.







Schwartz                  Expires April 7, 2002                [Page 46]


Internet-Draft               ANTACID Service                October 2001


   In some cases the limitations discussed below can be handled by
   suitable modifications to how data is organized in a given
   application.  In other cases ARS may simply be inappropriate for the
   application.

12.2.1 Architectural Limitations

   Probably ARS's biggest limitation is that it does not support global
   ACID updates.  Applications requiring global ACID semantics should
   use a different replication solution, such as one of the commercially
   available RDBMS's.

   On a related note, ARS's ANTACID semantics provide no support for a
   notion of referential integrity that spans beyond a single update.
   For example, ARS does not provide any means of ensuring that a web
   page that was once visible at a website might be bookmarked and later
   visited (which can break if sites regularly take old content off
   their servers).

   Also related to ANTACID semantics, ARS's read-write conflict
   detection mechanism is limited to detecting when an update arrives
   for a document or document subtree.  This approach is oriented
   towards applications where updates trigger computations that can re-
   generate derived data, in support of computed content pipelines.  ARS
   specifically does not support any notion of transactional integrity
   in the face of read-write conflicts.  In other words, an ARS server
   cannot guarantee that it will detect that a read-write conflict has
   occurred at the time an update is being committed.

   Another limitation is ARS's reliance on a primary server, the
   unavailability of which will stall update submissions from
   committing.  Because of this, an ARS primary will typically need to
   reside on a highly available, well connected machine.  Mobile devices
   using ARS can only commit updates when they connect to a network
   capable of routing to the primary.  For example, a pair of mobile
   users on an airplane could not use ARS to commit updates and
   synchronize between each other while disconnected from the rest of
   the Internet unless one of the mobile users' machines is itself the
   primary server.  Updates can be submitted while disconnected, and ARS
   provides a way for clients to view the local submitted update state,
   but they cannot commit and propagate the updates to other ARS peers
   until the updates have committed at the primary server.









Schwartz                  Expires April 7, 2002                [Page 47]


Internet-Draft               ANTACID Service                October 2001


   More generally, processes are not free to propagate updates to
   arbitrary other ARS peers.  All updates must flow according to a
   configured graph topology.  This graph topology provides for
   redundant paths through the network, but for example an ARS server
   cannot simply locate a nearby server (e.g., by LAN broadcast) and
   exchange update state.

    On a related note, the ARS replication topology must conform to a
   DAG structure.  This restriction can in some cases mean that
   achieving redundant paths requires more servers than would be needed
   if the graph weren't required to be acyclic.  For example, consider
   the following cyclic replication topology:


                    s1
                    | \
                    |  \
                   s2---s3


   In this figure updates can propagate to s2 if s3 is down and vice
   versa.  With ARS's DAG-based topology an additional server would be
   required to achieve the same level of redundancy:


                  s1------>s4
                  | \      /
                  |  \    /
                  |    \ /
                  |    / \
                  |   /   \
                 \|/\|/   \|/
                   s2----->s3


   Related to the DAG limitations, if a server accepts a submission and
   all of its upstream servers are unreachable there is currently no way
   to "back the submission out" from the accepting server and send it to
   another server whose upstream servers are currently reachable.
   Because of the client-server promise, once a server accepts the
   submission, that part of the submission path is fixed.  As a
   consequence it's important that each server be provisioned with
   upstream servers that together provide acceptable availability.








Schwartz                  Expires April 7, 2002                [Page 48]


Internet-Draft               ANTACID Service                October 2001


   Related to ARS's primary server design is the fact that replicated
   collections are defined by the accumulation of changes visible at the
   single primary server.  Thus, for example, ARS would not be a very
   natural fit for applications that need to allow a user to select
   pieces of data to download as a one-time operation from multiple
   sources -- e.g., as done by peer-to-peer file sharing applications
   like Napster and Gnutella.

   Another limitation of ARS is that it only replicates XML content.
   While arbitrary data could be encoded into XML, doing so would not be
   an efficient means of transmitting large volumes of binary data, such
   as audio or video content.

   Another limitation is that ARS does not support attribute-level
   replication.  Updates are performed at the granularity of an entire
   XML document (although ARS's encoding mechanism may cause
   significantly less data than the full document to be transmitted
   during replication).  Thus, for example, an application that modifies
   a few cells in a large table would not do well to represent the table
   as one large document with elements or attributes representing the
   individual cells.  Instead, each individual document should be
   "reasonably small" (the exact definition of which depends on the
   speed of network links and servers, required update commit rates,
   etc.)

   Another granularity limitation of ARS concerns locking: because ARS
   serializes updates at the granularity of a zone, zone-wide locking is
   needed while applying committed updates.  In turn this rough-grained
   locking can present performance bottlenecks for some application
   workloads compared with, for example, row-level locking in RDBMS's.

   Another limitation is that ARS provides no protocol support for
   determining when an update has circulated globally.  An application
   could be written to traverse the ARS DAG for a given zone and check
   for update completion, but distributed administration of servers
   could prevent discovery or querying of some servers.  Moreover, the
   cost of this computation could be prohibitive.  Finally, the list of
   replicas may itself be in flux, making it difficult to define a
   meaningful notion of "all replicas".

12.2.2 Completeness Limitations

   The digital signing and access control models are not worked out.
   Among other things, an Access Control List format DTD needs to be
   defined, along with protocol/architectural requirements on the
   distribution and uniform application of access control in ARS.





Schwartz                  Expires April 7, 2002                [Page 49]


Internet-Draft               ANTACID Service                October 2001


   A MIME subtype for registering new ARS encodings needs to be formally
   defined and registered with the Internet Assigned Numbers Authority
   (IANA).

   Procedures need to be worked out for how to split and delegate zones.

   SNMP MIBs have not been defined for ARS service monitoring and
   management.

   There is no support for automated layout of the replication topology,
   which may in turn limit the number of servers to which it is
   operationally manageable to deploy replication service.

   ARS does not support guarantees about how long it can take for
   updates to propagate, and instead considers this to be a local
   engineering/capacity planning matter.  However, in the future a new
   ARS sub-protocol might be also defined to support some form of
   conditional service guarantees for deployments where resource
   reservations are possible.
































Schwartz                  Expires April 7, 2002                [Page 50]


Internet-Draft               ANTACID Service                October 2001


13. Acknowledgements

   The author would like to thank the following people for their many
   helpful suggestions about ARS: David Clark, Dave Crocker, Marco
   Gazzetta, Carl Malamud, Paul Mockapetris, Darren New, Calton Pu, and
   Marshall Rose.













































Schwartz                  Expires April 7, 2002                [Page 51]


Internet-Draft               ANTACID Service                October 2001


References

   [1]   World Wide Web Consortium, "Extensible Markup Language (XML)
         1.0", W3C XML, February 1998, <http://www.w3.org/TR/1998/REC-
         xml-19980210>.

   [2]   Schwartz, M., "The ANTACID Replication Service: Protocol and
         Algorithms", draft-schwartz-antacid-protocol-00 (work in
         progress), October 2001.

   [3]   Gray, I. and A. Reuter, "Transaction Processing: Concepts and
         Techniques", Morgan-Kaufmann Publishers, Inc. , 1993.

   [4]   Rigney, C., Rubens, A., Simpson, W. and S. Willens, "Remote
         Authentication Dial In User Service (RADIUS)", RFC 2058,
         January 1997.

   [5]   Droms, R., "Dynamic Host Configuration Protocol", RFC 1531,
         October 1993.

   [6]   Reynolds, J., "Post Office Protocol", RFC 918, Oct 1984.

   [7]   Postel, J., "Simple Mail Transfer Protocol", RFC 788, Nov 1981.

   [8]   Yeong, W., Howes, T. and S. Kille, "X.500 Lightweight Directory
         Access Protocol", RFC 1487, July 1993.

   [9]   Rabinovich, M., "Issues in Web Content Replication", Data
         Engineering Bulletin Vol. 21 No. 4, December 1998,
         <http://www.research.att.com/~misha/otherPubs/web_repl.ps.gz>.

   [10]  Cooper, I., Melve, I. and G. Tomlinson, "Internet Web
         Replication and Caching Taxonomy", RFC 3040, January 2001.

   [11]  Wessels, D., "Squid Web Proxy Cache", August 2000,
         <http://www.squid-cache.org/>.

   [12]  Kantor, B. and P. Lapsley, "Network News Transfer Protocol",
         RFC 977, Feb 1986.

   [13]  Wessels, D. and K. Claffy, "Internet Cache Protocol (ICP),
         version 2", RFC 2186, September 1997.

   [14]  Schwartz, M., "Formal Service Agreements for Scaling Internet
         Caching", NLANR Web Cache Workshop , June 1997,
         <http://www.ircache.net/Cache/Workshop97/Papers/Schwartz/schwartz.html>
         .




Schwartz                  Expires April 7, 2002                [Page 52]


Internet-Draft               ANTACID Service                October 2001


   [15]  Microsoft Corporation, "Cache Array Routing Protocol and
         Microsoft Proxy Server 2.0", August 1998,
         <http://www.microsoft.com/proxy/documents/CarpWP.exe>.

   [16]  Hoff, A., Giannandrea, J., Hapner, M., Carter, S. and M. Medin,
         "The HTTP Distribution and Replication Protocol", August 1997,
         <http://www.w3.org/TR/NOTE-drp>.

   [17]  Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. Jensen,
         "HTTP Extensions for Distributed Authoring -- WEBDAV", RFC
         2518, February 1999.

   [18]  Skardal, H., Bunnell, J., Chellam, S., Gardner, T., Hendrie,
         C., Komatsu, K., Linn, G., Manley, D., Waidhofer, G. and J.
         Ward, "NDMP Version 4 Protocol", draft-skardal-ndmpv4-02 (work
         in progress), April 2001.

   [19]  Leach, P. and D. Naik, "A Common Internet File System
         (CIFS/1.0) Protocol", December 1997,
         <ftp://ftp.microsoft.com/developr/drg/cifs/cifs9f.doc>.

   [20]  Microsystems, Sun., "NFS: Network File System Protocol
         specification", RFC 1094, Mar 1989.

   [21]  Microsoft Corporation, "ODBC Section of the Microsoft Universal
         Data Access Web Site", March 1999,
         <http://www.microsoft.com/data/odbc/default.htm>.

   [22]  International Organization for Standardization, "Information
         Technology - Database Languages - SQL", ISO/IEC 9075:1992,
         1992.

   [23]  Satyanarayanan, M., Kistler, J., Kumar, P., Okasaki, M.,
         Siegel, E. and D. Steere, "Coda: A Highly Available File System
         for a Distributed Workstation Environment", IEEE Transactions
         on Computers Vol. 39, No. 4, April 1990,
         <http://www.cs.cmu.edu/afs/cs/project/coda/Web/docdir/tcc90.pdf>
         .

   [24]  Gray, J., Helland, P., O'Neil, P. and D. Shasha, "The Dangers
         of Replication and a Solution", Proceedings of SIGMOD
         International Conference on Management of Data , 1996.

   [25]  Oracle Corporation, "Oracle8i Replication Release 8.1.5 A67791-
         01", February 1999,
         <http://technet.oracle.com/doc/server.815/a67791/toc.htm>.

   [26]  Danzig, P., DeLucia, D. and K. Obraczka, "Massively Replicating



Schwartz                  Expires April 7, 2002                [Page 53]


Internet-Draft               ANTACID Service                October 2001


         Services in Wide-Area Internetworks", University of Southern
         California Technical Report 94-595, 1994,
         <ftp://catarina.usc.edu/pub/kobraczk/ToN.ps.Z>.

   [27]  Atkinson, R. and J. Munter, "UDDI Version 2.0 Replication
         Specification", June 2001,
         <http://www.uddi.org/pubs/Replication-V2.00-Open-20010608.pdf>.

   [28]  Rose, M., "The Blocks Extensible Exchange Protocol Core", RFC
         3080, March 2001.

   [29]  Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J.,
         Shenker, S., Sturgis, H., Swinehart, D. and D. Terry, "Epidemic
         Algorithms for Replicated Database Maintenance", Proceedings of
         the Sixth Conference on Principles of Distributed Computing ,
         August 1987.

   [30]  Schroeder, M., Birrell, A. and R. Needham, "Experience with
         Grapevine: The Growth of a Distributed System", ACM
         Transactions on Computer Systems Vol. 2 No. 1, February 1984.

   [31]  Rose, M., Gazzetta, M. and M. Schwartz, "The Blocks Datastore
         Model", Draft Technical Memo, January 2001.

   [32]  Braden, R. and J. Postel, "Requirements for Internet gateways",
         RFC 1009, Jun 1987.

   [33]  Even, S., "Graph Algorithms", Computer Science Press , 1979.

   [34]  Lampson, B. and H. Sturgis, "Crash Recovery in a Distributed
         Data Storage System", Xerox Palo Alto Research Center Internal
         Report , April 1979.

   [35]  Mockapetris, P., "Domain names - concepts and facilities", RFC
         1034, STD 13, Nov 1987.

   [36]  Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail
         Extensions) Part One: Mechanisms for Specifying and Describing
         the Format of Internet Message Bodies", RFC 1521, September
         1993.

   [37]  Petersen, K., Spreitzer, M., Terry, D., Theimer, M. and A.
         Demers, "Flexible Update Propagation for Weakly Consistent
         Replication", Proceedings of the 16th ACM Symposium on
         Operating System Principles , October 1997,
         <http://www.parc.xerox.com/csl/projects/bayou/pubs/sosp-
         97/index.html>.




Schwartz                  Expires April 7, 2002                [Page 54]


Internet-Draft               ANTACID Service                October 2001


   [38]  Claffy, K., Braun, H. and M. Schwartz, "A Distributed Testbed
         for National Information Provisioning", A proposal submitted to
         the National Science Foundation , April 1995,
         <http://www.ircache.net/Cache/C-Proposal/kcs.html>.

   [39]  Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
         Resource Identifiers (URI): Generic Syntax", RFC 2396, August
         1998.

   [40]  Fenner, B., Handley, M., Holbrook, H. and I. Kouvelas,
         "Protocol Independent Multicast - Sparse Mode (PIM-SM):
         Protocol Specification (Revised)", draft-ietf-pim-sm-v2-new-03
         (work in progress), July 2001.

   [41]  Steere, D., Baptista, A., McNamee, D., Pu, C. and J. Walpole,
         "Research Challenges in Environmental Observation and
         Forecasting Systems", Proceedings of The Sixth Annual
         International Conference on Mobile Computing and Networking ,
         August 2000.

   [42]  Fox, A., Gribble, S., Chawathe, Y., Brewer, E. and P. Gauthier,
         "Cluster-Based Scalable Network Services", Proceedings of the
         16th Symposium on Operating Systems Principles , October 1997,
         <http://www.cs.berkeley.edu/~gribble/papers/sosp16.ps.gz>.

   [43]  Ericsson Utvecklings AB, "Mnesia 3.9.2", 2000,
         <http://www.erlang.se/doc/doc-5.0.1/lib/mnesia-
         3.9.2/doc/index.html>.

   [44]  Hansen, P., "Operating System Principles", Prentice Hall ,
         1973.

   [45]  Terry, D., Petersen, K., Spreitzer, M. and M. Theimer, "The
         Case for Non-transparent Replication: Examples from Bayou",
         IEEE Data Engineering , December 1998,
         <http://www.parc.xerox.com/csl/projects/bayou/pubs/dataeng-
         98/DataEngineeringDec98.frame.html>.

   [46]  Merrells, J., Reed, E. and U. Srinivasan, "LDAP Replication
         Architecture", draft-ietf-ldup-model-06 (work in progress),
         March 2000.

   [47]  Davidson, S., "Optimism and Consistency in Partitioned
         Distributed Database Systems", ACM Transactions on Database
         Systems Vol. 9 No. 3, September 1984.

   [48]  Terry, D., Theimer, M., Petersen, K., Demers, A., Spreitzer, M.
         and C. Hauser, "Managing Update Conflicts in Bayou, a Weakly



Schwartz                  Expires April 7, 2002                [Page 55]


Internet-Draft               ANTACID Service                October 2001


         Connected Replicated Storage System", Proceedings of the 15th
         ACM Symposium on Operating System Principles , December 1995,
         <http://www.parc.xerox.com/csl/projects/bayou/pubs/sosp-
         95/www/BayouConflictsSOSP-web_1.html>.

   [49]  Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L.,
         Leach, P. and T. Berners-Lee, "Hypertext Transfer Protocol --
         HTTP/1.1", RFC 2616, June 1999.

   [50]  Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A.,
         Goland, Y., van Hoff, A. and D. Hellerstein, "Delta Encoding in
         HTTP", draft-mogul-http-delta-10 (work in progress), October
         2001.

   [51]  Kodumudi, S. and P. Manning, "Split Mirror Backup and Database
         Replication Using EMC TimeFinder", February 1999,
         <http://www.emc.com/products/product_pdfs/wp/timefinder/smbudr_tf.pdf>
         .

   [52]  Myers, J., "Simple Authentication and Security Layer (SASL)",
         RFC 2222, October 1997.

   [53]  Eastlake, D., Reagle, J. and D. Solo, "XML-Signature Syntax and
         Processing", RFC 3075, March 2001.


Author's Address

   Michael F. Schwartz
   Code On The Road, LLC

   EMail: schwartz@CodeOnTheRoad.com
   URI:   http://www.CodeOnTheRoad.com


















Schwartz                  Expires April 7, 2002                [Page 56]


Internet-Draft               ANTACID Service                October 2001


Appendix A. Extended List of Potential ARS Applications

   To help illustrate ARS's potential applicability, the following table
   expands on the three example replication applications provided in the
   Example Applications (Section 1.1) section.  We present these
   examples without further discussion, and instead refer readers
   wanting more depth to the aforementioned section.



   |--------------------------------------------------------------------|
   |      Application     |                 Description                 |
   |--------------------------------------------------------------------|
   | content distribution | central content provider publishes content  |
   | network              | updates that propagate to servers at many   |
   |                      | network points of presence                  |
   |--------------------------------------------------------------------|
   | real-time news       | many small docs flowing from central site   |
   | provider             | out to sites with which PDA's synchronize   |
   |--------------------------------------------------------------------|
   | service              | integr. of back office subscriber db's with |
   | provisioning         | RADIUS, DHCP, POP, SMTP, LDAP, WWW, other   |
   |                      | subscriber- or client-facing service DBs    |
   |--------------------------------------------------------------------|
   | computed content     | data owner injects data to be read by       |
   | pipeline             | competing downstream services that compute  |
   |                      | derived content                             |
   |--------------------------------------------------------------------|
   | nearby cell data     | PDAs submit updates and receive committed   |
   | synchronization      | updates from nearest cell server or kiosk   |
   |--------------------------------------------------------------------|
   | network management   | multiple agents observing each net element  |
   | data "plumbing"      | make conflicting updates.  Regional cor-    |
   |                      | relation/aggregation reduces NOC traffic    |
   |                      | (cf. NetExpert [55])                        |
   |--------------------------------------------------------------------|
   | private annotations  | publicly updateable replication space for   |
   | to centrally owned   | one data set, more restrictive space for    |
   | part of name space   | others. Multiple annotation authors/owners  |
   |--------------------------------------------------------------------|
   | integrated workflow  | open standard replication for workflow,     |
   |                      | version control, web content publishing     |
   |--------------------------------------------------------------------|
   | XML-driven dynamic   | distributed dynamic content servers cont-   |
   | content              | rolled by XML-encoded business logic+data   |
   |--------------------------------------------------------------------|
   | multi-PDA            | users perform updates via more than two     |
   | synchronization      | devices (PDA, cell phone, desktop, etc.)    |



Schwartz                  Expires April 7, 2002                [Page 57]


Internet-Draft               ANTACID Service                October 2001


   |--------------------------------------------------------------------|
   | intranet replication | mobile/field office/partner data exchange   |
   |--------------------------------------------------------------------|
   | structured netnews   | advantages over NNTP: (a) eventually        |
   |                      | consistent, ordered updates, (b) fielded    |
   |                      | search, (c) authenticated                   |
   |--------------------------------------------------------------------|












































Schwartz                  Expires April 7, 2002                [Page 58]


Internet-Draft               ANTACID Service                October 2001


Full Copyright Statement

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

   Code On The Road, LLC expressly disclaims any and all warranties
   regarding this contribution including any warranty that (a) this
   contribution does not violate the rights of others, (b) the owners,
   if any, of other rights in this contribution have been informed of
   the rights and permissions granted to IETF herein, and (c) any
   required authorizations from such owners have been obtained.  This
   document and the information contained herein is provided on an "AS
   IS" basis and CODE ON THE ROAD, LLC DISCLAIMS ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

   IN NO EVENT WILL CODE ON THE ROAD, LLC BE LIABLE TO ANY OTHER PARTY
   INCLUDING THE IETF AND ITS MEMBERS FOR THE COST OF PROCURING
   SUBSTITUTE GOODS OR SERVICES, LOST PROFITS, LOSS OF USE, LOSS OF
   DATA, OR ANY INCIDENTAL, CONSEQUENTIAL, INDIRECT, OR SPECIAL DAMAGES
   WHETHER UNDER CONTRACT, TORT, WARRANTY, OR OTHERWISE, ARISING IN ANY
   WAY OUT OF THIS OR ANY OTHER AGREEMENT RELATING TO THIS DOCUMENT,
   WHETHER OR NOT SUCH PARTY HAD ADVANCE NOTICE OF THE POSSIBILITY OF
   SUCH DAMAGES.



Schwartz                  Expires April 7, 2002                [Page 59]


Internet-Draft               ANTACID Service                October 2001


Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.















































Schwartz                  Expires April 7, 2002                [Page 60]