Network Working Group M. Schwartz
Internet-Draft Code On The Road, LLC
Expires: April 7, 2002 October 7, 2001
The ANTACID Replication Service: Rationale and Architecture
draft-schwartz-antacid-service-00
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC 2026 except that the right to
produce derivative works is not granted. (If this document becomes
part of an IETF working group activity, then it will be brought into
full compliance with Section 10 of RFC 2026.)
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 7, 2002.
Copyright Notice
Copyright (C) The Internet Society (2001). All Rights Reserved.
Abstract
This memo presents the ANTACID Replication Service, which replicates
hierarchically named repositories of XML documents for business-
critical, internetworked applications.
ASCII and HTML versions of this document are available at
http://www.codeontheroad.com/papers/draft-schwartz-antacid-
service.txt and http://www.codeontheroad.com/papers/draft-schwartz-
antacid-service.html, respectively.
Schwartz Expires April 7, 2002 [Page 1]
Internet-Draft ANTACID Service October 2001
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Example Applications . . . . . . . . . . . . . . . . . . . 5
1.2 Caching vs. Replication and Lazy-vs-Eager Update
Propagation . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Currently Deployed Replication Technology . . . . . . . . 9
1.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . 13
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 20
3. Overview of the Approach . . . . . . . . . . . . . . . . . 23
4. Name Space . . . . . . . . . . . . . . . . . . . . . . . . 28
5. Replication Topology . . . . . . . . . . . . . . . . . . . 30
6. Update Ordering . . . . . . . . . . . . . . . . . . . . . 32
7. Atomicity, Consistency, and Reliability . . . . . . . . . 33
7.1 Eventual Consistency . . . . . . . . . . . . . . . . . . . 33
7.2 ANTACID Semantics . . . . . . . . . . . . . . . . . . . . 33
7.3 Locking and Deadlock Prevention . . . . . . . . . . . . . 35
7.4 Collapsing Update Semantics . . . . . . . . . . . . . . . 36
7.5 Submitted Update Visibility . . . . . . . . . . . . . . . 37
7.6 Caching Partial Repository Contents . . . . . . . . . . . 38
7.7 Asynchronous Submission and the Client-Server Promise . . 38
8. Conflict Detection . . . . . . . . . . . . . . . . . . . . 40
8.1 Write-Write Conflicts . . . . . . . . . . . . . . . . . . 40
8.2 Read-Write Conflicts . . . . . . . . . . . . . . . . . . . 40
9. Transmission Optimization: Encodings . . . . . . . . . . . 42
10. Security . . . . . . . . . . . . . . . . . . . . . . . . . 44
10.1 Authentication . . . . . . . . . . . . . . . . . . . . . . 44
10.2 Confidentiality . . . . . . . . . . . . . . . . . . . . . 44
10.3 Integrity . . . . . . . . . . . . . . . . . . . . . . . . 44
10.4 Access Control . . . . . . . . . . . . . . . . . . . . . . 44
11. Datastore Interoperation . . . . . . . . . . . . . . . . . 45
12. Summary: ARS's Realm of Applicability . . . . . . . . . . 46
12.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . 46
12.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . 46
12.2.1 Architectural Limitations . . . . . . . . . . . . . . . . 47
12.2.2 Completeness Limitations . . . . . . . . . . . . . . . . . 49
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 51
Author's Address . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . 52
A. Extended List of Potential ARS Applications . . . . . . . 57
Full Copyright Statement . . . . . . . . . . . . . . . . . 59
Schwartz Expires April 7, 2002 [Page 2]
Internet-Draft ANTACID Service October 2001
1. Introduction
In this document we present the motivation, design rationale, and
architecture for the ANTACID Replication Service (ARS). ARS
replicates repositories of hierarchically-named, well-formed XML
documents [1] in a manner that supports business-critical,
internetworked applications (BCIAs). The ARS protocol and
algorithmic details are defined in a companion document [2].
By "business-critical" we mean applications requiring stronger data
coherence guarantees than file-by-file replication, but weaker than
global ACID semantics (i.e., Atomic, Consistent, Isolated, Durable
[3] semantics spanning all replicas). Our motivation is that many
commercial services require coherence guarantees for replicated data,
but that global ACID semantics are not appropriate across the
Internet because:
o global ACID semantics don't scale (a point we will discuss in more
depth later); and
o applications requiring global ACID semantics (e.g., banking) often
require privately owned, centrally controlled infrastructure
rather than the open Internet.
The "ANTACID" part of ARS refers to data coherence semantics we
define later in this document, which we believe are well suited to
BCIAs.
By "internetworked" we mean applications characterized by:
o up to millions of servers;
o administrative distribution;
o both planned and unplanned server and network outages;
o a wide range of network capacity; and
o a wide range of provisioning requirements concerning timeliness of
updates, security, and other operational issues.
To date, much of the replication technology that provides business-
critical support has developed in the relational database (RDBMS)
world. This world has produced technology that provides significant
structure (well defined data models, integrity constraints, atomic
update, and serial audit trails), but that does not work well in
internetworked environments. For example, RDBMS's often use
protocols too data intensive for low-bandwidth networks, lack support
Schwartz Expires April 7, 2002 [Page 3]
Internet-Draft ANTACID Service October 2001
for distributed administration, and require all replicas to
communicate with a single master server. On the other hand, much of
the current internetwork-oriented replication technology
(specifically, protocols standardized by the Internet Engineering
Task Force (IETF)) does not provide business-critical support. For
example, IETF replication protocols typically provide no means of
atomically updating a set of data objects, and in some cases do not
guarantee that all updates are delivered to all replicas.
With continuing growth in the Internet's size and breadth of uses,
applications increasingly require distributed data management
technology that provides data coherence guarantees and also works
well across the Internet. Recent examples include content delivery
networks, corporate data sharing partnerships, content syndication,
and network service providers sharing parts of their network
management data. The growing popularity of XML promises to
accelerate the trend of deploying business critical applications
across the Internet.
Because of the split RDBMS/IETF evolution of replication technology,
and because of the large number of incompatible replication
technologies that have been developed to date, building a BCIA
typically requires custom implementation and integration. The author
has personally worked on a number of commercial replication efforts
that took significant time and effort to deploy, because of the need
to integrate technologies from multiple vendors and fill out missing
pieces with custom development. Market pressures forced each of
these projects to launch with incomplete integration (e.g., lacking
some needed network management functionality). Moreover, there were
recurring integration costs as the vendors introduced new versions of
their technologies. This kind of experience is common to many
Management Information Systems (MIS) shops.
The current effort seeks to define a standard capable of replicating
data for use by BCIA's, allowing a robust service to be deployed by
configuration rather than custom development/integration. The
current effort also seeks to incorporate lessons learned over the
past 20 years from both the RDBMS and the IETF worlds. Both the
protocol and the data units replicated by ARS are XML-based because
we're betting on XML to become a dominant means of structuring data
on the Internet.
Schwartz Expires April 7, 2002 [Page 4]
Internet-Draft ANTACID Service October 2001
We begin this section by presenting three example applications, to
motivate ARS's goals of standardized replication support for BCIA's.
Next, we compare caching and replication, to clarify our meanings for
these terms and motivate our focus on replication. Next, we discuss
why we believe existing replication technology is not well suited to
business-critical internetworked environments. We then enumerate the
ARS replication requirements.
At the end of this document we summarize (Section 12) the strengths
and limitations of ARS, to help data service architects determine
whether ARS is appropriate for their needs.
1.1 Example Applications
We now present and discuss some example replication applications for
each of the characteristics introduced earlier. Briefly, those
characteristics were:
1. business-critical support: stronger data coherence guarantees
than file-by-file replication, but weaker than global ACID
semantics;
2. support for internetworked environments: the ability to operate
in an administratively distributed environment with up to
millions of servers in the face of partial outages, a wide range
of network capacity, and a wide range of provisioning
requirements; and
3. standard interfaces: the ability to deploy applications through
configuration rather than custom integration/development.
As an example application requiring characteristic 1 (business-
critical support), consider the problem of distributing web site
updates to a number of replica servers. This problem is faced by
content distribution networks, as well as by MIS shops that manage
geographically replicated intranets. After being editorially
reviewed or otherwise approved for distribution, content is placed on
a central staging server, from where it is replicated to a number of
other servers. When replication occurs, the set of files holding
images referenced from the HTML "assembly" pages must be moved into
place atomically with the assembly pages themselves, to avoid "broken
link" errors when users browse the content.
As an example application requiring characteristic 2 (support for
internetworked environments), consider the problem of a real-time
news provider, which must relay news headlines and stories to a large
base of mobile clients. As mobile data services grow over the next
few years, the client base could grow to several hundred million in
Schwartz Expires April 7, 2002 [Page 5]
Internet-Draft ANTACID Service October 2001
size. To handle the huge load bursts and 99.9999% uptime such scale
would require, a service provider might replicate content through a
tiered arrangement, with a single staging server feeding a set of
continental replica servers across wide area links. Each continental
server, in turn, feeds a number of regional replica servers, which
eventually populate several thousand local point-of-presence (POP)
servers globally. Each POP then passes updates to the set of
geographically local clients as they become intermittently reachable
over the network. The entire system would need to be provisioned in
a fashion that ensures updates reach the POP servers within (say) 10
minutes of when the updates are injected into the staging server, and
might support multiple classes of service -- for example with
professional brokers paying more for up-to-the-minute stock quote
data.
As a second example requiring support for internetworked environments
consider the problem of provisioning service for a new user by
populating a set of downstream services with slices of data sourced
at a back office client database. For example, when a new user signs
up for Internet service at a national service provider, information
such as their login name, password, given name, and subscribed
service list must be propagated to a set of RADIUS [4], DHCP [5], POP
[6], SMTP [7], LDAP [8] and database-driven web servers. A similar
operation (though perhaps applied to a smaller scale user base) is
needed for supporting MIS operations managing employees and
contractors in a medium-to-large corporate intranet. One
particularly complex requirement (which may or may not be selected,
depending on local policy) is allowing clients to update their local
server's information during temporary network partitions. For
example, a service provider may wish to allow clients to change their
passwords and continue logging in to the local server during a
network partition, and then propagate the password updates to other
replica sites when the partition is repaired. This functionality
would require support within the replication system for so-called
"multi-master" updates. This functionality could result in
conflicting updates being injected at multiple points in the network,
requiring conflict detection and resolution procedures. Multi-master
update and conflict detection/resolution add significant complexity
to a replication system, so it is desirable to structure the
replication protocol in such a fashion that services that do not need
this functionality can avoid the associated implementation costs and
runtime overhead.
As an example application requiring characteristic 3 (standard
interfaces), consider again the service provisioning example
mentioned above. At present the commercially available
implementations of RADIUS, SMTP, etc. are implemented by a number of
different vendors, whose technologies use incompatible systems for
Schwartz Expires April 7, 2002 [Page 6]
Internet-Draft ANTACID Service October 2001
populating and replicating their databases. Moreover, large MIS
operations often manage employee or client data using technologies
developed for the corporate personnel marketplace or the telco
customer data marketplace, and these technologies also use their own
proprietary data population/replication technology. Developing a
service provisioning system therefore requires non-trivial custom
integration and implementation, particularly as provisioning requires
integration with operations support systems (e.g., the ability to
disable all online and physical building access with a single action
when an employee is terminated).
1.2 Caching vs. Replication and Lazy-vs-Eager Update Propagation
Caching and replication are both used for distributing copies of
data, to lower latency and reduce bottlenecks. It is worth briefly
reviewing the distinction we see between these approaches, as the
industry has not converged on widely shared definitions for these
terms [9][10].
Caching systems typically make copies of individual objects in
response to requests for those objects, and refresh copies during a
subset of future access requests. Replication systems typically make
copies of defined collections of objects in advance of requests to
those collections, and update collections when the origin objects
(those being replicated) change.
Schwartz Expires April 7, 2002 [Page 7]
Internet-Draft ANTACID Service October 2001
Caching can typically be implemented more simply than replication,
but lacks several useful properties of replication systems:
o by copying defined collections of objects, replication systems
ensure that a particular replica contains a "complete" (within the
constraints of update propagation delays) collection that can be
accessed during network partitions. With caching systems,
partitions usually mean that individual objects are unavailable;
o by updating collections when the origin objects change,
replication systems can potentially keep objects more up-to-date
than caching systems, without requiring access-time requests to
check for staleness or origin server requests to "flush" stale
copies. This distinction is often referred to as "lazy" vs.
"eager" update propagation. While it is possible to implement
eager cache propagation, the most common situation is lazy cache
propagation. Thus, throughout the remainder of this document when
we discuss caching we assume lazy propagation, while we assume
replication can be configured to be either lazy or eager. Stated
another way, caching operates in "pull" mode, while replication
can operate in either "pull" or "push" mode;
o by updating collections when the origin objects change,
replication can trigger client notification. With caching systems
clients must either check for updates periodically or rely on some
external system to provide notification; and,
o by replicating objects at varying levels of granularity,
replication systems can provide an efficiency boost to caching
systems. For example, replication could be used to propagate
real-time updates for stock prices into a "my portfolio" page
whose content (other than the stock prices) is cached.
Caching systems are therefore most clearly appropriate for cases
where simple configuration is important, full collection availability
during partition is not required, object content changes slowly, and
client notification is not required.
Replication is most clearly appropriate for an environment where
content changes rapidly (and therefore, selectively pushing updates
of changed content works better than caching/pulling content); timely
information is important; and provisioned service is important (so
that service providers can engineer the flow of updates to remove
bottlenecks that interfere with meeting their service level
agreements).
Schwartz Expires April 7, 2002 [Page 8]
Internet-Draft ANTACID Service October 2001
Note that replication can be implemented using software originally
designed to support caching. For example, one could implement a
content delivery network that replicates content by pre-loading a set
of Squid [11] caches when new objects are added to a collection,
flushing the caches when objects are deleted, and re-loading the
caches whenever objects change.
1.3 Currently Deployed Replication Technology
A great deal of work has been done on replication over the past 20
years. In the current section we focus on deployed technology.
Throughout the remainder of this document we discuss research
approaches as they relate to the ARS architecture.
Many vendors have developed proprietary replication technology, for
file systems, relational and object databases, version control,
messaging and directory services, business-to-business integration
and electronic data interchange, systems and network management,
workflow, and content delivery networks. While many of these systems
provide sophisticated replication functionality, an open standards-
based approach is valuable because of its potential to enhance
interoperability among applications that share access to replicated
datastores. For example, it is unfortunate that such inherently
related processes as version control and workflow are typically
deployed in current intranets using different vendors' incompatible
replication technologies. To achieve deep integration, companies
must spend time developing "plumbing" code that moves data between
these incompatible systems, encodes semantics from one system into
opaque fields in the other system(s), and integrates with other
operations support software (such as network and systems management
systems). Often sites deploy only partially integrated solutions
because of time-to-market pressures.
While there are a number of replication standards designed for
internetworked environments, none adequately supports business-
critical applications. For example, the Network News Transfer
Protocol (NNTP) [12] replicates postings from thousands of newsgroups
across millions of machines, but enforces little structure. It is
common to see news postings sequenced in different orders on
different machines, with confusing references to "earlier" articles
that appear later in a particular repository. There also are no
guarantees that all articles posted to a newsgroup will reach all
replicas of that newsgroup.
Schwartz Expires April 7, 2002 [Page 9]
Internet-Draft ANTACID Service October 2001
The Internet Cache Protocol (ICP) [13] provides a way for cache
servers to share data among a hierarchy of caching peers. This
approach comes at a cost in runtime latency [14][15]. It also
provides no means of synchronizing cache contents or ensuring content
integrity. Sites wishing such stronger guarantees must develop
additional mechanisms for those purposes.
The HTTP Distribution and Replication Protocol (DRP) [16] defines
efficient means for requesting missing or updated files, primarily in
support of software distribution. It lacks support for a variety of
requirements we consider important for business-critical
internetworked environment. For example, DRP provides no support for
atomically updating a group of files, routing updates according to
network structure or policy, or allowing multiple sites to update
shared data.
The Distributed Authoring and Versioning protocol (WebDAV) [17]
defines means of remotely editing documents, including support for
collection creation and versioning. This functionality is orthogonal
to replication, and in fact the WebDAV requirements state that "the
WebDAV extensions must be able to operate in a distributed,
replicated environment".
The Network Data Management Protocol (NDMP) [18] defines protocol
support for clients to control how backup is done among various
configurations of servers and networks. This problem is tangentially
related to replication.
A common way to achieve some of the advantages of open standards with
deeper replication capabilities is to represent business logic and
content as XML stored inside a database or file system that supports
replication. While such an approach can be made to work, a storage
management-independent replication protocol is needed because of the
fragmented state of the file system and database marketplaces:
o The file system marketplace is fragmented by operating system,
with Windows NT dominating corporate intranets and UNIX dominating
high-end data centers. There are cross-platform implementations
of CIFS [19] and NFS [20], but in practice CIFS is primarily used
with Windows and NFS with UNIX.
o The database marketplace is fragmented by database vendor.
Specifically, ODBC [21] and SQL [22] define connectivity and query
standards, but in practice database application programmers always
make significant use of vendors' proprietary features for
operationally critical functionality like hot backup, failover,
and disk layout optimization. Once developed, it is difficult to
make such applications interoperate with other database systems.
Schwartz Expires April 7, 2002 [Page 10]
Internet-Draft ANTACID Service October 2001
Beyond the marketplace fragmentation issue, file systems and
databases do not provide appropriate functionality for a business-
critical internetworked environment:
o Replicated file systems typically do not provide sufficiently
strong atomicity and collection management functionality for
business-critical applications. For example, Coda [23] allows
clients to preload their file system cache for use while
disconnected, and provides mechanisms for updating replicas and
detecting conflicts while connected. This approach works well for
end user file sharing, but would not be appropriate for business
applications that, for example, require atomic update and serial
audit trails.
o Database replication systems typically implement data coherence
mechanisms that do not work well across the Internet. Global ACID
semantics, as provided by synchronous replication schemes, result
in deadlocks or reconciliation's that grow as the cube of the
number of nodes and the fourth power of transaction size [24].
Asynchronous replication schemes (such as Oracle updateable
snapshots [25]) typically commit updates at a local replica and
later propagate those updates to a master copy where automated
conflict detection and resolution procedures are invoked. A
conflict can cause compensating transactions to be propagated to
sites where the "losing" transactions were performed, which in
turn can generate other compensating transactions to undo
operations that ran based on values read from the losing
transactions. This arrangement can result in an explosion of
global load if widely replicated data experience update conflicts.
Finally, most database and file replication systems use
communication-intensive network protocols that exhibit performance
problems when used across the Internet. While a number of research
systems address weakly connected [23] or topologically complex [26]
environments, there is a paucity of practically deployable
replication solutions that work well across the range of connectivity
and topology found on the Internet.
The service that comes closest to ANTACID's goals and approaches is
the UDDI replication service [27]. Like ANTACID, UDDI replication
supports pull-based replication of XML content with notification when
changes are available, and transmits updates over a Directed Acyclic
Graph (DAG). Also, since UDDI is used for advertising and
discovering services offered by network-accessible businesses, it is
clearly intended for a business-critical internetworked environment.
Unlike ANTACID, UDDI replication defines a service-specific interface
(i.e., intended to support only UDDI data). As increasing numbers of
XML-based services come online we believe a general XML replication
Schwartz Expires April 7, 2002 [Page 11]
Internet-Draft ANTACID Service October 2001
service would be valuable, for the reasons discussed earlier (Section
1). Because UDDI replication is focused on the specific needs of
UDDI servers, the replication design fixes a number of choices that
ANTACID leaves configurable, including authentication and
confidentiality services (vs. ANTACID's reliance on BEEP [28], which
allows negotiation of security services); update content encodings
(vs. ANTACID's support for negotiated encodings); non-support of
update-anywhere functionality (vs. ANTACID's optional support for
update anywhere); and the requirement that updates be propagated
within a set amount of time (vs. ANTACID's treatment of convergence
time guarantees as a separate service provisioning issue).
Schwartz Expires April 7, 2002 [Page 12]
Internet-Draft ANTACID Service October 2001
1.4 Requirements
We begin with a high-level, idealized list of requirements, which we
then selectively relax when we consider the example replication
applications:
1. Computational Power and Connectivity of Participating Devices:
support everything from small/weakly connected devices to
large/well connected data centers.
2. Scale: replicate arbitrarily large volumes of data, to
arbitrarily many sites.
3. Distribution of Control: allow arbitrarily fine-grained control
over what content is replicated, to which sites, and under which
administrative policies.
4. Availability: allow reading and writing any available copy of
data in the presence of server and network outages, subject to
security policy restrictions.
5. Consistency: ensure globally consistent transactional (ACID) data
updates.
6. Efficiency / Performance: avoid duplicate transmissions of
updates across network links; compress redundancy out of data
transmissions; route updates intelligently with respect to
network topology; disseminate updates at full network speed;
guarantee that updates commit globally within a specified,
bounded period of time.
7. Security: support authenticated, encrypted, authorized access to
digitally signed data.
8. Implementation Complexity vs. Depth of Functionality: define
easily implemented primitives that can support deep, complex
application scenarios.
9. Configuration and Operational Management: support adding and
removing replicas at any time. Support automatic configuration
(e.g., of replication topology and update scheduling) that adapts
intelligently over time. Support a serial audit capability: the
ability to enumerate an authoritative list of all update
operations applied, singly-clock-timestamps, and principals that
performed those updates.
Schwartz Expires April 7, 2002 [Page 13]
Internet-Draft ANTACID Service October 2001
Looking at the breadth of example target applications, we see that
they vary in a number of dimensions, resulting, for example, in
requirements to support wide ranges of computational power,
connectivity, and number of replicas. Perhaps more important are the
characteristics these applications have in common:
o all require stronger update atomicity and serialization semantics
than file-by-file replication, but weaker than global ACID
semantics;
o all require efficient use of network and computational resources;
and,
o all require the ability to continue operations in spite of partial
node and network outages.
As an example of the update atomicity and serialization semantics, in
a content distribution network it is important to update a group of
HTML and image files atomically so that a network outage cannot leave
a server in a state where the top-level ("assembly") page has been
updated but the new image files it references have not yet arrived.
As a second example, if multiple back office sites submit updates
that result in subscriber account provisioning for an Internet
Service Provider, it is important that an audit trail be generated
showing the serialized sequence of updates that were applied. Such a
trail can help track down data integrity problems introduced at
various back office sites that cause subscriber-visible service
problems.
Below we revisit the idealized requirement list, highlight those that
are particularly difficult to satisfy in an internetworked
environment, and, with an eye towards the example replication
applications, present our chosen compromises:
Schwartz Expires April 7, 2002 [Page 14]
Internet-Draft ANTACID Service October 2001
1. Computational Power and Connectivity of Participating Devices: It
is unlikely that a single engineering design could work well for
Internet ranges of connectivity and capacity, and also compare
favorably to technology tuned for high capacity local
interconnects (e.g., a fiber channel-based storage area network
used to interconnect backend databases and application servers).
We therefore focus ARS on two classes of Internet computing
environments:
* Well connected, high-capacity Internet server complexes: for
example, a set of failover servers with diverse routed DS3
connectivity; and,
* Mobile nodes: modest computational and storage capacity,
frequently disconnected network.
2. Scale: while we see no algorithmic reasons why ARS could not
scale arbitrarily, we establish the following numeric goals:
* Ability to replicate data sets up to 30 terabytes in size.
* Ability to replicate individual pieces of data to hundreds of
well-connected data centers, or to millions of Personal
Digital Assistants (PDA's).
Schwartz Expires April 7, 2002 [Page 15]
Internet-Draft ANTACID Service October 2001
3. Distribution of Control: replication systems designed to support
arbitrarily distributed control typically route updates among
replicas at random and use anti-entropy operations [29] to ensure
that all updates circulate globally so that the replication
service provides eventual consistency [30]. Anti-entropy can be
a costly approach, because it requires pair-wise state
comparisons to determine missed updates. Moreover, by allowing
arbitrary flows of data it can be difficult for replica
administrators to engineer the global service to meet needed
convergence times. For these reasons, we impose the following
constraints on distribution of control:
* Updates flow according to a structured communication graph,
rather than to arbitrary peers.
* Ability to distribute control over replication across
administrative boundaries defined by name tree partitions.
* Ability to set differing numbers of replicas for different
parts of the name space.
* Support for both central source-of-record and update-anywhere
models.
* No support for master-less replication (ala LDUP [8]). While
master-less replication is useful, it makes serial audit
impossible and automated 100% coherence difficult.
4. Availability: one of the classic design decisions to be made for
a replication system is how to trade availability against
consistency. In the Internet environment, we consider
availability generally to be more important, and impose most of
the constraint relaxation on consistency:
* Ability to submit updates to any available copy.
* Ability to read committed updates from any available copy.
Optional support for reading the current set of submitted
updates from any available copy.
* Ability to propagate updates in the presence of partial node
and network outages.
Schwartz Expires April 7, 2002 [Page 16]
Internet-Draft ANTACID Service October 2001
5. Consistency:
* Support eventual consistency among asynchronously circulated
updates.
* Ensure mutual consistency among documents at a datastore after
each update to that datastore. We exclude from scope broader
notions of referential integrity, for example guaranteeing
that a bookmarked web page can later be successfully visited.
This latter problem might be addressed by a combination of
archival repositories and a versioning name space, which,
again, are out of scope.
* Support ability to determine when an update has committed at
the submission server. Provide no support for determining
when an update has circulated to all replicas.
* Support ability to detect write-write conflicts. Ability to
notify applications when updates arrive, as a partial solution
to the read-write conflict problem.
* Not all applications need to replicate all data served by an
ARS server, so it must be possible for applications to cache
documents using a simple Time-To-Live (TTL) mechanism to
detect when it is necessary to request an updated copy. ARS
servers are not required to participate in cache coherence
management.
Schwartz Expires April 7, 2002 [Page 17]
Internet-Draft ANTACID Service October 2001
6. Efficiency / Performance: consistency requirements generally
impose performance bottlenecks that prevent updates from flowing
at full network speed. Our goal here is that ARS have
performance reasonably close to that achieved by the best
currently deployed replication technology, while making
intelligent use of network resources. Specifically:
* Performance of individual point-to-point replication should
not be significantly worse than proprietary schemes, excluding
convergence time (which depends on connectivity factors
outside the control of the replication system).
* Efficient use of network resources: don't send the same update
many times over a link.
* Ability to route update transmissions in ways that make sense
with respect to network topology and/or policy.
* Support for time-bounded convergence is outside the scope of
ARS, but it should be possible to configure server operations
to bound convergence times, when done in conjunction with
appropriate system provisioning/capacity/uptime engineering.
As a possible future extension, a protocol could be defined to
support some form of conditional service guarantees for
deployments where resource reservations are possible.
7. Security: security generally comes at a tradeoff against ease of
configuration and use, and sometimes with performance impact.
ARS should provide ranges of choices for authentication,
confidentiality, access control, and data integrity, which allow
individual sites to choose how they make the tradeoffs. The
collective ARS documents should provide sufficient background to
allow sites to make these choices in an educated fashion.
8. Implementation Complexity vs. Depth of Functionality: define a
minimal set of broadly useful required protocol elements, and an
additional set of optional protocol elements that increase the
functionality for more complex applications. Also, focus on
replication, reusing existing solutions to related problems such
as content filtering, load sharing, and configuration management.
Schwartz Expires April 7, 2002 [Page 18]
Internet-Draft ANTACID Service October 2001
9. Configuration and Operational Management:
* Automatic configuration is difficult to achieve for anything
but the simplest of systems. Even something as outwardly
simple as auto-selecting full- vs. half- duplex network
switching has historically been the source of surprising
numbers of operational problems. Data replication is a far
more complex problem. ARS therefore defines no configuration
automation. ARS may impose some restrictions about the
service states when replicas may be added or removed.
* One of ARS's key design tradeoffs is the decision that serial
audit must be supported - as will be seen, this single
decision forces other design choices that impact availability
and consistency. Nonetheless we feel serial audit is critical
because it can help when tracking down service failures and
security breaches, and it can provide a legal record of update
activity.
Schwartz Expires April 7, 2002 [Page 19]
Internet-Draft ANTACID Service October 2001
2. Terminology
We use the term "document" to refer to a well-formed XML document.
We use the term "datastore" to refer to any hierarchically-named
document repository. The datastore could be implemented as a file
system, a database table, or some persistent storage system. ARS
makes no assumptions with respect to granularity of access (e.g.,
attribute-level vs. file level) of the underlying datastore.
However, the ARS service itself replicates at the granularity of an
individual document. In other words, modifications to individual
elements or attributes will result in updates to the complete
document.
We use the term "name space" to refer to the universe of all names
within a particular datastore naming system. For example, the Blocks
name space [31] consists of all names for Blocks (such as
"doc.rfc.2629"). We use the term "name tree" to refer to the
hierarchical collection of names within a particular name space
(e.g., the names within the Blocks name space).
We use the term "server" to mean a process that responds to requests
issued by another process. "Client" means a request issuer. These
terms apply to roles that may be transient. In particular, "ARS
servers" act as servers when a client submits a request, and act as
clients when forwarding requests to other ARS servers. Moreover, ARS
servers can initiate communications when performing callback
notification to a client that made an earlier request to the server.
We use the term "ARS peers" when referring to either a client or
server that communicates using the ARS protocol.
While the ARS client and server roles can be transient, we make a
fixed distinction between processes that receive (and possibly
forward) replication requests vs those that only submit requests. We
refer to the former as "replication servers" and the latter as
"clients of the replication service". We refer in aggregate to
replication servers as "the replication service".
Schwartz Expires April 7, 2002 [Page 20]
Internet-Draft ANTACID Service October 2001
This role distinction is motivated by the Internet gateway model
[32], which distinguishes between the information that must be known
within the routing core of the Internet vs by hosts needing packet
routing service. In the case of Internet gateways, the distinction
reduces the number of machines that are affected by routing changes.
In the case of ARS the distinction allows us to:
o offload state management from clients of the replication service
for tracking in-progress replication requests, in support of
computationally modest clients (such as mobile PDAs); and
o define a boundary across which clients of the replication service
may treat replicated data in ways opaque to the replication
service -- for example when caching or digitally signing
replicated data.
We discuss both of these issues in more depth later.
We use the term "primary" to refer to a designated server for a
particular region (the structure of which will be explained later) of
the name space that serializes updates for that region, and "non-
primary" for all other servers for that name space region.
ARS servers for individual regions of the name space are arranged in
a Directed Acyclic Graph (DAG [33]) rooted at the primary. We refer
to ARS servers closer to the primary along the DAG than a given
server as "upstream servers", and those further as "downstream
servers".
We do not use the term "secondary" to denote servers that relay
updates to primary servers, because there can be multiple hops
between a primary and the set of all downstream servers in the DAG.
Instead we speak of primary vs. non-primary servers, and of upstream
vs. downstream servers. It is important that when discussing ARS
one not use the term "secondary" because it does not adequately
capture the subtleties of interactions along the propagation DAG.
We use the term "ARS service" to refer to the overall replicated data
service supported by ARS.
We use the term "stable storage" to denote non-volatile memory (e.g.,
disk) implemented in a fashion where write operations survive system
crashes/reboots [34].
We use the term "commit" to indicate a computation performed on a
single datastore that applies a set of updates in a fashion that
respects ACID semantics at that single datastore. There is no notion
of multi-server commit in ARS.
Schwartz Expires April 7, 2002 [Page 21]
Internet-Draft ANTACID Service October 2001
We use the term "submission server" to indicate the server to which
an update was originally submitted by a client of the ARS service.
We use the term "submission path" to indicate the sequence of
servers, starting at the submission server and terminating at the
primary server, via which an update is submitted and propagated.
We use the term "principal" to refer to any named entity in the
system, including users, groups of users, and services running on
behalf of particular administrative regions of the replicated name
space.
ARS specifies a required subset of protocol elements and two optional
subsets (the details of which will be presented shortly). We use the
term "the ARS protocol" to refer to any legitimate subset of all
three sets of protocol elements.
Note that some of the terms used by ARS are different from their use
in the Domain Name System [35], even though ARS also forms a
hierarchical distributed datastore. Specifically, both systems have
a "primary" server notion, but where the DNS has "secondary" servers,
ARS has non-primary servers. The DNS replicates data from primary to
secondary servers, but ARS replicates data along a DAG from primary
to non-primary servers. Also, the DNS and ARS both define a notion
of "zone" to support distributed administration, but the DNS defines
more ways to partition zones than ARS does (specifically, the DNS
defines a notion of "class" that is not present in the ARS).
Schwartz Expires April 7, 2002 [Page 22]
Internet-Draft ANTACID Service October 2001
3. Overview of the Approach
Many aspects of the ARS architecture were influenced by the desire to
balance support for business-critical functionality against the need
to operate in an internetworked environment. In this section we
present an overview of the ARS approach, at the end of which we
summarize the design decisions made to enhance scalability and
operation across a variety of network environments.
ARS partitions each name space into disjoint regions called zones, to
support distributed administration. Zones can also enhance ARS
scalability by allowing a primary that receives too much update
traffic to delegate part of its work to another server.
Each ARS server supports one or more zones. Individual zones may or
may not be replicated.
For each zone updates are propagated along a DAG rooted at a
distinguished primary server that serializes all updates to that
zone. This DAG provides for redundancy and load distribution while
distributing updates, and can be configured to take network
characteristics (topology, link bandwidth, policy, etc.) into
account.
There are two distinct phases of the update process:
o update submission, which may be as simple as submitting an update
to the primary, or may involve submitting to a non-primary that in
turn propagates the submission through a chain of servers up the
DAG to the primary; and,
o committing of submitted updates, which causes messages to
propagate down the DAG from the primary to all non-primary servers
of that zone.
To distinguish these phases, we talk about update submission as
distinct from serialization/commit.
ARS servers can push updates down the DAG as they arrive, to reduce
replication latency if needed. Servers can also pull updates, for
example to "catch up" when first joining a replication DAG, or after
downtime, partition repair, or mobile reconnection. For
synchronization and implementation simplification reasons a push is
actually just a suggestion from the upstream server that the
downstream server should perform a pull request. The choice of
whether to push or pull updates is a local configuration decision
made at each server, though server administrators should coordinate
these choices to support specific provisioning requirements.
Schwartz Expires April 7, 2002 [Page 23]
Internet-Draft ANTACID Service October 2001
When a replication server receives an update submission, it enters
into what we call a client-server promise: after accepting the
update, the server must follow that update through either to success
or failure. This promise extends from the submission server through
each ARS server up the DAG to the primary, so that in effect the
client enters into a promise with the replication service (not just
an individual server). Because of the client-server promise, an
infrequently connected client of the replication service can thus
disconnect from the network after an ARS server has accepted the
update submission, and rely on the replication service to carry the
update through to completion or failure without the client's needing
to handle details of timing out and resubmitting the update. Clients
can learn of commit results through an asynchronous notification
mechanism.
ARS consists of three sub-protocols, only the first of which must be
implemented by all ARS servers:
1. the Commit-and-Propagate Protocol (ars-c), which allows a client
to submit an update to an ARS server, performs zone-wide
serialization, commits updates, and propagates committed updates
down the DAG to other ARS servers that each individually commit
and propagate the updates to their downstream servers. ars-c
also provides asynchronous notification of commit-or-fail results
to the submitting clients.
2. the Submission-Propagation Protocol (ars-s), which allows an
update that was submitted to a non-primary server to be
propagated up the DAG to the primary server, where it is
serialized with respect to all submission activity at the
submission server (in addition to the serialization performed by
ars-c). ars-s re-uses the protocol element defined by ars-c for
asynchronous client notification to notify servers down the
submission path of update commit-or-fail results. ars-s provides
a store-and-forward update submission mechanism, so that updates
can succeed even if all servers along a submission path are never
simultaneously available. Implementations that do not support
ars-c allow updates to be submitted only to the primary.
3. the Encoding Negotiation Protocol (ars-e), which defines a
protocol for negotiating a set of MIME-based [36] data
representations that can be used when passing submitted and
committed updates among ARS peers. These negotiated
representations support a variety of transmission optimizations,
including compression, delta encoding, and bulk transfer.
Implementations that do not support ars-e use a single required
encoding for submitted and committed updates.
Schwartz Expires April 7, 2002 [Page 24]
Internet-Draft ANTACID Service October 2001
This protocol decomposition allows relatively simple implementations
for broadly useful functionality, and additional functionality for
more demanding services. In particular, "update-anywhere"
functionality (supported by ars-s) is responsible for a significant
amount of the service complexity. Many services do not require this
functionality and need not incur its implementation complexity and
runtime overhead.
Similar to Bayou [37], ARS transmits operations rather than data when
transmitting both submitted and committed updates, to (a) avoid
ambiguities with respect to deleted documents and (b) make the amount
of information transmitted proportional to update activity instead of
overall database size. ARS also defines a collapsing update
mechanism that eliminates committed update transmissions that are
overwritten by subsequent updates. This approach modifies coherence
semantics in a way that should not impact most applications, and
enhances efficiency for infrequently synchronized datastores. It can
also reduce space requirements for log-based server implementations.
ARS supports a server-by-server model of update atomicity and
consistency. When a client submits an update, it contains a set of
operations that are to be performed together, all-or-nothing. That
group of updates commits first at the primary and then travels to all
downstream servers. The group of updates are applied asynchronously
at each server with ACID semantics local to that server, and updates
are guaranteed eventually to circulate to all replicas. In this way,
a client interacting with a single replica sees updates that are
locally consistent (e.g., retaining referential integrity between
cross-linked documents at that server), but the replicated ARS
service avoids scaling problems that would result from global ACID
semantics.
The ARS service tracks commit sequence numbers (CSNs) for each
document, which may be used to detect write-write conflicts. The
local server implementation or an application may also handle a
simple case of read-write conflicts by notifying clients when an
update to their input documents arrives. Both of these forms of
conflict detection are optional, and may be implemented outside of
the ARS service (e.g., with some form of trigger or "persistent
fetch").
Schwartz Expires April 7, 2002 [Page 25]
Internet-Draft ANTACID Service October 2001
ARS provides no protocol support for determining when an update has
circulated globally. An application could be written to traverse the
ARS DAG for a given zone and check for update completion, but
distributed administration of servers could prevent discovery or
querying of some servers. Moreover, the cost of this computation
could be prohibitive. Finally, the list of replicas may itself be in
flux, making it difficult to define a meaningful notion of "all
replicas".
Looking at the overall approach outlined above, a variety of design
decisions contribute to ARS's applicability to internetworked
environments. These decisions concern scalability and operation over
a variety of network environments.
In terms of scalability, probably the most important design decision
is ARS's server-by-server model of update atomicity. This approach
avoids the deadlock and reconciliation problems of global ACID
semantics, and induces an application architecture where clients
interact with local replicas rather than "popping around the net". A
second design decision in support of scalability is routing updates
according to a configured graph structure. This approach avoids the
overhead of pair-wise state comparisons performed by anti-entropy
operations, which are needed when updates are allowed to flow between
arbitrary peers. A third scalability-oriented design is the
separation of conflict detection from the replication service. This
approach offloads conflict detection overhead from replication
servers. A fourth scalability feature is the use of transmission
encodings, which can support a number of optimizations, most
importantly bulk transfer (which can make it feasible to replicate
very large datastores). Finally, scalability is supported by the
ability to partition the name space for distributed administration.
In terms of operation across a variety of network environments,
probably the most important design decision is routing updates
according to a configured graph structure. This approach allows data
to be routed according to bandwidth, policy, or other network
characteristics. For example, the graph could be configured so that
updates are sent once across an expensive trans-oceanic link, and fan
out from there to other replicas. Graph-based update routing also
supports a provisioned service model [38], where administrators plan
flows based on measured update rates and server and network capacity.
In this way, commercial providers can support service level
agreements across a variety of network environments, which is
difficult to do if updates were routed in an unconstrained fashion.
A second design decision in support of varying network
characteristics is store-and-forward update submissions. This
approach allows updates to succeed when all servers along a
submission path are not simultaneously available. On a related note,
Schwartz Expires April 7, 2002 [Page 26]
Internet-Draft ANTACID Service October 2001
by allowing both pull and push models of committed update
propagation, peers that are not continuously connected can request
updates when they connect, yet continuously connected servers can
learn of updates without polling. Finally, the store-and-forward
update mechanism also supports deployment on bastion hosts that
forward traffic across security boudaries, similar to how proxy
servers are often used for allowing Internet HTTP access from within
an intranet.
Mobile PDA's represent a broad class of network environment deserving
particular mention. ARS's client-server promise and asynchronous
notification allow a PDA to connect, submit an update, and
disconnect, managing only the state needed to tie notification back
to the original submission. Also, ARS's collapsing update mechanism
can reduce committed update traffic for infrequently synchronized
PDA's. ARS's update-anywhere capability supports geographically
constrained network architectures, allowing a PDA to synchronize data
with a nearby cell server or kiosk. Finally, ARS's distinction of
replication-service vs client-of-replication-service minimizes
implementation complexity for computationally modest PDA devices.
Schwartz Expires April 7, 2002 [Page 27]
Internet-Draft ANTACID Service October 2001
4. Name Space
ARS operations specify documents using Uniform Resource Identifiers
(URI's [39]). For this purpose, the "scheme" part specifies the name
space being used. For example, a URI for a document in the Blocks
name space might be "blocks:doc.rfc.2629".
The following figure illustrates a number of aspects of ARS naming,
drawn from the Blocks name space:
zzzzzzzzzzzzzzzzzzz
z ------- z
z / \ \ z
z / ... ... z
zzzzzzzzzzzz z
z / z z
z doc zzzzzzzzzz z
z / | \---\ z z
z rfc z edgar z z
z / | \ z / | \ z z
z z z z
zzzzzzzzzzzzzzzzzzzzzzzz
||\\ \\
|| \\ \\_________
|| \\ \--------\\
|| \\ \\
|| \\ \----------\\
|| \\ \\
+----------+ +----------+ +-------------+
| | | | | |
| doc | | doc | | doc |
| / | \ | | / | \ | | / | \ |
| rfc | | rfc | | rfc |
| / | \ | | / | \ | | / | \ |
| | | | | |
+----------+ +----------+ +-------------+
|| \\ // || ||
|| \\ || ||
|| // \\ || ||
+----------+ +----------+ +-------------+
| | | | | |
| doc | | doc | | doc |
| / | \ | | / | \ | | / | \ |
| rfc | | rfc | | rfc |
| / | \ | | / | \ | | / | \ |
| | | | | |
+----------+ +----------+ +-------------+
Schwartz Expires April 7, 2002 [Page 28]
Internet-Draft ANTACID Service October 2001
In this figure boxes surrounded by z's indicate zone boundaries.
Zones are named by the top-level node in each zone, with the root
given the distinguished name ".". Zones are partitioned by placing
cuts between nodes in the name tree, and designating a primary server
for each zone. In the figure there is a cut at "blocks:doc" and
another at "blocks:doc.edgar", making 3 separate zones ("blocks:.",
"blocks:doc", and "blocks:doc.edgar").
Schwartz Expires April 7, 2002 [Page 29]
Internet-Draft ANTACID Service October 2001
5. Replication Topology
For each zone that an ARS server replicates, the server is configured
to execute the ARS protocol with zero or more upstream servers and
with zero or more downstream servers. These servers are arranged
into a DAG rather than permitting updates to flow via unconstrained
peer-to-peer interconnections (as done for example by Bayou [37])
both to eliminate the need for anti-entropy operations and to provide
more structure for replica administrators who wish to engineer flows
to meet needed convergence times.
Each upstream server may be either the zone primary or another ARS
server that replicates the needed zone. In this way committed
document updates are replicated along a DAG rooted at the primary for
zone data they replicate. An example is illustrated in the figure in
the Name Space section (Section 4), with two of the "doc" zone
replicas configured to replicate from either of two different
upstream replicas.
In the simplest ARS service configuration (one supporting only ars-c
at all servers) updates are submitted at the primary, and after
committing at the primary the updates propagate down the DAG to the
non-primary servers. In a service that also supports ars-s, updates
can be submitted to non-primary servers and routed up to the primary
along the DAG. Supporting non-primary update submissions can provide
several advantages:
1. it allows clients (such as wireless PDA's) to submit updates to
topologically nearby servers, which may provide transmission cost
advantages;
2. it allows load bursts to be distributed away from the primary,
with non-primaries time-shifting or otherwise shaping update
traffic up to the primary;
3. it supports a distributed access control model (Section 10.4),
where individual server administrators manage update privileges
for their local users; and,
4. it allows updates to be submitted when the primary is temporarily
unreachable.
Scalability in the number of replicas is achieved by building a DAG
of ARS servers with fan-out at any particular replica engineered
based on measured update rates and server and network capacity.
Schwartz Expires April 7, 2002 [Page 30]
Internet-Draft ANTACID Service October 2001
Each ARS server knows all of its immediate upstream servers and all
of its immediate downstream servers. Thus, becoming a downstream
server involves configuring the downstream server and all upstream
servers being used by the new downstream server, through
administrative arrangement outside the ARS service. The means by
which this configuration is done (configuration file, database, etc.)
are a matter of local implementation. ARS does not define a protocol
for joining/leaving the replication topology.
When zones divide/delegate, the ARS servers beneath them only get
their content from the named zones, not the new delegated zones. The
ARS server administrator must explicitly add the new zones to the
configuration if they wish to replicate that newly delegated content.
The choice whether the replication topology construction is done via
manual configuration (e.g., using a graph editor) vs. by some
automatic process (e.g., one that looks at routing tables or that
performs bandwidth computations) is outside the scope of the ARS
architecture. Over time we believe that automating the construction
of replication DAGs will be needed to scale the service to thousands
of replicas [26]. Solutions to this problem are an area of future
study (for example, using some form of Protocol Independent Multicast
[40] to help discover topologically nearby nodes).
Schwartz Expires April 7, 2002 [Page 31]
Internet-Draft ANTACID Service October 2001
6. Update Ordering
ARS serializes submitted updates to a zone so that all servers commit
updates in the same order. Additionally, for pairs of servers that
support ars-s, ARS serializes submitted updates according to the
order that they arrived at the submission server. Each submission
server's ordering imposes a partial zone-wide ordering: submissions
are ordered per server, but do not imply any ordering between
servers. The partial orderings of all submission servers are
respected by the zone-wide total ordering chosen by the primary.
This ordering model is motivated by the desire to provide behavior
that meets users' intuitive expectations, for the case of
applications that interact with a single replica. Clearly it is
important to sequence all updates into a globally consistent order,
but it is also important to maintain submission site ordering. If
submission site ordering were not maintained, problems like the
following example scenario could arise. Consider an online news site
that publishes updates daily at midnight. Each day, the site
performs an internal editorial review and then submits the update to
the replication system. The staff member in charge of content
replication submits the approved update, and while waiting for
notification that the update has committed begins reading a local
copy of the updated content. In browsing through the content the
staff member notices an error that slipped past the editorial review,
and submits an update to correct the error. At this point, there are
two update submissions in the system, neither of which has committed
at the primary yet. If the ordering model did not respect submission
site ordering, it is possible that the correction could reach the
primary before the original update (for example, because of a
transient network outage that caused the first update to stall while
the second update routed through a different path to the primary) and
be serialized before the original update. As a result, the original
(uncorrected) update content becomes the final content on the web
site for the current 24 hour period.
ARS servers request updates from an upstream server by specifying the
zone and the CSN of the last operation seen for that zone. The
upstream server responds by sending all updates that have committed
since the specified sequence number, possibly collapsing some updates
per the collapsing update notion mentioned earlier (Section 3).
Note that ANTACID addresses a relatively constrained update ordering
problem because it assumes updates are always applied to entire
documents. Supporting finer-grained updates (e.g., attribute-level
updates commonly supported by relational databases) would require
dealing with update dependencies, merges, etc.
Schwartz Expires April 7, 2002 [Page 32]
Internet-Draft ANTACID Service October 2001
7. Atomicity, Consistency, and Reliability
7.1 Eventual Consistency
ARS supports eventual consistency semantics [30]. Specifically,
there are no guarantees that all replicas are ever in a consistent
state, but if updates cease for long enough while updates spread,
eventually all replicas will converge to a consistent state. There
are no guarantees about how long it can take for updates to
propagate, but the local implementation may schedule updates and
engineer sufficient capacity to meet a given propagation schedule.
Note that some form of conditional service guarantees [41] might be
provided by proprietary vendor extensions, to bound update
propagation times for deployments where resource reservations are
possible. Moreover, in the future a new ARS sub-protocol could be
defined to support such conditional service guarantees.
7.2 ANTACID Semantics
As noted earlier (Section 1.4), globally consistent ACID semantics do
not scale and are not needed for a usefully large class of
applications. Instead, ARS supports a server-by-server model of
update atomicity and consistency, which we call Asynchronous Network
Traversed ACID (ANTACID) semantics. Specifically, updates submitted
together are guaranteed to be applied in the same order with ACID
semantics at each replica as they traverse the DAG and are applied at
each server. There are no guarantees about when updates reach
individual replicas or about the update state of one replica with
respect to another replica. This approach is motivated by the
scaling problems of providing global ACID semantics and by the fact
that many applications work by interacting with a single replica, for
which all that is required is that updates to that replica appear to
be ACID to applications interacting with that replica.
The term "ANTACID" is also intended to connote a reduction in
unpleasant complications that would arise from the over-use of ACID
semantics in the global environment.
Fox et al. [42] define BASE (Basically Available, Soft state,
Eventual consistency) semantics to be, in essence, everything that
does not support ACID semantics. We believe it is useful to
characterize ANTACID semantics at two different levels of system
granularity. At the granularity of the global replicated service,
ANTACID semantics are BASE. At the granularity of individual
datastores, ANTACID semantics are ACID. We believe this split is key
to supporting business critical internetworked applications.
Schwartz Expires April 7, 2002 [Page 33]
Internet-Draft ANTACID Service October 2001
Mnesia [43] is an RDBMS that supports the usual set of transactional
operations as well as a set of so-called "dirty operations" that
trade away transactional integrity for increased performance. These
operations are atomic for individual reads/writes, but do not provide
any way to group logically related operations in a way that supports
referential integrity. Although Mnesia provides a way to selectively
relax traditional ACID semantics (as ARS does), the mechanism is
targeted towards realtime applications, rather than Internet data
replication.
To illustrate why we feel ANTACID semantics are appropriate for
BCIA's, consider the three example applications introduced earlier
(Section 1.1).
For the web content distribution application, a limited form of
referential integrity is required: all of the content updates for a
site must be made "live" simultaneously to avoid broken link errors
when users view the content with a web browser. It is not required
that all of the updates are made "live" simultaneously at all replica
servers, and in fact imposing that sort of global ACID semantics
would heavily impact availability: a partition affecting a single
replica would delay all other replicas from turning the needed
updates "live" on schedule, for applications requiring scheduled
delivery (e.g., an online publication that updates its content at
regular intervals). By supporting ANTACID semantics, the impact of
regional network outages can be limited to the affected region. Note
that there may be additional requirements for how closely together
updates should be made "live" across all of the regions, and that
such requirements fall under provisioning issues outside the scope of
the replication system (e.g., requiring a certain level of routing
redundancy and network capacity to ensure the delivery schedule can
be met).
There are similar referential integrity requirements for the real-
time news provider application. The provisioning requirements are
likely to be significantly more demanding than they are for the
content distribution application, given the time-sensitive nature of
stock quote data. Again, these provisioning requirements require
systems and network engineering beyond the scope of the replication
system.
The service provisioning application also requires referential
integrity. For example, it would be confusing to users if their DHCP
accounts were activated before their email accounts were activated.
By replicating customer/client data in a single standard format to a
regional server, all customer-facing services (DHCP, POP, etc.) could
access the same server and achieve the desired property that "all
services in the region are available as soon as one is". Attempting
Schwartz Expires April 7, 2002 [Page 34]
Internet-Draft ANTACID Service October 2001
to populate all regional servers simultaneously in a large
geographically dispersed enterprise or ISP would unnecessarily reduce
availability and increase lock contention/deadlocks/rollbacks needed
to achieve such global ACID semantics.
To our knowledge ANTACID semantics have not previously been defined
per se. However, these semantics are motivated by trying to
formalize an approach that has been applied informally/one-off in
various applications over the years.
7.3 Locking and Deadlock Prevention
To support ANTACID semantics we define an update group as a grouping
of update operations to be applied to documents within a single zone,
all-or-nothing at each ARS replica as the update propagates from
primary to all non-primary servers.
To implement ANTACID semantics, locks are acquired separately at each
server at the time a set of updates are to be committed, starting at
the primary. Each server acquires a zone-wide lock (to ensure that
update groups are applied in a zone-wide serialized order), performs
the updates, and then releases the zone-wide lock. Because all locks
for an update group are acquired at one time at each server after the
entire update content has arrived at that server, ARS prevents
deadlock [44]. Note that deadlock is possible if systems other than
ARS acquire locks in a hold-and-wait fashion, so ARS implementations
should carefully consider the possible deadlock problems if they
allow services other than ARS to update the datastore.
The ARS grouped update mechanism is the only way to relate updates to
each other, with all-or-nothing failure semantics. In particular,
unless updates to multiple documents are grouped together, failure of
one update has no affect on subsequent updates.
ARS does not provide transactional consistency among read/written
values. In fact, no read locking is supported by ARS.
Schwartz Expires April 7, 2002 [Page 35]
Internet-Draft ANTACID Service October 2001
7.4 Collapsing Update Semantics
For efficiency of operation among infrequently synchronized
datastores and to reduce server space requirements, ARS defines a
notion of collapsing update semantics. Rather than insisting that
each replica apply all update operations, the semantics are relaxed
as follows. One or more operations may be elided (the expression of
which we will discuss shortly) from an update group if the datastore
contents after successfully committing this subset of updates would
be identical to the datastore contents that would result if the
complete set of updates were performed. Thus, for example, a
document creation followed by three updates to that document may be
"replayed" at a downstream server as a single write with that
document's final value.
The term "elided" above means that the given update operations and
content are replaced by null operations accompanied by the CSN for
each operation. This approach is used rather than transmitting
nothing for elided operations so that the downstream server receives
a (sometimes null) operation for each CSN. This provides a degree of
fault tolerance, since it is a simple matter to detect missing
operations from an up-counting set of sequence numbers.
Update collapsing may be performed over operations that originally
spanned multiple update groups, as long as the collapsed updates are
transmitted within a single update group. As an example, if ten
updates were applied in separate update groups to a particular
document over the course of a day and a PDA requests all committed
updates at the end of that day, the responding server could send a
single update group that collapses out all but the final update to
that document.
Upstream ARS servers may collapse the set of update operations sent
in response to a request for committed update content. Downstream
ARS servers must be prepared to handle collapsed updates.
Schwartz Expires April 7, 2002 [Page 36]
Internet-Draft ANTACID Service October 2001
7.5 Submitted Update Visibility
While updates do not become part of the "official" state of the
replicated ARS service until they have been serialized and committed
at the primary, there are times when it can be useful for
applications to be able to see updates immediately after they have
been submitted. For example, in a system that manages passwords for
users it would be useful for users to be able to login to the system
after changing their password at a local server rather than waiting
for the update to commit at the primary. This issue gets to the
heart of some key tradeoffs in the design of a replication system, so
before presenting the ARS approach we briefly review two other
approaches we considered.
Bayou provides a mechanism that allows applications to see a view (in
RDBMS parlance; essentially, a stored query) showing updates that
have been submitted but not yet committed, allowing tentative updates
to circulate and possibly later be undone if a conflict is detected
[45]. We chose not to use this approach because it can lead to a
large volume of global undo/re-ordering activity in the case of
widely replicated data.
LDUP defines schema and protocol extensions to LDAPv3 [8] for
replicating directory content [46]. LDUP allows updates to be
visible immediately after they have been accepted by a server by
having each server act as an independent master, with peer-to-peer
conflict resolution procedures performed between servers from time to
time. This approach derives from a fundamental requirement of LDUP,
namely, not depending on any single primary server for any given
update. There are good reasons for this requirement, including the
fact that updates are never stalled by an unreachable primary server.
However, we chose to accommodate the limitation of requiring a
primary for reasons that may not be appropriate in the LDAP arena.
Specifically, without a designated primary server we would lose the
serial audit capability, which is required (Section 1.4) by ARS.
Because of the above design considerations, ARS uses the following
approach. As a local implementation matter the submission server may
allow applications to view the state of locally submitted updates.
Submitted updates must not be propagated to other ARS servers in
response to pull requests for committed updates. Applications that
view submitted but not-yet-committed state must be prepared to deal
with the case where the update is rolled back after failing to commit
at the primary. For example, a password management application might
allow local users to login with the new password after a local
password change, but revert to the old password if the primary later
rejects the update.
Schwartz Expires April 7, 2002 [Page 37]
Internet-Draft ANTACID Service October 2001
If an ARS implementation allows applications to view the state of
locally submitted updates, it should provide two views from which
applications can choose: one with only the committed updates, and
one with the committed as well as locally submitted updates.
7.6 Caching Partial Repository Contents
Developers of some applications (such as browsers) may wish to cache
copies of individual documents rather than undertaking the more
involved implementation and administration requirements of
replicating an entire zone. ARS servers themselves do not
participate in any caching protocol; they manage copies of document
data using only the ARS protocol. Caching, if it is performed at
all, is performed by clients of the ARS service. In effect, the ARS
protocol is used for maintaining a given level of data coherence
corresponding to the local provisioning requirements of the
replication server administrator(s), and caching may be used by
clients to hold on to replicated data once it leaves the replication
service "cloud" according to coherence requirements defined outside
of the replication service.
While ARS servers do not participate in a caching protocol, ARS
specifies one aspect of how caching must be performed by clients that
wish to cache documents. Specifically, each document may contain a
32 bit Time-To-Live (TTL) field that specifies for how many seconds a
cached document may be held by a client before it expires. To avoid
clock synchronization requirements, this TTL must be represented as a
"delta" from current time rather than as an absolute clock time at
which expiry is to occur. Processes must decrement the TTL field by
an amount equal to how long they have cached the document before
passing the document to other processes.
7.7 Asynchronous Submission and the Client-Server Promise
When an ARS server receives an update submission from a client, it
enters into a promise to follow that update through either to success
or failure. An infrequently connected client can thus disconnect
from the network after an ARS server has successfully received the
update submission, and rely on the server to perform all needed
store-and-forward delivery reatempts until the update succeeds or
fails.
Schwartz Expires April 7, 2002 [Page 38]
Internet-Draft ANTACID Service October 2001
For pairs of ARS servers that support ars-s, each server that accepts
an update submission along the submission path implicitly enters into
the promise accepted at the submission server. As a consequence,
servers must not be shut down permanently while they are processing
update submissions. Before permanently decommissioning a server it
must be operated in a mode that refuses new requests while completing
old ones.
To handle the case of a server not shut down cleanly (which can
happen for a variety of reasons in operational settings, even though
it violates the above requirement), the administrator of the
immediately downstream server along the submission path can manually
run a tool to fail submissions that have been routed to the shutdown
server.
Schwartz Expires April 7, 2002 [Page 39]
Internet-Draft ANTACID Service October 2001
8. Conflict Detection
Conflict detection is outside the scope of ARS, because some
applications do not need it and would prefer to avoid the additional
runtime overhead and implementation complexity. However, ARS
provides "hooks" to allow the local server implementation or an
application to perform conflict detection as follows.
8.1 Write-Write Conflicts
A server or application running at the ARS primary may detect write-
write conflicts by checking whether the CSN value associated with the
document submitted for update matches the CSN value associated with
the primary's copy of that document. If it does not match a write-
write conflict has occurred, in which case the primary may abort the
update and respond to the submitter with an appropriate failure
response message. It is up to the submitting client to resolve the
conflict, for example by re-reading the updated document and re-
running the computation that updates it.
Non-primary servers must not check for write-write conflicts in this
fashion, else downstream servers will erroneously conclude that a
conflict has occurred when they attempt to apply committed updates
propagated down from upstream servers.
8.2 Read-Write Conflicts
A local implementation may handle a basic case of read-write
conflicts by allowing downstream clients to be notified when an
update to their input documents arrives. This approach allows
updates to trickle through downstream servers (by re-running
computations when a read-write conflict is detected so that
eventually correct versions of the computed documents will be
replicated), but does not support global transactional consistency
among read/written values. For example, consider this scenario: A
server reads from replicated document X and uses it to compute a
value for document Y. Concurrently, someone updates X at another
replica server after the above server has read X but before Y has
been generated. Thus, the generated value of Y does not reflect the
most recent update to X. With the notification mechanism, the server
will find out when the new value of X arrives, run again, and
regenerate Y based on the latest value of X.
Note that ARS does not provide an update arrival notification
primitive itself. Rather, the use of the above-described arrival
notification is outside the scope of ARS. A local implementation may
use database triggers or other mechanisms to provide notification.
Schwartz Expires April 7, 2002 [Page 40]
Internet-Draft ANTACID Service October 2001
We chose this relatively simple approach to read-write conflict
detection because we feel it is a reasonable compromise of
scalability and implementation simplicity. For example, we have
scaling concerns about precedence graph-based approaches [47], and we
believe that requiring application-specific conflict detection [48]
would add too much complexity to the protocol.
Schwartz Expires April 7, 2002 [Page 41]
Internet-Draft ANTACID Service October 2001
9. Transmission Optimization: Encodings
ARS defines a single data update representation that can be used for
transmitting all submitted and committed updates between ARS peers.
However, there are a variety of cases where a different
representation can be significantly more efficient. The ARS Encoding
Negotiation Protocol (ars-e) provides a mechanism that can be used
between pairs of communicating ARS peers for selecting a common set
of MIME-based [36] data representations that support various types of
transmission optimizations. Its existence was inspired in part by
features in HTTP/1.1 [49] as well as by the author's experience with
the proprietary replication implementations in a variety of
commercial technologies and services.
A basic type of optimization we wish to support with ars-e is
transmission of compressed data, for example using MIME type "gzip-
compressed". ars-e can also be used to support delta encoded data
[50], which can be useful for textual repositories where small
changes are made to individual documents.
Given our data scale and performance requirements (Section 1.4),
probably the most important optimization is bulk transmission of raw
data files between a pair of ARS servers that are both implemented
atop the same proprietary data storage system (e.g., a particular
version of a particular vendor's relational database). For example,
the Oracle 8i exportable tablespace feature [25] provides a means of
exposing the underlying file system representation of database
tables, which can be replicated across a network orders of magnitude
faster than would be possible by transmitting the corresponding
individual record-level SQL data manipulation operations. A bulk
transfer encoding might also be used to take advantage of proprietary
backend hardware, for example using EMC's TimeFinder [51] product to
transfer huge datastores across a storage area network.
This ability to perform bulk synchronization of entire zones is how
we address the requirement (Section 1.4), "Performance of individual
point-to-point replication should not be significantly worse than
proprietary schemes, excluding convergence time." Specifically, in
the author's experience the performance difference between different
incremental update protocols is minor compared to the difference
between a system that supports bulk transmission and one that does
not.
Schwartz Expires April 7, 2002 [Page 42]
Internet-Draft ANTACID Service October 2001
There is one case where the use of an encoding can allow strictly
greater functionality (as opposed to better performance) to the peers
using that encoding. As mentioned earlier (Section 6), each ARS
server maintains a CSN-indexed list of operations applied to each
zone. When a downstream server requests all updates since a given
CSN, the upstream server consults the operation list to generate the
needed response. As a local implementation matter, servers may
choose to represent this list in a log, and they may choose to
truncate a prefix of this log periodically to conserve space. If
they do, it is possible that a downstream server could request
updates that precede the upstream server's log truncation point. In
this case, the upstream server sends an error to the downstream
server, from which the downstream server may recover by requesting
that the complete zone content be transmitted (similar to the
approach taken by Bayou [37]). This content must be transmitted with
an encoding designed for full zone transfers. For example, an
encoding could be defined that transmits the complete zone as a
sequence of one create operation for each existing document. Or, as
noted above, the entire zone could be transferred as a compressed
archive file if both servers are built on the same type of data
storage system. In either case, the downstream server would need to
delete its existing zone content before inserting the transmitted
content.
We place one key restriction on the use of ars-e: an implementation
must not use any proprietary formats in its encodings except for the
special case of transmitting an entire zone as a bulk transmission.
In all other cases the encodings must be published as open
specifications. Implementations that do not meet this requirement
are deemed not to be in compliance with the ARS specification. This
restriction is intended to allow an important efficiency enhancement
for bulk data transfer, yet to ensure that there is always an
interoperable way to transfer data between ARS peers. Moreover, we
seek to prevent vendors from claiming "trivial" ARS compliance by
misusing the bulk encoding "escape" to drop into their proprietary
system for all replication, and in so doing claim a more efficient
(but non-interoperable) "ARS implementation" than their competitors.
Schwartz Expires April 7, 2002 [Page 43]
Internet-Draft ANTACID Service October 2001
10. Security
ARS is defined in terms of a BEEP [28] profile. A number of ARS's
security characteristics derive from BEEP's built-in security models.
10.1 Authentication
As a BEEP profile, ARS uses the Simple Authentication and Security
Layer (SASL [52]) to negotiate the authentication mechanism used
between ARS peers.
10.2 Confidentiality
Encryption may be used during transmission, per usual practice with
BEEP profiles.
10.3 Integrity
Documents may contain primary-generated digital signatures to enforce
integrity of contents and prevent man-in-the middle or sequence
number collision attacks. As is the case with caching, digital
signing (if implemented) is implemented outside of the ARS service,
by clients of the ARS service. ARS does not specify how documents
are to be signed. The XML DSIG representation [53] is a possible
approach.
10.4 Access Control
Access control is not yet worked for ARS. We intend to define a two-
level access control architecture for ARS, wherein a Document Type
Definition (DTD) specifies an Access Control List structure used
between pairs of ARS servers, and user-level access control is left
as a local implementation matter. (Note that users interact with ARS
through clients which authenticate for the given users' identity.)
The ACL DTD, requirements governing the distribution and uniform
application of ACLs, and other protocol and architecture issues, are
all areas for future work.
Schwartz Expires April 7, 2002 [Page 44]
Internet-Draft ANTACID Service October 2001
11. Datastore Interoperation
Other services may need to share access to a datastore being
replicated by ARS. For example, ARS provides no search support,
instead assuming that such functionality could be built into a system
that shares access to the replicated datastore.
There are three requirements for sharing access to a datastore
replicated by ARS:
1. The underlying datastore must consist entirely of hierarchically-
named, well-formed XML documents.
2. All datastore read requests must perform zone-wide locking, as
described earlier (Section 7.3).
3. All datastore updates must "funnel" through ARS. For example, a
non-ARS service may accept update requests for replicated
documents, forward these requests to ARS, and wait for ARS to
indicate the update has completed before responding to the
initiating client.
Schwartz Expires April 7, 2002 [Page 45]
Internet-Draft ANTACID Service October 2001
12. Summary: ARS's Realm of Applicability
In this section we briefly summarize the strengths and limitations of
ARS, to help data service architects determine whether ARS is
appropriate for their needs.
12.1 Strengths
ARS replicates XML content in a manner that we believe is appropriate
for a broad class of business-critical internetworked applications.
Specifically, ANTACID semantics guarantee that updates submitted
together are applied in the same order and with ACID semantics at
each replica, but that individual replicas apply updates
asynchronously from one another. This approach is motivated by the
scaling problems of providing global ACID semantics and by the fact
that many applications work by interacting with a single replica, for
which all that is required is that updates to that replica appear to
be ACID to applications interacting with that replica.
ARS addresses scalability in data volume and replica count through a
combination of ANTACID semantics, update routing according to a
configured graph structure, the separation of conflict detection from
the replication service, transmission encodings, and the ability to
partition the name space for distributed administration.
ARS addresses operation across a variety of network environments
through a combination of update routing according to a configured
graph structure, store-and-forward update submissions, and support
for both pull and push models for committed update propagation.
ARS addresses mobile/computationally modest client operation through
a combination of the client-server promise, asynchronous
notification, collapsing updates, update-anywhere capability, and the
distinction of replication-service vs client-of-replication-service.
12.2 Limitations
Replication is a broad problem, for which no single standard could
work well for "all" applications. ARS has a number of limitations,
which we discuss in two categories below: architectural limitations,
and limitations concerned with the current state of completeness of
the specification. The former are probably fundamental to the design
tradeoffs we have chosen (unless the specification evolves in ways we
have not predicted), while the latter are likely to change over time.
Schwartz Expires April 7, 2002 [Page 46]
Internet-Draft ANTACID Service October 2001
In some cases the limitations discussed below can be handled by
suitable modifications to how data is organized in a given
application. In other cases ARS may simply be inappropriate for the
application.
12.2.1 Architectural Limitations
Probably ARS's biggest limitation is that it does not support global
ACID updates. Applications requiring global ACID semantics should
use a different replication solution, such as one of the commercially
available RDBMS's.
On a related note, ARS's ANTACID semantics provide no support for a
notion of referential integrity that spans beyond a single update.
For example, ARS does not provide any means of ensuring that a web
page that was once visible at a website might be bookmarked and later
visited (which can break if sites regularly take old content off
their servers).
Also related to ANTACID semantics, ARS's read-write conflict
detection mechanism is limited to detecting when an update arrives
for a document or document subtree. This approach is oriented
towards applications where updates trigger computations that can re-
generate derived data, in support of computed content pipelines. ARS
specifically does not support any notion of transactional integrity
in the face of read-write conflicts. In other words, an ARS server
cannot guarantee that it will detect that a read-write conflict has
occurred at the time an update is being committed.
Another limitation is ARS's reliance on a primary server, the
unavailability of which will stall update submissions from
committing. Because of this, an ARS primary will typically need to
reside on a highly available, well connected machine. Mobile devices
using ARS can only commit updates when they connect to a network
capable of routing to the primary. For example, a pair of mobile
users on an airplane could not use ARS to commit updates and
synchronize between each other while disconnected from the rest of
the Internet unless one of the mobile users' machines is itself the
primary server. Updates can be submitted while disconnected, and ARS
provides a way for clients to view the local submitted update state,
but they cannot commit and propagate the updates to other ARS peers
until the updates have committed at the primary server.
Schwartz Expires April 7, 2002 [Page 47]
Internet-Draft ANTACID Service October 2001
More generally, processes are not free to propagate updates to
arbitrary other ARS peers. All updates must flow according to a
configured graph topology. This graph topology provides for
redundant paths through the network, but for example an ARS server
cannot simply locate a nearby server (e.g., by LAN broadcast) and
exchange update state.
On a related note, the ARS replication topology must conform to a
DAG structure. This restriction can in some cases mean that
achieving redundant paths requires more servers than would be needed
if the graph weren't required to be acyclic. For example, consider
the following cyclic replication topology:
s1
| \
| \
s2---s3
In this figure updates can propagate to s2 if s3 is down and vice
versa. With ARS's DAG-based topology an additional server would be
required to achieve the same level of redundancy:
s1------>s4
| \ /
| \ /
| \ /
| / \
| / \
\|/\|/ \|/
s2----->s3
Related to the DAG limitations, if a server accepts a submission and
all of its upstream servers are unreachable there is currently no way
to "back the submission out" from the accepting server and send it to
another server whose upstream servers are currently reachable.
Because of the client-server promise, once a server accepts the
submission, that part of the submission path is fixed. As a
consequence it's important that each server be provisioned with
upstream servers that together provide acceptable availability.
Schwartz Expires April 7, 2002 [Page 48]
Internet-Draft ANTACID Service October 2001
Related to ARS's primary server design is the fact that replicated
collections are defined by the accumulation of changes visible at the
single primary server. Thus, for example, ARS would not be a very
natural fit for applications that need to allow a user to select
pieces of data to download as a one-time operation from multiple
sources -- e.g., as done by peer-to-peer file sharing applications
like Napster and Gnutella.
Another limitation of ARS is that it only replicates XML content.
While arbitrary data could be encoded into XML, doing so would not be
an efficient means of transmitting large volumes of binary data, such
as audio or video content.
Another limitation is that ARS does not support attribute-level
replication. Updates are performed at the granularity of an entire
XML document (although ARS's encoding mechanism may cause
significantly less data than the full document to be transmitted
during replication). Thus, for example, an application that modifies
a few cells in a large table would not do well to represent the table
as one large document with elements or attributes representing the
individual cells. Instead, each individual document should be
"reasonably small" (the exact definition of which depends on the
speed of network links and servers, required update commit rates,
etc.)
Another granularity limitation of ARS concerns locking: because ARS
serializes updates at the granularity of a zone, zone-wide locking is
needed while applying committed updates. In turn this rough-grained
locking can present performance bottlenecks for some application
workloads compared with, for example, row-level locking in RDBMS's.
Another limitation is that ARS provides no protocol support for
determining when an update has circulated globally. An application
could be written to traverse the ARS DAG for a given zone and check
for update completion, but distributed administration of servers
could prevent discovery or querying of some servers. Moreover, the
cost of this computation could be prohibitive. Finally, the list of
replicas may itself be in flux, making it difficult to define a
meaningful notion of "all replicas".
12.2.2 Completeness Limitations
The digital signing and access control models are not worked out.
Among other things, an Access Control List format DTD needs to be
defined, along with protocol/architectural requirements on the
distribution and uniform application of access control in ARS.
Schwartz Expires April 7, 2002 [Page 49]
Internet-Draft ANTACID Service October 2001
A MIME subtype for registering new ARS encodings needs to be formally
defined and registered with the Internet Assigned Numbers Authority
(IANA).
Procedures need to be worked out for how to split and delegate zones.
SNMP MIBs have not been defined for ARS service monitoring and
management.
There is no support for automated layout of the replication topology,
which may in turn limit the number of servers to which it is
operationally manageable to deploy replication service.
ARS does not support guarantees about how long it can take for
updates to propagate, and instead considers this to be a local
engineering/capacity planning matter. However, in the future a new
ARS sub-protocol might be also defined to support some form of
conditional service guarantees for deployments where resource
reservations are possible.
Schwartz Expires April 7, 2002 [Page 50]
Internet-Draft ANTACID Service October 2001
13. Acknowledgements
The author would like to thank the following people for their many
helpful suggestions about ARS: David Clark, Dave Crocker, Marco
Gazzetta, Carl Malamud, Paul Mockapetris, Darren New, Calton Pu, and
Marshall Rose.
Schwartz Expires April 7, 2002 [Page 51]
Internet-Draft ANTACID Service October 2001
References
[1] World Wide Web Consortium, "Extensible Markup Language (XML)
1.0", W3C XML, February 1998, <http://www.w3.org/TR/1998/REC-
xml-19980210>.
[2] Schwartz, M., "The ANTACID Replication Service: Protocol and
Algorithms", draft-schwartz-antacid-protocol-00 (work in
progress), October 2001.
[3] Gray, I. and A. Reuter, "Transaction Processing: Concepts and
Techniques", Morgan-Kaufmann Publishers, Inc. , 1993.
[4] Rigney, C., Rubens, A., Simpson, W. and S. Willens, "Remote
Authentication Dial In User Service (RADIUS)", RFC 2058,
January 1997.
[5] Droms, R., "Dynamic Host Configuration Protocol", RFC 1531,
October 1993.
[6] Reynolds, J., "Post Office Protocol", RFC 918, Oct 1984.
[7] Postel, J., "Simple Mail Transfer Protocol", RFC 788, Nov 1981.
[8] Yeong, W., Howes, T. and S. Kille, "X.500 Lightweight Directory
Access Protocol", RFC 1487, July 1993.
[9] Rabinovich, M., "Issues in Web Content Replication", Data
Engineering Bulletin Vol. 21 No. 4, December 1998,
<http://www.research.att.com/~misha/otherPubs/web_repl.ps.gz>.
[10] Cooper, I., Melve, I. and G. Tomlinson, "Internet Web
Replication and Caching Taxonomy", RFC 3040, January 2001.
[11] Wessels, D., "Squid Web Proxy Cache", August 2000,
<http://www.squid-cache.org/>.
[12] Kantor, B. and P. Lapsley, "Network News Transfer Protocol",
RFC 977, Feb 1986.
[13] Wessels, D. and K. Claffy, "Internet Cache Protocol (ICP),
version 2", RFC 2186, September 1997.
[14] Schwartz, M., "Formal Service Agreements for Scaling Internet
Caching", NLANR Web Cache Workshop , June 1997,
<http://www.ircache.net/Cache/Workshop97/Papers/Schwartz/schwartz.html>
.
Schwartz Expires April 7, 2002 [Page 52]
Internet-Draft ANTACID Service October 2001
[15] Microsoft Corporation, "Cache Array Routing Protocol and
Microsoft Proxy Server 2.0", August 1998,
<http://www.microsoft.com/proxy/documents/CarpWP.exe>.
[16] Hoff, A., Giannandrea, J., Hapner, M., Carter, S. and M. Medin,
"The HTTP Distribution and Replication Protocol", August 1997,
<http://www.w3.org/TR/NOTE-drp>.
[17] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. Jensen,
"HTTP Extensions for Distributed Authoring -- WEBDAV", RFC
2518, February 1999.
[18] Skardal, H., Bunnell, J., Chellam, S., Gardner, T., Hendrie,
C., Komatsu, K., Linn, G., Manley, D., Waidhofer, G. and J.
Ward, "NDMP Version 4 Protocol", draft-skardal-ndmpv4-02 (work
in progress), April 2001.
[19] Leach, P. and D. Naik, "A Common Internet File System
(CIFS/1.0) Protocol", December 1997,
<ftp://ftp.microsoft.com/developr/drg/cifs/cifs9f.doc>.
[20] Microsystems, Sun., "NFS: Network File System Protocol
specification", RFC 1094, Mar 1989.
[21] Microsoft Corporation, "ODBC Section of the Microsoft Universal
Data Access Web Site", March 1999,
<http://www.microsoft.com/data/odbc/default.htm>.
[22] International Organization for Standardization, "Information
Technology - Database Languages - SQL", ISO/IEC 9075:1992,
1992.
[23] Satyanarayanan, M., Kistler, J., Kumar, P., Okasaki, M.,
Siegel, E. and D. Steere, "Coda: A Highly Available File System
for a Distributed Workstation Environment", IEEE Transactions
on Computers Vol. 39, No. 4, April 1990,
<http://www.cs.cmu.edu/afs/cs/project/coda/Web/docdir/tcc90.pdf>
.
[24] Gray, J., Helland, P., O'Neil, P. and D. Shasha, "The Dangers
of Replication and a Solution", Proceedings of SIGMOD
International Conference on Management of Data , 1996.
[25] Oracle Corporation, "Oracle8i Replication Release 8.1.5 A67791-
01", February 1999,
<http://technet.oracle.com/doc/server.815/a67791/toc.htm>.
[26] Danzig, P., DeLucia, D. and K. Obraczka, "Massively Replicating
Schwartz Expires April 7, 2002 [Page 53]
Internet-Draft ANTACID Service October 2001
Services in Wide-Area Internetworks", University of Southern
California Technical Report 94-595, 1994,
<ftp://catarina.usc.edu/pub/kobraczk/ToN.ps.Z>.
[27] Atkinson, R. and J. Munter, "UDDI Version 2.0 Replication
Specification", June 2001,
<http://www.uddi.org/pubs/Replication-V2.00-Open-20010608.pdf>.
[28] Rose, M., "The Blocks Extensible Exchange Protocol Core", RFC
3080, March 2001.
[29] Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J.,
Shenker, S., Sturgis, H., Swinehart, D. and D. Terry, "Epidemic
Algorithms for Replicated Database Maintenance", Proceedings of
the Sixth Conference on Principles of Distributed Computing ,
August 1987.
[30] Schroeder, M., Birrell, A. and R. Needham, "Experience with
Grapevine: The Growth of a Distributed System", ACM
Transactions on Computer Systems Vol. 2 No. 1, February 1984.
[31] Rose, M., Gazzetta, M. and M. Schwartz, "The Blocks Datastore
Model", Draft Technical Memo, January 2001.
[32] Braden, R. and J. Postel, "Requirements for Internet gateways",
RFC 1009, Jun 1987.
[33] Even, S., "Graph Algorithms", Computer Science Press , 1979.
[34] Lampson, B. and H. Sturgis, "Crash Recovery in a Distributed
Data Storage System", Xerox Palo Alto Research Center Internal
Report , April 1979.
[35] Mockapetris, P., "Domain names - concepts and facilities", RFC
1034, STD 13, Nov 1987.
[36] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail
Extensions) Part One: Mechanisms for Specifying and Describing
the Format of Internet Message Bodies", RFC 1521, September
1993.
[37] Petersen, K., Spreitzer, M., Terry, D., Theimer, M. and A.
Demers, "Flexible Update Propagation for Weakly Consistent
Replication", Proceedings of the 16th ACM Symposium on
Operating System Principles , October 1997,
<http://www.parc.xerox.com/csl/projects/bayou/pubs/sosp-
97/index.html>.
Schwartz Expires April 7, 2002 [Page 54]
Internet-Draft ANTACID Service October 2001
[38] Claffy, K., Braun, H. and M. Schwartz, "A Distributed Testbed
for National Information Provisioning", A proposal submitted to
the National Science Foundation , April 1995,
<http://www.ircache.net/Cache/C-Proposal/kcs.html>.
[39] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396, August
1998.
[40] Fenner, B., Handley, M., Holbrook, H. and I. Kouvelas,
"Protocol Independent Multicast - Sparse Mode (PIM-SM):
Protocol Specification (Revised)", draft-ietf-pim-sm-v2-new-03
(work in progress), July 2001.
[41] Steere, D., Baptista, A., McNamee, D., Pu, C. and J. Walpole,
"Research Challenges in Environmental Observation and
Forecasting Systems", Proceedings of The Sixth Annual
International Conference on Mobile Computing and Networking ,
August 2000.
[42] Fox, A., Gribble, S., Chawathe, Y., Brewer, E. and P. Gauthier,
"Cluster-Based Scalable Network Services", Proceedings of the
16th Symposium on Operating Systems Principles , October 1997,
<http://www.cs.berkeley.edu/~gribble/papers/sosp16.ps.gz>.
[43] Ericsson Utvecklings AB, "Mnesia 3.9.2", 2000,
<http://www.erlang.se/doc/doc-5.0.1/lib/mnesia-
3.9.2/doc/index.html>.
[44] Hansen, P., "Operating System Principles", Prentice Hall ,
1973.
[45] Terry, D., Petersen, K., Spreitzer, M. and M. Theimer, "The
Case for Non-transparent Replication: Examples from Bayou",
IEEE Data Engineering , December 1998,
<http://www.parc.xerox.com/csl/projects/bayou/pubs/dataeng-
98/DataEngineeringDec98.frame.html>.
[46] Merrells, J., Reed, E. and U. Srinivasan, "LDAP Replication
Architecture", draft-ietf-ldup-model-06 (work in progress),
March 2000.
[47] Davidson, S., "Optimism and Consistency in Partitioned
Distributed Database Systems", ACM Transactions on Database
Systems Vol. 9 No. 3, September 1984.
[48] Terry, D., Theimer, M., Petersen, K., Demers, A., Spreitzer, M.
and C. Hauser, "Managing Update Conflicts in Bayou, a Weakly
Schwartz Expires April 7, 2002 [Page 55]
Internet-Draft ANTACID Service October 2001
Connected Replicated Storage System", Proceedings of the 15th
ACM Symposium on Operating System Principles , December 1995,
<http://www.parc.xerox.com/csl/projects/bayou/pubs/sosp-
95/www/BayouConflictsSOSP-web_1.html>.
[49] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L.,
Leach, P. and T. Berners-Lee, "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2616, June 1999.
[50] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A.,
Goland, Y., van Hoff, A. and D. Hellerstein, "Delta Encoding in
HTTP", draft-mogul-http-delta-10 (work in progress), October
2001.
[51] Kodumudi, S. and P. Manning, "Split Mirror Backup and Database
Replication Using EMC TimeFinder", February 1999,
<http://www.emc.com/products/product_pdfs/wp/timefinder/smbudr_tf.pdf>
.
[52] Myers, J., "Simple Authentication and Security Layer (SASL)",
RFC 2222, October 1997.
[53] Eastlake, D., Reagle, J. and D. Solo, "XML-Signature Syntax and
Processing", RFC 3075, March 2001.
Author's Address
Michael F. Schwartz
Code On The Road, LLC
EMail: schwartz@CodeOnTheRoad.com
URI: http://www.CodeOnTheRoad.com
Schwartz Expires April 7, 2002 [Page 56]
Internet-Draft ANTACID Service October 2001
Appendix A. Extended List of Potential ARS Applications
To help illustrate ARS's potential applicability, the following table
expands on the three example replication applications provided in the
Example Applications (Section 1.1) section. We present these
examples without further discussion, and instead refer readers
wanting more depth to the aforementioned section.
|--------------------------------------------------------------------|
| Application | Description |
|--------------------------------------------------------------------|
| content distribution | central content provider publishes content |
| network | updates that propagate to servers at many |
| | network points of presence |
|--------------------------------------------------------------------|
| real-time news | many small docs flowing from central site |
| provider | out to sites with which PDA's synchronize |
|--------------------------------------------------------------------|
| service | integr. of back office subscriber db's with |
| provisioning | RADIUS, DHCP, POP, SMTP, LDAP, WWW, other |
| | subscriber- or client-facing service DBs |
|--------------------------------------------------------------------|
| computed content | data owner injects data to be read by |
| pipeline | competing downstream services that compute |
| | derived content |
|--------------------------------------------------------------------|
| nearby cell data | PDAs submit updates and receive committed |
| synchronization | updates from nearest cell server or kiosk |
|--------------------------------------------------------------------|
| network management | multiple agents observing each net element |
| data "plumbing" | make conflicting updates. Regional cor- |
| | relation/aggregation reduces NOC traffic |
| | (cf. NetExpert [55]) |
|--------------------------------------------------------------------|
| private annotations | publicly updateable replication space for |
| to centrally owned | one data set, more restrictive space for |
| part of name space | others. Multiple annotation authors/owners |
|--------------------------------------------------------------------|
| integrated workflow | open standard replication for workflow, |
| | version control, web content publishing |
|--------------------------------------------------------------------|
| XML-driven dynamic | distributed dynamic content servers cont- |
| content | rolled by XML-encoded business logic+data |
|--------------------------------------------------------------------|
| multi-PDA | users perform updates via more than two |
| synchronization | devices (PDA, cell phone, desktop, etc.) |
Schwartz Expires April 7, 2002 [Page 57]
Internet-Draft ANTACID Service October 2001
|--------------------------------------------------------------------|
| intranet replication | mobile/field office/partner data exchange |
|--------------------------------------------------------------------|
| structured netnews | advantages over NNTP: (a) eventually |
| | consistent, ordered updates, (b) fielded |
| | search, (c) authenticated |
|--------------------------------------------------------------------|
Schwartz Expires April 7, 2002 [Page 58]
Internet-Draft ANTACID Service October 2001
Full Copyright Statement
Copyright (C) The Internet Society (2001). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Code On The Road, LLC expressly disclaims any and all warranties
regarding this contribution including any warranty that (a) this
contribution does not violate the rights of others, (b) the owners,
if any, of other rights in this contribution have been informed of
the rights and permissions granted to IETF herein, and (c) any
required authorizations from such owners have been obtained. This
document and the information contained herein is provided on an "AS
IS" basis and CODE ON THE ROAD, LLC DISCLAIMS ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
IN NO EVENT WILL CODE ON THE ROAD, LLC BE LIABLE TO ANY OTHER PARTY
INCLUDING THE IETF AND ITS MEMBERS FOR THE COST OF PROCURING
SUBSTITUTE GOODS OR SERVICES, LOST PROFITS, LOSS OF USE, LOSS OF
DATA, OR ANY INCIDENTAL, CONSEQUENTIAL, INDIRECT, OR SPECIAL DAMAGES
WHETHER UNDER CONTRACT, TORT, WARRANTY, OR OTHERWISE, ARISING IN ANY
WAY OUT OF THIS OR ANY OTHER AGREEMENT RELATING TO THIS DOCUMENT,
WHETHER OR NOT SUCH PARTY HAD ADVANCE NOTICE OF THE POSSIBILITY OF
SUCH DAMAGES.
Schwartz Expires April 7, 2002 [Page 59]
Internet-Draft ANTACID Service October 2001
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
Schwartz Expires April 7, 2002 [Page 60]