P2PSIP Working Group                                          J. Maenpaa
Internet-Draft                                              G. Camarillo
Intended status: Standards Track                           J. Hautakorpi
Expires: August 20, 2009                                        Ericsson
                                                       February 16, 2009


  A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And
                           Discovery (RELOAD)
                draft-maenpaa-p2psip-self-tuning-00.txt

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on August 20, 2009.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.






Maenpaa, et al.          Expires August 20, 2009                [Page 1]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


Abstract

   REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P)
   signaling protocol that provides an overlay network service.  Peers
   in a RELOAD overlay network collectively run an overlay algorithm to
   organize the overlay, and to store and retrieve data.  RELOAD
   provides an abstract interface to the overlay layer that allows
   implementing different structured and unstructured overlay algorithms
   by using different topology plugins.  This document defines a new
   topology plugin for RELOAD.  This topology plugin implements a self-
   tuning DHT (Distributed Hash Table), which adapts to changing
   operating conditions (e.g., churn and network size).


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Alternative Approaches to Stabilization in DHTs  . . . . . . .  5
     3.1.  Reactive vs. Periodic Stabilization  . . . . . . . . . . .  5
     3.2.  Configuring Periodic Stabilization . . . . . . . . . . . .  6
     3.3.  Adaptive Stabilization . . . . . . . . . . . . . . . . . .  7
   4.  Self-tuning, Chord-based Topology Plugin for RELOAD  . . . . .  8
     4.1.  Update Messages  . . . . . . . . . . . . . . . . . . . . .  9
     4.2.  Finger Stabilization . . . . . . . . . . . . . . . . . . . 12
     4.3.  Successor Stabilization  . . . . . . . . . . . . . . . . . 13
     4.4.  Predecessor Stabilization  . . . . . . . . . . . . . . . . 14
     4.5.  Joining an Overlay . . . . . . . . . . . . . . . . . . . . 14
       4.5.1.  Contents of the Join Message . . . . . . . . . . . . . 15
     4.6.  Leaving an Overlay . . . . . . . . . . . . . . . . . . . . 15
       4.6.1.  Contents of the Leave Message  . . . . . . . . . . . . 16
     4.7.  Data Replication . . . . . . . . . . . . . . . . . . . . . 16
     4.8.  Strong Stabilization . . . . . . . . . . . . . . . . . . . 17
   5.  Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 18
     5.1.  Estimating Overlay Size  . . . . . . . . . . . . . . . . . 18
     5.2.  Determining Routing Table Size . . . . . . . . . . . . . . 18
     5.3.  Estimating Failure Rate  . . . . . . . . . . . . . . . . . 19
     5.4.  Estimating Join Rate . . . . . . . . . . . . . . . . . . . 19
     5.5.  Calculating the Stabilization Interval . . . . . . . . . . 20
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 21
   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 21
     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 21
     8.2.  Informative References . . . . . . . . . . . . . . . . . . 22
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23






Maenpaa, et al.          Expires August 20, 2009                [Page 2]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


1.  Introduction

   REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a
   peer-to-peer signaling protocol that can be used to maintain an
   overlay network, and to store data in and retrieve data from the
   overlay.  For interoperability reasons, RELOAD specifies one overlay
   algorithm that is mandatory to implement.  Additionally, RELOAD
   supports a variety of other overlay algorithms through the use of
   topology plugins.  A topology plugin implements the topology defined
   by a specific overlay algorithm.  This document defines a new
   topology plugin for RELOAD.  This topology plugin implements a self-
   tuning DHT (Distributed Hash Table) algorithm.

   DHT-based overlay networks are self-organizing, scalable and
   reliable.  However, these features come at a cost: peers in the
   overlay network need to consume network bandwidth to maintain routing
   state.  Most DHTs use a periodic stabilization routine to counter the
   undesirable effects of churn on routing.  To configure the parameters
   of a DHT, some characteristics such as churn rate and network size
   need to be known in advance.  These characteristics are then used to
   configure the DHT in a static fashion by using fixed values for
   parameters such as the size of the successor set, size of the routing
   table, and rate of maintenance messages.  The problem with this
   approach is that it is not possible to achieve a low failure rate and
   a low communication overhead by using fixed parameters.  Instead, a
   better approach is to allow the system to take into account the
   evolution of network conditions and adapt to them.  The topology
   plugin proposed in this document uses a self-tuning version of the
   Chord DHT algorithm [Chord].  Two main advantages of a self-tuning
   DHT are that users no longer need to tune every DHT parameter
   correctly for a given operating environment and that the system
   adapts to changing operating conditions.

   The remainder of this document is structured as follows: Section 2
   provides definitions of terms used in this document.  Section 3
   discusses alternative approaches to stabilization operations in DHTs,
   including reactive stabilization, periodic stabilization, and
   adaptive stabilization.  Section 4 defines the self-tuning Chord DHT,
   whereas Section 5 describes how the self-tuning DHT calculates the
   stabilization rate and routing table size in an adaptive fashion.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].




Maenpaa, et al.          Expires August 20, 2009                [Page 3]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   This document uses the terminology and definitions from the Concepts
   and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts]
   draft.


   Chord Ring:  The Chord DHT orders identifiers on an identifier circle
      of size 2^m (m is the number of bits in peer identifiers).  This
      identifier circle is called the Chord ring.

   DHT:  Distributed Hash Tables (DHTs) are a class of decentralized
      distributed systems that provide a lookup service similar to a
      hash table.  Given a key, any participating peer can retrieve the
      value associated with that key.  The responsibility for
      maintaining the mapping from keys to values is distributed among
      the peers.

   Finger Table:  A data structure with up to m entries maintained by
      each peer in a Chord-based overlay.  The ith entry in the finger
      table of peer n contains the identity of the first peer that
      succeeds n by at least 2^(m-i) on the Chord ring.  This peer is
      called the ith finger of peer n.  As an example, the first entry
      in the finger table of peer n contains a peer half-way around the
      Chord ring from peer n.  The purpose of the finger table is to
      accelerate lookups.

   log2(N):  Logarithm of N with base 2.

   n.id:  Peer-ID of peer n.

   Neighborhood Set:  Consists of successor and predecessor lists.

   O(g(n)):  Informally, saying that some equation f(n) = O(g(n)) means
      that f(n) is less than some constant multiple of g(n).

   Omega(g(n)):  Informally, saying that some equation f(n) =
      Omega(g(n)) means that f(n) is more than some constant multiple of
      g(n).

   Predecessor List:  A data structure containing the first rf+1
      predecessors of a peer on the Chord ring. rf is the replication
      factor used in the overlay.

   Replica Set:  The set of peers storing redundant copies of a given
      resource.







Maenpaa, et al.          Expires August 20, 2009                [Page 4]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009



   Replication Factor:  Number of replicas used.  A replication factor
      equal to 0 means that there is no replication.

   Responsible Peer:  The peer storing the original copy of a resource.
      In Chord, this is the first peer whose identifier is equal to or
      follows the identifier of the resource on the Chord ring.  Also
      the term "Root Node" can be used for this concept.

   Routing Table:  The set of peers that a node can use to route overlay
      messages.  The routing table consists of the finger table,
      successor list and predecessor list.

   Successor List:  A data structure containing the first r successors
      of a peer on the Chord ring.



3.  Alternative Approaches to Stabilization in DHTs

   DHT-based peer-to-peer overlay networks are self-organizing, scalable
   and reliable.  However, these features come at a cost: peers in the
   overlay network need to consume network bandwidth to maintain routing
   state.  DHTs use stabilization routines to counter the undesirable
   effects of churn on routing.  The purpose of stabilization is to keep
   the routing information of each peer in the overlay consistent with
   the constantly changing overlay topology.  There are two alternative
   approaches to stabilization: periodic and reactive [rhea2004].
   Periodic stabilization can either use a fixed stabilization rate or
   calculate the stabilization rate in an adaptive fashion.

3.1.  Reactive vs. Periodic Stabilization

   In reactive stabilization, a peer reacts to the loss of a peer in its
   neighborhood set or to the appearance of a new peer that should be
   added to its neighborhood set by sending a copy of its neighbor table
   to all peers in the neighborhood set.  Periodic recovery, in
   contrast, takes place independently of changes in the neighborhood
   set.  In periodic recovery, a peer periodically shares its
   neighborhood set with each of the members of that set.

   The mandatory-to-implement Chord DHT algorithm in RELOAD
   [I-D.ietf-p2psip-base] uses reactive stabilization for the
   neighborhood set, unlike the original Chord algorithm, which uses
   periodic stabilization.  It has been shown in [rhea2004] that
   reactive stabilization works well for small neighborhood sets (i.e.,
   small overlays) and moderate churn.  However, in large-scale (e.g.,
   1000 peers or more [rhea2004]) or high-churn overlays, reactive



Maenpaa, et al.          Expires August 20, 2009                [Page 5]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   stabilization runs the risk of creating a positive feedback cycle,
   which can eventually result in congestion collapse.  In [rhea2004],
   it is shown that a 1000-peer overlay under churn uses significantly
   less bandwidth and has lower latencies when periodic stabilization is
   used than when reactive stabilization is used.  Although in the
   experiments carried out in [rhea2004], reactive recovery performed
   well when there was no churn, its bandwidth use was observed to jump
   dramatically under churn.  At higher churn rates and larger scale
   overlays, periodic stabilization uses less bandwidth and the
   resulting lower contention for the network leads to lower latencies.
   For this reason, most DHTs such as CAN [CAN], Chord [Chord], Pastry
   [Pastry], Bamboo [rhea2004], etc. use periodic stabilization
   [ghinita2006].  As an example, the first version of Bamboo used
   reactive recovery, which caused Bamboo to suffer from degradation in
   performance under churn.  To fix this problem, Bamboo was modified to
   use periodic stabilization.

   In Chord, periodic stabilization is typically done both for
   successors and fingers.  An alternative strategy is analyzed in
   [krishnamurthy2008].  In this strategy, called the correction-on-
   change maintenance strategy, a peer periodically stabilizes its
   successors but does not do so for its fingers.  Instead, finger
   pointers are stabilized in a reactive fashion.  The results obtained
   in [krishnamurthy2008] imply that although the correction-on-change
   strategy works well when churn is low, periodic stabilization
   outperforms the correction-on-change strategy when churn is high.

3.2.  Configuring Periodic Stabilization

   When periodic stabilization is used, one faces the problem of
   selecting an appropriate execution rate for the stabilization
   procedure.  If the execution rate of periodic stabilization is high,
   changes in the system can be quickly detected, but at the
   disadvantage of increased communication overhead.  On the other hand,
   if the stabilization rate is low and the churn rate is high, routing
   tables become inaccurate and DHT performance deteriorates.  Thus, the
   problem is setting the parameters so that the overlay achieves the
   desired reliability and performance even in challenging conditions,
   such as under heavy churn.  This naturally results in high cost
   during periods when the churn level is lower than expected, or
   alternatively, poor performance or even network partitioning in worse
   than expected conditions.

   In addition to selecting an appropriate stabilization interval,
   regardless of whether periodic stabilization is used or not, an
   appropriate size needs to be selected for the neighborhood set and
   for the routing table.




Maenpaa, et al.          Expires August 20, 2009                [Page 6]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   The current approach is to configure overlays statically.  This works
   in situations where perfect information about the future is
   available.  In situations where the operating conditions of the
   network are known in advance and remain static throughout the
   lifetime of the system, it is possible to choose fixed optimal values
   for parameters such as stabilization rate, neighborhood set size and
   routing table size.  However, if the operating conditions (e.g., the
   size of the overlay and its churn rate) do not remain static but
   evolve with time, it is not possible to achieve both a low lookup
   failure rate and a low communication overhead by using fixed
   parameters [ghinita2006].

   As an example, to configure the Chord DHT algorithm, one needs to
   select values for the following parameters: size of successor list,
   stabilization interval, and size of the finger table.  To select an
   appropriate value for the stabilization interval, one needs to know
   the expected churn rate and overlay size.  According to
   [liben-nowell2002], a Chord network in a ring-like state remains in a
   ring-like state as long as peers send Omega(log2^2(N)) messages
   before N new peers join or N/2 peers fail.  Thus, in a 500-peer
   overlay churning at a rate such that one peer joins and one peer
   leaves the network every 30 seconds, an appropriate stabilization
   interval would be on the order of 93s.  According to [Chord], the
   size of the successor list and finger table should be on the order of
   log(N).  Having a successor list of size O(log(N)) makes it unlikely
   that a peer will lose all of its successors, which would cause the
   Chord ring to become disconnected.  Thus, in a 500-peer network each
   peer should maintain on the order of nine successors and fingers.
   However, if the churn rate doubles and the network size remains
   unchanged, the stabilization rate should double as well.  That is,
   the appropriate maintenance interval would now be on the order of
   46s.  On the other hand, if the churn rate becomes e.g. six-fold and
   the size of the network grows to 2000 peers, on the order of eleven
   fingers and successors should be maintained and the stabilization
   interval should be on the order of 42s.  If one continued using the
   old values, this could result in inaccurate routing tables, network
   partitioning, and deteriorating performance.

3.3.  Adaptive Stabilization

   A self-tuning DHT takes into consideration the continuous evolution
   of network conditions and adapts to them.  In a self-tuning DHT, each
   peer collects statistical data about the network and dynamically
   adjusts its stabilization rate, neighborhood set size, and finger
   table size based on the analysis of the data [ghinita2006].
   Reference [mahajan2003] shows that by using a self-tuning mechanism,
   it is possible to achieve high reliability and performance even in
   adverse conditions with low maintenance cost.  Adaptive stabilization



Maenpaa, et al.          Expires August 20, 2009                [Page 7]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   has been shown to outperform periodic stabilization in terms of both
   lookup failure and communication overhead [ghinita2006].


4.  Self-tuning, Chord-based Topology Plugin for RELOAD

   This section proposes a new topology plugin for RELOAD.  Topology
   plugins allow RELOAD to support a variety of overlay algorithms.  The
   proposed topology plugin uses a self-tuning Chord DHT algorithm.  It
   can be used as an alternative to the default DHT specified by RELOAD.

   Chord [Chord] is a structured P2P algorithm that uses consistent
   hashing to build a DHT out of several independent peers.  Consistent
   hashing assigns each peer and resource an m-bit identifier.  Peers
   MUST use SHA-1 as the base hash fuction to generate the identifiers.
   The length of the identifiers MUST be m=128 bits.  The identifiers
   are ordered on an identifier circle of size 2^m.  On the identifier
   circle, key k MUST be assigned to the first peer whose identifier
   equals or follows the identifier of k in the identifier space.  The
   identifier circle is called the Chord ring.

   Different DHTs differ significantly in performance when bandwidth is
   limited.  It has been shown that when compared to other DHTs, the
   advantages of Chord include that it uses bandwidth efficiently and
   can achieve low lookup latencies at little cost [li2004].

   A simple lookup mechanism could be implemented on a Chord ring by
   requiring each peer to only know how to contact its current successor
   on the identifier circle.  Queries for a given identifier could then
   be passed around the circle via the successor pointers until they
   encounter the first peer whose identifier is equal to or larger than
   the desired identifier.  Such a lookup scheme uses a number of
   messages that grows linearly with the number of peers.  To reduce the
   cost of lookups, Chord maintains also additional routing information;
   each peer n MUST maintain a data structure with up to m entries,
   called the finger table.  The first entry in the finger table of peer
   n contains the peer half-way around the ring from peer n.  The second
   entry contains the peer that is 1/4th of the way around, the third
   entry the peer that is 1/8th of the way around, etc.  In other words,
   the ith entry in the finger table at peer n SHOULD contain the
   identity of the first peer s that succeeds n by at least 2^(m-i) on
   the Chord ring.  This peer is called the ith finger of peer n.  The
   interval between two consecutive fingers is called a finger interval.
   The ith finger interval of peer n covers the range [n.id + 2^(m-i),
   n.id + 2^(m-i+1)) on the Chord ring.  In an N-peer network, each peer
   SHOULD maintain information about O(log2(N)) other peers in its
   finger table.  As an example, if N=1000, it is sufficient to maintain
   10 fingers.



Maenpaa, et al.          Expires August 20, 2009                [Page 8]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   Chord needs all peers' successor pointers to be up to date in order
   to ensure that lookups produce correct results as the set of
   participating peers changes.  To achieve this, peers MUST run a
   stabilization protocol periodically in the background.  The
   stabilization protocol uses three operations: neighbor stabilization,
   finger stabilization, and predecessor stabilization.  Each Chord peer
   MUST maintain a stabilization timer.  When the stabilization timer
   fires, the peer MUST restart the timer and carry out the
   stabilization operations.  The stabilization operations are discussed
   in the subsections below.  Section 5 discusses how to determine the
   appropriate rate for stabilization operations.

   To increase robustness in the event of peer failures, each Chord peer
   MUST maintain a successor list of size r, containing the peer's first
   r successors.  The benefit of successor lists is that if each peer
   fails independently with probability p, the probability that all r
   successors fail simultaneously is only p^r.  Each peer SHOULD
   maintain at least r = max(rf+1, log2(N)) successors, where rf is the
   replication factor being used in the overlay.  The routing table of a
   peer consists of the successor list, the finger table, and a
   predecessor list.

   The original Chord algorithm maintains only a single predecessor
   pointer.  However, multiple predecessor pointers can be maintained to
   speed up recovery from predecessor failures.  Multiple predecessor
   pointers are useful also from the viewpoint of data replication,
   which will be discussed in Section 4.7.  Therefore, each peer MUST
   maintain at least rf+1 predecessor pointers.

4.1.  Update Messages

   The predecessor stabilization and successor stabilization procedures
   are implemented using Update requests and answers.  To describe the
   contents of these messages, the syntax defined in
   [I-D.ietf-p2psip-base] is used.  A Chord Update request is defined
   as:















Maenpaa, et al.          Expires August 20, 2009                [Page 9]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


               enum { reserved (0),
                      notify(1), succ_stab(2), pred_stab(3), full(4),
                      (255) }
                    ChordUpdateType;

               struct {
                 ChordUpdateType       type;
                 NodeId                sender_id;

                 select(type) {
                   case notify:
                     uint32            uptime;
                   case pred_stab:     /* Empty */
                     ;
                   case succ_stab:     /* Empty */
                     ;
                   case full:
                     uint32            uptime;
                     NodeId            predecessors<0..2^16-1>;
                     NodeId            successors<0..2^16-1>;
                     NodeId            fingers<0..2^16-1>;
                 };
               } UpdateReq;


   The "type" field MUST indicate the reason why the Update was sent:

   notify
      the sender of the Update wishes to notify the recipient of the
      sender's existence.  Upon receiving the Update, the recipient
      SHOULD insert the sender into its routing table, if appropriate.

   succ_stab
      the Update request is related to the successor stabilization
      routine.

   pred_stab
      the Update request is related to the predecessor stabilization
      routine.

   full
      the Update request contains the entire routing table of the
      sender.


   The sender_id field contains the sender's Peer-ID.

   If the type of the Update request is 'pred_stab' or 'succ_stab', the



Maenpaa, et al.          Expires August 20, 2009               [Page 10]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   request MUST NOT carry any additional information.

   If the type of the Update request is 'notify', the request MUST
   contain the sender's current uptime in seconds.

   If the type of the request is 'full', the contents of the message
   MUST be:

   uptime
      The sender's current uptime in seconds.

   predecessors
      The sender's predecessor list.

   successors
      The sender's successor list.

   fingers
      The sender's finger table.


   In a self-tuning DHT, each peer decides independently the appropriate
   size for the successor list, predecessor list and finger table.
   Thus, the 'predecessors', 'successors', and 'fingers' fields are of
   variable length.  As specified in RELOAD [I-D.ietf-p2psip-base],
   variable length fields are on the wire preceded by length bytes.  In
   the case of the successor list, predecessor list, and finger table,
   there are two length bytes (allowing lengths up to 2^16-1).  The
   number of NodeId structures included in each field can be calculated
   based on the length bytes since the size of a single NodeId structure
   is 16 bytes.  If a peer receives more entries than fit into its
   successor list, predecessor list or finger table, the peer SHOULD
   ignore the extra entries.  If a peer receives less entries than it
   currently has in its own data structure, the peer SHOULD NOT drop the
   extra entries from its data structure.

   If the Update request succeeds, the responding peer sends an
   UpdateAns message, which is defined as:













Maenpaa, et al.          Expires August 20, 2009               [Page 11]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


             enum { reserved (0),
                    notify(1), succ_stab(2), pred_stab(3), full(4),
                    (255) }
                  ChordUpdateType;

             struct {
               ChordUpdateType         type;

               select(type) {
                 case full:            /* Empty */
                   ;
                 case notify:
                   uint32              uptime;
                 case pred_stab:
                   NodeId              predecessors<0..2^16-1>;
                 case succ_stab:
                   NodeId              predecessors<0..2^16-1>;
                   NodeId              successors<0..2^16-1>;
               };
             } UpdateAns;


   If the type of the Update answer is 'full', the answer MUST NOT carry
   any additional information.  If the type is 'notify', the answer MUST
   contain the sender's current uptime in seconds.  If the type is
   'pred_stab', the answer SHOULD carry the predecessor list of the
   responding peer.  If the type is 'succ_stab', the answer SHOULD
   include the predecessor and successor lists of the responding peer.

4.2.  Finger Stabilization

   The purpose of the finger stabilization procedure is to incorporate
   new peers into the finger table.  In the procedure, peer n MUST
   maintain a counter 'next', which stores the index of the next finger
   that should be stabilized.  The counter MUST be initialized to value
   one and it MUST be incremented by one after each finger stabilization
   procedure.  When the stabilization timer fires, peer n MUST choose
   one finger interval i from the set of finger_table_size finger
   intervals it maintains:

             i = next % (finger_table_size + 1),

   and send a Probe request addressed to the first identifier belonging
   to the chosen finger interval i.  The peer f responding to the Probe
   request SHOULD become the ith finger of n.  Peer n SHOULD send an
   Attach request to peer f to initiate a new connection to it.

   This document defines a new ProbeInformationType value 'uptime'.



Maenpaa, et al.          Expires August 20, 2009               [Page 12]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   When this value is present in the requested_info field of a Probe
   request, it indicates that the receiver MUST include in the Probe
   response its current uptime in a ProbeInformation structure.  A Probe
   request that is sent as part of the finger stabilization procedure
   MUST contain the 'uptime' ProbeInformationType in its requested_info
   field.  The extended ProbeInformation structure that is returned in
   the Probe response is defined as:

             enum { responsible_set(1), num_resources(2), uptime(3),
                    (255)}
                  ProbeInformationType;

             struct {
               ProbeInformationType    type;

               select (type) {
                 case responsible_set:
                   uint32              responsible_ppb;

                 case num_resources:
                   uint32              num_resources;

                 case uptime:
                   uint32              uptime;
               };
             } ProbeInformation;

   The types "responsible_ppb" and "num_resources" have been specified
   in RELOAD [I-D.ietf-p2psip-base].  The "uptime" is a new type and
   contains the sender's current uptime in seconds.

4.3.  Successor Stabilization

   In the successor stabilization routine, a peer n asks its successor s
   for the successor's first predecessor p.  If the successor's first
   predecessor pointer does not point to n but instead to p (for
   instance, because p has joined the overlay between n and s), p should
   become n's first successor instead of s.  Thus, n adds p to the front
   of its successor list and notifies p of n's existence, so that p can
   change its predecessor to n.

   Also successor lists are stabilized as part of the successor
   stabilization routine.  In order to do this, peer n copies the
   successor list of its successor s, removing the last entry and
   prepending s to it.  If peer n notices that its successor has failed,
   it replaces the successor with the first live entry in its successor
   list and synchronizes its successor list with the new successor.




Maenpaa, et al.          Expires August 20, 2009               [Page 13]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   The successor stabilization routine is executed when the
   stabilization timer fires.  To begin the successor stabilization
   routine, peer n MUST send an Update request to its first successor s.
   The type of the Update request MUST be 'succ_stab'.  Upon receiving
   the Update request, peer s MUST send an Update answer to peer n.  The
   Update answer SHOULD include the successor and predecessor lists of
   peer s.  If n learns from the predecessor and successor lists
   included in the answer that new peers should be included in its
   neighborhood set, n MUST send Attach requests to the new peers.  Once
   a direct connection has been established with each new peer as a
   result of the Attach procedure, peer n MUST send an Update request of
   type 'notify' to each new peer.  This allows the new peers to insert
   n into their neighborhood sets.

4.4.  Predecessor Stabilization

   The predecessor stabilization routine is executed when the
   stabilization timer fires.  To begin the predecessor stabilization
   routine, a peer n MUST send an Update request to its predecessor p.
   The type of the Update request MUST be 'pred_stab'.  Upon receiving
   the Update request, peer p MUST send an Update answer to peer n.  The
   Update answer SHOULD include the predecessor list of peer p.  Peer n
   SHOULD use the predecessor list carried in the answer to update its
   own predecessor list.  If new peers are inserted into the predecessor
   list, peer n MUST send Attach requests and Update requests of type
   'notify' to the new peers in the same way as during the successor
   stablization routine.

4.5.  Joining an Overlay

   The process of joining an overlay is as follows:

   1.  The Joining Peer (JP) SHOULD connect to a bootstrap peer.
   2.  The JP SHOULD send an Attach request to the bootstrap peer, which
       SHOULD route the request towards the Admitting Peer (AP).  Once
       the Attach procedure is finished, there is a direct connection
       between the JP and the AP.
   3.  The JP SHOULD send a Join request to the AP.  The AP returns a
       Join answer.
   4.  The AP MUST send an Update request of type 'full' to the JP.  The
       Update request SHOULD contain the contents of AP's routing table.
       The JP SHOULD use the contents of the Update request to
       initialize its neighborhood set and finger table.  The JP SHOULD
       set the size of its successor list, predecessor list and finger
       table to the same values that the AP uses.
   5.  The AP SHOULD send a series of Store requests to transfer the
       resources that the JP will be responsible for.




Maenpaa, et al.          Expires August 20, 2009               [Page 14]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   6.  The JP MUST send Attach requests to initiate connections to each
       of the peers in its predecessor list, successor list, and finger
       table.  Since the JP is already connected to the AP, there is no
       need to send a new Attach request to the AP (in Chord, the AP
       will always become the JP's first successor).
   7.  The JP MUST send an Update request of type 'notify' to each of
       the peers in its predecessor and successor lists (except for the
       AP that is already aware of the JP).
   8.  The JP MUST send a Probe request carrying the 'uptime'
       ProbeInformationType value in the requested_info field to each of
       its fingers.  This way the JP will learn the uptimes of its
       fingers (the uptimes of predecessors and successors are learned
       from Update responses in the previous step).  The uptimes are
       needed when estimating the join rate of peers in the overlay.  It
       should be noted that these Probe requests are not routed via the
       overlay but are sent on a direct connection.
   9.  If necessary, the JP MAY send Store requests to insert resource
       records (e.g., the local user's contact information) into the
       overlay.

4.5.1.  Contents of the Join Message

   This topology plugin does not require any additional data in the Join
   request but uses the minimal Join request specified in
   [I-D.ietf-p2psip-base].

4.6.  Leaving an Overlay

   The process of leaving the overlay is as follows:

   1.  If necessary, the leaving peer n MAY send Remove requests to
       remove the resources that it has stored in the overlay.
   2.  If no replication is being performed in the overlay (i.e., rf
       equals zero), peer n SHOULD issue a series of Store requests to
       its first successor to transfer the ownership of the resource
       records it is storing.  Note that if replication is being used,
       the successor of peer n is already storing replicas of all of the
       resources peer n is storing.
   3.  Peer n MUST send a Leave request to its first predecessor and
       first successor.  The Leave request that is sent to the first
       successor SHOULD contain the predecessor list of peer n.  The
       Leave request that is sent to the first predecessor SHOULD
       contain the successor list of peer n.  The first successor SHOULD
       use the predecessor list carried in the Leave request to update
       its own predecessor list.  The first predecessor SHOULD use the
       successor list carried in the Leave request to update its own
       successor list.




Maenpaa, et al.          Expires August 20, 2009               [Page 15]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


4.6.1.  Contents of the Leave Message

   This topology plugin extends the Leave request defined in RELOAD
   [I-D.ietf-p2psip-base].  The overlay_specific_data field of the Leave
   request MUST contain a ChordLeaveData structure:

                 enum { reserved (0),
                        from_succ(1), from_pred(2), (255) }
                      ChordLeaveType;

                 struct {
                   ChordLeaveType         type;

                   select(type) {
                     case from_succ:
                       NodeId              successors<0..2^16-1>;
                     case from_pred:
                       NodeId              predecessors<0..2^16-1>;
                   };
                 } ChordLeaveData;


   The 'type' field indicates whether the Leave request was sent by a
   predecessor or a successor of the recipient:

   from_succ
      The Leave request was sent by a successor.

   from_pred
      The Leave request was sent by a predecessor.


   If the type of the request is 'from_succ', the contents will be:

   successors
      The sender's successor list.


   If the type of the request is 'from_pred', the contents will be:

   predecessors
      The sender's predecessor list.


4.7.  Data Replication

   Peers MUST use the successor replication strategy to store redundant
   copies of resources in the overlay.  In successor replication



Maenpaa, et al.          Expires August 20, 2009               [Page 16]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   [ktari2007], the responsible peer places a replica at rf peers
   immediately following the responsible peer on the Chord ring.  The
   replication factor rf is a configuration parameter.  In order for
   successor replication to work, both the size of the predecessor list
   and the size of the successor list MUST be at least rf+1.  The
   benefit of successor replication strategy is that if the responsible
   peer fails, the data is immediately available at the responsible
   peer's successor.  This is because in Chord, the successor of the
   failed peer always becomes responsible for the resource identifiers
   the failed peer was responsible for.

   Successor replication is done as follows.  When the responsible peer
   of resource-ID k receives a Store request for resource-ID k, it
   SHOULD store the data and return a success response.  In addition,
   the responsible peer SHOULD send Store requests to its rf first
   successors.  The rf first successors are called the replica set.
   Peers in the replica set are not responsible for the resource-ID k
   and thus MUST NOT store further replicas in the overlay.  Whenever a
   new peer joins the replica set, the responsible peer SHOULD send a
   Store request to the new peer to store a replica of the resources it
   is responsible for.  When new peers join the overlay between the
   responsible peer and peers in the replica set (or if the predecessor
   of the responsible peer changes), each peer in the replica set MUST
   use its predecessor list to determine whether it is still a part of
   the replica set.  If not, the peer MUST remove the replicas.  If a
   peer in the replica set leaves the overlay, the responsible peer MUST
   send a series of Store request to insert replica copies of the
   resources it is storing at the new successor that enters the replica
   set.  If the responsible peer leaves the overlay, it does not need to
   transfer its resources to its successor, because the successor is
   already storing replicas of those resources.  When the successor of
   the leaving peer notices that it has become the responsible peer for
   the resources the leaving peer was responsible for, it MUST add a new
   peer to the replica set and send a series of Store requests to store
   replicas of these resources at the new peer.

4.8.  Strong Stabilization

   A strong stabilization routine has been defined for Chord in
   [liben-nowell2002].  In the strong stabilization routine, a peer n
   searches for itself in the overlay.  Peer n can do this by sending a
   Probe request with its Peer-ID to its first successor.  If the
   network contains loops, the Probe request may be answered by some
   other peer than n.  If a loop is detected, no new peers are allowed
   to join the overlay until the overlay has recovered from the loop.
   The use of the strong stabilization routine is for further study.





Maenpaa, et al.          Expires August 20, 2009               [Page 17]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


5.  Self-tuning Chord Parameters

   This section specifies how to determine the appropriate stabilization
   rate and routing table size in an adaptive fashion.  The proposed
   mechanism is based on [mahajan2003], [liben-nowell2002], and
   [ghinita2006].  To calculate an appropriate stabilization rate, the
   values of three parameters MUST be estimated: overlay size N, failure
   rate U, and join rate L. To calculate an appropriate routing table
   size, the estimated network size N can be used.  Peers in the overlay
   MUST re-calculate the values of the parameters to self-tune the
   algorithm at the end of each stabilization period before re-starting
   the stabilization timer.

5.1.  Estimating Overlay Size

   Techniques for estimating the size of an overlay network have been
   proposed for instance in [mahajan2003], [horowitz2003],
   [kostoulas2005], [binzenhofer2006], and [ghinita2006].  In Chord, the
   density of peer identifiers in the successor set can be used to
   produce an estimate of the size of the overlay, N [mahajan2003].
   Since peer identifiers are picked randomly with uniform probability
   from the m-bit identifier space, the average distance between peer
   identifiers in the successor set is (2^m)/N.

   To estimate the overlay network size, a peer MUST compute the average
   inter-peer distance d between the successive peers starting from the
   most distant predecessor and ending to the most distant successor in
   the successor list.  The estimated network size MUST be calculated
   as:

                         2^m
                    N = -----
                          d

   This estimate has been found to be accurate within 15% of the real
   network size [ghinita2006].  Of course, the size of the neighborhood
   set affects the accuracy of the estimate.

   When a peer joins the network, the admitting peer sends the joining
   peer a copy of its neighborhood set.  Thus, a joining peer
   immediately has enough information at its disposal to calculate an
   estimate of the network size.

5.2.  Determining Routing Table Size

   The size of the finger table SHOULD be set to log(N) using the
   estimated network size N. The size of the successor list SHOULD be
   set to max(rf+1, log(N)), where rf is the replication factor being



Maenpaa, et al.          Expires August 20, 2009               [Page 18]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   used in the overlay.  An implementation MAY place a lower limit on
   the size of the finger table and the successor list.  As an example,
   the implementation might require the size of the finger table to be
   always at least 8.

5.3.  Estimating Failure Rate

   A typical approach is to assume that peers join the overlay according
   to a Poisson process with rate L and leave according to a Poisson
   process with rate parameter U [mahajan2003].  The value of U can be
   estimated using peer failures in the finger table and neighborhood
   set [mahajan2003].  If peers fail with rate U, a peer with M unique
   peer identifiers in its routing table should observe K failures in
   time K/(M*U).  Every peer in the overlay MUST maintain a history of
   the last K failures.  The current time MUST be inserted into the
   history when the peer joins the overlay.  The estimate of U MUST be
   calculated as:

                             k
                     U = --------,
                          M * Tk

   where M is the number of unique peer identifiers in the routing
   table, Tk is the time between the first and the last failure in the
   history, and k is the number of failures in the history.  If k is
   smaller than K, the estimate is computed as if there was a failure at
   the current time.  It has been shown that an estimate calculated in a
   similar manner is accurate within 17% of the real value of U
   [ghinita2006].

   The size of the failure history K affects the accuracy of the
   estimate of U. One can increase the accuracy by increasing K.
   However, this has the side effect of decreasing responsiveness to
   changes in the failure rate.  On the other hand, a small history size
   may cause a peer to overreact each time a new failure occurs.  In
   [ghinita2006], K is set 25% of the routing table size.

5.4.  Estimating Join Rate

   Reference [ghinita2006] proposes that a peer can estimate the peer
   join rate based on the uptime of the peers in its routing table.  An
   increase in peer join rate will be reflected by a decrease in the
   average age of peers in the routing table.  Thus, each peer MUST
   maintain an array of the ages of the peers in its routing table
   sorted in increasing order.  Using this information, an estimate of
   the global peer join rate L MUST be calculated as:





Maenpaa, et al.          Expires August 20, 2009               [Page 19]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


                         N           1
                    L = --- * ---------------,
                         4     Ages[rsize/4]

   where Ages is an array containing the ages of the peers in the
   routing table sorted in increasing order and rsize is the size of the
   routing table.  Only the ages of the 25% of the youngest peers in the
   routing table SHOULD be used to reduce the bias that a small number
   of peers with very old ages can cause [ghinita2006].  It has been
   shown that the estimate obtained by using this method is accurate
   within 22% of the real join rate [ghinita2006].  Of course, the size
   of the routing table affects the accuracy.

   In order for this mechanism to work, peers need to exchange
   information about the time they have been present in the overlay.
   Peers learn the uptimes of their successors and predecessors when
   adding the successors and predecessors to their routing tables since
   Update requests and answers that are of type 'notify' carry uptime
   values.  Peers learn the uptimes of their fingers because the Probe
   responses sent as part of the finger stabilization routine carry
   uptime values.  A joining peer learns the admitting peer's uptime
   since an Update request of type 'full' contains uptime information.

5.5.  Calculating the Stabilization Interval

   According to [liben-nowell2002], a Chord network in a ring-like state
   remains in a ring-like state as long as peers send Omega(log2^2(N))
   messages before N new peers join or N/2 peers fail.  We can use the
   estimate of peer failure rate, U, to calculate the time Tf in which
   N/2 peers fail:

                                  1
                           Tf = ------
                                 2*U

   Based on this estimate, a stabilization interval Tstab-1 MUST be
   calculated as:

                                         Tf
                           Tstab-1 = -----------
                                      log2^2(N)

   On the other hand, the estimated join rate L can be used to calculate
   the time in which N new peers join the overlay.  Based on the
   estimate of L, a stabilization interval Tstab-2 MUST be calculated
   as:





Maenpaa, et al.          Expires August 20, 2009               [Page 20]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


                                             N
                            Tstab-2 = ---------------
                                       L * log2^2(N)

   Finally, the actual stabilization interval Tstab that SHOULD be used
   can be obtained by taking the minimum of Tstab-1 and Tstab-2.

   The results obtained in [maenpaa2009] indicate that making the
   stabilization interval too small has the effect of making the overlay
   less stable (e.g., in terms of detected loops and path failures).
   Thus, a lower limit should be used for the stabilization period.
   Based on the results in [maenpaa2009], a lower limit of 15s is
   proposed, since using a stabilization period smaller than this will
   with a high probability cause too much traffic in the overlay.


6.  Security Considerations

   There are no new security considerations introduced in this document.
   The security considerations of RELOAD [I-D.ietf-p2psip-base] apply.


7.  IANA Considerations

   This document defines one new Probe Information Type value:

                              +-----------------+------+---------------+
                              | Probe Option    | Code | Specification |
                              +-----------------+------+---------------+
                              | uptime          |    3 |      RFC-AAAA |
                              +-----------------+------+---------------+


8.  References

8.1.  Normative References

   [I-D.ietf-p2psip-base]
              Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and
              H. Schulzrinne, "REsource LOcation And Discovery (RELOAD)
              Base Protocol", draft-ietf-p2psip-base-01 (work in
              progress), December 2008.

   [I-D.ietf-p2psip-concepts]
              Bryan, D., Matthews, P., Shim, E., Willis, D., and S.
              Dawkins, "Concepts and Terminology for Peer to Peer SIP",
              draft-ietf-p2psip-concepts-02 (work in progress),
              July 2008.



Maenpaa, et al.          Expires August 20, 2009               [Page 21]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

8.2.  Informative References

   [CAN]      Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S.
              Schenker, "A scalable content-addressable network", In
              Proc. of the 2001 Conference on Applications,
              Technologies, Architectures and Protocols for Computer
              Communications 2001, pp. 161.172.

   [Chord]    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.,
              Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A
              Scalable Peer-to-peer Lookup Service for Internet
              Applications", IEEE/ACM Transactions on Networking Volume
              11, Issue 1, 17-32, Feb 2003.

   [Pastry]   Rowstron, A. and P. Druschel, "Pastry: Scalable,
              Decentralized Object Location and Routing for Large-Scale
              Peer-to-Peer Systems", In Proc. of the IFIP/ACM
              International Conference on Distribued Systems
              Platforms Nov. 2001, pp. 329-350.

   [binzenhofer2006]
              Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable
              algorithm to monitor chord-based P2P systems at runtime",
              20th International Parallel and Distributed Processing
              Symposium April 2006.

   [ghinita2006]
              Ghinita, G. and Y. Teo, "An adaptive stabilization
              framework for distributed hash tables", 20th International
              Parallel and Distributed Processing Symposium April 2006.

   [horowitz2003]
              Horowitz, K. and D. Malkhi, "Estimating network size from
              local information", Information Processing Letters Dec.
              2003, Volume 88, Issue 5, pp. 237-243.

   [kostoulas2005]
              Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and
              A. Demers, "Decentralized schemes for size estimation in
              large and dynamic groups", Fourth IEEE International
              Symposium on Network Computing and Applications July 2005,
              pp. 41-48.

   [krishnamurthy2008]
              Krishnamurthy, S., El-Ansary, S., Aurell, E., and S.



Maenpaa, et al.          Expires August 20, 2009               [Page 22]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


              Haridi, "Comparing maintenance strategies for overlays",
              In Proc. of 16th Euromicro Conference on Parallel,
              Distributed and Network-Based Processing Feb. 2008, pp.
              473-482.

   [ktari2007]
              Ktari, S., Zoubert, M., Hecker, A., and H. Labiod,
              "Performance evaluation of replication strategies in DHTs
              under churn", In Proc. of the 6th International Conference
              on Mobile and Ubiquitous Multimedia Dec. 2007, pp. 90-97.

   [li2004]   Li, J., Strinbling, J., Gil, T., and M. Kaashoek,
              "Comparing the performance of distributed hash tables
              under churn", In Proc. of the 3rd International Workshop
              on Peer-to-Peer Systems 2004.

   [liben-nowell2002]
              Liben-Nowell, D., Balakrishnan, H., and D. Karger,
              "Observations on the dynamic evolution of peer-to-peer
              networks", In Proc. of the First International Workshop on
              Peer-to-Peer Systems March 2002.

   [maenpaa2009]
              Maenpaa, J. and G. Camarillo, "A study on maintenance
              operations in a Chord-based Peer-to-Peer Session
              Initiation Protocol overlay network", Accepted to Sixth
              International Workshop on Hot Topics in P2P Systems
              (HotP2P 2009) January 2009.

   [mahajan2003]
              Mahajan, R., Castro, M., and A. Rowstron, "Controlling the
              cost of reliability in peer-to-peer overlays", In
              Proceedings of the 2nd International Workshop on Peer-to-
              Peer Systems Feb. 2003.

   [rhea2004]
              Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz,
              "Handling churn in a DHT", In Proc. of the USENIX Annual
              Techincal Conference June 2004.












Maenpaa, et al.          Expires August 20, 2009               [Page 23]


Internet-Draft         Self-tuning DHT for RELOAD          February 2009


Authors' Addresses

   Jouni Maenpaa
   Ericsson
   Hirsalantie 11
   Jorvas  02420
   Finland

   Email: Jouni.Maenpaa@ericsson.com


   Gonzalo Camarillo
   Ericsson
   Hirsalantie 11
   Jorvas  02420
   Finland

   Email: Gonzalo.Camarillo@ericsson.com


   Jani Hautakorpi
   Ericsson
   Hirsalantie 11
   Jorvas  02420
   Finland

   Email: Jani.Hautakorpi@ericsson.com
























Maenpaa, et al.          Expires August 20, 2009               [Page 24]