Network Working Group                                         T. Asveren
Internet-Draft                                            Sonus Networks
Expires: June 13, 2008                                          U. Bodin
                                                                  Operax
                                                       December 11, 2007


                 Diameter State Recovery Considerations
                draft-asveren-dime-state-recovery-02.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on June 13, 2008.

Copyright Notice

   Copyright (C) The IETF Trust (2007).

Abstract

   This document discusses parameters to consider, different approaches
   and design strategies to synchronize and/or recover state in Diameter
   applications after failure of an active instance.







Asveren & Bodin           Expires June 13, 2008                 [Page 1]


Internet-Draft   Diameter State Recovery Considerations    December 2007


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Session State and the Need for Recovery  . . . . . . . . . . .  4
   4.  Proprietary Mechanisms . . . . . . . . . . . . . . . . . . . .  5
   5.  Protocol Assisted State Recovery . . . . . . . . . . . . . . .  6
     5.1.  Service Models . . . . . . . . . . . . . . . . . . . . . .  6
     5.2.  Parameters to Consider . . . . . . . . . . . . . . . . . .  8
       5.2.1.  Notification of the Peer About Failure . . . . . . . .  8
       5.2.2.  Transfer of Session Data . . . . . . . . . . . . . . .  8
       5.2.3.  Backup Server Selection  . . . . . . . . . . . . . . .  9
       5.2.4.  Timing of State Reconstruction . . . . . . . . . . . . 10
     5.3.  Approaches . . . . . . . . . . . . . . . . . . . . . . . . 10
       5.3.1.  Using a New Session  . . . . . . . . . . . . . . . . . 11
       5.3.2.  Backup Instance Triggered Recovery . . . . . . . . . . 11
   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 12
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   8.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 12
   9.  Normative References . . . . . . . . . . . . . . . . . . . . . 12
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12
   Intellectual Property and Copyright Statements . . . . . . . . . . 14





























Asveren & Bodin           Expires June 13, 2008                 [Page 2]


Internet-Draft   Diameter State Recovery Considerations    December 2007


1.  Introduction

   There are a variaety of Diameter applications defined to perform
   different tasks.  For some of these tasks, synchronizing and/or
   recovering state for ongoing sessions after failure of a Diameter
   endpoint is desirable, e.g.  Diameter Credit Control Application.
   The recovery could be achieved by a proprietary mechanism, could be
   assisted by protocol mechanisms or could be a combination thereof.
   This document focuses on issues associated with protocol assisted
   state recovery.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [2].

   The following terms defines the functionality used in describing
   entities in this document.

   Ongoing Session

      A Diameter session, for which at least the first transaction has
      been completed but not the last transaction according to the
      application message flow.

   Terminated Session

      A Diameter session that existed in the past, for which the last
      transaction according to the application message flow has been
      completed.

   Initial message

      A Diameter message used to create a new Diameter session.

   Mid-session message

      A Diameter message used to refresh or modify an existing Diameter
      session.

   Service Instance

      An instance of service provided by a Diameter application to
      another entity, e.g. charging, authentication services.





Asveren & Bodin           Expires June 13, 2008                 [Page 3]


Internet-Draft   Diameter State Recovery Considerations    December 2007


   Diameter Transaction

      A Diameter request/answer pair.


3.  Session State and the Need for Recovery

   Some Diameter applications make use of sessions consisting of
   multiple transactions.  The context necessary to be able to process/
   trigger further messages in an ongoing session constitutes the
   session state.

   In multi-transaction sessions, it is possible that one of the
   endpoints fail during a session.  Depending on the application, it
   may not be possible/desirable to terminate the corresponding service
   instance.  In such a case, it is necessary to utilize a backup node
   which can process messages for the ongoing session or to use a new
   session without terminating the service instance.

    Diameter     Active      Backup
     Peer        Instance    Instance
     |             |            |
     |----REQ1---->|            |
     | (session1)  |            |
     |             |            |
     |<---ANS1-----|            |
     | (session1)  |            |
     |             |            |
     |           Active         |
     |           Instance       |
     |           Fails          |
     |             |            |
     |----REQ2----------------->|
     |  (session1) |            |
     |             |            |
     |<---ANS2------------------|
     |  (session1) |            |
     |             |            |


               Figure 1: Session Failover to Backup Instance

   Another important aspect related with failing instances is the
   possibility of hanging resources on the peer Diameter entity.  This
   could happen if the peer Diameter entity does not clean up session
   state unless the session is terminated according to the expected
   application message flow.  It should be noted that while state
   recovery is a desirable feature for certain applications, hanging



Asveren & Bodin           Expires June 13, 2008                 [Page 4]


Internet-Draft   Diameter State Recovery Considerations    December 2007


   resources is an unacceptable situation for all applications, hence
   although some of the mechanisms described in this document could be
   used to prevent the occurance of such a case, it is recommended that
   application layer mechanisms, e.g. application layer timers, are used
   for this purpose.  Nonetheless, certain strategies mentioned in this
   document could be used to expedite session state cleanup after
   failovers.


4.  Proprietary Mechanisms

   Proprietary mechanisms do not assume any specific behavior from their
   peers.  They usually rely on some form of state replication between
   active and backup instances.

    +---------+               +----------+
    | Diameter|<------------->| Active   |
    | Peer    |   Session     | Instance |
    +---------+   Messaging   +----------+
                                   ^
                                   | Session
                                   | State
                                   | Replication
                                   V
                              +----------+
                              | Backup   |
                              | Instance |
                              +----------+


          Figure 2: Data Replication with a Proprietary Machanism

   It should be noted that Figure 2 is just an abstract representation
   of proprietary data replication between active and backup instances.
   Actual implementation may vary depending on the mechanims used.
   Proprietary state synchronization is a common technique utilized by
   Public Switched Telephone Network equipment vendors to provide 5 9's
   reliability.  There are also initiatives to define a standard set of
   APIs for platforms/middleware providing data synchronization
   services, e.g.  Application Interface Specification of Service
   Availability Forum.

   Proprietary data replication between active and backup instances may
   be asynchronous in nature.  This means that they may not provide
   loss-less state replication at all times.  Hence, after a failover to
   a backup instance, some session states may have been lost and other
   states may be wrongly kept by the backup instance.  That is, states
   may have been terminated through session signalling to the initially



Asveren & Bodin           Expires June 13, 2008                 [Page 5]


Internet-Draft   Diameter State Recovery Considerations    December 2007


   active instance but the removal of the corresponding session states
   were not properly reflected in the data replication process.


5.  Protocol Assisted State Recovery

   Protocol assisted state recovery relies on contents of the messages
   exchanged between Diameter entities.

5.1.  Service Models

   For each Diameter session Diameter messaging happens between a client
   and server.  Although not a sender/receiver of Diameter messages,
   physical service/resource provided is also a parameter when designing
   state recovery mechanisms.  The physical resource/service is
   application dependent and could be bandwith allocated on a router for
   QoS application, voice transfer resources used for a prepaid voice
   call application etc.

   Depending on Diameter application, physical resource/service could be
   at the client or server side.  For example for Diameter Credit
   Control Application the physical resource is controlled by the
   client, whereas for QoS application with a push scenario it is
   controlled by the server.

   In case a proprietary data replication mechanism which is not loss-
   less is used between active and backup instances to support failover,
   it may be desirable to make use of the data present in the physical
   resource/service.  This case can benefit from a synchronization phase
   before session data is transfered for purposes of rebuilding lost
   state.

   Physical resource/service could be used to extract some information
   regarding session state to be reconstructed.  For certain scenarios
   this information could be enough for state reconstruction or could be
   used in addition to information obtained via other means, e.g. in a
   proprietary data replication mechanism, failovers could be followed
   by a synchronization phase based on information obtained from the
   physical resource/service.

   Below is given a conceptual diagram for the DCCA client side state
   recovery utilizing the state kept by service control logic.

               +-----+
               |     +-------+
               |     | (2)   |
     ---(1)--->|     |Service|
       Service |     | Data-1|



Asveren & Bodin           Expires June 13, 2008                 [Page 6]


Internet-Draft   Diameter State Recovery Considerations    December 2007


       Start   |     +-------+        +---------+
       Request |     |                |         |
               |     |-----(3)------->|         |
               |     |Credit Control  |  DCCA   |
               |     | Request for    |  Client |---(4)----->
               |     | Service Data-1 |  Logic  |  CCR(Initial)
               |     |                | (Active)|
               |     |                |         |<---(5)------
               |     |<-----(6)-------|         |  CCA(Initial)
               |     | Grant Service  +---------+
               |     |
               |  S  |                  (7)
               |  e  |                 DCCA Client
               |  r  |                 Logic (Active)
               |  v  |                 fails
               |  i  |
               |  c  |                  (8)
               |  e  |                DCCA Client
               |     |                Logic (Standby)
               |  C  |                detects failure
               |  o  |
               |  n  |                +---------+
               |  t  |<-----(9)-------|         |
               |  r  |   Request for  |         |
               |  o  | State Retrieval|  DCCA   |
               |  l  |                |  Client |
               |     |-------(10)---->|  Logic  |
               |     | Credit Control |(Standby)|---(11)---->
               |     | Request for    |         |  CCR(Initial)
               |     | Service Data-1 |         |
               |     |                |         |<---(12)-----
               |     |                |         |  CCA(Initial)
               |     |                |         |
               |     |                |         |---(13)---->
               |     |                |         |  CCR(Update)
               |     |                |         |
               |     |                |         |<---(14)-----
     ---(15)-->|     |                |         |  CCA(Update)
       Service |     |                |         |
       End     |     |                |         |---(16)---->
       Request |     |                |         |  CCR(Terminate)
               |     |                |         |
               |     |                |         |<---(17)-----
               +-----+                +---------+  CCA(Terminate)


      Figure 3: Using Service Information for DCCA Client Side State
                                 Recovery



Asveren & Bodin           Expires June 13, 2008                 [Page 7]


Internet-Draft   Diameter State Recovery Considerations    December 2007


5.2.  Parameters to Consider

   There are several aspects which may be important for a protocol
   assisted session state recovery mechanism.  They may or may not be
   part of the design choices for a protocol assisted session state
   recovery mechanism, depending on the strategy utilized.

5.2.1.  Notification of the Peer About Failure

   Usually it is necessary for the remote peer to be informed about the
   failure of the active instance in the context of protocol assisted
   state recovery.  This could be achieved in different ways:

   Application Layer Timers

      Application layer timers could be utilized to send new requests
      periodically.  Lack of a new request or a corresponding answer for
      a sent request/receipt or UNABLE_TO_DELIVER error answer could
      indicate that the peer Diameter entity has failed.

   Notification from Standby Instance

      After failure of the active instance, standby instance can send a
      message to the remote Diameter peer to inform it about failure of
      the active instance.  This method requires standby instance to
      know the identities of the remote Diameter peers, with which the
      failed active instance had ongoing sessions.  This information
      could be exchanged by a proprietary data replication mechanism.
      Alternatively, standby instance could have a configured list of
      remote peers and notify all of them.

5.2.2.  Transfer of Session Data

   For protocol assisted recovery it is necessary to supply enough
   information to the backup instance so that session state can be
   constructed.  What constitutes session state data needs to be defined
   on a per application basis.  Also, in certain cases (e.g. when a
   separate mechanism for state replication is used in combination with
   protocol assisted state recovery) the transfer of session data may be
   preceeded by a state synchronization phase.  For example, a generic
   message providing a list of all active sessions could be used for
   such a synchronization phase.

   Some approaches to transfer session data include:







Asveren & Bodin           Expires June 13, 2008                 [Page 8]


Internet-Draft   Diameter State Recovery Considerations    December 2007


   Using a New Session

      Upon detection of the failure of the active instance, remote
      Diameter peer may start a new session without terminating the
      service instance.

   Using Application Messages

      Data necessary to reconstruct the session state may be transferred
      in an application defined message by AVP(s) specifically defined
      for that purpose.  Alternatively, an AVP may be used to flag that
      all data carried in the message is sent for the purposes of state
      synchronization.

   Using a Generic Message

      Data necesary to reconstruct session state may be transferred in a
      message specifically defined for that purpose.  Such a message may
      carry state information for one or multiple sessions.

5.2.3.  Backup Server Selection

   A Diameter peer needs to know the identity of the backup instance, so
   that it can send the necessary data to reconstruct session state.
   Furthermore, loadbalancing of the ongoing sessions to different
   backup instances may be necessary as well, to prevent overloading of
   backup entities.

   Active Instance Guided Selection

      Active instance could communicate the identity of the backup
      instance(s) to the peer Diameter entity with an AVP.  Information
      about how the load should be distributed among multiple backup
      instances could be communicated as well.

   Backup Instance Guided Selection

      If the notification of the peer Diameter entity about the failure
      of the active instance is performed via a message sent by the
      standby instance, the identity of the backup instance would be
      known to the the peer Diameter entity.  This message could carry
      information about other backup instances and loadsharing
      information too.

   Selection Based on Configuration

      The Diameter peer may know the identities of backup servers
      through configuration and try to loadshare ongoing session based



Asveren & Bodin           Expires June 13, 2008                 [Page 9]


Internet-Draft   Diameter State Recovery Considerations    December 2007


      on a locally defined algorithm.  For requests, which are rejected
      by a standby instance with TOO_BUSY_HERE error answer, another
      standby instance could be tried.

5.2.4.  Timing of State Reconstruction

   When state reconstruction should happen may vary depending on the
   application.  The following two models are foreseen:

   State Reconstruction After Failure

      It may be necessary to reconstruct the state after the backup
      instance detects failure of the active instance.  This model is
      useful when the state for ongoing sessions is necessary to
      generate answers for requests belonging to new sessions.  Care
      should be taken when determining the necessary information for
      such cases, it could be the case that what is needed is some
      cumulative data based on session states rather than the per
      session information and this could impact the design choices to
      recover/replicate the data or even the choice between a
      proprietary mechanism and protocol assisted recovery.

      Another use case is when autonomous requests need to be generated
      from the side, where the active instance has failed.  In such a
      situation, backup instance needs to know ongoing sessions
      immediately after it detects failure of the active instance so
      that it can generate such requests.

      If state reconstruction after failure is needed, notification of
      the Diameter peer about failure should be done by the backup
      instance.

   State Reconstruction Upon Receipt of a Request

      For certain applications, it could be enough if a backup server
      can reply for requests for ongoing sessions after the failure of
      the active instance.  In such scenarios, state information
      contained in the new requests for ongoing sessions (i.e. mid-
      session messages) could be used to reconstruct session state on
      the standby instance.

5.3.  Approaches

   The choice between a proprietary and protocol assisted state recovery
   mechanism is not a straightforward one.  Depending on the application
   and the reliability level required a detailed analysis needs to be
   done to justify usage of one of the methods.




Asveren & Bodin           Expires June 13, 2008                [Page 10]


Internet-Draft   Diameter State Recovery Considerations    December 2007


   If it is desired to use protocol assisted recovery, parameters
   discussed in Section 5.2 need to be considered.  It should be noted
   that choices made for different parameters are not always independent
   of each other, e.g. if state reconstruction immediately after failure
   detection is necessary, using a new session to transfer session data
   strategy can't be utilized.  Below, two different approaches are
   discussed in detail.

5.3.1.  Using a New Session

   As mentioned in Section 5.2.2 a new session can be used to rebuild
   state after failure.  This approach can be sufficient if immediate
   state reconstruction after failure is not needed.  That is, knowledge
   of the history of the session are not needed to proceed providing the
   service of the failed over Diameter node.  An example diagram is
   given in Figure 3.  It focuses on events happening on the client side
   for a DCCA session.  On the server side, the sessions which were
   created by the active instance are cleaned up after expiry of Tcc
   timer.

   A variant of using a new session for rebuilding state is to use
   application messages.  For example, regular mid-session messages
   maintaining soft-state can be used if they contain enough information
   for the desired state reconstruction.  Such messages could contain an
   AVP carrying a flag indicating that it's a mid-session message and
   not an initial message issued to create a completely new session.
   The ability to separate between recreated session and new session can
   be important to some applications.  For example, it may be desirable
   to give recreated sessions preference over new session to resources
   controlled by a Diameter server.

5.3.2.  Backup Instance Triggered Recovery

   In case immediate state reconstruction is desired or strictly needed
   by a backup Diameter instance, this instance may need to trigger
   transfer of session data to recover state.  This requires session
   data to be available and reachable to the backup Diameter instance.
   Possible locations of such data include the physical resource/service
   controlled by the failed over Diameter instance and the entities
   utilizing the service offered by the Diameter instance (i.e. entities
   issuing Diameter requests for the offered service).

   As mentioned in Section 5.2.2 application application messages or a
   generic message can be used to transfer session data for state
   reconstruction.  Application messages or a generic message
   transferring the desired session data could be preceeded by a generic
   synchronization message providing the backup Diameter instance with a
   complete list of all active sessions.  By that the backup Diameter



Asveren & Bodin           Expires June 13, 2008                [Page 11]


Internet-Draft   Diameter State Recovery Considerations    December 2007


   instance can distribute the recovery of session data over time.  This
   may be useful if this instance is to start provide its service
   imediately instead of waiting until the state reconstruction process
   is completed.  Requesting session data in parallel with answering to
   service requests requires however that period with incomplete session
   state after that the backup Diameter instance starts providing the
   service is acceptable.

   A generic synchronization message can also be useful in a combined
   solution using both a proprietary mechanism for state replication and
   protocol aided state recovery.  The complete list of all active
   sessions provided in such a message providing can be compared with
   the list of sessions replicated through a proprietary mechansism.
   Thereby a potential mis-match can be identified and missing session
   data can be explicitly requested by the backup Diameter instance.


6.  IANA Considerations

   This document does not require any IANA action.


7.  Security Considerations

   Certain procedures in protocol assisted state recovery, e.g.
   notification of the Diameter peer about failure of an active instance
   by the standby instance, could introduce security risks.  It is
   expected that use of IPSec/TLS together with a transitive trust model
   should eliminate these concerns.


8.  Acknowledgments


9.  Normative References

   [1]  Calhoun, P., Loughney, J., Guttman, E., Zorn, G., and J. Arkko,
        "Diameter Base Protocol", RFC 3588, September 2003.

   [2]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.










Asveren & Bodin           Expires June 13, 2008                [Page 12]


Internet-Draft   Diameter State Recovery Considerations    December 2007


Authors' Addresses

   Tolga Asveren
   Sonus Networks
   4400 Route 9 South
   Freehold, NJ, 07728
   USA

   Email: tasveren@sonusnet.com


   Ulf Bodin
   Operax
   Aurorum Science Park 8
   SE-977 75 Lulea
   Sweden

   Email: uffe@operax.com

































Asveren & Bodin           Expires June 13, 2008                [Page 13]


Internet-Draft   Diameter State Recovery Considerations    December 2007


Full Copyright Statement

   Copyright (C) The IETF Trust (2007).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Acknowledgment

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).





Asveren & Bodin           Expires June 13, 2008                [Page 14]