Network Working Group                                             Y. Nir
Internet-Draft                                               Check Point
Intended status: Standards Track                           April 2, 2008
Expires: October 4, 2008

                 A Quick Crash Detection Method for IKE

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on October 4, 2008.

Copyright Notice

   Copyright (C) The IETF Trust (2008).


   This document describes an extension to the IKEv2 protocol that
   allows for faster crash recovery using a saved token.

   When an IPsec tunnel between two IKEv2 implementations is
   disconnected due to a restart of one peer, it can take as much as
   several minutes for the other peer to discover that the reboot has
   occurred, thus delaying recovery.  In this text we propose an
   extension to the protocol, that allows for recovery within a few

Nir                      Expires October 4, 2008                [Page 1]

Internet-Draft            Quick Crash Detection               April 2008

   seconds of the reboot.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1.  Conventions Used in This Document  . . . . . . . . . . . .  3
   2.  RFC 4306 Crash Recovery  . . . . . . . . . . . . . . . . . . .  3
   3.  Protocol Outline . . . . . . . . . . . . . . . . . . . . . . .  4
   4.  Formats and Exchanges  . . . . . . . . . . . . . . . . . . . .  4
     4.1.  Notification Format  . . . . . . . . . . . . . . . . . . .  5
     4.2.  Authentication Exchange  . . . . . . . . . . . . . . . . .  5
     4.3.  Informational Exchange . . . . . . . . . . . . . . . . . .  7
   5.  Token Generation and Verification  . . . . . . . . . . . . . .  7
     5.1.  A Stateful Method of Token Generation  . . . . . . . . . .  7
     5.2.  A Stateless Method of Token Generation . . . . . . . . . .  8
     5.3.  Token Lifetime . . . . . . . . . . . . . . . . . . . . . .  8
   6.  Backup Gateways  . . . . . . . . . . . . . . . . . . . . . . .  8
   7.  Alternative Solutions  . . . . . . . . . . . . . . . . . . . .  8
     7.1.  Why not Save the Entire IKE SA . . . . . . . . . . . . . .  8
     7.2.  Initiating a new IKE SA  . . . . . . . . . . . . . . . . .  9
   8.  Interaction with IFARE . . . . . . . . . . . . . . . . . . . .  9
   9.  Operational Considerations . . . . . . . . . . . . . . . . . . 11
   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   11. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 13
   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
   13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 13
     13.1. Changes from draft-nir-qcr-00  . . . . . . . . . . . . . . 13
   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     14.1. Normative References . . . . . . . . . . . . . . . . . . . 14
     14.2. Informative References . . . . . . . . . . . . . . . . . . 14
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 14
   Intellectual Property and Copyright Statements . . . . . . . . . . 15

Nir                      Expires October 4, 2008                [Page 2]

Internet-Draft            Quick Crash Detection               April 2008

1.  Introduction

   IKEv2, as described in [RFC4306] has a method for recovering from a
   reboot of one peer.  As long as traffic flows in both directions, the
   rebooted peer should re-establish the tunnels immediately.  However,
   in many cases the rebooted peer is a VPN gateway that protects only
   servers, or else the non-rebooted peer has a dynamic IP address.  In
   such cases, the rebooted peer will not re-establish the tunnels.

   Section 2 describes the current procedure, and explains why crash
   recovery can take up to several minutes.  The method proposed here,
   is to send a token in the IKE_AUTH exchange that establishes the
   tunnel.  That token can be maintained on the peer in some kind of
   persistent storage such as a disk or a database, and can be used to
   delete the IKE SA on the non-rebooted peer after a crash.  Deleting
   the IKE SA results is a quick re-establishment of the IPsec tunnel.

1.1.  Conventions Used in This Document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   document are to be interpreted as described in [RFC2119].

2.  RFC 4306 Crash Recovery

   When one peer reboots, the other peer does not get any notification,
   so IPsec traffic can still flow.  The rebooted peer will not be able
   to decrypt it, however, and the only remedy is to send an unprotected
   INVALID_SPI notification as described in section 3.10.1 of [RFC4306].
   That section also describes the processing of such a notification:
   "If this Informational Message is sent outside the context of an
   IKE_SA, it should be used by the recipient only as a "hint" that
   something might be wrong (because it could easily be forged)."

   Since the INVALID_SPI can only be used as a hint, the non-rebooted
   peer has to determine whether the IPsec SA, and indeed the parent IKE
   SA are still valid.  The method of doing this is described in section
   2.4 of [RFC4306].  This method, called "liveness check" involves
   sending a protected empty INFORMATIONAL message, and awaiting a
   response.  This procedure is sometimes referred to as "Dead Peer
   Detection" or DPD.

   Section 2.4 does not mandate how many times the INFORMATIONAL message
   should be retransmitted, or for how long, but does recommend the
   following: "It is suggested that messages be retransmitted at least a
   dozen times over a period of at least several minutes before giving
   up on an SA".  Clearly, implementations differ, but all will take a

Nir                      Expires October 4, 2008                [Page 3]

Internet-Draft            Quick Crash Detection               April 2008

   significant amount of time.

3.  Protocol Outline

   Supporting implementations will send a notification, called a "QCD
   token", as described in Section 4.1 in the last packets of the
   IKE_AUTH exchange.  These are the final request and final response
   that contain the AUTH payloads.  The generation of these tokens is a
   local matter for implementations, but considerations are described in
   Section 5.  Implementations that send such a token will be called
   "token makers".

   A supporting implementation receiving such a token SHOULD store it in
   such a way, that it will survive a reboot.  If the implementation is
   part of a configuration where there is a backup gateway as described
   in Section 6 (such configurations are often referred to as high-
   availability), then the persistent storage module SHOULD be
   accessible to all implementations within the configuration.  An
   implementation supporting this part of the protocol will be called
   "token taker".

   When a token taker receives a protected IKE request message with
   unknown IKE SPIs, it MUST scan its saved token store.  If a token
   matching the IKE SPIs is found, it SHOULD be sent to the requesting
   peer in an unprotected IKE message as described in Section 4.3.

   When a token maker receives the QCD token in an unprotected
   notification, it MUST verify that the TOKEN_SECRET_DATA field is
   associated with the IKE SPIs in the IKE_SPI fields of the IKE packet.
   If the verification fails, it SHOULD log the event.  If it succeeds,
   it MUST delete the IKE SA associated with the IKE_SPI fields, and all
   dependant child SAs.  This event MAY also be logged.  The token maker
   MUST accept such tokens from any address, so as to allow different
   kinds of high-availability configuration of the token taker.

   A supporting implementation MAY immediately create new SAs using an
   Initial exchange, or it may wait for subsequent traffic to trigger
   the creation of new SAs.

   There is ongoing work on IKEv2 Session Resumption [resumption].  See
   Section 8 for a short discussion about this protocol's interaction
   with session resumption.

4.  Formats and Exchanges

Nir                      Expires October 4, 2008                [Page 4]

Internet-Draft            Quick Crash Detection               April 2008

4.1.  Notification Format

   The notification payload called "QCD token" is formatted as follows:

                            1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       ! Next Payload  !C!  RESERVED   !         Payload Length        !
       !  Protocol ID  !   SPI Size    ! QCD Token Notify Message Type !
       !                                                               !
       ~                       TOKEN_SECRET_DATA                       ~
       !                                                               !

   o  Protocol ID (1 octet) MUST contain 1, as this message is related
      to an IKE SA.
   o  SPI Size (1 octet) MUST be zero, in conformance with [RFC4306].
   o  QCD Token Notify Message Type (2 octets) - Must be xxxxx, the
      value assigned for QCD token notifications.  TBA by IANA.
   o  TOKEN_SECRET_DATA (16-256 octets) contains a generated token as
      described in Section 5.

4.2.  Authentication Exchange

   For clarity, only the EAP version of an AUTH exchange will be
   presented here.  The non-EAP version is very similar.  The figure
   below is based on appendix A.3 of [RFC4718].

Nir                      Expires October 4, 2008                [Page 5]

Internet-Draft            Quick Crash Detection               April 2008

    first request       --> IDi,
                            [[N(HTTP_CERT_LOOKUP_SUPPORTED)], CERTREQ+],
                            SA, TSi, TSr,

    first response      <-- IDr, [CERT+], AUTH,

                      / --> EAP
    repeat 1..N times |
                      \ <-- EAP

    last request        --> AUTH

    last response       <-- AUTH,
                            SA, TSi, TSr,

   Note that the QCD_TOKEN notification is marked as optional because it
   is not required by this specification that every implementation be
   both token maker and token taker.  If only one peer sends the QCD
   token, then a reboot of the other peer will not be recoverable by
   this method.  This may be acceptable if traffic typically originates
   from the other peer.

   In any case, the lack of a QCD_TOKEN notification MUST NOT be taken
   as an indication that the peer does not support this standard.
   Conversely, if a peer does not understand this notification, it will
   simply ignore it.  Therefore a peer MAY send this notification
   freely, even if it does not know whether the other side supports it.

Nir                      Expires October 4, 2008                [Page 6]

Internet-Draft            Quick Crash Detection               April 2008

4.3.  Informational Exchange

   This informational exchange is non-protected, and is sent as a
   response to a protected IKE request, which uses an IKE SA that is

               request             --> N(QCD_TOKEN)

               response            <--

   The QCD_TOKEN is the only notification in the request.  Similar to
   the description in section 2.21 of [RFC4306], The IKE SPI and message
   ID fields in the packet headers are taken from the protected IKE

   If the QCD_TOKEN verifies OK, an empty response MUST be sent.  If the
   QCD_TOKEN cannot be validated, a response SHOULD NOT be sent.
   Section 5 defines token verification.

5.  Token Generation and Verification

   No token generation method is mandated by this document.  Two methods
   are documented in Section 5.1 and Section 5.2, but they only serve as

   The following lists the requirements from a token generation
   o  Tokens should be at least 16 octets log, and no more than 256
      octets long, to facilitate storage.
   o  It should not be possible for an external attacker to guess the
      QCD token generated by an implementation.  Cryptographic
      mechanisms such as PRNG and hash functions are RECOMMENDED.
   o  The peer that generated the QCD token, should be able to
      immediately verify it, provided that the IKE SPIs are given, and
      that the IKE SA has not expired or been otherwise deleted.

5.1.  A Stateful Method of Token Generation

   This describes a stateful method of generating a token:
   o  Before sending the QCD token, 32 random octets are generated using
      a secure random number generator or a PRNG.
   o  Those 32 bytes are used as the TOKEN_SECRET_DATA field, and stored
      as part of the IKE SA.
   o  For verification, the IKE implementation simply retrieves the IKE
      SA, and compares the TOKEN_SECRET_DATA field from the notification
      to the TOKEN_SECRET_DATA field stored with the SA.

Nir                      Expires October 4, 2008                [Page 7]

Internet-Draft            Quick Crash Detection               April 2008

5.2.  A Stateless Method of Token Generation

   This describes a stateless method of generating a token.
   o  At startup, the IKE implementation generates a 32-octet random
      buffer using a cryptographically secure PRNG.  This buffer is
      called the QCD_SECRET.
   o  For each QCD token, the TOKEN_SECRET_DATA field is generated by
      calculating a SHA-256 hash over a concatenation of the QCD_SECRET
      and the IKE SPI as follows:


   o  Verification uses the same calculation, and works even if the IKE
      SA has been deleted.  Still, if the IKE SA is no longer valid, the
      notification MUST NOT be acknowledged, as this could be used in an
      attempt to guess the QCD_SECRET.

5.3.  Token Lifetime

   The token is associated with a single IKE SA, and SHOULD be deleted
   when the SA is deleted or expires.  More formally, the token is
   associated with the pair (SPI-I, SPI-R).

6.  Backup Gateways

   Making crash recovery quick is important, but since rebooting a
   gateway takes a non-zero amount of time, many implementations choose
   to have a stand-by gateway ready to take over as soon as the primary
   gateway fails for any reason.

   If such a configuration is available, it is RECOMMENDED that the
   persistent storage be shared between the primary and backup gateway.
   This has the effect of having the crash recovery available
   immediately.  This recommendation is especially useful if the primary
   and backup gateway either share an external IP address or reside on
   the same LAN.  If they are geographically remote, this may be less

7.  Alternative Solutions

7.1.  Why not Save the Entire IKE SA

   IKEv2 does not assume the existence of a persistent storage module.
   If we are adding such a module, why not use it to save the entire IKE

Nir                      Expires October 4, 2008                [Page 8]

Internet-Draft            Quick Crash Detection               April 2008

   SA across reboots, nullifying the need for a crash recovery

   There are several reasons why we believe that this is not a good
   1.  A token is only 16-256 octets, and is much more compact than all
       the data needed to store an IKE SA.
   2.  A token is valid for the life of an IKE SA.  An IKE SA state is
       updated whenever a message is sent, because of the requirement to
       maintain the sequence of message IDs.  It may not be acceptable
       to update the persistent storage whenever an IKE message is sent.
   3.  A reboot is usually an unpredictable event, and as such, we
       cannot know how long it will last.  By the time the machine has
       rebooted, the peer may have attempted some type of protected
       exchange (liveness check, create-child-SA or delete), timed out,
       and deleted the SA.  It is far better to reboot without SAs and
       with only a token for quick recovery.

7.2.  Initiating a new IKE SA

   Instead of sending a QCD token, we could have the rebooted
   implementation start an Initial exchange with the peer, including the
   INITIAL_CONTACT notification.  This would have the same effect,
   instructing the peer to erase the old IKE SA, as well as establishing
   a new IKE SA with fewer rounds.

   The disadvantage here, is that in IKEv2 an authentication exchange
   MUST have a piggy-backed Child SA set up.  Since our use case is such
   that the rebooted implementation does not have traffic flowing to the
   peer, there are no good selectors for such a Child SA.

   Additionally, when authentication is asymmetric, such as when EAP is
   used, it is not possible for the rebooted implementation to initiate

8.  Interaction with IFARE

   IFARE, specified in [resumption] proposes to make setting up a new
   IKE SA consume less computing resources.  This is particularly useful
   in the case of a remote access gateway that has many tunnels.  A
   failure of such a gateway would require all these many remote access
   clients to establish an IKE SA either with the rebooted gateway or
   with a backup gateway.  This tunnel re-establishment should occur
   within a short period of time, creating a burden on the remote access
   gateway.  IFARE addresses this problem by having the clients store an
   encrypted derivative of the IKE SA for quick re-establishment.

Nir                      Expires October 4, 2008                [Page 9]

Internet-Draft            Quick Crash Detection               April 2008

   What IFARE does not help, is the problem of detecting that the peer
   gateway has failed.  A failed gateway may go undetected for an
   unbounded amount of time, because IPsec does not have packet
   acknowledgement.  Before establishing a new IKE SA using IFARE, a
   client MUST ascertain that the gateway has indeed failed.  This could
   be done using either a liveness check (as in RFC 4306) or using the
   QCD tokens described in this document.

   A remote access client conforming to both specifications will
   generate QCD tokens, and store the IFARE state, if provided by the
   gateway.  A remote access gateway conforming to both specifications
   will store the QCD token sent from the client.  When the gateway
   reboots, the client will discover this in either of two ways:
   1.  The client does regular liveness checks, or else the time for
       some other IKE exchange has come.  The IKE times out after
       several minutes, if the gateway does not finish rebooting in
       time.  In this case QCD does not help.
   2.  Either the primary gateway or a backup gateway (see Section 6) is
       ready and sends a QCD token to the client.  In that case the
       client will quickly re-establish the IPsec tunnel, either with
       the rebooted primary gateway, the backup gateway as described in
       this document or another gateway as described in [resumption]

   The full combined protocol looks like this:

Nir                      Expires October 4, 2008               [Page 10]

Internet-Draft            Quick Crash Detection               April 2008

        Initiator                Responder
        -----------              -----------
       HDR, SAi1, KEi, Ni  -->

                           <--    HDR, SAr1, KEr, Nr, [CERTREQ]

       HDR, SK {IDi, [CERT,]
       [CERTREQ,] [IDr,]
       SAi2, TSi, TSr,
       N(TICKET_REQUEST)}  -->
                           <--    HDR, SK {IDr, [CERT,] AUTH, SAr2, TSi,
                                  TSr, N(TICKET_OPAQUE)

                ---- Reboot -----

       HDR, {}             -->
                           <--  HDR, N(QCD_Token)

       [N+,], SK {IDi, [IDr,]
       SAi2, TSi, TSr,
       [CP(CFG_REQUEST)]}  -->
                           <--  HDR, SK {IDr, Nr, SAr2, [TSi, TSr],

9.  Operational Considerations

   To support "token taker" part of this standard, an implementation
   needs to have access to a persistent storage module.  This could be
   an internal hard disk, a local or remote database application, or any
   other method that persists across reboots.  This storage module and
   the data links between the storage module and the IKE module must
   meet the performance requirements of the IKE module.  The storage
   module MUST support insertion and deletion rates equal to peek IKE SA
   setup rates and it SHOULD support query rates that are fast enough.

   See Section 10 for security considerations for this storage

   Throughout this document, we have referred to reboot time
   alternatingly as the time that the implementation crashes and the
   time when it is ready to process IPsec packets and IKE exchanges.
   Depending on the hardware and software platforms and the cause of the

Nir                      Expires October 4, 2008               [Page 11]

Internet-Draft            Quick Crash Detection               April 2008

   reboot, rebooting may take anywhere from a few seconds to several
   minutes.  If the implementation is down for a long time, the benefit
   of this protocol extension are reduced.  For this reason critical
   systems should implement backup gateways as described in Section 6.
   Note that the lower-case should in the previous sentence is
   intentional, as we do not specify this in the sense of RFC 2119.

   Implementing the "token taker" side of QCD makes sense for IKE
   implementation where protected connections originate from the peer,
   such as inter-domain VPNs and remote access gateways.  Implementing
   the "token maker" side of QCD makes sense for IKE implementations
   where protected connections originate, such as inter-domain VPNs and
   remote access clients.

   To clarify the requirements:
   o  A remote-access client MUST be a token maker and MAY be a token
   o  A remote-access gateway MAY be a token maker and MUST be a token
   o  An inter-domain VPN gateway MUST be both token maker and token

   In order to limit the effects of DoS attacks, an implementation
   SHOULD limit the rate of queries into the token storage so as not to
   overload it.  If excessive amounts of IKE requests protected with
   unknown IKE SPIs arrive, the IKE module SHOULD revert to the behavior
   described in section 2.21 of [RFC4306] and either send an
   INVALID_IKE_SPI notification, or ignore it entirely.

10.  Security Considerations

   Tokens MUST be hard to guess.  This is critical, because if an
   attacker can guess the token associated with the IKE SA, she can tear
   down the IKE SA and associated tunnels at will.  When the token is
   delivered in the IKE_AUTH exchange, it is encrypted.  When it is sent
   back in an informational exchange it is not encrypted, but that is
   the last use of that token.

   An aggregation of some tokens generated by one peer together with the
   related IKE SPIs MUST NOT give an attacker the ability to guess other
   tokens.  Specifically, if one peer does not properly secure the QCD
   tokens and an attacker gains access to them, this attacker MUST NOT
   be able to guess other tokens generated by the same peer.  This is
   the reason that the QCD_SECRET in Section 5.2 needs to be long.

   The persistent storage MUST be protected from access by other
   parties.  Anyone gaining access to the contents of the storage will

Nir                      Expires October 4, 2008               [Page 12]

Internet-Draft            Quick Crash Detection               April 2008

   be able to delete all the IKE SAs described in it.

   The tokens associated with expired and deleted IKE SAs MUST be
   deleted from the storage, so that a future compromise of the storage
   does not reveal enough tokens to facilitate an attack against the QCD

   The QCD token is sent by the rebooted peer in an unprotected message.
   A message like that is subject to modification, deletion and replay
   by an attacker.  However, these attacks will not compromise the
   security of either side.  Modification is meaningless because a
   modified token is simply an invalid token.  Deletion will only cause
   the protocol not to work, resulting in a delay in tunnel re-
   establishment as described in Section 2.  Replay is also meaningless,
   because the IKE SA has been deleted after the first transmission.

11.  IANA Considerations

   IANA is requested to assign a notify message type from the error
   types range (43-8191) of the "IKEv2 Notify Message Types" registry

12.  Acknowledgements

   We would like to thank Hannes Tschofenig and Yaron Sheffer for their
   comments about IFARE.

13.  Change Log

   This section lists all changes in this document

   NOTE TO RFC EDITOR : Please remove this section in the final RFC

13.1.  Changes from draft-nir-qcr-00

   o  Changed name to reflect that this relates to IKE.  Also changed
      from quick crash recovery to quick crash detection to avoid
      confusion with IFARE.
   o  Added more operational considerations.
   o  Added interaction with IFARE.
   o  Added discussion of backup gateways.

14.  References

Nir                      Expires October 4, 2008               [Page 13]

Internet-Draft            Quick Crash Detection               April 2008

14.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC4306]  Kaufman, C., "Internet Key Exchange (IKEv2) Protocol",
              RFC 4306, December 2005.

   [RFC4718]  Eronen, P. and P. Hoffman, "IKEv2 Clarifications and
              Implementation Guidelines", RFC 4718, October 2006.

14.2.  Informative References

              Sheffer, Y., Tschofenig, H., Dondeti, L., and V.
              Narayanan, "IPsec Gateway Failover Protocol",
              draft-sheffer-ipsec-failover-03 (work in progress),
              March 2008.

Author's Address

   Yoav Nir
   Check Point Software Technologies Ltd.
   5 Hasolelim st.
   Tel Aviv  67897


Nir                      Expires October 4, 2008               [Page 14]

Internet-Draft            Quick Crash Detection               April 2008

Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at


   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).

Nir                      Expires October 4, 2008               [Page 15]