DECADE                                                       Xin Wang
Internet Draft                                        Fudan University
Intended status: Informational                               Jin Zhao
Expires: April 2011                                   Fudan University
                                                          Tiegang Zeng
                                                      Fudan University
                                                               Jun Li
                                                      Fudan University
                                                              Lei Liu
                                                      Fudan University
                                                          Shihui Duan
                                                           China CATR
                                                      October 25, 2010




     Router-supported Data Regeneration for In-network Storage Systems
                draft-wang-decade-data-regeneration-01.txt


Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on April 25, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors. All rights reserved.




Wang, et al            Expires April 25, 2011                 [Page 1]


Internet-Draft            data regeneration           October 25, 2010


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents carefully,
   as they describe your rights and restrictions with respect to this
   document.

Abstract

   In-network storage systems store redundancy to compensate for the
   data loss incurred by hardware failures or other reasons. This
   document introduces a practical work of router-supported data
   regeneration in in-network storage systems to maintain the amount of
   redundancy. This proposed regeneration process can exploit the
   bandwidth diversity in the network, and the corresponding protocol
   enables supporting routers work transparently to support the
   regeneration process.
































Wang, et al            Expires April 25, 2011                 [Page 2]


Internet-Draft            data regeneration           October 25, 2010




Table of Contents


   1. Introduction ................................................ 4
   2. Key problems ................................................ 4
   3. Overview of the Router-Supported Regeneration Process                                                               ......... 6
      3.1. Regeneration Process                                    .................................... 6
      3.2. File Regeneration Protocol                                          .............................. 7
      3.3. System Implementation and Components .................... 9
      3.4. Key Technologies                                ....................................... 10
   4. Validation ................................................. 11
   5. DECADE Compatibility........................................ 12
   6. Security Considerations                                  ..................................... 12
   7. IANA Considerations ........................................ 12
   8. Conclusions ................................................ 13
   9. References ................................................. 13
      9.1. Normative References                                    ................................... 13
      9.2. Informative References                                      ................................. 13





























Wang, et al            Expires April 25, 2011                 [Page 3]


Internet-Draft            data regeneration           October 25, 2010


 1. Introduction

   In-network storage systems store a substantial volume of data in an
   overlay network containing a large number of storage servers, which
   can be used for online storage service, such as file sharing, CDN,
   and etc. In such systems, peer churns, hardware failures and other
   malfunctions are unavoidable so that some data may not be accessible.
   Thus, the system should maintain a certain ratio of redundancy so
   that a subset of data is enough for data recovery.

   Storing coded data of original files rather than their replicas [1]
   can maintain higher data integrity [2], so that any k of n storage
   servers can retrieve the original data. On the other hand, if a
   storage server fails, a replacement server should regenerate the data
   stored in the failed server. For coded data, this way guarantees that
   any k servers can retrieve the original file. The replacement server,
   which we call as "newcomer" in this document, should contact at least
   k storage servers, which we call as "providers" in this document.

   The efficiency of such regeneration is influenced by the topology of
   the network. First, since the newcomer should contact multiple
   providers, several flows of data from providers may converge at some
   links, and thus incur a significant bottleneck in the transmission.
   Second, the topology may have an influence on the available bandwidth
   between two servers. For example, two servers in a subnet may
   probably have higher available bandwidth between them than two
   servers in two different subnets. As shown in [3], the diversity of
   bandwidth incurred by network topology can be exploited by letting
   providers relay the data flow from other providers to form a tree-
   structured topology during the regeneration, where network coding
   naturally resides, such that some slow links can be bypassed.

   In this document, we show that the devices which relay the traffic
   during the regeneration can be not only providers, but routers as
   well. Routers can support the regeneration, as it can encode data
   when multiple data flows converge and reduce the overall traffic
   during the regeneration. Since the redundant maintenance is
   independent of data access, the scheme that we present is compatible
   with the protocol DECADE to access in-network storage [4]. We show
   that routers can support the regeneration process transparently such
   that no storage servers should be aware of such routers or the
   network topology.

 2. Key problems

   In order to support data regeneration in in-network storage systems,
   routers should be able to work transparently so that storage servers


Wang, et al            Expires April 25, 2011                 [Page 4]


Internet-Draft            data regeneration           October 25, 2010


   participating in the regeneration do not need any information about
   routers in the network. To satisfy this goal, the following key
   problems should be considered:

      a) Bandwidth: During the regeneration, since providers are allowed
      to relay the traffic from other providers, available bandwidth
      between servers should be measured to determine the optimal
      routing. However, since supporting routers are supposed to be
      transparent to storage servers, we can only measure the end-to-end
      available bandwidth between each two servers participating in the
      regeneration process.

      b) Routing: Since we can only get the table of the available
      bandwidth between each pair of participating servers, we first
      determine the routing on the overlay network covering the
      participating servers, then the transmission rate during the
      regeneration is optimized.

      c) Mapping: To make the supporting routers work, we need a mapping
      mechanism to make supporting routes aware of the regeneration
      process and know how to act during the regeneration. After the
      routing in the overlay network has been determined, participating
      servers may send data to their next-hop servers, to make
      supporting routers between them know that there will be traffic of
      a regeneration process coming soon. Supporting routers should be
      able to determine how many flows will come, whether it is
      necessary to encode such flows and where the next-hop is. After
      mapping, the data transmission can start.

      d) Reliability: Data transmission is unreliable in the network due
      to packet loss, disorder or other failures. Supporting routers
      should have a mechanism to make sure the data transmission is
      reliable. If the transmission between two servers is based on TCP,
      supporting routers should maintain the TCP state of incoming flows.
      If the transmission is based on UDP, there should be application-
      level retransmission scheme to guarantee the data reliability.

      e) Congestion control: TCP provides a mechanism to control the
      congestion in the network. However, if multiple flows are encoded
      by a supporting router, the router should control the congestion
      in place of the destination server. However, if the transmission
      is based on UDP, participating servers should perform end-to-end
      congestion control.






Wang, et al            Expires April 25, 2011                 [Page 5]


Internet-Draft            data regeneration           October 25, 2010


 3. Overview of the Router-Supported Regeneration Process

   Apart from data access, data regeneration is a part of functions
   provided by in-network storage systems. Our scheme requires that
   coded data are stored in the systems, and a file is divided into k
   blocks and n encoded blocks are produced after encoding in which any
   k blocks can recover the original file. The coding technique should
   be decentralized and the coding operations are not necessary to
   perform on a single server, such as random linear coding [5]. The n
   encoded blocks are stored in n storage servers, i.e., each storage
   server stores one encoded block. When a server fails, a replacement
   server, called newcomer, should contact at least k encoded blocks,
   and reconstruct a new encoded block by re-encoding these encoded
   blocks.

3.1. Regeneration Process

   Data regeneration process should be carried out as follows.

   1. A server failure is detected and then a regeneration process is
      triggered. One newcomer and at least k providers are selected. A
      weighted complete graph covering all these k + 1 servers is made
      in which the weight of each edge denotes the available bandwidth
      measured between each pair of the k + 1 servers. A maximum
      spanning tree, called regeneration tree, is constructed on such a
      complete graph.

   2. The newcomer obtains IP addresses of all providers and the
      regeneration tree. It then sends a NOTIFICATION messages to each
      provider.

   3. Each provider replies an ACK message when it receives a
      NOTIFICATION message to its parent in the regeneration tree.

   4. When an ACK message goes through a supporting router, the
      supporting router forwards this message and stores IP addresses of
      the source and the destination of the ACK message. An operation
      table should be constructed on the supporting router that contains
      the sources and destinations of the received ACK message and the
      number of hops to the corresponding destinations.

   5. Non-leaf providers modify the type of the received ACK message to
      a DACK message and forward it to the newcomer. If the newcomer has
      received ACK or DACK message from all providers, it sends a DETECT
      message to each provider.




Wang, et al            Expires April 25, 2011                 [Page 6]


Internet-Draft            data regeneration           October 25, 2010


   6. When a provider receives a DETECT message, it replies a RE-DETECT
      message to its parent in the regeneration tree. When a RE-DETECT
      message goes through a supporting router, the supporting router
      select the destination with the minimum number of hops in the
      operation table as the new destination of the RE-DETECT message
      and then forwards it. The supporting router selects IP addresses
      of sources of all received RE-DETECT messages and construct an
      encoding table.

   7. Non-leaf providers modify the type of the received RE-DETECT
      message to a DRE-DETECT message and forward it to the newcomer. If
      the newcomer has received RE-DETECT or DRE-DETECT messages from
      all providers, it sends a START message to each provider.

   8. All providers begin to send data in DATA messages that contain
      fixed number of bits of data to its parent node when it has
      received a START message. A provider that has incoming flow(s) has
      to wait to send its first DATA message until the first DATA
      message of each incoming flow has arrived and it has encoded the
      received data with the data it stores.

   9. When a DATA message goes through a supporting router, the
      supporting router stores it until it has received corresponding
      DATA messages from all entries in its encoding table. It then
      encodes the data in the received DATA message and sends a new DATA
      message that contains the encoded data to the destination with the
      minimum number of hops in the operation table.

   10.The newcomer stores the received data. If the newcomer has
      multiple incoming flows, it encodes the received data and stores
      them. The regeneration finishes when it has received all data.

3.2. File Regeneration Protocol

   The File Regeneration Protocol (FRP) runs on the application layer,
   as shown in Figure 1.

   +-----------------------------------------------------------+
   | MAC    |  IP       |  TCP/UDP   | FR      |   PAYLOAD     |
   | HEADER |  HEADER   |  HEADER    | HEADER  |               |
   +-----------------------------------------------------------+
      Figure 1 The overall structure of the File Regeneration Protocol

   The structure of the FR head is shown as Figure 2.





Wang, et al            Expires April 25, 2011                 [Page 7]


Internet-Draft            data regeneration           October 25, 2010


     0     1                                                    16
     +-----------------------------------------------------------+
     |CMD |                          SOURCE                      |
     |TYPE|                         IP ADDRESS                   |
     +-----------------------------------------------------------+
     |                                                           |
     |                            FILE BLOCK NAME                |
     +-----------------------------------------------------------+
     |RESERVE|PROVIDER  |PACKET |
     |       |NUMBER    |NUMBER |
     +--------------------------+
     0       2          4       6
                  Figure 2 The structure of the FR header

   "CMD TYPE" indicates the type of the message that includes:

      NOTIFICATION: The newcomer sends a NOTIFICATION messages to
      providers to start a regeneration process.

      ACK: A provider replies an ACK message to the newcomer when it
      receives a NOTIFICATION message. An ACK message makes supporting
      routers that it goes through be aware of the regeneration process.

      DACK: A DACK message is no different from a ACK message except for
      the CMD TYPE. Supporting routers will not process the DACK message.

      DETECT: The newcomer sends DETECT messages to providers when it
      has received ACK messages that contain addresses of all providers.

      RE-DETECT: A provider replies an DETECT message to the newcomer
      when it receives a DETECT message. A RE-DETECT message makes
      supporting routers be aware of the number of incoming flows during
      the upcoming data transmission.

      DRE-DETECT: A DRE-DETECT message is no different from a RE-DETECT
      message except for the CMD TYPE. Supporting routers will not
      process the DRE-DETECT message.

      START: The newcomer sends START messages to all providers,
      indicating all servers are ready.

      DATA: Providers send DATA messages that contain fixed number of
      bits of data to its parent in the regeneration tree. DATA messages
      may be encoded at supporting routers and are finally forwarded to
      the newcomer.




Wang, et al            Expires April 25, 2011                 [Page 8]


Internet-Draft            data regeneration           October 25, 2010


   "SOURCE IP ADDRESS" represents the IP address of the last encoding
   device, which may be a provider or a supporting router.

   "FILE BLOCK NAME" represents the name of the file block in the
   transmission.

   "RESERVE" represents the segment that is reserved for future
   applications.

   "PROVIDER NUMBER" represents the identifier number of the provider.

   "PACKET NUMBER" represents the sequential number of the file block in
   the transmission.

3.3. System Implementation and Components

   Router-supported data regeneration is an independent component in in-
   network storage systems. Since there are a large number of storage
   servers in the system, the server failure occurs frequently. To
   maintain the data integrity, a high-efficient mechanism is necessary
   to regenerate the lost data when a server fails. The implementation
   of our proposed in-network storage system contains two parts: storage
   servers and supporting routers. From the perspective of functions,
   storage servers are composed of three functional parts: dispatcher,
   newcomer and provider. Figure 3 illustrates the system architecture
   and components of the router-supported data regeneration.

                                        +----+   +----+
                                    |---| SR |---| PR |
                                    |   +----+   +----+
         +----+   +----+   +----+   |      |
         | DI |---| NC |---| SR |---|      |
         +----+   +----+   +----+   |      |
                                    |   +----+   +----+
                                    |---| SR |---| PR |
                                        +----+   +----+

   Figure 3 The architecture and components of the router-supported data
                               regeneration

   DISPATCHER (DI): It manages the whole system, including the selection
   of the newcomer and providers and the detection of server failures.

   NEWCOMER (NC): It starts the regeneration process with providers and
   accepts data from providers. It encodes the received data and stores
   the encoded data as a regenerated block.



Wang, et al            Expires April 25, 2011                 [Page 9]


Internet-Draft            data regeneration           October 25, 2010


   PROVIDER (PR): Providers are storage server that provider data to the
   newcomer in the regeneration process.

   SUPPORTING ROUTER (SR): Supporting routers are routers with computing
   and cache capabilities and can support regeneration. Before the data
   transmission during the regeneration, it collects information from
   ACK and RE-DETECT message to be aware of the incoming flows and the
   destination in the regeneration process.

3.4. Key Technologies

   Some key technologies are presented in this section, which can make
   data transmission rate improved, the bandwidth consumption saved and
   the spent time reduced during the regeneration.

   1. When the newcomer and providers have been selected, we measure the
      available bandwidth between each pair of the newcomer and
      providers. A complete graph covering the newcomer and providers
      can be constructed and the weight of each edge is the
      corresponding available bandwidth between two servers. A maximum
      spanning tree, i.e., a regeneration tree, then is constructed on
      this graph, in which the newcomer is the root. All non-root
      servers in the regeneration tree send their data to its parent.
      When a flow of data transferred goes through a supporting router,
      it may be encoded with other flows and forwarded to another server.
      Compared with conventional regeneration process, the method we
      propose utilizes the link with higher available bandwidth in the
      network, reduces the communication cost and thus increases the
      transmission rate during the regeneration.

   2. Another key technology in the router-supported data regeneration
      is data encoding on the supporting router. During the regeneration,
      supporting router detects File Regeneration Protocol (FRP) by
      processing all IP packets that go through it. Supporting routers
      recognizes the file block being regenerated by analyzing the FRP.
      If multiple flows of the same block come during the regeneration,
      a supporting router should encode the received data. The header of
      FRP enables the supporting router to know whether encoding
      operations should be performed. Utilizing the computing capability,
      encoding operations that should have been performed on the
      newcomer or providers are partially transferred to supporting
      routers, such that supporting routers sends out only one data flow
      even if it receives multiple data flows. Therefore, multiple data
      flows sharing the same physical link are eliminated or at least
      partially eliminated, and transmission rate during the
      regeneration can be significantly improved.



Wang, et al            Expires April 25, 2011                [Page 10]


Internet-Draft            data regeneration           October 25, 2010


 4. Validation

   We present our evaluation results of our proposed scheme here. We
   implement supporting routers on servers running Linux (kernel 2.6.30).
   In the network, there are four supporting routers and four storage
   servers, among which one is selected as both the dispatcher and the
   newcomer and others are providers. The storage servers and supporting
   routers can connect in a topology shown in Figure 4.

                  +-----+    +-----+    +-----+    +-----+
                  |DI+NC|----| SR1 |----| SR4 |----| PR3 |
                  +-----+    +-----+    +-----+    +-----+
                                |         |
                                |         |
                                |         |
                  +-----+    +-----+    +-----+    +-----+
                  | PR1 |----| SR2 |----| SR3 |----| PR2 |
                  +-----+    +-----+    +-----+    +-----+
              Figure 4 the network topology in the experiment

   Each link in the network topology refers to a fast Ethernet (100BASE-
   TX, specifically). Actually we control the available bandwidth on
   each link as follows.

          Table 1 The available bandwidth in the network topology

         +-------------------------------------------------------+
         |  Node 1   |   Node 2  |  Available Bandwidth (MBps/S) |
         +-------------------------------------------------------+
         |  DI+NC    |    SR1    |             60                |
         |  SR1      |    SR4    |             50                |
         |  SR4      |    PR3    |             80                |
         |  SR1      |    SR2    |             40                |
         |  SR1      |    SR3    |             25                |
         |  SR4      |    SR3    |             20                |
         |  PR1      |    SR2    |             80                |
         |  SR2      |    SR3    |             50                |
         |  SR3      |    PR2    |             80                |
         +-------------------------------------------------------+
   We regenerate a coded block with a size of 6,000,000 bytes. We
   compare the time spent in the regeneration process and the bandwidth
   consumed between the conventional regeneration process in which the
   newcomer receives data directly from providers and the regeneration
   process we propose above. The experiment is repeated for 100 times
   and the result is the average value.

                 Table 2 The average bandwidth consumption


Wang, et al            Expires April 25, 2011                [Page 11]


Internet-Draft            data regeneration           October 25, 2010


                  +-------------------------------------+
                  | conventional   |   router-supported |
                  | regeneration   |   regeneration     |
                  +-------------------------------------+
                  | 58241047 byte  |    45606648 bytes  |
                  +-------------------------------------+


   Table 2 shows the bandwidth consumption on average. We count the
   total number of bytes all providers and supporting routers send out.
   We can see that router-supported regeneration process is able to
   reduce the bandwidth consumption by 21.7%, since supporting routers
   can encode data from multiple devices.

                   Table 3 The average regeneration time

                  +-------------------------------------+
                  | conventional   |   router-supported |
                  | regeneration   |   regeneration     |
                  +-------------------------------------+
                  |   95.4 sec.    |      49.4 sec.     |
                  +-------------------------------------+


   Table 3 shows the average regeneration time. Router-supported
   regeneration can reduce the regeneration time by 48.3%, because it
   can not only reduce the bandwidth consumption, but also utilize the
   network topology by bypassing links with low available bandwidth.

 5. DECADE Compatibility

   Since in the in-network storage system, servers are not guaranteed to
   be stable, it is necessary to maintain the data integrity by
   regenerating the lost block after server failures. Thus, the File
   Regeneration Protocol (FRP) can work as a part of DECADE protocol.
   According to the reliability level of the applications, DECADE-
   compatible applications can implement FRP independently, which will
   not interfere with other part of the DECADE protocol.

 6. Security Considerations

   This draft does not introduce any new security issues.

 7. IANA Considerations

   This memo includes no request to IANA.



Wang, et al            Expires April 25, 2011                [Page 12]


Internet-Draft            data regeneration           October 25, 2010


 8. Conclusions

   We propose a topology-aware regeneration process for in-network
   storage system such that bandwidth diversity in the network can be
   exploited and routers may support the regeneration process by
   encoding the incoming data flows. We present the corresponding
   protocol to configure the regeneration process adaptively in which
   supporting routers and servers do not need to know the network
   topology and make decisions by their local information. System
   architecture is presented and related key technologies are discussed.

 9. References

9.1. Normative References

   [1]  S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
         Eisler, and D. Noveck, "Network File System (NFS) version 4
         Protocol", RFC 3510, 2003.

9.2. Informative References

   [2]  H. Song, N. Zong, Y. Yang, and R. Alimi, "DECoupled Application
         Data Enroute (DECADE) Problem Statement," http://
         http://tools.ietf.org/id/draft-ietf-decade-problem-statement-
         00.txt

   [3]  H. Weatherspoon and J. Kubiatowicz, "Erasure Coding vs.
         Replication: A Quantitative Comparison," Peer-to-Peer Systems,
         vol. 2429/2002, pp. 328-337, 2002.

   [4]  J. Li, S. Yang, X. Wang, and B. Li, "Tree-structured Data
         Regeneration in Distributed Storage Systems with Regenerating
         Codes," in Proc. IEEE INFOCOM, 2010.

   [5]  T. Ho, R. Koetter, M. Medard, D. Karger, and M. Effros, "The
         Benefits of Coding over Routing in a Randomized Setting," in
         Proc. International Symp. Inform. Theory, pp. 442, 2003.











Wang, et al            Expires April 25, 2011                [Page 13]


Internet-Draft            data regeneration           October 25, 2010


Authors' Addresses

  Xin Wang
   Fudan University
   Shanghai 201203, China
   Phone: 86-21-51355526
   Email: xinw@fudan.edu.cn

   Jin Zhao
   Fudan University
   Shanghai 201203, China
   Phone: 86-21-51355526
   Email: jzhao@fudan.edu.cn


   Tiegang Zeng
   Fudan University
   Shanghai 201203, China
   Phone: 86-21-51355526
   Email: 09210240087@fudan.edu.cn


   Jun Li
   Fudan University
   Shanghai 201203, China
   Phone: 86-21-51355526
   Email: 0572222@fudan.edu.cn


   Lei Liu
   Fudan University
   Shanghai 201203, China
   Phone: 86-21-51355526
   Email: 09210240117@fudan.edu.cn


   Shihui Duan
   CATR
   Beijing 100045, China
   Phone: 86-10-63200068
   Email: duanshihui@catr.cn








Wang, et al            Expires April 25, 2011                [Page 14]