Network Management Research Group                               M-S. Kim
Internet-Draft                                                 Y-G. Hong
Intended status: Informational                                      ETRI
Expires: January 3, 2019                                        Y-H. Han
                                                               KoreaTech
                                                                T-J. Ahn
                                                                      KT
                                                                K-H. Kim
                                                                    ETRI
                                                            July 2, 2018


      Intelligent Network Management using Reinforcement Learning
                          draft-kim-nmrg-rl-03

Abstract

   This document describes intelligent network management system to
   autonomously manage and monitor using machine learning techniques.
   Reinforcement learning is one of the machine learning techniques that
   can provide autonomously management with multi-agent path-planning
   over a communication network.  According to intelligent distributed
   multi-agent system, the main centralized node called by the global
   environment should not only manage all agents workflow in a hybrid
   peer-to-peer networking architecture and, but transfer and share
   information in distributed nodes.  All agents in distributed nodes
   are able to be provided with a cumulative reward for each action that
   a given agent takes with respect to an optimized knowledge based on a
   to-be-learned policy over the learning process.  The optimized and
   trained knowledge would be involved with a large state information by
   the control action over a network.  A reward from the global
   environment is reflected to the next optimized control action
   autonomously for network management in distributed networking nodes.
   The Reinforcement Learning(RL) Process have developed and expanded to
   Deep Reinforcement Learning(DRL) with model-driven or data-driven
   technical approaches for learning process.  The trendy technique has
   been widely to attempt and apply to networking fields since Deep
   Reinforcement Learning can be used in practical networking areas
   beyond dynamics and heterogeneous environment disturbances, so that
   in the technique can be intelligently learned in the effective
   strategy.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.





Kim, et al.              Expires January 3, 2019                [Page 1]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 3, 2019.

Copyright Notice

   Copyright (c) 2018 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Conventions and Terminology . . . . . . . . . . . . . . . . .   4
   3.  Motivation  . . . . . . . . . . . . . . . . . . . . . . . . .   4
     3.1.  General Motivation for Reinforcement Learning . . . . . .   4
     3.2.  Reinforcement Learning in networks  . . . . . . . . . . .   4
     3.3.  Deep Reinforcement Learning in networks . . . . . . . . .   4
     3.4.  Motivation in our work  . . . . . . . . . . . . . . . . .   5
   4.  Related Works . . . . . . . . . . . . . . . . . . . . . . . .   5
     4.1.  Autonomous Driving System . . . . . . . . . . . . . . . .   5
     4.2.  Network Defect Prediction . . . . . . . . . . . . . . . .   5
     4.3.  Wireless Sensor Network (WSN) . . . . . . . . . . . . . .   6
     4.4.  Routing Enhancement . . . . . . . . . . . . . . . . . . .   6
     4.5.  Routing Optimization  . . . . . . . . . . . . . . . . . .   6
     4.6.  Game Theory . . . . . . . . . . . . . . . . . . . . . . .   6
   5.  Intelligent Machine Learning Technologies . . . . . . . . . .   7
     5.1.  Reinforcement Learning (RL) . . . . . . . . . . . . . . .   7
     5.2.  Deep Learning (DL)  . . . . . . . . . . . . . . . . . . .   7
     5.3.  Deep Reinforcement Learning (DRL) . . . . . . . . . . . .   7
     5.4.  Advantage Actor Critic (A2C)  . . . . . . . . . . . . . .   8



Kim, et al.              Expires January 3, 2019                [Page 2]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


     5.5.  Asynchronously Advantage Actor Critic (A3C) . . . . . . .   8
     5.6.  Policy using Distance and Frequency . . . . . . . . . . .   9
     5.7.  Distributed Computing Node  . . . . . . . . . . . . . . .   9
     5.8.  Agent Sharing Information . . . . . . . . . . . . . . . .   9
   6.  Proposed Architecture . . . . . . . . . . . . . . . . . . . .   9
     6.1.  Architecture for Reinforcement Learning . . . . . . . . .  10
     6.2.  Architecture for Deep Reinforcement Learning  . . . . . .  11
   7.  Use case of Reinforcement Learning  . . . . . . . . . . . . .  11
     7.1.  Distributed Multi-agent Reinforcement Learning (RL):
           Sharing Information Technique . . . . . . . . . . . . . .  12
     7.2.  Intelligent Edge Computing technique for Traffic Control
           using Deep Reinforcement Learning . . . . . . . . . . . .  13
     7.3.  Edge computing system in a field of construction works
           using Reinforce Learning  . . . . . . . . . . . . . . . .  14
     7.4.  Fault prediction for core-network using Deep Learning . .  14
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  15
     10.2.  Informative References . . . . . . . . . . . . . . . . .  15
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

1.  Introduction

   In large infrastructures such as transportation, health and energy
   systems, collaborative monitoring system is needed, where there are
   special needs for intelligent distributed networking systems with
   learning schemes.  Agent reinforcement learning for intelligently
   autonomous network management, in general, is one of the
   challengeable methods in a dynamic complex cluttered environment over
   a network.  It also needs the development of computational multi-
   agents learning systems in large distributed networking nodes, where
   the agents have limited and incomplete knowledge, and they only
   access local information in distributed networking nodes.

   Reinforcement Learning can become an effective technique to transfer
   and share information among agents via the global environment
   (centralized node), as it does not require a priori knowledge of the
   agent behavior or environment to accomplish its tasks [Megherbi].
   Such a knowledge is usually acquired and learned automatically and
   autonomously by trial and error.

   Reinforcement Learning is one of the machine Learning techniques that
   will be adapted to the various networking environments for automatic
   networks[S.  Jiang].  Thus, this document provides motivation,
   learning technique, and use case for network machine learning.





Kim, et al.              Expires January 3, 2019                [Page 3]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   Deep reinforcement learning recently proposes that the extended
   reinforcement Learning algorithm could emerge as more powerful model-
   driven or data-driven techniques over a large state space to overcome
   the classical behavior reinforcement Learning process.  The deep
   reinforcement learning technique has been significantly shown as
   successful models in playing Atari games [V.  Mnih].  The deep
   reinforcement learning provides more effective experimental system
   performance in a complex and cluttered networking environment.

   The classical reinforcement learning slightly has a limitation to be
   adopted in networking areas, since the networking environments
   consist of significantly large and complex components in fields of
   routing configuration, optimization and system management, so that
   deep reinforcement learning can provide much more state information
   for learning process.

2.  Conventions and Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

3.  Motivation

3.1.  General Motivation for Reinforcement Learning

   Reinforcement learning is a system capable of autonomous acquirement
   and incorporation of knowledge.  It can continuously self-improve
   learning process with experience and attempts to maximize cumulative
   reward to manage an optimized learning knowledge by multi-agents-
   based monitoring systems[Teiralbar].  The maximized reward can be
   increasingly optimizing of learning speed for agent autonomous
   learning process.

3.2.  Reinforcement Learning in networks

   Reinforcement learning is an emerging technology in terms of
   monitoring network system to achieve fair resource allocation for
   nodes within the wire or wireless mesh setting.  Monitoring
   parameters of the network and adjusts based on the network dynamics
   can demonstrate to improve fairness in wireless environment
   Infrastructures and Resources[Nasim].

3.3.  Deep Reinforcement Learning in networks

   Deep reinforcement learning is a large state model-driven or data-
   driven approach on an intelligently learning strategy.  The
   intelligent technique represents learning models successfully to



Kim, et al.              Expires January 3, 2019                [Page 4]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   train knowledge for control policy directly from high-dimensional
   sensory input using reinforcement learning with Q-value function in a
   convolutional neural network [Mnih].  The model repeatedly estimates
   reward using the defined reward function depending on the current
   states, to acquire more effective and optimized control action in
   following next steps.  The deep reinforcement learning can be widely-
   adopted in routing optimization to attempt minimizing the network
   delay [Stampa].

3.4.  Motivation in our work

   There are many different networking management problems to
   intelligently solve, such as connectivity, traffic management, fast
   internet without latency and etc.  We expect that machine-learning-
   based mechanism such as reinforcement learning will provide network
   solutions with multiple cases against human operating capacities even
   if it is a challengeable area due to a multitude of reasons such as
   large state space, complexity in the giving reward, difficulty in
   control actions, and difficulty in sharing and merging of the trained
   knowledge between agents in a distributed memory node to be
   transferred over a communication network.[Minsuk]

4.  Related Works

4.1.  Autonomous Driving System

   Recently, 5G network and AI are new trend and future research areas,
   so that a lot of business models have been developed and appeared in
   the networking fields.  Autonomous vehicle has been simultaneously
   developed with 5G and AI.  Autonomous vehicle is capable of self-
   automotive driving without human supervision depending on optimized
   trust region policy by reinforcement learning that enables learning
   of more complex and special network management environment.  Such a
   vehicle provides a comfortable user experience safely and reliably on
   interactive communication network [April] [Markus].

4.2.  Network Defect Prediction

   Nowadays, the networking equipment handles a variety of services such
   as Internet, IPTV, VoIP in a single device.  As the performance of
   the equipment improves, even if there is an advantage to construct
   the equipment to be separately constructed in a single device, the
   probability of the service failure of network equipment might be
   increasing.  For that reason, the equipment failure risk over a
   network poses a major networking carriers, so that there is growing
   need to prevent disturbances by detecting network failure in advance.
   Machine learning such as deep learning or reinforcement learning
   emerged the preferred solutions to manage and monitor the networking



Kim, et al.              Expires January 3, 2019                [Page 5]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   equipment (LTE core, router and switch) prevented by the networking
   failure risk.

4.3.  Wireless Sensor Network (WSN)

   Wireless sensor network (WSN) consists of a large number of sensors
   and sink nodes for monitoring systems with event parameters such as
   temperature, humidity, air conditioning, etc.  Reinforcement learning
   in WSNs has been applied in a wide range of schemes such as
   cooperative communication, routing and rate control.  The sensors and
   sink nodes are able to observe and carry out optimal actions on their
   respective operating environment for network and application
   performance enhancements[Kok-Lim].

4.4.  Routing Enhancement

   Reinforcement learning is used to enhance multicast routing protocol
   in wireless ad hoc networks, where each node has different
   capability.  Routers in the multicast routing protocol are determined
   to discover optimal route with a predicted reward, and then the
   routers create the optimal path with multicast transmissions to
   reduce the overhead in reinforcement learning[Kok-Lim].

4.5.  Routing Optimization

   Routing optimization as traffic engineering is one of the important
   issues to control the behavior of transmitted data in order to
   maximize the performance of network [Stampa].  There are several
   attempts to be adopted with machine learning algorithms in the
   context of routing optimization.  Deep reinforcement learning is
   recently one of solutions for unseen network states that cannot be
   achieved by traditional table-based reinforcement learning agent
   [Stampa].  Deep reinforcement learning can provide more improvement
   to optimal control routing configuration by given-agent on complex
   networking.

4.6.  Game Theory

   The adaptive multi-agent system, which is combined with complexities
   from interacting game player, has developed in a field of
   reinforcement learning.  In the early game theory, the
   interdisciplinary work was only focused on competitive games, but
   reinforcement learning has developed into a general framework for
   analyzing strategic interaction and has been attracted field as
   diverse as psychology, economics and biology.[Ann] AlphaGo is also
   one of the game theories using reinforcement learning, developed by
   Google DeepMind.  Even though it began as a small learning
   computational program with some simple actions, it has now trained on



Kim, et al.              Expires January 3, 2019                [Page 6]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   a policy and value networks of thirty million actions, states and
   rewards.

5.  Intelligent Machine Learning Technologies

5.1.  Reinforcement Learning (RL)

   Agent reinforcement learning is machine-learning-based unsupervised
   algorithms based on an agent learning process.  Reinforcement
   learning is normally used with a reward from centralized node (the
   global environment), and capable of autonomous acquirement and
   incorporation of knowledge.  It is continuously self-improving and
   becoming more efficient as the learning process from an agent
   experience to optimize management performance for autonomous learning
   process.[Sutton][Madera]

5.2.  Deep Learning (DL)

   The rule-based network equipment failure for judgment/prediction
   should have been described as a correct rule for equipment or case,
   and continuously updated when a new failure pattern occurs.  Deep
   Learning (DL) techniques such as Convolution Neural Network(CNN), and
   Recurrent Neural Network(RNN) can be adapted to learn new patterns
   occurred by the networking faults.  We are able to judge and predict
   a fault condition in these models.  The deep learning models has
   advantages in terms of maintenance and expandability, since it can
   automatically learn features under the patterns without needing to
   describe the detailed rules.

5.3.  Deep Reinforcement Learning (DRL)

   Nowadays, some of advanced techniques using reinforcement learning
   encounter and combine to deep learning technique in Neural
   Network(NN) that has made it possible to extract high-level features
   from raw data in compute vision [A Krizhevsky].  There are many
   challenges under the deep learning models such as convolution neural
   network, recurrent neural network and etc., on the reinforcement
   learning approach.  The benefit of the deep learning applications is
   that lots of networking models, which have problematic issue due to
   complex and cluttered networking structure, can be used with large
   amounts of labelled training data.

   Recently, advances in training deep neural networks to develop a
   novel artificial agent, termed a deep Q-network (deep reinforcement
   learning network), can be used to learn successful policies directly
   from high-dimensional sensory inputs using end-to-end reinforcement
   learning [V.  Mnih].  The deep reinforcement learning(deep Q-network)
   can provide more extended and powerful scenarios to build networking



Kim, et al.              Expires January 3, 2019                [Page 7]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   models with optimized action controls, huge system states and real-
   time-based reward function.  Moreover, the technique has a
   significant advantage to set highly sequential data in a large model
   state space.  In particular, the data distribution in reinforcement
   learning is able to change as learning behaviors, that is a problem
   for deep learning approaches assumed by a fixed underlying
   distribution [Mnih].

5.4.  Advantage Actor Critic (A2C)

   Advantage Actor Critic is one of the intelligent reinforcement
   learning models based on policy gradient model.  The intelligent
   approach can optimize deep neural network controller in terms of
   reinforcement learning algorithms, and show that parallel actor-
   learners have a stabilizing effect on training and they can be
   allowing all of the methods to successfully train neural network
   controllers [Volodymyr Mnih].  Even though the prior deep
   reinforcement learning algorithm with experience replay memory
   tremendously has performance in challenging of the control service
   domains, it still needs to use more memory and computational power
   due to off-policy learning methods.  To make up for this algorithms,
   a new algorithm has appeared.  The Advantage Actor Critic (consisting
   of actor and critic) method would implement generalized policy
   iteration alternating between a policy evaluation and a policy
   improvement step.  Actor is a policy-based method that can improve
   the current policy for available the best next action.  Critic in the
   value-based approach can evaluate the current policy and reduce the
   variance by a bootstrapping method.  It is more stable and effective
   algorithm than the pure policy-based gradient methods.

5.5.  Asynchronously Advantage Actor Critic (A3C)

   Asynchronously Advantage Actor Critic is the updated algorithm based
   on Advantage Actor Critic.  The main algorithm concept is to run
   multiple environments in parallel to run the agent asynchronously
   instead of experience replay.  The parallel environment reduces the
   correlation of agent's data and induces each agent to experience
   various states so that the learning process can become a stationary
   process.  This algorithm is a beneficial and practical point of view
   since it allows learning performance even with a general multi-core
   CPU.  In addition, it can be applied to continuous space as well as
   discrete action space, and also has the advantages of learning both
   feedforward and recurrent agent.

   A3C algorithm is possibly a number of complementary improvement to
   the neural network architecture and it has been shown to accurately
   produce and estimate of Q-values by including separate streams for
   the state value and advantage in the network to improve both value-



Kim, et al.              Expires January 3, 2019                [Page 8]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   based and policy-based methods by making it easier for the network to
   represent feature coordinates [Volodymyr Mnih].

5.6.  Policy using Distance and Frequency

   Distance and Frequency algorithm uses the state occurrence frequency
   in addition to the distance to goal.  It avoids deadlocks and lets
   the agent escape the Dead, and it was derived to enhance agent
   optimal learning speed.  Distance-and-Frequency is based on more
   levels of agent visibility to enhance learning algorithm by an
   additional way that uses the state occurrence frequency.[Al-Dayaa]

5.7.  Distributed Computing Node

   Autonomous multi-agent learning process for network management
   environment is related to transfer optimized knowledge between agents
   on a given local node or distributed memory nodes over a
   communication network.

5.8.  Agent Sharing Information

   This is a technique how agents can share information for optimal
   learning process.  The quality of agent decision making often depends
   on the willingness of agents to share a given learning information
   collected by agent learning process.  Sharing Information means that
   an agent would share and communicate the knowledge learned and
   acquired with or to other agents using RL.

   Agents normally have limited resources and incomplete knowledge
   during learning exploration.  For that reason, the agents should take
   actions and transfer the states to the global environment under RL,
   then it would share the information with other agents, where all
   agents explore to reach their goals via a distributed reinforcement
   reward-based learning method on the existing local distributed memory
   nodes.

   MPI (Message Passing Interface) is used for communication way.  Even
   if the agents do not share the capabilities and resources to monitor
   an entire given large terrain environment, they are able to share the
   needed information to manage collaborative learning process for
   optimized management in distributed networking
   nodes.[Chowdappa][Minsuk]

6.  Proposed Architecture







Kim, et al.              Expires January 3, 2019                [Page 9]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


6.1.  Architecture for Reinforcement Learning

   The architecture using reinforcement learning describes a
   collaborative multi-agent-based system in distributed environments as
   shown in figure 1, where the architecture is combined with a hybrid
   architecture making use of both a master and slave architecture and a
   peer-to-peer.  The centralized node(global environment), assigns each
   slave computing node a portion of the distributed terrain and an
   initial number of agents.



         +-------------+
         |             |                       +-----------------+
         |             |<...... node 1 .......>|    terrain 1    |
         |             |                       +-----------------+
         | Global env. |                           +        |
         |  (node 0)   |                           |        |
         |             |                           |        +
         |             |                       +-----------------+
         |             |<...... node 2 .......>|    terrain 2    |
         |             |                       +-----------------+
         +-------------+

        Figure 1: Hybrid P2P and Master/Slave Architecture Overview

   Reinforcement Learning (RL) actions involve interacting with a given
   environment, so the environment provides an agent learning process
   with the elements as followings:

   o  Agent control actions, large states and cumulative rewards

   o  Initial data-set in memory

   o  Random or learning process in a given node

   o  Next, optimamization in neural network under reinforcement
      learning

   Additionally, agent actions with states toward its goal as below:

   o  Agent continuously control actions to earn next optimized state
      based on its policy with reward

   o  After an agent reaches its goal, it can repeatedly collect the
      information collected by the random or learning process to next
      learning process for optimal management




Kim, et al.              Expires January 3, 2019               [Page 10]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   o  Agent learning process is optimized in the following phase and
      exploratory learning trials

6.2.  Architecture for Deep Reinforcement Learning

   In shown as Figure2, we illustrate the fundamental architecture for
   relationship of an action, state and reward, and each agent explores
   to reach its goal(s) under deep reinforcement learning.  The agent
   takes an action that leads to a reward from achieving an optimal path
   toward its goal.


                            DRL Network
                            +----------------------------------+
                            |Q-Value1|                         |
                            |--------+    +-------+    +------+|
          ......Action......|Q-Value2|----|Network|----|States||<...
          .                 |--------+    +-------+    +------+|   .
          .                 |Q-Value3|                         |   .
          .                 +----------------------------------+   .
          .                                                        .
        +---------+----------+                                     .
        | Global Environment |                                     .
        +---------+----------+                                     .
          .                                                        .
          .                                                        .
          .           +-------------------+                +----------+
          ...........>+ Large State Space +....States.....>+ D-Memory +
                      +-------------------+                +----------+


                     Figure 2: DRL work-flow Overview

   Deep Reinforcement Learning network can provide a convolutional
   neural network to overcome the problematic issues of reinforcement
   Learning for successfully learning control policy from raw data in a
   complex environment.  It is also used with an experience replay
   memory that randomly samples previous transitions, and thereby
   smooths the training distribution over many past behaviors [V.
   Mnih].

7.  Use case of Reinforcement Learning









Kim, et al.              Expires January 3, 2019               [Page 11]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


7.1.  Distributed Multi-agent Reinforcement Learning (RL): Sharing
      Information Technique

   In this section, we deal with case of a collaborative distributed
   multi-agent, where each agent has same or different individual goals
   in a distributed environment.  Since sharing information scheme among
   the agents is problematic one, we need to expand on the work
   described by solving the challenging cases.

   Basically, the main proposed algorithm is presented by distributed
   multi-agent RL as below:

   +-------------------------------------------------------------------+
   | Proposed Algorithm                                                |
   +-------------------------------------------------------------------+
   | (1) Let Ni denote the number of node (i= 1, 2, 3 ...)             |
   |                                                                   |
   | (2) Let Aj denote the number of agent                             |
   |                                                                   |
   | (3) Let Dk denote the number of goals                             |
   |                                                                   |
   | (4) Place initial number of agents Aj, in random position (Xm,    |
   | Yn)                                                               |
   |                                                                   |
   | (5) Initialization of data-set memory for neural network          |
   |                                                                   |
   | (6) Copy neutal network Q and store as the data-set memory        |
   |                                                                   |
   | (7) Every Aj in Ni                                                |
   |                                                                   |
   | -----> (a) Do initial exploration (random) to corresponding Dk    |
   |                                                                   |
   | -----> (b) Do exploration (using RL) for Tx denote the number of  |
   | trial                                                             |
   +-------------------------------------------------------------------+

                        Table 1: Proposed Algorithm














Kim, et al.              Expires January 3, 2019               [Page 12]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   +-------------------------------------------------------------------+
   | Random Trial                                                      |
   +-------------------------------------------------------------------+
   | (1) Let Si denote the the current state                           |
   |                                                                   |
   | (2) Relinquish Si so that the other agent can occupy the position |
   |                                                                   |
   | (3) Assign the agent new position                                 |
   |                                                                   |
   | (4) Update the current state Si -> Si+1                           |
   +-------------------------------------------------------------------+

                           Table 2: Random Trial

   +-------------------------------------------------------------------+
   | Optimal Trial                                                     |
   +-------------------------------------------------------------------+
   | (1) Let Si denote the the current state                           |
   |                                                                   |
   | (2) Let ACj denote a contorl action                               |
   |                                                                   |
   | (3) Let DRm denote discount reward                                |
   |                                                                   |
   | (4) Choose ACj <- Policy(Si, ACj) in neural network               |
   |                                                                   |
   | (5) Update and copy the network for learning process in the       |
   | global environment                                                |
   |                                                                   |
   | (6) Update the current state Si < Si+1-                           |
   |                                                                   |
   | (7) Repeat a available network control action                     |
   +-------------------------------------------------------------------+

                          Table 3: Optimal Trial

   Multi-agent reinforcement learning in distributed nodes can improve
   the overall system performance to transfer or share information from
   one node to another node in following cases; expanded complexity in
   RL technique with various experimental factors and conditions,
   analyzing multi-agent sharing information for agent learning process.

7.2.  Intelligent Edge Computing technique for Traffic Control using
      Deep Reinforcement Learning

   Edge computing is a concept that allows data from a variety of
   devices to be directly analyzed at the site or near the data, rather
   than being sent to a centralized data center such as the cloud.  As
   such, edge computing will support data flow acceleration by



Kim, et al.              Expires January 3, 2019               [Page 13]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   processing data with low latency in real-time.  In addition, by
   supporting efficient data processing on large amounts of data that
   can be processed around the source, and internet bandwidth usage will
   be also reduced.  Deep reinforcement learning would be useful
   technique to improve system performance in an intelligent edge-
   controlled service system for fast response time, reliability and
   security.  Deep reinforcement learning is model-free approach so that
   many algorithms such as DQN, A2C and A3C can be adopted to resolve
   network problems in time-sensitive systems.

7.3.  Edge computing system in a field of construction works using
      Reinforce Learning

   In a construction site, there are many dangerous elements such as
   noisy, gas leak and vibration needed by alerts, so that real-time
   monitoring system to detect the alerts using machine learning
   techniques (DL, RL) can provide more effective solution and approach
   to recognize dangerous construction elements.

   Representatively, to monitor these elements CCTV (closed-circuit
   television) should be locally and continuously broadcasting in a
   situation of construction site.  At that time, it is in-effective and
   wasteful even if the CCTV is constantly broadcasting unchangeable
   scenes in high definition.  However, when any alert should be
   detected due to the dangerous elements, the streaming should be
   converted to high quality streaming data to rapidly show and defect
   the dangerous situation.  To approach technically, DL is one of the
   solutions to automatically detect these kinds of dangerous situations
   with prediction in an advance.  It can provide the transform data
   including with the high-rate streaming video and quickly prevent the
   other risks.  RL is additionally important role to efficiently manage
   and monitor with the given dataset in real time.

   [TBD]

7.4.  Fault prediction for core-network using Deep Learning

   EPC equipment such as PGW, SGW, MME, HSS and PCRF in the LTE core
   network send/receive messages using interfaces based on the 3GPP
   standard specification.  These EPC equipment could create training
   data and model to predict/detect features of the precursor symptoms
   occurring before the networking failure when a specific equipment and
   LTE network service failures are discovered.  In the addition, Deep
   Learning (DL) can predict various network faults such as in/out
   traffic, resource information of CPU/Memory and QoS performance in
   the case of IP core network equipment.

   [TBD]



Kim, et al.              Expires January 3, 2019               [Page 14]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


8.  IANA Considerations

   There are no IANA considerations related to this document.

9.  Security Considerations

   [TBD]

10.  References

10.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

10.2.  Informative References

   [I-D.jiang-nmlrg-network-machine-learning]
              Jiang, S., "Network Machine Learning", ID draft-jiang-
              nmlrg-network-machine-learning-02, October 2016.

   [Megherbi]
              "Megherbi, D. B., Kim, Minsuk, Madera, Manual., A Study of
              Collaborative Distributed Multi-Goal and Multi-agent based
              Systems for Large Critical Key Infrastructures and
              Resources (CKIR) Dynamic Monitoring and Surveillance, IEEE
              International Conference on Technologies for Homeland
              Security", 2013.

   [Teiralbar]
              "Megherbi, D. B., Teiralbar, A. Boulenouar, J., A Time-
              varying Environment Machine Learning Technique for
              Autonomous Agent Shortest Path Planning, Proceedings of
              SPIE International Conference on Signal and Image
              Processing, Orlando, Florida", 2001.

   [Nasim]    "Nasim ArianpooEmail, Victor C.M. Leung, How network
              monitoring and reinforcement learning can improve tcp
              fairness in wireless multi-hop networks, EURASIP Journal
              on Wireless Communications and Networking", 2016.

   [Minsuk]   "Dalila B. Megherbi and Minsuk Kim, A Hybrid P2P and
              Master-Slave Cooperative Distributed Multi-Agent
              Reinforcement Learning System with Asynchronously
              Triggered Exploratory Trials and Clutter-index-based
              Selected Sub goals, IEEE CIG Conference", 2016.



Kim, et al.              Expires January 3, 2019               [Page 15]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   [April]    "April Yu, Raphael Palefsky-Smith, Rishi Bedi, Deep
              Reinforcement Learning for Simulated Autonomous Vehicle
              Control, Stanford University", 2016.

   [Markus]   "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, Learning
              Driving Styles for Autonomous Vehicles from Demonstration,
              Robotics and Automation (ICRA)", 2015.

   [Ann]      "Ann Nowe, Peter Vrancx, Yann De Hauwere, Game Theory and
              Multi-agent Reinforcement Learning, In book: Reinforcement
              Learning: State of the Art, Edition: Adaptation, Learning,
              and Optimization Volume 12", 2012.

   [Kok-Lim]  "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae
              Hsiang Kwong, Application of Reinforcement Learning to
              wireless sensor networks: models and algorithms, Published
              in Journal Computing archive Volume 97 Issue 11, Pages
              1045-1075", November 2015.

   [Sutton]   "Sutton, R. S., Barto, A. G., Reinforcement Learning: an
              Introduction, MIT Press", 1998.

   [Madera]   "Madera, M., Megherbi, D. B., An Interconnected Dynamical
              System Composed of Dynamics-based Reinforcement Learning
              Agents in a Distributed Environment: A Case Study,
              Proceedings IEEE International Conference on Computational
              Intelligence for Measurement Systems and Applications,
              Italy", 2012.

   [Al-Dayaa]
              "Al-Dayaa, H. S., Megherbi, D. B., Towards A Multiple-
              Lookahead-Levels Reinforcement-Learning Technique and Its
              Implementation in Integrated Circuits, Journal of
              Artificial Intelligence, Journal of Supercomputing. Vol.
              62, issue 1, pp. 588-61", 2012.

   [Chowdappa]
              "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan,
              Thread-Safe Message Passing with P4 and MPI, Technical
              Report TR-CS-941025, Computer Science Department and NSF
              Engineering Research Center, Mississippi State
              University", 1994.

   [Mnih]     "V.Mnih and et al., Human-level Control Through Deep
              Reinforcement Learning, Nature 518.7540", 2015.






Kim, et al.              Expires January 3, 2019               [Page 16]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   [Stampa]   "G Stamp, M Arias, etc., A Deep-reinforcement Learning
              Approach for Software-defined Networking Routing
              Optimization, cs.NI", 2017.

   [Krizhevsky]
              "A Krizhevsky, I Sutskever, and G Hinton, Imagenet
              classification with deep con- volutional neural networks,
              In Advances in Neural Information Processing Systems,
              1106-1114", 2012.

   [Volodymyr]
              "Volodymyr Mnih and et al., Asynchronous Methods for Deep
              Reinforcement Learning, ICML, arXiv:1602.01783", 2016.

Authors' Addresses

   Min-Suk Kim
   Etri
   161 Gajeong-Dong Yuseung-Gu
   Daejeon  305-700
   Korea

   Phone: +82 42 860 5930
   Email: mskim16@etri.re.kr


   Yong-Geun Hong
   ETRI
   161 Gajeong-Dong Yuseung-Gu
   Daejeon  305-700
   Korea

   Phone: +82 42 860 6557
   Email: yghong@etri.re.kr


   Youn-Hee Han
   KoreaTech
   Byeongcheon-myeon Gajeon-ri, Dongnam-gu
   Choenan-si, Chungcheongnam-do
   330-708
   Korea

   Phone: +82 41 560 1486
   Email: yhhan@koreatech.ac.kr






Kim, et al.              Expires January 3, 2019               [Page 17]


Internet-Draft            draft-kim-mnrg-rl-03                 July 2018


   Tae-Jin Ahn
   Korea Telecom
   70 Yuseong-daero 1689 Beon-gil Yuseung-Gu
   Daejeon  305-811
   Korea

   Phone: +82 42 870 8409
   Email: Taejin.ahn@kt.com


   Kwi-Hoon Kim
   ETRI
   161 Gajeong-Dong Yuseung-Gu
   Daejeon  305-700
   Korea

   Phone: +82 42 860 6746
   Email: kwihooi@etri.re.kr

































Kim, et al.              Expires January 3, 2019               [Page 18]