Generative Adversarial Network for Imitation Learning from Single Demonstration

: Imitation learning is an effective method for training an autonomous agent to accomplish a task by imitating expert behaviors in their demonstrations. However, traditional imitation learning methods require a large number of expert demonstrations in order to learn a complex behavior. Such a disadvantage has limited the potential of imitation learning in complex tasks where the expert demonstrations are not sufficient. In order to address the problem, a Generative Adversarial Network-based model is proposed which is designed to learn optimal policies using only a single demonstration. The proposed model is evaluated on two simulated tasks in comparison with other methods. The results show that our proposed model is capable of completing considered tasks despite the limitation in the number of expert demonstrations, which clearly indicate the potential of our model.


Introduction:
Imitation learning, known also as learning from demonstration, has recently gained significant attention since it enables training autonomous agents in complex environments where reward functions are unavailable. The main goal of imitation learning is to imitate expert behaviors in their demonstrations by learning a mapping between observation states and actions. It is widely adopted in robotic and human-computer interaction fields 1 , such as self-driving vehicles [2][3][4] and social robot interaction 5,6 . However, traditional imitation learning algorithms usually require a significant number of demonstrations in order to acquire complex behaviors from the expert, since it is challenging to train an agent using a few or only one demonstration.
Indeed, humans are capable of learning a new behavior by observing it produced by the expert just once. Inspired by this human ability, the authors in 7 proposed a reinforcement learning model to solve an Atari game using only one demonstration. The model is trained by starting from a carefully selected state in the demonstration. One main drawback of the model is that it requires reward signals for every timestep. In a game environment, these reward signals can be easily collected. However, in a typical imitation learning setting, reward functions are unavailable and can be difficult to be defined manually. Therefore, in this paper, a model to imitate expert behaviors from a single demonstration is proposed. Moreover, the model leverages Generative Adversarial Network in order to learn optimal policies without having access to the reward function. The main contributions of this paper are as follows: -A GAN-based model is presented for imitating expert behaviors from a single demonstration. -A comprehensive evaluation is conducted, which demonstrates the potential of our proposed model. The rest of the paper is organized as follows. First, the related works of the proposed model is introduced. Second, the imitation learning problem is formulated. Third, the proposed model is presented in detail. Forth, the proposed model is evaluated, and the results is analyzed. Finally, the paper is concluded. Imitation learning has been successfully applied to train autonomous agents 1,4,5,8 in many fields. Behavioral Cloning (BC) and Inversed Reinforcement Learning (IRL) are two main approaches to imitation learning. Behavioral Cloning 9 utilizes supervised learning in order to mimic expert behaviors. Although BC is a straightforward method, it is vulnerable to the distribution shift between the training and testing data. In contrast, IRL 10 has succeeded in a wide range of tasks by first trying to recover a reward function from expert demonstrations and then leveraging it to find an optimal policy. However, IRL requires an extremely high computational cost since iterations of reinforcement learning are involved during the training phase [11][12][13] . In order to overcome this drawback, recent studies 14,15 have applied Generative Adversarial Network 16 to imitate expert behaviors by finding a mapping between states and actions. However, the abovementions methods require a significant number of demonstrations during the training phase. Ideally, the agent can have the same ability as humans in which they can imitate expert behaviors from only one or a few demonstrations.
The work in 7 proposed a reinforcement learning model learning to play an Atari game using only one demonstration. A carefully selected state in the demonstration is input to the model at each training step in order to imitate the expert behaviors and avoid learning a sub-optimal solution. However, the model requires reward signals in every timestep. These reward signals can be easily collected in an Atari game environment. However, in a typical imitation learning setting, reward functions are unavailable and can be difficult to be defined manually.
On the other hand, our proposed model leverages GAN to imitate expert behaviors without the need of a reward function. Moreover, the proposed model is capable of learning from only one expert demonstration and can provide a competitive performance in such a challenging setting.

Problem Formulation:
In this paper, the imitation learning problem is described as a Markov Decision Process (MDP) with finite time horizon: ℳ = ( , , , ) (1) where, S denotes the state space, is the action space, : × → represents the transition function, and is the time horizon. It is important to note that a shaped reward function is unavailable in imitation learning. A policy π: → represents a mapping from observation states to actions. An expert demonstration = {( , ): ∈ [0, ]} is a sequence of state-action pairs. Our main objective is to learn an optimal policy π * given a single demonstration .

The Proposed Model:
In this section, our proposed model is presented. The model leverages Generative Adversarial Network in order to learn optimal policies from a single expert demonstration . The architecture of the model is illustrated in Fig. 1. The model includes two deep feed-forward networks and .
The discriminator is trained to distinguish between a state-action pair ( , ) from the expert and a state-action pair ( , ) generated by the generator. Meanwhile, the generator aims to produce an action so that ( , ) looks as similar as possible to ( , ). The model finds optimal policies by playing a min-max game with the discriminator trained together with the generator using the following objective function 14,16 : The model acquires optimal policies by finding a saddle point, where: In order to train the model with only one demonstration, the demonstration is divided into multiple sub-demonstrations τ = {( , ): ∈ [ , + )} with the same length 0 < ≤ , where = 0,1, … , ( − + 1) is the starting timestep. Within each training iteration, subdemonstrations are feed into the model in random order to prevent overfitting and improve the stability of training.

Performance Evaluation:
In this section, the performance of the proposed model is evaluated. The evaluation settings and results are presented in the following subsections.

Figure 2. Visual rendering of two simulated environments used in the evaluation
Two simulated environments is considered: -Pendulum 17 : The pendulum starts at a random position. The goal of the task is to keep the pendulum stays upright by swinging it up. -CartPole 17,18 : A pole is attached to a cart. The goal is to keeps it stays upright by applying a force of +1 or -1 to the cart. The visualizations of the two environments are shown in Fig. 2. For each environment, one demonstration is collected by training the Trust Region Policy Optimization (TRPO) 19 which is a reinforcement learning algorithm for optimizing the learned policies by using gradient descent. The TRPO is trained with direct access to the environment and the shaped reward. In addition, the performance of our proposed model is also compared with TRPO. This baseline set an upper bound for the performance of our proposed model. The proposed model and TRPO are run on a personal computer with an Intel i7-8750H @ 2.20GHz and 16GB RAM system.

Network Structure and Hyperparameters
The generator and discriminator are deep feed-forward networks with 2 hidden layers. Each hidden layer has 32 nodes. Adam 20 is used which is a stochastic gradient descent algorithm to optimize the proposed model during the training phase. The Adam method is provided with a learning rate of 0.0003. Fig. 3 and 4 visualize the behaviors of policies learned by the evaluated models on Pendulum and CartPole environments, respectively. Observing from Fig. 3, the policies trained with TRPO can swing the pendulum up faster and keeps it stays vertical for a longer period of time than our proposed model. Accordingly, the policies learned by our proposed model have trouble applying a strong enough force to swing the pendulum upright at first. However, after the pendulum is upright, the learned policies can apply few light forces to maintain it vertically. For the CartPole environment in Fig. 4, it can be observed that both policies trained with TRPO and our proposed model can move the cart in order to prevent the pole from falling over.  Tables 1 and 2 tabulate the comparison results in terms of average cumulative reward and average training time between the proposed model and TRPO. It can be observed from Table 1 that TRPO outperforms our proposed model in terms of averaged cumulative reward on both environments. However, this result is expected since TRPO had direct access to states and the reward function of the environment in order to optimize their policies. On the other hand, the proposed model is trained using only one expert demonstration and without access to any reward signals, yet it can provide a competitive performance, especially in the CartPole environment. Moreover, according to Table 2, while TRPO takes more than 2 hours to finish training, the proposed model is about 5 times faster. Even though presenting lower averaged cumulative reward values, the proposed model is able to achieve a competing performance and requires a significantly smaller training time. These results clearly indicate the potential of our proposed model.

Conclusion:
In this paper, a model was proposed that utilizes Generative Adversarial Network to imitate expert behaviors using only one demonstration. Despite such a challenging setting, the model successfully learns optimal policies on two simulated environments. In comparison with TRPO which is a Reinforcement Learning model, the proposed model provides a competitive performance with an extremely better training time. The results prove that the proposed model can be promisingly applied in imitation learning. In future work, our goal is to improve the performance of our proposed model on more complex imitation tasks.

Authors' declaration:
-Conflicts of Interest: None. have been given the permission for republication attached with the manuscript. -The author has signed an animal welfare statement. -Ethical Clearance: The project was approved by the local ethical committee in Shibaura Institute of Technology.