Skip to main content

Efficient Exploration by Novelty-Pursuit

  • Conference paper
  • First Online:
Distributed Artificial Intelligence (DAI 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12547))

Included in the following conference series:

  • 747 Accesses

Abstract

Efficient exploration is essential to reinforcement learning in tasks with huge state space and long planning horizon. Recent approaches to address this issue include the intrinsically motivated goal exploration processes (IMGEP) and the maximum state entropy exploration (MSEE). In this paper, we propose a goal-selection criterion in IMGEP based on the principle of MSEE, which results in the new exploration method novelty-pursuit. Novelty-pursuit performs the exploration in two stages: first, it selects a seldom visited state as the target for the goal-conditioned exploration policy to reach the boundary of the explored region; then, it takes random actions to explore the non-explored region. We demonstrate the effectiveness of the proposed method in environments from simple maze environments, MuJoCo tasks, to the long-horizon video game of SuperMarioBros. Experiment results show that the proposed method outperforms the state-of-the-art approaches that use curiosity-driven exploration.

Z. Li and X-H. Chen—The two authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We sample goals from a distribution (e.g., softmax distribution) based on their prediction errors rather than in a greedy way.

References

  1. Andrychowicz, M., et al.: Hindsight experience replay. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 5048–5058 (2017)

    Google Scholar 

  2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2–3), 235–256 (2002). https://doi.org/10.1023/A:1013689704352

    Article  MATH  Google Scholar 

  3. Baranes, A., Oudeyer, P.: R-IAC: robust intrinsically motivated exploration and active learning. IEEE Trans. Auton. Ment. Dev. 1(3), 155–169 (2009)

    Article  Google Scholar 

  4. Bellemare, M.G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pp. 1471–1479 (2016)

    Google Scholar 

  5. Brockman, G., et al.: OpenAI gym. CoRR 1606, 01540 (2016)

    Google Scholar 

  6. Burda, Y., Edwards, H., Pathak, D., Storkey, A.J., Darrell, T., Efros, A.A.: Large-scale study of curiosity-driven learning. In: Proceedings of the 7th International Conference on Learning Representations (2019)

    Google Scholar 

  7. Burda, Y., Edwards, H., Storkey, A.J., Klimov, O.: Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations (2019)

    Google Scholar 

  8. Chen, X., Yu, Y.: Reinforcement learning with derivative-free exploration. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1880–1882 (2019)

    Google Scholar 

  9. Chevalier-Boisvert, M., Willems, L., Pal, S.: Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid (2018)

  10. Choromanski, K., Pacchiano, A., Parker-Holder, J., Tang, Y., Sindhwani, V.: From complexity to simplicity: Adaptive es-active subspaces for blackbox optimization. In: Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pp. 10299–10309 (2019)

    Google Scholar 

  11. Dhariwal, P., et al.: OpenAI baselines (2017). https://github.com/openai/baselines

  12. Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: Go-explore: a new approach for hard-exploration problems. CoRR 1901.10995 (2019)

    Google Scholar 

  13. Florensa, C., Held, D., Geng, X., Abbeel, P.: Automatic goal generation for reinforcement learning agents. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1514–1523 (2018)

    Google Scholar 

  14. Forestier, S., Mollard, Y., Oudeyer, P.: Intrinsically motivated goal exploration processes with automatic curriculum learning. CoRR 1708.02190 (2017)

    Google Scholar 

  15. Fortunato, M., et al.: Noisy networks for exploration. In: Proceedings of the 6th International Conference on Learning Representations (2018)

    Google Scholar 

  16. Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning, pp. 2052–2062 (2019)

    Google Scholar 

  17. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1856–1865 (2018)

    Google Scholar 

  18. Hazan, E., Kakade, S.M., Singh, K., Soest, A.V.: Provably efficient maximum entropy exploration. In: Proceedings of the 36th International Conference on Machine Learning, pp. 2681–2691 (2019)

    Google Scholar 

  19. Kauten, C.: Super Mario Bros for OpenAI Gym (2018). https://github.com/Kautenja/gym-super-mario-bros

  20. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2015)

    Google Scholar 

  21. Kolter, J.Z., Ng, A.Y.: Near-bayesian exploration in polynomial time. In: Proceedings of the 26th International Conference on Machine Learning, pp. 513–520 (2009)

    Google Scholar 

  22. Lattimore, T., Hutter, M.: Near-optimal PAC bounds for discounted MDPs. Theor. Comput. Sci. 558, 125–143 (2014)

    Article  MathSciNet  Google Scholar 

  23. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations (2016)

    Google Scholar 

  24. Liu, F., Li, Z., Qian, C.: Self-guided evolution strategies with historical estimated gradients. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. 1474–1480 (2020)

    Google Scholar 

  25. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 1928–1937 (2016)

    Google Scholar 

  26. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  27. Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning (1999)

    Google Scholar 

  28. O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V.: PGQ: combining policy gradient and q-learning. CoRR 1611.01626 (2016)

    Google Scholar 

  29. Osband, I., Blundell, C., Pritzel, A., Roy, B.V.: Deep exploration via bootstrapped DQN. In: Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pp. 4026–4034 (2016)

    Google Scholar 

  30. Ostrovski, G., Bellemare, M.G., van den Oord, A., Munos, R.: Count-based exploration with neural density models. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2721–2730 (2017)

    Google Scholar 

  31. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2778–2787 (2017)

    Google Scholar 

  32. Péré, A., Forestier, S., Sigaud, O., Oudeyer, P.: Unsupervised learning of goal spaces for intrinsically motivated goal exploration. In: Proceedings of the 6th International Conference on Learning Representations (2018)

    Google Scholar 

  33. Plappert, M., et al.: Multi-goal reinforcement learning: Challenging robotics environments and request for research. CoRR 1802.09464 (2018)

    Google Scholar 

  34. Plappert, M., et al.: Parameter space noise for exploration. In: Proceedings of the 6th International Conference on Learning Representations (2018)

    Google Scholar 

  35. Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 1312–1320 (2015)

    Google Scholar 

  36. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR 1707, 06347 (2017)

    Google Scholar 

  37. Stadie, B.C., Levine, S., Abbeel, P.: Incentivizing exploration in reinforcement learning with deep predictive models. CoRR 1507.00814 (2015)

    Google Scholar 

  38. Strehl, A.L., Littman, M.L.: An analysis of model-based interval estimation for markov decision processes. J. Comput. Syst. Sci. 74(8), 1309–1331 (2008)

    Article  MathSciNet  Google Scholar 

  39. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)

    Book  Google Scholar 

  40. Szepesvári, C.: Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, New York (2010)

    Book  Google Scholar 

  41. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26-31 (2012)

    Google Scholar 

  42. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012)

    Google Scholar 

  43. Wang, Z., et al.: Sample efficient actor-critic with experience replay. In: Proceedings of the 5th International Conference on Learning Representations (2017)

    Google Scholar 

  44. Yu, Y.: Towards sample efficient reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 5739–5743 (2018)

    Google Scholar 

Download references

Acknowledgements

The authors will thank Fabio Pardo for sharing ideas to visualize trajectories for SuperMarioBros. In addition, the authors appreciate the helpful instruction from Dr. Yang Yu and the insightful discussion with Tian Xu as well as Xianghan Kong.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ziniu Li or Xiong-Hui Chen .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Reward Shaping for Training Goal-Conditioned Policy

Reward shaping is invariant to the optimal policy under some conditions [27]. Here we verify that the reward shaping function introduced by our method doesn’t change the optimal behaviors for goal-conditioned policy. Lets’ consider the total shaping rewards during an episode of length T:

$$\begin{aligned}&\quad \sum _{t=1}^{T} -d(ag_t, g) + d(ag_{t+1}, g) \\&= -d(ag_1, g) + d(ag_2, g) - d(ag_2, g) + d(ag_3, g) \cdots \\&= -d(ag_1, g) + d(ag_{T+1}, g) \end{aligned}$$

For the optimal policy \(\pi _g^*\), \(d(ag_{T+1}, g) = 0\) while \(d(ag_1, g)\) is a constant. Therefore, the optimal policy \(\pi _g\) induced by the reward shaping is invariant to the one induced by the sparse reward function in Eq. 2.

Fig. 7.
figure 7

Visualization for the true visitation counts and the corresponding exploration boundary.

1.2 A.2 Additional Results

In this part, we provide additional experiment results to better understand our method.

Empty Room. We visualize the true visitation counts and the corresponding exploration boundary in Fig. 7. Note the agent starts from the left top corner and the exit is on the most bottom right corner. The data used for visualization is collect by a random policy. Hence, the visitation counts are large on the left top part. We define the true exploration boundary as the top \(10\%\) states with least visitation counts and the estimated exploration boundary given by our method are states with the largest prediction errors in the priority queue. From this figure, we can see that our method can make a good approximation to the true exploration boundary given by visitation counts.

SuperMarioBros. In Fig. 8, we make additional trajectory visualization on SuperMarioBros-1-1 and SuperMarioBros-1-2. Trajectories are plotted with the same number of samples (18M). We can observe that the vanilla method gets into the local optimum on SuperMarioBros-1-1 even though it has used the policy entropy regularization to encourage exploration. In addition, only our method can get the flag on SuperMarioBros-1-2.

Fig. 8.
figure 8

Trajectory visualization on SuperMarioBros-1-1 and SuperMarioBros-1-2. For each figure, top row: vanilla ACER; middle row: ACER + exploration bonus; bottom row: novelty-pursuit (ours). The vanilla method gets stuck into the local optimum on SuperMarioBros-1-1. Only our method can get the flag on SuperMarioBros-1-2.

1.3 A.3 Environment Prepossessing

In this part, we present the used environment preprocessing.

Maze. Different from [9], we only use the image and coordination information as inputs. Also, we only consider four actions: turn left, turn right, move forward and move backward. The maximal episode length is 190 for Empty Room, and 500 for Four Rooms. Each time the agent receives a time penalty of \(1/\mathrm {max\_ episode\_length}\) and receives a reward of +1 when it finds the exit.

FetchReach. We implement this environment based on FetchReach-v0 in Gym [5]. The maximal episode length is 50. The xyz coordinates of four spheres are (1.20, 0.90, 0.65), (1.10, 0.72, 0.45), (1.20, 0.50, 0.60), and (1.45, 0.50, 0.55). When sampling goals, we resample goals if the target position is outside of the table i.e., the valid x range: (1.0, 1.5), the valid y range is (0.45, 1.05), and the valid z range is (0.45, 0.65).

SuperMarioBros. We implement this environment based on [19] with OpenAI Gym wrappers. Prepossessing includes grey-scaling, observation downsampling, external reward clipping (except that 50 for getting flag), stacked frames of 4, and sticky actions with a probability of 0.25 [26]. The maximal episode length is 800. The environment restarts to the origin when the agent dies.

1.4 A.4 Network Architecture

We use the convolutional neural network (CNN) for Empty Room, Four Rooms, and video games of SuperMarioBros, and multi-layer perceptron (MLP) for FetchReach environment. Network architecture design and parameters are based on the default implementation in OpenAI baselines [11]. For each environment, RND uses a similar network architecture. However, the predictor network has additional MLP layers than the predictor network to strengthen its representation power [7].

1.5 A.5 Hyperparameters

Table 3 gives hyperparameters for ACER [43] on the maze and SuperMarioBros (the learning algorithm is RMSProp [41]. DDPG [23] used in Fetch Reach environments is based on the HER algorithm implemented in OpenAI baselines [11] expect that the actor learning rate is 0.0005. We run 4 parallel environments for DDPG and the size of the priority queue is also 100. As for the predictor network, the learning rate of the predictor network is 0.0005 and the optimization algorithm is Adam [20] for all experiments, and the batch size of training data is equal to the product of rollout length and the number of parallel environments.

The goal-conditioned exploration policy of our method is trained by combing the shaping rewards defined in Eq. 3 and environment rewards, which helps reduce the discrepancy with the exploitation policy. The weight for environment rewards is 1 for all environments except 2 for SuperMarioBros. For the bonus method used in Sect. 5, the weight \(\beta \) to balance the exploration bonus is 0.1 for Empty Room and Four Rooms, 0.01 for FetchReach, 1.0 for SuperMarioBros-1-1 and SuperMarioBros-1-3, and 0.1 for SuperMarioBros-1-2. Following [6, 7] we also do a normalization for the exploration bonus by dividing them via a running estimate of the standard deviation of the sum of discounted exploration bonus. In addition, we find sometimes applying the imitation learning technique for the goal-conditioned policy can improve performance. We will examine this in detail in future works. Though we empirically find HER is useful in simple environments like the maze and the MuJoCo robotics tasks, we find it is less powerful than the technique of reward shaping on complicated tasks like SuperMarioBros. Hence, reported episode returns of learned policies are based on the technique of reward shaping on SuperMarioBros and HER for others.

For the exploitation policy, we periodically allow it to interact with the environment to mitigate the exploration error [16]. For all experiments, we split the half interactions for the exploitation method. For example, if the number of maximal samples is 200k, the exploration and the exploitation policy will use the same 100k interactions.

Table 3. Hyperparameters of our method based on ACER on the maze and SuperMarioBros environments.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Z., Chen, XH. (2020). Efficient Exploration by Novelty-Pursuit. In: Taylor, M.E., Yu, Y., Elkind, E., Gao, Y. (eds) Distributed Artificial Intelligence. DAI 2020. Lecture Notes in Computer Science(), vol 12547. Springer, Cham. https://doi.org/10.1007/978-3-030-64096-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64096-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64095-8

  • Online ISBN: 978-3-030-64096-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics