Lucid dreaming for experience replay: refreshing past states with the current policy

Du, Yunshu; Warnell, Garrett; Gebremedhin, Assefaw; Stone, Peter; Taylor, Matthew E.

doi:10.1007/s00521-021-06104-5

Lucid dreaming for experience replay: refreshing past states with the current policy

S.I. : Adaptive and Learning Agents 2020
Published: 25 May 2021

Volume 34, pages 1687–1712, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Yunshu Du ORCID: orcid.org/0000-0001-9891-211X¹,
Garrett Warnell²,
Assefaw Gebremedhin¹,
Peter Stone^3,4 &
…
Matthew E. Taylor^1,5,6

608 Accesses
6 Citations
Explore all metrics

Abstract

Experience replay (ER) improves the data efficiency of off-policy reinforcement learning (RL) algorithms by allowing an agent to store and reuse its past experiences in a replay buffer. While many techniques have been proposed to enhance ER by biasing how experiences are sampled from the buffer, thus far they have not considered strategies for refreshing experiences inside the buffer. In this work, we introduce L uc i d D reaming for E xperience R eplay (LiDER), a conceptually new framework that allows replay experiences to be refreshed by leveraging the agent’s current policy. LiDER consists of three steps: First, LiDER moves an agent back to a past state. Second, from that state, LiDER then lets the agent execute a sequence of actions by following its current policy—as if the agent were “dreaming” about the past and can try out different behaviors to encounter new experiences in the dream. Third, LiDER stores and reuses the new experience if it turned out better than what the agent previously experienced, i.e., to refresh its memories. LiDER is designed to be easily incorporated into off-policy, multi-worker RL algorithms that use ER; we present in this work a case study of applying LiDER to an actor–critic-based algorithm. Results show LiDER consistently improves performance over the baseline in six Atari 2600 games. Our open-source implementation of LiDER and the data used to generate all plots in this work are available at https://github.com/duyunshu/lucid-dreaming-for-exp-replay.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Re-attentive experience replay in off-policy reinforcement learning

Article 22 February 2024

Double Replay Buffers with Restricted Gradient

When Less May Be More: Exploring Similarity to Improve Experience Replay

Notes

A one-step reward r is usually stored instead of the cumulative return (e.g., Mnih et al. [27]). In this work, we follow Oh et al. [32] and store the Monte-Carlo return \({\mathrm{G}}\); we fully describe the buffer structure in Sect. 3.
The implementation of A3CTBSIL is open-sourced at https://github.com/gabrieledcjr/DeepRL. In de la Cruz Jr et al. [6], we also considered using demonstrations to improve A3CTBSIL, which is not the baseline used in this work.
Note that while the A3C algorithm is on-policy, integrating A3C with SIL makes it an off-policy algorithm (as in Oh et al. [32]).
Note the performance in Montezuma’s Revenge differs between A3CTBSIL [6] and the original SIL algorithm [32]—see the discussion in “Appendix 4.”
Note that the baseline A3CTBSIL represents the scenario of SampleD, i.e., always sample from buffer \({{\,\mathrm{\mathcal {D}}\,}}\).
The data is publicly available: https://github.com/gabrieledcjr/atari_human_demo
The policy-based Go-Explore algorithm is an extension of the Go-Explore without a policy framework, which was presented in an earlier pre-print [9]. Go-Explore without a policy framework also leverages the simulator reset feature.
Ecoffet et al. [10] made a detailed comparison between the policy-based Go-Explore and DTSIL. We refer the interested readers to Ecoffet et al. [10] for further reading.

References

Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W (2017) Hindsight experience replay. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., 30:5048–5058. https://proceedings.neurips.cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf
Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., vol 29, pp 1471–1479. https://proceedings.neurips.cc/paper/2016/file/afda332245e2af431fb7b672a68b659d-Paper.pdf
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279
Article Google Scholar
Chan H, Wu Y, Kiros J, Fidler S, Ba J (2019) ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv:190204546
de la Cruz GV, Du Y, Taylor ME (2019) Pre-training with non-expert human demonstration for deep reinforcement learning. Knowl Eng Rev 34:e10. https://doi.org/10.1017/S0269888919000055
Article Google Scholar
de la Cruz Jr GV, Du Y, Taylor ME (2019) Jointly pre-training with supervised, autoencoder, and value losses for deep reinforcement learning. In: Adaptive and learning agents workshop, AAMAS
Dao G, Lee M (2019) Relevant experiences in replay buffer. In: 2019 IEEE symposium series on computational intelligence (SSCI), pp 94–101. https://doi.org/10.1109/SSCI44817.2019.9002745
De Bruin T, Kober J, Tuyls K, Babuška R (2015) The importance of experience replay database composition in deep reinforcement learning. In: Deep reinforcement learning workshop, NIPS
Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:190110995
Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2020) First return then explore. arXiv preprint arXiv:200412919
Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, Firoiu V, Harley T, Dunning I, Legg S, Kavukcuoglu K (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of Machine learning research 80:1407–1416. http://proceedings.mlr.press/v80/espeholt18a.html
Fedus W, Ramachandran P, Agarwal R, Bengio Y, Larochelle H, Rowland M, Dabney W (2020) Revisiting fundamentals of experience replay. In: Proceedings of the 37th international conference on machine learning, PMLR. https://proceedings.icml.cc/paper/2020/hash/5460b9ea1986ec386cb64df22dff37be-Abstract.html
Florensa C, Held D, Wulfmeier M, Zhang M, Abbeel P (2017) Reverse curriculum generation for reinforcement learning. In: Levine S, Vanhoucke V, Goldberg K (eds) Proceedings of machine learning research, PMLR, 78:482–495. http://proceedings.mlr.press/v78/florensa17a.html
Gangwani T, Liu Q, Peng J (2019) Learning self-imitating diverse policies. In: International conference on learning representations. https://openreview.net/forum?id=HyxzRsR9Y7
Gruslys A, Dabney W, Azar MG, Piot B, Bellemare M, Munos R (2018) The reactor: a fast and sample-efficient actor-critic agent for reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=rkHVZWZAZ
Guo Y, Choi J, Moczulski M, Feng S, Bengio S, Norouzi M, Lee H (2020) Memory based trajectory-conditioned policies for learning from sparse rewards. In: Advances in neural information processing systems. https://papers.nips.cc/paper/2020/hash/2df45244f09369e16ea3f9117ca45157-Abstract.html
He FS, Liu Y, Schwing AG, Peng J (2017) Learning to play in a day: faster deep reinforcement learning by optimality tightening. In: International conference on learning representations. https://openreview.net/forum?id=rJ8Je4clg
Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G, Agapiou J, Leibo JZ, Gruslys A (2018) Deep Q-learning from demonstrations. In: Annual meeting of the association for the advancement of artificial intelligence (AAAI), New Orleans (USA)
Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Hasselt H, Silver D (2018) Distributed prioritized experience replay. In: International conference on learning representations. https://openreview.net/forum?id=H1Dy---0Z
Hosu IA, Rebedea T (2016) Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:160705077
Kapturowski S, Ostrovski G, Dabney W, Quan J, Munos R (2019) Recurrent experience replay in distributed reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=r1lyTjAqYX
Le L, Patterson A, White M (2018) Supervised autoencoders: improving generalization performance with unsupervised regularizers. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., 31:107–117. https://proceedings.neurips.cc/paper/2018/file/2a38a4a9316c49e5a833517c45d31070-Paper.pdf
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=tX_O8O-8Zl
Lin LJ (1992) Self-improving reactive agents based on reinforcement learning. Planning and teaching. Mach Learn 8(3–4):293–321
Google Scholar
Liu R, Zou J (2018) The effects of memory replay in reinforcement learning. In: The 56th annual allerton conference on communication, control, and computing, pp 478–485
Mihalkova L, Mooney R (2006) Using active relocation to aid reinforcement learning. In: Proceedings of the 19th international FLAIRS conference (FLAIRS-2006), Melbourne Beach, FL, pp 580–585. http://www.cs.utexas.edu/users/ai-lab?mihalkova:flairs06
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Article Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of machine learning research, PMLR, New York, New York, USA, vol 48, pp 1928–1937. http://proceedings.mlr.press/v48/mniha16.html
Munos R, Stepleton T, Harutyunyan A, Bellemare M (2016) Safe and efficient off-policy reinforcement learning. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., vol 29, pp 1054–1062. https://proceedings.neurips.cc/paper/2016/file/c3992e9a68c5ae12bd18488bc579b30d-Paper.pdf
Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation (ICRA), pp 6292–6299. https://doi.org/10.1109/ICRA.2018.8463162
Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of machine learning research, PMLR, Long Beach, California, USA, vol 97, pp 4851–4860. http://proceedings.mlr.press/v97/novati19a.html
Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: Dy J, Krause A (eds) Proceedings of machine learning research, PMLR, Stockholmsmässan, Stockholm Sweden, vol 80, pp 3878–3887. http://proceedings.mlr.press/v80/oh18b.html
Pohlen T, Piot B, Hester T, Azar MG, Horgan D, Budden D, Barth-Maron G, van Hasselt H, Quan J, Večerík M, et al. (2018) Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:180511593
Resnick C, Raileanu R, Kapoor S, Peysakhovich A, Cho K, Bruna J (2018) Backplay:” Man muss immer umkehren”. In: Workshop on reinforcement learning in games, AAAI
Ross S, Bagnell D (2010) Efficient reductions for imitation learning. In: Teh YW, Titterington M (eds) Proceedings of machine learning research, JMLR workshop and conference proceedings, Chia Laguna Resort, Sardinia, Italy, 9:661–668. http://proceedings.mlr.press/v9/ross10a.html
Salimans T, Chen R (2018) Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:181203381
Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: International conference on learning representations. arXiv:1511.05952
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering Atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:191108265
Sinha S, Song J, Garg A, Ermon S (2020) Experience replay with likelihood-free importance weights. arXiv preprint arXiv:200613169
Sovrano F (2019) Combining experience replay with exploration by random network distillation. In: 2019 IEEE conference on games (CoG), pp 1–8. https://doi.org/10.1109/CIG.2019.8848046
Stumbrys T, Erlacher D, Schredl M (2016) Effectiveness of motor practice in lucid dreams: a comparison with physical and mental practice. J Sports Sci 34:27–34
Article Google Scholar
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, Cambridge
MATH Google Scholar
Tang Y (2020) Self-imitation learning via generalized lower bound Q-learning. In: Advances in neural information processing systems, vol 33. https://papers.nips.cc/paper/2020/file/a0443c8c8c3372d662e9173c18faaa2c-Paper.pdf
Tavakoli A, Levdik V, Islam R, Smith CM, Kormushev P (2018) Exploring restart distributions. arXiv:181111298
Wang Z, Bapst V, Heess NMO, Mnih V, Munos R, Kavukcuoglu K, de Freitas N (2017) Sample efficient actor-critic with experience replay. In: International conference on learning representations. https://openreview.net/pdf?id=HyM25Mqel
Wawrzyński P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Netw 22(10):1484–1497
Article Google Scholar
Zha D, Lai KH, Zhou K, Hu X (2019) Experience replay optimization. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, international joint conferences on artificial intelligence organization, pp 4243–4249. https://doi.org/10.24963/ijcai.2019/589, https://doi.org/10.24963/ijcai.2019/589
Zhang S, Sutton RS (2017) A deeper look at experience replay. arXiv preprint arXiv:171201275
Zhang X, Bharti SK, Ma Y, Singla A, Zhu X (2020) The teaching Dimension of Q-learning. arXiv preprint arXiv:200609324

Download references

Acknowledgements

We thank Gabriel V. de la Cruz Jr. for helpful discussions; his open-source code at https://github.com/gabrieledcjr/DeepRL is used for training the behavior cloning models in this work. This research used resources of Kamiak, Washington State University’s high-performance computing cluster. Assefaw Gebremedhin is supported by the NSF award IIS-1553528. Part of this work has taken place in the Intelligent Robot Learning (IRL) Lab at the University of Alberta, which is supported in part by research grants from the Alberta Machine Intelligence Institute (Amii), CIFAR, and NSERC. Part of this work has taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (CPS-1739964, IIS-1724157, NRI-1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARL, DARPA, Lockheed Martin, GM, and Bosch. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

Author information

Authors and Affiliations

Washington State University, Pullman, USA
Yunshu Du, Assefaw Gebremedhin & Matthew E. Taylor
Army Research Laboratory, Austin, USA
Garrett Warnell
The University of Texas at Austin, Austin, USA
Peter Stone
Sony AI, Austin, USA
Peter Stone
University of Alberta, Edmonton, Canada
Matthew E. Taylor
Alberta Machine Intelligence Institute, Edmonton, Canada
Matthew E. Taylor

Authors

Yunshu Du
View author publications
You can also search for this author in PubMed Google Scholar
Garrett Warnell
View author publications
You can also search for this author in PubMed Google Scholar
Assefaw Gebremedhin
View author publications
You can also search for this author in PubMed Google Scholar
Peter Stone
View author publications
You can also search for this author in PubMed Google Scholar
Matthew E. Taylor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunshu Du.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices for “lucid dreaming for experience replay: refreshing past states with the current policy”

We provide further details of our work in the following six appendices:

“Appendix 1” contains the implementation details of LiDER, including neural network architecture, hyperparameters, and computation resources used for all experiments.
“Appendix 2” presents the pseudocode for the A3C and SIL workers. Both follow the original work of Mnih et al. [28] and Oh et al. [32], respectively, we add them here for completeness.
“Appendix 3” provides detailed statistics of the one-tailed independent-samples t tests: 1) A3CTBSIL compared to LiDER, 2) A3CTBSIL compared to the three ablation studies of LiDER, 3) A3CTBSIL compared to the two extensions of LiDER, 4) LiDER compared to the three ablation studies of LiDER, and 5) LiDER compared to the two extensions of LiDER.
“Appendix 4” discusses the differences between the A3CTBSIL algorithm in Jr et al. [6] and the original SIL algorithm in Oh et al. [32] (as mentioned in Sect. 4.1).
“Appendix 5” presents the performance of the trained agents (TA) used in LiDER-TA.
“Appendix 6” details the pre-training process for obtaining the BC models used in LiDER-BC, including the statistics of the demonstration collected by de la Cruz et al. [5], the network architecture, the hyperparameters used for pre-training, and the performance of the trained BC models.

Appendix 1: Implementation details

We use the same neural network architecture as in the original A3C algorithm [28] for all A3C, SIL, and refresher workers (the blue, orange, and green components in Fig. 1, respectively). The network consists of three convolutional layers, one fully connected layer, followed by two branches of a fully connected layer: a policy function output layer and a value function output layer. Atari images are converted to grayscale and resized to 88\(\times \)88 with 4 images stacked as the input.

We run each experiment for eight trials due to computation limitations. Each experiment uses one GPU (Tesla K80 or TITAN V), five CPU cores, and 40 GB of memory (each LiDER-OneBuffer experiment uses 64 GB of memory since the buffer size was doubled). The refresher worker runs on GPU to generate data as quickly as possible; the A3C and SIL workers run distributively on CPU cores. In all games, the wall-clock time is roughly 0.8 to 1 million steps per hour and around 50 to 60 hours to complete one trial of 50 million steps.

The baseline A3CTBSIL is trained with 17 parallel workers; 16 A3C workers and 1 SIL worker. The RMSProp optimizer is used with a learning rate = 0.0007. We use \(t_{\mathrm{max}}=20\) for n-step bootstrap \(Q^{(n)}\) (\(n \le t_{\mathrm{max}}\)). The SIL worker performs \(M=4\) SIL policy updates (Equation (2)) per step t with minibatch size 32 (i.e., 32\(\times \)4=128 total samples per step). Buffer \({{\,\mathrm{\mathcal {D}}\,}}\) is of size \(10^5\). The SIL loss weight \(\beta ^{\mathrm{sil}}=0.5\).

LiDER is also trained with 17 parallel workers: 15 A3C workers, 1 SIL worker, and 1 refresher worker—we keep the total number of workers in A3CTBSIL and LiDER the same to ensure a fair performance comparison. The SIL worker in LiDER also uses a minibatch size of 32; samples are taken from buffer \({{\,\mathrm{\mathcal {D}}\,}}\) and \({{\,\mathrm{\mathcal {R}}\,}}\) as described in Sect. 3. All other parameters are identical to that of A3CTBSIL. We summarize the details of the network architecture and experiment parameters in Table 2.

Table 2 Hyperparameters for all experiments. We train each game for 50 million steps with a frame skip of 4, i.e., 200 million game frames were consumed for training

Full size table

Appendix 2: Pseudocode for the A3C and SIL workers

Appendix 3: One-tailed independent-samples t tests

We conducted one-tailed independent-samples t tests (equal variances not assumed) in all games to compare the differences in the mean episodic reward among all methods in this paper. For each game, we restored the best model checkpoint from each trial (eight trials per method) and executed the model in the game following a deterministic policy for 100 episodes (an episode ends when the agent loses all its lives) and recorded the reward per episode. This gives us 800 data points for each method in each game. We use a significance level \(\alpha =0.001\) for all tests.

First, we check the statistical significance of the baseline A3CTBSIL compared to LiDER (Sect. 4.1), the main framework proposed in this paper. We report the detailed statistics in Table 3. Results show that the mean episodic reward of LiDER is significantly higher than A3CTBSIL (\(p \ll 0.001\)) in all games.

Table 3 One-tailed independent-samples t test for the differences of the mean episodic reward between A3CTBSIL and LiDER. Equal variances are not assumed

Full size table

Second, we compare A3CTBSIL to the three ablation studies, LiDER-AddAll, LiDER-OneBuffer, and LiDER-SampleR (Sect. 5). Table 4 shows that all ablations were helpful in Freeway and Montezuma’s Revenge, in which the mean episodic rewards of the ablations are significantly higher than the baseline (\(p \ll 0.001\)). LiDER-AddAll also performed significantly better than A3CTBSIL in all games (\(p \ll 0.001\)). LiDER-OneBuffer outperformed A3CTBSIL in Freeway and Montezuma’s Revenge (\(p \ll 0.001\)), but it performed worse than the other four games (\(p \ll 0.001\)). LiDER-SampleR outperformed A3CTBSIL in Ms. Pac-Man, Freeway, and Montezuma’s Revenge (\(p \ll 0.001\)), but under-performed A3CTBSIL in Gopher, NameThisGame, and Alien (\(p \ll 0.001\)).

Table 4 One-tailed independent-samples t test for the differences of the mean episodic reward between A3CTBSIL and LiDER-AddALL, between A3CTBSIL and LiDER-OneBuffer, and between A3CTBSIL and LiDER-SampleR. Equal variances are not assumed

Full size table

Third, we compare A3CTBSIL to the two extensions, LiDER-BC and LiDER-TA (Sect. 6). Table 5 shows that the two extensions outperformed the baseline significantly in all games (\(p \ll 0.001\)).

Table 5 One-tailed independent-samples t test for the differences of the mean episodic reward between A3CTBSIL and LiDER-BC, and between A3CTBSIL and LiDER-TA. Equal variances are not assumed

Full size table

Fourth, we check the statistical significance of LiDER compared to the three ablation studies, LiDER-AddAll, LiDER-OneBuffer, and LiDER-SampleR (Sect. 5). Results in Table 6 show that most of the ablations significantly under-performed LiDER (\(p \ll 0.001\)) in terms of the mean episodic reward. Except for Gopher and NameThisGame, in which LiDER-AddAll performs at the same level as LiDER (\(p > 0.001\)).

Table 6 One-tailed independent-samples t test for the differences of the mean episodic reward between LiDER and LiDER-AddAll, between LiDER and LiDER-OneBuffer, and between LiDER and LiDER-SampleR. Equal variances are not assumed. Methods in bold are not significant at level \(\alpha =0.001\)

Full size table

Lastly, we compare LiDER to the two extensions, LiDER-TA and LiDER-BC (Sect. 6). Results in Table 7 show that LiDER-TA always outperforms LiDER (\(p \ll 0.001\)). LiDER-BC outperformed LiDER in Gopher, Alien, Ms. Pac-Man, and Montezuma’s Revenge. In Freeway, LiDER-BC performs the same as LiDER (\(p > 0.001\)), while in NameThisGame LiDER-BC performed worse than LiDER (\(p \ll 0.001\)).

Table 7 One-tailed independent-samples t test for the differences of the mean episodic reward between LiDER and LiDER-TA, and between LiDER and LiDER-BC. Equal variances are not assumed. Methods in bold are not significant at level \(\alpha =0.001\)

Full size table

Appendix 4: Differences between A3CTBSIL and SIL

There is a performance difference in Montezuma’s Revenge between the A3CTBSIL algorithm (our previous work in de la Cruz Jr et al. [6], which is used as the baseline method in this article) and the original SIL algorithm (by Oh et al. [32]). The A3CTBSIL agent fails to achieve any reward while the SIL agent can achieve a score of 1100 (Table 5 in [32]).

We hypothesize that the difference is due to the different number of SIL updates (Equation (2)) that can be performed in A3CTBSIL and SIL; lower numbers of SIL updates would decrease the performance. In particular, Oh et al. [32] proposed to add the “Perform self-imitation learning” step in each A3C worker (Algorithm 1 of Oh et al. [32]). That is, when running with 16 A3C workers, the SIL agent is actually using 16 SIL workers to update the policy. However, A3CTBSIL only has one SIL worker, which means A3CTBSIL performs strictly fewer SIL updates compared to that of the original SIL algorithm, and thus resulting in lower performance.

We empirically validate the above hypothesis by conducting an experiment in the game of Ms. Pac-Man by modifying the A3CTBSIL algorithm from our previous work [6]. Instead of performing a SIL update whenever the SIL worker can, we force the SIL worker to only perform an update at even global steps; this setting reduces the total number of SIL updates by half. We denote this experiment as A3CTBSIL-ReduceSIL.

Figure 8 shows that A3CTBSIL-ReduceSIL under-performed A3CTBSIL, which provides preliminary evidence that the number of SIL updates is positively correlated to performance. More experiments will be performed in future work to further validate this correlation.

Appendix 5: The performance of trained agents used in LiDER-TA

Section 6 shows that LiDER can leverage knowledge from a trained agent (TA). While the TA could come from any source, we use the best checkpoint of a fully trained LiDER agent. Table 8 shows the average performance of the TA used in each game. The score is estimated by executing the TA greedily in the game for 50 episodes. An episode ends when the agent loses all its lives.

Table 8 The performance of trained agents used in LiDER-TA, shown as the purple dotted line in Fig. 7. The score is estimated by executing the TA greedily in the game for 50 episodes

Full size table

Appendix 6: Pre-training the behavior cloning model for LiDER-BC

In Sect. 6, we demonstrated that a BC model can be incorporated into LiDER to improve learning. The BC model is pre-trained using a publicly available human demonstration dataset. Dataset statistics are shown in Table 9.

Table 9 Demonstration size and quality, collected in de la Cruz et al. [5]. All games are limited to 20 min of demonstration time per episode

Full size table

The BC model uses the same network architecture as the A3C algorithm [28] and pre-training a BC model for A3C requires a few more steps than just using supervised learning as to how it is normally done in standard imitation learning (e.g., Ross and Bagnell [35]). A3C has two output layers: a policy output layer and a value output layer. The policy output is what we usually train a supervised classifier for. However, the value output layer is usually initialized randomly without being pre-trained. Our previous work [6] observed this inconsistency and leveraged demonstration data to also pre-train the value output layer. In particular, since the demonstration data contains the true return \({\mathrm{G}}\), we can obtain a value loss that is almost identical to A3C’s value loss \(L^{a3c}_{\mathrm{value}}\): instead of using the n-step bootstrap value \(Q^{(n)}\) to compute the advantage, the true return \({\mathrm{G}}\) is used.

Inspired by the supervised autoencoder (SAE) framework [22], our previous work [6] also blended in an unsupervised loss for pre-training. In SAE, an image reconstruction loss is incorporated with the supervised loss to help extract better feature representations and achieve better performance. A BC model pre-trained jointly with supervised, value, and unsupervised losses can lead to better performance after fine-tuning with RL, compared to pre-training with the supervised loss only.

We copy this approach by jointly pre-training the BC model for 50,000 steps with a minibatch of size 32. Adam optimizer is used with a learning rate = 0.0005. After training, we perform testing for 50 episodes by executing the model greedily in the game and record the average episodic reward (an episode ends when the agent loses all its lives). For each set of demonstration data, we train five models and use the one with the highest average episodic reward as the BC model in LiDER-BC. The performance of the trained BC models is present in Table 10. All parameters are based on those from our previous work [6] and we summarize them in Table 11.

Table 10 The performance of behavior cloning models used in LiDER-BC, shown as the black dashed line in Fig. 7. The score is estimated by executing the BC greedily in the game for 50 episodes

Full size table

Table 11 Hyperparameters for pre-training the behavior cloning (BC) model used in LiDER-BC

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, Y., Warnell, G., Gebremedhin, A. et al. Lucid dreaming for experience replay: refreshing past states with the current policy. Neural Comput & Applic 34, 1687–1712 (2022). https://doi.org/10.1007/s00521-021-06104-5

Download citation

Received: 16 November 2020
Accepted: 05 May 2021
Published: 25 May 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s00521-021-06104-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lucid dreaming for experience replay: refreshing past states with the current policy

Abstract

Access this article

Similar content being viewed by others

Re-attentive experience replay in off-policy reinforcement learning

Double Replay Buffers with Restricted Gradient

When Less May Be More: Exploring Similarity to Improve Experience Replay

Notes

References

Acknowledgements