Improving the Safety of Deep Reinforcement Learning Algorithms by Making Them More Interpretable

Reinforcement learning (RL) algorithms, especially when combined with the power of deep learning (DRL), have achieved impressive results in games, for example on the Atari platform and at the game of Go. The advances in this field are promising and as applications in real-world physical systems are becoming possible, safety issues must be considered. If we want the use of DRL algorithms to have a positive impact on our society, we need to be careful that they are safe, reliable and predictable. Despite this, safety in RL has remained largely an open problem – as we noted in [Huang et al. 2018], a survey of deep neural network safety issues, up to now, most efforts related to machine learning safety have been spent on feedforward deep neural networks (DNNs), with image classification as one of their main tasks. Research is needed to consider other types of neural networks, such as DRL models. Most DRL models use feedforward DNNs to store their learned policies and therefore, for a single observation, similar safety analysis techniques can be applied. However, RL optimises over the objectives which may base on the rewards of multiple, or even an infinite number of time steps. Therefore, other than DNNs, the subject of study includes a sequence of observations.

The aim of this project is to improve the safety of DRL algorithms by making them more interpretable for humans, with an especial focus on the causes of adversarial examples. Interpretability is tightly linked with safety and trustworthiness – it is the key to check that an algorithm learns what it is supposed to learn, it acts the way it acts for the right reasons and its values and goals are in line with our own. Moreover, understanding why mistakes are made can help us finding and fixing safety problems, and improving on the algorithm and its robustness.

 

Project plan

Stage 1: Adversarial attacks (I am currently focusing on this area)
Adversarial examples are important, not only because they show that an algorithm can be intentionally deceived, but also as they provide evidence for the lack of robustness of a DRL algorithm and thus raise questions against its safety and trustworthiness. There are a few existing adversarial attacks against DRL algorithms, like [Huang et al. 2017, Lin et al. 2017]. I would like to examine and reproduce these and improve on them. I would also like to create new attacks – these could be entirely novel or modifications of some of the several existing attacks against DNNs which have not yet been adapted to work against DRL algorithms.

Stage 2: Interpretability
Adversarial examples prove the lack of robustness of a model, which might indicate that it has not learnt what it was supposed to learn. Therefore, I would like to apply interpretability techniques to shed light on the rationale behind the model’s decisions. By doing so, I hope to understand what the model actually learns, why it makes errors and what this tells about its robustness. This topic has been very little studied in relation to DRLs, for an example, see [Greydanus et al. 2018], but there are many works related to DNNs, like [Dong et al. 2018, Tomsett et al. 2018]. Therefore, my approach would be similar to the one descibed in Stage 1.

Stage 3: Adversarial defense
After getting a better idea of the robustness of DRL algorithms through examining adversarial examples and interpretability techniques, I would like to investigate how the robustness of these algorithms to adversarial attacks could be improved through interpretability. Understanding the working of an algorithm better should provide ways to make it more robust. Trying to spot adversarial examples through interpretability techniques is one possible direction. This topic has been very little studied in relation to DRLs but there are some works in relation to DNNs, like [Tao et al. 2018, Pattanaik et al. 2017], therefore, again, I would try to approach this problem in a similar manner to the previous two stages.

 

References

  • [Dong et al. 2017] Yinpeng Dong, Hang Su, Jun Zhu, and Fan Bao. Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples. arXiv preprint arXiv:1708.05493, 2017
  • [Greydanus et al. 2018] Samuel Greydanus, Anurag Koul, Jonathan Dodge, Alan Fern. Visualizing and Understanding Atari Agents. . Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1792-1801, 2018.
  • [Huang et al. 2017] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial Attacks on Neural Network Policies. arXiv preprint arXiv:1702.02284, 2017
  • [Huang et al. 2018] Xiaowei Huang, Daniel Kroening, Marta Kwiatkowska, Wenjie Ruan, Youcheng Sun, Emese Thamo, Min Wu, and Xinping Yi. Safety and Trustworthiness of Deep Neural Networks: A Survey. arXiv preprint arXiv:1812.08342v2, 2018.
  • [Lin et al. 2017] Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu and Min Sun. Tactics of Adversarial Attack on Deep Reinforcement Learning Agents. IJCAI, 2017.
  • [Pattanaik et al. 2017] Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust Deep Reinforcement Learning with Adversarial Attacks. arXiv preprint arXiv:1712.03632, 2017
  • [Tao et al. 2018] Guanhong Tao, Shiqing Ma, Yingqi Liu, and Xiangyu Zhang. Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples. NIPS 2018 Spotlight, 2018.
  • [Tomsett et al. 2018] Richard Tomsett, Amy Widdicombe, Tianwei Xing, Supriyo Chakraborty, Simon Julier, Prudhvi Gurram, Raghuveer Rao, and Mani Srivastava. Why the Failure? How Adversarial Examples Can Provide Insights for Interpretable Machine Learning. 2018 International Conference on Information Fusion (FUSION), 2018a