본문 바로가기
머신러닝

[논문] Human-level control through deep reinforcement learning

by 박정률 2017. 1. 20.

인하대학교 이필규 교수님이 추천해주신 논문을 읽어보도록 하겠습니다.

http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html


Human-level control through deep reinforcement learning 

이라는 제목의 논문으로 nature 입니다.

The theory of reinforcement learning provides a normative account , deeply rooted in psychological and neuroscientific perspectives on animal behavior, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity,agents are confronted with a difficult task : they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably , humans and other animals seem to solve this problem through a harmonious combinations of reinforcement learning and hirerarchical sensory processing systems , the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopa- minergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains , their applicability has previously been limited to domains in which useful features can be handcrafted or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using ends-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. we demonstrate that the deep Q-network agent,receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyper parameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first articial agent that is capable of learning to excel at a diverse array of challenging tasks. 


We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks- a central goal of general artificial intelligence that has eluded previous efforts.To achieve this, we developed a novel agent, a deep Q-network(DQN), which is able to combine reinforcement learning with a class of artificial neural network known as deep neural networks. Notably recent advances in deep neural networks , in which several layers of nodes are used to build up progressively more abstract representations of the data, have made it possible for artificial neural networks to learn concepts such as object categories directly from raw sensory data. We use one particularly successful architecture, the deep convolutional network , which uses hierarchical layers of tiled convolutional network, which uses hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields - inspired by Hubel and diesel's seminal work on feedforward processingg in early visual cortex- thereby exploiting the spatial correlations present in images, and building in robustness to natural transformations such as changes of viewpoint or scale.


We consider tasks in which the agent interacts with an environment through a sequence of observations, actions and rewards. The goal of the agent is to select actions is to select actions in a fashion that maximizes cumulative future reward. More formally, we use a deep convolutional neural network to approximate the optimal action-value function.



which is the maximum sum of rewards rt discounted by c at each time- step t, achievable by a behaviour policy p 5 P(ajs), after making an observation (s) and taking an action (a) (see Methods)19.

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximate such as a neural network is used to represent the action-value function This instability has several causes: the correlations present in the sequence of observations, the fact that small update to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values(Q) and the target values   .

We address these instabilities with a novel variant of Q-learning, which uses two key ideas. First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution(see below for details).Second we used an iterative update that adjusts the action-values(Q) towards target values that are only periodically updated, thereby reducing correlations with the target.


while other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration, these methods involve the repeated training of networks de novo on hundreds of iterations.Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function


While other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration24, these methods involve the repeated training of networks de novo on hundreds of iterations. Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function Q(s,a;hi) using the deep convolutional neural network shown in Fig. 1, in which hi are the param- eters (that is, weights) of the Q-network at iteration i. To perform experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1) at each time-step t in a data set Dt 5 {e1,...,et}. During learning, we apply Q-learning updates, on samples (or minibatches) of experience (s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration i uses the following loss function: