인하대학교 이필규 교수님이 추천해주신 논문을 읽어보도록 하겠습니다.

http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html


Human-level control through deep reinforcement learning 

이라는 제목의 논문으로 nature 입니다.

The theory of reinforcement learning provides a normative account , deeply rooted in psychological and neuroscientific perspectives on animal behavior, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity,agents are confronted with a difficult task : they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably , humans and other animals seem to solve this problem through a harmonious combinations of reinforcement learning and hirerarchical sensory processing systems , the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopa- minergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains , their applicability has previously been limited to domains in which useful features can be handcrafted or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using ends-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. we demonstrate that the deep Q-network agent,receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyper parameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first articial agent that is capable of learning to excel at a diverse array of challenging tasks. 


We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks- a central goal of general artificial intelligence that has eluded previous efforts.To achieve this, we developed a novel agent, a deep Q-network(DQN), which is able to combine reinforcement learning with a class of artificial neural network known as deep neural networks. Notably recent advances in deep neural networks , in which several layers of nodes are used to build up progressively more abstract representations of the data, have made it possible for artificial neural networks to learn concepts such as object categories directly from raw sensory data. We use one particularly successful architecture, the deep convolutional network , which uses hierarchical layers of tiled convolutional network, which uses hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields - inspired by Hubel and diesel's seminal work on feedforward processingg in early visual cortex- thereby exploiting the spatial correlations present in images, and building in robustness to natural transformations such as changes of viewpoint or scale.


We consider tasks in which the agent interacts with an environment through a sequence of observations, actions and rewards. The goal of the agent is to select actions is to select actions in a fashion that maximizes cumulative future reward. More formally, we use a deep convolutional neural network to approximate the optimal action-value function.



which is the maximum sum of rewards rt discounted by c at each time- step t, achievable by a behaviour policy p 5 P(ajs), after making an observation (s) and taking an action (a) (see Methods)19.

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximate such as a neural network is used to represent the action-value function This instability has several causes: the correlations present in the sequence of observations, the fact that small update to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values(Q) and the target values   .

We address these instabilities with a novel variant of Q-learning, which uses two key ideas. First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution(see below for details).Second we used an iterative update that adjusts the action-values(Q) towards target values that are only periodically updated, thereby reducing correlations with the target.


while other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration, these methods involve the repeated training of networks de novo on hundreds of iterations.Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function


While other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration24, these methods involve the repeated training of networks de novo on hundreds of iterations. Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function Q(s,a;hi) using the deep convolutional neural network shown in Fig. 1, in which hi are the param- eters (that is, weights) of the Q-network at iteration i. To perform experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1) at each time-step t in a data set Dt 5 {e1,...,et}. During learning, we apply Q-learning updates, on samples (or minibatches) of experience (s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration i uses the following loss function: 


n개의 수열이 있을 때 K의 연속 부분 수열의 중앙값의 합을 구하는 문제이다.


앞에서부터 K개의 수를 업데이트 시켜준 뒤에 앞에서 1개 지우고 뒤에서 1개씩 업데이트하면서 중앙값을 구해나가면 된다.


이 떄 중앙값은 앞에서부터 (k+1)/2 번째 수이다.


앞에서부터 K개의 수를 세그먼트 트리에 업데이트 시켜준 후 N까지 1개씩 지우고 업데이트 시켜주면서 중앙값을 구해주면 된다.


중앙값은 앞에서부터 (k+1)/2 수를 찾으면 된다.


이 때 세그먼트트리는 k번째 수를 찾는 문제처럼 각 노드의 값이 1 또는 0 으로 되어있는 sum - segment - tree 를 구현한다.


주의할점 ! 좌표를 기준으로 값을 저장했으니 좌표가 중복될 수 있다. 따라서 업데이트 시에 현재값 +1 또는 -1 로 업데이트한다.



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <cstdio>
#include <algorithm>
#include <vector>
const int MAX_N = 250000;
const int MAX_C = 66000;
using namespace std;
int n, k;
int seg[MAX_N * 4];
vector<pair<int,int>> vt;
int temp;
 
int update(int pos, int val, int node, int x, int y) {
    if (pos < x || y < pos)    return seg[node];
    if (x == y)    return seg[node] += val;
    int mid = (x + y) >> 1;
    update(pos, val, node * 2, x, mid);
    update(pos, val, node * 2 + 1, mid + 1, y);
    seg[node] = seg[node * 2+ seg[node * 2 + 1];
}
 
int query(int val, int node, int x, int y) {
    if (x == y)    return x;
    int mid = (x + y) >> 1;
    if (val <= seg[node * 2])
        return query(val, node * 2, x, mid);
    else
        return query(val-seg[node*2], node * 2 + 1, mid + 1, y);
}
int trans[MAX_N + 1];
int main() {
    freopen("input.txt""r", stdin);
    scanf("%d %d"&n, &k);
    for (int i = 0; i < n; i++) {
    scanf("%d"&temp);
    vt.push_back({ temp,i });
    trans[i] = temp;
    }
    sort(vt.begin(), vt.end());
    int nk = ((k + 1>> 1) ;
    long long  ans = 0LL;
    for (int i = 0; i < k; i++) {
        update(trans[i], 110, MAX_N-1);
        
    }
    
    for (int i = 0; i < n - k+1; i++) {
        ans += query(nk, 10, MAX_N-1);
        update(trans[i], -110, MAX_N-1);
        update(trans[i + k], 110, MAX_N-1);
        
 
    }
    printf("%lld", ans);
 
 
 
 
}
cs


'알고리즘' 카테고리의 다른 글

codeforces)447b  (0) 2017.12.21
알고리즘)(pow(A,B))mod C 를 logB 시간에 구하는 함수  (0) 2017.11.26
acm-cicpc 수학정복  (2) 2017.09.15
[BOJ1168] 조세퍼스 문제2  (0) 2017.01.22
[BOJ5012} 불만정렬  (0) 2017.01.22
[BOJ9426] 중앙값 측정  (0) 2017.01.19

맥 os에 파이참을 설치하는 방법입니다.

윈도우나 환경에서도 해봤지만 맥환경이 더 좋은듯 해서 옮기려 합니다.


윈도우환경도 설치방법은 크게 다르지 않으니 참고하셔도 좋습니다.


파이참(pycharm) 을 설치하는 이유는 별도의 IDE 환경 없이 파이썬을 코딩하려면 번거롭기 때문입니다.

java -> eclipse

python -> pycharm  이라고 생각하시면 편합니다.


https://www.jetbrains.com/pycharm/?fromMenu 

사이트에 접속해서 os 환경에 맞는 파이참을 설치하면됩니다.

community 버전이 무료이니 community 버전을 클릭합니다.


나머지는 해당 절차에 맞게 설치하시면됩니다.



자 그럼 pycharm 으로 본격적인 머신러닝을 진행해봅시다!

  1. 자손9319 2017.01.20 04:47 신고

    고생하셨어요 ㅋㅋㅋ

+ Recent posts