深度强化学习：修订间差异

删除的内容添加的内容

行内

2021年12月15日 (三) 13:45的版本

深度強化學習（英語：Deep reinforcement learning，簡稱 Deep RL 或 DRL）是機器學習的一個子領域，結合了強化學習和深度學習。強化學習探討如何在嘗試錯誤的過程中讓智慧型代理人學習做更好的決策。深度強化學習採用了深度學習的方法，讓智慧型代理人可以直接基於非結構化資料來做決策，而不需要人為設計的狀態空間。深度強化學習演算法可以讀取非常大的輸入資料（像是電玩畫面上的每個像素），來判斷哪個動作可以達到最好的目標（像是最高的遊戲分數）。深度強化學習已經有了廣泛的應用，包括機器人學、電動遊戲、自然語言處理、電腦視覺、教育、交通運輸、金融、醫療衛生等等。^[1]

概述

深度學習

深度學習是機器學習的一種，訓練人工神經網路來將一組輸入轉換成一組特定的輸出。深度學習常常以監督式學習的形式，用帶有標籤的資料集來做訓練。深度學習的方法可以直接處理高維度、複雜的原始輸入資料，相較於之前的方法更不需要人為的特徵工程（英语：Feature_engineering）從輸入資料中提取特徵。因此，深度學習已經在電腦視覺、自然語言處理等領域上帶來突破性的進展。

強化學習

強化學習是讓智慧型代理人和環境互動，從中嘗試錯誤以學習做出更好的決策。這類的問題在數學上常常用馬可夫決策過程表示：在每個時間點，代理人處在環境的一個狀態 $s$ ，在代理人採取了一個動作 $a$ 之後，會收到一個獎勵 $r$ ，並根據環境的狀態轉移函數 $p(s'|s,a)$ 轉移到下一個狀態 $s'$ 。代理人的目標是學習一組策略 $\pi (a|s)$ （也就是一組從當前的狀態到所要採取的動作之間的對應關係），使得獲得到的總獎勵最大。與最佳控制不同，強化學習的演算法只能透過抽樣的方式來探測狀態轉移函數 $p(s'|s,a)$ 。

深度強化學習

在很多現實中的決策問題裡，馬可夫決策過程的狀態 $s$ 的維度很高（例如：相機拍下的照片、機器人感測器的串流），限制了傳統強化學習方法的可行性。深度強化學習就是利用深度學習的技術來解決強化學習中的決策問題，訓練人工神經網路來表示策略 $\pi (a|s)$ ，並針對這樣的訓練場景開發特化的演算法。^[2]

演算法

如今已經有不少深度強化學習演算法來訓練決策模型，不同的演算法之間各有優劣。粗略來說，深度強化學習演算法可以依照是否需要建立環境動態模型分為兩類：

模型基底深度強化學習演算法：建立類神經網路模型來預測環境的獎勵函數 $r(s,a)$ 和狀態轉移函數 $p(s'|s,a)$ ，而這些類神經網路模型可以用監督式學習的方法來訓練。在訓練好環境模型之後，可以用模型預測控制的方法來建立策略 $\pi (a|s)$ 。然而，因為環境模型不一定能完美地預測真實環境，代理人和環境互動的過程中常常需要重新規劃動作。另外，也可以用蒙地卡羅樹搜尋或交叉熵方法（英语：Cross-entropy method）來依據訓練好的環境模型規劃動作。

無模型深度強化學習演算法：直接訓練類神經網路模型來表示策略 $\pi (a|s)$ 。這裡的「無模型」指的是不建立環境模型，而非不建立任何機器學習模型。這樣的策略模型可以直接用策略梯度（policy gradient）^[3]訓練，但是策略梯度的變異性太大，很難有效率地進行訓練。更進階的訓練方法嘗試解決這個穩定性的問題：可信區域策略最佳化（Trust Region Policy Optimization，TRPO）^[4]、近端策略最佳化（Proximal Policy Optimization，PPO）^[5]。另一系列的無模型深度強化學習演算法則是訓練類神經網路模型來預測未來的獎勵總和 $V^{\pi }(s)$ 或 $Q^{\pi }(s,a)$ ^[6]，這類演算法包括時序差分學習（英语：temporal difference learning）、深度Q學習、SARSA（英语：State–action–reward–state–action）。如果動作空間是離散的，那麽策略 $\pi (a|s)$ 可以用枚舉所有的動作來找出 $Q$ 函數的最大值。如果動作空間是連續的，這樣的 $Q$ 函數無法直接建立策略 $\pi (a|s)$ ，因此需要同時訓練一個策略模型^[7]^[8]^[9]，也就變成一種「演員－評論家」演算法。

应用

游戏

围棋：AlphaGo
國際象棋

机器人技术

机器人规划

智能城市

室内定位^[10]
智能运输

参阅

强化学习
Q学习
State-action-reward-state-action（英语：State-action-reward-state-action）
深度学习

参考文献

^ Francois-Lavet, Vincent; Henderson, Peter; Islam, Riashat; Bellemare, Marc G.; Pineau, Joelle. An Introduction to Deep Reinforcement Learning. Foundations and Trends in Machine Learning. 2018, 11 (3–4): 219–354. Bibcode:2018arXiv181112560F. ISSN 1935-8237. S2CID 54434537. arXiv:1811.12560 . doi:10.1561/2200000071.
^ Mnih, Volodymyr; et al. Human-level control through deep reinforcement learning. Nature. 2015, 518 (7540): 529–533. Bibcode:2015Natur.518..529M. PMID 25719670. S2CID 205242740. doi:10.1038/nature14236.
^ Williams, Ronald J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. 1992, 8 (3–4): 229–256. S2CID 2332513. doi:10.1007/BF00992696 .
^ Schulman, John; Levine, Sergey; Moritz, Philipp; Jordan, Michael; Abbeel, Pieter. Trust Region Policy Optimization. International Conference on Machine Learning (ICML). 2015. arXiv:1502.05477 .
^ Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347 .
^ Mnih, Volodymyr; et al. Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. December 2013.
^ Lillicrap, Timothy; Hunt, Jonathan; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR). 2016. arXiv:1509.02971 .
^ Mnih, Volodymyr; Puigdomenech Badia, Adria; Mirzi, Mehdi; Graves, Alex; Harley, Tim; Lillicrap, Timothy; Silver, David; Kavukcuoglu, Koray. Asynchronous Methods for Deep Reinforcement Learning. International Conference on Machine Learning (ICML). 2016. arXiv:1602.01783 .
^ Haarnoja, Tuomas; Zhou, Aurick; Levine, Sergey; Abbeel, Pieter. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. International Conference on Machine Learning (ICML). 2018. arXiv:1801.01290 .
^ Mohammadi, Mehdi; Al-Fuqaha, Ala; Guizani, Mohsen; Oh, Jun-Seok. Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services. IEEE Internet of Things Journal. 2018, 5 (2): 624–635. doi:10.1109/JIOT.2017.2712560.

[francoislavet2018-1] Francois-Lavet, Vincent; Henderson, Peter; Islam, Riashat; Bellemare, Marc G.; Pineau, Joelle. An Introduction to Deep Reinforcement Learning. Foundations and Trends in Machine Learning. 2018, 11 (3–4): 219–354. Bibcode:2018arXiv181112560F. ISSN 1935-8237. S2CID 54434537. arXiv:1811.12560 . doi:10.1561/2200000071.

[DQN2-2] Mnih, Volodymyr; et al. Human-level control through deep reinforcement learning. Nature. 2015, 518 (7540): 529–533. Bibcode:2015Natur.518..529M. PMID 25719670. S2CID 205242740. doi:10.1038/nature14236.

[williams1992-3] Williams, Ronald J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. 1992, 8 (3–4): 229–256. S2CID 2332513. doi:10.1007/BF00992696 .

[schulman2015trpo-4] Schulman, John; Levine, Sergey; Moritz, Philipp; Jordan, Michael; Abbeel, Pieter. Trust Region Policy Optimization. International Conference on Machine Learning (ICML). 2015. arXiv:1502.05477 .

[schulman2017ppo-5] Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347 .

[DQN1-6] Mnih, Volodymyr; et al. Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. December 2013.

[lillicrap2015ddpg-7] Lillicrap, Timothy; Hunt, Jonathan; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR). 2016. arXiv:1509.02971 .

[mnih2016a3c-8] Mnih, Volodymyr; Puigdomenech Badia, Adria; Mirzi, Mehdi; Graves, Alex; Harley, Tim; Lillicrap, Timothy; Silver, David; Kavukcuoglu, Koray. Asynchronous Methods for Deep Reinforcement Learning. International Conference on Machine Learning (ICML). 2016. arXiv:1602.01783 .

[haarnoja2018sac-9] Haarnoja, Tuomas; Zhou, Aurick; Levine, Sergey; Abbeel, Pieter. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. International Conference on Machine Learning (ICML). 2018. arXiv:1801.01290 .

[mohammadi2018semi-10] Mohammadi, Mehdi; Al-Fuqaha, Ala; Guizani, Mohsen; Oh, Jun-Seok. Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services. IEEE Internet of Things Journal. 2018, 5 (2): 624–635. doi:10.1109/JIOT.2017.2712560.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ 第13行： / 第13行： @@
 在很多現實中的決策問題裡，[[馬可夫決策過程]]的狀態 <math>s</math> 的維度很高（例如：相機拍下的照片、機器人感測器的串流），限制了傳統強化學習方法的可行性。深度強化學習就是利用深度學習的技術來解決強化學習中的決策問題，訓練人工神經網路來表示策略 <math>\pi(a|s)</math>，並針對這樣的訓練場景開發特化的演算法。<ref name="DQN2"/>
-== 算法 ==
+==演算法==
-一些算法包括：
+如今已經有不少深度強化學習演算法來訓練決策模型，不同的演算法之間各有優劣。粗略來說，深度強化學習演算法可以依照是否需要建立環境動態模型分為兩類：
-* [[Q學習|深度Q學習]]
-* [[强化学习|演员-评论家 (Actor-Critic)]]
+* '''模型基底'''深度強化學習演算法：建立類神經網路模型來預測環境的獎勵函數 <math>r(s, a)</math> 和狀態轉移函數 <math>p(s'|s, a)</math>，而這些類神經網路模型可以用[[監督式學習]]的方法來訓練。在訓練好環境模型之後，可以用[[模型預測控制]]的方法來建立策略 <math>\pi(a|s)</math>。然而，因為環境模型不一定能完美地預測真實環境，代理人和環境互動的過程中常常需要重新規劃動作。另外，也可以用[[蒙地卡羅樹搜尋]]或{{link-en|交叉熵方法|Cross-entropy method}}來依據訓練好的環境模型規劃動作。
-* DDPG
-* TRPO
+* '''無模型'''深度強化學習演算法：直接訓練類神經網路模型來表示策略 <math>\pi(a|s)</math>。這裡的「無模型」指的是不建立環境模型，而非不建立任何機器學習模型。這樣的策略模型可以直接用策略梯度（policy gradient）<ref name="williams1992"/>訓練，但是策略梯度的變異性太大，很難有效率地進行訓練。更進階的訓練方法嘗試解決這個穩定性的問題：可信區域策略最佳化（Trust Region Policy Optimization，TRPO）<ref name="schulman2015trpo"/>、近端策略最佳化（Proximal Policy Optimization，PPO）<ref name="schulman2017ppo"/>。另一系列的無模型深度強化學習演算法則是訓練類神經網路模型來預測未來的獎勵總和 <math>V^{\pi}(s)</math> 或 <math>Q^{\pi}(s, a)</math><ref name="DQN1"/>，這類演算法包括{{link-en|時序差分學習|temporal difference learning}}、[[Q學習|深度Q學習]]、{{link-en|SARSA|State–action–reward–state–action}}。如果動作空間是離散的，那麽策略 <math>\pi(a|s)</math> 可以用枚舉所有的動作來找出 <math>Q</math> 函數的最大值。如果動作空間是連續的，這樣的 <math>Q</math> 函數無法直接建立策略 <math>\pi(a|s)</math> ，因此需要同時訓練一個策略模型<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>，也就變成一種「演員－評論家」演算法。
-* PPO
 == 应用 ==
@@ 第53行： / 第52行： @@
 {{Reflist|refs=
+<ref name="DQN1">{{cite conference |first= Volodymyr|display-authors=etal|last= Mnih |date=December 2013 |title= Playing Atari with Deep Reinforcement Learning |url= https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf |conference= NIPS Deep Learning Workshop 2013}}</ref>
 <ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid=25719670|bibcode=2015Natur.518..529M |s2cid=205242740}}</ref>
 <ref name="francoislavet2018">{{cite journal|last1=Francois-Lavet|first1=Vincent|last2=Henderson|first2=Peter|last3=Islam|first3=Riashat|last4=Bellemare|first4=Marc G.|last5=Pineau|first5=Joelle|date=2018|title=An Introduction to Deep Reinforcement Learning|journal=Foundations and Trends in Machine Learning|volume=11|issue=3–4|pages=219–354|arxiv=1811.12560|bibcode=2018arXiv181112560F|doi=10.1561/2200000071|issn=1935-8237|s2cid=54434537}}</ref>
 <ref name="mohammadi2018semi">{{cite journal|title=Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services|url=https://ieeexplore.ieee.org/document/7945258/|first1=Mehdi|last2=Al-Fuqaha|first2=Ala|journal=IEEE Internet of Things Journal|issue=2|doi=10.1109/JIOT.2017.2712560|year=2018|volume=5|pages=624-635|last3=Guizani|first3=Mohsen|last4=Oh|first4=Jun-Seok|last1=Mohammadi}}</ref>
+<ref name="schulman2015trpo">{{Cite conference|title=Trust Region Policy Optimization |last1=Schulman|first1=John|last2=Levine|first2=Sergey|last3=Moritz|first3=Philipp|last4=Jordan|first4=Michael|last5=Abbeel|first5=Pieter|date=2015|arxiv=1502.05477|conference=International Conference on Machine Learning (ICML)|url=https://arxiv.org/abs/1502.05477}}</ref>
+<ref name="schulman2017ppo">{{Cite conference|title=Proximal Policy Optimization Algorithms |last1=Schulman|first1=John|last2=Wolski|first2=Filip|last3=Dhariwal|first3=Prafulla|last4=Radford|first4=Alec|last5=Klimov|first5=Oleg|date=2017|arxiv=1707.06347|url=https://arxiv.org/abs/1707.06347}}</ref>
+<ref name="williams1992">{{Cite journal|last1=Williams|first1=Ronald J|journal=Machine Learning|pages=229–256|title = Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning|date=1992|volume=8|issue=3–4|doi=10.1007/BF00992696|s2cid=2332513|doi-access=free}}</ref>
+<ref name="lillicrap2015ddpg">{{Cite conference|title=Continuous control with deep reinforcement learning |last1=Lillicrap|first1=Timothy |last2=Hunt|first2=Jonathan |last3=Pritzel|first3=Alexander |last4=Heess|first4=Nicolas |last5=Erez|first5=Tom |last6=Tassa|first6=Yuval |last7=Silver|first7=David |last8=Wierstra|first8=Daan |conference=International Conference on Learning Representations (ICLR)|date=2016|arxiv=1509.02971|url=https://arxiv.org/abs/1509.02971}}</ref>
+<ref name="mnih2016a3c">{{Cite conference|title=Asynchronous Methods for Deep Reinforcement Learning |last1=Mnih|first1=Volodymyr |last2=Puigdomenech Badia|first2=Adria |last3=Mirzi|first3=Mehdi |last4=Graves|first4=Alex |last5=Harley|first5=Tim |last6=Lillicrap|first6=Timothy |last7=Silver|first7=David |last8=Kavukcuoglu|first8=Koray |conference=International Conference on Machine Learning (ICML)|date=2016|arxiv=1602.01783|url=https://arxiv.org/abs/1602.01783}}</ref>
+<ref name="haarnoja2018sac">{{Cite conference|title=Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor |last1=Haarnoja|first1=Tuomas |last2=Zhou|first2=Aurick |last3=Levine|first3=Sergey |last4=Abbeel|first4=Pieter |conference=International Conference on Machine Learning (ICML)|date=2018|arxiv=1801.01290|url=https://arxiv.org/abs/1801.01290}}</ref>
 }}
 [[Category:机器学习]]