時序差分學習

時序差分學習（英語：Temporal difference learning，TD learning）是一類無模型強化學習方法的統稱，這種方法強調通過從當前價值函數的估值中自舉的方式進行學習。這一方法需要像蒙特卡羅方法那樣對環境進行取樣，並根據當前估值對價值函數進行更新，宛如動態規劃演算法。^[1]

和蒙特卡羅法所不同的是，時序差分學習可以在最終結果出來前對其參數進行不斷地調整，使其預測更為準確，而蒙特卡羅法只能在最終結果產生後進行調整。^[2]這是一種自舉式的演算法，具體的例子如下：

假設你需要預測星期六的天氣，並且手頭上正好有相關的模型。按照一般的方法，你只有到星期六才能根據結果對你的模型進行調整。然而，當到了星期五時，你應該對星期六的天氣有很好的判斷。因此在星期六到來之前，你就能夠調整你的模型以預測星期六的天氣。^[2]

時序差分學習與動物學領域中的動物認知存在一定的關聯。^[3]^[4]^[5]^[6]^[7]

數學模型

$TD(0)$ 表格法是最簡單的時序差分學習法之一，為隨即近似法的一個特例。這種方法用於估計在策略 $\pi$ 之下有限狀態馬爾可夫決策過程的狀態價值函數。現用 $V^{\pi }$ 表示馬爾可夫決策過程的狀態價值函數，其中涉及到狀態 $(s_{t})_{t\in \mathbb {N} }$ 、獎勵 $(r_{t})_{t\in \mathbb {N} }$ 、學習折扣率 $\gamma$ 以及策略 $\pi$ ^[8]：

V^{\pi }(s)=E_{a\sim \pi }\left\{\sum _{t=0}^{\infty }\gamma ^{t}r_{t}(a_{t}){\Bigg |}s_{0}=s\right\}.

為了方便起見，我們將上述表達式中表示動作的符號去掉，所得 $V^{\pi }$ 滿足哈密頓-雅可比-貝爾曼方程：

V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})|s_{0}=s\},

因此 $r_{0}+\gamma V^{\pi }(s_{1})$ 乃是 $V^{\pi }(s)$ 的無偏估計，基於這一觀察結果可以設計用於估計 $V^{\pi }$ 的演算法。在這一演算法中，首先用任意值對表格 $V(s)$ 進行初始化，使馬爾可夫決策過程中的每個狀態都有一個對應值，並選擇一個正的學習率 $\alpha$ 。我們接下來要做的便是反覆對策略 $\pi$ 進行評估，並根據所獲得的獎勵 $r$ 按照如下方式對舊狀態下的價值函數進行更新^[9]：

V(s)\leftarrow V(s)+\alpha (\overbrace {r+\gamma V(s')} ^{\text{The TD target}}-V(s))

其中 $s$ 和 $s'$ 分別表示新舊狀態，而 $r+\gamma V(s')$ 便是所謂的TD目標（TD target）。

TD-λ演算法

TD-λ演算法是理查德·S·薩頓基於亞瑟·李·塞謬爾的時序差分學習早期研究成果而創立的演算法，這一演算法最著名的應用是傑拉爾德·特索羅開發的TD-Gammon程式。該程式可以用於學習雙陸棋對弈，甚至能夠到達人類專家水準。^[10]這一演算法中的 $\lambda$ 值為跡線衰減參數，介於0和1之間。當 $\lambda$ 越大時，很久之後的獎勵將越被重視。當 $\lambda =1$ 時，將會變成與蒙特卡羅強化學習演算法並列的學習演算法。^[11]

在神經科學領域

時序差分學習演算法在神經科學領域亦得到了重視。研究人員發現腹側被蓋區與黑質中多巴胺神經元的放電率和時序差分學習演算法中的誤差函數具有相似之處^[3]^[4]^[5]^[6]^[7]，該函數將會回傳任何給定狀態或時間步長的估計獎勵與實際收到獎勵之間的差異。當誤差函數越大時，這意味着預期獎勵與實際獎勵之間的差異也就越大。

多巴胺細胞的行為也和時序差分學習存在相似之處。在一次實驗中，研究人員訓練一隻猴子將刺激與果汁獎勵聯絡起來，並對多巴胺細胞的表現進行了測量。^[12]一開始猴子接受果汁時，其多巴胺細胞的放電率會增加，這一結果表明預期獎勵和實際獎勵存在差異。不過隨着訓練次數的增加，預期獎勵也會發生變化，導致其巴胺細胞的放電率不再顯著增加。而當沒有獲得預期獎勵時，其多巴胺細胞的放電率會降低。由此可以看出，這一特徵與時序差分學習中的誤差函數有着相似之處。

目前很多關於神經功能的研究都是建立在時序差分學習的基礎之上的^[13]^[14]，這一方法還被用於對精神分裂症的治療及研究多巴胺的藥理學作用。^[15]

參考文獻

^ Sutton & Barto (2018)，第133頁.
^ ^2.0 ^2.1 Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning. 1988-08-01, 3 (1): 9–44 [2023-04-04]. ISSN 1573-0565. doi:10.1007/BF00115009. （原始內容存檔於2023-03-31）（英語）.
^ ^3.0 ^3.1 Schultz, W, Dayan, P & Montague, PR. A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176 . PMID 9054347. S2CID 220093382. doi:10.1126/science.275.5306.1593.
^ ^4.0 ^4.1 Montague, P. R.; Dayan, P.; Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF). The Journal of Neuroscience. 1996-03-01, 16 (5): 1936–1947 [2023-04-04]. ISSN 0270-6474. PMC 6578666 . PMID 8774460. doi:10.1523/JNEUROSCI.16-05-01936.1996. （原始內容存檔 (PDF)於2018-07-21）.
^ ^5.0 ^5.1 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. Using aperiodic reinforcement for directed self-organization (PDF). Advances in Neural Information Processing Systems. 1993, 5: 969–976 [2023-04-04]. （原始內容存檔 (PDF)於2006-03-12）.
^ ^6.0 ^6.1 Montague, P. R.; Sejnowski, T. J. The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learning & Memory. 1994, 1 (1): 1–33. ISSN 1072-0502. PMID 10467583. S2CID 44560099. doi:10.1101/lm.1.1.1 .
^ ^7.0 ^7.1 Sejnowski, T.J.; Dayan, P.; Montague, P.R. Predictive hebbian learning. Proceedings of Eighth ACM Conference on Computational Learning Theory. 1995: 15–18. ISBN 0897917235. S2CID 1709691. doi:10.1145/225298.225300 .
^ Sutton & Barto (2018)，第134頁.
^ Sutton & Barto (2018)，第135頁.
^ Tesauro, Gerald. Temporal difference learning and TD-Gammon. Communications of the ACM. 1995-03-01, 38 (3): 58–68 [2023-04-06]. ISSN 0001-0782. doi:10.1145/203330.203343. （原始內容存檔於2023-04-06）.
^ Sutton & Barto (2018)，第175頁.
^ Schultz, W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998, 80 (1): 1–27. CiteSeerX 10.1.1.408.5994 . PMID 9658025. S2CID 52857162. doi:10.1152/jn.1998.80.1.1.
^ Dayan, P. Motivated reinforcement learning (PDF). Advances in Neural Information Processing Systems (MIT Press). 2001, 14: 11–18 [2023-04-11]. （原始內容 (PDF)存檔於2012-05-25）.
^ Tobia, M. J., etc. Altered behavioral and neural responsiveness to counterfactual gains in the elderly. Cognitive, Affective, & Behavioral Neuroscience. 2016, 16 (3): 457–472. PMID 26864879. S2CID 11299945. doi:10.3758/s13415-016-0406-7 .
^ Smith, A., Li, M., Becker, S. and Kapur, S. Dopamine, prediction error, and associative learning: a model-based account. Network: Computation in Neural Systems. 2006, 17 (1): 61–84. PMID 16613795. S2CID 991839. doi:10.1080/09548980500361624.

參考著作

Sutton, Richard S.; Barto, Andrew G. Reinforcement Learning: An Introduction 2nd. Cambridge, MA: MIT Press. 2018 [2023-04-04]. （原始內容存檔於2023-04-26）.

延伸閱讀

Meyn, S. P. Control Techniques for Complex Networks. Cambridge University Press. 2007. ISBN 978-0521884419. See final chapter and appendix.
Sutton, R. S.; Barto, A. G. Time Derivative Models of Pavlovian Reinforcement (PDF). Learning and Computational Neuroscience: Foundations of Adaptive Networks. 1990: 497–537 [2023-04-06]. （原始內容存檔 (PDF)於2017-03-30）.

外部連結

Connect Four TDGravity Applet （頁面存檔備份，存於互聯網檔案館） (+ mobile phone version) – self-learned using TD-Leaf method (combination of TD-Lambda with shallow tree search)
Self Learning Meta-Tic-Tac-Toe （頁面存檔備份，存於互聯網檔案館） Example web app showing how temporal difference learning can be used to learn state evaluation constants for a minimax AI playing a simple board game.
Reinforcement Learning Problem, document explaining how temporal difference learning can be used to speed up Q-learning
TD-Simulator （頁面存檔備份，存於互聯網檔案館） Temporal difference simulator for classical conditioning

[FOOTNOTESuttonBarto2018133-1] Sutton & Barto (2018)，第133頁.

[RSutton-1988-2] 2.0 ^2.1 Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning. 1988-08-01, 3 (1): 9–44 [2023-04-04]. ISSN 1573-0565. doi:10.1007/BF00115009. （原始內容存檔於2023-03-31）（英語）.

[WSchultz-1997-3] 3.0 ^3.1 Schultz, W, Dayan, P & Montague, PR. A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176 . PMID 9054347. S2CID 220093382. doi:10.1126/science.275.5306.1593.

[:0-4] 4.0 ^4.1 Montague, P. R.; Dayan, P.; Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF). The Journal of Neuroscience. 1996-03-01, 16 (5): 1936–1947 [2023-04-04]. ISSN 0270-6474. PMC 6578666 . PMID 8774460. doi:10.1523/JNEUROSCI.16-05-01936.1996. （原始內容存檔 (PDF)於2018-07-21）.

[:1-5] 5.0 ^5.1 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. Using aperiodic reinforcement for directed self-organization (PDF). Advances in Neural Information Processing Systems. 1993, 5: 969–976 [2023-04-04]. （原始內容存檔 (PDF)於2006-03-12）.

[:2-6] 6.0 ^6.1 Montague, P. R.; Sejnowski, T. J. The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learning & Memory. 1994, 1 (1): 1–33. ISSN 1072-0502. PMID 10467583. S2CID 44560099. doi:10.1101/lm.1.1.1 .

[:3-7] 7.0 ^7.1 Sejnowski, T.J.; Dayan, P.; Montague, P.R. Predictive hebbian learning. Proceedings of Eighth ACM Conference on Computational Learning Theory. 1995: 15–18. ISBN 0897917235. S2CID 1709691. doi:10.1145/225298.225300 .

[FOOTNOTESuttonBarto2018134-8] Sutton & Barto (2018)，第134頁.

[FOOTNOTESuttonBarto2018135-9] Sutton & Barto (2018)，第135頁.

[10] Tesauro, Gerald. Temporal difference learning and TD-Gammon. Communications of the ACM. 1995-03-01, 38 (3): 58–68 [2023-04-06]. ISSN 0001-0782. doi:10.1145/203330.203343. （原始內容存檔於2023-04-06）.

[FOOTNOTESuttonBarto2018175-11] Sutton & Barto (2018)，第175頁.

[WSchultz-1998-12] Schultz, W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998, 80 (1): 1–27. CiteSeerX 10.1.1.408.5994 . PMID 9658025. S2CID 52857162. doi:10.1152/jn.1998.80.1.1.

[PDayan-2001-13] Dayan, P. Motivated reinforcement learning (PDF). Advances in Neural Information Processing Systems (MIT Press). 2001, 14: 11–18 [2023-04-11]. （原始內容 (PDF)存檔於2012-05-25）.

[14] Tobia, M. J., etc. Altered behavioral and neural responsiveness to counterfactual gains in the elderly. Cognitive, Affective, & Behavioral Neuroscience. 2016, 16 (3): 457–472. PMID 26864879. S2CID 11299945. doi:10.3758/s13415-016-0406-7 .

[ASmith-2006-15] Smith, A., Li, M., Becker, S. and Kapur, S. Dopamine, prediction error, and associative learning: a model-based account. Network: Computation in Neural Systems. 2006, 17 (1): 61–84. PMID 16613795. S2CID 991839. doi:10.1080/09548980500361624.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]