# 馬可夫決策過程

## 定义

• ${\displaystyle S}$是状态空间的集合，
• ${\displaystyle A}$是动作的集合，也被称为动作空间（比如说${\displaystyle A_{s}}$是状态${\displaystyle s}$中可用的动作集合），
• ${\displaystyle P_{a}(s,s')=\Pr(s_{t+1}=s'\mid s_{t}=s,a_{t}=a)}$${\displaystyle t}$时刻${\displaystyle s}$状态下的动作${\displaystyle a}$导致${\displaystyle t+1}$时刻进入状态${\displaystyle s'}$的概率，
• ${\displaystyle R_{a}(s,s')}$状态${\displaystyle s}$经过动作${\displaystyle a}$转换到状态${\displaystyle s'}$后收到的即时奖励（或预期的即时奖励）。

### 優化目標

${\displaystyle E\left[\sum _{t=0}^{\infty }{\gamma ^{t}R_{a_{t}}(s_{t},s_{t+1})}\right]}$（我们选择${\displaystyle a_{t}=\pi (s_{t})}$也就是策略给出的动作）。并且期望值为${\displaystyle s_{t+1}\sim P_{a_{t}}(s_{t},s_{t+1})}$

## 算法

${\displaystyle V(s):=\sum _{s'}P_{\pi (s)}(s,s')\left(R_{\pi (s)}(s,s')+\gamma V(s')\right)}$
${\displaystyle \pi (s):={\underset {a}{\operatorname {arg\,min} }}\left\{\sum _{s'}P(s'\mid s,a)\left(R(s'\mid s,a)+\gamma V(s')\right)\right\}}$

### 著名的變體

#### 數值迭代

${\displaystyle V_{i+1}(s):=\max _{a}\left\{\sum _{s'}P_{a}(s'|s)\left(R_{a}(s,s')+\gamma V_{i}(s')\right)\right\},}$

## 参考文献

1. ^ Bellman, R. A Markovian Decision Process. Journal of Mathematics and Mechanics. 1957, 6 (5): 679–684 [2021-04-30]. JSTOR 24900506. （原始内容存档于2021-04-30）.
2. ^ Howard, Ronald A. Dynamic Programming and Markov Processes (PDF). The M.I.T. Press. 1960 [2021-04-30]. （原始内容存档 (PDF)于2011-10-09）.
3. ^ Wrobel, A. On Markovian Decision Models with a Finite Skeleton. Mathematical Methods of Operations Research (ZOR). 1984, 28 (February): 17–27. S2CID 2545336. doi:10.1007/bf01919083.
4. ^ Kearns, Michael; Mansour, Yishay; Ng, Andrew. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning. 2002, 49 (193–208): 193–208. .
5. ^ Reinforcement Learning: Theory and Python Implementation. Beijing: China Machine Press. 2019: 44. ISBN 9787111631774.
• Yosida, K. “Functional Analysis”, Ch XIII, § 3, Springer-Verlag, 1968. ISBN 3-540-58654-7
• Ribarič.M. and I.Vidav, “An inequality for concave functions.” Glasnik Matematički 8 (28), 183–186 (1973).