Policy iteration

From Glossary

Jump to: navigation, search

This is an algorithm for infinite horizon dynamic programs (generally stochastic) that proceeds by improving policies to satisfy the fundamental equation:

F(s) = r^*(s) + \sum_{s'} P(s,s') F(s'),

where LaTeX: F(s) is the maximum expected value when starting in state LaTeX: s, r^*(s) is the immediate expected return when in state LaTeX: s and following an optimal policy (a decision rule), and LaTeX: P(s,s') is probability of a transition from state LaTeX: s to state LaTeX: s' in one time period.

The algorithm has some policy, LaTeX: x^*(s), at the beginning of an iteration. This determines an approximation of LaTeX: F, which is exact if the equation holds. If the equation is violated, the violation identifies how the policy can be improved. This changes the policy for the next iteration. Convergence depends upon the underlying Markov process (e.g., whether it is ergodic). Another approach is value iteration.

Personal tools