# Policy iteration

$LaTeX: F(s) = r^*(s) + \sum_{s'} P(s,s') F(s'),$
where $LaTeX: F(s)$ is the maximum expected value when starting in state $LaTeX: s, r^*(s)$ is the immediate expected return when in state $LaTeX: s$ and following an optimal policy (a decision rule), and $LaTeX: P(s,s')$ is probability of a transition from state $LaTeX: s$ to state $LaTeX: s'$ in one time period.
The algorithm has some policy, $LaTeX: x^*(s),$ at the beginning of an iteration. This determines an approximation of $LaTeX: F,$ which is exact if the equation holds. If the equation is violated, the violation identifies how the policy can be improved. This changes the policy for the next iteration. Convergence depends upon the underlying Markov process (e.g., whether it is ergodic). Another approach is value iteration.