摘要&结论
- 证明了VI框架下infinite horizon discrete-time (DT) nonlinear optimal control HJB方程(Hamilton–Jacobi–Bellman equation)的收敛性
- 解最优控制的方法:actor-critic方法
a critic NN is used to approximate the value function
**an action network **is used to approximate the optimal control policy
优点:不需要知道systems dynamic的完整知识
用来解决infinite horizon discrete-time (DT) nonlinear optimal control问题
特别的可以很好的用来解决LQR问题(DT linear quadratic regulator)
introduction
The DT nonlinear optimal control solution relies on solving the DT Hamilton–Jacobi– Bellman (HJB) equation
These are all offline methods for solving the HJB equations that require full knowledge of the system dynamics.
In this paper, we provide a full rigorous proof of convergence of the value-iteration-based HDP algorithm to solve the DT HJB equation of the optimal control problem for general nonlinear DT systems.
The point is stressed that these results also hold for the special LQR case
Section II starts by introducing the nonlinear DT optimal control problem.
Section III demonstrates how to set up the HDP algorithm to solve for the nonlinear DT optimal control problem.
In Section IV, we prove the convergence of HDP value iterations to the solution of the DT HJB equation.
In Section V, we introduce two NN parametric structures to approximate the optimal value function and policy.
II. DT HJB EQUATION
(根据某些条件)可以写成
DT HJB equation:
满足first-order necessary
condition,所以对等式右边求梯度
然后 沿着梯度方向更新
III. HDP ALGORITHM
见 [论文研读 Optimal and Autonomous Control Using Reinforcement Learning: A Survey]的 1) Offline PI Algorithm