TY - JOUR
T1 - Policy Iteration Q-Learning for Data-Based Two-Player Zero-Sum Game of Linear Discrete-Time Systems
AU - Luo, Biao
AU - Yang, Yin
AU - Liu, Derong
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2021/7
Y1 - 2021/7
N2 - In this article, the data-based two-player zero-sum game problem is considered for linear discrete-time systems. This problem theoretically depends on solving the discrete-time game algebraic Riccati equation (DTGARE), while it requires complete system dynamics. To avoid solving the DTGARE, the Q-function is introduced and a data-based policy iteration Q-learning (PIQL) algorithm is developed to learn the optimal Q-function by using data collected from the real system. Writing the Q-function in a quadratic form, it is proved that the PIQL algorithm is equivalent to the Newton iteration method in the Banach space by using the Fréchet derivative. Then, the convergence of the PIQL algorithm can be guaranteed by Kantorovich's theorem. For the realization of the PIQL algorithm, the off-policy learning scheme is proposed using real data rather than the system model. Finally, the efficiency of the developed data-based PIQL method is validated through simulation studies.
AB - In this article, the data-based two-player zero-sum game problem is considered for linear discrete-time systems. This problem theoretically depends on solving the discrete-time game algebraic Riccati equation (DTGARE), while it requires complete system dynamics. To avoid solving the DTGARE, the Q-function is introduced and a data-based policy iteration Q-learning (PIQL) algorithm is developed to learn the optimal Q-function by using data collected from the real system. Writing the Q-function in a quadratic form, it is proved that the PIQL algorithm is equivalent to the Newton iteration method in the Banach space by using the Fréchet derivative. Then, the convergence of the PIQL algorithm can be guaranteed by Kantorovich's theorem. For the realization of the PIQL algorithm, the off-policy learning scheme is proposed using real data rather than the system model. Finally, the efficiency of the developed data-based PIQL method is validated through simulation studies.
KW - Adaptive dynamic programming (ADP)
KW - Q-learning
KW - discrete-time systems
KW - policy iteration
KW - two-player zero-sum game
UR - http://www.scopus.com/inward/record.url?scp=85109198785&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2020.2970969
DO - 10.1109/TCYB.2020.2970969
M3 - Article
C2 - 32092032
AN - SCOPUS:85109198785
SN - 2168-2267
VL - 51
SP - 3630
EP - 3640
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 7
M1 - 9005399
ER -