Link to the paper: http://qitonggao.com/wp-content/uploads/2019/02/ICCPS19.pdf
In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition prob- abilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning tech- nique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Ra- bin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced vari- ance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep rein- forcement learning, our method can estimate the true gradients of an unknown loss function more accurately, improving the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This significantly reduces the computational cost and also renders our method applicable to planning problems where AMECs do not exist. In this case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satis- fying an LTL specification even if AMECs do not exist.
The above simulation shows the training result of robot satisfying the LTL: “<>(A && <>(B && <>T)) && []<>(A||T) && []<>B && []!C“, where the atomic propositions “A“, “B“, “C” and “T” only appears with some probability and the robot could only take desired actions with some probability because of the noisy controller.
This paper has been accepted by International Conference on Cyber-Physical Systems (ICCPS’19) with ~20% acceptance rate.
Qitong Gao, Davood Hajinezhad, Yan Zhang, Yiannis Kantaros, and Michael M. Zavlanos. 2019. Reduced Variance Deep Reinforcement Learning with Temporal Logic Specifications. Conference of International Conference on Cyber-Physical Systems (ICCPS’19). ACM, New York, NY, USA, 12 pages.