14.10-14.40: Huyên Pham (Paris 7):
Actor-Critic learning for mean-field control in continuous time
14.40-15.10: Yufei Zhang (LSE):
Exploration-exploitation trade-off for continuous-time reinforcement learning
15:10-15:30: Break and Informal Discussion
15:30-16:00: Renyuan Xu (University of Southern California)
: System Noise and Individual Exploration in Learning Large Population Games
16:00-16:30: Xun Yu Zhou (Columbia):
Q-Learning in Continuous Time
16:30-16:50: Panel Discussion
16:50-17:00: Wrap up and Conclusion
The meeting will be preceded by the Annual General Meeting of the Applied Probability Section at 1.30pm
): Actor-Critic learning for mean-field control in continuous time
We study policy gradient for mean-field control in continuous time in a reinforcement learning setting. By considering randomised policies with entropy regularisation,
we derive a gradient expectation representation of the value function, which is amenable to actor-critic type algorithms where the value functions and the policies are learnt alternately based
on observations samples of the state and model-free estimation of the population state distribution. In the linear-quadratic mean-field framework,
we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space. Finally, we illustrate the results of our algorithms with some numerical experiments on concrete examples.
): Exploration-exploitation trade-off for continuous-time reinforcement learning
Recently, reinforcement learning (RL) has attracted substantial research interests. Much of the attention and success, however, has been for the discrete-time setting. Continuous-time RL, despite its natural analytical connection to stochastic controls, has been largely unexplored and with limited progress. In particular, characterising sample efficiency for continuous-time RL algorithms remains a challenging and open problem.
In this talk, we develop a framework to analyse model-based reinforcement learning in the episodic setting. We then apply it to optimise exploration-exploitation trade-off for linear-convex RL problems, and report sublinear (or even logarithmic) regret bounds for a class of learning algorithms inspired by filtering theory. The approach is probabilistic, involving analysing learning efficiency using concentration inequalities for correlated continuous-time observations, and quantifying the performance gap between greedy policies derived from estimated and true models based on stochastic control theory.
): System Noise and Individual Exploration in Learning Large Population Games
In this talk, we demonstrate the importance of system noise and individual exploration in the context of multi-agent reinforcement learning in a large population regime. In particular, we discuss several linear-quadratic problems where agents are assumed to have limited information about the stochastic system, and we focus on the policy gradient method which is a class of widely-used reinforcement learning algorithms.
In the finite-agent setting, we show that (a modified) policy gradient method could guide agents to find the Nash equilibrium solution provided there is a certain level of noise in the system. The noise can either come from the underlying dynamics or carefully designed explorations from the agents. When the number of agents goes to infinity, we propose an exploration scheme with entropy regularization that could help each individual agent to explore the unknown system as well as the behavior of other agents. The proposed scheme is shown to be able to speed up and stabilize the learning procedure.
This talk is based on several projects with Xin Guo (UC Berkeley), Ben Hambly (U of Oxford), Huining Yang (Princeton U), and Thaleia Zariphopoulou (UT Austin).
Xun Yu Zhou
): Q-Learning in Continuous Time.
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term “(little) q-function”. This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a “q-learning” theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes. We then apply the theory to devise various actor–critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2021). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2021) and time-discretized conventional Q-learning algorithms. Joint work with Yanwei Jia.