Safe and robust reinforcement learning, planning under uncertainty, sequential decision making, (decentralized) partially observable Markov decision processes (POMDPs / Dec-POMDPs).

Overview

A core skill of AI systems is their ability to autonomously take decisions to achieve their objectives. However, when such autonomous systems have to interact with the real world their performance is notoriously brittle. Their decisions are often only reliable in situations that are close to those they encountered during training. Furthermore, mistakes can have serious consequences in safety-critical domains. If we want to be able to trust AI systems, we need guarantees on their safety and robustness.

My research focuses on algorithms for reliable decision making in sequential settings, where an agent influences its environment in a closed loop. In recent years, I have made contributions to the body of work on reinforcement learning, focusing on safe and robust decision making when models of the environments are (partially) unknown. Before, my most significant contributions have been to the field of planning under uncertainty, where uncertainty in sensing and acting is captured in probabilistic models (e.g., POMDPs) that are known to the agent a priori.

Selected research topics in Sequential Decision Making

Below I discuss a selection of research topics that I have worked on, but see my publication list for more papers.

Exploiting epistemic uncertainty for deep exploration in Reinforcement Learning

Reinforcement Learning (RL) allows an autonomous agent to optimize its decision making based on data it gathers while exploring its environment. Given limited and possibly inaccurate data, the agent is uncertain regarding its state of knowledge, which is referred to as epistemic uncertainty. In particular, estimates of such epistemic uncertainty can guide an agent's decision making, for example towards more efficient exploration of its environment. The principled embedding of epistemic uncertainty in present-day reinforcement learning is an important open issue. In recent work, we have focused on the exploitation of epistemic uncertainty to address hard exploration problems. Each approach considers a distinct setting.

First, our approach Sequential Monte-Carlo for Deep Q-Networks (SMC-DQN) studies uncertainty quantification for the value function in a model-free RL algorithm by training an ensemble of models to resemble the Bayesian posterior (AAMAS 2024).

Second, our Projection-Ensemble DQN (PE-DQN) algorithm focuses on the distributional RL setting and proposes to use diverse projections and representations in an ensemble of distributional value learners (ICLR 2024).

Third, our Epistemic Monte Carlo Tree Search (E-MCTS) methodology incorporates epistemic uncertainty into model-based RL by estimating the epistemic uncertainty associated with predictions at every node in the MCTS planning tree (EWRL 2023).

Safe policy improvement

Traditional reinforcement learning requires trial-and-error exploration to optimize decisions, which is undesirable in some applications. In this line of work, we instead consider safely improving upon an existing decision-making policy. Such approaches are typically better suited to safety-critical applications but often require substantially more data to be able to improve upon the existing policy.

We showed that, by exploiting the environment structure, we can prove a theoretical bound on the number of required samples that is exponentially lower (AAAI 2019), which was also confirmed by an empirical analysis.

Next, we showed how this environment structure itself can also be learned (IJCAI 2019), making our methods more widely applicable. They need fewer samples and require weaker prior knowledge assumptions.

Safety during training

Existing algorithms for safe reinforcement learning mostly focus on achieving safe policy after a training phase, but ignore safety during training. This is very undesirable when interacting with an actual system (e.g., a robot manipulator) instead of a simulation (e.g., a computer game). We addressed this pressing issue as follows.

First, often we understand the safety-related aspects quite well (e.g., operational limits of the manipulator). By modeling them in a principled manner, we developed an algorithm that provably avoids safety violations also during training (AAMAS 2021).

Second, instead of the usual focus on expected safety violations, we considered optimizing for the worst-case safety performance. We presented a safety critic that provides better risk control and can be added to state-of-the-art RL methods (AAAI 2021).

Abstraction-guided recovery

Human experts can provide valuable experience to guide an intelligent agent's behavior. We proposed to exploit state abstractions to help an agent recover to known scenarios (ICAPS 2021), leading to a new way to incorporate expert knowledge in reinforcement learning.

Constrained sequential decision making

Agents often have to optimize their decision making under resource constraints, for instance when simultaneous charging of electric vehicles might exceed grid capacity constraints. We have looked at approaches such as best-response planning (AAAI 2015) or fictitious play (ECAI 2016), but also algorithms that bound the probability of a resource violation (AAAI 2017) or when the resource constraints themselves are stochastic (AAAI 2018).

Single-agent planning under uncertainty

In a 2017 paper we focused on speeding up the state of the art in exact POMDP solving by applying a Benders decomposition to the pruning of vectors using linear programming, which is the most expensive operation (AAAI 2017, Java code, C++ code).

I wrote an overview chapter on POMDPs for the book Reinforcement Learning: State of the Art (Springer, 2012).

During my PhD I developed Perseus, a fast approximate POMDP planner which is easy to implement (JAIR 2005, Java code, C++ code, Matlab code). We also generalized approximate POMDP planning to fully continuous domains (JMLR 2006).

Optimal multiagent planning under uncertainty

I have worked on planning under uncertainty for multiagent (and multi-robot) systems. For instance, we developed one of the currently fastest optimal planners for general Dec-POMDPs (IJCAI 2011, JAIR 2013). It is based on an algorithm for speeding up a key Dec-POMDP operation (the backup) with up to 10 orders of magnitude speedups on benchmarks (AAMAS 2010). The work builds on a journal paper that laid the foundations for value-based planning in Dec-POMDPs (JAIR 2008).

Applications

I have applied my decision-making algorithms in different contexts such as smart energy systems (UAI 2015, AAAI 2015), robotics (IJRR 2013, AAAI 2013) and traffic flow optimization (EAAI 2016).