Research
Keywords
Safe and robust reinforcement learning, planning under
uncertainty, sequential decision making, (decentralized)
partially observable Markov decision processes (POMDPs /
DecPOMDPs).
Overview
A core skill of AI systems is their ability to
autonomously take decisions to achieve their
objectives. However, when such autonomous systems have
to interact with the real world their performance is
notoriously brittle. Their decisions are often only
reliable in situations that are close to those they
encountered during training. Furthermore, mistakes can
have serious consequences in safetycritical domains. If
we want to be able to trust AI systems, we need
guarantees on their safety and robustness.
My research focuses on algorithms for reliable decision
making in sequential settings, where an agent influences
its environment in a closed loop. In recent years, I
have made contributions to the body of work on
reinforcement learning, focusing on safe and robust
decision making when models of the environments are
(partially) unknown. Before, my most significant
contributions have been to the field of planning under
uncertainty, where uncertainty in sensing and acting is
captured in probabilistic models (e.g., POMDPs) that are
known to the agent a priori.
Selected research topics in Sequential Decision Making
Below I discuss a selection of research topics that I
have worked on, but see my publication list for more papers.
Exploiting epistemic uncertainty for deep exploration in Reinforcement Learning
Reinforcement Learning (RL) allows an autonomous agent
to optimize its decision making based on data it gathers
while exploring its environment. Given limited and
possibly inaccurate data, the agent is uncertain
regarding its state of knowledge, which is referred to
as epistemic uncertainty. In particular, estimates of
such epistemic uncertainty can guide an agent's decision
making, for example towards more efficient exploration
of its environment. The principled embedding of epistemic
uncertainty in presentday reinforcement learning is an
important open issue.
In recent work, we have focused on the exploitation of
epistemic uncertainty to address hard exploration
problems. Each approach considers a distinct setting.
First, our approach Sequential MonteCarlo for Deep
QNetworks (SMCDQN) studies uncertainty quantification
for the value function in a modelfree RL algorithm by
training an ensemble of models to resemble the Bayesian
posterior (AAMAS
2024).
Second, our ProjectionEnsemble DQN
(PEDQN) algorithm focuses on the distributional RL
setting and proposes to use diverse projections and
representations in an ensemble of distributional value
learners
(ICLR
2024).
Third, our Epistemic Monte Carlo Tree Search
(EMCTS) methodology incorporates epistemic uncertainty
into modelbased RL by estimating the epistemic
uncertainty associated with predictions at every node in
the MCTS planning tree
(EWRL
2023).
Safe policy improvement
Traditional reinforcement learning requires
trialanderror exploration to optimize decisions, which
is undesirable in some applications. In this line of
work, we instead consider safely improving upon an
existing decisionmaking policy. Such approaches are
typically better suited to safetycritical applications
but often require substantially more data to be able to
improve upon the existing policy.
We showed that, by exploiting the
environment structure, we can prove a theoretical bound on
the number of required samples that is exponentially lower
(AAAI
2019), which was also confirmed by an empirical
analysis.
Next, we showed how this environment structure itself
can also be learned (IJCAI
2019), making our methods more widely
applicable. They need fewer samples and require weaker
prior knowledge assumptions.
Safety during training Existing algorithms for
safe reinforcement learning mostly focus on
achieving safe policy after a training phase, but ignore
safety during training. This is very undesirable when
interacting with an actual system (e.g., a robot
manipulator) instead of a simulation (e.g., a computer
game). We addressed this pressing issue as follows.
First,
often we understand the safetyrelated aspects quite well
(e.g., operational limits of the manipulator). By modeling
them in a principled manner, we developed an algorithm that
provably avoids safety violations also during training
(AAMAS
2021).
Second, instead of the usual focus on expected safety
violations, we considered optimizing for the worstcase
safety performance. We presented a safety critic that
provides better risk control and can be added to
stateoftheart RL methods (AAAI
2021).
Abstractionguided recovery
Human experts can provide valuable experience to guide an
intelligent agent's behavior. We proposed to exploit
state abstractions to help an agent recover to known
scenarios (ICAPS 2021), leading to a new
way to incorporate expert knowledge in reinforcement
learning.
Constrained sequential decision making
Agents often have to optimize their decision making
under resource constraints, for instance when
simultaneous charging of electric vehicles might exceed
grid capacity constraints. We have looked at approaches
such as bestresponse planning (AAAI
2015) or fictitious play (ECAI
2016), but also algorithms that bound the
probability of a resource violation (AAAI
2017) or when the resource constraints themselves
are stochastic (AAAI
2018).
Singleagent planning under uncertainty
In a 2017 paper we focused on speeding up the state of
the art in exact POMDP solving by applying a Benders
decomposition to the pruning of vectors using linear
programming, which is the most expensive operation (AAAI
2017, Java
code, C++
code).
I wrote an overview
chapter on POMDPs for the book Reinforcement
Learning: State of the Art (Springer, 2012).
During my PhD I developed Perseus, a
fast approximate POMDP planner which is easy to
implement (JAIR
2005, Java
code, C++
code, Matlab code). We also generalized approximate POMDP planning to
fully continuous domains (JMLR
2006).
Optimal multiagent planning under uncertainty
I have worked on planning under
uncertainty for multiagent (and multirobot) systems. For instance, we
developed one of the currently fastest optimal planners for
general DecPOMDPs (IJCAI
2011, JAIR
2013). It is based on an algorithm for speeding up a
key DecPOMDP operation (the backup) with up to 10
orders of magnitude speedups on benchmarks (AAMAS
2010). The work builds on a journal paper that
laid the foundations for valuebased planning in
DecPOMDPs (JAIR
2008).
Applications
I have applied my decisionmaking algorithms in different contexts
such as smart energy systems (UAI 2015, AAAI 2015), robotics
(IJRR 2013, AAAI 2013) and
traffic flow optimization (EAAI 2016).
