Research
            
             Keywords
            
            
              Safe and robust reinforcement learning, planning under
              uncertainty, sequential decision making, (decentralized)
              partially observable Markov decision processes (POMDPs /
              Dec-POMDPs).
             
            
             Overview
            
            
              A core skill of AI systems is their ability to
              autonomously take decisions to achieve their
              objectives. However, when such autonomous systems have
              to interact with the real world their performance is
              notoriously brittle. Their decisions are often only
              reliable in situations that are close to those they
              encountered during training. Furthermore, mistakes can
              have serious consequences in safety-critical domains. If
              we want to be able to trust AI systems, we need
              guarantees on their safety and robustness.
             
            
              My research focuses on algorithms for reliable decision
              making in sequential settings, where an agent influences
              its environment in a closed loop. In recent years, I
              have made contributions to the body of work on
              reinforcement learning, focusing on safe and robust
              decision making when models of the environments are
              (partially) unknown. Before, my most significant
              contributions have been to the field of planning under
              uncertainty, where uncertainty in sensing and acting is
              captured in probabilistic models (e.g., POMDPs) that are
              known to the agent a priori.
             
            
             Selected research topics in Sequential Decision Making
            
            Below I discuss a selection of research topics that I
              have worked on, but see my publication list for more papers.
             
            Exploiting epistemic uncertainty for deep exploration in Reinforcement Learning
            
              Reinforcement Learning (RL) allows an autonomous agent
              to optimize its decision making based on data it gathers
              while exploring its environment. Given limited and
              possibly inaccurate data, the agent is uncertain
              regarding its state of knowledge, which is referred to
              as epistemic uncertainty. In particular, estimates of
              such epistemic uncertainty can guide an agent's decision
              making, for example towards more efficient exploration
              of its environment. The principled embedding of epistemic
              uncertainty in present-day reinforcement learning is an
              important open issue.
              In recent work, we have focused on the exploitation of
              epistemic uncertainty to address hard exploration
              problems. Each approach considers a distinct setting.
             
            
              First, our approach Sequential Monte-Carlo for Deep
              Q-Networks (SMC-DQN) studies uncertainty quantification
              for the value function in a model-free RL algorithm by
              training an ensemble of models to resemble the Bayesian
              posterior (AAMAS
            2024).
               
            
              Second, our Projection-Ensemble DQN
              (PE-DQN) algorithm focuses on the distributional RL
              setting and proposes to use diverse projections and
              representations in an ensemble of distributional value
              learners
              (ICLR
            2024).
             
            Third, our Epistemic Monte Carlo Tree Search
              (E-MCTS) methodology incorporates epistemic uncertainty
              into model-based RL by estimating the epistemic
              uncertainty associated with predictions at every node in
              the MCTS planning tree
(EWRL
            2023).
               
            Safe policy improvement
            
            Traditional reinforcement learning requires
            trial-and-error exploration to optimize decisions, which
            is undesirable in some applications.  In this line of
            work, we instead consider safely improving upon an
            existing decision-making policy. Such approaches are
            typically better suited to safety-critical applications
            but often require substantially more data to be able to
            improve upon the existing policy.
             
            We showed that, by exploiting the
            environment structure, we can prove a theoretical bound on
            the number of required samples that is exponentially lower
            (AAAI
            2019), which was also confirmed by an empirical
            analysis.
             
            
              Next, we showed how this environment structure itself
              can also be learned (IJCAI
              2019), making our methods more widely
              applicable. They need fewer samples and require weaker
              prior knowledge assumptions.
             
            Safety during training Existing algorithms for
            safe reinforcement learning mostly focus on
            achieving safe policy after a training phase, but ignore
            safety during training. This is very undesirable when
            interacting with an actual system (e.g., a robot
            manipulator) instead of a simulation (e.g., a computer
            game). We addressed this pressing issue as follows.
            
              First,
            often we understand the safety-related aspects quite well
            (e.g., operational limits of the manipulator). By modeling
            them in a principled manner, we developed an algorithm that
            provably avoids safety violations also during training
            (AAMAS
            2021).
               
            Second, instead of the usual focus on expected safety
            violations, we considered optimizing for the worst-case
            safety performance. We presented a safety critic that
            provides better risk control and can be added to
            state-of-the-art RL methods (AAAI
            2021).
             
            Abstraction-guided recovery
            
            Human experts can provide valuable experience to guide an
            intelligent agent's behavior. We proposed to exploit
            state abstractions to help an agent recover to known
            scenarios (ICAPS 2021), leading to a new
            way to incorporate expert knowledge in reinforcement
            learning.
             
            Constrained sequential decision making
          
            
              Agents often have to optimize their decision making
              under resource constraints, for instance when
              simultaneous charging of electric vehicles might exceed
              grid capacity constraints. We have looked at approaches
              such as best-response planning (AAAI
              2015) or fictitious play (ECAI
              2016), but also algorithms that bound the
              probability of a resource violation (AAAI
              2017) or when the resource constraints themselves
              are stochastic (AAAI
              2018).
             
            
             Single-agent planning under uncertainty
            
            
              In a 2017 paper we focused on speeding up the state of
              the art in exact POMDP solving by applying a Benders
              decomposition to  the pruning of vectors using linear
              programming, which is the most expensive operation (AAAI
              2017, Java
              code, C++
                code).
             
            
              I wrote an overview
                chapter on POMDPs for the book Reinforcement
              Learning: State of the Art (Springer, 2012).
             
            
              During my PhD I developed Perseus, a
              fast approximate POMDP planner which is easy to
              implement (JAIR
              2005, Java
                code, C++
                code, Matlab code). We also generalized approximate POMDP planning to
              fully continuous domains (JMLR
              2006).
             
            
             Optimal multiagent planning under uncertainty
          
          
            I have worked on planning under
              uncertainty for multiagent (and multi-robot) systems. For instance, we
              developed one of the currently fastest optimal planners for
              general Dec-POMDPs (IJCAI
              2011, JAIR
              2013). It is based on an algorithm for speeding up a
              key Dec-POMDP operation (the backup) with up to 10
              orders of magnitude speedups on benchmarks (AAMAS
              2010). The work builds on a journal paper that
              laid the foundations for value-based planning in
              Dec-POMDPs (JAIR
              2008).
             
            
             Applications
          
            
I have applied my decision-making algorithms in different contexts
such as smart energy systems (UAI 2015, AAAI 2015), robotics
(IJRR 2013, AAAI 2013) and
traffic flow optimization (EAAI 2016).
             
           |