POMDP Planning for Problems with large Discrete Action Spaces

Many good approximate POMDP (Partially Observable Markov Decision Process) solvers have been proposed since mid 2000, allowing POMDP to generate good strategies for realistic problems. Despite these advances, solving problems with large discrete action spaces remains difficult. This project aims to reduce such difficulties. To this end, we develop an adaptive method for action selection based on quantile statistics, specifically Cross-Entropy methods for optimisation.

We first propose a method, called Crossed-Entropy-based Multi Armed Bandit (CEMAB), for Multi-arm bandit problem --a framework used for action selection in many good online POMDP solvers today. Numerical experiments with up to 10,000 arm indicates that CEMAB outperforms various existing MAB methods, including ε-greedy, softmax, EXP1, UCB1.

We then propose an expansion of CEMAB to approximately solve POMDP problems on-line, and call this solver Quantile-Based Action SElecter (QBASE). Experiments on multiple benchmark, multi-robot problems and partially observed inventory control problems with up to 1M actions indicate that QBASE can generate good policies for problems with large discrete action space within reasonable time.



An example of QBASE applied to multi-agent multi-target finding with centralised control. Left: Performance on different numbers of agents and targets. The notation |S| is the size of the state space, while |A| is the size of the action space. Right: A visualisation of a policy for the case of 3 agents finding 3 targets. The agents are green squares marked with letters. The actual positions of the targets before being captured are marked with dark red. The pink color indicates the agents' belief that a target is at the particular cell, the darker the color the higher the probability.

Details of the methods and more results are in the following references:



  • Erli Wang
  • Hanna Kurniawati
  • Dirk P. Kroese
. . . Vol. . No. . pp. . ed. . [pdf]