Many good approximate POMDP (Partially Observable Markov Decision Process) solvers have been proposed since mid 2000, allowing POMDP to generate good strategies for realistic problems. Despite these advances, solving problems with large discrete action spaces remains difficult. This project aims to reduce such difficulties. To this end, we develop an adaptive method for action selection based on quantile statistics, specifically Cross-Entropy methods for optimisation.
We first propose a method, called Crossed-Entropy-based Multi Armed Bandit (CEMAB), for Multi-arm bandit problem --a framework used for action selection in many good online POMDP solvers today. Numerical experiments with up to 10,000 arm indicates that CEMAB outperforms various existing MAB methods, including ε-greedy, softmax, EXP1, UCB1.
We then propose an expansion of CEMAB to approximately solve POMDP problems on-line, and call this solver Quantile-Based Action SElecter (QBASE). Experiments on multiple benchmark, multi-robot problems and partially observed inventory control problems with up to 1M actions indicate that QBASE can generate good policies for problems with large discrete action space within reasonable time.
Details of the methods and more results are in the following references: