Reinforcement learning (RL) has garnered significant attention for developing agents that aim to maximize rewards, specified by an external supervisor, within fully observable environments. However, many real-world problems involve partial or noisy observations, where agents do not have access to complete and accurate information about the environment. These problems are commonly formulated as partially observable Markov decision processes (POMDPs). Previous studies have tackled RL in POMDPs by either incorporating the memory of past actions and observations or by inferring the true state of the environment from observed data. Nevertheless, aggregating observations and actions over time becomes impractical in high-dimensional continuous spaces. Furthermore, inference-based RL approaches often require a large number of environmental samples to perform well, as they focus solely on reward maximization and neglect uncertainty in the inferred state.
Active inference (AIF) is a framework naturally formulated in POMDPs and directs agents to select actions by minimizing a function called expected free energy (EFE). This supplies reward-maximizing (or exploitative) behaviour, as in RL, with information-seeking (or exploratory) behaviour. Despite this exploratory behaviour of AIF, its usage is limited to small and discrete spaces due to the computational challenges associated with EFE. In this paper, we propose a unified principle that establishes a theoretical connection between AIF and RL, enabling seamless integration of these two approaches and overcoming their aforementioned limitations in continuous space POMDP settings.
Experimental results demonstrate the superior learning capabilities of our method compared to other alternative RL approaches in solving partially observable tasks with continuous spaces. Notably, our approach harnesses information-seeking exploration, enabling it to effectively solve reward-free problems.
Table 1 compares foundational elements and decision-making strategies within RL, AIF, and our proposed unified inference approach, with a particular focus on their application in continuous decision spaces.
The results presented in Table 2 highlight the effectiveness of the proposed unified inference algorithm in various aspects: (i) It successfully generalizes MDP actor-critic methods to the POMDP setting, allowing for more effective exploration and learning under partial observability. (ii) It outperforms memory-based approaches in scenarios with noisy observations, indicating the advantage of leveraging the belief state representation in handling observation noise. (iii) The inclusion of the information gain intrinsic term into the generalized actor-critic methods improves their robustness to noisy observations.
Fig. 2. compares the exploratory behaviors of our G-SAC agent with ICAM, RND, and VIME, which incorporate
information-oriented exploratory terms as intrinsic rewards in SAC. ICM and RND incorporate
the prediction error of a transition model, and VIME employs intrinsic rewards that are designed to maximize the information
gain concerning the parameters of the Bayesian neural network.
The
results indicate that methods with the intrinsic reward being information gain, namely
G-SAC and VIME, learn much faster, suggesting that their exploration is more effective
than the baseline agent’s exploration. On the other hand, the performance of ICM
and RND is strongly undermined by randomness. It is widely recognized that intrinsic motivation based on the prediction error
of a transition model is sensitive to the inherent stochasticity of the environment (Burda
et al., 2018).
@article{malekzadeh2024active,
title={Active Inference and Reinforcement Learning: A Unified Inference on Continuous State and Action Spaces under
Partial Observability},
author={Malekzadeh, Parvin and Plataniotis, Konstantinos N},
journal={Neural Computation},
pages={1--64},
year={2024},
publisher={MIT Press}
}