UaMB-SF: Uncertainty-aware transfer across tasks using hybrid Model-Based Successor Feature reinforcement learning

1University of Toronto, 2Defence Research and Development Canada (DRDC) Toronto Research Centre
Neurocomputing 2023
MY ALT TEXT

Visual illustration of the proposed UaMB-SF framework.


Abstract

The ability to transfer and generalize knowledge gained from previous experiences to downstream tasks can significantly improve sample efficiency of reinforcement learning (RL) algorithms. Recent research indicates that successor feature (SF) RL algorithms enable knowledge generalization between tasks with different rewards but identical transition dynamics. Furthermore, uncertainty-aware exploration is widely recognized as another appealing approach for improving sample efficiency.

Putting together two ideas of hybrid model-based successor feature (MB-SF) and uncertainty leads to an approach to the problem of sample efficient uncertainty-aware knowledge transfer across tasks with different transition dynamics or/and reward functions. In this paper, the uncertainty of the value of each action is approximated by a Kalman filter (KF)-based multiple-model adaptive estimation, which treats the parameters of a model as random variables. To the best of our knowledge, this is the first attempt at formulating a hybrid MB-SF algorithm capable of generalizing knowledge across large or continuous state space tasks with various transition dynamics while requiring less computation at decision time than MB methods.

We highlight why previous SF-based methods are constrained to knowledge generalization across same transition dynamics, and design a set of demonstration tasks to empirically validate the effectiveness of our proposed approach. The results show that our algorithm generalizes its knowledge across different transition dynamics, and outperforms existing SF and MB approaches. We believe that our proposed framework can account for the computationally efficient behavioural flexibilities observed in the empirical literature and can also serve as a solid theoretical foundation for future experimental work.


A Motivating Example

Let’s consider the grid-world task shown in Fig. (a). First, an agent learns the value function and optimal policy of the source task (Fig. (a), up panel) by learning the reward and the SR matrix through both the TD-SR and MB-SR methods. The learned reward function and the SR matrices are then transferred to be trained on the test task (Fig. (a), down panel). The test task is generated by placing a barrier at state s4 of the source task while retaining the reward values consistent with the source task. During the first trial of learning the test task, the agent is not aware of the barrier; hence, it mimics actions learned from the source task until it notices the block by repeatedly being dropped at s3 rather than s4 after taking the optimal action realized for s3 in the source task (i.e., moving right).
MY ALT TEXT
As illustrated by the motivating example, unlike MB-SR, TD-SR cannot instantly adapt the SR values for states that do not directly experience changes in transition dynamics. Instead, TD-SR can only learn about changes in the transition structure incrementally; hence, it requires full trajectories through the state space to adapt the entire SR to the changes

Comparison with Other RL Methods

Table 1 summarizes the generalization properties and computation required at test tasks for various RL algorithms.

MY ALT TEXT

Validation of UaMB-SF Adaptability to Environmental Changes

The transfer setup of the proposed UaMB-SF algorithm is presented in Fig. 4. θa and F a capture changes in the reward function and transition dynamics, respectively. After training UaMB-SF on a source task, only a few samples are needed to update θa and F a , eliminating the need for exhaustive interactions from scratch.

MY ALT TEXT

BibTeX

@article{malekzadeh2023uncertainty,
            title={Uncertainty-aware transfer across tasks using hybrid model-based successor feature reinforcement learning☆},
            author={Malekzadeh, Parvin and Hou, Ming and Plataniotis, Konstantinos N},
            journal={Neurocomputing},
            volume={530},
            pages={165--187},
            year={2023},
            publisher={Elsevier}
            }