A robust quantile huber loss with interpretable parameter adjustment in distributional reinforcement learning

1University of Toronto,
ICASSP 2024

Abstract

Distributional Reinforcement Learning (RL) estimates return distribution mainly by learning quantile values via minimizing the quantile Huber loss function, entailing a threshold parameter often selected heuristically or via hyperparameter search, which may not generalize well and can be suboptimal. This paper introduces a generalized quantile Huber loss function derived from Wasserstein distance (WD) calculation between Gaussian distributions, capturing noise in predicted (current) and target (Bellman-updated) quantile values.

Compared to the classical quantile Huber loss, this innovative loss function enhances robustness against outliers. Notably, the classical Huber loss function can be seen as an approximation of our proposed loss, enabling parameter adjustment by approximating the amount of noise in the data during the learning process. This allows tailoring of the threshold parameter to suit specific problem characteristics.

Empirical tests on Atari games, a common application in distributional RL, and a recent hedging strategy using distributional RL, validate the effectiveness of our proposed loss function and its potential for parameter adjustments in distributional RL.


Motivation

MY ALT TEXT
Given the effect of the threshold parameter \( k \), instead of setting \( k = 1 \) by default, how about developing an interpretation for the quantile Huber loss that supports adaptive tuning of \( k \)?

Comparison with other RL methods

We replaced the loss function in QR-DQN and FQF, which utilize the quantile Huber loss with a fixed \( k=1 \), without considering the impact of varying \( k \) on performance, with our loss function (GL). Table 1 compares human-normalized scores on Atari games.

MY ALT TEXT
To test our interpretation for \( k \) in a more realistic setting, we assess its performance in option hedging, a risk-aware financial application where the objective is to optimize the Conditional Value-at-Risk at a 95% confidence level (CVaR95) for total rewards. Interestingly, the optimal value of \( k \) for D4PG is not 1, and D4PG-GL’s estimated b-value during training aligns with this optimal \( k (≈2) \). These results emphasize the advantages of finetuning \( k \) in the quantile Huber loss and demonstrate the effectiveness of our \( k \) interpretation, reducing the need for extensive parameter searches.
MY ALT TEXT

Poster


BibTeX


            @article{malekzadeh2024robust,
            title={A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning},
            author={Malekzadeh, Parvin and Plataniotis, Konstantinos N and Poulos, Zissis and Wang, Zeyu},
            journal={arXiv preprint arXiv:2401.02325},
            year={2024}
            }