Discriminative reward co-training

Philipp Altmann, Fabian Ritz, Maximilian Zorn, Michael Kölle, Thomy Phan, Thomas Gabor and Claudia Linnhoff-Popien

Abstract: We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy, determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator’s verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a reward surrogate, steering policy optimization toward more valuable regions of the reward landscape, thus, toward learning an optimal policy. In this article, we formally introduce the additional components, their intended purpose and parameterization, and define a unified training procedure. To reveal insights into the mechanics of the proposed architecture, we provide evaluations of the introduced hyperparameters. Further benchmark evaluations in various discrete and continuous control environments provide evidence that DIRECT is especially beneficial in environments possessing sparse rewards, hard exploration tasks, and shifting circumstances. Our results show that DIRECT outperforms state-of-the-art algorithms in those challenging scenarios by providing a surrogate reward to the policy and direct the optimization toward valuable areas.

Neural Computing and Applications (2024)

Citation:

Philipp Altmann, Fabian Ritz, Maximilian Zorn, Michael Kölle, Thomy Phan, Thomas Gabor, Claudia Linnhoff-Popien. “Discriminative reward co-training”. Neural Computing and Applications 2024. DOI: 10.1007/s00521-024-10512-8 [PDF] [Code]

Bibtex:

@article{Altmann2024Discriminative,
  title    = {{Discriminative reward co-training}},
  issn     = {1433-3058},
  url      = {https://link.springer.com/article/10.1007/s00521-024-10512-8},
  doi      = {10.1007/s00521-024-10512-8},
  abstract = {{We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy, determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator’s verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a reward surrogate, steering policy optimization toward more valuable regions of the reward landscape, thus, toward learning an optimal policy. In this article, we formally introduce the additional components, their intended purpose and parameterization, and define a unified training procedure. To reveal insights into the mechanics of the proposed architecture, we provide evaluations of the introduced hyperparameters. Further benchmark evaluations in various discrete and continuous control environments provide evidence that DIRECT is especially beneficial in environments possessing sparse rewards, hard exploration tasks, and shifting circumstances. Our results show that DIRECT outperforms state-of-the-art algorithms in those challenging scenarios by providing a surrogate reward to the policy and direct the optimization toward valuable areas.}},
  journal  = {Neural Computing and Applications},
  author   = {Altmann, Philipp and Ritz, Fabian and Zorn, Maximilian and Kölle, Michael and Phan, Thomy and Gabor, Thomas and Linnhoff-Popien, Claudia},
  month    = dec,
  year     = {2024},
  preprint = {https://arxiv.org/abs/2301.07421},
  code     = {https://github.com/philippaltmann/DIRECT},
  pdf      = {https://link.springer.com/content/pdf/10.1007/s00521-024-10512-8.pdf}
}