Interpretable Deep Reinforcement Learning via Concept-Based Policy Distillation
Abstract
Abstract Deep reinforcement learning policies perform exceptionally well in applications like Atari games, chess, Go, and poker. However, they are incomprehensible, making the process of extracting new knowledge and understanding policy behavior difficult. For the same reason, deploying these policies in high-stakes applications like healthcare, finance, and criminal justice is infeasible. To rectify the incomprehensibility issue, we propose a new concept-based policy distillation method for convolutional neural network-based policies. Our method transforms raw image states into human-interpretable concepts using non-negative matrix factorization on the policy’s activations. The concepts express features in an interpretable way and detail how the policy represents the world internally. We use the concepts to train a distilled policy represented using sparse linear models. The distilled policy chooses one linear model from a set of linear models to make action predictions. Employing a single sparse linear model reduces the complexity, making it easier for humans to understand policy behavior. Experimentally, we show the effectiveness of our distilled policy in four environments: Car Racing, Pong, Breakout, and Ms Pacman. We illustrate that inspecting these linear models gives local and global insight into how the black box policy works. Furthermore, we demonstrate that these linear models perform well by faithfully using the same features as the black box policy and capturing the black box policy’s behavior in critical states. The code, data, trained models, and TensorBoard logs with hyperparameters used are provided ( https://github.com/observer4599/interpretable-concept-based-policy-distillation ).
Cite
Text
Bekkemoen and Langseth. "Interpretable Deep Reinforcement Learning via Concept-Based Policy Distillation." Machine Learning, 2025. doi:10.1007/S10994-025-06928-5Markdown
[Bekkemoen and Langseth. "Interpretable Deep Reinforcement Learning via Concept-Based Policy Distillation." Machine Learning, 2025.](https://mlanthology.org/mlj/2025/bekkemoen2025mlj-interpretable/) doi:10.1007/S10994-025-06928-5BibTeX
@article{bekkemoen2025mlj-interpretable,
title = {{Interpretable Deep Reinforcement Learning via Concept-Based Policy Distillation}},
author = {Bekkemoen, Yanzhe and Langseth, Helge},
journal = {Machine Learning},
year = {2025},
pages = {288},
doi = {10.1007/S10994-025-06928-5},
volume = {114},
url = {https://mlanthology.org/mlj/2025/bekkemoen2025mlj-interpretable/}
}