Predictive Preference Learning from Human Interventions
Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality.
n this work, we propose a novel Interactive Imitation Learning algorithm, Predictive Preference Learning from Human Interventions (PPL), to learn from active human involvement. As shown in Fig. 1, our approach has two key designs: First, we employ an efficient rollout-based trajectory prediction model to forecast the agent’s future states. These predicted rollouts are visualized in real time for the user, helping human supervisors proactively determine when an intervention is necessary. Second, our algorithm leverages preference learning on the predicted future trajectories to further improve the sample efficiency and reduce the expert demonstrations needed. Such designs bring three strengths: (1) They mitigate the distributional shift problem in IIL and improve training efficiency. By incorporating anticipated future states into the training process, our method constructs a richer dataset, especially in safety-critical situations. This expanded dataset offers more information than expert corrective demonstrations in human-intervened states only. (2) The preference learning reduces the agent’s visits to dangerous states, thus suppressing human interventions in safety-critical situations. (3) By visualizing the agent’s predicted future trajectories in the user interface, we significantly reduce the cognitive burden on the human supervisor to constantly anticipate the agent’s behavior.