Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic Framework

Paper · arXiv 2207.01955 · Published July 5, 2022
LLM ArchitectureReinforcement Learning

Despite the promising results achieved, state-of-the art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuous monitoring or pre-defined rules, which inevitably result in a cumbersome and expensive learning process. In this paper, we introduce a novel initiative advisor-in-the loop actor-critic framework, termed as Ask-AC, that replaces the unilateral advisor-guidance mechanism with a bidirectional learner-initiative one, and thereby enables a customized and efficacious message exchange between learner and advisor. At the heart of Ask-AC are two complementary components, namely action requester and adaptive state selector, that can be readily incorporated into various discrete actor-critic architectures. The former component allows the agent to initiatively seek advisor intervention in the presence of uncertain states, while the latter identifies the unstable states potentially missed by the former especially when environment changes, and then learns to promote the ask action on such states. Experimental results on both stationary and non-stationary environments and across different actor-critic backbones demonstrate that the proposed framework significantly improves the learning efficiency of the agent, and achieves the performances on par with those obtained by continuous advisor monitoring.

Ask-AC enables the agent to initiatively seek supervision signals from advisor experts only when the agent confronts uncertain states, and hence significantly alleviates advisor efforts as compared to the continuous-monitoring scheme. Besides two key components, namely action requester and adaptive state selector, which complement each other and together can be seamlessly integrated with various discrete actor-critic architectures. The action requester endows the agent with a novel initially-asking action, allowing the agent to seek feedback from advisor experts in the presence of uncertain states. Nevertheless, the demand for seeking advisor feedback, as the training progresses, is gradually inhibited by the loss of the action requester. As a result, the agent may fail to perceive the critical states that require advisor feedback especially when the environment changes. To this end, the adaptive state selector is introduced to promote the asking action and to rapidly adapt to the environmental change. This is achieved through analyzing the error of state values and identifying the unstable states potentially missed by the action requester in its history states, so as to acquire advisor guidance. Our contribution is therefore a dedicated initiative advisor in- the-loop actor-critic framework, which enables a two-way message passing and seeks advisor assistance only on demand. The proposed Ask-AC substantially lessens the advisor participation effort and is readily applicable to various discrete actor critic architectures.

As shown in Figure 2, our framework consists of two complementary components: action requester and adaptive state selector. The action requester is directly based on a binary classification model with ask action, which enables the agent to initiatively ask for advisor feedback when it encounters uncertain states. As a complement to the action requester, the adaptive state selector analyzes the trend of value loss to identify the unstable states that the action requester misses in history states, so as to learn to promote the ask action on such states. With these two components, we can transform the various discrete actor-critic architecture into our advisor-in-the-loop setting.

To further test the practical effectiveness of our Ask-AC framework, we additionally introduce human advisors to assist agent training. A total of 10 people were recruited in our experiments, each paid ¥50 per hour. The entire experimental procedure was recorded. All participants used the same desktop computers to complete the study, where the random seeds of the environment were fixed to ensure comparability. We use the AskPPO method and adopt the DoorKey environment to conduct experiments. Our study includes two processes: (1) Testing the performance of human advisors in playing games. (2) Testing the performance of training agents under the feedback of human advisors.