Using Natural Language for Reward Shaping in Reinforcement Learning

Paper · arXiv 1903.02020 · Published March 5, 2019

Using arbitrary natural language statements within reinforcement learning presents several challenges. First, a mapping between language and objects/actions must implicitly or explicitly be learned, a problem known as symbol grounding [Harnad, 1990]. For example, to make use of the instruction, “Jump over the snake”, the system must be able to ground “snake” to appropriate pixels in the current state (assuming the state is represented as an image) and “jump” to the appropriate action in the action space. Second, natural language instructions are often incomplete. For instance, it is possible that the agent is not directly next to the snake and must walk towards it before jumping. Third, natural language inherently involves ambiguity and variation. This could be due to different ways of referring to the objects/actions (e.g. “jump” vs. “hop”), different amounts of information in the instructions (e.g. “Jump over the snake” vs. “Climb down the ladder after jumping over the snake”), or the level of abstraction at which the instructions are given (e.g. a high-level subgoal: “Collect the key” vs. low-level instructions: “Jump over the obstacle. Climb up the ladder and jump to collect the key.”)

Three Ways of Using Large Language Models to Evaluate Chat

Therefore, we applied LLMs and specific prompting to elicit ratings for the multiple qualities evaluated in DSTC11 Track 4 Task 2: appropriateness, content richness, grammatical correctness, and relevance.

The goal of the DSTC11 Track 4 Task 2 was to predict several turn-level metrics automatically on the test set. For each dialogue turn, considering the preceding dialogue history, the participants were to submit a system to predict the score of the target metrics, defined by the organizers as:

• Appropriateness – The response is appropriate given the preceding dialogue.

• Content Richness – The response is informative, with long sentences including multiple entities and conceptual or emotional words.

• Grammatical Correctness – Responses are free of grammatical and semantic errors.

• Relevance – Responses are on-topic with the immediate dialogue history. Table 1 shows chat conversations from the rehearsal dataset with the turn-level metric annotations.