POMDP-based Statistical Spoken Dialogue Systems: a Review

Paper · Source
Speech Voice

The principal elements of a conventional SDS are shown in Fig 11. At each turn t, a spoken language understanding (SLU) component converts each spoken input into an abstract semantic representation called a user dialogue act ut. The system updates its internal state st and determines the next system act via a decision rule at =  (st), also known as a policy. The system act at is then converted back into speech via a natural language generation (NLG) component. The state st consists of the variables needed to track the progress of the dialogue and the attribute values (often called slots) that determine the user’s requirements. In conventional systems, the policy is usually defined by a flow chart with nodes representing states and actions, and arcs representing user inputs[5], [6].

Despite steady progress over the last few decades in speech recognition technology, the process of converting conversational speech into words still incurs word error rates in the range 15% to 30% in many real world operating environments such as in public spaces and in motor cars [7], [8]. Systems which interpret and respond to spoken commands must therefore implement dialogue strategies that account for the unreliability of the input and provide error checking and recovery mechanisms. As a consequence, conventional deterministic flowchart-based systems are expensive to build and often fragile in operation.