Speech and Voice

Topic · 10 papers

Related topics:

Assessment of Personality Dimensions Across Situations Using Conversational Speech
Abstract—Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to …
DeepGesture: A conversational gesture synthesis system based on emotions and semantics
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans…
From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation
Abstract. In dialogue generation, the naturalness of responses is crucial for effective human-machine interaction. Personalized response generation poses even greater challenges, as the responses must…
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency an…
POMDP-based Statistical Spoken Dialogue Systems: a Review
The principal elements of a conventional SDS are shown in Fig 11. At each turn t, a spoken language understanding (SLU) component converts each spoken input into an abstract semantic representation ca…
Proactive behavior in voice assistants: A systematic review and conceptual model
Yet, there is a lack of review studies synthesizing the current knowledge on how proactive behavior has been implemented in VAs and under what conditions proactivity has been found more or less suitab…
Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
To understand such utility, the internal representation of speech SSL models has been scrutinized by probing analyses for known speech and linguistic features, such as low-level acoustics, phonetics, …
Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the …
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, …
Voxtral
We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a dive…