Speech and Voice
Related topics:
- Assessment of Personality Dimensions Across Situations Using Conversational SpeechAbstract—Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to …
- DeepGesture: A conversational gesture synthesis system based on emotions and semanticsAlong with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans…
- From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue GenerationAbstract. In dialogue generation, the naturalness of responses is crucial for effective human-machine interaction. Personalized response generation poses even greater challenges, as the responses must…
- LLaMA-Omni: Seamless Speech Interaction with Large Language Modelsthere is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency an…
- POMDP-based Statistical Spoken Dialogue Systems: a ReviewThe principal elements of a conventional SDS are shown in Fig 11. At each turn t, a spoken language understanding (SLU) component converts each spoken input into an abstract semantic representation ca…
- Proactive behavior in voice assistants: A systematic review and conceptual modelYet, there is a lack of review studies synthesizing the current knowledge on how proactive behavior has been implemented in VAs and under what conditions proactivity has been found more or less suitab…
- Self-Supervised Models of Speech Infer Universal Articulatory KinematicsTo understand such utility, the internal representation of speech SSL models has been scrutinized by probing analyses for known speech and linguistic features, such as low-level acoustics, phonetics, …
- Turn-taking and Backchannel Prediction with Acoustic and Large Language Model FusionWe propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the …
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech InteractionRecent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, …
- VoxtralWe present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a dive…