![Ziyang Ma Profile](https://pbs.twimg.com/profile_images/1694496052663943168/JqsFqulC_x96.jpg)
Ziyang Ma
@ddlbojack
Followers
242
Following
68
Statuses
57
PhD Candidate SJTU X-LANCE Lab | Focus on speech, language, audio and music processing | Ex @MSFTResearch NLC Group @AlibabaGroup Tongyi SpeechAI
Joined September 2022
RT @reach_vb: Let's goo! F5-TTS 🔊 > Trained on 100K hours of data > Zero-shot voice cloning > Speed control (based on total duration) > Em…
0
242
0
Marriage of BERT and LLaMA 😂
A little teaser for LLM2Vec @COLM_conf! Stop by Tuesday morning poster session to know how we officiated the marriage of BERTs and Llamas! 🦙
0
0
4
RT @WilliamWangNLP: BREAKING: Taylor Swift's Eras Tour just did what AI couldn’t—pushed NeurIPS by a whole day! 🤖 🤣🤣🤣 #NeurIPS 2024 Confer…
0
55
0
RT @karpathy: It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They…
0
1K
0
Glad that I will go to Kos, Greece🇬🇷 for #Interspeech2024 in person. We have 2 papers at oral sessions and 2 at poster sessions. Drop by if you are interested at SSL, LLM, emotion, and real-time interaction & generation!
0
0
14
RT @omarsar0: Foundation Models for Music Provides a comprehensive overview of state-of-the-art pre-trained models and foundation models i…
0
116
0
RT @WenhuChen: I love simple yet effective things. However, reviewers never agree with me on that.
0
16
0
RT @arankomatsuzaki: Language Model Can Listen While Speaking Explores full duplex modeling in interactive speech LMs, focusing on enhanci…
0
76
0
Check our listening-while-speaking language model (LSLM), pushing interactive speech language model (iSLM) a step forward!
Language Model Can Listen While Speaking Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.
0
2
16
RT @TONGYI_SpeechAI: The Tongyi Speech Team has open-sourced two foundational speech models: SenseVoice and CosyVoice. 😄SenseVoice, a mult…
0
68
0
RT @Thom_Wolf: The @kyutai_labs fully end-to-end audio model demo of today is a huge deal that many people missed in the room Mostly irre…
0
367
0
RT @dr_cintas: Luma has released a new feature that connects start and end keyframes for more AI video control. Look at these 10 wild exam…
0
484
0
RT @billyuchenlin: M-A-P/Neo-7B-Instruct is the 1st 💎fully-open💎 LLM on WildBench leaderboard and its performance is awesome. "Fully open-…
0
18
0
so cool as a man @jiatongshi
This is the most enjoyable work I’ve done for Interspeech! We enhanced original data from my wife Kiki through ACE Studio, and our data appeared in this year’s SVDD and VoiceMOS challenges 😁 Although KiSing’s songs are niche, hope it becomes the LJSpeech of singing in research🤔
0
0
3