[IMPORTANT] arXiv sound does not post some papers submitted to arXiv or . This is because they do not appear in the RSS of arXiv. We apologize for your inconvenience.
``Analyzing Musical Characteristics of National Anthems in Relation to Global Indices,'' S M Rakib Hasan, Aakar Dhakal, Ms. Ayesha Siddiqua, Mohammad Mominur Rahman, Md Maidul Islam, Mohammed Arfat Raihan Chowdhury, S M Masfequier Rahman Swapno, SM Nuruz…
``WaveGrad: Estimating Gradients for Waveform Generation. (arXiv:2009.00713v1 []),'' Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan,
``A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis. (arXiv:2308.15422v1 []),'' Ben Hayes, Jordie Shier, György Fazekas, Andrew McPherson, Charalampos Saitis,
``OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification,'' Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe,
``LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning,'' Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana,
``Style Transfer of Audio Effects with Differentiable Signal Processing. (arXiv:2207.08759v1 []),'' Christian J. Steinmetz, Nicholas J. Bryan, Joshua D. Reiss,
``Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data,'' Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Fran\c{c}oise Beaufays, Hadar Shemtov,
``VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers,'' Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei,
``Masked Audio Generation using a Single Non-Autoregressive Transformer. (arXiv:2401.04577v1 []),'' Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi,
``Music ControlNet: Multiple Time-varying Controls for Music Generation. (arXiv:2311.07069v1 []),'' Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan,
``Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation. (arXiv:2309.08876v1 []),'' Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe,
``An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis,'' Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong,
``SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound,'' Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley,
``Benchmarking Representations for Speech, Music, and Acoustic Events,'' Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, Sabato Marco Siniscalchi,
``WavCraft: Audio Editing and Generation with Large Language Models,'' Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos,
``How Should We Extract Discrete Audio Tokens from Self-Supervised Models?,'' Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli,
``Multi-instrument Music Synthesis with Spectrogram Diffusion. (arXiv:2206.05408v2 [] UPDATED),'' Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel,
``SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification. (arXiv:2103.16858v1 []),'' Helin Wang, Yuexian Zou, Wenwu Wang,
``Speech Enhancement and Dereverberation with Diffusion-based Generative Models. (arXiv:2208.05830v1 []),'' Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann,
``Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers. (arXiv:2307.03183v1 []),'' Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass,
``DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input. (arXiv:2309.07658v1 []),'' Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi,
``The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data,'' Alice Baird, Rachel Manzelli, Panagiotis Tzirakis, Chris Gagne, Haoqi Li, Sadie Allen, Sander Dieleman, Brian Kulis, Shrikanth S. Narayanan, Alan Cowen,
``Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition. (arXiv:2107.13530v1 []),'' Samuel Kessler, Bethan Thomas, Salah Karout,
``Less is More: Accurate Speech Recognition & Translation without Web-Scale Data,'' Krishna C. Puvvada, Piotr \.Zelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin,…
``Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning,'' Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Mart\'inez-Ram\'irez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon D…
``Audio Self-supervised Learning: A Survey. (arXiv:2203.01205v1 []),'' Shuo Liu, Adria Mallol-Ragolta, Emilia Parada-Cabeleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, Bjoern W. Schuller,
``Neural Vocoder is All You Need for Speech Super-resolution. (arXiv:2203.14941v1 []),'' Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang,
``Neural HMMs are all you need (for high-quality attention-free TTS). (arXiv:2108.13320v3 [] UPDATED),'' Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter,
``Tiny Transformers for Environmental Sound Classification at the Edge. (arXiv:2103.12157v1 []),'' David Elliott, Carlos E. Otero, Steven Wyatt, Evan Martino,
``BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data,''
Mateusz {\L}ajszczak
Guillermo C\'ambara
Yang Li
Fatih Beyhan
Arent van Korlaar
Fan Yang
Arnaud J…
``OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer,''
Yifan Peng
Jinchuan Tian
William Chen
Siddhant Arora
Brian Yan
Yui Sudo
Muhammad Shakeel
Kwanghee …
``One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition. (arXiv:2310.01688v1 []),'' Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini,
``LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes,'' Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida,
``DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation,'' Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan,
``Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech,'' Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, \'Eva Sz\'ekely, Gustav Eje Henter,
``Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. (arXiv:2103.14574v1 []),'' Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Jia Ye, RJ Ryan, Yonghui Wu,
``MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation,'' Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong,