A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR, ...etc).
Welcome to PR or contact me via email ([email protected]) for updating papers and works.
IEEE/ACM TASLP, IEEE JSTSP, JSLHR, IEEE TPAMI
NeuraIPS, ICLR, ICML, IJAI, AAAI, ACL, NAACL, EMNLP, ISMIR, ACM MM, ICASSP, INTERSPEECH, ICME
ASRU, SLT
[2022]
-
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher | INTERSPEECH 2022 | ✔️Code | 🎧Demo
-
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion | INTERSPEECH 2022 | 🎧Demo
-
Improving Adversarial Waveform Generation based Singing Voice Conversion with Harmonic Signals | ICASSP 2022 | 🎧Demo
[2021]
-
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion | ASRU 2021 | 🎧Demo
-
Controllable and Interpretable Singing Voice Decomposition via Assem-VC | NeurIPS 2021 Workshop | 🎧Demo
-
Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding | 2021/10 | 🎧Demo
-
FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation | ICME 2021 | 🎧Demo
-
Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach | 2021/07 | ✔️Code | 🎧Demo
[2020]
-
Zero-shot Singing Voice Conversion | ISMIR 2020 | 🎧Demo
-
Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training | 2020/12 | 🎧Demo | Unofficial Code
-
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System | INTERSPEECH 2020 | 🎧Demo
-
Unsupervised Cross-Domain Singing Voice Conversion | INTERSPEECH 2020 | 🎧Demo
-
PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network | ICASSP 2020 | 🎧Demo
-
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data | APSIPA 2020 | ✔️Code | 🎧Demo
-
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | NeurIPS 2022 | 🔽Apply&Download | 🎧Demo
-
NHSS: A Speech and Singing Parallel Database | 🔽Apply&Download
[2022]
- Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification | INTERSPEECH 2022
[2021]
-
Investigating Time-Frequency Representations for Audio Feature Extraction in Singing Technique Classification | APSIPA 2021
-
Zero-shot Singing Technique Conversion | CMMR 2021
- VocalSet: A Singing Voice Dataset | ISMIR 2018 | 🔽Apply&Download
[2022]
-
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers | INTERSPEECH 2022 | 🎧Demo
-
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion | INTERSPEECH 2022 | 🎧Demo
-
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme | ICLR 2022 | ✔️Code | 🎧Demo
-
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone | ICML 2022 | ✔️Code | 🎧Demo | 🎧Demo | 📝Blog
-
A Comparative Study of Self-supervised Speech Representation Based Voice Conversion | IEEE JSTSP 2022/07
-
S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations | ICASSP 2022 | ✔️Code
-
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | ICASSP 2022 | ✔️Code | 🎧Demo
-
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques | ICASSP 2022 | ✔️Code | 🎧Demo
-
NVC-Net: End-to-End Adversarial Voice Conversion | ICASSP 2022 | ✔️Code | 🎧Demo
-
Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion | ICASSP 2022 | 🎧Demo
-
Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features | ICASSP 2022 | 🎧Demo
-
Toward Degradation-Robust Voice Conversion | ICASSP 2022
-
DGC-vector: A new speaker embedding for zero-shot voice conversion | ICASSP 2022 | 🎧Demo
-
End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions | 2022/05 | 🎧Demo
[2021]
-
On Prosody Modeling for ASR+TTS based Voice Conversion | ASRU 2021 | 🎧Demo
-
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations | NeurIPS 2021 | 🎧Demo | Unofficial Code
-
MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features | 2021/10 | ✔️Code | 🎧Demo
-
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion | INTERSPEECH 2021 Best Paper Award | ✔️Code | 🎧Demo
-
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations | INTERSPEECH 2021 | ✔️Code | 🎧Demo
-
Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder | INTERSPEECH 2021 | ✔️Code | 🎧Demo
-
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations | INTERSPEECH 2021 | 🎧Demo
-
Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning | ICLR 2021
-
Global Rhythm Style Transfer Without Text Transcriptions | ICML 2021 | ✔️Code
-
AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization | ICASSP 2021 | ✔️Code | 🎧Demo
-
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling | IEEE/ACM TASLP 2021/05 | ✔️Code | 🎧Demo
[2020]
-
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning | IEEE/ACM TASLP 2020/11
-
Unsupervised Speech Decomposition via Triple Information Bottleneck | ICML 2020 | ✔️Code
[2019]
-
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization | INTERSPEECH 2019 | ✔️Code
-
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss | ICML 2019 | ✔️Code | 🎧Demo
-
CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | 2019 | 🔽Apply&Download
-
AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines | 2020 | 🔽Apply&Download | 🎧Demo
-
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale | 2018 | 🔽Apply&Download
-
AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline | 2017 | 🔽Apply&Download
[2022]
-
Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion | INTERSPEECH 2022 | 🎧Demo
-
Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis | INTERSPEECH 2022 | 🎧Demo
-
Emotion Intensity and its Control for Emotional Voice Conversion | IEEE Transactions on Affective Computing 2022/07 | ✔️Code | 🎧Demo
-
Textless Speech Emotion Conversion using Discrete and Decomposed Representations | 202202 | 🎧Demo
[2021]
- Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training | INTERSPEECH 2021 | ✔️Code | 🎧Demo
[2020]
-
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion | INTERSPEECH 2020 | ✔️Code | 🎧Demo
-
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data | Odyssey 2020 | ✔️Code | 🎧Demo
- Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset | ICASSP 2021 | 🔽Apply&Download | 🎧Demo
[2022]
-
Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis | INTERSPEECH 2022 | ✔️Code
-
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy | INTERSPEECH 2022 | ✔️Code
-
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses | INTERSPEECH 2022 | 🎧Demo
-
WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training | 2022/08 | 🎧Demo
-
Deep Learning Approaches in Topics of Singing Information Processing | IEEE/ACM TASLP 2022/07
-
Learning the Beauty in Songs: Neural Singing Voice Beautifier | ACL 2022 | ✔️Code | 🎧Demo
-
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | AAAI 2022 | ✔️Code | 🎧Demo
[2021]
- Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System | IEEE/ACM TASLP 2021/08 | ✔️Code
[2020]
- HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis | 2020/09 | 🎧Demo | Unofficial Code
-
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | NeurIPS 2022 | 🔽Apply&Download | 🎧Demo
-
PopCS | AAAI 2022 | 🔽Apply&Download
-
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis | INTERSPEECH 2022 | 🔽Apply&Download
[2022]
-
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech | ACM MM 2022 | ✔️Code | 🎧Demo
-
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis | ICLR 2022 | ✔️Code | 🎧Demo
-
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | 🎧Demo
[2022]
-
DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation | ISMIR 2022 | ✔️Code | 🎧Demo
-
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | 🎧Demo
-
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis | 2022/05 | 🎧Demo
[2021]
-
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus | ACM MM 2021 | 🔽Apply&Download | ✔️Code | 🎧Demo
-
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis | INTERSPEECH 2021 | 🎧Demo
-
DiffWave: A Versatile Diffusion Model for Audio Synthesis | ICLR 2021 | ✔️Code | 🎧Demo
-
WaveGrad: Estimating Gradients for Waveform Generation | ICLR 2021 | 🎧Demo
[2020]
-
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | NeurIPS 2020 | ✔️Code | 🎧Demo
-
Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech | INTERSPEECH 2020 | 🎧Demo
-
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | ICASSP 2020 | 🎧Demo | Unofficial Code
[2019]
-
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis | NeurIPS 2019 | ✔️Code | 🎧Demo
-
Towards achieving robust universal neural vocoding | INTERSPEECH 2019 | ✔️Code | 🎧Demo | Unofficial Code
[2022]
-
Multi-instrument Music Synthesis with Spectrogram Diffusion | ISMIR 2022 | ✔️Code | 🎧Demo
-
Musika! Fast Infinite Waveform Music Generation | ISMIR 2022 | ✔️Code | 🎧Demo
[2022]
- MT3: Multi-Task Multitrack Music Transcription | ICLR 2022 | ✔️Code |
[2021]
- Omnizart: A General Toolbox for Automatic Music Transcription | The Open Journal 2021/12 | ✔️Code | 🎧Demo
[2022]
-
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training | ICASSP 2022 | ✔️Code | ✔️Code
-
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
-
Pseudo-Labeling for Massively Multilingual Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
-
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | IEEE JSTSP 2022/06 | ✔️Code | ✔️Code
[2021]
-
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale | 2021/12 | ✔️Code | ✔️Code
-
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition | 2021/09 | ✔️Code | ✔️Code
-
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech | IEEE/ACM TASLP 2021/08 | ✔️Code
-
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data | ICML 2021 | ✔️Code | ✔️Code | ✔️Code
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units | IEEE/ACM TASLP 2021/06 | ✔️Code | ✔️Code
[2020]
-
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | NeurIPS 2020 | ✔️Code | ✔️Code
-
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | ICLR 2020 | ✔️Code | ✔️Code
-
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders | ICASSP 2020 | ✔️Code
-
Unsupervised Cross-lingual Representation Learning for Speech Recognition | 2020/06 | ✔️Code | ✔️Code
-
fairseq S2T: Fast Speech-to-Text Modeling with fairseq | AACL 2020 | ✔️Code | ✔️Code
[2019]
[2022]
- The VoiceMOS Challenge 2022 | INTERSPEECH 2022
[2021]
- Utilizing Self-supervised Representations for MOS Prediction | INTERSPEECH 2021 | ✔️Code
[2021]
- Data Augmenting Contrastive Learning of Speech Representations in the Time Domain | SLT 2021 | ✔️Code
[2022]
- RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion | INTERSPEECH 2022 | 🎧Demo
[2022]
[2021]
[2022]
[2021]
- Speech BERT Embedding For Improving Prosody in Neural TTS | ICASSP 2021 | ✔️Code | 🎧Demo
[2021]
- NATSpeech: A Non-Autoregressive Text-to-Speech Framework
- Coqui.ai TTS
- ESPnet: end-to-end speech processing toolkit
- Muskit: Open-source music processing toolkits
- nnAudio: Audio processing by using pytorch 1D convolution network
- Praat: doing phonetics by computer
- Parselmouth - Praat in Python, the Pythonic way
- Montreal Forced Aligner
- Awesome Speech Recognition Speech Synthesis Papers
- Awesome Voice Conversion Papers Projects
- TTS Papers
- 🐸 TTS papers
- Speech Synthesis Paper
- Awesome Diffusion Models
- Papers With Code: Voice Conversion
- Papers With Code: Singing Voice Conversion
- Papers With Code: Singing Voice Synthesis
- Awesome Open Source: Voice Conversion
- A list of demo websites for automatic music generation research
- ICASSP 2021 Paper List-VC