This repository implements the singing voice conversion method described in Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network along with multiple improvements regarding its conversion quality using PyTorch. Detailed surveys and experiments have been published as a master thesis, you can get it here.
You can find demo audio files and comparisons to the original PitchNet on our demo website.
We use NUS-48E dataset throughout the whole project. You can download it and perform data preprocessing and augmentation below.
Create a conda environment using environment.yml
:
conda env create -f environment.yml
Notice: Make sure you are under the project root when executing these scripts!
This script will read through the given $raw_dir
and generate folders with the same structure to $output_dir
, containing augmented audio files next to the original ones.
python data_augmentation.py $raw_dir $output_dir --aug-type $aug_type
raw_dir
: Path to the raw data directory with the following structure:
-> $raw_dir/
├── ADIZ
│ ├── 01.wav
│ ├── 09.wav
│ ├── 13.wav
│ └── 18.wav
├── JLEE
│ ├── 05.wav
│ ├── 08.wav
│ ├── 11.wav
│ └── 15.wav
...
output_dir
: Path to the directory to save the augmented and original files. The resulting structure will look like this:
-> $output_dir/
├── ADIZ
│ ├── 01_original.wav
│ ├── 01_aug_back.wav
│ ├── 01_aug_phase.wav
│ ├── 01_aug_back_phase.wav
│ ├── 09_original.wav
│ ├── 09_aug_back.wav
│ ├── 09_aug_phase.wav
│ ├── 09_aug_back_phase.wav
...
...
aug_type
: Type of augmentation
This script will read through the given $raw_dir
and generate folders with the same structure to $output_dir
, with each audio file processed as a *.h5
data file ready to be read by dataset classes.
python data_preprocess.py $raw_dir $output_dir --model $model
raw_dir
: Path to the raw data directoryoutput_dir
: Path to the directory to save the processed filesmodel
: Target model type which we are doing data preprocessing for
This script will train the model. If --model-path
is given, the training will continue with that checkpoint. To see other training parameters, run the script with -h
.
python train.py $train_data_dir $model_dir --model $model --model-path $model_path
train_data_dir
: Path to the processed data directorymodel_dir
: Directory to save checkpoint modelsmodel
: Target model typemodel_path
: Path to pretrained model
You can get our pretrained proposed model here.
This script will perform singing voice conversion on the given audio file. For two-phase conversion, the intermediate files will be saved to .tmp/
directory.
python inference.py $src_file $target_dir $singer_id $model_path --pitch-shift $pitch_shift --two-phase --train-data-dir $train_data_dir
src_file
: Path to the source audio filetarget_dir
: Path to save the converted audio filesinger_id
: Target singer ID (name)model_path
: Model pathpitch_shift
: Factor of pitch shifting performed on conversion, or "auto" for automatic pitch range shiftingtwo_phase
: Whether or not to perform two-phase conversiontrain_data_dir
: The original training data used for two-phase conversion
This script will plot the training loss curves of a given checkpoint. The output image will be stored in plotting-scripts/plotting-results/
.
python plotting-scripts/plot_loss.py $checkpoint_path --window-size $window_size --loss-types $loss_types
checkpoint_path
: Path to the target training checkpointwindow_size
: Window size for moving averageloss_types
: Target types of loss separated by spaces
This script will plot the pitch extracted from the given audio file.
python plotting-scripts/plot_pitch.py $src_file
src_file
: Path to the source audio file
This script will plot the audio duration histogram of the given dataset.
python plotting-scripts/plot_hist.py $raw_dir
raw_dir
: Path to the raw data directory
This script will plot the pitch histogram of the given dataset.
python plotting-scripts/plot_pitch_hist.py $raw_dir
raw_dir
: Path to the raw data directory
This script will plot the spectrogram of the given audio file.
python plotting-scripts/plot_spec.py $src_file
src_file
: Path to the source audio file
This script will conduct simple unit tests and print out a model summary (if applicable). Run with -h
option to see all available networks.
python test_network.py $target_net
This script will select random N seconds segment for each raw audio file in the given data directory and output it as a mini dataset.
python evaluation/select_data.py $raw_dir $output_dir --seg-len $seg_len
raw_dir
: Path to the raw data directoryoutput_dir
: Path to the directory to save the processed filesseg_len
: Length (seconds) for each segment
This script will perform evaluation given evaluation data directory, output file directory, and the target model.
python evaluation/evaluate.py $raw_dir $output_dir $model_path $sc_model_path $mapping --pitch-shift --two-phase --train-data-dir
raw_dir
: Path to the evaluation data directoryoutput_dir
: Path to the directory to save converted audio filesmodel_path
: Path to the target model to evaluatesc_model_path
: Path to the singer classifier modelmapping
: The mapping config of the conversion pairspitch_shift
: Whether or not to perform pitch shiftingtwo_phase
: Whether or not to perform two-phase conversiontrain_data_dir
: The original training data used for two-phase conversion
You can get the singer classifier model we used in the evaluation here.
Below is the hardware used in these experiments and correspoding training & inference time for people who are interested in trying out the project. For more detailed analysis and experiment results, please refer to the thesis.
Part | Specification |
---|---|
CPU | Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz |
RAM | 125GB |
GPU | TITAN RTX x2 |
Disk | PLEXTOR PX-512M9PeGN |
A complete training (300000 steps) takes around 40 hours.
Converting one second of audio file takes around 3 minutes.
- This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
- We referenced facebookresearch/music-translation, which has the same license, for WaveNet implementation and made modifications accordingly to fit our usages.
- pytorch-summary is used in this repo, which is licensed under a MIT License
@article{songrong2021svc,
title = {Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach},
author = {Lee, Songrong},
journal = {Graduate Institute of Networking and Multimedia, National Taiwan University Master Thesis},
pages = {1--56},
year = {2021},
publisher = {National Taiwan University}
}