Skip to content

Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach

License

Notifications You must be signed in to change notification settings

SongRongLee/mir-svc

Repository files navigation

Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach

License: CC BY-NC 4.0
This repository implements the singing voice conversion method described in Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network along with multiple improvements regarding its conversion quality using PyTorch. Detailed surveys and experiments have been published as a master thesis, you can get it here.

You can find demo audio files and comparisons to the original PitchNet on our demo website.

Table of contents

Dataset

We use NUS-48E dataset throughout the whole project. You can download it and perform data preprocessing and augmentation below.

Environment setup

Create a conda environment using environment.yml:
conda env create -f environment.yml

Scripts usage

Notice: Make sure you are under the project root when executing these scripts!

Data augmentation

This script will read through the given $raw_dir and generate folders with the same structure to $output_dir, containing augmented audio files next to the original ones.

python data_augmentation.py $raw_dir $output_dir --aug-type $aug_type

  • raw_dir: Path to the raw data directory with the following structure:
-> $raw_dir/
├── ADIZ
│   ├── 01.wav
│   ├── 09.wav
│   ├── 13.wav
│   └── 18.wav
├── JLEE
│   ├── 05.wav
│   ├── 08.wav
│   ├── 11.wav
│   └── 15.wav
...
  • output_dir: Path to the directory to save the augmented and original files. The resulting structure will look like this:
-> $output_dir/
├── ADIZ
│   ├── 01_original.wav
│   ├── 01_aug_back.wav
│   ├── 01_aug_phase.wav
│   ├── 01_aug_back_phase.wav
│   ├── 09_original.wav
│   ├── 09_aug_back.wav
│   ├── 09_aug_phase.wav
│   ├── 09_aug_back_phase.wav
...
...
  • aug_type: Type of augmentation

Data preprocessing

This script will read through the given $raw_dir and generate folders with the same structure to $output_dir, with each audio file processed as a *.h5 data file ready to be read by dataset classes.

python data_preprocess.py $raw_dir $output_dir --model $model

  • raw_dir: Path to the raw data directory
  • output_dir: Path to the directory to save the processed files
  • model: Target model type which we are doing data preprocessing for

Start training

This script will train the model. If --model-path is given, the training will continue with that checkpoint. To see other training parameters, run the script with -h.

python train.py $train_data_dir $model_dir --model $model --model-path $model_path

  • train_data_dir: Path to the processed data directory
  • model_dir: Directory to save checkpoint models
  • model: Target model type
  • model_path: Path to pretrained model

You can get our pretrained proposed model here.

Converting an audio file

This script will perform singing voice conversion on the given audio file. For two-phase conversion, the intermediate files will be saved to .tmp/ directory.
python inference.py $src_file $target_dir $singer_id $model_path --pitch-shift $pitch_shift --two-phase --train-data-dir $train_data_dir

  • src_file: Path to the source audio file
  • target_dir: Path to save the converted audio file
  • singer_id: Target singer ID (name)
  • model_path: Model path
  • pitch_shift: Factor of pitch shifting performed on conversion, or "auto" for automatic pitch range shifting
  • two_phase: Whether or not to perform two-phase conversion
  • train_data_dir: The original training data used for two-phase conversion

Plotting results

Loss curves

This script will plot the training loss curves of a given checkpoint. The output image will be stored in plotting-scripts/plotting-results/.
python plotting-scripts/plot_loss.py $checkpoint_path --window-size $window_size --loss-types $loss_types

  • checkpoint_path: Path to the target training checkpoint
  • window_size: Window size for moving average
  • loss_types: Target types of loss separated by spaces

Pitch curves

This script will plot the pitch extracted from the given audio file.
python plotting-scripts/plot_pitch.py $src_file

  • src_file: Path to the source audio file

Duration histogram

This script will plot the audio duration histogram of the given dataset.
python plotting-scripts/plot_hist.py $raw_dir

  • raw_dir: Path to the raw data directory

Pitch histogram

This script will plot the pitch histogram of the given dataset.
python plotting-scripts/plot_pitch_hist.py $raw_dir

  • raw_dir: Path to the raw data directory

Audio Spectrogram

This script will plot the spectrogram of the given audio file.
python plotting-scripts/plot_spec.py $src_file

  • src_file: Path to the source audio file

Network summary & testing

This script will conduct simple unit tests and print out a model summary (if applicable). Run with -h option to see all available networks.

python test_network.py $target_net

Evaluation

Data selection

This script will select random N seconds segment for each raw audio file in the given data directory and output it as a mini dataset.

python evaluation/select_data.py $raw_dir $output_dir --seg-len $seg_len

  • raw_dir: Path to the raw data directory
  • output_dir: Path to the directory to save the processed files
  • seg_len: Length (seconds) for each segment

Evaluation script

This script will perform evaluation given evaluation data directory, output file directory, and the target model.

python evaluation/evaluate.py $raw_dir $output_dir $model_path $sc_model_path $mapping --pitch-shift --two-phase --train-data-dir

  • raw_dir: Path to the evaluation data directory
  • output_dir: Path to the directory to save converted audio files
  • model_path: Path to the target model to evaluate
  • sc_model_path: Path to the singer classifier model
  • mapping: The mapping config of the conversion pairs
  • pitch_shift: Whether or not to perform pitch shifting
  • two_phase: Whether or not to perform two-phase conversion
  • train_data_dir: The original training data used for two-phase conversion

You can get the singer classifier model we used in the evaluation here.

Statistics

Below is the hardware used in these experiments and correspoding training & inference time for people who are interested in trying out the project. For more detailed analysis and experiment results, please refer to the thesis.

Hardware

Part Specification
CPU Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz
RAM 125GB
GPU TITAN RTX x2
Disk PLEXTOR PX-512M9PeGN

Training time

A complete training (300000 steps) takes around 40 hours.

Inference time

Converting one second of audio file takes around 3 minutes.

License

License: CC BY-NC 4.0

Citation

@article{songrong2021svc,
  title     = {Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach},
  author    = {Lee, Songrong},
  journal   = {Graduate Institute of Networking and Multimedia, National Taiwan University Master Thesis},
  pages     = {1--56},
  year      = {2021},
  publisher = {National Taiwan University}
}

About

Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach

Topics

Resources

License

Stars

Watchers

Forks

Languages