The SLU Benchmark used for the LeBenchmark is MEDIA (See here for downloading input features and SLU models).
The system used is based on Sequence-to-Sequence models and is coded for the Fairseq library. The encoder is similar to the pyramidal LSTM-based encoder proposed in the Listen, attend and spell paper (Kheops), the only difference is that we compute the mean of two consecutive hidden states for reducing the output size between two layers, instead of concatenating them like in the original model. The decoder is similar to the one used in our previous work published at ICASSP 2020. The differences in this case are that we use two attention mechanisms, one for attending the encoder hidden states, and the other for attending all the previous predictions, instead of using only the previous one like in the original decoder.
We use a similar training strategy as in our previous work. We train thus the encoder alone first, by putting a simple decoder (Basic) on top of it, that is a linear layer mapping the encoder hidden states into the output vocabulary size. The pre-trained encoders are used to pre-initialize parameters of models using a LSTM decoder (LSTM). Models with a Basic decoder and trained for decoding tokens (ASR) are used to pre-initialize models with a Basic decoder trained for SLU. Results obtained with this strategy are summarized in the following table, we give both token decoding (ASR) and concept decoding (SLU) results. For more details please see the paper accepted at Interspeech 2021.
Token decoding (Word Error Rate) | |||
---|---|---|---|
Model | Input Features | DEV ER | TEST ER |
Comparison to our previous work | |||
ICASSP 2020 Seq | Spectrogram | 29.42 | 28.71 |
Interspeech 2021 | |||
Kheops+Basic | Spectrogram | 36.25 | 37.16 |
Kheops+Basic | W2V2-En-base | 19.80 | 21.78 |
Kheops+Basic | W2V2-En-large | 24.44 | 26.96 |
Kheops+Basic | W2V2-Fr-S-base | 23.11 | 25.22 |
Kheops+Basic | W2V2-Fr-S-large | 18.48 | 19.92 |
Kheops+Basic | W2V2-Fr-M-base | 14.97 | 16.37 |
Kheops+Basic | W2V2-Fr-M-large | 11.77 | 12.85 |
Kheops+Basic | XLSR53-large | 14.98 | 15.74 |
Concept decoding (Concept Error Rate) | |||
Model | Input Features | DEV ER | TEST ER |
Comparison to our previous work | |||
ICASSP 2020 Seq | Spectrogram | 28.11 | 27.52 |
ICASSP 2020 XT | Spectrogram | 23.39 | 24.02 |
Interspeech 2021 | |||
Kheops+Basic | Spectrogram | 39.66 | 40.76 |
Kheops+Basic +token | Spectrogram | 34.38 | 34.74 |
Kheops+LSTM +SLU | Spectrogram | 33.63 | 34.76 |
Kheops+Basic +token | W2V2-En-base | 26.79 | 26.57 |
Kheops+LSTM +SLU | W2V2-En-base | 26.31 | 26.11 |
Kheops+Basic +token | W2V2-En-large | 29.31 | 30.39 |
Kheops+LSTM +SLU | W2V2-En-large | 28.38 | 28.57 |
Kheops+Basic +token | W2V2-Fr-S-base | 27.18 | 28.27 |
Kheops+LSTM +SLU | W2V2-Fr-S-base | 26.16 | 26.69 |
Kheops+Basic +token | W2V2-Fr-S-large | 23.34 | 23.75 |
Kheops+LSTM +SLU | W2V2-Fr-S-large | 22.53 | 23.03 |
Kheops+Basic +token | W2V2-Fr-M-base | 22.11 | 21.30 |
Kheops+LSTM +SLU | W2V2-Fr-M-base | 22.56 | 22.24 |
Kheops+Basic +token | W2V2-Fr-M-large | 21.72 | 21.35 |
Kheops+LSTM +SLU | W2V2-Fr-M-large | 18.54 | 18.62 |
Kheops+Basic +token | XLSR53-large | 21.00 | 20.67 |
Kheops+LSTM +SLU | XLSR53-large | 20.34 | 19.73 |
The system was developped under python 3.7, pytorch 1.4.0 and Fairseq 0.9, it can probably work with other version of python and pytorch, but it will not work for sure under Fariseq 0.10 or more recent. Tha's why we recommend, in order to reproduce our results, to install this version of Fairseq 0.9. If you clone the whole NeurIPS repository, in the SLU subfolder (this repository) there is already the version of Fairseq 0.9 used for our experiments. For installation you just need to activate the correct python environment and then:
cd SLU/fairseq/
pip install -e .
The other python scripts contained in the SLU folder:
compute_error_rate.py is used from command line (see Usage below) to compute the error rate on the model output. extract_flowbert_features.py is used from command line (see Usage below) to extract features from wav signals with wav2vec 2.0 models. NOTE: for using this script Fairseq 0.10 or more recent is needed (you can install different versions of Fairseq in different virtual env), as wav2vec 2.0 has been deployed with such Fairseq version.
Input features used for our Interspeech 2021 paper, and for our NeurIPS submission are available here, so that you don't need the original data or to extract features on your own.
If you want to extract features on your own with your wav2vec 2.0 models, you can use the extract_flowbert_features.py script. Since features are extracted once for all, I did not add command line options to the script, you need to modify flags and variables in the script. Some models used to extract features are reachable via links in the table above, or from our HuggingFace repository.
Flags:
- upsample: if set to True, the script will upsample signals to twice the sample rate (this is because MEDIA is 8kHz but 16kHz signals are needed)
- cuda: if set to True, the script will use a GPU for extracting features. This is much faster, but MEDIA contains long signals that make the script crash, so for some of them I run on CPU.
- extract_features: if set to True, the script will extract features, otherwise it will upsample signals and save them if upsample is set to True, otherwise the script will just show info all signals (e.g. duration)
- model_size: if set to 'large', the script will perform a layer normalization on input signals, as needed with large wav2vec 2.0 models
- add_ext: if set to True, the script will assume the input list is made of filename prefixes without extension, and will add it if needed
- file_ext: the file extension to give to the extracted feature files
NOTE: you may need to modify also the device variable in case you want to run the script on a different GPU than specified in the script.
Variables:
- prefix_list: the input list, one file per line with absolute path, with or without extension. If the extension is not given, the script will assume '.wav' as the signal extension
- flowbert_path: the absolute path to the wav2vec 2.0 model to use for extracting features. NOTE: if you want to extract features with a model finetuned with the supervised finetuning procedure, because of the way Fairseq instantiate and load models, you will need to specify also the model used as fine-tuning starting point in the variable sv_starting_point. Since in the end Fairseq initialize parameters with the model specified in flowbert_path, the second model can be identical to the first.
Once flags and variables have been set properly, you can run the script simply as python extract_flowbert_features.py from command line, making sure the correct python environement with Fairseq 0.10 is active.
In order to train a model with a Basic decoder (a linear layer), run the script run_end2end_slu_train_basic.sh.
You need to modify environment variables in the script so that to match installation, home, input, etc. on your machine.
In particular:
- the line source ${HOME}/work/tools/venv_python3.7.2_torch1.4_decore0/bin/activate must be changed to activate your python virtual env
- FAIRSEQ_PATH must be set to point the folder where the Fairseq tools are located (fairseq-train, fairseq-generate, etc.)
- PYTHONPATH must be set to the parent directory of the fairseq repository (e.g. path-to-neurips-repository/SLU/fairseq/)
- WORK_PATH must be set to any working directory on your machine
- DATA_PATH must be set to the directory containing the input data. DATA_PATH can be any directory, even not the one containing the data, if you have already serialized input features (see below). In particular your data must be contained in three sub-folders of DATA_PATH named train, dev and test. Additionally DATA_PATH must contain three files named train_prefixes.lst, dev_prefixes.lst and test_prefixes.lst containing the list, with absolute path, of file prefixes (i.e. without extension) of files for the three splits used by the system. The system will expect to find at least 3 different files for each prefix: a feature file (e.g. spectrogram, wav2vec 2.0 features, etc.) with extension specified with the --feature-extension option, a .txt file containing the transcription of the audio file, and a .sem file containing the transcription semantic annotation. Semantic annotation can be one of two formats which are implicitly specified with the option --corpus-name: media, or fsc (Fluent Speech Command). With fsc the content of the .sem file will be simply split on white spaces, and each token will be assumed as a concept to be predicted.
- SERIALIZED_CORPUS is the common prefix to all serialized input features files (train, dev, test and dict), see here.
The SERIALIZED_CORPUS prefix is actually automatically created based on a couple of other variables specified right above in the script: FEATURES_TYPE, FEATURES_SPEC, FEATURES_LANG, NORMFLAG.
- FEATURES_TYPE can be spectro for spectrograms, FlowBERT for wav2vec 2.0 large features, FlowBBERT for wav2vec 2.0 base features, and son on. See the script itself and/or file names here.
- FEATURES_SPEC is used to specify additional information in the file name, e.g. 3Kl for the French wav2vec 2.0 large model trained on 3K hours of speech.
- FEATURES_LANG is used for the language of the wav2vec 2.0 model (use ML for the XLSR53 model features)
- NORMFLAG is set always to Normalized.
Note that the values of these variables is completly arbitrary. However they must match those used when creating serialized feature files if you want to (re-)use serialized feature files like those available here.
There are 3 additional variables in the script in the same section: FEATURES_EXTN, SUBTASK, and CORPUS.
- FEATURES_EXTN is the file extension of feature files. This is used when you don't have (yet) serialized feature files and you want to read-in input data
- SUBTASK can be token or concept and is used to specify which units the model should be trained to predict: token for ASR, concept for SLU. See also below.
- CORPUS is an ID for the corpus used in the experiments. You should leave it set to media if you are using the MEDIA corpus.
Pay attention to the option --slu-subtask: with a value 'token' you will train an ASR model (token decoding); with a value 'concept' you will train a SLU model where the expected output format is SOC EOC ... SOC
EOC.
SOC and EOC are start and end of concept markers,
are the words instantiating the concept
, followed by the concept itself.
In order to train a model with a LSTM decoder (the version of LSTM decoder described above), run the script run_end2end_slu_train_icassplstm.sh. Again, you need to modify environment variables in the script so that to match installation, home, etc. on your machine. See above. In this script also you need to set properly the option --slu-subtask:
In order to train a model pre-initializing parameters with a previously trained model, use the option:
--load-fairseq-encoder <model file>
This option is intended to pre-initilize the encoder as explained in the paper. However the system detects automatically if the decoder's type is the same in the instantiated and loaded models, and in that case it pre-initializes also the decoder. This is especially useful when pre-initializing a SLU model with a linear decoder with a ASR model with a linear decoder. Pay attention to remove this option when training an ASR model from scratch.
At the first run, the system will read data and save them in a serialized format, containing all the tensors needed for training (and generation). At following runs you can use such data with the option --serialized-corpus <data prefix>. <data prefix> is the prefix in common to all the generated files (train, validation, test data plus the dictionary). See also the SERIALIZED_CORPUS variable above. This makes data loading much faster, especially when using wav2vec features as input.
Input features are available here, so that you don't need the original data or to extract features on your own.
run_end2end_slu_test.sh <checkpoint path> <serialized corpus prefix> <sub-task>
This will generate an output in the same folder as the checkpoint. Once again, you need to set properly some environment variable in the script like for training scripts. See above for <serialized corpus prefix>.
Pay attention to the <sub-task> argument which will initialize the --slu-subtask option in the script to generate the correct reference to compare the system output with.
score_model_output.sh <generated output>
Pay attention to the options given to the compute_error_rate.py script within the score_model_output.sh script (you need to edit the script if you want to change these options):
- --clean-hyp this will clean the outputs generated by the system, when it is trained with the CTC loss. If you train the system with another loss, this option is not needed. All results above are obtained training with the CTC loss.
- --slu-out when this option is given, the evaluation script (compute_error_rate.py) will remove concept boundaries and tokens, keeping only concepts (see the output format above), evaluating thus the model with Concept Error Rate (CER). If this option is not given, the output will be evaluated as it is, which may be useful when evaluating ASR models trained with --slu-subtask token.
Please cite from the Bibtex provided with our arXiv pre-print.