Madmom: A New Python Audio and Music Signal Processing Library

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

madmom: a new Python Audio and Music Signal

Processing Library


Sebastian Böck† , Filip Korzeniowski†, Jan Schlüter‡, Florian Krebs†, Gerhard Widmer†‡
† Department of Computational Perception, Johannes Kepler University Linz, Austria
‡ Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria
arXiv:1605.07008v1 [cs.SD] 23 May 2016

ABSTRACT feature extraction such as onset and beat detection as for ex-
In this paper, we present madmom, an open-source audio ample in the MIRtoolbox [13], Essentia [6] and librosa [15].
processing and music information retrieval (MIR) library However, to our knowledge, there exist no library that also
written in Python. madmom features a concise, NumPy- includes machine learning components (except Marsyas [18],
compatible, object oriented design with simple calling con- which contains two classifiers), although machine learning
ventions and sensible default values for all parameters, which components are crucial in current MIR applications.
facilitates fast prototyping of MIR applications. Prototypes Therefore, we propose madmom, a library that incorpo-
can be seamlessly converted into callable processing pipelines rates low-level feature extraction and high-level feature anal-
through madmom’s concept of Processors, callable objects ysis based on machine learning methods. This allows the
that run transparently on multiple cores. Processors can construction of the full processing chain within a single soft-
also be serialised, saved, and re-run to allow results to be ware framework, making it possible to build standalone pro-
easily reproduced anywhere. grams without any dependency on other machine learning
Apart from low-level audio processing, madmom puts em- frameworks. Moreover, madmom comes with several state-
phasis on musically meaningful high-level features. Many of of-the-art systems including their trained models, for exam-
these incorporate machine learning techniques and madmom ple for onset detection [17, 8], tempo estimation [3], beat
provides a module that implements some in MIR commonly estimation [2, 11], downbeat estimation [4], and piano tran-
used methods such as hidden Markov models and neural scription.
networks. Additionally, madmom comes with several state- madmom is written in Python, which has become the lan-
of-the-art MIR algorithms for onset detection, beat, down- guage of choice for scientific computing for many people due
beat and meter tracking, tempo estimation, and piano tran- to its free availability and its simplicity to use. The code
scription. These can easily be incorporated into bigger MIR is released under BSD license and pre-trained models are
systems or run as stand-alone programs. released under the CC BY-NC-SA 4.0 license.

1.1 Design and Functionality


1. INTRODUCTION
Music information retrieval (MIR) has become an emerg- 1.1.1 Object-oriented programming
ing research area over the last 15 years. Especially audio- madmom follows an object-oriented programming (OOP)
based MIR has become more and more important, since the approach. We encapsulate everything in objects which are
amount of available audio data in the last years exploded often designed as subclasses of NumPy’s ndarray, offering
beyond being manageable manually. all array handling routines inherited from NumPy [19] with
Most state-of-the-art audio-based MIR algorithms consist additional functionality. This compactly bundles data and
of two components: First, low-level features are extracted meta-data (e.g., a Spectrogram and its frame rate) and sim-
from the audio signal (feature extraction stage), and then plifies meta-data handling for the user.
the features are analysed (feature analysis stage) to retrieve
the requested information. Most current MIR systems incor-
1.1.2 Rapid prototyping
porate machine learning algorithms in the feature analysis madmom aims at minimizing the turnaround time from a
stage, with neural networks currently being the most popu- research idea to a software prototype. To this means, object
lar and successful ones [17, 11, 2, 3]. instantiation is made as simple as possible: E.g., a log Mel-
Numerous software libraries have been proposed over the spectrogram object can be instantiated with one line of code
years to facilitate research and development of applications by only providing the path to an audio file. madmom au-
in MIR. Some libraries concentrate on low-level feature ex- tomatically creates all the objects in between using sensible
traction from audio signals, such as Marsyas [18], YAAFE default values.
[14] and openSMILE [9]. Others also include higher level
1.1.3 Simple conversion into runnable programs
∗To whom correspondence should be addressed; Email: Once an audio processing algorithm is prototyped, the
[email protected] complete workflow should be easily transformed into a run-
nable standalone program with a consistent calling interface.
This is implemented using madmom’s concept of Processors.
For details we refer to Section 2.
1.1.4 Machine learning integration 2. LIBRARY DESCRIPTION
We aim at a seamless integration of machine learning In this section, we will describe the overall architecture of
methods without the need of any third party modules. We madmom, its packages as well as the provided standalone
limit ourselves to testing capabilities (applying pre-trained programs.
models), since it is impossible to keep up with newly emerg- madmom’s main API is composed of classes, but much of
ing training methods in the various machine learning do- the functionality is implemented as functions (in turn used
mains. Models that have been trained in an external library internally by the classes). This way, madmom offers the
should be easily be converted to an internal madmom model ‘best of both worlds’: concise interfaces exposed through
format. classes, and detailed access to functionality through func-
tions. In general, the classes can be split in two different
1.1.5 State-of-the-art features types: the so called data classes and processor classes.
Many existing libraries provide a huge variety of low-level Data classes represent data entities such as audio sig-
features but few musically meaningful high-level features. nals or spectrograms. They are implemented as subclasses of
madmom tries to close this gap by offering high-quality NumPy’s ndarray, and thus offer all array handling routines
state-of-the-art feature extractors for downbeats, beats, on- inherited directly from NumPy (e.g. transposing or saving
sets, tempo, piano transcription, etc.. the data to file in either binary or human readable format).
These classes are enriched by additional attributes and ex-
1.1.6 Reproducible research pose additional functionality via methods.
In order to foster reproducible research, we want to be Processor classes exclusively store information on how
able to save and load the specific settings used to obtain the to process data, i.e., how to transform one data class into
results for a certain experiment. In madmom this is imple- another (e.g., from an (audio-)Signal into a Spectrogram).
mented using Python’s own pickle functionality which allows In order to build chains of transformations, each data class
to save an entire processing chain (including all settings) to has its corresponding processor class, which implements this
a file. transformation. This enables a simple and fast conversion
of algorithm prototypes to callable processing pipelines.
1.1.7 Few dependencies
madmom is built on top of three excellent and wide-spread 2.1 Packages
libraries: NumPy [19] provides all the array handling sub- The library is split into several packages, grouped by func-
routines for madmom’s data classes. SciPy [10] provides op- tionality. For a detailed description including examples of
timised routines for the fast Fourier transform (FFT), linear usage please refer to the library’s documentation.
algebra operations and sparse matrix representations. Fi-
nally, Cython [1] is used to speed up time critical parts of the 2.1.1 madmom.processors
library by automatically generating C code from a Python- Processors are one of the fundamental building blocks of
like syntax and then compiling and linking it into extensions madmom. Each Processor accepts a number of processing
which can be transparently used from within Python. These parameters and must provide a process method, which takes
libraries are the only installation and runtime dependencies the data to be processed as its only argument and defines the
of madmom besides the Python standard library itself, sup- processing functionality of the Processor . An OutputPro-
ported in version 2.7 as well as 3.3 and newer. cessor extends this scheme by accepting a second argument
1.1.8 Multi-core capability which defines the output and can thus be used to write the
output of an algorithm to a file. All Processors are callable,
We designed madmom to be able to exploit the multi-core making it easy to use them interchangeably with normal
capabilities of modern computer architectures, by providing functions. Further, the Processor class provides methods
function to run several programs or Processors in parallel. for saving and loading any Processor to a file – including
1.1.9 Extensive documentation all parameters – using Python’s own pickle library. This
facilitates the reproducibility of an experiment.
All source code files contain thorough documentation fol-
Multiple Processors can be combined into a processing
lowing the NumPy format. The complete API reference,
chain using a SequentialProcessor or ParallelProcessor , which
instruction on how to build and install the library, as well
either execute the chain sequentially or in parallel, using
as interactive IPython [16] notebooks can be found online at
multiple CPU cores if available.
http://madmom.readthedocs.io. The documentation is build
automatically with Sphinx 1 . 2.1.2 madmom.audio
1.1.10 Open development process The madmom.audio package includes basic audio signal
We follow an open development process and the source processing and “low-level” functionality.
code and documentation of our project is publicly available The Signal and FramedSignal classes are used to load an
on GitHub: http://github.com/CPJKU/madmom. To maintain audio signal and chop it into (overlapping) frames. Following
high code quality, we use continuous integration testing via madmom’s automatic instantiation approach, both classes
TravisCI2 , code quality tests via QuantifiedCode3 , and test can be instantiated from any object up the instantiation hi-
coverage via Coveralls4 . erarchy – including a simple file name. madmom supports
almost any existing audio and video formats – provided ffm-
1
http://www.sphinx-doc.org peg 5 is installed – and transparently converts sample rates
2
http://www.travis-ci.org and number of channels if needed.
3
http://www.quantifiedcode.com
4 5
http://www.coveralls.io http://www.ffmpeg.org
Signal is a subclass of ndarray with additional attributes 2.1.6 madmom.models
like sample rate or number of channels. FramedSignal sup- madmom comes with a set of pre-trained models which
ports float hop sizes, making it possible to build systems are distributed under a Creative Commons attribution non-
with an arbitrary frame rate – independently of the signal’s commercial share-alike license, i.e. they can be freely used
sample rate – and ensures that all frames are temporally for research purposes as long as derivative works are dis-
aligned, even if computed with different frame sizes. tributed under the same license. madmom uses the exact
The ShortTimeFourierTransform and Spectrogram classes same mechanism to save and load the models it uses for
represent the complex valued STFT and magnitudes respec- Processors to be pickled.
tively. They are the key classes for spectral audio analysis
and provide windowing, automatic circular shifting (for cor- 2.2 Standalone Programs
rect phase) and zero-padding. Both are independent of the madmom comes with a set of standalone programs, cover-
data type (integer or float) of the underlying Signal , re- ing many areas of MIR. These programs are simple wrappers
sulting in spectrograms of the same value range. A Spec- around the functionality provided by the madmom.features
trogram can be filtered with a Filterbank (e.g. Mel, Bark, package, and provide a simple and easy to use command
logarithmic), which in turn can be parametrised to reduce line interface. They are implemented as Processors and can
the dimensionality or transform the spectrogram into a loga- operate either in single or batch mode, processing single or
rithmically spaced pitch representation closely following the multiple input files, respectively. Additionally all programs
auditory model of the human ear. madmom also provides can be pickled, serialising all parameters in a way that the
standard MFCC and Chroma features. program can be executed later on with the exact same set-
tings.
2.1.3 madmom.features Table 1 lists selected programs included in the library with
The madmom.features package includes “high-level” func- the performance achieved at the annual Music Information
tionality which are related to certain MIR tasks, such as Retrieval Evaluation eXchange (MIREX)6 , where MIR algo-
beat tracking or onset detection. madmom’s focus is to pro- rithms are compared on hidden test datasets. We aggregated
vide musically meaningful and descriptive features rather the results of all years (2006-2015), i.e., a rank 1 means that
than a vast number of low to mid-level features. At the the algorithm is the best performing one of all submissions
time of writing this paper, madmom contains state-of-the- from 2006 until present. The outstanding results in Table 1
art features for onset detection, beat and downbeat tracking, highlight the state-of-the-art features madmom provides.
rhythm pattern analysis, tempo estimation and piano tran-
scription. Table 1: Ranks of the programs included in madmom
All features are implemented as Processors without a cor- for the MIREX evaluations, results aggregated over
responding data class. Users can thus use the provided func- all years (2006-2015). Asterisks indicate pending
tionality and build algorithms on top of these features. For submissions.
most of the features, madmom also provides stand-alone pro-
grams with a consistent calling interface to process audio
files (see Section 2.2). Program Task Year Rank
2.1.4 madmom.evaluation CNNOnsetDetector [17] onset 2013 1
All features come with code for evaluation. The imple- OnsetDetector [8] onset 2013 2
mented metrics are those commonly found in the literature BeatTracker [5] beat MCK 2015 1
of the respective field. DBNBeatTracker [2] beat SMC 2015 1
CRFBeatDetector [11] beat MAZ 2015 1
2.1.5 madmom.ml
GMMPatternTracker [12] downbeat 2015 2
Most of today’s top performing music analysis algorithms
incorporate machine learning, with neural networks being DBNDownBeatTracker [4] downbeat 2016 *
the most universal and successful ones at the moment. mad- TempoDetector [3] tempo 2015 1
mom includes Python implementations of commonly used CRFChordTranscriptor chord 2016 *
machine learning techniques, namely Gaussian Mixture Mod- PianoTranscriptor transcription 2016 *
els (GMM), Hidden Markov Models (HMM), and different
types of neural networks (NN), including feed forward, re-
current, and convolutional layers, various activation func-
tions and special purpose units such as long short-term mem-
ory (LSTM). 3. CONCLUSION
madmom provides functionality to use these techniques This paper gave a short introduction into madmom, its
without any dependencies on third-party modules, but does design principles and library structure. Up-to-date infor-
not contain training algorithms. This decision was made on mation on functionality can be found in the project’s online
purpose since the library’s main focus is on applying ma- documentation at https://madmom.readthedocs.io and source
chine learning techniques to MIR, rather than providing an code repository at https://github.com/CPJKU/madmom.
extensive set of learning techniques. However, trained mod- Future work aims at including a streaming mode, i.e. pro-
els can be easily converted to be compatible with madmom, viding online real-time processing of audio signals in a mem-
since neural network layers usually are simply defined as a ory efficient way instead of processing whole audio files at a
set of weights, biases and an activation function they apply
6
to the input data. http://www.music-ir.org/mirex/wiki/
time. In addition, we will gradually extend the set of fea- [13] O. Lartillot and P. Toiviainen. A Matlab toolbox for
tures and algorithms, as well as add tools to automatically musical feature extraction from audio. In Proc. of the
convert models that have been trained with popular machine 10th Int. Conf. on Digital Audio Effects (DAFx), 2007.
learning libraries such as Lasagne [7]. [14] B. Mathieu, S. Essid, T. Fillon, and J. Prado.
YAAFE, an easy to use and efficient audio feature
extraction software. In Proc. of the 11th Int. Society
4. ACKNOWLEDGMENTS for Music Information Retrieval Conf. (ISMIR), 2010.
This work is supported by the European Union Seventh [15] B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar,
Framework Programme FP7 / 2007-2013 through the E. Battenberg, and O. Nieto. librosa: Audio and
GiantSteps project (grant agreement no. 610591) and the Music Signal Analysis in Python. In Proc. of the 14th
Austrian Science Fund (FWF) project Z159. Python in Science Conf. (SCIPY), 2015.
[16] F. Pérez and B. E. Granger. IPython: A System for
Interactive Scientific Computing. Computing in
5. REFERENCES Science Engineering, 9(3), 2007.
[1] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, [17] J. Schlüter and S. Böck. Musical Onset Detection with
D. Seljebotn, and K. Smith. Cython: The Best of Convolutional Neural Networks. In Proceedings of the
Both Worlds. Computing in Science Engineering, 6th International Workshop on Machine Learning and
13(2), 2011. Music, Prague, Czech Republic, 2013.
[2] S. Böck, F. Krebs, and G. Widmer. A Multi-model [18] G. Tzanetakis and P. Cook. MARSYAS: a framework
Approach to Beat Tracking considering Heterogeneous for audio analysis. Organised Sound, 4, 2000.
Music Styles. In Proc. of the 15th Int. Society for [19] S. van der Walt, S. C. Colbert, and G. Varoquaux.
Music Information Retrieval Conf. (ISMIR), 2014. The NumPy Array: A Structure for Efficient
[3] S. Böck, F. Krebs, and G. Widmer. Accurate Tempo Numerical Computation. Computing in Science
Estimation based on Recurrent Neural Networks and Engineering, 13(2), 2011.
Resonating Comb Filters. In Proc. of the 16th Int.
Society for Music Information Retrieval Conf.
(ISMIR), 2015.
[4] S. Böck, F. Krebs, and G. Widmer. Joint Beat and
Downbeat Tracking with Recurrent Neural Networks.
In Proc. of the 17th Int. Society for Music Information
Retrieval Conf. (ISMIR), 2016.
[5] S. Böck and M. Schedl. Enhanced Beat Tracking with
Context-Aware Neural Networks. In Proc. of the 14th
Int. Conf. on Digital Audio Effects (DAFx), 2011.
[6] D. Bogdanov, N. Wack, E. Gómez, S. Gulati,
P. Herrera, O. Mayor, G. Roma, J. Salamon,
J. Zapata, and X. Serra. Essentia: an open source
library for sound and music analysis. In In Proc. of
ACM Multimedia, 2013.
[7] S. Dieleman, J. Schlüter, C. Raffel, E. Olson, S. K.
Sønderby, D. Nouri, E. Battenberg, A. van den Oord,
et al. Lasagne: First release., 2015.
[8] F. Eyben, S. Böck, B. Schuller, and A. Graves.
Universal Onset Detection with Bidirectional Long
Short-Term Memory Neural Networks. In Proc. of the
11th Int. Society for Music Information Retrieval
Conf. (ISMIR), 2010.
[9] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
Recent Developments in openSMILE, the Munich
Open-Source Multimedia Feature Extractor. In In
Proc. of ACM Multimedia, Barcelona, Spain, 2013.
[10] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open
source scientific tools for Python, 2001–. [Online;
accessed 2016-05-20].
[11] F. Korzeniowski, S. Böck, and G. Widmer.
Probabilistic Extraction of Beat Positions from a Beat
Activation Function. In Proc. of the 15th Int. Society
for Music Information Retrieval Conf. (ISMIR), 2014.
[12] F. Krebs, S. Böck, and G. Widmer. Rhythmic Pattern
Modeling for Beat and Downbeat Tracking in Musical
Audio. In Proc. of the 14th Int. Society for Music
Information Retrieval Conf. (ISMIR), 2013.

You might also like