While reading this (see feature extraction section, subsection 4.1), I came across the following:
The per-frame values for each coefficient are summarized across time using the following summary statistics: minimum, maximum, median, mean, variance, skewness, kurtosis and the mean and variance of the first and second derivatives, resulting in a feature vector of dimension 225 per slice.
From what I understood he is trying to summarize a ~4 second audio recording by extracting a feature well known for its effectiveness (MFCC). Since the extracted MFCC forms a matrix (and he needs a feature vector) he summarizes the matrix several times in different ways (min, max, median, mean, variance, skewness, kurtosis, and the derivatives). By concatenating those summaries he obtains his final feature vector whose length is 225.
Questions:
Why and when should one use derivatives of features? Is this a common thing to do?
How to compute the derivatives of an MFCC matrix?
I'm especially interested in the second question.