The metric system
An FFT data window, or frame, is a short (~10ms) interval of sound, which is the unit of
multi taper spectral analysis. The spectral structure of each frame is summarized by
measurements of five features: Pitch, FM, AM, Wiener entropy, and goodness of pitch.
Each of these features has different units and different statistical distributions in the
population of songs studied. To arrive at an overall score of similarity, we transformed
the units for each feature to units of statistical distances. One can transform the units of
pitch, for example, from Hz to units of standard deviation. Instead of SD we use a similar
(and sometimes better) measure of deviation called MAD (median absolute deviation
from the mean). We can then compute Euclidean distances across all features. A similar
procedure can be used to compare larger units of time, which we shall call intervals. SA+
uses two methods to estimate Euclidean distances across intervals.
Euclidean distances across mean values: given two intervals, A and B, we first
calculate the mean feature values for each feature, and then compute Euclidean distances
across the mean features, just as we would have done for a single frame. For example,
consider two intervals of 3 frames in each, and (for simplicity) we shall consider only a
single feature: A=[10, 20 ,30 ] ; B=[30, 20,10]. We first average across frames, which
gives
and obviously, the Euclidian distance is 0. That is, this approach
looks at the overall interval, allowing local differences to cancel each other.
Euclidean distances across time courses: given two intervals, A and B, we compute
Euclidean distances across pairs of features, A1 against B1, A2 against B2, and so forth.
We then calculate the mean Euclidean distance across all pairs. Now consider the same
example: A=[10, 20 ,30 ] ; B=[30, 20,10], the Euclidian difference will be
= 28.3 MADs.
As shown, when we compared single frames, it is not unlikely to obtain short, or even
zero distances, but comparing time series, a distance of zero requires that all the pairs of
distances are zero. Hence, when examining the cumulative distribution of Euclidean
distances across the two methods in a large sample of sounds, the two methods give
different results:
Cumulative distribution of mean values
Cumulative distribution of time courses
This difference has a very practical implication when comparing songs: the time course
approach is good for detecting similarity between two sequences of features that show
similar curves of feature values. Note that moving an interval even by a single frame
changes the entire frame of comparison. By comparing all possible pairs of intervals
between two sounds, we can detect the rare pairs of intervals where the sequential match
between all (or most) frames is high. Euclidean distance across mean values achieves
exactly the opposite: dependency between neighboring intervals is high and we are
looking for high similarity between distributions regardless of the short scale differences.
Note: The difference between those approaches applies also to other SA modules: for
example, the syllable table is based on mean and variance feature values calculated for
each syllable, and hence all the table-based methods (DVD maps, cluster analysis) are
based on Euclidean distances across mean values. Therefore, when we identify a nice
cluster of syllables, we should not assume that similarity measurements based on the
Euclidean distances across time series will show high similarity across members of the
cluster. In fact, current findings suggest to us that birds stabilize the overall (mean)
values of syllable features at a time when the frame-to-frame feature values are dissimilar
across syllables.
Created using Helpmatic Pro HTML