Asymmetric similarity measurements
Asymmetric similarity measurements are those where sound 1 is the model (or template)
and sound 2 is the copy, and we want to judge how good the copy is in reference to the
model. For example, if a bird has copied 4 out of 5 syllables in the song playbacks it has
heard, we will say that 80% of the model was copied. However, what should we say had
the bird produced a song of 10 syllables, including accurate copies of the 5 model
syllables and 5 improvised syllables? It makes sense to state that all (100%) of the model
song was copied, but the two songs are only 50% similar to each other. To capture both
notions we will say that asymmetrically, the similarity to the model is 100%, and that
symmetrically, the similarity between the two songs is 50%. We shall start with
asymmetric comparisons:
Start SA+, open 'Example 2' and outline the entire song. Click the 'Sound 2' tab, open
'Example 2' and outline it. Make sure that the amplitude threshold is set to 37dB in both
windows. Click the 'Similarity' tab and click 'Score'. The following image should appear
within a few seconds:
The gray level of the similarity matrix represents the Euclidean distances: the shorter the
distance the brighter the color; intervals with feature distances that are higher than
threshold are painted black.
Similarity sections
are neighborhoods of intervals that
passed the threshold (e.g., when the corresponding p-value of Euclidean distance is less than 5% for all
neighbors). As noted, the gray level represents the
distance calculated for each pair of intervals. However
the only role of the distance calculation across (70ms)
intervals is to set a threshold based on 'viewing' features
across a reasonably long interval. The actual similarity
values are calculated frame-to-frame within the
similarity section, where p-value estimates are based on
the cumulative distribution of Euclidean distances
across a large sample (250,000) of random pairs of
frames obtained from comparisons across 25 random
pairs of zebra finch songs:
Local (frame level) similarity scores: Based on this distribution, we can endow each
pair of frames with a local similarity score, which is simply the complement of the
Euclidean distance p-value. That is, if a single-frame p-value is 5% we say that the
similarity between the two frames is 95%. Local similarity is encoded by colors in the
similarity matrix as follows:
Score (1-p)%
|
Color
|
95-100
|
red
|
85-94
|
yellow
|
75-84
|
lime
|
65-74
|
green
|
50-64
|
olive
|
35-49
|
blue
|


Section-level similarity Score: We now turn to the problem of
estimating the overall similarity captured by each section. First, SA+ detects the boundaries of each section. Then, single frame
scores are calculated for each pixel and finally, SA+ searches for
the best 'oblique cut' through the section, which maximizes the
score. In the simplest case (e.g., of two identical sounds)
similarity will maximize on a 450 angle at the center of the
section. In practice, it is not always the center of the section that
gives the highest similarity, and the angle might deviate from 450
if one of the sounds is time warped in reference to the other. We
therefore need to expand in different displacement areas and at
different angles. The default 'time warping tolerance' is set to 5%
by default, allowing up to 5% angular deviation from the
diagonal. Note that computation time increases exponentially
with the tolerance. The search for best match is illustrated below:


We
now consider only the frames that are on the best-matching diagonal, and calculate
the average score of the section. This score is plotted above the section. Boundaries of
similarity sections can be observed more
clearly by clicking the global 'combo' button:
The light blue lines show the boundaries of
each section and the rectangles enclose the
best diagonal match of each section
Similarity across sections: Note that there are several sections with overlapping
projections on both songs. To obtain a unique similarity estimate, SA+ must eliminate
redundancy by trimming (or omitting) sections that overlap with sections that explain
more similarity. We call the former 'inferior sections' (blue rectangles) and the latter (red
rectangle) 'superior sections'.
Final sections: once redundancy has been trimmed, it often makes sense to perform one
final filtering, by omitting similarity sections that explain very little similarity (which are
likely to be 'noise'). By default, SA+ omits sections that explain less than the equivalent
of 10ms x 100% similarity. Superior similarity sections that passed this final stage are
called final sections.
The overall similarity score is a product of 3 components: % similarity, mean accuracy
and sequential match. You can eliminate each component from the overall assessment by
un-checking it.
% similarity is the percentage of tutor's sounds included in final sections. Note
that the p-value used to detect sections is computed across intervals of 70ms:
This similarity estimate is asymmetric and it bears no relation to the local
similarity score we discussed above.
Mean accuracy is the average local similarity scores across final sections.
To estimate a combined score, we simply multiply the accuracy by the %
similarity. For example if we have 60% similarity and 70% accuracy (in the
similar parts) the total score will be 42%.
Note: some people do not like to report combined scores because lower numbers are
often judged as 'weak'. Presenting both significant similarity and combined scores is not a
bad idea.
Sequential match is calculated by sorting the final sections according to their temporal
order in reference to sound 1, and then examining their corresponding order in sound 2.
We say that two sections
are sequential if the beginning of
in sound 2
occurred between 0-80ms after the end of Si . This tolerance level accounts for the
duration of stops and also for possible filter effect of very short sections that are not
sequential. This procedure is repeated for all the consecutive pairs of sections on sound 1
and the overall sequential match is estimated as:
.
Note that multiplying by 2 is offsetting the effect of adding only one (the smallest) of two
sections in the numerator. This definition is used for asymmetric scoring, whereas for
symmetric scoring the sequential match is simply the ratio between the two outlined
intervals on sound 1 and sound 2, namely:

Weighting the sequential match into the overall score: In the case of symmetric scoring
only the sequentially matching parts of the two sounds can be considered, so it makes
sense to multiply the sequential match by the combined score. In the case of time-series
comparison, it does not make sense to multiply the numbers, because this will mean that
we give 100% weight to sections that are sequential, and 0% weight to those that are not.
Therefore, you have to decide what weight should be given to non-sequential sections.
The problem is that sequential mismatches might have different meanings. For example,
in extreme cases of 'low contrast' similarity matrices (with lots of gray areas) the
sequence might be the only similarity measure that captures meaningful differences, but
when several similar sounds are present in the compared intervals, it might be shear luck
if SA+ will sort them out sequentially or not. In short - we cannot advise you what to do
about it, and the default setting of 50% weight is arbitrary.
Created using Helpmatic Pro HTML