Asymmetric comparisons
Previous Top Next


Asymmetric similarity measurements


Asymmetric similarity measurements are those where sound 1 is the model (or template) and sound 2 is the copy, and we want to judge how good the copy is in reference to the model. For example, if a bird has copied 4 out of 5 syllables in the song playbacks it has heard, we will say that 80% of the model was copied. However, what should we say had the bird produced a song of 10 syllables, including accurate copies of the 5 model syllables and 5 improvised syllables?  It makes sense to state that all (100%) of the model song was copied, but the two songs are only 50% similar to each other. To capture both notions we will say that asymmetrically, the similarity to the model is 100%, and that symmetrically, the similarity between the two songs is 50%. We shall start with asymmetric comparisons:

Start SA+, open 'Example 2' and outline the entire song. Click the 'Sound 2' tab, open 'Example 2' and outline it. Make sure that the amplitude threshold is set to 37dB in both windows. Click the 'Similarity' tab and click 'Score'. The following image should appear within a few seconds:  
graphic


























The gray level of the similarity matrix represents the Euclidean distances: the shorter the distance the brighter the color; intervals with feature distances that are higher than threshold are painted black.

graphicSimilarity sections are neighborhoods of intervals that passed the threshold (e.g., when the corresponding p-value of Euclidean distance is less than 5% for all neighbors). As noted, the gray level represents the distance calculated for each pair of intervals. However the only role of the distance calculation across (70ms) intervals is to set a threshold based on 'viewing' features across a reasonably long interval. The actual similarity values are calculated frame-to-frame within the similarity section, where p-value estimates are based on the cumulative distribution of Euclidean distances across a large sample (250,000) of random pairs of frames obtained from comparisons across 25 random pairs of zebra finch songs: 






Local (frame level) similarity scores: Based on this distribution, we can endow each pair of frames with a local similarity score, which is simply the complement of the Euclidean distance p-value. That is, if a single-frame p-value is 5% we say that the similarity between the two frames is 95%. Local similarity is encoded by colors in the similarity matrix as follows:  
Score (1-p)%
Color
95-100
red
85-94
yellow
75-84
lime
65-74
green
50-64
olive
35-49
blue
graphic









graphic
Section-level similarity Score: We now turn to the problem of estimating the overall similarity captured by each section. First, SA+ detects the boundaries of each section. Then, single frame scores are calculated for each pixel and finally, SA+ searches for the best 'oblique cut' through the section, which maximizes the score. In the simplest case (e.g., of two identical sounds) similarity will maximize on a 450 angle at the center of the section. In practice, it is not always the center of the section that gives the highest similarity, and the angle might deviate from 450 if one of the sounds is time warped in reference to the other. We therefore need to expand in different displacement areas and at different angles. The default 'time warping tolerance' is set to 5% by default, allowing up to 5% angular deviation from the diagonal. Note that computation time increases exponentially with the tolerance. The search for best match is illustrated below:

graphic

























graphicgraphicWe now consider only the frames that are on the best-matching diagonal, and calculate the average score of the section. This score is plotted above the section. Boundaries of similarity sections can be observed more clearly by clicking the global 'combo' button:

 
The light blue lines show the boundaries of each section and the rectangles enclose the best diagonal match of each section

Similarity across sections: Note that there are several sections with overlapping projections on both songs. To obtain a unique similarity estimate, SA+ must eliminate redundancy by trimming (or omitting) sections that overlap with sections that explain more similarity. We call the former 'inferior sections' (blue rectangles) and the latter (red rectangle) 'superior sections'.  







Final sections: once redundancy has been trimmed, it often makes sense to perform one final filtering, by omitting similarity sections that explain very little similarity (which are likely to be 'noise'). By default, SA+ omits sections that explain less than the equivalent of 10ms x 100% similarity. Superior similarity sections that passed this final stage are called final sections.  


The overall similarity score is a product of 3 components: % similarity, mean accuracy and sequential match. You can eliminate each component from the overall assessment by un-checking it.
graphic
% similarity is the percentage of tutor's sounds included in final sections. Note that the p-value used to detect sections is computed across intervals of 70ms: This similarity estimate is asymmetric and it bears no relation to the local similarity score we discussed above.

Mean accuracy is the average local similarity scores across final sections.
To estimate a combined score, we simply multiply the accuracy by the % similarity. For example if we have 60% similarity and 70% accuracy (in the similar parts) the total score will be 42%.

Note: some people do not like to report combined scores because lower numbers are often judged as 'weak'. Presenting both significant similarity and combined scores is not a bad idea.

Sequential match is calculated by sorting the final sections according to their temporal order in reference to sound 1, and then examining their corresponding order in sound 2. We say that two sections graphicare sequential if the beginning of graphic in sound 2 occurred between 0-80ms after the end of Si . This tolerance level accounts for the duration of stops and also for possible filter effect of very short sections that are not sequential. This procedure is repeated for all the consecutive pairs of sections on sound 1 and the overall sequential match is estimated as:

graphic .

Note that multiplying by 2 is offsetting the effect of adding only one (the smallest) of two sections in the numerator. This definition is used for asymmetric scoring, whereas for symmetric scoring the sequential match is simply the ratio between the two outlined intervals on sound 1 and sound 2, namely:

graphic

Weighting the sequential match into the overall score: In the case of symmetric scoring only the sequentially matching parts of the two sounds can be considered, so it makes sense to multiply the sequential match by the combined score. In the case of time-series comparison, it does not make sense to multiply the numbers, because this will mean that we give 100% weight to sections that are sequential, and 0% weight to those that are not. Therefore, you have to decide what weight should be given to non-sequential sections. The problem is that sequential mismatches might have different meanings. For example, in extreme cases of 'low contrast' similarity matrices (with lots of gray areas) the sequence might be the only similarity measure that captures meaningful differences, but when several similar sounds are present in the compared intervals, it might be shear luck if SA+ will sort them out sequentially or not. In short - we cannot advise you what to do about it, and the default setting of 50% weight is arbitrary. 
  

Created using Helpmatic Pro HTML