. Interpreting similarity scores
Scoring similarity between two sounds as described above might work well in some
cases, and less well in other cases. It should be used carefully and wisely.
The
outcome of similarity scoring depends heavily on appropriate scaling of the features to
units of median absolute deviation from the average in the ‘population’. The next chapter
explains how to scale features and when new feature scaling should be considered. A
related factor is feature weight: the default assumption is that the five features are equally
important. This assumption has not been tested, but empirically, giving an equal weight
to the five features works well for scoring song similarity across adult zebra finches.
Each feature has different strengths and weaknesses and together, they complement each
other. The feature that is most likely to cause you troubles is pitch: pitch is sometimes
difficult to calculate, and an unstable pitch estimate might bias the scoring procedure.
Reading about the complexities involved in calculating similarity scores you might
wonder about the consequences of improper use of these methods. Compared to the
human-observer scoring method, the automated approach has pros and cons. No doubt,
the judgment of any human observer is preferred over automated methods. The main
difference is that automated methods provide well defined metrics for distances between
sounds and can quantify subtle differences. Statistically, however, you should handle
automated similarity scores just as you would handle human scores, except that you
might consider using parametric methods (if the scores distribution appears to be normal)
– but this is not a big issue. If at the end of the day all you care about is whether two
groups of animals differ in their sounds – it might not matter much how the scores were
calculated, under what assumptions, etc. For example, if you use the feature scale of
zebra finches on monkey sounds, and find strong differences in similarity scores across
two groups of animals using some non-parametric estimate of the scores, the difference is
real regardless of the strong biases injected by using a wrong normalization. However,
wrong normalization means that some features got might higher weight than others in the
overall estimate. You do not want to use wrong normalization since this might reduce the
sensitivity and reliability of scoring method, making it least-likely that significant
differences will be found. Further, if you did find an effect, it is most likely due to a
single feature that got uninteddently high weight and biased the score. Overall, in most
cases, you will want to use the scoring method that maximizes the difference between
your groups. The actual p-value used for threshold is just a yardstick, and it has nothing
to do with statistical significance per se.
Created using Helpmatic Pro HTML