Segmentation to syllable units

Segmentation to syllable units
 
One of your most important decisions when analyzing vocal sounds is choosing methods and setting parameters for distinguishing vocalization from background noise. At the macro scale, it allows you to detect bouts of vocalization, and at shorter time scale, to segment the sound to syllables (vocal events with short stops between them) and silences.

Even though some analysis can be done on the continuous signal in file, once vocal events are identified and segmented it is possible to do much more, e.g. identify units of vocalization, classify sounds and compare them by similarity measurements and clustering methods.

In this chapter we focus on non-real-time analysis, but similar approaches of identifying vocal sounds are also used in real-time analysis during recording. However in real time, we usually need to make intermediate steps to give way to higher priority processes (the recording itself), so that Sound Analysis Recorder first makes crude decision what sound should be temporarily saved and, a few seconds later, the live-analysis engine make proper segmentation and decides what sound files should be processed and permanently saved to specific folders.

In SAP2, detection of animal sound is based primarily on amplitude envelop. However, certain spectral filters can be set to reject noise or band-limit the amplitude detection. We offer the following approaches:
  1. Use a fixed amplitude threshold to segment sounds
  2. Use a dynamic (adaptive) amplitude threshold to segment sounds
  3. Write your own query for custom segmentation based on various features
  4. Export raw feature vectors to Matlab and design your own algorithm there
In this chapter we only cover approaches 1. and 2. Approach 3. is documented in the batch chapter, and Approach 4. in exporting data.

Using a fixed amplitude threshold to segment sounds - One of the simplest and most widely used methods for segmenting sounds is by a fixed amplitude threshold:
Open Explore & Score, Ensure that “fine segmentation” is turned off (see Fig 1 below)

img_002
Fig 1: Fine Segmentation "off"


Open your sound file or use Example1 (found in the sap directory) and then move the amplitude threshold slider (the one closest to the frequency axis) up to about 43Db:

img_004
Fig 2: Amplitude Threshold Slider


The yellow curve shows the amplitude, and the strait yellow line is the threshold. Amplitude is shown only when above threshold. Syllable units are underlined by a light blue color below them, and bouts are underlined by a red color.

Note segmentation outlines at bottom of sounds:

img_006
Fig 3: Segmentation outlines

Additional constraints on segmentation can be set, so at to reject some sources of noise. Here is an example:
Set the “advance window” slider to 2ms, and set the amplitude threshold to 30Db. Open example3:

img_007img_010
Fig 4: Frequency of syllables


As shown, the last 3 ‘syllables’ are actually low frequency cage noise. Move the mouse to just above the noise level while observing the frequency value at the Features at Pointer panel (see red arrow). As show, most of the noise is below 1500Hz, whereas most of the power of the syllables is above that range.
We are not going to filter out those low frequencies. Instead, we will use this threshold to make a distinction between cage noise and song syllables: Click the “Options & Settings” tab. Turn the high pass noise detector on and change frequency to 1500Hz:

img_012
Fig 5: High Pass - Noise Detector

Go back to sound 1, and click update display below the sonogram image:

img_014
Fig 6: Noise - No longer detected


Note that the most of the noise is no longer detected as vocal sound:

img_016
Fig 7: Noise isolated from vocal sounds


This filter does not affect any analysis of the remaining vocal sounds. This is because we set the noise detector filter as an additional criterion (on top of the amplitude threshold) to eliminate ‘syllables’ where at more than 90% of the energy is at the noise range.
There are several other controls that affect segmentation indirectly. Those include the FFT window, advance window, the band-pass filters on feature calculation, etc.
Here is an example of using the band-pass filter: turn the noise detector off and update the display so that the noise is once again detected as vocal sound. Then move the right sliders as shown
img_018
Fig 8: Noise isolated from vocal sounds

Now click update display:

img_020
Fig 9: Noise isolated from vocal sounds

And the outlines under the noise that is below the detection band should disappear. Note, however, than now all features for all syllables are only computed based on the band-pass filter that you set. Namely, frequencies outside the band are ignored across the board.
  1. Segmentation by a dynamic amplitude threshold
One limitation of static amplitude threshold is that when an animal vocalizes the “baseline” power often change as vocalization becomes more intense. For example, open the file “thrush nightingale example 1” with 3ms advance window and 0 amplitude threshold. Let’s observe the amplitude envelope of this nightingale song sonogram:

img_022
Fig 10: Noise isolated from vocal sounds

And let’s also look at the spectral derivatives, and a certain threshold indicated by the black line:

img_026img_028
Fig 11: Noise isolated from vocal sounds


It is easy to see that no fixed threshold can work in this case (see arrows). To address this, turn “fine segmentation” on. A new slider – called Diff – should appear between the amplitude threshold slider and the display contrast slider. Set it to zero (all the way up). In the fine segmentation box (bottom left of the SAP2 window) set the course filter to 500, fine filter to 0, update display and click filters:

img_030
Fig 12: White cureve - coarse amplitude filter, black line - fine filter, and segmentation


The white curve shows the coarse amplitude filter, which is the dynamic (adaptive) threshold. The black line is the fine filter, which is the same as amplitude in this case. The segmentation is set by the gap between them, where diff=0 means that we segment when the black line touches the white line, namely vocal sound is detected when the fine filter is higher than the course filter.
We can see that all syllables are now detected and segmented, but there are two problems:
  1. The diff detect low amplitude sounds, but also falsely detect small changes in background noise as sounds (look at the beginning of the file).
  2. Segmentation to syllables is often too sensitive and unreliable because each small modulation of amplitude may case segmentation.
A simple way of avoiding false detection of silences is to impose some minimal fixed amplitude threshold on top of the filters. To do this, set the Db threshold to 24:

img_030
Fig 13: No more silence is detected

As shown, no more silences are detected as sounds.
To decrease the sensitivity of segmentation we can use two methods. One is to make the Diff more liberal – allowing the detection of sounds even when the fine filter is slightly below the coarse one. Set the diff to -2.5 gives this result:

img_034
Fig 14: Setting the "diff filter to -2.5"

It is often better approach is to set the fine filter a bit coarser. For example setting fine filter to 5, keep course filter at 500, and setting the diff slider to -1.5 gives this segmentation:
img_036
Fig 15: Sound with fine filter set a to coarser setting

As shown, we have achieved a rather reliable segmentation despite the wide range of amplitudes in this song.

Created using Helpmatic Pro HTML