Cochlear Implant Atlas
CI Atlas · Speech-Coding Strategies: The Complete Lineage · Module 05

5Tracking the Voice: F0/F2 Formant Extraction

Before spectral-maxima strategies, the Melbourne/Nucleus team tried to explicitly extract the perceptually important features of speech. The first feature-extraction processor tracked the fundamental frequency F0 and the second formant F2 using zero-crossing detectors — a fundamentally different philosophy from waveform strategies.

TThe feature-extraction philosophy

Feature-extraction strategies attempt to estimate and transmit specific perceptually important speech parameters (formants, voicing, pitch) rather than the raw band waveform or envelope Formants are the resonant peaks of the vocal tract; F0 is the voice pitch and F2 is the second formant, both central to vowel and voicing perception This approach assumes a model of speech production/perception, unlike the assumption-free CIS The F0/F2 strategy was implemented on the early-1980s Nucleus implant from Cochlear Corporation / University of Melbourne, a 22-24 electrode device.[2000][1999]

8%F0F2 open-set NU-6 monosyllabic word recognition (n=5) [1990]
31%F0F2 sentences without context (vs 64% with F1 added) [1990]

CThe F0/F2 signal chain

The chain is: mic to AGC, then two paths — a 270 Hz low-pass filter feeding a zero-crossing detector to estimate F0 (which sets the pulse rate), and a 1000-4000 Hz band-pass filter feeding a zero-crossing detector to estimate F2 (which selects the electrode) F2 frequency selects which electrode is stimulated, F0 sets the stimulation (pulse) rate, and voicing is conveyed via the zero-crossing rate Voiced segments are stimulated at F0 pulses per second Unvoiced segments are stimulated at quasi-random intervals averaging about 100 pulses per second.[1987][2000]

F2 picks the place, F0 sets the rate

F2 traceEL 2apexbase
Vowel/u/F2900 HzF0 → rate120 pps

Feature extraction throws away the waveform and keeps only estimated features. In F0/F2, the processor tracks the second formant F2 (the main vowel cue) and stimulates the one electrode whose place matches it, while the voice pitch F0 sets the pulse rate. It is elegant when the estimate is right — and that is exactly the assumption that breaks down in noise. Schematic.

CVoicing by zero-crossing

Zero-crossing detection identifies the instants where the waveform crosses the zero-amplitude axis and is used here to estimate frequency and classify voiced versus unvoiced segments F0 was estimated via a zero-crossing detector after a 270 Hz low-pass filter F2 was estimated via a zero-crossing detector after a 1000-4000 Hz band-pass filter Fundamental frequency carries important cues for intelligibility, especially in background noise.[1987][2010]

TWhy F0/F2 was superseded

Encoding only F0 and F2 leaves out the low-frequency vowel information carried by the first formant F1 Vowel and consonant discrimination needs more spectral detail than two tracked parameters provide Adding F1 improved performance, directly motivating the F0/F1/F2 strategy Feature-extraction by zero-crossing is vulnerable to estimation errors when the tracked feature is corrupted.[1987][2006]

Zero-crossing rate decides voiced vs unvoiced

zero crossings ≈ 8few crossings → voiced → stimulate at F0 rate

Before spectral analysis was cheap, processors used a simple trick to tell a vowel from a hiss: the zero-crossing rate. A slow, quasi-periodic voiced sound crosses zero rarely; a noisy unvoiced consonant crosses it often. The F0/F2 family used this to switch between stimulating at the voice-pitch rate during voicing and at a low ~100 pps quasi-random rate during unvoiced segments. Schematic.

TBy the numbers

Two-Formant Tracking (F0F2) vs Three-Feature (F0F1F2) Speech Perception

020406080Percent / words correctNU-6 wordsSentences (no context)Spondees in noise
Speech testSpondees in noiseF0F237%F0F1F275%

The earliest Cochlear Corporation strategies extracted only voice pitch (F0) and the second formant (F2). F0F2 tracking gave very limited open-set recognition (NU-6 words just 8%). Adding the first formant (F1) roughly doubled-to-tripled scores, showing how much was lost by tracking only one formant. Verified means from Tye-Murray 1990 (n=5).

FHear it

Synthesise a vowel — the two numbers a formant tracker chases

← F2 (Hz)F1↓/a//i//u//e//o/
Nearest vowel/a/F1730 HzF21090 Hz

Two resonances do most of the work of a vowel: F1 (lower when the mouth is closed, as in /i/ and /u/) and F2 (higher when the tongue is front, as in /i/ and /e/). Play the vowel and slide F1 and F2 — the percept walks around the vowel space. This is precisely what the F0/F2 and F0/F1/F2 strategies tried to track and transmit; when their estimate of these two numbers was wrong, the wrong vowel was sent. Synthesised in your browser.

Case 14.5 · Tracking the Voice
A 1980s Nucleus recipient on an F0/F2 processor identifies many vowels by their pitch and openness but struggles badly with consonants such as /s/, /f/ and /th/.

Which design limitation of F0/F2 best accounts for the poor consonant perception?

Self-assessment — Module 52 questions
Question 1

In the F0/F2 strategy, what does the estimated F2 frequency control?

Question 2

How is F0 estimated in the F0/F2 strategy?

Tracked locally in your browser — see /progress for the dashboard.