5Envelope, fine structure & the vocoder
Once the filter bank has split sound into bands, the processor makes its boldest move: from each band it keeps only the slow rise and fall of amplitude — the envelope — and throws away the fast oscillation underneath it, the temporal fine structure. This is the single most consequential decision in cochlear-implant coding, and it has a name and a pedigree: the channel vocoder, a speech-transmission idea from the 1930s. It works astonishingly well for speech, as a famous experiment proved by showing that the envelopes of just a few bands let listeners understand sentences. But the discarded fine structure is precisely where pitch and music live, so the same trade-off that makes speech easy makes music hard. This module is about that bargain and why the implant strikes it.
TEnvelope and fine structure
Any band of sound can be split into two parts: a slowly varying envelope (how loud the band is, moment to moment) and a fast fine structure (the rapid carrier oscillation inside it). The envelope carries the rhythm and timing of speech; the fine structure carries much of the pitch and the cues for separating sounds. The implant keeps the first and discards the second.
CThe vocoder model
The conceptual model is the channel vocoder: split sound into bands, extract each band's envelope, and use those envelopes to modulate a set of carriers — in the implant, trains of electrical pulses on each electrode. A cochlear implant is, in effect, a vocoder whose carriers are electrodes in the cochlea. This is also why acoustic vocoder simulations (noise- or tone-vocoded speech) are used to model implant hearing in normal-hearing listeners.
CWhy envelope is (mostly) enough
It is not obvious that throwing away the fine structure should leave speech intelligible — but it does. In a landmark study, listeners recognised sentences from the envelopes of as few as four bands, showing that the temporal envelope across a handful of channels carries most of the information speech needs. That result is the empirical licence for the whole approach: it is why a device that discards so much can still deliver open-set speech.[1995]
CWhat it costs
The bill comes due elsewhere. Pitch, melody, talker identity and the subtle cues that let us follow one voice among many live largely in the fine structure — so discarding it is why implant users find music thin and speech in noise hard. Much of the rest of the chapter — fine-structure coding, current focusing — is an attempt to give back a little of what this step takes away, without losing the robustness that made envelope coding work.
What is the correct implication?
What does envelope extraction keep and discard?
What is the cochlear implant best modelled as, in signal-processing terms?