The most difficult sounds to symbolize

Acoustic Architecture

The Acoustic Architecture of Invisible Phonology: Challenges in the Symbolization and Detection of Non-Segmental Linguistic Shifts

March 17, 2026

Max Barrett

MaximillianGroup

Califonia, United States

The Acoustic Architecture of Invisible Phonology: Challenges in the Symbolization and Detection of Non-Segmental Linguistic Shifts

The mapping of human speech onto symbolic systems has traditionally prioritized segmental phonology, where discrete consonants and vowels serve as the primary units of meaning. However, a significant subset of the world’s languages employs "invisible" sound shifts—modulations in voice quality, nasal resonance, duration, and airflow mechanisms—that function phonemically but evade traditional orthographic representation. These shifts represent a departure from the linear concatenation of phonemes, instead utilizing the texture, timing, and aerodynamic source of the signal to distinguish lexical identity. The complexity of these phenomena presents a formidable challenge for both orthography and computational detection, particularly when attempting to normalize signals across diverse speaker populations including men, women, and children.

The Phonation Continuum: Breathy and Creaky Voice

Voice quality, or phonation, refers to the physiological state of the larynx during sound production. While modal voice—characterized by efficient vocal fold vibration with normal tension—is the cross-linguistic baseline, languages such as Gujarati and Jalapa Mazatec utilize deviations from this norm to differentiate words. These shifts are "invisible" because the primary place and manner of articulation often remain identical, while the laryngeal settings vary.

Breathy Phonation in Gujarati

In Gujarati, breathy phonation, often referred to as "murmur," distinguishes words that are otherwise homophonous. A classic minimal triplet is found in /baɾ/ (twelve), /ba̤ɾ/ (outside), and /bʱaɾ/ (burden).1 The acoustic distinction between a modal vowel and a breathy vowel lies primarily in the relationship between the first and second harmonics (). Breathy voice involves a more open glottis, where the vocal folds do not close completely or remain open for a larger portion of the glottal cycle. This manifests as a higher value, reflecting a larger open quotient (OQ).1

Visualizing this shift in a spectrogram reveals a significant increase in spectral tilt, measured by , where is the amplitude of the third formant. In breathy vowels, the higher frequencies are significantly dampened relative to the fundamental frequency.1 Furthermore, periodicity is disrupted; the Cepstral Peak Prominence (CPP), a measure of how clearly harmonics emerge from background noise, "dips" significantly at the midpoint of the breathy vowel. This indicates that while the vowel starts and ends with more modal-like characteristics, the center of the segment is characterized by increased laryngeal noise and aperiodicity.1

The symbolization of this shift in Gujarati orthography is historically rooted in the inter-syllabic /h/, but in modern spoken Gujarati, it is often realized as a single breathy vowel [V̤].2 The difficulty for the writer lies in the continuum of production; a speaker might produce a full [VɦV] sequence in formal registers but a subtle [V̤] in connected speech, making a standardized symbol difficult to implement without imposing an artificial rigidness on the phonological reality.2

Creaky Phonation in Mazatec

Jalapa Mazatec utilizes a three-way contrast between modal, breathy, and creaky voice. Creaky voice, or vocal fry, involves high adductive tension and low longitudinal tension, resulting in a constricted glottis with thick, slow-moving vocal folds.4 Unlike the "sighing" quality of Gujarati breathy voice, Mazatec creaky voice is characterized by a low fundamental frequency () and extreme irregularity. Individual glottal pulses often become audible, creating a "percept of roughness".5

Acoustically, creaky voice is identified by a lower value compared to modal voice, signaling the increased glottal constriction and a smaller open quotient.5 Spectral noise is higher across all frequency bands, but specifically in the Harmonic-to-Noise Ratio (HNR) at lower frequencies. In Mazatec, this laryngealization can be accompanied by high tones, which creates a "tense voice" variant where the is high but the glottis remains constricted.5 This complicates machine detection, as the system cannot rely on low pitch alone to identify creak; it must instead look for the spectral slope and harmonic irregularity markers.5

The Aerodynamics of Nasalization: French and Guaraní

Nasalization occurs when the velopharyngeal port is opened, coupling the nasal cavity with the oral tract. This introduces a complex set of poles (resonances) and zeros (anti-resonances) into the acoustic signal, which can vary significantly based on the degree of coupling and the specific vowel being nasalized.6

Phonemic Nasalization in French

In French, the contrast between /bo/ (beau) and /bɔ̃/ (bon) is entirely dependent on this shift. The acoustic visualization of French nasal vowels shows that they are not merely "nasalized" versions of their oral counterparts; they often involve secondary oral adjustments. For instance, the nasal vowel /ɛ̃/ is consistently lower and more back than the oral /ɛ/.8

The primary acoustic marker for French nasalization is the widening of the first formant () bandwidth and a general rising of the frequency.6 This occurs because the nasal tract acts as a side-branch resonator that absorbs energy and shifts the oral resonances. In many French nasal vowels, a stable nasal formant () appears around 900 Hz, while a nasal antiformant () can appear near the oral , potentially canceling it out.6

Orthographically, French uses the letters "n" and "m" to signal nasality, but this system is inconsistent. In words like chant or fin, the nasal consonant is silent, serving only as a diacritic for the vowel. This creates confusion for learners and speech recognition systems alike, as the "invisible" nasal feature must be inferred from a letter that, in other contexts, represents a full consonant.10

Nasal Harmony and Transparency in Guaraní

Guaraní presents one of the most sophisticated examples of "nasal spreading," where nasality is not confined to a single segment but acts as a prosodic feature of the entire word.12 This spreading is triggered by a stressed nasal vowel and propagates bidirectionally until it hits a blocker—usually an oral stressed vowel.14

A major point of linguistic debate in Guaraní is the behavior of voiceless stops (/p, t, k/), which have been traditionally labeled as "transparent" because they allow the nasal feature to "skip" over them without themselves becoming voiced nasal stops.12 However, modern acoustic and aerodynamic studies reveal that these stops are "partial undergoers." While the closure remains largely oral to maintain the high pressure required for a stop, there is evidence of nasal airflow energy at the onset of the closure.16 Furthermore, the Voice Onset Time (VOT) for /p/ and /t/ in Guaraní is shifted in nasal environments, suggesting that the "invisible" nasality is indeed affecting the temporal coordination of the stop, even if it is not phonetically realized as a nasal consonant.15

Temporal Precision: Duration and Quantity in Estonian

Holding a sound for a fraction of a second longer is a common way to mark emphasis, but in languages like Finnish and Estonian, it is a phonemic requirement. Estonian is uniquely complex due to its three-way quantity system: short (Q1), long (Q2), and overlong (Q3).17

The Ternary System of Estonian

The distinction between Q2 and Q3 in Estonian is particularly difficult to visualize because it is not just a measure of absolute duration; it is a feature of the entire disyllabic foot.19 In a Q1 foot like sada (hundred), the first syllable is short and the second is relatively long. In a Q2 foot like saada (to get), the first syllable is long and the second is short. In a Q3 foot like saada (send!), the first syllable is even longer and is often accompanied by a distinct falling pitch contour.21

The symbolization of this system in Estonian orthography is notably insufficient. While Q1 is marked with a single letter and Q2 with a double letter, Q3 is generally not distinguished from Q2 in writing (except for stop consonants).18 For example, the spelling koolis can represent both the inessive singular (Q2) and another form in Q3, requiring the reader to rely entirely on syntactic context. This "orthographic gap" exists because the difference between Q2 and Q3 involves prosodic cues—specifically the tonal drop and the extreme syllable ratio—that standard Latin-based alphabets are not equipped to capture.23

Unencoded Speech: The Mechanics of Clicks in Xhosa

Click consonants, prevalent in Xhosa and other Nguni languages, are produced using a lingual ingressive airstream mechanism. This involves two closures: an "initiatory" closure at the velar or uvular position and an "articulatory" closure at the dental, alveolar, or lateral position.25 The rarefaction of air between these two points creates a suction that, when released, produces the characteristic click burst.

Acoustic Characteristics of Xhosa Clicks

Clicks are described as "unencoded speech" because they do not coarticulate with their phonetic environment in the same way that pulmonic sounds do.26 Unlike a "t" or "p" which leaves "transitional features" on the following vowel (formant transitions), a click release is almost entirely self-contained. Spectrograms of Xhosa clicks show very distinct noise-burst properties:

Dental Clicks [ǀ] (orthographic 'c'): These have a diffuse spectrum with energy spread across a wide range (0-9000 Hz) but at a lower overall amplitude.27 They are often described as "affricative" because the release is more gradual.28
Palatal Clicks [ǃ] (orthographic 'q'): These are "instantaneous" and "compact." The spectral energy is concentrated in lower frequencies, typically between 1000 and 1700 Hz.27
Lateral Clicks [ǁ] (orthographic 'x'): These have a diffuse spectrum but with a distinct peak between 1000-2000 Hz, reflecting the resonance of the lateral side-cavities.27

Because standard Latin letters were never designed for lingual ingressive sounds, the symbols "c," "x," and "q" were arbitrarily assigned to these sounds in Xhosa. This creates a disconnect for non-native speakers who associate these letters with their European values, further obscuring the "invisible" mechanics of the click.31

The Signal Shift: Turkish Whistled Language

Whistle languages represent the most radical linguistic shift, where spoken Turkish is transposed into a series of frequency modulations.33 In the village of Kuşköy, "Bird Language" is used for long-distance communication where traditional vocalizations would be lost to ambient noise and distance.35

Phonetic Mapping and Modulation

In whistled Turkish, the vocal cords remain inactive, and the whistle acts as a "pure tone" carrier wave. The frequency is modulated by changing the volume of the resonant oral cavity, primarily through the anteroposterior movement of the tongue.35 The whistled signal essentially emulates the second formant () of spoken Turkish, as is the primary carrier of vowel identity in non-tonal languages.37

However, this shift necessitates a significant phonetic reduction. Spoken Turkish’s 32 phonemes are condensed into approximately six whistled phonetic groups. For instance, the bilabial stops /p/ and /b/ are merged into the sound /f/ because the lips cannot close during whistling without stopping the sound.35

Symbolizing a whistle language is virtually impossible with a standard alphabet because the signal is continuous and lacks the discrete boundaries of spoken phonemes. While researchers use frequency intervals to categorize these sounds, the native whistlers rely on a "right-brain" encryption mechanism that decodes the melody and rhythm rather than discrete letters.35

Machine Detection and Speaker Normalization

Detecting these subtle sound shifts in children, women, and men requires a robust normalization process. The "lack of invariance" in speech means that a man's breathy voice might have an similar to a woman's modal voice, making absolute frequency thresholds useless.38

The Logic Pathway for Machine Hearing

To detect "invisible" shifts across speaker types, machines must follow a specific logic pathway that isolates linguistic intent from physiological variation:

Signal Acquisition and Pre-processing: The raw audio is sampled (typically at 16-44.1 kHz) and divided into short frames (20-30 ms).7
Vocal Tract Length Normalization (VTLN): The system estimates the length of the speaker's vocal tract and applies a warping factor () to the frequency axis. This "scales" the formants as if they were produced by a reference tract, helping to normalize the difference between a child's small tract and a man's large one.39
Feature Extraction: Mel-frequency cepstral coefficients (MFCCs) are extracted to capture the spectral envelope. For shifts like phonation or nasality, higher-level parameters are added:

Phonation: , , and CPP.1
Nasalization: bandwidth and .9
Duration: Syllable and segment ratios.19

Temporal Trajectory Analysis: Instead of a single measurement, the machine evaluates the shift across the duration of the segment (e.g., at 10%, 50%, and 90% marks). This is crucial for Guaraní nasal spreading or Gujarati midpoint breathiness.7
Classification: These normalized features are fed into a Support Vector Machine (SVM) or an XGBoost model. By training on multi-speaker datasets, the model learns the "boundary" of the shift independent of the speaker's .7

Normalizing for Age and Sex

Research in child speech normalization indicates that Z-score standardization—scaling features based on age-and-sex-specific means—significantly improves detection.44 For instance, a child's "modal" voice naturally contains more noise than an adult's. A machine must be "taught" that a higher in a child may be their baseline, whereas the same value in an adult male would signal breathiness.44

Conclusion: The Symbolic Gap

The challenge of "invisible" sound shifts highlights a fundamental limitation of human orthography: it is designed for ease of reading, not acoustic precision. Standard alphabets are "selective," ignoring sounds that do not change meaning in the designer's language and simplifying those that do for the sake of usability.45 This creates a "symbolic gap" where the most complex and nuanced aspects of human speech—the sigh of a Gujarati breathy vowel, the suction of a Xhosa click, or the melody of a Turkish whistle—are rendered invisible.

Computational linguistics provides a bridge across this gap. By utilizing VTLN, harmonic-to-noise ratios, and spectral tilt measurements, machines can "hear" the shifts that orthography ignores. However, the ultimate normalization remains a human cognitive process. Whether through "multiple-listing" of phonetic variants in memory or "top-down" parsing using semantic context, the human ear remains the most sophisticated detector of these invisible phonological architectures.38 For the field of linguistics, the ongoing task is to refine these digital models until they can mirror the human ability to find constancy in a signal that is constantly shifting.

Works cited