• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Proceedings of Meetings on Acoustics

Search Volume | RSS Feeds RSS
POMA - 162nd Meeting Acoustical Society of America
Conference Location: San Diego, California Conference Date: 31 October - 4 November 2011
back to top
RSS Feeds
FREE

Improving automatic speech recognition by learning from human errors

Bernd T. Meyer

POMA Volume 14, pp. 060001 (December 2011); (9 pages)

Online Publication Date: December 23, 2011

Full Text: | Download PDF

Show Abstract
This work presents a series of experiments that compare the performance of human speech recognition (HSR) and automatic speech recognition (ASR). The goal of this line of research is to learn from the differences between HSR and ASR, and to use this knowledge to incorporate new signal processing strategies from the human auditory system in automatic classifiers. A database with noisy nonsense utterances is used both for HSR and ASR experiments with focus on the influence of intrinsic variation (arising from changes in speaking rate, effort, and style). A standard ASR system is found to reach human performance level only when the signal-to-noise ratio is increased by 15 dB, which can be seen as the human-machine gap for speech recognition on a sub-lexical level. The sources of intrinsic variation are found to severely degrade phoneme recognition scores both in HSR and in ASR. A comparison of utterances produced at different speaking rates indicates that temporal cues are not optimally exploited in ASR, which results in a strong increase of vowel confusions. Alternative feature extraction methods that take into account temporal and spectro-temporal modulations of speech signals are discussed.
Show PACS
43.71.Es Vowel and consonant perception; perception of words, sentences, and fluent speech
43.72.Ja Speech synthesis and synthesis techniques
43.72.Ne Automatic speech recognition systems
FREE

Examining the voice bar

Sean A. Fulop and Sandra F. Disner

POMA Volume 14, pp. 060002 (April 2012); (11 pages)

Online Publication Date: April 03, 2012

Full Text: | Download PDF

Show Abstract
In a spectrogram of a human vowel sound, it is possible to observe the formant resonances which define the vowel auditorily. It is usually also possible to observe an emphasized frequency below F1, which has often been called the voice bar. Although recognition of the voice bar dates back to 19th century phonetics, it has never been the subject of a specific investigation. As a result, the nature and origin of the voice bar remain mysterious. Recent work on voice source synthesis [C. d'Alessandro et al., eNTERFACE 2005 Proc. pp. 52--61] has explained the appearance of an emphasized frequency in the neighborhood of 200 Hz--simply, it results from the frequency peak of the radiated source spectrum. Yet many speech scientists continue to ignore the voice bar, even to the point of denying its reality. Measurements of the voice bar in a number of different speakers and languages will be clearly shown in this paper, using reassigned spectrograms and linear prediction spectral estimates. The proper recognition of the voice bar can begin with this preliminary study, whose results largely corroborate the recently developed theory.
Show PACS
43.72.Ar Speech analysis and analysis techniques; parametric representation of speech
43.70.Jt Instrumentation and methodology for speech production research
FREE

Temporal regularity in speech perception - is regularity beneficial or deleterious?

Eveline Geiser and Stefanie Shattuck-Hufnagel

POMA Volume 14, pp. 060004 (April 2012); (10 pages)

Online Publication Date: April 13, 2012

Full Text: | Download PDF

Show Abstract
Speech rhythm has been proposed to be of crucial importance for correct speech perception and language learning. This study investigated the influence of speech rhythm in second language processing. German pseudo-sentences were presented to participants in two conditions: `naturally regular speech rhythm' and an `emphasized regular rhythm'. Nine expert English speakers with 3.5±1.6 years of German training repeated each sentence after hearing it once over headphones. Responses were transcribed using the International Phonetic Alphabet and analyzed for the number of correct, false and missing consonants as well as for consonant additions. The over-all number of correct reproductions of consonants did not differ between the two experimental conditions. However, speech rhythmicization significantly affected the serial position curve of correctly reproduced syllables. The results of this pilot study are consistent with the view that speech rhythm is important for speech perception.
Show PACS
43.71.An Models and theories of speech perception
43.71.Hw Cross-language perception of speech
43.71.Rt Sensory mechanisms in speech perception
43.71.Sy Spoken language processing by humans
FREE

Music masking speech in hybrid cochlear implant simulations

Shaikat Hossain and Peter Assmann

POMA Volume 14, pp. 060005 (December 2012); (28 pages)

Online Publication Date: December 05, 2012

Full Text: | Download PDF

Show Abstract
The present study investigated the masking effects of various musical instruments on speech processed through simulations of a cochlear implant (CI) and electric-acoustic stimulation (EAS) at different signal-to-noise ratios (SNRs). Musical instruments with spectrotemporal characteristics similar to speech were generally more effective maskers. Introducing low frequency acoustic information led to improved word recognition scores for the EAS simulation compared to the normal CI simulation. Overall, EAS benefit was larger at lower SNRs. Fundamental frequency (F0) was better preserved in the EAS simulation and found to correlate with EAS benefit, consistent with theories that attribute its effectiveness to F0-based segregation.
Show PACS
43.66.Dc Masking
43.66.Ts Auditory prostheses, hearing aids
43.71.Ky Speech perception by the hearing impaired
FREE

Dispersion and variability of vowels of different vowel inventory sizes

Wai-Sum Lee

POMA Volume 14, pp. 060006 (December 2012); (9 pages)

Online Publication Date: December 06, 2012

Full Text: | Download PDF

Show Abstract
The study investigates dispersion and variability of the vowels of three Chinese dialects, Yongding, Cantonese, and Wenling with three-, seven-, and eleven-vowel system, respectively. Formant data on the male and female vowels of the three dialects are presented. The main findings are as follows. In all three dialects, (i) a larger vowel inventory correlates a more expanded acoustical vowel space, which supports the vowel dispersion theory's prediction that the larger the vowel inventory is the more expanded acoustical vowel space will be (Lindblom, 1986), although the difference in vowel space is not linearly related to the difference in vowel inventory size; (ii) variability in vowel formants is not inversely related to vowel inventory size, which disagrees with the vowel dispersion theory's prediction; (iii) there is greater between-category dispersion in the F1F2 plane in the female vowels than the male ones, which is similar to what is reported in Fant (1966, 1975); and (iv) contrary to the quantal theory's prediction (Stevens, 1972, 1989), the point vowels do not show less within-category variability than the non-point vowels of the three Chinese dialects.
Show PACS
43.70.Kv Cross-linguistic speech production and acoustics
FREE

Acoustic properties of coda liquids in Californian English

Onna A. Nelson

POMA Volume 14, pp. 060008 (February 2013); (9 pages)

Online Publication Date: February 25, 2013

Full Text: | Download PDF

Show Abstract
Synchronic and diachronic processes which affect one liquid in a language are likely to affect all liquids in the language (Walsh 1997; Proctor 2009). While it is well-established that the English rhotic [ɹ] may serve as the syllable peak in certain words such as church, bird, and verb, little work as investigated the possibility of a lateral syllable peak in analogous words such as milk, filled, and help. Given that coda-position /ɹ/ may be the syllable peak in certain closed syllables, it is expected that coda-position /ɫ/ behaves similarly. The current study examines coda-position liquids in closed syllables uttered by native California speakers to predict sonority based on liquid type, speaker gender, lexical stress, and other phonological features. Additionally, the formant values of liquids are examined to determine the similarity of liquids to vowels, as Gick et al (2002) suggests that the articulation of /ɹ/ and /ɫ/ are most similar to /ə/ and /ɔ/, respectively. It is therefore predicted that the formant structure of these sonorous liquids will mirror the formant structure of these two vowels. Results indicate that liquids in Californian English exhibit similar patterns regarding sonority under certain conditions, although rhotics may be more vowel-like than laterals.
Show PACS
43.70.Fq Acoustical correlates of phonetic segments and suprasegmental properties: stress, timing, and intonation
43.71.Es Vowel and consonant perception; perception of words, sentences, and fluent speech
FREE

Perception of speaker sex in children's voices

Peter Assmann, Santiago Barreda, and Terrance Nearey

POMA Volume 14, pp. 060009 (February 2013); (7 pages)

Online Publication Date: February 13, 2013

Full Text: | Download PDF

Show Abstract
To study the perception of speaker age in children's voices, adult listeners were presented with vowels in /hVd/ syllables, either in isolation or in a carrier sentence. Listeners used a graphical slider to register their estimate of the speaker's age. The data showed a moderate correlation of perceived age and chronological age. For isolated syllables, age estimation accuracy was fairly constant across age up to about age 11, but there was a systematic tendency for listeners to underestimate the ages of older girls. This error pattern was actually exaggerated when listeners were informed of the speaker's sex. Age estimation accuracy was higher for syllables embedded in a carrier sentence, and knowledge of the speaker's sex had little effect. Linear regression analyses were conducted using acoustic measurements of the stimuli to predict perceived age. These analyses indicated significant contributions of fundamental frequency, duration, vowel category, formant frequencies as well as certain measures related to the voicing source. The persistent underestimation of age for older girls, and the effect knowledge of speaker sex has on this underestimation suggest that acoustic information is combined with expectations regarding speakers of a given sex in arriving at an estimate of speaker age.
Show PACS
43.71.Gv Measures of speech perception (intelligibility and quality)
43.72.Ar Speech analysis and analysis techniques; parametric representation of speech
43.71.An Models and theories of speech perception
43.71.Bp Perception of voice and talker characteristics
43.71.Es Vowel and consonant perception; perception of words, sentences, and fluent speech
FREE

Selection of speech/voice vectors in forensic voice identification

James Harnsberger and Harry Hollien

POMA Volume 14, pp. 060010 (June 2013); (21 pages)

Online Publication Date: June 14, 2013

Full Text: | Download PDF

Show Abstract
The problem of identifying speakers from voice analysis is a serious one. Many procedures have been proposed, some based on signal processing techniques common to automatic speech recognition. Yet it is clear that humans very often can make highly accurate identifications, even under challenging listening conditions that are common in forensic audio. A number of procedures have been developed which mimic human perception for this purpose: a semi-automatic forensic speaker recognition system using four sets of parameters, or vectors, based on a substantial number of related speech parameters. Identifications of 28 males in a field of 10 foil voices provided these data; the technique involved three (complete) replications of the approach. It was found that identification scores for the three of these vectors (voice quality, vowel formants, fundamental frequency) were very high and that for the temporal vector, positive but modest. Moreover, it also was found that every one of the vector-summation scores identified the target speaker. These results were based on high quality simulated field recordings, and demonstrate the efficacy of modeling biological systems (human perception) to solve challenging processing problems.
Show PACS
43.71.Bp Perception of voice and talker characteristics
43.72.Uv Forensic acoustics
Close

close