• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1989

Volume 86, Issue S1, pp. S1-S125

back to top
RSS Feeds
back to top Session NN. Speech Communication VIII: Speech Perception
Contributed Papers
FREE

Converging evidence on the nature of the segmental representation underlying spoken word recognition (A)

Deborah A. Gagnon and James R. Sawusch

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S99-S100 (1989); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Previous work has shown that the segmental representation used in recognizing spoken words does not correspond to the traditional abstract phoneme [Gagnon and Sawusch (1989)]. However, the evidence did not unequivocally distinguish between two alternative proposals—allophones and position‐specific phonemes. In the previous study, subjects heard natural CVC prime‐target pairs in various degrees of phonetic overlap and named the second item (the target) as quickly as possible. The present study adopted the same approach but utilized a phoneme monitoring task in which subjects responded whenever the target began with a designated phoneme. The pattern of RT results, while different from that obtained in the naming task, again supported a rejection of the phoneme and shed further light on the nature of the representation. Similar patterns across word and nonword blocks and across trials in which voice was the same versus different within a prime‐target pair were found with both tasks. A comparison will be made of the two task types in terms of the results generated and the implications for their use in studying on‐line spoken word recognition. [Work supported by NINCDS.]
FREE

The role of phonological permissibility in the phonetic coding of speech (A)

Carol A. Wannemacher and James R. Sawusch

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S100-S100 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Previous research has shown that phonological permissibility of a consonant sequence can affect the identification of an ambiguous phoneme [Massaro and Cohen (1983)]. The present experiment used both reaction time and identification measures to further explore this finding. With natural speech, little or no effect of phonological permissibility of a fricative‐liquid sequence was found upon speed to classify the liquid. In the second experiment, a synthetic liquid series was constructed with endpoints varying in permissibility (depending on the fricative precursor). A speeded classification task was run and data were partitioned according to speed of response (cf. Fox, 1984). The results of this experiment, combined with the earlier findings, illustrate the relative roles of both acoustic‐phonetic and phonological information in phonetic coding. The data will be discussed in terms of the data‐ and knowledge‐driven influences upon phonetic coding. [Work supported by NINCDS.]
FREE

Attention to phonetic context across word boundaries (A)

Jenny DeGroot and Howard C. Nusbaum

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S100-S100 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Variability in phonetic context affects phoneme perception. For example, subjects are slower to identify a target consonant when an adjacent context phoneme varies independently, compared to when context phonemes are constant across trials [C. C. Wood and R. S. Day, Percept. Psychophys. 17, 346–350 (1975)]. Further, this is true whether of not a syllable boundary intervenes between the target and the varying context, suggesting that subjects attend to phonetic context regardless of syllable structure [J. DeGroot and H. C. Nusbaum, J. Acoust. Soc. Am. Suppl. 1 85, S123 (1989)]. Do listeners similarly attend to phonetic context across a word boundary when recognizing a phoneme? In the present experiments, subjects identified a target phoneme, while an adjacent context phoneme either varied orthogonally or was held constant. In one stimulus set, the varying context occurred in the same word as the target; in the other stimulus set, the vrying context occurred in an adjacent word. Response times were measured to investigate whether the varying context slows target recognition across a word boundary as it does within a word. The results provide information about how listeners distribute their attention across the speech signal, and about the perceptual function of different linguistic units. [Research supported by NIH.]
FREE

The effects of attention on the phonetic integration of acoustic information (A)

Jennifer L. Eberhardt and Peter C. Gordon

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S100-S100 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The present study investigated the effects of attention on phonetic processing. Thirty‐five synthetic steady‐state vowels varying from /i/ to /I/ were used. The first three formants of each vowel varied in seven equal logarithmic steps from /i/ to /I/ as the duration varied from short to long in five steps (50, 80, 120, 190, and 300 ms). All of the vowel stimuli were presented to subjects under high‐ and low‐attention conditions. Attention was manipulated by requiring that subjects perform a nonspeech distractor task while simultaneously performing a speech identification task, or by requiring that subjects perform a speech identification task only. Phonetic identification of the vowel stimuli was found to vary with the attention condition. When subjects performed the distractor task and the speech task simultaneously, duration became a more important cue to phonetic identity, whereas the effect of formant frequency was reduced. A quantitative model was developed to characterize how the integration of information from the two kinds of cues, formant structure and duration, changed with attention level.
FREE

The intelligibility of speech directed to children and adults (A)

Judith C. Goodman, Howard C. Nusbaum, Lisa Lee, and Kevin Broihier

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S100-S100 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Speech directed to very young children differs markedly from speech directed to adults in vocabulary, syntax, and intonation. Are there differences in segmental intelligibility as well? Speech was recorded from six mothers addressing their 2 year olds, and, in another session, addressing an adult, under comparable conditions. Open and closed‐class words were excised from both types of speech and were presented, in isolation, to adult listeners for identification. Children may not understand closed‐class words, so mothers may not articulate them clearly. Excised open‐class words expressed either new or previously given information. Whereas repeated information is articulated less clearly in speech to adults, repeated information in speech to children may be intended to aid comprehension. Thus the given/new distinction may affect the intelligibility of speech to adults and to children differently. Preliminary results indicate that child‐directed speech may be less intelligible than adult‐directed speech, particulary for closed‐class words: Expectations about children's linguistic knowledge apparently affect articulation of different word classes. [Work supported by NIH and a Biomedical Research Support Grant.]
FREE

Cues to perceptual normalization of talker differences (A)

Todd M. Morin and Howard C. Nusbaum

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S100-S100 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
To recognize vowels produced by a specific talker, a listener must know something about the talker's vowel space. How do listeners learn about a talkers' vowel space? How different must two talkers' vowels be to constitute different vowel spaces? To investigate these questions, listeners monitored sequences of target vowels for a specified target. In one condition, a single talker produced the vowels for each trial. In a second condition, a mix of four talkers produced the vowels for each trial. Previous research demonstrated that vowel recognition is accurate when talkers are mixed, but it requires more attention. The present study compared recognition for whispered and voiced vowels in these conditions to eliminate the use of F0 as a cue for talker normalization. Another experiment examined the effects of talker differences on normalization. Performance for a pair of talkers with similar vowel spaces was compared to performance for a pair of talkers with more disparate vowels. The results of these studies indicate the importance of fundamental frequency in vowel recognition across talkers and that the amount of effort required for normalization of talker differences depends on the similarity of the different talkers' vowel spaces. [Work supported by NIH.]
FREE

On the perceptual representation of vowel categories (A)

Keith Johnson

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S100-S101 (1989); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Subjects' behavior in a vowel perception task was compared with the predictions of two different types of vowel perception models. Both models use auditory spectra generated by a model of the peripheral auditory system [Bladon and Lindblom, J. Acoust. Soc. Am. 69, 1414–1422 (1981)]. In the first model, it is assumed that the spectral information in vowels is represented as a whole spectrum. The spectral characteristics of a vowel category are modeled as a spectral template against which incoming vowel spectra are compared. In the second model, the spectral information in vowels is characterized by the frequency locations of spectral peaks. Here, the spectral component of a vowel category is taken to be information concerning the frequencies of spectral peaks. Two sources of vowel templates for the models were also compared. In one case, templates were derived from the subject's own productions of the vowels in question. In the other, templates were derived from vowels synthesized with average formant values reported by Peterson and Barney [J. Acoust. Soc. Am. 24, 175–184 (1952)]. Subjects were instructed to find the best exemplars of each of 11 different vowel categories from an array synthetic steady‐state vowel tokens by adjusting the F1 and F2 of the tokens until the best exemplar was located. In all cases, it was found that the whole spectrum approach provided the best fit to the subjects' perceptual judgments. It was also found that model predictions based on the Peterson and Barney vowels provided a better fit to the perceptual data than did predictions using the subject's own productions as the basis for spectral templates. This finding suggests that perceptual vowel categories are not subject to the idiosyncrasies of the individual's own speech production, but rather are a product of the listener's range of the perceptual experience.
FREE

The intelligibility of LPC‐vocoded words and sentences presented to native and non‐native speakers of English (A)

Molly Mack

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S101-S101 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The present study is an extension of previous work [M. Mack, J. Neurologist. 3, 293–316 (1988);M. Mack., J. Acoust. Soc. Am. Suppl. 181, S4 (1987)] designed to assess the intelligibility of natural and computer‐generated speech presented to native and non‐native speakers of English. In this study, subjects were presented with three tests—the Diagnostic Rhyme Test (DRT), a meaningful sentences test, and a semantically anomalous sentences test. Stimuli were presented in two conditions—natural and LPC (2.4 kbps) vocoded. Subjects were 20 native speakers of English and 20 native speakers of German, fluent in English. Results revealed that the non‐natives performed significantly worse than the natives on all but the natural‐speech DRT; word frequency affected accuracy more for the non‐natives appeared to exhibit fatigue effects, unlike the natives, who exhibited perceptual learning. Moreover, the effects of test type and condition upon non‐native response accuracy were nonadditive. Implications for models of non‐native speech processing and the use of coded speech systems by non‐native listeners are considered.
FREE

Comprehension of natural and synthetic speech (A)

James V. Ralston, John W. Mullennix, Scott E. Lively, Beth G. Greene, and David B. Pisoni

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S101-S101 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Previous studies comparing the comprehension of natural and synthetic speech passages have produced conflicting results. All of these studies used successive measurement techniques in which subjects' comprehension was assessed after the presentation of a passage. However, comparative studies have found that successive methods are less sensitive than simultaneous “on‐line” measures. Successive measures are also known to be influenced by memory factors. Subjects in the present experiment monitored for word targets while they listened to short passages and then verified statements after each passage. Both monitoring and verification performance for passages of synthetic speech were depressed relative to passages of natural speech. There was a significant interaction between voice and text difficulty in the monitoring latency data, suggesting that both factors affect mechanisms drawing from the same limited processing resources. Finally, there was a significant interaction between voice and sentence type in the verification task. Subjects listening to passages of synthetic speech had relatively poor memory for propositional information compared to surface (word) information. Taken together, the results indicate that comprehension of synthetic speech is poorer and slower than natural speech. Memory for propositional information extracted from passages of synthetic speech is particularly poor. [Work supported by NSF.]
FREE

Perception of Hindi retroflex versus dental stops by monolingual speakers of American English (A)

Linda Polka

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S101-S101 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Cross‐language studies have shown that foreign consonant contrasts vary in the degree of perceptual difficulty that they present adult non‐native listeners. Phonemic, phonetic, and acoustic factors have been considered important in accounting for this variability. These factors were examined by comparing English listeners' perception of the Hindi retroflex versus dental place‐of‐articulation contrast in four different voicing contexts: prevoiced, voiceless aspirated, voiceless unaspirated, and breathy voiced. Differences in the perceptual difficulty of the four Hindi contrasts were predicted based on: (1) phonemic status (the functional status of the contrast in listeners' native phonology), (2) phonetic familiarity (as allophones or free variants), (3) differences in acoustic salience related to voicing, and (4) assimilation strategies. Differences in performance in a categorial AX discrimination task were ordered from most to least errors: prevoiced, voiceless aspirated, breathy‐voiced, and voiceless unaspirated. Perceptual differences were correlated with both acoustic salience of place cues and subjects' descriptions of their assimilation strategies. [Work supported by NICHD and NINCDS.]
FREE

Effects of auditory and phonetic training on Americans' discrimination of Hindi retroflex‐dental contrasts (A)

Winifred Strange, Linda Polka, and Manuela C. Aguilar

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S101-S101 (1989); (1 page) | Cited 1 time

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Previous research has shown that the Hindi retroflex‐dental contrast among stop consonants is not easily differentiated by English speakers even after some training. In the present study, subjects were given 4 days of training (768 trials) in a categorial AX discrimination task on full (unmodified) or truncated tokens of naturally produced breathy‐voiced [dha] vs [d̪ha]. Interstimulus intervals (ISI) were either short (550 ms) or long (1500 ms). Conditions favoring auditory‐processing strategies (short ISI and truncated syllables) yielded the most improvement during training. However, pre‐test to post‐test improvement in “phonetic‐level” discrimination of the full syllables was not significantly different across the four training conditions, and there was no transfer to discrimination of the contrast in voiceless aspirated [tha] ‐ [t̪ha] or prevoiced [da]‐[d̪a] stops. Large individual differences Pound in each training condition suggest that subjects' strategies may be more important than stimulus and task variables in predicting success in perceptual differentiation of non‐native contrasts. [Supported by NINCDS.]
FREE

Training methods for the facilitation of Japanese students' perception of American English /r/ and /l/ (A)

Salvatore Miranda, Melva Underbakke, Winifred Strange, and Theodore Micceri

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S102-S102 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This study investigated the effects of two types of discrimination training on the perception and production of /r/ and /l/ by Japanese students of English. Training stimuli cansisted of synthetic syllables drawn from three different ten‐step /r/‐/l/ series: rock‐lock, rake‐lake, and rook‐look. The prototype group received discrimination training using only the endpoint stimuli from each series (the clearest instances of /r/ and /l/). The gradient group received discrimination training that began with the endpoints and proceeded gradually to stimuli near the (English) category boundaries. Control subjects received discrimination training on naturally produced tokens of /b/ vs /v/ (another contrast that Japanese find difficult). Comparisons of pre‐test and post‐test scores showed improvement for both prototype and gradient groups in identification of synthetic speech series and in production, but not in perception of naturally produced minimal pairs. Improvement for the control group was restricted to the perception of the synthetic /r/‐/l/ series. No differences were found in the overall effectiveness of prototype versus gradient training. [Supported by NINCDS.]
FREE

Perceptual characteristics of English syllable‐initial /r,l/ for Japanese listeners (A)

Reiko Yamada, Yoh'ichi Tohkura, and Noriko Kobayashi

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S102-S102 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This study investigates the perceptual characteristics of American English /r,l/ for Japanese listeners using synthesized stimuli. Five major findings are obtained. (1) The Japanese listeners identify the stimuli using a variety of acoustic cues, and their response patterns are strongly influenced by acoustic features of the stimuli. In contrast, the American listeners can identify /r/ and /l/ as long as a primary cue remains, even under the condition where some of the acoustic cues are missing. (2) As the Japanese listeners tend to perceive some stimuli as /w/ more than American listeners do, perception experiments with /w/ as well as /r/ and /l/ for a choice of identification better clarify the perception mode of the Japanese listeners. (3) A positive relationship between the identification ability of the natural /r,l,w/ spoken by native Americans and that of the synthesized /r,l,w/ is found for the Japanese listeners. (4) Contextual effects in words are very strong for the Japanese listeners when trying to identify /r/ and /l/. (5) The Japanese listeners who have lived in English‐speaking countries before a certain age are able to identify /r/ and /l/ as well as native Americans.
FREE

Perceptual categorization of synthesized /R‐W/ continua in normal preschool children (A)

Laurie Fitzgerald and Elzbieta B. Slawinski

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S102-S102 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The purpose of the present study was to examine the development of categorical perception of phoneme boundaries. A seven‐step adult /R‐W/ continua was synthesized via the Klatt cascade/parallel software program. The acoustic tokens varied according to the second (F2)‐ and third (F3)‐formant onset frequencies and second (F2)‐ and third (F3)‐formant transition rates. Percentage of correct responses on an identification task was computed to yield a measure of phonemic boundary location for adults and children of 3, 4, and 5 years of age. The phonemic boundaries fell between stimuli 3 and 4, at stimulus 4 and 5 for 3‐, 4‐, and 5‐year‐old children and adults, respectively. The finding that there was a shifting and increasing steepness in the phonemic boundaries as a function of age is supportive of previous research and moreover, of the theory that a child's phonological system is not inherently different than that of an adult but rather is just a simpler or less precise version of the mature system.
FREE

Spectral slope as a cue for the perception of breathy and non‐breathy stops in Shanghainese (A)

Nianqi Ren and Ignatius G. Mattingly

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S102-S102 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In Shanghainese, breathy excitation following release distinguishes category 2 (so‐called “voiced unaspirated”) stops from category 1 (“voiceless unaspirated”) stops. The most obvious acoustic correlate of bresthiness is relative spectral slope: During the first 50–100 ms after release, H1‐H2 is greater for category 2 than for category 1 in both word‐initial and word‐medial (morpheme‐initial) stops (Ren, unpublished). To investigate the perception of breathiness, a series of synthetic disyllables was prepared in which the spectral slope during the 100 ms following the release of the word‐medial stop was increased step by step increasing the value of the “open quotient” in the computation of the glottal waveform—a technique previously used for the synthesis of breathy vowels [C. Bickley, Work. Papers MIT Speech Commun. Lab. 1, 71–81 (1982)]. Other potential cues (closure duration, pulsing during closure, tone contour) were neutralized. Ten Shanghainese speakers labeled the stimuli with greater spectral slope as category 2 and those with smaller spectral slope as category 1. [Work supported by NICHD Grant 01994.]
FREE

The separation of two voicelike signals (A)

P. G. Vaidya

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S102-S102 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
If two persons are talking or singing at the same time, human beings can selectively listen to only one of them, even if they are producing notes that are in unison. This paper presents an attempt to simulate this task by a computer. An elementary model of the voices of the two speakers has been used. Using this model, two voicelike signals are generated and mixed. The resulting signal is analyzed for the coherence of the trans‐phase. Trans‐phase is the phase difference between the components at two different frequencies when one of them has been sent through a nonlinear filter. If the trans‐phase shows coherence, the signals are said to be transspectrally coherent. A justification for the assumption of such coherence in voicelike signals has been presented. The data for the transspectral coherence have been sent through a processor that yields an estimate of the individual sound separately.
FREE

The role of auditory object formation in sentence perception (A)

Thomas D. Carrell

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S102-S102 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Previous work has demonstrated that amplitude‐modulated time‐varying sinusoidal (AM‐TVS) replicas of natural sentences were more intelligible than simple time‐varying sinusoidal (TVS) sentences [T. D. Carrell, J. Acoust. Soc. Am. Suppl. 1 84, S158 (1988)]. The goal of the experiments reported here was to determine the cause of this increased intelligibility. One potential explanation was based on the fact that AMTVS sentences were rated more natural (i.e., more humanlike) than TVS sentences. This increased naturalness might have increased the probability that speech‐specific analysis was brought to bear in decoding the incoming message. The second potential explanation was based on the idea that the comodulation of the three component tones of a TVS sentence might cause them to be grouped together as an auditory object for further processing. It has been suggested that “…amplitude modulation helps form auditory objects from complex sound fields” [W. A. Yost and S. Sheft, J. Acoust. Soc. Am. 85, 848–857 (1989)]. Evidence will be presented in favor of the second explanation. Specifically, the increase in intelligibility will be shown to be related to the comodulation of the three component tones and possibly to the mechanism underlying comodulation masking release (CMR). These findings suggest that the comodulation of the three tones comprising a TVS sentence create an auditory object and that auditory object formation is important in fluent speech understanding.
Close

close