• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1988

Volume 84, Issue S1, pp. S2-S224

Page 1 of 2 Pages Return to All Sections Next Page
back to top
RSS Feeds
back to top Session WW. Speech Communication X. Perception (Poster Session)
Contributed Papers
FREE

The development of cues to the perception of the [m]‐[n] distinction in CV syllables (A)

Ralph N. Ohde

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S154-S154 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The contribution of the nasal murmur and vocalic formant transitions to the perception of the [m]‐[n] distinction by adult listeners was investigated for speakers of different ages. Three children, ages 3, 5, and 7, and an adult female and male produced CV syllables consisting of either [m] or [n] and followed by [i,æ,u,ɑ]. Three productions of each syllable were modified according to several waveform editing techniques. Preliminary results of listening tests indicate that the murmur and vocalic transitions provide cues to place of articulation, with the latter property more prominent in perception than the former. The simultaneous presence of murmur and vocalic transition cues improved perception of place of articulation for some syllables, particularly for children's speech. The results will be discussed relative to the role of variability in production and integrated cues to the perception of place of articulation of nasal consonants in speech development. [Work supported in part by Biomedical Research Support Grant No. RR‐05424.]
FREE

Auditory‐perceptual analysis of selected syllables (A)

James D. Miller

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S154-S154 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Analyses of consonant‐vowel syllables (CVs) in terms of the auditory‐perceptual theory of phonetic recognition will be presented. Examples of CVs will include a voiceless stop, a voiceless fricative, a nasal, and an approximant paired with monophthongal vowels. Spectral analyses are used to locate the formant peaks and to track these during the course of the syllable, producing a sequence of spectra, one for each ms of waveform. Formant and F0 information from these sequences is then converted into sensory and perceptual paths in the theory's auditory‐perceptual space. This space contains subspaces, called perceptual target zones. The activation of a zone results in the output of a phonetic code. While the exact conditions for this activation are not yet known, it appears that certain aspects of the behavior of a perceptual path in relation to the perceptual target zones can determine the phonetic transcription of a syllable. [Work supported by NINCDS and AFOSR.]
FREE

Evidence for the 3‐Bark integration interval (A)

Brian A. Hanson and Hector Raul Javkin

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S154-S154 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
It is generally accepted that sound energy within a 1‐Bark interval is integrated by the ear. Chistovich et al. [in Frontiers of Speech Communication Research, edited by Lindblom and Ohman (Academic, New York, 1979), pp. 143–157] found evidence of a larger integration interval, of approximately 3 Bark, for vowel perception. Klatt [J. Acoust. Soc. Am. Suppl. 1 77, S7 (1985)] found that listeners could distinguish between vowels whose formant differences were compensated by bandwidth differences (so that cue trading did not occur) and concluded that the 3‐Bark interval was not supported. The present paper analyzes the effect of different intervals on the center of gravity analysis described in Jaykin et al. [J. Acoust. Soc. Am. Suppl. 1 82, S81 (1988)] adapted from Chistovich and Chernova [Speech Commun. 5, 3–16 (1986)], and compares that analysis with a perceptual experiment testing the effect of harmonics on the perception of formants. The results support the hypothesis that the minimum interval is about 3 Bark. Our model may explain the lack of cue trading between formants and bandwidths.
FREE

Learning to identify phonemic orders (A)

Brad S. Brubaker and Richard M. Warren

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S154-S154 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Several laboratories have reported that the order of items within repeating vowel sequences cannot be identified at item durations below 100 ms. This is puzzling, since phonemic sequences in speech consist of items that are considerably briefer. The present study demonstrates that listeners can, without feedback or knowledge of results, readily learn to name phonemic orders at item durations even briefer than those of normal speech. The listeners first identified different orders of components in repeating three‐item vowel sequences at a few hundred ms per item (an easy task), and were then required to identify orders of these sequences at regularly decreasing item durations. By a series of successive generalizations to ever shorter items, order was named down to about two glottal pulses/vowel (the shortest duration used). Within the durational range of vowels in speech, our listeners distinguished different arrangements by their resemblances to particular words. Below this range, listeners employed qualitative differences of a nonverbal nature. Implications for theories of speech perception will be discussed. [Work supported by NIH.]
FREE

Rate‐dependent perception of interpersonal characteristics (A)

Stanley Feldstein, Faith‐Anne Dohm, and Cynthia L. Crown

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S154-S155 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The hypothesis, derived from the similarity‐attraction literature, was that listeners describe speakers in more positive ways when they judge speakers' global speech rates to be similar to their own. Forty‐five male and female listeners judged the global rates of three male and three female speech samples and how those rates compared with their own rates. The speaker of each sample was evaluated in terms of ten unipolar, adjective scales, each of which ranged from 0 to 9, with the higher score having the higher valence. The scores of the ten scales were then averaged to provide a total attribution valence score, and were also divided to provide “competence” and “affability” factor scores. The scores were subjected to appropriate regression analyses that included as independent variables speaker and listener gender and the perceived and actual differences between the speaker and listener rates. In support of the hypothesis, listeners assigned more positive total attribution values to those speakers whose rates were similar to their own, although their gender and the actual differences between their rates and the speakers' rates jointly influenced their evaluations of competence.
FREE

Listener experience and perception of voice quality (A)

Jody Kreiman and Bruce R. Gerratt

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S155-S155 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This study examines the role of listener experience with populations of voices in the perception of vocal quality. Prototype models suggest that the strategy that naive listeners use to evaluate abnormal vocal qualities will differ significantly from that of listeners experienced with dysphonic populations. Listener groups should not differ in perceptual strategy for normal voices, where their experience is equal. The nature of differences in strategy is also examined: as listeners gain experience with populations of voices, do they hear them in more complicated ways, resulting in higher dimensional multidimensional scaling solutions and/or lower r2 values for these solutions? Alternatively, does experience increase the efficiency with which listeners utilize a relatively constant set of perceptual parameters? Six naive and six experienced listeners judged the similarity of 17 dysphonic and 17 normal voices. Separate MDS solutions were found for each listener group for each voice set, and regressions compared the solutions. The relative complexity and efficiency of individual perceptual strategies are also discussed.
FREE

Factors affecting the integration of auditory and visual information in speech: The effect of vowel environment (A)

Kerry P. Green, Patricia K. Kuhl, and Andrew N. Meltzoff

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S155-S155 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In the McGurk effect, observers typically report the illusory syllable /da/ when they hear the auditory syllable /ba/ presented in synchrony with a video display of a talker saying /ga/. While the effect itself has been well established, there is still little research on the conditions under which the effect occurs. In the experiment reported here, the number of illusory /d/ responses to the auditory /b/‐visual /g/ combination is examined in three vowel environments: /a/, /i/, and /u/. The results of this study indicate that the magnitude of the illusion is not the same across different vowel environments. It appears to be strongest for the /i/ vowel, moderate for /a/, and almost nonexistent for /u/. The results thus show that vowel environment is an important factor in determining the magnitude of the McGurk effect, which needs to be considered in accounts of auditory‐visual integration during speech perception. [Work supported by NIH.]
FREE

Effects of attention on the phonetic importance of acoustic cues (A)

Peter C. Gordon and Jennifer L. Eberhardt

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S155-S155 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The effect of attentional capacity on recognizing phonetic segments was studied by having subjects identify synthetic speech sounds while simultaneously performing a visual‐manual tracking task. The difficulty of the tracking task was manipulated in order to vary the amount of attention available for speech perception. Within the speech stimuli, acoustic cues to segment identity were manipulated so that trading relations could be assessed under different levels of attention. The relative importance of the acoustic cues to perceptual identification changed, depending on the difficulty of the concurrent tracking task. It appears that acoustic cues that are easily encoded make an increased contribution to phonetic judgments when listeners are unable to pay close attention to a speech sound. In this way, attention to the speech stimulus is similar to other factors (e.g., environmental interference, stage of development, and hearing loss) that also affect the relative contribution of acoustic cues to phonetic perception. It also appears that the initial stages of speech perception can make use of general, modality‐independent attentional resources. [Work supported by AFOSR.]
FREE

Identification of stops in consonant sequences extracted from continuous speech (A)

Lori F. Lamel

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S155-S155 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
At the 114th meeting of the Acoustical Society of America, experiments on the listener's ability to identify singleton stop consonants in syllable‐initial and noninitial position, and stops in noninitial homorganic nasal‐stop clusters, were reported. The stimuli, consisting of portions of speech extracted from continuous sentences, were drawn from a corpus of about 3600 sentences spoken by over 450 talkers. This paper further investigates the effect of intervening consonants on the listener's decision. Ten listeners identified stops in clusters with semivowels. With the exception of /dr/ and /tr/ (which were confused primarily with the /ǰ,č/), the listener's performance (96.1% correct) was comparable to that of singleton stops (97.1%). Another group of listeners identified stops preceded by /s/ or /z/. The listener's performance (overall 88.3%) was dependent upon the identity of the fricative and whether or not the stop was in a cluster with the preceding /s/. Results will be compared with singleton stop identification tasks. [Work supported by DARPA under Contract No. N00014‐82‐K‐0727, monitored through the Office of Naval Research.]
FREE

Fundamental frequency provides voicing information even with unambiguous VOTs (A)

D. H. Whalen, Arthur S. Abramson, Leigh Lisker, and Maria Mody

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S155-S156 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Earlier work [A. S. Abramson and L. Lisker, in Phonetic Linguistics (1985)] demonstrated that falling fundamental frequency (F0) after a syllable‐initial stop was a cue to voicelessness, and that flat or rising F0 was a cue for voiced stops, but only when the voice onset time (VOT) was ambiguous. The present study replicated that finding with seven VOT values and five onset F0 values. In the first condition, subjects identified the stop as “b” or “p.” Results were nearly identical to the previous experiment. A second condition included not just the stop decision, but a reaction time as well. Here, inappropriate F0 slowed response time even for unambiguous VOTs. A final condition was, like the first, identification without time pressure. Here, it was found that subjects were distinguishing all five levels of F0 onset so that, the lower the onset was, the more “b” responses were obtained in the ambiguous region. Thus F0 contributes to the voicing distinction, even when the categorization is not changed. Also, F0 cues a “voiced” response incrementally as it starts below the F0 of the remainder of the syllable. [Work supported by NIH Grant No. HD‐01994.]
back to top Session WW. Speech Communication X: Perception (Poster Session)
FREE

On the role of the fundamental frequency in vowel perception (A)

Tatsuya Hirahara

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S156-S156 (1988); (1 page) | Cited 1 time

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Vowel identification tests were carried out using 200 synthesized vowel‐like stimuli to examine the role of the fundamental frequency F0 in vowel perception. These stimuli were synthetic versions of the five Japanese vowels, /i/, /e/, /a/, /o/, and /u/, of which the F0 and/or the formant frequencies Fi (i = 1,2,3,4) were modified: ten F0 values were formed by adding n/3 Bark (n = 0,1,…,9) to the original F0. Four formant frequency sets were formed by adding m Bark (m = 0,1,2,3) to the original formant frequencies for each vowel. The results are the following: (1) perceived vowel height articulation shifts upward when the F0 shifts upward, while all formant frequencies remain the same: (2) this shift in vowel height is more distinct amid mid and low vowels than for high vowels; and (3) vowel height does not change when the F0 as well as all formant frequencies are shifted upward the same amount along the Bark scale. Further results, along with the hypothesis that a high F0 is regarded as the first formant in middle and low vowel perception, will be discussed.
FREE

Frication duration and amplitude rise time as cues to the voiceless fricative/affricate distinction (A)

Margaret A. Walsh, Keith R. Kluender, and Randy L. Diehl

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S156-S156 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This paper describes the perceptual role of frication duration and amplitude rise time in signaling the fricative/affricate distinction in initial position. Sets of edited natural tokens of /ʃa/ and /tʃa/ were created in which fricative duration and rise time varied orthogonally. In one experiment, frication duration varied from 100 to 210 ms, and rise time was fixed at either 30 or 80 ms. Frication duration proved to be a robust cue for the fricative/affricate distinction, with longer durations yielding more /ʃ/ responses. Moreover, the longer value of rise time shifted the /ʃ/‐/tʃ/ labeling boundary toward shorter frication durations; thus rise time had an enhancing effect on the perception of the fricative‐duration cue. In a second experiment, rise time varied from 20 to 110 ms, while frication duration was fixed at either 140 or 160 ms. Although the fricative‐duration parameter had a reliable effect on the percentage of fricative responses, variation in rise time alone had very little effect. These results are analogous to those of Walsh and Diehl [J. Acoust. Soc. Am. Suppl. 1 82, S80 (1987)], who found that transition duration plays a far more significant role than rise time in signaling the stop/glide distinction. [Work supported by NINCDS.]
FREE

Delayed pitch fall in Japanese: Perceptual experiment (A)

Kazue Hata and Yoko Hasegawa

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S156-S156 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Hasegawa and Hata [J. Acoust. Soc. Am. Suppl. 1 83, S29 (1988)] investigated the delayed pitch fall phenomenon in Tokyo dialect Japanese in production data and examined the relationships of perceived accent to the peak location and the steepness of the falling contour. Pitch fall, the only acoustic correlate of the accent in Japanese, sometimes occurred on the syllable following the accented syllable. A delayed pitch fall tends to be steeper the later it occurs. The present paper examines delayed pitch fall from a perceptual point of view. Three‐syllable synthetic stimuli /ma ma ma/ were prepared with the F0 peak in different locations and different falling slopes. These stimuli were presented to native Japanese subjects in order to determine whether perception and production are correlated in this phenomenon, i.e., whether a change in the location of the peak and the degree of F0 fall in the second /ma/ causes the accent to be perceived on the first syllable. Implications for speech recognition will be discussed.
FREE

Some auditory‐visual interactions are post‐categorical (A)

Richard E. Pastore, Jody K. Layer, Robert J. Logan, Stuart A. Tousman, and Hanson Hsu

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S156-S156 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The McGurk effect recently has been variously conjectured to be based upon a precategorical interaction or integration of information from the auditory and visual representation of the linguistic event. The conjecture of precategorical interactions, which rejects the original notion of post‐categorical interactions, may be in the form of a specialized module for speech or language, or may contribute to a logic‐decision process, either of which leads to a relatively discrete perception of speech. While not disputing such precategorical interactions for language, the research to be described demonstrates post‐categorical interaction of auditory and visual representations of phonemes. The stimuli are nonword CVC syllables. The set of auditory stimuli is edited from natural speech, while the set of visual stimuli includes the orthographic representations of the auditory stimuli syllables. Evidence for interactions in identification of initial phonemes is based upon both error rates and reaction times. [Research supported in part by an NSF grant to the first author.]
FREE

Can speech envelopes' modulation spectra be used to support segmental decisions? (A)

M. M. McCormick, R. J. Porter, F. Seitz, and I. M. C. Watson

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S156-S156 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The auditory system senses the rate, the magnitude, and the phase of temporal modulations of sounds' spectra and amplitudes. Such “modulation sensations” have been proposed as a possible basis for perception of speech segments and their features [see R. J. Porter, in Language Perception and Production, edited by Allport et al. (Academic, London, 1987), Chap. 5]. This preliminary study extends a Russian attempt to analog‐model speech segmentation decisions using speech‐envelope modulation magnitude [Malinnikova et al., Fiziol. Zh. SSSR im. I. M. Sechenova 66, 139–145 (1980)]. In the current study, comparisons are made between human segmentation decisions and a multistage, computer‐based, spectral analysis of the modulations of the amplitude envelopes within different speech spectral regions. [Research supported in part by Alvey Grant No. MMI092.]
FREE

Cross‐syllabic‐position failures of adaptation are not due to acoustic‐phonetic cancellation (A)

Arthur Samuel

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S156-S157 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Why does adaptation with a syllable‐initial consonant fail to affect perception of the same consonant in syllable‐final position, and vice‐versa? One account of this well‐replicated result invokes a cancellation explanation [Pisoni and Tash, Percept. Psychophys. 18, 401–408 (1975)]: With the place of articulation stimuli used, the pattern of formant transitions switches with syllablic position, allowing putative phonetic level effects to be opposed by putative acoustic level effects. Three experiments to be reported tested the cancellation hypothesis by preempting the possibility of acoustic countereffects. In experiment 1, the test syllables and adaptors were /r/‐/l/ CVs and VCs that do not produce canceling formant patterns across syllabic position. In experiment 2, /b/‐/d/ continua were used in a paired‐contrast procedure, believed to be sensitive to phonetic, but not acoustic, identity. In experiment 3, cross‐ear adaptation, also believed to tap phonetic rather than acoustic processes, was used. All three experiments refuted the cancellation hypothesis. Instead, it appears that the perceptual process treats syllable‐initial consonants and syllable‐final ones as inherently different. [Work supported by AFOSR.]
FREE

The perceptual distance between synthetic /s/ and /ʃ/ syllables in different vocalic contexts (A)

Margaret F. Cheesman and Dianne J. Van Tasell

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S157-S157 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A nine‐step, synthetic, /s/‐/ʃ/continuum was crossed with a fivestep continuum of /i/ to /u/‐like vowels to form 45 CV syllables. Thirteen listeners categorized the fricative portion of these syllables as /s/ or /ʃ/ in a 2AFC task. Identification judgments were strongly influenced by the vowel context; for example, more fricatives were identified as /s/ when they preceded /u/ than when they preceded /i/. The same group of listeners used a triadic comparison procedure [Levelt et al., Br. J. Math. Star. Psychol. 19, 163–179 (1966)] to provide estimates of the perceptual distances between a subset of the fricatives in different vowel contexts. Perceptual distances were smaller for within‐category pairs than they were for across‐category pairs. Relations between the perceptual distance and the acoustic similarity of stimuli will be discussed. [Supported by NIH Grant No. NS12125, the Bryng Bryngelson Research Fund, and an SSHRC doctoral fellowship to MFC.]
FREE

Discrimination of level versus nonlevel pitch contours (A)

George D. Allen

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S157-S157 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Phonetic transcription of intonation often requires a decision as to whether the pitch contour is rising, falling, or level. Although automatic procedures can measure voice fundamental frequency (F0) contours accurately, it is probably more appropriate to categorize them on the basis of what one's ears can discriminate. The fact that glottal periodicity includes jitter adds to the uncertainty of this discrimination. Pulse trains with level and rise‐fall F0 contours were presented for discrimination by listeners in a 2I‐FC paradigm. Within each pair, the F0 of the nonlevel began low, rose linearly for half the total duration, and then fell linearly to the end. Mean F0 was thus equal for the two items. Preliminary results using one duration (600 ms), three mean F0 levels (100, 125, and 150 Hz), three ΔF0 values (1%, 2%, and 4%), and four jitter amounts (0%, 0.5%, 1%, and 2%) indicate that discrimination increases with increasing ΔF0 and decreases with increasing jitter. Both trends were quite linear across the range of parameters employed in this preliminary study. The PEST procedures are now being used to map these relationships more precisely.
FREE

Identification of Polish words with non‐neutralized word‐final segments (A)

Louisa M. Slowiaczek and Helena Szymanska

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S157-S157 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Research examining the phonetic characteristics of a number of neutralization rules has found that underlying contrasts that should be neutralized are phonetically preserved. In particular, earlier results [L. Slowiaczek and D. Dinnsen, J. Phonet. 13, 325–341 (1985)] found evidence that the rule of word‐final devoicing in Polish is not neutralized. The present investigation extended these production results by testing whether the acoustic measures identified in productions from the original study are functional in perception. Native Polish and English listeners identified Polish monosyllabic words using a two‐alternative forced‐choice procedure. Results for the Polish subjects indicated a bias to choose the voiceless alternative and suggested that Polish listeners are unable to perceive differences in the minimal pairs examined in the production study. [Work supported by a small grant from Loyola University of Chicago.]
FREE

Auditory memory in phonetic and nonphonetic judgments of vowels (A)

Sumi Shigeno

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S157-S157 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The present study compares the magnitudes of context effects in the perception of phonetic and nonphonetic features of stationary vowels. The stimuli were synthetic vowels on the [u]‐[a] continuum generated by varying stimultaneously F1 (442–586 Hz) and F0 (118–154 Hz). Thus vowel [a] had a higher pitch than [u]. Category boundaries were determined for the following three conditions: (1) isolated vowels presented in random sequence, (2) vowels with a preceding context stimulus [u], and (3) with white noise inserted between the context stimulus and the target stimulus. When the ISI between the context and the target was 2.0 s, the context effect was contrasted in the case of phonetic judgment, and its magnitude was reduced by the insertion of white noise. It was assimilated in the nonphonetic judgment, and its magnitude was increased by the insertion of noise. When the ISI was 0.5 s, the magnitude of the assimilation in the nonphonetic judgment was not increased by the insertion of noise. These results suggest that the reduction of contrast was not due to a decrement in auditory memory, which can be considered as a fast‐decaying component.
FREE

Effects of syllable duration on the perception of Mandarin tones: A cross‐language study (A)

Deborah L. Blicher, Randy L. Diehl, and Leslie B. Cohen

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S157-S157 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Mandarin syllables carrying a high‐rising F0 contour (tone 2) tend to be produced shorter than those carrying a low‐falling‐rising contour (tone 3). It is suggested that, even for syllables with the same physical duration, the more complicated F0 structure of tone 3 makes it appear longer than tone 2. Talkers may therefore produce an actual length difference to enhance the apparent length difference. This hypothesis was tested by comparing perceptual judgments by native Mandarin and native English speakers on various series of Mandarin tones. A short‐stimulus (350 ms) series and a long‐stimulus (450 ms) series were synthesized for each of the syllables /bi/, /ba/, and /bu/. Each series was generated by incrementally interpolating the F0 contour between a tone 2 and a tone 3 exemplar, both of which were Mandarin morphemes or words. Subjects (both Mandarin‐ and English‐speaking) were first trained with feedback to assign the short and long series‐endpoint stimuli to two categories based on F0 contour alone. Next, subjects identified the entire stimulus series (both long and short) on the basis of the two training categories. For both Mandarin‐ and English‐speaking subjects, a longer syllable duration shifted the labeling boundary reliably toward fewer tone 2 (i.e., more tone 3) responses. The parallel boundary shifts suggest that length variation enhances the perceptual distinction between tones 2 and 3, probably by reinforcing what is already a difference in apparent length. [Work supported by NINCDS.]
FREE

Phoneme processing in the perception of spoken Japanese words (A)

Shigeaki Amano

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S158-S158 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Cohort theory, a speech perception model based on serial phoneme processing, is studied in respect to the lexical access of Japanese words. Six subjects made a lexical decision by pushing a key as soon as possible after hearing real Japanese words or nonwords that were 2‐5 mora long and had various nonword discrimination points. A nonword discrimination point was defined as a point where no other real word candidates could be found in a cohort by serial phoneme processing. The results show that the reaction times measured from the nonword discrimination point are not constant. The nearer the nonword discrimination point is to the end of the word, the shorter the reaction times are. This indicates the existence of parallel phoneme processing. Moreover, the reaction times for short words are longer than for long words when they are measured from the end of the word. This shows that short words are buffered somewhere in the word perception process until the lexical decision is made. However, there is also evidence of serial processing. The reaction times from word onset gradually increase along with time to the nonword discrimination point. These results suggest that phoneme processing is quasiserial. Parallel or buffered processing should be incorporated into cohort theory along with serial processing.
FREE

Naturalness and intelligibility of amplitude modulated time‐varying sinusoidal speech (A)

Thomas D. Carrell

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S158-S158 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Synthetic speech has been used for decades to test theories regarding human speech perception. Typically, the speech has been constructed to sound as natural as possible given the constraints of the experiment. An alternative strategy has been to examine the perception of intentionally impoverished stimuli such as time‐varying sinusoid (TVS) replicas of speech [Remez et al., Science 212, 947‐950 (1981)]. The TVS signals consist of three tones whose frequencies mimic the formant center frequencies of a natural sentence. They exhibit few of the acoustic properties of natural speech. It has been demonstrated that TVS signals sound extremely unnatural although they are surprisingly intelligible. The goal of the present experiment was to determine some of the general acoustic characteristics of signals that are important for speech perception. This was accomplished by examining the perceptual consequences of adding simple temporal and spectral information to TVS sentences. The TVS signals were amplitude modulated at 100 Hz in order to give them more speechlike acoustic characteristics without giving them fundamental frequencies or harmonic structures. The modulation greatly improved the phonetic intelligibility of the acoustically sparse TVS signal. The modulated signal was also significantly more natural sounding to listeners than the unmodulated TVS signal. Performing this operation on natural speech, however, caused a decrement in intelligibility.
FREE

The role of attention in speech perception (A)

Bertram Scharf, Huanping Dai, and Joanne L. Miller

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S158-S158 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This study investigated the role of auditory attention during speech perception. The syllables /da/ and /ga/ were synthesized so that they differed in initial burst and third formant transition; the critical distinguishing information was in the vicinity of 2.5 Khz. Discrimination was first measured with a 1I, 2AFC procedure under six masking conditions. Performance was near chance (55% correct) when the masker was centered at 2.5 Khz, but increased as the masker moved away from this critical frequency region, reaching 100% with the masker at 1 kHz. Next examined was whether listeners' attention as they performed the task was focused specifically on the 2.5‐Khz region, or spread across all frequency regions. In one condition, subjects were asked to discriminate the syllables when a weak 90‐ms, 1‐kHz tone was added to /da/ and, in the other condition, when a weak 90‐ms, 2.5‐Khz tone was added to /da/; subjects were not informed that the tones had been added. In both conditions, the masker was centered at 2.5 kHz. Performance was at 58% when the 1‐Khz tone was added, but at 75% when the 2.5‐kHz tone was added. In two control conditions, it was found that, when subjects were informed that the tones had been added, so that they could focus their attention on the relevant frequency regions, performance increased substantially (to 90% correct) for the 1‐Khz condition, but only slightly (to 81% correct) for the 2.5‐kHz condition. These results suggest that, when attempting to discriminate syllables, listeners focus their attention on the specific frequency region critical to the distinction. [Work supported by NIH.]
FREE

The neural coding of relational invariance in speech: Human language analogs to the barn owl (A)

Harvey M. Sussman

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S158-S158 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This paper presents a brain‐based perspective on the noninvariance problem in speech research, namely, that physically different auditory stimuli come to elicit an invariant perception of a phoneme category. Extrapolating from the sound localization system in the barn owl, two speculative human models are offered that account for the emergence of relational invariance for (1) bilabial and alveolar stop consonants across vowel contexts, and (2) vowel identity across speakers with different sized vocal tracts. The barn owl's extraction of interaural time differences (ITDs) to signal azimuth is based on disambiguating interaural phase information coded by frequency specific neurons. Only by spanning across a broad frequency spectrum can a neuronal functional array signal an unambiguous ITD to higher centers. A similar principle is invoked to model both stop consonant place invariance and vowel normalization in human speech processing. Formant manipulation metrics, capable of distinguishing contrastive phonemic categories, are described to illustrate the operational features of the modeling scheme.
Page 1 of 2 Pages Return to All Sections Next Page
Close

close