• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1989

Volume 86, Issue S1, pp. S1-S125

back to top
RSS Feeds
back to top Session S. Speech Communication IV: Memorial Session for Dennis Klatt
Invited Papers
FREE

Acoustics and synthesis of fricative consonants (A)

Kenneth N. Stevens

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S47-S47 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
This paper attempts to integrate some recent research of Dennis Klatt on the analysis and synthesis of fricative consonants with further experimental and theoretical studies of fricative production. The speech production studies examine data on airflows and pressures during voiced and voiceless fricatives, and estimate from these data the time variation of the areas of the glottal and supraglottal constrictions and the spectra of the sound sources at these constrictions. These calculations are based on theoretical and experimental data on airflow in constricted tubes and on sound generation in turbulent flow. Acoustic spectra in a number of utterances containing fricatives in various vowel environments are measured at critical points within the utterances, and are interpreted in terms of the production studies. Particular attention is paid to events near the consonant‐vowel and vowel‐consonant boundaries, where the dominant source changes from frication noise to aspiration noise to glottal vibration. Based on this research, new synthesis rules for fricative consonants are proposed. [Research supported in part by NIH Grant No. NS‐04332.]
FREE

Dennis Klatt's contribution to automatic speech recognition (A)

Victor W. Zue

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S47-S47 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Over the past 20 years, Dennis Klatt has made enormous contributions to the field of automatic speech recognition through his research, writing, and student supervision. In the early 70's, he served as a member of the steering committee of the ARPA Speech Understanding Research (SUR) program, providing leadership and guidance to the research community. He also participated actively in speech recognition research, first performing a set of spectrogram reading experiments assessing the role of various sources of knowledge, and later investigating the use of synthesis‐by‐rule techniques for word verification. Out of this involvement with the ARPA‐SUR program came the landmark paper reviewing its technical achievement. as well as several publications describing his own proposals, LAFS and SCRIBER, for human and machine speech recognition. Over the past 10 years, Dennis directed his attention to the design of signal representation front‐ends, as well as the investigation of perceptually motivated distance metrics in order to implement his speech recognition models. This talk pays tribute to Dennis's incredibly active research life by examining the legacy he left behind in speech recognition.
FREE

Duration models and segmental quality in a text‐to‐speech system (A)

Rolf Carlson and Björn Granström

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S47-S48 (1989); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
We have had the privilege of working together with Dennis Klatt for many years. In our presentation we will refer to some of Klatt's work that has had an influence on our own work. Modeling of segmental duration was a central part in Klatt's work during the 1970s. This work resulted in a duration model in 1979 that captures many of the basic effects found in speech. This model has been used as a framework in the KTH text‐to‐speech system. The use of quantity in Swedish demands expansions of the model. The Swedish study has been done in the context of a speech database from different speakers reading different text materials. In Klattalk, Klatt [J. Acoust. Soc. Am. 82, 737–793 (1987)] addressed all levels in a text‐to‐speech system, but special effort was placed on a general improved segmental quality. The quality of the best speech synthesis is, however, still far from that of human speech. Some recent efforts to improve the segmental intelligibility in our system will be described. This includes experiments with new synthesis strategies with an emphasis on modeling contextual variability. Analysis and synthesis of positional variants of the Swedish consonants are reported, and new strategies for synthesis are discussed. Based on analysis of the speech database, consonant rules affecting both source and resonator features were formulated and tested. Special efforts were made to handle the realization of consonant clusters. In this development work, diagnostic tests were used at regular intervals. Results from the last years' evaluation will be reported and discussed.
FREE

Perceptual evaluation of MITalk and DECtalk (A)

David B. Pisoni

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S48-S48 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In the spring of 1979, we began the first of what would eventually become many dozens of behavorial studies on the perceptual evaluation of synthetic speech produced automatically by rule. The MITalk text‐to‐speech system was just nearing completion at MIT. At that time, I spent many hours with Dennis Klatt talking about and planning a variety of perceptual tests to evaluate the MITalk system. In this paper, I will first summarize the initial results obtained with the MITalk system in 1979. Then I will describe the evaluation of DECtalk. I believe that one of the reasons that DECtalk has consistently shown such high levels of segmental intelligibility, levels often approaching those observed with natural speech, was Dennis' intense fascination with the results of our error analyses of the MRT that he used to selectively modify, refine, and improve the quality of the synthetic speech produced by DECtalk. In working with Dennis, it became clear to me that one of his major goals was to develop a system tht would produce the very best quality synthetic speech possible. The data from additional perceptual tests demonstrate clearly that Dennis was successful in achieving his goals for DECtalk. DECtalk remains the standard against which all other text‐to‐speech systems are compared. [Work supported by NSF.]
FREE

Adults and infants show a “prototype effect” for speech sounds (A)

Patricia K. Kuhl

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S48-S48 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Dennis Klatt provided valuable assistance to many speech perception researchers. His advice on the synthesis of speech signals was particularly helpful, and investigators relied on his expertise when problems arose in the development of stimuli designed to approach a problem in a new way. This was the case in the preparation of stimuli to test whether there are “prototypes” for vowel sounds. The experimental question was whether or not adults and infants responded differently in a within‐category vowel discrimination task when the “standard” stimulus was an exceptionally good instance of the vowel /i/—a prototype of the category—as opposed to a nonprototypic /i/ vowel. The results showed that when the prototype of the category served as the standard stimulus, it was more difficult to hear differences between it and novel /i/ vowels than it was to hear differences between the nonprototypic stimulus and novel /i/ vowels. In other words, the prototype was perceived to be more similar to new members of the category than was the nonprototypic stimulus. The effect was observed both in adults and in 6‐month‐old infants. Described first will be the hypothesis underlying the prototype test, the method used to construct the stimuli (the test stimuli for the prototype and the nonprototype were scaled psychophysically using the “mel” scale), and the results on English adults and infants. Described next will be the second phase of the research program, which entails cross‐language tests on Swedish adults and infants. The cross‐language tests are designed to assess two different explanations for the prototype effect: (1) that particular vowel stimuli (/a,i,u/) are inherently more resistent to the effects of articulatory/acoustic change (Stevens' quantal theory) and (2) that the effects observed in American adults and infants are attributable to experience in listening to English, even in the first 6 months of life. Careful preparation of the English and Swedish stimuli was critical in designing these experiments. Dennis Klatt was always ready to provide advice on such matters; his assistance is gratefully acknowledged and sorely missed. [Work supported by NIH.]
Contributed Papers
FREE

Perceptual and acoustic charactertics of distorted /r/ (A)

Ralph N. Ohde, Michael E. McCarver, and Donald J. Sharf

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S48-S49 (1989); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
One type of articulation disorder is a sound distortion that has been defined as an allophonic variation within the perceptual boundary of a target phoneme. An established finding in speech perception is that sounds are more accurately identified across sound categories than within sound categories. In order to determine if distorted /r/ could be accurately and reliably perceived, six speech pathologists identified the productions of prevocalic /r/ and /w/ words of 12 children diagnosed as having an /r/ misarticulation. The results of the identification tests revealed a relatively high average distorted /r/ category of 70% or better for four children. Moreover, intrasubject reliability scores for these distorted /r/ children averaged 80% or better. Preliminary findings of spectrographic analyses of formant transition onsets show that F3 onsets of distorted /r/ are substantially higher than F3 onsets of /r/ for normal and synthetic versions of children's speech. [Work supported by Biomedical Research Support Grant No. RR‐05424.]
FREE

Stress shift as the placement of phrase‐level pitch markers (A)

Stefanie Shattuck‐Hufnagel

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S49-S49 (1989); (1 page) | Cited 1 time

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The F0 contours produced by the text‐to‐speech conversion program Klatttalk [D. H. Klatt, J. Acoust. Soc. Am. 82, 737–793 (1987)] are based on a translation [S. Maeda, RLE‐QPR 114, 193–211 (1974)] for American English of the “hat pattern” approach developed for Dutch [J. 't Hart and A. Cohen, J. Phon. 1, 309–327 (1973)]; this approach is similar to an earlier description by Mattingly [I. G. Mattingly, Supplement to Haskins Laboratory Status Report on Speech Research, 1–223 (1968)]. One question that arises for this view of F0 patterns is how might it deal with the phenomenon of “stress shift”: For some speakers, the prosodic prominence on the main‐stress syllable of words like “thirteen” and “Mississippi” is perceived to move to an earlier syllable when these words appear in phrases like “thirteen men” and “Mississippi mud.” This paper will report pitch and duration measurements designed to evaluate the hypothesis that at least some aspects of the stress shift phenomenon can be described as the simple placement of the onset rise of a hat pattern on an early syllable of the prosodic phrase.
FREE

Synthetic speech audiometry (A)

Corine Bickley and Gerald Kidd

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S49-S49 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A new hearing test is being developed that is based on presenting to listeners sets of synthesized words with well‐defined acoustics properties. The test is based in part on work by Gòsy et al. [Proc. 11th ICPS, Tallinn (1987)], and its aim is to estimate a listener's hearing sensitivity from errors in word discrimination. Sets of words have been synthesized (using the Klatt synthesizer) that differ from each other by one of two phonemes (e.g., sit sat fat fit). The synthesis was guided by two goals. (1) Each word should differ from another word in a set by only one acoustic feature; the primary difference must be limited to a specific frequency band (e.g., sit versus sat differ by the frequency of the first formant). (2) The synthesized words should be highly intelligible to normally hearing listeners in a quiet environment. Initial results were obtained by presenting the synthesized words combined with white noise of various levels in a forced‐choice paradigm to normally hearing subjects,. As expected, word discriminability was correlated with the salience of the acoustic feature that distinguishes the word relative to the added noise. The feasibility of using synthesized word sets of this type to detect and estimate the severity of hearing impairment will be discussed. [Work supported by a grant from NIH.]
FREE

Acoustic properties of /h/ (A)

Sharon Y. Manuel and Kenneth N. Stevens

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S49-S49 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The aim of this paper is twofold: (1) to investigate the physical mechanisms of sound generation for the consonant /h/ and (2) to examine the timing of supraglottal and glottal movements of /h/. Utterances in which /h/ was present or absent (e.g., “new heart” versus “new art”) were analyzed acoustically and contrasted. The corpus consisted of about 20 such utterances repeated several times by three speakers. The acoustic data showed evidence of breathy voicing at the /h/‐vowel boundary in all cases, and that generation of turbulence noise during the consonant occurred both in the vicinity of the glottis (aspiration noise) and the vicinity of the supraglottal constriction (frication noise). The relative contribution of the two noise sources depended on the vowel, with greater frication noise occurring for high vowels. When an /h/ was in position between two vowels or glides, it generally added little or no duration to the utterance, relative to the contrasting utterance with no /h/. Implications for the phonological status of /h/ as discussed. [Work supported by NIH grants to MIT.]
FREE

Perception of some consonant contrasts in noise (A)

Abeer Alwan

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S49-S49 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The goal of the present study is to examine the acoustic properties that listeners use to distinguish between speech sounds when these sounds are presented in noise. A series of perceptual experiments was conducted using natural stimuli consisting of nonsense CV syllables, where C was either /m/,/n/,/b/, or/d/, and V was either /ɑ/, /iy/, or /ow/. The stimuli were degraded by adding various levels of white noise and were presented to subjects in identification tests. Preliminary results show that when the noise is at a level such that the transition of the second formant frequency of the vowel is masked, confusions between the place of articulation for the stimuli occur. Noise levels for which confusions in manner of articulation occur can also be predicted from masking theory. These results are compared with results reported earlier [e.g., Miller and Nicely (1954)] where the thresholds of idenfication were described in terms of the signal‐to‐noise ratio. These results will be discussed further in terms of the acoustic theory of speech production and the masking theory of the auditory system. [Work supported in part by an NIH grant.]
FREE

The influence of selected acoustic cues on the perception of /l/ and /w/ (A)

Carol Y. Espy‐Wilson

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S49-S49 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In a semivowel recognition system developed by Espy‐Wilson [Mass. Inst. Technol. Res. Lab. Electron. Tech. Rep. No. 531 (1987)], the sounds /l/ and /w/ were frequently confused, especially when they occurred intervocalically. In this study, the perceptual importance of some of the cues used in the recognition system, as well as some others which appear to be salient, were investigated. An [ala]‐[awa] continuum was synthesized. The starting point was an easily identifiable [ala] stimulus. Three factors were varied orthogonally to shift the percept towards [awa]: (1) the rate of change in the formant transitions between the semivowel and following vowel, (2) the rate of change in the amplitudes of F3, F4, and F5 between the semivowel and following vowel, and (3) the spectral shape of the semivowel (coronal or labial). Preliminary results of an identification test show that spectral shape and the rate of change of the formant transitions are important cues, whereas the rate of change in the amplitudes of F3, F4, and F5 appears to have a negligible effect on listeners' responses. For example, with formant transitions of 30 ms or less, [ala] is heard. With formant transitions greater than 40 ms, the perception moves towards [awa]. The results also show that a few listeners had difficulty hearing [awa] when the semivowel had a coronal shape, despite the rate of change in the formant transitions being biased towards /w/. These results will be discussed with respect to past and future research in speech recognition.
FREE

Combining statistical and linguistic models for synthesis of prosodic contours (A)

Mari Ostendorf, Patti Price, Stefanie Shattuck‐Hufnagel, Nanette Veilleux, Colin Wightman, and Rudy Garcia

J. Acoust. Soc. Am. Volume 86, Issue S1, pp. S50-S50 (1989); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
“It is very important to get the timing, intonation, and allophonic detail correct in order that a sentence sound intelligible and moderately natural.” [D. Klatt, J. Acoust. Soc. Am. 82, 737–793 (1987)]. This important review article included prosody as a research issue for improving text‐to‐speech synthesis. Klan's suggestions for improving prosody are addressed here: Development of new systems for control of F0 and duration, and mechanisms for adding variety. The proposed synthesis system is a statistical model trained on text, parts of speech, pronunciation, lexical stress, prosodic labels (major and minor boundaries, accents, etc.), and acoustic parameters (relative F0 and duration). The synthesis problem is to predict the prosodic labels and acoustic parameters given the text and the statistical model. Several hours of speech have been collected from professional FM newscasters, a labeling scheme has been converged on, and a portion of the data has been labeled. The components of the system so far implemented will be discussed: (1) statistical modeling of sequences of parts of speech to predict major prosodic breaks, (2) the role of breath noise in naturalness, and (3) the implementation of a sinusoidal model for duration and pitch modification of waveforms. [Work supported by NSF.]
Close

close