• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1981

Volume 70, Issue S1, pp. S1-S109

back to top
RSS Feeds
back to top Session R. Speech Communication III: Analysis and Synthesis of Speech
Contributed Papers
FREE

A reconsideration of acoustic invariance for place of articulation in stop consonants: Evidence from cross‐language studies (A)

Aditi Lahiri and Shiela E> Blumstein

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S39-S39 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
It has been claimed that the gross shape of the onset spectrum provides invariant properties for place of articulation in stop consonants [Blumstein and Stevens, J. Acoust. Soc. Am. 66, 1001–1017 (1979)]. We have examined the gross shape of the onset spectra for diffuse stop consonants (labial, dental, and alveolar) in French and Malayalam to test these theoretical claims across different languages and found that (1) although alveolar consonants in Malayalam had the diffuse‐rising shape, contrary to theoretical predictions, the dental consonants were diffuse‐flat (2) French dental consonants were also diffuse flat and could not be distinguished from labial consonants. We explored whether an alternative measure could capture the theoretical claims that (1) there is acoustic invariance for place of articulation, and (2) the properties for such invariance reflect a predominance of low‐frequency energy for labials and a predominance of high‐frequency energy for alveolars and dentals. We compared the change in distribution of energy from stimulus onset relative to the onset of the vowel steady state. After visual inspection, a set of measurement procedures were developed and tested on 300 utterances from French, Malayalam, and English. Over 85% of the data was correctly. classified. [Supported by an NIH Grant.]
FREE

Differences in the F0 patterns of speech: Tone language versus stress language (A)

Stephen J. Eady

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S39-S39 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A comparison was made between the fundamental frequency (F0) patterns of continuous speech in Mandarin Chinese and American English. Seven adult male native speakers of each language were asked to read an unemotional narrative text written in their language. The recorded speech signal was analyzed using an F0 extraction program that produced F0 values at 10‐msec intervals. The analysis showed the F0 patterns of Chinese to have a greater amount of dynamic movement than those of English. The speech of the Mandarin subjects displayed a greater average rate of F0 change than that of the American subjects. The Chinese speech was also characterized by more F0 fluctuations (peaks and valleys) as a function of time and as a function of the number of syllables. The results are consistent with the notion that the F0 patterns of Mandarin Chinese (a tone language) are determined mainly by the patterns of American English (a stress language) are determined mainly by the placement of primary stress on only a few of the lexical items in a sentence. [Work supported by the Social Sciences and Humanities Research Council of Canada and by BRS Grant RR‐05596 to Haskins Laboratories.]
FREE

A real‐time voice fundamental frequency acquisition and processing system (A)

Donald J. Stilwell, Martin J. McCutcheon, Akira Hasegawa, Samuel G. Fletcher, and Stephen C. Smith

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S40-S40 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The system was designed to measure voice fundamental frequency by zero‐crossing detection of the filtered output of a miniature piezoelectric accelerometer placed on the throat. Relatively inexpensive hardware provides a host computer with a cycle‐by‐cycle pitch period information with a resolution of 10 μs. Tests indicate the instrument operates reliably for male and female speakers of a wide variety of ages under varying phonatory conditions including breathy voicing. Performance evaluations and applications as a research tool and a speech training aid will be presented.
FREE

Acoustical characteristics of primary syllabic stress in excellent esophageal speakers (A)

Monica McHenry, Alan Reich, and Fred Minifie

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S40-S40 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Primary syllabic stress appears to be controlled by a complex interaction of fundamental frequency (∫0), sound pressure (SP), duration, and formant frequency position [D. Fry, Language and Speech 2, 126–151 (1959)]. Because esophageal speakers typically exhibit reduced ability to manipulate ∫0, SP, and duration, it seems unlikely that they would be able to approximate normal prosodic patterning. This study investigated the ability of excellent esophageal speakers to manipulate acoustical characteristics associated with primary syllabic stress. Five excellent esophageal speakers and five sex and age‐matched normals produced 10 sentence pairs, each containing a bi‐syllabic stimulus item differing only in primary stress placement. The mean ∫0, SP, and duration of the stressed and unstressed vowel nuclei were analyzed. Although some differences in absolute levels were apparent, only sound pressure level differences reached statistical significance. For both groups, primary stress was associated with a comparable pattern of increased ∫0, SPL and duration. The findings of this project were interpreted to mean that excellent esophageal speakers are capable of producing primary syllabic stress in a fashion that is remarkably similar to normals.
FREE

Post‐vocalic and syllabic /r/ and /l/ in English (A)

Joseph P. Stemberger

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S40-S40 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The psychological status of postvocalic and syllabic /r/ and /l/ is unclear. Are the postvocalic /r/ and /l/ in bar /bar/ and ball /bal/ part of a complex vowel, like the glide in buy /bay/, or part of the syllable coda, like the stop in bought /bat/? Are syllabic /r/ and /l/ vowels or consonants? A corpus of 6200+ speech errors was examined. Vowel + glide acts as a unit in errors, while vowel + consonant rarely does so. Vowel + /r/ and vowel + /l/ were intermediate, often acting as units, often not. Unlike consonants, /r/ and /l/ are a part of the syllable nucleus, but, unlike glides, are not a part of the vowel. Syllabic /r/ and /l/ show the error patterns of both vowels and consonants. They are syllabic consonants rather than vowels, but show vowel error patterns by virtue of their syllabicness and sharing many features with vowels.
FREE

Coarticulation in French consonant clusters (A)

Douglas O'Shaughnessy

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S40-S40 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Formant speech synthesis requires an adequate model of how formants vary with time in natural speech. While the general formant targets for phonemes in languages such as French are well known, the effects of coarticulation and undershoot of targets in various phonetic contexts are less well established. Toward the goal of French synthesis‐by‐rule with timing and formant transitions based on natural speech, 285 words were read in frame sentences by a French Canadian and analyzed via digital spectrograms for durations, formants, and bandwidths. Examples of all possible consonant clusters were examined, including those not found in English (e.g., “pluie”). While most formant transitions could be well modelled in terms of simple targets and time constants, the glides and liquids (/l, r, w, j, ɥ/) were highly variable in consonant cluster contexts. For example, /r/ was devoiced next to an unvoiced consonant, and the formants for /l/ indicated different articulatory positions depending on context. The presence of a liquid in word‐final stop + liquid clusters (e.g., “tigre”) was noted primarily by extending the aspiration period following stop release, rather than releasing the liquid into a schwa vowel.
FREE

Accentuation in reading aloud of news bulletins (A)

S. G. Nooteboom and J. G. Kruyt

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S40-S41 (1981); (2 pages)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Earlier research has led to a “grammar of Dutch intonation,” i.e., a set of rules for generating a suitable pitch contour for any to be synthesized Dutch sentence. However, before these rules can be applied the location of the pitch accents and prosodic boundaries must be specified. As a first step towards accent location rules, the accentuation behavior of Dutch speakers was examined as they read aloud sentences occurring in simulated radio news bulletins. The major independent variables were syntactic form and whether or not a particular referent had been mentioned in the preceding sentence. In all 1296 utterances were transcribed for pitch accents in terms of the grammar of Dutch intonation. The results show that pitch accents occur on a high percentage of the content words. Probability of accentuation is related to word class, lexical meaning, syntactic position, and, to a minor extent, to previous mention of the same referent. [Work supported by the Netherlands Organisation for the Advancement of Pure Research.]
FREE

Quality limitations in residual‐excited LP speech coding (A)

James G. Kubina and Douglas O'Shaughnessy

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S41-S41 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
In this paper, we report on a study of two factors which determine the RELP speech coder overall transmission rate and output quality, namely, the transmitted residual baseband bandwidth and the baseband encoding technique. The first subjective test revealed a linear relationship between uncoded baseband bandwidth and coder quality referenced to the log PCM standard. The optimum baseband encoding method out of a subset (PCM, APCM, ADM, PPADPCM, ATC) was determined by both an objective (SEGSNR) measure and a subjective (preference) measure derived from the results of a second subjective test. The ADPCM based RELP coder maximized both measures and was used thereafter in a final subjective test which sought to establish a relationship between transmission rate and quality. Test results showed a quasi‐linear dependance of quality and transmission rate and also showed that toll quality (7‐bit log PCM) cannot be presently achieved with RELP coding.
FREE

Acoustic basis for universal constraints on phoneme combinations (A)

Haruko Kawasaki and John J. Ohala

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S41-S41 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Phonological studies reveal many cross‐language similarities in permissible/impermissible phoneme sequences. We attempt to account for such universally attested phonotactic constraints by reference to two aspects of acoustic properties of these sound sequences. First, the magnitude of acoustic modulation in these sound sequences should be directly proportional to their perceptual saliency, which in turn would affect their viability in languages. Second, the similarity between two or more sound sequences should determine the likelihood of confusion and their susceptibility to merger. The magnitude of, and distance between formant trajectories were computed for various sequences of stops, liquids, glides, and vowels. The results correctly predicted the infrequent occurrence of [bw‐], [‐wu], [‐yi] as opposed to frequently observed [‐wa], [‐ya], [‐yu]. Such sequences as [gw‐] and [dw‐] are predicted to be confusable with [b‐], and [by‐] is likewise confusable with [d‐]. The results also indicated that the combination of a liquid and a back vowel should be rare.
FREE

Frequency of occurrence of word sequences (A)

N. Umeda and D. Kahn

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S41-S41 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A problem encountered in speech synthesis using concatenation methods is the unnaturalness and lowered intelligibility which arise from reducing the duration of stored units (diphones, demisyllables, words, etc.). in word‐concatenation systems, this problem is most severe in the case of a sequence of function words, where an extreme amount of contraction is necessary in order to approach natural‐speech prosody. We suggest this problem can be circumvented by storing not only isolated words but also certain sequences of words. To this end, we studied the frequency of occurrence of two‐ and three‐word sequences in English, based on the million‐word corpus of Kucera & Francis. We find that many sequences of extremely common words occur more frequently than all but the most frequent single words. For example, of the is one of the ten most common “words” of English; 17 two‐word sequences have corpus frequencies greater than 1000, making them more common than much, well, should, how, etc. We suggest that the results of our study allow one to properly select a “word”‐concatenation vocabulary. Our word‐sequence frequency tables should be useful to psychologists and workers in speech recognition as well.
FREE

Software synthesis with an array processor (A)

Anthony Levas and Ignatius G. Mattingly

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S41-S41 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The SYNTH software serial synthesizer [Mattingly, Pollock, Levas, Scully and Levitt, J. Acoust. Soc. Am. Suppl. 1 69, S83 (1981)] has been implemented using a Floating Point Systems AP 120 Array Processor interfaced with a PDP‐11/45. A batch of filter coefficients and excitation information is computed by the host from the input parameter values for an utterance and passed to the array processor, which then computes for each sample the glottal excitation and the output of filters representing the vowel and consonant branches of the synthesizer. Coefficients are stored in “table memory” and delay values in one of the two “data pads” of the array processor. Only one instruction apiece. in which a multiplication, an addition (or subtraction), a “main data memory” transfer, and a pointer incrementation are initiated simultaneously, is required for the numerator and denominator loops of the filter. The use of the array processor eliminates any appreciable delay in the execution of synthesis, which has been the one disadvantage of a software synthesizer as compared with a hardware synthesizer. [Work supported by NSF Grant PFR 8006144.]
FREE

A new compression technique of speech waveform (A)

M. Morito, K. Hosoda, and K. Yamada

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S41-S42 (1981); (2 pages)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A new compression technique of speech waveform has been developed to synthesize natural speech at lower bit rates by a simple method. The basic process of this technique was constructed with waveform symmetrization, ADPCM coding and decoding, segment repetition, and amplitude interpolation. With this technique, it is possible to compress speech information at bit rates 2.8 to 6.5 kbite/s. In the analysis, 256 points of digital speech data were converted to a symmetric waveform having the same amplitude characteristics. The segment waveform (one pitch period length if voiced, 32 sampling periods length if unvoiced) was determined by the converted symmetric waveform and was coded using ADPCM. Further, in order to attain highly information compression, we defined the characteristic value giving periodicity of speech waveforms. In synthesis, one segment of the speech waveform was reproduced using ADPCM data. The same process was repeated several times according to the characteristic value determined in the analysis. This repetition is very efficient for the compression technique of speech waveform. Amplitude interpolation of segments was introduced to mitigate discontinuity between neighboring segments and was achieved by shifting the initial position of the pointer in the ADPCM decoding.
FREE

Generation of F0 contours for English sentences from partially specified fundamental frequency templates (A)

Carolyn Gramlich and William Sanders

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S42-S42 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
English sentences are assigned to a small number of intonational classes. Each class has an associated fundamental frequency template, which characterizes the F0 contour by specifying up to two frequency points per word—typically a peak frequency and a final frequency. The F0 contour for a particular sentence is elaborated by cubic polynomial interpolation of the template's frequency points, fixed in time by the actual durations of the words of the sentence. Similar sentences with different numbers of words can be assigned to the same intonational class if short phrases are described by a single pair of frequency points, which are then algorithmically elaborated. In a practical application the name of a template is stored with the text of a sentence, and the spoken sentence is synthesized by applying the template to the sequence of LPC‐encoded words in real time. A demonstration tape will be played.
FREE

Spweech synthesis from measured vocal tract shapes (A)

M. M. Sondhi and J. R. Resnick

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S42-S42 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The talk will describe our recent experiments on estimation of vocal tract area functions from acoustical measurements at the lips. Since the theoretical basis for such measurements has been discussed in several earlier publications, we will concentrate on the experimental aspects. The two main accomplishments we will report on are: (1) We are now able to make measurements, compute and display the area functions in real time (16–20 frames per second). This speed of reconstruction has never been possible before. (2) We have synthesized reasonable good quality speech from a number of sentence‐length sequences of measured area functions. Some attempts at synthesizing steady vowels from such measurements have been reported in the past. However, to the best of our knowledge, this is the first instance of continuous speech synthesized from direct acoustical measurements of area functions. We will present examples of time‐varying area functions and corresponding synthesized speech signals.
FREE

A variable frame rate LPC vocoder using normalized ladder‐forms (A)

J. S. Wang, D. T. L. Lee, and M. Morf

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S42-S42 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
We present a variable frame rate LPC (Linear Predictive Coding) vocoder based on a normalized exact‐least squares ladder‐form analysis algorithm by Lee and Morf. This algorithm works on a sample by sample basis, hence no a priori frame rate has to be preselected. As a byproduct of this algorithm a log‐likelihood variable is computed that can be used to detect discontinuities in the speech signal such as plosive sounds, pitch pulses, glottal stops and other underlying transitions. Based on this algorithm a complete design of a vocoder is presented, that allows variable frame rates, pitch synchronous speech analysis/synthesis. Such vocoders are of interest in digital communication systems, especially packet oriented systems, or digitally stored speech based systems. Computer simulations of this vocoder using a digitized speech data base are presented, and several alternative implementations are discussed.
FREE

Analysis‐by‐synthesis method for determining optimal excitation for natural‐sounding LPC speech synthesis (A)

B. S. Atal and J. R. Remde

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S42-S42 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
LPC speech synthesis uses two separate excitation signals—a delta‐function pulse once every pitch period for voiced speech and white noise for unvoiced speech. This way of representing excitation requires that speech segments be classified accurately into voiced and unvoiced categories and the pitch period of voiced segments be known. It is now well recognized that such a rigid idealization of the excitation is often responsible for the unnatural quality associated with synthesized speech. We find that a more flexible representation of the excitation is necessary for producing natural‐sounding speech. This paper presents an analysis‐by‐synthesis procedure for determining the optimal excitation for LPC synthesis (at different bit rates) without requiring prior knowledge of either the voiced‐unvoiced classification or the pitch period. The excitation is found by minimizing the perceptual difference between waveforms of the original and the synthetic speech signals using a noniterative procedure. The perceptual difference metric takes account of the finite frequency resolution and the masking properties of the human hearing mechanism.
FREE

Better acoustic model of the vocal tract (A)

Shinji Maeda

J. Acoust. Soc. Am. Volume 70, Issue S1, pp. S42-S42 (1981); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Speech signals were synthesized by assuming plane wave propagations and nonrigid vocal‐tract walls. The tract is excited by specifying the time function of the glottal area. A constant air pressure source is assumed. The quality of 11 different French vowels synthesized was excellent in terms of naturalness as well as of intelligibility. It seems that this high quality is primarily due to a better realization of the source‐tract interaction, especially the time‐varying characteristics of the clottat impedance within each fundamental period. Interestingly the quality of nasal vowels or nasals was not satisfactory. We suspect that a simple acoustic tube is a rather poor representation of the nasal tract. Our theoretical analygig and also experiments by Lindqvist‐Gauffin and Sundberg [“Acoustic Properties of the Nasal Tract,” Phonetica 33, 161–168 (1976)], suggests that the sinus shunt cavities within the nasal tract may play an important role in shaping appropriate nasal spectra. Results of simulation experiments will be demonstrated.
Close

close