• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

May 1981

Volume 69, Issue S1, pp. 31-S125

back to top
RSS Feeds
back to top Session G. Speech Communication I: Acoustic Analysis
Contributed Papers
FREE

Synthesis as feedback in field linguistics (A)

Joseph E. Grimes

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S15-S15 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Interim field reports on the phonologies of previously undescribed languages can be tested for consistency using synthesis‐by‐rule techniques. Such a report on Mura‐Piraha of Brazil provided information for the feature rule component of Hertz's Cornell Speech Research System. Values for the parameter rules came from spectrograms made from a tape recorded in the field containing the examples in the report. Both sets of rules were revised to improve the match between the synthesized speech and the recordings. Out of this revision came recommendations about changes to the phonological description that the investigator can follow up on his next opportunity to do field work. In the current absence of such opportunity, two groups of linguists attempted to transcribe the forms phonetically. One had special orientation to Mura‐Piraha phonology and the other had only general phonetic background. Both confusion matrices are given.
FREE

A new method for the quantitative evaluation of degree of hoarseness (A)

Eiji Yumoto, Wilbur J. Gould, and Thomas Baer

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S15-S15 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
A sustained phonation of the vowel /a/ was separated into the harmonic and noise components by the PDP 11/45 computer. The ratio of the acoustic energy of the harmonic components to that of the noise components (H/N ratio) was calculated. The subjects consisted of twenty‐five normals and forty‐three pathological cases with varying degrees of hoarseness pre‐ and postoperatively. Sound spectrograms were made from the original phonation and the extracted noise sound. There was a highly satisfactory degree of separation of the noise components from the original phonation. The H/N ratio was a useful tool to quantitatively compare the degree of hoarseness of a posttreatment voice with that of a pretreatment voice. The critical range of the H/N ratio of normals was larger than that of the pathological cases. The theoretical basis of this method will be discussed in detail.
FREE

Second formants in fricatives (A)

Sigfrid D. Soil

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S15-S15 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Acoustic analyses of the sibilant fricatives, [s, z, ʃ, Ʒ], produced in initial position before [a], [i], and [u] were performed. LPC spectra revealed reliable anticipatory vowel coarticulation effects, viz., spectral peaks affiliated with the second formant of the following vowel, present 30–60 ms before vowel onset. These peaks represent oral resonances excited by either aspiration or voicing and indicate that during the latter part of the fricative the constriction begins to open in anticipation of the vowel. Acoustic characteristics of the peaks varied with vowel context due to differences in the anticipatory coarticulation of each vowel. In the context of the high vowels, [i, u], both assimilation and articulatory overlap of the fricative and vowel configurations was evident from the frequencies of the clearly defined spectral peaks. However, in the [a] context the opposing configurations for the fricative constriction and the low back vowel, which are executed sequentially, resulted in poorly differentiated peaks. The data are congruent with perceptual evidence that high vowels are more accurately identified than low vowels in fricatives excised from fricative‐vowel syllables, thus exemplifying links between articulatory, acoustic, and perceptual aspects of coarticulation.
FREE

Cross‐linguistic differences between fricatives (A)

Jonas N. A. Nartey and Hector Raul Javkin

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S15-S15 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Various cross‐linguistic studies indicate that we can safely utilize formant frequency measures in distinguishing English [i], for instance, from French [i]. When it comes to fricatives, there is no reliable cross‐linguistic method. This paper presents what we hope to be the dimensions by which fricatives may be described in a meaningful cross‐linguistic way. Five to ten speakers each of a number of American Indian, Indo‐European, Asian, and African languages produced fricatives in a sentence frame in three‐vowel environments; i‐i, a‐a, u‐u. A 50 ms section of each fricative was subjected to a critical‐band analysis. The resulting spectra were then compared for fricatives within and between the languages.
FREE

Statistical analysis of cross‐linguistic differences between fricatives (A)

Hector Raul Javkin and Jonas N. A. Nartey

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S15-S15 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
The differences between fricatives in different languages have been difficult to determine because of a lack of agreed‐upon parameters for characterizing fricatives. The spectra of fricatives in a number of unrelated languages were analyzed into 22 critical bands within the frequencies 0–10 kHz. A PARAFAC analysis was performed within each language across three modes: Fricative, vowel environment, critical band. The analyses of the different languages were compared using PARAFAC and canonical correlation. We hope to answer the question whether the sounds which have been described as the same in different languages are in fact the same.
FREE

Spectral characteristics of palato‐alveolar affricates in three languages (A)

Ian Maddieson

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S15-S16 (1981); (2 pages)

Full Text: | Download PDF

Show Abstract
The spectral characteristics of the fricative portion of palato‐alveolar affricates in Spanish, Italian, and (British) English have been examined in order to obtain information on the extent and nature of inter‐language differences among similar sounds, as well as on certain intra‐language variables. Eight speakers of each language were recorded reading six‐syllable sentences at slow and fast speech rates controlled by a metronome. Embedded in each sentence was a word as close to the phonetic form [katʃa] as the phonotactics of the language permit. In Italian, both single and geminate affricates were recorded. A central portion of the friction noise in each token was analyzed using a program whose output is a simplified spectrum representing relative amplitudes in 22 simulated “critical bands.” Further data reduction was carried out using Harshman's PARAFAC procedure, a 3‐mode factor analysis. The results enable comparisons of cross‐language and within‐language (e.g., speech rate, speaker, sex) differences to be made in terms of differences along a small number of empirically discovered dimensions.
FREE

Fricative‐stop coarticulation: Acoustic and perceptual evidence (A)

Bruno H. Repp and Virginia A. Mann

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S16-S16 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Eight native speakers of American English produced 10 tokens of all possible CV, FCV, and VFCV utterances with V = [ɑ] or [u], F = [s] or [ʃ], and C = [t] or [k]. Acoustic analysis showed that the formant transition onsets following the stop consonant release were systematically influenced by the preceding fricative, suggesting a shift in the place of stop articulation towards the place of fricative articulation. The shift was equally large in FCV (e.g.,/sta/) and VFCV (e.g., /asda/) utterances; that is, it was not reduced when a syllable boundary intervened between fricative and stop. In a parallel perceptual study, the CV portions of these utterances (with release bursts removed to provoke errors) were presented to listeners for identification of the stop consonant. The pattern of place‐of‐articulation confusions in this identification task will be discussed in relation to the acoustic measurements. [Work supported by NICHD and BRS.]
FREE

A statistical analysis of third‐octave speech amplitude distributions (A)

Steven De Gennaro, Louis D. Braida, and Nathaniel I. Durlach

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S16-S16 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Amplitude distributions which probabilistically describe the level variations present in third‐octave bands during speech have been obtained through the periodic sampling of the short‐term rms speech envelope in each band. Representative density functions are presented for a variety of speech materials including isolated CVC syllables and nonsense sentences spoken by male and female talkers. In general, the amplitude distributions do not have simple analytic forms, although they can be described parametrically in terms of cumulative percent levels. Typically, the range of amplitudes between the 10% and 90% cumulative levels in each frequency band exceeds 40 to 50 dB. Detailed characteristics of the density functions, however, change significantly across frequency. In the low‐frequency bands, the distributions are bimodal, reflecting distinct amplitude ranges for voiced and unvoiced speech segments. The distributions become unimodal and more peaked in the higher frequency channels. Comparisons of these results with previous characterizations of speech amplitude distributions [i.e., H. K. Dunn and S. D. White, J. Acoust. Soc. Am. 11, 278 (1940] are also presented. [Work supported by NIH.]
FREE

A comparison between vocal tract length estimates from acoustic and x‐ray data (A)

F. Lonchamp and J. P. Zerling

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S16-S16 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Thirty vocal tract length measurements from lip corners to glottis were obtained from cine‐x‐ray sagittal views and lateral‐frontal photographic data. Material for two French speakers includes vowels [i, ɛ, ɑ, ɔ, u] in three symmetric [ə CVC] frames where C is either [b, d] or [g]. Four formant frequencies were measured from the simultaneous audio recording using covariance LPC on closed glottis sections, and were corrected for the frequency effects of lip radiation and wall vibrations. Calculated bandwidths from Fant's data and the modified formant frequencies were used to derive area functions for several tract lengths. Use of Paige and Zue's criterion [IEEE‐AU 18, 268–270 (1970)] of minimum perturbation with respect to a uniform tube yielded close agreement between measured and estimated lengths for all vowels except [u]. Removal of the last (lip) section in criterion calculations gives better results for [u].
FREE

Estimation of the intrinsic duration of vowels (A)

Y. Nishinuma, S. Barber, and D. J. Hirst

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S16-S16 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
With the intent of formulating a simple method to estimate the intrinsic duration of vowels, we examined three groups of experimental data and twelve groups of published data from various works concerning six languages. By adopting a linear model hypothesis, we carried out a multiple linear regression analysis, using measured vowel duration as the dependent variable, and its corresponding F1, F2, F3 and a computed variable Dq representing sub‐categorical mean duration as independent variables. Results show that F1, F2, and Dq gave a quite good estimation for all data analyzed. Therefore we think that the intrinsic duration of vowels can be satisfactorily estimated by means of these three variables, at least for the languages concerned.
FREE

On the interaction between fundamental frequency and articulatory setting (A)

Timothy Habick

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S16-S16 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Despite the apparent implications of much current research [P. Ladefoged et al., J. Acoust. Soc. Am. 64, 1027–1035 (1978)], fundamental frequency plays an important role in determining the levels of a speaker's formant frequencies, since F0 affects the articulatory setting that a speaker habitually chooses for speech. On the basis of a spectrographic analysis of the speech of 40 subjects from one speech community, the size and location of the phonemic system in two‐formant acoustic space will be shown to vary as a function of three major parameters: Fundamental frequency, social forces (both determining articulatory setting), and the invariable dimensions of the vocal tract. In addition, several competing theories of this aspect of speech perception will be rejected. Specifically, the assertion that listeners calculate the size of speakers' vocal tracts in order to decode their speech will be argued against.
FREE

Rate of change of fundamental frequency: A useful parameter in acoustic analysis of the human voice? (A)

Anders Askenfelt and Johan Sundberg

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S16-S17 (1981); (2 pages)

Full Text: | Download PDF

Show Abstract
The informative power of the rate of change of the fundamental frequency (RCFF), i.e., approximately the time derivative of the fundamental frequency, has been used in two different projects on the human voice quality. One project [Hammarberg et al., Acta Otolaryngol. 90, 441–452 (1980)] deals with acoustic correlates of voice disorders. It has been shown that a perceived quality, which is commonly named “roughness,” correlates with cycle to cycle variations (perturbations) in the speech waveform. The RCFF calculated over a very short time window has been used to detect those parts of the running speech which show waveform perturbations. The results suggest a correlation between the degree of “roughness” as rated by voice experts and the amount of waveform perturbation periods in the speech. In the other project [Askenfelt and Sjölin, STL/QPSR 2–3, 74–81 (1980)] an attempt was made to find a parameter reflecting voice changes associated with psychopathological depression. The average RCFF was determined from the patient's reading of a standard text during illness and after recovery respectively. Thereby, the RCFF was calculated over a time window of approximately 50 ms. The results suggest that this RCFF measure is influenced by the mental health of these patients.
FREE

F0 detection by crosscorrelation with a comb function (A)

Ph. Martin

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S17-S17 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Thanks to the development of fast digital hardware performing real‐time spectral analysis on speech data, dependable pitch analysis can be obtained even in the presence of a large amount of noise through the use of new F0 measurement techniques. A new method is presented here, based on the crosscorrelation between the power spectrum of suitably windowed speech data F(ω) and a comb function C(ωp,ω): I(ωp)  =   ∫ 0F(ω)C(ωp,ω)dω. The comb function detects the possible harmonic structure in the spectrum, so that when Cp,ω) is carefully chosen, the crosscorrelation function Ip) shows maxima at values of ωp corresponding to the fundamental frequency of voiced speech. The appropriate design of the comb (with an amplitude varying with ω), leads to more appropriate and economical digital hardware compared to the general crosscorrelation approach.
FREE

Speech as a string of pulses: Pulse‐coherence function (A)

Jean‐Sylvain Liénard and Frédéric Manceron

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S17-S17 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
The usual methods for speech signal analysis make an a‐priori distinction between voiced and unvoiced segments, which is a well known source of errors, and give a spectrum evaluation on a time interval (20 to 40 ms) larger than the largest pitch period, which eliminates most phonetically relevant information. Our approach [after J. C. Lafon, Congrès Soc. ORL, Paris (1958)], consists in a decomposition of the signal into a string of “pulses,” which are considered as distinct only if their interval exceeds 1 ms. Pitch and voicing are to be extracted later, on the basis of regularity and spectral similarity of the pulses. The work described in the present paper is done using a filterbank followed by short‐time integrators. A “pulse coherence function” is defined; its maxima are used to mark the beginnings of the pulses. The synthesis experiments made using this function show that short acoustical phenomena (of plosives), noises (of fricatives), voicing and pitch are correctly reproduced.
FREE

Accuracy of formant frequency estimation by spectrograms and by linear prediction analysis (A)

R. B. Monsen

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S17-S17 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
The accuracy of spectrographic techniques and of linear prediction analysis in measuring formant frequencies was compared. The first three‐formant frequencies of ninety synthetic speech tokens were measured by three experienced spectrographic readers and by linear prediction analysis. The synthetic speech tokens were made by parallel synthesis and were chosen to represent a wide range of formant bandwidths, frequencies, and fundamental frequencies. For fundamental frequencies between 100 and 350 Hz, both methods are accurate to within approximately ±60 Hz for both first and second formants; the third formant can be measured with the same degree of accuracy by linear prediction, but only to within ±120 Hz by spectrographic means. For linear prediction analysis, accuracy does not decrease with increasing fundamental frequency, although the accuracy of both methods decreases greatly when fundamental frequency is 400 Hz or greater. These limits of measurement appear to be within the vicinity of difference limens for formant frequencies. [Supported by NINCDS Grant NS03856.]
FREE

Voiced/unvoiced decision by a clustering procedure (A)

Robert Espesser

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S17-S17 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Previous work based on a pattern recognition/classification approach need either a‐priori information or a training set of data [L. R. Rabiner, IEEE Trans. ASSP‐24, 201–212 (1976)], [L. J. Siegel, IEEE Trans. ASSP‐27, 83–89 (1979)]. A voiced/unvoiced classification approach which avoids these problems is presented. Energy, zero‐crossing rate, LPC normalized error are measured every 10 ms over the speech segment. Feature vectors in this three‐dimensional space are then classified in two groups by an iteration reallocation procedure clustering. Mahalanobis distance is used for allocation, and is recomputed at each iteration of the clustering algorithm. Initial centers and centroids of the final clusters must satisfy certain constraints. Several partitions are computed for a speech segment and the best one according to some defined criteria is retained. Mean error rate (without any smoothing algorithm) is about 3.5% for five female and five male speakers.
FREE

Design considerations for optimizing the intelligibility of a DFT‐based, pitch‐excited, critical‐band‐spectrum speech analysis/resynthesis system (A)

Stephanie Seneff, Dennis Klatt, and Victor Zue

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S17-S17 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
The primary objective of the research to be described was to determine whether a particular critical‐band spectral representation retains information sufficient for high‐performance speech recognition. The technique employed to answer this question was to design a speech analysis/resynthesis system (vocoder) in which the synthesizer only made use of a critical‐band spectral representation. The intelligibility of speech processed through this system was determined as a function of a number of design parameters. The magnitude spectrum was computed for overlapping windowed segments of the speech waveform every 10 ms. The critical‐band spectrum (38 spectral coefficients) was derived by forming the appropriate weighted sums of DFT magnitude coefficients. The analyzer also made a voicing decision and estimated fundamental frequency based on low‐frequency DFT peaks. During resynthesis, the full DFT magnitude spectrum was regenerated by interpolation between the 38 available coefficients. An inverse DFT, in which the phase was set to zero, yielded a finite impulse response that could be convolved with the idealized excitation source to reconstruct a speech waveform. Not surprisingly, the synthetic speech sounded muffled due to the width of the critical‐band spectral peaks. Several algorithms were then developed to sharpen these peaks prior to resynthesis. The best algorithm, to be described, was compared with a similar DFT‐magnitude vocoder without critical‐band smoothing and with a linear‐prediction vocoder, using a modified rhyme test. The results have implications for both speech recognition and vocoder design. [Work supported in part by an NIH grant.]
FREE

Real‐time LPC analysis using digital signal processor chips (A)

R. J. Hanson and J. P. Olive

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S18-S18 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
This paper describes a procedure for determining LPC area parameters in real time. The analysis involves the following steps: High‐frequency pre‐emphasis, windowing, computing the autocorrelation coefficients, and their conversion to area parameters directly from the autocorrelation coefficients without first computing the predictor coefficients. Computer simulations show that 20‐bit precision is sufficient to avoid significant errors in the conversion of the autocorrelation coefficients to area parameters if frame‐by‐frame normalization is used. Our results show that good speech quality can be maintained in synthesis by using 16 area parameters updated at 10‐ms intervals. The real‐time requirement of this analysis procedure cannot be met by any single signal processor chip which is currently available. We have, therefore, used a multiprocessor configuration of signal processor chips to perform the analysis in real‐time. The Bell Laboratories DSP chip—capable of performing a 16 by 20 bit multiply with 40 bit accumulation in 0.8 μs—was found to provide the necessary precision and speed.
Close

close