• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Apr 1991

Volume 89, Issue 4B, pp. 1851-2015

back to top
RSS Feeds
back to top Session 3SP: Speech Communication: Speech Processing
Contributed Papers
FREE

Profiling vectors for speaker identification (A)

Harry Hollien and Ming Jiang

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1891-1891 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
When speaker verification is the issue of interest, it is possible to focus on signal analysis irrespective of the speech related features it contains. Such approaches are appropriate in this case because system distortions are minimal, noise is low, talkers are cooperative, and very sophisticated equipment is available. Not so for speaker identification. Here extensive channel and speaker distortions (including noise) can be expected; speech is noncontemporary and speakers usually uncooperative. Hence, the signal is so distorted or masked, the usual processing techniques cannot be expected to be very useful. The approach to speaker identification demonstrated in this paper is threefold. First, it is assumed that the signal contains speech features that are robust (i.e., resistant to noise and distortion) and unique to the talker. These idiosyncracies are based on speaker's anatomy, physiology, and habitual communicative patterns. Second, it is postulated that, while there may be no single attribute within a person's speech/voice that would permit them to be differentiated from all other speakers under any set of conditions, the simultaneous use of a large series of feature analyses may permit identification. Finally, it has become possible to reduce bias among the vectors by the normalization of data. In turn, this approach leads to a very effective two‐dimensional profile wherein the unknown speaker must first be identified and then comparisons made to known talkers. A system of this type has been structured and tested; it is based on four natural speech vectors, each containing 20–40 parameters. Data regarding this general approach and these vectors have been reported previously. This presentation will focus on the effects (on efficient speaker identification in the field) of normalizing the vector data and reducing it to a two‐dimensional profile.
FREE

Vowel formant tracking for speaker identification (A)

Ming Jiang and Min Shi

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1891-1891 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
The first two or three spectral peaks, or formants, are crucial in determining the vowel quality. In turn, accurate determination vowel formants quality is important to effective speaker identification task. A vowel formant tracking vector (VFT) was developed for the speaker identification (SAUSI) profile. Specifically, the speech spectrum is obtained frame‐by‐frame by using an LPC algorithm with the first three formant frequencies for each frame calculated. The underlying assumption was that the vowels will exhibit a contiguous formant frequency transition from frame‐to‐frame and, hence, can be separated from consonants for the cited formant measurements. In order to carry out this task, the frequency range 0–5000 Hz is divided into 34 semitone bins and three histograms are obtained for first three vowel formants. In turn, these histograms provide an estimation of general quality of the vowels spoken by each speaker being evaluated. The result is that the interspeaker differences are large enough to permit identification of the target speaker while the intraspeaker differences are fairly small even for text independent speech. The algorithm utilized will be presented as will data demonstrating that this VFT vector is robust enough to effectively perform the speaker identification task.
FREE

SAMREC0: A C30‐based reference connected‐word recognizer for the evaluation of speech databases (A)

F. Capman and G. Chollet

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1891-1891 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
One of the objectives of the ESPRIT‐SAM project is the elaboration of speech databases for the evaluation of recognizers. In this framework, a reference system [G. Chollet and C. Cagnoulet, “On the evaluation of speech databases using a reference system,” ICASSP, 1982], based on dynamic programming algorithm, was modified to accept connected words [G. Chollet and C. Montacie, “Evaluating speech recognizers and databases,” NATO‐ASI, 1988]. This software, which is called SAMREC0 by the SAM speech input assessment group, is now implemented using a T.I. TMS320C30‐based PC‐board, so that it can be used efficiently on the SAM PC‐AT workstation. Some results will be presented on the evaluation of the first SAM database EUROM0. This database was recorded in quiet conditions and very few classification errors are observed. Work is under development to simulate noisy conditions using the same database, in order that the limits of the reference or other systems could be measured.
FREE

Feature detection using a connectionist network (A)

Gary Bradshaw and Alan Bell

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1891-1892 (1991); (2 pages)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
A feedforward connectionist network trained by backpropagation was used to detect 15 speech features. The network was trained over 240 sentences (40 men and 40 women), and tested over 200 sentences (10 men and 10 women), all part of the MIT Ice Cream database. Network input consisted of a smoothed spectral vector at 15‐ms‐intervals, plus two coefficients of amplitude and spectral change. The network achieves a signal detection discrimination level (a‐prime) of 0.87 compared to a level of 0.76 for a ten‐nearest‐neighbor system. Almost identical training and test performances indicates excellent generalization to new speakers and text. Processing costs are mainly signal processing and network training; detection itself can be done in real time. Performance is much better for broad features like sonorance, which occur frequently, than for infrequent features like sibilance, partly because of their low frequency and partly because of other characteristics. [Work supported by USWest.]
FREE

Neural networks in articulatory speech analysis/synthesis (A)

M. G. Rahim, W. B. Kleijn, and J. Schroeter

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1892-1892 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
A major difficulty in articulatory analysis/synthesis is the estimation of vocal‐tract parameters from input speech. The use of neural networks to extract these parameters is more attractive than codebook look‐up due to the lower computational complexity. For example, a multilayer perceptron (MLP) with two hidden layers, trained and evaluated on a small data set was shown to perform a reasonable mapping of acousticto‐geometric parameters. Increasing the training data, however, revealed ambiguity in the mapping that could not be resolved by a single network. This paper addresses the problem using an assembly of MLP's, each designated to a specific region in the articulatory space. Training data were generated by randomly sampling the parameters of an articulatory model of the vocal system. The resultant vocal‐tract shapes were clustered into 128 regions, and an MLP with one hidden layer was assigned to each of these regions for mapping 18 cepstral coefficients into ten tract areas, and a nasalization parameter. Networks were selected by dynamic programming, and were used to control a time‐domain articulatory synthesizer. After training, significant perceptual and objective improvements were achieved relative to using a single MLP. Comparable performance to codebook look‐up with dynamic programming was obtained. This model, however, requires only 4% of the storage needed for the codebook, and performs the mapping faster by a factor of 20.
FREE

Automatic speech recognition based on property detectors (A)

T. V. Ananthapadmanabha and H. N. Jayasimha

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1892-1892 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
Speaker‐independent, large‐vocabulary, continuous speech recognition by a machine is a challenging problem for which over a decade of search has been made without significant progress. In the existing systems, the same acoustic feature vector (LPC, cepstrum, filter bank, etc.) is used for all speech sounds and they heavily depend on contextual information for their success. This paper presents some results based on a radically different approach called “property detectors.” The approach of property detectors is well known in visual perception where it has been demonstrated that specialized detectors exist on the retina that trigger only for vertical, horizontal, or inclined lines. It has only been speculated that such specialized detectors could exist for speech. Recently, acoustic properties have been discovered that uniquely characterize some phonemes like /a/, /i/, /u/, /e/, /o/, and /s/. A limited‐vocabulary, speaker‐independent airline schedule announcement system was developed. This system was tested in a noisy hall with a large number of speakers, including female speakers, with different linguistic backgrounds. The system, though is in its early stage, gave a performance of about 85% accuracy. The approach based on property detectors aappears promising
FREE

Synthesis of manner and voicing continua based on speech production models (A)

Corine Bickley, Kenneth N. Stevens, and Rolf Carlson

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1892-1892 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
The goal of this project is to create natural‐sounding synthetic consonant‐vowel syllables for presentation to aphasic patients and normal controls in studies of perception of speech sounds and lexical access. Of particular interest are the manner distinctions that appear to form the basis for the processing of other phonetic dimensions by human listeners. Continua of syllabic‐nonsyllabic, sonorant‐obstruent, continuant‐noncontinuant, and voiced‐voiceless sounds were constructed using the KLSYN88 synthesizer. The endpoint stimuli were synthesized based on theoretical models of glottal and turbulence noise sources and vocal‐tract filtering, with some refinements to match the characteristics of a particular speaker. Intermediate stimuli were created to form continua that represent incremental changes in the synthesizer parameters. For all stimuli, the values of synthesis parameters modeled utterances that could be produced by a human talker. Identification functions for these continua for normal listeners showed relatively sharp boundaries between phonetic categories. The acoustic characteristics of the stimuli in the vicinity of the boundaries were examined to determine the pattern of acoustic attributes responsible for the abrupt change in identification, such as rise times of amplitudes, rates of change of formants, and relative amplitudes of noise and glottal excitations. [Work supported in part by NIH grants DC00776 and DC00075.]
FREE

Considerations on speaking style and speaker variability in speech synthesis (A)

Lennart Nord and Björn Granström

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1892-1893 (1991); (2 pages)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
In the exploration of speaking style and speaker variability, a multispeaker database and a speech production model is used. The structure of the database, which includes professional as well as untrained speakers, makes it possible to extract relevant information by simple search procedures. In perceptual studies both F0 and duration has had an indisputable effect on prosodics but the role of intensity and of segmental variation has been less dear. This has resulted in an emphasis on the former attributes in current speech synthesis schemes. Intensity has a dynamic aspect, discriminating emphasized and reduced stretches of speech. A more global aspect of intensity must be controlled when an attempt is made to model different speaking styles. Specifically, attempts have been made to model the continuum from soft to loud speech. Systematic variation in speech synthesis has been used as a tool to explore possible speaker dimensions, among them reduced and over‐articulated speech. Listening experiments have been carried out with the aim to investigate whether it is possible to describe synthesis samples according to different attitudinal and emotional dimensions.
FREE

Improvement of synthetic speech quality through syntactic information (A)

Tohru Shimizu, Seiichi Yamamoto, Norio Higuchi, and Hisashi Kawai

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1893-1893 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
Many words in Japanese have identical written expression but different pronunciation. Natural synthetic speech therefore requires selection of the correct pronunciation for words and optimized prosodic features, including accent position and level, sentence intonation and length of pause, through the use of syntactic features. This paper describes (1) a new method of determining phrase accent level, based on accentual phrase boundary location and compound word structure, and (2) a newly proposed syntactic class of phrase boundaries. The results of the automatic determination of pronunciation, and opinion tests of intelligibility and naturalness are also described. About 10 000 words are assigned to syntactic and semantic features to determine correct pronunciation, representing about 20% of the total vocabulary. Pronunciation of 99% of the words in a Japanese economic daily were correct, and naturalness of the synthetic speech was 1.1 grades higher under the five‐grade opinion test.
FREE

Source parameters for the fricative consonants /s,ʃ,ç, x/ (A)

Christine H. Shadle

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1893-1893 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
A series of experiments with mechanical models of fricative consonant articulatory configurations have been conducted to determine where in the tract the turbulence noise is generated and the spectral characteristics of that noise. The latest models, based on a combination of x ray, EPG, and photographic data, have the correct midsagittal profile and area function, and thus have the most realistic shape of model work to date. Data obtained from /s,ʃ/ substantiate earlier results based on a different subject [C. H. Shadle, J. Acoust. Soc. Am. Suppl. 1 84, S34 (1988); C. H. Shadle, in Speech Production and Speech Modelling, Proc. of NATO‐ASI, edited by W. Hardcastle and A. Marchal (Kluwer Academic, Amsterdam, 1990), pp. 127–219] and results from extremely idealized models [C. H. Shadle, Proc. 12th ICA, paper A3‐4, Toronto (1986)]. Comparisons across a range of flow rates, with and without sublingual cavity, between measured source and far‐field spectra, and between speech and model data for /s,ʃ,c, x/ lead to source parameters, a distinction between two source types, and to the conclusion that the three‐dimensional shape of the tract is crucial in determining source parameters: these parameters can be used in a model based on one‐dimensional sound propagation. Three‐way comparisons between far‐field sound measured (1) for the models and (2) for actual utterances, and (3) far‐field sound predicted from measured source parameters used in a model based on one‐dimensional sound propagation, will be shown. [Work supported by SERC.]
FREE

Reliable glottal‐closure‐instant (GCI) estimation from short analysis frames (A)

Krishna S. Nathan and Harvey F. Silverman

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1893-1893 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
It is well known that the first formant is maximally excited at the instant of glottal closure. Therefore, it is natural to utilize the energy in a band containing the first formant as a cue to the GCI. In practice, however, the actual GCI lies a few samples prior to where this energy signal attains a local maximum. Moreover, such an estimate makes no use of any period information regarding the GCI's. Consequently, secondary excitations within a period can lead to spurious GCI's. It is therefore proposed to augment the information contained in the first formant with the linear prediction error. Although, prediction error has been widely used for pitch determination, it is not sufficient to locate the GCI reliably because of ambiguities arising from multiple peaks, especially for vowels like /u/ (as in foot). Interestingly, these experiments have shown that secondary excitations tend to result in peaks in the residual error signal at locations different from those in the formant energy signal. Furthermore, in the absence of spurious excitation, the residual error can contain valuable independent period information. Therefore, the product of these two signals yields accurate GCI estimates. Such an algorithm has been tested on all vowels in a variety of environments and has been found to be very robust. Analysis frames as short as 5–10 ms have been used.
FREE

Isolation and characterization nf microevents in speech (A)

David A. Berry and William J. Strong

J. Acoust. Soc. Am. Volume 89, Issue 4B, pp. 1893-1893 (1991); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
An event‐synchronous technique has been designed in an attempt to optimize time and frequency resolution in speech analysis. The technique isolates “microevents” in the speech waveform and then analyzes them, thus differing from commonly used asynchronous methods that employ a fixed frame length stepped forward in constant time increments. A microevent (ME) is associated with a “packet of energy” in the waveform and is initiated by some underlying input or fluctuation of energy. There are four basic types of MEs: (1) a voiced ME is initiated by a pitch pulse; (2) a plosive ME is initiated by a plosive burst; (3) a noise ME is initiated by a positive fluctuation in energy; and (4) a mixture ME. An ME is terminated at the initiation of the next ME or when the energy of the speech signal falls below the background level. ME durations are constrained to lie within a range of 2–20 ms. The current algorithm, developed and tested with portions of the 1988 DARPA TIMIT acoustic‐phonetic continuous speech database, isolates over 95% of the MEs correctly. Once isolated, MEs are characterized by their one‐third octave spectra. Results will be illustrated with various examples.
Close

close