• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

May 1988

Volume 83, Issue S1, pp. S1-S122

back to top
RSS Feeds
back to top Session Y. Speech Communication V: Speech and Speaker Recognition
Contributed Papers
FREE

Comparative study of ASR front‐ends in noise (A)

Jean‐Claude Junqua and Hisashi Wakita

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S54-S54 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In automatic speech recognition (ASR) of speech corrupted by noise, the performance tends to deteriorate rapidly depending on the choice of analysis method and distance measure. In order to evaluate the recognition performance for several analysis methods and distance measures, a series of isolated word recognition experiments was performed. Analysis methods selected are critical‐band filtering, perceptually based linear prediction (PLP), linear prediction (LP), and time synchronous linear prediction (SLP). The weighted Euclidean distance with different weightings [unity, root power sums (RPS), and exponential filtering] was applied in the cepstrum domain. Experiments were carried out for clean speech and for two noise conditions (white and low‐pass filtered white, added to the clean speech) at different SNR ratios (25 to 5 dB), using an alphanumeric vocabulary (ten speakers). It is shown that improvements in robustness of the recognizer in noise can be achieved by a proper selection of analysis method and cepstral weights used in the front‐end. Improvements are found over the RPS distance measure (previously shown to be useful in noise conditions with LP and PLP analyses) [B. Hanson and H. Wakita, Proceedings ICASSP 86 (IEEE, New York, 1986), pp. 757–760] by use of the general exponential lifter.
FREE

Feature‐based automatic syllable and stress detection (A)

Briony Williams and Jonathan Dalby

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S54-S54 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The importance of syllable structure and stress level as determinants of segmental temporal and spectral variability makes automatic syllable detection and stress estimation a very desirable goal for continuous speech recognition research. In this paper, a rule‐based system is described for locating syllables in continuous speech and for making a two‐level stress assignment. Location of syllable nuclei and rough estimation of syllable boundaries are performed using a smoothed midfrequency “sonorant” energy contour, a frication detector, and Sonorant consonant detectors. First pass classification of detected syllables as stressed, unstressed, or uncertain is based on the relative energy levels and utterance‐position‐normalized durations of syllables in a three‐syllable window. Fundamental frequency information is then used to reclassify uncertain cases as either stressed or unstressed. Preliminary evaluation of the system's current performance on a multispeaker data base yielded 77% correct location and classification of syllables. Although improvement is necessary, this result is encouraging.
FREE

A segmentation algorithm based on spectral variance (A)

A. Kumar and H. Wakita

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S54-S54 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Presented is a segmentation algorithm based on spectral variance. The speech signal is first segmented into spectrally stable segments by using a median smoothed spectral variance over a 70‐ms window. The segment boundaries are placed at the maxima in the spectral variance and the minima give typical frames for the segments. The spectral variance peak for the glides is generally very small because of their smooth transition. Hence, the glides are segmented by using second and third formant trajectories. Extraneous segments outside the word boundaries are eliminated by adaptive silence detector. The segments are then assigned broad phoneme classes by using a tree classifier on the LPC cepstral coefficients for the typical frames. The fricative and nonfricative segments are distinguished by the normalized differenced speech signal. The tree classifier is separately trained on hand‐labeled databases. Preliminary experiments show that 90% of the segments are detected. Results for the 104‐word keyboard vocabulary for six males and four females and continuous speech will be presented.
FREE

Speech recognition using a synthesized codebook (A)

L. A. Smith, B. L. Scott, R. G. Goodman, L. S. Lin, and J. M. Newell

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S54-S55 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Speech sounds generated by a simple waveform synthesizer were used to create a vector quantization codebook for use in speech recognition. Recognition was tested over the TI‐20 isolated word database using a conventional DTW matching algorithm. Input speech was filtered to limit the bandwidth to 300–3300 Hz, then was passed through the Scott Instruments Coretechs process, implemented on the SI2010 signal processing chip, to create the speech representation for matching. Synthesized sounds were processed in software by an SI2010 emulation program. SI2010 emulation and recognition were performed on a DEC VAX 11/750. The original codebook contained 109 vectors. This codebook was decimated through the course of the experiments, based on the number of times each vector was used in quantizing the training data for the previous experiment. Recognition scores are presented for progressively smaller codebook sizes, as well as for the baseline condition (no vector quantization).
FREE

An efficient, robust speaker‐independent algorithm (A)

Brian Scott, Lisan Lin, Mark Newell, and Lloyd Smith

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S55-S55 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The algorithms described have yielded speaker‐independent scores of 95.1% on the 20‐word TI database obtained from the National Bureau of Standards. Results were obtained by training the system on half of the speakers in the database, testing on the other half, and then reversing the order. Training was done with the 10 training tokens per speaker per word only. Testing was on the 16 test tokens per speaker per word. The total number of test trials was 5120. The recognizer uses conventional methods for time normalization and matching. Time normalization is linear and scoring is accomplished with a simple differencing algorithm weighted by variances. Storage requirement is 3072 bits per word. Most of the speaker normalization is accomplished by the proprietary signal processing method developed by Scott Instruments. Aside from the amplitude normalization routines, no floating point arithmetic is used. All signal processing is temporally based. The front end process can be adapted for use with dynamic time warping algorithms or feature based algorithms. The system is, therefore, extensible to connected speech.
FREE

Recognition of continuously spoken letters by listeners and spectrogram readers (A)

Nancy A. Daly and Victor W. Zue

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S55-S55 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Because of acoustic similarities between some letters of the alphabet, automatic recognition of continuously spoken letters is a difficult task. The goal of this study is to determine and compare how well listeners and spectrogram readers can recognize continuously spoken letter strings from multiple speakers. The interest in spectrogram reading results is motivated by the belief that this procedure may help to identify acoustic attributes and decision strategies that are useful for system implementation. Listening and spectrogram reading tests involving eight listeners and six spectrogram readers, respectively, were conducted using a corpus of 1000 wordlike strings designed to minimize the use of lexical knowledge. Results show that listeners' performance was better than readers' (98.4% vs 91.0%). In both experiments, string lengths were determined very accurately (98.1% and 96.2%), presumably due to the large number of glottal stops inserted at letter boundaries to facilitate segmentation. Most of the errors were due to substitution of one letter for another (68% and 92%), and they generally fall into two categories. Asymmetric errors can often be attributed to subjects' disregard for contextual influence, whereas symmetric errors are largely due to acoustic similarities between certain letter pairs. Subsequent acoustic study of four of the most confusable letter pairs has resulted in the identification of a number of distinguishing acoustic attributes. Using these attributes, overall recognition performance better than that of the readers was achieved. [Work supported by NSF and DARPA under contract N00014‐82‐K‐0727, monitored through the Office of Naval Research.]
FREE

Human and machine performance on speaker identity verification (A)

Timothy C. Feustel, Robert J. Logan, and George A. Velius

J. Acoust. Soc. Am. Volume 83, Issue S1, pp. S55-S55 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Two experiments were conducted to identify acoustic features for speaker identity verification (SIV) that are used by humans and not by cepstral‐based algorithms. Although these algorithms generally out‐perform human listeners for randomly selected comparisons between single‐word utterances, this approach was to analyze human performance on comparisons that could not be effectively discriminated by machine. Experiment 1 showed that humans could perform at high levels of accuracy on these comparisons suggesting that either information exists that is not captured by the algorithms, or that the information is coded by the algorithms but is not used effectively. The second experiment consisted of three stimulus conditions for SIV; digitized speech signals, noise‐excited resynthesized LPC signals, and error prediction signals from the LPC. Results indicated high levels of performance in the natural and error prediction signal conditions and performance near chance in the noise excited condition, thus suggesting that the error signal provides valuable information that allows humans to distinguish between speakers. It may be possible to improve verification algorithms by adapting current models to more accurately utilize information used by human listeners.
Close

close