• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

May 1981

Volume 69, Issue S1, pp. 31-S125

back to top
RSS Feeds
back to top Session S. Speech Communication III: Speech Recognition
Contributed Papers
FREE

Organizing the lexicon for recognition (A)

Jola Jakimik and Sheri Hunnicutt

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S41-S41 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
In this paper we discuss several aspects of the organization of a lexicon for the purpose of recognizing spoken words. We are interested in what a listener (human or computer) needs to know about a word in order to distinguish it from other words. We consider how words may be organized (or “indexed”) in the lexicon so that appropriate candidates for recognition are accessed from partial information, and how each word may be specified. (These features of a lexion define the dimensions of perceptual similarity among words, which determine misperceptions, for example.) We evaluate some likely indexing properties and lexical representations; for example, we consider indexing by stressed vowels, and lexical representation in terms of consonant and vowel classes.
FREE

Isolated word recognition using demisyllable templates (A)

A. E. Rosenberg, L. R. Rabiner, S. E. Levinson, J. G. Wilpon, and K. L. Shipley

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S41-S41 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
An automatic speech recognition system is described for recognizing isolated words from reference templates created by concatenating demisyllables from a corpus of about 1000 demisyllables. The composition (in terms of demisyllables) of each reference word is specified in a lexicon with one or more entries for each word in the vocabulary. Experiments were carried out, using a 1109‐word basic English vocabulary, to investigate the usefulness of such a representation and the effect on performance of some simple modifications in demisyllable specifications and durations of reference patterns. Performance statistics are provided.
FREE

Speaker independent isolated word recognition for a 129‐word airline vocabulary (A)

J. G. Wilpon, L. R. Rabiner, and A. Bergh

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S41-S41 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Previous research at Bell Laboratories has been shown that a reliable act of speaker‐independent word reference templates for a speech recognition system can be obtained from a population of talkers using sophisticated statistical clustering techniques. These studies have investigated a 39‐word alpha‐digit vocabulary and a 54‐word vocabulary of computer terms. In this talk, automatic clustering procedures are used to create reference tokens for a 129‐word vocabulary of airline reservation terms. To obtain the word reference templates, a two‐stage training procedure was used. First each of 100 talkers (50 men, 50 women) used the robust training procedure of Rabiner and Wilpon to provide a single, reliable pattern for each vocabulary word. Second the set of automatic clustering procedures placed each word token into one of several word clusters and produced one reference pattern per cluster. A set of 20 new talkers were used to test the procedure. Several length normalization procedures were incorporated into the testing procedure. Recognition accuracies on the order of 90% were obtained for the 20 talkers. These results are comparable to ones obtained from a speaker‐dependent study done previously on the same vocabulary.
FREE

Computational cost of DP algorithms in speech recognition (A)

A. Waibel, N. Krishnan, and R. Reddy

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S41-S41 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
In this study we present the models of computation of several search algorithms for isolated word recognition. These search techniques considered include variants of the classical dynamic programming, variants of branch‐and‐bound search technique and variants of the beam search technique (as implemented in the Harpy system). We show that depending on the choice of the technique and related parameters, one can achieve more than an order of magnitude improvement in speed without any loss of accuracy.
FREE

Effect of reference set selection on speaker dependent speech recognition (A)

Zongge Li, Fil Alleva, and Raj Reedy

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S41-S41 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
Presented here for a speaker dependent system, is an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars. The goal of the algorithm is to produce a reference set that minimizes the worst matching behavior and total error over the N sets of exemplars. The results of the experiments presented here show a reduction in the average error rate from 16.4% to 10.2% over a set of 4 male speakers and 4 female speakers.
FREE

Speaker adaptation for word‐based speech recognition systems (A)

Melvyn J. Hunt

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S41-S42 (1981); (2 pages)

Full Text: | Download PDF

Show Abstract
This work is aimed at enhancing the speaker‐independent performance of word‐based speech recognition systems by rapidly and automatically deducing general characteristics of the current speaker and using them to derive speaker‐normalizing transforms. DP matching is used to align and compare corresponding frames of the incoming speech and reference vocabulary. A single transform is then computed for all voiced speech and another for all unvoiced speech. The transform consist of a linear filtering component and, optionally, a constrained frequency shift. Experiments have been carried out with twenty male and female, native and non‐native English speakers each producing 150 digits. Adaptation on all 150 digits reduces recognition errors by a factor of three (4.5% to 1.5%). With adaptation on just three randomly selected digits, the reduction factor is two. Frequency shifting is useful only when the amount of adaptation material is large and the reference speech is not exclusively from the same sex as the current speaker. Best performance is obtained using a transform without frequency shifting and with all input and reference speech from the same sex. [Work supported by DCIEM, Department of National Defence, Canada.]
FREE

Directory listing retrieval using spoken connected letters and a level building DTW algorithm (A)

C. S. Myers and L. R. Rabiner

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S42-S42 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
At Bell Laboratories, a system for retrieving directory listings based on recognition of spoken spelled names (with isolated letter input) has been developed. The system can be used in either a speaker trained mode or in a speaker independent mode, and the overall name accuracy is quite high in either case. In this talk we discuss an extension of the system to the recognition of connected sequences of letters for the specification of the name. The recognition and search procedures had to be modified since no clear segmentation into individual letters was possible. The level‐building DTW algorithm of Myers and Rabiner was used to provide letter distance scores, and a modified Aldefeld search procedure was used to find name candidates. Name recognition accuracies on the order of 95% were obtained in tests of the system with four talkers saying 50 names at two talking speeds for both speaker trained and speaker‐independent modes.
FREE

An architecture of an MOS‐LSI speech recognition system using dynamic programming (A)

H. Murveit, M. Lowy, and R. W. Brodersen

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S42-S42 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
In the past several years, a number of very accurate, isolated word speech recognition systems based on dynamic programming techniques have been designed and tested. However, as these techniques are computationally intensive, commercial systems using dynamic time warping have been costly. We have designed an architecture which exploits the capabilities of custom MOS‐LSI designs to implement a complete speech recognition system. This system would operate in real time using dynamic time warping, yet it would only require 4–5 integrated circuits for a moderate (50–200 word) vocabulary. This system is designed to be expandable so that larger vocabularies can be used by including additional IC's in parallel with the others. The integrated circuits which are required are two custom‐designed chips, a memory IC, and a low‐performance microcomputer for overall control. The custom chips include a front end processor for spectral analysis (currently a switched‐capacitor filter bank) with an endpoint detector, and an IC to implement the dynamic programming algorithm. The architecture of the custom chips has been defined so that adequate performance can be obtained from a standard MOS process. A connected speech recognition algorithm that is based on the use of the above IC's has been developed and will be described. It requires additional processing in the low‐speed microcomputer to perform a second level of dynamic programming.
FREE

Recognition strategies in a continuous speech understanding system (A)

Joseph J. Mariani

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S42-S42 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
ESOPE0, the first version of our speech recognition system, uses a top‐down strategy from the pragmatic level to the phonetic one, and operates from left to right with a best‐few method and no‐backtracking. Dynamic comparison among the four best phoneme‐candidates is carried out. ESOPE1 uses the same basic strategy in a systematic way: A best‐few algorithm leads to a beam‐search procedure. ESOPE1‐1 employs a top‐down treatment down to the acoustic level with a diphone dictionary. It uses a dynamic comparison method at the acoustic level. In our automatic dictation project, using a natural language syntax and a 170 000‐form vocabulary, a bottom‐up, best‐few attitude has been taken to translate into words an error‐free continuous phoneme string. We therefore feel that severely limited language and poor phoneme recognition involve a top‐down strategy, whereas a bottom‐up strategy is preferable in the opposite situation. This, and the recent results in psycholinguistics, lead us, in our present elaboration of ESOPE2, to the use of both a top‐down, and a bottom‐up strategy (Prediction‐Verification‐Induction). Predictions are made at each level, but the recognized phonemes may introduce unpredicted words, to allow limited learning abilities.
FREE

The use of syntax, semantics, and pragmatics in the KEAL speech understanding system (A)

Dominique Gillet, Andrée Nouhen‐Bellec, Patrice Quinton, and Jacques Siroux

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S42-S42 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
KEAL is a speech understanding system currently under development at the French Telecommunication Research Center (CNET) in Lannion; it aims to study oral man‐machine communication. KEAL is typically designed for automatic inquiry. Such a task requires dialog procedures in order to provide naturalness in the inquiry process. On the other hand, the use of dialog allows to achieve good comprehension despite the presently limited performances of KEAL at the phonetic level. This paper describes how syntactic, semantic and pragmatic knowledges are used in KEAL. Sentence recognition is performed by a bottom‐up left‐to‐right parser which provides also the parse‐tree of the sentence. This parse‐tree is then interpreted in order to extract the semantic information which is relevant to the dialog. The semantic information is used by the dialog controller for instantiating a model‐graph, which represents the state of the dialog at any moment. The dialog controller sends a response to a text‐to‐speech synthesizer, and indicates how to analyze the user's reply. An example taken from automatic directory assistance service is described; results concerning preliminary experiments are discussed.
FREE

On the use of a speech recognition system for the detection of mispronunciations (A)

Jane Le Bras

J. Acoust. Soc. Am. Volume 69, Issue S1, pp. S42-S42 (1981); (1 page)

Full Text: | Download PDF

Show Abstract
The investigation described in this article formed part of a thesis whose subject relates to the use of a speech recognition system in the field of automatic detection of mispronunciations. The experiment presented here is a first attempt at automatic evaluation of errors of pronunciation made by English subjects learning French. After briefly describing previous work using computers in the teaching of pronunciation, problems connected with the choice of a speech‐recognition system will be discussed. Then, we will describe the method chosen and the constraints specific to the using of an analytic speech‐recognition system. An algorithm for detecting the presence or the absence of aspiration in stop consonants will be presented. In the last part of this work, we will give the results of experiments carried out with five English subjects and seven French subjects using a program employing this algorithm. This program is simply intended to facilitate the testing of algorithms and to allow research with several different speakers. It has not been written in real time, but in batch (IRIS 80). Its principal interest comes from the fact it uses the output of the phonetic analyser of keal, the speech recognition system used in CNET.
Close

close