• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1976

Volume 60, Issue S1, pp. S1-S125

back to top
RSS Feeds
back to top Session E. Speech Communication I: Automatic Speech Understanding
Contributed Papers
FREE

Review of the ARPA speech understanding project (A)

Dennis H. Klatt

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S10-S10 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
After five years of research and development, the final three speech understanding systems funded by the Advanced Research Projects Agency of the Department of Defense were demonstrated in early September of this year. As a member of the ARPA Steering Committee and as a consultant for one of the research groups, I will offer a summary of the capabilities that were demonstrated, and also speculate on the scientific knowledge gained during the course of the program. The opinions to be expressed are entirely the author's.
FREE

The Harpy Speech Recognition System: performance with large vocabularies (A)

B. Lowerre and R. Reddy

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S10-S11 (1976); (2 pages)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The Harpy System [B. Lowerre and R. Reddy, Harpy, a connected speech recognition system, J. Acoust. Soc. Am. 59, S 97 (A) (1976)] has been extended to run with large vocabularies. The system was recently tested with a 1011‐word vocabulary language permitting about 1012 possible sentences for an Information Retrieval task using a computerized data base. The system achieved 93.77% word recognition accuracy (89.870 sentence recognition accuracy) on 284 connected speech sentences (containing 1580 words) about 3‐sec duration each for a single speaker. The system requires about 12 Mipss (million instructions per second of speech) and uses about 200 000 words of memory on a PDP‐10 system. More complete results, including several speakers and additional sentences, will be reported. [Reported supported by the Defence Advanced Research Projects Agency.]
FREE

The Hearsay‐II speech understanding system (A)

L. D. Erman, F. Hayes‐Roth, V. R. Lesser, and R. Reddy

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S11-S11 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The Hearsay‐II System has as its design goal recognition, understanding, and responding to connected speech utterances, particularly in situations where sentences cannot be guaranteed to agree with some predefined, restricted language model, as in the case of the Harpy System. Further, it attempts to view knowledge sources as different and independent which cannot always be integrated into single representation. It is based on the blackboard model [V. R. Lesser, R. D. Fennell, L. D. Erman, and D. R. Reddy, IEEE Trans. Acoust. Speech and Signal Process. ASSP‐23, 11–23 (1975) with knowledge sources as a set of parallel processes which are activated asynchronously depending on data events. The system performs on the Information Retrieval task with accuracy comparable to that of the Harpy system, but runs about 2 to 20 times slower. More complete performance results will be reported. As we get closer to unrestricted vocabularies and nongrammaticality of spoken languages, it will be necessary to have systems which have the flexibility of Hearsay‐II and the performance of Harpy. [Research supported by the Defense Advanced Research Projects Agency.]
FREE

Feature extraction segmentation and labeling in the Harpy and Hearsay‐II systems (A)

H. G. Goldberg and R. Reddy

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S11-S11 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
Goldberg [J. Acoust. Soc. Am. 59, S97(A) (1976)] has shown that uniform techniques for segmentation and labeling can provide the initial signal‐to‐symbol transformation for speech recognition systems with reasonable accuracy and efficiency. Furthermore, the choice of parametric representation was not found to be critical for most commonly accepted representations. However, for efficiency, the computationally simplest techniques should be used to segment the utterance before more accurate (and expensive) spectral representations are used for labeling [R. Reddy, J. Acoust. Soc. Am. 42, 329–47 (1967)]. To provide an initial symbolic input for both the Harpy and Hearsay‐II systems, an hierarchical, feature‐extraction based segmenter, using the ZAPDASH parameters, has been developed. After segmentation, labeling is done by a modified LPC minimum distance [F. Itakura, IEEE Trans. ASSP‐23, 67–72 (1975)]. Labeling proceeds by comparing the midpoint of each segment with stored templates (acquired by an iterative learning process from speaker‐specific training corpus) and adjusted with weights according to features obtained from the segmenter. The use of the highly efficient segmentation procedures and parameters provides approximately a factor of 5 speedup over uniform techniques which were previously used with both Harpy and Hearsay‐II [Research supported by the Defense Advanced Projects Agency.]
FREE

Connected Digit Recognition using symbolic representation of pronunciation variability (A)

G. Goodman, B. Lowerre, R. Reddy, and D. Scelza

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S11-S11 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
Most connected speech recognition systems such as Harpy and Hearsay‐II use some form of symbolic representation alternative pronunciations of the vocabulary whereas most isolated word recognition systems use word templates. In an attempt to compare relative performance of systems that use symbolic representations of words, the Harpy system was run on a connected digit task requiring the recognition of random three‐digit sequences. Each of ten speakers (seven male and three female) spoke 30 training sentences and 100 test sentences over a period of two weeks in a computer terminal room environment (approximately 65 dBA). Using speaker‐dependent phoneme templates, the word error rate over all the ten speakers was about 2%. Using speaker‐independent phoneme templates computed from the training data for all the speakers (male and female), the word error rate was about 8% for a test data set of 1200 random connected three‐digit sequences from 20 speakers (including ten new speakers). The recognition time is about 4.5 Mipss (million instructions per second of speech). [Research supported by the Defense Advanced Research Projects Agency.]
FREE

Parametric representation of speech (A)

G. Gill and R. Reddy

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S11-S11 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
As digital processing of speech becomes commonplace, it becomes desirable to have a parametric representation of speech which is simple, fast, accurate, and directly obtainable from the PCM representation of speech. The ZAPDASH representation of speech (Zerocrossings And Peaks of Differenced And Smooth waveforms) is one such. The PCM data is used to generate a different waveform and a down sampled, smoothed waveform (for 10‐kHz sampling rate, the smoothing FIR filter coefficients were − 1 0 1 2 4 4 4 2 1 0 − 1, used every fourth point). Peak‐to‐peak distances and number of zerocrossings are calculated each 10 msec, resulting in 400 8‐bit parameters per second of speech. ZAPDASH can be done in 15–20 computer instructions per sample and can be extracted in less than a 1/3 real time on minicomputers with 2 μsec instruction time. Although this representation is not noticeably different other similar proposals, it seems to be fairly robust and accurate, and is used in the feature extraction, segmentation, and labeling parts of the Harpy and Hearsay‐II systems. Fortran and PDP‐11 machine language versions are available from the authors. [Research supported by the Defense Advanced Research Projects Agency.]
FREE

The HWIM speech understanding system—Overview and performance (A)

Jared J. Wolf and William A. Woods

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S11-S11 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
HWIM (for Hear What I Mean), the speech understanding system developed at BBN as part of the recent five‐year ARPA Speech Understanding Research Project, is designed to “understand” naturally spoken utterances relevant to a task domain of travel budget management. Its vocabulary is over 1000 words, and its grammar permits a habitable subset of natural English. HWIM contains sources of knowledge at the levels of acoustic‐phonetics, phonology, vocabulary, syntax, semantics, factual knowledge, and discourse. This paper describes the system as a whole and presents its performance results at the end of the ARPA project. [This research was supported by the Advanced Research Projects Agency of the Department of Defense and was monitored by ONR under Contract No. N00014‐75‐C‐0053.]
FREE

Phonetic and lexical processing in the HWIM speech understanding system (A)

Richard M. Schwartz, John W. Klovstad, Victor W. Zue, John I. Makhoul, and Jared J. Wolf

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S11-S12 (1976); (2 pages)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The “front end” of HWIM, the BBN speech understanding system, is that part of the system that governs the formation and evaluation of hypotheses between the levels of the speech signal and the word. It comprises processes for signal processing, acoustic‐phonetic recognition, lexical‐segmental matching, and lexical‐parametric matching. Implicit in the lexical matching processes is the application of phonological rules, both within word pronunciations and across word boundaries. A consistent scoring policy governs the evaluation of hypothesis at the segmental and word levels, and this policy is carried into the control component of the system, where it is applied to multiword hypotheses about the interpretation of the utterance. [This research was supported by the Advanced Research Projects Agency of the Department of Defense and was monitored by ONR under Contract No. N00014‐75‐C‐0053.]
FREE

Linguistic processing and control strategy in the HWIM speech understanding system (A)

William A. Woods, Madeleine Bates, Geoffrey Brown, and Jared J. Wolf

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S12-S12 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The principal source of higher‐level linguistic knowledge in HWIM, the BBN speech understanding system, is an augmented transition network parser, which embodies the syntactic, semantic, and part of the factual sources of knowledge of the system. It parses sentences or sentence fragments in either direction, and it can, for a sentence fragment, enumerate the words and syntactic/semantic classes permissible at the ends of the fragment. The control component of the system is a program that calls on the other sources of knowledge of the system in order to formulate, evaluate, and extend hypotheses about the interpretation of the utterance. It is responsible for guiding the system to the most likely interpretation as efficiently as possible. [This research was supported by the Advanced Research Projects Agency of the Department of Defense and was monitored by ONR under Contract No. N00014‐75‐C‐0053.]
FREE

Word verification in a speech understanding system (A)

Craig C. Cook and Dennis H. Klatt

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S12-S12 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
Given a word whose presence has been hypothesized in an unknown utterance, one way of enhancing the confidence in that hypothesis is to generate a synthetic parameterization of the word and then match it against the equivalent parametric representation of the unknown utterance. We have implemented such an approach in the speech understanding system under development at Bolt Beranek and Newman, Inc. Given a word, a synthesis‐by‐rule program generates a representation in terms of linear prediction spectra, which are matched against similar spectra of the raw signal using a 13‐pole linear prediction error metric in conjunction with a dynamic programming time‐normalization algorithm. Some automatic talker normalization procedures have been implemented in the synthesis strategy. The performance of the verification component has been measured by obtaining the distribution of verification scores for all word hypotheses generated by the speech understanding system, and determining the scores for words that should be verified correctly versus those scores for false word hypotheses. [Supported by ARPA under Contract No. N00014‐75‐C‐0053.]
FREE

Use of intonational phrase boundaries to select syntactic hypotheses in a speech understanding system (A)

Wayne A. Lea

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S12-S12 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
A procedure has been developed for using prosodically detected phrase boundaries to weight word and phrase hypotheses in the Bolt Beranek and Newman (BBN) SPEECHLIS speech understanding system, so that correct words and structural hypotheses will be proposed at earlier stages in parsing, and erroneous theories can be avoided. The state‐transition arcs of the augmented transition network grammar were specially marked if they were expected to be immediately preceded by intonationally detected phrase boundaries. The scores on words associated with the arcs were increased if expected boundaries were detected, or decreased if expected boundaries were missing in the acoustic‐prosodic darn. Fifteen BBN sentences were processed through a computer program that detected phrase boundaries at fall‐rise valleys in fundamental frequency contours. Analysis of simple traces of the hypothesizing, testing, and constructing of syntactic structures by the SPEECHLIS system showed that prosodic adjustment of scores would increase the likelihood of correct words and phrases being selected before incorrect ones. These ideas are being refined and tested further, for implementation in the SPEECHLIS system.
FREE

Evaluation of an automatic word recognition system over dialed‐up telephone lines (A)

A. E. Rosenberg and F. Itakura

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S12-S12 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
An evaluation of an automatic word recognition system [F. Itakura, IEEE Trans. Acoust. Speech Signal Process. ASSIP‐23, 67–72 (1975)] has been carried out over dialed‐up telephone lines using a laboratory computer on line. Thirteen speakers participated in the evaluation, calling up the system once a day over a five‐month period. In each experimental session speakers were instructed by voice prompt to provide utterances of 12 words spoken in isolation. These words were randomly selected in each session from an 84‐word vocabulary, 50 of whose entries are North American cities, designed to give the speaker access to airline flight time‐table information. At each trial the speaker was informed whether his utterance was correctly recognized or not. There are two categories of error: an incorrect match or a rejection (no match). Speakers were requested to repeat words not recognized on the first attempt a second and, if required, a third time. The average number of trials per speaker over the entire experimental period was 840. The median percentage of incorrect matches on the first attempt over the 13 speaker was 2.7% while the median percentage of rejections was 5.7%. The median percentage of words still not recognized after 3 attempts was 1.5%.
FREE

Statistical decision approach to the recognition of connected digits (A)

M. R. Sambur and L. R. Rabiner

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S12-S12 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
A connected digit recognition system that uses a statistical decision approach based on an expanded form of the principle of minimum residual error has been developed. The expanded distance measure includes the effects of analysis estimation error, the effects of coarticulation, and the effects of speaker variability. The recognition system has been tested on six speakers in a speaker dependent mode with recognition accuracies near 100%. It has also been tested with ten new speakers in a speaker independent mode, with a digit recognition accuracy exceeding 95%.
FREE

Labeling speech events for acoustic and linguistic processing (A)

R. J. Hanson and L. L. Pfeifer

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S12-S13 (1976); (2 pages)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
In speech studies, data base access has been significantly improved through the use of label files which relate portions of the speech waveform with a phonetic (or phonemic) transcription as well as other linguistic and nonlinguistic information. In this way the acoustic and linguistic “realities” of speech can be associated and checked against each other since they were derived independently. Because of the lack of a simple correspondence between the acoustic structure of a portion of the speech waveform and the linguistic label assigned, additional principles must be invoked to make labels useful. As an example of the labeling process and the accompanying problems, a sound identification study is described in which 675 vowel nuclei were labeled using the transcription derived from a sophisticated multitranscriber system. Despite precautions taken to insure correct labeling, many of the vowels are judged in a pattern recognition experiment to be more similar to neighboring vowels in the vowel space or to schwa than the vowel they were labeled as. These supposedly misidentified vowels were found to have acoustic structures very different from the ideal for their label, at least partially due to the loss of coarticulatory and vowel reduction information when assigning transcription symbols. Directions for resolving these problems and for increasing the utility of labeled speech events in acoustic and linguistic processing are discussed. [Work supported by AFOSR Contract No. 44620‐74‐C‐0034.]
FREE

An algorithm for speaker verification (A)

M. Shridhar and M. Vidalon

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S13-S13 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
An analysis of the spectral properties of male speech reveals that most of the spectral energy is distributed in the frequency range of 50–2000 Hz. The authors attempted to develop a computer algorithm for speaker verification that utilized parameters extracted directly from the digitized speech. The original speech signal was sampled at 4 kHz after filtering it with a 2‐kHz low‐pass filter. The investigations reveal that by the use of low‐order linear predictor model a feasible set of parameters could be realized for application to speaker verification. A simple warping procedure modifies the parameter contour of the unknown speaker so that the correlation with the reference contour is maximized. The verification decision was based on the distance of the test sample contour from the reference contour for the claimed speaker. If the distance was less than a fixed threshold, the speaker was accepted. Among all the parameters investigated, the reflection coefficients and the autocorrelation coefficients were found to be the most effective, providing a verification accuracy of 98% for speech, 2 sec in duration, which increased to more than 98% for a duration of 3 sec. In conclusion, a procedure for speaker verification has been developed and it is fairly simple, reasonably fast, and reliably accurate.
FREE

Long‐term feature averaging in voice authentication (A)

J. D. Markel, B. T. Oshika, and A. H. Gray, Jr.

J. Acoust. Soc. Am. Volume 60, Issue S1, pp. S13-S13 (1976); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The purpose of this paper is to investigate the applicability of long‐term feature averaging as an eventual means for performing text independent voice authentication (speaker verification). Based upon a set of long‐term feature vectors, a principal component analysis is performed to obtain a normalized reference coordinate system for each speaker. Features extracted from the test speaker are transformed to this coordinate system and then the Euclidean distance is measured. It is shown thatunder a weak assumption of Gaussian statistics, the threshold necessary to attain a given probability of correct acceptance as a function of the number of dimensions or features can be theoretically calculated. Results of several preliminary experiments are presented to illustrate the technique.
Close

close