• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1990

Volume 88, Issue S1, pp. S1-S200

back to top
RSS Feeds
back to top Session 9SP: Speech Communication: Considerations in Bringing Speech Algorithms to Market
Invited Papers
FREE

Small Business Innovation Research (SBIR) funding: A case study in bringing a computer‐based speech training aid into the marketplace (A)

D. Kewley‐Port, C. S. Watson, and D. Maki

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S196-S196 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
University research projects occasionally involve applications of speech technology that are intended to produce a commercial product, such as the development of a microcomputer‐based speech training aid (ISTRA) at Indiana University. Even though initially funded as basic research, such projects, when successful, eventually require an alternative source of funds. SBIR programs in most federal agencies have been mandated by Congress to support research and development in small businesses. This presentation will describe SBIR programs and the funding process, from the perspective of investigators who attempt to form small businesses outside of the university womb. [Research support by NSF and NIH, SBIR Phase I grants.]
FREE

From theory to practice: A 10‐yr path (A)

M. H. O'Malley

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S196-S196 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
Berkeley Speech Technologies, Inc. has been developing commercial text to speech synthesis technology for over 10 yr. What started out as a quick “technology transfer” has grown to become a complex body of “intellectual property” that has been realized in such products as a 100 000‐word talking dictionary, a telephone response system with 16 T‐T‐S lines on one board, a satellite communication system for trucks, and a portable talking computer for blind users. Practical considerations caused modification of the initial theoretical assumptions. From the beginning, it was assumed that high intelligibility and high phoneme accuracy were essential, but it was soon learned that 700 words per minute with a 25‐ms start and stop are equally important for blind users. Similarly, academic research had assumed wide bandwidth and low noise, but telephone systems require that all of the speech information be packed into a 3.5‐kHz telephone bandwidth. Initially, the choice was made to use demi‐syllable synthesis because it seemed to be an “engineering shortcut” that might cover gaps in standard scientific descriptions. As the technology developed, however, the decision was made to convert to a more scientifically based synthesis model because it offered higher quality, greater flexibility, and faster development, especially of new languages. Our 10‐yr development could not have been justified on the basis of expected financial return. However, it was and is fun.
FREE

Issues for bringing speech technology to the marketplace (A)

Steven F. Boll

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S197-S197 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
ITT A/CD is one of a few companies that has taken on the challenge of transferring speech technology from an abstract algorithm to a working system. In the mid‐1980s IIT A/CD developed moderate vocabulary, speaker‐dependent, connected speech recognition, and produced a series of single‐board recognizer products compatible with various personal computers and workstations. To date, over 400 single‐board recognizers have been ordered by over 30 customers. From these efforts a number of issues and experiences have surfaced that are critical to their successful operation. Some of these include: (1) robust performance across channels and microphones (every change to the microphone, analog recording setup, and room environment, will affect, and, most likely, lower performance); (2) training sensitivity (every shortcut taken in training, i.e., using fewer tokens, cross‐channel recordings, out‐of‐vocabulary training phrases, etc., will lower performance); (3) man‐machine interface (use psychologists to design user interfaces rather than engineers; otherwise they end up being just too complicated); (4) application specification (select only those applications that receive an overwhelming and compelling advantage from the technology).
FREE

From algorithm to market: Real‐world issues in speech development projects (A)

Tony Vitale

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S197-S197 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
Papers on speech technology have typically addressed issues in one of two disparate areas: (a) hypotheses and observations aboul articulatory, auditory, or acoustic aspects of speech production and perception, and the development of the algorithms that underlie those observations; and (b) marketing and sales issues somewhat or totally unrelated to (a) such as competitive analyses, industry‐specific applications, and applications software or the user interface. There is, however, a complex world of logistics that afffects a speech technology project in the most fundamental way. Nothing describes this more accurately and effectively than a brief history of the DECtalk development project at Digital Equipment Corp. It is a realistic illustration of the evolution from laboratory system to product. In 1981, Dennis Klatt, a senior research scientist at MIT, was searching for an industrial partner to help develop and produce a text‐to‐speech system derived from Klatttalk, a set of algorithms for a speech synthesizer that he had been working on since 1970. After almosl 9 months of searching, he found individuals within Digital Equipment Corp. who were willing to risk the costs of an extensive development project for a technology which, at the time, had no market. This relationship between academia and industry turned out to be so mutually beneficial and enlightening that Klatt was prompted to discuss it publicly from the academician's perspective [D. H. Klatt, Proc. Speech Tech. '87, 293–294 (1987)]. This discussion outlines some of the more important logistical issues from the corporate point of view and a description of these constitutes a typical project history: problems in hardware design and development; dependence upon external engineering and support groups (e.g., manufacturing, technical documentation, etc.) and external vendors; and testing for design verification/maturity and physical robustness. In addition, speech development was proceeding simultaneously in two separate places (Maynard and MII), the need to run in real‐time entailed the perennial decisions on cost versus performance, and debugging cycles required constant compromise.
FREE

Speech recognition: Technology and applications (A)

Lawrence R. Rabiner

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S197-S197 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
The technology of speech recognition has evolved for almost 2 decades since the introduction of sophisticated pattern recognition techniques such as dynamic time warping and clustering. Applications of the technology have been slower to evolve for several reasons, including system performance, system cost, and the general acceptance of voice technology by the public. Because of dramatic improvements in each of the areas that impeded use of speech recognition in real‐world applications, the technology is beginning to be accepted as a reasonable alternative to keyboards or mouse‐like devices for entering data, and to operators or attendants for tasks like call routing, enhanced services, and information retrieval. This talk reviews speech recognition technology, including a discussion of isolated word, connected word, and continuous speech recognition systems. Then there is a brief outline of the reasons why speech recognition is a relatively difficult problem to solve in its most general form, namely, unlimited vocabulary, unlimited syntax, unlimited talker population, unlimited environmental constraints, and unlimited tasks. It is shown that by appropriately restricting one or more of these factors governing performance, a highly reliable, robust speech recognizer for some limited, but interesting, tasks can be implemented. A number of these tasks are discussed in this talk and current performance via videotape demonstrations is illustrated.
Contributed Papers
FREE

Error correcting alternatives in the machine recognition of speech (A)

Jerome R. Bellegarda and Dimitri Kanevsky

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S198-S198 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
Because for many users the machine recognition of realistic, large vocabulary speech tasks typically produces recognition rates only in the neighborhood of 80%, the use of automatic speech recognizers (ASR) requires spotting and correcting the errors introduced during decoding. As a result, error correction is an important consideration in bringing current speech recognition technology to market. Standard methods for error correction include overwriting using a keyboard and selecting from a list of related candidate words with a mouse. In addition, a new alternative is considered: overwriting using a pressure‐sensitive pen on a transparent paper‐like interface (PLI) capable of recognizing handwriting online. These three alternatives are compared in terms of relative speed and ease of use, by measuring the average time that it takes to correct speech recognizer errors using each of the alternatives, and by recording the overall impression of each of the participants in the experiments. Keyboard entry was viewed as slow and distracting, because the attention must shift from screen to keyboard. Mouse selecting was faster but inappropriate for out‐of‐vocabulary words, such as proper nouns. Overall, PLI came across as a very viable means of correcting ASR errors, even in writer‐independent mode. This is because the correct word often differs from the wrongly decoded one by only a few characters, thus enabling the use of (robust) discrete character recognition. In a typical experiment, users were able to correct all ASR errors after a maximum of two passes at each wrongly decoded word.
FREE

Automatic recognition of connected digit strings in a credit card authorization task (A)

J. G. Wilpon, P. Ramesh, M. A. McGee, D. B. Roe, L. R. Rabiner, and C. H. Lee

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S198-S198 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
An important area of speech recognition is automatic recognition of connected digit strings (i.e., strings composed of the digits 0 through 9—including “oh”). Applications of this technology include credit card authorization, catalog ordering, dialing of telephone numbers, and data entry, to name just a few. For the past 2 yr AT&T, in cooperation with American Express Travel Related Services, has experimented with a system for automatic recognition of 10‐digit merchant identification codes, and 15‐digit customer credit card numbers, for the purpose of authorizing purchases charged to an AMEX card. (The problem of recognizing the transaction amount has also been studied, but this is a much more difficult problem and will not be discussed.) Field trial experience with the recognizer using about 1000 customers who provided 2000 connected digit strings over 800‐based dialed up telephone connections was correct recognition of 97%–98% of the strings with no rejections using constraints on the validity of both merchant identifications and credit card numbers. Several schemes for applying the constraints in a practical implementation were studied and will be discussed in the talk.
FREE

Hidden Markov model based automatic keyboard spotting for automating operator services (A)

P. Modi and J. G. Wilpon

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S198-S198 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
A basic assumption for most current speech recognition systems is that the speech to be recognized consists solely of words from a predefined vocabulary. For speech recognition applications in the telephone network, where any person can pick up any telephone at any time from anywhere, it is naive to assume that users will adhere strictly to this protocol. For example, when a customer is asked to say only collect, calling‐card, third‐number, person, or operator, they may instead say, “I want to make a collect call please.” In Wilpon et al., a hidden Markov model based keyword spotting algorithm was presented, which recognizes keywords from a pre‐defined vocabulary list spoken in an unconstrained fashion. In this talk, there will be discussion of several advances to the previous algorithm including: (1) improved rejection criteria and (2) improved modeling techniques. Results will be presented from evaluations on a five‐word vocabulary used to automate operator‐assisted calls. Using wordspotting techniques to recognize speech, a recognition accuracy of 99.8% is currently being achieved, while rejection is only 5.0%, regardless of whether the user speaks only vocabulary words or not.
FREE

Considerations in utilizing synthetic speech as an aid for proofreading (A)

H. Kasuya

J. Acoust. Soc. Am. Volume 88, Issue S1, pp. S198-S198 (1990); (1 page)

Online Publication Date: 14 Aug 2005

Full Text: | Download PDF

Show Abstract
One typical application of Japanese synthetic speech produced by rule is in a newspaper company where a text‐to‐speech system has extensively been used as an aid for proofreading a manuscript. Operators in the proofreading department work at computer voice terminals about 4 h a day with adequate intermissions. Several important human factors relevant to the voice quality and prosodic properties of synthetic speech were raised in our interview with the operators. To make this point clearer, perceptual experiments were performed in the laboratory on the subjects' preference for the voice quality and prosodic characteristics of synthetic speech, focusing on the average pitch frequency, dynamic range of pitch contours, average frequency spectral characteristics, speaker's sex, and speaking rate, in the task where subjects were required to proofread a printed text by listening to the speech. The experiments indicated that a male voice of a relatively low average pitch frequency with rather less pitch changes was accepted well by the subjects. The experiments further suggested that the ability of a variable speaking rate is indispensable to a speech synthesis system. These findings were thought to be strongly related to the fact that the users work with synthetic speech for a long period of time.
Close

close