• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Apr 1985

Volume 77, Issue S1, pp. S1-S108

back to top
RSS Feeds
back to top Session E. Speech Communication II: Analysis, Synthesis, and Recognition
Contributed Papers
FREE

A feature‐based time domain pitch tracker (A)

Michael S. Phillips

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S9-S10 (1985); (2 pages)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A pitch tracking algorithm was designed that uses perceptually motivated features to identify the first peak of each pitch period in the speech waveform during voiced portions of speech. The feature measurement algorithms were designed to capture all of the information that a person uses to identify pitch marks when looking at a waveform display. A multi‐variate classifier makes decisions about the location of pitch marks based on the values of the feature measurements. This classifier was designed using Classgraph—a program that allows the user to examine hand‐labeled data and make decision boundaries in the multidimensional feature space. Performance was evaluated by comparing pitch marks generated by the algorithm with hand‐labeled pitch marks on a database of speakers each saying a different sentence. Each sentence was hand‐labeled by two people. The agreement among labelers was within 1% of the agreement between each labeler and the output of the algorithm. [Supported by NSF and DARAP.]
FREE

Low bit rate real‐time half duplex LPC vocoder utilizing minimal random access memory (A)

Morris Moore and James M. Keba

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S10-S10 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A low bit rate real‐time half duplex LPC vocoder was constructed based upon the Texas Instruments TMS32010 digital signal processor IC. Approximately 2700 words of external ROM were used for program and tables. The 144 words of onboard RAM were the only words of RAM used. The analysis was performed on preemphasized windowed speech using autocorrelation accumulation and the LeRoux‐Gueguen method to generate reflection coefficients and gain. A modified Rabiner‐Gold pitch algorithm was used for pitch and voicing. The software was split into foreground computations done on each speech sample and background computations done on each frame. There were 54 LPC information bits generated per frame.
FREE

The application of a hierarchical classification technique to speech analysis (A)

Mark A. Randolph

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S10-S10 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
In this paper we describe a hierarchical classification technique based on CART (Classification and Regression Trees, see Breiman et al., 1984) and its application to the task of speech analysis. Our investigation is motivated by two reasons. First, we believe that this technique provides an intuitive mechanism for quantifying the acoustic cues of phonetic contrasts. Second, the technique can potentially help us develop classifiers that are useful for automatic speech recognition. Towards these goals, we have added a number of features to the basic CART algorithm, and have expanded it into an exploratory data analysis tool. For example, CART uses a predetermined criterion for partitioning the feature space. We have added the capability for users to manually perform this partitioning. In addition, we have implemented a number of statistical functions for univariate and multivariate data analysis, and added graphical facilities for viewing data in different ways. Most importantly, we have integrated CART with SPIRE, our primary acoustic‐phonetic analysis tool. Examples of how CART can be used for speech analysis and classifier design will be presented. Comparisons with other classification procedures will also be included. [Work supported by AT&T Bell Laboratories Cooperative Research Fellowship and by the Office of Naval Research under contract N00014‐82‐K‐0727.]
FREE

Medium band waveform coding of speech signals based on short‐time phase equalization (A)

Masaaki Honda and Takehiro Moriya

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S10-S10 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A new speech coding method is presented utilizing perceptual redundancy for slow‐varying, short‐time phase characteristics. First, a speech signal model is introduced based on the all pole filter and all pass filter, which represent the power spectrum and the short‐time phase of speech signals respectively. The phase equalization technique for canceling the short‐time phase is then presented based on the time domain matched filtering method. Using this technique, the prediction residual of the speech signal is converted into a nearly zero‐phase signal, or ideally into a pitch pulse sequence, which results in the temporal concentration of residual energies. The phase‐equalized speech signal is efficiently encoded using variable‐rate tree coding with temporal bit allocation. Experimental results show that quality of the phase equalized speech is almost indistinguishable from the original speech, and that the coded speech quality is equivalent to that of the 6.6 bit Log‐PCM at a bit rate of 1 bit/sample. It is also confirmed that this system has the ability to make a speech quality change from natural to vocoder sounds by controlling the information bits for the main pulses and its residuals of the phase‐equalized residual signal.
FREE

The relationship between intelligibility and naturalness in high quality synthetic speech (A)

Thomas D. Carrell

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S10-S10 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The relationship between intelligibility and naturalness in synthetic speech was examined. There has been evidence that in extreme cases very unnatural speech may be intelligible [R. E. Remez et al., Science 212, 947–950 (1981)] and very natural sounding speech may be unintelligible [E. Abberton and A. J. Fourcin, Lang. Speech 21, 305–318 (1978)]. However, there is little evidence regarding the independence of naturalness and intelligibility in relatively high quality synthetic speech. This relationship was investigated in the present experiment using a paired‐rating design, in which listeners were required to rate the intelligibility and naturalness of a pair of words on each trial. The first item was always a digitized natural token of a word and the second item was always a synthetic version of the same word. The synthetic stimuli were designed to mimic the specific acoustic correlates of three male and three female talkers. Overall, the listeners rated the synthetic speech as highly intelligible and natural. In addition, a principal component analysis indicated that the naturalness and intelligibility of the different synthetic talkers were independent factors. These results argue that even in the case of high quality synthetic speech, these two attributes of the signal are separate. [Work supported by NIH (NINCDS).]
FREE

Speaker characteristics influencing the intelligiblity of resynthesized LPC speech (A)

Harry Buiting and Louis Boves

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S10-S10 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
In applying analysis resynthesis techniques, much attention must be paid to the choice of the speaker, especially when a so‐called speech‐chip is used in the resynthesis process. To find an answer to a number of questions concerning this issue a perception experiment was carried out using ten Dutch speakers, five males and five females. The speakers read 50 phonetically balanced CVC nonsense words, which were embedded in carrier sentences. The 50×50 = 500 CVC words were analyzed by means of a linear predictive procedure which resulted in energy, pitch, and formant/bandwidth data. For a small subset of the speech material the analysis data were optimized by means of both visual and auditory inspection. This optimization process resulted in a set of rules to be applied to the analysis data before resynthesis. The optimization rules were used to resynthesize the CVC words with several bit rates. The resynthesized words made up the stimulus ensemble in a recognition experiment. Questions we address are:—What is the effect of the bit rate on the identifiability?—Is it possible to isolate criteria that “good” speakers should meet?—How successfully can rules established for a small set of words be applied to a larger set?
FREE

The intelligiblity of nonvocoded and vocoded semantically anomalous sentences (A)

Molly Mack and Bernard Gold

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S10-S11 (1985); (2 pages)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
This study consisted of an analysis of the intelligibility of semantically anomalous sentences presented to 28 subjects in four conditions (seven subjects per condition): (1) natural speech, no noise; (2) vocoded speech, no noise; (3) vocoded speech, noise added to the pitch track; (4) vocoded speech, noise added to the spectrum. Results revealed that intelligibility was quite good in conditions (1) and (2), relatively poor in (3), and quite poor in (4)—results consistent with previously obtained Diagnostic Rhyme Test (DRT) data [B. Gold and J. Tierney, Lincoln Laboratory Tech. Rep. No. 670 (1983)]. Specifically, subjects averaged 10, 25, 47, and 141 errors each in conditions (1), (2), (3), and (4), respectively. Further, about 60% of all errors were phonemic, while 40% were syntactic and semantic. We concluded that information in the spectrum is more critical than information in the pitch track, that most errors affect the phonological component when intelligibility is poor and context is uncertain, and that the DRT is an appropriate though perhaps insufficient test of speech intelligibility. [Work sponsored by the Department of the Air Force.]
FREE

Synthesizing an intelligible /h/ (A)

Jonas N. A. Nartey

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S11-S11 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
This paper presents follow‐up work on earlier data which suggested that /h/ is a voiceless vowel, and not a fricative or an approximant, as others have claimed [e.g., Strevens, 1960 for the former, and Ladefoged, 1975 for the latter]. Acoustic measurements of English /h/ spoken in various phonetic environments were compared to those of fricatives and approximants in similar environments. Results indicate that co‐articulation effects of /h/ and its phonetic environments are similar to those of neither fricatives nor approximants. In another experiment, /h/ was synthesized in English utterances using identical parameters with (a) fricatives, (b) approximants, and (c) vowels. The utterances were presented to native American English speakers for both “naturalness” and intelligibility judgments. Results favored the [h] with the characteristics of a “voiceless vowel.”
FREE

Mixed excitation for speech II: Naturalness of period‐by‐period reordered voiced fricatives (A)

George D. Allen and Leah H. Jamieson

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S11-S11 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Previous attempts at synthesizing voiced fricatives have failed to yield acceptably natural‐sounding segments. One possible reason for this failure is that the synthesis models have been global, whereas some important characteristics of these segments may reside in their period‐by‐period structure. Our goal in this study was therefore to compare the auditory quality of digitally manipulated /v, ð, z, and Ʒ/ segments, in various V‐V contexts. High‐quality tokens were digitized at 24 kHz, and individual pitch periods were marked by hand using interactive software. Comparison stimuli were then created via the following manipulations: (1) reordering of alternate periods; (2) reordering triples of periods; (3) replacement of all odd‐ (even‐) numbered periods by their even‐ (odd‐) numbered neighbors; (4) similar replacements modulo 3; (5) random reordering of periods. Furthermore, these manipulations were sometimes restricted to the onset, steady state, or offset portions of the segments. These digitally spliced segments were then presented to listeners for discrimination, naturalness, and likeness judgments. Results of these comparisons and their implications for synthesis of natural‐sounding voiced fricatives will be discussed.
FREE

Multipulse linear predictive synthesis in text‐to‐speech systems (A)

Andrew Varga and Frank Fallside

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S11-S11 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Multipulse linear prediction is a coding technique that can provide extremely high quality speech synthesis [B. S. Atal and J. R. Remde, IEEE Proc. ICASSP82, 614–617 (1982)]. It is therefore of interest to examine whether the technique can be used to provide correspondingly good quality in a text‐to‐speech system. In such a system a prosodic contour is imposed on a set of concentrated speech units. Various speech units (e.g., words and diphones) have been tried. First results suggest that the word is the appropriate unit to be used in such a system. The techniques for pitch and timing alteration, speech unit concatention, and the effects on the resulting synthetic speech will be discussed and demonstrated.
FREE

A versatile dictionary for speech synthesis by rule (A)

Susan R. Hertz

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S11-S11 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
All synthesis rule developers are faced with the problem of handling phenomena that cannot easily be captured in rules. The Delta System [J. Acoust. Soc. Am. Suppl. 1 75, S60 (1984)] provides the rule writer with an especially versatile exception dictionary. The dictionary has two parts: the active dictionary and sets. The active dictionary can store token sequences representing units of any kind (e.g., phrases, words, demisyllables) and associate arbitrary actions with them. For example, an action might specify a pronunciation, as in conventional dictionaries, or it might invoke a rule. An action can be restricted to the portion of an entry that is an exception. Sets contain token sequences but no actions. They provide an especially compact way to group together items that behave similarly. Rules can test a token sequence for membership in a set to determine whether to apply to the sequence. Delta's dictionary is fully integrated into the rule system it accompanies. It can set variables for the rule program, influence the program's flow of control, and manipulate the utterance being synthesized.
FREE

Text‐independent speaker recognition experiments using codebooks in vector quantization (A)

Kiyohiro Shikano

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S11-S11 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
A text‐independent speaker clustering approach to speaker‐indepencent speaker recognition through vector quantization (VQ) was investigated, where the distortion value was used as a clustering measure. To show the possibility of the text‐independent speaker clustering, speaker recognition experiments were carried out using the Harvard sentence database. Nine male speakers uttered ten different Harvard sentences each. Codebooks were generated from the first five sentences for each speaker using Weighted Likelihood Ratio measure (WLR) through LPC analysis. Using 128 vectors in each codebook, a speaker recognition rate of 98% was attained on the latter five Harvard sentences. Effects of codebook size and input length are also discussed. The above approach based on framewise VQ only utilizes the static distribution of LPC spectra. VQ for multiframe codebooks was used to represent the coarticulation units. The results of speaker recognition experiments based on multi‐frame codebooks will be compared with fixed length VQ approaches.
FREE

Speaker‐independent recognition of isolated digits using a weighted cepstral distance (A)

Yoh'ichi Tohkura

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S11-S11 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The cepstral distance has been one of the most efficient spectral distance measures in speech and speaker recognition [S. Furui, IEEE Trans. Acoust. Speech Signal Process. ASSP‐29, 254–272 (1981)]. A new weighted cepstral distance measure using LPC derived cepstrum coefficient variability was tested in a speaker‐independent English digit recognition system using standard DTW alignment techniques [L. R. Rubiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon, IEEE Trans. Acoust. Speech Signal Process. ASSP‐27, 134–141 (1979)]. The results show a recognition accuracy of > 99% for the digits [K. L. Shipley. A. E. Rosenberg, and D. E. Bock, J. Acoust. Soc. Am. Suppl. 1 72, S80 (1982)]. Recognition results using the same data base and the log likelihood LPC distance are about 97.4%. Hence there is a large improvement in performance with the new weighted cepstral distance.
FREE

A single‐boarded isolated word recognizer using LPC cepstrum (A)

M. Hamada, Y. Bessho, T. Norimatsu, and A. Yamada

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S12-S12 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Minicomputer simulation was carried out to design an optimum system based on currently available signal processing LSI's. First, finite‐word‐length effects of Levinson‐Durbin (LD) algorithm and Le‐Roux (LR) algorithm [J. Le‐Roux et al., IEEE Trans. Acoust. Speech Signal Process. ASSP‐25, 257–259 (1977)] for extracting PARCOR coefficients were investigated regarding (1) the PARCOR/AR/cepstrum coefficient error, (2) the difference in LPC cepstrum distance between the top two candidates, and (3) recognition rate. LR was found to be almost always better than LD by each of above measures. Second, the effects order of analysis, number of template bits, and the template normalization method were examined to minimize memory size. It was shown that the number of template bits of each cepstrum coefficient can be reduced to four with little decrease of recognition rate as compared to the system with floating point number templates. A single‐boarded recognizer using TMS320 for LPC analysis and MN 1263 for DP matching was implemented. The overall recognition rate of on‐line test in speaker‐dependent mode was 99.4% for a ten‐word vocabulary (total of 1000 tokens of ten speakers). Multiple template speaker‐independent mode achieved a recognition rate of 97.0% for an eight‐word vocabulary.
FREE

Network representation of templates in word recognition (A)

Kai‐Fu Lee

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S12-S12 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
Reference set generation is a crucial step in template‐based isolated word recognition. Some common techniques include casual training, selection, clustering, and averaging. These single‐template systems do not capture the variations in speech. Alternatively, multiple‐template techniques lead to additional storage and recognition time. Network representation combines the performance of multiple‐template techniques and the efficiency of single‐template techniques. Using the network approach, words are divided into segments, and different examples of the same or different words can share segments. This not only reduces storage required, but also enables the system to focus on acoustically dissimilar segments. Network representation has not been popular because of the difficulties in (1) segmenting correctly and recovering from segmentation errors, and (2) creating and modifying the network automatically. A word recognition system has been designed and implemented to facilitate network training by providing (1) relatively reliable segmentation, (2) segment‐based warping algorithm that tolerates inexact segmentation, and (3) incremental network generation. Preliminary results show that network training is superior to all of the above‐mentioned methods for speaker‐dependent and independent recognition. [Work supported by NSF.]
FREE

The use of allophonic variations of /a/ in automatic continuous speech recognition of French (A)

Jacqueline Vaissière

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S12-S12 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
The acoustic characteristics of the vowels in continuous speech are affected by their duration, the stress condition, and the surrounding phonemes. The purpose of this study is to investigate the extent of the allophonic variations of the most frequent and most variable vowel in French, /a/, and the integration of such variations in an automatic speech recognition system. The preliminary corpus consists of 200 vowels /a/ extracted from 128 sentences uttered by two speakers. Using the SPIRE facilities, the sentences were digitized and segmented [Zue and Leung, J. Acoust. Soc. Am. Suppl. 1 75, S59 (1984)]. The segments labeled as SONORANT corresponding to the occurrence of /a/ were extracted. The values of F1 and F2 at the onset, offset, and middle of the segments were estimated using automatic peak tracking from the LPC spectra. Our results suggest that: (1) the MANNER of articulation of the preceding and following consonant plays a negligible role on F1 and F2 onset and offset (in contrast with other data published for English and Swedish); (2) if the PLACE of articulation of the consonants is divided into FRONT, MID, and BACK, then the difference (F2‐F1) at /a/ onset allows the unique specification of the place of articulation of the preceding consonant in 86% and 57% of the cases, for SP1 and SP2, respectively, and there is no overlap between FRONT and BACK consonants; (3) (F2‐F1) at /a/ offset indicates that F2 of the following vowel has to be taken into account before interpretation; (4) determination of the thresholds (for MID and BACK consonants) can be done on a few well selected words, where the effect of the context is known to be maximal. Detailed descriptions of our findings extended to the study of more speakers, and of their consequences for the coding of the permissible variations of the vowel /a/ for each word of the vocabulary will be presented. [Work supported in part by the Office of Naval Research under contract N00014‐82‐K‐0727.]
FREE

Phonetic string alignment (A)

Kathleen M. Goudie‐Marshall, Joseph Picone, and William M. Fisher

J. Acoust. Soc. Am. Volume 77, Issue S1, pp. S12-S12 (1985); (1 page)

Online Publication Date: 12 Aug 2005

Full Text: | Download PDF

Show Abstract
This paper will describe a newly developed expert system which uses linguistic knowledge to align the phonetic content of two different words or phrases and score their similarity. The development of the system arose out of a need to have an automatic scoring algorithm for intelligibility testing for text‐to‐speech systems and other synthetic‐speech coding and recognition systems. The system uses an automatic text‐to‐phone algorithm to translate the input and reference ASCII text strings to phonemic units, aligns them using linguistic knowledge‐based decision criteria and a dynamic programming optimization, and outputs the aligned strings as well as tabulating the phoneme confusions.
Close

close