• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1988

Volume 84, Issue S1, pp. S2-S224

back to top
RSS Feeds
back to top Session J. Speech Communication II: Analysis and Synthesis, Part B (Poster Session)
Contributed Papers
FREE

Formant extraction by local approximation of the speech spectrum (A)

Kensaku Fujii and Juro Ohga

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S21-S21 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The present study proposes a new method of formant extraction that flattens the residual signal with reduced computational load. In this method, the formants are extracted in reducing order of the amplitude. Each formant is represented by an IIR filter of
math
. This formula has three unknown quantities Qa, cos θ0, and Qb. They are estimated from quadratic equations on the speech spectrum. The coefficient cos θ0 corresponds to the formant frequency, which is obtained as the peak of the second‐order curve defined by the maximum line spectrum A and the neighboring line spectra B and C. Here, Qb is related to the formant bandwidth which is given by the geometrical series average
math
. Here, eaT and ebT are the damping terms in the impulse response of the second‐order IIR filter. The relative magnitude of the frequency responses of this filter at the frequencies of the line spectra A and B is set to coincide with the relative amplitude of the line spectra A and B. The ebT is determined in a similar way by using the line spectra A and C. The coefficient Qa is the damping term of the frequency response which expresses the global shape of the spectrum. It can be estimated by using the line spectrum A and the maximum frequency component.
FREE

Formant‐frequency estimation by linear transformation of the LPC‐cepstrum (A)

David J. Broad and Frantz Clermont

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S21-S21 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Correlations near 0.98 between vowel formant frequencies and linear combinations of 1/3‐octave‐band spectrum levels reported by Pols et al. [“Perceptual and physical space of vowel sounds,” J. Acoust. Soc. Am. 46, 458–467 (1969)] suggest that the formants might be estimated by linear transformation of either a low‐resolution log spectrum or of the first M cepstral coefficients ci, which are linear functions of a log spectrum. Applying this idea with M = 14 to four speakers results in pooled intraspeaker prediction errors of 41, 126, and 134 Hz for F1, F2, and F3, respectively. These become 52, 164, and 218 Hz when regressions on data from three speakers are used to measure formants from a fourth. The method is therefore not very accurate, but it is robust in that large errors from misidentified peaks are rare. A study of single‐resonance waveforms explains why the method works: The ci are roughly cosines in the resonance frequency F, with i half‐cycles over the analysis bandwidth; these ci (F) form a basis set for approximating the function Fesl = Freal.
FREE

Formant contour extraction by a temporally constrained search of the spectral resonance space (A)

Frantz Clermont

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S21-S22 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
An algorithm for extracting the first three formant‐frequency contours of vowel sounds is based on properties of the linear prediction phase spectrum derived by Yegnanarayana [J. Acoust. Soc. Am. 63, 1638–1640 (1978)], and Yegnanarayana and Reddy [Int. Conf. Acoust. Speech Signal Process., Conf. Record, 744–747 (1979)]. The former study has shown that the negative derivative of the linear prediction phase spectrum (NDPS) behaves like a formant‐enhancing filter. The latter study has shown that the Euclidean distance between pairs of NDPS emphasizes differences only around formant peak regions, and that it is easily computed as an index‐weighted cepstral distance. In stage 1 of the proposed algorithm, all posible sets of four candidate peaks from the original spectrum are used to synthesize four‐formant spectra. The index‐weighted cepstral distances between these simplified spectra and the original spectrum become row entries in a matrix of intraframe distances. In stage 2, a dynamic programming (DP) procedure imposes continuity constraints across frame spectra. The DP‐cost function is the cumulative sum of the intraframe distances plus the minimum of interframe cepstral distances. Extracting temporally constrained contours then consists of backtracking the optimum path through the spectral distance matrix.
FREE

Statistical approaches to formant tracking (A)

Robert T. Gayvert and James Hillenbrand

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S22-S22 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Formant trackers that rely on peak picking tend to make occasional large errors. This paper investigates two general methods for determining formant locations without explicit use of peak information. Both approaches involve statistical models derived from hand‐traced formant tracks. In the first method, individual formant frequency values are estimated using a maximum likelihood classifier. In the second, formant probability distributions are found for each element of a vector quantization codebook, and formant values are then determined by conditional mean estimates. Hidden Markov models or simple smoothing can then be applied to provide continuity constraints. Both of these techniques have been quantitatively analyzed using a database of 78 utterances produced by four males and four females. The performance of these trackers across different training and testing sets will be discussed. [Work supported by Rome Air Development Center under contract F3060285‐C‐0008.]
FREE

A pitch extraction method using higher‐order joint moment (A)

Norio Nomura and Yuichi Yoshida

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S22-S22 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A modified higher‐order joint moment method for a correct pitch extraction is proposed. The mean products of three vectorized samples are used, while the usual autocorrelation method uses the mean products of two samples. Each of three samples separated by the same sample interval is modified to a vector whose modulus is equal to the absolute value of the sample, and the argument is ± 120 deg in accordance with the sample's sign. Only when three original samples have the same sign does the product of three vectors become a real positive value. By averaging the products in a certain window, a sharp peak on the sample‐interval axis is obtained at the pitch period. The effectiveness of the method is examined with real voiced signals. The peak at the pitch period is much sharper than the peak of the autocorrelation method, and the spurious peaks are very small. These properties result in a correct pitch extraction. It is also shown that the influence of the pitch fluctuation has properties similar to those as in the autocorrelation method.
FREE

Formant frequency estimation by moment calculation of the speech spectrum (A)

Kazuyuki Takagi and Shuichi Itahashi

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S22-S22 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Moment calculation is applied to extract the formant frequencies of a speech spectrum. Three kinds of first‐order moments divide a spectrum into four frequency regions. The centers of gravity of the first three regions are calculated to give the 0th order estimation of the 1st, 2nd, and 3rd formant frequencies. Then the upper and the lower bounds of each region are modified so that the estimated frequency comes closer to the major peak of the spectrum, utilizing the second‐order and the third‐order moments that represent the variance and skewness of the spectral pattern. The process repeats until the k th estimation equals the (k − 1) th estimation. This modification improves the estimation precision significantly. An experiment with model spectra generated by an all‐pole model gave estimation precision of 3% using formant frequencies typical of the five Japanese vowels. Speech materials spoken by five male and five female speakers were used for this experiment. The speech waveform was sampled at a rate of 10 kHz through a 5 kHz LPF, quantized into 12 bits; then the spectrum envelope was calculated with the first 24 cepstra of a 256‐point FFT spectrum. The results give acceptable precision, compared with visually determined formant frequencies.
FREE

Real‐time pitch detection with a digital signal processor (A)

Michiharu Mito, Kiyoshi Takahasi, Syuji Kurokawa, Syogo Nakamura, and Tadahiro Kubota

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S22-S22 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Pitch detection is an important and essential technique in speech analysis, synthesis, and so on. There are many methods to extract voice pitch, but it is difficult to perform pitch extraction in real time. This paper describes a simple real‐time pitch detection algorithm which directly estimates the interval between peaks of the waveform. This algorithm consists of the following two parts: one is the peak emphasis of voiced signals and the other is the pitch detection. The peak emphasis is obtained by running a DFS (discrete Fourier series) and a window operation. It is important to determine the peaks for the pitch measurement because the pitch period is obtained by estimating the interval between successive peaks representing the pitch period of the voiced signal. The peaks related to pitch are determined using a few simple rules. Since the voiced signal waveform includes several extra complicated peaks, these rules are constructed taking into account the characteristics of the voiced signal. Other peaks, which do not correspond to the pitch period, are rejected by a simple logical judgment. A real‐time pitch detection algorithm has been realized using a conventional DSP.
FREE

Speech analysis using a time‐varying ARX model for separating the source‐tract coupling of vowels (A)

Tetsuo Funada

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S22-S22 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The purpose of this research is to extract formant frequencies precisely and to classify voiced/unvoiced intervals accurately based on a source‐tract model. A sequential estimation of the source wave (i.e., the glottal volume flow) and the vocal tract (VT) characteristics is achieved by using a time‐varying “ARX model,” where the term ARX model refers to an AR (autoregressive) model with an auxiliary nonwhite input (X input). This X input indicates the glottal volume flow in the present research. Applications to synthetic vowels generated by the two‐mass model demonstrated the following results: (1) Much information on the glottal closure and opening was obtained from the X input; and (2) compared to the conventional (autocorrelation) LP method, formant frequencies (especially the first formant) during the open period of the glottis were estimated more accurately. It has also been observed from real vowels uttered by a male speaker that the phase of the X input agrees with the phase of the glottal movement which can be confirmed by electroglottography (EGG).
FREE

Excitation problem in speech synthesis (A)

Bishnu S. Atal

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S22-S23 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Linear predictive coding methods require efficient representation of both the LPC filter and its input excitation to synthesize high‐quality speech at low bit rates. Considerable progress has been made so far in encoding the filter parameters and it is possible to quantize these parameters with only 1600 bits/s without introducing distortion in the synthetic speech signal. However, it is still not possible to encode the LPC filter excitation at low bit rates and maintain high voice quality in the synthetic speech signal. In this paper, the problems associated with low bit representation of the excitation are discussed. To achieve low bit rates, a parametric representation is needed that can provide a compact yet accurate representation of the excitation. Such a compact representation is obtained by expressing the excitation waveform as a linear combination of the eigenvectors of the autocorrelation matrix of the LPC filter's impulse response. This representation allows the study of the effect of changes in the filter excitation on the speech output in a systematic manner. The signal‐to‐noise ratios necessary to represent various eigenvector components in the excitation without producing perceptible distortion in the output speech signal have been determined. Thus the minimum number of bits necessary to reproduce a speech signal is estimated. These results will be discussed in the paper.
FREE

Effects of fundamental frequency contour on the identification of resynthesized vowels with static formant frequency patterns (A)

James Hillenbrand and Robert T. Gayvert

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S23-S23 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
At a previous meeting, a study that was aimed at determining the identifiability of vowels based exclusively on static spectral information was discussed [Hillenbrand and McMahon, J. Acoust. Soc. Am. Suppl. 1 82, S81 (1987)]. In that study, a formant synthesizer was used to generate steady‐state versions of 1520 vowels (76 speakers × 10 vowels × 2 repetitions) using Peterson and Barney's measured values of F0 and F1−F3 [J. Acoust. Soc. Am. 24, 175–184 (1952)]. The values of all control parameters remained constant throughout the 300‐ms duration of each stimulus. Listeners in that study showed an error rate of approximately 25%, several times greater than the 5.6% error rate reported in the original Peterson and Barney study. The present study represents a follow‐up designed to determine what role fundamental frequency movement might play in vowel identification. The new stimuli were identical to those of the previous resynthesis study except that all stimuli were generated with a falling pitch contour. Preliminary results suggest that the introduction of pitch movement decreases the error rate from approximately 25% to approximately 21%. [Work supported by Rome Air Development Center under contract F3060285‐C‐0008.]
FREE

Analysis and synthesis of CV syllables in Hindi (A)

S. S. Agrawal

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S23-S23 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Twenty‐nine consonants of frequent occurrence in Hindi were combined with a cardinal vowel /ɑ/ to make CV‐type syllables. These were spoken by a standard male speaker. The spectral analysis was done using a sound spectrograph and the covariance method of LPC analysis. The formant frequencies and their bandwidths were obtained in segments of varying duration from 8 to 20 ms depending upon the nature of acoustic information in the syllable. A program called “SNDSYS” was used to estimate the fundamental frequency and overall amplitude values at every 10‐ms duration. These parameters were used to give a basic acoustic description and frame rules for consonant‐vowel combinations. A P.C. version of Klatt's synthesizer called “KLPC” was used to synthesize the CV syllables. There are over 40 parameters and constants that are used in the default synthesizer configuration. Based on the acoustic description of different sounds, various parameters were updated at an interval of 5 to 10 ms to generate different consonant‐vowel combinations. The configuration file obtained for each syllable (named as documentation file) was further used to update and change the parameters to improve the quality of synthesized speech. Special considerations related to the synthesis of voiced and unvoiced aspirated consonants of Hindi are also discussed.
FREE

Synthesis of Chinese by rules based on a multipulse excitation model (A)

Li Changli and Me Fuyuan

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S23-S23 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
According to the model of speech production, the characteristic parameter of speech can be divided into two parts: excitation and vocal tract parameters. Atal proposed the multipulse excitation model that can produce high‐quality sythesized speech. This research shows that the intensity, duration, and pitch mode of single syllable Chinese produced by multipulse excitation may be changed when the adaptive method is utilized to process its multipulse sequences and vocal tract parameter. There are about 10 000 Chinese words in common use, but the pronunciation of many words is the same, so that only about 1300 syllables are independent. The Chinese language is a tone language. Each Chinese word is of four pitch modes, and the vocal tract parameter for the four modes of one word is almost the same. Therefore, there are 400 independent vocal tract parameters and 1300 multipulse sequences in Chinese. Based on the above strategy, a new method of sythesizing Chinese by rules has been proposed. The intelligibility and naturalness of the synthesized speech are satisfactory.
FREE

Statistical modeling of dynamic spectral patterns for a speech synthesizer (A)

Sateshi Takahashi, Yasuaki Satoh, Takeshi Ohno, and Katsuhiko Shirai

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S23-S23 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A new method called the spectral locus control method (SLCM) is proposed, which can approximate the dynamic characteristics of the speech spectrum, such as in the transition from vowels to consonants or from consonants to vowels, effectively and accurately. The main procedures of the method are as follows. Continuous speech is segmented into VCV units, and these units are grouped according to the consonants. The spectrum patterns of the V1CV2 units in each group are analyzed to construct a statistical model which, given the spectra of V1 and V2, generates the spectrum loci for V1CV2 units. To synthesize continuous speech, a spectrum appropriate for a given consonantal context is first selected for each vowel V in every CVC sequence in the text. Then, the temporal sequence of the spectrum patterns for the entire V1CV2 is calculated based on the spectrum of the stationary parts in V1 and V2. Since VCV segment spectra are adapted to their consonantal environment, the synthesized speech is highly natural, especially in transitions.
FREE

A speech synthesis system by rule in Japanese (A)

Ryunen Teranishi

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S23-S23 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In Japanese, text‐to‐speech systems have to deal with problems of complicated orthography and a writing custom without a clear word separation rule. The system shown here is a tentative one avoiding such troublesome problems, constructed to study the rules in Japanese that are useful for the speech synthesis. In order to obtain natural prosody, as in human text reading, the system should have some mechanism that divides the input sentence into several proper breath groups with pauses based on the analysis of the syntactic structure as a human does. The construction of the system has been accomplished, and it can respond to such a demand. This algorithm was realized as completed software on 2HD diskette, available for the PC‐98 series of personal computer made by NEC. The features of the system are as follows. The input form is word units written in kana letter codes. These units are separated with the space code. No prosody code is necessary except for the punctuation marks. An original parser, based on traditional Japanese grammar, produces the prosody for the synthesized speech.
FREE

A system for speech synthesis from Japanese orthographic text (A)

Hisashi Kawai, Kcikichi Hirose, and Hiroya Fujisaki

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S23-S24 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A system has been developed for speech synthesis from Japanese orthographic text of Japanese. The system consists of four processing stages. The linguistic processing stage utilizes natural language processing techniques for extracting lexical, syntactic, semantic, and discourse information from each paragraph of the input text. The phonetic processing stage utilizes this information to derive a string of segmental and prosedie symbols for the entire paragraph. The acoustic processing stage generates time‐varying patterns of parameters from these symbols to control the final stage, which is a formant‐type synthesizer. The Fujisaki‐Ljungqvist model is adopted for the excitation of the voiced sounds [Proc. ICASSP 86, 1605–1608 (1986)], and its fundamental frequency is controlled by a model of F0 contour generation [H. Fujisaki and K. Hirose, J. Acoust. Soc. Jpn. (E) 5, 233–242 (1984)]. The segmental features, on the other hand, are synthesized by concatenating pole‐zero frequency patterns prestored for each syllable. The validity of the system, especially of the prosodic feature synthesis, was confirmed by the naturalness of the accent and intonation of the synthesized speech. [Work supported by Grant‐in‐Aid for Scientific Research on Priority Areas from Ministry of Education, Science and Culture of Japan, No. 63608002.]
FREE

Text‐to‐speech system for English and Japanese (A)

Kenji Matsui, Noriyo Hara, Masaaki Kitano, Hector Javkin, Kazue Hate, and Hisashi Wakita

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S24-S24 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A real‐time text‐to‐speech system for English and Japanese has been developed. This system consists of a language processing module, a phonetic acoustic processing module, and a synthesis module. Full general English and Japanese sentences can be converted to speech. The Japanese software and English software are independent except for the synthesis module. The features of this system are as follows. (1) The synthesis module is a phoneme‐based cascade‐parallel formant synthesizer with high observed intelligibility (73.5% for the 119 Japanese monosyllables). (2) This system has a 3000‐morphene English dictionary and 40 000‐word Japanese dictionary with a high‐speed search algorithm. (3) A large speech database was collected for the development of Japanese prosody rules. (4) For the precise control of pitch contour, the Fujisaki model was adopted. (5) One of the two systems developed can stand alone; the other requires a personal computer with a high‐speed DSP board. (6) In the development of this system, some powerful interactive tools have also been developed for varying speech parameters in real time.
FREE

The Japanese speech synthesis system with text editing and automatic prosodic control facilities (A)

Seiichi Yamamoto, Norio Higuchi, and Tohru Shimizu

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S24-S24 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A Japanese speech synthesis system using a Japanese speech synthesizer by rule and software for text editing and automatic prosodic control has been developed. In the case of Japanese, there is a possibility that difference in accent type will cause differences in meaning, and that a word included in a compound word, such as a compound noun or verb, will have an accent type different from its original one. Therefore, the correct specification of the prosodic symbols is not so easy, even for people without any difficulty in hearing. The software for the text editing and for the automatic prosodic control inserts symbols for the prosodic control into the input text during the process of the text creation and editing in which the input kana string is converted to a kanji‐kana mixed sentence as in conventional Japanese word processors. On the other hand, the speech synthesizer uses phonemes as synthesis units and generates all acoustic parameters based on the 156 feature rules and 472 parameter rules. Since it also has the facility to send the synthetic speech through the telephone line, even a speech‐impaired person can transmit messages to a distant listener.
FREE

A Japanese speech synthesizer based on production rules (A)

Norio Higuchi, Seiichi Yamamoto, and Tohru Shimizu

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S24-S24 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A Japanese speech synthesizer by rule, which uses phonemes as synthesis units and generates all acoustic parameters based on production rules, has been developed. The conversion from the input romaji string in Hepburn style to the synthetic speech waveform consists of (1) the generation of the phoneme/boundary string with the distinctive feature matrix based on 156 feature rules, (2) the conversion to sequences of the acoustic parameters based on 472 parameter rules, and (3) the generation of the speech waveform using a Klatt‐type formant synthesizer. The first two processes are written in C language and implemented by a microprocessor (M 68000) and the last one is implemented by a digital signal processor (TI TMS32010). Both male and female voices can be synthesized with three different accent levels at seven different speech rates in real time. Nine kinds of subjective evaluation, which include tests for intelligibility, naturalness, and other nonlinguistic factors, were proposed and applied to the speech generated with the above‐mentioned speech synthesizer. According to the results, 87.8% of the morae of the male voice and 81.6% of the morae of the female voice were identified correctly by three male subjects and two female subjects.
FREE

Rhythm control based on CV‐syllable positioning for Japanese synthetic speech (A)

Toshimitsu Minowa

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S24-S24 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
A technique has been developed for Japanese speech synthesis‐by‐rule to control the rhythm of synthetic speech sounds to which little attention has been given so far. In Japanese speech sounds, syllables are generally believed to be the basic elements of the rhythm, with each syllable sound pronounced almost isochronously. It was found through listening tests that there is an important portion in a syllable for recognizing the syllable and the positioning of that portion determines the rhythm. The portion was termed auditory perceptual timing point (APTP) and was determined for each syllable in listening tests. Most APTPs were found near the voice onset, which closely agreed with the result obtained by Sato [H. Sato, Trans. Comm. Speech Res., ASJ, S77‐31, 1–8 (1977)]. The rhythm pattern was, in principle, determined by the number of morae in individual words and the syntactic structure of an input text, though further investigation is necessary to construct detailed rules. It has been confirmed that the quality of synthetic speech sound can be improved by employing this rhythm‐control technique.
FREE

Quantitative evaluation of the perceptual significance of control parameters in synthesis by rule (A)

Yoichi Yamashita, Riichiro Mizoguchi, and Osamu Kakusho

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S24-S24 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In synthesis by rule, control parameters generated by rules do not always match desired ideal values (natural sounds). To effectively improve synthetic sound quality, it is important to evaluate the perceptual significance of individual parameters and to search for the parameters most important to speech quality. This paper describes a method for measuring relative weights among some control parameters to quantitatively evaluate their perceptual significance. Perceptual weights are measured by equalizing two kinds of distances between stimuli employed in listening tests. One is defined on the physical space delimited by the synthesis parameters of stimuli. Another is defined on the psychological space delimited by parameters obtained from multidimensional scaling (MDS) techniques. MDS distributes stimuli into the space of an arbitrary dimension based on preference lists for stimuli that are obtained through listening tests. To verify the validity of the proposed method, perceptual weights for the first and second formants of the isolated vowel /a/ were measured. [Work partly supported by a Grant‐in‐Aid for Scientific Research on Priority Areas from the Ministry of Education.]
FREE

On the unit selection measure of speech synthesis by rule using multiple synthesis units (A)

Katsuo Abe and Yoshinori Sagisaka

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S24-S25 (1988); (2 pages)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
In speech synthesis by rule, a synthesis scheme using nonuniform speech synthesis units to obtain the optimum synthesis unit sequence for a desired output speech has been proposed. In this paper, vowel spectrum variations are analyzed using the LPC‐cepstrum distance to introduce a quantitative measure for unit selection. Using 5240 words, vowel spectral distortion resulting from contextual differences was studied and the following tendencies were found. (1) The neighboring phoneme, the position in the utterance, and the accentuation affect vowel spectral envelopes in this order. (2) For CVs whose following consonants have the same point or manner of articulation, the spectral distance among the vowels of the CVs is 12% smaller than the average. (3) The vowel spectrum of the word's final CV differs from that of the word's medial CV. Based on these results, a quantitative measure is introduced to represent the spectral similarities of each vowel. With this measure, the unit selection scheme was tested using open data. Through these experiments, it was not only confirmed that the previously proposed categorical measures are adequate for general unit selection, but it was also shown that some phoneme combinations should be specially scored for unit selection.
FREE

Acceptability of text‐to‐speech systems: Its quantification using magnitude estimtion and its relationship to intelligibility and naturalness in various degrees of distortion (A)

Chaslav Pavlovic, Mario Rossi, and Robert Espesser

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S25-S25 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
Because there are no physical measurements that quantify perceptual attributes of synthesized speech, subjective tests must be used. In this project the possibility of directly scaling acceptability via magnitude estimation is assessed. One hundred and twenty subjects took part in the study. Seven different synthesizers and three types of background noise were employed. The results are discussed in light of the following questions. (1) What is the relationship between acceptability on one hand and naturalness and intelligibility on the other in various degrees of distortion? (2) Are the objective intelligibility scores and subjective magnitude estimations highly correlated in all conditions? (3) Do the relative ratings produced by different groups of subjects agree? (4) Are the ratings on an absolute scale? (5) Are the ratings invariant to the stimulus set and range size? (6) Are the ratings practice invariant? (7) Do the ratings depend on the subject's familiarity with the test material?
FREE

Quality assessment of synthetic speech using word intelligibility scores (A)

Toshiro Watanabe

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S25-S25 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The quality of Japanese speech synthesized by rules cannot be sufficiently evaluated using only the 100‐monosyllable list that has been used for speech articulation tests: Word intelligibility must also be assessed. Word intelligibility tests on both natural and synthetic speech were conducted to clarify the differences between synthetic and natural speech in learning effects, the effects of external noise, and word familiarity. Both trained and untrained listeners were used. Each word list of 155 words, selected by our previously developed method, reflected the most important word attributes of Japanese: word length, word familiarity, and initial phoneme occurrence. The experimental results show that the intelligibility scores for synthetic speech with untrained subjects were fairly low and independent of noise level. These were greatly improved by training, depending on the noise level. However, natural speech intelligibility depends little on training. The results also show that the familiarity of the test words and the first phoneme in the word are important factors in recognizing Japanese words in both synthetic and natural speech.
FREE

Morphophonological derivation of Japanese predicate phrases from a semantic base (A)

Shigeru Sato

J. Acoust. Soc. Am. Volume 84, Issue S1, pp. S25-S25 (1988); (1 page)

Online Publication Date: 13 Aug 2005

Full Text: | Download PDF

Show Abstract
The definition of lexical base forms may greatly affect the structure of the phonological component of a system of speech synthesis from semantic representations. This paper presents a model of mora‐generating/preserving phonology for Japanese constructed from an observation of the syntactic/morphophonological derivations of predicate phrases. The achievements of the model are the following. (1) The output of the mora‐independent lexico‐syntax of the predicate phrase is taken care of by the three‐tier rule system of word formation, phonology, and phonetics. (2) The reduction of the number of rules and the simplification of their forms are attained by transferring to syntax processes previously regarded as phonological. (3) The mora preservation phenomena are exclusively handled by the seven cyclic phonological rules. (4) Along with the rule editor system, this phonological component is fully implemented in the computer as a subsystem attached to the speech synthesis system from the semantic base [S. Sato and H. Kasuya, European Conference of Speech Technology, Vol. 2, 414–417 (1987)].
Close

close