• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Journal of the Acoustical Society of America

Year Range: 
Search Issue | RSS Feeds RSS
Previous Issue Next Issue

Nov 1975

Volume 58, Issue S1, pp. S2-S132

back to top
RSS Feeds
back to top Session ZZ. Speech Communication IX: Speech and Speaker Recognition
Contributed Papers
FREE

Interactive experiments with a Digital Pattern Playback (A)

Patrick W. Nye, Franklin S. Cooper, and Paul Mermelstein

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S105-S105 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
Facilities which make spectrograms immediately available for visual comparison, easy modification of spectral data, and resynthesis of speech have proved to be particularly useful tools in speech research. This paper reports an experiment in which such an interactive research tool—a Digital Pattern Playback (DPP)—is being used to evaluate a spectrum‐matching and dictionary‐search technique for speech recognition. The DPP is a computer‐supported analysis‐synthesis facility which, in the present experiment, displays spectrograms of “unknown” sentences so that an analyst can list the important acoustic features of marked segments of the unknown sentence. Interrogation of a feature‐based dictionary then recovers all items with features which match the unknown segment. If necessary, additional features may be assigned to narrow the search. The reference spectrograms retrieved from the dictionary are compared, one at a time, with the spectrogram of the unknown sentence and the best match is selected for each unknown segment. Instrumentation, strategies, and factors governing the efficiency of the algorithms for feature‐selection and spectrogram‐matching will be discussed. [Work supported by NSF and ARPA, Department of Defense.]
FREE

Acoustic‐phonetic experiment facility for the study of continuous speech (A)

Richard M. Schwartz

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S105-S105 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
While gathering acoustic data for the acoustic‐phonetic analysis of speech, it is necessary to consider many different sounds in varying phonetic environments to assure that the results are statistically significant. In order to reduce the amount of time required to test hypotheses, a facility has been developed which provides an interactive environment for performing a wide variety of acoustic‐phonetic experiments on a large data base of continuous speech. Using this facility, one can formulate an experiment, run it on selected portions (or all) of the data base, display or tabulate the results in a meaningful way, and then run another experiment (or a variant thereof) based on the results. Due to the ease of interactions, formulating or revising an experiment, running it, and displaying the results normally takes less than 5 min. This facility has been used in combination with a data base of 69 hand‐labeled sentences to develop algorithms for acoustic‐phonetic segmentation and labeling in a speech understanding system. Several examples of its use and the results obtained will be presented.
FREE

Approach to automatic segmentation in the acoustic phonetic transformation (A)

H. Kasuya and H. Wakita

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S105-S105 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
As a first step in the study of automatic acoustic‐phonetic transformation of speech sounds by an arbitrary speaker, the segmentation of connected speech into vowel‐like and consonantal segments has been investigated. Area functions obtained by the linear prediction method have been found to provide useful parameters for this purpose. An advantage of utilizing area functions is that the segmentation scheme can be easily combined with interspeaker normalization [H. Wakita, J. Acoust. Soc. Am. 57, S3(A) (1975)] to handle an arbitrary speaker. Two criteria were used for determining [+ consonantal] features: (1) the minimum of the Euclidean distances [MED] between the input area function and the area functions of all ten American vowels registered as references, and (2) the ratio of the back cavity volume to the total volume of the vocal tract [RBC]. The [+ consonantal] feature is assigned to those segments for which either the MED or the RBC exceeds a corresponding threshold value. Experiments with seven short sentences spoken by two adult males and an adult female resulted in quite satisfactory segmentation. Although [l] segments were sometimes labeled as [− consonantal], the nasal segments surrounded by high vowels were correctly detected. Further attempt to subdivide the [+ consonantal] segments into individual consonant segments will be discussed.
FREE

Identification of normalized steady‐state vowels (A)

H. Wakita

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S105-S105 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
This study was taken up as a first step toward automatic acoustic‐phonetic transcription for arbitrary speakers. Sound identification experiments were conducted with nine steady‐state American vowels /i, ɪ, ɛ, æ, ʌ, a, u, , ɜ/. Reference information for each vowel was determined from vowel utterances in the context “hVd” produced by 16 speakers (nine males and seven females). In this case, the area function for each vowel was computed by use of the linear prediction method and was normalized to a reference length of 17 cm. Based on the resonance frequencies of the area functions thus normalized, the probability density function for the first three resonance frequencies of each vowel category was computed under the assumption of a normal distribution. These distributions were then used for the identification of steady‐state vowels produced by an additional ten speakers (five males and five females). The overall rate of correct vowel identification was 83%. A higher recognition rate is expected to result from improving the estimation accuracy of the formant frequencies and bandwidths.
FREE

Some preliminary experiments in the recognition of connected digits (A)

L. R. Rabiner and M. R. Sambur

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S105-S106 (1975); (2 pages)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
A system is described for recognizing connected digits. The system is essentially speaker independent and has been programmed to recognize strings of three consecutive digits. In order to segment the input utterance into the single digits, a voiced‐unvoiced‐silence analysis is made using a pattern recognition algorithm. Based on both the voiced‐unvoiced‐silence contour, and variations in the energy contour of the utterance, a preliminary segmentation of the digit string is made. Boundary adjustments are then made based on a preliminary recognition of the individual digits. Finally a recognition algorithm, similar to the one used in the isolated digit recognition work described by Sambur and Rabiner, is used to classify the individual digits in the utterance. Experiments with the system using ten speakers (five male, five female) in a fairly low noise environment yielded a 91% correct digit recognition score. Similar experiments using ten new speakers (five male, five female) in a noisy computer room yielded an 87% correct digit recognition score.
FREE

Linear predictive residual analysis compared to bandpass filtering for automatic speech recognition (A)

G. M. White

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S106-S106 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
It has been recently proposed by Itakura [F. Itakura, “Minimum Predictive Residual Principal Applied to Speech Recognition,” IEEE Symp. Speech Recog. CMU (1974)] that the linear predictive residual can be used as a measure of speech waveform similarity. To measure the similarity between two waveforms, Itakura proposed to construct a linear predictive filter for one waveform and measure the residual (predictive error) for the other waveform. Itakura used this technique to achieve some remarkably good speech recognition scores. We constructed a speech recognition system using both bandpass filtering and linear prediction in order to compare the two techniques. The classifier used dynamic programming. A 36‐word vocabulary was used consisting of the alphabet plus digits spoken five times by the same speaker. A single word list was used for training and the other four were used for testing. Speech input was through a noise cancelling microphone. For the digital linear predictive, inverse filtering, analysis, speech was low pass filtered at about 5 kHz and digitized at 10 kHz. For the bandpass filtering experiment, 21 filter channels each 1/3 octave wide wore used covering the audio spectrum from about 100 Hz to 10 kHz. The recognition scores in both cases were 98% correct showing that the linear predictive residual technique is essentially equivalent to bandpass filtering as a means of measuring speech waveform similarity.
FREE

Partial word boundary detection from stress contours (A)

D. C. Sargent

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S106-S106 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
A machine algorithm was developed for partial word boundary detection in continuous speech. Word boundary detection was achieved by comparing a computer‐extracted stress contour for the 700 syllable test passage with that contour which would have been predicted from rigid adherence to the Alternating Stress Rule. Since this rule functioned only at the word level and below, it was more likely to be violated when crossing word boundaries within a word. The position of any Alternating Stress Rule violation in the extracted stress contour was therefore marked as a probable word boundary location. Utilizing this concept, 44% of the word boundaries in the test passages were correctly positioned with a false alarm rate of less than 10%. Most of the false alarms were caused by the presence of adjacent reduced syllables within the same word. Research is presently being conducted to incorporate additional regularities in the stress patterns of English to further improve the algorithm's performance.
FREE

Algorithm to detect the beginning and end points of a speech utterance (A)

K. Ganesan and W. C. Lin

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S106-S106 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
There is a great need to detect the beginning and end points of a speech utterance in applications like speech recognition and speaker identification. In this paper, we present a method for beginning and end‐point detection which makes use of the maximum likelihood principle. The features that are used by the algorithm are (1) total per‐unit energy, (2) zero‐crossing rate, and (3) absolute amplitude of the speech samples, Conditional probability densities are estimated for these three features using a database of 60 phonetically balanced words and ten phonetically balanced sentences spoken by four male speakers with General American accents. A set of optimum thresholds are obtained for each feature such that the probability of classification error is minimized. The algorithm was tested for both isolated words and sentences over a population of six speakers and an error rate of nearly 0% was observed.
FREE

On‐line, adaptive speaker‐independent word recognition system based on phonetic recognition techniques (A)

W. C. Lin and K. Ganesan

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S106-S106 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The research reported in this paper deals with a new method of phonemic analysis of speech by statistical pattern recognition techniques and its application to the problem of Automatic Speech Recognition (ASR). An on‐line, adaptive, trainable speaker‐independent system is implemented using this approach. The details of the system follow: first, the beginning and end points of the speech utterance are detected. The utterance is then sent for automatic segmentation where it is segmented into the following classes: (1) voiced, (2) unvoiced, (3) transition, and (4) silence. An 11‐dimensional feature vector consisting of 10 linear predictor coefficients and zero‐crossing rate is extracted from these regions. For voiced and transition region, the feature extraction is done pitch synchronously and for unvoiced regions, a constant frame of 6.4 msec is used. A new phonetic unit called phoneme‐pair is defined for the transition regions, while the unvoiced and voiced regions are represented using the phonemes of the IPA. Conditional probability densities for each of the phonemes and phoneme‐pairs are estimated using non‐parametric methods as a single polynomial in the 11‐dimensional space. The classifier makes Bayes' minimum risk decision based on these probability densities. The recognition results of the ASR system are Training Set: 98.4%, Test Set: 96.0% (for speakers in the training set) and 91.0% (for speakers not in the training set). The present vocabulary of the system is 60 words and any new word can be added by entering its corresponding phonetic transcription. The adaptive and trainable characteristics of the system will also be demonstrated.
FREE

On the similarity of noisy phonetic strings produced by different words (A)

James K. Baker

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S106-S106 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
In a speech recognition system with an acoustic processor which attempts to automatically estimate a phonetic transcription, it is necessary to know the similarity of the probability distributions of phonetic strings when different words are spoken and input to the acoustic processor. Let aϵA,a=a1a2a3an represent an arbitrary phonetic string. Define the similarity between the words W1 and W2 by
math
. The number of terms in the sum defining S grows exponentially with the length of the words W1 and W2. However, if the nodes of the phonological graphs for W1 and W2 are properly ordered, S can be calculated inductively by a generalization of the computations used in modeling a probabilistic function of a Markov process. The number of computations is approximately the product of the number of arcs in W1 times the number of arcs in W2.
FREE

Speaker recognition using orthogonal linear prediction (A)

M. R. Sambur

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S107-S107 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The effectiveness of a set of speaker recognition features is usually characterized in terms of the ratio of the interspeaker variability of the feature to its intraspeaker variability (F‐ratio). A recent experiment in speech synthesis [M.R. Sambur, “An Efficient LPC Vocoder,” Bell Syst. Tech. J. (to be published)] has shown that by an appropriate eigenvector analysis, a set of orthogonal parameters can be obtained that is essentially constant across an utterance for a given speaker (i.e., zero intraspeaker variability). If the same eigenvector analysis is applied to the same utterance spoken by another speaker, the resulting values of the orthogonal parameters are, however, different. These orthogonal parameters were therefore examined for their ability to differentiate different speakers. They were formally tested in a speaker recognition experiment involving 21 speakers. The speech data consisted of six repetitions of the same sentence spoken by each speaker on six separate occasions. The identification and verification accuracy of the orthogonal parameters exceeded 99%.
FREE

Selection of features and speech segments for speaker verification (A)

W. C. Lin and S. K. Pillay

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S107-S107 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
A speaker verification system based on the results of two thorough and systematic feature ordering techniques is described. First the Information Theoretic approach is used to reduce the redundancies among the features that are originally present in the feature pool. This selection procedure measures the amount of reduction in the uncertainty on deciding on the identity of the speaker when more features are added to the set of already picked features. Once the redundant features are removed, the Between‐to‐Within multifeature variance ratio feature ordering algorithm is applied. It is a linear transformation technique which transforms the pattern vectors from measurement space into a new vector space. In the new space, the optimum combination of interspeaker separability and intraspeaker variability is achieved with a few number of features. Based on the results of the feature ordering algorithms, the effectiveness of each segment is determined and a set of minimum number of features is chosen. Preliminary results show that just ten pitch periods in the sounds /el/ or /mi/ alone are sufficient to identify three speakers all the time.
FREE

Perceptual (aural) and spectrographic investigation of speaker homogeneity (A)

H. B. Rothman

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S107-S107 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
In order to investigate the perceptual and spectrographic homogeneity of speakers, 28 (14 pairs) talkers recorded an extended prose passage on two occasions, one week apart. Twelve talkers were chosen as six pairs on the basis of their having been confused with each other due to the similarity of their voices. A tape was prepared for the presentation of two randomized 2‐sec speech segments for each pair of talkers. Listeners made aural/perceptual judgments of same or different for the following conditions: (1) same/contemporary (i.e., one talker recorded at the same time); (2) same/noncontemporary (i.e., one talker recorded one week apart); (3) different (i.e., paired talkers). Preliminary analysis of the data indicate the following: (1) 96% correct identifications were obtained for the same talker paired with a contemporary speech segment; (2) 44% correct identifications were obtained when comparing the same talker with a noncontemporary speech sample; (3) 87% correct identifications were obtained when comparing different talkers of a pair; (4) confusions between contemporary and noncontemporary samples of talker pairs occurred at a 38% level. High identification scores were expected for categories 1 and 3. It is evident that correct identification of a talker drops sharply when the comparison speech sample is noncontemporary. Further research duplicated the above procedure utilizing the same speech samples filtered to match a telephone passband; results were similar to the above. Spectrographic matching will be done for both procedures and results from each will be correlated.
FREE

Evaluation of selected acoustic parameters for use in speaker identification (A)

E. T. Doherty

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S107-S107 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
The effectiveness of certain acoustic and temporal properties of the speech signal long‐term power spectra (LTS), speaking fundamental frequency (SFF), and speaking time (ST) in the determination of a speaker's identity from his voice alone were tested alone and in various combination. Further, the effect of distortions—limited passband, stress, or disguise—were evaluated. Various analytical procedures, Euclidean distance, cross‐correlation or discriminant analysis, are used. Two groups, 50 college‐age males who read “normally” and 25 males, aged 25–45, who read normally and while subjected to stress and while attempting voice disguise were selected. Acoustic/temporal analyses were performed on the speakers' utterances to extract the LTS, SFF, and ST vectors. Filtering was simulated for LTS. Results indicated that (1) the LTS vector is extremely effective for identifying speech produced normally, (2) SFF and ST were far less effective, (3) combining vectors usually improved correct identification levels, (4) under stress or attempting a disguise, no single vector or combination adequately differentiated talkers, and (5) a discriminant analysis is a more better method of determining identity than is cross correlations or Euclidean distance.
FREE

Spectrographic and aural examination of professionally mimicked voices (A)

M. Hall and O. Tosi

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S107-S107 (1975); (1 page)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
Five members of the IAVI‐examined spectrographically and aurally pairs of samples corresponding to (1) mimicked voice‐real voice, (2) mimicked voice‐mimicked voice, and (3) real voice‐real voice, to decide whether or not the two components of the pair belonged to the same or different persons. There were two types of recordings; one type obtained in a quiet environment, the other within an ambient noise environment. Results suggest that there is no significant spectrographic intraspeaker variability within the mimicked voices produced by the particular professional mimic employed in this experiment. However, the examiners found significant interspeaker variation between the real voice of each mimicked person and its mimicked voice by the mimic, variation that allowed them to produce right discriminations. In all cases, average fundamental frequencies differed. The recordings including noise yielded less significant results than the ones obtained in quiet. In addition, these pairs of voices were presented to nontrained listeners aurally only, in a free field, requesting them to decide whether or not the two voices of each pair were the same or different. They produced right answers in approximately 75% of the tests.
FREE

Speaker sex identification utilizing a constant laryngeal source (A)

W. S. Brown, Jr. and S. H. Feinstein

J. Acoust. Soc. Am. Volume 58, Issue S1, pp. S107-S108 (1975); (2 pages)

Online Publication Date: 11 Aug 2005

Full Text: | Download PDF

Show Abstract
It has been demonstrated that the sex of speakers can be reliably and accurately identified in the absence of idiosyncratic glottal wave forms. This finding led to the hypothesis that there are other sex‐related differences in the supraglottal vocal tract which produce discriminable acoustic differences in speech. As a test of this hypothesis, ten males and ten females counted to ten and read the second sentence of the “Rainbow Passage” using an electronic artificial larynx (F0=120 Hz) with a closed glottis. Tho vocalizations were tape recorded, randomized, and played to 30 listeners who then determined the speakers' sex. Later, the vocalizations were subjected to a sound spectrum analysis. Results indicated that male voices with energy concentrations shifted toward the lower frequencies were identified above chance as were females with energy concenctrations shifted toward the higher frequencies. Speakers whose energy concentrations were centralized were most often confused. These results confirm the hypothesis that sex‐related supraglottal vocal tract characteristics play a major role in the identification of speaker sex.
Close

close