Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter Open Access November 2, 2016

Automatic speech segmentation using throat-acoustic correlation coefficients

Rustam Rafikovich Mussabayev, Maksat N. Kalimoldayev, Yedilkhan N. Amirgaliyev and Timur R. Mussabayev
From the journal Open Engineering

Abstract

This work considers one of the approaches to the solution of the task of discrete speech signal automatic segmentation. The aim of this work is to construct such an algorithm which should meet the following requirements: segmentation of a signal into acoustically homogeneous segments, high accuracy and segmentation speed, unambiguity and reproducibility of segmentation results, lack of necessity of preliminary training with the use of a special set consisting of manually segmented signals. Development of the algorithm which corresponds to the given requirements was conditioned by the necessity of formation of automatically segmented speech databases that have a large volume. One of the new approaches to the solution of this task is viewed in this article. For this purpose we use the new type of informative features named TAC-coefficients (Throat-Acoustic Correlation coefficients) which provide sufficient segmentation accuracy and effi- ciency.

References

[1] Vidal E.,Marzal A., A Review and new approaches for automatic segmentation of speech signals, Signal Processing V: Theories and Applications, L.Torres, E.Masgrau, and M.A. Lagunas (eds.), Elsevier Science Publisher B.V., 1990, 43-53 Search in Google Scholar

[2] Sharma M.,Mammone R.J., "Blind" speech segmentation: automatic segmentation of speech without linguistic knowledge, In: Proceedings of International Conference on Spoken Language Processing (October 3-6, Philadelphia, USA), 1996 Search in Google Scholar

[3] Rabiner L.R., Juang B.H., Fundamentals of Speech Recognition, Prentice-Hall, 1993 Search in Google Scholar

[4] Park S.S., Kim N.S., On using multiple models for automatic speech segmentation, IEEE T. Audio Speech, 2007, 15, 2202- 2212 10.1109/TASL.2007.903933Search in Google Scholar

[5] Demenko G., Grocholewski S., Klessa K., Ogorkiewicz J.,Wagner A., Lange M., et al, LVCSR speech database - JURISDIC, In: Signal Processing - Algorithms, Architectures, Arrangements, and Applications (September 25-27), 2008, 67-72 Search in Google Scholar

[6] Hahn M., Kim S., Lee J., Lee Y., Constructing multi-level speech database for spontaneous speech processing, In: Proceedings of International Conference on Spoken Language Processing (October 3-6, Philadelphia, USA), 1996, 3, 1930-1933 Search in Google Scholar

[7] Taylor P., Text-to-Speech Synthesis, Cambridge University Press, Cambridge, 2007 Search in Google Scholar

[8] Ostendorf M., Bulyko I., The impact of speech recognition on speech synthesis, In: Proceedings of IEEE Workshop on Speech Synthesis (September 11-13), 2002 Search in Google Scholar

[9] Flanagan J.L., Speech Analysis, Synthesis and Perception, Springer-Verlag, Berlin, 1972 10.1007/978-3-662-01562-9Search in Google Scholar

[10] Fant G., The Acoustic Theory of Speech Production, The Hague, Mouton, 1960 Search in Google Scholar

[11] Maimberg A., Phonetics, Dover, NY, 1963 Search in Google Scholar

[12] Medan Y., Yair E., Dan C. Super resolution pitch determination of speech signals, IEEE T Signal Process, 1991, 39, 40-48 10.1109/78.80763Search in Google Scholar

[13] Yegnanarayana B., Mahadeva Prasanna S., Sreenivasa Rao K., Speech enhancement using excitation source information, In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (May 13, Orlando, FL, USA), 2002, 1, 541-544 10.1109/ICASSP.2002.1005796Search in Google Scholar

[14] McAulay R.J., Quatieri T.F., Speech processing based on a sinusoidal model, The Lincoln Laboratory Journal, 1988, 2, 153-167 Search in Google Scholar

[15] Graciarena M., Franco H., Sonmez K., Bratt H., Combining standard and throat microphones for robust speech recognition, IEEE Signal Proc. Let., 2003, 10, 72-74 10.1109/LSP.2003.808549Search in Google Scholar

[16] Roucos S., Viswanathan V., Henry C., Schwartz R., Word recognition using multisensor speech input in high ambient noise, In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (1-4 April, Tokyo, Japan), 1986, 11, 737-740 Search in Google Scholar

[17] Zheng Y., Liu Z., Zhang Z., Sinclair M., Droppo J., Deng L., et al., Air- and bone-conductive integrated microphones for robust speech detection and enhancement, In: Proceedings of IEEE Workshop Automatic Speech Recognition and Understanding (November 30 - December, St. Thomas, Virgin Islands), 2003, 249-254 Search in Google Scholar

[18] Zhang Z., Liu Z., Sinclair M., Acero A., Deng L., Droppo J., et al.,Multi-sensory microphones for robust speech detection, enhancement and recognition, In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (May 17-21, Montreal, Canada), 2004, 3, 781-784 Search in Google Scholar

[19] Erzin E., Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings, IEEE T. Audio Speech, 2009, 17, 1316-1324 10.1109/TASL.2009.2016733Search in Google Scholar

[20] Quatieri T., Brady K., Messing D., Campbell J., Campbell W., Brandstein M., et al. Exploiting nonacoustic sensors for speech encoding, IEEE T. Audio Speech, 2006, 14, 533-544 10.1109/TSA.2005.855838Search in Google Scholar

[21] Subramanya A., Zhang Z., Liu Z., Acero A., Speech modeling withmagnitude-normalized complex spectra and its application to multisensory speech enhancement, In: Proceedings of IEEE International Conference on Multimedia and Expo ( July 25-28, Ischia, Italy), 2006, 1157-1160 10.1109/ICME.2006.262741Search in Google Scholar

[22] Deng L., Liu Z., Zhang Z., Acero A., Nonlinear information fusion in multi-sensor rocessing—Extracting and exploiting hidden dynamics of speech captured by a bone-conductive microphone, In: Proceedings of IEEE 6th Workshop Multimedia Signal Process., 2004, 19-22 Search in Google Scholar

[23] Subramanya A., Deng L., Liu Z., Zhang Z., Multi-sensory speech processing: Incorporating automatically extracted hidden dynamic information", In: Proceedings of IEEE International Conference on Multimedia and Expo, 2005, 1-4 Search in Google Scholar

[24] Shahina A., Yegnanarayana B., Language identification in noisy environments using throat microphone signals, In: Proceedings of International Conference on Intelligent Sensing and Information Processing (Piscataway, NJ, USA), 2005, 400-403 Search in Google Scholar

[25] Mainardi E., Davalli A., Controlling a prosthetic armwith a throat microphone, In: Proceedings of IEEE Engineering in Medicine and Biology Society (Lyon, France), 2007, 3035-3039 10.1109/IEMBS.2007.4352968Search in Google Scholar PubMed

[26] Rothweiler J., A testbed for voice-based robot control, In: Proceedings of IEEE International Conference on Technologies for Practical Robot Applications (November 10-12, Woburn, USA), 2008, 111-115 10.1109/TEPRA.2008.4686683Search in Google Scholar

[27] Jung Y., Han M., Chung K., Lee S., A valid frequency range analysis of throat signal for voice command System, In: Proceedings of International Conference on Information Science and Applications (April 27-29, Jeju Island, Korea), 2011, 1-5 Search in Google Scholar

[28] Latsch V.L., Netto S.L., On the construction of unit databanks for text-to-speech systems, In: Proceedings of IEEE International Telecommunications Symposium (September 3-6, Ceara, Brazil), 2006, 340-343 10.1109/ITS.2006.4433295Search in Google Scholar

[29] Jin Q., Jou S.S., Schultz T., Whispering Speaker Identification, In: Proceedings of IEEE International Conference onMultimedia Mubeen N., Shahina A., Nayeemulla A., Vinoth G., Combining spectral features of standard and Throat Microphones for speaker identification,In: Proceedings of International Conference on Recent Trends In Information Technology (April 19-21), 2012, 119-122 Search in Google Scholar

[31] Almpanidis G., Kotti M., Kotropoulos C., Robust detection of phone boundaries using model selection criteria with few observations, IEEE T. Audio Speech, 2009, 17, 287-298 10.1109/TASL.2008.2009162Search in Google Scholar

[32] Wilpon J.G., Juang B.H., Rabiner L.R., An investigation on the use of acoustic sub-word units for automatic speech recognition, In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (April 6-9, Dallas, Texas, USA), 1987, 821- 824 Search in Google Scholar

[33] Artimy M.M., Robertson W., Phillips W.J., Automatic detection of acoustic sub-word boundaries for single digit recognition, In: Proceedings of IEEE Canadian Conference on Electrical and Computer Engineering (9-12 May, Edmonton, Canada), 1999, 751-754 Search in Google Scholar

[34] Brown K.L., Algazi V.R., Characterization of spectral transitions with applications to acoustic sub-word segmentation and automatic speech recognition, In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (May 23- 26, Glasgow, UK), 1989, 104-107 Search in Google Scholar

[35] Ruske G., Automatische Spracherkennung, Oldenbourg Verlag, Munchen, 1988 Search in Google Scholar

[36] Talamazzini S., Automatische Spracherkennung, Vieweg Verlag, 1995 Search in Google Scholar

[37] Levinson S.E., Mathematical Models for Speech Technology, John Wiley and Sons, 2005 10.1002/0470020911Search in Google Scholar

[38] Heracleous P., Even J., Ishi C., Miyashita T., Hagita N., Fusion of standard and alternative acoustic sensors for robust automatic speech recognition", In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (March 25- 30, Kyoto, Japan), 2012, 4837-4840 10.1109/ICASSP.2012.6289002Search in Google Scholar

[39] Lyons R., Understanding Digital Signal Processing, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2004 Search in Google Scholar

[40] Programs for Digital Signal Processing, Algorithm 5.2, IEEE Press, New York, 1979 Search in Google Scholar

[41] Rabiner L.R., Schafer R.W., Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978 Search in Google Scholar

[42] Kaiser J.F., Nonrecursive digital filter design using I0-sinh window function, In: Proceedings of IEEE International Symposium on Circuit and Systems, (April, San Francisco, USA), 1974 Search in Google Scholar

[43] Turan M.A., Erzin, E., Source and Filter Estimation for Throat- Microphone Speech Enhancement, IEEE T. Audio Speech, 2016, 24, 265-275 10.1109/TASLP.2015.2499040Search in Google Scholar

[44] Turan M.A., Erzin, E., A Phonetic Classification for Throat Microphone Enhancement, In: Proceedings of IEEE Signal Processing and Communications Applications (April 23-25, Trabzon, Turkey), 2014, 1634-1637 10.1109/SIU.2014.6830559Search in Google Scholar

[45] Visalakshi R., Dhanalakshmi P., Palanivel S., Analysis of Throat Microphone Using MFCC Features for Speaker Recognition, In: Proceedings of Computational Intelligence, Cyber Security and Computational Models, (December 17-19, Peelamedu, India), 2015, 412, 35-41 10.1007/978-981-10-0251-9_5Search in Google Scholar

[46] Uloza V., Padervinskis E., Uloziene I., Saferis V., Verikas A., Combined Use of Standard and Throat Microphones for Measurement of Acoustic Voice Parameters and Voice Categorization, J. Voice, 2016, 29, 552-559 10.1016/j.jvoice.2014.10.008Search in Google Scholar PubMed

[47] Farooq M., Fontana J.M., Sazonov E., A novel approach for food intake detection using electroglottography, Physiol. Meas., 2014, 35, 739-751 10.1088/0967-3334/35/5/739Search in Google Scholar PubMed PubMed Central

Received: 2016-5-5
Accepted: 2016-7-20
Published Online: 2016-11-2

©2016 R.R. Mussabayev et al.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Scroll Up Arrow