Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access April 20, 2019

Bidirectional deep architecture for Arabic speech recognition

Naima Zerari, Samir Abdelhamid, Hassen Bouzgou and Christian Raymond
From the journal Open Computer Science

Abstract

Nowadays, the real life constraints necessitates controlling modern machines using human intervention by means of sensorial organs. The voice is one of the human senses that can control/monitor modern interfaces. In this context, Automatic Speech Recognition is principally used to convert natural voice into computer text as well as to perform an action based on the instructions given by the human. In this paper, we propose a general framework for Arabic speech recognition that uses Long Short-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the nonuniform sequence length of the speech utterances issued fromboth feature extraction techniques, (1)Mel Frequency Cepstral Coefficients MFCC (static and dynamic features), (2) the Filter Banks (FB) coefficients. The neural architecture can recognize the isolated Arabic speech via classification technique. The proposed system involves, first, extracting pertinent features from the natural speech signal using MFCC (static and dynamic features) and FB. Next, the extracted features are padded in order to deal with the non-uniformity of the sequences length. Then, a deep architecture represented by a recurrent LSTM or GRU (Gated Recurrent Unit) architectures are used to encode the sequences of MFCC/FB features as a fixed size vector that will be introduced to a Multi-Layer Perceptron network (MLP) to perform the classification (recognition). The proposed system is assessed using two different databases, the first one concerns the spoken digit recognition where a comparison with other related works in the literature is performed, whereas the second one contains the spoken TV commands. The obtained results show the superiority of the proposed approach.

References

[1] Rabiner L. R., Juang B. H., Fundamentals of speech recognition, PTR Prentice Hall Englewood Cliffs, 1993Search in Google Scholar

[2] Jelinek F., Statistical methods for speech recognition, MIT press, 1997Search in Google Scholar

[3] Desai N., Dhameliya K., Desai V., Feature extraction and classifcation techniques for speech recognition: A review, International Journal of Emerging Technology and Advanced Engineering, 2013, 3(12), 367–371Search in Google Scholar

[4] Ittichaichareon C., Suksri S., Yingthawornsuk T., Speech recognition using mfcc, International Conference on Computer Graphics, Simulation and Modeling, 2012, 28–29Search in Google Scholar

[5] Hochreiter S., Schmidhuber J., Long short-term memory, Neural computation, 1997, 9(8), 1735–178010.1162/neco.1997.9.8.1735Search in Google Scholar PubMed

[6] Lippmann R. P., Review of neural networks for speech recognition, Neural computation, 1989, 1(1), 1–3810.1162/neco.1989.1.1.1Search in Google Scholar

[7] Juang B. H., Rabiner L. R., Automatic Speech Recognition – A Brief History of the Technology Development, Georgia Institute of Technology, Atlanta, Rutgers University and the University of California, Santa Barbara, 200510.1016/B0-08-044854-2/00906-8Search in Google Scholar

[8] Anusuya M. A., Katti S. K., Speech recognition by machine, a review, arXiv preprint arXiv:1001.2267, 2010Search in Google Scholar

[9] Saeed K., Nammous M. K., A speech-and-speaker identifcation system: feature extraction, description, and classification of speech signal image, IEEE transactions on industrial electronics, 2007, 54(2), 887–89710.1109/TIE.2007.891647Search in Google Scholar

[10] Hammami N., Sellam M., Tree distribution classifier for automatic spoken arabic digit recognition, IEEE International Conference for Internet Technology and Secured Transactions, 2009, 1–410.1109/ICITST.2009.5402575Search in Google Scholar

[11] Hammami N., Bedda M., Improved tree model for arabic speech recognition, International Conference on Computer Science and Information Technology, 2010, (5), 521–52610.1109/ICCSIT.2010.5563892Search in Google Scholar

[12] Daqrouq K., Alfaouri M., Alkhateeb A., Khalaf E., Morfeq A., Wavelet lpc with neural network for spoken arabic digits recognition system, British Journal of Applied Science & Technology, 2014, 4(8), 1238–125510.9734/BJAST/2014/6034Search in Google Scholar

[13] Satori H., Harti M., Chenfour N., Introduction to arabic speech recognition using cmu sphinx system, arXiv preprint arXiv:0704.2083, 200710.1109/ISCIII.2007.367358Search in Google Scholar

[14] LeCun Y., Bengio Y., Hinton G., Deep learning, nature, 2015, 521, 436–44410.1038/nature14539Search in Google Scholar PubMed

[15] Graves A., Mohamed A. R., Hinton, G., Speech recognition with deep recurrent neural networks, IEEE International conference on acoustics, speech and signal processing, 2013, 6645–664910.1109/ICASSP.2013.6638947Search in Google Scholar

[16] Dahl G. E., Yu D., Deng, L., Acero A., Context-dependent pretrained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on audio, speech, and language processing, 2012, 20(1), 30–4210.1109/TASL.2011.2134090Search in Google Scholar

[17] Hinton G., Deng L., Yu D., Dahl G.,Mohamed A. R., Jaitly N., et al., Deep neural networks for acoustic modeling in speech recognition, IEEE Signal processing magazine, 2012, 29(6), 82–9710.1109/MSP.2012.2205597Search in Google Scholar

[18] Ali A., Zhang Y., Cardinal P., Dahak N., Vogel S., Glass J., A complete kaldi recipe for building arabic speech recognition systems, IEEE spoken language technology workshop, 2014, 525–52910.1109/SLT.2014.7078629Search in Google Scholar

[19] Ali A., Bell P., Glass J., Messaoui Y., Mubarak H., Renals S., et al., The MGB-2 challenge: Arabic multi-dialect broadcast media recognition, IEEE Spoken Language Technology Workshop, 2016, 279–28410.1109/SLT.2016.7846277Search in Google Scholar

[20] Ali A., Vogel S., Renals S., Speech recognition challenge in the wild: Arabic MGB-3, IEEE Automatic Speech Recognition and Understanding Workshop, 2017, 316–32210.1109/ASRU.2017.8268952Search in Google Scholar

[21] Afify M., Nguyen L., Xiang B., Abdou S., Makhoul J., Recent progress in Arabic broadcast news transcription at BBN, Ninth European Conference on Speech Communication and Technology, 200510.21437/Interspeech.2005-537Search in Google Scholar

[22] Manohar V., Povey D., Khudanpur S., JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning, Automatic Speech Recognition and Understanding Workshop, 2017, 346–35210.1109/ASRU.2017.8268956Search in Google Scholar

[23] Young S. J., Young S., The HTK hiddenMarkov model toolkit: Design and philosophy, University of Cambridge, Department of Engineering, 1993Search in Google Scholar

[24] Davis S., Mermelstein P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Transactions on acoustics, speech, and signal processing, 1980, 28(4), 357–36610.1109/TASSP.1980.1163420Search in Google Scholar

[25] Wang J. C.,Wang J. F., Weng Y. S., Chip design of MFCC extraction for speech recognition, Integration the VLSI journal, 2002, 32(1-2), 111–13110.1016/S0167-9260(02)00045-7Search in Google Scholar

[26] Lalitha S., Geyasruti D., Narayanan R., Shravani M., Emotion detection using MFCC and cepstrum features, Procedia Computer Science, 2015, 70, 29–3510.1016/j.procs.2015.10.020Search in Google Scholar

[27] Ai O. C., Hariharan M., Yaacob S., Chee L. S., Classification of speech dysfluencies with MFCC and LPCC features, Expert Systems with Applications, 2012, 39(2), 2157–216510.1016/j.eswa.2011.07.065Search in Google Scholar

[28] Al-Anzi F. S., AbuZeina D., The Capacity of Mel Frequency Cepstral Coefficients for Speech Recognition, International Journal of Computer and Information Engineering, 2017, 11(10), 1162–1166Search in Google Scholar

[29] Rabiner L. R., Schafer R. W., Theory and applications of digital speech processing, Upper Saddle River, NJ: Pearson, 2011, 64Search in Google Scholar

[30] Furui S., Speaker-independent isolated word recognition based on emphasized spectral dynamics, International Conference on Acoustics, speech and Signal Processing, 1986, 1991–1994Search in Google Scholar

[31] Kumar K., Kim C., Stern R. M., Delta-spectral cepstral coefficients for robust speech recognition, IEEE international conference on acoustics, speech and signal processing, 2011, 4784–478710.1109/ICASSP.2011.5947425Search in Google Scholar

[32] San-Segundo R., Montero J. M., Barra-Chicote R., Fernández F., Pardo, J. M., Feature extraction fromsmartphone inertial signals for human activity segmentation, Signal Processing, 2016, 120, 359–37210.1016/j.sigpro.2015.09.029Search in Google Scholar

[33] Graves A., Fernández S., Gomez F., Schmidhuber J., Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, The 23rd International Conference on Machine Learning ACM, 2006, 369–37610.1145/1143844.1143891Search in Google Scholar

[34] Vukotic V., Raymond C., Gravier G., A step beyond local observations with a dialog aware bidirectional GRU network for Spoken Language Understanding, 17th Annual Conference of the International Speech Communication Association, 2016, 3241–324410.21437/Interspeech.2016-1301Search in Google Scholar

[35] Chung J., Gulcehre C., Cho K., Bengio Y., Gated feedback recurrent neural networks, International Conference on Machine Learning, 2015, 2067–2075Search in Google Scholar

[36] Yuan Gao., Dorota Glowacka., Deep gate recurrent neural network, Asian Conference on Machine Learning, 2016, 350–365Search in Google Scholar

[37] Graves A., Jaitly N., Mohamed A. R., Hybrid speech recognition with deep bidirectional LSTM, IEEE workshop on automatic speech recognition and understanding, 2013, 273–27810.1109/ASRU.2013.6707742Search in Google Scholar

[38] Huang Z., Xu W., Yu K., Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint arXiv:1508.01991, 2015Search in Google Scholar

[39] Duda R. O., Hart P. E., Stork D. G., Pattern classification, John Wiley & Sons, 2012Search in Google Scholar

[40] Haykin S. S., Neural networks and learning machines, Pearson Education, Upper Saddle River, NJ, 2009Search in Google Scholar

[41] Bishop C. M., Neural networks for pattern recognition. Oxford University Press, 199510.1201/9781420050646.ptb6Search in Google Scholar

[42] Lichman M., UCIMachine Learning Repository, University of California, http://archive.ics.uci.edu/ml, 2013Search in Google Scholar

[43] Chollet F., Keras: The python deep learning library, Astrophysics Source Code Library, 2018Search in Google Scholar

[44] Jiang H., Confidence measures for speech recognition: A survey, Speech communication, 2005, 45(4), 455–47010.1016/j.specom.2004.12.004Search in Google Scholar

[45] Bouzgou H., Automatic Analysis of High dimensional Signals: Advanced Wind Speed Forecasting Techniques, Lambert Academic Publishing, 2012Search in Google Scholar

[46] Zerari N., Abdelhamid S., Bouzgou H., Raymond C., Bidirectional recurrent end-to-end neural network classifier for spoken Arab digit recognition, International Conference on Natural Language and Speech Processing, 2018, 1–610.1109/ICNLSP.2018.8374374Search in Google Scholar

[47] Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R., Dropout: a simpleway to prevent neural networks from over-fitting, Journal of Machine Learning Research, 2014, 15(1), 1929–1958Search in Google Scholar

[48] Sahidullah M., Saha, G., Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, Speech Communication, 2012, 54(4), 543–56510.1016/j.specom.2011.11.004Search in Google Scholar

Received: 2018-07-20
Accepted: 2019-03-04
Published Online: 2019-04-20

© 2019 Naima Zerari et al., published by De Gruyter Open

This work is licensed under the Creative Commons Attribution 4.0 Public License.

Scroll Up Arrow