Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Paladyn, Journal of Behavioral Robotics

Editor-in-Chief: Schöner, Gregor

1 Issue per year


CiteScore 2017: 0.33

SCImago Journal Rank (SJR) 2017: 0.104

Open Access
Online
ISSN
2081-4836
See all formats and pricing
More options …

Soft missing-feature mask generation for Robot Audition

Toru Takahashi
  • Department of Intelligence and Science and Technology, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Kazuhiro Nakadai
  • Honda Research Institute Japan Co., Ltd., 8-1 Honcho, Wako, Saitama 351-0114, Japan
  • Mechanical and Environmental Informatics, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo, 152-8552, Japan
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Kazunori Komatani
  • Department of Intelligence and Science and Technology, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Tetsuya Ogata
  • Department of Intelligence and Science and Technology, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Hiroshi G. Okuno
  • Department of Intelligence and Science and Technology, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2010-03-31 | DOI: https://doi.org/10.2478/s13230-010-0005-1

Abstract

This paper describes an improvement in automatic speech recognition (ASR) for robot audition by introducing Missing Feature Theory (MFT) based on soft missing feature masks (MFM) to realize natural human-robot interaction. In an everyday environment, a robot’s microphones capture various sounds besides the user’s utterances. Although sound-source separation is an effective way to enhance the user’s utterances, it inevitably produces errors due to reflection and reverberation. MFT is able to cope with these errors. First, MFMs are generated based on the reliability of time-frequency components. Then ASR weighs the time-frequency components according to the MFMs. We propose a new method to automatically generate soft MFMs, consisting of continuous values from 0 to 1 based on a sigmoid function. The proposed MFM generation was implemented for HRP-2 using HARK, our open-sourced robot audition software. Preliminary results show that the soft MFM outperformed a hard (binary) MFM in recognizing three simultaneous utterances. In a human-robot interaction task, the interval limitations between two adjacent loudspeakers were reduced from 60 degrees to 30 degrees by using soft MFMs.

Keywords: Robot Audition; HARK; missing-feature-theory; soft mask generation; simultaneous speech recognition; Automatic Speech Recognition; sound source separation; sound localization

References

  • [1] J. Barker, M. Cooke, and P. Green. Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise. In Procedings of Eurospeech-2001, pages 213–216. ESCA, 2001.Google Scholar

  • [2] J. Barker, L. Josifovski, M. Cooke, and P. Green. Soft decisions in missing data techniques for robust automatic speech recognition. In Proc. of 6th International Conference on Spoken Language Processing (ICSLP-2000), volume I, pages 373–376, 2000.Google Scholar

  • [3] S. F. Boll. A spectral subtraction algorithm for suppression of acoustic noise in speech. In roceedings of 1979 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-79), pages 200–203. IEEE, 1979.Google Scholar

  • [4] C. Breazeal. Emotive qualities in robot speech. In Proceedings of 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2001), pages 1389–1394, 2001.Google Scholar

  • [5] I. Cohen and B. Berdugo. Speech enhancement for nonstationary noise environments. Signal Processing, 81(2):2403–2418, 2001.Google Scholar

  • [6] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication, 34(3):267–285, May 2000.Google Scholar

  • [7] C. Côté, D. Létourneau, F. Michaud, J. M. Valin, Y. Brosseau, C. Räievsky, M. Lemay, and V. Tran. Reusability tools for programming mobile robots. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 1820–1825. IEEE, 2004.Google Scholar

  • [8] J. de Veth, F. de Wet, B. Cranen, and L. Boves. Missing feature theory in asr: Make sure you miss the right type of features. In Proceedings ofWorkshop on Robust Methods for ASR in Adverse Conditions, Tampere, pages 231–234, 1999.Google Scholar

  • [9] A. Drygajlo and M. El-Maliki. Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. In Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), pages 121–124, 1998.Google Scholar

  • [10] Y. Ephraim and D. Malah. Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-32(6):1109–1121, 1984.CrossrefGoogle Scholar

  • [11] I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai, F. Kanehiro, H. Hirukawa, and K. Yamamoto. Robust speech interface based on audio and video information fusion for humanoid HRP-2. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2404–2410. IEEE, 2004.Google Scholar

  • [12] H. Isao, A. Futoshi, K. Yoshihiro, K. Fumio, and Y. Kiyoshi. Robust speech interface based on audio and video information fusion for humanoid hrp-2. In Proceedings of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2404–2410, 2004.Google Scholar

  • [13] Multiband Julius. http://www.furui.cs.titech.ac.jp/mbandjulius/.

  • [14] H. D. Kim, K. Komatani, T. Ogata, and H. G. Okuno. Human tracking system integrating sound and face localization using em algorithm in real environments. Advanced Robotics, 23(6):629–653, 2007.Web of ScienceGoogle Scholar

  • [15] R. P. Lippmann and B. A. Carlson. Robust speech recognition with time-varying filtering, interruptions, and noise. In Proceedings of 1997 ISCA 5th European Conference on Speech Communication and Technology (EuroSpeech 1997), pages 365–372, 1997.Google Scholar

  • [16] Y. Matsusaka, T. Tojo, S. Kuota, K. Furukawa, D. Tamiya, K. Hayata, Y. Nakano, and T. Kobayashi. Multi-person conversation via multi-modal interface – a robot who communicates withmulti-user. In Proceedings of 6th European Conference on Speech Communication Technology (Eurospeech 1999), pages 1723–1726, 1999.Google Scholar

  • [17] I. A. McCowan and H. Bourlard. Microphone array post-filter for diffuse noise field. In ICASSP-2002, volume 1, pages 905–908, 2002.Web of ScienceGoogle Scholar

  • [18] K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. Active audition for humanoid. In Proc. of 17th National Conference on Artificial Intelligence (AAAI-2000), pages 832–839. AAAI, 2000.Google Scholar

  • [19] K. Nakadai, D. Matasuura, H. G. Okuno, and H. Tsujino. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication, 44(1-4):97–112, October 2004.CrossrefGoogle Scholar

  • [20] K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication, 44(1-4):97–112, 2004.CrossrefGoogle Scholar

  • [21] K. Nakadai, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsujino. An open source software system for robot audition hark and its evaluation. In Proceedings of 2008 IEEE/RAS International Conference on Humanoid Robots (HUMANOIDS 2008), pages 561–566, 2008.Google Scholar

  • [22] Y. Nishimura, T. Shinozaki, K. Iwano, and S. Furui. Noise-robust speech recognition using multi-band spectral features. In Proceedings of 148th Acoustical Society of America Meetings, number 1aSC7, 2004.Google Scholar

  • [23] M. T. Padilla, T. F. Quantieri, and D. A. Reynolds. Missing feature theory with soft spectral subtraction for speaker verification. In Proceedings of the 8th International Congress on Spoken Language Processing (InterSpeech 2006), pages 913–916, 2006.Google Scholar

  • [24] H.M. Park and R.M. Stern. Missing feature speech recognition using dereverberation and echo suppression in reerberation environments. In Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), volume IV, pages 381–384, 2007.Google Scholar

  • [25] L. C. Parra and C. V. Alvino. Geometric source separation: Mergin convolutive source separation with geometric beamforming. IEEE Transactions on Speech and Audio Processing, 10(6):352–362, 2002.Google Scholar

  • [26] R. Plomp, L. C. W. Pols, and J. P. van de Geer. Dimensional analysis of vowel spectra. Acoustical Society of America, 41(3):707–712, 1967.Google Scholar

  • [27] B. Raj and R. M. Stern. Missing-feature approaches in speech recognition. Signal Processing Magazine, 22(5):101–116, 2005.Google Scholar

  • [28] P. Renevey and A. Drygajlo. Missing feature theory and probabilistic estimation of clean speech components for robust speech recognition. In Proceedings of European Conference on Speech Communication Technology (Eurospeech-1999), pages 2627–2630, 1999.Google Scholar

  • [29] M. L. Seltzer, B. Raj, and R. M. Stern. A bayesian classifier for spectrographicmask estimation formissing feature speech recognition. Speech Communication, 43:379–393, 2004.CrossrefGoogle Scholar

  • [30] T. Takahashi, K. Nakadai, K. Komatani, T. Ogata, and H. G. Okuno. Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. In Proceedings of 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2009), pages 2730–2735, 2009.Google Scholar

  • [31] K. Tatsuya and L. Akinobu. Free software toolkit for Japanese large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 4, pages 476–479, 2000.Google Scholar

  • [32] J. M. Valin, F. Michaud, and J. Rouat. Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems Journal, 55(3):216–228, 2007.Google Scholar

  • [33] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on microphone array source separation with post-filter. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2133–2128, 2004.Google Scholar

  • [34] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on microphone array source separation with post-filter. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2123–2128. IEEE, 2004.Google Scholar

  • [35] F. Wang, Y. Takeuchi, N. Ohnishi, and N. Sugie. Amobile robot with active localization and discrimination of a sound source. Journal of Robotic Society of Japan, 15(2):61–67, 1997.Google Scholar

  • [36] S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud, , K. Komatani, T. Ogata, and H. G. Okuno. Making a robot recognize three simultaneous sentences in real-time. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), pages 897–902. IEEE, 2005.Google Scholar

  • [37] S. Yamamoto, J. M. Valin, K. Nakadai, T. Ogata, and H. G. Okuno. Enhanced robot speech recognition based on microphone array source separation and missing feature theory. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), pages 1489–1494. IEEE, 2005.Google Scholar

  • [38] S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J. M. Valin, K. Komatani, T. Ogata, and H. G. Okuno. Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU-2007), pages 111–116. IEEE, 2007.Google Scholar

  • [39] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama, and H. G. Okuno. Improvement of robot audition by interfacing sound source separation and automatic speech recognition with missing feature theory. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2004), pages 1517–1523. IEEE, 2004.Google Scholar

  • [40] S. Yamamoto, K. Nakadai, J.M. Valin, J. Rouat, F. Michaud, K. Komatani, T. Ogata, and H. G. Okuno. Genetic algorithm-based improvement of robot hearing capabilities inseparating and recognizing simultaneous speech signals. In Proceedings of 19th International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems (IEA/AIE’06), volume LNAI 4031, pages 207–217. Springer-Verlag, 2006.Google Scholar

About the article

Received: 2010-02-21

Accepted: 2010-03-19

Published Online: 2010-03-31

Published in Print: 2010-03-01


Citation Information: Paladyn, Journal of Behavioral Robotics, Volume 1, Issue 1, Pages 37–47, ISSN (Online) 2081-4836, DOI: https://doi.org/10.2478/s13230-010-0005-1.

Export Citation

© Toru Takahashi et al.. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Comments (0)

Please log in or register to comment.
Log in