Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter Open Access March 31, 2010

Voice-awareness control for a humanoid robot consistent with its body posture and movements

Takuma Otsuka EMAIL logo , Kazuhiro Nakadai , Toru Takahashi , Kazunori Komatani , Tetsuya Ogata and Hiroshi G. Okuno


This paper presents voice-awareness control consistent with robot’s head movements. For a natural spoken communication between robots and humans, robots must behave and speak the way humans expect them to. The consistency between the robot’s voice quality and its body motion is one of the most especially striking factors in naturalness of robot speech. Our control is based on a new model of spectral envelope modification for vertical head motion, and left-right balance modulation for horizontal head motion. We assume that a pitch-axis rotation, or a vertical head motion, and a yaw-axis rotation, or a horizontal head motion, effect the voice quality independently. The spectral envelope modification model is constructed based on the analysis of human vocalizations. The left-right balance model is established by measuring impulse responses using a pair of microphones. Experimental results show that the voice-awareness is perceivable in a robot-to-robot dialogue when the robots stand up to 150 cm away. The dynamic change in the voice quality is also confirmed in the experiment.


[1] K. Aoki, T. Kamakura, and Y. Kumamoto. Parametric loudspeaker – characteristics of acoustic field and suitablemodulation of carrier ultrasound. Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 74(9):76–82, 2007.10.1002/ecjc.4430740908Search in Google Scholar

[2] P. Birkholz, D. Jackèl, and B. J. Kröger. Construction and control of a three-dimensional vocal tract model. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’06), pages 873–876, 2006.Search in Google Scholar

[3] C. Breazeal and B. Scassellati. A context-dependent attention system for a social robot. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99), pages 1146–1151, 1999.Search in Google Scholar

[4] R. A. J. Clark, K. Richmond, and S. King. Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317–330, 2007..Search in Google Scholar

[5] ARNIS Sound Technologies Co., Ltd. Soundlocus., 2009.Search in Google Scholar

[6] R. Dillmann, R. Becher, and P. Steinhaus. ARMAR II - a learning and cooperative multimodal humanoid robot system. International Journal of Humanoid Robotics, 1(1):143–155, 2004.10.1142/S0219843604000046Search in Google Scholar

[7] D. Erickson. Expressive speech: Production, perception and application to speech synthesis. Acoustical Science and Technology, 26(4):317–325, 2005.Search in Google Scholar

[8] G. Fant. Acoustical Theory of Speech Production: With Calculations based on X-Ray Studies of Russian Articulations. Mouton, The Hague, The Netherlands, 1970.10.1515/9783110873429Search in Google Scholar

[9] S. Fujie, D. Watanabe, Y. Ichikawa, H. Taniyama, K. Hosoya, Y. Matsuyama, and T. Kobayashi. Multi-modal integration for personalized conversation: Towards a humanoid in daily life. In 8th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2008), pages 617–622, Dec. 2008..10.1109/ICHR.2008.4756014Search in Google Scholar

[10] E. T. Hall. Hidden Dimension. Doubleday Publishing, 1996.Search in Google Scholar

[11] Z. Inanoglu and S. Young. Intonation modelling and adaptation for emotional prosody generation. Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science 3784:286–293, 2005.10.1007/11573548_37Search in Google Scholar

[12] Kawada Industries, Inc. Upper body humanoid robot HIRO., 2009.Search in Google Scholar

[13] ISO. ISO 226:2003: Acoustics – Normal equal-loudness-level contours. International Organization for Standardization, 2003.Search in Google Scholar

[14] K. Kaneko, F. Kanehiro, S. Kajita, H. Hirukawa, T. Kawasaki, M. Hirata, K. Akachi, and T. Isozumi. Humanoid robot HRP-2. In IEEE International Conference on Robotics and Automation (ICRA-2004), volume 2, pages 1083–1090 Vol.2, 26-May 1, 2004.10.1109/ROBOT.2004.1307969Search in Google Scholar

[15] H. Kawahara, M. Morise, R. Nisimura, T. Irino, and H. Banno. Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’08), pages 3933–3936, 2008.10.1109/ICASSP.2008.4518514Search in Google Scholar

[16] H. Kawahara, R. Nisimura, T. Irino, M. Morise, T. Takahashi, and H. Banno. Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’09), pages 3377–3680, 2009.10.1109/ICASSP.2009.4960481Search in Google Scholar

[17] H. Kenmochi and H. Ohshita. Vocaloid – commercial singing synthesizer based on sample concatenation. In Proceedings of INTERSPEECH, pages 4010–4011, 2007.Search in Google Scholar

[18] H. D. Kim. Binaural Active Audition for Humanoid Robots. PhD thesis, Graduate School of Informatics, Kyoto University, Sep. 2009.Search in Google Scholar

[19] Y. Kubota, M. Yoshida, K. Komatani, T. Ogata, and H. G. Okuno. Design and implementation of 3D auditory scene visualizer towards auditory awareness with face tracking. In IEEE International Symposium on Multimedia (ISM2008), pages 468–476, Dec. 2008.10.1109/ISM.2008.107Search in Google Scholar

[20] D. Matsui, T. Minato, K. F. MacDorman, and H. Ishiguro. Generating natural motion in an android by mapping human motion. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-2005), pages 3301–3308, Aug. 2005.10.1109/IROS.2005.1545125Search in Google Scholar

[21] K. Nakadai and H. Tsujino. Towards new human-humanoid communication: Listening during speaking by using ultrasonic directional speaker. In IEEE International Conference on Robots and Automation (ICRA-2005), pages 1483–1488, Apr. 2005.Search in Google Scholar

[22] K. Nakadai, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsujino. An open source software system for robot audition HARK and its evaluation. In 8th IEEE-RAS International Conference on Humanoids (Humanoids 2008), pages 561–566, Dec. 2008.10.1109/ICHR.2008.4756031Search in Google Scholar

[23] T. Otsuka, K. Nakadai, T. Takahashi, K. Komatani, T. Ogata, and H. G. Okuno. Voice quality manipulation for humanoid robots consistent with their head movements. In 9th IEEE-RAS International Conference on Humanoids (Humanoids-2009), pages 405–410, Dec. 2009.10.1109/ICHR.2009.5379569Search in Google Scholar

[24] T. Otsuka, K. Nakadai, Toru Takahashi, K. Komatani, T. Ogata, and H. G. Okuno. Incremental Polyphonic Audio to Score Alignment using Beat Tracking for Singer Robots. In Proceedings of IEEE/RSJ Int’l Conference on Intelligent Robots and Systems, pages 2289–2296, 2009.10.1109/IROS.2009.5354637Search in Google Scholar

[25] T. Tasaki, S. Matsumoto, H. Ohba, M. Toda, K. Komatani, T. Ogata, and H. G. Okuno. Distance-based dynamic interaction of humanoid robot with multiple people. Innovations in Applied Artificial Intelligence, Lecture Notes in Artificial Intelligence 3533:111–120, 2005.10.1007/11504894_18Search in Google Scholar

[26] A. Vurma and J. Ross. Where Is a Singer’s Voice if It Is Placed “Forward”. Journal of Voice, 16(3):383–391, 2002.10.1016/S0892-1997(02)00109-1Search in Google Scholar

Received: 2010-2-21
Accepted: 2010-3-19
Published Online: 2010-3-31
Published in Print: 2010-3-1

© Takuma Otsuka et al.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Downloaded on 2.2.2023 from
Scroll Up Arrow