Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Linguistics Vanguard

A Multimodal Journal for the Language Sciences

Editor-in-Chief: Bergs, Alexander / Cohn, Abigail C. / Good, Jeff

1 Issue per year

Online
ISSN
2199-174X
See all formats and pricing
More options …

Toward completely automated vowel extraction: Introducing DARLA

Sravana Reddy / James N. Stanford
  • Linguistics & Cognitive Science, Dartmouth College, 6220 Reed Hall Dartmouth College, Hanover, New Hampshire 03755, USA
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2015-07-14 | DOI: https://doi.org/10.1515/lingvan-2015-0002

Abstract

Automatic Speech Recognition (ASR) is reaching further and further into everyday life with Apple’s Siri, Google voice search, automated telephone information systems, dictation devices, closed captioning, and other applications. Along with such advances in speech technology, sociolinguists have been considering new methods for alignment and vowel formant extraction, including techniques like the Penn Aligner (Yuan and Liberman 2008) and the FAVE automated vowel extraction program (Evanini et al. 2009; Rosenfelder et al. 2011). With humans transcribing audio recordings into sentences, these semi-automated methods can produce effective vowel formant measurements (Labov et al. 2013). But as the quality of ASR improves, sociolinguistics may be on the brink of another transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. It would then be possible to quickly extract vowels from virtually limitless hours of recordings, such as YouTube, publicly available audio/video archives, and large-scale personal interviews or streaming video. How far away is this transformative moment? In this article, we introduce a fully automated program called DARLA (short for “Dartmouth Linguistic Automation,” http://darla.dartmouth.edu), which automatically generates transcriptions with ASR and extracts vowels using FAVE. Users simply upload an audio recording of speech, and DARLA produces vowel plots, a table of vowel formants, and probabilities of the phonetic environments for each token. In this paper, we describe DARLA and explore its sociolinguistic applications. We test the system on a dataset of the US Southern Shift and compare the results with semi-automated methods.

Keywords: automatic speech recognition; sociophonetics; vowels

References

  • Baranowski, M. 2013. Sociophonetics. In R. Bayley, R. Cameron & C. Lucas (eds.), The Oxford Handbook of Sociolinguistics, 403–424. Oxford: Oxford University Press.Google Scholar

  • Boersma, P. & D. Weenink. 2015. Praat: Doing phonetics by computer [computer program]. http://www.praat.org.

  • Cambridge University. 1989–2015. HTK Hidden Markov Model Toolkit. http://htk.eng.cam.ac.uk.

  • Carnegie Mellon University. 1993–2015. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

  • Carnegie Mellon University. 2000–2015. CMU Sphinx Speech Recognition Toolkit. http://cmusphinx.sourceforge.net.

  • Davis, S. B. & P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing 28. 357–366.CrossrefGoogle Scholar

  • Deng, L., X. Cui, R. Pruvenok, J. Huang, S. Momen, Y. Chen & A. Alwan. 2006. A database of vocal tract resonance trajectories for research in speech processing. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1660034.

  • Di Paolo, M., M. Yaeger-Dror & A. B. Wassink. 2011. Analyzing vowels. In M. Di Paolo & M. Yaeger-Dror (eds.), Sociophonetics: A student’s guide. London: Routledge.Google Scholar

  • Evanini, K. 2009. The permeability of dialect boundaries: A case study of the region surrounding Erie, Pennsylvania, Ph.D. thesis, University of Pennsylvania.Google Scholar

  • Evanini, K., S. Isard & M. Liberman. 2009. Automatic formant extraction for sociolinguistic analysis of large corpora. In Proceedings of Interspeech. http://www.isca-speech.org/archive/interspeech_2009/i09_1655.html.

  • Garofolo, J., L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren & V. Zue. 1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Philadelphia: Linguistic Data Consortium.Google Scholar

  • Godfrey, J. and E. Holliman. 1993. Switchboard-1 Release 2 LDC97S62. Philadelphia: Linguistic Data Consortium.Google Scholar

  • Goldman, J.-P. 2011. Easyalign: an automatic phonetic alignment tool under Praat. In Proceedings of Interspeech. http://www.isca-speech.org/archive/interspeech_2011/i11_3233.html.

  • Gorman, K., J. Howell & M. Wagner. 2011. Prosodylab-Aligner: a tool for forced alignment of laboratory speech. Canadian Acoustics 39. 192–193.Google Scholar

  • Greenberg, S., J. Hollenback & D. Ellis. 1996. Insights into spoken language gleaned from phonetic transcriptions of the Switchboard corpus. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). http://www.silicon-speech.com/Media/PDF/1996_Greenberg%26CO_InsightsSpokenLanguage.pdf.

  • Hasegawa-Johnson, M., J. Baker, S. Borys, K. Chen, E. Coogan, S. Greenberg, A. Juneja, K. Kirchhoff, K. Livescu, S. Mohan, J. Muller, K. Sonmez & T. Wang. 2005. Landmark-based speech recognition: Report of the 2004 Johns Hopkins Summer Workshop. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1415088.

  • Hesselwood, B., L. Plug & A. Tickle. 2010. Assessing rhoticity using auditory, acoustic and psychoacoustic methods. In Proceedings of Methods XIII: Papers from the 13th International Conference on Methods in Dialectology.Google Scholar

  • Hillenbrand, J., L. Getty, M. Clark & K. Wheeler. 1995. Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America 97. 3099–3111.CrossrefGoogle Scholar

  • Hinton, G., L. Deng, D. Yu, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, G. Dahl & B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29. 82–97.CrossrefWeb of ScienceGoogle Scholar

  • Jelinek, F., L. Bahl & R. Mercer. 1975. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory 21. 250–256.CrossrefGoogle Scholar

  • Kane, J. 2012. Tools for analysing the voice: Developments in glottal source and voice quality analysis, Ph.D. thesis, Trinity College Dublin.Google Scholar

  • Kendall, T. & V. Fridland. 2012. Variation in perception and production of mid front vowels in the U.S. Southern Vowel Shift. Journal of Phonetics 40. 289–306.Web of ScienceCrossrefGoogle Scholar

  • Kendall, T. & J. Fruehwald. 2014. Towards best practices in sociophonetics (with Marianna Di Paolo). In New Ways of Analyzing Variation (NWAV) 43, Chicago.Google Scholar

  • Kisler, T., F. Schiel & H. Sloetjes. 2012. Signal processing via web services: the use case WebMAUS. In Digital Humanities Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts. http://www.clarin-d.de/images/workshops/proceedingssoasforthehumanities.pdf.

  • Labov, W. 1994. Principles of linguistic change. Volume 1: Internal factors. Oxford: Blackwell.Google Scholar

  • Labov, W. 1996. The organization of dialect diversity in North America. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), http://www.ling.upenn.edu/phono_atlas/ICSLP4.html.

  • Labov, W., S. Ash & C. Boberg. 2006. The Atlas of North American English (ANAE). Berlin: Mouton.Google Scholar

  • Labov, W., I. Rosenfelder & J. Fruehwald. 2013. One hundred years of sound change in Philadelphia: Linear incrementation, reversal and reanalysis. Language 89. 30–65.CrossrefGoogle Scholar

  • Labov, W., M. Yaeger & R. Steiner. 1972. A quantitative study of sound change in progress. Report on NSF Contract NSF-GS–3287.Google Scholar

  • Lobanov, B. M. 1971. Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America 49. 606–608.CrossrefGoogle Scholar

  • Panayotov, V., G. Chen, D. Povey & S. Khudanpur. 2015. LibriSpeech: an ASR corpus based on public domain audio books. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar

  • Reddy, S. & J. N. Stanford. 2015. A web application for automated dialect analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) – Demos. http://aclweb.org/anthology/N/N15/#3000.

  • Rosenfelder, I., J. Fruehwald, K. Evanini & J. Yuan. 2011. FAVE (Forced Alignment and Vowel Extraction) Program Suite. http://fave.ling.upenn.edu.

  • Sonderegger, M. & J. Keshet. 2012. Automatic measurement of voice onset time using discriminative structured prediction. Journal of the Acoustical Society of America 132. 3965–3979.Web of ScienceCrossrefGoogle Scholar

  • Stanford, J., N. Severance & K. Baclawski. 2014. Multiple vectors of unidirectional dialect change in eastern New England. Language Variation and Change 26. 103–140.Web of ScienceCrossrefGoogle Scholar

  • Thomas, E. 2011. Sociophonetics: An introduction. New York: Palgrave Macmillan.Google Scholar

  • Thomas, E. & T. Kendall. 2007. NORM: The vowel normalization and plotting suite [online resource]. http://lvc.uoregon.edu/norm/about_norm1.php.

  • Wolfram, W. & N. Schilling-Estes. 2006. American English (2nd edition). Malden, MA: Blackwell.Google Scholar

  • Yuan, J. & M. Liberman. 2008. Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America 123. 3878.CrossrefGoogle Scholar

About the article

Received: 2015-01-19

Accepted: 2015-06-25

Published Online: 2015-07-14

Published in Print: 2015-12-01


Citation Information: Linguistics Vanguard, ISSN (Online) 2199-174X, DOI: https://doi.org/10.1515/lingvan-2015-0002.

Export Citation

©2015 by De Gruyter Mouton. Copyright Clearance Center

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

[2]
Yossi Adi, Joseph Keshet, Emily Cibelli, Erin Gustafson, Cynthia Clopper, and Matthew Goldrick
The Journal of the Acoustical Society of America, 2016, Volume 140, Number 6, Page 4517

Comments (0)

Please log in or register to comment.
Log in