Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton July 14, 2015

Toward completely automated vowel extraction: Introducing DARLA

  • Sravana Reddy EMAIL logo and James N. Stanford
From the journal Linguistics Vanguard


Automatic Speech Recognition (ASR) is reaching further and further into everyday life with Apple’s Siri, Google voice search, automated telephone information systems, dictation devices, closed captioning, and other applications. Along with such advances in speech technology, sociolinguists have been considering new methods for alignment and vowel formant extraction, including techniques like the Penn Aligner (Yuan and Liberman 2008) and the FAVE automated vowel extraction program (Evanini et al. 2009; Rosenfelder et al. 2011). With humans transcribing audio recordings into sentences, these semi-automated methods can produce effective vowel formant measurements (Labov et al. 2013). But as the quality of ASR improves, sociolinguistics may be on the brink of another transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. It would then be possible to quickly extract vowels from virtually limitless hours of recordings, such as YouTube, publicly available audio/video archives, and large-scale personal interviews or streaming video. How far away is this transformative moment? In this article, we introduce a fully automated program called DARLA (short for “Dartmouth Linguistic Automation,”, which automatically generates transcriptions with ASR and extracts vowels using FAVE. Users simply upload an audio recording of speech, and DARLA produces vowel plots, a table of vowel formants, and probabilities of the phonetic environments for each token. In this paper, we describe DARLA and explore its sociolinguistic applications. We test the system on a dataset of the US Southern Shift and compare the results with semi-automated methods.


We are grateful to the anonymous reviewers for their suggestions, and to the various users of DARLA for feedback. Irene Feng assisted in building the web interface. The first author was supported by a Neukom Fellowship at Dartmouth, and development of DARLA is being sponsored by a Neukom CompX grant. The computing cluster used for training the ASR models and running experiments was made available by NSF award CNS-1205521.


Baranowski, M. 2013. Sociophonetics. In R. Bayley, R. Cameron & C. Lucas (eds.), The Oxford Handbook of Sociolinguistics, 403424. Oxford: Oxford University Press.10.1093/oxfordhb/9780199744084.013.0020Search in Google Scholar

Boersma, P. & D. Weenink. 2015. Praat: Doing phonetics by computer [computer program]. in Google Scholar

Cambridge University. 1989–2015. HTK Hidden Markov Model Toolkit. in Google Scholar

Carnegie Mellon University. 1993–2015. CMU Pronouncing Dictionary. in Google Scholar

Carnegie Mellon University. 2000–2015. CMU Sphinx Speech Recognition Toolkit. in Google Scholar

Davis, S. B. & P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing 28. 357366.Search in Google Scholar

Deng, L., X. Cui, R. Pruvenok, J. Huang, S. Momen, Y. Chen & A. Alwan. 2006. A database of vocal tract resonance trajectories for research in speech processing. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), in Google Scholar

Di Paolo, M., M. Yaeger-Dror & A. B. Wassink. 2011. Analyzing vowels. In M. Di Paolo & M. Yaeger-Dror (eds.), Sociophonetics: A student’s guide. London: Routledge.Search in Google Scholar

Evanini, K. 2009. The permeability of dialect boundaries: A case study of the region surrounding Erie, Pennsylvania, Ph.D. thesis, University of Pennsylvania.Search in Google Scholar

Evanini, K., S. Isard & M. Liberman. 2009. Automatic formant extraction for sociolinguistic analysis of large corpora. In Proceedings of Interspeech. in Google Scholar

Garofolo, J., L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren & V. Zue. 1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Philadelphia: Linguistic Data Consortium.Search in Google Scholar

Godfrey, J. and E. Holliman. 1993. Switchboard-1 Release 2 LDC97S62. Philadelphia: Linguistic Data Consortium.Search in Google Scholar

Goldman, J.-P. 2011. Easyalign: an automatic phonetic alignment tool under Praat. In Proceedings of Interspeech. in Google Scholar

Gorman, K., J. Howell & M. Wagner. 2011. Prosodylab-Aligner: a tool for forced alignment of laboratory speech. Canadian Acoustics 39. 192193.Search in Google Scholar

Greenberg, S., J. Hollenback & D. Ellis. 1996. Insights into spoken language gleaned from phonetic transcriptions of the Switchboard corpus. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). in Google Scholar

Hasegawa-Johnson, M., J. Baker, S. Borys, K. Chen, E. Coogan, S. Greenberg, A. Juneja, K. Kirchhoff, K. Livescu, S. Mohan, J. Muller, K. Sonmez & T. Wang. 2005. Landmark-based speech recognition: Report of the 2004 Johns Hopkins Summer Workshop. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). in Google Scholar

Hesselwood, B., L. Plug & A. Tickle. 2010. Assessing rhoticity using auditory, acoustic and psychoacoustic methods. In Proceedings of Methods XIII: Papers from the 13th International Conference on Methods in Dialectology.Search in Google Scholar

Hillenbrand, J., L. Getty, M. Clark & K. Wheeler. 1995. Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America 97. 30993111.10.1121/1.411872Search in Google Scholar

Hinton, G., L. Deng, D. Yu, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, G. Dahl & B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29. 8297.10.1109/MSP.2012.2205597Search in Google Scholar

Jelinek, F., L. Bahl & R. Mercer. 1975. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory 21. 250256.10.1109/TIT.1975.1055384Search in Google Scholar

Kane, J. 2012. Tools for analysing the voice: Developments in glottal source and voice quality analysis, Ph.D. thesis, Trinity College Dublin.Search in Google Scholar

Kendall, T. & V. Fridland. 2012. Variation in perception and production of mid front vowels in the U.S. Southern Vowel Shift. Journal of Phonetics 40. 289306.10.1016/j.wocn.2011.12.002Search in Google Scholar

Kendall, T. & J. Fruehwald. 2014. Towards best practices in sociophonetics (with Marianna Di Paolo). In New Ways of Analyzing Variation (NWAV) 43, Chicago.Search in Google Scholar

Kisler, T., F. Schiel & H. Sloetjes. 2012. Signal processing via web services: the use case WebMAUS. In Digital Humanities Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts. in Google Scholar

Labov, W. 1994. Principles of linguistic change. Volume 1: Internal factors. Oxford: Blackwell.Search in Google Scholar

Labov, W. 1996. The organization of dialect diversity in North America. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), in Google Scholar

Labov, W., S. Ash & C. Boberg. 2006. The Atlas of North American English (ANAE). Berlin: Mouton.10.1515/9783110167467Search in Google Scholar

Labov, W., I. Rosenfelder & J. Fruehwald. 2013. One hundred years of sound change in Philadelphia: Linear incrementation, reversal and reanalysis. Language 89. 3065.Search in Google Scholar

Labov, W., M. Yaeger & R. Steiner. 1972. A quantitative study of sound change in progress. Report on NSF Contract NSF-GS–3287.Search in Google Scholar

Lobanov, B. M. 1971. Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America 49. 606608.10.1121/1.1912396Search in Google Scholar

Panayotov, V., G. Chen, D. Povey & S. Khudanpur. 2015. LibriSpeech: an ASR corpus based on public domain audio books. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).10.1109/ICASSP.2015.7178964Search in Google Scholar

Reddy, S. & J. N. Stanford. 2015. A web application for automated dialect analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) – Demos. in Google Scholar

Rosenfelder, I., J. Fruehwald, K. Evanini & J. Yuan. 2011. FAVE (Forced Alignment and Vowel Extraction) Program Suite. in Google Scholar

Sonderegger, M. & J. Keshet. 2012. Automatic measurement of voice onset time using discriminative structured prediction. Journal of the Acoustical Society of America 132. 39653979.10.1121/1.4763995Search in Google Scholar

Stanford, J., N. Severance & K. Baclawski. 2014. Multiple vectors of unidirectional dialect change in eastern New England. Language Variation and Change 26. 103140.10.1017/S0954394513000227Search in Google Scholar

Thomas, E. 2011. Sociophonetics: An introduction. New York: Palgrave Macmillan.10.1007/978-1-137-28561-4Search in Google Scholar

Thomas, E. & T. Kendall. 2007. NORM: The vowel normalization and plotting suite [online resource]. in Google Scholar

Wolfram, W. & N. Schilling-Estes. 2006. American English (2nd edition). Malden, MA: Blackwell.Search in Google Scholar

Yuan, J. & M. Liberman. 2008. Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America 123. 3878.10.1121/1.2935783Search in Google Scholar

Received: 2015-1-19
Accepted: 2015-6-25
Published Online: 2015-7-14
Published in Print: 2015-12-1

©2015 by De Gruyter Mouton

Downloaded on 3.12.2023 from
Scroll to top button