Abstract
Forced aligners have revolutionized sociophonetics, but while there are several forced aligners available, there are few systematic comparisons of their performance. Here, we consider four major forced aligners used in sociophonetics today: MAUS, FAVE, LaBB-CAT and MFA. Through comparisons with human coders, we find that both aligner and phonological context affect the quality of automated alignments of vowels extracted from English sociolinguistic interview data. MFA and LaBB-CAT produce the highest quality alignments, in some cases not significantly different from human alignment, followed by FAVE, and then MAUS. Aligners are less accurate placing boundaries following a vowel than preceding it, and they vary in accuracy across manner of articulation, particularly for following boundaries. These observations allow us to make specific recommendations for manual correction of forced alignment.
Acknowledgements
We gratefully acknowledge support from the ARC Centre of Excellence for the Dynamics of Language, and funding from a Transdisciplinary & Innovation Grant (TIG952018). We thank Robert Fromont, Debbie Loakes, and the anonymous Linguistics Vanguard reviewers for valuable feedback on the paper, as well as Miriam Meyerhoff, Jim Stanford, and Hywel Stoakes for help in formulating the ideas presented here.
References
Baayan, Harald, R. Piepenbrock & Lennart Gulikers. 1995. The CELEX Lexical Database (Release 2, CD-ROM). University of Pennsylvania, Philadelphia: Linguistic Data Consortium.Search in Google Scholar
Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2010. lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-33.Search in Google Scholar
Brognaux, Sandrine, Sophie Roekhaut, Thomas Drugman, & Richard Beaufort. 2012. Automatic phone alignment: A comparison between speaker- independent models and models trained on the corpus to align. Proceedings of the 8th International Conference on NLP, JapTAL, 300–311. Kanazawa, Japan.10.1007/978-3-642-33983-7_30Search in Google Scholar
Carnegie Mellon University. 1993–2016 CMU pronouncing dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.Search in Google Scholar
Cassidy, Steve & Thomas Schmidt. 2017. Tools for multimodal annotation. In Nancy Ide & James Pustejovsky (eds.), Handbook of Linguistic Annotation, 209–227. Netherlands: Springer.10.1007/978-94-024-0881-2_7Search in Google Scholar
Chodroff, Eleanor. 2018. Corpus Phonetics Tutorial. ArXiv: abs/1811.05553.Search in Google Scholar
Cosi, Piero, Daniele Falavigna & Maurizio Omologo. 1991. A preliminary statistical evaluation of manual and automatic segmentation discrepancies. EUROSPEECH-91, 2nd European Conference on Speech Technology, 693–696.10.21437/Eurospeech.1991-183Search in Google Scholar
Coto-Solano, Rolando & Sofía Flores Solórzano. 2017. Comparison of two forced alignment aystems for aligning Bribri speech.CLEI Electronic Journal 20(1). 2:1–2:13.10.19153/cleiej.20.1.2Search in Google Scholar
DiCanio, Christian, Hosung Nam, Douglas H. Whalen, H. Timothy Bunnell, Jonathan D. Amith & Rey Castillo Garcia. 2012. Assessing agreement level between forced alignment models with data from endangered language documentation corpora. INTERSPEECH-2012, Portland Oregon, 130–133.10.21437/Interspeech.2012-43Search in Google Scholar
Fromont, Robert & Jennifer Hay. 2012. LaBB-CAT: An annotation store. Proceedings of the Australasian Language Technology Workshop, 113–117.Search in Google Scholar
Fromont, Robert & Kevin Watson. 2016. Factors influencing automatic segmental alignment of sociophonetic corpora. Corpora 11(3). 401–431.10.3366/cor.2016.0101Search in Google Scholar
Gaida, Christian, Patrick Lange, Rico Petrick, Patrick Proba, Malatawy Ahmed & David Suendermann-Oeft. 2014. Comparing open-source speech recognition toolkits. DHBW Stuttgart Technical Report, Project OASIS. (http://suendermann.com/su/pdf/oasis2014.pdf).Search in Google Scholar
Gonzalez, Simon, Catherine E. Travis, James Grama, Danielle Barth & Sunkulp Ananthanarayan. 2018. Recursive forced alignment: A test on a minority language. In Julien Epps, Joe Wolfe, John Smith & Caroline Jones (eds.), Proceedings of the 17th Australasian International Conference on Speech Science and Technology, 145–148.Search in Google Scholar
Gordon, Elizabeth, Margaret Maclagan & Jennifer Hay. 2007. The ONZE corpus. In Joan C. Beal, Karen P. Corrigan & Hermann L. Moisl (eds.), Creating and digitizing language corpora, 82–104. London: Palgrave Macmillan.10.1057/9780230223202_4Search in Google Scholar
Horvath, Barbara. 1985. Variation in Australian English: The sociolects of Sydney. Cambridge: Cambridge University Press.Search in Google Scholar
Jones, Caroline, Katherine Demuth, Weicong Li & Andre Almeida. 2017. Vowels in the Barunga variety of North Australian Kriol, INTERSPEECH-2017, Stockholm Sweden, 219–223.10.21437/Interspeech.2017-1552Search in Google Scholar
Keegan, Peter J., Catherine I. Watson, Jeanette King, Margaret Maclagan, & Ray Harlow. 2012. The role of technology in measuring changes in the pronunciation of Māori over generations. In Tania Ka’ai, Muiris Ó Laoire, Nicholas Ostler, Rachael Ka’ai-Mahuta, Dean Mahuta & Tania Smith (eds.), Language endangerment in the 21st Century: Globalisation, technology and new media, Proceedings of Conference FEL XVI, 65–71. AUT University, Auckland, New Zealand: Te Ipukarea: The National Māori Language Institute, AUT University/Foundation for Endangered Languages.Search in Google Scholar
Kisler, Thomas, Uwe D. Reichel & Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45. 326–347.10.1016/j.csl.2017.01.005Search in Google Scholar
Kuznetsova, Alexandra, Per B. Brockhoff & Rune H. B. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82(13). 1–26.10.18637/jss.v082.i13Search in Google Scholar
MacKenzie, Laurel & Danielle Turton. 2020. Assessing the accuracy of existing forced alignment software on varieties of British English. Linguistics Vanguard. 6(s1).10.1515/lingvan-2018-0061Search in Google Scholar
McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. INTERSPEECH-2017, Stockholm Sweden, 498–502.10.21437/Interspeech.2017-1386Search in Google Scholar
Panayotov, Vassil, Guoguo Chen, Daniel Povey & Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books, Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210. Brisbane, Australia.10.1109/ICASSP.2015.7178964Search in Google Scholar
Paulo, Sérgio & Luís C. Oliveira. 2004. Automatic phonetic alignment and its confidence measures. In José Luis Vicedo, Particio Martínez-Barco, Rafael Munoz & Maximiliano Saiz Noeda (eds.), Advances in Natural Language Processing: 4th International Conference, EsTAL 2004, 36–44. Berlin/Heidelberg: Springer-Verlag.10.1007/978-3-540-30228-5_4Search in Google Scholar
Povey, Daniel, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer & Karel Vesely. 2011. The Kaldi speech recognition toolkit. Paper presented at the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hawaii.Search in Google Scholar
R Core Team. 2018. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org. Search in Google Scholar
Reddy, Sravana & James Stanford. 2015. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28.10.1515/lingvan-2015-0002Search in Google Scholar
Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (Forced Alignment and Vowel Extraction) Program Suite v1.2.2 10.5281/zenodo.22281.Search in Google Scholar
Schiel, Florian. 1999. Automatic phonetic transcription of non-prompted speech, Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS), 607–610. San Francisco.Search in Google Scholar
Schiel, Florian, Christoph Draxler, Angela Baumann, Tania Elbogen & Alexander Steen. 2012. The Production of Speech Corpora. Ms. (https://www.bas.uni-muenchen.de/Forschung/BITS/TP1/Cookbook/).Search in Google Scholar
Stoakes, Hywel & Florian Schiel. 2017. A Pan-Australian Model for MAUS. Paper presented at the Australian Linguistic Society Annual Conference, 4–7 December, University of Sydney.Search in Google Scholar
Strunk, Jan, Florian Schiel & Frank Seifart. 2014. Untrained forced alignment of transcriptions and audio for language documentation corpora using WebMAUS. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC’14, 3940–3947. Reykjavik, Iceland.Search in Google Scholar
Stuart-Smith, Jane, Brian José, Tamara Rathcke, Rachel Macdonald & Eleanor Lawson. 2017. Changing sounds in a changing city: An acoustic phonetic investigation of real-time change over a century of Glaswegian. In Chris Montgomery & Emma Moore (eds.), Language and a sense of place: Studies in language and region, 38–64. Cambridge: Cambridge University Press.10.1017/9781316162477.004Search in Google Scholar
Travis, Catherine E., James Grama & Simon Gonzalez. In Progress. Sydney Speaks Corpora. Australian Research Council Centre of Excellence for the Dynamics of Language, Australian National University: http://www.dynamicsoflanguage.edu.au/sydney-speaks/.Search in Google Scholar
Wagner, Michael, Dat Tran, Roberto Togneri, Phil Rose, David Powers, Mark Onslow, Debbie Loakes, Trent Lewis, Takaaki Kuratate, Yuko Kinoshita, Nenagh Kemp, Shunichi Ishihara, John Ingram, John Hajek, David Grayden, Roland Göcke, Janet Fletcher, Dominique Estival, Julien Epps, Robert Dale, Anne Cutler, Felicity Cox, Girija Chetty, Steve Cassidy, Andy Butcher, Denis Burnham, Steven Bird, Cathi Best, Mohammed Bennamoun, Joanne Arciuli & Eliathamby Ambikairajah. 2010. The big Australian speech corpus (The Big ASC) Paper presented at the 13th Australasian International Conference on Speech Science and Technology. Melbourne, Australia.Search in Google Scholar
Walker, James & Miriam Meyerhoff. 2020. Pivots of the Caribbean? Low-back vowels in Eastern Caribbean English. Linguistics.10.1515/ling-2019-0037Search in Google Scholar
Watson, Catherine I. & Zoe E. Evans. 2016. Sound change of experimental artifact?: A study on the impact of data preparation on measuring sound change. In Christopher Carignan and Michael D. Tyler (Eds.), Proceedings of the 16th Australasian International Conference on Speech Science and Technology, 261–264. Sydney, Australia.Search in Google Scholar
Wilbanks, Erik. 2018. faseAlign (Version 1.1.9) [Computer software]. Retrieved Oct 11, 2018 from https://github.com/EricWilbanks/faseAlign.Search in Google Scholar
Young, Steve, Gunnar Evermann, Thomas Hain, Dan Kershaw, Xunying (Andrew) Liu, Julian Odell, Dave Ollason, Daniel Povey, Valtcvho Valtchev & Phil Woodland. 2006. The HTK Book (For Version 3.4). Cambridge: Cambridge University Engineering Department.Search in Google Scholar
Yuan, Jiahong & Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. The Journal of the Acoustical Society of America 123. 5687–5690.10.1121/1.2935783Search in Google Scholar
Appendix
Linear mixed-effects model fit to Overlap Rate – Predictors: aligner identity, scaled vowel duration; Random intercepts: speaker, vowel identity, preceding and following manner of articulation.
Predictors | Estimates | CI | t | p |
---|---|---|---|---|
(Intercept) = H2H | 0.77 | 0.71–0.82 | 27.08 | <0.001 |
Aligner = FAVE | −0.23 | −0.25 to −0.21 | −21.02 | <0.001 |
Aligner = LABBCAT | −0.22 | −0.25 to −0.20 | −20.71 | <0.001 |
Aligner = MAUS | −0.26 | −0.29 to −0.24 | −24.45 | <0.001 |
Aligner = MFA | −0.15 | −0.17 to −0.13 | −13.76 | <0.001 |
Scale(dur) | 0.04 | 0.04–0.05 | 10.81 | <0.001 |
-
Bold values denote statistical significance at the p < 0.05 level.
Linear mixed-effects model fit to Overlap Rate (with MAUS acoustic model from New Zealand English spontaneous speech) – Predictors: aligner identity, scaled vowel duration; Random intercepts: speaker, vowel identity, preceding and following manner of articulation.
Predictors | Estimates | CI | t | p |
---|---|---|---|---|
(Intercept) = H2H | 0.76 | 0.70–0.82 | 24.76 | <0.001 |
Aligner = FAVE | −0.23 | −0.25 to −0.21 | −20.55 | <0.001 |
Aligner = LABBCAT | −0.22 | −0.25 to −0.20 | −20.25 | <0.001 |
Aligner = MAUS | −0.26 | −0.29 to −0.24 | −23.91 | <0.001 |
Aligner = MAUSNZ | −0.25 | −0.28 to −0.23 | −22.89 | <0.001 |
Aligner = MFA | −0.15 | −0.17 to −0.13 | −13.45 | <0.001 |
Scale(dur) | 0.04 | 0.03–0.05 | 10.88 | <0.001 |
-
Bold values denote statistical significance at the p < 0.05 level.
Linear mixed-effects model fit to Boundary Displacement – Predictors: aligner identity, scaled vowel duration, position; Random intercepts: speaker, vowel identity.
Predictors | Estimates | CI | t | p |
---|---|---|---|---|
(Intercept) = H2H, preceding | 8.92 | 2.46–15.38 | 2.71 | 0.007 |
Aligner = FAVE | 31.24 | 24.53–37.95 | 9.12 | <0.001 |
Aligner = LABBCAT | 13.46 | 6.75–20.17 | 3.93 | <0.001 |
Aligner = MAUS | 60.44 | 53.73–67.16 | 17.65 | <0.001 |
Aligner = MFA | 17.36 | 10.65–24.07 | 5.07 | <0.001 |
Position = following | 4.11 | −0.13–8.36 | 1.90 | 0.058 |
Scale(dur) | 12.51 | 10.21–14.81 | 10.65 | <0.001 |
-
Bold values denote statistical significance at the p < 0.05 level.
Linear mixed-effects model fit to preceding Boundary Displacement – Predictors: aligner identity; scaled vowel duration, preceding manner of articulation; interaction between preceding manner and aligner; Random intercepts: speaker, vowel identity.
Predictors | Estimates | CI | t | p |
---|---|---|---|---|
(Intercept) = H2H, stops | 7.40 | −5.37–20.16 | 1.14 | 0.256 |
scale(dur) | 9.03 | 5.72–12.34 | 5.34 | <0.001 |
Manner = approximant | 5.99 | −12.41–24.39 | 0.64 | 0.523 |
Manner = fricative | −0.48 | −19.24–18.28 | −0.05 | 0.960 |
Manner = lateral | 4.03 | −28.45–36.51 | 0.24 | 0.808 |
Manner = nasal | 1.64 | −22.08–25.36 | 0.14 | 0.892 |
Aligner = FAVE | 24.78 | 7.33–42.23 | 2.78 | 0.005 |
Aligner = LABBCAT | 10.77 | −6.68–28.22 | 1.21 | 0.226 |
Aligner = MAUS | 69.02 | 51.57–86.47 | 7.75 | <0.001 |
Aligner = MFA | 7.44 | −10.00–24.89 | 0.84 | 0.403 |
Manner = approximant:Aligner = FAVE | 9.18 | −16.61–34.98 | 0.70 | 0.485 |
Manner = fricative:Aligner = FAVE | 7.58 | −18.84–34.00 | 0.56 | 0.574 |
Manner = lateral:Aligner = FAVE | 10.17 | −35.60–55.95 | 0.44 | 0.663 |
Manner = nasal:Aligner = FAVE | 18.68 | −14.79–52.14 | 1.09 | 0.274 |
Manner = approximant:Aligner = LABBCAT | −2.24 | −28.03–23.56 | −0.17 | 0.865 |
Manner = fricative:Aligner = LABBCAT | 1.09 | −25.33–27.51 | 0.08 | 0.935 |
Manner = lateral:Aligner = LABBCAT | 1.95 | −43.82–47.72 | 0.08 | 0.933 |
Manner = nasal:Aligner = LABBCAT | 6.50 | −26.96–39.97 | 0.38 | 0.703 |
Manner = approximant:Aligner = MAUS | −11.12 | −36.92–14.67 | −0.85 | 0.398 |
Manner = fricative:Aligner = MAUS | −28.83 | −55.25 to −2.41 | −2.14 | 0.032 |
Manner = lateral:Aligner = MAUS | −27.12 | −72.90–18.65 | −1.16 | 0.245 |
Manner = nasal:Aligner = MAUS | 2.52 | −30.95–35.98 | 0.15 | 0.883 |
Manner = approximant:Aligner = MFA | 12.79 | −13.00–38.59 | 0.97 | 0.331 |
Manner = fricative:Aligner = MFA | 10.06 | −16.36–36.48 | 0.75 | 0.455 |
Manner = lateral:Aligner = MFA | 14.92 | −30.85–60.69 | 0.64 | 0.523 |
Manner = nasal:Aligner = MFA | 15.05 | −18.41–48.52 | 0.88 | 0.378 |
-
Bold values denote statistical significance at the p < 0.05 level.
Linear mixed-effects model fit to following Boundary Displacement – Predictors: aligner identity, scaled vowel duration, following manner of articulation; interaction between following manner and aligner; Random intercepts: speaker, vowel identity.
Predictors | Estimates | CI | t | p |
---|---|---|---|---|
(Intercept) = H2H, stops | 9.11 | −2.77–21.00 | 1.50 | 0.133 |
scale(dur) | 11.42 | 8.22–14.62 | 7.00 | <0.001 |
Manner = approximant | 5.77 | −22.80–34.35 | 0.40 | 0.692 |
Manner = fricative | −0.87 | −19.21–17.47 | −0.09 | 0.926 |
Manner = lateral | 5.56 | −21.91–33.03 | 0.40 | 0.691 |
Manner = nasal | 5.73 | −11.60–23.06 | 0.65 | 0.517 |
Aligner = FAVE | 27.03 | 10.58–43.48 | 3.22 | 0.001 |
Aligner = LABBCAT | 15.43 | −1.02–31.89 | 1.84 | 0.066 |
Aligner = MAUS | 37.28 | 20.83–53.74 | 4.44 | <0.001 |
Aligner = MFA | 15.22 | −1.24–31.67 | 1.81 | 0.070 |
Manner = approximant:Aligner = FAVE | 5.93 | −34.32–46.18 | 0.29 | 0.773 |
Manner = fricative:Aligner = FAVE | 7.11 | −18.75–32.97 | 0.54 | 0.590 |
Manner = lateral:Aligner = FAVE | 2.39 | −36.17–40.96 | 0.12 | 0.903 |
Manner = nasal:Aligner = FAVE | 4.28 | −20.10–28.65 | 0.34 | 0.731 |
Manner = approximant:Aligner = LABBCAT | 0.95 | −39.30–41.20 | 0.05 | 0.963 |
Manner = fricative:Aligner = LABBCAT | 0.74 | −25.12–26.59 | 0.06 | 0.956 |
Manner = lateral:Aligner = LABBCAT | 11.32 | −27.24–49.88 | 0.58 | 0.565 |
Manner = nasal:Aligner = LABBCAT | −1.13 | −25.51–23.25 | −0.09 | 0.928 |
Manner = approximant:Aligner = MAUS | 48.88 | 8.63–89.13 | 2.38 | 0.017 |
Manner = fricative:Aligner = MAUS | 21.49 | −4.37–47.35 | 1.63 | 0.103 |
Manner = lateral:Aligner = MAUS | 5.58 | −32.98–44.14 | 0.28 | 0.777 |
Manner = nasal:Aligner = MAUS | 49.54 | 25.16–73.91 | 3.98 | <0.001 |
Manner = approximant:Aligner = MFA | −0.71 | −40.96–39.54 | −0.03 | 0.972 |
Manner = fricative:Aligner = MFA | −2.27 | −28.13–23.59 | −0.17 | 0.864 |
Manner = lateral:Aligner = MFA | 23.19 | −15.37–61.75 | 1.18 | 0.239 |
Manner = nasal:Aligner = MFA | 1.94 | −22.44–26.31 | 0.16 | 0.876 |
-
Bold values denote statistical significance at the p < 0.05 level.
©2020 Walter de Gruyter GmbH, Berlin/Boston