Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton April 18, 2020

Comparing the performance of forced aligners used in sociophonetic research

  • Simon Gonzalez EMAIL logo , James Grama and Catherine E. Travis
From the journal Linguistics Vanguard

Abstract

Forced aligners have revolutionized sociophonetics, but while there are several forced aligners available, there are few systematic comparisons of their performance. Here, we consider four major forced aligners used in sociophonetics today: MAUS, FAVE, LaBB-CAT and MFA. Through comparisons with human coders, we find that both aligner and phonological context affect the quality of automated alignments of vowels extracted from English sociolinguistic interview data. MFA and LaBB-CAT produce the highest quality alignments, in some cases not significantly different from human alignment, followed by FAVE, and then MAUS. Aligners are less accurate placing boundaries following a vowel than preceding it, and they vary in accuracy across manner of articulation, particularly for following boundaries. These observations allow us to make specific recommendations for manual correction of forced alignment.

Acknowledgements

We gratefully acknowledge support from the ARC Centre of Excellence for the Dynamics of Language, and funding from a Transdisciplinary & Innovation Grant (TIG952018). We thank Robert Fromont, Debbie Loakes, and the anonymous Linguistics Vanguard reviewers for valuable feedback on the paper, as well as Miriam Meyerhoff, Jim Stanford, and Hywel Stoakes for help in formulating the ideas presented here.

References

Baayan, Harald, R. Piepenbrock & Lennart Gulikers. 1995. The CELEX Lexical Database (Release 2, CD-ROM). University of Pennsylvania, Philadelphia: Linguistic Data Consortium.Search in Google Scholar

Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2010. lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-33.Search in Google Scholar

Brognaux, Sandrine, Sophie Roekhaut, Thomas Drugman, & Richard Beaufort. 2012. Automatic phone alignment: A comparison between speaker- independent models and models trained on the corpus to align. Proceedings of the 8th International Conference on NLP, JapTAL, 300–311. Kanazawa, Japan.10.1007/978-3-642-33983-7_30Search in Google Scholar

Carnegie Mellon University. 1993–2016 CMU pronouncing dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.Search in Google Scholar

Cassidy, Steve & Thomas Schmidt. 2017. Tools for multimodal annotation. In Nancy Ide & James Pustejovsky (eds.), Handbook of Linguistic Annotation, 209–227. Netherlands: Springer.10.1007/978-94-024-0881-2_7Search in Google Scholar

Chodroff, Eleanor. 2018. Corpus Phonetics Tutorial. ArXiv: abs/1811.05553.Search in Google Scholar

Cosi, Piero, Daniele Falavigna & Maurizio Omologo. 1991. A preliminary statistical evaluation of manual and automatic segmentation discrepancies. EUROSPEECH-91, 2nd European Conference on Speech Technology, 693–696.10.21437/Eurospeech.1991-183Search in Google Scholar

Coto-Solano, Rolando & Sofía Flores Solórzano. 2017. Comparison of two forced alignment aystems for aligning Bribri speech.CLEI Electronic Journal 20(1). 2:1–2:13.10.19153/cleiej.20.1.2Search in Google Scholar

DiCanio, Christian, Hosung Nam, Douglas H. Whalen, H. Timothy Bunnell, Jonathan D. Amith & Rey Castillo Garcia. 2012. Assessing agreement level between forced alignment models with data from endangered language documentation corpora. INTERSPEECH-2012, Portland Oregon, 130–133.10.21437/Interspeech.2012-43Search in Google Scholar

Fromont, Robert & Jennifer Hay. 2012. LaBB-CAT: An annotation store. Proceedings of the Australasian Language Technology Workshop, 113–117.Search in Google Scholar

Fromont, Robert & Kevin Watson. 2016. Factors influencing automatic segmental alignment of sociophonetic corpora. Corpora 11(3). 401–431.10.3366/cor.2016.0101Search in Google Scholar

Gaida, Christian, Patrick Lange, Rico Petrick, Patrick Proba, Malatawy Ahmed & David Suendermann-Oeft. 2014. Comparing open-source speech recognition toolkits. DHBW Stuttgart Technical Report, Project OASIS. (http://suendermann.com/su/pdf/oasis2014.pdf).Search in Google Scholar

Gonzalez, Simon, Catherine E. Travis, James Grama, Danielle Barth & Sunkulp Ananthanarayan. 2018. Recursive forced alignment: A test on a minority language. In Julien Epps, Joe Wolfe, John Smith & Caroline Jones (eds.), Proceedings of the 17th Australasian International Conference on Speech Science and Technology, 145–148.Search in Google Scholar

Gordon, Elizabeth, Margaret Maclagan & Jennifer Hay. 2007. The ONZE corpus. In Joan C. Beal, Karen P. Corrigan & Hermann L. Moisl (eds.), Creating and digitizing language corpora, 82–104. London: Palgrave Macmillan.10.1057/9780230223202_4Search in Google Scholar

Horvath, Barbara. 1985. Variation in Australian English: The sociolects of Sydney. Cambridge: Cambridge University Press.Search in Google Scholar

Jones, Caroline, Katherine Demuth, Weicong Li & Andre Almeida. 2017. Vowels in the Barunga variety of North Australian Kriol, INTERSPEECH-2017, Stockholm Sweden, 219–223.10.21437/Interspeech.2017-1552Search in Google Scholar

Keegan, Peter J., Catherine I. Watson, Jeanette King, Margaret Maclagan, & Ray Harlow. 2012. The role of technology in measuring changes in the pronunciation of Māori over generations. In Tania Ka’ai, Muiris Ó Laoire, Nicholas Ostler, Rachael Ka’ai-Mahuta, Dean Mahuta & Tania Smith (eds.), Language endangerment in the 21st Century: Globalisation, technology and new media, Proceedings of Conference FEL XVI, 65–71. AUT University, Auckland, New Zealand: Te Ipukarea: The National Māori Language Institute, AUT University/Foundation for Endangered Languages.Search in Google Scholar

Kisler, Thomas, Uwe D. Reichel & Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45. 326–347.10.1016/j.csl.2017.01.005Search in Google Scholar

Kuznetsova, Alexandra, Per B. Brockhoff & Rune H. B. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82(13). 1–26.10.18637/jss.v082.i13Search in Google Scholar

MacKenzie, Laurel & Danielle Turton. 2020. Assessing the accuracy of existing forced alignment software on varieties of British English. Linguistics Vanguard. 6(s1).10.1515/lingvan-2018-0061Search in Google Scholar

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. INTERSPEECH-2017, Stockholm Sweden, 498–502.10.21437/Interspeech.2017-1386Search in Google Scholar

Panayotov, Vassil, Guoguo Chen, Daniel Povey & Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books, Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210. Brisbane, Australia.10.1109/ICASSP.2015.7178964Search in Google Scholar

Paulo, Sérgio & Luís C. Oliveira. 2004. Automatic phonetic alignment and its confidence measures. In José Luis Vicedo, Particio Martínez-Barco, Rafael Munoz & Maximiliano Saiz Noeda (eds.), Advances in Natural Language Processing: 4th International Conference, EsTAL 2004, 36–44. Berlin/Heidelberg: Springer-Verlag.10.1007/978-3-540-30228-5_4Search in Google Scholar

Povey, Daniel, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer & Karel Vesely. 2011. The Kaldi speech recognition toolkit. Paper presented at the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hawaii.Search in Google Scholar

R Core Team. 2018. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org. Search in Google Scholar

Reddy, Sravana & James Stanford. 2015. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28.10.1515/lingvan-2015-0002Search in Google Scholar

Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (Forced Alignment and Vowel Extraction) Program Suite v1.2.2 10.5281/zenodo.22281.Search in Google Scholar

Schiel, Florian. 1999. Automatic phonetic transcription of non-prompted speech, Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS), 607–610. San Francisco.Search in Google Scholar

Schiel, Florian, Christoph Draxler, Angela Baumann, Tania Elbogen & Alexander Steen. 2012. The Production of Speech Corpora. Ms. (https://www.bas.uni-muenchen.de/Forschung/BITS/TP1/Cookbook/).Search in Google Scholar

Stoakes, Hywel & Florian Schiel. 2017. A Pan-Australian Model for MAUS. Paper presented at the Australian Linguistic Society Annual Conference, 4–7 December, University of Sydney.Search in Google Scholar

Strunk, Jan, Florian Schiel & Frank Seifart. 2014. Untrained forced alignment of transcriptions and audio for language documentation corpora using WebMAUS. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC’14, 3940–3947. Reykjavik, Iceland.Search in Google Scholar

Stuart-Smith, Jane, Brian José, Tamara Rathcke, Rachel Macdonald & Eleanor Lawson. 2017. Changing sounds in a changing city: An acoustic phonetic investigation of real-time change over a century of Glaswegian. In Chris Montgomery & Emma Moore (eds.), Language and a sense of place: Studies in language and region, 38–64. Cambridge: Cambridge University Press.10.1017/9781316162477.004Search in Google Scholar

Travis, Catherine E., James Grama & Simon Gonzalez. In Progress. Sydney Speaks Corpora. Australian Research Council Centre of Excellence for the Dynamics of Language, Australian National University: http://www.dynamicsoflanguage.edu.au/sydney-speaks/.Search in Google Scholar

Wagner, Michael, Dat Tran, Roberto Togneri, Phil Rose, David Powers, Mark Onslow, Debbie Loakes, Trent Lewis, Takaaki Kuratate, Yuko Kinoshita, Nenagh Kemp, Shunichi Ishihara, John Ingram, John Hajek, David Grayden, Roland Göcke, Janet Fletcher, Dominique Estival, Julien Epps, Robert Dale, Anne Cutler, Felicity Cox, Girija Chetty, Steve Cassidy, Andy Butcher, Denis Burnham, Steven Bird, Cathi Best, Mohammed Bennamoun, Joanne Arciuli & Eliathamby Ambikairajah. 2010. The big Australian speech corpus (The Big ASC) Paper presented at the 13th Australasian International Conference on Speech Science and Technology. Melbourne, Australia.Search in Google Scholar

Walker, James & Miriam Meyerhoff. 2020. Pivots of the Caribbean? Low-back vowels in Eastern Caribbean English. Linguistics.10.1515/ling-2019-0037Search in Google Scholar

Watson, Catherine I. & Zoe E. Evans. 2016. Sound change of experimental artifact?: A study on the impact of data preparation on measuring sound change. In Christopher Carignan and Michael D. Tyler (Eds.), Proceedings of the 16th Australasian International Conference on Speech Science and Technology, 261–264. Sydney, Australia.Search in Google Scholar

Wilbanks, Erik. 2018. faseAlign (Version 1.1.9) [Computer software]. Retrieved Oct 11, 2018 from https://github.com/EricWilbanks/faseAlign.Search in Google Scholar

Young, Steve, Gunnar Evermann, Thomas Hain, Dan Kershaw, Xunying (Andrew) Liu, Julian Odell, Dave Ollason, Daniel Povey, Valtcvho Valtchev & Phil Woodland. 2006. The HTK Book (For Version 3.4). Cambridge: Cambridge University Engineering Department.Search in Google Scholar

Yuan, Jiahong & Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. The Journal of the Acoustical Society of America 123. 5687–5690.10.1121/1.2935783Search in Google Scholar

Appendix

Table 2:

Linear mixed-effects model fit to Overlap Rate – Predictors: aligner identity, scaled vowel duration; Random intercepts: speaker, vowel identity, preceding and following manner of articulation.

Predictors Estimates CI t p
(Intercept) = H2H  0.77 0.71–0.82  27.08 <0.001
Aligner = FAVE −0.23 −0.25 to −0.21 −21.02 <0.001
Aligner = LABBCAT −0.22 −0.25 to −0.20 −20.71 <0.001
Aligner = MAUS −0.26 −0.29 to −0.24 −24.45 <0.001
Aligner = MFA −0.15 −0.17 to −0.13 −13.76 <0.001
Scale(dur)  0.04 0.04–0.05  10.81 <0.001
  1. Bold values denote statistical significance at the p < 0.05 level.

Table 3:

Linear mixed-effects model fit to Overlap Rate (with MAUS acoustic model from New Zealand English spontaneous speech) – Predictors: aligner identity, scaled vowel duration; Random intercepts: speaker, vowel identity, preceding and following manner of articulation.

Predictors Estimates CI t p
(Intercept) = H2H  0.76 0.70–0.82  24.76 <0.001
Aligner = FAVE −0.23 −0.25 to −0.21 −20.55 <0.001
Aligner = LABBCAT −0.22 −0.25 to −0.20 −20.25 <0.001
Aligner = MAUS −0.26 −0.29 to −0.24 −23.91 <0.001
Aligner = MAUSNZ −0.25 −0.28 to −0.23 −22.89 <0.001
Aligner = MFA −0.15 −0.17 to −0.13 −13.45 <0.001
Scale(dur)  0.04 0.03–0.05  10.88 <0.001
  1. Bold values denote statistical significance at the p < 0.05 level.

Table 4:

Linear mixed-effects model fit to Boundary Displacement – Predictors: aligner identity, scaled vowel duration, position; Random intercepts: speaker, vowel identity.

Predictors Estimates CI t p
(Intercept) = H2H, preceding 8.92    2.46–15.38 2.71   0.007
Aligner = FAVE 31.24  24.53–37.95 9.12    <0.001
Aligner = LABBCAT 13.46  6.75–20.17 3.93    <0.001
Aligner = MAUS 60.44  53.73–67.16 17.65 <0.001
Aligner = MFA 17.36  10.65–24.07 5.07    <0.001
Position = following 4.11   −0.13–8.36 1.90    0.058
Scale(dur) 12.51  10.21–14.81 10.65 <0.001
  1. Bold values denote statistical significance at the p < 0.05 level.

Table 5:

Linear mixed-effects model fit to preceding Boundary Displacement – Predictors: aligner identity; scaled vowel duration, preceding manner of articulation; interaction between preceding manner and aligner; Random intercepts: speaker, vowel identity.

Predictors Estimates CI t p
(Intercept) = H2H, stops  7.40 −5.37–20.16 1.14 0.256
scale(dur)  9.03  5.72–12.34 5.34 <0.001
Manner = approximant  5.99 −12.41–24.39 0.64 0.523
Manner = fricative −0.48 −19.24–18.28 −0.05 0.960
Manner = lateral  4.03 −28.45–36.51 0.24 0.808
Manner = nasal  1.64 −22.08–25.36 0.14 0.892
Aligner = FAVE   24.78  7.33–42.23 2.78 0.005
Aligner = LABBCAT   10.77 −6.68–28.22 1.21 0.226
Aligner = MAUS   69.02  51.57–86.47 7.75 <0.001
Aligner = MFA  7.44 −10.00–24.89 0.84 0.403
Manner = approximant:Aligner = FAVE  9.18 −16.61–34.98 0.70 0.485
Manner = fricative:Aligner = FAVE  7.58 −18.84–34.00 0.56 0.574
Manner = lateral:Aligner = FAVE   10.17 −35.60–55.95 0.44 0.663
Manner = nasal:Aligner = FAVE   18.68 −14.79–52.14 1.09 0.274
Manner = approximant:Aligner = LABBCAT −2.24 −28.03–23.56 −0.17 0.865
Manner = fricative:Aligner = LABBCAT  1.09 −25.33–27.51 0.08 0.935
Manner = lateral:Aligner = LABBCAT  1.95 −43.82–47.72 0.08 0.933
Manner = nasal:Aligner = LABBCAT  6.50 −26.96–39.97 0.38 0.703
Manner = approximant:Aligner = MAUS  −11.12 −36.92–14.67 −0.85 0.398
Manner = fricative:Aligner = MAUS  −28.83 −55.25 to −2.41 −2.14 0.032
Manner = lateral:Aligner = MAUS  −27.12 −72.90–18.65 −1.16 0.245
Manner = nasal:Aligner = MAUS  2.52 −30.95–35.98 0.15 0.883
Manner = approximant:Aligner = MFA   12.79 −13.00–38.59 0.97 0.331
Manner = fricative:Aligner = MFA   10.06 −16.36–36.48 0.75 0.455
Manner = lateral:Aligner = MFA   14.92 −30.85–60.69 0.64 0.523
Manner = nasal:Aligner = MFA   15.05 −18.41–48.52 0.88 0.378
  1. Bold values denote statistical significance at the p < 0.05 level.

Table 6:

Linear mixed-effects model fit to following Boundary Displacement – Predictors: aligner identity, scaled vowel duration, following manner of articulation; interaction between following manner and aligner; Random intercepts: speaker, vowel identity.

Predictors Estimates CI t p
(Intercept) = H2H, stops  9.11 −2.77–21.00 1.50 0.133
scale(dur)  11.42  8.22–14.62 7.00 <0.001
Manner = approximant  5.77 −22.80–34.35 0.40 0.692
Manner = fricative −0.87 −19.21–17.47 −0.09 0.926
Manner = lateral  5.56 −21.91–33.03 0.40 0.691
Manner = nasal  5.73 −11.60–23.06 0.65 0.517
Aligner = FAVE  27.03  10.58–43.48 3.22 0.001
Aligner = LABBCAT  15.43 −1.02–31.89 1.84 0.066
Aligner = MAUS  37.28  20.83–53.74 4.44 <0.001
Aligner = MFA  15.22 −1.24–31.67 1.81 0.070
Manner = approximant:Aligner = FAVE  5.93 −34.32–46.18 0.29 0.773
Manner = fricative:Aligner = FAVE  7.11 −18.75–32.97 0.54 0.590
Manner = lateral:Aligner = FAVE  2.39 −36.17–40.96 0.12 0.903
Manner = nasal:Aligner = FAVE  4.28 −20.10–28.65 0.34 0.731
Manner = approximant:Aligner = LABBCAT  0.95 −39.30–41.20 0.05 0.963
Manner = fricative:Aligner = LABBCAT  0.74 −25.12–26.59 0.06 0.956
Manner = lateral:Aligner = LABBCAT  11.32 −27.24–49.88 0.58 0.565
Manner = nasal:Aligner = LABBCAT −1.13 −25.51–23.25 −0.09 0.928
Manner = approximant:Aligner = MAUS  48.88  8.63–89.13 2.38 0.017
Manner = fricative:Aligner = MAUS  21.49 −4.37–47.35 1.63 0.103
Manner = lateral:Aligner = MAUS  5.58 −32.98–44.14 0.28 0.777
Manner = nasal:Aligner = MAUS  49.54  25.16–73.91 3.98 <0.001
Manner = approximant:Aligner = MFA −0.71 −40.96–39.54 −0.03 0.972
Manner = fricative:Aligner = MFA −2.27 −28.13–23.59 −0.17 0.864
Manner = lateral:Aligner = MFA  23.19 −15.37–61.75 1.18 0.239
Manner = nasal:Aligner = MFA  1.94 −22.44–26.31 0.16 0.876
  1. Bold values denote statistical significance at the p < 0.05 level.

Received: 2019-02-20
Accepted: 2019-10-19
Published Online: 2020-04-18

©2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 29.3.2024 from https://www.degruyter.com/document/doi/10.1515/lingvan-2019-0058/html
Scroll to top button