Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2017: 1.200
5-year IMPACT FACTOR: 1.386

CiteScore 2017: 0.80

SCImago Journal Rank (SJR) 2017: 0.288
Source Normalized Impact per Paper (SNIP) 2017: 0.930

See all formats and pricing
More options …

Dependency profiles in the large-scale analysis of discourse connectives

Veronika Laippala / Aki-Juhani Kyröläinen
  • Department of linguistics and languages, McMaster University, Hamilton, ON, Canada
  • Applied Linguistics, Brock University, St. Catharines, ON, Canada
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Jenna Kanerva / Filip Ginter
Published Online: 2018-06-06 | DOI: https://doi.org/10.1515/cllt-2017-0031


This article presents dependency profiles (DPs) as an empirical method to investigate linguistic elements and their application to the study of 24 discourse connectives in the 3.7-billion token Finnish Internet Parsebank (http://bionlp-www.utu.fi/dep_search/). DPs are based on co-occurrence patterns of the discourse connectives with dependency syntax relations. They follow the assumption of usage-based models, according to which the semantic and functional properties of linguistic expressions arise based on their distributional characteristics. We focus on the typical usage patterns reflected by the DPs and the (dis)similarities among discourse connectives that these patterns reveal. We demonstrate that 1) DPs can be analyzed with clustering to obtain linguistically meaningful groupings among the connectives and that 2) the clustering can be combined with support vector machines to obtain generic and stable linguistic characteristics of the discourse connectives. We show that this data-driven method offers support for previous results and reveals novel tendencies outside the scope of studies on smaller corpora. As the method is based on automatic syntactic analysis following the cross-linguistic universal dependencies, it does not require manual annotation and can be applied to a number of languages and in contrastive studies.

Keywords: Dependency syntax; discourse connectives; web-as-corpus; Universal dependencies


  • Arppe, Antti. 2008. Univariate, bivariate and multivariate methods in corpus-based lexicography - a study of synonymy. Helsinki: University of Helsinki dissertation.Google Scholar

  • Berez, Andrea L. & Stefan Th. Gries. 2009. In defense of corpus-based methods: A behavioral profile analysis of polysemous get in English. In Steven Moran, Darren S. Tanner & Michael Scanlon (eds.), Proceedings of the 24th northwest linguistics conference (University of Washington Working Papers in Linguistics 27), 157–166. Seattle, WA: Department of Linguistics.Google Scholar

  • Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.Google Scholar

  • Biber, Douglas. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.Google Scholar

  • Bolinger, Dwight. 1968. Entailment and the meaning of structures. Glossa 2. 119–127.Google Scholar

  • Boser, Bernhard E., Isabelle M. Guyon & Vladimir N. Vapnik. 1992. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory, 144–152.Google Scholar

  • Bresnan, Joan, Anna Cueni, Tatiana Nikitina & R. Harald Baayen. 2007. Predicting the dative alternation. In Gerlof Bouma, Irene Krämer & Joost Zwarts (eds.), Cognitive foundations of interpretations, 69–94. Amsterdam: Royal Netherlands Academy of Arts and Sciences.Google Scholar

  • Carpena, Pedro, Pedro Bernaola-Galván, Michael Hackenberg & José L. Oliver. 2009. Level statistics of words: Finding keywords in literary texts and symbolic sequences. Physical Review E 79.Google Scholar

  • Chang, Yu-Ying & John. M. Swales. 1999. Informal elements in English academic writing: Threats or opportunities for advanced non-native speakers? In Christopher N. Candlin & Ken Hyland (eds.), Writing: Texts, processes and practices, 143–167. London: Longman.Google Scholar

  • Cozijn, Rein, Leo Noordman & Wietske Vonk. 2011. Propositional integration and world-knowledge inference: Processes in understanding because sentences. Discourse Processes 48(7). 475–500.CrossrefGoogle Scholar

  • Degand, Liesbeth & Benjamin Fagard. 2012. Competing connectives in the causal domain French car and parce que. Journal of Pragmatics 44. 154–168.CrossrefGoogle Scholar

  • Degand, Liesbeth & Henk Pander Maat. 2003. A contrastive study of Dutch and French causal connectives on the Speaker Involvement Scale. In Arie Verhagen & Jeroen Van De Weijer (eds.), Usage-based approaches to Dutch, 175–1999. Utrecht: LOT.Google Scholar

  • Divjak, Dagmar & Stefan Th. Gries. 2006. Ways of trying in Russian: Clustering behavioral profiles. Corpus linguistics and linguistic theory 2(1). 23–60.Google Scholar

  • Eckhoff, Hanne M. & Laura A. Janda. 2014. Grammatical Profiles and Aspect in Old Church Slavonic. Transactions of the Philological Society 112. 231–258.CrossrefGoogle Scholar

  • Edmonds, Philip & Graeme Hirst. 2002. Near-Synonymy and Lexical Choice. Computational Linguistics 28(2). 105–144.CrossrefGoogle Scholar

  • Efron, Bradley & Robert J. Tibshirani. 1993. An introduction to the bootstrap. New York: Chapman & Hall.Google Scholar

  • Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis. Philological Society. Oxford: Blackwell. Reprinted in Frank R. Palmer (ed.), Selected papers of J. R. Firth (1952–59), 168–205. London and Bloomington: Longman and Indiana University Press.Google Scholar

  • Fraley, Chris & Adrian E. Raftery. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41(8). 578–588.CrossrefGoogle Scholar

  • Gandomi, Amir & Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management 35(2). 137–144.CrossrefGoogle Scholar

  • Gries, Stefan Th. 2012. Behavioral Profiles: A fine-grained and quantitative approach in corpus-based lexical semantics. In Gonia Jarema, Gary Libben & Chris Westbury (eds.), Methodological and analytic frontiers in lexical research, 57–80. Amsterdam & Philadelphia: John Benjamins.Google Scholar

  • Gries, Stefan Th., John Newman & Cyrus Shaoul. 2011. Ngrams and the clustering of registers. Empirical Language Research 5(1).Google Scholar

  • Guyon, Isabelle & André Elisseeff. 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research 3. 1157–1182.Google Scholar

  • The Comprehensive grammar of Finnish. (CGF), Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja R. Heinonen & Irja Alho. 2004. Iso suomen kielioppi [The comprehensive grammar of Finnish]. Helsinki: Suomalaisen Kirjallisuuden Seura.Google Scholar

  • Harris, Zellig. 1968. Mathematical structure of language. New York: Wiley.Google Scholar

  • Hennig, Christian. 2007. Cluster-wise assessment of cluster stability. Computational Statistics & Data Analysis 52(1). 258–271.CrossrefGoogle Scholar

  • Hennig, Christian. 2014. How many bee species? A case study in determining the number of clusters. In Myra Spiliopoulou, Lars Schmidt-Thieme & Ruth Janning (eds.), Data analysis, machine learning and knowledge discovery, 41–49. Berlin: Springer.Google Scholar

  • Herlin, Ilona. 1997. Suomen kielen koska-konjunktion merkitys ja merkityksenkehitys [The meaning and development of the Finnish koska ‘because’]. Saarijärvi: SKS, Gummerus.Google Scholar

  • Jääskeläinen, Anni & Aino Koivisto. 2012. Konjunktio, partikkeli vai konnektiivi? [Conjunction, particle or connective?]. Virittäjä 116(4). 591–601.Google Scholar

  • Joachims, Thorsten. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning (EMCL), 137–142.Google Scholar

  • Kalliokoski, Jyrki. 1989. Ja: rinnastus ja rinnastuskonjunktion käyttö. [Ja ‘and’: coordination and the use of the coordinating conjunction]. Helsinki: Suomalaisen kirjallisuuden seura, University of Helsinki dissertation.Google Scholar

  • Kanerva, Jenna, M. Juhani Luotolahti, Veronika Laippala & Filip Ginter. 2014. Syntactic N gram collection from a large-scale corpus of Internet Finnish. Proceedings of the sixth international conference Baltic HLT, 184–191.Google Scholar

  • Kaufman, Leonard & Peter J. Rousseeuw. 1990. Finding groups in data: An introduction to cluster analysis. New York: John Wiley.Google Scholar

  • Kaur, Manpreet, Nishu Kumari, Anil Kumar Singh & Rajeev Sangal. 2016. Shallow discourse parsing using semantic lexicons. Proceedings of the Twentieth Conference on Computational Natural Language Learning: Shared Task. Berlin, Germany: Association for Computational Linguistics.Google Scholar

  • Kehler, Andrew. 2002. Coherence, reference, and the theory of grammar. Chigago: CSLI Publications.Google Scholar

  • Keller, Rudi. 1995. The epistemic weil. In Dieter Stein & Susan Wright (eds.), Subjectivity and subjectification: Linguistic perspectives, 16–30. Cambridge: Cambridge University Press.Google Scholar

  • Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý & Vít Suchomel. 2014. The Sketch Engine: Ten years on. Lexicography 2014. 1–30.Google Scholar

  • Komppa, Johanna. 2012. Retorisen rakenteen teoria suomi toisena kielenä -ylioppilaskokeen kirjoitelman kokonaisrakenteen ja kappalejaon tarkastelussa [Rhetorical structure theory in study of the schematic, rhetorical and paragraph structure of matriculation essays in Finnish as a second language]. Helsinki: Unigrafia, University of Helsinki dissertation.Google Scholar

  • König, Ekkehard. 1986. Conditionals, concessive conditionals and concessives. Areas of contrast, overlap and neutralization. In Elizabeth Closs Traugott, Alice ter Meulen, Judy Snitzer Reilly & Charles A. Ferguson (eds.), On conditionals, 229–246. Cambridge: Cambridge University Press.Google Scholar

  • Levshina, Natalia & Liesbeth Degand. 2017. Just because: In search of objective criteria of subjectivity expressed by causal connectives. Dialogue and Discourse 2017(1). 132–150.Google Scholar

  • Levy, Joseph P., John A. Bullinaria & Malti Patel. 1999. Explorations in the derivation of word co-occurrence statistics. South Pacific Journal of Psychology 10(1). 99–111.CrossrefGoogle Scholar

  • Lund, Kevin, Curt Burgess & Ruth Ann Atchley. 1995. Semantic and associative priming in high-dimensional semantic space. Proceedings of the 17th annual conference of the Cognitive Science Society, 660–665.Google Scholar

  • Luotolahti, M. Juhani, Jenna Kanerva, Veronika Laippala, Sampo Pyysalo & Filip Ginter. 2015. Towards universal web parsebanks. Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), 211–220.Google Scholar

  • Makkonen-Craig, Henna. 2017. The forbidden first word: Discourse functions and rhetorical patterns of and-prefacing in student essays. Text and Talk 37(6). 713–734.Google Scholar

  • Mann, William.C. & Sandra A. Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text 8(3). 243–281.Google Scholar

  • Maury, Pascale & Amélie Teisserenc. 2005. The role of connectives in science text comprehension and memory. Language and Cognitive Processes 20. 489–512.CrossrefGoogle Scholar

  • Milligan, Glenn W. & Martha C. Cooper. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2). 159–179.CrossrefGoogle Scholar

  • Mosegaard Hansen, Maj-Britt. 2005. A comparative study of the semantics of enfin and finalement. Journal of French Language Studies 15. 153–171.CrossrefGoogle Scholar

  • Murtagh, Fionn & Pierre Legendre. 2014. Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? Journal of Classification 31(3). 274–295.CrossrefGoogle Scholar

  • Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Pyysalo. Sampo, Natalia Silveira, Reut Tsarfaty & Daniel Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC).Google Scholar

  • Péry-Woodley, Marie-Paule. 2000. Une pragmatique à fleur de texte: Approche en corpus de l’organisation textuelle. France: Université de Toulouse-LeMirail. ERSS.Google Scholar

  • Pit, Mirna. 2003. How to express yourself with a causal connective? Subjectivity and causal connectives in Dutch, German and French. Amsterdam: Rodopi. University of Utrecht dissertation.Google Scholar

  • Pitler, Emily & Ana Nenkova. 2009. Using Syntax to Disambiguate Explicit Discourse Connectives in Text. Proceedings of the ACL/IJCNLP Conference Short Papers. Suntec, Singapore, 13–19.Google Scholar

  • Poláková, Lucie, Jiří Mírovský & Anna Nedoluzhko. 2013. Introducing the Prague Discourse Treebank 1.0. Proceedings of the 6th International Joint Conference on Natural Language Processing, Asian Federation of Natural Language Processing. Nagoya, Japan, 91–99.Google Scholar

  • Prasad, Rashmi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi & Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco.Google Scholar

  • Redeker, Gisela. 1990. Ideational and pragmatic markers of discourse structure. Journal of Pragmatics 14. 305–319.Google Scholar

  • Redeker, Gisela & Helmut Gruber. 2014. The pragmatics of discourse coherence: Theories and applications. Amsterdam: John Benjamins Publishing Company.Google Scholar

  • Robaldo, Livio & Eleni Miltsakaki. 2014. Corpus-driven semantics of concession: Where do expectations come from? Dialogue & Discourse 5(1). 1–36.Google Scholar

  • Sanders, Ted. 1997. Semantic and pragmatic sources of coherence: On the categorization of coherence relations in context. Discourse Processes 24. 119–147.CrossrefGoogle Scholar

  • Sanders, Ted & Leo Noordman. 2000. The role of coherence relations and their linguistic markers in text processing. Discourse Processes 29. 37–60.CrossrefGoogle Scholar

  • Sanders, Ted & Wilbert Spooren. 2009. Causal categories in discourse – Converging evidence from language use. In Ted Sanders & Eve Sweetser (eds.), Causal categories in discourse and cognition, 1–18. Berlin: Mouton de Gruyter.Google Scholar

  • Sanders, Ted & Wilbert Spooren. 2015. Causality and subjectivity in discourse: The meaning and use of causal connectives in spontaneous conversation, chat interactions and written text. Linguistics 53(1). 53–92.Google Scholar

  • Sanders, Ted, Wilbert Spooren & Leo Noordman. 1992. Toward a taxonomy of coherence relations. Discourse Processes 15. 1–35.CrossrefGoogle Scholar

  • Sanders, Ted & Eve Sweetser. 2009. Introduction: Causality in language and cognition – What causal connectives and causal verbs reveal about the way we think. In Ted Sanders & Eve Sweetser (eds.), Causal categories in discourse and cognition, 205–246. Berlin: Mouton de Gruyter.Google Scholar

  • Schäfer, Roland. 2016. CommonCOW: Massively huge web corpora from CommonCrawl data and a method to distribute them freely under restrictive EU copyright laws. Proceedings of Language Resources and Evaluation (LREC), 4500–4504.Google Scholar

  • Scott, Mike & Tribble Christopher. 2006. Textual patterns: Key words and corpus analysis in language education. Philadelphia. PA, USA: John Benjamins Publishing Company.Google Scholar

  • Simon, Anne-Catherine & Liesbeth Degand. 2007. Connecteurs de causalité, implication du locuteur et profils prosodiques: Le cas de car et de parce que. French Language Studies 17(3). 323–341.Google Scholar

  • Speelman, Dirk & Dirk Geeraerts. 2009. Causes for causatives: The case of Dutch doen and laten. In Ted Sanders & Eve Sweetser (eds.), Causal categories in discourse and cognition, 173–204. Berlin: Mouton de Gruyter.Google Scholar

  • Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.CrossrefGoogle Scholar

  • Stukker, Ninke & Ted Sanders. 2012. Subjectivity and prototype structure in causal connectives: A cross-linguistic perspective. Journal of Pragmatics 44(2012). 169–190.CrossrefGoogle Scholar

  • Vapnik, Vladimir N. & Vladimir Vapnik. 1998. Statistical learning theory. New York: Wiley Interscience.Google Scholar

  • Zufferey, Sandrine. 2012. “Car, parce que, puisque” revisited: Three empirical studies on French causal connectives. Journal of Pragmatics 44(2). 138–153.CrossrefGoogle Scholar

About the article

Veronika Laippala

Veronika Laippala is Associate professor of Digital linguistics in the School of languages and translation studies at the University of Turku, Finland. Her research focuses on corpus linguistics and computational linguistics. In particular, she has worked on the development of web-crawled corpora and corpora of computer-mediated communication in various languages and on enhancing computational methods for text linguistics and discourse analysis. Her most recent projects include “Finnish Internet Parsebank” developing a very large web-corpus for Finnish and “Structuring language use across multilingual web corpora” aiming at automatically detecting registers from web corpora in different languages.

Aki-Juhani Kyröläinen

Aki-Juhani Kyröläinen is currently a post-doctoral fellow at McMaster University and Brock University within the Words in the World project. His main research interests center on distributional models of language with a focus on morphosyntactic structures. Additionally, his research combines multiple methodologies ranging from corpus analysis to psycholinguistic experimentation with an emphasis on eye-tracking.​

Jenna Kanerva

Jenna Kanerva has a MSc in computer science and she is currently a PhD student at the University of Turku. Her research focuses on machine learning methods in language technology, main interest area being development of dependency parsing pipeline for Finnish.

Filip Ginter

Filip Ginter gained MSc (2001) and PhD (2007) in computer science and currently holds the position of an assistant professor in language technology at the University of Turku. His research interests are centered around machine learning applied to large textual corpora.

Published Online: 2018-06-06

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2017-0031.

Export Citation

© 2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Comments (0)

Please log in or register to comment.
Log in