Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2017: 1.200
5-year IMPACT FACTOR: 1.386

CiteScore 2017: 0.80

SCImago Journal Rank (SJR) 2017: 0.288
Source Normalized Impact per Paper (SNIP) 2017: 0.930

See all formats and pricing
More options …

DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions

Maria Antònia Martí / Mariona Taulé / Venelin Kovatchev / Maria Salamó
Published Online: 2019-01-04 | DOI: https://doi.org/10.1515/cllt-2018-0028


One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models for modelling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitatively by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied.

Keywords: constructions; semantics; distributional semantic models


  • Baldwin, Timothy & Su Nam Kim. 2010. Multiword expressions. Handbook of natural language processing 2. 267–292.Google Scholar

  • Baroni, Marco. 2013. Composition in distributional semantics. Language and Linguistics Compass 7(10). 511–522.CrossrefGoogle Scholar

  • Baroni, Marco & Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4). 673–721. ISSN: 0891-2017.CrossrefGoogle Scholar

  • Baroni, Marco, Brian Murphy, Eduard Barbu & Massimo Poesio. 2010. Strudel: A corpus-based semantic model based on properties and types. Cognitive Science 34(2). 222–254.PubMedCrossrefGoogle Scholar

  • Bartsch, S. 2004. Structural and functional properties of collocations in English: A corpus study of lexical and pragmatic constraints on lexical co-occurrence. International Journal of Corpus Linguistics 10. 266–270. 10.1075/ijcl.10.2.08nes.

  • Biemann, Chris & Eugenie Giesbrecht. 2011. Distributional semantics and compositionality 2011: Shared task description and results. In Proceedings of the workshop on distributional semantics and compositionality, 21–28. Association for Computational Linguistics.

  • Caliński, T. & J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-Simulation and Computation 3(1). 1–27.CrossrefGoogle Scholar

  • Croft, W. & D.A. Cruse. 2004. Cognitive linguistics. Cambridge Textbooks in Linguistics. Cambridge University Press. ISBN: 9780521667708.

  • Dubremetz, Marie & Joakim Nivre. 2014. Extraction of nominal multiword expressions in French. EACL 2014. 72.

  • Duffield, Cecily Jill, Jena D. Hwang & Laura A. Michaelis. 2010. Identifying assertions in text and discourse: the presentational relative clause construction. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 17–24. Association for Computational Linguistics.

  • Evert, Stefan. 2008. Corpora and collocations. Corpus linguistics. An international handbook 2. 223–233.Google Scholar

  • Farahmand, Meghdad & Ronaldo Martins. 2014. A supervised model for extraction of multiword expressions based on statistical context features. EACL 2014. 10.

  • Fillmore, Charles J., Russell Lee-Goldman, and Russell Rhodes. 2012. The Framenet constructicon. Sign-based construction grammar. Stanford, CA: CSLI.

  • Forsberg, Markus, Richard Johansson, Linnéa Bäckström, Lars Borin, Benjamin Lyngfelt, Joel Olofsson & Julia Prentice. 2014. From construction candidates to construction entries. An experiment using semi-automatic methods for identifying constructions in corpora. Constructions and Frames 6(1). 114–135. ISSN: 1876-1933.CrossrefGoogle Scholar

  • Franco-Salvador, Marc, Rangel Francisco, Rosso Paolo, Taulé Mariona & Mart&’ı M. Antónia. 2015. Language variety identification using distributed representations of words and documents. In Proceedings of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality and Interaction, Lectures Notes in Computer Science. Springer Verlag.

  • Gamallo, Pablo, Alexandre Agustini & Gabriel P. Lopes. 2005. Clustering syntactic positions with similar semantic requirements. Computational Linguistics 31(1). 107–146.CrossrefGoogle Scholar

  • Goldberg, A. E. 1995. Constructions: A construction grammar approach to argument structure. Cognitive Theory of Language and Culture. University of Chicago Press. ISBN: 9780226300863.

  • Goldberg, A. E. 2006. Constructions at work, 280. Oxford: Oxford University Press. ISBN 0-19-9-268517 and 0-19-9-268525 (pbk).

  • Goldberg, Adele E. 2013. Argument structure constructions versus lexical rules or derivational verb templates. Mind & Language 28(4). 435–465.CrossrefGoogle Scholar

  • Gries, Stefan Th. & Nich C. Ellis. 2015. Statistical measures for usage-based linguistics. Language Learning (65). 1–28.

  • Gries, Stefan Th., Beate Hampe & Doris Schönefeld. 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics (16). 635–676.

  • Harris, Zellig. 1954. Distributional structure. Word 10(23). 146–162.CrossrefGoogle Scholar

  • Hwang, Jena D., Rodney D. Nielsen & Martha Palmer. 2010. Towards a domain independent semantics: Enhancing semantic representation with construction grammar. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 1–8. Association for Computational Linguistics.

  • Karypis, George. 2003. CLUTO – a clustering toolkit. Technical report, University of Minnesota.Google Scholar

  • Kesselmeier, K., T. Kiss, A. Müller, C. Roch, T. Stadteld & J. Strunk. 2009. Mining for preposition-noun constructions in german. In Workshop on Extracting and Using Constructions in Natural Language Processing, NODALIDA 2009.

  • Kovatchev, Venelin, Maria Salamó, & M. Antònia Mart&’ı. 2016. Comparing distributional semantics models for identifying groups of semantically related words. Procesamiento del Lenguaje Natural 57. 109–116.Google Scholar

  • Landauer, T. K., D. S. McNamara, S. Dennis & W. Kintsch. 2007. Handbook of latent semantic analysis. University of Colorado Institute of Cognitive Science Series. Lawrence Erlbaum Associates. ISBN: 9780805854183.

  • Lapesa, Gabriella & Stefan Evert. 2014. A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. TACL 2. 531–545. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/457.Google Scholar

  • Levin, Beth. 1993. English verb classes and alternations: A preliminary investigation, xviii + 348. Chicago: The University of Chicago Press. Hardbound, ISBN 0-226-47532-8, Paperbound ISBN 0-226-47533-6.

  • Lin, Dekang & Patrick Pantel. 2001. Dirt@ sbt@ discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 323–328. ACM.

  • Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL, 746–751.

  • Miller, George A. 1995. Wordnet: A lexical database for english. Communication of the ACM 38(11). 39–41. ISSN: 0001-0782.CrossrefGoogle Scholar

  • Mitchell, Jeff & Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science 34(8). 1388–1439.PubMedCrossrefGoogle Scholar

  • Muischnek, K. & H. Sajkan. 2009. Using collocation-finding methods to extract constructions and estimate their productivity. In Workshop on Extracting and Using Constructions in Natural Language Processing, NODALIDA 2009.

  • Murphy, Brian, Partha Pratim Talukdar & Tom M. Mitchell. 2012. Learning effective and interpretable semantic models using non-negative sparse embedding. In COLING, 1933–1950.Google Scholar

  • Navigli, Roberto & Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193. 217–250. ISSN: 0004-3702.CrossrefGoogle Scholar

  • Niwa, Yoshiki & Yoshihiko Nitta. 1994. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In Proceedings of the 15th Conference on Computational Linguistics, volume 1 of COLING ’94, 304–309, Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Nunberg, Geoffrey, Ivan A. Sag & Thomas Wasow. 1994. Idioms. Language. 491–538.

  • O’Donnell, Matthew Brook & Nick Ellis. 2010. Towards an inventory of english verb argument constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, EUCCL ’10, 9–16, Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Padró, Llu&‘ıs & Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk & Stelios Piperidis (eds.), LREC, 2473–2479. European Language Resources Association (ELRA). ISBN: 978-2-9517408-7-7.Google Scholar

  • Pecina, Pavel. 2010. Lexical association measures and collocation extraction. Language Resources and Evaluation 44. 137–158. ISSN: 1574-020X.

  • Ramisch, Carlos, Aline Villavicencio & Christian Boitet. 2010. Multiword expressions in the wild?: The mwetoolkit comes in handy. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, 57–60. Association for Computational Linguistics.

  • Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In Computational linguistics and intelligent text processing, 1–15. Springer/Berlin/Heidelberg.Google Scholar

  • Sangati, Federico & Andreas van Cranenburgh. 2015. Multiword expression identification with recurring tree fragments and association measures. In Proceedings of NAACL-HLT, 10–18.

  • Shutova, Ekaterina, Lin Sun & Anna Korhonen. 2010. Metaphor identification using verb and noun clustering. In Proceedings of the 23rd International Conference on Computational Linguistics, 1002–1010. Association for Computational Linguistics.

  • Shutova, Ekaterina, Lin Sun, Elkin Darío Gutiérrez, Patricia Lichtenstein & Srini Narayanan. 2017. Multilingual metaphor processing: Experiments with semi-supervised and unsupervised learning. Computational Linguistics 43(1). 71–123.Crossref

  • Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.CrossrefGoogle Scholar

  • Stefanowitsch, Anatol & Stefan Th. Gries. Corpora and grammar. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: an international handbook, vol. 2, 933–951. Berlin & New York: Mouton de Gruyter.

  • Tomasello, Michael. 2000. First steps toward a usage-based theory of language acquisition. Cognitive Linguistics 11(1–2). 61–82.Google Scholar

  • Turney, Peter D. 2008. The latent relation mapping engine: Algorithm and experiments. Journal of Artificial Intelligence Research (JAIR) 33. 615–655.CrossrefGoogle Scholar

  • Turney, Peter D. & Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research (JAIR), 37(1). 141–188. ISSN: 1076-9757.CrossrefGoogle Scholar

  • Tutubalina, Elena. 2015. Clustering-based approach to multiword expression extraction and ranking. In Proceedings of NAACL-HLT, 39–43.

  • Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, EUCCL ’10, 25–31, Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Wray, Alison & Mick Perkins. 2000. The functions of formulaic language: An integrated model. Language and Communication 20(1). 1–28.CrossrefGoogle Scholar

  • Zuidema, Willem. 2006. What are the productive units of natural language grammar?: A DOP approach to the automatic identification of constructions. In Proceedings of the Tenth Conference on Computational Natural Language Learning, 29–36. Association for Computational Linguistics.

About the article

Maria Antònia Martí

Maria Antònia Martí is a professor of Computational Linguistics at the University of Barcelona. She is currently the Director of the CLiC research group (Center for Language and Computation). Her research is focussed on Corpus Linguistics and Distributional Semantics. She teaches courses in Empirical Linguistics, Corpus Linguistics and Introduction to Linguistics to both undergraduate and postgraduate students.

Mariona Taulé

Mariona Taulé is a professor in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. She is also Secretary of the Sociedad Espaola de Procesamiento del Lenguaje Natural and edits the Journal Procesamiento del Lenguaje Natural. Her research and publications are related to computational linguistics and natural language processing and, especially, to lexical semantics, corpus linguistics and development of linguistic resources for natural language processing, basically for Spanish, Catalan and English.

Venelin Kovatchev

Venelin Kovatchev is a PhD researcher in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. His research is focused on paraphrasing, textual entailment and semantic similarity.

Maria Salamó

Maria Salamó received both her B.S. in Computer Science (1999) and her Ph.D. (2004) degrees from the Universitat Ramon LLull (Spain). She is associated professor in the University of Barcelona and member of Institute of Complex Systems (UBICS). Her research covers a broad range of topics within AI including Natural Language Processing, Machine Learning, Recommender Systems, and User Modeling.

Published Online: 2019-01-04

Funding: This work was supported by Ministerio de Economía y Competitividad, Funder Id: 10.13039/501100003329, Grant Number: TIN2015-71147 and Generalitat de Catalunya, Funder Id: 10.13039/501100002809, Grant Number: 2017 SGR 341.

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2018-0028.

Export Citation

© 2019 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Comments (0)

Please log in or register to comment.
Log in