Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2017: 1.200
5-year IMPACT FACTOR: 1.386

CiteScore 2017: 0.80

SCImago Journal Rank (SJR) 2017: 0.288
Source Normalized Impact per Paper (SNIP) 2017: 0.930

DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions

Maria Antònia Martí / Mariona Taulé / Venelin Kovatchev / Maria Salamó
Published Online: 2019-01-04 | DOI: https://doi.org/10.1515/cllt-2018-0028


One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models for modelling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitatively by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied.

Keywords: constructions; semantics; distributional semantic models


About the article

Maria Antònia Martí

Maria Antònia Martí is a professor of Computational Linguistics at the University of Barcelona. She is currently the Director of the CLiC research group (Center for Language and Computation). Her research is focussed on Corpus Linguistics and Distributional Semantics. She teaches courses in Empirical Linguistics, Corpus Linguistics and Introduction to Linguistics to both undergraduate and postgraduate students.

Mariona Taulé

Mariona Taulé is a professor in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. She is also Secretary of the Sociedad Espaola de Procesamiento del Lenguaje Natural and edits the Journal Procesamiento del Lenguaje Natural. Her research and publications are related to computational linguistics and natural language processing and, especially, to lexical semantics, corpus linguistics and development of linguistic resources for natural language processing, basically for Spanish, Catalan and English.

Venelin Kovatchev

Venelin Kovatchev is a PhD researcher in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. His research is focused on paraphrasing, textual entailment and semantic similarity.

Maria Salamó

Maria Salamó received both her B.S. in Computer Science (1999) and her Ph.D. (2004) degrees from the Universitat Ramon LLull (Spain). She is associated professor in the University of Barcelona and member of Institute of Complex Systems (UBICS). Her research covers a broad range of topics within AI including Natural Language Processing, Machine Learning, Recommender Systems, and User Modeling.

Published Online: 2019-01-04

Funding: This work was supported by Ministerio de Economía y Competitividad, Funder Id: 10.13039/501100003329, Grant Number: TIN2015-71147 and Generalitat de Catalunya, Funder Id: 10.13039/501100002809, Grant Number: 2017 SGR 341.

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2018-0028.

