DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions

Maria Antònia Martí 1 , Mariona Taulé 2 , Venelin Kovatchev 3  and Maria Salamó 4
  • 1 Filologia Catalana i Lingüística General, Barcelona, Spain
  • 2 Linguistics, Barcelona, Spain
  • 3 Filologia Catalana i Lingüística General, Barcelona, Spain
  • 4 Matematica Aplicada i Analisi, Barcelona, Spain
Maria Antònia Martí
  • Corresponding author
  • Filologia Catalana i Lingüística General, Universitat de Barcelona, Barcelona, Spain
  • Email
  • Further information
  • Maria Antònia Martí is a professor of Computational Linguistics at the University of Barcelona. She is currently the Director of the CLiC research group (Center for Language and Computation). Her research is focussed on Corpus Linguistics and Distributional Semantics. She teaches courses in Empirical Linguistics, Corpus Linguistics and Introduction to Linguistics to both undergraduate and postgraduate students.
  • Search for other articles:
  • degruyter.comGoogle Scholar
, Mariona Taulé
  • Linguistics, University of Barcelona, Barcelona, Spain
  • Email
  • Further information
  • Mariona Taulé is a professor in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. She is also Secretary of the Sociedad Espaola de Procesamiento del Lenguaje Natural and edits the Journal Procesamiento del Lenguaje Natural. Her research and publications are related to computational linguistics and natural language processing and, especially, to lexical semantics, corpus linguistics and development of linguistic resources for natural language processing, basically for Spanish, Catalan and English.
  • Search for other articles:
  • degruyter.comGoogle Scholar
, Venelin Kovatchev
  • Filologia Catalana i Lingüística General, Universitat de Barcelona, Barcelona, Spain
  • Email
  • Further information
  • Venelin Kovatchev is a PhD researcher in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. His research is focused on paraphrasing, textual entailment and semantic similarity.
  • Search for other articles:
  • degruyter.comGoogle Scholar
and Maria Salamó
  • Matematica Aplicada i Analisi, Universitat de Barcelona, Barcelona, Spain
  • Email
  • Further information
  • Maria Salamó received both her B.S. in Computer Science (1999) and her Ph.D. (2004) degrees from the Universitat Ramon LLull (Spain). She is associated professor in the University of Barcelona and member of Institute of Complex Systems (UBICS). Her research covers a broad range of topics within AI including Natural Language Processing, Machine Learning, Recommender Systems, and User Modeling.
  • Search for other articles:
  • degruyter.comGoogle Scholar

Abstract

One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models for modelling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitatively by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied.

  • Baldwin, Timothy & Su Nam Kim. 2010. Multiword expressions. Handbook of natural language processing 2. 267–292.

  • Baroni, Marco. 2013. Composition in distributional semantics. Language and Linguistics Compass 7(10). 511–522.

  • Baroni, Marco & Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4). 673–721. ISSN: 0891-2017.

  • Baroni, Marco, Brian Murphy, Eduard Barbu & Massimo Poesio. 2010. Strudel: A corpus-based semantic model based on properties and types. Cognitive Science 34(2). 222–254.

  • Bartsch, S. 2004. Structural and functional properties of collocations in English: A corpus study of lexical and pragmatic constraints on lexical co-occurrence. International Journal of Corpus Linguistics 10. 266–270. 10.1075/ijcl.10.2.08nes.

  • Biemann, Chris & Eugenie Giesbrecht. 2011. Distributional semantics and compositionality 2011: Shared task description and results. In Proceedings of the workshop on distributional semantics and compositionality, 21–28. Association for Computational Linguistics.

  • Caliński, T. & J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-Simulation and Computation 3(1). 1–27.

  • Croft, W. & D.A. Cruse. 2004. Cognitive linguistics. Cambridge Textbooks in Linguistics. Cambridge University Press. ISBN: 9780521667708.

  • Dubremetz, Marie & Joakim Nivre. 2014. Extraction of nominal multiword expressions in French. EACL 2014. 72.

  • Duffield, Cecily Jill, Jena D. Hwang & Laura A. Michaelis. 2010. Identifying assertions in text and discourse: the presentational relative clause construction. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 17–24. Association for Computational Linguistics.

  • Evert, Stefan. 2008. Corpora and collocations. Corpus linguistics. An international handbook 2. 223–233.

  • Farahmand, Meghdad & Ronaldo Martins. 2014. A supervised model for extraction of multiword expressions based on statistical context features. EACL 2014. 10.

  • Fillmore, Charles J., Russell Lee-Goldman, and Russell Rhodes. 2012. The Framenet constructicon. Sign-based construction grammar. Stanford, CA: CSLI.

  • Forsberg, Markus, Richard Johansson, Linnéa Bäckström, Lars Borin, Benjamin Lyngfelt, Joel Olofsson & Julia Prentice. 2014. From construction candidates to construction entries. An experiment using semi-automatic methods for identifying constructions in corpora. Constructions and Frames 6(1). 114–135. ISSN: 1876-1933.

  • Franco-Salvador, Marc, Rangel Francisco, Rosso Paolo, Taulé Mariona & Mart&’ı M. Antónia. 2015. Language variety identification using distributed representations of words and documents. In Proceedings of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality and Interaction, Lectures Notes in Computer Science. Springer Verlag.

  • Gamallo, Pablo, Alexandre Agustini & Gabriel P. Lopes. 2005. Clustering syntactic positions with similar semantic requirements. Computational Linguistics 31(1). 107–146.

  • Goldberg, A. E. 1995. Constructions: A construction grammar approach to argument structure. Cognitive Theory of Language and Culture. University of Chicago Press. ISBN: 9780226300863.

  • Goldberg, A. E. 2006. Constructions at work, 280. Oxford: Oxford University Press. ISBN 0-19-9-268517 and 0-19-9-268525 (pbk).

  • Goldberg, Adele E. 2013. Argument structure constructions versus lexical rules or derivational verb templates. Mind & Language 28(4). 435–465.

  • Gries, Stefan Th. & Nich C. Ellis. 2015. Statistical measures for usage-based linguistics. Language Learning (65). 1–28.

  • Gries, Stefan Th., Beate Hampe & Doris Schönefeld. 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics (16). 635–676.

  • Harris, Zellig. 1954. Distributional structure. Word 10(23). 146–162.

  • Hwang, Jena D., Rodney D. Nielsen & Martha Palmer. 2010. Towards a domain independent semantics: Enhancing semantic representation with construction grammar. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 1–8. Association for Computational Linguistics.

  • Karypis, George. 2003. CLUTO – a clustering toolkit. Technical report, University of Minnesota.

  • Kesselmeier, K., T. Kiss, A. Müller, C. Roch, T. Stadteld & J. Strunk. 2009. Mining for preposition-noun constructions in german. In Workshop on Extracting and Using Constructions in Natural Language Processing, NODALIDA 2009.

  • Kovatchev, Venelin, Maria Salamó, & M. Antònia Mart&’ı. 2016. Comparing distributional semantics models for identifying groups of semantically related words. Procesamiento del Lenguaje Natural 57. 109–116.

  • Landauer, T. K., D. S. McNamara, S. Dennis & W. Kintsch. 2007. Handbook of latent semantic analysis. University of Colorado Institute of Cognitive Science Series. Lawrence Erlbaum Associates. ISBN: 9780805854183.

  • Lapesa, Gabriella & Stefan Evert. 2014. A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. TACL 2. 531–545. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/457.

  • Levin, Beth. 1993. English verb classes and alternations: A preliminary investigation, xviii + 348. Chicago: The University of Chicago Press. Hardbound, ISBN 0-226-47532-8, Paperbound ISBN 0-226-47533-6.

  • Lin, Dekang & Patrick Pantel. 2001. Dirt@ sbt@ discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 323–328. ACM.

  • Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL, 746–751.

  • Miller, George A. 1995. Wordnet: A lexical database for english. Communication of the ACM 38(11). 39–41. ISSN: 0001-0782.

  • Mitchell, Jeff & Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science 34(8). 1388–1439.

  • Muischnek, K. & H. Sajkan. 2009. Using collocation-finding methods to extract constructions and estimate their productivity. In Workshop on Extracting and Using Constructions in Natural Language Processing, NODALIDA 2009.

  • Murphy, Brian, Partha Pratim Talukdar & Tom M. Mitchell. 2012. Learning effective and interpretable semantic models using non-negative sparse embedding. In COLING, 1933–1950.

  • Navigli, Roberto & Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193. 217–250. ISSN: 0004-3702.

  • Niwa, Yoshiki & Yoshihiko Nitta. 1994. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In Proceedings of the 15th Conference on Computational Linguistics, volume 1 of COLING ’94, 304–309, Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Nunberg, Geoffrey, Ivan A. Sag & Thomas Wasow. 1994. Idioms. Language. 491–538.

  • O’Donnell, Matthew Brook & Nick Ellis. 2010. Towards an inventory of english verb argument constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, EUCCL ’10, 9–16, Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Padró, Llu&‘ıs & Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk & Stelios Piperidis (eds.), LREC, 2473–2479. European Language Resources Association (ELRA). ISBN: 978-2-9517408-7-7.

  • Pecina, Pavel. 2010. Lexical association measures and collocation extraction. Language Resources and Evaluation 44. 137–158. ISSN: 1574-020X.

  • Ramisch, Carlos, Aline Villavicencio & Christian Boitet. 2010. Multiword expressions in the wild?: The mwetoolkit comes in handy. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, 57–60. Association for Computational Linguistics.

  • Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In Computational linguistics and intelligent text processing, 1–15. Springer/Berlin/Heidelberg.

  • Sangati, Federico & Andreas van Cranenburgh. 2015. Multiword expression identification with recurring tree fragments and association measures. In Proceedings of NAACL-HLT, 10–18.

  • Shutova, Ekaterina, Lin Sun & Anna Korhonen. 2010. Metaphor identification using verb and noun clustering. In Proceedings of the 23rd International Conference on Computational Linguistics, 1002–1010. Association for Computational Linguistics.

  • Shutova, Ekaterina, Lin Sun, Elkin Darío Gutiérrez, Patricia Lichtenstein & Srini Narayanan. 2017. Multilingual metaphor processing: Experiments with semi-supervised and unsupervised learning. Computational Linguistics 43(1). 71–123.

  • Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.

  • Stefanowitsch, Anatol & Stefan Th. Gries. Corpora and grammar. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: an international handbook, vol. 2, 933–951. Berlin & New York: Mouton de Gruyter.

  • Tomasello, Michael. 2000. First steps toward a usage-based theory of language acquisition. Cognitive Linguistics 11(1–2). 61–82.

  • Turney, Peter D. 2008. The latent relation mapping engine: Algorithm and experiments. Journal of Artificial Intelligence Research (JAIR) 33. 615–655.

  • Turney, Peter D. & Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research (JAIR), 37(1). 141–188. ISSN: 1076-9757.

  • Tutubalina, Elena. 2015. Clustering-based approach to multiword expression extraction and ranking. In Proceedings of NAACL-HLT, 39–43.

  • Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, EUCCL ’10, 25–31, Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Wray, Alison & Mick Perkins. 2000. The functions of formulaic language: An integrated model. Language and Communication 20(1). 1–28.

  • Zuidema, Willem. 2006. What are the productive units of natural language grammar?: A DOP approach to the automatic identification of constructions. In Proceedings of the Tenth Conference on Computational Natural Language Learning, 29–36. Association for Computational Linguistics.

Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


Journal + Issues

Corpus Linguistics and Linguistic Theory publishes high-quality, corpus-based research focusing on theoretically-relevant issues in all core areas of linguistic research (phonology, morphology, syntax, semantics, pragmatics) and other recognized topic areas. The journal features articles from a corpus-based approach that develop new methods, evaluate theoretical claims and offer analyses of linguistic phenomena within a theoretical framework.

Search