Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Open Computer Science

Editor-in-Chief: van den Broek, Egon


Covered by:
SCOPUS
Web of Science - Emerging Sources Citation Index


CiteScore 2018: 0.63
Source Normalized Impact per Paper (SNIP) 2018: 0.604

ICV 2018: 97.86

Open Access
Online
ISSN
2299-1093
See all formats and pricing
More options …

Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Alfredo Maldonado / Filip Klubička / John Kelleher
Published Online: 2019-10-11 | DOI: https://doi.org/10.1515/comp-2019-0009

Abstract

Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.

Keywords: word embeddings; taxonomic embeddings; WordNet; semantic similarity; taxonomic enrichment; retrofitting

References

  • [1] Mikolov T., Corrado G., Chen K., Dean J., Efficient Estimation of Word Representations in Vector Space, in Proceedings of the International Conference on Learning Representations (ICLR 2013), Scottsdale, AZ, 2013, 1–12Google Scholar

  • [2] Mikolov T., Stutskever I., Chen K., Corrado G., Dean J., Distributed Representations of Words and Phrases and their Compositionality, in Proceedings of the Twenty-Seventh Annual Conference on Neural Information Processing Systems (NIPS) In Advances in Neural Information Processing Systems 26, Lake Tahoe, NV, 2013, 3111–3119Google Scholar

  • [3] Baroni M., Dinu G., Kruszewski G., Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, 2014, 238–247, 10.3115/v1/P14-1023Google Scholar

  • [4] Hill F., Reichart R., Korhonen A., SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation, Computational Linguistics, 41(4), 2015, 665–695, 10.1162/COLIWeb of ScienceGoogle Scholar

  • [5] Kacmajor M., Kelleher J. D., Capturing and measuring thematic relatedness, Language Resources and Evaluation, 2019, 1–38, 10.1007/s10579-019-09452-wGoogle Scholar

  • [6] Fellbaum C., WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998Google Scholar

  • [7] Faruqui M., Dodge J., Jauhar S. K., Dyer C., Hovy E., Smith N. A., Retrofitting Word Vectors to Semantic Lexicons, in Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, CO, 2015, 1606–1615,10.3115/v1/N15-1184Google Scholar

  • [8] Speer R., Lowry-Duda J., ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge, in Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), Vancouver, 2017, 85–89Google Scholar

  • [9] Faruqui M., Dyer C., Non-distributional Word Vector Representations, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, 2015, 464–469, 10.3115/v1/P15-2076Google Scholar

  • [10] Goikoetxea J., Soroa A., Agirre E., Random Walks and Neural Network Language Models on Knowledge Bases, in Human Language Technologies: The 2015 Conference of the North American Chapter of the Association for Computational Linguistics, Denver, CO, 2015, 1434–1439Google Scholar

  • [11] Goikoetxea J., Agirre E., Soroa A., Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet, in AAAI, 2016Google Scholar

  • [12] Nickel M., Kiela D., Poincaré Embeddings for Learning Hierarchical Representations, in I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, eds., Advances in Neural Information Processing Systems 30, Curran Associates, Inc., Long Beach, CA, 2017, 6338–6347Google Scholar

  • [13] Cohen T., Widdows D., Embedding of semantic predications, Journal of Biomedical Informatics, 68, 2017, 150–166, 10.1016/j.jbi.2017.03.003Google Scholar

  • [14] Agirre E., Cuadros M., Rigau G., Soroa A., Exploring Knowledge Bases for Similarity., in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’10), 2010Google Scholar

  • [15] Wieting J., Bansal M., Gimpel K., Livescu K., Roth D., From Paraphrase Database to Compositional Paraphrase Model and Back, Transactions of the Association for Computational Linguistics, 3, 2015, 345–358Google Scholar

  • [16] Mrkšić N., Séaghdha D. O., Thomson B., Gašić M., Rojas-Barahona L., Su P. H., Vandyke D., Wen T. H., Young S., Counter-fitting word vectors to linguistic constraints, arXiv preprint arXiv:1603.00892, 2016Google Scholar

  • [17] Nguyen K. A., Köper M., Schulte im Walde S., Vu N. T., Hierarchical Embeddings for Hypernymy Detection and Directionality, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017, 233–243Google Scholar

  • [18] Mrkšić N., Vulić I., Séaghdha D. Ó., Leviant I., Reichart R., Gašić M., Korhonen A., Young S., Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints, Transactions of the Association for Computational Linguistics, 5, 2017, 309–324Google Scholar

  • [19] Nguyen K. A., Schulte im Walde S., Vu N. T., Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016, 454–459Google Scholar

  • [20] Vulić I., Glavaš G., Mrkšić N., Korhonen A., Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources, in Proceedings of NAACL-HLT 2018, New Orleans, LA, 2018, 516–527Google Scholar

  • [21] Ponti E.M., Vulić I., Glavaš G., Mrkšić N., Korhonen A., Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, 282–293Google Scholar

  • [22] Yu Z., Cohen T., Bernstam E. V., Johnson T. R., Wallace B. C., Retrofitting Word Vectors of MeSH Terms to Improve Semantic Similarity Measures, in Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis (LOUHI), Austin, TX, 2016, 43–51Google Scholar

  • [23] Speer R., Havasi C., Representing General Relational Knowledge in ConceptNet 5, in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, 2012, 3679—-3686Google Scholar

  • [24] Finkelstein L., Gabrilovich E., Matias Y., Rivlin E., Solan Z., Wolf-man G., Ruppin E., Placing search in context: the concept revisited, ACM Transactions on Information Systems, 20(1), 2002, 116–131, 10.1145/503104.503110Google Scholar

  • [25] Camacho-Collados J., Pilehvar M. T., Collier N., Navigli R., SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity, in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, 2017, 15–26Google Scholar

  • [26] Ganitkevitch J., Van Durme B., Callison-Burch C., PPDB: The paraphrase database, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, 758–764Google Scholar

  • [27] Baker C. F., Fillmore C. J., Lowe J. B., The berkeley framenet project, in Proceedings of the 17th international conference on Computational linguistics-Volume 1, Association for Computational Linguistics, 1998, 86–90Google Scholar

  • [28] Klubička F., Maldonado A., Kelleher J., Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance, in Proceedings of GWC2019: 10th Global WordNet Conference, 2019Google Scholar

  • [29] Al-Rfou R., Perozzi B., Skiena S., Polyglot: Distributed Word Representations for Multilingual NLP, in Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, 2013, 183–192, 10.1007/s10479-011-0841-3Google Scholar

  • [30] Turney P. D., Pantel P., From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, 37, 2010, 141–188Web of ScienceGoogle Scholar

About the article

Received: 2019-05-17

Accepted: 2019-07-04

Published Online: 2019-10-11


Citation Information: Open Computer Science, Volume 9, Issue 1, Pages 252–267, ISSN (Online) 2299-1093, DOI: https://doi.org/10.1515/comp-2019-0009.

Export Citation

© 2019 Alfredo Maldonado et al., published by De Gruyter Open. This work is licensed under the Creative Commons Attribution 4.0 Public License. BY 4.0

Comments (0)

Please log in or register to comment.
Log in