Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Alfredo Maldonado 1 , Filip Klubička 2 , and John Kelleher 3
  • 1 ADAPT Centre at Trinity College Dublin, , Dublin, Ireland
  • 2 ADAPT Centre at Technological University Dublin, , Dublin, Ireland
  • 3 ADAPT Centre at Technological University Dublin, , Dublin, Ireland

Abstract

Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Mikolov T., Corrado G., Chen K., Dean J., Efficient Estimation of Word Representations in Vector Space, in Proceedings of the International Conference on Learning Representations (ICLR 2013), Scottsdale, AZ, 2013, 1–12

  • [2] Mikolov T., Stutskever I., Chen K., Corrado G., Dean J., Distributed Representations of Words and Phrases and their Compositionality, in Proceedings of the Twenty-Seventh Annual Conference on Neural Information Processing Systems (NIPS) In Advances in Neural Information Processing Systems 26, Lake Tahoe, NV, 2013, 3111–3119

  • [3] Baroni M., Dinu G., Kruszewski G., Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, 2014, 238–247, 10.3115/v1/P14-1023

  • [4] Hill F., Reichart R., Korhonen A., SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation, Computational Linguistics, 41(4), 2015, 665–695, 10.1162/COLI

  • [5] Kacmajor M., Kelleher J. D., Capturing and measuring thematic relatedness, Language Resources and Evaluation, 2019, 1–38, 10.1007/s10579-019-09452-w

  • [6] Fellbaum C., WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998

  • [7] Faruqui M., Dodge J., Jauhar S. K., Dyer C., Hovy E., Smith N. A., Retrofitting Word Vectors to Semantic Lexicons, in Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, CO, 2015, 1606–1615,10.3115/v1/N15-1184

  • [8] Speer R., Lowry-Duda J., ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge, in Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), Vancouver, 2017, 85–89

  • [9] Faruqui M., Dyer C., Non-distributional Word Vector Representations, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, 2015, 464–469, 10.3115/v1/P15-2076

  • [10] Goikoetxea J., Soroa A., Agirre E., Random Walks and Neural Network Language Models on Knowledge Bases, in Human Language Technologies: The 2015 Conference of the North American Chapter of the Association for Computational Linguistics, Denver, CO, 2015, 1434–1439

  • [11] Goikoetxea J., Agirre E., Soroa A., Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet, in AAAI, 2016

  • [12] Nickel M., Kiela D., Poincaré Embeddings for Learning Hierarchical Representations, in I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, eds., Advances in Neural Information Processing Systems 30, Curran Associates, Inc., Long Beach, CA, 2017, 6338–6347

  • [13] Cohen T., Widdows D., Embedding of semantic predications, Journal of Biomedical Informatics, 68, 2017, 150–166, 10.1016/j.jbi.2017.03.003

  • [14] Agirre E., Cuadros M., Rigau G., Soroa A., Exploring Knowledge Bases for Similarity., in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’10), 2010

  • [15] Wieting J., Bansal M., Gimpel K., Livescu K., Roth D., From Paraphrase Database to Compositional Paraphrase Model and Back, Transactions of the Association for Computational Linguistics, 3, 2015, 345–358

  • [16] Mrkšić N., Séaghdha D. O., Thomson B., Gašić M., Rojas-Barahona L., Su P. H., Vandyke D., Wen T. H., Young S., Counter-fitting word vectors to linguistic constraints, arXiv preprint arXiv:1603.00892, 2016

  • [17] Nguyen K. A., Köper M., Schulte im Walde S., Vu N. T., Hierarchical Embeddings for Hypernymy Detection and Directionality, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017, 233–243

  • [18] Mrkšić N., Vulić I., Séaghdha D. Ó., Leviant I., Reichart R., Gašić M., Korhonen A., Young S., Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints, Transactions of the Association for Computational Linguistics, 5, 2017, 309–324

  • [19] Nguyen K. A., Schulte im Walde S., Vu N. T., Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016, 454–459

  • [20] Vulić I., Glavaš G., Mrkšić N., Korhonen A., Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources, in Proceedings of NAACL-HLT 2018, New Orleans, LA, 2018, 516–527

  • [21] Ponti E.M., Vulić I., Glavaš G., Mrkšić N., Korhonen A., Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, 282–293

  • [22] Yu Z., Cohen T., Bernstam E. V., Johnson T. R., Wallace B. C., Retrofitting Word Vectors of MeSH Terms to Improve Semantic Similarity Measures, in Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis (LOUHI), Austin, TX, 2016, 43–51

  • [23] Speer R., Havasi C., Representing General Relational Knowledge in ConceptNet 5, in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, 2012, 3679—-3686

  • [24] Finkelstein L., Gabrilovich E., Matias Y., Rivlin E., Solan Z., Wolf-man G., Ruppin E., Placing search in context: the concept revisited, ACM Transactions on Information Systems, 20(1), 2002, 116–131, 10.1145/503104.503110

  • [25] Camacho-Collados J., Pilehvar M. T., Collier N., Navigli R., SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity, in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, 2017, 15–26

  • [26] Ganitkevitch J., Van Durme B., Callison-Burch C., PPDB: The paraphrase database, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, 758–764

  • [27] Baker C. F., Fillmore C. J., Lowe J. B., The berkeley framenet project, in Proceedings of the 17th international conference on Computational linguistics-Volume 1, Association for Computational Linguistics, 1998, 86–90

  • [28] Klubička F., Maldonado A., Kelleher J., Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance, in Proceedings of GWC2019: 10th Global WordNet Conference, 2019

  • [29] Al-Rfou R., Perozzi B., Skiena S., Polyglot: Distributed Word Representations for Multilingual NLP, in Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, 2013, 183–192, 10.1007/s10479-011-0841-3

  • [30] Turney P. D., Pantel P., From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, 37, 2010, 141–188

OPEN ACCESS

Journal + Issues

Search