Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter June 13, 2018

Mining the Web for New Words: Semi-Automatic Neologism Identification with the NeoCrawler

Daphné Kerremans and Jelena Prokić
From the journal Anglia


Lexical innovation is omnipresent and constantly at work. Studies aiming to understand the process of lexical innovation and the subsequent diffusion of neologisms therefore benefit from systematic methods of neologism identification. Retrieval procedures in the past have largely consisted of manual activities of participant observations and close reading. Recently, attempts have been made at designing automatized identification procedures, assisted by state-of-the-art natural language processing techniques and tools. Beginning with a discussion of the most commonly used neologism detection methods and applications in linguistics, the present paper will describe a semi-automatic approach to identifying new words on the web, the NeoCrawler’s Discoverer, which has been developed as part of a project on the incipient diffusion of lexical innovations. The Discoverer daily processes large batches of online text in English and automatically identifies unknown grapheme sequences as potential neologism candidates by means of a dictionary matching procedure, in which the individual tokens are matched against a very large dictionary. These potential neologisms subsequently are presented to the user for manual evaluation of their neologism status. Finally, candidates are added to the NeoCrawler’s database for continuous close monitoring of their development in the online speech community. We argue that the use of dictionary matching in neologism identification offers an efficient method to semi-automatically extract potential instances of lexical innovation with high precision and high recall when compared to previous approaches.

Works Cited

Breen, James. 2010. “Identification of Neologisms in Japanese by Corpus Analysis”. In: Sylviane Granger and Magali Paquot (eds.). eLexicography in the 21st Century: New Challenges, New Applications. Louvain: Presses universitaires de Louvain. 13–22.Search in Google Scholar

Cabré, Maria T. and Lluís de Yzaguirre. 1995. “Stratégie pour la détection semiautomatique des néologismes de presse”. TTR: Traduction, terminologie, rédaction 8: 89–100.Search in Google Scholar

Cartier, Emmanuel. 2016. Néoveille, système de repérage et de suivi des néologismes en sept langues”. Neologica 10: 101–131.Search in Google Scholar

Cartier, Emmanuel. 2017. “Neoveille, a Web Platform for Neologism Tracking”. Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3–7 2017. 95–98. <> [last accessed 1 March 2018].Search in Google Scholar

Chiru, Costin and Traian Rebedea. 2014. “Archaisms and Neologisms Identification in Texts”. Paper presented at RoEduNet Conference 13th Edition: Networking in Education and Research Joint Event RENAM 8th Conference, Chisinau, Moldova, 11–13 September 2014. <> [last accessed 1 March 2018].Search in Google Scholar

Diamond, Graeme. 2016. “Making Decisions about Inclusion and Exclusion”. In: Philip Durkin (ed.). The Oxford Handbook of Lexicography. 532–545. 10.1093/oxfordhb/9780199691630.013.38Search in Google Scholar

Falk, Ingrid, Delphine Bernhard and Christophe Gérard. 2014. “From Non Word to New Word: Automatically Identifying Neologisms in French Newspapers”. Paper presented at 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland, May 2014. In: Proceedings of the International Conference on Language Resources and Evaluation <> [last accessed 1 March 2018].Search in Google Scholar

Fischer, Roswitha. 1998. Lexical Change in Present-Day English: A Corpus-Based Study of the Motivation, Institutionalization, and Productivity of Creative Neologisms. Tübingen: Narr.Search in Google Scholar

Geierhos, Michaela. 2006. Grammatik der Menschenbezeichner in biographischen Kontexten. Unpubl. M. A. thesis, CIS, Ludwig-Maximilians-Universität München.Search in Google Scholar

Gérard, Christophe, Lauren Bruneau, Ingrid Falk, Delphine Bernhard and Ann-Lise Rosio. 2017. “Le Logoscope : Observatoire des innovations lexicales en français contemporain”. In: Joaquín García Palacios, Goedele de Sterck, Daniel Linder, Jesús Torre del Rey, Miguel Sánchez Ibanez and Nava Maroto García (eds.). La neología en las lenguas románicas: Recursos, estrategias y nuevas orientaciones. Frankfurt am Main: Lang. 339–356.Search in Google Scholar

Hamilton, William L., Jure Leskovec and Dan Jurafsky. 2016 a. “Cultural Shift or Linguistic Drift? Comparing Two Computational Models of Semantic Change”. In: Proceedings of Conference on Empirical Methods on Natural Language Processing, Austin, Texas, USA, 1–5 November 2016. <–1229.pdf> [last accessed 1 March 2018].10.18653/v1/D16-1229Search in Google Scholar

Hamilton, William L., Jure Leskovec and Dan Jurafsky. 2016 b. “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change”. Proceedings of the 54th Annual Meeting of the ACL, Berlin, Germany, 7–12 August 2016. <–1141.pdf> [last accessed 1 March 2018].10.18653/v1/P16-1141Search in Google Scholar

Iakovleva, Tatiana. 2017. “Automatic Detection of Neologisms in Russian Newspaper Corpora with Néoveille”. In: Proceedings of the International Conference CORPUS LINGUISTICS 2017, St. Petersburg, 27–30 June 2017. 43–47. <> [last accessed 1 March 2018].Search in Google Scholar

Janssen, Maarten. 2005. “NeoTrack: Semiautomatic Neologism Detection”. Paper presented at APL Conference 2005, Lisbon, Portugal. <> [accessed 1 March 2018]. Search in Google Scholar

Kerremans, Daphné. 2015. A Web of New Words: A Corpus-Based Study of the Conventionalization Process of English Neologisms. Frankfurt am Main: Lang.10.3726/978-3-653-04788-2Search in Google Scholar

Kerremans, Daphné, Susanne Stegmayr and Hans-Jörg Schmid. 2012. “The NeoCrawler: Identifying and Retrieving Neologisms from the Internet and Monitoring On-Going Change”. In: Kathryn Allan and Justyna Robinson (eds.). Current Methods in Historical Semantics. Berlin: Mouton de Gruyter. 59–96.10.1515/9783110252903.59Search in Google Scholar

Kerremans, Daphné, Jelena Prokić, Quirin Würschinger and Hans-Jörg Schmid. Forthcoming. “Web Mining in Linguistics: Identifying and Observing Lexical Innovation with the NeoCrawler”. Search in Google Scholar

Kilgarriff, Adam, Jan Busta and Pavel Rychlý. 2015. “DIACRAN: A Framework for Diachronic Analysis”. <> [accessed 1 March 2018].Search in Google Scholar

Kristiansen, Marita. 2012. “Using Web-Based Corpora to Find Norwegian Specialised Neologies”. Communication & Language at Work 1: 11–20.10.7146/claw.v1i1.7235Search in Google Scholar

Lejeune, Gaël and Emmanuel Cartier. 2017. “Character Based Pattern Mining for Neology Detection”. In: Proceedings of the First Workshop on Subword and Character Level Models in NLP, Copenhagen, Denmark, 7 September 2017. 25–30. <> [last accessed 1 March 2018].10.18653/v1/W17-4103Search in Google Scholar

Levenshtein, Vladimir I. 1966. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals”. Soviet Physics Doklady 10: 707–710. Originally published as: Левенштейн, Влади́мир И. 1965. “Двоичные коды с исправлением выпадений, вставок и замещений символов”. Доклады Академий Наук СCCP 163: 845–848.Search in Google Scholar

Li, Wei, Kun Guo, Yong Shi, Luyao Zhu and Yuanchun Zheng. 2017. “Improved New Word Tourism Detection Field Method Used in Tourism”. Procedia Computer Science 108C: 1251–1260.10.1016/j.procs.2017.05.022Search in Google Scholar

Liu, Tsun-Jui, Shu-Kai Hsieh and Laurent Prevot. 2013. “Observing Features of PPT: A Corpus-Driven Study with N-Gram Model Neologisms”. In: Proceedings of the 25th Conference on Computational Linguistics and Speech Processing, Taiwan, 4–5 October 2013. 250–259. Search in Google Scholar

Mattern, René. 2010. Erkennen von Neologismen: Entwicklung eines Programms zur Untersuchung unbekannten Vokabulars. Unpublished M. A. thesis, CIS, Ludwig-Maximilians-Universität München.Search in Google Scholar

Megerdoomian, Karine and Ali Hadjarian. 2010. “Mining and Classification of Neologisms in Persian Blogs”. In: Proceedings of the Second Workshop on Computational Approaches to Linguistic Creativity, Los Angeles, California, USA, June 5, 2010. 6–13. <> [last accessed 1 March 2018].Search in Google Scholar

Merriam-Webster. Online ed. Springfield, MA: Merriam-Webster. <> [last accessed 1 March 2018].Search in Google Scholar

O’Donovan, Ruth and Mary O’Neill. 2008. “A Systematic Approach to the Selection of Neologisms for Inclusion in a Large Monolingual Dictionary”. In: Proceedings of the 13th EURALEX International Congress, Barcelona, Spain, 15–19 July 2008. 571–579. <> [last accessed 1 March 2018].Search in Google Scholar

OED online = The Oxford English Dictionary. 2000–. 3rd ed. online. Oxford: Oxford University Press. <> [last accessed 1 March 2018]. Search in Google Scholar

Paryzek, Piotr. 2008. “Comparison of Selected Methods for the Retrieval of Neologisms”. Investigationes Linguisticae XVI: 163–181. 10.14746/il.2008.16.14Search in Google Scholar

Rajaraman, Anand and Jeffrey D. Ullman. 2011. “Data Mining”. In: Jure Leskovec, Anand Rajamaran and Jeffrey D. Ullman (eds.). Mining of Massive Datasets. Cambridge: Cambridge University Press. 1–17.10.1017/CBO9781139058452Search in Google Scholar

Renouf, Antoinette and Laurie Bauer. 2000. “Contextual Clues to Word Meaning”. International Journal of Corpus Linguistics 5: 231–258. 10.1075/ijcl.5.2.07renSearch in Google Scholar

Renouf, Antoinette, Andrew Kehoe and Jay Banerjee. 2005. “The WebCorp Search Engine: A Holistic Approach to Web Text Search”. In: Electronic Proceedings of CL2005. <> [last accessed 21 February 2018]. Search in Google Scholar

Schmid, Hans-Jörg. 2016. English Morphology and Word-Formation: An Introduction. 3rd revised and extended ed. Berlin: Schmidt.Search in Google Scholar

Stenetorp, Pontus. 2010. Automated Extraction of Swedish Neologisms Using a Temporally Annotated Corpus. M. A. thesis in Computer Science at the School of Computer Science and Engineering, KTH Royal Institute of Technology. <> [last accessed 21 February 2018]. Search in Google Scholar

Stoyanova, Ivelina, Svetlozara Leseva, Martin Yalamov and Svetla Koeva. no date. “An Online System for Neologism Detection in Bulgarian”. <> [last accessed 21 February 2018].Search in Google Scholar

Svanlund, Jan. Forthcoming. “Metacomments and Metasignals: What can they Tell us about the Conventionalization of Neologies?”Search in Google Scholar

Torres-del-Rey, Jesús and Nava Maroto. 2014. “Building the Interface between Experts and Linguists in the Detection and Characterisation of Neology in the Field of Neurosciences”. In: Proceedings of the 4th International Workshop on Computational Terminology, Dublin, Ireland, August 2014. 64–67. <> [last accessed 1 March 2018].10.3115/v1/W14-4808Search in Google Scholar

Zwicky, Arnold. 2005. “More Illusions”. Language Log 17 August. <> [last accessed 19 February 2018]. Search in Google Scholar

Published Online: 2018-6-13
Published in Print: 2018-6-11

© 2018 Walter de Gruyter GmbH, Berlin/Boston

Scroll Up Arrow