Abstract
This paper provides a replication of sociolectometric analyses found in Geeraerts, Grondelaers, and Speelman (1999) with the help of distributional semantic modelling. We selected 14 concepts from the lexical field of football in Dutch and Chinese respectively. Instead of manually disambiguating the corpus occurrences, we explored a semi-automatic procedure based on token-based vector space models and cluster analysis. The experiments show that our workflow is efficient for detecting regional lexical variation in large-scale corpora. More specifically, the results revealed that removing semantic clusters whose most central members are tokens referring to other senses rather than the intended concept’s sense, does have an impact on the sociolectometric distances. Furthermore, discarding entire clusters has consequences for the total concept frequency.