Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2017: 1.200
5-year IMPACT FACTOR: 1.386

CiteScore 2017: 0.80

SCImago Journal Rank (SJR) 2017: 0.288
Source Normalized Impact per Paper (SNIP) 2017: 0.930

See all formats and pricing
More options …

A study on Chinese register characteristics based on regression analysis and text clustering

Renkui Hou
  • Corresponding author
  • School of Chinese Language and Literature, Ludong University, Yantai, Shandong, China
  • Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Chu-Ren Huang / Hongchao Liu
Published Online: 2017-03-30 | DOI: https://doi.org/10.1515/cllt-2016-0062


This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aLbcL, where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.

Keywords: sentence length distribution; regression analysis; Chinese register; text clustering


  • Altmann, G. 1988. Verteilungen der Satzlängen. Glottometrika 9. 147–169.Google Scholar

  • Best, K.-H. 2005. Quantitative linguistics. An international handbook, chapter Satzlänge (Sentence length), 298–304. Berlin: de Gruyter.Google Scholar

  • Best, K.-H. 2002. The distribution of rhythmic units in German short prose. Glottometrics 3, 136–142.Google Scholar

  • Biber, D. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8(1). 9–37.Google Scholar

  • Biber, D. & S. Conrad. 2009. Register, genre, and style. Cambridge: Cambridge University Press.Google Scholar

  • Chao, Yuen Ren. 1968. A grammar of spoken Chinese. Berkeley and Los Angeles: University of California Press.Google Scholar

  • Chen, H. H. 1994. The contextual analysis of Chinese sentences with punctuation marks. Literary and Linguistic Computing 9(4). 281–289.Google Scholar

  • Chen, K.-J., C.-R. Huang, L.-P. Chang, & H.-L. Hsu. 1996. Sinica corpus: Design methodology for balanced corpora. In B.-S. Park & J.B. Kim (eds.), Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation, 167–176. Seoul: Kyung Hee University.Google Scholar

  • Chen, K.-J., C.-C. Luo, M.-C. Chang, F.-Y. Chen, C.-J. Chen, C.-R. Huang & Z.-M. Gao. 2003. Sinica treebank: Design criteria, representational issues and implementation. In Anne Abeillé (ed.), Treebanks: Building and using parsed Corpora, 231–248. Dordrecht/Boston: Kluwer Academic Publishers.Google Scholar

  • Dzurjuk, T. 2006. Sentence length as a feature of style (applied to works of German writers). Glottometrics 12. 55–62.Google Scholar

  • Grzybek, P. (2007). History and methodology of word length studies. In Grzybek, P. (ed.), Contributions to the Science of Text and Language: Word Length Studies and Related Issues, 15–90. The Netherlands: Springer.Google Scholar

  • Grzybek, P., E. Kelih & E. Stadlober. 2008. The relation between word length and sentence length: an intra-systemic perspective in the core data structure. Glottometrics 16. 111–121.Google Scholar

  • Grzybek, P., E. Stadlober & E. Kelih. 2007. The relationship of word length and sentence length: the inter-textual perspective. In Decker R. & H. J. Lenz (eds), Advances in Data Analysis, 611–618. Berlin/Heidelberg: Springer.Google Scholar

  • Guha, S., R. Rastogi & K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. In ACM SIGMOD Record, Vol. 27, No. 2, 73-84. ACM.

  • Halkidi, M., Y. Batistakis & M. Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems 17. 107–145.Google Scholar

  • Hou, R., J. Yang & M. Jiang. 2014. A study on Chinese quantitative stylistic features and relation among different styles based on text clustering. Journal of Quantitative Linguistics 21(3). 246–280.Google Scholar

  • Huang, C.-R. & D. Shi. 2016. A reference grammar of Chinese. Cambridge: Cambridge University Press.Google Scholar

  • Huang, C.-R. & K.-J. Chen. (2017). Sinica Treebank. In N. Ide and J. Pustejovsky (eds), Handbook of Linguistic Annotation. Berlin & Heidelberg: Springer.Google Scholar

  • Kelih, E., P. Grzybek, G. Antić & E. Stadlober. 2006. Quantitative text typology: The impact of sentence length. In Spiliopoulou M., Kruse R., Borgelt C., Nürnberger A., Gaul W. (eds), From data and information analysis to knowledge engineering, Studies in Classification, Data Analysis, and Knowledge Organization. 382–389. Berlin/Heidelberg: Springer.Google Scholar

  • Köhler, R. 2012. Quantitative syntax analysis, Vol. 65. Berlin: de Gruyter.Google Scholar

  • Koppel, M., J. Schler & S. Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 60(1). 9–26.Google Scholar

  • Liu, Y. & F. Hu. 2011. A comparative study of stylistics between “Reading News” and “Talking News”. Language Teaching and Linguistic Studies 1. 97–104.Google Scholar

  • Lu, J. 1993. The features of Chinese sentences. Chinese Language Learning 1. 1–6.Google Scholar

  • Lv, S. 1992. Studies on Chinese grammar through comparison. Language Teaching and Linguistic Studies. 2. 4–18.Google Scholar

  • Manning, C., & H. Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Google Scholar

  • Mannion, D. & P. Dixon. 2004. Sentence-length and authorship attribution: the case of Oliver Goldsmith. Literary and Linguistic Computing 19(4). 497–508.Google Scholar

  • Morton, A. Q. 1965. The authorship of Greek prose. Journal of the Royal Statistical Society. Series A (General) 128(2). 169–233.Google Scholar

  • Pande, H. & H. S. Dhami. 2015. Determination of the distribution of sentence length frequencies for Hindi language texts and utilization of sentence length frequency profiles for authorship attribution. Journal of Quantitative Linguistics 22(4). 338–348.Google Scholar

  • Popescu, I. I., K. H. Best & G. Altmann. 2014. Unified modeling of length in language. Lüdenscheid: RAM Verlag.Google Scholar

  • R Core Team. 2016. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar

  • Rezaee, R., B. P. F. Lelieveldt & J. H. C. Reiber. 1998. A new cluster validity index for the fuzzy c-Mean. Pattern Recognition Letters 19. 237–246.Google Scholar

  • Sherman, L. A. 1888. Some observations upon the sentence-length in English prose. University of Nebraska Studies 1. 119–130.Google Scholar

  • Sichel, H. S. 1971. On a family of discrete distributions particularly suited to represent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of the Third Symposium on Mathematical Statistics, 51–97. South Africa: Council for Scientific and Industrial Research.Google Scholar

  • Sichel, H. S. 1974. Distribution representing sentence-length in written prose. Journal of the Royal Statistical Society Series A- Statistical in Society 137. 25–34.Google Scholar

  • Sigurd, B., M. Eeg‐Olofsson & J. Van Weijer. 2004. Word length, sentence length and frequency–Zipf revisited. Studia Linguistica 58(1). 37–52.Google Scholar

  • Wang, K. & H. Qin. 2014. What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1). 57–77.Google Scholar

  • Williams, C. B. 1940. A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31(3/4). 356–361.Google Scholar

  • Wimmer, G. & G. Altmann. 2005. Unified derivation of some linguistic laws. In R. Köhler, G. Altmann & R. G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 791–807. Berlin: de Gruyter.Google Scholar

  • Wimmer, G. & G. Altmann. 2007. Towards a unified derivation of some linguistic laws. In P. Grzybek (ed.), Contributions to the Science of Text and Language: Word Length Studies and Related Issues. 329–337. The Netherlands: Springer.Google Scholar

  • Wu, Y. 2005. A comparative study of news and novel style. Chinese Monthly 5. 66–67.Google Scholar

  • Zhang, Z. S. 2012. A corpus study of variation in written Chinese. Corpus Linguistics and Linguistic Theory 8(1). 209–240.Google Scholar

  • Zhu, D. 1982. Lectures on grammar. Beijing, China: Commercial Press.Google Scholar

  • Zipf, G. K. 1935. The psycho-biology of language. Oxford, England: Houghton, Mifflin.Google Scholar

  • Zipf, G. K. 1949. Human behavior and the principle of least effort. Reading, MA: Ed: Addison-Wesley.Google Scholar

About the article

Published Online: 2017-03-30

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2016-0062.

Export Citation

© 2017 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Lirong Xu and Lianzhen He
Journal of Quantitative Linguistics, 2018, Page 1

Comments (0)

Please log in or register to comment.
Log in