Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Journal of Official Statistics

The Journal of Statistics Sweden

4 Issues per year

IMPACT FACTOR 2016: 0.411
5-year IMPACT FACTOR: 0.776

CiteScore 2016: 0.63

SCImago Journal Rank (SJR) 2016: 0.710
Source Normalized Impact per Paper (SNIP) 2016: 0.975

Open Access
See all formats and pricing
More options …

Three Methods for Occupation Coding Based on Statistical Learning

Hyukjun Gweon
  • Corresponding author
  • Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1 Canada
  • Email:
/ Matthias Schonlau
  • Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1 Canada
  • Email:
/ Lars Kaczmirek
  • GESIS – Leibniz-Institute for the Social Sciences, PO Box 12 21 55, D-68072 Mannheim, Germany
  • Email:
/ Michael Blohm
  • GESIS – Leibniz-Institute for the Social Sciences, PO Box 12 21 55, D-68072 Mannheim, Germany
  • Email:
/ Stefan Steiner
  • Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1 Canada
  • Email:
Published Online: 2017-02-21 | DOI: https://doi.org/10.1515/jos-2017-0006


Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.

Keywords: Automated coding; Machine learning; ISCO-88; ALLBUS


  • ALLBUS. 2015. Available at: http://www.gesis.org/allbus (accessed October 10, 2016).Google Scholar

  • Appel, M.V. and E. Hellerman. 1983. “Census Bureau Experiments with Automated Industry and Occupation Coding.” In Proceedings of the American Statistical Association, Section on Survey Research Methods. August 15-18, 1983, Toronto, Canada. 32-40.Google Scholar

  • Belloni, M., A. Brugiavini, E. Meschi, and K. Tijdens. 2014. Measurement Error in Occupational Coding: an Analysis on SHARE Data. Ca’ Foscari University of Venice, Department of Economics, Working Paper 24. Doi: http://dx.doi.org/10.2139/ssrn.2539080.CrossrefGoogle Scholar

  • Bethmann, A., M. Schierholz, K. Wenzig, and M. Zielonka. 2014. “Automatic Coding of Occupations.” In Proceedings of Statistics Canada Symposium. August 29-31, 2014, Québec, Canada. Available at: http://www.statcan.gc.ca/sites/default/files/media/14291-eng.pdf (accessed October 10, 2016).Google Scholar

  • Chen, B.-C., R.H. Creecy, and M.V. Appel. 1993. “Error Control of Automated Industry and Occupation Coding.” Journal of Official Statistics 9: 729-745. http://www.jos.nu/Articles/abstract.asp?article¼94729 (accessed October 10, 2016).Google Scholar

  • Clarke, F.R. and S.J. Brooker. 2011. Use of Machine Learning for Automated Survey Coding. In Proceedings of the 58th ISI World Statistics Congress. August 21-26, 2011, Dublin, Ireland.Google Scholar

  • Conrad, F.G., M.P. Couper, and J.W. Sakshaug. 2016. “Classifying Open-Ended Reports: Factors Affecting the Reliability of Occupation Codes.” Journal of Official Statistics 32: 75-92. Doi: http://dx.doi.org/10.1515/JOS-2016-0003.CrossrefGoogle Scholar

  • Creecy, R.H., B.M. Masand, S.J. Smith, and D.L. Waltz. 1992. “Trading MIPS and Memory for Knowledge Engineering.” Communications of the ACM 35: 48-64. Doi: http://dx.doi.org/10.1145/135226.135228.CrossrefGoogle Scholar

  • Day, J. 2014. Using an Autocoder to Code Industry and Occupation in the American Community Survey. Presentation for the Federal Economic Statistics Advisory Committee Meeting. Available at: http://www2.census.gov/adrm/fesac/2014-06-13_day.pdf (accessed October 10, 2016).Google Scholar

  • Elias, P. 1997. “Occupational Classification (ISCO-88): Concepts, Methods, Reliability, Validity and Cross-National Comparability.” OECD Labour Market and Social Policy Occasional Papers 20, OECD Publishing. Available at: https://ideas.repec.org/p/oec/elsaaa/20-en.html (accessed October 10, 2016).Google Scholar

  • Elias, P. and M. Birch. 2010. Tuning CASCOT for Industry and Occupation Coding in the Scottish Census of Population 2011. Technical Report, Institute for Employment Research. Coventry: University of Warwick.Google Scholar

  • Ferrillo, A., S. Macchia, and P. Vicari. 2008. “Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions.” In Proceedings of the European Conference on Quality in Official Statistics - Q2008. July 8-11, 2008, Rome, Italy. Available at: http://q2008.istat.it/sessions/paper/15Ferrillo.pdf (accessed January 2017).Google Scholar

  • Fix, E. and J.L. Hodges. 1951. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report, USAF School of Aviation Medivine, Randolph Field, Texas. Project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951.Google Scholar

  • Friedman, J.H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189-1232. Available at: http://www.jstor.org/stable/2699986 (accessed October 10, 2016).Google Scholar

  • Ganzeboom, Harry B.G. and Donald J. Treiman. 2003. “Three Internationally Standardised Measures for Comparative Research on Occupational Status.” In Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-Economic Variables, edited by J.H.P. Hoffmeyer-Zlotnik and C. Wolf, pp. 159-193. Doi: http://dx.doi.org/10.1007/978-1-4419-9186-7_9.CrossrefGoogle Scholar

  • Geis, A. 2011. Handbuch fu¨r die Berufsvercodung. Technical Report, GESIS, Mannheim, Germany. Available at: http://www.gesis.org/fileadmin/upload/dienstleistung/tools_standards/handbuch_der_berufscodierung_110304.pdf (accessed October 10, 2016). Google Scholar

  • Geis, A.J. and J.H.P. Hoffmeyer-Zlotnik. 2000. “Stand der Berufsvercodung.” ZUMA Nachrichten 24: 103-128.Google Scholar

  • Iezzi, D.F., M. Lori, F. Lorenzini, M. Nicosia, and S. Stoppiello. 2014. “An Application of Text Mining Technique for the Census of Nonprofit Institutions.” In Statistical Methods and Applications from a Historical Perspective, edited by F. Crescenzi and S. Mignani, pp. 143-152. Springer. Doi: http://dx.doi.org/10.1007/978-3-319-05552-7_13.CrossrefGoogle Scholar

  • International Labour Office. 1990. International Standard Classification of Occupations, ISCO-88. International Labour Office. Available at: http://www.ilo.org/public/libdoc/ilo/1990/90B09_411_engl.pdf (accessed October 10, 2016).Google Scholar

  • Joachims, T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, Volume 1398. April 21-23, 1998, Chemnitz, Germany, 137-142. Doi: http://dx.doi.org/10.1007/BFb0026683.CrossrefGoogle Scholar

  • Jones, R. and P. Elias. 2004. CASCOT: Computer-Assisted Structured Coding Tool. Technical Report, Institute for Employment Research. Coventry: University of Warwick. Available at: http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ (accessed October 10, 2016).Google Scholar

  • Jung, Y., J. Yoo, S.-H. Myaeng, and D.-C. Han. 2008. “A Web-Based Automated System for Industry and Occupation Coding.” In Web Information Systems Engineering - WISE 2008, edited by J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, and X. Wang. Volume 5175, 443-457. Springer. Doi: http://dx.doi.org/10.1007/978-3-540-85481-4_33.CrossrefGoogle Scholar

  • Kalpic, D. 1994. “Automated Coding of Census Data.” Journal of Official Statistics 10: 449-463.Google Scholar

  • Knaus, R. 1987. “Methods and Problems in Coding Natural Language Survey Data.” Journal of Official Statistics 3: 45-67.Google Scholar

  • Koch, A. and M. Wasmer. 2004. “Der ALLBUS als Instrument zur Untersuchung sozialen Wandels: Eine Zwischenbilanz nach 20 Jahren.” In Sozialer und Politischer Wandel in Deutschland, edited by R. Schmitt-Beck, M. Wasmer, and A. Koch, 13-41. VS Verlag fu¨r Sozialwissenschaften.Google Scholar

  • Maitra, R. and I.P. Ramler. 2010. “A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere.” Journal of Computational and Graphical Statistics 19: 377-396. Doi: http://dx.doi.org/10.1198/jcgs.2009.08155.CrossrefGoogle Scholar

  • Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. 2014. e1071: Misc Functions of the Department of Statistics, TU Wien. Available at: http://CRAN.R-project.org/package¼e1071 (accessed October 10, 2016).Google Scholar

  • O’Reagan, R.T. 1972. “Computer-Assigned Codes from Verbal Responses.” Communications of the ACM 15: 455-459. Doi: http://dx.doi.org/10.1145/361405.361419.CrossrefGoogle Scholar

  • Ossiander, E.M. and S. Milham. 2006. “A Computer System for Coding Occupation.” American Journal of Industrial Medicine 49: 854-857. Doi: http://dx.doi.org/10.1002/ajim.20355.CrossrefGoogle Scholar

  • Platt, J. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” In Advances in Large Margin Classifiers, edited by A.J. Smola, P. Bartlett, B. Scho¨lkopf, and D. Schuurmans, 61-74. Cambridge, Massachusetts: MIT Press. Google Scholar

  • R Core Team. 2014. “R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.” Available at: http://www.R-project.org/ (accessed October 10, 2016).Google Scholar

  • Russ, D.E., K.-Y. Ho, C.A. Johnson, and M.C. Friesen. 2014. “Computer-Based Coding of Occupation Codes for Epidemiological Analyses.” In Proceedings of the 27th IEEE International Symposium on Computer-Based Medical Systems. May 27-29, 2014, New York, USA, 347-350. Doi: http://dx.doi.org/10.1109/CBMS.2014.79.CrossrefGoogle Scholar

  • Schierholz, M. 2014. “Automating Survey Coding for Occupation.” Master’s thesis, Ludwig-Maximilians-Universita¨t Munich. Available at: https://epub.ub.uni-muenchen.de/21444/index.html (accessed October 10, 2016).Google Scholar

  • Scholtus, S., R. van de Laar, and L. Willenborg. 2014. The Memobust Handbook on Methodology for Modern Business Statistics. Available at: https://ec.europa.eu/eurostat/cros/system/files/NTTS2013fullPaper_246.pdf (accessed January 2017).Google Scholar

  • Scholz, E., and M. Wasmer. 2009. German General Social Survey 2006. English Translation of the German “ALLBUS”- Questionnaire. Technical Report, GESIS, Mannheim, Germany. Available at: http://nbn-resolving.de/urn:nbn:de:0168-ssoar-207035 (accessed October 10, 2016).Google Scholar

  • Schonlau, M., and N. Guenther. 2016. Text Mining Using N-Grams. Social Science Research Network. Doi: http://dx.doi.org/10.2139/ssrn.2759033.CrossrefGoogle Scholar

  • Silla, C.N., and A.A. Freitas. 2011. “A Survey of Hierarchical Classification across Different Application Domains.” Data Mining and Knowledge Discovery 22: 31-72. Doi: http://dx.doi.org/10.1007/s10618-010-0175-9.CrossrefGoogle Scholar

  • Snowball. 2015. Available at: http://snowball.tartarus.org/algorithms/german/stemmer.html (accessed October 10, 2016).Google Scholar

  • Statistisches Bundesamt. 2010. Demographische Standards. Technical Report, Wiesbaden, Germany. Available at: https://www.destatis.de/DE/Methoden/StatistikWissenschaft- Band17.html (accessed October 10, 2016).Google Scholar

  • Thompson, M., M.E. Kornbau, and J. Vesely. 2012. “Creating an Automated Industry and Occupation Coding Process for the American Community Survey.” Available at: http://ftp.census.gov/adrm/fesac/2014-06-13_thompson_kornbau_vesely.pdf (accessed October 10, 2016).Google Scholar

  • Tijdens, K. 2014. “Dropout Rates and Response Times of an Occupation Search Tree in a Web Survey.” Journal of Official Statistics 30: 23-43. Doi: http://dx.doi.org/10.2478/jos-2014-0002.CrossrefGoogle Scholar

  • Tijdens, K. 2015. “Self-Identification of Occupation in Web Surveys: Requirements for Search Trees and Look-Up Tables.” Survey Methods: Insights from the Field (SMIF). Doi: http://dx.doi.org/10.13094/SMIF-2015-00008.Google Scholar

  • Tourigny, J.Y., and J. Moloney. 1995. “The 1991 Canadian Census of Population Experience with Automated Coding.” In United Nations Statistical Commission on Statistical Data Editing.Google Scholar

  • Vapnik, V.N. 2000. The Nature of Statistical Learning Theory. 2nd edition. New York: Springer.Google Scholar

  • Weiss, S.M., N. Indurkhya, T. Zhang, and F. Damerau. 2010. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer. Google Scholar

  • Wenzowski, M.J. 1988. “ACTR - A Generalised Automated Coding System.” Survey Methodology 14: 299-308.Google Scholar

  • Yu, C. 2002. High-Dimensional Indexing: Transformational Approaches to High- Dimensional Range and Similarity Searches. Volume 2341. Berlin: Springer. Doi: http://dx.doi.org/10.1007/3-540-45770-4.CrossrefGoogle Scholar

  • Züll, C. 2014. Berufscodierung. Technical Report, GESIS - Leibniz Institut fu¨r Sozialwissenschaften (SDM Survey Guidelines). Mannheim. Doi: http://dx.doi.org/10.15465/sdm-sg_019. Google Scholar

About the article

Received: 2016-03-01

Revised: 2016-10-01

Accepted: 2016-10-01

Published Online: 2017-02-21

Published in Print: 2017-03-01

Citation Information: Journal of Official Statistics, ISSN (Online) 2001-7367, DOI: https://doi.org/10.1515/jos-2017-0006.

Export Citation

© by Hyukjun Gweon. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. BY-NC-ND 4.0

Comments (0)

Please log in or register to comment.
Log in