Unit Under Test Identification Using Natural Language Processing Techniques

Matej Madeja 1  and Jaroslav Porubän 2
  • 1 Department of computers and Informatics, Technical University of Košice, 042 00, Košice
  • 2 Department of computers and Informatics, Technical University of Košice, 042 00, Košice


Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Détienne F., What model(s) for program understanding?, 2007

  • [2] Reddy A., et al., Java™ coding style guide, Sun MicroSystems, 2000

  • [3] Butler S., Wermelinger M., Yu Y., Sharp H., Mining java class naming conventions, in 2011 27th IEEE International Conference on Software Maintenance (ICSM), 2011, 93–102, 10.1109/ICSM.2011.6080776

  • [4] Manning C.D., Manning C.D., Schütze H., Foundations of statistical natural language processing, MIT press, 1999

  • [5] Beck K., Gamma E., Test infected: Programmers love writing tests, Java Report, 3(7), 1998, 37–50

  • [6] Madeja M., Porubän J., Tracing naming semantics in unit tests of popular GitHub android projects, volume 74, 2019, 10.4230/OASIcs.SLATE.2019.3

  • [7] McGlauflin B., Java Unit Testing Best Practices: How to Get the Most Out of Your Test Automation, DZone Technical Library, 2019

  • [8] Madeja M., Porubän J., Accuracy of Unit Under Test Identification Using Latent Semantic Analysis and Latent Dirichlet Allocation, in A.S. Valerie Novitzká Štefan Korečko, ed., Informatics 2019, Institute of Electrical and Electronics Engineers, 2019, 248 – 253

  • [9] Croft W.B., Metzler D., Strohman T., Search engines: Information retrieval in practice, volume 520, Addison-Wesley Reading, 2010

  • [10] Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harsh-man R., Indexing by latent semantic analysis, Journal of the American society for information science, 41(6), 1990, 391–407

  • [11] Blei D.M., Ng A.Y., Jordan M.I., Latent dirichlet allocation, Journal of machine Learning research, 3(Jan), 2003, 993–1022

  • [12] Cvitanic T., Lee B., Song H.I., Fu K., Rosen D., Lda v. lsa: A comparison of two computational text analysis tools for the functional categorization of patents, in International Conference on Case-Based Reasoning, 2016

  • [13] Hiemstra D., A probabilistic justification for using tf×idf term weighting in information retrieval, International Journal on Digital Libraries, 3(2), 2000, 131–139, 10.1007/s007999900025

  • [14] Bingham E., Mannila H., Random Projection in Dimensionality Reduction: Applications to Image and Text Data, in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, Association for Computing Machinery, New York, NY, USA, 2001, 245–250, 10.1145/502512.502546

  • [15] Kanerva P., Kristoferson J., Holst A., Random indexing of text samples for latent semantic analysis, in Proceedings of the Annual Meeting of the Cognitive Science Society, volume 22, 2000

  • [16] Wang C., Paisley J., Blei D., Online variational inference for the hierarchical Dirichlet process, in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, 752–760

  • [17] Řehůřek R., Sojka P., Software Framework for Topic Modelling with Large Corpora, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, 45–50, http://is.muni.cz/publication/884893/en

  • [18] Řehůřek R., About Gensim, 2019

  • [19] Lau J.H., Newman D., Baldwin T., Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality, in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, 530–539

  • [20] Huang A., Similarity measures for text document clustering, in Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, volume 4, 2008, 9–56

  • [21] Tolia N., Andersen D.G., Satyanarayanan M., Quantifying interactive user experience on thin clients, Computer, 39(3), 2006, 46–52

  • [22] Maletic J.I., Marcus A., Using latent semantic analysis to identify similarities in source code to support program understanding, in Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000, 2000, 46–53, 10.1109/TAI.2000.889845

  • [23] Maletic J.I., Valluri N., Automatic software clustering via latent semantic analysis, in 14th IEEE International Conference on Automated Software Engineering, IEEE, 1999, 251–254

  • [24] Thomas S.W., Adams B., Hassan A.E., Blostein D., Studying software evolution using topic models, Science of Computer Programming, 80, 2014, 457–479

  • [25] Thomas S.W., Mining software repositories using topic models, in Proceedings of the 33rd International Conference on Software Engineering, ACM, 2011, 1138–1139

  • [26] Asuncion H.U., Asuncion A.U., Taylor R.N., Software traceability with topic modeling, in 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, IEEE, 2010, 95–104


Journal + Issues