Abstract
In this research, we analyze the effect of lightweight syntactical feature extraction techniques from the field of information retrieval for log abstraction in information security. To this end, we evaluate three feature extraction techniques and three clustering algorithms on four different security datasets for anomaly detection. Results demonstrate that these techniques have a role to play for log abstraction in the form of extracting syntactic features which improves the identification of anomalous minority classes, specifically in homogeneous security datasets.
Funding statement: This research was enabled by the support of the Natural Science and Engineering Research Council of Canada Alliance Grant and 2Keys Corporation.
About the authors

Rafael Copstein graduated Bachelor of Computer Science at the Pontifical Catholic University of Rio Grande do Sul (PUCRS) – Brazil – in 2019, as first of his class as recognized by the Brazilian Computer Society (SBC). In that same year, he was awarded a scholarship at Dalhousie University – Canada – to pursue a Master’s Degree in Computer Science under the supervision of Dr. Nur Zincir-Heywood in the area of Computer Networks. In 2021, as an invitation from his supervisor, he upgraded his degree into a Ph. D. in Computer Science. Shortly after, he was awarded the Nova Scotia Graduate Scholarship (NSGS) due to the relevance of his research to the province of Nova Scotia and due to his academic achievements.

Egil Karlsen is a Masters of Computer Science student at Dalhousie University. He received his Bachelor of Computer Science with First Class Honors from Dalhousie University where he performed research in the area of vulnerability analysis, authentication and authorization services specifically in the OAuth protocol. Research interests include applications of machine learning in cyber security.

Jeff Schwartzentruber holds the position of Principal Data Scientist and Research Lead at 2Keys Corporation. Dr. Schwartzentruber received his PhD in Mechanical Engineering from Ryerson University with a focus on analytical process modelling and is a fellow of the Ontario Centre of Excellence. In his role at 2Keys, Dr. Schwartzentruber is responsible for the continued development, innovation and leadership of artificial intelligence (AI) and machine learning (ML) capabilities at the intersection of identity and access management, advanced threat analytics and response, and managed security services. Jeff’s research interests include machine learning (particularly deep learning and boosted trees), real-time anomaly detection, and analytical/semi-empirical model development for security and business applications.

Nur Zincir-Heywood is a University Research Professor and a Professor of Computer Science at Dalhousie University. Her research interests include machine learning for cyber security, network and service analysis. She has published over 200 fully reviewed papers and has been a recipient of multiple best paper awards. She serves as an Associate Editor of the IEEE Transactions on Network and Service Management and Wiley International Journal of Network Management. She also promotes information communication technologies to wider audiences as a tech columnist for CBC Information Morning and a Board Member on CS-Can/INFO-Can.

Malcolm Heywood is a Professor of Computer Science at Dalhousie University, Canada. He has a particular interest in the use of machine learning to discover of simple solutions to complex tasks. His research with the tangled program graph algorithm resulted in a silver medal at the 2018 Humies Competition. Dr. Heywood is a member of the editorial board for Genetic Programming and Evolvable Machines (Springer). He was a track co-chair for the ACM GECCO GP track in 2014 and a co-chair for European Conference on Genetic Programming in 2015 and 2016 and a co-chair for the ACM GECCO GECH track in 2021 and 2022.
Acknowledgment
The first author gratefully acknowledges the support by the province of Nova Scotia. The research is conducted as part of the Dalhousie NIMS Lab at: https://projects.cs.dal.ca/projectx/.
References
1. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and benchmarks for automated log parsing,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 121–130.10.1109/ICSE-SEIP.2019.00021Search in Google Scholar
2. D. El-Masri, F. Petrillo, Y.-G. Guéhéneuc, A. Hamou-Lhadj, and A. Bouziane, “A systematic literature review on automated log abstraction techniques,” Information and Software Technology, vol. 122, p. 106276, 2020.10.1016/j.infsof.2020.106276Search in Google Scholar
3. R. Copstein, J. Schwartzentruber, N. Zincir-Heywood, and M. Heywood, “Log abstraction for information security: Heuristics and reproducibility,” in The 16th International Conference on Availability, Reliability and Security, ser. ARES 2021. New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3465481.3470083.10.1145/3465481.3470083Search in Google Scholar
4. B. Gallagher and T. Eliassi-Rad, “Classification of http attacks: a study on the ecml/pkdd 2007 discovery challenge,” Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2009.10.2172/1113394Search in Google Scholar
5. H. Dev and Z. Liu, “Identifying frequent user tasks from application logs,” in Proceedings of the 22nd International Conference on Intelligent User Interfaces, ser. IUI ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 263–273. [Online]. Available: https://doi.org/10.1145/3025171.3025184.10.1145/3025171.3025184Search in Google Scholar
6. K. Savitha and M. Vijaya, “Mining of web server logs in a distributed cluster using big data technologies,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 5, no. 1, 2014.10.14569/IJACSA.2014.050119Search in Google Scholar
7. C. Lonvick, “Rfc3164: The bsd syslog protocol,” 2001.10.17487/rfc3164Search in Google Scholar
8. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “A lightweight algorithm for message type extraction in system application logs,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 11, pp. 1921–1936, 2012. [Online]. Available: https://doi.org/10.1109/TKDE.2011.138.10.1109/TKDE.2011.138Search in Google Scholar
9. F. Haddadi and A. N. Zincir-Heywood, “Benchmarking the effect of flow exporters and protocol filters on botnet traffic classification,” IEEE Syst. J., vol. 10, no. 4, pp. 1390–1401, 2016. [Online]. Available: https://doi.org/10.1109/JSYST.2014.2364743.10.1109/JSYST.2014.2364743Search in Google Scholar
10. R. Alshammari and A. N. Zincir-Heywood, “The impact of evasion on the generalization of machine learning algorithms to classify voip traffic,” in 21st International Conference on Computer Communications and Networks, ICCCN 2012, Munich, Germany, July 30 – August 2, 2012. IEEE, 2012, pp. 1–8. [Online]. Available: https://doi.org/10.1109/ICCCN.2012.6289243.10.1109/ICCCN.2012.6289243Search in Google Scholar
11. D. C. Le and N. Zincir-Heywood, “A frontier: Dependable, reliable and secure machine learning for network/system management,” J. Netw. Syst. Manag., vol. 28, no. 4, pp. 827–849, 2020. [Online]. Available: https://doi.org/10.1007/s10922-020-09512-5.10.1007/s10922-020-09512-5Search in Google Scholar
12. D. Bhamare, T. Salman, M. Samaka, A. Erbad, and R. Jain, “Feasibility of supervised machine learning for cloud security,” CoRR, vol. abs/1810.09878, 2018. [Online]. Available: http://arxiv.org/abs/1810.09878.Search in Google Scholar
13. B. Andriamanalimanana, A. Tekeoglu, K. Bekiroglu, S. Sengupta, C. Chiang, M. Reale, and J. E. Novillo, “Symmetric kullback-leibler divergence of softmaxed distributions for anomaly scores,” in Conference on Communications and Network Security. IEEE, 2019, pp. 1–6.10.1109/CNS44998.2019.8952588Search in Google Scholar
14. H. T. Nguyen and K. Franke, “Adaptive intrusion detection system via online machine learning,” in International Conference on Hybrid Intelligent Systems. IEEE, 2012, pp. 271–277.10.1109/HIS.2012.6421346Search in Google Scholar
15. C. Raissi, J. Brissaud, G. Dray, P. Poncelet, M. Roche, and M. Teisseire, “Web analyzing traffic challenge: description and results,” in Proceedings of the ECML/PKDD, 2007, pp. 47–52.Search in Google Scholar
16. ECML/PKDD, “Ecml/pkdd 2007 discovery challenge,” September 2021, https://gitlab.fing.edu.uy/gsi/web-application-attacks-datasets/-/tree/master/ecml_pkdd.Search in Google Scholar
17. A. Aizawa, “An information-theoretic perspective of tf–idf measures,” Information Processing & Management, vol. 39, no. 1, pp. 45–65, 2003.10.1016/S0306-4573(02)00021-3Search in Google Scholar
18. University of Victoria, “Isot-cid cloud security,” October 2021, https://www.uvic.ca/ecs/ece/isot/datasets/cloud-security/index.php?utm_medium=redirect&utm_source=/engineering/ece/isot/datasets/cloud-security/index.php&utm_campaign=redirect-usage.Search in Google Scholar
19. Muhammad Anis Al Hilmi, Kurnia Adi Cahyanto, and Muhamad Mustamiin, Apache Web Server - Access Log Pre-processing for Web Intrusion Detection, IEEE Dataport, 2020, https://dx.doi.org/10.21227/vvvq-6w47.Search in Google Scholar
20. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.10.1109/TKDE.2008.239Search in Google Scholar
21. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.10.1613/jair.953Search in Google Scholar
22. S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.10.1109/TIT.1982.1056489Search in Google Scholar
23. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996, pp. 226–231.Search in Google Scholar
24. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.10.1111/j.2517-6161.1977.tb01600.xSearch in Google Scholar
25. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.10.1145/1656274.1656278Search in Google Scholar
© 2022 Walter de Gruyter GmbH, Berlin/Boston