Abstract
In this research, we analyze the effect of lightweight syntactical feature extraction techniques from the field of information retrieval for log abstraction in information security. To this end, we evaluate three feature extraction techniques and three clustering algorithms on four different security datasets for anomaly detection. Results demonstrate that these techniques have a role to play for log abstraction in the form of extracting syntactic features which improves the identification of anomalous minority classes, specifically in homogeneous security datasets.
Funding statement: This research was enabled by the support of the Natural Science and Engineering Research Council of Canada Alliance Grant and 2Keys Corporation.
Acknowledgment
The first author gratefully acknowledges the support by the province of Nova Scotia. The research is conducted as part of the Dalhousie NIMS Lab at: https://projects.cs.dal.ca/projectx/.
References
1. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and benchmarks for automated log parsing,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 121–130.10.1109/ICSE-SEIP.2019.00021Search in Google Scholar
2. D. El-Masri, F. Petrillo, Y.-G. Guéhéneuc, A. Hamou-Lhadj, and A. Bouziane, “A systematic literature review on automated log abstraction techniques,” Information and Software Technology, vol. 122, p. 106276, 2020.10.1016/j.infsof.2020.106276Search in Google Scholar
3. R. Copstein, J. Schwartzentruber, N. Zincir-Heywood, and M. Heywood, “Log abstraction for information security: Heuristics and reproducibility,” in The 16th International Conference on Availability, Reliability and Security, ser. ARES 2021. New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3465481.3470083.10.1145/3465481.3470083Search in Google Scholar
4. B. Gallagher and T. Eliassi-Rad, “Classification of http attacks: a study on the ecml/pkdd 2007 discovery challenge,” Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2009.10.2172/1113394Search in Google Scholar
5. H. Dev and Z. Liu, “Identifying frequent user tasks from application logs,” in Proceedings of the 22nd International Conference on Intelligent User Interfaces, ser. IUI ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 263–273. [Online]. Available: https://doi.org/10.1145/3025171.3025184.10.1145/3025171.3025184Search in Google Scholar
6. K. Savitha and M. Vijaya, “Mining of web server logs in a distributed cluster using big data technologies,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 5, no. 1, 2014.10.14569/IJACSA.2014.050119Search in Google Scholar
7. C. Lonvick, “Rfc3164: The bsd syslog protocol,” 2001.10.17487/rfc3164Search in Google Scholar
8. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “A lightweight algorithm for message type extraction in system application logs,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 11, pp. 1921–1936, 2012. [Online]. Available: https://doi.org/10.1109/TKDE.2011.138.10.1109/TKDE.2011.138Search in Google Scholar
9. F. Haddadi and A. N. Zincir-Heywood, “Benchmarking the effect of flow exporters and protocol filters on botnet traffic classification,” IEEE Syst. J., vol. 10, no. 4, pp. 1390–1401, 2016. [Online]. Available: https://doi.org/10.1109/JSYST.2014.2364743.10.1109/JSYST.2014.2364743Search in Google Scholar
10. R. Alshammari and A. N. Zincir-Heywood, “The impact of evasion on the generalization of machine learning algorithms to classify voip traffic,” in 21st International Conference on Computer Communications and Networks, ICCCN 2012, Munich, Germany, July 30 – August 2, 2012. IEEE, 2012, pp. 1–8. [Online]. Available: https://doi.org/10.1109/ICCCN.2012.6289243.10.1109/ICCCN.2012.6289243Search in Google Scholar
11. D. C. Le and N. Zincir-Heywood, “A frontier: Dependable, reliable and secure machine learning for network/system management,” J. Netw. Syst. Manag., vol. 28, no. 4, pp. 827–849, 2020. [Online]. Available: https://doi.org/10.1007/s10922-020-09512-5.10.1007/s10922-020-09512-5Search in Google Scholar
12. D. Bhamare, T. Salman, M. Samaka, A. Erbad, and R. Jain, “Feasibility of supervised machine learning for cloud security,” CoRR, vol. abs/1810.09878, 2018. [Online]. Available: http://arxiv.org/abs/1810.09878.Search in Google Scholar
13. B. Andriamanalimanana, A. Tekeoglu, K. Bekiroglu, S. Sengupta, C. Chiang, M. Reale, and J. E. Novillo, “Symmetric kullback-leibler divergence of softmaxed distributions for anomaly scores,” in Conference on Communications and Network Security. IEEE, 2019, pp. 1–6.10.1109/CNS44998.2019.8952588Search in Google Scholar
14. H. T. Nguyen and K. Franke, “Adaptive intrusion detection system via online machine learning,” in International Conference on Hybrid Intelligent Systems. IEEE, 2012, pp. 271–277.10.1109/HIS.2012.6421346Search in Google Scholar
15. C. Raissi, J. Brissaud, G. Dray, P. Poncelet, M. Roche, and M. Teisseire, “Web analyzing traffic challenge: description and results,” in Proceedings of the ECML/PKDD, 2007, pp. 47–52.Search in Google Scholar
16. ECML/PKDD, “Ecml/pkdd 2007 discovery challenge,” September 2021, https://gitlab.fing.edu.uy/gsi/web-application-attacks-datasets/-/tree/master/ecml_pkdd.Search in Google Scholar
17. A. Aizawa, “An information-theoretic perspective of tf–idf measures,” Information Processing & Management, vol. 39, no. 1, pp. 45–65, 2003.10.1016/S0306-4573(02)00021-3Search in Google Scholar
18. University of Victoria, “Isot-cid cloud security,” October 2021, https://www.uvic.ca/ecs/ece/isot/datasets/cloud-security/index.php?utm_medium=redirect&utm_source=/engineering/ece/isot/datasets/cloud-security/index.php&utm_campaign=redirect-usage.Search in Google Scholar
19. Muhammad Anis Al Hilmi, Kurnia Adi Cahyanto, and Muhamad Mustamiin, Apache Web Server - Access Log Pre-processing for Web Intrusion Detection, IEEE Dataport, 2020, https://dx.doi.org/10.21227/vvvq-6w47.Search in Google Scholar
20. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.10.1109/TKDE.2008.239Search in Google Scholar
21. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.10.1613/jair.953Search in Google Scholar
22. S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.10.1109/TIT.1982.1056489Search in Google Scholar
23. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996, pp. 226–231.Search in Google Scholar
24. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.10.1111/j.2517-6161.1977.tb01600.xSearch in Google Scholar
25. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.10.1145/1656274.1656278Search in Google Scholar
© 2022 Walter de Gruyter GmbH, Berlin/Boston