Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Oldenbourg March 23, 2022

Exploring syntactical features for anomaly detection in application logs

Rafael Copstein, Egil Karlsen, Jeff Schwartzentruber, Nur Zincir-Heywood and Malcolm Heywood

Abstract

In this research, we analyze the effect of lightweight syntactical feature extraction techniques from the field of information retrieval for log abstraction in information security. To this end, we evaluate three feature extraction techniques and three clustering algorithms on four different security datasets for anomaly detection. Results demonstrate that these techniques have a role to play for log abstraction in the form of extracting syntactic features which improves the identification of anomalous minority classes, specifically in homogeneous security datasets.

ACM CCS:

Funding statement: This research was enabled by the support of the Natural Science and Engineering Research Council of Canada Alliance Grant and 2Keys Corporation.

Acknowledgment

The first author gratefully acknowledges the support by the province of Nova Scotia. The research is conducted as part of the Dalhousie NIMS Lab at: https://projects.cs.dal.ca/projectx/.

References

1. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and benchmarks for automated log parsing,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 121–130.10.1109/ICSE-SEIP.2019.00021Search in Google Scholar

2. D. El-Masri, F. Petrillo, Y.-G. Guéhéneuc, A. Hamou-Lhadj, and A. Bouziane, “A systematic literature review on automated log abstraction techniques,” Information and Software Technology, vol. 122, p. 106276, 2020.10.1016/j.infsof.2020.106276Search in Google Scholar

3. R. Copstein, J. Schwartzentruber, N. Zincir-Heywood, and M. Heywood, “Log abstraction for information security: Heuristics and reproducibility,” in The 16th International Conference on Availability, Reliability and Security, ser. ARES 2021. New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3465481.3470083.10.1145/3465481.3470083Search in Google Scholar

4. B. Gallagher and T. Eliassi-Rad, “Classification of http attacks: a study on the ecml/pkdd 2007 discovery challenge,” Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2009.10.2172/1113394Search in Google Scholar

5. H. Dev and Z. Liu, “Identifying frequent user tasks from application logs,” in Proceedings of the 22nd International Conference on Intelligent User Interfaces, ser. IUI ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 263–273. [Online]. Available: https://doi.org/10.1145/3025171.3025184.10.1145/3025171.3025184Search in Google Scholar

6. K. Savitha and M. Vijaya, “Mining of web server logs in a distributed cluster using big data technologies,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 5, no. 1, 2014.10.14569/IJACSA.2014.050119Search in Google Scholar

7. C. Lonvick, “Rfc3164: The bsd syslog protocol,” 2001.10.17487/rfc3164Search in Google Scholar

8. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “A lightweight algorithm for message type extraction in system application logs,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 11, pp. 1921–1936, 2012. [Online]. Available: https://doi.org/10.1109/TKDE.2011.138.10.1109/TKDE.2011.138Search in Google Scholar

9. F. Haddadi and A. N. Zincir-Heywood, “Benchmarking the effect of flow exporters and protocol filters on botnet traffic classification,” IEEE Syst. J., vol. 10, no. 4, pp. 1390–1401, 2016. [Online]. Available: https://doi.org/10.1109/JSYST.2014.2364743.10.1109/JSYST.2014.2364743Search in Google Scholar

10. R. Alshammari and A. N. Zincir-Heywood, “The impact of evasion on the generalization of machine learning algorithms to classify voip traffic,” in 21st International Conference on Computer Communications and Networks, ICCCN 2012, Munich, Germany, July 30 – August 2, 2012. IEEE, 2012, pp. 1–8. [Online]. Available: https://doi.org/10.1109/ICCCN.2012.6289243.10.1109/ICCCN.2012.6289243Search in Google Scholar

11. D. C. Le and N. Zincir-Heywood, “A frontier: Dependable, reliable and secure machine learning for network/system management,” J. Netw. Syst. Manag., vol. 28, no. 4, pp. 827–849, 2020. [Online]. Available: https://doi.org/10.1007/s10922-020-09512-5.10.1007/s10922-020-09512-5Search in Google Scholar

12. D. Bhamare, T. Salman, M. Samaka, A. Erbad, and R. Jain, “Feasibility of supervised machine learning for cloud security,” CoRR, vol. abs/1810.09878, 2018. [Online]. Available: http://arxiv.org/abs/1810.09878.Search in Google Scholar

13. B. Andriamanalimanana, A. Tekeoglu, K. Bekiroglu, S. Sengupta, C. Chiang, M. Reale, and J. E. Novillo, “Symmetric kullback-leibler divergence of softmaxed distributions for anomaly scores,” in Conference on Communications and Network Security. IEEE, 2019, pp. 1–6.10.1109/CNS44998.2019.8952588Search in Google Scholar

14. H. T. Nguyen and K. Franke, “Adaptive intrusion detection system via online machine learning,” in International Conference on Hybrid Intelligent Systems. IEEE, 2012, pp. 271–277.10.1109/HIS.2012.6421346Search in Google Scholar

15. C. Raissi, J. Brissaud, G. Dray, P. Poncelet, M. Roche, and M. Teisseire, “Web analyzing traffic challenge: description and results,” in Proceedings of the ECML/PKDD, 2007, pp. 47–52.Search in Google Scholar

16. ECML/PKDD, “Ecml/pkdd 2007 discovery challenge,” September 2021, https://gitlab.fing.edu.uy/gsi/web-application-attacks-datasets/-/tree/master/ecml_pkdd.Search in Google Scholar

17. A. Aizawa, “An information-theoretic perspective of tf–idf measures,” Information Processing & Management, vol. 39, no. 1, pp. 45–65, 2003.10.1016/S0306-4573(02)00021-3Search in Google Scholar

18. University of Victoria, “Isot-cid cloud security,” October 2021, https://www.uvic.ca/ecs/ece/isot/datasets/cloud-security/index.php?utm_medium=redirect&utm_source=/engineering/ece/isot/datasets/cloud-security/index.php&utm_campaign=redirect-usage.Search in Google Scholar

19. Muhammad Anis Al Hilmi, Kurnia Adi Cahyanto, and Muhamad Mustamiin, Apache Web Server - Access Log Pre-processing for Web Intrusion Detection, IEEE Dataport, 2020, https://dx.doi.org/10.21227/vvvq-6w47.Search in Google Scholar

20. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.10.1109/TKDE.2008.239Search in Google Scholar

21. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.10.1613/jair.953Search in Google Scholar

22. S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.10.1109/TIT.1982.1056489Search in Google Scholar

23. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996, pp. 226–231.Search in Google Scholar

24. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.10.1111/j.2517-6161.1977.tb01600.xSearch in Google Scholar

25. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.10.1145/1656274.1656278Search in Google Scholar

Received: 2021-11-27
Revised: 2022-02-16
Accepted: 2022-02-19
Published Online: 2022-03-23
Published in Print: 2022-04-26

© 2022 Walter de Gruyter GmbH, Berlin/Boston