Towards a better similarity algorithm for host - based intrusion detection system

: An intrusion detection system plays an essential role in system security by discovering and preventing malicious activities. Over the past few years, several research projects on host - based intrusion detection systems ( HIDSs ) have been carried out utilizing the Australian Defense Force Academy Linux Dataset ( ADFA - LD ) . These HIDS have also been subjected to various algorithm analyses to enhance their detection capability for high accuracy and low false alarms. However, less attention is paid to the actual implementation of real - time HIDS. Our principal objective in this study is to create a performant real - time HIDS. We propose a new model, “ Better Similarity Algorithm for Host - based Intrusion Detection System ” ( BSA - HIDS ) , using the same dataset ADFA - LD. The proposed model uses three classi ﬁ cations to represent the attack folder according to certain criteria, the entire system call sequence is used. Furthermore, this work uses textual distance and compares ﬁ ve algorithms like Levenshtein, Jaro – Winkler, Jaccard, Hamming, and Dice coe ﬃ cient, to classify the system call trace as attack or non - attack based on the notions of interclass decoupling and intra - class coupling. The model can detect zero - day attacks because of the threshold de ﬁ nition. The experimental results show a good detection performance in real - time for Levenshtein/Jaro – Winkler algorithms, 99 – 94% in detection rate, 2 – 5% in false alarm rate, and 3,300 – 720 s in running time, respectively.


Introduction
Concern over the frequency of cyberattacks has grown recently due to the Internet's rapid development. Around 32% of businesses and 22% of charities in the United Kingdom alone reported experiencing a cyberattack in 2019 [1]. Such attacks can be found using an intrusion detection system (IDS). Despite the tremendous success of the established intrusion detection methods, there has been an increase in interest in either enhancing the current methods or developing new ones [2].
It is much more difficult to identify zero-day attacks (attacks that have never been seen before), as no pattern or signature can be utilized to identify them. In addition, the rate of production of system call sequences data required to access, control, or manage connected devices increase rapidly. Therefore, there is a growing demand for effective intrusion detection algorithms that recognize, isolate, and handle suspicious patterns in system call sequences. Effectively, a system call is a programmatic method through which a computer application asks the kernel of the operating system it runs on for a service. It offers a point of contact between processes and the operating system so that user-level processes can ask for its services. A trace representing the monitored process's behavior is thus created by a series of system calls, which correspond to the sequential list of service requests provided by a process to the kernel. A well-known example of a dataset containing system call traces is the Australian Defense Force Academy Linux Dataset (ADFA-LD). The Australian Defense Force Academy provides it. The ADFA-LD dataset has been used in numerous papers for intrusion detection. Specifically, the ADFA-LD dataset was created to assess anomaly detection and system call-based host-based intrusion detection system (HIDS).
In contrast to many other datasets used to evaluate HIDSs, the ADFA-LD is based on Linux local servers and is composed of thousands of system call traces for the most recent attacks and vulnerabilities in various applications. It also reflects the features of current Linux-based operating systems. Given this, it is anticipated that the ADFA-LD would establish a new benchmark for evaluating HIDS. According to Marteau [3], the following factors make it difficult to identify anomalous system call sequences in ADFA-LD: • The anomalies are context-dependent. The context of a system call's existence, or more specifically, the system calls that come before and after it, can be used to determine whether it is abnormal. A sequence taken as a whole may be judged as abnormal.
• Because system call sequence can vary in length and the alphabet used is relatively large, system call sequence variability is very high (more than 300 system call for the Linux system).
To compare two sequences side by side or to offer several sequence alignments for a set of sequences, many different similarity measures have been proposed by bioinformatics over the years. Indeed, the similarity is a vague concept that can only be treated quantitatively using an appropriate mathematical representation of the objects to be compared and a comparison metric.
Data similarity analysis nowadays gathers numerous methods and tools to discover the "essential" information in a text to identify the elements and find their similarities and differences. The way to check the similarity between a data point or data group is to calculate the distance between these data points. In textual data, we also check the similarity between strings by calculating the distance between one text and another. They have successfully been customized to handle a sequence of system calls.
To address the limitations cited in the paragraphs above, we describe in this study a new prototype BSA-HIDS, which stands for better similarity algorithm for host-based intrusion detection system. We use system call sequences represented in the ADFA-LD benchmark; each sequence, regardless of its length, is taken into account as a whole. As far as we know, this technique has not yet been proposed for system call sequence comparison, particularly in intrusion detection. It describes how many sequences are closer to concluding whether it is an attack or normal. The following are this study's contributions: • The whole dataset ADFA-LD is used; that is to say, the whole system call sequences are used and do not depend on window size. Indeed, if the system call sequence contains 242 system calls, we take the entire sequence in this model, unlike other models, which take just the first 100 system calls or a specified window size. • The performances of the three classifications in terms of detection rate, false positive rate, and false negative rate using a comparison of five similarity measures are almost identical. This means that whatever the classification of attacks is, the model performs very well due to how the threshold is defined (the threshold is variable; it is calculated for each sequence). • The preprocessing time to perform the three classifications is negligible, about 3 s (training time).
• There are no parameters to learn. • The proposed model can detect zero-day attacks.
• The performance of the anomaly IDS is improved with a lower false negative rate of 0.04 and 0.0 false positive, a higher accuracy of 0.99 in a short running time of 3,300 s by the BSA algorithm running on ten processor cores using the Levenshtein algorithm. This result is better compared to recent works [3,4].
Section 2 of this article briefly reports the related key works. We detail the edit distance-based algorithms in Section 3. We describe the methodology in Section 4 discussing the experimental dataset, which is the system call sequence data released by Creech and Hu [5], the data preprocessing using three attack classifications, and finally, we explain the principle detection used in this study. In Section 5, we discuss the results and Section 6, we provide the conclusion and directions for future research.

Related work
Computer security has become essential in protecting the integrity of information technology, such as computer systems, networks, and data from attack, damage, or unauthorized access. Several types of research have been conducted in this broad and multifaceted paradigm. Pavithran et al. [6] proposed a novel encryption process to protect a system from attacks. It is based on Deoxyribonucleic acid (DNA) cryptography, a hyperchaotic system, and a Moore machine. Namasudra [7] proposed a novel cryptosystem using DNA cryptography and DNA steganography for the cloud-based IoT infrastructure. Das and Namasudra [8] proposed a novel ciphertext policy attribute-based encryption (CP-ABE)-based fine-grained access control scheme to solve the attribute revocation problem in the CP-ABE technique utilized very often in an IoT-based healthcare system for encrypting patients' healthcare data. Also, many other research works [9][10][11] address the security of systems.
In this work, we are interested in intrusions. Effectively, a device that monitors a system for potential intrusions is called an IDS; it is a crucial tool for detecting security violations in real time. If the detection occurs on a network, the IDS will be referred to as a NIDS, and if it occurs on a host, it will be referred to as an HIDS. Furthermore, we differentiate between two approaches. Signature-based intrusion detection approach looks for attacks by looking for predefined specific patterns, such as known malicious sequences of instructions used by malware or byte sequences in network packets, and the anomaly-based intrusion detection approach was primarily developed to detect unknown attacks (zero-day attacks). Both approaches have weaknesses: anomaly-based IDS is criticized for producing many false alarms. In contrast, signature-based IDS is criticized for not being able to identify zero-day attacks.
Many system calls-based anomaly detection models have been developed to increase detection rates and reduce false alarm rates in HIDS [2,3,[12][13][14][15][16]. If we restrict the focus to anomaly detection in sequential data, we find that four basic approaches, according to Marteau [3], are taken to treat symbolic sequences: Window-based approaches, Global kernel-based approaches, Generative approaches, and Language-based approaches.
The first approach uses a fixed-window size defined in advance, and the window slides along the sequence progressively. This one is the most widely used of the four methods mentioned above, and its popularity is due to many machine learning and statistical knowledge-based techniques that can be used [14,15]. However, the second approach uses the whole sequence. Effectively, a similarity measure is used for each pair-wise sequence to give the sequence's distance. These methods have origins in bioinformatics or text processing.
The third approach generally includes Recurrent neural network (RNN), Long short-term memory (LSTM), and Hidden Markov model (HMM), which have all been employed successfully on various intrusion detection tasks [17]. Therefore, HMM algorithms have been criticized for being difficult to calculate and for the poor performance resulting from their brief dependence on initial system calls. The last method was initially proposed to improve a vector space model by separating essential n-gram features. Recently, a much more ambitious model suggested creating phrases, sentences, and, ultimately, a language out of sequences of system calls [5]. The proposed model fits into the "Global kernel-based approaches." This section reviews a few useful strategies researchers have suggested during the last 10 years, especially those enabling system call analysis. All these works are used in one of the four approaches cited above. Moreover, Table 1 highlights the successes and shortcomings of each of those works. ADFA-LD Anomaly detection algorithm using distinct short-sequences extraction from system call traces -Detection of zero-day attacks -Since it can learn quickly and gradually, it is adaptable to deal with any environmental changes without having to completely rebuild the classifier.
-The false alarm rate needs to be decreased by improving the extraction and the classification algorithms.
-The abnormality threshold value is determined empirically and it has to be determined automatically.
90.48% detection rate, 22.5% false alarm rate -Learning time of about 30 seconds.
-No detection time is discussed. [13] ADFA-LD -Construct embedding vectors for all system calls; -Model the sequences with system call embedding and weighing To improve detection performance, the sequence embedding model presented in this research is the first to convert system call sequences into embedding vectors which shows a good performance To make this model efficient, the running or at least the detection time must be discussed -False positive rate of 5.3%.
-True positive rate of 91.7% Not discussed [14] ADFA-LD Convolution neural network with LSTM. They use a sized window to define the system call sequence.

A high detection rate
False alarm rate is not discussed Accuracy of 96%

Not discussed
A good and performant IDS, whether it is a HIDS or NIDS, must be able to provide at least results for these metrics: accuracy, false alarm rate, and time of detection. As seen in Table 1, most of the papers cited give results for just one or two of these metrics. In Section 3, we will give a new BSA-HIDS model that addresses all the metrics, and we will show its effectiveness compared to the recent works cited in this table.

Edit distance-based algorithms
There are several algorithms for calculating the distance between texts, and the computational strategy of these algorithms differs according to their views of the string. Thus, they are sorted into four categories: the first includes those based on calculating the editing distance (character by character), and the second includes using words (tokens). The third category includes those based on word sequences, and the fourth includes those based on phonetic meaning [19]. We are interested in the first two categories.
The first category of algorithms determines how many steps must be taken to transform one string into another, as the number of operations rises, the similarity between the two strings declines. Therefore, a set of tokens (words) is required as input in the second category rather than full strings. The more common tokens there are between the two sets, the more similar they are to one another. We will examine here the following algorithms.

Hamming distance
The Hamming distance is equivalent to the minimum number of substitutions required to move from the representation of string1 to that of string2. The substitution corresponds to replacing an element in the representation of string1 with a new element to get closer to the representation of string2 [20].
Let E be an alphabet of symbols and C a subset of E n , the set of words of length n over E.
, is defined as the number of places in which A and B differ, that is The Hamming distance satisfies 0 and , 0 if and only if , , .

Levenshtein distance
The number of adjustments needed to convert one string into another is counted to determine this distance. This algorithm modifies the first string to match the second one using insertion, deletion, and replacement. The Levenshtein distance between two strings A B and is given by [21], where:

Jaro-Winkler distance
Two strings receive high scores from this method, if (1) they have matching characters close to one another and (2) the matching characters are in the same order. The Winkler algorithm, therefore, increases the Jaro similarity measure for equivalent initial characters.
is the similarity of Jaro, l is the length of the same prefix at the beginning of both strings, up to a maximum of 4. p is used as the scalar. The scaling factor must not be greater than 0.25. If not, the similarity could go beyond 1 because the prefix being considered can only be four letters long. Original Winkler's work used a value of 0.1.
where m is the number of matching characters. Two characters from A and B are identical if they do not differ by more than

Jaccard index
For this case of set similarity, the approach is to find the number of common tokens between two sets and divide it by the total number of unique tokens. It is described mathematically as follows [22]: where A B and are two strings that have to be tokenized by the user. In our defined prototype, we tokenize the sequence of system call contained in the attack folder using space as a delimiter converting system call numbers to tokens.

Dice coefficient
For this case of set similarity, the approach is to combine the two sets and look for the common tokens, then divide those by the total number of tokens.
It is based on the idea that if a token appears in both strings, its total count must be twice the intersection when the intersection of two sets of strings is doubled in the numerator (which removes duplicates). The tokens in both strings are combined to form the denominator. Recall that the denominator of Jaccard's equation was the union of two strings; this one is very different. Like intersections, the union also eliminates duplicates, and the dice mechanism prevents this. Dice will constantly overstate how similar two strings are [23].

Methodology
This section outlines the numerous steps to implement the proposed HIDS. The experimental dataset used in this work and its preprocessing is described utilizing three attack folder classifications described in Section 4.1. The BSA used in this study is described in Section 4.2.
Based on five metrics or similarity algorithms, BSA categorizes system call traces as either normal or attack data. A simplified systematic description of the method employed in the suggested study is shown in Figure 1.

ADFA-LD dataset and its preprocessing
Linux dataset ADFA-LD was created by Creech and Hu [5] using an auditing tool named auditd.
It was compiled using the fully patched Ubuntu 11.04 operating system and kernel 2.6.38. Numerous services, including a web server, database server, SSH server, FTP server, etc., are being run by the operating system to capture sequences that represent attack and normal sequences.
As shown in Table 2, the ADFA-LD is divided into three distinct data folders; each folder has its system call trace files. The Training data master (TDM) folder and Validation data master (VDM) folder represent the normal data. On the other hand, Attack data master (ADM) represents attack data. ADM folder includes six other attack data types: "Adduser," "Hydra-FTP," "Hydra-SSH," "JavaMeterpreter," "Meterpreter," and "Web shell." The preprocessing in this approach consists of dividing the ADM folder into a set of groups. The other two TDM/VDM folders are not divided. After thoroughly studying the ADM files, we should consider the classifications in Figure 2, Tables 3 and 4 below.  It was noticed that all the system call traces with the same number at the end of the file name have the same system calls but with different occurrences. Therefore, classifications 2 and 3 were proposed based on this observation.
In classification 2, we partitioned the files by each type of attack. Moreover, for each type of attack, we then partitioned it into a set of groups. Each group contains files that are similar in name, as shown in Table 3. The same principle was used to define the set of groups in classification 3. However, they were defined concerning the whole set of attacks this time. Table 4 shows this distribution.

Principle of detection
The BSA normal and abnormal behavior detection algorithm is based on the following principles:  HydraFTP961 HydraFTP1442 HydraFTP1613 HydraFTP2462 HydraFTP2783 Hydra-SSH = 5738 WS4569 HydraFTP8978 WS4605 HydraFTP9300 WS4609 HydraFTP11504 WS961 HydraFTP13541 WS1371 • The measures of the similarity algorithms are normalized in such a way that a similarity between two strings that approaches 0 means that the two strings are similar. Moreover, in contrast, a similarity between two strings close to 1 means that the two are different. This translates, in our case, into the coupling and decoupling factors. • A trace attack that we want to test must be close to all the other trace attacks (the class to which it belongs depends on the classification chosen), which translates into a very low similarity. On the other hand, its similarity to the set of valid and training data must be very high. • A valid trace we want to test must be close to all the other valid and training traces, resulting in a very low similarity. But, its similarity to all attack traces must be very high. • The threshold by which the test is carried out is variable, relative to each trace. Note that in this algorithm sim can be one of the similarity measure listed above (Levenshtein, Jaro-Winkler, Jaccard, Hamming, or Dice coefficient). Towards a better similarity algorithm for host-based intrusion detection system  9

1: : S , S , S Input A T V 2: ( ∊ ) ( ∊ ) : x system call trace to test x m or x S Input
Note that if ( is true that means we detect an attack, thus true positive (TP) value increments otherwise false negative (FN) value increments. In addition, if ( ( ) ) ≥ a b c min , is true then, true negative (TN) increments otherwise false positive (FP) increments. The two test rules in the BSA algorithm are defined based on the concept of inter-class (decoupling factor) and intra-class (coupling factor).
The intra-class distance corresponds to the distance between attacks placed in an attack set m i . A small intra-class distance between attacks belonging to the same set m i can be translated in this case by the presence of the same system call numbers in the traces of this set or by the almost identical sequence of these system call numbers. The notion of intra-class distance makes it possible to highlight the heterogeneity of the sets m i resulting from classifications 1, 2, and 3 in order to choose the best classification. The inter-class distance corresponds to the distance between the sets S S S , , A T V within the whole space of the system call traces S. The further apart they are, the stronger the inter-class distance will be. Therefore, the distance between an attack and the set of all attacks must be less than the maximum of the average distance of that attack from all traces contained in the set S T and its average distance from all traces contained in the set S V . On the other hand, the distance between a normal trace and the set of all attacks must be greater than or equal to the minimum of the average distance of that normal trace from all traces contained in the set S T and its average distance from all traces contained in the set S V .

Experimental environment
Experiments used Python under VMware with Ubuntu 20.04.5 LTS Linux 64-bit, 24 GB in memory and 10 cores processors. The Levenshtein and Jaro-Winkler algorithms are from the Python rapidfuzz library, and the other three algorithms are from the Python textdistance library. All these algorithms are normalized so that a value that approaches 0 means that the sequences are similar, and a value that approaches 1 means that the sequences are not similar.

Evaluation metrics
With the use of the sub-metrics TPs, TNs, FPs, and FNs, we assessed and examined each classification's performance using a confusion matrix, accuracy, FNR, FPR, precision, recall, and F1-score. For completeness, these measures are defined as follows: Confusion matrix: Demonstrates how many accurate and inaccurate predictions a model made. It considers all factors and can visually display results for each factor, making it a common evaluation tool, especially when attempting to comprehend and enhance an algorithm's performance.
Accuracy  The metrics provided above directly indicate each classifier's performance.

Results and discussion
To avoid having many graphs, we will present the results in this section concerning classifications 1, 2, and 3.
All distances used are normalized as follows: Let S 1 , S 2 be two system call sequences. where Sim Levenshtein is the similarity using Levenshtein algorithm, − Sim Jaro Winkler is the similarity using Jaro-Winkler, Sim Jaccard is the similarity using Jaccard, Sim Hamming is the similarity using Hamming, − Sim Dice Coefficient is the similarity using Dice coefficient, Rapidfuzz and textdistance are libraries in Python. Tables 5-9 and Figures 3-7 show the performance in terms of accuracy, FNR, FPR, recall, precision, and F1-score for all three classifications under Levenshtein, Jaro-Winkler, Hamming, Jaccard, and Dice coefficient. In the first three similarity measures, we notice that classification 2 is better than the others (1 and 3). However, in the last two similarity measures, algorithms perform better with the last classification (classification 3). The bold values indicate the line of the classification that gave good results.
Towards a better similarity algorithm for host-based intrusion detection system  11 The bold values indicate the line of the classification that gave good results. The bold values indicate the line of the classification that gave good results. The bold values indicate the line of the classification that gave good results. The bold values indicate the line of the classification that gave good results.    We notice that the folder named "Different" in each classification gives a signifying number of false negatives. Let us take, for example, this folder in classification 2 for the JavaMeterpreter attack and using the Levenshtein algorithm, the curve is shown in Figure 8.
It can be seen from Figure 8 that the attack numbers (5,10,15,20,25,26,31,42,43,44) are poorly detected. This is due to their textual distances, which are close to both the training and validation class and far from the attack class.
Another thing that draws our attention is that if we define a single threshold to classify the system call sequences, we will have a very high false alarm rate. Indeed, let us take the example of Figure 8. If we define a threshold of 0.5, i.e., sequences with a textual distance below 0.5 are considered as attack sequences, and those above 0.5 are considered normal sequences. Such a definition will consider all 56 sequences shown in Figure 8 as normal sequences, which are not. Thus, here we emphasize the strength of the model, which is the variable definition of the threshold that fits all system sequences. Figure 9 and Table 10 shows a comparison according to classification 2 for the five similarity measures used. Jaro-Winkler/Dice coefficient gives the same recall value of 0.95 and the same FNR of 0.04. However,  the Jaro-Winkler algorithm performs better, it gives a high value: accuracy of 0.94, precision of 0.72, and F1-score of 0.82 compared to Dice coefficient, which gives 0.90, 0.61, and 0.74 respectively. Jaccard/Hamming gives the same accuracy, 0.98, same F1-score, 0.95; however, the false positives of the Jaccard algorithm are greater than those of the Hamming algorithm. The last algorithm, Levenshtein, gives outstanding performance of 0.99 in terms of accuracy, 0.04 FNR, 0 FPR, 1 precision, 0.95 recall, and 0.97 F1score. These excellent results are obtained because this algorithm processes the system call sequences to measure similarity.
Before selecting the best method and classification, another parameter is taken into consideration, the elapsed time to implement the HIDS. Each algorithm's running times in seconds are displayed in the following table:  Table 11 represents the running time for each algorithm. We notice that Jaro-Winkler is the fastest one with 720 s, Dice coefficient with 2,940 s, Jaccard algorithm with 3,060 s, Levenshtein algorithm with 3,300 s, and finally, Hamming took a significant time processing with 14,880 s. Jaro-Winkler's speed is due to the implementation of the rapidfuzz library, which executes this algorithm in 0.00094 s and Levenshtein in 0.00312 s. Therefore, the long time to execute the Hamming algorithm is due to its implementation in the textdistance library, which takes about 0.03531 s.   It should be noted that the version of Levenshtein's algorithm described in the study [24] gives an accuracy of 1, FNR of 0, FPR of 0, Precision, Recall, and F1-score of 1, but the implementation time is very long, about 1 month and 15 days with the capabilities of the virtual environment described above.
From the confusion matrix in Figure 10, it can be seen in more detail that Jaccard, Hamming, and Levenshtein algorithms showed almost the same high-performance level and displayed a similar trend regarding correct and incorrect classifications. However, Jaro-Winkler and Dice coefficient showed a lower performance value than others. While having a generally higher performance in terms of elapsed time, Jaro-Winkler does have the second-highest score for FN and FP. Although this rating is not as bad as an FP rating, it is still high relative to the other algorithms, such as Levenshtein, Jaccard, Hamming, and Dice coefficient, which shows FN/FP of 30/0, 21/51, 31/43, and 35/442 respectively.
The thing that caught our attention was the number of false negatives obtained by each algorithm, which ranged from 21 to 35. These values constitute the number of undetected attack sequences. This means that these attack sequences belong to the normal behavior (of the 833 TDM sequences). In this case, attack sequences may be identical to normal data sequences, and the model could not detect them. Instead, we use the occasion to highlight that if we eliminate these sequences from the ADM, the model becomes very efficient.
The most important thing to notice in this work is that all the described similarity measures give almost similar performances with different running times. This is due to how the presence of an attack is tested and, more precisely, how the threshold is defined. This is described in the BSA algorithm of this model.
To evaluate the proposed model, it is imperative to test it with other models which fall in the same field. From Table 12, all these models use ADFA-LD as a benchmark. We notice that BSA-HIDS (Jaro-Winkler) and BSA-HIDS (Levenshtein) give a high performance than that in studies [3,4] in terms of accuracy and false alarm rate, 94% accuracy, 5% FAR and 99% accuracy, 2% FAR using just 720 and 3,300 s, respectively. The  result achieved by Marteau [3] was 90% accuracy in 900 seconds, and those achieved by Yaqoob and Madkour [4] were 90% accuracy and 22% FAR in seconds, but the number of seconds is unclear.
The proposed model has been developed aiming to have a performant HIDS, which is achieved and displayed in Table 12. The obtained results of the proposed system, BSA-HIDS, are superior to all up-to-date published systems in terms of accuracy, false alarm rate, and detection time. Although this model produced encouraging results, it does have limitations. The model cannot detect attack sequences that are an exact sequence of normal sequences.

Conclusion and perspectives
In this work, to identify unusual system call sequences, we have designed and implemented BSA-HIDS, a novel algorithm based on the sequence similarity measure. We used five similarity measures to test our model and choose the best one that performs well. The use case determines which string similarity algorithm is chosen. To generate the similarity score, all the algorithms mentioned earlier, in one way or another, seek to identify the common and uncommon strings' components. Comparing our model to the most recent models, the following are its main advantages: • Its simplicity to implement; no definition for window size, no maximal length for the n-grams, and no hidden architectures (LSTM, HMM, and CNN). • Can easily be used for online exploitation.
• Can detect zero-day attacks.
• The threshold definition takes each system call sequence into account. • The proposed system provides the best combination of a high detection rate and very small running time.
The observed accuracy is significantly higher compared to all recent systems. Additionally, the suggested model offers the ideal fusion of rapid response (running time) and high detection rate. Because of the definition of threshold, it has a high ability to recognize the zero-day attack and is flexible enough to react to environmental changes.
We have identified a shortcoming of the BSA-HIDS model: it needs to distinguish between sequences that are exact sequences of the training set TDM. However, any alternative approach would need to handle the circumstance correctly.
Yet, certain shortcomings in the suggested HIDS still need to be considered for future work. The BSA algorithm need to be improved to lower the false alarm rate. As part of future work, we plan to test the model's adeptness on other datasets like UNM and NSL-KDD. We aim to localize and delete attack sequences that are the same as normal sequences to test the model's efficiency.
Finally, to optimize the present work, first, we can define a new similarity algorithm. Effectively, we will rewrite the Levenshtein algorithm to take into account words and not characters; this way, the execution time can be reduced. Indeed, since the high system call number consists of three digits, in the best case, the execution time can be reduced to 1,100 seconds (3,300/3 = 1,100 s). Second, we can minimize the number of files in each classification to minimize detection time.