Similarity is a concept in mathematics. It is used to judge the degree of difference between two data samples. The “distance” is often used to describe the degree of similarity. The larger the distance, the smaller the similarity between the two data samples [12]. A data sample can be two numbers, two sequences, or more generally, two vectors. The localization and recognition algorithm for fuzzy anomaly data in the big data network uses the Euclidean distance, as the standard to measure the similarity, and realize the localization and recognition of the fuzzy anomaly data.

Let *D*_{e} be the Euclidean distance between the two sets of samples *X* and *Y* in *N*-dimensional space *L*. The formula for calculating *D*_{e} is as follows:

$${D}_{e}=\sqrt{\sum _{i=1}^{n}({X}_{i}-{Y}_{i})}$$(17)*X* = (*X*_{1}, *X*_{2}, *X*_{3}, · · · , *X*_{n}), *Y* = (*Y*_{1}, *Y*_{2}, *Y*_{3}, · · · , *Y*_{n}).

Let *D*_{e}_{1} represent the Euclidean distance between two matrix samples *P* and *Q* in *N* × *M* space *S*. The formula for calculating *D*_{e}_{1} is as follows:

$${D}_{e1}=\sqrt{\sum _{i=1}^{n}\sum _{j=1}^{m}{({P}_{ij}-{Q}_{ij})}^{2}}$$(18)Let *Sim*(*P*, *Q*) represent the similarity of *D*_{e}_{1}, and the formula for calculating *D*_{e}_{1} is:

$$Sim(P,Q)=\frac{1}{{D}_{e1}}=\frac{1}{\sqrt{\sum _{i=1}^{n}\sum _{j=1}^{m}{({P}_{ij}-{Q}_{ij})}^{2}}}$$(19)Where *P*, *Q* are again *N* × *M*-dimensional matrices, *i* = 1, 2, · · · , *n* and *j* = 1, 2, · · · , *m*.

Using multi-feature similarity methods to detect fuzzy anomaly data in big data networks, it is first necessary to construct a feature set of normal network states [13]. Through long-term data collection, the data is analyzed, clustered and aggregated to form a feature set per unit time and a threshold marked by time. If a uniform standard threshold is used, the periodicity of the network traffic cannot be reflected and the time stamp threshold is used, that is, the real-time traffic of a specific time period is discriminated by the threshold of a specific time, which can effectively reduce the false alarm rate [14].

The specific algorithm is as follows:

2. $$n=\frac{T}{t}$$(20)

3. The counter *T* is incremented by 1 each time the standard feature set needs to be updated.

4. Collect network characteristic data once every sampling time.

5. Define the 6 ×*M*-dimensional real-time feature matrix *I*, which is used to store the feature information of the fuzzy anomaly data of the big data network.

Each row corresponds to one feature set. When there are less than *m* attributes in each category, it is set to 0. Each feature set is handled differently. Whenthe source network segment exit traffic is to be stored, the cosine is required to set the column number of each network segment in the matrix [15, 16, 17, 18, 19, 20, 21,].

Let the network segments *A*, *B*, and *C* correspond to the first, second and third column in the matrix respectively. The data of each sampling time needs to be stored according to regulations. To store the destination port traffic characteristics, it is divided into *{*(0, 100), (101, 1000), (1001, 3000), (3001, 5000), · · · , (9000, 65535)*}* by port number segment. Each port number segment corresponds to the columns in the matrix, in order:

$$I=\left[\begin{array}{cccccc}{I}_{11}& {I}_{12}& \cdots & \cdots & \cdots & {I}_{1m}\\ {I}_{21}& {I}_{22}& \cdots & \cdots & \cdots & {I}_{2m}\\ {I}_{31}& {I}_{32}& \cdots & \cdots & \cdots & {I}_{3m}\\ {I}_{41}& {I}_{42}& \cdots & \cdots & \cdots & {I}_{4m}\\ {I}_{51}& {I}_{52}& 0& \cdots & \cdots & 0\\ {I}_{61}& {I}_{62}& {I}_{63}& 0& \cdots & 0\end{array}\right]$$(21)Similarly, the 6 × *M* standard feature set matrix *S* is obtained from the training data:

$$S=\left[\begin{array}{cccccc}{S}_{11}& {S}_{12}& \cdots & \cdots & \cdots & {S}_{1m}\\ {S}_{21}& {S}_{22}& \cdots & \cdots & \cdots & {S}_{2m}\\ {S}_{31}& {S}_{32}& \cdots & \cdots & \cdots & {S}_{3m}\\ {S}_{41}& {S}_{42}& \cdots & \cdots & \cdots & {S}_{4m}\\ {S}_{51}& {S}_{52}& 0& \cdots & \cdots & 0\\ {S}_{61}& {S}_{62}& {S}_{63}& 0& \cdots & 0\end{array}\right]$$(22)Calculate the Euclidean distance *D*_{e}(*I*, *S*) of the real-time feature set matrix and the standard feature set matrix:

$${D}_{e}(I,S)=\sqrt{\sum _{i=1}^{6}\sum _{j=1}^{m}{({I}_{ij}-{S}_{ij})}^{2}}$$(23)Where *i* = 1, 2, · · · , 6, *j* = 1, 2, · · · , *m*. The similarity *Sim*(*I*, *S*) of the real-time feature set matrix and the standard feature set matrix is obtained by the Euclidean distance *D*_{e}(*I*, *S*):

$$Sim(I,S)=\frac{1}{{D}_{e}(I,S)}=\frac{1}{\sqrt{\sum _{i=1}^{6}\sum _{j=1}^{m}{({I}_{ij}-{S}_{ij})}^{2}}}$$(24)If the similarity value is higher than the threshold *ξ*^{T} of this period, it is normal data, and the feature set is updated. Each attribute of the standard feature set matrix and the corresponding attribute of the real-time feature set matrix are weighted and averaged, and the updated feature set attribute is *S*_{ij}(*n* + 1), and the expression is:

$${S}_{ij}(n+1)=\frac{{S}_{ij}(n)\cdot n+{I}_{ij}}{n+1}$$(25)Let *S*(*n* + 1) represent the updated feature set matrix, and the calculation formula is:

$$S(n+1)=\frac{S(n)\cdot n+I}{n+1}$$(26)If the similarity value is lower than the threshold *ξ*^{T} of this period, it is the fuzzy anomaly data, and the positioning model *DW* of the fuzzy anomaly data is constructed to complete the localization and recognition of the fuzzy anomaly data of the big data network:

$$DW=({a}_{1}^{2}+{a}_{2}^{2}+\cdots +{a}_{j}^{2}{)}^{\tau}/Sim(I,S)$$(27)Where *a*_{i} represents the distance between adjacent servers in the big data network and r represents the correction factor.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.