During the cluster analysis course, calculating the similarity of feature vector is important. We suppose that if entity pairs have the same relationship the feature vectors are more similar. For the N-dimensional feature vector to describe the binary entity relationship tuples we use the distance to measure the feature vector’s similarity. We choose the Manhattan distance to describe the two feature’s similarity. For the vector *V*_{1}(*v*_{11}, *v*_{12},…,*v*_{1n}), *V*_{2}(*v*_{21}, *v*_{22},…,*v*_{2n}), the Manhattan distance formula as follows:

$$\begin{array}{}{\displaystyle D({V}_{t},{V}_{j})=\sum _{k=1}^{m}|{v}_{tk}-{v}_{jk}|}\end{array}$$(1)

We choose the TongYiCi CiLin Extended Edition dictionary written by the HIT to calculate the similarity of two words, the TongYiCi CiLin Extended Edition record 70000 entry [14]. The TongYiCi CiLin Extended Edition uses a five-layer classification system to describe the hierarchical relation of entries. Fig. 1 is its hierarchical structure figure.

Figure 1 Monitoring Properties of Sensor NetworkThe hierarchical structure of TongYiCi CiLin Extended Edition.

Such as: the word “Hmong” code is Di04B10#, “D” is the Level 1, “i” is level 2, “04” is level 3, “B” is level 4, “10” is level 5, and the “#” has other uses. gives the word encoding rules.

Table 1 Description of sub-process monitoring indicators

The eighth encode has 3 labels, including the label “#”, “=” and “@”. The “#”represent the unequal, the “=” represents equation, and the “@” represents independent.

For the two words *w*_{1} and *w*_{2}, the similarity evaluation formula is as follows:

$$\begin{array}{}{\displaystyle Sim({w}_{1},{w}_{2})={l}_{n}\times cos(n\times \frac{\pi}{180})(\frac{n-k+1}{n})+C}\end{array}$$(2)

Because the density-based spatial clustering of application with noise (DBSCAN) method (see ) has better performance to cluster the information from the noise, so we used this method to cluster the binary tuples about the entity relationship. We can use this clustering method to filter some noise and improve the accuracy of the entity relationship extraction. This method first defines two parameters: radius and density. If accord with these conditions, those binary entity relationship tuples will be clustered. We can find those sample points which are in keeping with these two conditions by constantly extending the search, clustering the information from the noise.

DBSCAN is a kind of clustering is the method-based on density. Density based methods considers clusters as dense region of objects that are different from lower dense regions in the data space. Density based regions are more appropriate and applicable in arbitrary shaped clusters but selection of attributes and selection of clusters with algorithms are more complex. It has the feature to merge two clusters that are sufficiently close to each other.

Density biased sampling, DBSCAN (Density Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure), and DENCLUE (Density Clustering), and so forth are examples of this method[15, 16, 17, 18]. DBSCAN requires two important parameters as follows [15, 16].

–

Eps is the radius that represents spatial attribute (latitude and longitude) that delimitates the neighborhood area of a point.

–

Minpts is the minimum number of points that must exist in the Eps-neighborhood.

Some concepts and definitions of DBSCAN (Density-Based Spatial Clustering of Application with Noise) which are directly and indirectly related to DBSCAN (Density-Based Spatial Clustering of Application with Noise) are explained here [15]:

Cluster: in a database with given data objects as *D* = {*O*_{1}, *O*_{2}, ⋯, *O*_{n}}, the procedure of partitioning database *D* into smaller parts which are similar in certain standards as *C* = {*C*_{1}, *C*_{2}, ⋯, *C*_{i}} is called clustering.
$\begin{array}{}{\displaystyle {C}_{j}^{\mathrm{\prime}}s}\end{array}$ are clusters, where *C*_{j} ⩽ *D*(*j* = 1, 2, 3, ⋯, *i*).

Neighborhood: a distance function (e.g., Manhattan distance and Euclidean distance) for any two points *p* and *q* denotes *dist*(*p*, *q*).

Eps-neighborhood:the Eps-neighborhood (threshold distance) of a point *p* is defined by {*q* ∈ *D*|*dist*(*p*, *q*) ⩽ *Eps*}.

Core object: a point *p* is a core point if at least Minpts points are within distance of it, and those points are said to be directly reachable from *p* . In other words, a core object is a point that its neighborhood of a given radius (Eps) has to contain at least a minimum number (Minpts) of other points.

Directly density reachable: an object *p* is directly density reachable from the object *q* if is with in Eps-neighborhood of *q* and *q* is a core object in given data objects as *D* = {*O*_{1}, *O*_{2}, ⋯, *O*_{n}}.

Density reachable: a point *q* is reachable from *q* if there is a chain *p*_{1} ⋯ *p*_{n} with *p*_{1} = *p* and *p*_{n} = *q*, where each *p*_{i} + 1 is directly reachable from *p*_{i} with respect to Eps and Minpts, for 1 ⩽ *i* ⩽ *n*, *p*_{i} ∈ *D*.

Density connected: an object *p* is density connected to object *q* with respect to Eps and Minpts if there is an object *o* ∈ *D* such that both *p* and *q* are density reachable from o with respect to Eps and Minpts.

Density based clusters: a cluster *c* is nonempty subset of *D* satisfying the following “maximality” and “connectivity” requirements:

–

*p*, *q*: if *q* ∈ *C* and *p* is directly reachable from *q* with respect to Eps and Minpts, then *p* ∈ *C*.

–

*p*, *q* ∈ *C*: *p* is density connected to *q* with respect to Eps and Minpts.

Border objects: an object *p* is a border object if it is not a core object but density reachable from another core object.

Noise: all points are not reachable from any other point, that is, neither a core point nor density reachable. *Noise* = {*p* ∈ *D*|∀*i* : *p* *notC*_{i}}.

Some of the reasons why we have selected DBSCAN are its positive points as discussed in the following[16]:

–

It is capable of discovering clusters with arbitrary shapes.

–

There is no need to predict the number of clusters in advance and hence it is more realistic.

–

There are greedy methods to replace *R*^{*}-tree data type greedy queries.

–

Selection and application of attributes is always open to improve time and space complexity.

–

It is robust to outliers and merging is possible with other clusters if they are similar.

It can be found that density reachable is the transitive closure of directly density reachable, and this relationship is asymmetric, and density connected is a symmetric relationship. The purpose of DBSCAN is to find the largest set of density connected objects.

Eg: hypothesis radius Eps = 3, MinPts = 3, there is point {*m*, *p*, *p*_{1}, *p*_{2}, *o*} in the Eps-neighborhood of point *p*, there is point {*m*, *p*, *q*, *m*_{1}, *m*_{2}} in the Eps-neighborhood of point m, there is point {*q*, *m*} in the Eps-neighborhood of point *q*, there is point *o*, *p*, *s* in the Eps-neighborhood of point *o*, there is point {*o*, *s*, *s*_{1}} in the Eps-neighborhood of point *s*, then there are the core object *p*, *m*, *o*, *s*(But *q* is not the core object because the number of points in the Eps-neighborhood of point *q* is equal to 2, less than MinPts = 3). Point *m* is directed density reachable from point *p* because *m* is the Eps-neighborhood of point *p*, and *p* is the core object. Point *q* is reachable from point *p* density, because point *q* can reach direct density from point m, and point m can reach direct density from point *p*. Point *q* is density reachable from point *p* because point *q* can directed density reachable from point m and point *m* can directed density reachable from point *p*. Point *q* is density connected to point *s* because point *q* is density reachable from point *p* and point *s* is density reachable from point *p*.

In summary, the method of density-based multi-clustering of semantic similarity (DBMCSS) that we are proposed includes the following steps: (1) Named entities in the text are recognized; (2) The entity pairs in a sentence are selected according to the adjacent principles.This is to find out the entity pairs that have the relationship; (3) Using some rules to filter out the noise entity pairs; (4) Choosing the entity pair, the verb or noun between the entity pair, the verb or noun on the right of entity pair, the verb or noun on the left of entity pair to establish the entity relationship tuples; (5) The feature vectors of the arbitrary entity pairs in the sentences are established.We chose the POS of the words in step (4) as the feature for binary entity relationship tuples extraction; (6) Calculating the similarity of the feature vector by Manhattan distance; (7) Using the DBSCAN clustering algorithm to cluster the feature vector, then the entity relationship tuples are extracted.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.