A new method for graph stream summarization based on both the structure and concepts

Abstract Graph datasets are common in many application domains and for which their graphs are usually massive. One solution to process such massive graphs is summarization. There are two kinds of graphs, stationary and stream. For stationary graphs, a number of summarization algorithms are available while for graph stream there is no a comprehensive summarization method that summarizes a graph stream based on the structure, vertex attributes or both with varying contributions. This is because of challenges of graph stream, which are volume of data and changing of data over time. In this paper, we propose a method based on sliding-window model for which summarizes a graph stream based on a combination of the structure and vertex attributes. We proposed a new structure for summary graphs and also proposed new methods for comparing two summary graphs. To the best of our knowledge, this is the first method that summarizes a graph stream based on both the structure and vertex attributes with varying contributions. Through extensive experiments on real dataset of Amazon co-purchasing products, we have demonstrated the performance of the proposed method.

general goal of summarization is to reduce a massive graph to a smaller one by removing unimportant details and preserving general properties of the graph. In a number of applications, data and their relations are modeled by a structural graph, e.g. cities and their ways. These graphs are summarized based on nodes and their relations [2][3][4]. On the other hand, a number of applications generate attributed graphs for which a number of attributes has been associated to vertices or even edges [5,6] e.g. social networks such as Facebook. In Facebook, each node represents a person and has attributes such as name, family, country, and stuff.
In general, summarization is performed by grouping similar nodes into one group and dissimilar nodes into different groups [1]. Similarity of two vertices can be structurally or attribute-based or both. For example in Facebook, both edges and vertex attributes can take into account for constructing summary. Therefore based on the important of the structure, vertex attributes or both, summarization will be structural [4], attribute-based [7,8] or hybrid [9]. The similarity of the nodes has an important impact on the resultant summary and can be calculated based on vertex connectivity or vertex attributes or both. Therefore, the similarity criterion of two vertices specifies the type of the resultant summary.
These days many application generate data which are received as a graph stream [10] such as selling products in supermarkets. For this example, the relationship between sold products are received as a stream of edges, an edge represents each pair of sold products. Although a number of algorithms have been proposed for summarizing stationery attributed graphs based on both the structure and vertex attributes with varying contributions of each [6,8,11], to the best of our knowledge there is no method capable of summarizing a graph stream based on both the structure and vertex attributes or both. This is the main challenge of graph stream summarization. In this paper, a new method has been proposed that addresses this challenge. The proposed method summarizes a graph stream based on sliding window paradigm. By using this proposed method, always a summary of the graph stream is available. We have provided experimental results on Amazon product co-purchasing network dataset for evaluating the proposed method.
We propose a method for graph stream summarization based on sliding-window model. In this method a graph is summarized based on both the structure and vertex attributes. For comparing two summaries, we introduce a new schema for a summary graph and a new algorithm for calculating the difference between two summaries. In overall our contribution are as follows: • A new method for graph stream summarization • A new schema for a summary graph • A new algorithm for comparing two summaries and calculating their distance.
The rest of this paper is organized as follows. In Section 2, related works are reviewed. Section 3 is dedicated to the proposed method. Our experiments are explained in Section 4. Discussions are presented in Section 5 and finally we have provided conclusions and discussion of future work in Section 6.

Related Works
In this section, we review previous works on four different types of graph summarization to discuss the main challenges of graph summarization.

Structural summarization
In [12] a method has been proposed for graph structural summarization. In this method, a graph is compressed by partitioning similar nodes into one group and dissimilar nodes into different groups. For a compressed graph, a super-edge is the aggregated edges between a pair of supernodes. In this method, a graph is compressed based on the Minimum Description Length (MDL) idea. Firstly, they developed a greedy algorithm and secondly to reduce the runtime of the algorithm, they proposed a randomized version.
Riondato et al. [3] proposed another method to summarize structural graphs. In this method, the aim is to guarantee the quality of a summary and minimizing the reconstruction errors. Riondato et al. have presented a connection between graph summarization and geometric clustering problems for the first time. Based on this connection, the authors developed a polynomial-time algorithm to generate the best possible summary of the expected size.
Tian et al. [13] proposed three distributed summarization algorithms named DistGreedy, DistRandom and Dis-tLSH to summarize large scale graphs. These algorithms differ in how they select a pair of nodes to merge, which they select greedy, randomly, and using locality sensitive hashing theory, respectively.
Chen et al. [14] proposed a method based on producing randomized summary graphs for identifying frequent patterns. Structural summaries can be beneficial for frequent pattern mining. In fact, instead of mining massive and time-consuming original graphs, summary graphs are mined.
In fact spectral graph clustering can be used for structural summarization. Spectral graph clustering partitions a graph based on eigenvalues and eigenvectors of the graph adjacency matrix [15][16][17][18][19]. This technique can be beneficial in image segmentation and social network analysis. There are a number of applications that use spectral graph clustering for finding communities in networks [20]. In this applications, initially a large graph is converted into a small one by summarization and then use spectral graph clustering to cluster the resultant small summary graph [21].
Community detection algorithms has many applications and recently, many articles [22][23][24][25][26] have been published on this subject. Graph summarization can be beneficial for detecting communities in a network.
Of-course, there are some similar methods/models [27,28], subgraph mining models, which are limited in comparison with summarization methods. These models rather than summarizing, choose one or more subsets from graphs.

Attribute-based summarization
In [7], a summarization method with two novel operations Summarization by grouping Nodes on Attributes and Pairwise relations (SNAP) and k-SNAP has been proposed.
These operations are used for grouping nodes and summarizing attributed graphs. Tian et al. defined attribute and relation compatible grouping. They also improved SNAP operation by proposing k-SNAP operation. In k-SNAP operation, k is the summary size where is determined by the user. The k-SNAP operation improved by Zhang et al. [8] by proposing the CANAL algorithm in 2009. The CANAL algorithm is used to categorize attribute values automatically, and also to provide a criterion to measure the quality of a summary.
In 2008, the OLAP framework has been proposed by Chen et al. [29]. In the OLAP framework, the cubes are created on the graph based on dimensions and measures. In this framework, a graph is summarized based on both selected attributes and input information.

Hybrid summarization
In [6] a method was proposed for clustering a graph based on both the structure and vertex attributes. In this method, for a given graph a new graph, named the augmented graph, with real and virtual links is constructed. Because of attribute-based similarity of vertices, the virtual links are added to the new graph. In the augmented graph, both real and virtual links are considered to measure the similarity of two nodes. If the number of associated attributes is relatively high, the augmented graph will be massive and finally the runtime of the algorithm is high.
In [11] another method has been proposed to hybrid summarization of a graph. In this method, initially a graph is summarized based on vertex attributes, without take into consideration the graph structure, and then by moving nodes between super-nodes adjust the summary to the graph structure. This method for situations where the structure has an important impact in constructing summary may be inefficient.
In [30] a method has been proposed for attributed graphs which constructs a hybrid summary by considering MDL principle to model the graph summarization problem into a code cost function and utilizing greedy method to compute an optimal summary. In this method, the user's needs and also the ontology of the graph have not been considered.

Graph stream summarization
There has been some research work on the subject of graph stream summarization but the contribution of these works in the scope of graph stream summarization is not significant. Major research work done on graph stream are as follows.
In this [31] a novel Graph Stream Sketch (GSS) has been proposed to summarize graph streams with linear space cost (O(|E|) and constant update time complexity (O(1)). The aim of Gou et al. has been constructing a summary for query answering with the controllable errors.
In [32] the focus is on calculating the rank of a node in a graph stream with the minimum passes over the stream and the minimum space, of-course up to an adaptive error. Therefore, algorithms and models has presented in this regard.
In [33] Feigenbaum et al. have been interested in the trade-offs between model parameters such as perdata-item processing time, required space, and the required number of passes over the stream. These trade-offs have been considered for solving problems such as Spanner Construction, BFS-Tree Construction, Graph Distance Lower-Bounds.
In [34] a new variation of streaming model with a helper which can provide annotations for data streams have been proposed by Cormode et al. They have discussed that by giving linear sized annotations, the memory for many problems is reduced to constant time.
In [35] Feigenbaum proposed a new streaming model and formulized it. They believed this model is necessary for proposing efficient algorithms to solve problems on massive graphs. They have considered an upper bound for required spaces foe such algorithms. They applied the proposed model on special problems.
In [36] Aggarwal also et al. proposed a method for graph stream clustering by introducing micro-clusters and compressing them with hash functions. The proposed method can be beneficial for special applications. Aggarwal proposed a new method for classification a massive domain graph stream [37]. Aggarwal has proposed a probabilistic approach for constructing a summary that can be stored in main memory. Aggarwal used this method for determining special patterns in a graph stream.
There are other works on graph stream such as the problems of connectivity [35], counting subgraphs e.g. triangles [38,39], calculating the degree of nodes [40], spanners [41], sparsification [42]. Thus to the best of our knowledge, there is no capable method for summarizing a graph stream which converts a graph to a smaller one by removing unnecessary details and preserving overall properties.

The proposed method
In the proposed method, we use sliding-window model for summarizing a graph stream. We take into account the edges of the first window over the graph stream and construct the graph of this window. This graph is summarized using hybrid summarization method proposed by the authors of this paper [9,43] for summarizing an attributed graph based on both the structure and vertex attributes. The summary graph is maintained as a reference. We take into account edges of the second window over the graph stream and its graph is constructed and summarized. This summary is named current summary. The current summary is compared to the reference summary. Depending on whether they are matched or not, one of them is skipped. If they are matched then the current summary is skipped otherwise the distance of two summaries is higher than a given threshold. For the latter case, the current summary is maintained as the reference summary. In this case, the current summary is the best representative of the graph stream.
By continuing this trend, a summary of the graph stream is available at any moment. The paradigm of the new method for graph stream summarization depicted in Figure 2.

Algorithm
The proposed method has been summarized in Algorithm 1. In this algorithm, summarizing a graph, comparing two summary graphs and calculating the distance of two summary graphs are not clear and should be illustrated more. In the following subsections, we illustrate the structure of a summary graph, similarity of two supernodes, comparing two summary graphs and finally the proposed method is illustrated by an example.

The summary structure
In the proposed method, an attributed graph is summarized based on both the structure and vertex attributes. Every super-node in the summary graph is a vector of structural and semantical attributes. Structural attributes are the number of vertices in the super-node, the degree of the super-node and the percentage of vertices, which are relevant with nodes of the other super-nodes. Semantical attributes are considered as the percentage of vertices, which have a value on an attribute. In fact, for every value of a vertex attribute this percentage value is calculated. In Section 3.5 we illustrate the summary structure by an example.

Similarity of two super-nodes
Based on the proposed structure for the summary graph, a super-node is a vector of attributes and the similarity of two super-nodes is calculated based on their vectors. The similarity of two super-nodes is calculated using Equation (1), which also uses Equations (2) through (6). Initially, the distance of two super-nodes is calculated and then by subtracting this value from one, the similarity of two supernodes is obtained.
The distance between two super-nodes is equal to summation of structural and attribute-based distance of two super-nodes, which is presented by Equation (2).
The number of vertices, the degree of super-nodes and the number of vertices which relevant to vertices of other super-nodes (the weight of edges) are considered as structural attributes. These structural attributes determine the structural distance of two super-nodes. Equation (3) describes the structural distance of two super-nodes. As seen in Equation (3), the value of structural distance belongs to [ 0, 1 ]. For each of the three parts of Equation (3), if the denominator is zero, the value of that part is considered to be zero.
The attribute-based distance between two super-nodes with k attributes and each attribute with k ′ values is calculated using Equation (4). where where np and dq are the number of vertices in Vp and the degree of Vp, respectively.

Comparing two summary graphs
To compute the distance of two summary graphs, initially the similarity of each pair of super-nodes of two summary graphs is calculated using Equation (1). The super-node pairs with the most similarity are associated. After associating the super-nodes of two summary graphs, the distance of two summary graphs is calculated. The distance is equal to summation of distances of each pair of matched super-nodes. The approach for calculating the distance of two summary graphs has been described in Algorithm 2.

Algorithm 2:
Calculating distance two summary graphs Input: summary graphs: GS1 and GS2 and the size of summary graph: size; Output: distance of two summaries; 1: Begin 2: Calculate the distance of every two super-nodes 3: Add every super-node pair with its calculated distance to ascending priority queue q; 4: n=summary graph size; 5: While(n>0) 6: Remove a super-node pair; 7: Match two super-node of the pair; 8: n-; 9: endwhile; 10: set dsit to sum of distances of the matched super-node pairs; 11: end.

Illustrating the proposed method with an example
To clarify the issue, we consider two summary graphs SG 1 and SG 2 with above-mentioned structure, each with three  super-nodes and two attributes. Attributes are gender and education level, gender with values of Male and Female and education level with values of BSc., MSc. and Ph.D. As we see in Figure 3, the summary graph shows the overall and important information of the original graph. For example, super-node V 1 shows a group of 600 people where 20% are in relationship with people of V 2 , 80% are in relationship with people of V 3 , 65% are female and 35% are male, 30% are bachelor of science, 30% are master of science and 30% are Doctor of Philosophy.
As already mentioned, initially the similarity of every pair of super-nodes of two summary graphs is calculated. For clarify, in the following we calculate the similarity of two super-nodes V 1 and V ′ 1 , step-by-step.
For saving time, we have refused to provide computational steps for other super-nodes pairs and only have entered their final similarity values in Table 1.
, respectively. The first component of each pair is a super-node of the first summary graph and the second component is its matched super-node of the second summary graph. According to this matching, the distance of two summary graphs SG 1 and SG 2 is calculated as follows: = 0.9899 + 0.9945 + 0.9920 = 2.9764

Experiment
In this section, we conducted experiments to evaluate the performance of the proposed method on real-world graphs. The proposed method was implemented in Java programming language.

Dataset Amazon co-purchasing network
This data is available in address¹ and includes information about different products such as the books, music  Name  #Nodes  #Edges  Duration  amazon0302  262,111 1,234,877  Amazon product co-purchasing network from March 2 2003  amazon0312  400,727 3,200,440  Amazon product co-purchasing network from March 12 2003  amazon0505  410,236 3,356,824  Amazon product co-purchasing network from May 5 2003  amazon0601 403,394 3,387,388 Amazon product co-purchasing network from June 1 2003 amazon-meta 548,552 1,788,725 Amazon product metadata: product info and all reviews on around 548,552 products CDs, DVDs and VHS video tapes. There are 548,552 products and for each product, the information such as title, salesrank, list of similar products, category and reviews is available. This data are about Amazon co-purchasing products of 2003 and has been collected in summer 2006 by Jure Leskovec with crawling Amazon website. The information of this products and their graph streams are presented in Table 2. Rows second to fifth show four directed graph streams. Each graph is a graph stream where each edge (x, y) shows product y has frequently co-purchased with product x. We chose Id, ASIN, group and salesrank fields for providing experiments.

Evaluation
To the best of our knowledge, our proposed method is a novel general-purpose method for graph stream summarization and there is no competitor method for exact evaluation. We believed that comparing the results of our proposed method with the changes of real constructed graphs are more reasonable and reliable than comparing to other competitor methods.
Therefore, for evaluating the proposed method for graph stream summarization, we chose amazon0302 file and set the window size to 1000 edges. We considered the first window over the first 1000 edges of the file and constructed the first graph. For every window, the vertices are those, which are appeared at least as one end of the first 5000 edges. The graph of the first window has been summarized and resulted a summary graph with the size of 7. The summary graph is maintained as a reference. The next windows are also considered over the graph stream and in the following tasks such as summarizing graphs, comparing every coming graph with the reference summary and changing the reference summary(if necessary) are done. In this experiment, window size was fixed, 1000 edges, but usually the first window is considered bigger than the others are. In the following, 5 summary graphs are presented in Figures 5 through 9. In these Figures, dis and toy represent discontinued and toy products. These two categories do not belong to the main categories which are mentioned in the description of the dataset.
The semantic of each summary graph has been extracted and shown in Table 3. Semantic changes of two consecutive summary graphs are presented in Table 4. Fig-Figure 6: The second summary graph of the size of 7

Time complexity
In the proposed method, the dominant time belongs to the summarization algorithm. According to summarization algorithm in [9,43], initially the similarity of each pair of nodes is calculated and after that the graph is summarized by merging nodes/super-nodes. In the worst case, the summarization algorithm performs at most |V| merge operations to obtain the expected summary. Henceforth, the time complexity of this method is O(|E|×|V|). Time complexity of other processes such as calculating distance between two super-nodes, matching super-nodes of two summary graphs and finally calculating the distance of two summary graphs is less than the runtime of summarization algorithm.

Discussions
The summary graphs as shown in Figures 5 through 9, the semantic of each summary as shown in Table 3 and distance of every two summary graphs as shown in Table 4, help us to justify distance of every two consecutive summary graphs intuitively. In fact, the calculated distances based on above-mentioned formulas should be supported by intuitive structural and semantical changes.
As shown in Table 4, the first and second summary graphs have distance value of about 0.3 and intuitively these two summary graphs differ in two cases as shown in the fourth column. Therefore, the distance of these two summary graphs is supported by the intuitive changes. Such a situation can be seen for the second and third summary graphs. On the other hand, the third and fourth summary graphs have a lower distance in comparison to the previous consecutive summary graphs and this is also in line with the intuitive changes. Third and fourth summary graphs differ intuitively only in one case, the existence or absence of 4-clique. The situation for the fourth and fifth summary graphs is similar to consecutive summary graphs of the first through third. Therefore, the calculated distance for each pair of consecutive summary graphs is according to intuitive distance of summary graphs and it is reliable.
By setting the threshold value of the distance between two summary graphs, it is determined whether the reference needs to be changed or not. For example, if we set the threshold to 0.5 then the first summary graph is remained as the reference. On the other hand, if we set the threshold to 0.2 then initially the reference will be the first summary graph, with the appearance of the second graph, this new summary graph will be the reference summary graph. This will also happen for the third summary graph, and third summary graph will be the reference. With the appearance of the fourth summary graph, the reference summary does not change. With the advent of the fifth summary graph, the reference will change and it will be replaced with the fifth summary graph. The threshold value can be determined in terms of scope and precision.
It is obvious that our proposed method is a generalpurpose method, because of taking into account the struc- Summary graph Summary graph semantic 1 Figure 5 Discontinued products are related with books. The majority of books are related to themselves. The majority of the super-nodes are isolated. 2 Figure 6 Discontinued products are related with all other products. Super-nodes are related to each other (a near clique).
Only toy the super-node of Toy is isolated. 3 Figure 7 There is no an isolated super-node. All super-nodes are in relationship with the super-node of book.
There is 4-clique in the graph. 4 Figure 8 There is no an isolated super-node. All super-nodes are in relationship with book super-node.
The number of sold music products is maximal. 5 Figure 9 There is no category of toy in the summary graph All super-nodes are in relationship with book super-node. The majority of books are in relationship with each other.     0.36400673 Toy category The number of edges ture, vertex attributes, user's needs and graph ontology in summarization. Hence, the summary graph can be beneficial in community detection, node degree calculations and stuff. By initially setting parameters in summarization algorithm, it is possible to change the orientation of the summarization algorithm.

Conclusions
In this paper, we proposed a method for graph stream summarization based on the sliding-window model. In the proposed method, a graph is summarized based on both the structure and vertex attributes. The super-nodes of two summary graphs are matched to each other and the distance of every pair of matched super-nodes is calculated. The distance of two summary graphs is calculated based on the calculated distances of the matched super-nodes.
If the difference of the new summary graph and the reference is higher than the threshold, then the reference will be replaced with the new summary graph.
To the best of our knowledge, this is the first method for summarizing a graph stream based on both the structure and vertex attributes. In this way, always the summary of the graph stream is available. The summary graph is a representative of the graph stream, which has the overall properties of the graph stream and can be used for decision-making. In this paper, a number of algorithms have been proposed for calculating the similarity of two super-nodes, matching super-nodes and calculating the distance of two summary graphs.
In the proposed method, the window size was chose fixed while considering windows of varying length is more logical. We plan to extend the proposed method by consider windows of varying length and learning the size of window algorithmically.
In real-world applications, some of data are missed and this issue should be considered in providing experiments. On the other hand, a number of applications generate more than one graph streams and these graph streams should be summarized simultaneously. A future research venue would be summarizing multiple graph streams.