Show Summary Details
More options …

Data and Information Management

4 Issues per year

Open Access
Online
ISSN
2543-9251
See all formats and pricing
More options …
Volume 1, Issue 2

gst-store: Querying Large Spatiotemporal RDF Graphs

Dong Wang
/ Lei Zou
/ Dongyan Zhao
Published Online: 2017-12-29 | DOI: https://doi.org/10.1515/dim-2017-0008

Abstract

The Simple Protocol and RDF Query Language (SPARQL) query language allows users to issue a structural query over a resource description framework (RDF) graph. However, the lack of a spatiotemporal query language limits the usage of RDF data in spatiotemporal-oriented applications. As the spatiotemporal information continuously increases in RDF data, it is necessary to design an effective and efficient spatiotemporal RDF data management system. In this paper, we formally define the spatiotemporal information-integrated RDF data, introduce a spatiotemporal query language that extends the SPARQL language with spatiotemporal assertions to query spatiotemporal information-integrated RDF data, and design a novel index and the corresponding query algorithm. The experimental results on a large, real RDF graph integrating spatial and temporal information (> 180 million triples) confirm the superiority of our approach. In contrast to its competitors, gst-store outperforms by more than 20%-30% in most cases.

Keywords: spatiotemporal query; RDF graph; tree index

1 Introduction

Nowadays, we can use the Resource Description Framework (RDF) (Klyne, Carroll, & McBride, 2004), which is recommended by the World Wide Web Consortium (W3C) as the foundation of the Semantic Web, to restore the knowledge. An RDF statement is a triple presented as 〈subject, predicate, object〉, which describes a property value of a subject or the relation between the two entities – the subject and the object. In practice, a huge amount of entities and statements contains spatial and temporal information, e.g., a city is always located in a specific location, and a transient event happens at a specific time point. Therefore, the schema of the RDF data needs to be extended to express the spatiotemporal semantics. For example, 〈Ulm Coordinates 48.39841/9.99155〉 describes the longitude and latitude of a spatial entity “Ulm”. 〈Albert_Einstein WonPrize Nobel_Prize (59.35,18.0667) (1921-##-##,1921-##-##)〉 denotes the event that Einstein won Nobel Prize in 1921 in a location with the coordinates 59.35°N,18.0667°E.

Based on the spatiotemporal RDF data, users can ask more meaningful queries. For example, it is useful to count the fast food restaurants nearby someone’s workplace, or to find the spouses in Hollywood whose age difference is more than 10 years. More practically, for meetings, incentives, conventions, and events (MICE) tourists, it is important to find those newly renovated motels that are near the places of interest. In order to answer these spatiotemporal queries more efficiently and effectively, it is important to build an RDF query engine for the spatiotemporal RDF data.

Although the spatial and temporal data can be managed using traditional spatiotemporal databases, the “pay-as-you-go” nature of RDF proposes new challenges for existing solutions. Firstly, the RDF data have diverse graph structures for different entities, which property does not fit the traditional entity–relationship (ER) model. Though the column-based relational database partially solves the problem, it also suffers due to the multiple values and null values in RDF data. Secondly, the incomplete spatiotemporal information in RDF data makes it inefficient to retrieve the spatiotemporal RDF data using a “join” operator, i.e., too many intermediate results might be generated. The entities and the statements without necessary spatiotemporal information should be more efficiently filtered early. In summary, the traditional spatiotemporal databases are not suitable for spatiotemporal RDF data management.

A spatiotemporal RDF data set can be linked to other RDF repositories to provide structural queries with both semantic and spatiotemporal features. In this case, the spatiotemporal information-integrated RDF data are more suitable for providing location-based and time-based semantic search for users. Though queries are often related to spatiotemporal information, it is hard to find a short query that includes all the four spatiotemporal queries . As a result, we artificially build an example: a user wants to find a physicist who was born in a circular area, with the center located at coordinates (49° N,10°E) and having a radius of 300 km (this area is the southern area of Germany), and who won some academic award in some place where the distance between the place and his birth place is <1500 km. Additionally, he was born before the year 1900, and he won the prize before his 50th birthday. The query can be represented as a Simple Protocol and RDF Query Language (SPARQL)-like query as follows. Section 3.2 gives the formalized definition of the query.

In this paper, we extend the semantics of the SPARQL language by integrating the spatial and temporal feature assertions (the extended SPARQL query is called the ST query, short for spatial-temporal SPARQL query). The spatial and the temporal constraints assert the location of an entity/event and the event’s valid time, for instance, distance(place(?y),place(49, 10)) < 300 and time(?t1) < date(1900.01.01) in Example 1.

In order to answer S-T queries in a uniform manner, we propose a tree-style index structure (called the ST-tree). The ST-tree index is a height balanced n-ary tree. The semantic features and the spatiotemporal features are integrated within the ST-tree, and the ST-tree combines the advantages of the advantages of the R-tree (Guttman, 1984) and the VS-tree (Zou, Mo, Chen, Özsu, & Zhao, 2011) in two steps. First, we encode the entities and RDF triples into bit strings (called “signatures”) to form a signature graph. The ST-tree is constructed over the signature graph, and a list of pruning rules that consider both spatiotemporal and semantic constraints in the query is proposed to reduce the search space during S-T query processing. Second, we introduce a cost model to guide the ST-tree construction.

To summarize, we make the following contributions in this paper.

1. We formalize the spatiotemporal queries by extending the semantics of SPARQL queries, and the spatiotemporal queries are used to retrieve information over the RDF data integrating the spatial and temporal information. Flexible spatiotemporal and semantic constraints are used in the spatiotemporal queries.

2. We build a novel tree-style index integrating the spatiotemporal features and the semantic features, and we design a cost model-based approach to build ST-tree.

3. Based on the ST-tree, we design a novel S-T query-processing algorithm that includes both semantic and spatiotemporal pruning rules to reduce the search space.

4. We evaluate our approach on a large real-world data set, and the result shows that our approach outperforms the baselines.

The remainder of this paper is organized as follows. Section 2 reviews the existing spatiotemporal RDF data management systems and some related works. Section 3 describes the basic idea of our work and gives a list of formal definitions. Section 4 gives the whole framework of our gst-store. Then, Section 5 and 6 show the technical details of our index and the query algorithm. The experimental results are shown in Section 7. Finally, we conclude this paper in Section 8.

2 Related Work

Recently, researchers have begun to pay attention to the spatiotemporal RDF data. There are some available real-world RDF data sets that integrate spatial and temporal information, such as YAGO21 (Hoffart, Suchanek, Berberich, & Weikum, 2013), OpenStreetMap2 (Haklay & Weber, 2008), Gov-Track3 and so on, are available. YAGO2 (Hoffart, Suchanek, Berberich, & Weikum, 2013) is an RDF data set based on Wikipidea and WordNet. Additionally, YAGO2 integrates GeoNames4, which is a geographical database that contains more than 10 million geographical names, to express the spatial information of the entities. At the same time, some statements have temporal information, e.g., the objects of the predicates “bornOnDate”, “wasCreatedOnDate” and so on denote the time that the were born in or created. Based on the spatiotemporal information and some simple inference rules, YAGO2 generates a list of spatial entities and a list of spatiotemporal statements (Hoffart, Suchanek, Berberich, & Weikum, 2013).

Many RDF management systems (Abadi, Marcus, Madden, & Hollenbach, 2009, 2007; Broekstra, Kampman, & Van Harmelen, 2002; Neumann & Weikum, 2009; Weiss, Karras, & Bernstein, 2008; Broekstra, Kampman, & Van Harmelen, 2002; Wilkinson, 2009; Wilkinson, 2009; Wilkinson, Sayers, Kuno, & Reynolds, 2003) have been proposed in the past years. RDF-3x (Neumann & Weikum, 2010), Hexastore (Weiss, Karras, & Bernstein, 2008) and gStore (Zou, Mo, Chen, Özsu, &Zhao, 2011) are the state-of-the-art RDF management systems. In these management systems, the RDF data are well organized and indexes are used to efficiently and effectively answer the RDF queries. Unfortunately, since the indexes are well-designed and none of the systems takes spatial or temporal features into consideration, all the systems are unsuitable for spatiotemporal RDF data management without great modification.

To the best of our knowledge, few SPARQL query engines consider spatial and temporal queries over RDF data in a uniform manner except for YAGO2 Demo (Hoffart et al., 2011) and SPARQL-ST (Perry, Jain, & Sheth, 2011). However, YAGO2 Demo (Hoffart et al., 2011) uses hard-coded spatial/temporal predicates to define the spatiotemporal queries. Six (hard-coded) spatial predicates (“northOf”, “eastOf”, “southOf”, “westOf”, “nearby”, and “locatedIn”) and four (hard-coded) temporal predicates (“before”, “after”, “during”, and “overlaps”) over statements are employed used in the YAGO2 Demo. Users can construct queries as a list of triple patterns with the spatial and temporal predicates. Other spatiotemporal queries are not supported. Since all spatiotemporal predicates are determinative, YAGO2 Demo does not allow flexible spatiotemporal range queries or join queries. The spatiotemporal semantics for the statements is limited, and the spatial semantics of the entities is missing.

Perry et al. (Perry, Jain, & Sheth, 2011) propose SPARQL-ST based on the work of Gutierrez et al. (Gutierrez, Hurtado, & Vaisman, 2007) for integrating the spatial information of entities and the temporal information of statements. In their study, Perry et al. (Perry, Jain, & Sheth, 2011) formalize the storage schema for the spatial entities and the temporal statements, in addition to formalizing the spatiotemporal graph pattern to construct SPARQL-ST. Their work implements a query engine by extending a commercial relational database that supports spatial objects, i.e., by dividing the spatiotemporal RDF data into three main tables (namely, triple table, spatial table, and temporal table) to restore the data and by utilizing the literal and the spatiotemporal indexes of the relational database to evaluate the SPARQL-ST queries. In contrast to our framework, the spatial semantics of statements is missing, and the storage schema is not suitable for real, big RDF data, e.g., >100 million statements; moreover, only parts of them have spatiotemporal information. The time cost on self-joins is unacceptable.

Furthermore, Batsakis et al. (Batsakis & Petrakis, 2010) and Lyell et al. (Lyell, Voyadgis, Song, Ketha, & Dibner, 2011) try to build spatiotemporal ontology to organize the spatiotemporal RDF data. The corresponding ontology-based query languages are introduced to retrieve the spatiotemporal RDF data. These works introduce well-designed ontology, and the query capability has been widely extended. However, these two reports have little discussion on how to answer the spatiotemporal RDF queries efficiently, and the query performance is not evaluated. Additionally, since the statements can not be seen as vertices, the ontology-based model is not suitable for organizing the spatiotemporal information of the statements.

Besides, several other proposals take either spatial features or temporal features of RDF data into consideration. Brodt et al. (Brodt, Nicklas, & Mitschang, 2010) and Erling and Mikhailov (Erling & Mikhailov, 2009) utilize RDF query engines and spatial index to manage spatial RDF data. Brodt et al. (Brodt, Nicklas, & Mitschang, 2010) uses RDF-3x as the base RDF query engine, and adds a spatial index for filtering entities before or after RDF-3x join operations. These two approaches only support range query (and spatial join (Erling & Mikhailov, 2009)) on entities, and the spatial entities follow the GeoRSS GML (Singh, Turner, Maron, & Doyle, 2008) model. Our early work on S-store (Wang et al., 2013) integrates spatial information into the RDF data. In S-store, a tree index SS-tree is used. First, an R-tree based on the spatial entities and a VS-tree based on the nonspatial entities are built separately, and then the two trees are combined to form the SS-tree. The R-tree and the VS-tree pruning rules are used to generate the candidates for the queries. The brute force combining method disregards that an entity integrates the spatial features and the semantic features at the same time. In contrast, we propose a cost model-based method to take both spatiotemporal features and semantic features into consideration while constructing the tree index.

Gutierrez et al. (Gutierrez, Hurtado, & Vaisman, 2007; Gutiérrez, Hurtado, & Vaisman, 2005) give formally definitions of the temporal RDF graph, and prefer to use time interval labeling on an RDF graph to integrate temporal information into RDF data. Furthermore, their first work (Gutierrez, Hurtado, & Vaisman, 2007) introduces a simple query language for temporal RDF data. Based on the work of Gutierrez et al., several query languages have been proposed, such as T-SPARQL (Grandi, 2010), SPARQL-ST (Perry, Jain, & Sheth, 2011) and τ-SPARQL (Tappolet & Bernstein, 2009). Tappolet et al. (Tappolet & Bernstein, 2009) propose a temporal RDF data management framework. The named graph is used to manage the statements with different time intervals, and a tree-style index keyTree is introduced to efficiently retrieve the valid time interval and the involved triples at a certain time point. In contrast, Pugliese et al. (Pugliese, Udrea, & Subrahmanian, 2008) extend the work of (Gutierrez, Hurtado, & Vaisman, 2007), and introduce a novel tree-style index to efficiently and effectively answer the temporal RDF queries. Firstly, they combine the graph distance metric and the temporal distance metric to build a metric called tGRIN distance metric. Then, based on the tGRIN distance metric, the entries are clustered. The clusters with different granularity constitute the tGRIN tree-style index. Based on the tGRIN index, two pruning rules are introduced to efficiently answer the temporal queries. However, most of the statements in real data sets (e.g., YAGO2), most of the statements lack temporal information. Therefore, the tGRIN metric fails since it is hard to compute the temporal distance between temporal statements and non-temporal statements5. Thus, the pruning rules are inefficient. Besides, the pruning rules are based on mapping the constant in the query to the data set. If the constants in the query are high-degree nodes (e.g., the type “city”) or if there is no constant in the query, the pruning rules are also inefficient.

3.1 SPARQL vs. Subgraph Match

An RDF data set is a list of RDF triples. Here, we have a sample RDF data set (shown in Figure 1(a)), which that consists of 25 triples. We call each triple a statement. The answer of a SPARQL query is a list of statements that satisfy the SPARQL constraints. We also regard an RDF data set as a graph (called RDF graph G). Figure 1(b) shows the corresponding RDF graph of the sample data set. Furthermore, a SPARQL query can be also modeled as a graph structure Q. Therefore, answering a SPARQL query is equivalent to finding subgraph matches of query graph Q over RDF graph G. The formal definitions are given as follows.

Fig. 1

Sample RDF Data Set

Definition 1

A statement is a triples, p, o〉, where s, p, and o represent subject, predicate, and object, respectively.

Definition 2

The RDF data graph is denoted as G = 〈V, E, LV, LE), where

1. V = VlVeVcVb denotes all RDF vertices where Vl, Ve, Vc and Vb are the sets of literal vertices, entity vertices, class vertices and blank nodes respectively.

2. E is a collection of the edges between vertices.

3. LV = {URI} ∪ {Literal Value} ∪ {null} is the collection of all vertex labels (i.e., label(v)), where v ∈ {VeVc} ⇔ label(v) ∈ {URI} and vVllabel(v) ∈ {Literal Value}. For vVb, label(v) is null.

4. LE is the collection of edge labels, i.e., all possible predicates.

Definition 3

The SPARQL query graph is denoted as Q = 〈V, E, LV, LE), where

1. V = VlVeVcVbVp, where Vp denotes the parameter vertices, and Vl, Ve, Vc, and Vb are the same as in Definition 2.

2. E and LE are the same as in Definition 2.

3. LV is the same as in Definition 2, expect for vVplabel(v) is null.

Definition 4

Consider an RDF graph G and a query graph Q with n vertices {v1,…, vn}. A list of n corresponding vertices {u1,…, un} in G is said to be a match of Q if and only if the following conditions hold:

1. If vi ∈ {VlVcVe}, ui ∈ {VlVcVe} and label(vi) = label(ui);

2. If vi ∈ {VbVp}, lable(ui) is unrestricted;

3. If there is an edge vivj from vi to Vj in Q, there is also an edge uiuj from ui to uj in G. If vivj has predicate p, uiuj must have the same predicate p.

Here, an RDF data set is seen as a list of statements. A statement is also regarded as an edge in the RDF graph connecting the subject vertex and the object vertex with the edge label (the predicate). The subjects and the objects contain the vertex set of the RDF graph. If a vertex is an entity or a class, the vertex label is a uniform resource identifier (URI). If the subject or the object is a string, the vertex label is the corresponding literal value. Note that the label of a vertex can be null, i.e., the vertex is a blank node.

A SPARQL query is a small graph similar to the RDF graph. In contrast to the RDF graph, the SPARQL query graph contains a special case of vertices, i.e., the parameter vertices. The identifier of a parameter vertex is started with a “?”, and the label of the parameter vertex is seen as null. In the RDF graph, the graph matches of the SPARQL query graph are the result of the SPARQL query.

3.2 Spatiotemporal RDF

In this section, we formally define the spatiotemporal RDF data and the spatiotemporal SPARQL query as follows.

Definition 5

An entity e is called a spatial entity if it has an explicit location labeled with the coordinates x and y (for the two-dimensional situation). The other entities are called nonspatial entities.

Definition 6

An S-T statement is a five-tuples, p, o, L, T〉, where s, p, o, L and T represent subject, predicate, object, location, and time interval, respectively. The S-T statement is an extension of the original RDF statement, where s, p, and o are the original elements. L denotes the spatial feature (the coordinates) of a statement, and T has the start time Ts, and the end time Te to denote the valid time interval of a statement, i.e., the statement is considered to be credible in this time interval: specifically, Ts = Te if and only if the statement happens at a time point.

Definition 7

If the L in a statement is not null, the statement is called a spatial statement. Otherwise, it is called a non-spatial statement.

Definition 8

If the T in a statement is not null, the statement is called a temporal statement. Otherwise, it is called a non-temporal statement.

Definition 9

An S-T triple pattern is a five-tuples, p, o, L, T〉, where s, p, o, L, and T represent subject, predicate, object, location, and time interval respectively. In contrast to the S-T statement, each item of an S-T triple pattern can be replaced by a variable. An S-T statement S is called a match of an S-T triple pattern P if the nonvariable items are the same in S and P. The variable items in P are mapped to the corresponding items in S.

Definition 10

An S-T query is a list of S-T triple patterns with some spatial and temporal filtering conditions. If there is neither spatial nor temporal filtering condition, the S-T query is degraded to a traditional SPARQL query.

Definition 11

The spatiotemporal filtering conditions are represented as spatiotemporal assertions in this paper.

Given an S-T query Q, the spatial assertions are expressed as an expression that distance(place(a), place(b)) < d, wheredistanceandplaceare reserved words in gst-store, a and b are variables in Q or a specific geometry point, and d is a constant given by the user.

The temporal assertions are expressed as time(a) < time(b) ± Xyear-Ymonth-Zday or time(c) < y-m-d. Here, X, Y, Z and y, m, d are parameters given by the user to denote the values of year, month and day; a, b, c are variables in Q, and “time” and “year-month-day” are reserved words in gst-store. Note that the compare symbol “<” can be replaced by “=” or “>”.

Figure 2 shows a subset of a spatiotemporal RDF data set. Ulm, Baden-Württemberg, and Gdańsk are spatial entities. Some statements are spatial statements, such as #1, #2, and #6, and some statements are temporal statements, such as #10, #17, and #20. Besides, there are a lot of nonspatial entities, as well as nonspatial and nontemporal statements. For example, people have no spatial information since we cannot locate a person on the map. Similarly, statements such as 〈People hasName Name〉 are nonspatial and non-temporal statements. In gst-store, we use “S-T assertion” to represent the spatiotemporal constraints in S-T queries. For example, in Example 1, the filtering conditions list four kinds of spatiotemporal constraints, where place(?y), place(49, 10)) < 300 is a spatial range constraint, distance(place(?l2), place(?y)) < 1500 is a spatial join constraint, time(?t1) < date(1900.01.01) is a temporal range constraint, and time(?t2) < time(?t1) ± (50year-01month–01day) is a temporal join constraint.

Fig. 2

Spatiotemporal RDF Data

In this stage, we support (i) the spatial range query and the spatial join semantics for spatial entities and statements, and (ii) the temporal range query and the temporal join semantics for temporal statements.

In practice, we use place(?x) to denote the spatial label of variable ?x. Also, distance(a, b)6< r denotes that the distance between a and b should be below the threshold r, where a and b should be a specific location or a variable. If either a or b is a constant, the constraint is called a spatial range assertion. If both a and b are variables, the constraint is called a spatial join assertion. Note that a spatial query can have range assertions and spatial join assertions at the same time.

Similarly, we use timestart(?x) and timeend(?x) to denote the Ts and Te features of variable ?x respectively. Note that time(?x) denotes that both Ts and Te should satisfy the constraints. In the temporal assertions, we use “a < b”, “a = b” and “a > b” to denote the time order of a and b, where a, and b are either a temporal feature or a time point. If either a or b is a time point, the constraint is called a temporal range assertion. If both of a and b are expressions that include variables, the constraint is called a temporal join assertion.

For instance, the Example 1 is an S-T query including the spatial range assertion, the spatial join assertion, the temporal range assertion and the temporal join assertion at the same time.

The S-T RDF data set and the S-T query can be also modeled as graphs (Definitions 12 and 13). The query processing is to find the matches (Definition 14) of an S-T query graph Q in an S-T RDF data graph G. Figure 3 shows the graph corresponding to the S-T RDF data set in Figure 2, where the spatial entities and the spatial statements are all surrounded by red rectangles, and the temporal statements are surrounded by blue rectangles. Note that if a temporal statement is already surrounded by a red rectangle, we only surround the temporal feature of the statement with a blue rectangle.

Fig. 3

Spatiotemporal RDF Graph

Definition 12

The S-T RDF data graph is denoted as G = 〈V, E, LV, LE, SV, SE, TV, TE〉, where

1. V, E, Lv, LE is the same as in Definition 2.

2. SV and SE represent the spatial labels of V and E respectively, where the spatial labels denote the position of the entity (the event), i.e., the latitude and longitude (only valid for spatial entities and spatial statements).

3. TV and TE represent the temporal labels of V and E respectively, where the temporal labels denote the time interval when the entity (the event) occurs, i.e., the start time and the end time.

Definition 13

The S-T SPARQL query graph is denoted as Q = 〈V, E, LV, LE, SCV, SCE, TCV, TCE〉, where

1. V, E, Lv, LE is the same as in Definition 3.

2. SCV and SCE represent the spatial assertions of V and E respectively, where the spatial assertions can be an absolute area or the relative position for some parameter.

3. TCV and TCE represent the temporal assertions of V and E respectively, where the temporal assertions can be an absolute time interval or the relative relation for some parameter expressions, such as “>” “=” and “<”.

Definition 14

Consider an S-T RDF graph G and an S-T query graph Q with n vertices {v1,…,vn}. A list of n vertices {ui,…, un} in G is said to be a match of Q if and only if the conditions in Definition 4 and the following conditions hold:

1. If viVp, the spatial label S(ui) must satisfy the spatial assertion SC(vi), and the temporal label T(ui) must satisfy the temporal assertion TC(vi);

2. If there is an edge vivj from vi to vj in Q, there is also an edge uiuj from ui to uj in G. If vivj has spatial(temporal) assertions, uiuj must have the corresponding spatial(temporal) label that satisfies the spatial(temporal) assertions.

We show the graph view of Q in Figure 4. We can find that there is a match of Q in the S-T RDF data graph satisfying all the constraints of Q, where the result of ?x, ?y and ?z is “Albert_Einstein”, “Ulm”, and “Nobel_Prize”, respectively.

Fig. 4

The Graph View of Q

3.3 S-T Signature Graph

In gst-store, we use a bit string7, a minimum bounding rectangle (MBR), of a spatial feature (the coordinates) and a segment8 of a temporal feature (the time interval) to denote an entity. The bit string is called a signature. The original S-T RDF graph is converted to an S-T signature graph in gst-store.

The signature sig of each subject s depends on all the edges {e1, e2,…, en} adjacent to s. For each ei, a list of hash functions are used to generate a signature sig.ei, where the front N bits denote the predicate, and the following M bits denote the object. The valid bits (i.e., the bits with value “1”) depend on the hash codes of the corresponding textual information. For instance, suppose that we use two hash functions for the predicates and two hash functions for the URI/literals, and N and M are both set to be 5. Here, for the edge (statement) Ulm isCalledUlm”, the hash codes of the predicate isCalled are 1 and 5 and the hash codes of the literal value “Ulm” are 2 and 4 based on the hash functions. Therefore, the edge is represented as 10001 01010 in Figure 5, where the first 5 bits represent the predicate isCalled, and the last 5 bits represent the literal value “Ulm”. The signature sig of s is sig = sig.e1|sig.e2|…|sig.en, where sig.e1, sig.e2,…, sig.en are the out-edges of s.

Fig. 5

Encoding Technique

For example, in Figure 2, there are four edges starting from Ulm (#8, #9, #10, and #11). Suppose that we set the first five bits for the predicate and the following five bits for the object, we can get four signatures 0001101000, 1000101010, 1001000010 and 0001100011 corresponding to the four edges. Thus, Ulm can be represented as 1001101011. Figure 5 shows the encoding processing for “Ulm”. Note that only the entity and class vertices in the RDF graph are encoded.

Then, for each vertex (vi) and each edge ej, we set the MBR(vi) and the MBR(ej) of the entities and the statements, where MBR(x)denotes the MBR of the spatial feature of x. Next, we set the segments seg(vi) and seg(ei) in the time axis to denote the time features. Note that the seg(v) of all the entities is null in this stage. Subsequently, for each node v, all the segments seg(ei) of edge ei starting from node v are combined as a union segment to denote the temporal feature of the node v, i.e., seg(v) = ∪seg(ei).

Given an S-T query Q, Q can also be easily transformed into an S-T signature query Q* based on the upper conversion method. We define the match of Q* in the S-T signature RDF graph as follows. It can be easily derived that each match (Definition 14) of Q in G corresponds to a match (Definition 15) of Q* in G*.

Definition 15

Given an S-T signature graph G* and an S-T signature query graph Q* with n signature vertices {q1,…, qn}, a set of distinct signature vertices {sig1,…, sign} in G* is a match of Q* if and only if the following conditions hold:

1. qi, sigi.signature&qi.signature = qi.signature;

2. qi, the spatiotemporal labels of sigi must satisfy the spatiotemporal assertions;

3. If there is an edge qiqj from qi to qj in Q*, there is also an edge sigisigj from sigi to sigj in G*, and qjqj.signature&sigisigj.signature = qiqj.signature. If qiqj has spatial(temporal) assertion, sigisigj must have the spatial(temporal) label that satisfies the assertion.

4 Overview of gst-store

gst-store uses a hybrid index that integrates both R-tree (Guttman, 1984) and VS-tree (Zou, Mo, Chen, Özsu, & Zhao, 2011). Therefore, the pruning strategies of R-tree and VS-tree are also integrated as the searching strategy of gst-store. Our framework consists of the preprocessing, the index construction and the query processing stages.

In the preprocessing stage, we first encode each vertex and edge as a bit string (we call it a signature). Subsequently, we build the S-T signature graph G*. Figure 6 shows a running example. In Figure 6, the entities or the statements surrounded by the dotted rectangles have the spatial feature or the temporal feature. The spatial features are represented as red “MBR(·,·)”, and the temporal feature are represented as blue “####-##-##”. Since the nodes that have no out-edges are not encoded, they are not taken into consideration in the S-T signature graph.

Fig. 6

S-T Signature Graph

In the index construction stage, we construct a tree-style index based on the S-T signature graph to effectively reduce the search space. The index is called ST-tree. Figure 7 shows an running example. The nodes on the same level of the ST-tree form an S-T signature graph. If there’s a match of a query Q in a lower S-T signature graph, there must be a corresponding match in each higher S-T signature graph. Therefore, we need to guarantee that ST-tree is a height-balanced tree.

Fig. 7

ST-tree

In the query processing stage, given a query graph Q, we first convert Q into the S-T signature query graph Q*. Figure 8 shows the S-T signature query graphs of the example Q in Section 3.2. In Figure 8, the edges and the nodes are encoded, and the spatiotemporal constraints are added to the edges and nodes. Note that if there is a set of vertices in G that matches a query graph Q, there must be a corresponding match in G* of Q*. Subsequently, we implement a top down searching algorithm over the ST-tree to find the matches of Q* in G*. Finally, we retrieve the corresponding textual result and return it to the user.

Fig. 8

The S-T Signature Graph Q* of Q

5 Index Construction

In this section, we introduce our S-T RDF index ST-tree. The index is presented in a tree style. Generally speaking, we build the ST-tree based on the VS-tree and the R-tree. The ST-tree is used to generate the candidates for the variables.

5.1 The ST-tree Structure

The ST-tree is a hybrid tree style index combining the VS-tree (Zou, Mo, Chen, Özsu, & Zhao, 2011) and the R-tree (Guttman, 1984). The VS-tree is an extension of the S-tree (Deppisch, 1986). As shown in Figure 7, the ST-tree is a height balanced n-ary tree, and each level of the ST-tree comprises an S-T signature graph. The leaves of the ST-tree and the corresponding edges between the leaves comprise the S-T signature RDF data graph, and the inner nodes of the ST-tree obey the ST-tree rule.

• ST-tree Rule: Consider two S-T signature nodes v1 and v2 and their father nodes n1 and n2. The following conditions hold:

1. n1.sig&v1.sig = v1.sig, n2.sig&v2.sig = v2.sig;

2. v1.MBRn1.MBR, v2.MBRn2.MBR;

3. v1.segn1.seg, v2.segn2.seg;

4. If there is an edge v1v2 between v1 and v2, there must be an edge n1n2 between n1 and n2, where n1n2.sig&v1v2.sig = v1v2.sig, v1v2.MBRn1n2.MBR and v1v2.segn1n2.seg, even if n1 = n2.

• The ST-tree rule ensures that the upper-level S-T signature graph is a summary graph of the lower-level S-T signature graph, i.e., each node/edge in the upper level is the union of its descendants. For example, the node $\begin{array}{}{d}_{3}^{3}\end{array}$ is the father of the nodes Ulm and Baden-Württemberg. Thus, the signature of $\begin{array}{}{d}_{3}^{3}\end{array}$ is 1101101011, which is the union of the signatures of Ulm and Baden-Württemberg, 1001101011 and 1101000010. The spatial MBR and the temporal interval of $\begin{array}{}{d}_{3}^{3}\end{array}$ are also the union of the corresponding features of the nodes Ulm and Baden-Württemberg. In the ST-tree, given levels i and i ± 19, we call the S-T signature graph $\begin{array}{}{G}_{i}^{\star }\end{array}$ in level i as the summary graph of the S-T signature graph $\begin{array}{}{G}_{i+1}^{\star }\end{array}$ in level i ± 1, and $\begin{array}{}{G}_{i+1}^{\star }\end{array}$ is the expanded graph of $\begin{array}{}{G}_{i}^{\star }\end{array}$.

Theorem 1

Given an S-T signature query Q* and level i, if there is a match of Q* in the S-T signature graph $\begin{array}{}{G}_{i+1}^{\star }\end{array}$ in level i ± 1, there should be a corresponding match in $\begin{array}{}{G}_{i+1}^{\star }\end{array}$s summary graph $\begin{array}{}{G}_{i}^{\star }\end{array}$

Proof

Suppose the match of Q* {q1,…, qn} in $\begin{array}{}{G}_{i+1}^{\star }\end{array}$ is $\begin{array}{}\left\{{v}_{1}^{i+1},...{v}_{n}^{i+1}\right\}.\text{\hspace{0.17em}Let\hspace{0.17em}}{v}_{i}^{1},...,{v}_{n}^{i}\end{array}$ denotes the corresponding father node set of $\begin{array}{}\left\{{v}_{1}^{i+1},...{v}_{n}^{i+1}\right\}.\text{\hspace{0.17em}i.e.,\hspace{0.17em}}{v}_{j}^{i}\end{array}$ is the father node of $\begin{array}{}{v}_{j}^{i+1}.\end{array}$ Based on Definition 15 and the ST-tree Rule, $\begin{array}{}{v}_{j}^{i}.\end{array}$sig&qj.sig = $\begin{array}{}{v}_{j}^{i}.\end{array}$ sig& $\begin{array}{}{v}_{j}^{i+1}.\end{array}$sig&qj.sig = $\begin{array}{}{v}_{j}^{i+1}.\end{array}$sig&qj.sig = qj.sig. Since $\begin{array}{}{v}_{j}^{i+1}.\end{array}$MBR$\begin{array}{}{v}_{j}^{i}.\end{array}$MBR and $\begin{array}{}{v}_{j}^{i+1}.\end{array}$seg$\begin{array}{}{v}_{j}^{i}.\end{array}$seg, if $\begin{array}{}{v}_{j}^{i+1}\end{array}$ satisfy the spatiotemporal assertions, $\begin{array}{}{v}_{j}^{i}\end{array}$ also satisfies the spatiotemporal assertions. Similarly, if there is an edge qxqyQ*, the corresponding edge $\begin{array}{}\overline{{v}_{x}^{i+1}{v}_{y}^{i+1}}\in {G}_{i+1}^{\star }\end{array}$ is a sub edge of $\begin{array}{}\overline{{v}_{x}^{i}{v}_{y}^{i}}\in {G}_{i}^{\star }.\end{array}$ All constraints are satisfied, i.e., $\begin{array}{}\left\{{v}_{1}^{i},...,{v}_{n}^{i}\right\}\end{array}$ is a match of Q* in $\begin{array}{}{G}_{i}^{\star }\end{array}$.

□

Theorem 1 gives the correctness guarantee. If the ST-tree can be separated into several layers, i.e., the ST-tree is a height-balanced tree, the tree nodes in the upper layer can be safely pruned if the signature, the MBR, or the segment is unsatisfied.

5.2 ST-tree Construction

The ST-tree is constructed over the S-T signature graph. In Section 3.3, we have described the generation of the S-T signature graph. Each node in the S-T signature graph has three features: the signature, the spatial MBR and the temporal segment. Based on these three features, we can build an S-tree, an R-tree of spatial information and an R-tree of temporal information respectively. In the ST-tree, we integrate the three trees with different features.

We use the “insert” operation to build the ST-tree. Given a list of S-T signature nodes, we insert the nodes one by one into the ST-tree. Since the ST-tree is a height-balanced n-ary tree, we implement a similar “insert and split” strategy as for other height-balanced n-ary trees, such as B+-tree, R-tree, S-tree, and so on. When a node n comes, the strategy works as follows.

1. Iteratively choose the node from top-down manner with the lowest cost when inserting n into it. If the chosen node v is a leaf, insert n into v.

2. If v is full, split v into two separate nodes v1 and v2, where the costs of v1 and v2 are minimized. If a splitting operation makes the father of the split node become full, split the father node iteratively.

3. If the root is split to r1 and r2, set a new root r and make r to be the father of r1 and r2.

Since the R-tree and the VS-tree have similar cost model, we can adopt a cost model while constructing the ST-tree by integrating the independent cost models of the R-tree and the VS-tree.

The first cost in our model is the signature (bit string) cost. The signature cost represents the dissimilarity of two signatures. Equation 1 shows how to compute the signature cost when given signatures of tree nodes sig1 and sig2, where Costsig denotes the signature cost, bitcount(sig) counts the number of the valid bits, and ⊕ means the xor operation.

$Costsig=bitcount(sig1⊕sig2)$(1)

The second cost in our model is the spatial cost. While two entries are combined, the spatial cost is the increasing area of the MBR. In order to avoid ineffective insertion or splitting, we use the area of the rectangle’s circum circle10 instead of the area of the original rectangle. Equation 2 shows how to compute the spatial cost while combining the tree nodes n1 and n2, where Costspa denotes the spatial cost, Area(Si) means the area of the the rectangle’s circum circle, and d(MBRi) denotes the diameter of the tree node ni’s MBRi. Note that RiRj denotes a binary operator to generate a rectangle surrounding the rectangles Ri and Rj.

$Costspa=Area(R1⊙R2)−Area(R1)−Area(R2)=π4(d(MBR1⊙MBR2)2−d(MBR1)2−d(MBR2)2$(2)

The third cost in our model is the temporal cost. The temporal cost of two entities e1 and e2 is the increased length while combining e1 and e2. Equation 3 shows how to compute the temporal cost while combining the tree nodes n1 and n2, where seg1seg2 denotes the time interval surrounding seg1 and seg2.

$Costtem=2×length(seg1⊗seg2)−length(seg1)−length(seg2)$(3)

Since each tree node owns both spatiotemporal and signature features, we take both the spatiotemporal cost and the signature cost into account when inserting an entity or splitting a full node. The cost of combining two nodes is shown in Equation 4, where 0 < a < 1, 0 < α < 1, and 0 < α + β < 1. Note that Zspa = ∑ Costspa, Ztem = ∑ Costtem, and Zsig = ∑ Costsig are the normalized parameters to balance the scale of the spatial cost, the temporal cost and the signature cost respectively. In the section of the experiments, we design a specific experiment to determine the values of the parameters α and β. $Cost=αCostspaZspa+βCosttemZtem+(1−α−β)CostsigZsig$(4)

Based on the cost model, we propose a method to construct an ST-tree. Algorithm 1 shows the procedure to build an ST-tree given a set of entities. In the very beginning, the ST-tree only has a empty root, and we set the maximum node size of the ST-tree. Given a set of entities, we iteratively insert the entities one by one into the ST-tree. While inserting an entity, we generate an entry representing the entity and insert the entry into the node with the lowest cost in a top-down manner. If the insertion produces a full node, we split the full node into two half full nodes. Note that the splitting operation may produce a new full node. If the root needs to be split, we generate a new node, and then we set the new node as the root of the ST-tree and set the two split nodes as the new node’s children.

5.3.0.1 Guarantee of Balance

Based on Theorem 1, the ST-tree should be a height-balanced tree. Since the ST-tree is built based on the “insert” and “split” operations, it can be proven that the ST-tree is a height-balanced tree.

Lemma 1

Given a height-balanced tree T, T is also balanced after splitting a node n to n1 and n2 using the “split” operation.

Proof

(Sketch) Since T is a height-balanced tree, the subtree Tn rooted at n is also a balanced tree, and the subtrees rooted at n’s children are balanced tree too. Based on the definition of the height-balanced tree, the new trees T1 and T2 rooted at n1 and n2, respectively are both height balanced trees since the children of n1(n2) are the subsets of n’s children.

Clearly, the height of T is the bigger one of T1’s height and T2’s height, and the difference between T1’s height and T2’s height is at the most 1. Therefore, the new tree $\begin{array}{}{T}_{f}^{\prime }\end{array}$ rooted at n’s father after splitting remains to be a height balanced tree, and the height of $\begin{array}{}{T}_{f}^{\prime }\end{array}$ is unchanged.

Since the remaining part of T is unchanged, if n is not the root of T, T remains height balanced after the splitting.

If n is the root of T, a new root r is set to be the father of n1 and n2. Since T1 and T2 are height balanced and the height difference between T1 and T2 is no more than 1, the new tree is also height balanced. □

Lemma 2

Given a height-balanced tree T and a new node n, T is also a height-balanced tree after inserting n into T.

Proof

(Sketch) n is added to the lowest layer of T via the “insert” operation. Then,

1. the father node of n is not full, i.e., the insertion does not cause splitting procedure. Clearly, T is also a height-balanced tree since the height of T is unchanged and the depth of each node in T is unchanged except for n.

2. the father node of n is not full, i.e., a splitting procedure is triggered. Based on Lemma 1, T is also a height-balanced tree after the necessary splitting steps.

In summary, T remains height balanced after inserting n into T. □

Theorem 2

An ST-tree T is a height-balanced tree.

Proof

1. Clearly, an empty ST-tree is a height-balanced tree.

2. (2) Based on Lemma 2, if an ST-tree is height balanced, it remains height balanced after an insertion.

In summary, T is a height balanced tree since the construction of T is a series of insertions. □ □

In this stage, we ignore the “update” and the “delete” operations of the ST-tree because (1) in contrast to occasionally removing/editing statements, the real RDF data sets prefer to increase their scales, i.e., insertion is more important, and (2) we can use time stamps to manage the changed statements. In fact, it is easy to design the “update” and the “delete” operations of the ST-tree by referring to the same operations of the R-tree (or B+-tree, S-tree, and so on).

5.3.0.2 Time Complexity

The ST-tree construction is a series of “insert” and “split” operations. Therefore, the time complexity of the tree construction depends on the time complexity of the two operations and the number of times that the operations are triggered. Suppose that the node capacity is set to be k, i.e., the tree nodes in the ST-tree have no more than k children without splitting. In other words, if a tree node has k + 1 children after an “insert” or a “split” operation, the node should be split.

Based on Algorithm 2, given an ST-tree T with height h, the time cost of an “insert” operation is O(h × k), where an insertion needs h times of comparison to find the lowest cost path, and O(k) times of cost computation with all the children of the chosen node in each comparison. Suppose that an RDF data set has n entities, the time cost of the insertion is O(n) × O(h × k) = O(n) × O(lg n × k) = O(nk lg n). Note that an insertion introduces a new node to the ST-tree.

Based on Algorithm 3, it takes $\begin{array}{}O\left(\frac{k×\left(k-1\right)}{2}+\left(k-1\right)\left(k-2\right)+k\right)=O\left({k}^{2}\right)\right),\text{where\hspace{0.17em}}O\left(\frac{k×\left(k-1\right)}{2}\right)\end{array}$ is the time cost to find the two seeds, O((k − 1)(k − 2)) is the time cost to separate the children of the split node, and O(k) is the time cost to allocate the children to the two new nodes. Obviously, a new node is added to the ST-tree if a “split” operation is executed11. Therefore, the number of executions of the “split” operation is equal to Nn − (h − 2), where N is the node count of the ST-tree, n is the entity node count and h is the height of the ST-tree. Since n is fixed and h is far less than N (the maximum h is log2 N when the ST-tree degenerates to a binary tree), the total time cost of the “split” operation is O(N × k2). In the best case, the tree node are all full, i.e., $\begin{array}{}N=\frac{n}{k}+\frac{n}{k}+...+1\phantom{\rule{thinmathspace}{0ex}}=\phantom{\rule{thinmathspace}{0ex}}O\left(\frac{n}{k-1}\right).\end{array}$ In the worst-case scenarios, the tree nodes are all half-full and the roots have only two children, i.e, all the insertions are focused on the same path. Thus, $\begin{array}{}N=O\left(2×\frac{n}{k-2}+1\right).\end{array}$ In summary, $\begin{array}{}N=O\left(\frac{n}{k}\right),\end{array}$ and the time complexity of the splits is O(N × k2) = O(nk).

As a result, the time complexity of the tree construction is O(nk lg n) + O(nk) = O(nk), where n is the entity number in the data set, and k is the node capacity of the ST-tree.

6 Query Processing

Given an S-T query Q, we first convert the Q to an S-T signature graph Q*. The conversion process consists of three steps.

1. Encode the triple patterns as described in Section 3.3.

2. For each spatiotemporal range assertion, we add the corresponding absolute MBR or segment on the specific variables.

3. For each spatiotemporal join assertion, we add the relevant MBRs or segments on the variables.

The Q* corresponding to Q is shown in Figure 8. The signatures are generated as G to G*, where the variables contribute no valid bit. The range assertions of Q are converted to the absolute MBRs binding ?y in Q*, and the join assertions of Q are converted to the relevant MBRs in Q*. Specifically, if an out-edge from node n has temporal range assertions, we add the assertions on n, which is called “infection”.

After the corresponding Q* is generated, we next search the matches of Q* in G* exploiting the ST-tree. Considering an S-T signature query graph Q* = {q1,…, qn}, we first generate the node candidate set NodeSeti for each variable qi, and then verify each candidate in the query candidate set QSet = {NodeSet1 × … × NodeSetn} to generate the matches of Q in G.

6.1 Pruning Rules

For efficiently generating the node candidate sets, we have the following five pruning rules. Pruning rules 1 and 2 are based on the spatial range and spatial join assertions respectively. Pruning rule 3 is based on the temporal range assertions. Pruning rule 4 is based on the signature, and Pruning rule 5 considers the edge features. Based on Theorem 1 (in Section 5), when node n is unsatisfied, the subtree rooted at n can be safely pruned.

6.1.0.1 Pruning Rule 1

Consider a variable v bound with a range assertion. If there is a tree node n where v.mbr has no intersection with n.mbr, the subtree rooted on n can be pruned safely.

For example, ?y in Q* has a range assertion. Thus, the subtrees rooted at $\begin{array}{}{d}_{1}^{2}\text{\hspace{0.17em}and\hspace{0.17em}}{d}_{4}^{3}\end{array}$ can be safely pruned, because the spatial features are unsatisfied.

6.1.0.2 Pruning Rule 2

Consider two variables vi and vj bound by a spatial join assertion, and NodeSeti is the candidate set of vi and NodeSetj is the candidate set of vj. Suppose the max distance is set to be MaxDist. Let niNodeSeti; if the distance from MBR of ni to any node njNodeSetj is larger than MaxDist, ni can be safely pruned.

In practice, we combine all the MBRs of the candidates of one variable into one MBR, and the minimal distance between two combined MBRs is considered as the lower bound of each candidate pair. Thus, the time complexity is reduced from O(m × n) to O(m + n), where m and n are the sizes of two candidate sets respectively.

For example, if the distance between ?x and ?y is set to be less than 50km, when only node $\begin{array}{}{d}_{3}^{3}\end{array}$ is considered as a candidate of ?x, $\begin{array}{}{d}_{4}^{3}\end{array}$ can be safely pruned for ?y, since the distance lower bound from $\begin{array}{}{d}_{4}^{3}\text{\hspace{0.17em}to\hspace{0.17em}}{d}_{3}^{3}\end{array}$ is much more than 50km.

6.1.0.3 Pruning Rule 3

Consider a variable v, if the temporal assertion is not null and there is a tree node n where v.segn.seg = ϕ, the subtree rooted on n can be pruned safely.

For example, ?x in Q* has a temporal assertion. Thus, the subtree rooted at $\begin{array}{}{d}_{2}^{2}\end{array}$ can be safely pruned for ?x.

6.1.0.4 Pruning Rule 4

Consider a variable v, if there is a tree node n where v.sig&n.sig! = n.sig, the subtree rooted on n can be pruned safely.

In Q*, $\begin{array}{}{d}_{4}^{3}\end{array}$ for ?x, $\begin{array}{}{d}_{1}^{2}\end{array}$ for ?y and $\begin{array}{}{d}_{3}^{2}\end{array}$ for ?z etc. can be safely pruned.

6.1.0.5 Pruning Rule 5

Consider two linked variables vi and vj with an edge e = vivj from vi to vj, and NodeSeti is the candidate set of vi and NodeSetj is the candidate set of vj in the same S-T signature graph. Let niNodeSetj ; if there is no edge from ni to any node njNodeSetj, ni can be safely pruned. Additionally, if there is an S-T assertion on e, the unsatisfied edges are considered nonexistent.

The pruning rule is based on the fact that if there is no satisfied edge from ni to any node njNodeSetj, there is no satisfied edge from the descendants of ni to any descendants of the njNodeSetj. In practice, given a node n, all the features of the edges starting from n are integrated into one signature, one MBR, and one segment to reduce the time complexity.

Algorithm 4 describes the generation process for the top-down node candidate sets generating process. The use of the pruning rules is shown in Lines 9-21.

6.2 Verification

For the node candidate set {NodeSet}, we generate a list of nodes 〈v1,…, vn〉 from each item of {NodeSet}, respectively, and verify if 〈v1,…, vn〉 forms the connected regions that correspond to the connected regions in Q*. If 〈v1,…, vn〉 can form, we consider it a match candidate of Q*, or we discard it otherwise. The generating process can be accomplished by using a breadth first search (BFS) algorithm starting from the smallest node candidate sets in each connected region. If there is an edge e = vkvl in Q*, there must be an corresponding edge.

Given a match candidate $\begin{array}{}{Q}_{c}^{\star }\end{array}$ of Q*, we verify whether all the spatiotemporal assertions are satisfied. The satisfied match candidates are the matches of Q*. Subsequently, since the encoding technique may bring false-positive error, we verify whether all S-T triple patterns in Q are satisfied given a match of Q*. The valid candidates are the matches of Q. Then, the matches of Q are returned to users.

7 Experiments

To the best of our knowledge, only YAGO2 Demo (Hoffart et al., 2011) and SPARQL-ST (Perry, Jain, & Sheth, 2011) are available spatiotemporal RDF data management systems. Since the technical details of YAGO2 Demo are not reported, SPARQL-ST is chosen to make a comparison with gst-store. In addition, we also make a comparison between gst-store, a post-processing method, S-store (Wang et al., 2013) and an enterprise system Virtuoso.

Our demo is available at http://59.108.48.17:8080/GStoreWangDong/query.jsp.

7.1.0.1 Data Set

YAGO2 is a real data set based on Wikipedia, WordNet, and GeoNames. The latest version of YAGO2 has >10 million entities and 440 million statements. We obtain a spatiotemporal RDF data set from YAGO2 by removing some statements that describe the date when another statement is extracted or the uniform resource locator (URL) where another statement is extracted from. The condensed data set has >10 million entities/classes and >180 million statements. More than 7 million entities are spatial entities, >90 million statements are spatial statements, and >28 million statements are temporal statements. Based on YAGO2, we generate 10,557,223 S-T signature nodes, wherein 7,394,075 of them have not null spatial features, and 1,266,865 of them have not null temporal features.

7.1.0.2 Queries and Setup

In order to evaluate our approach, we manually generate 20 sample S-T SPARQL queries that have different features. The sample queries are divided into 10 classes, i.e., Ssimple, SRE, SJE, SS, SC, Tsimple, TR, TJ, TC and ST. We run all queries on a personal computer (PC) server with an Intel Xeon CPU E5645 running at 2.40 GHz and 16 GB main memory. The node capacity is set to be 100, i.e., a node in the ST-tree should have no more than 100 children. Our previous work (Wang et al., 2013) shows that different node capabilities affect the performance little.

• Ssimple: Simple queries with Spatial range assertions of entities.

• SRE: Queries with Spatial Range assertions of Entities.

• SJE: Queries with Spatial Join assertions of Entities.

• SS: Queries with Spatial assertions of Statements.

• SC: Complex queries with all kinds of Spatial assertions.

• Tsimple: Simple queries with Temporal range assertions.

• TR: Queries with Temporal Range assertions.

• TJ: Queries with Temporal Join assertions.

• TC: Complex queries with all kinds of Temporal assertions.

• ST: Queries with all kinds of SpatioTemporal assertions.

Table 1 shows the result set size of each query. In order to illustrate the reason why the postprocessing method (i.e., finding SPARQL query results by ignoring the spatiotemporal assertions and then verifying the candidates by the spatiotemporal assertions) is not efficient, we report the result sizes of all queries discarding the spatiotemporal assertions and the final S-T query result sizes. From Table 1, we observe that the result sizes discarding the spatiotemporal assertions are very large even though the final S-T queries have <10 results, such as Ssimple1. It means that the postprocessing method needs a lot of effort during the verification process.

Tab. 1

The Result Set Size of Queries

7.2 Evaluating the Parameters of Cost Model

In this section, we evaluate how does the variations of α and β affects the query performance. Since the combination of α and β can be huge, we just adjust α and β separately, and combine the respective optimal ratios to build the ST-tree. For convenience, we use y to denote 1 − αβ.

First of all, we set β to be zero, i.e., we only focus on spatial information. In order to obtain optimal α, we vary α from 0 to 1 with step size 0.1. The query sets Ssimple, SRE, SJE, SS, and SC are used for adjusting α. We report the average time cost of different α in Figure 9. Based on the performance curve, we set the ratio of y and α to be 5:5. Second, we set α values to be zero and vary β to choose the best ratio of y and β. The performance of different β values is shown in Figure 10. Note that in this experiment, we use query sets Tsimple, TR, TJ, TC, and ST. Based on the result, the ratio of y and β is set to be 7 : 3. Therefore, we use α = 0.41, β = 0.18 and y = 0.41 as the optimal cost ratio to build the ST-tree.

Fig. 9

Average Time Cost of Different Values of α

Fig. 10

Average Time Cost of Different Values of β

7.3 Evaluating Entity Organization

In this section, we evaluate whether different entity organization styles affect the offline and online performances. There are four different tree construction methods, which are ST-tree, VS-tree (Zou, Mo, Chen, Özsu, & Zhao, 2011), R-tree based on the spatial MBR, and R-tree based on the temporal segment. After adding the necessary features and building the S-T signature graphs, all the four kinds of trees can answer S-T queries. In the following, we use VS-tree+ to denote the tree based on VS-tree, R-treeS+ to denote the tree based on spatial R-tree, and R-treeT+ to denote the tree based on temporal R-tree.

Table 2 shows the offline cost. The ST-tree demands lower storage space than the other three tree styles. Since all the four kinds of tree construction methods can be modeled as cost model-based methods, the result shows that the cost model of the ST-tree is more effective than the others. The last row of Table 2 shows the time cost of the tree construction. Clearly, the more complex the cost model is, the more time cost is incurred to build the tree structure. The ST-tree requires the most time cost. However, only less than half an hour is needed to build the ST-tree. Note that we only consider the tree construction and ignore the S-T signature graph construction.12

Tab. 2

Offline Cost of Tree Construction

Table 3 shows the on-line time cost of the queries based on different tree style indexes. Obviously, the ST-tree based on our cost model outperforms the other tree styles. In other words, the cost model of the ST-tree takes both semantic feature and spatiotemporal feature into consideration, which improves the performance.

Tab. 3

On-line Cost of Different Tree styles

7.4 Evaluating Performance

To evaluate the efficiency of our approach, we choose four baseline approaches, which are denoted as gStore+, Virtuoso, S-store+ and SPARQL-ST respectively.

The gStore+ method adopts the postprocessing solution, which runs the SPARQL queries on an RDF query engine by ignoring the spatiotemporal assertions and then refining the candidates by considering the spatiotemporal assertions. In practice, this approach exploits gStore (Zou, Mo, Chen, Özsu, & Zhao, 2011) as the RDF management system, and the node capacity is set to be 100. Besides, MySQL is used to retrieve the spatiotemporal information of the entities and the statements.

The Virtuoso approach is an enterprise system Virtuoso, which declares that the spatial RDF data can be organized using Virtuoso.

The S-store is our early work based on spatial RDF data. The index SS-tree of S-store can be separated into an R-tree based on the spatial entities and a VS-tree based on the nonspatial entities. After simply adding the temporal features in the nodes and edges of the SS-tree, the S-store can be extended to answer S-T queries. Here, we use S-store+ to denote this baseline method.

We implement the method of SPARQL-ST (Perry, Jain, & Sheth, 2011) as the fourth baseline. MySQL is used as the data management system. The B+ -tree index and the spatial index of MySQL are used. In this section, we make a comparison between gst-store and the four baselines. The query response times are shown in Table 4.

Tab. 4

The Performance Comparison

Since query sets Ssimple and Tsimple have many candidate results (Table 1), the time cost of gStore+ is unacceptable. gStore+ cannot get the results of Ssimple1 or Tsimple1 in reasonable time (half an hour), and the time costs for Ssimple2 and Tsimple2 are >100 seconds. However, our approach (gst-store) can answer these queries efficiently. Although the other queries have just a few candidate results without spatiotemporal assertions, gst-store still outperforms the BASE1 approach.

Actually, only several of the queries(the query sets Ssimple, SRE, and SJE) can be answered using Virtuoso. gst-store outperforms Virtuoso with several orders of magnitude. Here, the mark “-” denote that the query can not be answered.

Compared to the S-store+, gst-store outperforms in all queries except for Ssimple1. In Table 1, we can find that Ssimple1 has low selectivity on the semantic constraint (>90% entities are selected) and high selectivity on the spatial constraint (only three entities are considered as the result). As a result, the S-store+ performs like an R-tree based on the spatial entities (because of the composing construction method), where gst-store is confused due to the cost model. In other cases, the cost model performs well, and gst-store defeats the S-store+.

The SPARQL-ST approach can answer most of the queries, except query sets SS, SC and ST. This is because these three query sets involve the spatial features of the statements, which are out of the SPARQL-ST data model. For the same reason as the competition with S-store+, the SPARQL-ST approach performs better than gst-store for Ssimple1. In other cases, SPARQL-ST costs several minutes to answer the queries, much slower than gst-store. In summary, gst-store outperforms its competitors in most scenarios.

8 Conclusions

In this paper, we introduce S-T queries, a variant of SPARQL language, to query RDF data with spatiotemporal features. In order to answer S-T queries efficiently, we build a hybrid index, called ST-tree, in our gst-store system, an engine for large RDF graphs integrating spatial and temporal information. Several pruning rules are introduced in the query algorithm to reduce the search space. The experiment results on a real large RDF graph show the effectiveness and the efficiency of our approach.

A The Sample Queries

In this appendix, the 20 queries used in the experiments are shown in Figure 11.

Fig. 11

The 20 Sample Queries

References

• Abadi, D. J., Marcus, A., Madden, S. R., & Hollenbach, K. (2007, September). Scalable semantic web data management using vertical partitioning. Paper presented at the Proceedings of the 33rd international conference on Very large data bases (pp. 411-422). VLDB Endowment. Google Scholar

• Abadi, D. J., Marcus, A., Madden, S. R., & Hollenbach, K. (2009). SW-Store: a vertically partitioned DBMS for Semantic Web data management. The VLDB Journal, 18(2), 385-406. http://dx.doi.org/10.1007/s00778-008-0125-y Crossref

• Batsakis, S., & Petrakis, E. G. (2010, September). SOWL: spatiotemporal representation, reasoning and querying over the semantic web. In Proceedings of the 6th International Conference on Semantic Systems (p. 15). ACM. http://dx.doi.org/10.1145/1839707.1839726

• Brodt, A., Nicklas, D., & Mitschang, B. (2010, Novenmber). Deep integration of spatial query processing into native RDF triple stores. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 33-42). ACM. http://dx.doi.org/10.1145/1869790.1869799

• Broekstra, J., Kampman, A., & Van Harmelen, F. (2002, June). Sesame: A generic architecture for storing and querying rdf and rdf schema. Paper presented at the International semantic web conference (pp. 54-68). Springer, Berlin, Heidelberg. Google Scholar

• Deppisch, U. (1986, September). S-tree: a dynamic balanced signature index for oflce retrieval. In Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 77-87). ACM. http://dx.doi.org/10.1145/253168.253189

• Erling, O., & Mikhailov, I. (2009). RDF Support in the Virtuoso DBMS Networked Knowledge-Networked Media (pp. 7-24): Springer Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-02184-8_2

• Grandi, F. (2010, Septermber). T-SPARQL: A TSQL2-like Temporal Query Language for RDF. Paper presented at the ADBIS (Local Proceedings). Google Scholar

• Gutierrez, C., Hurtado, C. A., & Vaisman, A. (2007). Introducing Time into RDF. IEEE Transactions on Knowledge and Data Engineering, 19(2), 207-218. http://dx.doi.org/10.1109/TKDE.2007.34

• Gutiérrez, C., Hurtado, C. A., & Vaisman, A. A. (2005). Temporal RDF. Paper presented at the The Semantic Web: Research and Applications, Second European Semantic Web Conference, ESWC 2005, Heraklion, Crete, Greece, May 29 - June 1, 2005, Proceedings. Google Scholar

• Guttman, A. (1984). R-trees: a dynamic index structure for spatial searching. In ACM SIGMOD International Conference on Management of Data (Vol.14, pp.47-57). ACM. http://dx.doi.org/10.1145/602259.602266 Crossref

• Haklay, M., & Weber, P. (2008). Openstreetmap: User-generated street maps. IEEE Pervasive Computing, 7(4), 12-18. http://dx.doi.org/10.1109/MPRV.2008.80

• Hoffart, J., Suchanek, F. M., Berberich, K., Lewis-Kelham, E., De Melo, G., & Weikum, G. (2011, March). YAGO2: exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th international conference companion on World wide web (pp. 229-232). ACM. http://dx.doi.org/10.1145/1963192.1963296

• Hoffart, J., Suchanek, F. M., Berberich, K., & Weikum, G. (2013). YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194, 28-61. http://dx.doi.org/10.1016/j.artint.2012.06.001

• Klyne, G., Carroll, J. J., & McBride, B. (2004, February). Resource description framework (RDF): Concepts and abstract syntax. World Wide Web Consortium RecommendationGoogle Scholar

• Lyell, M., Voyadgis, D., Song, M., Ketha, P., & Dibner, P. (2011, May). An ontology-based spatio-temporal data model and query language for use in gis-type applications. In Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications (p. 15). ACM. http://dx.doi.org/10.1145/1999320.1999335

• Neumann, T., & Weikum, G. (2009). RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1), 647-659. http://dx.doi.org/10.14778/1453856.1453927

• Perry, M., Jain, P., & Sheth, A. P. (2011). Sparql-st: Extending sparql to support spatiotemporal queries. Semantic Web & Beyond, 12, 61-86. http://dx.doi.org/10.1007/978-1-4419-9446-2_3

• Neumann, T., & Weikum, G. (2010). x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. Proceedings of the VLDB Endowment, 3(1-2), 256-263. http://dx.doi.org/10.14778/1920841.1920877 Crossref

• Pugliese, A., Udrea, O., & Subrahmanian, V. (2008, April). Scaling RDF with time. In Proceedings of the 17th international conference on World Wide Web (pp. 605-614). ACM. http://dx.doi.org/10.1145/1367497.1367579

• Singh, R., Turner, A., Maron, M., & Doyle, A. (2008). GeoRSS: Geographically encoded objects for RSS feeds: http://georss.org/gml

• Tappolet, J., & Bernstein, A. (2009, May). Applied temporal RDF: Eflcient temporal querying of RDF data with SPARQL. In European Semantic Web Conference (pp. 308-322). Springer, Berlin, Heidelberg. http://dx.doi.org/10.1007/978-3-642-02121-3_25

• Wang, D., Zou, L., Feng, Y., Shen, X., Tian, J., & Zhao, D. (2013, April). S-store: An engine for large rdf graph integrating spatial information. In International Conference on Database Systems for Advanced Applications (pp. 31-47). Springer, Berlin, Heidelberg. http://dx.doi.org/10.1007/978-3-642-37450-0_3

• Weiss, C., Karras, P., & Bernstein, A. (2008). Hexastore: sextuple indexing for semantic web data management. Proceedings of the VLDB Endowment, 1(1), 1008-1019. http://dx.doi.org/10.14778/1453856.1453965 Crossref

• Wilkinson, K. (2009). Jena property table implementation. SswsGoogle Scholar

• Wilkinson, K., Sayers, C., Kuno, H., & Reynolds, D. (2003, September). Eflcient RDF storage and retrieval in Jena2. Paper presented at the International Conference on Semantic Web and Databases (pp. 120-139). CEUR-WS. org. Google Scholar

• Zou, L., Mo, J., Chen, L., Özsu, M. T., & Zhao, D. (2011). gStore: answering SPARQL queries via subgraph matching. Proceedings of the VLDB Endowment, 4(8), 482-493. http://dx.doi.org/10.14778/2002974.2002976 Crossref

Footnotes

• 1
• 2
• 3
• 4
• 5

Actually, the nontemporal statements should be filtered out early since they dissatisfy the temporal constraint, and they are filtered out early in gst-store.

• 6

In this paper, for the ease of presentation, we adopt the Euclidean distance between two locations. Actually, we can use “the earth’s surface distance” to define the distance between two locations based on latitudes and longitudes.

• 7

the bit string is also used in our previous work gStore (Zou, Mo, Chen, Özsu, & Zhao, 2011).

• 8

A segment is a 1-dimensional MBR.

• 9

The root node is on level 0, and level i denotes that the depth is i.

• 10

For the two-dimensional situation.

• 11

If the split node is the root of the ST-tree, the number of added node is 2.

• 12

The time cost of building the whole index is about 18 hours, regardless of the tree style we choose.

Accepted: 2017-11-19

Published Online: 2017-12-29

Citation Information: Data and Information Management, Volume 1, Issue 2, Pages 84–103, ISSN (Online) 2543-9251,

Export Citation