Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter Open Access June 8, 2016

Enriching and improving the quality of linked data with GIS

  • Adam Iwaniak EMAIL logo , Iwona Kaczmarek , Marek Strzelecki , Jaromar Lukowicz and Piotr Jankowski
From the journal Open Geosciences

Abstract

Standardization of methods for data exchange in GIS has along history predating the creation of World Wide Web. The advent of World Wide Web brought the emergence of new solutions for data exchange and sharing including; more recently, standards proposed by the W3C for data exchange involving Semantic Web technologies and linked data. Despite the growing interest in integration, GIS and linked data are still two separate paradigms for describing and publishing spatial data on the Web. At the same time, both paradigms offer complementary ways of representing real world phenomena and means of analysis using different processing functions. The complementarity of linked data and GIS can be leveraged to synergize both paradigms resulting in richer data content and more powerful inferencing. The article presents an approach aimed at integrating linked data with GIS. The approach relies on the use of GIS tools for integration, verification and enrichment of linked data. The GIS tools are employed to enrich linked data by furnishing access to collection of data resources, defining relationship between data resources, and subsequently facilitating GIS data integration with linked data. The proposed approach is demonstrated with examples using data from DBpedia, OSM, and tools developed by the authors for standard GIS software.

1 Introduction

More than fifty years of Geographic Information Systems (GIS) research and practice have been characterized by an impressive growth of geospatial data and development of mature data processing technologies. With the emergence of the Web and its principles, the paradigm of providing spatial data for GIS systems has changed. Web services like WMS, WFS, WCS, promoted by the Open Geospatial Consortium (OGC), have become one of the main sources of spatial data, replacing the need for downloading datasets or directly accessing databases. These services have become common standards accepted and used by the GIS community – both software developers and users.

Along with the increasing volume of data available on the Web, there is a growing need to publish it in a structured form. GIS application schema and catalog objects provide semantics for data and expose its structure [1]. Along with developments in methods of representing the meaning of data, much interest has been recently focused on linked data providing a simple data structure and semantics contained in vocabularies and ontologies. The slow but steady growth of linked data, reflected in the growing number of datasets available on LOD cloud, can be traced on diagrams created by Richard Cyganiak and Anja Jentzsch[1].

In the realm of spatial data, linked data and GIS are two separate paradigms, representing different approaches for data representation and exchange. However, the volume of data with spatial reference in the linked open data (LOD) cloud has been on the rise. Time and space referencing is one of the simplest methods for structuring such data and providing the context for interpretation [2]. This is one of the reasons for perceiving linked data as one of the most important approaches for geographic information publication and consumption on the Web [69, 24]. It provides new means for sharing, accessing and integrating geoinformation and holds a promise of changing ways, in which GI developers and analysts solve their problems. With the emergence of linked open data cloud, geographic information can become a reference for integrating data across domains. The traditional approach to spatial analysis has relied on data geometry and attributes describing quantitative properties of spatial data. The emergence of linked data has ushered in an alternative approach to spatial analysis in geosciences taking advantage of semantics describing both quantitative and qualitative properties of spatial entities. Much of data semantics stored and accessible through the Web is in linked data format.

Linked data is based heavily on Semantic Web technologies. One of its principles is the use of RDF data model for data representation. To satisfy the need for stronger geographic information presence in Semantic Web, the Spatial Data on the Web Working Group was established by W3C with the participation of OGC. The longer term objective of the group has been to pave the way for next generation of geospatial services and data representation compliant with Semantic Web technologies [10]. There are several approaches for expressing geospatial data based on spatial features in the form of vocabularies and ontologies. One of the simplest and the most popular is WGS84 Basic Geo for representing point geometries with two separate predicates - latitude and longitude. An OGC standard – Geo SPARQL is available for accommodating more complex features and geometries. The standard delivers a specification for query language extension as well as a vocabulary for representing spatial features in RDF, which is based on another OGC standard – Simple features.

Despite of describing geodata representation in linked data, standards provided by W3C and OGC do not provide methods for publishing it on the Web and integrating it with other heterogeneous resources. There is a need for good practices to fill this void and various projects have tried to do so. One of them is GeoKnow whose goal has been to provide methods for bringing Spatial Data Infrastructure resources into linked data cloud (Geo- Know: LeveragingGeospatial Data in theWeb of Data). The project goal has been to provide tools for geospatial linked data storage, query, visualization, and crucially publishing. Another project example is Linked Geodata, which focuses on publishing linked datasets with links to other established data sources (LinkedGeoData: A Core for a Web of Spatial Open Data) [11]. The core of the solution used in Linked Geodata is the converted OpenStreetMap data linked with the most popular services such as DBpedia and GeoNames. The data is available either through SPARQL endpoint or in the form of ready to be downloaded packages. LIFE is a project conducted at the University of Münster, whose goal is to improve methods for publishing scientific information as linked data. It focuses mostly on spatio-temporal aspects of the data using W3C and OGC standards and has resulted in developing useful tools such as LOD4WFS [12] for combining linked data and OGC WFS service. One of the initiatives that has originated from this project is LODUM whose goal is to provide information (also spatially-enabled data) about the University of Münster in the form of linked data.

The spatial identity of concepts given by geographic coordinates or complex geometry in specific data resources can add the new possibilities for overcoming existing challenges associated with linked data [3]. This information is the element needed to enable the use of Desktop GIS as a main tool for interlinking, fusing and verifying geospatial linked data resources, thus enabling GIS power users to participate in linked data project by improving the trustworthiness and quality of the data.

The motivation of this paper is to present an approach for utilizing Desktop GIS software for the selected common challenges associated with managing existing linked datasets where user’s supervision is crucial to ensure quality of the data. The authors propose and discuss the methods of using common tools based on examples, which constitutes the use cases of Desktop GIS taking advantage of data available on the Web. In particular, we discuss the following cases:

  1. Desktop GIS as a tool for linked data integration. Uncoupled linked data resources with spatial reference are often semantically and spatially related to each other. Sometimes they represent the same real-world entities. In those cases the method of interlinking them is crucial. Desktop GIS provides the tools for performing spatial analysis and for processing other non-spatial properties of such resources. In this case a GIS user is also capable of using more complex workflows and sets of geoprocessing tools. By using those capabilities of GIS it is possible to produce new sets of linkages between existing, published resources, which in turn can create connections between two separate datasets and extend or refine existing relations.

  2. Spatial data fusion for linked data in GIS environment. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent and clean representation [4]. The need for data fusion is also seen in linked data case for resolution of conflicts between different values of the same property in input datasets. It is essential to overcome challenges including geospatial representation and conflation – the problem of finding identities between resources from different datasets [1]. Spatial data conflation methods and strategies known to GIS users are available in GIS software and can be used as a data fusion tool for both spatial and non-spatial aspects of linked data. The fused datasets can be used in specific applications or be published back on the linked data cloud. It is also possible to fuse linked datasets with other non-linked spatial data sources (i.e. administration registries, VGI) to enrich the cloud.

  3. Using GIS to improve spatial data trustworthiness and quality. The growth of data published on the Web in RDF as well as in dictionaries and ontologies has been dynamic and uncontrolled. Some of linked data comes from unstructured sources like free text or semi-structured sources like Wikipedia, which causes the data to be more prone to errors [5]. To ensure that linked data fits user requirements and has an acceptable level of quality and trust, there is a need for its verification. This process is strongly connected to the previously outlined use cases. Desktop GIS tools can be used for quality assessment of spatial references and other aspects of linked data such as proper label or consistency of units of measurements used in properties.

The use cases are described in more detail in section three of the paper, following a review of methodological and technical developments in linked data. Workflows for using desktop GIS in the described cases are illustrated with an application example, in which linked data and geospatial data are processed using a combination of open source, proprietary GIS, and software tools developed by the authors. The workflows can be re-created and ran in most of GIS environments supporting processing of spatial features. The summary of the proposed workflows and future directions of research are presented in the final section of the paper.

2 Methods

Despite the significant integrative potential of linked data, the actual data available in linked open data cloud is often incomplete and burdened with errors, which diminishes data trust and raises a question of the value of its informational content. In contrast with the currently available linked data resources, GIS data is highly specialized and although its quality and accuracy vary, much of it has been created and published by specialized government or commercial agencies following data accuracy standards and quality control procedures. Its structure and informational content are targeted to support a range of problem-solving applications and its value as reference data is arguably high. Unfortunately, most of available GIS data published by agencies is not suitable for non-GIS users, which limits its potential for serving as spatial reference for other data published on the Web or applications, which heavily depend on semantically rich spatial data.

The proposed workflows for dealing with common linked data challenges such as data integration, fusion and quality assessment are directed at Desktop GIS users to enable their potential contribution to linked data project. Procedures and methods for matching different types of data can be reproduced in Desktop GIS environment. At the same time, it is important to realize that linked data will not replace in the foreseeable future GIS data. Likewise, GIS data in its current form lacks the flexibility of data publishing and integration that linked data offers. These differences, however, make both data paradigms attractive candidates for complementary uses.

Given the linked data potential and the value of GIS data, we posit that both data sources can complement each other, and that their simultaneous use can add value to both of them. We consider the usage of GIS and linked data, as an iterative and reflexive process. Presented methods can also be beneficial for GIS, which may potentially reveal implicit relationships between spatial objects, external events, and facts not commonly stored as part of GIS data but available in linked data form. Hence, the bidirectional data enrichment could be useful not only for a Web user, but also for a GIS user.

2.1 Accessing linked data in Desktop GIS environment

An important issue associated with information exchange is providing an established and standardized data communication protocol, which is suitable for and compliant with data at hand [22]. In the presented cases (see Fig. 1) it is essential to be able to exchange the data between linked open data cloud and Desktop GIS environments.

Figure 1 Methods of accessing RDF data from GIS environment.
Figure 1

Methods of accessing RDF data from GIS environment.

Linked data heavily depend on WWW and Semantic Web technologies and uses HTTP as a transfer protocol. This means that it can be accessed in the same manner as any other WWW resource by sending GET requests and receiving RDF document in one of available notations like N-Triples or RDF/XML. Importing linked data into GIS can be done by downloading RDF document and converting it into selected geodata format. Some data portals (http:// www.data.gov, http://www.data.gov.uk, http://www.bbc. co.uk/things/) offer RDF resources or datasets in the form of prepared documents, which are available for downloading. Since much of linked data is stored in triple repositories, which need to be queried in order to return specific fragments of RDF graph, this method of accessing data is suitable only for small RDF documents.

SPARQL is W3C standard for both HTTP-based protocol and RDF query language. It describes the interface of Web services capable of receiving queries and delivering query results. By using SPARQL the user can send prepared “SELECT” queries and receive fragments of RDF graph meeting the query criteria. Under this use scenario the results are returned as a set of variables and their bindings. With SPARQL there is no need for downloading the whole RDF dataset because the downloaded fragments connect to triple-store through SPARQL web interface. One example of resources with the enabled SPARQL interface is DBpedia. In addition to the standardized means of accessing datasets SPARQL language offers capabilities for creating federated queries. A federated query can unlock the full potential of linked data by creating queries across LOD cloud and aggregating results frommany SPARQL-enabled repositories.

Linked data with geographical references imported into GIS must be interpreted and “understood” at both syntactic and semantic levels. This requirement applies mostly for the spatial component of data – the geometry of spatial entity. The geometry is interpreted with the use of vocabularies and ontologies. Two of the most common vocabularies and ontologies are WGS84 Basic Geo and GeoSPARQL. These vocabularies define how a geometry can be converted into GIS compliant geometry. It is important to use the consistent set of vocabularies, because this improves the interoperability of data. Importing resources from LOD cloud into GIS can be facilitated by a procedure starting with linked data exploration using SPARQL queries. As part of the procedure, SPARQL queries are formulated to obtain the schema of linked dataset. More detailed information about schema should be contained in vocabularies or ontologies. Ontology in this case not only offers the description of semantics of imported data, but also guides mapping between semantic data terminology and database relations (classes) and their schema. The final step of the process is data transformation, where interpreted data is converted into GIS compliant representation – feature class definitions, feature instances and property values. After the successful completion of data transformation, data from LOD cloud should be available in GIS environment.

In the presented later in the paper a simple use case scenario the authors import datasets into Desktop GIS from two data sources: DBpedia and OpenStreetMap. This process is accomplished with the use of dedicated software, based on GeoMedia platform, which adds a functionality for data exchange between LOD cloud and Desktop GIS environment. Data in DBpedia is available through SPARQL interface from where it can be acquired directly. The user creates a SPARQL query and sends it to DBpedia endpoint. Next, the imported data is converted into GIS specific representation (based on relational model). After completing the import process, data from DBpedia is available as feature class instances in GIS. The use case also involves importing semantic representation of Open- StreetMap data provided by LinkedGeoData project.

2.2 Desktop GIS as a tool for linked data integration

Linked data integration defined as connecting separate datasets published in linked open data cloud is one of basic principles of linked data, according to which resources should include links to other resources. Applications relying on integrated data will benefit from access to distributed data sources and offer improved capabilities at providing accurate answers and improved user experience. Ensuring that resource links in integrated linked datasets are up-to-date and replacing defunct connections with new ones is a continuous process. This subsection describes the scenario for integrating resources from linked datasets with the help of GeoMedia Professional software placing the particular emphasis on matching two spatial features and creating links between them with common GIS tools.

The opportunity for enriching data content by creating links lies in reciprocal referencing representations of entities [14]. The limited number of data facts can be leveraged by finding interconnections between seemingly separate entities. In this case, a phenomenon connected with a corresponding spatial object through linguistic description becomes available as a label.

The accurate determination of whether two objects (resources) represent the same real-world entity is not a trivial task. The most of methods for finding identity of resources in different datasets employ the some form of mathematical model combining semantic, lexical, and spatial (direct/geometric and indirect) relationships. An example can be a method that uses network measures to match resources originating from different sources [16]. Another approach for matching datasets from different sources is the estimation of potential relationships between resources. This method uses Markov LogicNetworks on the set of candidates retrieved from entities linked by properties confirming the matching hypothesis. The output can be improved iteratively by using dedicated “bootstrapping algorithm” [5]. A matching method described in this article follows the same general approach of combining spatial analysis with semantic and lexical analysis [17]. The availability of geographical reference (location) in any form is very helpful. In this case GIS tools can be helpful in determining the equivalence of resources. They provide advanced methods for analyzing the geometry and topology of spatial objects and offer functionalities to assign the identity of objects using spatial analysis.

The approach developed by the paper authors can be presented on the example of integrating objects representing churches from New York City available as separate linked data resources, where one source of the data is OpenStreetMap and the other is DBpedia. In order to carry out the analysis, we distinguished features useful for proper identification of particular objects and for recognition of their mutual identity. Given the quality of Open- StreetMap data, we could not rely on one criterion. The first criterion used in the analysis was location. Even though we cannot be sure that we have found descriptions referring to the same object, the appearance of entities in some proximate location may indicate that one deals with mutually related phenomena. As long as one is considering the same category of objects (e.g. churches), spatial proximity is a strong clue for interpreting description of objects as potentially referring to the same entity.

The analysis of proximity is a basic tool provided by GIS software and there are multiple ways of conducting such analysis. The problem lies in the quality of analyzed data. Given that both datasets are incomplete and inaccurate one needs additional determinants to ascertain the mutual identity of represented objects. In order to conduct the experiment, we prepared a testing environment. Due to processing requirements of spatial data originating from GIS and linked data sources, we decided to store OSM and DBpedia datasets in the form of GIS features, represented as relations linking the geometry of objects with their attributes.

The geometry of churches described in DBpedia is given by point features. In contrast, churches from Linked- GeoData are represented by polygons. The analysis relies on the iterative elimination of most inappropriate objects. For every church a buffer was created, where the distance depended on the spatial distribution of churches in a local neighborhood, as well as on the positional accuracy of LinkedGeoData originating from OpenStreetMap given that the reference accuracy is often limited by the accuracy of GPS signal receivers built into smartphones. The maximum distance buffer of 100m was used. The analysis showed that the similarity of two analyzed objects could be confirmed at the separation distances of less than 100 m.

The boundary of the local neighborhood was determined with the use of clustering methods. The buffer of 30m was used for local neighborhoods where the density was high. For low density neighborhoods the buffer distance of 100 m was applied. These are experimental values, which could be easily changed by a GIS user. The spatial distribution of churches is presented in Fig. 3. On the left side, where the density of churches was high, the buffer was set to 30 m in contrast to the right side where the buffer is 100 m.

Figure 2 Linked data integration workflow as block diagram.
Figure 2

Linked data integration workflow as block diagram.

Figure 3 Analysis with different buffer distances dependent on the spatial distribution of features.
Figure 3

Analysis with different buffer distances dependent on the spatial distribution of features.

In the next step of the analysis a Euclidean distance from every OSM originating object to every DBpedia object was calculated. It is insufficient to assume that two objects are the same based only on the separation distance. To confirm object’s identity linguistic similarity was applied as second criterion. We compared labels or names provided for DBpedia and LinkedGeoData churches. As a metric for textual analysis, we utilized Levenshtein’s distance function and implemented it in GIS software. Levenshtein distance, also called edit distance, is a measure of the similarity between two strings [17]. The function counts the minimal number of insertions, deletions, and substitutions required to make two strings equivalent. The greater the distance value returned by the function, the more dissimilar the strings are. In our implementation we used the normalized Levenshtein’s distance defined by the formula:

Lev_N(L1,L2)=1Lev(L1,  L2)/max(char_length(L1),char_length(L2))

where: Lev_N (L1, L2) – normalized Levenshtein distance between labels L1 and L2. The normalized distance values fit the interval [0..1] with 0 indicating no match and 1 – the perfect match.

The result of this operation are the lines connecting every LinkedGeoData (OSM) church to DBpedia church in a buffer and assigning to them the values of Euclidean and Levenshtein distances (Fig. 4).

Figure 4 Two polygons related to single point object with the values of Euclidean and Levenshtein distances (Levenshtein and Euclidean distances values from first polygon are respectively 0,98 and 41.06 m; from the second polygon are 0,35 and 86,15 m).
Figure 4

Two polygons related to single point object with the values of Euclidean and Levenshtein distances (Levenshtein and Euclidean distances values from first polygon are respectively 0,98 and 41.06 m; from the second polygon are 0,35 and 86,15 m).

Consequently, every polygon has a relation to every point in a buffer. Values of normalized Euclidean and Levenshtein functions were used to formulate the similarity function, which allowed to find the most appropriate link between the objects:

Lev_N(Losm,  LDBP)+(1DIST_N(Gosm,  GDBP))2k

where:

  1. DIST_N(G1, G2) – normalized Euclidean distance between geometries G1 and G2,

  2. GOSM – LinkedGeoData (OSM) resource geometry,

  3. GDBP - DBPedia resource geometry,

  4. Losm- OSM resource label,

  5. LDBP- DBpedia resource label,

  6. k – coefficient indicating weight of Euclidean distance, for the purpose of experiment k = 2.

Next, the similarity function was used to define the most suitable link between churches represented with polygon (LinkedGeoData) and point (DBpedia) data (Fig. 5).

Figure 5 The most promising links between resources are highlighted.
Figure 5

The most promising links between resources are highlighted.

As one can see in Fig. 5, there are some instances where one DBpedia point has more than one link candidate from LinkedGeoData resources. To resolve this ambiguity the inverse analysis should be done. The task is to choose the best link connecting DBpedia resources with OSM object. The result is the elimination of links with the lowest value of similarity function. After the score evaluation it is essential to assert links between matched resources. The simplest way of establishing the mutual identity of resources from different datasets and to link those resources is to express it using OWL’s builtin owl:sameAs property. Building a triple “ex1:resourceX owl:sameAs ex2:resourceY” results in precise assertion and in certainty that in the case of two objects originating from different resources one deals with de facto the same real world entity.

2.3 Spatial data fusion for linked data in Desktop GIS environment

Another common challenge associated with managing linked data is enriching resources with additional information from other sources. Unlike linked data integration, where the datasets are not modified, in this case a new dataset is created as a result of data fusion. This process is closely related to data enrichment, which is a method of utilizing the content of data objects (information units). The goal of this operation is to merge the same real-world entities contained in multiple datasets into a single, consistent representation paying attention to improving quality of the data. Geospatial resources can be seen as one of the most important reference data in linked open data cloud with strong need for high trustworthiness and quality, and data fusion can aid in data quality improvement.

Similarly to the integration process, spatial data fusion for matching two resources can be run in Desktop GIS (see Fig. 6). The results of the process can enrich linked data or “traditional” GIS data. Desktop GIS software can be used as a tool for integrating linked data with spatial data stored in GIS or spatial data registries. In the example given earlier, two datasets describing the same realworld entity (churches) are integrated. The most important attribute acquired from LOD cloud, which can also be GIS data, is the object’s URI. In the process of entity matching, the unique identifier of the DBpedia resource is assigned to every LinkedGeoData church. URI in this case can serve as a link connecting spatial features stored in GIS registries with resources available on theWeb. GIS data attributes predominantly describe properties of real world entities represented by feature geometry. Relationships to external objects from different domains generally are not stored. A need to use such relationships could arise, for example, when carrying out an interdisciplinary analysis requiring access to various distributed data resources. As an example consider the analysis of historical changes in the density pattern of settlements as a function of political and economic processes. Settlements in most cases are represented in GIS by points or polygons. The only information, which this representation provides is spatial distance. Other relationships, which would enrich such analysis and could potentially be inferred from attribute properties and narratives related to settlements are often missing from GIS data. In this instance, spatial objects from a GIS data set are matched with objects possessing rich semantic description representing an interdisciplinary context. Using simple analytical functionswe can assign more attributes to GIS objects. In our case we can enrich GIS data with the complementary information coming from DBpedia, such as building materials used for constructing churches, name of the architect, the year of completion etc.

Figure 6 Linked data and GIS data fusion workflow as block diagram.
Figure 6

Linked data and GIS data fusion workflow as block diagram.

Solutions developed outside the GIS community, which are created for linked data management, such as Silk [23], are not very effective for spatial analysis. Description of location with a geographic name is not sufficient to reveal spatial relationships. Also, the representation of spatial objects is insufficient for many applications. The results of spatial data fusion performed with the use of advanced spatial analysis functions, Desktop GIS environment can serve as a tool for preparing enriched spatial data for linked open data cloud. In most cases, semantic data with spatial references is not endowed with information about spatial (topological) relationships. The description of spatial objects provided by raw linked data, in general, is limited to their location coordinates. Spatial relationships usually are derived as result of further data processing involving GIS-based analysis of feature location and geometry. Such analysis can be instrumental for building relationships between different objects from disparate domains and subsequently from different sources. Hence, it stands to reason that semantic data generated from GIS analysis could be a valuable resource enabling further data analysis useful for spatial decision support and problem solving. Using spatial analysis methods available in GIS, references to spatial objects in linked data can be computed and expressed both in quantitative (distance, direction), boolean (topological relations like: touches, overlaps) and/or qualitative form (closer to .. then, further to .. then).

2.4 Improving geospatial linked data trustworthiness and quality with Desktop GIS

Information published as linked data on the Web in many cases comes from structured sources like relational databases, semi-structured sources like Wikipedia, or unstructured sources like free text [14]. Data originating from the latter two sources are more exposed to errors [5]. In general, semi-structured and unstructured data sets are more likely to be noisy and incomplete [14]. One characteristic of data quality is fitness for use, according to which data quality is high when the data satisfies (fits) user requirements [19]. The same characteristic applies to representation of space (geometry). Geospatial data quality in GIS depends generally on the purpose, for which the data have been created and on their intended use. Data quality metrics include completeness, precision, accuracy and consistency. However, in the context of linked data quality some of these metrics can be understood differently. One example is the completeness of data generated from unstructured and semi-structured sources. From the logic standpoint, the incompleteness of knowledge is one of the main assumptions behind Semantic Web (the open world assumption). Moreover, from the practical point of view this assumption is often essential [14]. Other metrics of linked data quality include coherence of links to other resources or consistency with regard to implicit information [21].

Recent studies pointed out problems with linked data quality concerning non-standardized and variable representation, inconsistency, and lack of interoperability [18, 25]. One reason that linked data is not used as much as it could is the high instance of data errors including syntactically erroneous and repetitious data [13]. However, problems with syntax or duplications can be easily detected and repaired automatically. The other categories of errors including incorrect links or wrong geographic locations are more challenging to detect.

Similar problems affecting linked data quality are also the concern of geospatial linked data. Like in object-oriented GIS, there is a need for common standards. GeoSPARQL, NeoGeo, WGS84 Basic Geo Vocabulary, GeoOWL are commonly used for representing geometry and spatial relations in geospatial data. Sometimes, the geometry of data is represented in a non-standardized way, which is inconsistent with one of the linked data principles, which states that common vocabularies should be used. This can be the result of automated transformation from semi-structured data into RDF, which can be observed in the Polish version of DBpedia (Fig. 7).

Figure 7 Polish DBpedia geographic coordinates representation in RDF as a separate literal nodes for degrees, minutes and seconds.
Figure 7

Polish DBpedia geographic coordinates representation in RDF as a separate literal nodes for degrees, minutes and seconds.

DBpedia content is automatically generated from Wikipedia with limited supervision and is prone to errors. According to the research conducted by Kontokostas et al. (2014) [20], 28K resources in the English language version of DBpedia share the same coordinates with another resource. This type of location error often results from the relatively low accuracy, with which coordinates are expressed and the degree of location generalization. For example, the resources <http://dbpedia.org/resource/Wallis_and_Futuna> and <http://dbpedia.org/page/Hahake_District> have the same coordinates. This is because Hahake District is a part of Territory of theWallis and Futuna Islands, and it is located in the middle of the islands.

In our approach, Desktop GIS is employed as a tool for quality supervision for linked data (see Fig. 8). One of the basic functionalities of GIS systems - visual representation, can be helpful in detecting basic problems with erroneous location caused for instance by inverted coordinates (Fig. 9). Even a simple map-based data view can be useful for detecting positional and attribute data errors. For example the visualization of spatial resources from DBpedia categorized as “Properties of religious function on the National Register of Historic Places in New York City” can detect the “standing-off” resource (North Hillsdale Methodist Church with URI http://dbpedia.org/resource/North_Hillsdale_Methodist_Church) and change its category to more suitable “Properties of religious function on the National Register of Historic Places in New York”.

Figure 8 Quality assessment with match quality metadata for linked data integration.
Figure 8

Quality assessment with match quality metadata for linked data integration.

Figure 9 Basic supervision of linked data resources location quality with the use of Desktop GIS map visualization.
Figure 9

Basic supervision of linked data resources location quality with the use of Desktop GIS map visualization.

The problem with matching linked data with GIS resources, described in the presented case study examples, lies in the degree of certainty in identifying a correct match. To ensure the trustworthiness of the links between two separate datasets, or new datasets created as a result of spatial data fusion with the use of method presented in earlier subsection, it is important to define a system of ranking candidate pairs according to the level of certainty of match. Given a set of matching criteria, one could assign to each of them a weight representing the certainty of match.

In some cases it is impossible to match data from different sources in an automatic manner. In those cases data or information could be verified manually. It is important to provide the data user with an explicit record of method used for matching and verification of given objects. This is the role of metadata, which should reveal what kind of data has been subject of processing.

3 Results

The proposed workflows workflows for overcoming linked data management challenges were used to conduct experiments. We present the results of experiments in this section. In the case of data integration, performing matching process on resources from DBPedia and LinkedGeoData representing churches in New York the authors were able to determine connections between two separate resources. The results of matching selected spatial objects with corresponding connections are visible in Fig. 10.

Figure 10 The final result of the matching two objects from DBpedia and OSM sources.
Figure 10

The final result of the matching two objects from DBpedia and OSM sources.

The final result of this experiment is a set of pairs representing connections between resources demonstrating that they represent the same real-world objects (Tab. 1).

Table 1

Sample links between individual resources from two datasets, which can be converted into RDF triples.

LinkedGeoDataDBpedia
URIlabelURIlabel
http://linkedgeodata.org/page/triplify/way3624833Cathedral of Saint John the Divinehttp://dbpedia.org/resource/Cathedral_of_Saint_John_the_DivineCathedral of Saint John the Divine
http://linkedgeodata.org/page/triplify/way270859755West Park Presbyterian Churchhttp://dbpedia.org/resource/West-Park_Presbyterian_ChurchWest-Park Presbyterian Church
http://linkedgeodata.org/page/triplify/way269219552Holy Trinity Lutheran Churchhttp://dbpedia.org/resource/Holy_Trinity_Lutheran_Church_(Manhattan)Holy Trinity Lutheran Church (Manhattan)
http://linkedgeodata.org/page/triplify/way264643155Church of the Guardian Angelhttp://dbpedia.org/resource/Church_of_the_Guardian_Angel_(Manhattan)Church of the Guardian Angel (Manhattan)
http://linkedgeodata.org/page/triplify/way271003185Holy Name of Jesus Churchhttp://dbpedia.org/resource/Holy_Name_of_Jesus_Roman_Catholic_ChurchHoly Name of Jesus Roman Catholic Church
http://linkedgeodata.org/page/triplify/way277844754Saint Barnabas Churchhttp://dbpedia.org/resource/St._Barnabas’_Church_(Bronx)St. Barnabas’ Church (Bronx)
http://linkedgeodata.org/page/triplify/way276140800SaintMargaretMary Roman Catholic Churchhttp://dbpedia.org/resource/St._Margaret_Mary’s_Church_(Bronx)St. Margaret Mary’s Church (Bronx)
http://linkedgeodata.org/page/triplify/way278346592Saint Thomas Episcopal Churchhttp://dbpedia.org/resource/Saint_Thomas_Church_(Manhattan)Saint Thomas Church (Manhattan)
http://linkedgeodata.org/page/triplify/way275351153Sacred Heart Roman Catholic Churchhttp://dbpedia.org/resource/Church_of_the_Sacred_Heart_(Bronx,_New_York)Church of the Sacred Heart (Bronx, New York)
http://linkedgeodata.org/page/triplify/way276413870Church of the Visitationhttp://dbpedia.org/resource/Visitation_of_the_Blessed_Virgin_Mary_Church_(Bronx,_New_York)Visitation of the Blessed Virgin Mary Church (Bronx, New York)

The pairs can be converted into RDF triples and then exposed in LOD cloud.

To ensure that the quality of integrated or fused data can be properly assessed the authors argue that the subjective and objective quality of matching between resources can be expressed in the form of metadata, which can be added to created linked datasets. The result of conducted experiments is an ontology, which defines the metadata profile for describing the quality of linked data integration and fusion (Fig. 11).

Figure 11 Proposed ontology for matching spatial objects.
Figure 11

Proposed ontology for matching spatial objects.

The ontology contains one main class Correspondence, which represents the matching between two realworld spatial entities – corresponding object and matching object, which are connected directly and transitively through the correspondence object (Fig. 12). Basic properties are defined for indicating the used synthetic measure of matching, information about the spatial distance between objects, information if there is a better matching between corresponding object and different object and information about manual verification. Representing metadata as ontology with the use of Semantic Web technologies gives the possibility of adding properties of Correspondence class (i.e. for expressing lexical or semantic distance).

Figure 12 Example use of proposed ontology.
Figure 12

Example use of proposed ontology.

The workflow for linked data fusion was used to create a new linked data resource combining attributes from LinkedGeoData, DBPedia and attributes calculated in GIS environment. Table 2 contains an example resource with described origins of the attributes. Data fusion results can be exported into RDF representation and published in linked data cloud.

Table 2

Example resource showing results of linked data fusion

PropertyValueSource
URIhttp://www.igig.up.wroc.pl/wogis/NYchurches/Holy_Name_Of_Jesus_ChurchGIS
Dbpediahttp://dbpedia.org/page/Holy_Name_of_Jesus_Roman_Catholic_ChurchDBPedia
osm_linkedgeodatahttp://linkedgeodata.org/data/triplify/way271003185LinkedGeoData
label“Holy Name of Jesus Catholic Church”DBPedia
comment“The Holy Name of Jesus Roman Catholic Church stands at 96th Street and Amsterdam Avenue, New York City. It was taken over by the Franciscans in 1990...”DBPedia
thumbnailhttp://commons.wikimedia.org/wiki/Special:FilePath/WTM_NewYorkDolls_031.jpg?width=300DBPedia
geometry“POLYGON((-73.9707247 40.794645700000004,-73.970712 40.7946638,...”LinkedGeoData (converted in GIS from LineString)
height25.5LinkedGeoData
area1608.845GIS
perimeter181.935GIS

4 Discussion and conclusion

GIS with their data and analytic functions are different from linked data on many levels - conceptual notwithstanding. We have argued, that both; GIS and linked data are potentially valuable sources of spatial information. At the same time, different conceptual underpinnings of both paradigms have resulted in different data models, data storage solutions, serialization, and different tools for data processing. These differences extend to the level of data provenance. Most of GIS data result from organizational missions or application-driven resource acquisition processes. Linked data have a heterogeneous nature; they could be published by trusted organizations or could be collected by volunteers and/or retrieved from Web resources by computer systems (agents). If the credibility of the former is often high the trust in the latter varies.

Using data resources coming from both paradigms (GIS and linked data) could improve the quality of information, provide a method of assessment of data reliability, facilitate data enrichment, and result in better description of analyzed resources. The problem hindering the parallel use of linked data and GIS data is the synchronization of processes and simultaneous exchange of results obtained from both types of resources. This is a challenge that needs to be overcome in order to improve access to and usability of both data resources.

It is difficult to decide how much influence the Semantic Web will have on the future of GIS systems. Will GIS take on fully semantic dimension and will data exchange be based on RDF format? Time will tell. Currently, researchers are working on bridging both data paradigms with some even pondering the need for a new paradigm. GIS has a long history, established standards, and experienced users. It will take time, proven data processing workflows, and tangible benefits in order to convince the community of GIS users to adapt the solutions of Semantic Web.

The approach proposed in this paper is seeking a compromise in trying to connect two data representation and processing paradigms with the use of “traditional” Desktop GIS. The integration can be accomplished by implementing a common part of SPARQL protocol in GIS and treating it as one of web services similar toWMS and WFS, thus facilitating data exchange with GIS. In the proposed approach, GIS, spatial analysis, and topological data may be used for linked data verification and determining identity of objects. The novelty of the approach is in using Desktop GIS for commonchallenges in linked data dataset management: integration, fusion and quality assessment.

Resources are matched and matching accuracy is rather assumed than known. In contrast, a typical use of GIS for resource management and decision support requires precision and accuracy. In most applications, the likelihood of correctly identifying objects must reach 100%. Semantic Web tools integrated with GIS may facilitate and accelerate the publication of RDF data and their wider-spread use. The use of GIS in linked data environment may result in smaller level of automated mapping processes than in stand-alone GIS data processing, but the quality of resulting data and their reliability are likely to increase.

The examples presented in the paper show the potential of using semantic techniques for interpreting spatial data obtained from different sources. Based on these examples, the use of ontologies describing the correspondence of objects representing identical entities seems promising. It opens up a possibility of using reasoning engines to infer not only object’s proximity but also the quality of object matching. To further support the synergistic use of linked data and GIS it is essential to develop robust tools for data import/export from/to GIS systems and semantic data resources. Such tools could not only facilitate simultaneous processing of linked data with GIS data but also be instrumental in supporting data analysis with a caveat that certainty of retrieved data should be described in dedicated semantic metadata, using ontology suited for such purposes. In some cases, processed data should be verified and improved manually with a record of such interventions preserved in metadata.

Acknowledgement

This research was supported by the National Science Centre, Poland - project number 2012/05/B/HS4/04197.

The examples in the article were prepared using software “Semantic Components” created under the project “Development and Implementation of Innovative Geo- Media Enterprise Intelligence Technology Implementing Multi-Criteria Analysis with Spatial Data in the Desktop Environment and on the Web”, financed by the National Centre for Research and Development, Poland.

References

Published papers

[2] Janowicz, K., S. Scheider, T. Pehle, and G. Hart. 2012. Geospatial semantics and linked spatiotemporal data - Past, present, and future, Semantic Web, v.3 n.4, p.321–332.Search in Google Scholar

[4] Bleiholder J., and F. Naumann. 2009. Data fusion, ACM Computing Surveys (CSUR), v.41 n.1, p.1–41.Search in Google Scholar

[11] Stadler, C., J. Lehmann, K. Höffner and S. Auer. 2012. Linked- GeoData: A core for a web of spatial open data, Semantic Web, Vol. 3, Issue 4, Pages 333–35410.3233/SW-2011-0052Search in Google Scholar

[14] Paulheim, H. and C. Bizer. 2014. Improving the Quality of Linked Data Using Statistical Distributions, International Journal on Semantic Web & Information Systems, Vol. 10, Issue 2, Pages 63– 86.10.4018/ijswis.2014040104Search in Google Scholar

[17] Levenshtein, V. I. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transmission, vol. 1, 8–17Search in Google Scholar

[18] Hogan, A., R. Umbrich, A. Harth, R. Cyganiak, A. Polleres and S. Decker. 2012. An empirical survey of Linked Data conformance, Web Semant., 14: 14–44.10.1016/j.websem.2012.02.001Search in Google Scholar

[19] Wang, R. Y. and D. M. Strong. 1996. Beyond accuracy: what data quality means to data consumers, J. Manage. Inf. Syst., 12: 5– 33.10.1080/07421222.1996.11518099Search in Google Scholar

[21] Zaveri, A., A. Rula, A. Maurino, R. Pietrobon, J. Lehmann and S. Auer. 2015. Quality assessment for Linked Data: A Survey. Semantic Web, Vol. 7, No. 1, Pages 63–93.Search in Google Scholar

[24] Janowicz, K., S. Schade, A. Bröring, C. Keßler, P. Maué and C. Stasch. 2010. Semantic Enablement for Spatial Data Infrastructures, Transactions in GIS, 14: 111–29.10.1111/j.1467-9671.2010.01186.xSearch in Google Scholar

Books and book chapters

[1] Kuhn, W., T. Kauppinen and K. Janowicz. 2014. Linked Data - A Paradigm Shift for Geographic Information Science. In Matt Duckham, Edzer Pebesma, Kathleen Stewart and AndrewU Frank (eds.), Geographic Information Science (Springer International Publishing), Vol. 8728 of the series Lecture Notes in Computer Science, Pages 173–186.10.1007/978-3-319-11593-1_12Search in Google Scholar

[3] Bizer, C., T. Heath, and T. Berners-Lee. 2009. Linked Data – The Story So Far. In International Journal on Semantic Web & Information Systems, Vol. 5, Issue 3, Pages 1–2210.4018/jswis.2009081901Search in Google Scholar

[5] Dutta, A., C. Meilicke and S. Ponzetto. 2014. A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources. In Valentina Presutti, Claudia d’Amato, Fabien Gandon, Mathieu d’Aquin, Steffen Staab and Anna Tordai (eds.), The Semantic Web: Trends and Challenges (Springer International Publishing), Vol. 8465 of the series Lecture Notes in Computer Science, Pages 286–301.Search in Google Scholar

[8] Abbas, S. and A. Ojo. 2013. Towards a Linked Geospatial Data Infrastructure. In Andrea Kő, Christine Leitner, Herbert Leitold and Alexander Prosser (eds.), Technology-Enabled Innovation for Democracy, Government and Governance (Springer Berlin Heidelberg), Vol. 8061 of the series Lecture Notes in Computer Science, Pages 196–210.10.1007/978-3-642-40160-2_16Search in Google Scholar

[12] Jones, J., W. Kuhn, C. Keßler and S. Scheider. 2014. Making the Web of Data Available Via Web Feature Services. In Joaquín Huerta, Sven Schade and Carlos Granell (eds.), Connecting a Digital Europe Through Location and Place (Springer International Publishing), Part of the series Lecture Notes in Geoinformation and Cartography, Pages 341–361.10.1007/978-3-319-03611-3_20Search in Google Scholar

[16] Guéret, C., P. Groth, C. Stadler and J. Lehmann. 2012. Assessing Linked Data Mappings Using Network Measures. In Elena Simperl, Philipp Cimiano, Axel Polleres, Oscar Corcho and Valentina Presutti (eds.), The Semantic Web: Research and Applications (Springer Berlin Heidelberg), Vol. 7295 of the series Lecture Notes in Computer Science, Pages 87–10210.1007/978-3-642-30284-8_13Search in Google Scholar

[22] Marsden, B. W. 1986. Communication Network Protocols, Chartwell-Bratt, Pages 64–65Search in Google Scholar

[25] Fürber, C. and M. Hepp. 2010. Using SPARQL and SPIN for Data Quality Management on the Semantic Web. In Witold Abramowicz and Robert Tolksdorf (eds.), Business Information Systems (Springer Berlin Heidelberg), Vol. 47 of the series Lecture Notes in Business Information Processing, Pages 35–4610.1007/978-3-642-12814-1_4Search in Google Scholar

Conference proceedings

[6] Schade, S., and M. Lutz. 2010. Opportunities and Challenges for using Linked Data in INSPIRE. LSTD-2010 Linked Spatiotemporal Data 2010. In Proceedings of the Workshop on Linked Spatiotemporal Data 2010 In conjunction with the 6th International Conference on Geographic Information Science (GIScience 2010). Zurich, Switzerland.Search in Google Scholar

[7] Schade, S., C. Granell, and L. Díaz. 2010. Augmenting SDI with Linked Data. LSTD-2010 Linked Spatiotemporal Data 2010. In Proceedings of the Workshop on Linked Spatiotemporal Data 2010 In conjunction with the 6th International Conference on Geographic Information Science (GIScience 2010). Zurich, Switzerland.Search in Google Scholar

[10] Harvey, F., J. Jones, S. Scheider, A. Iwaniak, I. Kaczmarek, J. Łukowicz, and M. Strzelecki. 2014. Little Steps Towards Big Goals. Using Linked Data to Develop Next Generation Spatial Data Infrastructures (aka SDI 3.0). In Proceedings of the AGILE 2014 ICGIS (Castellón), ISBN: 978-90-816960-4-3.Search in Google Scholar

[13] Mika, P. et al. (Eds.) TheSemanticWeb- ISWC2014, Proceedings of 13th International Semantic Web Conference, Riva del Garda, Italy, Part I, LNCS 8796, Pages 213–228, Springer International Publishing Switzerland 2014Search in Google Scholar

[20] Kontokostas, D., P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen and A. Zaveri. 2014. Test-driven evaluation of linked data quality. In Proceedings of the 23rd international conference on World Wide Web, 747-58. Seoul, Korea: ACM.10.1145/2566486.2568002Search in Google Scholar

[23] Jentzsch, A., R. Isele and C. Bizer. 2010. Silk-Generating RDF Links while publishing or consuming Linked Data. Poster at the International Semantic Web Conference (ISWC2010), Shanghai.Search in Google Scholar

Websites

[9] Marcell, R. and A. Bröring. 2013. Linked Open Data in Spatial Data Infrastructures. Editors: (52∘North), WWW document, https://wiki.52north.org/pub/Projects/GLUES/2012-09- 10_LoD_SDI_White_Paper_MR_AB.pdf, Retrieved 19.10.2015Search in Google Scholar

Received: 2016-2-5
Accepted: 2016-4-11
Published Online: 2016-6-8
Published in Print: 2016-6-1

© 2016 A. Iwaniak et al., published by De Gruyter Open

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Downloaded on 2.3.2024 from https://www.degruyter.com/document/doi/10.1515/geo-2016-0020/html
Scroll to top button