Otakar Čerba and Karel Jedlička

Linked Forests: Semantic similarity of geographical concepts “forest”

De Gruyter | 2016


Linked Data represents the new trend in geoinformatics and geomatics. It produces a structure of objects (in a form of concepts or terms) interconnected by object relations expressing a type of semantic relationships of various concepts. The research published in this article studies, if objects connected by above mentioned relations are more similar than objects representing the same phenomenon, but standing alone. The phenomenon “forest” and relevant geographical concepts were chosen as the domain of the research. The concepts similarity (Tanimoto coefficient as a specification of Tversky index) was computed on the basis of explicit information provided by thesauri containing particular concepts. Overall in the seven thesauri (AGROVOC, EuroVoc, GEMET, LusTRE/EARTh, NAL, OECD and STW) there was tested if the “forest” concept interconnected by the relation skos:exactMatch are more similar than other, not interlinked concepts. The results of the research are important for the sharing and combining of geographical data, information and knowledge. The proposed methodology can be reused to a comparison of other geographical concepts.

1 Introduction

Linked Data (detailed information in [1] or [2]) is a trend of the current world of information technologies, including geoinformatics and geomatics (several examples of implementations of Linked Data in the geographical and spatial data domain were published in [ 35]). The Linked Data approach enables us to publish various types of geographical and spatial data in highly interoperable way. The best description of Linked Data is provided by a 5-star rating scheme of Linked Open Data [6]. This ranking describes Linked Data as data sets under an open license (*) available in a machine-readable(**), non-proprietary for-mat (***) ideally in a RDF (Resource Description Frame-work) standard (****). The last level (*****) is essential. It can be expressed by the sentence “Link your data to other peoples data to provide context”[6]. This statement is a key prerequisite for the development of the Internet of Things which is based on the vision of combination of “captured data with data retrieved from other sources, e.g., with data that is contained in the Web, gives rise to new synergistic services that go beyond the services that can be provided by an isolated embedded system”[7].

There are a lot of benefits of the Linked Data approach, but the key one is “the provision of integrated access to data from a wide range of distributed and heterogeneous data sources”[2], which is strongly related to links between data sets or objects. But several authors (for example [8] or [9]) point out shortcomings of the Linked Data approach. For example the article [9] mentions that the links between data sets are “too shallow to realize much of the benefits promised”.

The main principle of the Linked Data approach is a formation of links between particular data. These links can be based on various relations (such topological connections, part-whole relationship), but the relation expressing equivalence belongs to the most frequently used links (as it is evident from Vocabulary of Interlinked Datasets of important Linked Data sources). The goal of this article is to check whether the concepts representing the same geographical phenomenon from various Linked Data resources connected with links are more similar than the concepts expressing the same phenomenon, but standing alone. In other words if the Linked Data approach represents a trustworthy and reliable system interconnecting always the relevant concepts. Except the Linked Data domain the results and methodology of this study can be contributive for other activities connected to data and its understanding such as ontology alignment, data harmonization or general data interoperability. As a domain for testing of above-mentioned assumption the geographical concept “forest” concept is used. The semantic research and evaluation of this very common concept was studied in previous works of Bennett [10], Helms [11] or Comber [12]. A dehnition and other information on the geographical concept “forest” is essential in dealing with tasks related to deforestation, landscape changes, protection of species, floods, production of oxygen etc.

The article is structured as follows. Section Related works and terminology focuses on studies of geographical concepts, thesauri (as one of the most frequently used semantic tool for common users as well as the tool applying Linked Data approach), the SKOS standard (Simple Knowledge Organization System), which is used in thesauri, and the Linked Data approach. Then important publications dealing with geographical concepts, their specihcation, similarity and quality of links are introduced. The next section Methods introduces the principles of comparing “forest” concepts in various thesauri selected for this research. The part Results shows the outcomes of similarity of the compared concepts. The results, proposals of handling with geographical concepts in thesauri and continuation of the research are discussed and summarized in the last section Discussion.

2 Related works and terminology

This section is an overview of key theoretical terms used in this article (Essential terms) and an introduction of im-portant background and related studies (Related studies) concentrated on several quality aspects of Linked Data (primarily links interconnecting equivalent or very similar concepts) and investigation of similarity of concepts.

2.1 Essential terms

The following paragraphs focus on four crucial terms of this article: geographical concepts; thesauri, as one of the most frequently employed semantic tools for common users who are not usually experts in thesauri domain and therefore consider information provided by thesauri as reliable; the SKOS standard, used to keep the structure in thesauri; and the Linked Data approach.

The current world of geomatics, geoinformatics and other disciplines related to spatial data and information is connected to various tools and services dealing with semantics of data(for example thesauri, ontologies or controlled vocabularies). The semantic information has to enable better understanding, sharing, integrating and combining any geographical and spatial data. It also improves communication related to geographical phenomena and reduces misinterpretation of data and information. Tools, such as thesauri or ontologies, provide a set of concepts, including geographical concepts.

The term “geographical concepts” is mentioned in many publications focused on conceptual modelling of geographical information or geo-ontologies. In a majority of publications (e.g. [ 1315]), the dehnitions of geographical concepts are quite vague. Geographical concepts are concepts with a relation to a location in a geographical space. The studies discuss above all the scope of geographical concepts. The scope can be very narrow, similar to gazetteers, including geographical objects such as cities or mountains, or it can be very broad, covering not only all locatable objects, but also concepts connected to geography and related helds (for example “volcano” or “ocean”).

The term “geographical concept” proceeds from the general word “concept” in the context of ontological and conceptual modelling. Both terms “concept” and “geo-graphical concept” are described and analysed in detail in [14]). The term “concept” (or “conceptualization”) appeared in Gruber's fundamental dehnition of ontology in a sense of information sciences [16]. The term “concept” is described in publications [1719] – “a concept can be anything about which something is said, and, therefore, could also be the description of a task, function, action, strategy, reasoning process, etc.”. A connection between concepts and semantics is mentioned in [20] - “a concept may be anything: an animal, a technique, and so on. Operationally, a concept is the set of all terms used in all languages to describe the same idea.” Authors are aware of many open questions and discussions on the correct definition and specihcation of the term “concept” (this fact is evident from the comparison above-mentioned articles and papers), but the scope of this document does not allow a broader presentation of this issues which deserve a special research.

Thesaurus, as one of the most important semantic tools dealing with concepts, is dehned as “a list of technical terms with relations among them, enabling generic retrieval of documents having different but related keywords”[21]. Holanda in [22] points out that “thesaurus is one, out of many, possible representation of term (or word) connectivity”. The other dehnition [23] specihes types of relations in thesauri “A thesaurus is mainly a controlled vocabulary - a domain-specihc vocabulary, made up of terms not words that are linked to one another by cross-referencing”. Other definitions and the evolution of the thesaurus concept are described in [24]. “The structure of a thesaurus is generally defined a priori. A controlled set of words or expressions (terms) is organised in a known order and structure. The relationships between the terms (e.g. equivalence, homographic, hierarchical and associative) are displayed clearly and identified by standardised relationship indicators (e.g. BT broader term, NT narrower term and RT related term), which are employed reciprocally.”[ 23]. It is necessary to mention that modern thesauri such as examples used in this research (see the section Methods) constitute an integral component of the Linked Data cloud (see linkeddata.org), because contained terms and concepts are usually linked to other thesauri and semantic tools.

Simple Knowledge Organization System (SKOS) is a standard[1] [25] provided by World Wide Web Consortium (W3C) to support the Semantic Web and knowledge organization systems, including thesauri[2]. SKOS is based on XML (Extensible Markup Language). It is an implementation of the RDF standard. SKOS “consists of a set of RDF properties and RDFS (RDF Schema) classes that can be used to express the content and structure of a concept scheme as an RDF graph.”[26]. According to [27] SKOS is based on “conceptual resources (concepts) which can be identified with URIs (Uniform Resource Identifier), labeled with lexical strings in one or more natural languages, documented with various types of notes, semantically related to each other in informal hierarchies and association networks and aggregated into concept schemes”. The key design principles of SKOS, including history, rationale, particular components, mapping, relations and formal semantics, are explained in [28].

The best description of the Linked Data approach, including concepts and several RDF-based standards such as SKOS, is provided by the 5-star rating scheme of Linked Open Data [6]. As mentioned in the section Introduction, necessary properties of such a type of data can be summarized as machine readable data under an open licence, which are stored in the RDF format. The most important property is a connection by links to external data. These links should interconnect concepts on the basis of relations that are defined in various standards, for example SKOS or Web Ontology Language (OWL). Tim Berners-Lee [6] defines that Linked Data are related to the two main standards URI and RDF. URI guarantees the mechanism of unique identifiers for each element. These identifiers are provided to create links between data. The RDF standard deals with triple data structure (subject - predicate - object), which enables us to describe all data and information in a universal way.

2.2 Related Studies

The research focused on geographical concepts, their interconnection and similarity is very broad. As mentioned, this paper deals with the concept “forest”. The essential publication “What is a forest? On the vagueness of certain geographic concepts” [10] was already mentioned in the Introduction. This article focuses on semantic research and evaluation of geographical concepts representing “forest” phenomenon. The ideas introduced in this article are expanded in further articles and papers dealing with geographical concepts [12, 30], building geoontologies [ 31, 32] or implementing fuzzy logic approaches into the conceptualization process [33].

In connection with the beginning of the Linked Data approach and the Semantic Web there are several studies evaluating the quality of information provided by relations between concepts. The authors of the article Towards Linkset Quality for Complementing SKOS Thesauri [34] test relations in the thesauri. Other studies [3539] concentrate on the relation owl:sameAs as the key expression of equivalence between concepts in the language OWL. Ding in [ 36] proposes “a general strategy for integrating and fusing information from the URIs in an owl:sameAs network” based on various types of description.

In order to find out the quality of links between concepts on the basis of provided explicit information it is necessary to investigate a similarity of the compared concepts. There are various approaches to investigate similarity (e.g. [4042]). For example in [41] there are presented three approaches: feature-based model, semantic-network based models (semantic distances) and information-content based models. Also the article [40] dealing with similarity of concepts in WordNet presents three types of calculating semantic similarity (edge-based methods, information-based statistics methods, hybrid methods) as well as many references. Feature based model is closely connected to Tversky's studies (e.g. [ 4345]). The Tversky-based methods are also mentioned in other publications, for example [14, 46, 47].

The other approach of similarity measurements is based on the Formal Concept Analysis (FCA) [46, 4850]. This method is commonly used for comparing of concepts in one ontological systems. Therefore it is necessary to merge concepts into one ontology.

Figure 1 UML Activity Diagram depicting methodological steps.

Figure 1

UML Activity Diagram depicting methodological steps.

3 Methods

The SKOS format uses the relation skos:exactMatch to find out equivalent or very similar concepts. It is defined as follows “skos:exactMatch indicates a high degree of conhdence that two concepts can be used interchangeably across a wide range of information retrieval applications”[25]. This description is vague, because there is not mentioned what the “high degree of conhdence” means and how it can be investigated, computed or compared. Therefore, the following methods and their implementation result in the evaluation of the statement that geographical concepts related to the “forest” phenomenon and interconnected by the skos:exactMatch relation are more similar than self-standing concepts.

This section is divided into three parts describing particular phases of the research:

  1. Selection of tested thesauri

  2. Extraction of information from thesauri

  3. Computation of similarity of the “forest” concepts in selected thesauri

The structure of this chapter is also depicted in deep on following Figure 1:

3.1 Selection of tested thesauri

The tested thesauri were selected on the basis of several conditions, which had to eliminate unsuitable products. Because of following criteria thesauri such as TheSoz (Thesaurus Sozialwissenschaften), Deutsche National Bibliothek Thesaurus or RAMEAU (Rpertoire d'autorit-matire encyclopdique et alphabtique unih) of Bibliothque nationale de France were not put into the research. This applies also to general concept resources such as DBpedia, Wikidata or WordNet. The selection criteria include:

  • Respected and well-known tools developed for a long time.

  • Containing a large number of concepts.

  • Tools generally focused on scientihc disciplines related to geography.

  • The thesauri contain the “forest” concept (and its description and relations) in English (to eliminate the risk of wrong translation).

  • They are maintained by a respected organization, company, consortium or in case of community administration they have many real users.

  • The tools providing information under an open or free license were preferred.

The research was realized with the use of the following selected thesauri (in alphabetical order):

  • AGROVOC (the acronym AV is used in the following text and tables ),

  • EuroVoc (EV),

  • General Multilingual Environmental Thesaurus (GE),

  • Linked Thesaurus fRamework for Environment / Environmental Applications Reference Thesaurus (LE),

  • The National Agricultural Library's Agricultural Thesaurus (NA),

  • OECD (Organisation for Economic Cooperation and Development) Macrothesaurus (OE),

  • STW (Standard Thesaurus Wirtschaft) Thesaurus for Economics (ST).

All selected thesauri contain a concept related to the “forest” phenomenon. These concepts are labelled as 'forest' (GE and LE), 'Forest' (ST) and 'forests' (AV, EV, NA and OE). All these forms of the noun were taken as equivalent.Extraction of Information

Figure 2 Thesauri interconnected by grounds of “forest” concept (the connecting arrows express the skos:ExactMatch relation).

Figure 2

Thesauri interconnected by grounds of “forest” concept (the connecting arrows express the skos:ExactMatch relation).

The next step consists of collecting all explicit information provided by the thesauri. Thesauri usually provide four kinds of information on (not only geographical) concepts[3] – explicit description, annotations or dehnitions of concepts, information following from implemented hierarchy, other relations and links to external resources. There are three main semantic relations in the SKOS standard ([24, 25]) and thesauri, which are related to hierarchical system of concepts: skos:broader (BT) and skos:narrower (NT) dehne the hierarchy between two concepts. The property skos:related (RT) is used to assert an associative link between two SKOS concepts. The dehnitions, descriptions and particular subjects of the above-mentioned relations are extracted to a word list (according to [12, 51,52]). This step limits the impacts of human interpretation of information. After concepts extraction from each particular thesaurus, several changes, for example transformation to singular or using only small letters, have been made to get a uniform set of terms. The word lists were written down as an XML file that was processed by XSLT (Extensible Stylesheet Language - Transformation) language to compute the similarity of concepts (see following chapter).

Next information extracted from source data consists in relations interconnecting “forest” concepts connected by the relation skos:exactMatch in various thesauri. The following schema *(Figure 2) shows how are particular selected thesauri interconnected in case of studied concept.

3.2 Computation of similarity

The similarity (25) was computed for four main types of relations provided by the thesauri and mentioned in the previous section. A similar approach to compute similarity of various types of information separately was used also in [53,54], where the four types of similarity (syntactic, property, neighbourhood and context) is recognized.

The total similarity computed in our research was retrieved as the average of particular values (6). The similarity was computed according to Tversky (1, principles are mentioned in [44], the formula was published in [43]). A similar approach was used for example in [14, 46,47].

S i m ( X , Y ) = | X Y |/(| X Y | +  α | X  -  Y | +  β | Y  -  X |) (1)

Characters X and Y in the (1 represent the input sets (in this case the sets X and Y means particular list of words created by decomposition of information provided by thesauri). The parameters α and β and were set to 1 (the Tanimoto coefficient as a specification of the Tversky index). Other coefficients such as Dice's coefficient ((Table 1) were tested, but the results were similar. Comparing (Table 1 and (Table 4 there are evident the same distributions of maxima and minima (local as well as absolute) and also differences between relevant values in both tables (coarse of function) are very similar. The correlation between both tables ((Table 1 and (Table 4) equals 0,988. Except the similar character of outputs the important fact related to the selection of Tversky index is that this index is asymmetric (unlike Dice's coefficient). Therefore it is able to take into consideration a possible extension of research by some relations that are not symmetric.

The similarity was computed with the use of an XSLT template developed by the first author. The template transforms the input data file containing all explicit information separated into the word list into an HTML (HyperText Markup Language) file. This HTML file contains the tables with particular similarities.

Table 1

Similarity based on NT relation (narrower term) and computation of .Dice’s coefficient.

AV 1.00 0.17 0.30 0.21 0.32 0.13 0.14
EV 0.17 1.00 0.32 0.37 0.21 0.15 0.00
GE 0.30 0.32 1.00 0.60 0.31 0.12 0.13
LE 0.21 0.37 0.60 1.00 0.29 0.11 0.12
NA 0.32 0.21 0.31 0.29 1.00 0.10 0.11
OE 0.13 0.15 0.12 0.11 0.10 1.00 0.00
ST 0.14 0.00 0.13 0.12 0.11 0.00 1.00

4 Results

The comparison of similarity of “forest” concepts defined in the above mentioned thesauri is summarized in the following tables showing particular aspects of similarity. Rows and columns of the tables represent the “forest” concept in concrete thesauri. The values (between 0 and 1) show similarity between particular concepts in thesauri. It is evident that the same concepts (on the top-left to bottom-right diagonal) show maximum similarity (value 1). The similarity is expressed by the Tversky index ( 1 in the section Methods).

Table 2 shows the similarity of definitions (or description) of the concepts. It shows one of the main problems of thesauri - missing explicit description in a form of definitions or some other texts. Only three thesauri (AGROVOC, GEMET and LusTRE/EARTh) contain a detail specification of the “forest” concept. It is evident that GEMET and LusTRE/EARTh use the same definition (adopted from [55]). The similarity based on definitions is computed without stop words (words only with syntactic information). As the equivalent terms all forms of words with the same meaning had been taken (this rule was kept in other analyses as well).

Table 2

Similarity of definitions and other forms of description (Tversky index).

AV 1.00 0.00 0.06 0.06 0.00 0.00 0.00
EV 0.00 1.00 0.00 0.00 0.00 0.00 0.00
GE 0.06 0.00 1.00 1.00 0.00 0.00 0.00
LE 0.06 0.00 1.00 1.00 0.00 0.00 0.00
NA 0.00 0.00 0.00 0.00 1.00 0.00 0.00
OE 0.00 0.00 0.00 0.00 0.00 1.00 0.00
ST 0.00 0.00 0.00 0.00 0.00 0.00 1.00

The following tables express the similarity of object relations, which are typical for thesauri based on the SKOS standard: broader terms (Table 3, narrower terms (Table 4 and related terms (Table 5[4].

Table 3

Similarity based on BT relation (broader term) and computation of Tversky index.

AV 1.00 0.00 0.00 0.00 0.11 0.00 0.20
EV 0.00 1.00 0.00 0.00 0.00 0.00 0.00
GE 0.00 0.00 1.00 0.00 0.00 0.00 0.00
LE 0.00 0.00 0.00 1.00 0.00 0.00 0.00
NA 0.11 0.00 0.00 0.00 1.00 0.00 0.33
OE 0.00 0.00 0.00 0.00 0.00 1.00 0.00
ST 0.20 0.00 0.00 0.00 0.33 0.00 1.00
Table 4

Similarity based on NT relation (narrower term) and computation of Tversky index.

AV 1.00 0.09 0.17 0.12 0.19 0.07 0.08
EV 0.09 1.00 0.19 0.23 0.12 0.08 0.00
GE 0.17 0.19 1.00 0.43 0.19 0.07 0.07
LE 0.12 0.23 0.43 1.00 0.17 0.06 0.06
NA 0.19 0.12 0.19 0.17 1.00 0.05 0.06
OE 0.07 0.08 0.07 0.06 0.05 1.00 0.00
ST 0.08 0.00 0.07 0.06 0.06 0.00 1.00
Table 5

Similarity based on RT relation (related term) and computation of Tversky index.

AV 1.00 0.00 0.00 0.00 0.00 0.00 0.00
EV 0.00 1.00 0.00 0.00 0.00 0.00 0.00
GE 0.00 0.00 1.00 0.00 0.12 0.20 0.00
LE 0.00 0.00 0.00 1.00 0.00 0.10 0.00
NA 0.00 0.00 0.12 0.00 1.00 0.08 0.00
OE 0.00 0.00 0.20 0.10 0.08 1.00 0.00
ST 0.00 0.00 0.00 0.00 0.00 0.00 1.00

In order to summarize the similarity of particular concepts, average values from the previous tables (Table 2Table 5) were calculated (Table 6). Authors have tested various weights of aspect of similarity, but finally all weights were considered as equal (set to the value 1), because for example the explicit descriptions or definitions are the most important to understand the concept for humans, but the standardized and formalized object relations can be processed automatically.

Table 6

Total similarity of the “forest” concepts in the selected thesauri (based on Tversky index).

AV 1.00 0.02 0.06 0.05 0.08 0.02 0.07
EV 0.02 1.00 0.05 0.06 0.03 0.02 0.00
GE 0.06 0.05 1.00 0.38 0.08 0.07 0.02
LE 0.05 0.06 0.38 1.00 0.04 0.04 0.02
NA 0.08 0.03 0.08 0.04 1.00 0.03 0.10
OE 0.02 0.02 0.07 0.04 0.03 1.00 0.00
ST 0.07 0.00 0.02 0.02 0.10 0.00 1.00

Table 6 contains three types of extreme values:

  1. Similarity of the same concepts (the top-left to bottom-right diagonal).

  2. Similarity of the “forest” concepts in the GEMET and LusTRE/EARTh. The value 0,38 is the highest in comparison with other computed similarities. The reason is the fact, that both thesauri use the same definition of the “forest” concepts. The example GEMET and LusTRE/EARTh illustrates another problem of skos:exactMatch relation and its implementation. While the “forest” concepts in LusTRE/EARTh is connected to the concept with same name in GEMET, there is not an inverse relation.

  3. Entirely different concepts with value of similarity 0 include the following pairs: STW - EuroVoc and STW -OECD (none of these pairs is interconnected with the skos:exactMatch relation).

If the extremal value (0,38) is eliminated the set of similarity values is quite homogeneous. Figure 3 compares two histograms of values of similarity. White columns show the absolute number of similarity between noninterlinked concepts falling into each interval of similarity (the range of similarity is limited by values in Table 6). Grey colour represents similarity values interconnected to the skos:exactMatch relation.

Figure 3 Histogram of similarity values.

Figure 3

Histogram of similarity values.

Also the following scheme (Fig. 4) presents the results of comparison. The particular thesauri are interconnected if the total similarity (Table 6) of the concepts “forest” in both thesauri is higher than 0. Therefore two couples (EV- ST and OE-ST) are missing as well as the extremal value (0,38). Black lines connects concepts interlinked by the skos:exactMatch relation, while the silver colour is used for not interconnected concepts. The width of the lines represents values of similarity according to the Table 6. Values are divided into equal interval according the Fig. 3. The line width is changing from 0 for the lowest interval to 9 pixel for the interval (0,09;0,10).

The both outputs (Fig. 3 and Fig. 4) show the same results, that there is not a direct relation among interconnection of concepts and value of similarity. This results is supported by average similarity (after eliminating of extremal values which fall into both types of concepts) for interconnected (0,05) and notinterconnected concepts (0,04).

5 Discussion

From the results presented in the previous section, it is evident that in the case of the “forest” concepts and the selected thesauri the concepts interconnected by the skos:exactMatch are not considerably more similar than other concepts. This statement is based on following facts:

Figure 4 Similarity comparison scheme.

Figure 4

Similarity comparison scheme.

  • Random or non-ordered occurrence of interconnected and not interconnected concepts in the histogram (Figure 3). It is not possible to say that the number of interconnected concepts tends to any side of the graph.

  • The similarity of interconnected concepts is higher (0,085 on the contrary to the similarity of noninterconnected concepts 0,049). But if the extreme values are removed (to have the data set more homogeneous and not influenced by one very different value), the average similarity of noninterconnected concepts is even higher than the interconnected concepts (0,056 compared to 0,051).

Regarding the results of this research authors claim that a construction of relations expressing “a high degree of confidence”?? does not follow explicit semantic information provided by thesauri and other semantic tools. The highest value of similarity is 0,358. It is very low (maximum similarity is 1) to bear out the statement mentioned in the Introduction section - concepts representing the same phenomenon from various resources connected with links are more similar than the concepts expressing the same phenomenon, but standing alone. This fact is emphasized by the average value of similarity, which is also very low (about 0,05).

It seems, that the semantic relations between the concepts are probably created on the basis of implicit semantics - subjective view of the authors, editors or the managers of the thesauri, their experiences with other semantic tools and similarity based on the name of a concept. The implicit semantics is not shareable in a wide or global community. Also processing of implicit semantic information by machines is impossible. Therefore, its implementation cannot support interoperability and sharing of knowledge efficiently.

The other reasons of the low similarity of the “forest” concepts are partially mentioned in the article [10]. Several premises of the geographical domain have been published, which contain sources of vagueness. Similarly to the mountains, marsh or thicket concepts (mentioned in [10]) also the “forest” concept does not have “a precise, universally acknowledged definitions”[10].

The low similarity values are also caused by the use of vague terms in the definition and very generic description of the studied relation (skos:exactMatch), which contains very general and non-specific phrase “a high degree of confidence”[25]. According to [10] “'High' and 'dense' are adjectives, which give some indication of physical properties of a feature but do not specify any definite measurable requirement. 'Very' accentuates vague adjectives but does not make them any more definite.”

Both mentioned cases of vagueness represent a combination of conceptual and sorites vagueness (mentioned in [10]). The conceptual vagueness (closely connected to ambiguity) consists in inadequate explicit definitions and descriptions (for example very poor or missing characterization of the “forest” concept in several thesauri, see Table 2). The sorites vagueness (based on the Sorites paradox) concerns various and very subjective viewing of several properties (for example “high degree”).

The research introduced in this article has not been completed. The further steps of the research of semantic similarity of geographical concepts in semantic tools will be divided into four main parts:

  1. Improving methods of similarity investigation and computation. For example in the research published in this paper the missing relations have the same value (0) as the existing relations, but without any similarities. Also other approaches to similarity computation mentioned above will be studied in more detail.

  2. The set of tested geographical concepts has to be extended (general land cover and land use concept, because the publications [56, 57] declare important heterogeneities) as well as new resources of concepts will be added.

  3. Description of relations and similarity by an approach based on multi-valued logic can be realized.

  4. As a final result of the long-term research, recommendations focused on building semantic relations with focus on context and any explicit specification will be published.

The goal of this article is to verify if selected geographical concepts representing the same geographical phenomenon from various resources and interconnected by a relation expressing very high affinity are really more similar than concepts standing alone. As a domain for testing the geographical concept “forest” is used, because the “forest” phenomenon and concepts representing this phenomenon are essential in dealing with tasks related to deforestation, landscape changes, protection of species, floods, production of oxygen, tourism, forestry etc. The set of studied semantic tools for testing was narrowed down include only relevant thesauri (AGROVOC, EuroVoc, GEMET, LusTRE/EARTh, NAL, OECD Macrothesaurus and STW Thesaurus for Economics) containing geographical concepts. Finally the skos:exactMatch relation, which means high affinity of interconnected concepts, was chosen, because the SKOS format is typically used in thesauri. The above-mentioned methodology proven at the “forest” concept can be easily used for broader set of concepts.

This proximity of concepts was evaluated on the basis of computation of similarity of each type of explicit information provided by the thesauri. These types of information included definitions, descriptions or annotations, hierarchical relations (broader and narrower terms) and semantic relation (related term). The content of subjects of these relations was decomposed into particular words (usually nouns) and the similarity was computed on the basis of Tversky's approach. Total similarity was gained as the average of four particular similarity values based on various types of relations.

The results show that in the case of “forest” concepts and the selected thesauri the concepts interconnected by the skos:exactMatch relation are not considerably more similar than other concepts. It is evident from the low correlation of the “is interconnected” property and the similarity of the concept, the histogram of similarity (Figure 3) and the average similarity of the interconnected and noninterconnected concepts. On the basis of the results of the research it is possible to claim that a construction of skos:exactMatch relation does not follow explicit semantic information provided by thesauri.

Results of this research can be used for further development of studied thesauri, because they should not be a definitive solution, but live system absorbing new data, information and knowledge. Improvements of thesauri can consist in completion of inverse relations or extension, harmonization and standardization of explicit description and specification of concepts.

Regardless of results of our research Linked Data are a very important component of the contemporary world of information technologies. Linked Data enable us to interconnect self-standing and isolated data resources and objects. Since the links are connecting not only to data object, but also data objects and relevant items in vocabularies, Linked Data could contribute to better understanding and sharing of data. But there is a crucial question: are the particular components of Linked Data (primarily the links) really reliable? In the context of this article the question could be narrowed down - Are concepts connected by the skos:exactMatch relation much more similar than other concepts and is this similarity really high? The research published in this paper shows that the answer to above-mentioned questions is negative (at least in the case of the studied thesauri, concepts and the relations).

These results do not criticize the Linked Data approach and its implementation in geographical domain. They point out that Linked Data need clear and understandable descriptions with minimization of vague terms. These descriptions should be based on respected publications, standards and norms. They should follow a general consensus and offer alternatives, but only with detail explanation of meaning and ways of usage of such alternatives. Also, hierarchical and semantic relations have to be constructed on the basis of detailed external information and expert knowledge. The cooperation of semantic engineers and geographers (and other domain experts) is crucial. Also it is necessary to emphasize the key role of explicit and formal semantics and uniform approach to development of interconnections of geographical concepts. These recommendations could contribute to a better use of the amazing potential of Linked Data in the geographical domain. Authors are aware of unrealistic expectation related to complete eliminating of vagueness in geographical concepts in Linked Data. But it is necessary to mention that any particular improvements connected to providing less vague information support interoperability, more quality communication and information transfer. These small steps focused on semantics are a very important part of never-ending effort for the Semantic Web.


This publication was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports.


[1] Bizer, C., Heath, T., Idehen, K., Berners-Lee, T. Linked data on the web (LDOW2008). In Proceedings of the 17th international conference on World Wide Web, 2008, pp. 1265–1266 Search in Google Scholar

[2] Bizer, C., Heath, T., Berners-Lee, T. Linked data-the story so far. International journal on semantic web and information systems, 2009, 5(3), 1–22. Search in Google Scholar

[3] Goodwin, J., Dolbear, C., Hart, G. Geographical linked data: The administrative geography of Great Britain on the semantic web. Transactions in GIS, 2008,12(s1), 19–30. Search in Google Scholar

[4] Stadler, C., Lehmann, J., Hoffner, K., Auer, S. Linkedgeodata: A core for a web of spatial open data. Semantic Web, 2012, 3(4), 333–354. Search in Google Scholar

[5] Kritikos, K., Rousakis, Y., Kotzinos, D. Linked open GeoData management in the cloud. In Proceedings of the 2nd International Workshop on Open Data, 2013, p. 3. ACM. Search in Google Scholar

[6] Berners-Lee, T. Design issues: Linked data. World Wide Web Consortium, 2006. Search in Google Scholar

[7] Kopetz, H. Internet of things. In Real-Time Systems, 2011, pp. 307–323. Springer US. Search in Google Scholar

[8] Bechhofer, S., Buchan, I., De Roure, D., Missier, P., Ainsworth, J., Bhagat, J. et al. Why linked data is not enough for scientists. Future Generation Computer Systems, 2013, 29(2), 599–611. Search in Google Scholar

[9] Jain, P., Hitzler, P., Yeh, P. Z., Verma, K., Sheth, A. P. Linked Data Is Merely More Data. In AAAI Spring Symposium: linked data meets artificial intelligence, 2010. Search in Google Scholar

[10] Bennett, B. What is a forest? On the vagueness of certain geographic concepts. Topoi, 2001, 20(2), 189–201. Search in Google Scholar

[11] Helms, J. A. Forest, forestry, forester: What do these terms mean?. Journal of Forestry, 2002,100(8), 15–19. Search in Google Scholar

[12] Comber, A. J., Wadsworth, R. A., Fisher, P. F. Usingsemantics to clarify the conceptual confusion between land cover and land use: the example of forest. Journal of Land Use Science, 2008, 3(2–3), 185–198. Search in Google Scholar

[13] Schwering, A., Raubal, M. Spatial relations for semantic similarity measurement. In Perspectives in conceptual modeling, 2005, pp. 259–269. Springer Berlin Heidelberg. Search in Google Scholar

[14] Kavouras, M., Kokla, M. Theories of geographic concepts: ontological approaches to semantic integration. CRC Press, 2007. Search in Google Scholar

[15] Haav, H. M., Kaljuvee, A., Luts, M., Vajakas, T. Ontology-Based Retrieval of Spatially Related Objects for Location Based Services. In On the Move to Meaningful Internet Systems: OTM, 2009, pp. 1010–1024. Springer Berlin Heidelberg. Search in Google Scholar

[16] Gruber, T. R. A translation approach to portable ontology specifications. Knowledge acquisition, 1993, 5(2), 199–220. Search in Google Scholar

[17] Gomez-Perez, A., Benjamins, R. Overview of knowledge sharing and reuse components: Ontologies and problem-solving methods. IJCAI and the Scandinavian AI Societies. CEUR Workshop Proceedings, 1999. Search in Google Scholar

[18] Corcho, O., Gomez-Perez, A. A roadmap to ontology specification languages. In Knowledge Engineering and Knowledge Management Methods, Models, and Tools, 2000, pp. 80–96. Springer Berlin Heidelberg. Search in Google Scholar

[19] Gomez-Perez, A., Corcho, O. Ontology languages for the semantic web. Intelligent Systems, IEEE, 2002,17(1), 54–60. Search in Google Scholar

[20] Caracciolo, C. AGROVOC model description and analysis. With suggestion for improvements. (FAO internal document), 2013. Search in Google Scholar

[21] Miyamoto, S., Miyake, T., Nakayama, K. Generation of a pseudothesaurus for information retrieval based on cooccurrences and fuzzy set operations. Systems, Man and Cybernetics, IEEE Transactions on GIS, 1993, (1), 62–70. Search in Google Scholar

[22] Holanda, A., Torres Pisa, I., Kinouchi, O., Souto Martinez, A., Seron Ruiz, E. Thesaurus as a complex network. Physica A: Statistical Mechanics and its Applications, 2004, 344(3), 530–536. Search in Google Scholar

[23] Severino, F. The term development in the thesauri of international organisations. The European Journal of Development Research, 2007,19(2), 327–351. Search in Google Scholar

[24] Pastor-Sanchez, J. A., Martinez Mendez, F. J., Rodriguez-Muoz, J. V. Advantages of Thesaurus Representation Using the Simple Knowledge Organization System (SKOS) Compared with Proposed Alternatives. Information Research: An International Electronic Journal, 2009,14(4), n4. Search in Google Scholar

[25] Miles, A. Bechhofer, S. SKOS Simple Knowledge Organization System Reference. W3C Recommendation, 2009. Search in Google Scholar

[26] Miles, A., Matthews, B., Wilson, M., Brickley, D. SKOS core: simple knowledge organisation for the web. In International Conference on Dublin Core and Metadata Applications, 2005, pp. 3. Search in Google Scholar

[27] Isaac, A., Summers, E.. SKOS simple knowledge organization system primer. W3C Working Group Note, 2008. Search in Google Scholar

[28] Baker, T., Bechhofer, S., Isaac, A., Miles, A., Schreiber, G., Summers, E. Key choices in the design of Simple Knowledge Organization System (SKOS). Web Semantics: Science, Services and Agents on the World Wide Web, 2013. 20, 35–49. Search in Google Scholar

[29] Van Assem, M., Malais, V., Miles, A., Schreiber, G. A method to convert thesauri to SKOS. Springer Berlin Heidelberg, 2005, pp. 95–109. Search in Google Scholar

[30] Bennett, B., Mallenby, D., Third, A. An Ontology for Grounding Vague Geographic Terms. In FOIS, 2008, Vol. 183, pp. 280–293. Search in Google Scholar

[31] Tomai, E., Kavouras, M. From onto-geonoesis to onto-genesis: The design of geographic ontologies. Geoinformatica, 2004, 8(3), 285–302. Search in Google Scholar

[32] Mark, D., Smith, B., Egenhofer, M., Hirtle, S. Ontological foundations for geographic information science. Research Challenges in Geographic Information Science, 2004, 335–350. Search in Google Scholar

[33] Fisher, P., Cheng, T., Wood, J. Higher order vagueness in geographical information: empirical geographical population of type n fuzzy sets. Geoinformatica, 2007,11(3), 311–330. Search in Google Scholar

[34] Albertoni, R., De Martino, M., Podesta, P. Towards Linkset Quality for Complementing SKOS Thesauri. 2014. Search in Google Scholar

[35] Hogan, A., Polleres, A., Umbrich, J., Zimmermann, A. Some entities are more equal than others: statistical methods to consolidate Linked Data. In 4th International Workshop on New Forms of Reasoning for the Semantic Web: Scalable and Dynamic (Ne- FoRS2010), 2010. Search in Google Scholar

[36] Ding, L., Shinavier, J., Finin, T. McGuinness, D. L. owl: sameAs and Linked Data: An empirical study. WebSci10: Extending the Frontiers of Society On-Line, 2010. Search in Google Scholar

[37] Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D. L. SameAs networks and beyond: analyzing deployment status and implications of owl: sameAs in linked data. In The Semantic WebISWC, 2010. pp. 145–160. Springer Berlin Heidelberg. Search in Google Scholar

[38] Halpin, H., Hayes, P. J. When owl: sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web. In LDOW, 2010. Search in Google Scholar

[39] Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S. An empirical survey of linked data conformance. Web Semantics: Science, Services and Agents on the World Wide Web, 2012, 14, 14–44. Search in Google Scholar

[40] Lin, F., & Sandkuhl, K. (2008). A survey of exploiting wordnet in ontology matching. In Artificial Intelligence in Theory and Practice II (pp. 341–350). Springer US. Search in Google Scholar

[41] Borgida, A., Walsh, T. J., Hirsh, H. Towards Measuring Similarity in Description Logics. Description Logics, 2005,147. Search in Google Scholar

[42] Formica, A. Concept similarity by evaluating information contents and feature vectors: a combined approach. Communications of the ACM, 2009, 52(3), 145–149. Search in Google Scholar

[43] Tversky, A. Features of similarity. Psychological Review, 1977, 84(4):327352. Search in Google Scholar

[44] Tversky, A., Gati, I. Studies of similarity. Cognition and categorization, 1978,1, pp 79–98. Search in Google Scholar

[45] Tversky, A., Gati, I. Similarity, separability, and the triangle inequality. Psychological review, 1982, 89(2), 123. Search in Google Scholar

[46] Wang, L., Liu, X. A new model of evaluating concept similarity. Knowledge-Based Systems, 2008, 21(8), 842–846. Search in Google Scholar

[47] dAmato, C., Staab, S., Fanizzi, N. On the influence of description logics ontologies on conceptual similarity. In Knowledge Engineering: Practice and Patterns, 2008, pp. 48–63. Springer Berlin Heidelberg. Search in Google Scholar

[48] Formica, A. Ontology-based concept similarity in formal concept analysis. Information Sciences, 2006,176(18), 2624–2641. Search in Google Scholar

[49] Yang, Y., Du, Y., Sun, J., Hai, Y. A topic-specific web crawler with concept similarity context graph based on FCA. In Advanced intelligent computing theories and applications. With aspects of artificial intelligence, 2008, pp. 840–847. Springer Berlin Heidelberg. Search in Google Scholar

[50] Formica, A., Concept similarity in Formal Concept Analysis: An information content approach. Knowledge-Based Systems, 2008, 21(1), 80–87. Search in Google Scholar

[51] Lee, M. C., Liu, Z. L., Chen, H. H., Lai, J. B., Lin, Y. T. FCA based concept constructing and similarity measurement algorithms. In Advanced Information Management and Service (IMS), 2010, pp. 384–388. Search in Google Scholar

[52] Ballatore, A., Wilson, D. C., Bertolotto, M. Computing the semantic similarity of geographic terms using volunteered lexical definitions. International Journal of Geographical Information Science, 2013, 27(10), 2099–2118. Search in Google Scholar

[53] Guisheng, Y., Qiuyan, S. Research on ontology-based measuring semantic similarity. In Internet Computing in Science and Engineering, 2008. ICICSE'08, 2008, pp. 250–253. Search in Google Scholar

[54] Ngan, L. D., Hang, T. M., Goh, A. E. Semantic similarity between concepts from different OWL ontologies. In Industrial Informatics, 2006, pp. 618–623. Search in Google Scholar

[55] Dunster, J., Dunster, K. Dictionary of natural resource management. Dictionary of natural resource management, 1996. Search in Google Scholar

[56] Cerba, O., Ontologie jako nastroj pro navrhy datovych modelu vybranych temat priloh smernice INSPIRE. Dissertation, Univerzita Karlova v Praze, 2011. (in Czech) Search in Google Scholar

[57] Belgiu, M., Strobl, J., Mittlboeck, M. Adding Semantics To Spatial Content. A Land Cover Scenario, 2012. Search in Google Scholar