Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter October 9, 2013

A Unique Indexing Technique for Discourse Structures

Chinnaudayar Navaneethakrishnan Subalalitha and Parthasarathi Ranjani

Abstract

Sutra is a form of text representation that has been used in both Tamil and Sanskrit literature to convey information in a short and crisp manner. Nanool, an ancient Tamil grammar masterpiece has used sutras for defining grammar rules. Similarly, in Sanskrit literature, many of the Shāstrās have used sutras for a concise representation of their content. Sutras are defined as short aphorisms, formulae-like structures that convey the complete essence of the text. They act as indices to the elaborate content they refer to. Inspired by their characteristics, this article proposes an indexing mechanism based on sutras for discourse structures built using rhetorical structure theory (RST) and also using Sangati, a concept proposed in Sanskrit literature. The indices identified by the indexer are ideal for question answering (QA), summary generation, and information retrieval (IR) systems. The indexer has been tested on IR system using 1000 Tamil language text documents. A performance comparison has also been made with one of the existing RST-based indexing technique.

1 Introduction

The rapid growth in data over the web calls for efficient natural language processing (NLP) applications such as question answering (QA) and information retrieval (IR) engines. A QA or IR engine becomes productive when it meets the demand of the end user in terms of accuracy and speed, which in turn depends on how the background information is processed. Semantics play a vital role in today’s NLP arena – in increasing the quality of the output. One way to acquire the semantics of a text is to organize it as a discourse structure because it helps in interpreting the natural language (NL) text fragments accurately. The interpretation is done by identifying the semantic relations between NL text fragments. Rhetorical structure theory (RST) and discourse representation theory (DRT) are the most popular theories that form a discourse structure for a given text. RST finds the coherence between the NL text fragments using RST-based discourse relations [8], whereas DRT uses a level of abstract mental representation, namely, discourse representation structure (DRS) within its formalism, which helps in handling the meaning across sentence boundaries [15]. In this article, an indexing technique for the discourse structure built using RST has been proposed. Many researchers have focused on building NLP applications using RST [14]. It is shown that the use of the discourse relations in the NLP applications has increased their efficiency. Still, many improvements need to be done on processing the discourse structures such as indexing the discourse structures, organization of the indices, query handling, and response retrieval. Indexing the text by retaining the semantics introduced by the text representation technique is more important for a QA or an IR engine to retrieve semantically relevant answers for the user. Such an indexing technique for RST-based discourse structures has been proposed in this article.

The proposed indexing technique emulates a concept called sutra, which is proposed in the ancient Tamil grammar masterpiece Nannool as well as in Sanskrit literature [5, 10]. Sutras have been used to represent the information in a concise and unambiguous manner that is comparable with the quality of an ideal text index. Inspired by these qualities of sutras, a technique for indexing RST-based discourse structures has been proposed. To prove the versatility of the proposed indexing technique, in addition to the discourse structure built by RST, another text representation technique based on sangati relations, which are proposed in ancient Sanskrit literature [13] has also been indexed by the proposed indexing technique. Sangati expresses continuity and proper positioning of piece of text similar to the RST. The indices built by the proposed indexer are suitable for IR systems, summary generation systems, and QA systems.

This article is organized as follows. Section 2 describes the basics of RST, sangati, and sutras. Section 3 discusses the proposed work. Section 4 explains the related work on RST-based indexing techniques. Section 5 gives details on the evaluation of the proposed technique. Section 6 presents the conclusion.

2 Background

2.1 Rhetorical Structure Theory

RST is a text representation technique proposed by Bill Mann, Sandy Thompson, and Christian Matthiessen at the University of Southern California as part of their studies on computer-based text generation in 1983 [8]. RST captures the coherence between the text using discourse relations and forms a discourse structure [8]. To relate the coherent texts using discourse relations, elementary discourse units (EDUs) are identified. EDUs are essentially at the clause level, whereas in certain circumstances, they represent sentences. Discourse units identified at the paragraph level are called as complex discourse units (CDUs) [1]. The discourse units are categorized as nucleus and satellite. The nucleus expresses the salient part of the text, whereas the satellite is the additional information supplied about the nucleus. Given a text document, RST builds a graph-like structure, where the nodes are represented by the EDUs or CDUs, and the edges are represented by the discourse relations.

Discourse relations can be categorized into three categories, namely, subject matter, presentational, and multinuclear [7]. Subject matter relations describe the parts of the subject or the main theme of the text. Presentational relations show the presentation aspects with which the author has written the text. Unlike presentational and subject matter relations, multinuclear relations have two nuclei and no satellite. “Elaboration,” “evaluation,” “interpretation,” “means,” “cause,” “result,” “otherwise,” “purpose,” “solutionhood,” “condition,” “unconditional,” and “unless” are subject matter relations. “Antithesis,” “background,” “concession,” “enablement,” “evidence,” “justify,” “motivation,” “preparation,” “restatement,” and “summary” are presentational relations. “Conjunction,” “disjunction,” “contrast,” “joint,” “list,” “multinuclear restatement,” and “sequence” are multinuclear relations.

Figure 1 shows an example of how the nucleus, satellite, and the discourse relation are identified for an English sentence. The triplet nucleus–discourse relation–satellite is denoted as NRS sequence in this article. When a discourse structure is constructed for a paragraph, it can be organized as a rhetorical structure (RS) tree to facilitate the easy access by the NLP applications [8].

Figure 1. NRS Sequence for Example 1.

Figure 1.

NRS Sequence for Example 1.

Example 1 If you walk daily, you will be healthy.

2.2 Sangatis

Shāstrās, which are expositions of Vedic texts, are organized in the form of sutras (statements), adhikaran¯a (subtopic), pāda (section), adhyāya (chapter), and Sashtra (whole content) [6]. A set of related sutras on a subtopic form adhikaran¯as. A set of adhikaran¯as form a pāda, and a set of pādas form an adhyāya. Sutras, being concise in nature, need to be explained. The explanation is normally organized using adhikaran¯a. An adhikaran¯a has five components, namely, subject of discussion, doubt/ambiguity in understanding the subject, sangati for this discussion, opponent’s view, and the proponent’s (proposed) view. Of these, sangati is used for explanations at various levels: at the sutra level, in terms of how a sutra is related to the previous sutra; at the adhikaran¯a level, as to how a sutra is relevant to the adhikaran¯a; at the pādas level, as to how it is relevant to that pāda. Similarly, sangatis have been used between adhikaran¯as and between pādas as well. Table 1 lists the sangatis that are considered in this article and their equivalent meaning in English.

Table 1.

Sangathis and Their English Meaning.

S. no.SangathiEnglish meaning
1UpodghātaIntroduction
2ApavādaException
3A¯ks˙epaObjection
4PrāsangikaRelated
5Upaji¯vyaHelper
6UttānaArises
7Sthiri¯karan˙aStrengthen
8ĀtidesíkaTransference
9Dr˙s˙t˙antaExample
10Pratyudharan˙aCounterexample
11AvasaraTimely relevance
12Visés˙aSpeciality
13AnantaraFollows
14PratyavasthanaReinstate

Figure 2 illustrates the usage of upodghāta sangati in describing a text on cancer, where upodghāta links the introductory part of the text to its respective explanatory part.

Figure 2. Usage of Upodghāta Sangati.

Figure 2.

Usage of Upodghāta Sangati.

2.3 Sutras

Sutras have been used in the Tamil language grammar masterpiece Nannool and in the Sanskrit literature. In Nannool, the grammar rules are defined using sutras or noorpaas. As already mentioned, in Sanskrit literature, Shāstrās contain sutra-based texts. Per both the literatures, sutras are equivalent to formulae that explain a wide concept in a brief manner. The definition of the sutra as per Nannool is given in Tamil in Figure 3 followed by the English transliteration and its meaning in English.

Figure 3. Definition of the Sutra as per Nannool in Tamil.

Figure 3.

Definition of the Sutra as per Nannool in Tamil.

Per Sanskrit literature, the definition of sutra is transliterated in English as follows:

alpāksaram asandigdham sāravad viśvatomukham astobham anavadyam ca sūtram sūtravido viduh, which means “Of minimal syllabary, unambiguous, pithy, comprehensive, continuous, and without flaw: who knows the sutra knows it to be thus.”

Per Sanskrit literature, sutra is an aphorism (or line, rule, formula) or a collection of aphorisms in the form of a manual or, more broadly, a text in Hinduism or Buddhism. Literally, it means a thread or line that holds things together and is derived from the verbal root siv, meaning, “to sew.”

It can be seen that the sutras have been used by both the literatures to express the content of a coherent text in a concise manner. The first line in the definition of the sutras as per Nanool, “sil vagai ezhuththil palvagaip poruLai” (representing many things in few words), denotes the characteristic of a text index. An efficient text index should be a representative of the sentence, paragraph, document from which it is extracted. Furthermore, they express continuity between the texts they represent. Hence, in this article, this idea of sutras to index the coherent discourse structures built by RST and sangati has been used.

3 Related Work

Because the proposed indexing technique is suitable for IR, QA, and summary generation system, existing works on RST-based indexing techniques focusing on these applications have been discussed in this section.

Haouam and Mariri [4] have proposed an indexing technique for RST-based discourse structures. The NRS sequences are indexed as “relation name–nucleus–satellite” and stored in the data base along with their document identifiers and discourse relation identifiers. The indices are given intuitive weights initially. The weight of an index increases if it is referred frequently by the user. The indexer deals with eight discourse relations namely, “contrast,” “elaboration,” “circumstance,” “condition,” “cause,” “concession,” “sequence,” and “purpose.” The index system has been tested using an IR system. Given an NL question, the cue phrases in the question are mapped to a discourse relation and the document containing the text spans matching with the keywords present in the query are retrieved. For instance, if the query is, “Is there anything that contradicts Mary?,” then the term “contradict” is mapped to the discourse relation “contrast.” The documents containing the NRS sequences that contain the discourse relation “contrast” are first retrieved and then the keywords present in the text spans are further checked and matched with the query terms for accurate results.

Sahib and Shah have proposed an RST-based indexing technique that suits IR. The discourse structure is converted as an RS tree and weights are intuitively assigned to each node initially. The root of the tree is assigned a weight 1. The nucleus and the satellite nodes present in the tree are assigned the weights 0.9 and 0.5, respectively. The weights are changeable, and the weight of a child node varies depending upon the weight of its parent. These weights and the frequency of the terms present in each node are used together as the weight for an index node. The index nodes are stored in a database along with their weights, discourse relation identifiers, and document identifiers. On the retrieval side, the cue words in the query are mapped with the discourse relations and the documents whose discourse relations matches with the cue words present in the query are retrieved. The retrieved documents are ranked according to their weights.

An RST-based indexing technique for answering “why”-type questions in a QA system has been proposed by Verberne et al. [16]. The claim is that for “why”-type questions, the answer can be identified by matching the question with the respective nuclei terms. The NRS sequence is represented as a triplet, “relation name–nucleus–satellite,” and is stored as a plain text file as an index. A set of questions is prepared, and for each index, the nucleus likelihood, “P (nucleus/question),” question likelihood, “P (Question/Nucleus),” nucleus prior [P (Relation)/Nuclei present in the document], and discourse relation prior [P(R) (instances of relation type in question set/occurrence of relation type in the corpus)] are calculated. The nuclei crossing the predefined threshold are selected, and the respective satellite is saved as the answer along with their likelihood and is used for answer retrieval.

A concept called variable length documents (VLD) using RST to create a summary of a given text has been proposed by O’Donnell . This work has been discussed here because both the summary generation and index generation aim on creating a crisp version of the discourse structure without disturbing its semantics. The summary of the text is created by converting the NL text to an RS tree. The NRS sequences in the RS trees are assigned weights depending on the user’s choice for the discourse relations. For instance, an NRS sequence containing, “purpose” discourse relation may get a higher score than the other relations. Satellites are considered or neglected depending on the user’s choice. These weights are used to extract the important pieces of text, thereby generating a VLD.

The RST-based indexing techniques discussed so far perceives the discourse structure as a tree or a graph. The NRS sequences are extracted from the tree or the graph and are indexed as a triplet. Weights are assigned to these triplets and stored in a database. The weights are assigned based upon various criteria such as the frequency of the terms present in the triplet, frequent usage in the corpus, and frequent usage by the user. It can be observed that when the semantically coherent discourse structure is broken down to an NRS sequence while indexing, the semantic coordination that exists across two different NRS sequences in the discourse structure is completely lost. Eventually, the NLP application using such indexing techniques will barely get the advantages of the discourse structure. On the retrieval side, the existing RST-based techniques matches the cue word present in the query with a discourse relation and uses it to retrieve the corresponding text spans or the documents containing the text spans. This method of retrieval will worsen the performance of the indexer when the corpus size increases as the number of irrelevant text spans will also increase. In the indexing techniques proposed by Haouam and Mariri [4] and Sahib and Shah , the text spans retrieved using the discourse relations are filtered further by matching the terms present in the text spans with the query terms. This method of two levels processing for answer retrieval will increase the retrieval time with the increase in the corpus size.

In the proposed indexing technique, an index is not a single NRS sequence but a series of NRS sequences that are linked in the discourse structure. Hence, the proposed indexer inherits the coherence in the index from the discourse structure. Instead of representing the index using a set of words, the proposed indexer uses a set of noun concepts and the discourse relations together to represent an index. Also, the two levels of index and query matching as done by the existing RST-based indexing techniques is avoided in the proposed indexer using both the discourse relations and noun concepts together for answer retrieval.

4 Construction of the Sutra-Based Indexer (Proposed Work)

It can be observed from the previous discussions that a discourse structure contains the coherent texts that are semantically woven together. If a discourse structure is indexed merely by storing each NRS sequence separately, the coherence of the discourse structure is completely lost. Furthermore, choosing the frequently occurring words as representative words for the indices worsens the efficiency of the query–index matching and retrieval process. Thus, while indexing such discourse structures for an NLP application, the coherence needs to be retained in the indices to acquire the benefits of the discourse structure completely. An index of a discourse structure needs to be a tiny graph consisting of coherent text spans or NRS sequences. Choosing the representative of the index graph should also be given much importance so that semantically relevant text spans are retrieved accurately. The proposed sutra-based indexer identifies a string of coherent NRS sequences as an index for a CDU and chooses a set of noun concepts and a set of discourse relations to represent the index. Hence, the semantics of the discourse structure is unperturbed in the index as well. To arrive at such an index, the proposed indexer proposes a set of new features that considers the importance of each sentence present in the CDU, the importance of the coherence that exists between the NRS sequences present within the CDU and also with the other CDUs present in the same text document, the influence of the RST-based discourse relations, and the sangatis present in the CDU.

Figure 4 shows the architecture of the sutra-based indexer. As discussed earlier, given a discourse structure built using RST or sangati or both RST and sangati comprising a set of NRS sequences or RS trees representing a text document, the sutra-based indexer identifies an index for each CDU present in the corpus. The index is built in three steps, namely, pattern identification, weighting patterns, and sutra generation. The next section discusses about the pattern identification in detail. Figure 5 shows the Tamil text taken from the corpus, its transliteration along with the gloss, and the discourse structure for the text.

Figure 4. Architecture of Sutra-Based Indexer.

Figure 4.

Architecture of Sutra-Based Indexer.

Figure 5. Example 4.

Figure 5.

Example 4.

It can be seen that from Figure 6 that the nodes in the discourse structure are formed by the sentences present in the example text. There are three CDUs present in the discourse structure with the sentences, which are S1–S3 (CDU1), S4–S6 (CDU2), and S7–S9 (CDU3). The sutra is formed for each CDU.

Figure 6. Discourse Structure for Example 4.

Figure 6.

Discourse Structure for Example 4.

4.1 Pattern Identification

A pattern consists of connected NRS sequences present in a CDU. All such patterns are identified, and each pattern is given a weight depending upon various factors namely, node’s frequency within the pattern, node’s frequency across the patterns in the CDU, pattern’s frequency in the CDU, total number of discourse relations that links the pattern with the other CDUs present throughout the text, and influence of the discourse relations present in the pattern. Using these factors, a significant pattern is chosen as the representative pattern for that CDU.

Node’s frequency within the pattern is given importance because the higher number of nodes implies that the pattern contains more information.

Node’s frequency across the patterns in the CDU gives the cumulative frequency of each node in the pattern across all the patterns identified for a CDU. A high frequency indicates that the node has more semantic connectivity with the other nodes, and hence, it needs to be given importance.

Pattern’s frequency in the CDU gives the frequency of a pattern occurring as a subpattern in the CDU. A pattern occurring frequently as a subpattern in the other patterns indicates that it consists of nodes, which are semantically linked frequently with the other nodes in the CDU, and hence, such pattern needs to be given importance.

Total number of discourse relations that links the pattern with the other CDUs present throughout the text gives the total number links a pattern has with the other CDUs. If a pattern has nodes that have links with the other CDUs, then it indicates that the node is an important sentence of the entire text. When a pattern has many such nodes then the pattern becomes a significant pattern within the text and needs to be given more importance.

Influence of discourse relations indicates the sum of important discourse relations present in each pattern pertaining to the corpus. For instance, in the tourism domain-specific corpus, the discourse relations such as “elaboration,” “list,” “sequence,” and “preparation” predominantly occurs, and such discourse relations need to be given importance. The influence of a discourse relation “Ri” is given by Pi (Ri), where Pi (Ri) is the total number of occurrence of (Ri)/total number of discourse relations present in the corpus.

The pattern that gets the highest weight is chosen to form a sutra. Figure 7 shows the patterns identified for the CDU1, which is present in the discourse structure shown in Figure 6.

Figure 7. Patterns Identified for CDU1.

Figure 7.

Patterns Identified for CDU1.

Table 2 shows the weights for the patterns shown in Figure 7.

The influence of the discourse relations, “elaboration” and “prāsangika” are 0.6 and 0.33, respectively. It can be seen from Table 2 that the pattern “P1” is the top-weighted pattern and is chosen to form a sutra that represents the CDU1 as the semantic index.

Table 2.

Weight Factors for the Patterns.

Weight factorsP1P2P3P4
Number of nodes in the pattern3222
Total node frequency across the patterns in the CDU9666
Pattern frequency in the CDU0111
Number of links across the document2211
Influence of discourse relations1.260.330.330.33
Total weight15.2611.3310.3310.33

4.2 Sutra Generation

The sutra for a CDU comprises a set of noun concepts and set of discourse relations chosen from the top-weighted pattern. Because nouns convey the essence of a text, the noun concepts are chosen to represent the CDU. Because a sutra needs to be concise, the number of noun concepts chosen to form the sutra is limited by choosing the frequent noun concepts and choosing the abstract noun concepts, leaving behind its instances. The abstract noun concepts can be found using any semantic knowledge base such as WordNet or ontology. The set of discourse relations chosen to form a sutra along with the noun concepts are nothing but the discourse relations that are present within the top-weighted pattern and also that links the top-weighted pattern with the rest of the patterns present in the CDU. Figure 8 shows the sutra generation for a single text document.

Figure 8. Formation of Sutra for a Single Text Document.

Figure 8.

Formation of Sutra for a Single Text Document.

With respect to our example, the noun concepts of the top-weighted pattern P1 are “Tirunelveli city,” “Tirunelveli district,” “India,” “city,” “Thaamirabarani,” and “river bank.” The noun concept, “Tirunelveli city” occurs thrice; thus, it is included in the sutra. The rest of the nouns occur only once and are abstract, and hence, they are included in the sutra. The discourse relations, namely, “elaboration,” “prāsangika,” and “evidence,” which are present in the pattern P1 and links it with the other patterns and other CDUs, are also included in the sutra of CDU1. Hence, the sutra of CDU1 is “Tirunelveli city,” “Tirunelveli district,” “India,” “city,” “Thaamirabarani,” and “river bank” – “elaboration,” “prāsangika,” and “evidence.” It can be seen that the sutra exhibits the complete essence of the CDU1 through the noun concepts and the discourse relations.

Each sutra is stored along with its respective CDU and the CDUs that are semantically linked to the sutra. Then, an inverted index is built for each sutra. Each sutra is tagged with its respective CDU identifiers and the document identifier. On the retrieval side, the discourse relations pertaining to the query are identified by mapping the cue words in the query. Also, the noun concepts in the query are identified, and the combinations of these noun concepts and the discourse relations are matched with the sutras to fetch the results. Figure 9 shows the inverted index data structure of sutras.

Figure 9. Representation of the Sutra as an Inverted Index.

Figure 9.

Representation of the Sutra as an Inverted Index.

It can be observed that the proposed indexing technique is suitable for IR, QA, and summary generation systems. For instance, when used by a QA or summary generation system, the CDUs matching the query can be retrieved as an answer or a summary. When used by an IR system, the documents matching the query can be retrieved.

The next section discusses the evaluation of the proposed work and also facilitates the comparison of the proposed work with an existing RST-based indexing technique.

5 Evaluation

The proposed indexer has been tested on an IR system using corpora of 1000 Tamil language tourism domain-specific text documents. For discourse structure construction, an enhanced version of our existing language independent discourse structure has been used [15]. Our discourse parser needs the input text documents to be enconverted to the universal networking language (UNL). The Tamil Enconverter proposed by Balaji et al. [2] has been used for enconverting the Tamil text documents. The discourse parser identifies discourse relations by exploring the similarities that exist between the discourse relations and UNL relations. The NRS sequences are identified between clauses and sentences. To identify CDUs from the sentence level NRS sequences, lexical chains are identified using an adaptation of the segmentation algorithm proposed by Stokes et al. [12]. Repetition of words and synonyms are the features used for constructing the lexical chains. Because UNL portrays each term as a concept, word repetition and synonyms are identified easily. For each CDU identified, an RS tree is constructed using the RS tree algorithm proposed by Marcu [3] and the NRS sequences of the tree are formed using UNL features.

5.1 Evaluation of the Proposed Indexing Technique Using an IR system

The IR system has been tested using 15 queries. Precision for the first ten documents (P@10) and mean average precision (MAP) were used as the parameters for evaluation. MAP for a set of queries is the mean of the average precision scores for each query. A comparison has also been done with an existing RST-based indexing technique proposed by Sahib and Shah [11]. The P@10 values of both the IR systems are shown in Figure 10.

Figure 10. Performance Comparison Through the IR System.

Figure 10.

Performance Comparison Through the IR System.

(1)MAPscore=i=1NAverageprecision(queryi)N (1)

where N is the number of queries, the MAP score using the proposed approach is 0.7, and MAP score using the approach of Sahib and Shah [11] is 0.62.

The precision values and the MAP score achieved by the proposed indexing technique are higher than that of the existing RST-based indexing approach. The three main reasons behind such high performance are context coverage of the indices, sangati relations, and the presence of more than one discourse relation present in each index. Figure 11 shows a single Tamil query, its English transliteration, and its translation, which was used for evaluation.

Figure 11. Query1.

Figure 11.

Query1.

The P@10 values obtained through the approach of Sahib and Shah [11] and the proposed approach for the above query are 40% and 70%, respectively. It can be observed from the query that the word sirappu (speciality) can be mapped with the discourse relation “elaboration” and the sangati, visesha. In the approach of Sahib and Shah [11], the documents indexed through the NRS sequence that contains the discourse relation, “elaboration” along with the terms present in the query are retrieved. Because corpus used is tourism domain specific, the queries also lean toward the same domain. It was observed from the corpus that the discourse relation, “elaboration” is predominantly present, and hence, despite the presence or absence of the cue term in the query, the “elaboration” discourse relation is used as a default discourse relation that is used along with the query terms for retrieval. Because the methodology of matching the query terms and the index terms is not stated by Sahib and Shah [11], to retrieve the documents, various combinations of all the query terms have been used along with the discourse relation “elaboration” for retrieving the results. By doing so, only four documents out of first ten retrieved documents were found to be semantically relevant to the query. This is mainly because of the context coverage of the indices in the approach of Sahib and Shah [11]. Because the indices represent a EDU, which is essentially a clause or a sentence, the documents retrieved through such indices may not be always semantically closer to the query. In the proposed approach, an index is a representative of a CDU, which is essentially a paragraph, and hence, the chance of retrieving semantically relevant documents is always higher through such indices. Also, in the proposed indexing technique, the search is done using various combinations of the noun concepts–discourse relations, which help in eliminating irrelevant documents, unlike the approach of Sahib and Shah [11] where all the query terms are used for retrieval. Furthermore, the usage of sangatis in addition to the RST-based discourse relations used for discourse structure construction is also a major reason for high precision. For instance, in the proposed approach, the word sirappu (speciality) present in the query is mapped with the discourse relation, “elaboration” and also with the sangati, visesha. Because sangatis are more capable of capturing the coherence between the complex sentences and paragraphs than RST-based discourse relations [13], it stands as one of the reasons for high-precision values. Using sangati, it can also be observed that the proposed indexer is scalable to handle other types of discourse relations as well. Because an index in the proposed approach is a set of noun concepts and a set of discourse relations, even if the query contains more than one cue term that could be mapped with the discourse relations or sangatis, the chance of retrieving semantically more relevant result is higher in the proposed approach than that of the existing RST-based indexing techniques. For instance, two discourse relations namely, “elaboration” and “joint” can be mapped from the Query2 shown in Figure 12. The “elaboration” is a default discourse relation, whereas the discourse relation “joint” is mapped using the cue term matrum (and).

Figure 12. Query2.

Figure 12.

Query2.

Because an index in the approach of Sahib and Shah [11] is merely an NRS sequence, the terms and a single discourse relation can only be used for searching the documents. In the proposed approach, the noun concepts along with both the discourse relations are used to search the documents, which results in the retrieval of semantically closer document to the query.

6 Conclusions

A technique in indexing RST sangati-based discourse structures has been presented in this article, inspired by the concept sutra, which is used in ancient Tamil and Sanskrit literatures. Sutras have been used to represent a text in a concise and unambiguous manner. The indices built by the proposed technique emulate the characteristics of the sutra. The proposed indexing technique aids in retaining the semantics present within the discourse structure in the indices. The evaluation of the proposed indexing technique has been done by incorporating it into an IR system. Furthermore, the performance of the proposed indexing technique has been compared with one of the existing RST-based indexing technique, and an improved performance has been shown. The proposed indexing technique differs from the existing techniques in the way it identifies, stores, and uses the indices for retrieval. The proposed indexing is versatile enough to handle both RST-based discourse relations and sangatis. The proposed indexing technique can also be used for summary generation and QA systems.


Corresponding author: Chinnaudayar Navaneethakrishnan Subalalitha, Department of Information Science and Technology, College of Engineering, Guindy, Anna University, Chennai, 600025, India, e-mail:

Bibliography

[1] N. Asher, P. Muller and S. Afantenos, Complex discourse units and their semantics, Nation 2 (2011) 7.Search in Google Scholar

[2] J. Balaji, T. V. Geetha, Ranjani Parthasarathi and Madhan Karky, Morpho-semantic features for rule-based Tamil enconversion, Int. J. Comput. Appl. 26 (2011), 0975–8887.Search in Google Scholar

[3] M. Daniel, Building up rhetorical structure trees, in: Proceedings of the National Conference on Artificial Intelligence, pp. 1069–1074, 1996.Search in Google Scholar

[4] K. E. Haouam and F. Mariri, A dynamic weight assignment approach for index terms, J. Comput. Sci. 2 (2006), 261–268.10.3844/jcssp.2006.261.268Search in Google Scholar

[5] A. A. Macdonell, “The sūtras”. A History of Sanskrit Literature. D. Appleton and company, New York, 1900.Search in Google Scholar

[6] Madhavacharya, Jaiminiya Nyaya Mala Vistara, Chankhamba Sanskrit Pratishthan.Search in Google Scholar

[7] W. C. Mann and S. A. Thompson, Rhetorical structure theory: a theory of text organization, No. ISI/RS-87-190, University of Southern California Marina Del Rey Information Sciences Institute, 1987.Search in Google Scholar

[8] W. C. Mann and S. A. Thompson, Rhetorical structure theory: toward a functional theory of text organization, Text 8 (1988), 243–281.Search in Google Scholar

[9] M. O'Donnell, Variable-length on-line document generation, Proc. Flexible Hypertext Workshop of the Eighth ACM International Hypertext Conference, UK, 1997.Search in Google Scholar

[10] SaaminaathaIyer, Pavanandhi munivar iyatriya Nannool moolamum Mayilainaathar uraiyum, Dr. U. Ve, SaaminaathaIyer Noolnilaiyam, Chennai, 1995.Search in Google Scholar

[11] M. Sahib and A. Ali Shah, A new indexing technique for information retrieval systems using rhetorical structure theory (RST), J. Comput. Sci. 2.3 (2006), 224.10.3844/jcssp.2006.224.228Search in Google Scholar

[12] N. Stokes, J. Carthy and A. F. Smeaton, Segmenting broadcast news streams using lexical chains, in: Proceedings of Starting AI Researchers Symposium, (STAIRS), pp. 145–154, 2002.Search in Google Scholar

[13] C. N. Subalalitha and P. Ranjani, An approach to discourse parsing using sangati and rhetorical structure theory, in: 24th International Conference on Computational Linguistics, p. 73, 2012.Search in Google Scholar

[14] M. Taboada and W. C. Mann, Applications of rhetorical structure theory, Discourse Stud. 8 (2006), 567–588.10.1177/1461445606064836Search in Google Scholar

[15] J. van Eijck, Discourse representation theory, in: Encyclopedia of language and linguistics, edited by K. Brown, Elsevier, Vol. 3, pp. 660–669, 2006.10.1016/B0-08-044854-2/01090-7Search in Google Scholar

[16] S. Verberne, L. Boves, P.-A. Coppen and N. Oostdijki, A discourse-based answering of why-questions – employing RST structure for finding answers to why-questions, TAL 47 (2007), 21–41.Search in Google Scholar

Received: 2013-5-13
Published Online: 2013-10-9
Published in Print: 2014-9-1

©2014 by Walter de Gruyter Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.