Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton May 8, 2013

Establishing criteria for RST-based discourse segmentation and annotation for texts in Basque

  • Mikel Iruskieta

    Mikel Iruskieta is lecturer of Basque language and literature at the University of the Basque Country. His methodological interests include text parsing and knowledge and discourse representation. He has worked mainly on text analysis applications such as machine translation, text summarization and knowledge extraction.

    EMAIL logo
    , Arantza Diaz de Ilarraza

    Arantza Diaz de Ilarraza is professor of computer languages and systems at the University of the Basque Country. She received her PhD in Computer Science from the University of the Basque Country in 1990. She is a researcher in the field of Natural Language Processing. Her research interests include the development of natural language processing resources, machine translation and linguistic annotations.

    and Mikel Lersundi

    Mikel Lersundi received his PhD from the University of the Basque Country; his dissertation performed a syntactic and semantic analysis of a Basque dictionary to extract lexical-semantic relations between words and to build a database containing these relations. He teaches Basque language for scientific purposes at the University of the Basque Country and specializes in lexico-semantic relations, terminology, and machine translation.

Abstract

This article presents a discourse annotation methodology based on Rhetorical Structure Theory and an empirical study of annotating a corpus of specialized medical texts in Basque. The annotation process includes two phases: segmentation and annotation of rhetorical relations. Phase one entails an initial study which leads to establishing linguistic criteria for sentence-based segmentation; a second phase focuses on annotation of rhetorical relations. After establishing discourse segments and rhetorical relations, the annotation process is analyzed and evaluated by means of the method commonly used in RST (Marcu 2000). Inconsistencies detected in the evaluation method lead the authors to redefine some criteria of the evaluation method. As a result of this work, a small annotated Basque-language corpus is provided to scientific community.

About the authors

Mikel Iruskieta

Mikel Iruskieta is lecturer of Basque language and literature at the University of the Basque Country. His methodological interests include text parsing and knowledge and discourse representation. He has worked mainly on text analysis applications such as machine translation, text summarization and knowledge extraction.

Arantza Diaz de Ilarraza

Arantza Diaz de Ilarraza is professor of computer languages and systems at the University of the Basque Country. She received her PhD in Computer Science from the University of the Basque Country in 1990. She is a researcher in the field of Natural Language Processing. Her research interests include the development of natural language processing resources, machine translation and linguistic annotations.

Mikel Lersundi

Mikel Lersundi received his PhD from the University of the Basque Country; his dissertation performed a syntactic and semantic analysis of a Basque dictionary to extract lexical-semantic relations between words and to build a database containing these relations. He teaches Basque language for scientific purposes at the University of the Basque Country and specializes in lexico-semantic relations, terminology, and machine translation.

Published Online: 2013-5-8
Published in Print: 2015-10-1

©2015 by De Gruyter Mouton

Downloaded on 29.3.2024 from https://www.degruyter.com/document/doi/10.1515/cllt-2013-0008/html
Scroll to top button