Jump to ContentJump to Main Navigation
Show Summary Details
More options …

The Prague Bulletin of Mathematical Linguistics

The Journal of Charles University

2 Issues per year

Open Access
Online
ISSN
1804-0462
See all formats and pricing
More options …

CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora

Mark Fishel / Heiki-Jaan Kaalep
Published Online: 2010-12-10 | DOI: https://doi.org/10.2478/v10108-010-0021-7

CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora

This work introduces a method and tool for handling overlapping parallel corpora — i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations of the overlapping corpora or compare them and assess their quality in comparison to each other. The introduced tool enables the user to define the desired behavior when combining corpora pairs, resulting in pure comparison, maximum-size or maximum-quality versions of the combinations. We test the tool on two cases of overlapping parallel corpora and five language pairs. We also evaluate the impact of using the method on two translation systems — a phrase-based and a parsing-based one.

  • Bojar, Ondřej and Zdeněk Žabokrtský. CzEng0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics, 92, 2009.Google Scholar

  • Kaalep, Heiki-Jaan and Kaarel Veskis. Comparing parallel corpora and evaluating their quality. In Proceedings of MT Summit XI, pages 275-279, Copenhagen, Denmark, 2007.Google Scholar

  • Koehn, Philipp. Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit X, pages 79-86, Phuket, Thailand, 2005.Google Scholar

  • Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL'07, pages 177-180, Prague, Czech Republic, 2007.Google Scholar

  • Li, Zhifei, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese, and Omar Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135-139, Athens, Greece, 2009.Google Scholar

  • NIST. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Technical report, NIST, 2002.Google Scholar

  • Och, Franz J. and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19-51, 2003.CrossrefGoogle Scholar

  • Papieni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL'01, pages 311-318, Philadelphia, PA, USA, 2001.Google Scholar

  • Steinberger, Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, and Dániel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of LREC'06, pages 2142-2147, Genoa, Italy, 2006.Google Scholar

  • Stolcke, Andreas. SRILM - an extensible language modeling toolkit. In Proceedings of ICSLP'02, volume 2, pages 901-904, Denver, Colorado, USA, 2002.Google Scholar

  • Varga, Daniel, László Németh, Péter Halácsy, András Kornai, Viktor Trón, and Viktor Nagy. Parallel corpora for medium density languages. In Proceedings of RANLP'05, pages 590-596, Borovets, Bulgaria, 2005.Google Scholar

About the article


Published Online: 2010-12-10

Published in Print: 2010-09-01


Citation Information: The Prague Bulletin of Mathematical Linguistics, Volume 94, Issue , Pages 67–76, ISSN (Online) 1804-0462, ISSN (Print) 0032-6585, DOI: https://doi.org/10.2478/v10108-010-0021-7.

Export Citation

This content is open access.

Comments (0)

Please log in or register to comment.
Log in