Search Results

You are looking at 1 - 10 of 14 items

  • Author: Alexander Geyken x
Clear All Modify Search

Abstract

The German dictionary market has been confronted with a dramatic change in the past years. Traditional dictionary publishers shrank dramatically or even disappeared completely, and the largest academic dictionary, the Grimmsches Worterbuch, will definitely stop its work in 2016. This paper discusses how large documentary reference dictionaries can be planned and produced in this new context for the German market. The focus of this paper is put on the technological conditions and their impact on planning the lexicographical process, including the corpus base, automatization of information extraction, aggregation of existing information, workflow management, online-publication as well as user contributions. Several examples of existing German and English dictionary projects are discussed with respect to these aspects. A particular focus is placed on the Digitales Worterbuch der deutschen Sprache (DWDS). It is shown how the new technological possibilities help to balance richness and up-to-dateness of information with consistency and a minimum amount of redundancy, and more generally how a flexible project workflow can be established where time and budget constraints can be adapted to the project goals.

Abstract

Previous rule-based approaches for Named Entity Recognition (NER) in German base NER on Part-of-Speech tagged texts. We present a new approach where NER is situated between morphological analysis and Part-of-Speech Tagging and model the NER-grammar entirely with weighted finite state transducers (WFST). We show that NER strategies like the resolution of proper noun/common noun or company-name/family-name ambiguities can be formulated as a best path function of a WFST. The frequently used second pass resolution of coreferential Named Entities can be formulated as a re-assignment of appropriate weights. A prototypical NE recognition system built on the basis of WSFT and large lexical resources was tested on a manually annotated corpus of 65,000 tokens. The results show that our system compares in recall and precision to existing rule-based approaches.

Selected Papers from the 9th Conference on Natural Language Processing KONVENS 2008