Holger Voormann, Ulrike Gut
December 9, 2008
In the past decades language corpora have become indispensable tools for linguistic research and the development of linguistic theory. However, it is not yet widely acknowledged that the quality of corpus-based research and theories depends crucially on the quality of the corpora, not only in terms of their content and size but especially as far as the accuracy and richness of the annotations are concerned. Neither has much systematic thought gone into the effectiveness of the traditional corpus creation process regarding this problem. This paper proposes a novel approach to corpus creation – agile corpus creation – that addresses the problem of simultaneously maximizing corpus size as well as the quality and quantity of manual and automatic annotations while minimizing the time and cost involved in corpus creation. The central aspects of agile corpus creation lie in the reorganization of the traditional linear and separate phases of corpus design, data collection, data annotation and corpus analysis and in the recognition of potential sources of errors during corpus creation.