You are looking at 1 - 10 of 6,891 items :

  • Databases and Data Mining x
Clear All
Hardware, Software und Anwendungen
Series: De Gruyter STEM


Documenting text-reuse (when one text includes a quotation or paraphrase of, or even allusion to another text) is one example of the problem of analysis and alignment. The most clever analytical tools will be of no avail unless their results can be cited, as scholarly evidence has been cited for centuries. This is where the CITE Architecture can help. CITE solves several problems at once. The first problem is the endless possible number of analyses (by which we mean “desirable ways of splitting up a text”): do we choose to “read” a text passage-by-passage, clause-by-clause, word-by-word, or syllable-by-syllable? The second, related to the first, is that of overlapping hierarchies: The first two words of the Iliad are “μῆνιν ἄειδε,” but the first metrical foot of the poem is “μηνιν α”; the first noun-phrase is “μῆνιν οὐλομένενην”, the first word of the first line, and the first word of the second line, and nothing in between. All of these issues are present when documenting text-reuse, and especially when documenting different (and perhaps contradictory) scholarly assertions of text-reuse. In our experience, over 25 years of computational textual analysis, no other technological standard can address this problem as easily.


In this paper,0 we present a method for paraphrase extraction in Ancient Greek that can be applied to huge text corpora in interactive humanities applications. Since lexical databases and POS tagging are either unavailable or do not achieve sufficient accuracy for ancient languages, our approach is based on pure word embeddings and the word mover’s distance (WMD) [20]. We show how to adapt the WMD approach to paraphrase searching such that the expensive WMD computation has to be computed for a small fraction of the text segments contained in the corpus, only. Formally, the time complexity will be reduced from O(N·K3·logK) to O(N+K3·logK), compared to the brute-force approach which computes the WMD between each text segment of the corpus and the search query. N is the length of the corpus and K the size of its vocabulary. The method, which searches not only for paraphrases of the same length as the search query but also for paraphrases of varying lengths, was evaluated on the Thesaurus Linguae Graecae® (TLG®) [25]. The TLG consists of about 75·106 Greek words. We searched the whole TLG for paraphrases for given passages of Plato. The experimental results show that our method and the brute-force approach, with only very few exceptions, propose the same text passages in the TLG as possible paraphrases. The computation times of our method are in a range that allows its application in interactive systems and let the humanities scholars work productively and smoothly.



A shorter version of the paper appeared in German in the final report of the Digital Plato project which was funded by the Volkswagen Foundation from 2016 to 2019. [35], [28].


This article presents a commented history of automatic collation, from the 1940s until the end of the twentieth century. We look at how the collation was progressively mechanized and automatized with algorithms, and how the issues raised throughout this period carry on into today’s scholarship. In particular, we examine the inner workings of early collation algorithms and their different steps in relation to the formalization of the Gothenburg Model. The scholars working with automatic collation also offer fascinating insights to study the collaborations between Humanists and Computer Scientists, and the reception of computers by philologists.