Documenting text-reuse (when one text includes a quotation or paraphrase of, or even allusion to another text) is one example of the problem of analysis and alignment. The most clever analytical tools will be of no avail unless their results can be cited, as scholarly evidence has been cited for centuries. This is where the CITE Architecture can help. CITE solves several problems at once. The first problem is the endless possible number of analyses (by which we mean “desirable ways of splitting up a text”): do we choose to “read” a text passage-by-passage, clause-by-clause, word-by-word, or syllable-by-syllable? The second, related to the first, is that of overlapping hierarchies: The first two words of the Iliad are “μῆνιν ἄειδε,” but the first metrical foot of the poem is “μηνιν α”; the first noun-phrase is “μῆνιν οὐλομένενην”, the first word of the first line, and the first word of the second line, and nothingbetween. All of these issues are present when documenting text-reuse, and especially when documenting different (and perhaps contradictory) scholarly assertions of text-reuse. In our experience, over 25 years of computational textual analysis, no other technological standard can address this problem as easily.
In this paper,0 we present a method for paraphrase extraction in Ancient Greek that can be applied to huge text corpora in interactive humanities applications. Since lexical databases and POS tagging are either unavailable or do not achieve sufficient accuracy for ancient languages, our approach is based on pure word embeddings and the word mover’s distance (WMD) . We show how to adapt the WMD approach to paraphrase searching such that the expensive WMD computation has to be computed for a small fraction of the text segments contained in the corpus, only. Formally, the time complexity will be reduced from to , compared to the brute-force approach which computes the WMD between each text segment of the corpus and the search query. N is the length of the corpus and K the size of its vocabulary. The method, which searches not only for paraphrases of the same length as the search query but also for paraphrases of varying lengths, was evaluated on the Thesaurus Linguae Graecae® (TLG®) . The TLG consists of about Greek words. We searched the whole TLG for paraphrases for given passages of Plato. The experimental results show that our method and the brute-force approach, with only very few exceptions, propose the same text passages in the TLG as possible paraphrases. The computation times of our method are in a range that allows its application in interactive systems and let the humanities scholars work productively and smoothly.
This article presents a commented history of automatic collation, from the 1940s until the end of the twentieth century. We look at how the collation was progressively mechanized and automatized with algorithms, and how the issues raised throughout this period carry on into today’s scholarship. In particular, we examine the inner workings of early collation algorithms and their different steps in relation to the formalization of the Gothenburg Model. The scholars working with automatic collation also offer fascinating insights to study the collaborations between Humanists and Computer Scientists, and the reception of computers by philologists.
Proceeding from the debate on intertextuality, some considerations are presented here for Literary and Historical Studies that suggest a theory-driven approach applying algorithm-based procedures. It will be shown that methodical tensions between qualitative and quantitative approaches can be solved simultaneously in this way. On this basis, the approach combines intertextuality theory with an algorithm-based procedure (here a search based on Word Mover’s Distance).
Authorship verification is the task of determining whether two texts were written by the same author based on a writing style analysis. Author obfuscation is the adversarial task of preventing a successful verification by altering a text’s style so that it does not resemble that of its original author anymore. This paper introduces new algorithms for both tasks and reports on a comprehensive evaluation to ascertain the merits of the state of the art in authorship verification to withstand obfuscation.
After introducing a new generalization of the well-known unmasking algorithm for short texts, thus completing our collection of state-of-the-art algorithms for verification, we introduce an approach that (1) models writing style difference as the Jensen-Shannon distance between the character n-gram distributions of texts, and (2) manipulates an author’s writing style in a sophisticated manner using heuristic search. For obfuscation, we explore the huge space of textual variants in order to find a paraphrased version of the to-be-obfuscated text that has a sufficiently high Jensen-Shannon distance at minimal costs in terms of text quality loss. We analyze, quantify, and illustrate the rationale of this approach, define paraphrasing operators, derive text length-invariant thresholds for termination, and develop an effective obfuscation framework. Our authorship obfuscation approach defeats the presented state-of-the-art verification approaches, while keeping text changes at a minimum. As a final contribution, we discuss and experimentally evaluate a reverse obfuscation attack against our obfuscation approach as well as possible remedies.
Extracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them.
The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data.
In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.
Network science methodology is increasingly applied to a large variety of real-world phenomena, often leading to big network data sets. Thus, networks (or graphs) with millions or billions of edges are more and more common. To process and analyze these data, we need appropriate graph processing systems and fast algorithms. Yet, many analysis algorithms were pioneered on small networks when speed was not the highest concern. Developing an analysis toolkit for large-scale networks thus often requires faster variants, both from an algorithmic and an implementation perspective. In this paper we focus on computational aspects of vertex centrality measures. Such measures indicate the (relative) importance of a vertex based on the position of the vertex in the network. We describe several common (and some recent and thus less established) measures, optimization problems in their context as well as algorithms for an efficient solution of the raised problems. Our focus is on (not necessarily exact) performance-oriented algorithmic techniques that enable significantly faster processing than the previous state of the art – often allowing to process massive data sets quickly and without resorting to distributed graph processing systems.