Trismegistos : Optimizing Interoperability for Texts from the Ancient World

Although its origins lie with the Prosopographia Ptolemaica, a project studying people who lived in Ptolemaic Egypt (332–30 BCE), Trismegistos has developed into an interdisciplinary platform for the study of the ancient world in general, from 800 BCE to 800 CE: texts, places, people, collections. Setting up this very divergent set of databases has only been possible through the availability of full text corpora, new digital processing techniques, and the “exponentiality” permitted by interconnectivity. By bringing everything together in a single environment, Trismegistos has facilitated quantitative studies of several phenomena, but this approach remains promising and will hopefully become more widespread. TM’s main aim, however, is interoperability through the spread of stable identifiers, as an instrument to build a Linked Open Data environment for the ancient world.

Peremans' approach to this prosopography was, certainly for that time, remarkable: all-inclusive and interdisciplinary avant la lettre.Together with his assistant Van't Dack, he decided that documentary papyri and ostraca would be the core material, but information from epigraphic texts or literary sources would not be neglected either.As a Flemish nationalist, Peremans also insisted from the beginning that texts in the local Egyptian languages, Demotic and hieroglyphic would be included, even if he did not know them himself.
All this resulted in a series of printed volumes, each covering specific corporate categories, more or less following social hierarchy.The volumes were published between 1950 and 1968, with an index in 1975.However, the supplements to the early volumes, published in 1975 and 1981, already illustrated some infrastructural problems.2Within each category, people were ordered alphabetically by their name and assigned a number for ease of reference.Newly added individuals thus had to receive a letter in addition to a number, e.g.PP VIII 3844a.
This was obviously going to be a problem in the end, but fortunately, technology offered a solution in the form of the computer.The PP was an early adopter of this innovation, and started with the "computerisation of the documentation in a relational database" in the mid-eighties, around the time of Peremans' death in 1986(Mooren, 2001).As it was never really a project with separate funding, much of this conversion work was carried out by assistants, and took quite some time.Paradoxically, the advent of the system of project funding in Leuven in the nineties did not really speed up the process, as Willy Clarysse and later Katelijn Vandorpe successfully applied for other projects such as the Leuven Homepage of Papyrus Collections [LHPC], the Leuven Database of Ancient Books [LDAB], the Fayum project, or the Archives project.3 Although the PP thus lied dormant in the early 2000's, the systematic data collection for it and for the other projects would turn out to be instrumental in the creation of Trismegistos.Together with a database of Demotic papyri by the late Heinz-Josef Thissen, professor of Egyptology at Cologne University, the table of texts collected was at the core of the proposal for the project 'Multilingualism and Multiculturalism in Graeco-Roman Egypt' [MaMiGRE], during the course of which Trismegistos would be created (Depauw & Gheldof, 2014).
More than just delivering data, however, the PP also inspired the new project in its approach: all-inclusive and interdisciplinary.Even if initially, MaMiGRE was intended to be an Egyptological supplement to the already existing Greek papyrological projects such as the Heidelberger Gesamtverzeichnis griechischer Urkunden aus Ägypten [HGV]4 and the LDAB, it soon broadened its horizon.Rather than limiting the dataset to papyrology, the Graeco-Roman period and just sources in Egyptian languages and scripts, when TM was launched in 2006 it was meant to be a platform for the study of any type of text dating to the period from 800 BCE to 800 CE, in any language or script and on any writing surface.
In these initial stages, TM Texts still had an important geographical limitation, however, in that it only dealt with Egypt and the Nile Valley.This restriction only disappeared gradually, when after the end of MaMiGRE in 2008 I returned to Leuven and started contemplating the idea of widening our scope to include the entire (Western) ancient world.In 2010, through the mediation of James Cowey, the first contacts were made with the Epigraphische Datenbank Heidelberg [EDH].5This eventually allowed us to become a part of the Europeana EAGLE project from 2013 onwards (Orlandi et al., 2017).It also led us to include all Latin inscriptions in TM Texts, a significant increase also in numbers, from roughly 100,000 items (for Egypt), to about 600,000 records.Keeping the interdisciplinary spirit of the PP and TM in mind, however, we also sought cooperation with other projects dealing with the smaller indigenous languages.We thus included 10,000 Etruscan texts through a cooperation with Gerhard Meiser (Meiser, 2014),6 entered the Messapian (Simone & Marchesini, 2002), Gaulish (Recueil des Inscriptions Gauloises, 1985Gauloises, -2002) ) and Italic (Crawford et al., 2011) evidence on the basis of printed corpora, integrated the Raetic,7 Ogham (and other Celtic from Britain)8 and Runic9 on the basis of existing databases, and also worked together with regional databases such as Inscriptiones Siciliae to have exhaustive coverage for specific regions.10TM Texts is still far from complete, however.Our coverage is patchy for languages such as Libyan; Iberian and some other palaeo-Iberian languages are missing completely, as is Punic; we only have the Aramaic material for Egypt, and this is true for most other Semitic languages as well.Our main limitation today, however, is that the Greek inscriptions are still not included, especially for the Greek East outside Africa.We hope to remedy this in the not too distant future, in cooperation with key research bodies such as the Packard Humanities Institute [PHI]11 and the Supplementum Epigraphicum Graecum [SEG].12

New Techniques & Other Trismegistos Databases
So far the focus has been on the TM Texts database (680,123 records), and rightly so, since the sources lie at the basis of all scholarly research of the history of the ancient world.Nonetheless, Trismegistos also offers other databases, most of which have grown organically from earlier Leuven projects.Trismegistos People is a database of currently 496,702 attestations of people (370,086 records) and personal names (33,325 records) in TM Texts.Although in its current state it cannot really be called a prosopography because people have not been identified systematically across texts (except perhaps for the Ptolemaic period), it clearly builds upon the PP and is currently limited to Egypt.As a systemization of information available in the LDAB, TM Authors deals with ancient authors (5,720 records) and their works (4,847 records -far from complete).At the core of TM Places lies the Fayum project, although it now includes many places (52,130 records) outside Egypt as well, covering both their use as provenance (705,858 records) and their mention in text (217,106 records).The TM Collections database (3,750 records), like its predecessor the LHPC, focuses on the current whereabouts of ancient sources.13 Setting up all of these large-scale databases in the last ten years has only been possible because of the availability of full text corpora, new digital processing techniques, and the "exponentiality" permitted by interconnectivity.To start with the former, it was the availability of the full text of Greek papyri in the Duke Databank of Documentary Papyri [DDbDP] that allowed us to develop a Named Entity Recognition [NER] tool to filter out personal names and place names.14The NER allowed us to work much faster than would have been possible by purely human input (Depauw & Van Beek, 2009).This is illustrated nicely by the fact that the Demotic evidence, despite the significantly smaller size of the Demotic corpus, is still only partially in the TM People database, whereas the Greek is covered completely -which is entirely due to the fact that Demotic is not available as digital full text.The NER system we set up does not only deal with the typically Greek, and relatively simple, naming system in which most individuals are identified by name and father's name.It can also cope with far more complicated onomastic identifying clusters caused by the Roman tria nomina (think of Gaius Iulius Caesar) and the increasingly common addition of mothers, grandfathers etc. to the identification string.Finally, as the DDbDP also included the TM identifiers (discussed further below), we could easily connect the information distilled from the texts to the data that was already available in the TM Texts database: publications, provenance, date, whereabouts etc.
It was this combination of the availability of full text, NER and interconnectivity which allowed (and allows) TM to set up further databases dealing with specific aspects of ancient texts, often in conjunction with other projects and scholars.TM Text Irregularities was developed through a joint effort of Joanne Stolk and myself, to study the corrections both modern editors and ancient authors made in Greek papyri (Depauw & Stolk, 2015).15TM Editors sprang from a question to the PAPY mailinglist about papyri edited after 1980, and now identifies over 20,000 modern authors and editors, with special attention to their edition of texts (Depauw & Broux, 2016).16TM Abbreviations & Formulae is the result of NER on the full text as available in the Epigraphische Datenbank Clauss-Slaby [EDCS]17 of Latin inscriptions.18It is still under construction, as is the website we are developing on the basis of Ana Blasco's PhD study on the Greek transliteration of Egyptian names (Blasco Torres, 2017).In fact, one could call this last example a double derivate: it builds on the TM People database of names and name variants, which in turn draws in information from TM Texts.Finally, TM Calendar (in cooperation with Sofie Remijsen) is a first attempt at systematizing our date information.19 We hope to elaborate on this further in the future, in cooperation with projects such as PeriodO and Graph of Dated Objects and Texts [GODOT].20 Apart from NER, TM has embraced some other important technical innovations from 2012 onwards.As TM Networks (founded by Yanne Broux) illustrates, we have experimented with what is traditionally called Social Network Analysis [SNA] but now increasingly just network analysis (Broux & Depauw, 2015a).21This method of studying connectedness can be used not only to study relations between people, but also places, names or even Demotic epistolary formulae (Broux, 2016;Dogaer & Depauw, 2017).We have also developed a new way of visualising chronological evolutions, as it is especially useful to include information from imprecisely dated texts (Van Beek & Depauw, 2013).
One very recent but exciting development has come about through a PhD student, Alek Keersmaekers, whom I co-supervise together with Toon Van Hal (Greek) and Dirk Speelman (corpus linguistics).Starting from the full text of the DDbDP available in GitHub, he has morphologically annotated all the words (part-of-speech tagging and lemmatizing) in XML through a probabilistic model with an accuracy of ca.95% for non-proper names.Again through a co-operation with TM, he could draw in all the textual metadata, and was also aided by the TM Text Irregularities database for his choice of using the regularized version or the original.We converted his XML to MySQL and made this into the Trismegistos Words database (counting 4,513,494 records) which has become available in January 2018 (Keersmaekers & Depauw, 2018).22

The Raison d'Être of Trismegistos
This survey of the roots of the TM project and its development and expansion through new digital techniques may shed some light on the genesis of the project, but I have said preciously little so far about the underlying philosophy of such a broad set of tables or databases.
At the heart of our approach lies the motivation to provide a tool that facilitates access to sources from the ancient world and allows us to study phenomena that transcend disciplinary boundaries.It is only when everything is available in a single system that it is easy to count and quantify.The quantitative method has hitherto been quite marginal in the study of the ancient world, but large corpora of papyri and inscriptions offer interesting new prospects.We have, for instance, revisited the old discussion of the rise of Christianity in the fourth century AD on the basis of the use of Christian names (Depauw & Clarysse, 2013;Depauw & Clarysse, 2015); the increasing use of mother's names in identification clusters (Broux & Depauw, 2015b); the practice of naming your child after a Hellenistic queen (Clarysse & Broux, 2016); or the rise in popularity of double names and hybrid names in the Roman period (Broux, 2015;Dogaer, 2015a;Dogaer, 2015b;Dogaer & Depauw, 2017).In other publications networks, also a form of quantification, are used to study co-occurrence of place names or combinations of epistolary formulae (Broux & Depauw, 2015a).Much more is possible, and I hope that others will start using the data in TM for their own quantitative research.
This brings me to interoperability.From the outset, TM wanted to bring together projects, each collecting data within their scholarly disciplines.TM was never intended to replace projects, if alone for the lack of expertise on most of the languages and datasets covered by TM Texts.This is also the reason why we, as a rule, do not include the full text itself, nor images of the objects on which the texts are written.Our focus is on (limited) metadata, i.e. information about texts, rather than the texts themselves.
Also, to stimulate cooperation, TM provides stable identifiers for all areas it covers.These identifiers consist of the name of the table or database, and a simple number without meaning that merely identifies the entity and points to information about it in the Trismegistos database.They exist in a human readable format (e.g.TM Nam 1234) or as a "clean" URI (e.g.[http://www.trismegistos.org/name/1234]).TM meanwhile has IDs for texts, people, attestations of people, personal names, places, (ancient) authors and their works, (modern) editors, collections, and many more things.
Perhaps the most crucial identifier is the TM Text ID, normally abbreviated as "TM ID" [http://www.trismegistos.org/text/1234].It points to a text or document, in the sense of a set of intentionally related units of linguistically coherent language, written on a physically separate writing surface.The criterion of intentionality is to some extent arbitrary, in the sense that in some cases it is debatable whether two texts actually appear on the same writing surface because their scribes and authors wanted them to.It is, nevertheless, a necessary factor, as otherwise texts appearing on the same object as the result of unrelated reuse would get only a single id.Certainly, in cases where there is no clearly physically separate writing surface (e.g. a desert rock), this would lead to the accumulation of unrelated texts under a single number.
We are very pleased that the Digital Archive for the Study of pre-Islamic Arabian inscriptions [DASI]23 has agreed to have its material included in Trismegistos.We hope the addition of 7,719 records will make the South Arabian inscriptions better known to scholars of the ancient world, and increase interoperability and standardization.As TM (and other) identifiers spread to as many projects as possible, projects can cooperate and exchange information more easily.In a Linked Open Data Structure, this would permit specialized projects to connect to TM and pull in varied metadata about provenance, date, and publications.This can then be used as background information for the specific topic that forms the focus of attention.In fact, Linked Open Data has the potential to speed up small projects significantly, similar to the development of new tables and databases in TM (Depauw & Dzierzbicka, 2018).Together with other databases such as Pleiades and Pelagios for places or SNAP for people (Simon, Barker, Isaksen, & de Soto Cañamares 2015;Depauw et al., 2017),24 a graph environment can be created that has great potential to bring knowledge about the ancient world closer to everyone.