9 A Methodological Framework for the Epigraphic South Arabian Lexicography. The Case of the Sabaic Online Dictionary

: This paper describes the concept and functionalities of an online reference dictionary for Sabaic, aiming to present all extant lexical material of this Ancient South Arabian language. After introducing the features of the corpus, several methodological issues, and the solutions adopted within the project are illustrated, focusing in particular on the annotation of morphological analysis (treatment of ambiguous forms, homographs, heterographs with identical meaning, variant readings, incorrect forms). The conception developed to present the material online is also described.


General Remarks
Over recent decades, a huge amount of new Ancient South Arabian inscriptions has come to light.This has been published in collections such as Arbach & Schiettecatte, 2006or Prioletta, 2013 to name but a few, but mostly in scattered editions comprising only a few texts.This material not only contains hitherto unknown lexemes, but also calls for a reconsideration of quite a number of well-known terms in the Sabaic lexicon.The available dictionaries on Sabaic such as Beeston et al., 1982or Biella, 1982 thus no longer reflect the present state of research.The same holds true for dictionaries on other Ancient South Arabian idioms such as Qatabanic (Ricks, 1989) and Minaic (Arbach, 1993).Moreover, apart from a considerable quantitative increase of the material, we are also confronted with a qualitative leap, as completely new text genres have emerged, particularly among the everyday correspondence on wooden sticks (cf.Stein, 2010& Maraqten, 2014).Though revision of at least part of the material included in the extant dictionaries was rightly demanded in both well-meaning (e.g.Lundin, 1987) and cynical reviews (cf.Jamme, 1985, pp. 202-269), a revised second edition of Beeston et al., 1982 never appeared.An up-to-date presentation of the Introduction lexical data gathered over the past 30 years is therefore clearly a desideratum of the scientific community, both within and outside Ancient South Arabian Studies proper.

Scope of the Project
The project described in the present paper1 aims at a reconsideration of the lexical material.It will result in a reference dictionary ("Belegwörterbuch") that will include a complete lexical survey of the Sabaic material published so far.In contrast to other projects on Ancient Arabian epigraphy featured in this volume, such as DASI and OCIANA, it is not focused on the epigraphic corpus as such, but uses the latter as a basis for lexicographic work.Digitization of material is thus not considered as a result intended for public use, but rather as a practical means to collect and organize large amounts of data.
The application is running on a Microsoft-Windows-Server on which the IIS (Microsoft) is installed as internet server.The data is stored on an instance of the Microsoft SQL Server Express.Applied programming languages are C# and JavaScript.While the internal working platform was designed in ASP.Net, the more modern ASP.Net MVC is used for the public web presence.Furthermore, the JavaScript framework JQuery is used in the web presence.
Two different concepts, adjusted to the various parts of the working process, are used for data management.First, a collection of the epigraphic material is needed as a material base to reference lexemes.For the annotation of texts an XML format was chosen.An editing view of each annotated text is generated from the XML document as a HTML view via a JavaScript routine.Annotations are directly assigned to words as XML attributes.XML documents thus generated are stored in the database.Following the grammatical analysis, the information contained within these XML documents is divided into the tables of the relational database to enable further processing.
Second, a complete set of the interpretations given in the literature is collected.This is to facilitate lexicographic work, comprising translations from text editions, extant dictionaries and further lexicographical material published both in compendia (e.g., Sima, 2000) and specialized articles (e.g., Robin, 2013), or studies on Ancient South Arabian culture and history (e.g., Beeston, 1976) to name but the most important genres.The collected material is further enriched with etymological parallels from other Semitic languages.This material has no intrinsic relation to a specific inscription.It is directly stored in the tables of the relational database.

Character of Material
Sabaic is part of a group of several interrelated epigraphically documented Semitic languages, commonly referred to as Ancient South Arabian, which were spoken in the territory of modern Yemen from the early 1 st millennium BCE up to the 6 th century CE (cf.Stein, 2011).The material, however, is rather extensive both in respect to absolute length of attested texts and as far as lexical variance is concerned, at least if compared to contemporaneous European texts of equal genre, as e.g.Latin inscriptions.Reading is considerably facilitated by the regular use of word dividers2.As with most Semitic languages, the script is highly defective: basically only consonants are noted, with the only exception of final long vowels.3Abbreviations are absent.Inscriptions as material objects are written on durable material, mostly stone or metal.They thus constitute a primary source that may be damaged or destroyed, but is not prone to textual alterations.However, this only applies to the object as such.A comparatively large portion of the Ancient South Arabian material is only known from copies or transcriptions of various qualities, made by modern scholars.As photographic documentation, though already requested in the review literature at a comparatively early time (cf.Schlobies, 1936, p. 58, n.1), is often lacking or of poor quality, the actual appearance of these inscriptions can no longer be checked.This material contains a certain amount of corrupt forms, including obvious faults (cf.Stein, 2002, esp. the exhaustive examples pp.447-452).The latter are often corrected by later reeditions based on photographs or rediscovered originals.4For a large amount of poorly published material there is still no reliable documentation available.Nevertheless, rectification of obvious faults was often undertaken by later editions.As a result, these inscriptions present a certain amount of variant readings that can rather be compared to manuscripts.

Collection of Material
Ancient South Arabian inscriptions are scattered over a wide range of different publications which may comprise huge collections of different material (e.g., CIH and RES) and full inventories of excavations (e.g., Jamme, 1962), but may also focus on individual texts or passages.An exhaustive collection of material is undertaken by DASI, but is not yet completed for Sabaic.The compilation of material underlying the present dictionary already started back in the late 1990s, then still in a DOS format.This collection consisted mainly of analytic transcriptions and bibliographic information.The material was originally meant as a basis for grammatical studies and was thoroughly prepared for this purpose.Information on Semitic roots or fuller, non-assimilated forms was thus encoded.Since the latter are also important for lexicographic work, the compiled file, subsequently completed and augmented by newly published material, was considered an ideal base for a dictionary.The data is by now converted in an XML format.

Organisation of Material
The dictionary has a modular approach.The Sabaic material is split into several subcorpora (such as votive inscriptions, building inscriptions, juridical texts etc.), which are processed separately.For the time being, the dictionary includes major parts of the votive inscriptions, which actually form the most comprehensive genre among the Sabaic texts (an up-to-date account of the incorporated material can be found on the website).
Within these corpora, lexicographic work is organized according to inscriptions, i.e. all lexical items of a given text are considered, irrespective of alphabetical order.Work thus started with a micro-dictionary containing the vocabulary of a single inscription, this core being constantly enlarged with material from other texts.Therefore, not only does the actual number of lexemes increase over time, but also their extent.
The project aims to present all extant lexical material.This also includes probable or even actual faults.Variant readings and -to a lesser extent -unusual or even faulty orthographies are thus included in the basic text.5However, uncertain and incorrect readings are marked as such.A reconstruction of "correct" texts is intended for presentation purposes, but often turns out to be impossible.This is, in most cases, due to deficient editions lacking proper documentation (cf.section 9.2.1), but may also result from the limited range of our knowledge, especially in damaged contexts.

Morphological Analysis
The morphological analysis of texts is performed by manipulating the XML document.In the process of tagging a word, information is stored as XML attributes to the corresponding XML tag.These attributes provide information on the actual lemma (e.g.bn "son" vs. bn "bān-tree", preposition bn "from" etc.), its lexical category and grammatical subcategories (if appropriate).Lexical categories include noun, pronoun, verb, preposition, conjunction, other particle and other.6To facilitate a clear presentation of the lexical material, names7 and fragments8 are treated as separate categories.Since this classification is part of the lemma, as are roots and meanings, it is substituted automatically once the correct lemma is chosen.In ambiguous contexts, multiple tags can be assigned.However, this possibility is kept to a minimum to avoid confusion of the reader.
On the other hand, grammatical subcategories such as gender, number, state, conjugation and the like, are specific to the actual word.Since the defective Sabaic orthography includes many ambiguous forms9, these tags are edited manually, based on the particular context of the word.If neither form nor context allows a clear identification of categories, forms are tagged as "unspecific".10Morphological tags are used to create a morphological catalogue of attested forms, which is a vital part of the reference dictionary.
Furthermore, tagging provides information on reliability of attestation: certain, uncertain, supplemented11 or wrong12.While both uncertain and wrong forms are marked as such,13 supplemented forms are simply excluded from presentation.

Definition of Lemmata
The definition of lemmata follows practical considerations.Forms are thus treated as separate items if their respective contexts show sufficient differences to consider them as such.Sabaic being an extinct language, this is more or less the approach of all extant lexicographic literature in the field.

Treatment of Homographs
Homographs belonging to different grammatical categories are commonly treated as separate items in the scientific literature (cf. the organization of material in both Beeston et al., 1982 andBiella, 1982).Obviously, a noun like qrn "garrison" should be differentiated from the homographic verb qrn "to be on garrison duty".The difference between such morphological categories is, however, sometimes difficult to observe in existing contexts.Distinguishing between infinitives of the base stem, following a pattern fʿl, and abstract nouns showing a similar pattern in the construct state, is a particularly delicate case.14Particularly in stereotype contexts, the grammatical form of a certain lexeme may resist disambiguation.
The situation is even more complicated for homographs belonging to the same grammatical category, i.e. homograph verbs or homograph nouns.These were probably differentiated via vocal patterns, a means that is not featured in the defective script (for nominal forms cf.Stein, 2003, esp. pp. 56-62).Furthermore, the rather complicated system of verbal stems and their eventual graphic distinction was only fully understood in the last decade (cf.Multhoff, 2011).Consequently, Sabaic lexicography has not yet developed consistent standards to deal with this material.Delimitation of lexemes in existing dictionaries is thus often rather arbitrary, sometimes even summarizing different graphemes under one single lemma.15Given the progress in our understanding of Sabaic verbal stems over the last fifteen years, it is now possible to distinguish different verbal stems.All stems show at least occasionally unequivocal forms, mostly in infinitives.Even though these are not always attested, the fundamental disambiguation of the system has yielded a set of semantic criteria that often enables definition in otherwise uncertain cases.Verbal homographs can thus normally be clarified, presenting different stems as different lexemes and identical stems as singular lexemes, as with ʿrb 0 1 "to enter" besides 0 2 "to offer" (but see below, section 9.4.2).16 The situation is more complicated with nominal forms.If we compare related languages such as Arabic, a much bigger amount of homographs is to be expected, but clear morphological or semantical criteria are missing.In this particular case, contexts are carefully checked.Forms appearing in similar contexts are generally treated as a single lexeme.Other forms are considered as such if there are no convincing arguments against this assumption.On the other hand, forms are split into different lexemes if clear semantic differences appear from constructions or contexts, as in the case of ḏhb, a homograph comprising lexemes with the meaning "bronze", "oasis, irrigated area", "irrigation, flood-water" and a certain "measure of capacity".

Deliberate Splitting of Lexemes
In certain cases, single lexemes are deliberately split up.The most common cases are particles.Conjunctions are generally separated from homograph prepositions.Very frequent particles such as b-, l-and the pronoun ḏ-are further divided into different semantic or contextual categories to enable a clear presentation.17Verbal and nominal lexemes are split if they are clearly derived from different forms.18Lastly, a small number of nouns in ubiquitous contexts are given separate entries to allow a clearer presentation of the remaining forms in the context of a fully referenced dictionary.19

Heterographs with Identical Meaning
Sabaic shows a certain number of words that are different in form, but apparently similar in meaning.This mainly applies to a) different rendering of weak radicals (w or y) and b) otherwise identical nouns with and without final -t.While some of these forms can be explained by diachronic or regional variation,20 the motivation of other variant forms is less clear.21Those may represent different lexemes, but may also refer to different numbers (the distribution of which rests equally unclear).Different graphemes are considered as one single lemma if their relation is clearly grammaticalized, as is the case with diachronic or regional variation.Other cases are mostly treated as different lexemes.22

Treatment of Incorrect Forms
The published Sabaic material, accumulated over a period of almost 150 years, comprises a surprisingly high number of incorrect forms.These include both actual faults of the Sabaean writer (e.g. a merger of similar characters, as in ʾrz instead of ʾrḍ "earth") and misreadings (sometimes even misspellings) by modern editors.In the absence of reliable photographic documentation, it is difficult, if not impossible, to attest to the correctness of a given form.
However, misreadings have sometimes become real "classics" over a certain period of time and could have provoked a rather huge amount of material, both translations and further reflections, and even found their way into the extant dictionaries.23A simple exclusion of these ghost-words from the dictionary will probably not solve the problem of their constant reappearance in (especially non-specialist) literature.On the other hand, assumed scribal faults are not always as obvious as the example given above.At least part of this material may, in fact, prove correct if further documentation becomes available.And then there are textual emendations in older editions, the status of which is often rather unclear.All these forms are therefore to be includedand properly commented -in the dictionary.And finally -lemmata, which are based on actual misread forms and are thus to be deleted from the present corpus, may in fact appear as clearly attested forms in the future.

Structure of Presentation
While the actual processing work is structured by inscriptions (see section 9.2.3, above), the presentation of the lexical material in the dictionary is generally structured by lexemes.Since Sabaic grammar does not really match the internal logic of alphabetical arrangement, ensuring usability proved rather tricky.While lexemes are operated in a standardized form for both internal and presentational purposes, this form is often difficult to reconstruct from an internal plural form24, or an irregular formed verbal stem25, and is thus not a reliable basis for arrangement or even search.Dictionaries for other languages with similar phenomena (such as Arabic or Gəʿəz) often opt for an arrangement by root.The latter, however, may also prove difficult to reconstruct and is thus an equally unreliable basis for search.We therefore decided to offer several different search options to ensure easy access to the material: lexemes proper, but also roots, strings of characters and translations.26All lexemes are complemented with a suggested translation, an automatically generated counter giving the number of attestations in the material processed thus far, a catalogue of attested morphological forms, a complete literary and etymological documentation and quotations of the particular lexeme in its syntactical and semantic context (Figure 9.1).For the time being, the presentation is exclusively in German.However, technical requirements for an eventual extension to other languages were taken into consideration.

Translation
An up-to-date translation is given.This is established on the basis of the complete epigraphic material.To enable retrograde search and thus usability, it is often enriched with synonyms.In ambiguous cases, all possible renderings are mentioned, including possible references to other lexemes.27Onomastic material is generally vocalized to facilitate reading.It should be kept in mind, however, that most of this vocalization is conventional.

Existing Translations
A full catalogue of existing translations is being prepared.28This is particularly important since the meaning of many lexemes is still not sufficiently established in the literature.The whole range of possible interpretations should thus be made accessible to the user to reflect scientific discussion in the field.Part of the collected material stems from existing dictionaries such as Beeston et al., 1982, and glossaries to larger corpora such as Jamme, 1962, pp. 426-451.Unfortunately, the latter are not always comprehensive.29 However, these sources in no way reflect the totality of existing interpretations.Further material is retrieved from editions (usually containing translations in context) and commentaries.In particular, translations referred to in commentaries to text editions are often dispersed and thus difficult to access.30Furthermore, existing translations of texts are thoroughly checked for their lexicographic content.31However, to allow a satisfactory workflow, translations have to be near to literal.Paraphrases, especially common for phrases considered as hendiadys, tend to be tricky.Nevertheless, these are included if they can be linked with certainty to particular lexemes.
The resulting catalogue is often surprisingly extensive.Since sufficient criteria for classifying variant forms as "identical" are missing, only verbatim quotes are considered as a single item.Catalogue entries may thus be rather close to each other.In exceptional cases such as sbʾ "to go out", approximately one hundred different translations have been collected, reaching from "(bellum) gerere" up to "zum Kriegszug aufbrechen".While translations of common words normally started to center around a semantic nucleus at a comparatively early stage, sometimes back in the 19 th century, translations of other lexemes can differ considerably over time.In particular, translations of forms with rather unspecific literal meanings, such as sbʾ, may also include a wide range of metaphorical renderings oriented towards context rather than literal meaning.
In the case of onomastics, all existing vocalizations are collected.As in the case of other lexemes, only verbatim quotes were subsumed under a single entry.However, this approach proved rather dysfunctional, given the huge amount of different transliteration systems that entails numerous pseudovariants (e.g.Ḏât-Ḥimyam besides dhāt Ḥimyam, the name of a female deity).

Etymological Parallels
Lemmata are enriched with etymological parallels from South Arabia and beyond.These are divided in non-Sabaic Ancient South Arabian, i.e.Qatabanic, Minaic and Ḥaḍramitic, and other Semitic languages.Ancient South Arabian parallels are catalogued according to Sabaic, i.e. a catalogue as complete as possible is intended.This includes material from both dictionaries and translations of actual texts, irrespective of correctness.
Since scientific literature (at least up to the first half of the 20 th century) normally considered the different idioms as mere variants of a common Ancient South Arabian language, translations were often meant as applying to all respective languages.They can thus be considered as part of the catalogue of existing translations for Sabaic as well.32Etymological parallels from other Semitic languages are generally retrieved from the respective dictionaries.Cataloguing started with geographically adjacent languages (Arabic, Ethiopic and Modern South Arabian) and is by now far from complete, but is continuously being enriched.

Morphological Catalogue
An exhaustive morphological catalogue is created from the morphological tags.The section gives a fully referenced overview of attested forms,33 which is arranged according to morphology.34All references are clickable and linked to quotations in context.

Examples in Context
The actual usage of each particular lexeme is illustrated by references within their contexts.35These are given both in transliteration of the Sabaic text and full German translation.In order to demonstrate the different semantic and/or syntactical aspects of a word in different contexts, these quotations are structured by significant headlines that provide an overview of usages.For extensively attested lexemes,36 however, even this classification will fail to guarantee a fairly clear presentation.
To ensure a consistent rendering and improve workflow, quoted passages are only translated once.Inscriptions are therefore split into paragraphs covering sufficiently comprehensible semantic units37 and supplemented with a German translation.Renderings are kept as literal as possible.As all elements of texts are stored separately in the database, each chosen paragraph can be created using simple routines.The paragraphs thus created are subsequently allocated to their correct place in the structure of presentation of each single lexeme.

Results Reached Thus Far
The web presentation of the project, accessible under [http://sabaweb.uni-jena.de],was launched in 2016.At present (June 2018) it contains over 1,800 lemmata38 (plus over 2,200 names) attested in around 900 inscriptions containing a total of 70,000 words.Choosing digital technology in preparing and presenting the dictionary certainly had substantial effects on the workflow.Huge amounts of diverging material such as inscriptions on the one hand and translations and etymologies on the other are easy to manage in the framework of a database.Lexicographical work can be structured according to internal criteria such as contexts and does not depend on external necessities such as alphabetical arrangement.The same applies to presentation, as lexemes can be published in any possible order without affecting usability.