Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
The present study introduces a research approach that combines corpus-linguistic and discourse-analytic perspectives to analyze the discourse patterns in a large corpus of biology research articles. The primary goals of the study are to identify vocabulary-based Discourse Units (DUs) using computational techniques, to describe the basic types of DUs in biology research articles as distinguished by their primary linguistic characteristics (using Multi-Dimensional analysis), to interpret those Discourse Unit Types in functional terms, and to then illustrate how the internal organization of a text can be described as a sequence of DUs, shifting among various Discourse Unit Types.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
In Dutch, high-frequency words with the suffix -lijk are often highly reduced in spontaneous unscripted speech. This study addressed socio-geographic variation in the reduction of such words against the backdrop of the variation in their use in written and spoken Dutch. Multivariate analyses of the frequencies with which the words were used in a factorially contrasted set of subcorpora revealed significant variation involving the speaker’s country, sex, and education level for spoken Dutch, and involving country and register for written Dutch. Acoustic analyses revealed that Dutch men reduced most often, while Flemish highly educated women reduced least. Two linguistic context effects emerged, one prosodic, and the other pertaining to the flow of information. Words in sentence final position showed less reduction, while words that were better predictable from the preceding word in the sentence (based on mutual information) tended to be reduced more often. The increased probability of reduction for forms that are more predictable in context, combined with the loss of the suffix in the more extremely reduced forms, suggests that high-frequency words in -lijk are undergoing a process of erosion that causes them to gravitate towards monomorphemic function words.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
This paper presents a technical state of the art in usage-based linguistics as defined in the context of Cognitive Linguistics. Starting from actual case studies rather than theoretical assumptions, methodological issues concerning the usage-based approach are addressed, with specific reference to the use of corpus materials. The specific methodological identity of usage-based linguistics is described in terms of data gathering strategies and the status of empirical data in linguistic research. From a delineation of corpus research in contrast with introspection, survey research, and experimentation, two criteria emerge as essential for a genuine corpus-oriented usage-based linguistics, viz. the use of quantitative techniques and the systematic operationalization of research hypotheses. It is suggested that paying closer attention to these methodological issues is a prerequisite for the further development of the usage-based approach in Cognitive Linguistics.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
Language users never choose words randomly, and language is essentially non-random. Statistical hypothesis testing uses a null hypothesis, which posits randomness. Hence, when we look at linguistic phenomena in corpora, the null hypothesis will never be true. Moreover, where there is enough data, we shall (almost) always be able to establish that it is not true. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. We present experimental evidence of how arbitrary associations between word frequencies and corpora are systematically non-random. We review literature in which hypothesis testing has been used, and show how it has often led to unhelpful or misleading results.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
1. Introduction In this issue of Corpus Linguistics and Linguistic Theory , Adam Kilgarriff discusses several issues concerned with the role of probabilistic modelling and statistical hypothesis testing in the domain of corpus linguistics and computational linguistics. Given the overall importance of these issues to the above-mentioned fields, I felt that the topic merits even more discussion and decided to add my own two cents with the hope that this discussion note triggers further commentaries or even some lively discussion and criticism. The points raised in Kilgarriff’s paper are various and important and considerations of space do not allow me to address all of them in as great detail as they certainly deserve. I will therefore concentrate on only one particular aspect of the paper which I find ‒ given my own research history and subjective interests ‒ particularly important, namely the issue of statistical hypothesis testing. More precisely, I will address one of the central claims of Kilgarriff’s paper. Kilgarriff argues ‒ apparently taking up issues from methodological discussion in many other disciplines (cf. section 2) ‒ that the efficiency of statistical null-hypothesis testing is often doubtful because (i) “[g]iven enough data, H 0 is almost always rejected however arbitrary the data” and (ii) “true randomness is not possible at all”. In information-retrieval parlance, null-hypothesis significance testing when applied to large corpora yields too many false hits. In this short discussion note I would like to do two things. First, I would like to make a few suggestions as to what I think are the most natural methodological consequences of Kilgarriff’s statement and several other points of critique concerning null-hypothesis significance testing raised in other disciplines. Second, I would like to revisit one of the examples Kilgarriff discusses in his paper to exemplify aspects of these proposals and show how the results bear on corpus-linguistic issues.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
There is a long-standing tradition in Chomskyan generative grammar of rejecting the relevance of corpus studies. A variety of arguments are put forth to justify this rejection, most importantly, that corpora are necessarily “finite and somewhat accidental” while the set of grammatical utterances is “presumably infinite” (Chomsky 1957: 15), and that, therefore, “probabilistic considerations have nothing to do with grammar” (Chomsky 1964[1962]: 215, n. 1; cf. also Chomsky 1957: 17). Chomsky is frequently reported as backing up this claim with the observation that the sentence I live in New York is fundamentally more likely than I live in Dayton, Ohio purely by virtue of the fact that there are more people likely to say the former than the latter (McEnery and Wilson 2001: 10). As always, it is difficult to decide whether Chomsky seriously offers this example in support of his position. Not that it really matters: Chomsky’s contempt for ‒ and his ignorance of ‒ quantitative issues is of no concern to modern corpus linguistics. Chomsky’s irredeemably anti-empirical views are firmly rooted in his anti-empiricist philosophy, and no amount of quantitatively sophisticated corpus-based argumentation will ever change his mind.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005
Abstract
Recent publications in the field of corpus linguistics (including several in this and the previous issue of CLLT) strongly indicate that the field is on its way from a view of corpora as mere repositories of authentic data from which examples can be culled ad libitum to a methodology that analyzes linguistic phenomena systematically and exhaustively as they manifest themselves in corpus data. Thereby, corpus linguistics is becoming an attractive complement to other empirical research methods in the language sciences, such as experimental designs.
Unable to retrieve citations for this document
Retrieving citations for document...
Requires Authentication
Unlicensed
Licensed
November 4, 2005