Milica Vuković Stamatović ORCID logo

Vocabulary complexity and reading and listening comprehension of various physics genres

De Gruyter | Published online: September 27, 2019

Abstract

This study sheds light on the vocabulary complexity of various physics genres and how it affects reading and listening comprehension of the science of physics. We analysed the vocabulary frequency profile of seven physics genres: research articles, textbooks, lectures, magazines, popular books, TV documentaries and TED talks, to determine the presence of general-purpose, academic and technical vocabulary in them, as well as their vocabulary level and variation. The main research question was whether the vocabulary level of these genres could pose an impediment to typical native and non-native speakers of English in terms of their reading/listening comprehension, and, in general, how accessible these genres are vocabulary-wise. The results suggest that typical native speakers will struggle reading physics research and magazine articles, whereas typical non-native speakers will not read/listen to any of the genres at an optimal level, but will be able to read/listen to four of them at an acceptable level.

1 Introduction

Vocabulary knowledge is one of the prerequisites for understanding text and speech. Although understanding is naturally much more than just the knowledge of words, it can be safely assumed that the knowledge of the words used in a text will be a critical factor in the comprehension of that text – typically, the more words are known, the better the comprehension (Nation 2013). This study focuses on just that – the vocabulary used in various physics academic and non-academic genres, both spoken and written in the English language, i. e. its level and complexity across these genres, as well as the share of the academic and technical vocabulary in them as a measure of accessibility of those texts to both native and non-native speakers of English. The aim is to determine how much vocabulary one needs to know as a prerequisite for reading and listening comprehension of the genres investigated and see whether there are any differences in their lexical profiles.

1.1 Reading and listening comprehension

Various studies have been conducted on the topic of reading and listening comprehension and how much vocabulary one needs to have to be able to comprehend the texts he/she reads or listens to.

Most studies on comprehension agree that vocabulary knowledge is a good predictor of reading comprehension (Laufer and Ravenhorst-Kalovski 2010: 16) and that “the language threshold for reading purposes is largely lexical” (Laufer 1992: 126). Moreover, relying on Anderson and Freebody (1981), Nagy (1988: 9) argues that the frequency of “difficult words in a text is the single most powerful predictor of text difficulty”, whereas the vocabulary knowledge of a reader is the single most effective predictor of how well that reader can understand text. This is in line with the results of a study conducted by Schmitt et al. (2011), which revealed a linear relationship between knowledge of vocabulary and the level of reading comprehension.

A frequently cited vocabulary threshold is that determined by Laufer, who argues that knowing 95% of the words used in a text is required for “reasonable reading comprehension” (1989: 321) and to guess the meaning of the remaining words from the context. In some other studies, the threshold is set higher – for instance, Nation (2006) determined that a 98%-vocabulary coverage would be “the ideal coverage”, which translates to the fact that a vocabulary of 8,000–9,000 word families would generally be needed for unassisted understanding of a text. Under word family Nation assumes a base word and all the word forms inflected and derived from it, which can be understood without learning all the forms separately (e. g. maximise, maximises, maximising, maximised, maximisation, maximisations … ).

Likewise, there is empirical support for the assumption that vocabulary knowledge is a major contributor to listening comprehension, in both mother tongue and foreign language, as determined by van Zeeland and Schmitt (2012). Additionally, the findings of these authors for listening comprehension are very similar to those found for reading comprehension by Laufer and Ravenhorst-Kalovski (2010) – 95% for adequate and 98% for optimal comprehension, although listeners could largely comprehend informal narratives with lower lexical coverages (90%). However, as physics genres are not informal narratives, we will adhere to the said 95%- and 98%-coverages as the thresholds for adequate and optimal comprehension, for both reading and listening.

Such high coverages might pose a problem, particularly for those to whom English is a foreign language. Nation (2013: 12) argues that there are around 70,000 word families in English. The gap between native and non-native speakers of English, in terms of how many words they know, is usually very wide. On the one hand, many adults who speak English as a foreign language will have a vocabulary of less than 5,000 word families, despite having learned it for a number of years (Nation and Waring 1997). On the other hand, a high-school graduate native speaker of English will know just under or about 20,000 word families (Laufer and Yano 2001: 549; Nation 2013: 13), which is a considerably larger vocabulary (it must be noted, however, that there is a lot of variety amongst individuals (Coxhead et al. 2015)).

Obviously, vocabulary knowledge can be a major obstacle for many non-native speakers. On the other hand, reading and listening comprehension of some specific types of texts and speeches, such as those pertaining to scientific genres, could pose a problem to native speakers too, as they might struggle with specialised vocabulary found in such texts – academic and technical words, despite having a sizeable general vocabulary. In this vein, Snow and Ucelli (2009: 14) argue for the academic language to be taught not just to those for whom English is a foreign language, but to native speakers as well, bearing in mind that they also find it challenging and “intrinsically more difficult than other language registers”. Likewise, technical vocabulary in any language is most accessible to those having some specialisation in the technical field concerned; it basically comes with learning the subject (Nation 2013). General world knowledge and familiarity with the topic will also play a significant part in comprehension – familiarity acts compensatory for inadequate knowledge of vocabulary (Schmitt et al. 2015) and facilitates comprehension.

Taking all this into account, the premise that this paper departs from is that the larger the vocabulary and the more academic and technical vocabulary is contained in a specific physics genre, the more difficult it will be to comprehend such a text or speech.

1.2 Vocabulary level and specialised vocabulary

By vocabulary complexity, which we intend to explore here, we mean the level of vocabulary, i. e. the vocabulary load and the percentage of specialised words it contains.

Based on frequency in a text and speech, vocabulary is classified into high-, mid- and low-frequency vocabulary (Nation 2013). The first 2,000 or 3,000 most frequent words (typically, general-purpose words) belong to the first group and pervade all types of texts. Learners are likely to learn these words first, keeping in mind that they are exposed to them the most. The next group, according to Nation, consists of the next 6,000 to 7,000 words of moderate frequency, which are needed to function linguistically without outside assistance. The last class of words, low-frequency words, is by far the largest group of words (tens of thousands of words belong here) – these can be reasonably frequent in particular texts (e. g. technical words in a technical text) but are rare in general texts.

It is estimated that native speakers of English can, on average, learn 1,000 word families annually between the age of three and their early twenties (Biemiller and Slonim 2001). For this and for obvious practical reasons, vocabulary level of both native and non-native speakers is typically measured in the number of thousands of words they command, and word lists compiled often have 1,000-word-long sublists. The level of vocabulary we intend to measure in various physics genres will thus be measured in the number of thousands of words needed to reach the comprehension levels suggested earlier.

In the context of our study, specialised vocabulary will refer to academic and technical vocabulary contained in various physics genres we explore here.

Academic vocabulary includes words which are typically found in academic texts of various disciplines, i. e. words which are common in academic texts and not so common in non-academic ones (Nation 2013: 291). It thus includes words such as hypothesis, theorise, empirical, infer, etc. On the other hand, technical words are common in a particular discipline and abound in technical texts (Nation 2016); for instance, technical words in physics might include current, atom, gravity, nuclear etc. Both these specialised groups contain items from all three vocabulary frequency groups; however, they typically include very few high-frequency words (if so, they take on extended or special meanings in a particular discipline). In general texts, these two groups of words do not make significant coverages. Due to the fact that academic vocabulary contains formal register vocabulary, inter alia, some of it may be found in formal texts (e. g. in newspapers, where they make 4% (Nation 2013: 294)), but more significant coverages are only to be encountered in academic texts. On the other hand, high coverages of technical vocabulary may be expected in technical texts – their amount, however, varies by discipline (Coxhead 2018).

Essentially, the more specialised words a text contains, the less accessible that text will be to a reader/listener who is not so familiar with academic language or with the field of specialisation concerned. Academic vocabulary is a source of difficulty for non-native speakers of English (Coxhead 2000: 213), and, as Snow and Uccelli (2009) noted, it is also more challenging for native English speakers than other registers. On the other hand, the knowledge of technical vocabulary is mostly related to the knowledge of the discipline and is generally not accessible to those outside the field.

1.3 Word lists

Linguists are generally interested in vocabulary distribution, whereas applied linguists in particular are interested in prioritizing vocabulary in an effort to select words and phrases to which instruction will be devoted in foreign language classes. As of late, these interests and aims have given rise to building various word lists.

One of the first word lists developed was the General Service List (GSL), containing approximately 2,000 word families (West 1953). It contains words which West assessed as highly frequent, general words in English, and in many texts it covers about 80% of the words used. This might sound as an impressive coverage, but it still means that two out of ten words will be left unknown with knowledge of the words from the list only. In fact, for a higher coverage, many more word families will be needed. In addition, the GSL’s coverage varies widely across genres and is generally lower in academic and technical texts (see Section 1.4). This list can now be criticised as outdated; in addition, the word inclusion criteria for this list were not solely based on frequency, which is why other, newer lists are replacing it nowadays.

The introduction of new technologies has allowed for building much larger text corpora, as well as the development of various software applications for analysing those corpora.

Coxhead (2000) produced a word list containing academic vocabulary, derived from a 3.5-million-word academic corpus known as the Academic Word List or the AWL. This was a turning point in the history of making word lists, as it brought about the creation of many other lists. The AWL contains 570 word families and typically covers around 10% of the vocabulary used in academic texts (e. g. 10% in medicine (Chen and Ge 2007); 9.47% in pharmacology (Fraser 2007); 11.17% in applied linguistics (Vongpumivitch et al. 2009); 9.96% in chemistry (Valipouri and Nassaji 2013), etc.). Some of the general academic lists which have been made since include the Academic Vocabulary List (Gardner and Davies 2014) and the New Academic Word List (Browne et al. 2013); on the other hand, some authors have argued for discipline-specific academic word lists, criticising the AWL’s generality, so as to account for the differences amongst disciplines (e. g. the medical academic word list (Wang et al. 2008), applied linguistics academic word list (Khani and Tazik 2013), chemistry academic word list (Valipouri and Nassaji 2013), environmental academic word list (Liu and Han 2015), nursing academic word list (Yang 2015), medical academic vocabulary list (Lei and Liu 2016), etc.). Others have pointed to the AWL’s insufficient coverage – combining it with general-purpose lists, which typically cover 2,000 words, does not lead to sufficient coverage needed for reasonable comprehension (Hancioglu and Eldridge 2007). Still, the list is very practical – it is short enough to make a feasible learning target, while simultaneously providing respectable coverage in various types of academic texts.

The AWL was derived from a written corpus, on top of a general-purpose word list (the GSL) and thus mostly consists of mid-frequency words; on the other hand, the Academic Spoken Word List (the ASWL), produced by Dang et al. (2017), is based on spoken data and does not pre-exclude any group of words. Consequently, its coverage is substantial, together with proper nouns and marginal words (it ranges from 92% to 96%).

Academic texts also abound in technical vocabulary, whose coverage varies by discipline (Nation 2013) and also on how technical vocabulary is defined. A number of technical word lists have been developed since 2000, including the pilot science list (Coxhead and Hirsh 2007), pharmacology word list (Fraser 2007), computer science word list (Minshall 2013), engineering technology word list (Jin et al. 2013), etc.

One of the technical lists which is relevant for the study presented here, is the Pilot Science List, produced by Coxhead and Hirsh (2007). The list contains 318 word families and covers 3.79% of the corpus it was derived from – a 1.76-million-word corpus of first-year university science texts belonging to 14 disciplines – one of them being physics. The list was built on top of the GSL (West 1953) and the Academic Word List (Coxhead 2000), i. e. the words contained in these two were pre-excluded in order to make a new technical list.

Words not profiled into any vocabulary group (such as general, academic or technical word groups) are generally low-frequency words, which, inter alia, include proper names (generally easily recognisable); abbreviations (usually explained the first time they are used); compound words (whose meanings can usually be decoded from the meanings of their parts); and marginal words (the letters of the alphabet, swear words, exclamations).

At the end of this brief review, we will also refer to another general word list which is of relevance for this study, or rather a set of general word lists, usually referred to as BNC/COCA word lists (Nation 2012), which have been developed recently from a vast corpus. This is a set of 25 lists, each containing 1,000 word families, which were extracted based on their frequency in the BNC/COCA corpus containing 450 million words. The set comes with four supplementary lists for the four categories of low-frequency words explained above. This set of word lists can be used for determining the vocabulary level of texts by looking at how many word families are needed to reach minimum requirements for reading and listening comprehension.

The selection of lists for this study is restricted by the lack of choice when it comes to science-technical vocabulary lists – the only list available at the moment is the Pilot Science List referred to above. Two limitations come with this list – this is not a physics-only list, it is more general than that; the other limitation is related to the method by which this list was created. Firstly, the Pilot Science List uses the word-family approach and can only be used with other lists using such an approach (which excludes lists using the lemma principle); secondly, this list was built on top of another two lists, i. e. by excluding items from those (the GSL and the AWL), which is why it is best to use it together with them – its use with newer, general-purpose and academic lists would lead to some overlapping amongst the lists when using them combined, which would affect coverage percentages.

Keeping the above in mind, in this study we use: the GSL (West 1953), to represent high-frequency general-purpose words, the Academic Word List – the AWL (Coxhead 2000), to represent frequent academic words and the Pilot Science List – the PSL (Coxhead and Hirsh 2007), to represent frequent science-technical words. The lists were built on top of each other and are thus complementary. We also use the BNC/COCA general word lists (Nation 2012), as the most complete set of general word lists.

1.4 Vocabulary profile and genre

As this study aims at exploring seven genres (within one discipline – physics) – research articles, textbooks, lectures, magazines, popular books, TV documentaries and TED talks, we provide a brief review of some of the vocabulary profiling studies related to these specific genres. Most of these studies resulted in word lists of various sizes.

As suggested above, various academic word lists have been produced to date. Whereas some of the studies rely on corpora of combined various academic genres, others employ corpora composed of just one genre, most commonly, research articles (for instance, these include studies of the vocabulary used in pharmacology (Fraser 2007), medicine (Wang et al. 2008), applied linguistics (Vongpumivitch et al. 2009; Khani and Tazik 2013), agriculture (Martínez et al. 2009), chemistry (Valipouri and Nassaji 2013), environmental science (Liu and Han 2015), nursing (Yang 2015), linguistics (Moini and Islamizadeh 2016), social sciences (Kwary and Artha 2017)). In these studies we can usually find data relating to the coverages of general-purpose and academic word lists. Thus, the GSL’s coverages in research articles of various disciplines range from 64.35% in chemistry and 67.53% in agriculture to 76.4% in applied linguistics. In addition, the AWL’s coverage varies from 9.06% in agriculture and 9.60% in chemistry to 11.76% in social sciences, 11.96% in applied linguistics and 12.82% in environmental science. Although these data are not sufficient to offer a definitive conclusion, we may note a tendency for higher GSL’s and AWL’s coverages in social sciences, as opposed to natural sciences.

Various studies have used corpora solely composed of textbooks (for instance, for the study of vocabulary in the field of engineering (Mudraya 2006; Ward 2009; Jin et al. 2013; Hsu 2014; Todd 2017), business (Konstantakis 2007; Hsu 2011) and medicine (Hsu 2013)). As can be noted, more vocabulary studies have been devoted to research articles than textbooks. A study especially relevant to the present paper is the one by Hsu (2011), who measured reading comprehension thresholds for business textbooks and research articles. She found that of the two genres, business research articles are more demanding in terms of reading (if the threshold is set at 95%, 5,000 word families are needed for reading business research articles, as opposed to 3,500 word families needed for reading business textbooks). Radford (2013; cited in Coxhead 2018) also found that research articles had much higher vocabulary levels than textbooks in the field of computer science; the levels needed for reading computer science research articles exceeded as many as 30,000 word families, which is very different from the results for business research articles cited above.

Very few studies have explored spoken data, the reason mainly being that it is exceedingly more difficult to collect spoken data of a size that would be sufficient for vocabulary profiling and comparable to written corpora. Coxhead (2017) studies the talk of secondary school teachers, whereas Thompson (2006) studies the BASE corpus of lectures and finds the AWL’s coverage of 4.9% in it. Dang et al. (2014) analysed the lexical profile of academic spoken English using a vast corpus mostly composed of lectures, but also including seminars. Their results revealed that 3,000 to 5,000 words, depending on the discipline, together with proper nouns and marginal words, would be enough for the lower reading threshold (95%). The authors concluded that the smallest vocabulary was needed for social sciences, whereas the most demanding disciplines proved to be life and medical sciences (for physical sciences, 4,000 word families were needed). The AWL’s average coverage was 4.41% in their corpus and 4.28% in the physical sciences subcorpus.

TED talks have received a lot of attention as of late. Wingrove (2017) finds that TED talks contain less academic vocabulary than lectures (the word list used to represent academic vocabulary in this study was the AVL (Gardner and Davies 2014)). In his study, the AVL’s coverages for lectures ranged between 5.64% and 7.01%, as opposed to coverages for TED talks, which ranged between 3.70% and 5.70%. Coxhead and Walls (2012) find that 5,000 word families (plus proper nouns) exceed the 95%-vocabulary threshold in a corpus of TED talks, whereas 8,000–9,000 word families (plus proper nouns) are needed to reach a coverage of 98%, and that there is little coverage variability over various topics.

When it comes to science magazines, TV science documentaries and popular science books, to our knowledge, no vocabulary profiling studies describing these genres have been produced to date.

Nation (2006) points out that fewer words are needed for dealing with spoken as opposed to written texts – at a 98%-threshold, some 6,000–7,000 word families are needed for talk and some 8,000–9,000 for written texts. Studies reviewed above also suggest that less academic vocabulary is found in spoken academic genres than in written academic genres.

2 Aim of the study and research questions

The aim of this study is to determine vocabulary complexity across a number of physics academic and non-academic genres. We wish to determine vocabulary levels across these genres, including how much academic and technical vocabulary they contain. In addition, we investigate how much vocabulary is needed to successfully read and listen to those texts.

The target group that we have in mind for this paper includes both native and non-native adult speakers of English with an interest in science, and particularly in physics, or who study physics and physics-related disciplines. Special focus amongst them is on non-native learners of English for specialised purposes (academic and specific – physics).

Bearing this in mind, the research questions posed here are as follows:

  1. How much general, academic and technical vocabulary is used in the various physics genres?

  2. How many word families are needed to reach adequate and optimal reading and listening comprehension in various physics genres?

  3. What level of lexical diversity is displayed in various physics genres?

  4. How accessible are various physics genres to native and non-native speakers of English vocabulary-wise?

To answer these questions, we use the methodology and corpora described below.

3 Methodology

Broadly speaking, there are two strands of the methodology used for measuring vocabulary richness and, in fact, they measure different aspects of it. One of them is the use of formulas, most of them being derived from one of the oldest ones, called type-token ratio (TTR), where tokens would be the number of individual words used in a text and types refer to the number of unique word forms in that text. However, this formula is sensitive to the size of the corpus (Kubát and Milička 2013) and, therefore, not truly applicable when comparing corpora of different sizes. This is why there have been various attempts to correct it – for instance, Scott (2004) proposes a standardised TTR (sTTR), where a text is divided into chunks of equal length and the TTR is calculated for each of the chunks separately; the final sTTR result is the average of all the ratios calculated for the equally-sized chunks. Another, more enhanced method was proposed by Covington and McFall (2010). They argue that mechanically cutting a text into equal chunks may mean that the chunks so obtained will not be semantically and pragmatically comparable; in other words, this method “does not account for intertextual varibility” (Cvrček and Chlumská 2015: 316). Covington and McFall (2010), therefore, propose the method of calculating the “moving average type-token ratio (maTTR)”. First a window length, i. e. the size of a chunk, is determined – for instance, it could be 1,000 words. The TTR is calculated for the first chunk of 1–1,000 words, then for words 2–1,001, and then for words 3–1,002, and so on, until the text is exhausted. The mean of all the average TTRs so obtained is the measure of “lexical diversity” of the entire text.

Basically, all TTR-based methods measure diversity, i. e. variability of vocabulary. Still, variability does not provide an insight into the level of vocabulary, i. e. types of words that are used and how frequent these words are in a language. For instance, a text may display substantial lexical variability, but still consist of words which are mostly general-purpose and of mid-frequency (e. g. a movie review intended for the general audience). This is why our main method will be that of Lexical Frequency Profiling (the LFP), which provides data on the level of words, i. e. their frequency in a language, and the coverage of various types of vocabulary (i. e. the presence of academic and technical vocabulary in our case). The LFP is a method used for determining the vocabulary load of a text developed by Laufer and Nation (1995). It assumes the use of software to categorise the words used in a text as belonging to certain word lists, such as the ones described earlier. Thus, the lexical profile of a text is established, which allows for various texts to be compared in terms of vocabulary level and complexity. The software we use for this purpose is AntWordProfiler 1.4.0 w (Anthony 2014), a vocabulary profiling tool which provides statistical data about the text analysed in terms of the vocabulary used and allows for its comparison against word lists.

To complete the lexical profile of the seven physics genres, we will also supplement our research method with the measure of lexical variety, measured using both the sTTR and the maTTR. For chunking, we use the AntFileSplitter software (Anthony 2017) and for calculating the maTTR scores, we use the MaWaTaTaRaD software (Milička 2013).

The word lists used here are those referred to in the theoretical section of the paper – the BNC/COCA word lists, the GSL, Academic Word List (the AWL) and Pilot Science Word List (PSWL).

4 Corpora

We created seven sizeable corpora for the purpose of this study. A summary is presented in Table 1 and the details are given after the table.

Table 1:

Overview of corpora.

Genre Written genres Spoken genres
Research a. Magazines Textbooks Pop. books Lectures TV docum. TED talks
No. of words 1,257,493 1,214,752 1,744,865 1,256,484 73,979 251,449 175,764

4.1 Physics research article corpus

We created a corpus containing 200 physics research articles, collected from 102 physics journals indexed in the Science Citation Index (run by Clarivate Analytics). The articles were collected in the summer of 2017 from recent issues of the journals (typically, the last issue available). The corpus contains eight subsections corresponding to the eight subfields as classified in Science Citation Index, totalling 1,257,493 words. The details of this corpus are given below.

Physics subfield No. of articles No. of journals No. of tokens
Physics, applied 30 15 149,362
Physics, atomic, molecular & chemical 30 15 160,456
Physics, condensed matters 25 13 154,363
Physics, fluids & plasmas 25 13 162,333
Physics, mathematical 18 10 157,781
Physics, multidisciplinary 30 15 163,823
Physics, nuclear 24 12 148,036
Physics, particles & fields 18 9 161,339
Total 200 102 1,257,493

In this corpus, we removed the reference sections, tables and figures, as is usually done with research article corpora (e. g. Coxhead 2000; Martínez et al. 2009; Hsu 2013), so as to reduce the number of proper words and abbreviations, which do not influence the vocabulary complexity of texts.

4.2 Physics textbook corpus

Our physics textbook corpus consists of two books – Fundamentals of Physics (Halliday et al. 2007), and University Physics with Modern Physics (Young and Freedman 2016). Both are widely used in colleges for physics undergraduate courses as a gold standard for freshman physics and cover most physics fields. The details of this corpus are given below.

Physics textbook No. of tokens
Fundamentals of Physics 627,260
University Physics with Modern Physics 1,117,605
Total 1,744,865

4.3 Physics lecture corpus

Our physics lecture corpus was taken from two sources – the Michigan Corpus of Academic Spoken English (MICASE) corpus and the British Academic Spoken English (BASE) corpus. There were 3 physics lectures in the MICASE corpus and 5 in the BASE corpus belonging to the field of physics; however, the two subsections of our merged corpus are of roughly the same size. Note that the corpora of spoken language are generally much smaller than those of the written counterpart due to the costs of transcribing speech.

Physics lectures No. of lectures No. of tokens
MICASE physics lectures 3 38,036
BASE physics lectures 5 35,943
Total 8 73,979

4.4 Physics magazine corpus

Science magazines publish news, reports and opinions about science and generally target non-expert audience, although the spectrum of readership for various magazines is very wide – some are more academic whereas others are much less so. We chose two reputable ones which have a very spread distribution and are concerned with physics: Physics World (published by the Institute of Physics (IOP)), a UK magazine, and Physics Today (published by the American Institute of Physics (AIP)), a US magazine. In the world of science magazines, these two would fall on the more technical end of the spectrum, as they chiefly target physics students, interdisciplinary scientists and physicists, although they generally strive “to communicate … to the widest possible audience” (the Physics World website: https://physicsworld.com). Magazines falling on the more popular end of the spectrum typically cover a wide variety of topics (such as the environment, health, technology, engineering, etc.), and not physics only or primarily, which is why they were not suitable for this particular study.

The details of this corpus are given below. The issues making part of this corpus were randomly chosen from the period 2009–2016.

Physics magazine No. of issues No. of tokens
Physics World 16 561,006
Physics Today 12 653,746
Total 28 1,214,752

We also form and use another corpus at one point in the study, to test a hypothesis relating to the vocabulary of science magazines. This corpus contains 12 issues of New Scientist, one of the most popular science magazines, encompassing a total of 356,628 words. These issues were published in 2016.

4.5 Popular physics book corpus

The popular physics book corpus compiled for this study contains 10 books. We chose the titles which were most frequently “shelved” by readers of the GoodReads website (www.goodreads.com) in the section of popular physics books in June 2018, only making sure not to include more than one title by the same author. We could not provide for an equal distribution of various physics topics as the public seems to be mostly interested in just some of the physics topics (such as the topic of the universe). Also, by following the “popularity” of books as the main criterion, we did not account for the different sizes of the various individual books – we dismissed the option of taking random samples from the books, as the book part from which the samples were taken (e. g. the introduction, text body, conclusion) could affect the vocabulary profile; in addition, all our other corpora consist of full texts, which is why we used the same principle here. This corpus is approximately of the same length as our other written corpora. The details of the corpus are given below.

Physics popular book No. of tokens
A Brief History of Time (by Stephen Hawking) 64,135
The Elegant Universe (by Brian Greene) 151,258
Physics of the Impossible (by Michio Kaku) 107,280
Cosmos (by Karl Segan) 126,699
A Universe from Nothing (by Lawrence Krauss) 275,589
Astrophysics for People in a Hurry (by Neil Degrasse Tyson) 33,144
The Trouble with Physics (by Lee Smolin) 143,344
Dreams of a Final Theory (by Steven Weinberg) 107,590
Just Six Numbers: The Deep Forces That Shape the Universe (by Martin Rees) 51,990
The Emperor’s New Mind (by Roger Penrose) 195,455
Total 1,256,484

4.6 Physics TV documentaries

This corpus consists of transcripts of 44 science documentaries that primarily deal with physics. Although they cover a vast variety of physics topics, it must be noted that many of them are oriented towards the universe topics, as was the case with popular physics books. They are generally 1-hour long, and the majority of them were produced in the USA. The details of the corpus are given below.

Physics documentary No. of documentaries No. of tokens
Physics documentaries, original network BBC 13 76,906
Physics documentaries, original network PBS Nova 6 55,878
Physics documentaries, original networks Fox and National Geographic 12 52,023
Physics documentaries originally aired on History 13 66,642
Total 44 251,449

4.7 Physics TED talk corpus

Our final corpus consists of transcripts of all TED talks on physics given before June 2018. TED talks are very popular online talks, distributed freely on various science, cultural and other academic topics, under the motto “ideas worth spreading”. The speakers have up to 18 min to deliver their talk in an innovative and engaging manner, commonly through storytelling. All the transcripts tagged as “physics” on their website in June 2018 (https://www.ted.com/topics/physics) entered our corpus – in total, 80 TED talks.

TED talk No. of talks No. of token
TED talks on physics 80 175,764

5 Results and analysis

Using the software AntWordProfiler 1.4.0 w (Anthony 2014), we examined the vocabulary profile of the seven corpora compiled for the purpose of this study against two sets of word lists: on the one hand, the GSL, Academic Word List (the AWL) and Pilot Science List (PSL); and, on the other, Nation’s set of 25 general word lists and 4 supplementary lists which contain proper names, marginal words, abbreviations and compounds (2012).

By examining the lexical profile of our corpora against the first set of word lists, we can investigate how much high-frequency general-purpose vocabulary, as well as frequent academic and technical vocabulary (as represented by the selected word lists) they contain. Table 2 presents these results.

Table 2:

Coverage of GSL, AWL and PSL in various physics genres (%).

Word lists Written genres Spoken genres
Research a. Magazines Textbooks Pop. books Lectures TV docum. TED talks
GSL 1st 1000 63.11 62.65 71.01 81.09 82.32 79.94 82.03
GSL 2nd 1000 4.94 5.23 5.58 3.99 3.76 5.15 4.75
AWL 9.60 8.82 6.73 5.35 3.40 3.50 3.79
PSL 3.45 2.29 3.88 1.38 1.66 1.83 1.38
Total 81.1 78.99 87.2 91.81 91.14 90.42 91.95

The results in Table 2 show that most of the high-frequency general-purpose vocabulary (the GSL) was found in the TED talks corpus (a cumulative coverage of 86.78%), followed closely by the lectures (86.08%), documentaries (85.09%) and popular physics books (85.08%). Spoken genres seem to have simpler vocabulary than the written formal genres, which is in fact typical of them (e. g. Akinnaso 1982; Nation 2006). The medium of communication seems to prevailingly affect vocabulary type: even though lectures are academic per se and not intended for the general public, they fit the high-frequency general-purpose vocabulary profile of the other spoken genres here investigated.

On the other hand, the research articles and the magazines studied here had much less of the GSL’s vocabulary (68.05% and 67.88%, respectively), which is in line with previous findings related to science research articles (for instance, in Coxhead and Hirsh’s study (2007), the GSL’s coverage in the science corpus was 71.52%). Still, we find it surprising that the magazines were very similar to the research articles in this respect, even though their target audience was not as strictly technical as that of research articles – in fact, the results from Table 2 suggest that the vocabulary of the selected physics magazines contained the least high-frequency vocabulary of the seven genres analysed.

When it comes to the AWL’s coverage, the results suggest that the research article corpus had the most academic vocabulary (as represented by the said word list) – it contained 9.6% of this type of vocabulary. This finding is in line with previous studies on academic vocabulary in science research articles (see Section 1.3). Perhaps, again, somewhat surprisingly, the physics magazines were also very academic in nature, containing 8.82% of academic vocabulary, which is not far from the academic vocabulary levels of some types of research articles (for instance, in Coxhead and Hirsh’s science corpus (2007), the AWL’s coverage was 8.96%). Furthermore, the textbooks contained much less academic vocabulary than the research articles (6.73% vs. 9.6%), which can be explained by the fact that their target audience is students, who may still not be considered full academicians.

Bearing in mind that their target audience is the general public, it might be surprising that 5.35% of the words used in the popular physics books studied here was academic vocabulary (as represented by the AWL), which might be explained by the fact that the AWL contains a lot of words characterising formal register (Nation 2013), together with the fact that their authors were physicists and that their scientific writing style was transferred, to an extent, into this genre, despite their conscious efforts to simplify the language.

Finally, the three remaining genres, the lectures, documentaries and TED talks, contained the lowest AWL`s coverages amongst the genres studied, with similar levels of about 3.5%. These three are spoken genres (the latter two scripted spoken genres), which is likely the main reason for this result. Wingrove’s results (2017) suggest more academic vocabulary in lectures than in TED talks – our study did not confirm this, but rather found similar levels for the two genres in the discipline of physics. Additionally, the AWL’s coverage in our corpus of physics lectures (3.40%) was lower than that found by Dang et al. (2014), which was 4.28% for physical sciences and 4.41% for the whole interdisciplinary corpus.

Most science-technical vocabulary, as represented by Pilot Science List (the PSL), was found in the research articles and the textbooks (3.45% and 3.88%, respectively). These results are in line with the findings of Coxhead and Hirsh (2007), where the PSL’s coverage was 3.79%. The magazines also proved to be rather technical, with the PSL’s coverage of 2.29%, although much less so than the research articles, the genre they resembled most, vocabulary-wise. The lectures were amongst the least technical genres, containing similar levels of technicality to the corpus of documentaries – the spoken mode of this genre was, most likely, responsible for it.

Overall, most frequent academic and technical vocabulary was found in the research articles and the magazines. The lowest AWL’s and PSL’s coverages were found in the lectures, documentaries and TED talks, whereas the textbooks and the popular books came in between, with the textbooks showing a tendency towards the more academic-technical end of the spectrum, and the popular books showing a tendency towards the other, the more general end of the spectrum. Generally speaking, for some genres it can be said that the lower level of vocabulary academicity corresponds to the lower level of vocabulary technicality and vice-versa.

For illustration purposes, in the Appendix we present sample paragraphs from the seven corpora, in which the GSL, AWL and PSL words are marked in different colours.

In the next part of the analysis, we determine the vocabulary level of our seven corpora, using a set of 29 BNC/COCA word lists referred to earlier. What follows is a table summarising the results (Table 3).

Table 3:

Vocabulary coverage in various physics genres (%).

BNC/COCA word lists Written genres Spoken genres
Research a. Magazines Textbooks Pop. books Lectures TV docum. TED talks
2,000 64.96 70.65 64.48 72.02 85.91 86.20 85.64
4,000 + proper n.,

abbrev. and marginal words
87.43 88.58 91.96 95.46 96.21 95.66 96.32
7,000 + proper n.,

abbrev. and marginal words
90.58 91.52 94.81 97.43 97.72 97.93 98.14
8,000 + proper n.,

abbrev. and marginal words
91.20 92.10 95.32 97.76 97.89 98.21 98.43
10,000 + proper n.,

abbrev. and marginal words
91.79 92.67 95.81 98.10 98.21 98.58 98.74
25,000 + proper n.,

abbrev. and marginal words
93.43 93.79 96.67 98.66 98.9 99.14 99.21

When determining the vocabulary level needed for comprehension, we assume that readers can recognise and know the meaning of: proper nouns; marginal words, i. e. the letters of the alphabet in our corpora; transparent compounds, commonly formed from the high-frequency general words; and abbreviations. Under such an assumption, the results show that the minimum vocabulary coverage level which allows for reading and listening comprehension – that of 95%, could be reached at the level of 4,000 word families for the following four genres:

  1. lectures (96.21%, with a 95% confidence interval of 94.94–96.58%), [1]

  2. documentaries (95.66%, with a 95% confidence interval of 95.44–96.15%),

  3. TED talks (96.32, with a 95% confidence interval of 96.00–96.81%) and

  4. popular books (95.46%, with a 95% confidence interval of 94.94–95.8%).

In essence, all the spoken genres studied had a lower vocabulary level, i. e. simpler vocabulary, regardless of whether they were academic or not, which is in line with Nation’s finding (2006). The finding precisely matches Dang and Webb’s finding that 4,000 word families are needed for listening to lectures and seminars in physical sciences (2014). Our results for TED talks (4,000 word families for a 95%-threshold) are also close to those found by Coxhead and Walls (2012), who determined that 5,000 word families were needed for that goal.

Not surprisingly, of the written genres, the popular books genre, intended for the widest audience, had the lowest vocabulary load, close to that of the spoken genres. The textbooks, on the other hand, required a level of 8,000 word families to be read adequately at a 95% vocabulary coverage threshold (95.32%, with a confidence interval of 95.07–95.57%), which is significantly higher – it seems that much more vocabulary is expected of university students.

The most arcane texts for reading, amongst the genres analysed, were those of the physics research articles and the magazine articles. Even employing the full set of 25,000 most frequent word families, the 95%-threshold could not be met for them (94.43% and 93.79%, respectively). These two genres were least accessible of the seven analysed and required most general, academic and technical vocabulary to be read successfully.

It seems that physics research articles are much more demanding than, for instance, business research articles, which reached a 95%-threshold at just 5,000 word families (Hsu 2011). This means that a lot of the vocabulary found in these texts is principally reserved for those specialising in the field of physics and that even impressive knowledge of general vocabulary will simply not be enough for reading them. Our results are in line with Radford’s result for computer science research articles (2013), which could not be read at 25,000 words either.

If we apply the 98%-threshold, which is needed for “pleasurable”, i. e. effortless reading or listening, the most accessible texts we analysed were:

  1. TED talks, requiring 7,000 word families to reach the coverage of 98.15%, with a confidence interval of 97.95–98.49% – somewhat lower than what was found by Coxhead and Walls (2012);

  2. TV documentaries, requiring 8,000 word families to reach the coverage of 98.21%, with a confidence interval of 98.15–98.52%;

  3. lectures, requiring 10,000 words to reach 98.21%, with a 95% confidence interval of 97.94–98.48%; and

  4. popular books, also requiring about 10,000 word families to reach the coverage of 98.1%, with a 95% confidence interval of 97.87–98.33%.

These results correspond to those referring to the level of vocabulary academicity and technicality, as well as those regarding the minimum reading coverage level. The textbooks, however, needed more than 25,000 word families to reach this level, as did the magazines and the research articles.

Magazine articles are supposed to have a wider audience than research articles, trying to reach outside the scientific community. It is unlikely, however, that such articles can be successfully read even by highly educated natives outside the field, as they contain substantial technicality. We can assume that the degree of technicality of our corpus arises from the choice of the magazines – we note again, however, that we were limited in the choice of our corpus, as the most widely read magazines typically cover more than just physics.

As said in Section 3, to test this assumption, we formed another corpus, containing twelve 2016 issues of New Scientist, one of the most popular science magazines (356,628 words). It is a London-based weekly, selling all over the world and covering various science and technology areas, including speculative articles. According to the magazine’s website, its readership is nearly 1 million every week, including scientists and non-scientists. We performed the same lexical profile analysis against the 25 + 4 BNC/COCA word-list set (Nation 2012). The results show that the cumulative coverage of all the lists together was 94.01%, which is actually very similar to the level achieved by our physics magazine corpus. Admittedly, the results could vary for other magazines, but it can still be concluded that the three science magazines analysed here contain very complex vocabulary, which by far exceeds the vocabulary of a high-school graduate native. Even though these magazines claim that they target non-scientists as well, we are unsure how comfortable a reading they can provide for this audience. Their readability may also pose a problem to news reporters who often base their newspaper articles on the texts published in such magazines.

Our final table, Table 4, provides an insight into the variability of the vocabulary used in the seven physics genres:

Table 4:

Lexical variety over various physics genres.

Genre Written genres Spoken genres
Magazines Pop. books Research a. Textbooks TV docum. TED talks Lectures
sTTR (25,000 chunks) 0.20 0.14 0.12 0.08 0.17 0.13 0.08
maTTR (25,000 chunks) 0.23 0.15 0.14 0.10 0.15 0.14 0.09

Both methods of measuring lexical diversity, i. e. variability (the sTTR and maTTR) give similar results, with very little variation amongst the chunks (95% confidence intervals range between ± 0.0002 and ± 0.009 in relation to the means). Amongst the written genres, the physics magazines and the popular books displayed the most lexical variability, and amongst the spoken ones, the same can be said of the TV documentaries and the TED talks. It seems that the authors of the genres intended for the most general audience, which are supposed to attract the readers/listeners and be appealing in various ways, invest effort in using variable vocabulary. On the other hand, the academic genres, i. e. those solely directed at an academic audience (the research articles, textbooks and lectures) all employ vocabulary displaying a lower degree of variation, i. e. more repetitiveness. The authors of the academic texts and speech were generally not concerned with the variability of their vocabulary – the presence of conventional academic and technical vocabulary might have acted restrictively in that sense. Thus, the variation degrees did not necessarily correspond to the level of vocabulary, i. e. its rarity. The results relating to lexical variability might be of relevance when selecting L2 materials in terms of the diversity of vocabulary at a certain level of their vocabulary knowledge.

6 Discussion

In this study we analysed the vocabulary frequency profile of seven physics academic and non-academic genres: research articles, textbooks, lectures, magazines, popular books, TV documentaries and TED talks. The aim was to determine their vocabulary complexity and see how it relates to the vocabulary knowledge of typical native and non-native speakers of English, i. e. whether the vocabulary level of these genres could pose an impediment to their reading/listening.

With this in mind, our first research question is how much general, academic and technical vocabulary is used in the various physics genres analysed.

Resting on the assumptions stated in the introduction that academic vocabulary creates difficulty for language learners (Coxhead 2000), as well as for native speakers (Snow and Uccelli 2009), and that the same can be said of technical vocabulary for lay persons, regardless of their native/non-native status, we found that the physics research articles, magazines and textbooks cumulatively contained most of this type of vocabulary (13.05%, 11.11% and 10.61%, respectively), as represented by the AWL. This means that more than 1 in every 10 words may cause difficulty for readers and listeners, which can greatly hinder comprehension. A person unaccustomed to academic language and new to this technical field might, therefore, struggle with reading and listening to these genres vocabulary-wise. On the other end of the spectrum were the physics lectures, TED talks and documentaries, which contained much less of the AWL’s vocabulary (5.06%, 5.17% and 5.33%, respectively), where approximately 1 in 20 words may have caused difficulty for readers and listeners.

Results were similar for the simplest general vocabulary, i. e. the most frequent 2,000 words of English, which suggested that physics TED talks, lectures, documentaries and popular books abound in these (the results ranged between 86.78% and 85.05% for these four genres). Generally, in our corpora, more than 17 out of 20 words belonged to this, most accessible, type of vocabulary for these four genres.

As noted in the analysis, the physics magazines seemed inaccessible to the general public and would not be easily read by them, even though their editors target this population as well.

The physics popular books did have more than 17 out of every 20 words in the category of high-frequent vocabulary, but still contained more academic and technical vocabulary, as represented by the AWL and the PSL, than did the physics lectures, TED talks and documentaries. If the authors of these books intend to include a wider audience, then a reduction in this type of vocabulary could be advised.

The physics textbooks had a much more complex vocabulary than did the physics lectures, which is one of the reasons why it is worthwhile for students to attend them before tackling the content of the textbooks themselves as part of self-study. This finding is parallel to the one found by Biber (2006: 36), where he, having studied various university text corpora, established that textbooks use a much more varied vocabulary than is used in classroom teaching, mostly due to the use of a more specialised vocabulary.

When it comes to the second research question – how many word families are needed to reach adequate and optimal reading and listening comprehension for the various physics genres, it was shown that the medium of communication (written/spoken) greatly influences the vocabulary complexity of genres. Generally, it can be said that the spoken genres, regardless of whether they were academic or not, proved to be the most accessible, which is in line with previous findings from the literature.

The results show that 4,000 word families were needed for adequate reading of and listening to the physics lectures, TED talks, documentaries and popular books; 8,000 word families were needed for adequate reading of the physics textbooks; whereas more than 25,000 word families were needed for adequate reading of the physics research articles and magazines.

When it comes to optimal reading, 7,000 word families were needed for listening to the TED talks; 8,000 for the physics documentaries; 10,000 for the lectures and the popular books. All other genres required more than 25,000 word families for optimal comprehension (textbooks, magazines and research articles).

In response to the third research question, i. e. how much lexical variability we may encounter in various physics genres, the results showed that the genres directed at the widest audience displayed more variability than the genres with an exclusively academic audience.

The results presented above can help us respond to the fourth research question, i. e. how accessible physics genres are to the native and non-native speakers of English vocabulary-wise. Coming back to the finding that a non-native speaker of English who has learned this language for a number of years, typically will not have more than 5,000 word families (Nation and Waring 1997), we may conclude that such a speaker will probably be able to understand, vocabulary-wise, physics lectures, TED talks, documentaries and popular books, but none of these at a “pleasureable” level (a 98% coverage), as the simplest of these still require a 7,000-word-family vocabulary level. This, of course, may be so for proficient non-native speakers, but not for typical ones.

On the other hand, as noted in the introduction, a high-school graduate native speaker of English will know 20,000 word families (Laufer and Yano 2001: 549), which allows for a pleasurable reading of and listening to most physics genres. However, this is not so for research articles and magazines articles, which require more than 25,000 word families for comprehension. The former are intended for researchers and scientists, and a high-school graduate would likely never need to read these anyway. Still, if editors of science magazines really want to include wider audiences, then all our results point in the direction of a necessary simplification of their vocabulary.

7 Pedagogical implications

The results of this study could be of particular interest to non-native EAP and ESP learners and their teachers, as they can now make more informed decisions when choosing source materials for their teaching and learning (including the extra materials for pleasure reading/listening). It would be best to first test the learners’ vocabulary using, perhaps, Nation’s vocabulary tests (2013). The knowledge of how many thousands of words learners know is needed to decide which genres they can adequately and optimally cope with, and which genres present a realistic and which an insurmountable challenge for their current vocabulary knowledge. In addition, if the goal is to learn academic words found in physics genres, teachers and learners in this ESP field will know which genres provide the best exposure to such vocabulary. In addition, based on the data provided here, they will also know which genres provide the best diversity of vocabulary of a certain level.

8 Limitations of the study

There are several limitations to the present study. Namely, the spoken corpora are comparably smaller than the written ones, due to the difficulty of collecting spoken data. We have also said that there is no physics word list currently available, which is why we had to resort to the science word list instead. This entailed the use of the GSL and the AWL, as was explained in the introduction, although newer lists are available. Additionally, the AWL and the PSL include only the frequent academic and scientific-technical vocabulary, and are thus not a measure of all academic and technical vocabulary used in the corpora. The results of this study thus need to be taken into account only keeping in mind these limitations.

9 Conclusion

The findings presented in this study provide some insights into the type, level and variation of the vocabulary used in the physics-communicating genres. We hope that this relatively systematic vocabulary profiling study of one discipline across its various genres will be of relevance not only to those interested in physics and the related disciplines, but also to the EAP and ESP professionals and researchers conducting corpus-based vocabulary investigations.

References

Akinnaso, Niyi F. 1982. On the differences between spoken and written language. Language and Speech 25(2). 97–125. Search in Google Scholar

Anderson, Richard Chase & Peter Freebody. 1981. Vocabulary knowledge. In John T. Guthrie (ed.), Comprehension and teaching: Research reviews, 77–117. Newark, DE: International Reading Association. Search in Google Scholar

Anthony, Laurence. 2014. AntWordProfiler (Version 1.4.1) [computer software]. Tokyo: Waseda University. http://www.laurenceanthony.net/software (accessed 8 August 2017). Search in Google Scholar

Anthony, Laurence. 2017. AntFileSplitter (Version 1.0.0) [computer software]. Tokyo: Waseda University. https://www.laurenceanthony.net/software/antfilesplitter/ (accessed 15 June 2019). Search in Google Scholar

Biber, Douglas. 2006. University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins. Search in Google Scholar

Biemiller, Andrew & Naomi Slonim. 2001. Estimating root word vocabulary growth in normative and advantaged populations: Evidence for a common sequence of vocabulary acquisition. Journal of Educational Psychology 93(3). 498–520. Search in Google Scholar

Browne, Charles, Brent Culligan & Joseph Phillips. 2013. The new academic world list. http://www.newgeneralservicelist.org/nawl-new-academic-word-list/ (accessed 1 April 2018). Search in Google Scholar

Chen, Qi & Guang-Chun Ge. 2007. A corpus-based lexical study on frequency and distribution of Coxhead’s AWL word families in medical research articles. English for Specific Purposes 26. 502–514. Search in Google Scholar

Covington, Micheal A. & Joe D. McFall. 2010. Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94–100. Search in Google Scholar

Coxhead, Averil. 2000. A new academic word list.TESOL 34(2). 213–238. Search in Google Scholar

Coxhead, Averil. 2017. The lexical demands of teacher talk: An international school study of EAL, Maths and Science. Oslo Studies in Language 9(3). 29–44. Search in Google Scholar

Coxhead, Averil. 2018. Vocabulary and English for specific purposes research: Quantitative and qualitative perspectives. London: Routledge. Search in Google Scholar

Coxhead, Averil & David Hirsh. 2007. A pilot science-specific word list. Revue Française de Linguistique Appliquée 12(2). 65–78. Search in Google Scholar

Coxhead, Averil, Paul Nation & Dalice Sim. 2015. Vocabulary size and native speaker secondary school students. New Zealand Journal of Educational Studies 50(1). 121–135. Search in Google Scholar

Coxhead, Averil & Roz Walls. 2012. TED Talks, vocabulary, and listening for EAP. TESOLANZ Journal 20. 55–67. Search in Google Scholar

Cvrček, Václav & Lucie Chlumská. 2015. Simplification in translated Czech: A new approach to type-token ratio. Russian Linguistics 39(3). 309–325. Search in Google Scholar

Dang, Thi, Ngoc Yen, Averil Coxhead & Stuart Webb. 2017. The academic spoken word list. Language Learning 67(3). 959–997. Search in Google Scholar

Dang, Thi, Ngoc Yen & Stuart Webb. 2014. The lexical profile of academic spoken English. English for Specific Purposes 33. 66–76. Search in Google Scholar

Fraser, Simon. 2007. Providing ESP learners with the vocabulary they need: Corpora and the creation of specialised word lists. Hiroshima Studies in Language and Language Education 10. 127–143. Search in Google Scholar

Gardner, Dee & Mark Davies. 2014. A new academic vocabulary list. Applied Linguistics 35(3). 305–327. Search in Google Scholar

Halliday, David, Robert Resnick & Jearl Walker. 2007. Fundamentals of physics. Hoboken NJ: Wiley. Search in Google Scholar

Hancioğlu, Nilgün & John Eldridge. 2007. Texts and frequency lists: Some implications for practising teachers. ELT Journal 61(4). 330–340. Search in Google Scholar

Hsu, Wenhua. 2011. The vocabulary thresholds of business textbooks and business research articles for EFL learners. English for Specific Purposes 30(4). 247–257. Search in Google Scholar

Hsu, Wenhua. 2013. Bridging the vocabulary gap for EFL medical undergraduates: The establishment of a medical word list. Language Teaching Research 17(4). 454–484. Search in Google Scholar

Hsu, Wenhua. 2014. Measuring the vocabulary load of engineering textbooks for EFL undergraduates. English for Specific Purposes 33. 54–65. Search in Google Scholar

Jin, Ng Yu, Lee Yi Ling, Chong Seng Tong, Nurhanis Sahiddan, Alicia Philip, Noor Hafiza Nor Azmi & Mohd Ariff Ahmad Tarmizi. 2013. Development of the engineering technology word list for vocational schools in Malaysia. International Education Research 1(1). 43–49. Search in Google Scholar

Khani, Reza & Khalil Tazik. 2013. Towards the development of an academic word list for applied linguistics research articles. RELC Journal 44(2). 209–232. Search in Google Scholar

Konstantakis, Nikolaos. 2007. Creating a business word list for teaching business English. Elia 7. 79–102. Search in Google Scholar

Kubát, Miroslav & Jiří Milička. 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20(4). 339–349. Search in Google Scholar

Kwary, Deny A. & Almira F. Artha. 2017. The academic article word list for social sciences. MEXTESOL 41(4). 1–11. Search in Google Scholar

Laufer, Batia. 1989. What percentage of text-lexis is essential for comprehension. In Christer Lauren & Marianne Nordman (eds.), Special language: From humans thinking to thinking machines, 316–323. Clevedon: Multilingual Matters. Search in Google Scholar

Laufer, Batia. 1992. How much lexis is necessary for reading comprehension? In Pierre Arnaud & Henri Béjoint (eds.), Vocabulary and applied linguistics, 126–132. London: Palgrave Macmillan. Search in Google Scholar

Laufer, Batia & Paul Nation. 1995. Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics 16(3). 307–322. Search in Google Scholar

Laufer, Batia & Geke C. Ravenhorst-Kalovski. 2010. Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension. Reading in a Foreign Language 22(1). 15–30. Search in Google Scholar

Laufer, Batia & Yasukata Yano. 2001. Understanding unfamiliar words in a text: Do L2 learners understand how much they don’t understand? Reading in a Foreign Language 13(2). 549–566. Search in Google Scholar

Lei, Lei & Dilin Liu. 2016. A new medical academic word list: A corpus-based study with enhanced methodology. English for Academic Purposes 22(1). 42–53. Search in Google Scholar

Liu, Jia & Lina Han. 2015. A corpus-based environmental academic word list building and its validity test. English for Specific Purposes 39. 1–11. Search in Google Scholar

Martínez, Iliana A., Silvia C. Beck & Carolina B. Panza. 2009. Academic vocabulary in agriculture research articles: A corpus-based study. English for Specific Purposes 28(3). 183–198. Search in Google Scholar

Milička, Jiří. 2013. MaWaTaTaRaD 2 [computer software]. Prague: Charles University and Olomouc: Palacký University. http://www.milicka.cz/en/software.htm (accessed 15 June 2019). Search in Google Scholar

Minshall, Daniel E. 2013. A Computer Science Word List. Wales: Swansea University MA thesis. Search in Google Scholar

Moini, Raouf & Zahra Islamizadeh. 2016. Do we beed discipline-specific academic word lists? Linguistics academic word list (LAWL). Journal of Teaching Language Skills 35(3). 65–90. Search in Google Scholar

Mudraya, Olga. 2006. Engineering English: A lexical frequency instructional model. English for Specific Purposes 25(2). 235–256. Search in Google Scholar

Nagy, William E. 1988. Teaching vocabulary to improve reading comprehension. Newark, DE: International Reading Association. Search in Google Scholar

Nation, Paul. 2006. How large a vocabulary is needed for reading and listening? Canadian Modern Language Review 63. 59–82. Search in Google Scholar

Nation, Paul. 2012. The BNC/COCA word family lists. http://www.victoria.ac.nz/lals/about/staff/paul-nation (accessed 1 April 2018). Search in Google Scholar

Nation, Paul. 2013. Learning vocabulary in another language (2nd edition). Cambridge: Cambridge University Press. Search in Google Scholar

Nation, Paul. 2016. Making and using word lists for language learning and testing. Amsterdam: John Benjamins. Search in Google Scholar

Nation, Paul & Robert Waring. 1997. Vocabulary size, text coverage and word lists. Vocabulary: Description, Acquisition and Pedagogy 14. 6–19. Search in Google Scholar

Radford, Paul. 2013. The tyranny of (semi-)technical vocabulary: Challenges facing the student of computer science. Wellington: Victoria University of Wellington MA thesis. Search in Google Scholar

Schmitt, Norbert, Tom Cobb, Marlise Horst & Diane Schmitt. 2015. How much vocabulary is needed to use English? Replication of van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007). Language Teaching 50(2). 212–226. Search in Google Scholar

Schmitt, Norbert, Xiangying Jiang & William Grabe. 2011. The percentage of words known in a text and reading comprehension. The Modern Language Journal 95(1). 26–43. Search in Google Scholar

Scott, Mike. 2004. The WordSmith Tools, vol. 4. 0. Oxford: Oxford University Press. Search in Google Scholar

Snow, Catherine E. & Paola Uccelli. 2009. The challenge of academic language. In David R. Olson & Nancy Torrance (eds.), The Cambridge handbook of literacy, 112–133. Cambridge: CambridgeUniversity Press. Search in Google Scholar

Thompson, Paul. 2006. A corpus perspective on the lexis of lectures, with a focus on economics lectures. In Ken Hyland & Marina Bondi (eds.), Academic discourse across disciplines, 253–270. New York: Peter Lang. Search in Google Scholar

Todd, Richard Watson. 2017. An opaque engineering word list: Which words should a teacher focus on? English for Specific Purposes 45. 31–39. Search in Google Scholar

Valipouri, Leila & Hossein Nassaji. 2013. A corpus-based study of academic vocabulary in chemistry research articles. Journal of English for Academic Purposes 12(4). 248–263. Search in Google Scholar

van Zeeland, Hilde & Norbert Schmitt. 2012. Lexical coverage in L1 and L2 listening comprehension: The same or different from reading comprehension? Applied Linguistics 34(4). 457–479. Search in Google Scholar

Vongpumivitch, Viphavee, Ju-yu Huang & Yu-Chia Chang. 2009. Frequency analysis of the words in the Academic Word List (AWL) and non-AWL content words in applied linguistics research papers. English for Specific Purposes 28. 33–41. Search in Google Scholar

Wang, Jing, Shao-lan Liang & Ge Guang-chun. 2008. Establishment of a medical word list. English for Specific Purposes 27. 442–458. Search in Google Scholar

Ward, Jeremy. 2009. A basic engineering English word list for less proficient foundation engineering undergraduates. English for Specific Purposes 28(3). 170–182. Search in Google Scholar

West, Michael. 1953. A general service list of English words. London: Longman, Green and Co. Search in Google Scholar

Wingrove, Peter. 2017. How suitable are TED talks for academic listening? Journal of English for Academic Purposes 30. 79–95. Search in Google Scholar

Yang, Ming-Nuan. 2015. A nursing academic word list. English for Specific Purposes 37(1). 27–38. Search in Google Scholar

Young, Hugh & Roger Freedman. 2016. University physics with modern physics (14th edition). Harlow: Pearson. Search in Google Scholar

Published Online: 2019-09-27

© 2020 Walter de Gruyter GmbH, Berlin/Boston