Using two case studies from biology, the article demonstrates and analyses how domain-specific self-learning items with variable content can be generated automatically for a blended learning environment. It shows that automated item generation works well even for highly specific technical properties and that a good item quality can be produced. Evaluations are based on sample exercises from two courses in botany and genetics, each with more than 100 participants.
Blended Learning describes the integration of e-learning into existing classroom formats. By combining the advantages of both formats, the learning outcome of corresponding courses can be increased , , whereas the development of the contents for the individual learning environments is largely determined by the structure of the specific disciplines. Biology as a science is characterised by a multitude of branches and disciplines. All of them have their own far-reaching research traditions , in which discipline-specific teaching methods are used. This paper presents technical aspects of automated item generation for courses in two different disciplines of biology. On the one hand, this is the botanical exercise on biodiversity, which can be assigned to general botany and is referred to as the botanical systematics course, and on the other hand the introduction to genetics.
In both cases, students must meet domain-specific performance requirements, many of which can only be assessed by domain-specific assessment items. In the botanical systematics course, for example, students learn to identify flowering plants based on morphological characteristics. The key idea of this course is to teach students how to work with an identification key. In genetics, students are expected, among other things, to elucidate a mode of inheritance of a genetically determined trait by means of its occurrence within a family. For this purpose, there is a domain-specific type of diagram called pedigree, which is conventionally used exclusively to answer genetic questions. In both cases, declarative knowledge (about flowering plants or inheritance mechanisms) and procedural knowledge (about the handling of an identification key or logical reasoning based on rules of inheritance) are necessary. Turning the acquired knowledge into actual procedural skills usually requires some amount of practice. Since teaching assistants are only available for a very limited amount of time per week, students require additional opportunities for practice. The aim of the developed blended learning environments is to enable extended self-assessment by the students and thus to support the development and practice of procedural skills.
This article examines the extent to which it is possible to generate assessment items for self-assessment automatically from highly specific tasks (formats), which cannot be solved by mere memorization even when used repeatedly. Central questions are: (1) Can we ensure a high item quality while considering the domain-specific peculiarities? (2) Which manual or automated parts of the generation process require specialized biological knowledge or capabilities? We answer these questions by presenting and discussing an exemplary item for the identification of a floral formula in dependence on a given plant family for the systematics course and an item for the analysis of a pedigree for the determination of the mode of inheritance at hand in the context of the introduction to genetics.
The presentation of materials in the article is organized as follows: Section 2 provides a very short overview on automated item generation and the current shortcomings with respect to the assessment of procedural knowledge. Section 3 provides information on the context of our research with details on the e-assessment system and the two courses used in the study. Section 4 presents the first case study in botany and Section 5 the second one in genetics. Section 6 summarizes the observations made in the case studies and thus answers our research questions. Section 7 concludes the paper with some remarks on future work.
2 Related Work
Automated item generation is generally a well-researched area with numerous existing approaches . In particular, template-based approaches are a well-known technique for generating assessment items automatically while ensuring some desired quality properties of the items . Especially in the context of multiple-choice items, these approaches can be used to increase the quality of distractors systematically . However, domain-specific approaches from biology are not in widespread use. Instead, manually created items are typically used in closed learning environments (e. g. ). Deviating from this, the use of domain-specific systems is common for various types of tasks in other domains (e. g. mathematics , ).
In contrast to the rare use of domain-specific systems, there is a wide variety of research for ontology-based approaches, which work well independently of domains. A detailed literature review can be found in . For example, Škopljanac-Mačina and Blašković use Formal Concept Analysis (FCA) to generate electrical engineering tasks , whereas Alsubait et al. evaluate different approaches to the ontology-based generation of multiple-choice items, which also include biology ontologies . However, these approaches are hardly suitable for the specific case studies in this article, since the combination of procedural knowledge in combination with wrong declarative knowledge as a distractor cannot be derived directly from ontologies. More precisely, the automated generation of assessment items from domain models beyond factual knowledge can still be considered a problem that is not yet completely solved . Consequently, we employ the aforementioned template-based approach in our research presented in the current article.
3 Technical and Organizational Background
3.1 E-Assessment System JACK
We use the e-assessment system JACK as the basic environment to deliver the self-assessments to the students. The JACK system allows to use parameterized item templates, so that several instances of similar items can be generated automatically by replacing some parts of the item content every time a student accesses the respective exercise. In mathematics, such methods are common to present the same task with different numerical values. However, the principle is not limited to numerical values, and can apply to texts, images or other item components. Thus, parametrization encourages students to fully understand theoretical concepts, because memorization is no longer possible for tasks with sufficient variability. Furthermore, even after successful completion, these tasks can be used for practice several times with different values in order to deepen their understanding of the material.
The e-assessment system JACK contains an expression evaluation module, which can assign variables within items symbolically as well as numerically with different phrases and functions, but also by other procedures such as ontology queries. If required, internal computer algebra systems (CAS) such as Sage or the statistical programming language R are used additionally. From the perspective of software architecture, JACK is designed as a general framework for e-assessment that can be extended by domain-specific modules if necessary . It is particularly interesting with respect to our second research question to see whether additional domain-specific modules will be necessary in context of our research discussed in the current paper.
Item authors create items within JACK by writing at least two XML files. The first one defines the variables to be used in templates as well as the expressions used to determine their values. The second file contains the actual item template, i. e. the fixed parts of the item stem and feedback with placeholders for parameters in between in almost arbitrary places. Since JACK allows creating scaffolding exercises, each item may contain more than one of these second kind of XML files. In that case, each one defines one stage within the scaffolding exercise and the first XML file defines their sequence and which variables are passed to which stage. More details about JACK items with multiple stages can be found in  and are skipped here, because we only use items with one stage throughout the rest of the paper.
Listing 1 shows a sample item definition in XML for a multiple-choice item with four answer options. Within the <input>-tag, five parameters are defined. JACK will assign different values to these parameters each time an instance of the item needs to be created. The <task>-tag contains the item stem. In listing 1 it includes an image, where the image reference is actually a placeholder ([var=bild]). Hence different images can be included in different instance of the item. The <answers>-tag contains four answer options, where each answer option contains just another placeholder. The attribute randomize=true indicates that the answer options will be shuffled before they are displayed to the student. The <choice>-tag within the <correctanswer>-tag defines that the fourth answer option is the key. The pattern string that uses three zeros to mark the first three answer options as wrong and a one to mark the last option as correct indicates this. Notably, any pattern is possible here, so that both multiple-choice and multiple-response items can be created. It is even possible to use a placeholder within the pattern string so that a different number of answer options can be correct in different instances of the same item. The <choice>-pattern within the <feedback>-tag is a generic marker for all possible patterns not covered so far. It is used here to define one global error feedback for all wrong answers. The feedback includes a placeholder that tells the correct answer.
The set of variable definitions corresponding to the item from listing 1 is shown in listing 2. The definition also contains some additional variables that provide possible values and ease the construction process. First two sets of strings (bilder and familien) are defined that contain file names of images and the corresponding terms describing the image content, respectively. Then, an auxiliary variable index is chosen randomly bounded by the size of the lists used before. The index is then used two times within the function getFromList() to draw an image and the corresponding term for variables bild and richtig. Three distractors are drawn randomly from the list of terms afterwards. A function named chooseFromComplement() is used here that assures that none of the terms drawn before will be chosen a second time. After all variables got their values assigned, they are passed to the item template shown in listing 1 via the list of inputs within the <step>-tag.
Although creating both XML files requires careful work as well as good command of the available functions (such as randomIntegerBetween(), getFromList() and chooseFromComplement()), item authors typically need very little training before they can start to be productive. In fact, our undergraduate teaching assistants were able to produce a large share of the items discussed in the current paper primarily based on sample items and JACK documentation, but without receiving any formal training. Hence, we can assume that the technical context is no major obstacle for creating domain-specific item templates.
3.2 Botanical Systematics Course
The Botanical Systematics Course takes place every summer term and is attended by approximately 250 students. During the course, plant families and their morphological characteristics will be presented on each session and real plants are identified with the help of a classification key. Declarative knowledge about the morphology and taxonomy of plants is to be acquired, but also its application in practical classification. The course concludes with a written exam consisting of a theoretical part on identification features and plant families and a practical part in which students are asked to correctly identify four given plants. An online learning environment was made available for the first time in the 2018 summer term to accompany the identification exercise and to prepare for the final module examination.
3.3 Introduction to Genetics
In the winter term 2018/19, the lecture introduction to genetics, which is mandatory for students becoming biology teachers, students of biology and medical biology (N ≈ 300) as part of their bachelor’s programmes, was for the first time supported by a digital learning environment. It allows students to post-process the lectures and to prepare for the exam. The digital learning environment comprised a total of 32 dynamically generated exercises on formal, human and population genetics, 20 of which have already been empirically tested . In case study 2, a new task for pedigree analysis is presented.
4 Case Study 1: Botany
In the first step of the blended learning environment accompanying the botanical systematics course, students were able to practice the theoretical aspects of the exercise in a total of 175 tasks ranging in topics such as floral formulae, floral diagrams, flower and plant morphology, fruit forms, and scientific nomenclature.
The automatic generation of learning tasks represents a fundamental challenge in biology learning, because of a typically strong focus on single facts in basic biology courses. In most learning tasks, there are no formulas that can be varied easily to generate a task pool, as in the case of mathematics.
In the following, we will present two approaches with which we improved the traditional static form of online learning environments in biology, simple task generation and complex task generation.
The main part of the learning environment consists of items based on representations. Since students are expected to learn how to identify plants and families based on a wide variety their morphological features, they need to learn about these properties. For example, the shape of leaves can be distinguished into 23 different forms and the morphological nature of the edge of the leaf can come in 13 clearly distinguishable types. In many cases, there are combinations of these characteristics, so a leaf’s shape can for example be a compound spear-shaped leaf. In common plant systematics literature, there are more than 16 main features of plants commonly used . In order for students to be able to identify plant species and families in a short time, as expected in the final examination on this course, they need to know a large number of terms and recognize them on living plants.
In the simple task generation items, we use a total of 90 self-designed illustrations of the most common plant features in the discussed plant species. Selection of these features was conducted by the botanists teaching the course for many years. Based on these illustrations, multiple-choice and fill-in items were developed with a basic form of randomisation. Both the presented picture of a plant’s morphology, as well as the presented distractors are chosen randomly from a pool when a student loads new items. Separate item pools are offered for every week of the course. Students are free to choose whether to use the most recent pool or an older one. An additional larger item pool is updated every week to contain items for the complete course. In order to give learners more variation in the tool, multiple-choice items were designed in two types: one, where an illustration is shown and the technical terms form the distractors, and a second one with a reverse arrangement.
In these simple task generation items about labelling representations, identifying structures and matching processes, definitions or similar, variation can occur through the images used and the structures or processes inquired, but these variation options are also very limited.
However, there is an opportunity for complex task generation in the special case of plant identification exercises, in which so-called floral formulae are used among other course contents.
For students to be able to assign given plants to their respective families, they need to know different characteristics of the families. One of the central characteristics is the structure of the flower. The most condensed representation of the flower structure is the floral formula. The knowledge of the floral formula is in many cases sufficient to determine the family of a plant. Flower components are identified from the outside to the inside by letters and numbers. The number, arrangement and, if applicable, connation of the components are represented by numbers, brackets and other symbols. We will explain the structure of the floral formulas with the example of the Caryophyllaceae family: * K(5) C5 A5+5 G(5) indicates that it is a radially symmetrical flower (*, i. e. min. 3 axes of symmetry) with 5 fused sepals (K(5)), 5 petals (C5), 5 outer and 5 inner stamens (A5+5) and 5 fused upper carpels (G(5)).
As part of the learning platform, a task template was developed in which students have to specify the correct floral formula for a given plant family. Students will only be given the name of the plant family, but no corresponding illustration will be shown.
The automatic generation of the distractors employs a two-stage process that is based on randomisation. In the first step, a random wrong number and two random wrong sums are rolled, of which one is guaranteed to consist of two equal numbers. “Wrong” in this context means that numerical values are used that do not occur in the correct solution, but do not appear completely absurd from a biological point of view. At this point in the design process, botanical expertise is therefore necessary to avoid the choice of obviously wrong values. On the basis of these random values, several wrong contents are then generated for all four components of the floral formula by using the wrong numbers, wrong sums and, if necessary, additional wrong brackets or over-/underscores, as far as this makes sense contentwise. At this point in the design process, botanical expertise is therefore also necessary to rule out the creation of low-quality distractors. In the example in Figure 1, 9 false possibilities arise for each of the components K and C, six possibilities for A and five possibilities for G.
In the second random step, the actual distractors are generated from this quantity of incorrect components. One distractor is composed of exactly one incorrect component and three correct components, two of exactly two incorrect components and one of exactly three incorrect components. The two-stage nature of the procedure ensures that the distractors cannot be identified simply by the fact that they always contain different numerical values. In the example used in Figure 1, for instance, the wrong component C(3+3) occurs more frequently than the correct component C5. Therefore, the task cannot be solved simply by choosing the option that contains the most common value for each component.
In the above example there are 320 possible combinations for wrong numerical values and for each of these combinations 2,430 (= 9 * 9 * 6 * 5 for the number of options for each component) distractors can be generated. A total of 777,600 different combinations of distractors can be created for this task, whereby not every combination of this set can occur within a task. Although the choice of distractors is equally distributed, the occurrence of individual components is not equally distributed because the quantities of potential components vary. For example, there are only five possibilities for G but nine for K. Hence it is more likely to see a distractor containing G(4) within an item since statistically one of five distractors will contain that element, than to see one with K(6) that will only occur in one of nine distractors. It could therefore theoretically happen in unfavourable cases that inferior distractors occur more frequently. This question will be examined below by means of an exemplary analysis of usage data from the summer terms 2018 and 2019.
The example shown above of the floral formula of Caryophyllaceae has been worked on 279 times by 104 students in 2018 and 308 times by 105 students in 2019. A total of 795 different distractors were displayed to the students in 2018 and a total of 848 different distractors in 2019. The correct answer was chosen in 234 cases (83,9 %) and 257 cases (83,4 %), which corresponds to the average level of difficulty of all the tasks of this type used. The task was solved in 79 % of the cases in the first attempt.
The three most frequently displayed distractors in 2018 were * K(5) C5 A5+5 G5, * K(5) C5 A(5+5) G(5) and * K(5) C5 A5+5 G(4). They all belong to the group of distractors with one wrong component and were displayed 46 times in total and selected in nine cases (20 %) in 2018. In 2019, the three most frequently displayed distractors were * K(5) C5 A5+5 G(4), * K(5) C5 A5+5 G5 and * K(5) C5 A5+5 G(5+5). They also all belong to the group of distractors with one wrong component, were shown 50 times, and selected 7 times (14 %).
Although these results can be considered positive with respect to the goals of our endeavour, they are not yet robust. For example, another interesting distractor is * K(5) C5 A(5) G(5). It was displayed nine times and selected four times (44,4 %) in 2018 and can hence be considered a good distractor due to the high frequency it was chosen. However, the same distractor was displayed 13 times but only selected twice (15,4 %) in 2019, which is a less remarkable frequency. In turn, * K5 C5 A5+5 G(5) was displayed seven times and selected four times (57,1 %) in 2019, but displayed ten times and only selected twice (20,0 %) in 2018. Hence, we can conclude that we need a larger data set before we can start a more detailed distractor analysis.
As another observation, we can see that distractors with three wrong components are more equally distributed but chosen by the student very rarely. It thus seems to be useful to skip the generation of distractors with three wrong components in favour of another distractor with one wrong component. As a result, the items may get a bit more difficult and we will be able to collect more data for the remaining distractors.
Although the design principles of all items on floral formulae are the same, results are not similar in all cases. We use another item on the floral formula of Fabaceae family to illustrate that. That item was finished 203 times by 89 students in 2018 and finished 251 times by 95 students in 2019. A total of 609 and 741 distractors were displayed in the respective years. The correct answer was chosen in 174 cases (85,7 %) in 2018, but in 192 cases (76,5 %) in 2019, showing a much larger difference between the two terms than in our first example. Moreover, the three most common distractors were displayed 36 times in total, but only chosen once (3 %) in 2018, which is a very low frequency. However, the three most common distractors in 2019 were displayed 40 times and selected 13 times (32,5 %), which is an even higher frequency than in our first example. This again stresses our earlier observation that a much larger data set is required to be able to perform a decent distractor analysis. Nevertheless, the observations on distractors with three wrong components could also be confirmed on the Fabaceae item as well as on other items.
Although we must be careful with the interpretation due to the not yet satisfying data set, several conclusions can be drawn regarding the research questions of this paper. Selection frequencies of 20 % and more for the most common distractors show that the automated generation of good distractors is possible. Moreover, our design principle is able to produce these distractors with a sufficient probability. Both examples show that a large number of distractors are generated, which are not or only rarely chosen and therefore have to be eliminated. Thus, it appears necessary to design the process of task creation and quality assurance iteratively. The configuration of the item generation process needs to be adapted gradually until frequently occurring distractors that are selected rarely are no longer generated for future use. However, unlike the initial determination of the potential false components of the distractors, iterative refinement requires only a small amount of botanical expertise. In particular, the elimination of all distractors with three wrong components is a purely technical step that just requires some minor changes to the item definitions.
The whole blended learning tool with all its different items was used by 209 students in 2018, solving a total of over 25,400 tasks. In 2019, numbers where even higher with 212 students solving more than 34,300 tasks. In both years, students typically solved 600–900 tasks per week on the days directly before the face-to-face sessions of the course. An additional sharp increase in usage could be observed before the final exams with more than 10,000 tasks solved in the week before the exam in both years. This indicates that the students consider the platform to be conducive to learning. More detailed analyses regarding the effects of the use on exam results are planned for the future.
5 Case Study 2: Genetics
Human pedigrees represent the occurrence of specific, usually hereditary traits within family contexts. They can be used to determine the inheritance underlying a genetic trait, to map hereditary diseases [10, p. 604], but also in the context of human genetic counselling, whereby the aim of this can be both therapeutic and preventive measures [26, p. 46].
The analysis of human pedigrees to identify depicted modes of inheritance is the subject of teaching in biology classes in secondary schools in Germany (cf. , ). Within the framework of the university course introduction to genetics, students are expected to be able to identify the mode of inheritance of a trait based on a given pedigree as well as to determine the probability that a pair of parents represented in the pedigree will have an affected child. Previous study results indicate that the difficulty of “pedigree problems” is largely dependent on the inheritance described (cf. ).
The JACK-R module  provided a suitable basis to present parameterized genetic tasks with dynamically generated pedigree problems, since the R-package kinship2 , available in the Comprehensive R Archive Network (CRAN), already provided a function for the graphical representation of pedigrees (see Figure 2). In order to use such packages, JACK provides various connections to so-called backends that can evaluate R code. In particular, a service based on R-Serve  was already available, which returns evaluations of R programs. In order to make the execution of potentially malicious R-code safe, it is executed in isolated docker containers, which also work with blacklisting  of security-relevant R-commands. In order to allow for the integration of images necessary for pedigree problems, a Spring-based microservice was additionally developed, which tries to create images from R-commands and provides them to JACK in case of success. In addition, a newly developed R package provides the necessary functions for the construction of the tasks, which are explained in more detail below.
Since pedigree problems represent problem solving tasks, their correct solution usually requires the integration of logical reasoning and content knowledge (cf. , ). In order to provide students with sufficient opportunities for practice, pedigrees were automatically generated for all modes of inheritance relevant to the course. The task construction takes place in three steps, whereby the pedigree generation takes the first two steps.
In a first step, the family constellation and therefore, the overall pedigree structure is created according to certain rules. Starting from a parental generation, a multi-stage and in parts iterative random procedure is used to generate a realistic family structure. All pedigrees include three, four or five generations starting from one founding couple that has three to five children. From the second generation onwards, a random choice is made between two and four persons for whom an unrelated partner and two (), three () or four () children are created (see Figure 2).
In the second step, based on the simulated mode of inheritance, the transmission of the corresponding gene copies (alleles) that determine the phenotypes is carried out. To this end, the genetic make-up of the founding couple and the partners newly added in each generation must be randomly generated first, depending on the simulated inheritance. Only then can the alleles be passed on to further generations in accordance with the rules of transmission. Here, too, a random function is typically used, since at least in the case of an autosomal and therefore non-sex-linked inheritance, each person inherits exactly one of two gene copies from each parent. In the case of sex-linked inheritance, the transmission of genetic information is modified. However, this is taken into account accordingly. Then, based on the existing or inherited alleles and the mode of inheritance to be simulated, it is possible to determine which person carries the trait (and is therefore represented in the pedigree by a filled symbol), thus completing the pedigree generation.
Although all pedigrees generated in the previous steps are designed considering a specific mode of inheritance, a large number of them cannot be clearly assigned to any specific mode of inheritance without further information, because the pedigree does not include informative family constellations or other pedigree features that permit a unique identification of the mode of inheritance. In this case, several modes appear more or less equally likely. In order to solve this problem, we have developed an analysis module, which in a third step checks whether a generated pedigree can be assigned to one mode of inheritance with reasonable certainty. Therefore, the analysis module uses only the information available to students. This step would probably be avoidable, if unambiguous characteristics were enforced during pedigree generation. This approach, however, was on the one hand rejected for reasons of complexity and on the other to manipulate the realistic distribution of alleles over the generations as little as possible. By using additional construction rules, the structure of the generated pedigrees could otherwise be systematically influenced.
The analysis module recognizes a pedigree as unambiguous if all alternative modes of inheritance can be excluded with sufficient certainty. This can be explained using the example of the pedigree shown in Figure 2. The person with the number 120 has a genetic trait that his parents do not have. Accordingly, this trait is not inherited dominantly, but recessively. With person 106 there is at least one woman who carries the trait, therefore the gene coding for the trait cannot be located on the Y chromosome (since women do not have a Y chromosome). The localisation of the recessive gene on the X chromosome can also be ruled out, since otherwise man 112 would have to be affected too. Based on these considerations, the pedigree can be clearly determined; the trait is inherited autosomal recessively.
Since the proportion of sufficiently clearly identifiable pedigrees is very low depending on the inheritance, we generated a pool consisting of 150 pedigrees (ten each for one of five modes of inheritance and one of three sizes) prior to the start of the course. For this purpose, 6771 pedigrees were automatically generated and analysed by the previously described procedure. For each unambiguously identifiable pedigree, only a seed key, the mode of inheritance and the desired number of generations have to be stored in a data frame, since the seed key leads to reproducible values in pseudorandomized number generation.
The actual generation of the task only takes place when a student requests a pedigree problem. From the previously mentioned data frame, a random line with construction variables is selected and the corresponding task is generated on this basis, whereby the corresponding image is generated. In addition, the analysis script is also called up again, since it not only determines the inheritance at hand, but also generates individual feedback by identifying persons within the pedigree who can be used to exclude the specific inapplicable inheritance mechanisms. This feedback is displayed to students during processing whenever they select an incorrect mode of inheritance. In case of an incorrect submission, they have the option of choosing an alternative mode of inheritance. In any case, they can check their answers several times.
When selecting the tasks, all five relevant inheritance mechanisms (; ) and the three possible pedigree sizes (; ) were answered with comparable frequency as intended. A total of 2266 pedigree problems were answered by 210 students, which corresponds to an average of 10.8 problems per student, whereby the individual frequency of use varied greatly (86 persons with up to three answered problems, maximum frequency of use of one person = 155). Notably, students used the exercises in particular to prepare themselves for the final exam. More than 100 problems where completed on each of the three days before both days of exam, from which the students could select one, while the average throughout the lecture period is less than 25 completed problems per day. On average, the pedigree problems are of a reasonable difficulty. In 65.7 % of the cases, students chose the correct solution on the first attempt. The cumulative probability of choosing the right solution within two or three attempts was 86.6 % and 94.5 % respectively. Notably, the difficulty varies remarkable depending on the mode of inheritance depicted but also depending on the number of generations shown (see Figure 3). Since there is no clear trend regarding the influence of the number of generations, further investigations on a larger data set are necessary to draw conclusions that are more precise on the difficulty of the generated problems.
However, based on the results so far, we assess the task generation and selection as satisfying, since pedigrees of all sizes and for all modes of inheritance were generated and processed in sufficient numbers. The average frequency of use by the students suggests that the variability of the generated tasks is great enough that the analysis of pedigrees can be practised multiple times with the help of this task.
Furthermore, the quality of the tasks will prospectively be evaluated based on explanations, which the students will also have to provide. These explanations are of particular importance, since a complete and conclusive pedigree analysis should not only identify the present mode of inheritance, but also justify the result for instance by excluding all alternative modes of inheritance based on evidence. In addition, a further effectiveness analysis is planned on the basis of examination results, whereby the study design and the high variability of the tasks pose particular challenges.
6 Analysis Results
There were two central questions for the research presented in the current paper: (1) Can we ensure a high item quality while considering the domain-specific peculiarities? (2) Which manual or automated parts of the generation process require specialized biological knowledge or capabilities?
Based on the findings from the two independent case studies we can report a positive answer to the first question. It was possible to record the technical complexity of the domain-specific tasks in algorithmic form or in the form of item definitions. The resulting parameterized items carried enough information to enable the fully automated generation of meaningful item instances. Domain-specific peculiarities (such as the creation of plausible distractors for floral formulas or specific explanations for each distractor in case of the pedigree problems) could be incorporated within the generation process. Moreover, items generated this way turned out as sufficiently difficult based on their probability of being solved, so that automation is not accompanied by a reduction of the quality of the tasks. However, results also show space for improvements in the quality of some distractors. Especially, a smaller set of distractors for the botany items may raise the quality of these items.
This is also an interesting observation with respect to our second research question. The identification of superfluous distractors seems to be a merely mechanical task that can be performed based on statistical investigations on large data sets. Consequently, among the manual steps only the initial item design requires substantial biological knowledge to configure a meaningful generation process. For the botany items, the functions used within the item definitions are generic, while the algorithms required to check each generated pedigree for solvability are domain-specific. Since these were implemented in R, they could be added as a module within the JACK software architecture. The e-assessment system JACK itself did not require any additional domain-specific capabilities. Consequently, the required biological capabilities could be kept small and local also for the automated parts of the generation process.
7 Conclusions and Future Work
This article shows that highly specific types of biological tasks can be successfully generated automatically. The expertise required for this can essentially be coded within the framework of the design process in order to achieve a high item quality, while a readjustment based on statistical features with less specialist knowledge appears possible.
In addition to the iterative improvement of the existing tasks and the expansion of the task pools within the respective disciplines, it appears possible to transfer the task format from the field of botany to other contexts of biology, which also include a formula-based representation. For example, tooth formulae in the area of zoology. Furthermore, a derived format could be used to learn anatomical terms, which often consist of different terms (e. g. Musculus adductor magnus and Musculus adductor longus).
Although the task on pedigrees and the underlying script cannot easily be used to develop other biology-specific tasks, it illustrates that it is possible to generate automated problem-solving tasks based on domain-specific diagrams. This would be conceivable, for example, in the field of evolution with dendrograms (in the sense of phylogenetic family trees).
Funding source: Bundesministerium für Bildung und Forschung
Award Identifier / Grant number: 01PL16075
Funding statement: The research work in this paper was funded by the German Federal Ministry of Education and Research (BMBF) under funding number 01PL16075. The responsibility for the content of this publication lies with the authors.
About the authors
Justin Timm is a research assistant at the University of Duisburg-Essen, Germany. He has completed his teacher training for the subjects biology and chemistry at the Justus Liebig University Gießen, Germany in 2014. After that, he completed the practical phase of teacher training at an upper secondary school in Aachen, Germany in 2016. Since then he is a member of the biology education group of Philipp Schmiemann in Essen. He is particularly interested in students reasoning on pedigree problems and systems thinking.
Benjamin Otto first studied technical computer science at the Beuth University of Applied Sciences in Berlin, where he discovered his interest in mathematics. After completing his intermediate diploma, he moved to Essen to study mathematics with computer science as a minor. He finished his studies in 2015 with a diploma degree in mathematics from the University of Duisburg-Essen and since then has been working on e-assessment, container technologies and Java microservices. In particular, the parameterized generation of programming tasks is the focus of his research activities.
Thilo Schramm is a research assistant at the University of Duisburg-Essen, Germany. He has completed his teacher training for the subjects mathematics and biology and is currently working as a PhD student at the department of Biology Education. His research interests encompass evolutionary trees, skills involved in reading them and the way university students are interacting with them. Furthermore, he is working in collaboration with the department of Botany at the University of Duisburg-Essen to develop and assess online learning opportunities for existing courses.
Michael Striewe is a research associate at the University of Duisburg-Essen, Germany. He received a diploma degree in computer science in 2007 from the Technical University Dortmund and a PhD degree in computer science in 2014 from the University of Duisburg-Essen. His research interests combine software engineering and technology-enhanced learning with an emphasis on the design and analysis of computer assisted assessment systems. A special emphasis of his work is on automated generation of competency-oriented feedback as well as on automated generation of domain-specific, complex assessment items. He is co-founder of a national workshop series on automated assessment of programming assignments and co-editor of a book on the same topic.
Philipp Schmiemann is associate professor for Biology Education at the University of Duisburg-Essen, Germany. His research interests are students’ learning and learning difficulties in the context of evolution, genetics, and systems thinking in particular. He is working in some interdisciplinary projects on science education and teacher professional development, too. Moreover, he is the vice dean of the faculty of Biology and chairman of the German Biology Education Association (FDdB im VBIO).
Michael Goedicke is professor for computer science at the University of Duisburg-Essen and member of paluno - The Ruhr Institute for Software Technology. He studied computer science at the University of Dortmund and wrote his doctorate there in 1985 on specification languages for embedded systems. Subsequently, he did research in the areas of specification of software architectures and description of software components, also at the University of Dortmund. In 1993, he completed his habilitation on the topic of specification of software components. Since 1994 Michael Goedicke is professor for practical computer science / specification of software systems at the University of Essen (since 2003 University Duisburg-Essen). Further research interests include: viewing and describing artifacts and methods for the development of software systems, architecture of software systems taking into account quality characteristics such as performance and scalability, technologies and platforms for the development of software systems.
We would like to thank our colleagues in botany and genetics for their support during the development of tasks and the opportunity to use them in courses.
 T. Alsubait, B. Parsia, and U. Sattler. 2012. Next Generation of E-Assessment: Automatic Generation of Questions. International Journal of Technology Enhanced Learning, 4(3/4), 156–171.10.1504/IJTEL.2012.051580Search in Google Scholar
 T. Alsubait, 2015. Ontology-Based Multiple-Choice Question Generation. (PhD thesis, School of Computer Science, The University of Manchester, Manchester, United Kingdom). Retrieved from https://www.research.manchester.ac.uk/portal/files/55558272/FULL_TEXT.PDF.Search in Google Scholar
 E. H. Buttner and A. N. Black. 2014. Assessment of the Effectiveness of an Online Learning System in Improving Student Test Performance. Journal of Education for Business 89(5), 248–256.10.1080/08832323.2013.869530Search in Google Scholar
 R. L. Bennett, K. S. French, R. G. Resta, and D. L. Doyle. 2008. Standardized human pedigree nomenclature. Update and assessment of the recommendations of the National Society of Genetic Counselors. Journal of Genetic Counseling 17(5), 424–433.10.1007/s10897-008-9169-9Search in Google Scholar PubMed
 The Concord Consortium. 2019. Teaching Genetics with Dragons. Retrieved from https://concord.org/teaching-genetics/dragons.Search in Google Scholar
 G. Daroczi. 2013. The sandboxR Package: Filtering “Malicious” Calls in R. Retrieved from https://github.com/Rapporter/sandboxR.Search in Google Scholar
 M. Foulonneau and E. Ras. 2013. Using Educational Domain Models for Automatic Item Generation Beyond Factual Knowledge Assessment. In: (D. Hernández-Leo, T. Ley, R. Klamma, and A. Harrer, eds.), Scaling up Learning for Sustained Impact, Springer, Berlin, Heidelberg, Germany, pp. 442–447.10.1007/978-3-642-40814-4_36Search in Google Scholar
 M. J. Gierl, H. Lai, and X. Zhang. 2018. Automatic Item Generation. In: (M. Khosrow-Pour, D. B. A.), Encyclopedia of Information Science and Technology, Fourth Edition. IGI Global, Hershey, PA, pp. 2369–2379.Search in Google Scholar
 Kultusministerkonferenz der Länder in der Bundesrepublik Deutschland (KMK) (Ed.). 2004. Einheitliche Prüfungsanforderungen in der Abiturprüfung: Biologie. [Uniform Examination Requirements in the Abitur Examination: Biology]. Retrieved from http://www.kmk.org/fileadmin/veroeffentlichungen_beschluesse/1989/_12_01-EPA-Biologie.pdf.Search in Google Scholar
 Kultusministerkonferenz der Länder in der Bundesrepublik Deutschland (KMK) (Ed.). 2005. Bildungsstandards im Fach Biologie für den Mittleren Schulabschluss. [Educational Standards in Biology for the Secondary School Leaving Certificate]. Luchterhand, München, Germany.Search in Google Scholar
 F. Kurt-Karaoglu, N. Schwinning, M. Striewe, B. Zurmaar, and M. Goedicke. 2015. A Framework for Generic Exercises with Mathematical Content. In: (J. E. Guerrero, ed.), 2015 International Conference on Learning and Teaching in Computing and Engineering, pp. 70–75. Retrieved from https://doi.org/10.1109/LaTiCE.2015.11.10.1109/LaTiCE.2015.11Search in Google Scholar
 H. Lai, M. J. Gierl, C. Touchie, D. Pugh, A. P. Boulais, and A. De Champlain. 2016. Using Automatic Item Generation to Improve the Quality of MCQ Distractors. Teaching and Learning in Medicine, 28(2), 166–173.10.1080/10401334.2016.1146608Search in Google Scholar PubMed
 E. Mayr. 1997. This is Biology. The Science of the Living World. Harvard University Press, Cambridge, MA.Search in Google Scholar
 B. Means, Y. Toyama, R. Murphy, and M. Baki. 2013. The Effectiveness of Online and Blended Learning: A Meta-Analysis of the Empirical Literature. Teachers College Record 115(3), 1–47.10.1177/016146811311500307Search in Google Scholar
 B. Otto, T. Massing, N. Schwinning, N. Reckmann, A. Blasberg, S. Schumann, C. Hanck, and M. Goedicke. 2017. Evaluation einer Statistiklehrveranstaltung mit dem JACK R-Modul. [Evaluation of a statistics course with the JACK R module]. In: (C. Igel, C. Ullrich, and M. Wessner, eds.), e-Learning Conference of the German Computer Society (DeLFI-2017), pp. 75–86. Retrieved from https://dl.gi.de/20.500.12116/4880.Search in Google Scholar
 S. Pobel and M. Striewe. 2019. Domain-Specific Extensions for an E-Assessment System. In: (M. A. Herzog, Z. Kubincová, P. Han, and M. Temperini, eds.), Advances in Web-Based Learning – ICWL 2019. Springer Nature, Cham, Switzerland, pp. 327–331.10.1007/978-3-030-35758-0_32Search in Google Scholar
 O. Schmeil and J. Fitschen. 2016. Die Flora Deutschlands und der angrenzenden Länder: Ein Buch zum Bestimmen aller wildwachsenden und häufig kultivierten Gefäßpflanzen [The flora of Germany and neighbouring countries: A book to identify all wild and frequently cultivated vascular plants]. Quelle & Meyer, Wiebelsheim, Germany.Search in Google Scholar
 N. Schwinning, M. Schypula, M. Striewe, and M. Goedicke. 2014. Concepts and Realisations of Flexible Exercise Design and Feedback Generation in an e-Assessment System for Mathematics. In: (M. England, J. H. Davenport, A. Kohlhase, M. Kohlhase, P. Libbrecht, W. Neuper, P. Quaresma, A. P. Sexton, P. Sojka, J. Urban, and S. M. Watt, eds.), Joint Proceedings of the MathUI, OpenMath and ThEdu Workshops and Work in Progress track at CICM, co-located with Conferences on Intelligent Computer Mathematics (CICM2014). Retrieved from http://ceur-ws.org/Vol-1186/paper-25.pdf.Search in Google Scholar
 A. Teufel. 2011. Basics Humangenetik. [Basics in human genetics]. Elsevier, Urban & Fischer, München, Germany.Search in Google Scholar
 C. Y. Tsui and D. Treagust. 2010. Evaluating Secondary Students’ Scientific Reasoning in Genetics Using a Two-Tier Diagnostic Instrument. International Journal of Science Education, 32(8), 1073–1098.10.1080/09500690902951429Search in Google Scholar
 S. Urbanek, 2003. Rserve: A Fast Way to Provide R Functionality to Applications. In: (K. Hornik, F. Leisch, and A. Zeileis, eds.) Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). Retrieved from https://www.r-project.org/conferences/DSC-2003/Proceedings/Urbanek.pdf.Search in Google Scholar
© 2020 Timm et al., published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 Public License.