When is the best time to learn? -- Evidence from an introductory statistics course

We analyze learning data of an e-assessment platform for an introductory mathematical statistics course, more specifically the time of the day when students learn. We propose statistical models to predict students' success and to describe their behavior with a special focus on the following aspects. First, we find that learning during daytime and not at nighttime is a relevant variable for predicting success in final exams. Second, we observe that good and very good students tend to learn in the afternoon, while some students who failed our course were more likely to study at night but not successfully so. Third, we discuss the average time spent on exercises. Regarding this, students who participated in an exam spent more time doing exercises than students who dropped the course before.


Introduction
Modern university courses often use e-assessment systems. Especially when courses have a high number of participants e-learning tools are very useful to give students individual feedback. Courses with quantitative contents such as statistics and intro-ductory mathematics are particularly suitable for eassessment since fill-in exercises -which require students to submit a numeric answer -unambiguously allow to assess whether students can solve an exercise. The e-assessment system JACK is a framework for delivering and grading complex exercises of various kinds. It was originally created to check programming exercises in Java [31], but has been extended to several other exercise types such as multiple-choice and fill-in exercises [32,30,28]. JACK offers parameterizable content, meaning that exercises can contain different values each time an exercise is practiced. This means not only that different students get a different parameterization but, moreover, that the same student sees different numbers at each different time s/he selects the exercise. Hence, the exercise remains challenging until s/he understands the underlying concept to solve the exercise.
In addition to fill-in exercises, JACK allows to design exercises with dynamic programming content. For instance, JACK offers electronic Java or R -the standard statistical programming language -exercises. Programming exercises not only help to prepare students for modern statistical work, but also have been shown to be highly beneficial to foster their understanding of statistics, see [24,20].
This study analyzes JACK data to more deeply understand students' learning behavior in an introductory mathematical statistics course. The high corre-lation between learning effort in the semester and the final grades is well documented, see [21] and Section 2 for more examples. Here, we aim to investigate additional aspects, namely the relevance of the daytime when students learn. In order to do so, we use several statistical learning methods to study which factors in students' learning behavior are relevant to predict their success in the exam. It turns out that daytime activity has a higher effectiveness than nighttime activity.
Additionally, we find that good and very good students favor to learn in the afternoon, while some students who failed our course had insufficient learning behavior late at night. Moreover, we discuss the average time spent on exercises. Regarding this, students who participated at an exam spent more time in exercises than students who dropped the course before.
The remainder of this paper is organized as follows: Section 2 provides a brief overview of related work. Section 3 introduces the statistics course analyzed here and Section 3.1 presents the available data and the models used. Section 4 discusses the empirical results. Section 5 concludes.

Related work
The overall engagement of students is indisputably one of the main covariates of academic success. For the case of mathematical statistics this has been shown on several occasions. [29] show in a meta-study that the simultaneous usage of traditional classroom lectures and e-assessment has a positive effect on students' success. [21] substantiate the previous result by analyzing the learning activity on the e-assessment platform JACK. The study reveals that learning effort and success, measured by the total number of (correct) submissions on JACK over the course, positively affects the final grade in the exam. [24] add additional R-programming exercises to the JACK framework and show that the newly introduced exercise type helps to improve the general understanding of fundamental statistical concepts and thus ultimately yields better results in the final exam.
Due to the empirically observed positive effect of a multitude of variables on academic performance, pre-diction of the latter has become possible. In this socalled branch of educational data mining various statistical learning methods are applied to educational data in order to predict student outcomes. Often, but not necessarily, this outcome is measured with a binary response of pass/fail in order to be able to provide an early-warning to students. [17,14,22] give a comprehensive overview of popular statistical learning methods used in the literature. For an overview of how to implement an early-warning-system see, e.g., [7]. The literature has identified a number of important predictors. [13] find evidence for the importance of socioeconomic and psychometric variables as well as pre-university grades, although [23] show that, especially among the socioeconomic variables, the predictive capability can vary across countries. [4] additionally identify post-admission variables like obtained credits, degree of exam participation and exam success rate to have an influence on students' success. [19,33,10] analyze the learning activity on learning management systems and are able to accurately predict students' performance with appropriate variables. In a similar but more assessment-based fashion [16,9,20] use activity in e-learning frameworks as well as the results of mid-term exams to predict students' success in the final exam. [3] identify the performance in a small number of selected courses as a predictor for the academic achievement at the end of the study program. For a broad literature review concerning the usage of educational data mining see [27].
In contrast to previous studies which mostly rely on quantitative and qualitative learning activity measured by time-invariant variables, there is also a temporal dimension of engagement which has been studied from different perspectives throughout the academic literature. [15] use time-dependent information provided by a learning management system to predict academic performance. [34] incorporate students' response time as an additional feature into a random forest to investigate the predictive capability for students' performance and find evidence that it can indeed improve prediction accuracy. [26] and [25] further elaborate on the latter by using more sophisticated techniques and are able to support the preceding result.  Only a few studies focus on the intraday engagement of students, that is, the actual daytime of learning, as a predictor for academic success. This topic is relevant as various studies show a significant influence of sleep quality and patterns on academic performance [2,5,6,11]. Based on these insights [12] incorporate sleep variables into a prediction setting. With a stepwise regression approach they identify sleep frequency, night outings and sleep quality as among the most important predictors of academic success.

Course Structure
This section outlines the inital setup of the study. In particular, we sketch the structure of the analyzed course.
The e-assessment system JACK was used for a lecture and exercises course in mathematical statistics at the German university of Duisburg-Essen. 753 undergraduate first-year students started the course. The course is compulsory for several business and economics programs as well as in teachers' education. Out of these 753 students, only 379 took an exam at the end of the course, while the others dropped the course in this term (see Table 1). The course also introduces statistical programming skills using the statistics software R. In order to do so, the eassessment system JACK offers programming exercises where the correctness of students' code is assessed, in addition to classical fill-in and multiplechoice exercises.
The course consisted of a weekly 2-hours lecture, which introduced statistical concepts, and a 2-hours exercise class, which presented explanatory exercises and problems. Both classes were held classically in front of the auditorium. Due to the large number of students, these classes are limited in addressing students' different speeds of learning and individual questions. To overcome this issue and to encourage self-reliant learning, as well as to support students who had difficulties to attend classes, we offered all homework on JACK. All in all, we offer 173 different exercises on JACK, of which 48 are designed as R-programming exercises and the remainder as fill-in or multiple-choice exercises. The individual learning success is supported by offering specific automated feedback and, furthermore, by optional hints. In case of additional questions which were neither covered by hints nor feedback, the students were able to ask questions in our moodle help-forum.
In order to further encourage students to learn continuously during the semester, and not only in the weeks prior to the exams, we offered five online tests using JACK. These tests lasted 40 minutes at fixed times in the evening. Four of the online tests contained fill-in or multiple-choice exercises only. The fifth online test contained R exercises exclusively. Participation only required a device with internet access, but no compulsory attendance at university. This summative assessment allowed students to assess their individual state of knowledge during the lecture period. It was not compulsory for students to participate at online tests in order to take the final exam at the end of the course. Instead, we offered bonus points for the final exams to encourage participating at the tests (a maximum of 10 bonus points in total for fill-in online tests). The bonus points were only added to final exam points if at least 25 out of 60 exam points were reached, i.e., if students passed the exam without bonus. The R online test was worth at most 2 bonus points which were awarded even if students achieved less than 25 points. The reason for this was to motivate students to focus on programming skills since [24] and [20] show that this has a substantially (three times) higher impact on exam success than classical fill-in exercises.
The final exams (3 in total) were also held electron-ically. While online tests during the semester could be solved at home with open books, the final exams were offered exclusively at university PC pools and supervised by academic staff. The exam consisted of R exercises (∼ 15%), short handwritten proofs (∼ 15%) and the remainder of fill-in exercises. Students can only retake an exam if they failed or did not take the previous ones (so that students can pass at most once), but can fail several times. 1 The last grade a student achieved in an exam will be denoted as the final grade. The corresponding exam will be denoted as the final exam.

Data and Models
In this section we present the available database and the models used. For each homework submission by a student on JACK we observe the exercise ID, the student ID, the number of points (on a scale from 0 to 100) and the time stamp with minute-long precision.
The response variables are given by the final exam success. We consider two possible responses. The first is a binary variable indicating whether a student passed (1) or did not pass (0) the course. Second, we consider the final grade as a response. We have the following grading scheme: very good ("100"), good ("200"), satisfactory ("300"), sufficient to pass ("400") and failed ("500"). We assign "600" to students who took the course but did not participate at any of the exams. This is actually not a grade. However, this reflects the view that students who did not take any exam were even less prepared than students who failed the exams. Table 2 reports an overview of final grades. We do not report grades for one specific exam date but the grade given at the end of the course.
JACK registered 163,444 submissions of homework exercises. See Table 1 for how these submissions are distributed among students. Figure 1 plots the number of daily submissions on JACK aggregated for all students. Characteristically, the number of submissions peaks shortly before a summative assessment 1 Students obtain 6 malus points for each failed exam of which they may collect at most 180 during their whole bachelor program.
We compile the following information for each student i from the raw data: • the number of submissions (# submissions in short), • the number of fully correct submissions (100 points), • the number of submissions in the morning from 8am to 12pm, • the number of submissions in the afternoon from 12pm to 4pm, • the number of submissions in the evening from 4pm to 8pm, • the number of submissions in the late evening from 8pm to 12am, • the number of submissions at night from 12am to 8am, • the median submission time (see Subsection 4.2).
• The score, which is defined as follows: let t be a day during the semester. Then where x ijt is the number of points of the latest submission up to time t of student i in exercise j, j = 1, . . . , n. In other words, the score is the sum of points of the last submissions to every exercise. This helps tracking the learning progress for every student. In particular, we consider the final score, which is the score evaluated at the end of the term.
• The frequency of submissions, i.e., the mean time between two following submissions at different days, measured in days.
• The time until a student hands in the first submission from the beginning of the term, measured in days.
• The time until a student hands in the last submission before his/her last exam, measured in days.
• The number of days a student submitted solu-tions.
• The average time spent per exercise measured in minutes (see Subsection 4.3). Table 3 reports summary statistics. Figure 2 plots the average score of students with different grades and of students who dropped the course. Evidently, good and very good students had a strong learning progress from the beginning of the semester on. Students with the sufficient pass grade "400" and students who failed ("500") start similarly weak but improve shortly before the exam. Students with "400", however, improve slightly more, which may be the reason that they pass the exam. On the other hand, they may just have been lucky in the exam. The students who dropped the course show very little progress on average. We choose the following modeling approaches for the classification problem: 2 2 We also tried other modeling approaches than the ones  Table 3: Overview of empirical quartiles mean and standard deviation for the considered covariates.

Progress of average score
• Logistic Regression. Logistic regression models the probability of an event given p regressor variables X = (X 1 , . . . , X p ) via .
(1) The idea is to regress the log-odds, log p(X) 1−p(X) , on a linear combination of X. So equation (1) can be rewritten as The unknown coefficients β 0 , β 1 , . . . , β p are estimated based on the available data. We use maximum likelihood, see [17] for more details. We measure the variability of the estimatesβ 0 ,β 1 , . . . ,β p via the standard errors SE(β l ), l = 0, 1, . . . , p, of the estimates. From this we obtain the t-statistics t l =β l SE(β l ) . [18] recommends to use the absolute value of the tstatistic of each non-constant regressor as importance measure for logistic regression. stated here (e.g. neural networks, support vector machines, etc.). However their predictive performance proved not to be competitive.
The above approach can easily be extended to ordered logistic regression in which we want to predict a variable with k > 2 possible outcomes (multi-class-classifi-cation), see [1]. We use binary logistic regression to predict the response "student passed", and ordered logistic regression to predict the grade.
• Random Forests. Tree-based methods can be used for two-class-classification as well as multiclass-classi-fication. Single decision trees are very easy to interpret but have the drawback of having a high variance. To avoid this problem [8] proposed an algorithm for averaging decision trees to obtain a so-called random forest. The idea is to take B bootstrap samples from the single training data set. Then, a tree is trained on every bootstrapped training data sample. Finally, the prediction is the majority vote, which is the most common occurring class over all B predictions, see [17]. Each of the single trees has a high variance but a low bias. Averaging over all trees reduces the variance. Another problem is that in each split of the trees, every variable in the predictor space is considered. If there is, for example, one very strong predictor it will be used in each tree for the first split. This leads to a high correlation between the trees. To avoid this problem, only a random sample of the p predictors is used in each split to find an optimal split. The number of predictors in this random sample is usually set to √ p. [8] also proposed to use the Mean Decrease Accuracy as importance measure for the input variables. We build 500 trees to grow the forest and try 3 variables at each split.

Empirical results
This section analyzes students' learning behavior. We discuss which learning strategy turns out to predict students' exam success.

Variable Importance
Our first analysis discusses which of the explanatory variables have a high predictive relevance. To model the target variable, i.e., passing the course or achieving a certain grade, we use the following set of variables as predictors in all of our models: {# correct, # morning, # afternoon, # evening, # late evening, # night, score at day of first online test, score at day of third online test, final score, frequency, first submission, last submission, # days, total time spent for exercises}. We dropped some variables like the total number of submissions to avoid high correlation between the predictor variables.
To compare the performance of the two models we use the accuracy, i.e., the rate of correctly classified observations. To avoid overfitting we use 3-fold cross validation. Table 4 contains the cross validation results for the two different models. We see that in both, two-class and multi-class classification, the random forest works best with an accuracy of 0.830 or 0.73 but logistic regression works well, too. In the full data set 75.3% of the students do not pass the course, which leads to an accuracy of 0.753 if we predict all students to not pass the course. Hence the random forest leads to an increase in accuracy of around 8 percentage points which leads us to use the results of the random forest from now on.
We now investigate the variables which are chosen to build the single trees for the random forest. Figure  3 shows the importance of the variables used in the analysis. We can see that the variable last submission, i.e., the time until a student hands in the last submission before his/her last exam, measured in days, is by far the most important variable. Unfortunately, Figure 3 is silent on the direction of impact on the target variable. A solution to this problem is the partial dependence plot, which can help to understand how the logodds of realizing the respective class depend on the input variables. 3 A high positive value of the partial dependence for a given value of the predictor means that it is more likely to belong to the class of interest 3 The y-axis shows f (x) = log p k (x) − 1 where K is the number of classes, k is the class of interest, and p j is the proportion of votes for class j. than to the other class, see [14]. Here the class of interest is not passing the course. Figure 4 shows the partial dependence plot for last submission. We see that not passing the course is more likely for high values of last submission. This means that students who learn until the day of the final exam unsurprisingly have a higher probability to pass the course than students who quit learning far before the exam. This is because 374 out of 753 students did not participate in an exam. Most of these students did not learn until the exam but only made a few submissions at the beginning of the semester. Hence the variable last submission has high values for these students. On the other hand most of the students who participated in the exams learned until shortly before the exam. This implies the high importance of the last submission. Other important variables are the final score, the frequency of submissions and the number of submissions in the morning. Figure 5 shows the partial dependence plot of the final score. We see that a high final score leads to a low probability not to pass the course. 4 Furthermore, the importance of the variables in Figure 3 shows that the time of the first submission, the number of submissions at night and in the late evening and the first score in the term do not help for the predictive performance in the final exam. For the former and the latter this could be due to the fact that, at the beginning of the course, almost all students start to learn at the same level of knowledge, so there is no information that helps to decide between students passing or failing the final exam.
Since 374 out of 753 students did not participate in the exams we only focus on students who participated in an exam for the remainder of this subsection. This will obviously reduce the impact of the variable last submission. We now estimate the corresponding binary classification random forest for pass vs. fail. Figure 6 shows its variable importance plot. 5 Now, final score and frequency, i.e., the mean time 4 Note that in logistic regression the sign of the estimated coefficients tells the direction of the impact of a variable. These are mostly in line with the exemplary partial dependence plots. 5 Note that a negative value for the mean decrease accuracy implies that randomly permuting the respective variable (ceteris paribus) yields to a lower MSE of the random forest. between two days of submissions measured in days, are the most important variables in the random forest model. For example, Figure 7 shows the partial dependence plot for the frequency. Small values of frequency make it more likely to pass the course. This means that students who learn regularly with only a few hours between their submissions have a higher probability to pass. In case of multi-class classification, again, the time until a student hands in the last submission before his/her last exam is by far the most important variable in the model, for the same reasons as above. All other variables have low importance in this model. For reasons of brevity we shall now focus on the results of the binary model.

Learning Times
We now analyze more deeply at which time of the day good and less successful students prefer to learn. In order to investigate this we compute the median submission time for each student. We compare the median submission times for students who passed or did not pass in final exams. Figure 8 shows kernel density plots for the median submission time for passing students in solid black and non-passing students in dashed red. There is a higher variance of median submission times for students who did not pass; students who passed prefer to learn in the afternoon. Weaker students tend to learn later. Moreover, quite a few non-passing students have median learning time in the morning. This is usually the time of the day when students should attend lecture and exercise classes. Figure 9 further supports this claim. We compare very good and good students with students who failed all exams and students who dropped the course. Evidently, good students prefer to submit exercises during daytime. The earliest median submission time of a good student is about 11:30am and the latest is 7:20pm. Comparing these students with students who failed and, more visibly, who dropped shows that there are quite a few who study very late or very early. For example, the earliest is about 12:20am and the latest about 11:30pm.
This leads us to conclude that there are more nonpassing students who have difficulties to learn in the afternoon. As stated in Section 2 lack of sleep caused by studying at night has a negative impact on students' performance.
Needless to say, the more important reason that poor students fail is mainly because they learn too little and not because of bad timing, cf. Subsection 4.1. It also needs to be emphasized that unfavorable time management can also be due to a high amount of responsibilities not connected to their studies. Unfortunately, our data set does not allow us to distinguish between these aspects. A data set including both submission data for e-assessment and information on students' other daily activities is hard to collect.  Figure 7: Partial dependence plot for the frequency variable for the two-class random forest without students who dropped the course. Vertical ticks on the x-axis indicate deciles of the frequency variable.

Submission duration
We now highlight another influential factor for success: we consider how long students work on a single submission, i.e., how much time they spent to solve an exercise. This analysis faces some challenges. First, we only observe the end and not the beginning of solving an exercise and hence do not have exact start and end times. We bypass this problem by measuring the time between two succeeding submissions. For example, if a student submits an exercise at 12pm and submits a second exercise (which might be the same as the first) at 12:15pm we consider 15 minutes as time spent for the second exercise. This means we do not observe duration of the first submission but of the following submissions. We omit duration times which are longer than two hours because students then likely took a break. This is also part of the second issue because we only monitor submission times in JACK and not whether students used this

Median submission time per student
Daytime hour Density Grade 100&200 500 600 Figure 9: Kernel density estimates of the median daytime of submissions on JACK comparing students with different grades and with students who dropped the course.
time to learn or whether they got distracted. We cannot rule out times of distractedness but still believe the following analysis offers interesting insights. For each student we accumulate all duration times. These totals are of course higher for students who submitted more exercises than for students who submitted only a few. We thus divide total duration by the number of exercises submitted for each student. Figure 10 shows a kernel density estimate for the average time spent per submission of students who passed versus students who did not pass. Evidently, there are many students who did not pass who invested little time for each exercise. Again, we next distinguish between students who failed an exam and students who dropped the course. Figure 11 compares students who achieved the best or second best grade with students who failed and students who dropped the course. Interestingly, time spent per submission is similar for both good students and students who failed (The plot for mediocre students looks very similar, too.). However, students who dropped the course invested perceptibly less time for each submission. Apparently, these students had too little motivation and/or time to participate in the course. They likely did not seriously attempt to solve the exercises.

Conclusions
This study analyzes when students should learn to be successful in a final exam. For this purpose, we analyzed data from the online learning platform JACK from an introductory mathematical statistics course in the summer term 2017. This data on students' submissions on JACK offered information about the daytime when a student submits a solution to an exercise.
We used logistic regression and random forests to predict the success of a student in the final exam and, also try to predict the final grade. An advantage of these methods is that they offer information about the importance of the variables used in the model. We analyze the variable importance obtained by the random forest.
The two most important variables in this model  are the day between the last submission of an exercise and the exam as well as the score the students achieve when they study with JACK. We further identify the frequency with which the students work on JACK and the number of submissions between 8am and 12pm as important variables. We identify good students to submit exercises during the daytime, while some students who quit the course or fail in the final exam learn very early in the morning or very late in the evening. Needless to say, the total amount of learning has a high impact on success. Additionally, we cannot rule out external factors (e.g. working during daytime) causing this effect rather than students who purposely did not study during daytime. Still we may conclude that students who did not pass the course study little during the afternoon. Moreover the time a student spends on a single exercise is very short for students who dropped the course. All in all, our results stress the importance for students to decide when and how often to learn. With a good time management, students can possibly increase their probability to pass a course like the one investigated here.