Dropping out from Higher Education in Germany an Empirical Evaluation of Determinants for Bachelor Students

Abstract Withdrawing from university is a complex decision-making process, during which several conditions and problems from different areas of life and study accumulate and affect each other. This study is based on the National Educational Panel Study (NEPS), which includes a wide range of information on study course and students’ characteristics, and aims at providing an encompassing analysis of determinants influencing students’ dropout decision. Determinants can be categorized into demographic and family background, the financial situation of students, their prior education, institutional determinants, as well as motivation and satisfaction with study. Both, a bivariate analysis, as well as a logistic regression model with LASSO regularization identify many important determinants already known before or at the beginning of the study, such as prior education and satisfaction related variables, allowing early identification of at-risk students and the implementation of prevention programs.


Introduction
Due to the rising number of students in higher education institutions and the social and personal costs related to dropping out of university, analyzing study success and study dropout becomes more and more important. In Germany, the number of students enrolled at institutions of tertiary education increased monotonically over the last years towards 2.9 million in winter term 2019/2020, which is associated with increased educational costs (DESTATIS, 2020). In Germany, 14.7% of Bachelor students do not finish their degree (Schnepf, 2014). Other European countries face an even higher number of students dropping out of higher education, e.g. France (17.9%), Spain (24.2%), Netherlands (28.3%) or Italy (34.1%) (Schnepf, 2014). To minimize the wasting of financial and human resources due to a high number of university dropouts, policy and educational institutions are increasingly interested in detecting determinants that influence the dropout decision.
The current state of empirical research on student dropout carried out within a wide range of disciplines has identified several possible reasons for withdrawing from tertiary education. These include, for instance, demographic and family background, the financial situation of students, prior education, institutional determinants, as well as motivation and satisfaction with study.
This study aims to provide an encompassing analysis of these potential determinants and includes predictors from all of the mentioned categories. Therefore, a bivariate analysis using different effect size measures and a multivariate logit model are used. The database is the National Educational Panel Study (NEPS), a continuously growing and comprehensive German panel study including many variables covering a wide range of possible dropout determinants from different areas (Blossfeld et al., 2011).
Since the number of observations (17,910) and variables (more than 3,000) is large, both steps of the empirical analysis bivariate and multivariate are important. Simple bivariate analysis provides an overview of all relevant variables in the dataset and first impressions on their potential usefulness and importance.
The main disadvantage is that a large absolute effect size can result in a small partial effect in a multivariate setting due to intercorrelations of the predictors. Therefore, we regard the bivariate analysis as a prerequisite for the more computing-intensive multivariate models with feature selection (Hastie et al., 2009).
The results of both analyses help to identify many promising starting points for early warning systems for students being at-risk of dropping out.

Determinants influencing dropouts a literature review
Higher education dropout is not always defined consistently in the literature. Based on theoretical considerations, many different definitions have been applied and a distinction should be made according to the level at which dropouts occur. Students may change their field of study (within the same subject area or between subject areas), the type of degree, the (type of) university, or students may leave the university system, for instance, due to academic failure, wrong expectations or to favorable job offers (Tinto, 1975;Larsen et al., 2013b). Depending on student's or faculty's perspective, these different types of dropouts could be perceived as transfers (e.g. from one field to another) or as a formal total dropout. The former is sometimes called "reselection" (Larsen et al., 2013b) or "institutional departure" (Tinto, 1993, p. 36), the latter "de-selection" (Larsen et al., 2013b) or "system departure" (Tinto, 1993, p. 36).
Withdrawing from university is seldom the result of short-term or spontaneous decisions, but rather of a long decision-making process, during which several conditions and problems accumulate and prompt students to leave university without a degree (Heublein, 2014). Previous studies investigated the dropout of tertiary education in several countries with different focuses and identified several possible reasons for dropping out. Behr et al. (2020a) provide an encompassing and up to date review. These determinants can be categorized into societal aspects which include the demographic and family background, the financial situation of students, and their prior education, into institutional determinants, as well as into motivation and satisfaction with study.

Demographic and family background
Pre-study demographic and background factors seem to have a strong influence on study performance and dropout. Aina (2013) and Ghignoni (2017), both focusing on the relationship between the family background and the dropout decision in Italy, find that the better the parental education and social class, the lower the probability for leaving university without degree. Some studies observe that male students tend to drop out more frequently than female students. Mastekaasa and Smeby (2008) for Norway and Severiens and Ten Dam (2012) for the Netherlands analyze the impact of dropout in maleand female-dominated study fields, and reveal that men have a very high attrition rate in female-dominated fields while women dropout to a lesser extent in those courses. Furthermore, there is evidence that a higher age at enrolment increases the dropout probability (e.g. Müller and Schneider, 2013;Lassibille and Navarro G´omez, 2008), which may also explain the higher dropout rate for students with vocational training before entering higher education (Müller and Schneider, 2013). Reisel and Brekke (2010) investigate the connection between higher education performance and the migration background in Norway and the USA and state that the dropout probability is higher for students from a foreign country, which is also observed by Belloc et al. (2010) for Italy. Similarly to Sarcletti and Müller (2011), they find that students with migration background tend to have less knowledge about the education system and the prevailing culture, and are less familiar with the language which increases the risk of dropping out. According to Aina (2013) and Di Pietro (2006), the latter of whom analyzes the relationship between regional labor market conditions and university dropout rates in Italy, the geographic area plays an important role. Students from economically stronger regions and with good labor market prospects have a higher probability of enrolling in tertiary education and a lower dropout rate.

Financial situation
Another important aspect of study success is the students' financial situation which is related to the possibility of financial support, as well as to their amount of off-study work. A study by Glocker (2011), investigating the effect of financial aid on study success in Germany, reveals that an increased amount of support students receive decreases the dropout rate significantly. According to a Norwegian study by Hovdhaugen (2015), working more than 20 hours a week increases the probability of dropping out, whereas working for a maximum of 19 hours a week seems to have no significant influence on study success. Similar results are reported by Beerkens et al. (2011) for students from Estonia. They observe that more than 25 hours of off-study work decreases the probability of timely graduation.

Prior education
The pre-study education of students seems to be very important for the study success. Müller and Schneider (2013) examine the relationship between pre-tertiary educational pathways and dropout from tertiary education in Germany. They observe that students from the upper secondary school track (e.g. Gymnasium in Germany) and with a standard educational pathway have a lower dropout rate than students from the lower or intermediate track. Especially, students with vocational training before studies tend to have a high dropout rate, which may be associated with increased age at study start (see section ). According to Sarcletti and Müller (2011), school performance is of particular importance for study success as it is an indicator of the ability to meet the level of performance required by the higher education system. Various international studies find positive correlations between school (e.g. GPA) and study performance, for instance, Stinebrickner and Stinebrickner (2014), who analyze students at the Berea College in the USA.

Institutional determinants
The type of higher education institution also influences the dropout decision of students. For instance in Germany, Sarcletti and Müller (2011) find the dropout rates in Bachelor courses at universities of applied science to be lower than those at universities. The same observations are made by Heublein et al. (2017), who also reveal the highest dropout rates in Germany to be in Engineering, Mathematics and Natural Sciences. This result is confirmed by Lassibille and Navarro G´omez (2008) for Spain and Korhonen and Rautopuro (2019) for Finland. Moreover, there are some important determinants related to study conditions that affect students' decision to drop out. Hovdhaugen and Aamodt (2009) analyze the impact of the learning environment on leaving university for Norwegian students and find that poor teaching quality and an unfavorable learning environment increase the probability of dropping out. A similar observation is made by Georg (2009) for German students. Suhre et al. (2007) for the Netherlands and Ghignoni (2017) for Italy highlight the importance of the relationship between students and teachers. Furthermore, a good program organization (Heublein et al., 2017) and program flexibility (Di Pietro and Cutillo, 2008) seem to decrease the probability of withdrawal.

Motivation and satisfaction with study
Besides these easily measurable determinants, also students' motivation and satisfaction with study affect their risk of dropping out. The latter determinants are based on the students' subjective self-perception, who have to state a value on a pre-defined scale to measure these variables. Suhre et al. (2007) investigate the association between study satisfaction and dropout probability in the Netherlands and observe unsatisfied students to have a higher risk of withdrawal. A German study by Suhlmann et al. (2018) finds the fit between the higher education institution and personal attitudes to be strongly related to students' satisfaction and motivation which further decreases the probability of dropping out (Schiefele et al., 2007). Nordmann et al. (2019) for the UK and Korhonen and Rautopuro (2019) for Finland find that class attendance and time spent on the study course have a positive influence on study performance. Moreover, according to Van Bragt et al. (2011a) and Van Bragt et al. (2011b), both focusing on the relevance of students' personal characteristics for study success in the Netherlands, aspects such as conscientiousness, ambivalence or attribution are very important for educational performance. Other studies confirm the importance of personal characteristics including, for instance, resilience and self-control (e.g. Brandst¨atter et al., 2006).

Summary and contribution
To sum up, there are many different aspects of students' life including the prestudy phase, the institutional setting, the financial situation and motivational aspects, which seem to be relevant for the dropout decision. There are also some reviews on dropout research, which group the wide range of predictors in a similar way. For instance, Vossensteyn et al. (2015) categorize them into determinants on the individual level, on the institutional level and those on the level of the higher education system. According to Rodr´ıguez-G´omez et al. (2015), focusing on definitions and common reasons for dropout in America and Europe, dropout is a multi-factor phenomenon which is the result of a complex interaction of determinants from a wide range of reasons including external, institutional, and personal factors among others. Therefore, to obtain detailed and comprehensive insights into the dropout phenomenon, there are some implications for the data and the methodological approach. First, all of these determinants (categories) found to be important should be considered in the analysis. Previous research mainly focused only on one or a few aspects of dropout and, as also stated in Larsen et al. (2013b), mainly on pre-study or university "non-malleable"' determinants, but research would benefit from dealing more with study-related and university malleable determinants, as these are mainly within the scope of policy action. Singell and Waddell (2010) and Gury (2011) emphasized the importance of both fixed and time-varying effects (e.g. study conditions) on withdrawal, which cannot be analyzed with cross-section data. Administrative data, which have been used in many studies, lack information on pre-study determinants and on determinants based on the subjective self-perception of students. Survey data often contain only too few observations to get representative and reliable results. Therefore, as also claimed in Sarcletti and Müller (2011), large prospective and longitudinal data covering determinants before and at the beginning of the study, as well as students' subjective self-perceptions, are of considerable importance for assessing the dropout phenomenon in its entirety. Moreover, it seems to be important to sort and condense the large number of determinants, to evaluate their degree of impact and to detect the most important ones in the dropout prediction, so as to identify promising and efficient starting points for reducing dropout rates. This study uses a large German survey dataset which covers a wide range of student life and intends to include determinants from all of the mentioned categories. Beside a bivariate analysis of the relevance of these different determinants by measures of effect size, this study aims at identifying the most important ones by applying a LASSO (least absolute shrinkage and selection operator) regression with an internal feature selection. It is hypothesized that from each of the identified determinant categories important features are selected for the final dropout prediction model.

Sample description
The fifth cohort of the National Educational Panel Study (NEPS)1 is a comprehensive German panel study including students in tertiary education covering a wide range of different aspects of students' background and the course of study (Blossfeld et al., 2011). This study uses nine waves which have been obtained by different survey methods like computer-assisted telephone interviews (CATI), competency tests, as well as computer-assisted web interviews (CAWI). The target population are first-year students (German and non-German) at higher education institutions in Germany in winter term 2010/2011. Interviewed students must be enrolled for the first time at public or state-approved higher education institutions aiming at a Bachelor degree, state examination (medicine, law, pharmacy, teaching), diploma or Master (Roman Catholic or Protestant theology) or specific art and design degrees (Zinn et al., 2017). In the first wave, 17,910 students participated in the NEPS. Table A2 in the appendix provides some general information on the dataset.
One limitation of this type of data is the long timehorizon that is necessary to finally evaluate first-semester students. The dataset contains the freshmen cohort of winter-term 2010/11 and represents the most recent study of this sample size and quality in Germany. As we focus on the examination of dropping out at an early stage of study, we therefore use mainly time-invariant variables and determinants from the early study phase. These variables are collected mainly already at the begin of the survey in 2011. Furthermore, since 2010 no major changes have taken place in the German higher education system that can lead to a huge change in the influencing variables as the Bologna process in 1999. A further limitation of the study is caused by panel attrition. This problem has already been analyzed by Behr et al. (2020b) and has no negative consequences on their model. The sample is drawn as a stratified cluster sample. Clusters are defined by all students enrolled in a certain subject at a particular higher education institution. To oversample teacher education students and students attending private higher education institutions (as little is known about these groups), first-level stratification according to educational institutions was applied. The second level of stratification (within the first-level strata) was conducted according to groups of related subjects. These techniques for composing the NEPS sample, which should represent the entire freshman student population in Germany as closely as possible, are based on data on first-year students from winter term 2008/2009 from the Federal Statistical Office of Germany (Fachserie 11 Reihe 4.1: Bildung und Kultur Studierende an Hochschulen) (Zinn et al., 2017). Table 1 provides an overview of some relevant characteristics of students participating in wave 1 (own calculations). There is a substantial overrepresentation of female students (60.46%), but using sample weights (provided in the scientific use file), the proportions of female students, as well as the type of institution, are very similar to the population proportions of beginning students in winter term 2010/2011 in Germany (Statistisches Bundesamt, 2011). Table 2 provides an overview of the distribution of study fields (first field) in the first wave. Again, weighted values of field proportions are very similar to those provided by the Statistical Office. Almost one-third of beginning students start to study in the field of law, economics and social sciences (31.27%), followed by engineering (21.47%), mathematics and physical sciences (18.82%) and linguistics and cultural studies (17.09%).

Predictor variables included in the study
Since the NEPS contains more than 3,000 variables in total, a variable preselection is necessary in order not to exceed the scope of the article. To ensure sufficient data quality only variables with less than 20% missing values in the target population are used, whereby some variables do not apply to every student, e.g. a student is only asked at the beginning of their study if not born in Germany. Since dropout prevention should begin at an early stage of study, we focus on the first waves. This criterion reduces the number of variables to less than 200. The final variable pre-selection is made from a theoretical point of view. Features that have not been found to be important in any previous articles, and also have no considerable influence here in terms of effect-size, are excluded. The final sample includes 52 variables.
These variables were grouped into the identified five thematic fields: Demographic and family background, financial situation, prior education, institutional determinants, and motivation and satisfaction with study.

Identifying dropouts
According to Larsen et al. (2013a), the term "university dropout" can simply be explained as leaving the higher education system without obtaining a degree. This definition is from a macro point of view and mainly important for the whole education system and society. An alternative dropout definition includes students who change their subject field or institution before graduation. This second definition relates to a micro point of view, that of a faculty or institution, for which changes before the first degree could represent a failure in their goal of avoiding dropout from their study program. Here, dropout is defined as leaving the higher education system without a first degree. Changes of the study field, degree or institution are not treated as dropouts, but are considered in the analysis as predictors for dropping out. The outcome variable for dropping out in this analysis is based on the "status" of a student showing one of the following four categories: 0. Graduate 1. Dropout 2. Still studying 3. Status is not available (NA) Since the focus of this article lies in the identification/ prediction of potential dropout students, the aim is to compare dropouts and graduates. Students who are still studying and those with an unknown status are disregarded in the empirical analysis.
The final sample contains 943 students identified as dropouts and 2,625 graduates (N = 3, 568). Students' status is a binary variable, where 0 is indicating a graduate and 1 indicating a dropout. The status variable is constructed using relevant variables until wave 9 (summer term 2015). The relative small final sample, compared to the number of participants in wave 1, is a result of right-censored data (many students are still studying), and missing values, since not every student participated in all nine waves.

Bivariate analysis of dropout determinants
Let Y be the status variable and X = {X 1 , . . . , X k } a set of k determinants, that are potentially related to the status Y . In section 5 (multivariate analysis), the focus lies on P (Y = 1|X 1 = x 1 , . . . X p = x p ), which denotes the probability that a student drops out, given a subset p ≤ k of known determinants (e.g. gender, high school grade etc.). To determine those p variables that might influence the conditional probability of Y , the mean (M ) of a specific determinant X j is compared in the two groups: M 0,j = M (X j |Y = 0) (mean in the group of graduates) and M 1,j = M (X j |Y = 1) (mean in the group of dropouts), j = 1, . . . , k.
The bivariate analysis aims to detect variables differing strongly between dropouts and graduates. Two effect size measures are used to identify differences in the mean of the two groups of dropouts and graduates, which can also be seen as correlation measures: 1) Cohen's d and 2) Point-biserial correlation. The higher the absolute effect size, the larger the mean difference between the two groups (Hartung et al., 2011). In general, one can expect variables with high absolute effect sizes to have more influence on the probability to drop out. In contrast to tests for statistical significance, effect size measures are not influenced by the sample size in the two populations.2 Let M 1,j and M 0,j be the weighted mean of variable X j in the group of dropouts and graduates. The absolute point-biserial correlation coefficient r pb is a correlation measure for a dichotomous variable (here the status Y ) and a metric variable X j , j = 1, . . . , k (Bortz and Schuster, 2010), calculated by 2 Statistical tests can be highly accurate for large samples, where even small differences can be detected easily, whereas for small samples they often fail. In regression models or, as in this case, binary classification models, also importance ranking exists, e.g. for random forests (Breiman, 2001). Highly correlated features can influence the importance of a variable in those situations. with sample sizes n 1,j (dropout) and n 0,j (graduate) in the two groups, S n j the overall standard deviation and the overall sample size n j = n 1,j + n 0,j . The sample sizes n 1,j and n 0,j vary dependently on the number of missing values of the variable X j and are given in Table A3 in the appendix. Table A3 also provides information on variable description, coding and scaling. Cohen's d (Hartung et al., 2011) is a measure of effect size and defined as pooled standard deviation S = (n 1,j − 1)S 1,j + (n 2,j − 1)S 2,j n 1,j + n 2,j and S 1,j and S 2,j are the weighted standard deviations in the two groups for variable j. The interest is more on the absolute value of Cohen's d, i.e. |Cohen's d| to give a ranking of variables with comparable huge differences in the two groups. Table 3 shows the results of a bivariate analysis of the status variable and the different predictors. In each thematic field, a ranking of variables beginning with the largest absolute Cohen's d is presented.3

Demographic and family background
According to Table 3, female students tend to outperform their male peers (the weighted mean of male students in the dropout group is 54.4% and only 41.6% among the graduates), and the students in the dropout group are on average somewhat older (year of birth), which is in line with previous literature. Students living in the new eastern federal states of Germany (place of residence) tend to drop out more frequently. According to Aina (2013), students may profit from financial benefits in economically stronger regions (like the old western federal states of Germany). The immigration background has only a minor effect on study success here. There are no consistent results in the literature regarding the effect of immigration background because this effect depends strongly on the country and its national (education) system (Reisel and Brekke, 2010). According to the family background, graduates' mothers and fathers tend to have a higher occupational prestige (coded using the ISEI-08 standard International Socio-Economic Index of Occupational Status) than parents of dropouts. Mothers of university graduates are on average better educated than those of dropout students. The highest father's diploma seems to play only a minor role. These results are mainly in line with previous studies.

Financial situation
Strongly related to the family background of students is their financial situation and off-study work. Dropouts more often receive financial aid (BAföG: Bundes Ausbildungsförderungs Gesetz, financial support for students with a poor socioeconomic background), which indicates that dropout students more often come from financially weak families. There is just a small difference in the financial income of the two groups, which is in line with Heublein et al. (2008). Graduates work, on average, more than five hours per week more than the dropouts during the term break. During the semester, the difference is not significant and the correlation with the status variable is small. Similarly, previous studies show that working only a few hours has no negative association with study performance. The possibility/willingness to give up other, competing goals to invest in study (study costs) is lower in the dropout group.

Prior education
Educational achievements up to secondary education generally influence the higher education performance (Sarcletti and Müller, 2011). According to Table 3, graduates generally seem to be much better prepared and informed than dropouts. The skills acquired before tertiary education (especially mathematical skills) are also of high importance. These aspects have not been in the focus of previous studies. The overall school grade seems to have a large effect size and is highly correlated with the dropout decision. University graduates obtained on a scale from 1 to 4 (1 is the highest grade and 4 the lowest) an average school grade of 2.3 compared to an average grade of 2.7 for university dropouts. Similar results were found in various international studies. Moreover, the number of repeated classes in high school is lower among graduates than among the dropouts and reveals a large effect size. According to the type of high school, students can achieve a general university entrance qualification (the highest one), or a university of applied science entrance qualification (the middle one) or other lower degrees. About 70% of the graduates attended a Gymnasium (highest school track) and only 60% of the dropouts. Related to this, graduates obtained on average a higher school leaving qualification. These results are in line with previous research.

Institutional determinants
Institutional determinants provide information about the structure, organization and study conditions of higher education institutions, which also determine study success. Large differences in dropout rates between the different types of higher education institutions are observed. The majority of individuals in the dropout group withdraw from general university (58.1%) compared to only 35.2% who graduated from a university. General universities are more theory-oriented, while universities of applied science focus on practical applications and offer more structured study programs (Mayer et al., 2007). Note that lower dropout and higher graduation rates may be due to a differing subject profile of universities of applied sciences and the usually shorter time to completion. Figure 1 shows the distribution of study fields in the dropout and graduate group. The presented percentages for one specific group over the eight fields sum up to 100%. The highest difference between the dropout and the graduation group is observed for Law, economics and social sciences, which also has the largest effect size of all subject groups. Comparing the dropout rates within each study field, the highest dropout rates are observed for Engineering and Mathematics and natural sciences. Similar observations are also made by Heublein et al. (2017).

Motivation and satisfaction with study
Determinants based on the subjective self-perception of students such as motivation and satisfaction also may influence academic success. The results indicate that graduates are more extrinsically motivated than dropouts.
Related to that, 13.1% of dropouts compared to only 4.4% of graduates had preferred to do something else than studying (alternative to degree). These findings are mainly in line with the sparse previous research on such aspects. The proportion of individuals who feel disappointed concerning their chosen subject (subject of choice satisfied) is significantly higher in the dropout group than among graduates (28% vs. 17%). These effects have not been analyzed in previous research in detail, but Mora (2010) states that students' subject of choice is sometimes pressured by parents, teachers, and peers. Being satisfied with actual studies on the whole, enjoying the degree course, as well as being interested in the degree course are highly correlated with study success. Additionally, dropouts are more concerned about some frustrating points of the degree course such as "frustrating external circumstances" or "degree course is wearing me down". Similarly, the sparse previous research on some of these aspects find student satisfaction to have a positive impact on their intention to stay in college.  Figure 1: Distribution of subjects in the dropout and graduate group.

Methodological considerations
Due to possible intercorrelations among the predictor variables, bivariate effects observed in the previous section may change substantially when all the variables are considered. In this section, partial effects of the variables are analyzed in a multivariate setting using logistic regression.

Logistic regression (logit)
Logit is one of the most popular linear methods utilised for classification problems. Here, the dependent class variable is a binary variable Y containing two possible values: 1 (here for dropouts) and 0 (for graduates). In a logit model, the aim is to estimate the posterior probabilities of both classes via an index function F based on the predictor variables X = (X 1 , · · · , X d ) (here, d = 52).  Probabilities of both events depending on the predictor variables X are defined as P (Y = 1|X) = F (β 0 + X T β) = F (β 0 + β 1 · X 1 + · · · + β d · X d ), (1) In the logit model, the logistic distribution function is used for the function F : . (3)

Best subset model
As the number of predictor variables can rapidly increase (as in Equation (4)), leading to complex models and probably to the presence of irrelevant variables as well as high correlation among the variables, it is therefore of interest to select the best subset of inputs to include in the logit model. For feature selection we use the LASSO (Least Absolute Shrinkage and Selection Operator) regularization (Tibshirani, 1996). Here, the negative binomial likelihood and a regularization parameter λ is introduced to penalize unimportant or highly correlated features and shrink their coefficients to zero. This leads to the minimization problem (only main effects are considered for simplification): The parameters to be estimated in this case are β 0 , β 1 , · · · , β d , γ 11 , · · · , γ dd (in sum 1 + 2 · d+ d 2 parameters).
over a grid of values of the hyperparameter λ, which controls the overall strength of the penalty. The hyperparameter α (α = 1 for LASSO regression and α = 0 for Ridge regression) controls the "elastic-net" penalty. LASSO regression uses the L 1 -norm and leads to a smaller number of relevant coefficients since it picks only one coefficient from two highly correlated variables and shrinks the other coefficient to zero, while Ridge shrinks these coefficients towards each other. For the analysis, models are calculated using the glmnet function from the glmnet package (Hastie and Qian, 2014) implemented in R.
The glmnet algorithm applies cyclical coordinate descent for successive optimisation of the cost function over each parameter until convergence. The cv.glmnet function computes several models and evaluates the optimal λ for the model with the lowest error via grid search and crossvalidation. The higher the λ, the more coefficients are shrink to zero.

Assessment of model performance
The logit model provides the probability that a student drops out of higher education. It can also be seen as a binary classifier, whereby a student with P (Y = 1|x) ≥ a (a is the threshold value defined by the user or automatically calculated depending on the class size), is classified as a dropout and otherwise as a graduate. The performance of the obtained model is evaluated in terms of the mean squared error (MSE), which is the mean of the squares of the errors between the predicted probability for class 1, i.e. P (Y = 1|X), and the observed variable Y ∈ {0, 1}; accuracy, which gives the relative number of correctly classified students and the area under the ROC-curve (AUC). The true positive rate, also called sensitivity or "Recall", is the number of dropouts, truly classified as dropouts, divided by the total number of dropouts. The false positive rate (or 1−specificity) is calculated as the number of graduates classified as dropouts, divided by the total number of graduates. All these measures depend on the threshold. Varying the threshold from 0 to 1 and plotting the true positive rate against the false positive rate outputs the receiver operating characteristic curve (ROCcurve). The area under the (ROC)-curve, named AUC, is a further important measure for binary classification. A value near 0.5 means that the model chooses randomly the class of a new observation, while a value near 1 means that almost all observations are correctly classified. Furthermore, the computations are done by applying 10-fold cross-validation repeated 20 times as suggested by Krstajic et al. (2014) to reduce the variance of the estimations.

Dealing with missing values in the data
The constructed dataset contains 943 dropouts, 2625 graduates and 53 predictor variables (including the intercept), with a considerable number of values missing in the data (about 18%). Since the logit model requires data with complete cases, these missing values should be handled. In general, three approaches are possible: (1) using prediction methods that can handle missing values (instead of logistic regression), (2) using only complete cases which would delete most observations in our dataset (the dataset would be reduced to 36 observations), and (3) imputation techniques which fill the missing values with plausible values. To find the best imputation technique leading to optimal model performance, the 10-fold crossvalidated out-of-sample AUC and MSE were computed for several imputation methods including mean or median imputation, regression imputation, stochastic imputation, hot-deck imputation, and multiple imputation (Batista and Monard, 2003;Twala, 2009;Meeyai, 2016).
The median imputation produces the best results in terms of AUC and MSE. Consequently, for further analysis in this study, the complete dataset obtained with this imputation method is used. Garciarena and Santana (2017) also found situations where median imputation outperforms advanced imputation techniques. This dataset has many dichotomous variables where the median imputation reveals good results in terms of model performance. Of course, in many other applications, the median imputation might not be optimal. Also note, that median imputation has a decreasing effect on the variance of the imputed variables which also affects confidence intervals and p-values of statistical tests (Kleinke et al., 2020).

Empirical results
Here, the results of the logit model via LASSO regularization are presented. The predictor variables are divided into 5 groups (as presented in section ) and the response variable (status of student) has value 1 for dropouts and 0 for graduates. After computing the LASSO regularization to select the most prominent variables out of all the variables, we ended up with 22 variables, which have absolutely nonzero coefficients (see the appendix on how the number of selected variables is defined). A logit model based on the 22 selected variables is fitted and the standardized regression coefficients are shown in Table 4 along with the pseudo R 2 , the cross-validated MSE and AUC. Significance concerning this logit model cannot be interpreted in a conventional way (at face value) due to the prior selection process. Nevertheless, the z-values, in addition to the standardized coefficients, provide important information on the partial effects of the variables. A positive value of a coefficient indicates that a higher value of the corresponding variable increases the probability to dropout. The confusion matrix is also reported in Table 5 along with accuracy (proportion of correctly identified students), recall (proportion of dropouts correctly identified) and the average threshold (minimum probability for a student to be classified as dropout), which is automatically calculated by the model. The ROC-curve is plotted in Figure 2.
As noted in Table 4, many determinants contribute to lower the risk of dropping out. Assuming the non-presence of high colinearity in the data (removed by means of Lasso), values of the standardized regression coefficients outline the relative importance of the predictors (Fox, 2015;Darlington and Hayes, 2016). The largest coefficients and z-values (in magnitude) are in bold, which shows that from each determinant area there are important predictors for student dropout. For instance, being a female student, with good prior preparation and school grades, studying at a university of applied sciences, being satisfied with the studies, not preferring to do something else instead of studying, and receiving financial aid (BAföG) carry a lower risk of dropping out than their counterparts. Moreover, students from Mathematics and Engineering have a higher risk of dropping out, whereas students from Law/ Economics sciences are less risky compared to Linguistics and Cultural Sciences. The model achieves a crossvalidated MSE of 0.301 and AUC of 0.796. Three-fourths of students are correctly classified and the proportion of correctly identified dropout students amounts to about 75%.4 The directions of relationships between the covariates and study dropout are mainly in line with the descriptive analysis, theoretical considerations as well as with findings from previous studies (if already analyzed) which are discussed in detail in earlier sections. A counter-intuitive result is the direction of the effect of the predictor "Degree course is wearing me down".
Of considerable interest is that besides well known determinants such as school performance, aspects related to students' satisfaction with study are of great importance for academic success. Satisfaction further depends on a student's information and preparation status (Weerasinghe et al., 2017) which become relevant already before or at the beginning of study and lie, up to a certain degree, in universities' (and also secondary schools') scope of action. Therefore, there are many promising starting points for early warning systems for preventing students at risk of dropping out.

Discussion and conclusion
The current state of empirical research on student dropout from several disciplines has identified numerous possible reasons why students withdraw from tertiary education. This study aims at providing an encompassing evaluation of these determinants and aims at identifying the most important ones by applying bivariate measures of effect size and a multivariate LASSO regression with an internal feature selection to predict the probability of a student to graduating or to dropping out. The analysis is based on a dataset including freshman students, who have started in the winter term 2010/2011 at German institutions of higher education and covering a wide range of different aspects of students' background and the course of study. In the following, the findings and their possible implications for universities to prevent students from dropping out at an early stage of study are discussed.
From each of the determinant categories there remain important variables in the final prediction model after feature selection (AUC=0.789), which confirms that dropout is a result of several conditions and underlines the complexity of the dropout phenomenon.

Demographic and family background, and prior education
The impact of students' pre-study determinants, such as their prior education and other background determinants, implies that higher education institutions should take into account the increased heterogeneity of students and their specific needs. For instance, a lower educational pathway of students (e.g. type of school leaving qualification; also found by Müller and Schneider, 2013) or a poorer school performance (preparation, school grade, repeated classes; also stated by Sarcletti and Müller, 2011) increase the risk of dropping out. A preferable strategy may be to implement background-specific remediation programs or field-specific bridging courses preparing students for university requirements.

Institutional determinants
Relevant predictors on the institutional level are the type of higher education institution and the field of study. Studying at a general university instead of a university of applied sciences and studying subjects like Mathematics/ Natural Sciences and Engineering (also found e.g. by Sarcletti and Müller, 2011;Heublein et al., 2017) seem to increase the risk of dropping out. This observation provides no direct starting point for reducing dropout rates but may point to more structured or practiceoriented study courses (as at universities of applied sciences) to be a relevant determinant of study success. Moreover, the results indicate the usefulness of fieldspecific intervention measures especially in fields with a high dropout rate.

Financial situation
The financial situation of students, for instance in form of financial aid (BAföG in Germany; also found by Glocker, 2011) and the ability to cover living costs (study costs, income), seems to be an important aspect of study success. Here, an improvement of the financial aid system, for example, a higher amount of subsidies, probably decreases the dropout risk for students, especially for those from low-income families.

Motivation and satisfaction with study
Several determinants identified as important for study success are related to student satisfaction (e.g. satisfied with actual studies; also found by Suhre et al., 2007). Regular student surveys to get information on student satisfaction, their wishes, and needs are probably an appropriate first step towards providing a supportive and encouraging environment and thereby increasing satisfaction with studies. Satisfaction highly depends on the gap between students' expectations concerning study content, organization and required qualifications and the real study situation induced by insufficient information and preparation status of students (Suhre et al., 2007;Weerasinghe et al., 2017). Therefore, possible starting points are, for instance, student information days and workshops helping students to get an overview of the different study alternatives early and to find study fields matching their skills and interests. In addition, the implementation of online self-assessment programs for a first overview and evaluation of interests and opportunities may also be useful (Heublein, 2014). Here, cooperation with secondary schools seems to be of considerable importance (Hetze, 2011). To be able to study the subject of choice, which also seems to have a great impact on study success, early information on formal and content-related requirements may encourage students to obtain these qualifications already at school (e.g. to choose maths as core subject). Moreover, as the fact that students would have preferred to do something else rather than studying (alternative to a degree) is a predictor of dropping out, students should also ponder their non-academic alternatives before starting a study. Here, special offers helping to decide if a study or, for instance, vocational training would better match their aspirations and wishes may prevent student dropout due to discontent and unfulfilled expectations. Some determinants identified as relevant for the dropout decision are influenceable to a varying degree by institutions or by students themselves whereas others are not. There are many aspects that become relevant already before or at the beginning of the study, such as prior education and also satisfaction, so there are promising starting points for early warning systems. The findings provide valuable starting points to tackle the dropout phenomenon. However, in the discussions on dropout prevention, it should be kept in mind that a dropout from university may not necessarily be interpreted as a negative event in the educational career. A voluntary dropout may be a sensible revision of a disadvantageous decision allowing students to take a chance with new opportunities and possibilities to find a more appropriate and interesting job instead of persevering in a non-satisfying study program.
A sequence of models for 200 different values of λ (log(λ) ∈ [−8, −2]) is fitted and displayed in Figure A1.5 Computation stops if the fraction of (null) deviance explained does not change sufficiently from one lambda to the next (end of the path).6 Each curve corresponds to a predictor variable and shows the path of its coefficient as λ varies. The number of nonzero coefficients at the current λ, also known as the effective degrees of freedom, is indicated in the axis above. The higher the value of λ, the more the coefficients shrunk to zero.
To select the model that best fits the data, the optimal value of λ should be chosen. This is done by evaluating and comparing the out-of-sample MSE and AUC of each model using the method of cross-validation (number of folds is set to 10).7 Figure A2 shows the results.
The graph includes the cross-validation curve (black dotted line in both figures), the upper and lower standard deviation curves along the λ sequence (error bars). These vertical dotted lines indicate the two selected λ's, which correspond to some coefficients, respectively. Left figure: The first line from the left panel of Figure A2 provides λ min = 0.0017, which is the value of λ that gives the minimum mean cross-validated error (here 0.298) and 49 predictors have nonzero coefficients. The second line outputs λ 1se = 0.0152, which gives the most regularized model such that the error is within one standard error of the minimum. This error amounts to 0.307 and 22 predictors have nonzero coefficients. The figure on the right provides the best λ based on the AUC. The second vertical line outputs the λ 1se also with a value of 0.0152, an AUC value of about 0.789 and 22 nonzero coefficients.

Model improvement
To improve the predictive performance, interaction terms (between the predictor variables) and curvilinear (quadratic) effects are included in the model. Using the selected predictors, the model is (re)-computed and, additionally to main effect terms, terms of quadratic order and interactions within the predictors are considered. This 5 Models are fitted using the function glmnet. 6 The deviance is defined as 2*(loglike sat loglike), where loglike sat is the loglikelihood for the saturated model (Friedman et al., 2010). 7 The function cv.glmnet is used for that. leads to an overall number of 275 variables (22 first order variables + 22 quadratic forms + 22 = 231 interactions of the second order). As λ varies, values of MSE and AUC and the number of nonzero coefficients are recorded ( Figure A1).
As displayed in Figure A1, a slight improvement of the model is noted in both evaluation measures. The MSE improves from 0.309 to 0.282 and the AUC value from 0.789 to 0.821 when the best subset with 78 variables is used. The Accuracy and Recall values also improve, from 73.35% to 76.28% and from 74.23% to 76,35%, respectively. The threshold value, i.e. the minimal probability to be classified as dropout, is a = 0.268. These results confirm that considering interactions among the predictor variables and terms of quadratic order in addition to main terms generally improves the predictive performance of the models.
Additionally to the predictive performance, regression coefficients of the predictors are shown in Table A1. For convenience, 22 variables are selected as in the prior computed model. It shall be noted that some main effects could be discarded due to the presence of quadratic and interaction effects, which restricts meaningful interpretations of the model. However, this table is primarily to provide an indication on which quadratic and interaction effects are included in the model.
As shown in Table A1, only one main effect and one quadratic effect are included in the model, namely alternative to a degree and square of the grade at secondary school. Interaction effects of the grade at secondary school and the institutional determinants are indicated as important predictors. Interaction effects between the satisfaction variables (satisfaction with the studies and satisfaction with the chosen subject) and the financial variables are also important.