Evaluation of Basic Trainings for Rescue Forces

Since members of rescue forces such as firefighters have to deal with sometimes extreme and dangerous situations, high-quality basic trainings are indispensable for their professional success. There is therefore an obvious need for standardized tools assessing the training quality. This paper aims to develop and validate such an evaluation instrument. In Study 1, a qualitative analysis (N = 21) was used to identify core characteristics of good firefighter basic trainings and served as theoretical basis for the generation of corresponding items. In Study 2 (N = 257), the item set was piloted and reduced, its structure was assessed in exploratory factor analyses, and first validations were conducted. Study 3 (N = 451) tested the proposed factor structure via confirmatory analyses and validated the questionnaire comprehensively. Factor analyses showeda six-factor structure. The scales of the newly created Feedback Instrument for Rescue forces Education – Basic education (FIRE-B) are to be judged as reliable. Moreover, there are several clear indications of validity. Thus, the present research contributes to the understanding of critical factors and processes of basic trainings. Furthermore, the FIRE-B has a high practical relevance, both in the assessment of training quality and in the identification of opportunities for improvement.


Introduction
Worldwide, numerous people work in rescue services and fire brigades. Their activities are frequently characterized by danger, physical and mental exertion as well as high unpredictability (Taber, Plumb, and Jolemore 2008, 280). Therefore, firefighters, along with other rescue workers, need a variety of skills to be optimally prepared for their specific work requirements. These skills include cognitive, physical and social skills (Burke 1997;Henderson 2010). The main way firefighters develop the necessary knowledge and skills is through a good firefighter basic training. When such a training is subpar, the likelihood of both mistakes and workrelated injuries increases (Moore-Merrell et al. 2008), which, in turn, is significantly associated with symptoms of burnout and PTSD (Katsavouni et al. 2016). Consequently, high-quality trainings for firefighters not only serve the interest of society, which expects competent personnel to respond to critical incidents, but they also serve firefighters' own interests, as they help to minimize their risk of personal injury while also protecting their mental health (e.g., Katsavouni et al. 2016;Moore-Merrell et al. 2008).
Thus, it is important to evaluate firefighter training programs in order to assess the quality of training and to identify potential areas for improvement. Such evaluations can be understood as the systematic judgment of a program's worth or value (Steele 1970). Yet, according to the current state of knowledge, there is no scientifically developed instrument for evaluating firefighter basic trainings that would enable trainees to validly assess the training quality. Thus, the present research answers this need for a valid and scientifically based quality assessment instrument for firefighter basic trainings.

Firefighter Basic Trainings
Internationally, the structure of fire brigades is quite diverse, and different nations have different distributions of volunteer and full-time firefighters (Brushlinsky et al. 2019). Nonetheless, Bukowski and Tanaka (1991) proposed four central points as international performance code for firefighters: "Prevent the fire or retard its growth and spread, protect building occupants from the fire effects, minimize the impact of fire [and] support fire-service operations" (175-176). Thus, firefighters' goals are to save lives, property, and the environment. For these purposes every firefighter must receive appropriate training.
In Germany, fire protection is provided by professional fire brigades as well as volunteer fire brigades (cf. Brushlinsky et al. 2019). In both cases, future firefighters must first complete appropriate basic training (cf. FwDV 2 2012), which is provided by municipalities, districts and cities. At the voluntary level, basic training includes troop training and technical training. At the professional level, basic training involves several years of full-time training. Troop training is dedicated to the smallest tactical fire brigade unit, the troop. This usually consists of two firemen: A troop leader and a troop man. The troop leader bears the responsibility for his troop man and usually takes orders from a group leader, who, in turn, is subordinate to a platoon leader. Accordingly, the troop training can be subdivided into troop man training and troop leader training. The curriculum for troop training at the voluntary fire brigade level consists of various modules on topics such as legal bases, vehicle knowledge, equipment knowledge and life-saving emergency measures. It comprises a total of 185 h. Complementing and building on this, the second part of basic training at the voluntary fire brigade level is technical training (FwDV 2 2012). According to FwDV 2 (2012), the technical training includes, for example, additional training courses for radio operators, respirator wearers and machine operators as well as special training courses on how to behave when working with hazardous substances. The combination of troop and technical training at volunteer fire brigades is roughly comparable with the training required to become a professional firefighter at the professional fire brigade level. A central difference is that the basic training at the professional fire brigade level additionally includes training as a paramedic as well as internships at fire stations.
Regulations (FwDV 2 2012; VAP1.2-Feu 2015) specify the design of basic trainings, including components such as adequate content, learning objectives, time targets and methods. However, these training descriptions only represent minimum requirements. They are regarded as recommendations that can be supplemented. Accordingly, slight differences exist in the implementation of basic trainings depending on the respective federal state, district, municipality and trainer (Buchenau 2020). For example, even though the specified learning objectives must be achieved in the specified time frame (FwDV 2 2012), trainers can use methods and didactic approaches that are not described in the regulations. Furthermore, selection of trainers for the professional fire brigade focuses more on the trainers' technical rather than pedagogical aptitude (VAP1.2-Feu 2015; Meyer and Stiegel 2012).
This training situation underlines the need for an evaluation to assess the current quality of basic training and to identify potential areas for further improvement. The desired result, namely high-quality firefighter basic trainings, will not only support good practices in emergency situations but will also support the physical and psychological health of firefighters. Specific evaluations can be helpful for recording and improving the quality of firefighters' initial trainings in order to reduce the risks associated with the profession. For this reason, the concept of program evaluations will be explained in more detail in the following section.

Program Evaluation
Widely used in the context of evaluating training programs is the four-level model by Kirkpatrick (1979) (Blau et al. 2012). According to the model, the evaluation of trainings should take place at the four consecutive levels of reaction, learning, behavior and results (see Figure 1). According to Kirkpatrick (1979), the first level reaction covers the subjective, emotional evaluation of a training. This can be operationalized with questions about training relevance, materials and exercises, and reactions to the trainer or premises (Blanchard and Thacker 2010;Kirkpatrick 2007). Positive reactions not only indicate that participants are highly motivated but also that they are paying close attentionprocesses that are presupposed for participants' learning success (Blanchard and Thacker 2010). Secondly, according to Kirkpatrick (1979), evaluations should investigate the level of learning. This refers to the extent to which participants expand their knowledge, develop their skills or change attitudes through a training. The trainees' learning can be assessed with items such as "After the training, I know substantially more about the training contents than before" (Grohmann and Kauffeld 2013, 142). This evaluation criterion is of central importance to determine how well trainers promote the learning of participants and to uncover potential improvements (Kirkpatrick 2007). Thirdly, at the level behavior, an evaluation should examine whether a change in behavior, i.e., a transfer of learned training content to everyday situations, has taken place as a result of a training (Kirkpatrick 1979). Finally, at the level of results, an evaluation should record the broad organizational effects of a training. For example, a training might lead to reduced costs or improved service quality (Kirkpatrick 1979). Kirkpatrick and Kirkpatrick (2006) stress that the four levels of the model are to be understood as successive stages. Each individual level has an influence on the next level. For this reason, Kirkpatrick (1979) suggests evaluating a program on higher levels only after it has been shown to be successful on lower levels. However, implementing subsequent evaluation levels steadily becomes more difficult and resource intensive (Kennedy et al. 2014). Therefore, we aim to develop an evaluation questionnaire that covers the levels of reaction and learning. At the reaction level, we will record firefighters' affective reactions to the trainings and their perceived usefulness (cf. Blanchard and Thacker 2010). At the learning level, self-assessments of whether the firefighters acquired specific competences should provide information on subjective learning success (cf. Kirkpatrick and Kirkpatrick 2006). The present work aims at covering both of these levels in a standardized questionnaire (cf. Blanchard and Thacker 2010;Kirkpatrick and Kirkpatrick 2006). Before describing the questionnaire, the next section examines the extent to which evaluations are common in firefighter basic trainings.

Evaluation in the Context of Firefighter Basic Trainings
Regarding firefighter trainings, evaluations are critically important to achieve optimum training outcomes. In basic training courses, it is already common to evaluate participants' performances after practical exercises. Such evaluations often take place in the form of "lessons learned" (Berlin and Carlström 2014, 199), conversations that happen after the training exercises with the aim of discussing in the training group what worked and what could use improvement. Also, Childs (2005) states the importance of such critical reflection in firefighter education programs. Similarly, Sommer and Njå (2011) propose that sharing experiences is a good learning method in firefighter basic trainings.
Less common is the evaluation of firefighter basic trainings themselves. However, such an evaluation is equally important, as various training conditions can impede learning. For instance, learning effects are absent or very low if a training is badly structured, if exercises are unrealistic, if a training mainly focuses on existing knowledge or if trainers impart new knowledge using bad didactics (Berlin and Carlström 2014). To the best of our knowledge, the Feedback Instrument for Rescue forces Education (FIRE, Schulte and Thielsch 2019) questionnaire is currently the only systematically developed and validated published general evaluation tool in the context of firefighter education programs. The instrument is directed at future group and platoon leaders who can use it to rate the quality of firefighter leadership trainings in which they participate. The FIRE questionnaire consists of the six scales trainers' behavior, structure, overextension, group, competence and transfer.
To better understand what constitutes a validated questionnaire, the concept of validity and the aim of the present research is explained in more detail in the following section.
Evaluation of Basic Trainings for Rescue Forces

Validation of Questionnaires and Aim of the Present Study
In general, validity is understood as the measure of accuracy with which a questionnaire measures the concept that it intends to measure (Goldstein and Simpson 2002). Goldstein and Simpson (2002) propose to assess validity by examining three different facets of validity: Content validity, construct validity and criterion validity. First, content validity examines the extent to which a questionnaire represents the characteristic to be measured (Haynes, Richard, and Kubany 1995). It can be achieved by involving experts from the respective field in the development of the construct's definition and items for the questionnaire (American Educational Research Association et al. 2014). Next, construct validity can be confirmed if relationships between the behavior in the test situation and underlying psychological characteristics can be demonstrated (Goldstein and Simpson 2002). As such, construct validity can include the investigation of a questionnaire's factorial structure: Factorial validity exists if there is a good fit between the theoretical model on which a questionnaire is based on and the empirical data obtained with it (Guilford 1946;Thompson and Daniel 1996). Yet, as described by Campbell and Fiske (1959), construct validity can be divided into two subtypes: convergent validity and divergent validity. A questionnaire is regarded as being convergently valid if high correlations with construct-related questionnaires can be proven. The idea is to test whether constructs that are expected to be related are, in fact, related. In turn, a questionnaire can be described as being divergently valid if only low correlations with other independent constructs are found. Thus, it is the idea to check whether constructs measured with other questionnaires that are not expected to be related do not, in fact, have any relationship. The third aspect of validity, criterion validity, aims at demonstrating that questionnaire scores are related to or, more precisely, predict concrete real-life outcomes. Again, two subtypes can be specified: A questionnaire is said to have concurrent validity if it can predict criteria measured at the same time (e.g., expert or global ratings), whereas it has predictive validity if it forecasts criteria measured some point after the questionnaire scores were obtained (e.g., learning or performance outcomes) (Cronbach and Meehl 1955).
The present study aims to respond to the need for a valid tool to evaluate firefighter basic trainings 1 . Therefore, our first goal is to systematically develop a profound theoretical basis for appropriate items (Study 1). Further, we aim at a piloting these items as well as comprehensively validating the developed questionnaire using different samples. To examine the questionnaire's factorial validity, we conduct exploratory and confirmatory factor analyses. Beyond that, to investigate convergent and divergent construct validity as well as concurrent criterion validity, we make use of correlative validation methods (Study 2 & 3).

Study 1
In Study 1, we determined the factors related to the success and quality of a good firefighter basic training program at municipal and district level as theoretical basis for the item construction.

Method
In Study 1, in a qualitative research approach we asked N = 21 experts (n = 13 trainees, n = 4 trainers, n = 4 persons mainly having a managing function in a fire brigade school) what they personally consider to be important aspects of good training at the municipal and district level. All participants were German and male. Ages ranged from 18 to 31 years (M = 22.46; SD = 3.60) for the trainees, from 25 to 49 years (M = 34.00; SD = 10.52) for the trainers, and from 33 to 46 years (M = 37.50; SD = 5.80) for the persons with a managing function. The survey was answered with regard to trainings for professional fire brigades by 48% of the participants, whereas 52% answered it with regard to trainings for voluntary fire brigades. Factors related to the success and quality of basic trainings were recorded in an online survey by means of the Critical Incident Technique (CIT, Flanagan 1954). The CIT is a qualitative research method based on expert surveys. It is frequently used as an effective exploratory tool to better understand specific human activities or to get an information base for further research (Butterfield et al. 2005). The idea is to let experts clearly and comprehensibly describe critical situations (i.e., situations including particularly effective or ineffective behaviors) from which critical categories or items can be derived. Additionally, participants were able to directly name important aspects of a good training in four open questions: The first concerned aspects of good training in the fire brigade. The second asked for characteristics of a good trainer and his or her teaching style. The third asked about relevant framework conditions, and the fourth asked for ways in which trainees themselves can contribute to good training. The study was available online from the end of August to October 2017. Participation was voluntary and anonymous. As compensation, the respondents received a result report after Study 1 was completed.

Results and Discussion
Based on a qualitative content analysis (Mayring 2015), the experts' statements were clustered into categories of good firefighter basic trainings by two independent evaluators. The analyses led to eight categories of a good basic training at a fire brigade: Didactics, motivation & engagement, personality, content & methods, structure & organization, materials & facilities, group and achievement of learning objectives (see Table 1). A standard procedure in questionnaire construction is the creation of a large item pool for a first draft of a questionnaire (e.g., Kline 2000;Nunnally 1975;Rossi, Wright, and Anderson 2013). This makes it possible to remove unsuitable items after subsequent item analyses while still retaining a sufficiently large number of items. Thus, in accordance with the identified Respectful interaction, team spirit, willingness to learn Direct applicability in practice, high learning effect Note. Shown is the number of entries that could be assigned to the respective category for the four different questions. The percentages are given in brackets. If there was no information, this category was not mentioned in the question. CIT = Critical Incident Technique (one entry per category was counted for each situation), E = Education (question about characteristics of a good education), C = Conditions (question on general conditions of good training), P = Person (question about characteristics of a good trainer).
categories, a pool of 51 items, which comprehensively and fully depicted the mentioned success-critical aspects, was created for a preliminary version of the questionnaire (see Table A1 in the online supplement at https://doi.org/10.5281/ zenodo.3948173).
The results of Study 1 indicate that the categories experts described as being important for good firefighter basic trainings are similar to the success factors for firefighter leadership trainings (FIRE questionnaire). However, the results also revealed aspects that are not covered by the existing FIRE questionnaire (Schulte and Thielsch 2019), such as specific teaching methods and outcomes, motivational aspects, personality aspects of the trainers and required materials and facilities. Beyond that, the results of Study 1 revealed parallels to the characteristics of good teaching at universities, as many items were similar to items that have been Based on these findings, we developed an adapted questionnaire for the evaluation of firefighter basic trainings. Thus, the initial item pool consisted of 51 questions newly created based on interview results of Study 1 as well as items originating from existing instruments that were adapted to the technical context of the fire brigade as well as to the training context at municipal and district level (see Table A1 in the online supplement at https://doi.org/10.5281/zenodo.3948173). The resulting questionnaire was named Feedback Instrument for Rescue forces Education -Basic education (FIRE-B). The aims of the following studies were thus to shorten this draft version by removing items proven to be unsuitable in item analyses, to facilitate practical application and to check the psychometric quality of the resulting final instrument.

Study 2
In Study 2, the preliminary questionnaire version developed in Study 1 was piloted with members of various fire brigades in Germany. The aims of this study were to shorten the FIRE-B draft by selecting items on the basis of the descriptive item parameters, to uncover the factor structure via an exploratory factor analysis, and to carry out initial validations.

Sample
The sample for the item and exploratory factor analysis consisted of N = 257 persons from Germany (229 men, 26 women, 2 not specified) with an age range of 16 to 51 years (M = 25.75; SD = 6.48). An overview of the initial sample and the exclusion criteria applied is given in Figure A1 in the appendix. Of the final sample, 37% was made up of (former) apprentices in training to become professional firefighters in the fire brigade, and 63% was made up of (former) apprentices in troop man or troop leader training in the voluntary fire brigade. In professional fire brigades, 39% (n = 94) of the trainees had completed their training. With regard to trainees in volunteer fire brigades (n = 163), 12% were in troop man training, 47% were between troop man training and troop leader training, 5% were in troop leader training and 36% had already completed troop leader training.

Measures and Procedure
Study 2 was conducted as an online survey using the software EFS Survey (provided by the Questback GmbH 2018). The main component of the survey was the set of 51 items developed in Study 1. Participants indicated their agreement with the items on a seven-point Likert scale (from 1 = strongly disagree to 7 = strongly agree) with a denial option to indicate that the respective item cannot be answered meaningfully. Another component of the survey included items for the initial validations. Firstly, two items of global judgment (subjective learning success (Gediga et al. 2000), global grading on a school grading scale (cf. FEVOR/FESEM, Staufenbiel 2000)) served as indicators for criterion validity. Secondly, mood served as a criterion for divergent validity. To measure the participants' mood, the five-level smiley scale by Jäger (2004) was used. A series of two studies by Jäger (2004) provided evidence for this scale's unidimensionality and equidistance and showed high correlations with the German version of the PANAS scale (0.75 ≤ r ≤ 0.89). Finally, one additional scale was measured that is not pertinent to the present study. The median response time for completing the entire survey was 12 min and 22 s.
The study was available online from January to March 2018. Participation in the survey was voluntary, anonymous and possible via an access link. It could be carried out on computers or other internet-enabled devices and consisted of three different sections (see Figure A2 in the appendix). As compensation, the respondents received a result report after Study 2 was completed.

Statistical Analyses
All data analyses of Study 2 were performed with the program IBM SPSS Statistics -Version 24. Before starting the analyses, the inverted items were reversed so that a high value for all items is equivalent to a good evaluation of the training. Missing values of the training evaluation items (those for which participants had selected the denial option) were imputed using the expectation maximization algorithm. Missing values occurred in 18% of the respondents. Overall, only 0.5% of the data were missing.
The 51 items of the preliminary questionnaire version were evaluated primarily with regard to their distribution, response rates and item intercorrelations as well as on the basis of their correlation with the mean value of eight items on the selfassessed acquisition of competence (these included all seven items on the abovementioned scale acquisition of competence as well as one item that was not included in the final instrument due to content redundancy). The latter correlation was regarded as an indication of how relevant the items were regarding content and practicability in the feedback process. On the basis of these analyses, an initial item selection was made. The reduced set of items (see Section 3.2.1) was included in an explorative factor analysis (EFA, main axis analysis with oblique promax rotation) in order to uncover the factor structure underlying the data and to further reduce the pool of items. Finally, bivariate correlations between the scales of the draft questionnaire and the mentioned validation measures were calculated to exploratorily assess construct and criterion validity.

Item Selection
In the first selection phase, the item set was reduced to 44 items: One item was excluded due to an unfavorable answer distribution in the histogram and a low correlation with the self-assessed competence acquisition. Another item was excluded due to a high item intercorrelation. In addition, three items concerning overextension in the training (originating from the FIRE scale for leadership training evaluation, Schulte and Thielsch 2019) seemed to be somewhat irrelevant for basic training evaluation: They had comparatively high mean values and low standard deviations, did not correlate with the self-assessed acquisition of competence and were assessed as less relevant by an expert from a fire brigade school. Thus, those three items were excluded. Lastly, two items were excluded as they had comparatively high mean values and low standard deviations.
Additionally, both items could possibly be problematic for the feedback process, as they referred to stable personality traits of the trainers. For a detailed description of the reasons for exclusion, see Table A1 in the online supplement at https://doi. org/10.5281/zenodo.3948173. See Table A2 in the online supplement at https://doi. org/10.5281/zenodo.3948173 for the final FIRE-B-items with an indication of the original items that served as basis.

Exploratory Factor Analyses
The factor number was determined based on the Kaiser criterion (eigenvalues > 1; Guttman 1954;Kaiser and Dickmann 1959), the scree plot (Cattell 1966) and the minimum average partial test (MAP test, Velicer 1976). The Kaiser criterion argued for a solution with seven factors, while a visual inspection of the scree plot as well as the original version of the MAP test (Velicer 1976) suggested six factors, and the revised version of the MAP test (Velicer, Eaton, and Fava 2000) indicated five factors. Because the Kaiser criterion generally tends to overestimate the number of factors (Moosbrugger and Schermelleh-Engel 2008) and because only solutions with five or six factors seemed conceptually meaningful, subsequent contentrelated deliberations finally led to a solution with six factors.
Based on the results of the EFA, a second item selection was carried out, which further reduced the item pool from 44 to 30 items. A total of eight items were excluded due to their loading pattern (and in some cases due to additional criteria): Five items were removed due to low loadings < 0.5 or double loadings > 0.3, two items were excluded due to comparatively low loadings ≤ 0.54 and low correlations with the mean value of self-assessed competence acquisition (r ≤ 0.27), and one item was excluded due to a double loading (0.32) and a high item intercorrelation (r = 0.68). In contrast, five items with low loadings < 0.4 and/or double loadings ≥ 0.3 were considered relevant in terms of content due to the high correlation of r ≥ 0.5 with the mean self-assessed competence acquisition. Accordingly, these items were retained. The content relevance of the items thus represented the more important decision criterion. In addition, two items were excluded due to low to moderate correlations with the criterion (r ≤ 0.34), and four items were excluded due to high item-total correlations (r it ≥ 0.67) and high item intercorrelations (r ≥ 0.65), which could indicate the content redundancy of the items. In this case, high item-total correlations were chosen as a reason for exclusion in order to obtain factors that cover as many different facets of the construct as possible. For a detailed description of the reasons for exclusion, see Table A1 in the online supplement at https://doi.org/10.5281/zenodo.3948173.

Extracted Scales and Their Interpretation
Subsequently, following the recommendations of Beavers and colleagues (2013), a new EFA (main axis analysis with oblique promax rotation) with the final 30 items was calculated to obtain the factor structure of the optimized solution. This EFA led to a solution with six factors. In terms of content, factor 1 (competence) represents the acquisition of competences in training. Factor 2 (structure & didactics) concerns the structure of training and the didactic abilities of the trainers. Factor 3 (materials & facilities) describes the quality and availability of materials and facilities. Factor 4 (support & encouragement) represents the support and promotion of the trainees by the trainers. Factor 5 (group) refers to the group of trainees and, finally, factor 6 (practice) concerns the practical orientation of training. Thus, the instrument consists of the outcome scale for competence acquisition (factor 1), which focuses on the consequences or effects of training, and of five process scales (factors 2-6), which make it possible to assess the execution and implementation of a training (cf. Blanchard and Thacker 2010).
The scales and items of the final questionnaire as well as the corresponding item statistics are presented in Table 2.

Initial Validation
Concerning a first validation, in Study 2 participants' mood was not strongly related to the assessment of the training (0.23 ≤ r ≤ 0.34, p < 0.001). Comparable correlations were found in the validation of the related FIRE questionnaire (Schulte and Thielsch 2019). Consequently, the evaluation results can be distinguished from the participants' mood, initially indicating divergent construct validity. Regarding a first criterion validation, the evaluation results on the five process scales of the FIRE-B in Study 2 showed average to high correlations with the subjective learning success (0.33 ≤ r ≤ 0.53, p < 0.001) as well as with the global grading of the training (0.40 ≤ r ≤ 0.72, p < 0.001). See Table 4 for single values. As these first validation results seem promising, further in-depth analyses were necessary and performed in the following study, Study 3.

Study 3
In Study 3, confirmatory factor analyses (CFA) were carried out using a different sample to verify the questionnaire's factor structure proposed in Study 2. In addition, bivariate correlations between the scales and selected validation measures served to broadly assess construct and criterion validity. The sample in Study 3 consisted of N = 451 (414 men, 37 women) German firefighters aged between 18 and 63 years (M = 34.02; SD = 9.92). An overview of the initial sample and the exclusion criteria applied is given in Figure A3 in the appendix. The final sample size meets the minimum requirement of N = 400 based on the recommendation for CFAs with three indicator variables per factor and loadings of 0.6 (Gagne and Hancock 2006). It is also suitable for the calculation of correlation coefficients, which, according to Schönbrodt and Perugini (2013), are sufficiently robustly estimated from a sample size of about 250 persons, assuming medium effect sizes. The participants were asked to assess the training they currently completed or had last completed at the time of the survey. Of those questioned, 24% assessed their training for becoming professional firefighters. Furthermore, 19% assessed the troop man training and 33% assessed the troop leader training within the voluntary fire brigade. Other trainings provided at municipal or district level were assessed by 24% of the sample.

Measures
In addition to the items of the FIRE-B (see Table 2), scales from other wellestablished evaluation instruments as well as from FIRE validation studies were used for the investigation of convergent construct validity. The participants' current mood, their level of education and their experience served as divergent criteria. Third, participants' overall satisfaction with the training and their learning success were assessed for criterion validation. Unless specified differently, participants indicated their agreement with the items on a seven-point Likert scale (from 1 = strongly disagree to 7 = strongly agree). An additional denial option (unanswerable) could be ticked if participants perceived an item as not applicable, for instance, because they had never been involved in the activity described in an item (cf. Chyung et al. 2017). Table A3 in the online supplement at https://doi.org/ 10.5281/zenodo.3948173 gives an overview of all validation items of Study 3 as well as their source and response format.

Items For Construct Validation
Scales from well-established German evaluation instruments for higher education (HILVE II (Rindermann 2009); TRIL (Gläßer et al. 2002); FEPRA (Staufenbiel 2000)) were used for convergent construct validation of the four scales structure & didactics, support & encouragement, materials & facilities as well as competence. In past studies, the HILVE could be assessed as a very stable measure over time that correlates with performance criteria and external-rater judgments (Rindermann 1994). Likewise, the results from the TRIL in a program evaluation also correlated with the judgments of external observers (Gollwitzer and Schlotz 2003). For the validation of the scale group, three items were used which were also used for the validation of the group scale from the FIRE questionnaire (Schulte and Thielsch 2019). The validation of the scale practice was carried out with four items from the validation of the transfer scale from the FIRE questionnaire (Schulte and Thielsch 2019). For divergent construct validation, the current mood was again (as in Study 2) measured with a five-point smiley scale (Jäger 2004), and the level of education was assessed by the highest level of school-leaving certificate achieved. Additionally, the variable experience of participants was assessed by an inquiry about previous experience, for example regarding previous membership in a youth fire brigade as well as the monthly mission experience.

Items for Criterion Validation
To investigate the first criterion, participants' overall satisfaction, three single-item measures were used: The item "All in all, the attendance of this training was worthwhile for me" was taken from the TRIL (Gläßer et al. 2002), the item "I would recommend the training to a good friend" was used according to the MFE-Sr (Thielsch and Hirschfeld 2012), and the last item asked the participants to rate the training on a school grade scale (1 = very good; 6 = insufficient) (FEVOR/FESEM, Staufenbiel 2000). Learning success served as second criterion and was first measured by the item "I learned a lot during my training" taken from the KIEL (Gediga et al. 2000). Second, participants were asked whether they had passed the evaluated basic training.

Procedure
For data collection, an online survey was created using the survey software EFS Survey (provided by the Questback GmbH 2018). Participation in the survey was voluntary, anonymous and possible via an access link. It could be carried out on computers or other internet-enabled devices and consisted of three different sections (see Figure A4 in the appendix). The median response time for completing the entire survey was 12 min and 59 s. The study was available online from July to September 2018. Again, as did Study 2, it aimed at interviewing German firefighters. As compensation, the respondents received a result report after Study 3 was completed. Moreover, they were able to take part in a raffle for an annual subscription to a firefighter-specific magazine.

Statistical Analysis
The statistical data analyses were carried out with RStudio (RStudio Team 2016, Version 1.1.456). In particular, the packages lavaan (Rosseel 2012, version 0.6.3), plyr (Wickham 2011, version 1.8.4), psych (Revelle 2018, version 1.8.10) and semPlot (Epskamp 2017, version 1.1) were used. A robust maximum-likelihood estimator with Huber-White standard errors and a scaled test statistic asymptotically comparable to the Yuan-Bentler test statistic (MLR) was used to calculate the confirmatory factor analysis (cf. Steinmetz 2015). In addition, bivariate correlations between the scales of the questionnaire and selected validation criteria were calculated to assess construct and criterion validity.

Confirmatory Factor Analysis
A confirmatory factor analysis (CFA) was conducted to review the factor structure proposed in Study 2. Modification indices indicated a high correlation between item 1a ("I think the training was clearly structured") and item 1b ("I could always follow the training process during the training") of the scale structure & didactics. After considerations of content, both items were judged to be redundant. Since item 1a seemed more global and understandable, item 1b was removed from the model. According to Schermelleh-Engel, Moosbrugger, and Müller (2003), the model fit for the final FIRE-B with 29 items on six scales can be classified as good (RMSEA = 0.05; SRMR = 0.04) to acceptable (CFI = 0.95; TLI = 0.95). The χ 2 -test was significant ( χ 2 (362) = 585.12, p < 0.001), which is common for large samples (Tanguma 2001). However, related to the degrees of freedom, the χ 2 -value is good (χ 2 /df = 1.62). Results thus provide support for a six-factorial structure. Figure 2 illustrates the specified model including all path coefficients. The intercorrelations of the scales are medium to high and can be found in Table A1 in the appendix.
In sum, the results of the CFA are in line with the findings of the EFA in Study 2, confirming that the items of the FIRE-B load on six distinct factors (structure & didactics, support & encouragement, group, practice, materials & facilities, competence). Overall, the six-dimensional questionnaire structure demonstrates that various quality factors contribute to good firefighter basic trainings and that firefighters should have a wide range of skills for successful action (cf. Kleinmann et al. 2010).  Table 3 reports an overview of the reliability estimates and the associated measurement model tests for all FIRE-B scales based on the data of Study 2 and 3. Cronbach's α should at best assume values between 0.70 and 0.90 (Tavakol and Dennick 2011). Likewise, ω H can be classified (Schweizer 2011). Accordingly, all scales reach a good to acceptable level of reliability. To this extent, the coefficients are comparable with those of other established evaluation instruments (e.g., HILVE II, Rindermann 2009).

Convergent Construct Validity
All FIRE-B scales showed consistently high positive correlations with their corresponding validation scales (see Table 4): r = 0.81, p < 0.001 between the scale structure & didactics (FIRE-B) and the scale structure & didactics (TRIL); r = 0.88, p < 0.001 between the scale support & encouragement (FIRE-B) and the scale

Divergent Construct Validity
According to the assumption, in Study 3 the participants' current mood correlated significantly but only to a small extent with the FIRE-B scales (0.12 ≤ r ≤ 0.23, p < 0.01). For the educational level of the participants, small and only partially significant correlations with the FIRE-B scales (−0.17 ≤ r ≤ −0.05, p between <0.001 and 0.31) were found. With regard to the length of previous experience in the work of fire brigades, there were consistently small, sometimes insignificant correlations (−0.10 ≤ r ≤ −0.02; 0.09 ≤ p ≤ 0.74). Similarly, the monthly experience in professional and voluntary fire brigades was only slightly related to the assessment of the FIRE-B scales (−0.05 ≤ r ≤ 0.09, 0.05 ≤ p ≤ 0.95). All results point to divergent validity of the FIRE-B. See Table 4 for detailed results.

Criterion Validity
Study 3 showed moderate to large correlations of the FIRE-B scales with the school grade awarded (0.42 ≤ r ≤ 0.75, p < 0.001). In addition, there were moderate to large highly significant relationships with the other items measuring overall satisfaction (0.42 ≤ r ≤ 0.69, p < 0.001). With regard to learning success, moderate to high highly significant correlations between the FIRE-B scales and the subjective learning success were noted (0.41 ≤ r ≤ 0.75, p < 0.001). The correlation between the evaluation results and the passing of the examination could not be meaningfully examined due to the lack of variance in the data. Thus, 99% of the respondents passed the examination directly, and only 1% passed after a subsequent examination. All other results support the criterion validity of the FIRE-B. See Table 4 for detailed results.

General Discussion
High-quality basic training is critical to the development of firefighters' knowledge and skills. Only with such a training firefighters can successfully perform their demanding tasks and, at the same time, be prepared for possible negative physical (e.g., work-related injuries, see Moore-Merrell et al. 2008) or psychological (e.g., burnout and PTSD, see Katsavouni et al. 2016) consequences. In this regard, the newly created evaluation questionnaire for firefighter basic trainings (FIRE-B) addressed the lack of a valid and scientifically based tool to assess these trainings. The questionnaire was developed and validated in a series of three studies. Results clearly show that the FIRE-B meets all central psychometric standards and, therefore, can and should be used.
Through these studies, we ensured high content validity, meaning that the constructed item set represents all relevant facets of firefighter basic trainings: First, an expert survey about the characteristics of a good firefighter education served as basis for the development of the item pool (Study 1). Second, in Study 2 only few participants made additions to the questionnaire despite explicit requests. Moreover, we found further clear indications for validity of the FIRE-B: The factor structure proposed in Study 2 was confirmed with an independent sample in Study 3, indicating factorial validity. Beyond that, bivariate correlations in Study 2 and Study 3 served to investigate convergent and divergent construct validity as well as criterion validity. The patterns of correlations between the FIRE-B and the validation scales clearly support the assumption that the FIRE-B measures the intended content. Thus, the results consistently confirmed the validity of the FIRE-B. In addition, the internal consistencies of the scales can overall be regarded as sufficient to good (see Table 3), which is especially promising because most of the scales are brief. Therefore, applying the FIRE-B will lead to reliable results. Reliability will be further ensured because training evaluations will only be performed for trainings with a fairly large group of participants in order to avoid answer bias based on individual opinions (see the scoring instructions in online supplement at https://doi.org/10.5281/zenodo.3948173). 2 Other benefits of the FIRE-B are its time efficiency, usefulness and relevance. Regarding efficiency, even though the six different scales allow for a comprehensive assessment of various aspects concerning basic training, the items can be processed in about 4 to 5 min. Regarding usefulness, no other scientifically developed, published evaluation tool for rescue service basic education is available in the literature, making the FIRE-B a highly useful tool. In addition, the evaluation questionnaire is of practical relevance, as its results can describe the current quality of fire brigade trainings and serve as a starting point to derive concrete measures for improving trainings.
In theoretical terms, the current study contributes to the question of which quality factors are relevant to evaluate firefighter basic trainings. Altogether, the identified six-factor questionnaire structure confirms that various quality factors contribute to a good training and that firefighters should have a wide range of skills for successful action (cf. Kleinmann et al. 2010). Similarly, Schulte and Thielsch (2019) concluded that evaluations of firefighter leadership trainings should be multidimensional. The final dimensions of the FIRE-B largely correspond to the scales of the FIRE questionnaire (Schulte and Thielsch 2019). For example, both the FIRE-B and the FIRE questionnaire include the scale group. The scale structure & didactics of the FIRE-B is comparable to the scales trainers' behavior and structure in the FIRE questionnaire. In addition, the scale competence includes aspects of the two original FIRE scales competence and transfer. However, differences also exist, showing that basic and leadership training differ from each other and should be evaluated with different instruments. Firstly, the FIRE-B contains the scale support & encouragement. In contrast, the FIRE only contains a few motivation-related items on its scale trainers' behavior. Furthermore, the scales practice and materials & facilities are part of the FIRE-B but do not exist in the FIRE questionnaire. These aspects seem to be more important for basic trainings than for experienced participants of leadership trainings. Conversely, the scale overextension is part of the FIRE but not of the FIRE-B questionnaire. The low failure rate in examinations of firefighter basic trainings confirms the low relevance of this scale for the FIRE-B.

Practical Application
On a practical level, the FIRE-B for the first time offers the possibility to assess the quality of firefighter basic trainings at municipal and district level from the trainee's point of view. The differentiation of process and outcome scales according to Kirkpatrick (1979) helps to distinguish information on learning outcomes from information on possible ways to adapt the training process. In this way, the process-related items of the FIRE-B capture judgments about the trainer, the organization of the training, the group as well as about exercises, materials and facilities. The result-related items assess the extent to which the training has contributed to the acquisition of knowledge and skills. Thus, the FIRE-B provides information on the current quality of basic firefighting trainings and helps to identify possible areas for improvement within the implementation of the training and the achievement of the learning objectives. If, for example, trainees indicate not having achieved certain learning objectives, the trainer can check whether and at what point in the process there was a problem. To facilitate the questionnaire's practical application, we provide additional information in the online supplement at https://doi.org/10.5281/zenodo.3948173, including questionnaire templates (in English and German) and scoring instructions.
Generally, an evaluation with the FIRE-B should take place directly after a firefighter training course. The 29 items can be answered quickly (in our experience in about 4-5 min), and the six different scales provide a comprehensive picture of the training quality. The items are consistently to be rated on a seven-point response scale with a denial option, allowing a simple data analysis and interpretation. Depending on the evaluation context, we recommend the additional use of four optional items (see Table A5 in the online supplement at https://doi.org/10. 5281/zenodo.3948173). Also, as it is important to ensure anonymous evaluation, one should not collect variables (e.g., demographic) that allow conclusions about individual persons. If scales are not applicable or if the use of all scales is considered too time consuming, single FIRE-B scales can be omitted, as they have been validated separately. However, one should not remove individual items. As the scales are already very brief, this may impair psychometric quality. Similarly, one should not change the wording of the individual questions. Exceptions are minor adjustments to ensure the questionnaire's comprehensibility and its optimal adaptation to the evaluation context.
In the analysis, mean values are calculated for the individual scales across participants and courses (see scoring instructions in the online supplement at https://doi.org/10.5281/zenodo.3948173). A high number of missing values (see scoring instructions in the online supplement at https://doi.org/10.5281/zenodo. 3948173) from many participants may indicate a lack of fit of the questionnaire in the respective evaluation context. Additionally, an evaluation analysis should only take place if a minimum number of completed evaluation questionnaires are available (see scoring instructions in the online supplement at https://doi.org/10. 5281/zenodo.3948173). Further, to keep effort and calculation errors to a minimum, we recommend that evaluations are analyzed using simple data evaluation programs.
Beyond that, organizers and trainers should consider the subjective nature of this type of evaluation: feedback gathered with the FIRE-B questionnaire is an opportunity for organizers and trainers to obtain important information about their own teaching activities from their trainees' points of view. Thus, organizers and trainers should meet with the trainees to discuss the results of the evaluation. In addition, the responsible organization should support the evaluation both technically and in terms of its content. Particularly, organizers should offer trainers help if evaluations repeatedly reveal areas for improvement or offer them praise for good teaching quality.
Finally, while the FIRE-B was developed specifically for the context of fire brigades, in this context it pursues a rather global approach, meaning that the questionnaire can be used to evaluate a wide range of firefighter basic trainings. If, beyond that, people wish to investigate more specific aspects as part of an evaluation, we recommend using more specific evaluation scales, such as scales for the evaluation of firefighter examinations, mission exercises or command unit trainings (Röseler et al. 2020;Schulte and Thielsch 2019;Thielsch, Busjan, and Frerichs 2018).

Limitations and Future Research
This study has some limitations but also some possibilities for future research. Both will be discussed below.
Regarding the application of the FIRE-B, one must consider that German firefighters served as the basis for constructing the questionnaire, such that the proportion of voluntary versus professional firefighters as well as the gender distributions were representative of German fire brigades (cf. Deutscher Feuerwehrverband (DFV) 2015). Fire brigades in other countries may have different distributions or different work and trainings structures. Thus, organizers in different countries should first check whether the FIRE-B validly covers the relevant areas of the respective training courses or whether it needs to be adapted. Similarly, the use of the FIRE-B questionnaire in other training contexts is conceivable, as its items do not contain a fire brigade-specific vocabulary. The content should be generally relevant within rescue training courses, such as paramedic trainings. Before applying a translation or adapted version, the validity should first be checked. Therefore, we recommend as a minimum requirement performing an expert assessment of the content validity for the intended context, and we highly welcome specific validation studies.
Another aspect to consider when using the FIRE-B is its subjectivity: The FIRE-B asks for judgments from the trainees' points of view. As indicated above, such information is primarily helpful for trainers to reflect on their own teaching activities and, if necessary, to think about possibilities for improvement. At the same time, the evaluation's subjectivity increases the risk of misunderstandings. We, therefore, recommend that organizers meet with the trainers and trainees after the evaluation to discuss the results as a group, to collect ideas for improvement, and to uncover and clarify possible misunderstandings. In this regard, an evaluation from the trainers' viewpoints might also be of interest, especially as Schulte and Thielsch (2019) showed that the judgments of trainees are, to some extent, different from those of the trainers. For example, trainers rated trainees' overextension to be higher than the trainees actually experienced. As such, considering the views of both parties could contribute to a more global assessment of the quality of firefighter basic trainings and can also benefit trainees' learning (Berlin and Carlström 2014;Childs 2005;Sommer and Njå 2011).
Additionally, the results of our study also carry some limitations. First, there was no proof of a relationship between the FIRE-B scales and passing the final exam. In view of the marginal failure rate, the question of passing the final examination for basic trainings seems to be an inadequate validation criterion. In the case of a uniform grading system, the grade achieved in a final examination might be a more suitable criterion. Furthermore, an objective proof of examination (e.g., training diplomas) could prevent possible effects of social desirability. Second, to avoid high exclusion rates, the date of the evaluated training was used as a filter criterion only in Study 2 but not in Study 3 (see Figures A1/A3 in the appendix). In Study 3, there were few significant correlations between the number of years since the start of training and FIRE-B scales. However, this finding does not affect the validity of this work, since controlling partial correlations were used for statistical data analysis in Study 3.
Regarding further research, the four-level model according to Kirkpatrick (1979) calls for methodically expanded follow-up studies investigating how to evaluate firefighter basic trainings at the levels of behavior and results. For the assessment of transfer effects at the behavioral level, future studies should collect data during subsequent work as a firefighter. Conceivable sources are again subjective self-judgments (e.g., "I successfully manage to apply the training contents in my everyday work" (Grohmann and Kauffeld 2013, 142), but also judgments from the perspectives of trainers, colleagues or superiors. In addition, objective observations of behavior based on standardized evaluation criteria can be a useful supplement (Blanchard and Thacker 2010). For the evaluation at the level of results, follow-up studies could examine whether a high quality of training leads to better organizational outcomes and whether, for example, fire brigade operations can be carried out more quickly or more successfully. Equally, future studies could check whether high-quality trainings imply a lower accident rate among firefighters themselves. A particularly challenging aspect of such an evaluation at the result level is the difficulty in attributing positive effects only to the object of evaluation and not to other unrecorded influences (Kennedy et al. 2014;Kirkpatrick 1979;Praslova 2010).

Conclusion
The present paper provides, for the first time, a systematically developed and validated evaluation questionnaire for firefighter basic trainings. The FIRE-B is a useful, efficient, reliable and valid feedback instrument. Thus, it can and should be used in rescue service education. The regular and long-term use of the evaluation questionnaire may not only contribute to the recording of current quality standards but may also reveal areas for improvement or possible changes in the basic training of firefighters. By providing a tool that may help improve the quality of firefighter basic trainings, we hope to ultimately contribute to society as a whole, as an optimized education hopefully leads to the most competent firefighters capable of responding optimally to various forms of emergencies.