Spatial thinking from a different view: Disentangling top-down and bottom-up processes using eye tracking

Abstract The goal of the present study was to investigate the potential of gaze fixation patterns to reflect cognitive processing steps during test performance. Gaze movements, however, can reflect top-down and bottom-up processes. Top-down processes are the cognitive processing steps that are necessary to solve a certain test item. In contrast, bottom-up processes may be provoked by varying visual features that are not related to the item solution. To disentangle top-down and bottom-up processes in the context of spatial thinking, a new test (R-Cube-Vis Test) was developed and validated explicitly for the usage of eye tracking in three studies as long and short version. The R-Cube-Vis Test measures visualization and is conform to the linear logistic test model with six difficulty levels. All items of one level demand the same transformation steps to solve an item. The R-Cube-Vis Test was then utilized to investigate different gaze-fixation-based indicators to identify top-down and bottom-up processes. Some of the indicators were also able to predict the correctness of the answer of a single item. Gaze-related measures have a high potential to reveal cognitive processing steps during solving an item of a given difficulty level, if top-down and bottom-up processes can be segregated.


Introduction
Psychological diagnostics cover the testing of someone's abilities or personality characteristics that can lead to informed decisions in the present or about future events (e.g., Swets, Dawes, & Monahan, 2000). Especially in performance testing, it might be interesting to get insights into the cognitive steps that lead to the given answer of a certain task. On the one hand, such information could be used to detect inappropriate solving strategies, which might be important if the test performance is poor. If the knowledge about appropriate solving strategies is not intended to be measured by the test, the information about the usage of inappropriate strategies could be used to, e.g., reevaluate the test or to teach appropriate solving strategies for a follow-up test. On the other hand, the knowledge about such cognitive steps might be able to differentiate between participants with high and low ability even if the items have ceiling or floor effects and/or have a high guessing probability.
However, most of the common performance tests are based on a single measure per item, accuracy (or a derivate) or reaction times. Information about how participants solve certain items is usually not considered. Only if the tasks are more complex such as in the MicroDyn test (Greiff, Wüstenberg, & Funke, 2012), in which participants have to solve linear structural equation models for simple sample scenarios, sequences of single mouse clicks can be recorded to identify solving strategies (e.g., Meißner, Greiff, Frischkorn, & Steinmayr, 2016;Schweizer, Wüstenberg, & Greiff, 2013). If the tasks are short and can only be answered as, e.g., correct or incorrect, only accuracy and reaction times are usually accessible. Current research addresses this issue by taking reaction times into account as a moderating factor if accuracy is the main measure. For example, van der Linden (2007) proposed a hierarchical model that includes accuracy and reaction times per item on an individual and population level. This allows evaluating the accuracy of an answer by the time needed to produce this answer. However, a single reaction time value can only be an approximate indicator for the cognitive processes performed during the solving of the item. Generally, it cannot be answered whether longer or shorter reaction times indicate a reasonable solving behavior. An alternative approach, the multinomial processing tree model (Erdfelder & Buchner, 1998), attempts to decompose cognitive processes based on accuracy but still utilizes a single value per item. Besides these traditional measures, eye tracking provides alternative measures to access cognitive processes due to the many data points that can be gathered per item.
The goal of the presented studies was to test whether such eye tracking information has the potential to provide information additional to accuracy and reaction times in order to describe someone's ability and to evaluate the solving behavior. This research question was analyzed in the well-established domain of spatial thinking. The standard spatial thinking tests, however, are limited if they are used in eye tracking research due to some restriction, such as heterogeneous stimulus materials and overlapping of relevant item features (see below). To this end, a new test was developed and validated for the explicit usage of eye tracking.
The remainder of the paper is organized as follows: Section 2 introduces the construct of spatial thinking and provides an overview over current eye tracking research in this domain. The drawbacks of the current standard tests are also considered. In Section 3, the newly developed spatial thinking test, the R-Cube-Vis Test, is described in detail, including first evidence for validation aspects. Study 1 to 3 (Sections 4 -6) analyze further aspects of validity of the R-Cube-Vis Test as a long and a short version. The fourth study (Section 7) demonstrates the usage of the R-Cube-Vis Test in eye tracking research by disentangling bottom-up and top-down processes and by applying simple and sophisticated fixation-based measures. The paper ends with a general discussion.
The construct of spatial thinking is considered to be heterogeneous. According to the factor-analytic approach, it consists of different factors that can be measured with specific tests (see Hegarty & Waller, 2005). The number of the spatial thinking factors differs from two to five depending on how broad the construct or single aspects are defined (e.g., Carroll, 1993;Lohman, 1988;McGee, 1979;Michael, Guilford, Fruchter, & Zimmerman, 1957). One of the most extensive works came from Carroll (1993), who reanalyzed more than 90 studies. His five-factor structure could also be confirmed by Burton and Fogarty (2003) and forms the basis of the present studies. The first factor is visualization referring to the "the ability to manipulate or transform the image of spatial patterns into other arrangements" (Ekstrom, French, Harman, & Dermen, 1976b, p. 173). The second factor is spatial relations and means "the ability to perceive spatial patterns or to maintain orientation with respect to objects in space" (Ekstrom et al., 1976b, p. 149). In contrast to visualization, where the configuration of object parts is changed, spatial relations refers to the manipulation of the whole object (Ekstrom et al., 1976b). These two factors can be interpreted as the ends of a continuum (Pellegrino, Alderton, & Shute, 1984) with complex and power tests on the one end (visualization) and simple and speed tests on the other end (spatial relations). The further three spatial thinking factors of Carroll (1993) cover the spatial abilities to recognize objects under different conditions (Closure Speed, Flexibility of Closure, Perceptual Speed) and to memorize objects (Visual Memory).
As was generally pointed out for performance tests before, standard tests for visualization provide only accuracy values and, therefore, are restricted to describing someone's ability. Potential solving strategies that may vary between participants cannot be obtained with these tests (Hegarty & Waller, 2005). Different participants might even solve task items in different manners, depending on their abilities and preferences (see Kyllonen, Lohman, & Snow, 1984;Kyllonen, Lohman, & Woltz, 1984). These drawbacks are illustrated in the following two standard tests for spatial thinking, the Paper Folding Test (PFT, Ekstrom, French, Harman, & Dermen, 1976a, Figure 1) and the Mental Rotation Test (MRT, Vandenberg & Kuse, 1978), which were both conducted in the presented studies for validation purposes.  (PFT). The task is to decide which of the paper sheets on the right side (A -E) matches the unfolded paper sheet on the left side. Here, the correct answer is C.
According to Ekstrom et al. (1976b), a typical visualization test is the Paper Folding Test (PFT, Figure 1), where participants have to mentally unfold a perforated paper sheet. Another test that was invented as a visualization test is the Mental Rotation Test (MRT, Figure 2), which, however, does not fulfill all aspects of the definition of visualization. In the MRT, participants have to mentally rotate, but neither to transform nor to manipulate, 3D objects that were originally invented by Shepard and Metzler (1971). Although Carroll (1993) assigned the MRT as Block Rotation Tasks to the visualization tests, he pointed out that "block rotation tasks are likely to appear on factor SR [spatial relations] when they are simple and highly speeded." (p. 323) Therefore, Pellegrino et al. (1984) placed the MRT in the middle of the described continuum between visualization and the second spatial thinking factor, spatial relations. For the presented studies, the redrawn version of Peters et al. (1995, MRT-A) was utilized.  Peters et al., 1995). Two out of the four objects on the right side have to be identified that represent a rotated version of the object on the left side. Here, the first and third object are the correct ones.
In the present context, a strategy refers to a certain sequence of different cognitive processes performed to solve a task. For example, Kyllonen, Lohman, and Snow (1984) described two solving strategies for the PFT. In the visualization strategy, participants solve tasks by mentally unfolding the left paper sheet, i.e., they cognitively process each item in the intended manner. In contrast, participants who use the analytic strategy solve these tasks by deriving logical conclusions without mentally transforming or manipulation the shown figures. The inspection of the second figure on the left side (Figure 1), for example, would lead to the conclusion that all perforations have to be on the left side of the sheet. Therefore, options B, D, and E can be excluded. Furthermore, the observation that the perforation goes through the upper and lower part of the paper, leads to the conclusion that A can also be excluded. Hence, the correct answer is C. Given these different approaches to solving these items, standard tests of visualization that are purely based on accuracy measures can fail to measure the true construct of visualization if participants actually use, e.g., analytic solving strategies. Therefore, further information about how a participant has solved a certain task is needed to derive sophisticated descriptions of someone's ability.
A promising source of information about cognitive processes taking place while participants are solving a visual task is eye tracking. Eye-tracking methods deliver gaze-related measures such as gaze fixations that are "pauses over informative regions of interest" (Salvucci & Goldberg, 2000, p. 71). The basic assumption of eye tracking research is the "eyes' tendency to fixate the referent of the symbol that is 'at the top of the stack' [of the cognitive processes]." (Just & Carpenter, 1976, p. 477) This so-called eye mind hypothesis assumes that the parts of an object are cognitively processed at the same time as they are fixated by the eyes. This holds at least for cognitive tasks with rapid operations (50 -800msec, according to Just & Carpenter, 1976).
There are only a small number of studies that use eye tracking as an additional source of information during the performance of spatial thinking tests. An important difference between these studies is the different size of areas of interest (AOIs). AOIs are defined regions on the stimulus materials in eye tracking research. Typically, distinct fixations within an AOI are not differentiated and are interpreted as equal. Only the number and duration of fixations on certain AOIs are usually reported. Fixations on specific details within these AOIs are not considered. Therefore, the definition of AOIs (with respect to its scaling) determines the resolution of the resulting gaze pattern. Fine-scaled AOIs produce more detailed information (higher resolution) than large-scaled AOIs (lower resolution).
One of the earliest studies that postulated a three-phases cognitive processing model for visualization showed characteristic gaze patterns for the three phases using large-scaled AOIs (Egan, 1979). The AOIs definition discriminated only between the whole objects or figures that had to be compared. Details of these objects were ignored. Egan (1979) was able to describe a robust viewing pattern by using this AOI definition, but the resulting model is very general and not specific. The first phase is the Search-phase, in which participants compare objects pairwise. In the second phase, the Transform-phase, participants try to transform non-matching elements of the objects until they match. In the third phase (Confirm-phase), participants compare other corresponding object elements to decide whether the objects are equal except for the transformations. The first and third phases were characterized by fixation switches between two objects, looking for matching parts. Furthermore, it appeared that the time depended on the discriminability of the two objects (Carpenter & Just, 1978). The second phase contained longer fixations that went back and forth between corresponding parts. A similar discrimination into three phases (search, transformation/ rotation, conformation) could also be found for pairs of the Shepard-Metzler-figures (Just & Carpenter, 1976). Although Just and Carpenter (1976) defined AOIs for parts of the presented objects by considering fixations on single arms of the Shepard-Metzler-figures (see Figure 2), the resulting gaze patterns were also only generally described due to the simple stimulus materials. In addition, the resulting fixation sequences were manually segmented and annotated with respect to the three phases, which limits objectivity.
Another application example utilizing large-scaled AOIs comes from Snow (1980), who examined interindividual differences. To this end, he analyzed the PFT results of 48 high-school students representing extreme groups with regard to their abilities in crystallized and fluid intelligence (Cattell, 1971). In his eye tracking study, Snow (1980) considered gaze paths based on large-scaled AOIs, where each AOI represented one single figure (e.g., there were seven AOIs defined for the item presented in Figure 1). Fixations on specific features of the figures, such as perforation holes or specific areas, were not considered. However, Snow (1980) was able to find differences between both ability groups. Low-ability participants seemed to have problems in inhibitory or control mechanisms while they were checking the answer alternatives. High ability participants were able to better sustain their item analysis.
In another study, the gaze movements on fine-scaled AOIs of complex items were analyzed. Ivie and Embretson (2010) tracked gaze movements of ten participants while they solved the Revised Minnesota Paper Form Board Test (RMPFBT, Likert & Quasha, 1970). An RMPFBT item shows six figures in three rows and two columns. The target figure is placed in the upper left corner and shows randomly distributed separated parts of a simple form, e.g., a disc. All other five figures show the target form, the disc, not separated but with different partitions drawn on it. Participants have to decide which of the five figures contains the same partition as the target figure. Ivie and Embretson analyzed the data qualitatively because of the small sample size (N = 10) with respect to differences in participant groups (low ability, high ability, and guesser) as well as differences between the items. For their analysis, they used fine-scaled AOIs on single features of the six figures per item. Participant groups differed in the time they spent on analyzing the shown figures, or in the number fixations on features (e.g., edges) of these figures. However, specific gaze patterns in order to derive cognitive processing steps could only be found by visually analyzing the eye tracking videos. The analysis of the gaze behavior for each item over all participants revealed individual gaze patterns for each item with respect to the mean proportional time that was spent on certain item features. However, it was impossible to derive general viewing patterns for all items.
In the presented studies, an item consists of various figures (e.g., seven figures in a PFT item, Figure  1; five figures in a MRT item, Figure 2). AOIs were defined either one for each figure (large-scaled) or as fine-scaled AOIs for different features of a single figure. The results showed that fixation patterns based on large-scaled AOIs or on fine-scaled AOIs for simple items (e.g., of the MRT) are only able to describe simple cognitive processing models. Fixation patterns based on fine-scaled AOIs of complex items (e.g. of the RMPFBT or the PFT), however, would potentially be able to deliver more sophisticated cognitive processing models. But fine-scaled AOIs resulted in noisy data due to visual features of the items that influenced the gaze movement but were not related to the demanded cognitive processing steps. Together, the studies show that there exists a trade-off between the complexities of the items, the appropriate size of the AOIs that define the smallest fixation unit (item figures vs. single features of item figures) and the complexity of the derived cognitive processing model. The less complex the items, the smaller can the AOIs be defined while still arriving at meaningful results. Whereas the analysis of single features of the item figures was appropriate for the simple items of the Shepard-Metzler-figures, this was not the case for the more complex items of the RMPFBT (Ivie & Embretson, 2010). Both combinations, simple items with small AOIs (e.g., Just & Carpenter, 1976;Nazareth, 2015) and more complex items with larger AOIs (e.g., Snow, 1980) seem to be only able to define simple cognitive processing models consisting of, e.g., three phases, or to distinguish between two strategies or performance groups. For more complex items that measure the spatial thinking factor visualization more complex cognitive processing models would be necessary for an appropriate representation. However, such models could only be established if small AOIs for single features of item figures can be meaningfully interpreted.
The current tests, with their complex items, especially the PFT as a visualization test, are limited at this point due to different reasons. (1) The comparison of gaze patterns between items is hampered by the heterogeneous item sets regarding their visual appearance. Different items differing in too many features lead to gaze patterns that are hard to interpret (e.g., Ivie & Embretson, 2010). Although the recorded gaze patterns are often interpreted as a result of purely task-driven visual attention (top-down process), it can be expected that the gaze movements reflect an interplay between top-down and bottom-up processes (such as pop-up effects, see Posner, 1980;Theeuwes, 2010). Therefore, the fixation pattern recorded for a single item is influenced by two sources: A) Top-down processes, which cover the cognitive processing steps to solve the items that result from the necessary transformation steps. Here and in the following text, "transformation steps" refer to the intended solving steps, e.g., the mental unfolding of the left sheet of an item of the PFT, which has to be distinguished from the actual cognitive processing steps. The second source of influence are B) bottom-up processes, which address gaze movements driven by visual item features that are independent of these necessary transformation steps and occur only due to irrelevant characteristics of the item's visual appearance. In the following text, (parts of) fixation patterns resulting from top-down processes referring to necessary transformation steps are called type-specific. (Parts of) Fixation patterns that are driven by irrelevant visual item features (bottom-up processes) are called itemspecific. In heterogeneous item sets, both effects cannot be disentangled and, hence, the resulting fixation patterns are hard to interpret. (2) In some cases, it is not possible to decide which item feature is cognitively processed (considering the fixation) because of the visual overlap of relevant item features. For example, in the third figure from the right shown in the MRT item ( Figure 2), one edge is lying over another edge. Also, if a participant is fixating on the upper part of the second figure from the left of the PFT item shown in Figure  1, it is not possible to decide whether the participant is thinking about the folding or the perforation. (3) The presentation of many figures within a single item leads to complex gaze patterns that are hard to interpret, e.g., the PFT shows seven to nine figures per item, and the MRT shows five figures per item. (4) Some tests are presented as paper-pencil versions that limit the accuracy of the gaze estimations because of the used eye tracking hardware. In paper-pencil versions, gaze patterns had to be recorded by mobile eye trackers that would often result in less accurate gaze estimations than remote eye trackers due to different hardware components, parallaxing and slippage of the head-worn eye tracker (Holmqvist et al., 2011). Computerbased test versions would overcome these issues by using remote eye trackers.
Hence, new testing materials are needed that overcome these restrictions in order to be able to gain insights into the cognitive processes involved in solving items of spatial thinking tests for the visualization factor. To this end, the R-Cube-Vis Test was developed and validated (Study 1 to 3). However, once these materials existed, the next challenge would be the identification of fixation patterns of specific item characteristics and the decision whether these patterns indicate cognitive processes that are relevant to solve the demanded task or whether these patterns reflect idiosyncratic characteristics of the item based only on its visual appearance. The aim of Study 4 was to demonstrate which fixation-based indicators might be promising to reveal cognitive processes, how they can differentiate between indicators that are related to task demands or task irrelevant item characteristics, and whether these analyzed indicators are able to indicate the correctness of a single answer.

A new test for visualization
Regarding the four restrictions of standard visualization tests presented in Section 2, four requirements were formulated as a final solution to overcome these issues: (1) The test must contain many homogenous items of one task type (homogenous group) to differentiate between eye movements that differ between items of a single homogenous group (item-specific characteristics) and eye movements that are indicative for a specific homogenous group (type-specific characteristics). Type-specific eye movements refers to movements that correspond to the cognitive processes to solve the item, whereas item-specific eye movement refers to gaze patterns that result from the visual appearance of the item but are independent of the demanded cognitive processes to solve the item. Therefore, "homogenous" items should be equal with regard to difficulty and their construction in such a way that the specific transformations to solve items of the same homogenous group should be the same for all items. On the other side, one item should not be solved by just remembering the answers of previous items of the same homogenous group. To this end, all items should differ in their visual appearance. Further factors that might influence the gaze movements (such as mind wandering, e.g., Uzzaman & Joordens, 2011) that result as apparent random patterns will be controlled by considering the average over items. (2) On each item figure, all relevant areas should be visually separated. If the area of a specific fixation is detected, it should be unambiguous with respect to the specific part of the item that is fixated, i.e., there should be no visually overlapping parts.
(3) The items should be built by only a few figures to avoid too complex patterns of movement that are hard to interpret.
(4) The test should be computer-based to guarantee standardized testing and the best possible measuring results of the applied eye tracking hardware.

New stimulus materials
Based on the formulated four requirements, a new test, the R-Cube-Vis Test, was developed in accordance with the definition of visualization that describes it as "the ability to manipulate or transform the image of spatial patterns into other arrangements" (Ekstrom et al., 1976b, p. 173, see also Section 2). Cubes as stimulus materials are common in spatial thinking tests (e.g., Cube Comparison test, Ekstrom et al., 1976a, Figure 8). However, the cubes in the existing tests are rotated as a whole, which is not in accordance with the definition of visualization that demands a manipulation or transformation of the image. The item of the R-Cube-Vis Test fulfill this criterion by movable elements of the cubes, such as in the Rubik's cubes. The participants are asked to mentally transform single elements of one cube to get the other cube. Regarding the four requirements for eye tracking usage, the items were constructed in a way that many items of the same kind could be generated. The items are simply structured, they show all relevant parts without visual overlap, and they can be presented by a computer. Hence, the R-Cube-Vis Test is expected to measure visualization and to be utilized for eye tracking.
In the first step, a preliminary study was conducted to define the item characteristics belonging to a certain difficulty level and to select single items for the final test. All items of the R-Cube-Vis Test use Rubik's cubes as figures (Figure 3). Each item shows two colored cubes, one target cube on the left side and one transformed/rotated cube on the right side. The task is always to decide whether elements of the right cube can be rotated such that the left cube results (possible item, Figure 3) or not (impossible item, Figure 5).
All elements of the right cube in the R-Cube-Vis Test can be potentially rotated in two directions and about three axes like in the original Rubik's cube. Each cube consists of the same six colors (blue, brown, green, red, white, and yellow). However, there might be fewer colors visible, because only three sides are shown. The perspective of each cube shows all three sides in the same size. The final test materials consist of six difficulty levels ( Figure 3). The presented cubes differ in their size regarding the number of visible pieces. If the cubes show 3 x 3 pieces on each side, the size of this cube is labeled as "size 3". If 4 x 4 pieces are shown on each side, the cube has "size 4". Furthermore, they differ also in the number of the rotated elements (1 vs. 2) and how two elements are rotated (parallel, p vs. crossed, c). The levels are labeled accordingly. The two most difficult levels (Level 4.2.c1 and Level 4.2.c2) differ with respect to the visible cross and the visible cross point ( Figure 4).
As can be seen in Figure 5, the impossible items for each level were created to look structurally like the possible items of the same level. They always had the same size and the same number of changed elements. The impossible items were created from specific possible items of the respective level, which were not used in the test. These possible items were switched into impossible items by changing the coloring and the position of the rotated elements. However, one cannot trace back the original possible item from an impossible item, because each impossible item can be reached from different possible items. The impossible items of both most difficult levels are indistinguishable between both levels, because the possible cubes of both levels are structurally equivalent with the same size, the same number of rotated elements and the same kind of rotation.  2.c2: a) Cross point (marked by a black circle) and cross (marked as a black cross) of the right cube correspond to each other: In order to solve the cube, the element with the marked red square has to be rotated in the red line of the marked cross. b) Cross point (marked by a black circle) and cross (marked as a black cross) of the right cube do not belong together: In order to solve the cube, the element with marked yellow square has to be rotated on the left side of the back, in the invisible yellow line, which is on the back of the right cube. This construction allows the generation of many items of the same kind with respect to structure and difficulty, with the difficulty being dependent on the cubes' size and number of the rotated elements as well as how these elements are rotated. The visual appearance can be varied by manipulating the color arrangement, the axes used, the position of the rotated elements for a specific axis, and the direction of the rotation. The expectation was that the item difficulty is independent of these aspects of visual appearance while the manipulated features of the visual appearance ensure that items appear different from each other. One could therefore expect that all items that were alike with respect to structure and difficulty could be assigned to one homogenous group and that the variability of the visual appearance within these groups guaranteed that no item could be solved by remembering the answer to a previously seen item of the same group. Due to this rational construction of the stimulus materials, the validity evidence based on test content is given according to the standards formulated by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014). The contentoriented evidence "can be obtained from an analysis of the relationship between the content of a test and the construct it is intended to measure." (American Educational Research Association et al., 2014, p. 14). This evidence can be confirmed for the R-Cube-Vis Test regarding the definition of visualization (see Section 2). In case of the R-Cube-Vis Test, the "image of spatial patterns" (here, the right cube of an item) has to be "manipulated/transformed" (rotating single elements, in several consecutive steps) to receive an "other arrangement" (the left cube).

Test measures
The participants had to provide two answers for each item. First, they had to decide whether both cubes are possibly the same. The resulting accuracy measure was coded with 1 (correct answer) and 0 (incorrect answer). Second, the participants had to estimate their confidence about the given answer on a scale from 1 (very sure) to 4 (very unsure).
Based on these two answers per item, four different accuracy measures were evaluated for the R-Cube-Vis Test. An optimal measure does not exist. The theoretical drawbacks of the four measures will be discussed below. All measures were analyzed with respect to reliability and validity in Study 1 -3 since all measures differ in the information they use.
The first measure is ACC-all. It is calculated as the sum of the item accuracies over all items divided by the number of all items. The second measure is ACC-poss. It sums the item accuracies of all possible items and divides it by the number of all possible items. Both measures can reach values ranging from 0 (all considered items were answered incorrectly) to 1 (all considered items were answered correctly). Whereas ACC-poss focuses exclusively on items that were presumably solved by using the intended transformations, i.e., the rotation of the cube's elements (the possible items), ACC-all uses twice as many items, can control for answering bias and is, therefore, more robust than ACC-poss. For example, if a participant always ticks "is possible", she would receive an ACC-poss score of 1, which would indicate a high ability, but would receive an ACC-all score of .5, which would correctly indicate a low ability.
The main drawback of both measures is the expected statistical error given 50% chance level. To reduce this error, a weighted accuracy (wACC) was introduced in the present studies. This accuracy measure uses the accuracy of each possible item and the corresponding confidence rating. The wACC is the product of the item accuracy (0, incorrect answer or 1, correct answer) and the reversed confidence rating value (1 to 4). Therefore, the wACC-value of each item ranges from 0 to 4. For example, if a participant answers an item correctly and gives a confidence rating of 3, then the wACC = 1 * 3 = 3. If a participant answers an item incorrectly and gives a confidence rating of 2, then the wACC = 0 * 2 = 0. In generally, wACC = 0 means that the item is not solved correctly independent of the participant's confidence. Values from 1 (strongly unsure) to 4 (strongly sure) thus indicate a correctly solved item with varying confidences. wACC is based on possible items only, such as ACC-poss. The evaluation of the answer by the additional confidence rating is thought to increase the "true" variance. If a participant is guessing an answer correctly with low confidence, this participant will nevertheless get the highest accuracy value (ACC-all and ACC-poss) for this item (i.e., 1) but only a medium wACC value (i.e., 1 or 2). Although there is evidence that self-ratings of cognitive abilities can be valid (see Bratko, Butkovic, Vukasovic, Chamorro-Premuzic, & Stumm, 2012;Jacobs & Roodenburg, 2014), the confidence rating of the wACC measure might be influenced by personal characteristics such as self-esteem which can lead to systematic errors in wACC.
Therefore, a fourth measure was calculated that combines the answers of all items (possible and impossible, correctly and incorrectly answered items) and their confidence ratings according to signal detection theory (Macmillan & Creelman, 2005). The receiver operating curve (ROC) can be computed based on the participant's ratings ranging from 0 (strongly sure that the item is impossible) to 3 (strongly unsure, impossible) and 4 (strongly unsure, possible) to 7 (strongly sure, possible). The ROC relates the true positives to the false positives for each of the eight categories. For each participant, the area under the curve (AUC) was computed based on the ROC. In comparison to the previous measures, the AUC uses confidence ratings and controls for a personal decision threshold but might be biased by including impossible items and weighting wrong answers.
In summary, no measure fulfills all of the following aspects in an optimal way: low chance level, no answering bias, no personal bias, excluding impossible items, and no weighting of wrong answers. Each measure has drawbacks regarding at least one of these aspects. However, the measures ACC-all and ACCposs are standard measures of current performance tests and, therefore, are appropriate performance measures. The innovative measures that consider confidence ratings additional to accuracy, wACC and AUC, are alternatives that potentially compensate drawbacks of the standard measures, but have their own drawbacks (see above). Given this ambiguousness, all measures will be reported in the validation studies to determine the most suitable one for the test. All four measures will be analyzed and compared with respect to reliability and validity aspects.

Study 1: Development and validation of a long version of the R-Cube-Vis Test
Study 1 had two goals. First, the results of the preliminary study should be confirmed with respect to the order of the difficulty levels and the correlations with external variables. For all six defined R-Cube-Vis levels (Figure 3), additional items were created. The expectation was that the order of the levels regarding their difficulty is similar to the results found in the preliminary study: Level 4.1, 3.1, 4.2.p, 3.2.p, 4.2.c1, 4.2.c2. Size 4 should be easier than size 3, one rotated element is easier than two, and parallel elements are easier than crossed ones. It is possible that items of size 4 are easier than items of size 3 because more parts of the cubes' surfaces of the items with size 4 stay the same after rotating the elements than of the cubes' surfaces of items with size 3. For example, the right cube of an item of Level 4.1 (size 4, Figure 3) shows on the two sides with the rotated element eight changed pieces and 24 pieces that stayed the same. That means that 8 / (24 + 8) = 1 / 4 of the pieces changed. Whereas the right cube of an item of Level 3.1 (size 3, Figure 3) shows on the two sides with the rotated element six changed pieces and twelve pieces that stayed the same. Therefore, relatively more items changed, 6 / (12 + 6) = 1 / 3. A reason for the fact that the items with parallel rotated elements are easier to solve than items with crossed rotated elements might be that the order of the back rotation to solve the item with two parallel rotated elements (Level 4.2.p, Figure 3) is arbitrary, whereas the order of the back rotation of the two crossed rotated elements is fixed (Level 4.2.c1, 4.2.c2, Figure 3). Furthermore, the test was analyzed with respect to validity evidence based on relations to other variables (American Educational Research Association et al., 2014). This validity aspect refers to the relation to "external variables [that] may include measures of some criteria that the test is expected to predict, as well as relationships to other tests hypothesized to measure the same constructs, and tests measuring related or different constructs." (p. 16) In particular, evidence for convergent validity was examined regarding the standard visualization test, the Paper Folding Test (PFT), as well as evidence for discriminant validity with respect to the established spatial relations tests. The correlations with the Mental Rotation Test (MRT-A) should lie in between due to its position in the middle of the continuum between visualization and spatial relations (Section 2). The school grades in German and Mathematics were also considered for comparing their correlations with the validation tests and R-Cube tests.
Second, validity evidence based on internal structure (American Educational Research Association et al., 2014) was examined for the R-Cube-Vis Test. This covers the "analyses of the internal structure of a test [that] can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based." (p. 16). The expectation was that all items of the R-Cube-Vis Test lie on one dimension and that all items of each of the six levels have a unique level of difficulty. Hence, the R-Cube-Vis Test was tested for conformity with the linear logistic test model (LLTM, Fischer, 1973).
In Study 1 -2, a second developed spatial thinking test, the R-Cube-SR test, was also conducted for validation purposes. This test is intended to measure the second spatial thinking factor, spatial relations (therefore: R-Cube-SR), and uses the same cubes but in a changed presentation. This test will be published in a separate article and is not analyzed in the present paper.

Method Participants
The minimal sample size to find the smallest expected correlation between the two spatial relations tests, r = .34 (see Section 2), with a directed hypothesis, an α-level of .05, and a test power at .80 (recommended by Cohen, 1988) is 49 (G*Power, Version 3.1.9.2, Faul, Erdfelder, Lang, & Buchner, 2007). Study 1 was therefore conducted with 53 non-colorblind participants (40 female, 13 male) from the University of Mannheim. They received course credit for their participation. The participants' mean age was M = 21.15 years (SD = 2.15 years), ranging from 17 to 26 years. The final sample consisted of N = 49 (39 female, 10 male), because three participants were excluded according to the exclusion criteria defined based on the data of the preliminary study, and one participant because he misunderstood the PFT instruction. Furthermore, speeded tests, such as a chronometric test (here termed "Chronometric Test" and abbreviated "CT", Jansen-Osmann & Heil, 2007, see below), assume that participants can easily solve the tasks, and that they mainly differ in their reaction times and not in their accuracy. Therefore, participants were also excluded from correlations with the CT if they had an error rate higher than 30% on all possible items in order to guarantee that the reaction times reflect the speed of the cognitive processes to solve these items and not just guessing behavior. The same threshold was also used by Jansen, Schmelter, Quaiser-Pohl, Neuburger, and Heil (2013) for another speeded mental rotation test. Hence, the analyzed correlations with the CT are based on only N = 45 participants. This study as well as the following studies were administered in compliance with the guidelines of the ethic committee.

Materials
The validation tests were PFT, MRT as well as two spatial relations test, the Würfelaufgaben (form A) from the Intelligenz-Struktur-Test, IST-2000R, abbreviated as IST-WA (Amthauer, Brocke, Liepmann, & Beauducel, 1999) and the CT. In the IST-WA, participants have to decide whether cubes are the same except for the rotation of one cube as a whole (no single elements as in the R-Cube-Vis Test). Items of the CT show two 2D-object pairs ("primary mental ability figures", Thurstone, 1958) that participants have to rate as either same or mirrored except for rotation.
PFT, MRT, and IST-WA were presented as paper-pencil tests, and the CT was computer-based. The PFT was conducted in its long version, with 20 items presented on two sheets of paper with ten items on each sheet. For each page, participants had three minutes to solve the items. The sum score was computed based on all correctly answered items (correct answer: 1 point; incorrect answer: 0 points). The 24 MRT items were also presented on two pages with twelve items on each page. Participants had four minutes to solve the items on each page. For each item, participants received 1 point only if both correct figures were detected, otherwise they received 0 points. In each of the 60 items of the CT, participants had to decide whether both objects are mirrored or equal except for rotation. There were 16 items, eight same and eight mirrored, for 45°, 90°, 180° disparity angles. Twelve additional items had a disparity angle of 0° and are were mirrored. Each correctly answered item resulted in 1 point, otherwise 0 points. The CT is constructed as a speed test, and therefore, in all analyses, the logarithmized reaction times of the correctly solved items were used. These reaction times were computed based on the subset of the 24 possible items (all twelve items with 0° disparity angle were impossible) that were answered correctly. The IST-WA was administered with 20 items. Participants received 1 point for each item answered correctly, and 0 points for item answered incorrectly.
The R-Cube-Vis Test was conducted in a long version with 24 possible and 24 impossible items for each level. The test was computer-based, whereby the presentation of the instruction and the repetition of the trial phase were self-administered (i.e., participants decided about their reading times and whether they wanted to repeat practice trials) and the presentation of the items, and the recording of the answers were system-administered (i.e., controlled by the computer). Within each level, items were varied systematically, regarding (1) the three axes about which the elements were rotated, (2) the turned element (front, middle, end) and (3) the turning direction. The R-Cube-Vis Test was conducted in five blocks with one block for each level, except the two most difficult levels. The block order was fixed beginning with the cubes of size 3 followed by the cubes of size 4 (3.1; 3.2.p; 4.1; 4.2.p; 4.2.c1 and 4.2.c2 together). All items within each block were randomly presented. The trial phase before each block consisted of five items, two possible and three impossible. Participants had the opportunity to repeat this phase. Please note that at this stage of the development of the R-Cube-Vis Test, the two most difficult levels (Level 4.2.c1 and 4.2.c2) were not clearly identified as separate levels, because the results of the preliminary study delivered only weak evidence. Therefore, items of both levels were handled as belonging to the same group and were presented in the same block. However, these items were balanced within the block with 12 possible and 12 impossible items per level. Hence, the R-Cube-Vis Test was conducted with five blocks and 240 items overall.
Items of the R-Cube-SR Test show two Rubik's Cubes, whereby the participants have to decide whether it is possible to rotate the left cube as a whole to get the right cube. The items were provided in two blocks. The cubes of the first block have single-colored sides, whereas the items of the second block show patterns on their sides. Each block contained twelve possible and twelve impossible items. The R-Cube-SR Test was also computer-based like the R-Cube-Vis Test with self-administered (i.e., participants decided about their reading times and whether they wanted to repeat practice trials) as well as system-administered (i.e., controlled by the computer) parts.
Participants also had to fill out a questionnaire asking for age, gender, experience with Rubik's Cube and the Abitur grades in German and Mathematics (Abitur is the diploma from German secondary school qualifying for university admission. The grades range theoretically from 0.7, best grade, to 6.0, worst grade.).

Procedure
There were two sessions, A and B, with durations of 90 and 60 minutes. In session A, participants conducted the PFT followed by the R-Cube-Vis Test. Session B took place one week later. The participants performed the MRT, IST-WA, R-Cube-SR, CR, and the questionnaire in the indicated order. The participants were tested in groups of up to six people. Here, and in all the following studies, all computer-based tests were presented with the presentation software E-Prime 2.0 (Psychology Software Tools Inc., 2012).

Analyses
All analyses in this and the following studies were conducted by R statistics (R Core Team, 2017) using the packages plyr (Wickham, 2011), dplyr (Wickham, François, Henry, & Müller, 2018), reshape2 (Wickham, 2007), zoo (Zeileis & Grothendieck, 2005) to prepare the data and perform descriptive statistics. Pearson correlations were computed using the package psych (Revelle, 2017). Correlation comparisons were tested with Fisher's z by the package cocor (Diedenhofen & Musch, 2015). The significance levels of all pairwise correlations of the validation tests, as well as all correlations between each validation test and both grades, in German and Mathematics, were adjusted according to Holm (1979). The same adjustment was also applied to test evidence of convergent and discriminant validity with the validation tests and to test evidence of validity of the external criteria using the grades in German and Mathematics. All adjustments were made separately for each tested measure of the R-Cube-Vis Test. Reliability estimates were computed as standardized Cronbach's α (Cronbach, 1951) using R-packages psy (Falissard, 2012) and psych (Revelle, 2017). In all analyses, reaction times were logarithmized and only considered for the correctly answered possible items. In order to examine evidence of the validity of the internal structure, the R-Cube-Vis stimulus materials were tested for conformity with the linear logistic test model (LLTM) to provide evidence for one-dimensionality and homogeneity of the six levels of difficulty, Level 3.1, 3.2.p, 4.1, 4.2.p, 4.2.c1, and 4.2.c2. The focus was on both measures that were defined for all six levels, accuracy over all possible items (ACC-poss), and weighted accuracy (wACC). Note that the impossible items in Level 4.2.c cannot be assigned to one of the two subgroups. Therefore, accuracy over all items (ACC-all) and the Area Under the Curve (AUC) were not considered here. The design matrix of the LLTM consisted of four parameters indicating (1) size 3 or size 4, (2) one or two rotated elements, (3) parallel or crossed rotation, and (4) whether the cross and the corresponding cross point are both visible. Hence, six item categories resulted, one for each level of difficulty. The model conformity was tested according to the procedure described in Baghaei and Kubinger (2015). As a prerequisite, the standard Rasch model (RM) was performed (Fischer, 1973) and tested with Andersen's likelihood ratio test (Andersen, 1973) using mean split of mean raw scores. Additionally, the item characteristics were analyzed by their infit (weighted mean square). All analyses were conducted using the R-package eRm (Mair, Hatzinger, & Maier M. J., 2015). In the last step, both models, LLTM and RM, were compared using a chi-square test given the deviance of their -2*loglikelihood's as empirical value and the difference between their number of model parameters as degrees of freedom (Fischer, 1973). Furthermore, the information criteria AIC (Akaike Information Criterion, Sakamoto, Ishiguro, & Kitagawa G., 1986) and BIC (Bayesian Information Criterion, Schwarz, 1978) were compared between the RMs and the corresponding LLTMs. These information criteria indicate a compromise between model fit and simple model structure. The differences between two AIC values (two BIC values), Δ AIC (Δ BIC ), were interpreted as negligible for Δ AIC ≤ 2, weak for 2 < Δ AIC < 4 (Δ BIC ≤ 2), moderate for 4 ≤ Δ AIC ≤ 7 (2 < Δ BIC ≤ 6), strong for 7 < Δ AIC ≤ 10 (6 < Δ BIC ≤ 10), and very strong for Δ AIC > 10 (Δ BIC > 10) following Burnham and Anderson (2016) for AIC and Raftery (1995) for BIC. For both information criteria, lower values indicate the more appropriate model. For both model types, RM and LLTM, the person separation reliability (PSR) was computed, which, according to Wright and Stone (1999), can be interpreted similarly to the KR20 reliability index (Kuder & Richardson, 1937).
Both models in their standard form assume dichotomous items with expected values ranging from 0 to 1. Therefore, the RM and the corresponding LLTM could not be directly applied to ACC-poss because of the guessing probability at 50% and to wACC because of the five answer categories. In case of ACC-poss, additionally to a simple RM, a 3-PL model (Birnbaum, 1968) with a fixed guessing parameter at .5 and a fixed discrimination parameter at 1 was adapted to the data. The 3-PL model was compared with the RM using the ltm-package (Rizopoulos, 2006) with respect to the information criteria AIC and BIC. The corresponding LLTM of the most appropriate model was tested in the following way. The values of wACC were dichotomized with splits between the values 1 and 2, 2 and 3, and 3 and 4. These splits resulted in three new dichotomous variables indicating 0 for all values lower than the splitting threshold and 1 for all values larger than the splitting threshold, i.e., for a split between 2 and 3, the wACC values 0, 1, and 2 result in the new value 0, whereas the values 3 and 4 result in the new value 1. For each of these splits, RMs and LLTMs were conducted and tested as described above. Note that a split between 0 and 1 would result in a new variable identical to ACC-poss.
Two issues arose due to the small sample size. In a case where all participants have the same value for some items, the estimated RMs and LLTMs, as well as the models that were estimated for the Andersen's test, might be based only on a subsample of items. In this case, all items with only one value in the respective sample had to be excluded. Furthermore, the statistical power is low for both likelihood ratio tests. However, robust parameter estimations are still expected (see Hohensinn, Kubinger, & Reif, 2014;MacDonald & Kromrey, 2011). In addition to the statistical tests, the results were visually inspected using the R-package ggplot2 (Wickham, 2009).

Results
Descriptive results (Table 1) show the not-logarithmized reaction times of the CT. However, according to the results of the preliminary study that showed a non-normal distribution of the reaction times, the logarithmized reaction times of the CT were used for further analyses. Although some measures were significantly different from the normal distribution according to the Shapiro-Wilk test of normality (Royston, 1995), Pearson's correlations as a robust method (e.g., Havlicek & Peterson, 1976) were computed in subsequent analyses.
The inter-correlations between the validation tests of both spatial thinking factors, PFT, MRT, and IST-WA were similar, .49 ≤ r ≤ .55, p < .01 (Table 2). Only the correlations with CT showed the expected trend with r ≥ -.37, p < .05 for PFT and MRT, and with a higher correlation with the second spatial relations test, IST-WA, r = -.56, p < .001. However, neither the correlation between PFT and MRT nor the correlation between both spatial relations tests were significantly different from inter-correlations between both spatial thinking factors, z ≤ 1.29, p ≥ .20.  The grade in Mathematics showed descriptively higher but not significant correlations with the validation tests (.26 ≥ |r| ≥ .30, .p ≥ .08) compared to the grade in German, (.06 ≥ |r| ≥ .16, p ≥ .28). The highest correlations with experience with Rubik's cubes were found for PFT, r = .32, p < .05, and IST-WA, r = .25, p = .08. MRT and CT were less correlated, |r| ≤ .11, p ≥ .43.

R-Cube-Vis Test for visualization: Descriptive results
The five R-Cube-Vis levels showed the expected order of increasing difficulty over the defined levels regarding all four measures (accuracy based on all items, ACC-all, accuracy based on all possible items, ACC-poss, weighted accuracy, wACC, Area Under the Curve, AUC, Table 3), although Level 4.1 and 3.1 had nearly the same values (Table 3). . ACC-all and AUC were not computed for both levels because the impossible items cannot be unambiguously assigned to one of these subgroups. Note. * p < .05, ** p < .01, *** p < .001 (significance level was adjusted for each R-Cube-Vis measure and each level separately). All correlations with |r| ≥ .3 are bold.

Reliability and evidence for convergent and discriminant validity
The reliability estimates for all measures of the R-Cube-Vis Test were very good, α = .94 (ACC-all), α = .90 (ACC-poss), α = .97 (wACC). All R-Cube-Vis measures with the PFT were highly correlated, .51 ≤ r ≤ .59, p < .001. The correlations with all other validation tests (MRT, IST-WA, CT) decreased descriptively in that order for all measures (Table 4). However, the differences between the correlations were only significant between PFT and CT and only for ACC-poss and AUC, z ≥ 2.36, p < .05. The correlations with the grade in German were negligible, |r| ≤ .06, p ≥ .71. The strongest correlation with the grade in Mathematics was r = -.20, p ≥ .17 for ACC-all, ACC-poss, and AUC. wACC was less correlated with the grade in Mathematics, r = -.12, p = .40. Correlations with experience with Rubik's cubes were not significant and ranged from r = .06, p = .66 (ACC-all) to r = .24, p = .09 (wACC).

Testing conformity with the linear logistic test model (LLTM)
In a first step, the items were analyzed to test the Rasch model (RM) against the 3-PL model for ACC-poss as well as the hypothetical assumptions against polytomous IRT-models using the wACC (see Section 4.1.4).
The first analysis showed strong evidence in favor for the RM compared to the 3-PL model for ACC-poss according to AIC and BIC (RM: AIC = 2714.33, BIC = 2895.98; 3-PL: AIC = 2725.01, BIC = 2906.65). Therefore, the RM and its corresponding LLTM were for ACC-poss and the dichotomized wACC described above (Section 4.1.4) in the following analyses.
Results of Andersen's likelihood ratio test of RMs showed good fits for ACC-poss, and all splits of wACC, p ≥ .98 (Table 5), which is supported by the plots showing the estimated β's for low and high ability groups pitted against each other ( Figure 6). All reliability estimates were good, PSR ≥ .84, and at least 99% of all infits were in the acceptable range from 0.5 to 1.5 according to Linacre (2002 , Table 5).  The corresponding LLTM's were not significantly different from the RMs, p ≥ .57 (Table 6) and showed similar β estimations ( Figure 7). The reliability estimates were good, PSR ≥ .84 (Table 6) and equal to the estimates of the corresponding RMs (Table 5). For all splits of wACC as well as ACC-poss AIC and BIC showed strong evidence for the LLTMs compared to the RMs with ΔAIC > 100 and ΔBIC > 100.

Discussion
The first goal of Study 1 was the validation of the long version of the R-Cube-Vis Test with respect to convergent and discriminant evidence of validity. Secondly, the internal structure of the R-Cube-Vis Test was analyzed, and it was tested whether the six levels of difficulty define homogenous item groups. All correlations between these validation tests were in the expected ranges known from the literature (e.g., Blajenkova et al., 2006;Kozhevnikov et al., 2002;Kozhevnikov & Hegarty, 2001;Miyake et al., 2001, see also Section 2). However, the correlation between PFT and MRT was descriptively smaller than the correlations of both tests with the IST-WA. The correlations with the CT descriptively showed the expected pattern with a stronger correlation with other spatial relations test, IST-WA and weaker correlations with the PFT and MRT. The correlations with the grades were negligible for the German grade but reached a medium effect size for the grade in Mathematics (r ≈ .30). Both results are consistent with the current literature (e.g., Geary et al., 2000;Gunderson et al., 2012;Kaufman, 2007;LeFevre et al., 2010, see also Section 2).
The descriptive results of the R-Cube-Vis Test confirmed the order of the six levels with respect to their difficulty for all four measures. The reliability of the complete test was very good concerning the accuracy over all items (ACC-all), the accuracy over all possible items (ACC-poss), and the weighted accuracy (wACC).
Correlations between the R-Cube-Vis Test and all four validation tests showed convergent and discriminant evidence of validity through strong correlations with the PFT, weaker correlations with the MRT and the weakest correlations with both spatial relations tests, IST-WA and CT. A small difference between the correlations with IST-WA and CT might be based on the different measures, as IST-WA uses an accuracybased measure as do all measures of the R-Cube-Vis, whereas the CT uses reaction times. Similar correlation patterns with the validation tests were also found for single levels. This supports the conclusion that the test materials are homogeneous. The smaller correlations of Level 4.1 and 4.2.c2 can be explained by ceiling and floor effects, respectively. All four measures are suitable for the R-Cube-Vis Test as a visualization test. Evidence of validity of external criteria, the grade in German and Mathematics, was weak but comparable to the correlations between these grades and the validation tests.
Furthermore, evidence of validity of the internal structure was shown by the confirmation of the R-Cube-Vis Test with the linear logistic test model (LLTM). Although the small sample size is limiting the interpretation of both likelihood tests due of the reduced power, one can reasonably expect the parameters to have been accurately estimated (see Hohensinn et al., 2014;MacDonald & Kromrey, 2011). Therefore, the main focus is on further criteria to test the model conformity. Infit ranges, person separation reliability (PSR), as well as the visual inspection of the beta-value-correspondence plots, support the conformity with the RM. The nearly equal PSR values of the corresponding LLTMs, as well as the lower information criteria for the LLTMs for ACC-poss and all splits of wACC, show that the LLTMs are more appropriate models for the R-Cube-Vis Test than the RMs.
The results support the R-Cube-Vis Test as a one-dimensional visualization test with six discrete difficulty levels for all four tested measures.

Study 2: Development and validation of short versions of the R-Cube-Vis Test
The complete R-Cube-Vis Test from Study 1 took up to 50 min for all 240 items. The goal of Study 2 was the development and validation of a short version of the test. To this end, 10,000 item samples, five possible items from each of the difficulty levels (30 possible items in total), were randomly drawn from the complete test materials created for Study 1. For each item sample and based on the data of Study 1, the weighted accuracy (wACC) was computed and correlated with the Paper Folding Test (PFT). The two item samples with the strongest (r = .66, p < .001), and weakest (r = .40, p < .01), correlation were chosen as short versions of the R-Cubs-Vis test for the present study. The expectation was that both samples would show similar results regarding reliability and evidence of validity because all items from a single level should be equivalent. The strong and weak correlation with the PFT should be due only to statistical error and not result from systematic differences between the single items. A partial confirmation of this assumption is suggested by the fact that five of the 30 items were identical in both item samples. Both item samples were analyzed with respect to reliability and to convergent, discriminant, criterion, and construct evidence of validity.

Study design
The between-subject design of the present study consisted of two conditions, A and B. In condition A, the participants performed the short version with the item set of the R-Cube-Vis Test that was strongly correlated with the PFT, whereas in condition B, the participants performed the short version with the item set that was weakly correlated with the PFT. All other conducted tests were identical in both conditions.

Participants
The correlations found in Study 1 were slightly weaker than expected from the literature. The sample size was, therefore, increased in Study 2 and 3. The sample consisted of N = 119 participants, 99 female and 20 male, which were from 18 to 39 years old (M = 22.29 years, SD = 3.08 years). They were all students at the University of Mannheim, they were not color-blind, and they could choose between course credit and monetary compensation for their participation. In condition A, N = 59 (49 female, 10 male) participants and in condition B, N = 60 (50 female, 10 male) participants took part. The same exclusion criteria as in the preliminary study and Study 1 were applied in Study 2. The final subsamples of conditions A and B consisted of N = 55 (46 female, 9 male) and N = 53 (44 female, 9 male) participants to test evidence aspects of the validity of the R-Cube-Vis Test.

Materials
The tests PFT, Mental Rotation Test (MRT), and CT as well as the questionnaire were the same as in Study 1. All tests were conducted and assessed as described before (Section 4.1.2). Instead of the IST-WA, the Cube Comparison test (CC, Ekstrom et al., 1976a, Figure 8) was administered as second spatial relations test. The CC also utilizes cubes as stimulus materials with letters and symbols on their sides, whereby all sides of one cube are pairwise different. Each item shows two cubes. Participants have to decide whether both cubes are the same except for rotation ("same") or not ("different"). In accordance with Ekstrom et al. (1976b), participants had 3 minutes for each of the two test parts of the CC. Each part contained 21 items. One item could be interpreted ambiguously as "same" or "different", because one cube side shows an "X" and it is very hard to see if the "X" on the right cube is lying on the side or standing upright. Therefore, it was excluded from the following analyses. The total number of items was 41. The participants received one point for each correctly answered item of the CC and 0 points otherwise.
There were two short versions of the R-Cube-Vis Test, presented in condition A and B, as described above. Each short version consisted of six levels, and each level contained five possible and five impossible items. Each of the R-Cube-Vis short versions consisted of six blocks, one for each of the six levels. The blocks were presented in order of increasing difficulty in accordance with the results of Study 1 (4.1; 3.1; 4.2.p; 3.2.p; 4.2.c1; 4.2.c2). At the beginning of the test, there was a trial phase containing six items, one from each level (three possible and three impossible). Participants were given the opportunity to repeat the trial phase. Same as in Study 1, the test was computer-based and self-as well as system-administered.

Procedure
In both conditions, participants conducted the three paper-pencil tests, PFT, MRT and CC followed by the CT. Afterwards, they performed one of the two R-Cube-Vis short versions in either condition A or condition B. Finally, the participants performed the R-Cube-SR and answered the descriptive data questionnaire. The entire session took 90 minutes to complete.

Results
The accuracy of all tests was slightly lower than in Study 1 ( Table 7). The results of all measures of the R-Cube-Vis Test in both conditions were comparable. As in Study 1, for analyses with the CT, participants were excluded if they had an error rate higher than 30% on all possible items. All validation tests were strongly and similarly correlated with each other (.50 ≤ |r| ≤ .57, p < .001) and pairwise not significantly different, z ≥ .45, p ≥ .65 (Table 8). Note. The different sample sizes result from different excluded participants for each test. The nonlogarithmized reaction times are presented here. All correlations between the validation tests and the grade in German were negligible and not significant, |r| ≤ .08, p ≥ .40. The grade in Mathematics showed stronger (marginal) significant correlations, r = -.39, p < .001 (PFT), r = -.17, p = .18 (MRT), r = -.23, p < .05 (CC), r = .31, p < .01 (CT). Experience with Rubik's cubes was not significantly correlated with any validation test, |r| ≤ .09, p ≥ .36.

R-Cube-Vis Test for visualization: Descriptive results
Descriptive results for R-Cube-Vis stimulus materials were similar for both conditions (Table 9). Compared to Study 1, the order of Level 4.1 and 3.1 was reversed in both conditions regarding their difficulty according to all measures (accuracy based on all items, ACC-all; accuracy, based on all possible items, ACC-poss; weighted accuracy, wACC; Area Under the Curve, AUC). However, the differences between both levels were quite small, as was the case in Study 1. Furthermore, Level 4.2.c1 and 4.2.c2 were solved correctly less often in both conditions than in the previous study.

Reliability and convergent and discriminant evidence of validity
Cronbach's α was comparable in both conditions. Cronbach's α was low for ACC-poss, α = .69 in condition A and α = .67 in condition B, acceptable for ACC-all, α = .78/.79, and good for wACC, α = .85/.90. The correlations with the validation tests showed different results between both conditions (Table  10). In condition B, correlations with the PFT were significant for all measures, .43 ≤ r ≤ .49, p < .01. The correlations with the spatial relations tests and the MRT were smaller for ten out of twelve correlations (|r| ≤ .31) and only two correlations with the spatial relations tests were significant, namely between ACC-all and CC (r = .37, p < .05) and between wACC and CT (r = .43, p < .05). In condition A, the correlation pattern over all validation tests differed between the four measures. ACC-all was comparably medium strength but not significantly correlated with all validation tests, whereas ACC-poss showed the strongest correlation with CT, r = -.47, p < .001. wACC and AUC were most strongly correlated with the PFT (r = .49, p < .001 and r = .35, p < .05) and weakly correlated with all other tests.

Testing conformity with the linear logistic test model (LLTM)
At first, the Rasch model (RM) of ACC-poss was compared with the 3-PL model with a fixed guessing parameter at .5 and a fixed discrimination parameter at 1. . Therefore, the RM was applied for ACC-poss in the following analyses. The RMs were computed for both conditions separately and tested according to the procedure described in Section 4.1.4 (Table 11, Table 13). For both conditions and all variables (three splits of wACC and ACCposs), Andersen's likelihood tests showed no significant difference between both ability groups, p ≥ .08 (Table 11). 100% of all infits were between 0.5 and 1.5. The only exception is the split between 3 and 4 in condition B, with 3% of the infits outside of this range (Table 13). All estimates of the person separation reliability (PSR) were acceptable to very good, .74 ≤ r ≤ .94 (Table 11, Table 13). β-estimates were comparable for the low and the high ability group in both conditions (Figure 9, Figure 11).    Comparisons between RMs and linear logistic test models (LLTMs) showed no significant difference in the likelihood ratio tests (Table 12, Table 14). The information criteria AIC and BIC showed strong evidence in favor of the LLTMs compared to the corresponding RMs with ΔAIC > 10 and ΔBIC > 50 for both conditions. Reliability estimates (PSR) of the LLTMs were nearly the same as for the RMs, .74 ≤ r ≤ .93 (Table 12, Table 14). β-estimates were comparable between all RMs and LLTMs ( Figure 10, Figure 12).

Discussion
The goal of Study 2 was the validation of a short version of the R-Cube-Vis Test with two different item samples. These samples were drawn based on the data of Study 1 and selected with respect to their correlations with the Paper Folding Test (PFT). The two item samples with the lowest and highest correlation were chosen to show the interchangeability of the items within a single difficulty level. All correlations between these tests were similarly strong and lay in the expected interval known from the literature (e.g., Blajenkova et al., 2006;Kozhevnikov et al., 2002;Kozhevnikov & Hegarty, 2001;Miyake et al., 2001, see also Section 2). However, the correlations between the same spatial thinking factor (Visualization, Spatial Relations) were comparable to those between the two different spatial thinking factors. Such similar correlation patterns were found in previous studies as well (e.g., Kozhevnikov et al., 2002;Kozhevnikov & Hegarty, 2001). Similar to Study 1, there were medium effect sizes regarding the correlation with the grade in Mathematics and low correlations between the validation tests and the grade in German.
The reliability estimates of ACC-poss and ACC-all of both R-Cube-Vis short versions were poor to acceptable. However, the true reliability of ACC-all and ACC-poss might be underestimated due to small number of items and their binary format (Brogden, 1946a(Brogden, , 1946bSun et al., 2007). wACC showed good reliability estimates in both conditions.
The patterns of the correlations between the R-Cube-Vis short versions and the visualization tests showed different results depending on the conditions. In condition B, all four measures were strongly correlated with the PFT and reached the correlation between the PFT and MRT. This means that the expected correlation pattern could partly be confirmed with strong correlations with the PFT and weaker correlations with the MRT and both spatial relation tests. The results support convergent evidence (for the PFT) and discriminant evidence (for both spatial relations test) of validity. A correlation with the MRT that lies in-between could not be shown. In condition A, the correlation pattern differs between the four measures. The expected pattern could only be found for weighted accuracy, wACC, and Area Under the Curve, AUC. But also in this case, a stronger correlation with the MRT compared to the correlations with both spatial relations tests could not be found. Therefore, a moderate convergent and discriminant evidence of validity only existed for two measures.
Evidence of validity of external criteria, grades in German and Mathematics, were comparable to the correlations found for the PFT with negligible correlations with the grade in German but medium effects for the grade in Mathematics. However, these could only be found for some measures.
The generally weaker correlations of the R-Cube-Vis Test with the PFT and the MRT compared to the reported correlations in Study 1 might result from the changed procedure in Study 2. There was only one trial phase at the beginning with six items, one from each level, either possible or impossible, but not both. Furthermore, the levels were presented in six blocks that were ordered according to increasing difficulty. Due to the smaller number of items within each block, there could be learning effects that might have transferred from one block to the following block, especially if the items were quite similar (e.g., as between Level 4.1 and 3.1). Items from blocks presented later might be easier to solve than items from earlier blocks, as could be seen in the descriptive data (Table 9), where the order between Level 4.1 and 3.1 switched. The procedure was therefore changed to be more appropriate for the short version and tested in the following study.
For both item samples, the Rasch model (RM) conformity was shown for all splits of wACC and ACCposs regarding all criteria, Andersen's likelihood test, infits and person separation reliability (PSR) as well as visual inspection of the β-estimation plots. Results from the Andersen's likelihood test can only be interpreted carefully due to the small sample size and the resulting low power. Comparisons between the RMs and the corresponding linear logistic test models (LLTMs) confirm the LLTMs as the more suitable models for all splits of wACC and ACC-poss in both conditions. These results support the evidence of validity of the internal structure.
Remarkably, the results support the idea that items within each of the six levels are interchangeable. The order of the correlations with the PFT of both short versions found in Study 1 could not be found in Study 2, it even changed. The item sample of the short version used in condition B was weaker correlated with the PFT in Study 1, but was stronger correlated with the PFT in Study 2 compared to the short version used in condition A.

Study 3: Validation of an alternative short version of the R-Cube-Vis Test
The goal of Study 3 was to examine a short version of the R-Cube-Vis Test with an alternative procedure to overcome the limited convergent and discriminant evidence of validity found in Study 2. The trial phase was extended to reduce learning effects. Furthermore, items were presented randomly in three combined blocks, where each block consisted of two neighboring levels of difficulty. This revised short version was analyzed with respect to reliability and evidence of validity.

Method Participants
All N = 57 participants, 47 female and 10 male, were students from the University of Mannheim and were not color-blind. On average, the participants were 21 years of age (M=21.49, SD = 3.06), with the youngest being 18 and the oldest 36. They received course credit for their participation. Participants were excluded according to the same criteria as in all previous studies. The final sample consisted of N = 52 (43 female, 9 male) participants.

Materials
The materials were similar to the materials of Study 3, the Paper Folding Test (PFT), the Mental Rotation Test (MRT) and two spatial relations tests, the Cube Comparison Test (CC) and the Chronometric Test (CT). All tests were conducted and evaluated as described in Study 3 (Section 5.1.3).
Participants also had to perform the R-Cube-Vis short version, using the same items from condition B in Study 2. The test consisted of three blocks. The first block contained items from Level 4.1 and 3.1, the second block from Level 4.2.p and 3.2.p and the third block from Level 4.2.c1 and 4.2.c2. All items within each block were presented randomly with five possible and five impossible items from each level. Before each block, the participants had to conduct a trial phase. Each trial phase consisted of four items, one possible and one impossible item from each of the two levels of this block. Participants were given the opportunity to repeat the trial phases. As in the previous studies, the test was computer-based with self-administered and systemadministered parts.
Furthermore, a questionnaire was administered for descriptive data, such as age, sex, R-Cube experience, and Abitur grades in Mathematics and German.

Procedure
Participants performed the four validation tests in the order described in Section 6.1.2 followed by the R-Cube-Vis Test and the questionnaire.

Results
Descriptive results (Table 15) showed similar accuracy values compared to Study 2. The correlation between the PFT and MRT was r = .33, p < .05 and lower than the correlations between the PFT and both spatial relations tests, CC and CT (Table 16). However, both differences did not reach statistical significance, z ≤ -.049, p ≥ .31. Both spatial relations tests, CC and CT, were correlated with r = -.45, p < .01 (Table 16). None of the tests were significantly correlated with the grade in German, |r| ≤ .14, p ≥ .43 and there was only one significant correlation between the validation tests and the grade in Mathematics, r = -.33, p < .05 (MRT). All other tests were less strongly correlated |r| ≤ .21, p ≥ .29. Experience with Rubik's cubes showed one marginally significant correlation with CC (r = .27, p = .05) but was not correlated with any other validation tests (|r| ≤ .16, p ≥ .26).

R-Cube-Vis Test for visualization: Descriptive results
The descriptive results of the single R-Cube-Vis levels showed the expected ordering (Table 17) with nearly the same difficulty of Level 4.1 and 3.1. Compared to Study 2, ACC-poss and the weighted accuracy (wACC) were higher for Level 4.2.c1 and similar to the results of Study 1, but with a lower accuracy than the chance level in Level 4.2.c2.

R-Cube-Vis Test for visualization: Reliability and convergent and discriminant evidence of validity
The reliability estimates of the R-Cube-Vis measures were similar to Study 2, α = .76 (accuracy over all items, ACC-all), α = .64 (ACC-poss) and α = .90 (wACC). The correlations with the visualization test, PFT, and the MRT were significant for all measures and were descriptively larger for PFT than MRT, .46 ≤ r ≤ .57, p < .01 and .34 ≤ r ≤ .45, p < .05, respectively (Table  18). CC was also significantly correlated with three of four measures with effect sizes comparable to the correlations with the MRT. The correlations with CT were lower (r ≥ -.30, p ≥ .05) but were not significantly different from the correlations of all measures with the PFT z ≥ 1.30, p ≥ .10.

R-Cube-Vis Test for visualization: Testing conformity with the linear logistic test model (LLTM)
The 3-PL model with the fixed guessing parameter at .5 and the fixed discrimination parameter at 1 was more appropriate with moderate evidence than the RM according to AIC and BIC (RM: AIC = 947.11, BIC = 986.14; 3-PL: AIC = 942.94, BIC = 981.96). However, the differences in both criteria were small compared to the differences found in both previous studies, and therefore both models were interpreted as comparable to confirm the data structure. In the following analyses, the RM and the corresponding LLTM were considered to be consistent to Study 1 and 2. There was no significant difference between β-estimates of the low and high ability groups according to Andersen's likelihood test for Rasch models (RMs), p ≥ .06 (Table 19) for any of the variables (ACC-pos and three splits of wACC). All infits lay between 0.5 and 1.5 and all estimates of the person separation reliability (PSR) were acceptable, r ≥ .71, except for the ACC-poss, r = .53 (Table 19). Plotting the β-estimates for both groups defined for the Andersen's tests against each other showed that all data points lie around the main diagonal ( Figure 13).  Comparisons between the RMs and the corresponding linear logistic test models (LLTMs) showed no significant differences based on the likelihood ratio tests for any split, p ≥ .30 (Table 20). Both information criteria, AIC and BIC, showed strong evidence in favor of LLTMs compared to the corresponding RMs with ΔAIC > 10 and ΔBIC > 50. The reliability estimates (PSR) were nearly the same compared to the estimates of the RMs (Table 20). The visual inspection of the plots that assign the β-estimates of the RMs against β-estimates of the LLTMs revealed high consistency for corresponding values (Figure 14).

Figure 14:
Comparison between normalized β-parameters for RMs and corresponding LLTMs.

Discussion
The aim of this study was the evaluation of an alternative procedure of the R-Cube-Vis short version, since both tested short version in Study 2 showed only moderate support for convergent and discriminant evidence of validity. The correlations between the Paper Folding Test (PFT) and the Mental Rotation Test (MRT) as well as the correlations between the MRT and the Chronometric Test (CT) were lower than the respective correlations of the previous studies and below the lower bound of the expected intervals based on the current literature (e.g., Blajenkova et al., 2006;Kozhevnikov et al., 2002;Kozhevnikov & Hegarty, 2001;Miyake et al., 2001, see also Section 2). However, these unexpected correlations were interpreted as an artefact given the results of all three previous studies. All other correlations lay in their expected intervals. The negligible correlations with the validation tests and the grade in German confirmed the results of Study 1 and Study 2. The correlations with the grade in Mathematics were weaker than before, with the exception of the correlation with the MRT, but they were located in the expected interval too.
The correlation patterns found between all R-Cube-Vis measures and the validation tests were similar to Study 1, and stronger than those obtained in Study 2. They showed the expected pattern with strong correlations with the visualization test, PFT, weaker correlations with the MRT and the weakest correlation with the spatial relations test, CT. Only the correlations with the second spatial relations test, Comparison test (CC), were higher than expected and similar to the correlations with the MRT. However, the CC was also highly correlated with the PFT as the standard visualization test. Therefore, the correlation patterns of all measures support convergent and discriminant evidence of validity of the R-Cube-Vis Test as a visualization test.
The low to medium strength correlations with the external criteria -the grades in German and Mathematics -were comparable to the respective correlations found for the validation tests and are comparable to the results of the previous studies.
There is evidence for conformity with the linear logistic test models (LLTMs) as superior models compared to the corresponding Rasch models (RM). Hence, this supports the evidence of validity of the internal structure. As in the previous studies, the small sample size limits the interpretation of Andersen's likelihood test and the likelihood ratio test because of the low power, but the alternative criteria support the model conformity.
The results of this study show that the changed procedure of the R-Cube-Vis short version improved the test with respect to convergent and discriminant evidence of validity and is more suitable than the procedure applied in Study 2.

Study 4: Disentangling type-and item-specific characteristics of gaze patterns
When looking at parts of a visual item during the attempt to solve the item, the gaze patterns are thought to be determined by the cognitive processes that are necessary to solve the item. These type-specific characteristics are assumed to result from "top down" processing. However, there are visual features of the item that are independent of the demanded cognitive processes and which are therefore task-irrelevant. But they may affect the gaze pattern as well by "bottom up" processing resulting in item-specific characteristics (see Section 2). In addition, there might be random eye movements, which are considered as unsystematic error and will be ignored in the following analysis. In order to identify cognitive processes that are strategically applied to solve an item ("top down"), the gaze patterns that indicate processing of the crucial visual information have to be disentangled from gaze patterns driven by visual features in "bottom-up" fashion that "capture" the eyes but that are task-irrelevant. Hence, the goal of this study is to disentangle the type-specific characteristics of the recorded gaze-patterns from the item-specific characteristics.
The R-Cube-Vis Test provides an approach to resolve this issue. Its six levels of difficulty are considered to be homogeneous item sets. All items of a given level are constructed based on the same transformation principle (called the type of a certain level). For instance, the items of Level 4.1 (Figure 17, p. 70) are constructed such that there is a one-step rotation applied to one axis of the cube on the right. It is assumed that processing all items of this level will require corresponding essential mental transformation steps to solve them successfully. For the items of Level 4.1, identifying the rotated axis and mentally performing a back rotation may be necessary steps. However, the particular axis, or the side of the cube where the rotation is visible, as well as the rotation direction, vary between the items of Level 4.1. These differences in visual appearance are assumed to be independent of the required cognitive processing steps. The basic idea is that gaze patterns that are equal over all items of a specific level reflect the cognitive processes (topdown) that are applied to solve an item and are interpreted to be type-specific. In contrast, gaze patterns that vary between items of the same level reflect visual features of an item that are not relevant for solving the task ("bottom-up") and are interpreted to be item-specific. Therefore, these two kinds of gaze patterns (type-specific and item-specific) can be distinguished by fixation-based measures of all items of the R-Cube-Vis Test that belong to the same difficulty level, where all items have the same type.
In accordance with existing eye tracking research, it was expected "bottom-up" processes should be more observable on fine-scaled areas of interest (AOIs), i.e., single elements of the cubes, because of visual attributes of these features that might lead the gaze (see Ivie & Embretson, 2010). In contrast, large-scaled AOIs based on summarizing many fine-scaled AOIs, e.g., the whole cube, might result in effects that are independent of these specific attribute effects (see Snow, 1980).
The fine-scaled AOIs of the single elements of the cubes are coded with respect to the construction principle of the item type. For example, the right cube of each item of Level 4.1 shows one side with no rotated element (Figure 17, p. 70). However, for each cube of Level 4.1 this side can be the left side, the top side, or the right side of the right cube. For all items, however, this side is labeled equally, because it is the side with no rotated element. This way to code the AOIs is termed "function coding" here. When fixations on functionally coded AOIs differ between homogenous items of the same type, there must be item-specific visual features that lead the eyes in different ways for each item. Furthermore, particular visual features may "capture" the eyes in predictable ways across items of a type independent of their function for solving the item. This would be the case if, for instance, a particular color pattern or a particular actual side of a cube would always be fixated for some time. Understanding these purely visual item features across items of a type requires another coding of the AOIs with regard to the actual visual places and characteristics rather than with regard to their function for the transformations. This alternative coding of the AOIs considers the actual place (left, top, or right side) and is labelled "location coding" here.
The specific fixation-based indicators, which were considered in the following analyses, focus either on small parts of a side of a cube that are functionally relevant for mental transformations (fine-scaled AOIs), but also comprise larger AOIs such as one side of a cube or a complete cube. In addition, the indicators may encompass patterns of subsequent visits on AOIs. In the present study, four indicator classes were considered: (1) Fixations on pre-defined, functionally coded parts on a single cube (fine-scaled), with the functional relevance based on the construction principle of the item type, (2) fixations on the complete cube, i.e., summing up over all fine-scaled AOIs on a single cube (large-scaled), (3) sequences of fixation switches between the left and the right cube, i.e., summing up over time (large-scaled), and (4) specific viewing patterns on the fine-scaled, functionally coded AOIs that were assumed to be necessary to solve an item according to its type (fine-scaled). Item-specific effects should be found for fixations-based measures of fine-scaled AOIs (Indicator-Class 1 and 4), whereas type-specific effects were expected for measures of large-scaled AOIs (Indicator-Class 2 and 3).
In addition, an attempt was made to understand visual item features that may determine item-specific differences in fine-scaled, functionally coded AOIs between items of the same type, i.e., to understand how visual features drive "bottom-up" effects in gaze patterns that are unrelated to solving the item. To this end, the location coded AOIs were interpreted with respect to the single items. A better understanding of the "bottom up" effects is important. If the strategic "top down" transformation processes have to be identified and described with gaze patterns, then the "bottom up" gaze patterns have to be controlled for.
The second part of Study 4 was the explorative analysis of all fixation-based measures with respect to their potential to indicate the correctness of the answer of single items utilizing logistic regression models. These results would provide insights into the cognitive processes indicated by the fixations-based measures that might be necessary, or at least beneficial, to solving items of a certain level of difficulty.

Method Participants
All N = 69 participants were students at the University of Mannheim, were not color-blind, and received course credit. Seventeen participants had to be excluded based on the same criteria as in the previous studies and due to erroneous eye tracking data. The final sample consisted of N = 52 participants (42 female, 10 male) with mean age M = 20.90 years (SD = 3.87 years) ranging from 18 years to 39 years.

Materials
The short version of the R-Cube-Vis Test was conducted following the same procedure as in Study 3. Before each item, a cross was presented in the middle of the screen for 1 second, on which the participants had to fixate. The cubes of each following item were placed to the right and left of this fixation cross. Furthermore, participants had to provide information about sex, age, and study major. As in Study 3, the test was computer-based with self-and system-administered parts.

Apparatus
The R-Cube-Vis Test was presented on a 23" screen (aspect ratio: 16:9, resolution: 1920 x 1080 pixels) embedded in an eye tracking unit with the Tobii TX300 remote eye tracker with 300Hz recording frequency. The eye tracker was connected to the presentation software E-Prime 2.0 (Psychology Software Tools Inc., 2012) with the "Extensions for Tobii" (Psychology Software Tools Inc., 2011). Fixations were computed by the adaptive event detection algorithm of Nyström and Holmqvist (2010). The algorithm was applied with an upper bound at 200ms/deg to limit the adaptive event detection thresholds if the data are too noisy (Fehringer, 2018). The algorithm was implemented in python (van Rossum, 1995) with the packages pandas (McKinney, 2011), numpy (van der Walt, Colbert, & Varoquaux, 2011), and scipy (Jones, Oliphant, Peterson, & others, 2001). The recorded pupillometric data were not considered in the following analyses. However, they were used to optimize a pupillary-based indicator for cognitive workload, the Index of Pupillary Activity (Duchowski et al., 2018), with respect to the difficulty levels of the R-Cube-Vis Test (Fehringer, 2020).

Procedure
After the instruction, the eye tracker was calibrated for each participant. They then had to perform the R-Cube-Vis Test. At the end, participants had to answer the questionnaire. Before and after the R-Cube-Vis Test, a fixation task with nine fixation crosses was performed to provide evidence for comparable accuracies of the eye tracker at the beginning and the end of the experiment. All tasks were presented with the presentation software E-Prime 2.0 (Psychology Software Tools Inc., 2012).

Analyses
Areas of interest (AOIs) were defined to cluster fixations on the same regions. These AOIs were defined either fine-scaled for regions with the same color (Figure 15), i.e., one AOI for each feature on a cube (Indicator-Class 1), and by combining fine-scaled AOIs to more meaningful AOIs as whole cubes (Indicator-Class 2). As described above, the AOIs were function coded and, if possible, also location coded. For each of these defined AOIs, separately for each item, the relative number of fixations was considered. The relative number of fixations of an AOI X was defined as the number of fixations on X divided by the number of all fixations on one of the two cubes.
Second, entropy values (Shannon, 1948) for each level were computed as an indicator of how chaotic or systematic the fixation patterns were (Indicator-Class 3). Entropy, H(X) is computed by with the transition matrix X that contains in each cell (i, j) the probability that a fixation on AOI i is followed by a fixation on AOI j. The entry in each cell (i, j) of X is labeled by p ij . s is the number of AOIs and π i is the probability of occurrence of AOI i. The log-function is the logarithm to the base e (natural logarithm). Simplified, the lower the entropy value is, the more systematic and less chaotic is the AOI sequence. In order to take different fixation durations into account, fixations were split up into 200ms chunks, i.e., an origin fixation with duration of 560ms was split up into three fixations (200ms + 200ms + 160ms). That means a fixation sequence ABAAB with fixation durations 560ms, 180ms, 320ms, 100ms, 210ms would result in a sequence of AAABAAABB with theoretical durations (200ms + 200ms + 160ms), 180ms, (200ms + 120ms), 100ms, (200ms + 10ms). Due to the short fixation sequences, especially at the easiest levels, entropy values were analyzed for fixations sequences only differentiating between left and right cube, because some of the fine-scaled AOIs were not fixated at all and would lead to many empty entries in the transition matrix X. Hence, the entropy values contain only fixations on the left and the right cube and base, therefore, on indicators of Indicator-Class 2.
Third, fixation sequences per item were analyzed in order to find indicators of cognitive processes that are potentially necessary to solve the items (Indicator-Class 4), i.e., back rotation of the rotated element ( Figure 16a) and comparison of corresponding parts of the right cube (Figure 16b, c). Back rotation means that participants tried to imitate the required rotation of the rotated element to solve the cube. Comparison of corresponding parts means that participants compared one side of a rotated element with corresponding non-rotated elements of another cube side. To this end, the fixation sequences per item were searched for bi-grams, a special case of n-grams with n = 2. A n-gram is a sequence of n parts, which is usually searched for in a bigger, ordered sequence, e.g., the sequence of n letters is searched for in a given text. In the current study, two consecutive fixations (bi-grams), which might indicate the described cognitive processes, were searched for in the fixation sequences of each item and each participant. For each rotated element, three bi-grams were defined. (1) Back rotation was defined by the first fixation on the "old" color of the rotated element (an "old" color is a color that appears on both cube, left and right, i.e., this color would be still visible if the right cube is solved) and a second fixation on the "new" color (this color is not visible on the left cube, this color only appears on the right cube due to the rotation) of the rotated element (Figure 16a, first fixation should be on the red part and the second fixation on the green part). For the comparison of corresponding parts, a differentiation between the old and the new color of the rotated element was considered. One fixation should be (2) either on the old color (Comparison I) or (3) on the new color (Comparison II) of the rotated element. The second fixation should always be on the not-rotated elements of the other cube side (Comparison I: Figure 16b, fixation on the blue part of the rotated element is followed by a fixation on the blue part of the not-rotated element or vice versa. Comparison II: Figure  16c, fixation on the brown part of the rotated element is followed by a fixation on the green part of the notrotated element or vice versa). One should note that for the bi-grams that indicate rotation, the fixation sequence is ordered, whereas for the bi-grams indicating comparisons, the ordering of the two fixations is arbitrary. For each bi-gram, each item, and each participant, a variable indicates whether the respective bi-gram occurs (0: no bi-gram; 1: at least one bi-gram). The entropy values as well as the bi-grams were computed using python (van Rossum, 1995) with the packages pandas (McKinney, 2011), numpy (van der Walt et al., 2011), and scipy (Jones et al., 2001). All plots were created by the R-package ggplot2 (Wickham, 2009). First, the indicators of the four indicator classes were examined with respect to type-specific and itemspecific properties. To this end, the means and the 95%-confidence intervals of the relative number of fixations of each indicator of Indicator-Class 1 and 2 as well as the values of the entropy (Indicator-Class 3) and the bi-grams (Indicator-Class 4) were compared between items of the same level of difficulty. Itemspecific properties exist if the five possible items of a specific level differ in the considered means. This holds for only the function and not the location coded fine-scaled AOIs. If all means are equal for a specific indicator and a single level of difficulty, this indicator was interpreted as type-specific. These analyses were only performed descriptively and not with significance statistics due to the new approach in these analyses and the absence of previous knowledge about the gaze patterns of the utilized stimuli and about the differentiation between type-and item-specific characteristics.

a) b) c)
Item-specific characteristics of the fine-scaled AOIs, indicated by differences between the items of the same level, were inspected regarding specific features such as the position of the rotated elements or the axis around which the elements were rotated utilizing the location coded AOIs. Based on the identified item configuration and the corresponding means of the considered indicator, assumptions were derived to explain the different means. For example, if the rotated element was rotated around the horizontal axis, it could be that the participants look more often at the rotated element than if the rotated element was rotated around the vertical axis.
Third, all indicators were examined regarding their ability to differentiate between correctly and incorrectly solved items (item accuracy). Therefore, logistic regression models were analyzed as prediction models with the single tested indicators as independent variable and item accuracy as dependent variable. Item accuracy can either be 0 (correctly solved item) or 1 (incorrectly solved item). All indicators were z-standardized to control for item-dependent properties. The significance level was set to α = .05 for all tested models. However, these models were only computed for the two most difficult levels (Level 4.2.c1 and 4.2.c2), because all other levels had too few incorrectly solved items. For both medium levels (Level 4.2.p and 3.2.p) differences between the means of the correctly and incorrectly solved items were descriptively analyzed. Such descriptive comparisons could not be performed for the easiest levels (Level 4.1 and 3.1), because some items were solved incorrectly only by one or even zero participants.
Similar as before, correctly solved items should be related with lower entropy values (indicating higher systematic viewing patterns) and with a higher occurrence of bi-grams assuming that these fixation sequences reflect cognitive processes that are necessary for solving the items.

Results
The expected differences between the levels could be found for ACC-poss (accuracy over all possible items) and with regard to reaction times, number of fixations, and fixation duration (Table 21). Generally, the more difficult the levels were, the longer were the reaction times and the fixation durations, and the larger was the number of fixations. However, the results were comparable between Level 4.1 and 3.1 and between Level 4.2.p and 3.2.p but differed between Level 4.2.c1 and 4.2.c2.

Single cube features and whole cubes
Only a few selected results are presented due to space restrictions. For demonstration purposes, the focus of the following analyses lies on the left cube (Indicator-Class 2) and its three sides (Indicator-Class 1). An investigation was made whether the indicators of the three sides of the left cube change depending on the corresponding side of the right cube. For example, an analysis was performed whether the number of fixations on the side of the left cube that corresponds to the side of the right cube showing no rotated elements differs from the number of fixations on a side of the left cube that corresponds to a side of the right which shows the rotated element. In the following only the results of the easiest level, Level 4.  The considered items of Level 4.1 are displayed in Figure 17. The AOI that covers the complete left cube is labeled with "L". All single sides of the left are labeled according to the corresponding side of the right cube (function coded, see the example coding for item 4.1_8 in Figure 18). If the side of the right cube shows no rotated element, it is labeled with "L.not", e.g., the brown side of item 4.1_8 ( Figure 18) and the red side of item 4.1_9 (Figure 17). If the side of the right cube shows the side of the rotated element with the color that cannot be seen on the left cube ("new color"), the corresponding side of the left cube is labeled with "L.new", e.g., the red side of item 4.1_8 ( Figure 18) and the blue side of item 4.1_16 (Figure 17). If the side of the right cube shows the side of the rotated element with the color that is also be shown on the left cube ("old color"), the corresponding side of the left cube is labeled with "L.old", e.g. the yellow side of item 4.1_8 ( Figure 18) and the blue side of item 4.1_19 (Figure 17). The location coded AOIs were labeled with "L.left" for the left side, "L.top" for the top side, and "L.right" for the right side. As expected, the descriptive analysis of the fine-scaled, function coded AOIs (Indicator-Class 1) revealed item-specific effects with a different relative number of fixations for the five items. The large-scaled AOI (Indicator-Class 2), the left cube, showed a similar relative number of fixations for all items, except for item 4.1_19 ( Figure 19) and was therefore interpreted as type-specific. However, the differences between the finescaled AOIs disappeared nearly completely if the AOIs were location coded ( Figure 19). For all items, the right side of the left cube that is closest to the middle of the item (where the fixation cross was presented directly before the item was presented) had the highest relative number of fixations of all sides of the left cube followed by the top side, which is the second closest side to the middle. The left side of the left cube with the largest distance to the middle showed the smallest relative number of fixations. The reason might be that participants have to fixate always in the middle of the screen before the item appears and that the cubes of each item are placed to the left and to the right of the screen center. Participants might have fixated more frequently on cube parts the closer they were to the center. However, item 4.1_9 was an exception also in this analysis with a lower relative number of fixations on the top side than was the case for the other items. One reason might be that this item is the only one for which the two top layers are not rotated. Therefore, participants might have fixated on the top layer to a lesser extent, because the attention is only on the lower parts of the cube. The items of Level 3.2.p are presented in Figure 20. As for Level 4.1, "L.not" refers to the side of the left cube corresponding to the side of the right with no rotated elements. The other two sides with rotated elements cannot be semantically discriminated. On both sides one rotated element shows a new color and the other rotated element shows an old color. Therefore, the numbers for these sides were randomly chosen ( Figure  20). In the analyses, both sides were considered together and labeled with "L.rot". The location coded AOIs L.left, L.top, and L.right are defined as above.  The descriptive analysis of the large-scaled AOI (Indicator-Class 2), the left cube, showed a type-specific pattern of the relative numbers of fixations with nearly the same values for all items in Level 3.2.p ( Figure  21). In contrast to Level 4.1, the fine-scaled function coded AOIs (Indicator-Class 1) were also very similar for all items, whereas the location coded AOIs differed across the items ( Figure 21). However, the location coded AOIs showed only one side with higher relative number of fixations for each item compared to all other sides (Figure 21). Each of these specific sides is one of the two sides that correspond to one of the sides of the right cube showing both rotated elements ( Figure 20). This pattern seems to follow a simple rule: If the right side of the left cube (nearest to the middle of the screen) corresponds to a side of the right cube that shows the two rotated elements, then this side has the highest relative number of fixations. If this is not the case, then the both other sides of the left cube correspond to the both sides of the right cube showing the rotated elements and then the second closest side to the middle, the top side, has the highest relative number of fixations. However, item 3.2_1 is an exception with the highest relative number of fixations on the left side of the left cube, which may have been caused by the red color of one of the rotated elements leading the gaze. These results suggest that for these items, both processes, top-down and the bottom-up, are important with respect to gaze patterns and that the resulting gaze patterns have type-and item-specific characteristics. The considered items of Level 4.2.c1 are shown in Figure 22. Here, the labels of the left cube differ from before. "L.cross" refers to the side of the left cube corresponding to the side of the right cube on which both elements cross, e.g., the green side of item 4.3_10 (Figure 22). "L.point" is the label of the left cube's side corresponding to the side of the right cube showing one rotated element with a "dot", e.g., the white side of item 4.3_23 (Figure 22). "L.line" refers to the side of the left cube that corresponds to the side of the right showing one rotated element with no dot, e.g., the yellow side of item 4.3_10 (Figure 22). Also in this case, the location coded AOIs L.left, L.top, and L.right refer to the left side, the top side, and the right side, respectively.
The expected type-specific pattern of the relative number of fixations for the large-scaled AOI, the left cube, were also revealed by the descriptive analysis of Level 4.2.c1. The relative number of fixations of the fine-scaled AOIs differs between the items with function as well as with location coding (Figure 23). However, some systematics might be recognized. In four of the five items, the side of the left cube that corresponds to the side of the right cube showing the cross had the highest relative number of fixations (the exception is item 4.3_17). One reason could be that this side (AOI L.cross) corresponds to the only side of the right cube that shows both rotated elements. Hence, participant might have focused more on this side to understand how both elements are rotated. Also, the relative number of fixations on the left side of the left cube, which has the largest distance to the middle, was always the lowest, except for item 4.3_17, that shows the cross on this side. Similar as in Level 3.2.p, the gaze pattern seems to be led by top-down as well as by bottom-up processes and, therefore, contain type-and item-specific characteristics.

Entropy
In contrast to the relative number of fixations on single cube features or even on the complete cube, the descriptive analysis showed that the entropy values (Indicator-Class 3, summing up over time) seemed to be more stable for items within the same difficulty level (Figure 24). The value ranges within a single item level were close and went from .03 (Level 4.2.c1) to .08 (Level 4.2.c2) for all levels indicating type-specific behavior. This result confirms the assumption that more broadly defined fixation-based indicators (largescaled) are more stable for different items from the same difficulty level and are, therefore, type-specific.
The between-level differences showed higher values for easier levels and lower values for the most difficult levels. Level 3.2.p seemed to be an exception with higher entropy values than Level 4.2.p and with entropy values comparable to Level 4.1 and 3.1. One reason might be that for items of Level 3.2.p in particular it was not obvious which elements had to be back rotated by observing only the right cube, because all elements have different colors on a single side ( Figure 20). Therefore, it was comparatively more helpful to look at the left cube than it was for each other level. However, further studies with more detailed analyses of the fixation sequences are needed to confirm this finding.

Bi-grams
The relative frequencies of the analyzed bi-grams differed between different levels as well as between single items within the same level (see the plots for Level 4.1, Figure 25, Level 3.2.p, Figure 26, and Level 4.2.c1, Figure 27). This result confirms that the bi-grams (Indicator-Class 4) can be interpreted to be item-specific. However, there were some exceptions showing a more type-specific pattern, such as Comparison II of Level 4.1 (Figure 25), and all bi-grams of Element 1 of Level 4.2.c1 ( Figure 27). Some of the results might be explained by visual item features of the items. For example, the items of Level 3.2.p might be grouped into two groups regarding the occurrence of the Rotation (Figure 26). Items 3.2_1, 3.2_10, and 3.2_18 had a higher occurrence than the items 3.2_4 and 3.2_20. A reason might be that both rotated elements of the first three items (items 3.2_1, 3.2_10, and 3.2_18) are next to each other, whereas both rotated elements of the last two items (3.2_4 and 3.2_20) have the non-rotated element placed between them (Figure 20). In both cubes, the non-rotated element is in the middle. The participants, therefore, had to switch their gaze fixation from the right front to the left back.

Differentiation between correctly and incorrectly solved items
Differences between correctly and incorrectly solved items were analyzed descriptively regarding the number of fixations on all defined AOIs (Indicator-Class 1 and 2), the entropy (Indicator-Class 3), and the bi-grams (Indicator-Class 4). However, this differentiation was not considered for Level 4.1 and 3.1, because there were too few incorrectly solved items in these levels (Level 4.1: 1 ≤ N ≤ 5; Level 3.1.p: 2 ≤ N ≤ 7).

Figure 27:
Mean occurrence rate of the three bi-grams in Level 4.2.c1 separated for each item. Comparison I, Comparison II, and Rotation are defined as above. Element 1 refers to the first element that has to be rotated to receive the right cube from the left cube. Element 2 refers to the other element. The error bars indicate the 95%-confidence intervals.
The number of incorrectly solved items per level was still low for the two levels with medium difficulty (Level 4.2.p: 4 ≤ N ≤ 10; Level 3.2.p: 6 ≤ N ≤ 12), therefore, no statistical models were computed and the data were only analyzed descriptively. For both levels, Level 4.2.p and Level 3.2.p, there seemed to exist a typespecific pattern of the AOI L.rot that combined the two sides of the left cube that correspond to the two sides of the right cube with rotated elements. The relative number of fixations was always lower for incorrectly than for correctly solved items with differences .05 ≤ M correct -M incorrect ≤ .08 (Level 4.2.p) and .02 ≤ M correct -M incorrect ≤ .08 (Level 3.2.p). This could mean that the comparison of the rotated elements of the right cube with the corresponding sides of the left cube is helpful to solve these items and would be a type-specific property. The logistic regression models showed an effect for the left cube (AOI L) in Level 4.2.c2, χ²(1) = 5.45, p < .05; b = -0.31, p < .05, OR (odd ratio) = 0.73, [0.56, 0.95] that, however, seems to be traced back to the AOI L.point, χ²(1) = 7.61, p < .01; b = -0.40, p < .01, OR = 0.66,[0.51,0.85]. In contrast to Level 3.2.p, a higher relative number of fixations was indicative for incorrectly solved items. A similar effect for the left cube (AOI L) could be found in the descriptive analysis for Level 4.2.c1. For four out of five items of Level 4.2.c1, the values were higher for incorrectly solved items, .03 ≤ M incorrect -M correct ≤ .11, such as for Level 4.2.c2.
A reason for the reversed pattern of Level 4.2.c1 and 4.2.c2 compared to Level 4.2.p and 3.2.p might be certain characteristics of the items. For all levels, the rotated elements have to be compared with the non-rotated elements in order to decide in which direction they have to be rotated back. For all levels, the information of the non-rotated elements can be gained from the left cube, but also from the non-rotated elements of the right cube. For Level 4.2.p and particularly for Level 3.2.p recognizing the rotated elements is not obvious by only observing the right cube, because of the two parallel rotated elements visual on two sides of the right cube ( Figure 20). Hence, it might be that detecting the non-rotated parts on the right cube is more error-prone than to detect these parts on the left cube. In contrast, for Level 4.2.c1 and 4.2.c2, it is easier to recognize the non-rotated elements of the right cube, because only on one side of the cube there are two rotated elements (on the other two sides, there is only one rotated element) and these elements are crossed. Therefore, they are obviously recognizable as rotated elements and can easily be distinguished from the non-rotated elements ( Figure 22). Hence, it might be that if participants focused more on the right cube than on the left one, they could solve the items correctly more easily because they could reduce the error-prone switching between the left and the right cube.
The descriptive analyses of the entropy values of Level 4.2.p and 3.2.p showed in eight of ten items lower values for the incorrectly solved items (Level 4.2.p: .05 ≤ M correct -M incorrect ≤ .09; Level 3.2.p: .05 ≤ M correct -M incorrect ≤ .15). This result suggests that if participants had a more chaotic viewing pattern, they had a higher chance to solve the items correctly . However, due to the small samples of the incorrectly solved items, this pattern has to be interpreted with caution. Only the linear logistic regression model of Level 4.2.c2 was significant, χ²(1) = 4.39, p < .05; b = -0.30, p < .05, OR = 0.74,[0.55,0.98] indicating that lower entropy (more systematic viewing patterns) predicts a higher probability to solve the item correctly. This pattern was reversed for Level 4.2.p and 3.2.p. The differences in the mean entropy values of the single items were 04 ≤ M incorrect -M correct ≤ .09. The order of the entropy values of the incorrectly and correctly solved items for Level 4.2.c1 was ambiguous.
The different results of Level 4.2.p and 3.2.p compared to Level 4.2.c2 might be explained in the same way as before. For Level 4.2.p and 3.2.p, it might have been more promising to solve the items correctly by switching -maybe in a chaotic way -between the left and the right cube. For the most difficult level, Level 4.2.c2, a more systematic switching between the left and the right cube might have been more successful.
The descriptive analysis of the bi-grams of the medium levels (Level 4.2.p and 3.2.p) showed no systematic difference between correctly and incorrectly solved items. In contrast, for some of the bi-grams, significant results could be found for the most difficult levels: for Level 4.2.c1, Rotation of the second element, χ²(1) = 6.75, p < .01; b = 0.40, p < .05, OR (odds ratio) = 1.49, 95%-confidence interval = [1.10, 2.06]; Comparison II of the first element, χ²(1) = 4.91, p < .05; b = 0.33, p < .05, OR = 1.40, [1.04, 1.93]; Comparison II of the second element, χ²(1) = 11.27, p < .001; b = 0.47, p < .001, OR = 1.60, [1.22, 2.12] and for Level 4.2.c2, Comparison I of the second element, χ²(1) = 10.87, p < .001; b = 0.43, p < .01, OR = 1.53, [1.19, 1.99]. All significant models suggest that the occurrence of the respective bi-grams raises the probability of solving the item correctly. The descriptive analysis considering all bi-grams suggested a general effect that the bi-grams can differentiate between correctly and incorrectly solved items for Level 4.2.c1 and 4.2.c.2. These results seem to be independent of the specific value-level of the relative frequency of the single item. They suggest that if the cognitive processes indicated by the bi-grams are performed by the participant, the respective item is solved correctly with a higher chance. Although the absolute values differed between items and can be interpreted as item-specific, the differences between correctly and incorrectly solved items suggest a type-specific pattern. However, for some items, the absolute values are generally low for incorrectly as well as for correctly solved items.

General discussion
Eye tracking based measures can deliver additional information about how participants solve a certain test item that goes beyond the information gained by accuracy and reaction times. Therefore, the goal of the present studies was to analyze the potential of different eye tracking measures to indicate specific cognitive processing steps, which might be able to predict item answers. However, in a previous step type-and itemspecific characteristics had to be disentangled. The studies were conducted in the well-established domain of spatial thinking. A new test, the R-Cube-Vis Test, was developed and validated (Study 1 to 3). The test measures the main factor of spatial thinking, visualization. It overcomes the drawbacks of established visualization tests with respect to the usage of eye tracking. Study 4 demonstrated the application of these stimulus materials with various fixation-based measures.

Development and validation of the R-Cube-Vis Test
Existing tests of visualization, such as the Paper Folding Test (PFT, Ekstrom et al., 1976a) or the Mental Rotation Test (MRT, Vandenberg & Kuse, 1978), comprise a small number of heterogeneous and complex items. Relevant features overlap, and the tests are administered as paper-pencil-versions. These properties hamper the ease of use of eye tracking. Therefore, a new visualization test, the R-Cube-Vis Test, was developed to overcome these drawbacks to be suitable for eye-tracking techniques. The items of the R-Cube-Vis Test were rationally constructed in accordance with the definition of visualization (Carroll, 1993). The R-Cube-Vis Test consists of six difficulty levels with 24 items (twelve possible and twelve impossible) per level in the long version and ten items (five possible and five impossible) per level in the short version. Four accuracy-based measures were considered in both versions, accuracy over all items (ACC-all), over all possible items (ACC-poss), weighted accuracy (wACC), and the Area Under the Curve (AUC). All measures delivered comparable results in the validation studies, but ACC-poss and wACC were the only measures that could differentiate between the two most difficult levels.
The reliability estimates based on Cronbach's alpha for the long version (Study 1) of all measures were very good. Lower reliability estimates were found in the short versions (Study 2 and 3) but were assumed to be an artefact caused by the binary format (Brogden, 1946a(Brogden, , 1946bSun et al., 2007) combined with the small number of items.
The validity of the R-Cube-Vis Test was considered according to the standards formulated by American Educational Research Association et al. (2014). The content-oriented evidence of validity of the R-Cube-Vis Test is given by its construction according to the definition of visualization. There is also evidence of validity regarding cognitive processes assuming that the participants mentally rotated the single elements to solve the items. This assumption is additionally supported by the results of the analyzed bi-grams (fixation sequences based on two fixations) in Study 4. Here, three kinds of bi-grams were considered that might indicate the back-rotation of the rotated elements as well as comparisons between one side of a single rotated element and the corresponding parts of the not-rotated elements. The cognitive processes presumably indicated by these bi-grams were assumed to be necessary to solve the items in the intended manner with the so-called visualization strategy. For the two most difficult levels it could be shown that such cognitive processes indicated by those bi-grams occur more frequently for the correctly solved items than for the incorrectly solved items. This supported the evidence of validity regarding cognitive processes.
Convergent and discriminant evidence of validity could be shown in Study 1 and 3 for all measures regarding the correlations with the PFT for visualization, the chronometric test (CT, Jansen-Osmann & Heil, 2007), the Würfelaufgaben (form A) from the Intelligenz-Struktur-Test (IST-WA, Amthauer et al., 1999), and the Cube Comparison test (CC, Ekstrom et al., 1976a) for spatial relations as well as the MRT in the middle of the continuum (Pellegrino et al., 1984). In both studies, all correlations with the PFT were strong and descriptively larger than the correlation between the PFT and MRT. The correlations with the CT were weak to medium. The correlations with the MRT and the second spatial relations test (IST-WA in Study 1 and CC in Study 3) lay in between and were comparable with each other. Whereas the correlations with the MRT were expected because the MRT is located in the middle of the continuum between visualization and spatial relations, the correlations with the second spatial relations test were descriptively larger than expected. The original assumption was that the correlations would be the same as with the other spatial relations test, CT. The reason for the stronger correlations with the second spatial relations test might be that these tests (IST-WA and CC) are closer to the visualization end of the continuum than the CT. This idea is supported by the descriptively larger correlations of these tests with the PFT and MRT in Study 1 and 3. However, the correlation patterns of all measures confirm the expected patterns regarding convergent and discriminant evidence of validity. The weaker convergent and discriminant evidence of validity in Study 2 can be explained by the testing procedure. Changing the testing procedure in Study 3 revealed the expected correlation pattern. The differences between the four accuracy-based measures are small and only descriptive. A potential effect must be tested in a follow-up study.
One important requirement for the R-Cube-Vis Test with respect to the ease of use of eye tracking was the creation of many homogenous items that should be equal regarding their difficulty on the measured ability dimension. Therefore, all items from one homogenous group should be on the same difficulty level. To examine the evidence of validity of the internal structure, the R-Cube-Vis stimulus materials were tested for conformity with the linear logistic test model (LLTM) the dichotomous measure ACC-poss and for the three splits of wACC. The splitting was necessary because of the bi-modal distribution of the single values of each item. The LLTM parameter indicated characteristics of the cubes that were potentially relevant for solving the respective items. The results presented strong evidence for the one-dimensionality of the testing materials as well as the distinct difficulty levels and, hence, convincing evidence for validity of the expected internal structure. Another argument for the interchangeability of the items within each of the six difficulty levels are the results of the randomly drawn item samples from the long version that were used for the short versions tested in Study 2. The two selected item samples had the highest and the lowest correlation with the PFT, respectively, based on the data in Study 1. Both item samples had five of 30 items in common and the two samples changed the order of the magnitude of their correlations with the PFT in Study 2. However, the results were limited, because in each of the three studies (Study 1 to 3) the sample size was too small to test LLTM conformity based on the likelihood tests. The fact, however, that all likelihood tests in all studies are not significant can be seen as cumulative evidence for the model conformity with the RMs and LLTMs. Furthermore, small sample sizes like in the present studies lead to low statistical power, but they deliver correct β-estimations (see Hohensinn et al., 2014;MacDonald & Kromrey, 2011). Further criteria such as the person separation reliability (PSR), the information criteria and the visual inspection of β-estimation plots provided additional evidence for model conformity. Hence, all criteria together strongly support the expected model fits for the R-Cube-Vis Test.
The small correlations with the grades in German and Mathematics were comparable with the small correlations between these criteria and the PFT and MRT in all studies. All studies confirmed that the stimulus materials were unbiased with respect to previous experience with Rubik's cubes.
The various aspects of evidence of validity strongly supported the R-Cube-Vis Test in its long and short version as a valid test for visualization that is distinguishable from spatial relations and can be placed at the corresponding end of the continuum of these two factors according to Pellegrino et al. (1984). Furthermore, the R-Cube-Vis Test fulfills the four requirements for the usage of eye tracking. The test by construction contains six homogenous item groups and is supported by its conformity with the LLTM. The items were created in such a way that all relevant visual features are separable and that the items are simply structured by showing only two figures. Finally, the test is computer-based. However, further research is advisable with larger samples from different populations and with alternative external criteria.

Eye Tracking measures as source of information of the R-Cube-Vis Test
In Study 4, eye tracking data were analyzed utilizing the short version of the R-Cube-Vis Test. At first, type-and item-specific characteristics of gaze patterns were disentangled to discriminate between those parts of the gaze patterns that were caused either by the demanded cognitive processing steps to solve the item or by visual item features that were independent of the necessary transformation steps. Secondly, the predictive power with respect to item accuracy was considered. The analyzed eye tracking measures were taken from four fixation-based Indicator-Classes. Two of these Indicator-Classes are based on fine-scaled Areas of Interest (AOIs), either on single cube features (Indicator-Class 1) or on fixations sequences on these features (Indicator-Class 4). The other two Indicator-Classes cluster more information together in largescaled AOIs, either as whole cubes (Indicator-Class 2), or summing up fixations on these whole cubes over time as entropy (Indicator-Class 3). A general result of Study 4 showed descriptively that indicators using large-scaled AOIs (Indicator-Class 2 and 3) remained stable over different items from the same homogenous group (type-specific) of the R-Cube-Vis Test, whereas indicators using fine-scaled AOIs (Indicator-Class 1 and 4) differed between items of the same item group (item-specific). This finding confirms the results of previous studies that also found stable results for large-scaled AOIs of complex items (e.g., Egan, 1979;Snow, 1980) and for small-scaled AOIs of simple AOIs (e.g., Just & Carpenter, 1976) but failed with smallscaled AOIs for complex items (e.g., Ivie & Embretson, 2010). However, there were differences between the difficulty levels. For example, the fixations on the three sides on the left cube seemed to be clearly itemspecific in Level 4.1 but they descriptively showed type-as well as item-specific characteristics in Level 3.2.p and 4.2.c1. This is an important aspect if cognitive processing steps should be modeled for the R-Cube-Vis Test. If a gaze pattern model should describe processing steps of successful solving behavior in a specific level, it would be important to define a base model that reflects the type-specific characteristics found here. However, such a model should also include item-specific characteristics for each single item, maybe as an additive component. For example, the model of Level 4.1 would expect three fixations on the side that corresponds to the new color in general, but if the side is in the middle, the model would expect two fixations more. Future studies are necessary to gather more detailed information about each fine-scaled AOI and about its type-specific and item-specific behavior.
The Indicator-Class 4 contains the occurrence of bi-grams. Bi-grams are sequences of two consecutive fine-scaled fixations that are intended to indicate the cognitive processing steps that are necessary to solve the items. As the indicators of Indicator-Class 1 described above, the bi-grams showed item-specific gaze patterns. However, for each single item, the bi-grams were able to differentiate between correctly and incorrectly solved items. For example, the defined bi-grams were found more often for correctly solved items of Level 4.2.c1 and 4.2.c2 than for incorrectly solved items, although the absolute occurrence of these bi-grams differed between the items. Future studies might reveal further insights in the cognitive processes that are represented by the bi-grams. Thereby, bi-grams might be extended to n-grams with n > 2 that represent pre-defined and rationally motivated processing steps.
Such analyses using various fixation-based indicators should be extended with more items (e.g., using the long version of the R-Cube-Vis) to gain more information about the solving behavior applied to these items. In such a way, type-and item-specific fixation patterns of each homogenous item group can be identified. Type-specific characteristics of fixation patterns would describe the gaze patterns that are necessary to solve the item. The item-specific characteristics of fixation patterns could be included by using, e.g., an additive component specific for certain visual item features.
The second goal of Study 4 was to show the potential of fixation-based indicators to predict item accuracy. The number of fixations on the left cube, entropy values as well as bi-grams indicating necessary transformation steps seemed especially promising. However, differential results for the six difficulty levels were revealed. For example, the results suggested that for the levels of medium difficulty, focusing on the left cube seemed to be more expedient to solve the item, compared with the two most difficult levels.
For the two most difficult levels, focusing on the right cube was related to a higher solving probability. A corresponding pattern existed for the entropy values. One reason might be that in both levels with medium difficulty it is harder to recognize the rotated elements by analyzing only the right cube because of the parallel rotation (especially in Level 3.2.p). Therefore, fixating the left cube provides more valid information to solve these items. In contrast, in the two most difficult levels, it is easier to see the rotated elements because of the crossed rotation. Hence, fixations on the right cube provide valid information to solve the item. However, follow-up studies are needed to confirm these explanations. In particular, the cognitive processing steps associated with the entropy values might be analyzed more thoroughly.

Limitations
Study 1 to 3 are limited with respect to the small sample size. On the one hand, this affects the interpretation of the correlations, which might deviate from the true relationship between the respective variables (see Schönbrodt & Perugini, 2013). On the other side, the statistical power is too low for the Andersen's likelihood ratio test of the IRT models. However, even with small sample sizes the parameter estimation is robust (see Hohensinn et al., 2014;MacDonald & Kromrey, 2011). Still, the results in all three of the conducted studies were comparable, which supports the validity of the estimated correlations as well as the conformity of the tested IRT-models. The results of Study 4 are strongly explorative and have to be interpreted with caution. These results must be confirmed in follow-up studies.

Conclusion
The presented studies demonstrate the potential of eye tracking as a promising technique for revealing information about the cognitive processes applied for solving visual-spatial test items in the context of spatial thinking. Eye tracking measures provide information that goes beyond the information provided by accuracy and reaction times. However, eye tracking techniques can only be applied successfully if the stimulus materials are appropriate.
All considered fixation-based indicators differ between the six levels, not only in their absolute values, but also in their indication of correctly solved items. For example, higher entropy values of the medium difficulty levels predict a correctly solved item, whereas correctly solved items in the two most difficult levels are indicated by lower entropy values. Therefore, models of gaze patterns indicating specific cognitive processing steps have to be established separately for each of the difficulty levels.
If such gaze pattern models exist, they might be able to provide more detailed information about the cognitive steps that led to a certain answer. Based on information about how a participant attempted to solve an item, the success of solving the item can be predicted. This information can be utilized to improve the measurement in addition to accuracy and reaction times. On the one hand, such information could be used to create adaptive testing scenarios that include these gaze-related measures. On the other hand, it might even be possible to diagnose the participant's potential performance based on comparisons between the applied solving strategies and the optimal strategies. In this way, such a model could provide a differential diagnosis that could lead to a more sophisticated decision about a participant's ability.

Results of Study 4
Location of fixations         'Comparison I' means comparison between the part of the rotated element with the new color and the corresponding not rotated element on the other side. 'Comparison II' means the comparison between the part of the rotated element with the old color and the corresponding not-rotated elements on the other side.