Cognitive state detection with eye tracking in the field: an experience sampling study and its lessons learned

: In the future, cognitive activity will be tracked in the same way how physical activity is tracked today. Eye-tracking technology is a promising off-body technology that provides access to relevant data for cognitive activity tracking. For building cognitive state models, continuous and longitudinal collection of eye-tracking and self-reported cognitive state label data is critical. In a ﬁeld study with 11 students, we use experience sampling and our data collection system esmLoop to collect both cognitive state labels and eye-tracking data. We report descriptive results of the ﬁeld study and develop supervised machine learning models for the detection of two eye-based cognitive states: cognitive load and ﬂow. In addition, we articulate the lessons learned encountered during data collection and cognitive state model development to address the challenges of building generalizable and robust user models in the future. With this study, we contribute knowledge to bring eye-based cognitive state detection closer to real-world applications


Introduction
Similar to physical activities such as steps per day that can be easily detected by wearables nowadays, it should become possible to detect cognitive user states like cognitive load, flow or mind wandering in the future.Cognition in general refers to the mental processes related to the acquisition, organization, and use of knowledge covering attention, memory, reasoning, decision making and problem solving. 1,2Recently, there has been a call for research aimed at automatically detecting cognitive user states from on-body and off-body technologies. 3,4Following the paradigm of biosignal-adaptive systems described in Schultz and Maedche, 5 the detection of cognitive user states based on biosignal data would allow to design interactive systems that adapt to the user's current needs, ultimately improving the user's performance and well-being.For example, in learning, the ability to detect cognitive states such as cognitive load, mind wandering, flow or situation awareness can be used to personalize learning content for the learner. 6,7In the workplace, this capability can help to design work environments that enhance employee performance and well-being (e.g., through flow-adaptive notification management, loadadaptive task assignment, adaptive video meeting systems 8 or visual attention feedback 9,10 ).By recognizing the user's cognitive state, interactive systems can not only better adapt to their users in a specific situation, but may also support users to learn from their past cognitive states in relation to their behavior.For example, users could adjust their work schedules based on identified patterns in the individual cognitive demands of specific tasks.
In particular, commercial off-the-shelf (COTS) eye trackers are a promising off-body technology that can provide access to cognitive user states. 6Advances in eye-tracking technology combined with supervised machine learning (ML) have demonstrated that it is possible to identify cognitive user states based on collected eye data. 6,113][14] However, the resulting cognitive state ML models have the drawback of requiring data collected in highly controlled environments.To move these cognitive state models out of the laboratory and into real-world applications, the models must not only be accurate, but also robust and generalizable across tasks, users, and environments.To achieve high generalizability and robustness, data and labels must be collected for a variety of tasks and from users in different environments, ideally continuously over time.So far, only few studies have investigated the 2 -M.Langner et al.: Cognitive state detection with eye tracking in the field development of eye-based cognitive state ML models outside the laboratory or used eye-tracking data recorded over a longer period of time. 6,15Thus, this remains an important research gap.
The Experience Sampling Method (ESM) is an established method for building user models and collecting a variety of data at random times and locations. 16In ESM studies, participants are interrupted randomly, event-based, or interval-based to complete a self-report survey, typically in the form of questionnaires. 17Therefore, ESM is a common method for systematically capturing people's activities, emotions, and thoughts during their daily lives based on self-reported information. 18In addition, data from sensors such as GPS, gyroscope, temperature, air pressure, or electrocardiography are continuously collected and later correlated with the self-reported survey data. 19ESM can also be used in combination with eye tracking to collect self-reported cognitive state data, also called labels in this context, to develop more robust, accurate, and generalizable eye-based cognitive state models.Such labels, in combination with eye-tracking data, serve as input for supervised ML algorithms to predict the corresponding cognitive state of the user.Since eye trackers are sensitive to changes in the environment and eye movements are task dependent, guidelines on how to apply eye tracking in ESM studies and develop eye-based cognitive state models based on label data collected in the field are crucial.However, guidelines for applying eye tracking in ESM studies are lacking, and existing ESM knowledge should be extended to an eye tracking context.
In this paper, we investigate whether it is feasible to use an ESM approach to collect cognitive state labels and eye-tracking data in the field as a foundation for the development of accurate, generalizable, and robust cognitive state models.We conduct an exploratory longitudinal ESM study to collect eye-tracking data in the field and develop eye-based cognitive state models on this basis.First, we introduce our experience sampling-based system, esmLoop, which is designed to collect eye-tracking data continuously and cognitive state labels randomly.We then outline the various steps required in preparing and conducting our ESM field study, which involved collecting data from 11 students working on their thesis project over 5 days using esmLoop.We provide detailed insights into the participants' interaction with esmLoop during the field study and their opinions and needs regarding this system, as articulated in post-study interviews.We focus on two exemplary cognitive states for supervised ML model development, namely cognitive load and flow.First, we develop supervised ML models following a classification-and regression-based approach for both cognitive states, using all collected label and eye-tracking data from different window sizes.Since the models do not significantly outperform baseline models, this highlights the challenge of developing eye-based ML models that are generalizable across tasks and participants, even though we collect a reasonable amount of labels for different tasks.Subsequently, we focus solely on the labels obtained during writing tasks to investigate the feasibility of building task-focused cognitive load and flow models based on eyetracking data collected in the field.These writing taskfocused models ultimately outperform the baseline models, demonstrating the feasibility of developing accurate and robust cognitive state models that are generalizable across participants using eye-tracking data collected in the field.Finally, we present the lessons learned during ESM data collection and cognitive state model development to address our challenges in building generalizable and robust models in the future.By sharing our experiences and providing the aggregated data and analysis scripts according to the open science paradigm, we fill the introduced research gap of using ESM to collect eye tracking and cognitive state labels in the field and develop eye-based cognitive state models based on it, thus contributing to the field of eye-tracking research.We believe our experience will facilitate the integration of eye tracking and eye-based cognitive state detection into real-world applications.Additionally, it can aid researchers in designing better ESM studies and developing cognitive state models in the future.

Eye-based recognition of user states: cognitive load and flow
The saying "the eyes are the window to our soul" highlights that the eyes provide more information about us humans than just visual attention. 20Eye tracking is a technology that can provide access to much more user information, in particular with regards to cognitive user states.Leveraging the eye-mind hypothesis by Just and Carpenter, 21 fixations are directly reflecting what humans are cognitively processing.In recent years, eye-tracking technology has advanced a lot, especially in terms of robustness, so that it has made its way out of research laboratories into realworld applications. 22With advances in computing power and machine learning algorithms, new approaches to eyetracking data analysis have become possible.4][25][26] In particular, fixation, saccade and pupilbased features are used in both low-level and high-level (AOI-based) gaze features.A recent trend is the investigation and detection of user traits and characteristics as well as cognitive and affective user states using eye-tracking technology.Research focused on the detection of user characteristics such as personality, working memory and field dependence based on eyetracking data. 24,27,28Studies targeting the recognition of affective user states using eye tracking focus mainly on the arousal and valence dimensions of emotions or on discrete emotions of Ekman. 29,30Typical cognitive user states studied using eye-tracking technology are cognitive load and mind wandering. 6,11,31n our study we specifically focus on two cognitive states: cognitive load and flow.While cognitive load is already researched with eye-tracking technology, flow to the best of our knowledge was not yet investigated using eye-tracking technology.Cognitive load refers to how many mental resources are currently occupied and is typically captured by the NASA TLX. 32The NASA TLX covers cognitive load in terms of six dimensions: mental demand, physical demand, temporal demand, performance, effort and frustration.Kahneman and Beatty 33 established the link between pupil size and cognitive load and several studies further investigated this link. 34,35Many studies that predicted cognitive load using machine learning rely heavily on pupil-based metrics such as pupil dilation or blinking. 11,14,25owever, pupil size is not only dependent on the person's current cognitive load, but also influenced by other environmental factors such as the ambient light conditions.Therefore, cognitive load recognition in the field requires more elaborated approaches than just measuring pupil size.Recent publications specifically investigated cognitive load recognition including further typical eye-tracking features such as fixations, saccades or microsaccades. 3,36,37In general, for increasing robustness and generalizability, the feature set should to be extended.
The flow state refers to a state of mind that people experience when they act with total involvement. 38Antecedents of flow are clear goals, unambiguous feedback and the challenge of the task meets the persons skills (skill-challenge balance). 39Characteristics of being in flow are a strong focus on the task and a feeling of control, the merge of action and awareness, a loss of self-consciousness and a transformation of time. 39Flow theory is tightly connected to attention and information processing theory because attention plays a critical role in achieving flow as it determines what we perceive.Furthermore, attention is a necessary condition for subsequent mental processes and events for flow. 38Therefore, we argue that eye tracking can be a suitable technology to detect flow continuously and in realtime.So far, flow was already investigated using biosensors that rely on EEG and ECG technology. 40,41However, despite the theoretical evidence, detecting the flow state using eyetracking technology to the best of our knowledge was not evaluated so far.
In this study we continue the line of previous research by harnessing the power of eye-tracking data collected in the field to recognize cognitive user states.Consequently, we do not only focus on recognizing cognitive load but we also examine the recognition of flow using eye-tracking data.In addition, our research broadens the scope of cognitive state recognition by considering a wide range of tasks collected across several users.

Data collection methods: experience sampling method & ecological momentary assessment
The Experience Sampling Method (ESM), also known as Ecological Momentary Assessment (EMA), is a diary method for systematically obtaining self-reports from people about what they do, feel and think during activities in their daily lives. 18While ESM primarily focuses on representativeness, EMA focuses more on momentariness of recorded survey data.However, there is no strict difference between the two methods. 16,42Participants in ESM studies are typically given a pager, pen and paper, mobile device, or PC-based application that interrupts and displays a self-report questionnaire capturing current experiences at random points in their daily lives. 184][45] In addition, biosignal data collected along with the self-reported data can be used to detect changes in affective and cognitive user states. 18,46A common approach is to analyze the biosignal data, such as ECG, EEG, or eye-tracking data, collected just prior to questionnaire administration in the context of the self-reported affective or cognitive state questionnaire responses. 44Typically, the biosignal data from a given time window (e.g., 5 s, 30 s, 1 min, or 3 min) is aggregated to a specific metric, such as mean heart rate variability, mean alpha power, mean fixation duration, and so on. 41,47By correlating this data with the affective or cognitive state labels, or by using this data along with the affective or cognitive state labels as input to an ML classifier, insights into the user's cognitive or affective states can be gained.A common approach is also to examine and compare multiple time windows of biosignal data for their influence on the explainability of the self-reported data. 47,48 ESM has the advantage that self-report data can be collected in the natural environment, immediately during the experience and for a range of different experiences. 19owever, ESM also places a high burden on participants as they are interrupted from their ongoing activity several times a day. 44Previous studies also show that participants experience fatigue during data collection, as the questionnaires in ESM studies are typically repetitive. 19If burden and fatigue are too high, the risk of dropout increases, which may ultimately affect data quality or the comparability of data from different participants. 49Therefore, low effort solutions such as collecting the self-reported data at the task, rather than switching to a smartphone or pen and paper to collect self-reported data, are key to reducing burdens.Furthermore, the combination of an ESM study and biosignals data collection can support researchers in accessing more data sources.However, there is a lack of guidelines and tools for conducting ESM studies in combination with biosignals, despite being more susceptible to external influences during data collection.

Field study
The goal of this field study is to explore the collection of cognitive state labels and eye-tracking data in the field and leverage the collected data to build models for two cognitive states, cognitive load and flow.As a foundation for our study we developed an experience sampling-based system called esmLoop.It supports collecting cognitive user state labels, eye-tracking and interaction data in the field.

The data collection system esmLoop
To conduct the field study, we developed esmLoop a PCbased desktop application that supports collecting data sets necessary for the development of supervised ML cognitive user state models.The user interface of esmLoop is depicted in Figure 1.In a first step, esmLoop guides the user through the process of setting up the eye tracker.When starting esm-Loop, the start screen reminds the user to set up the display for the eye tracker (1).Subsequently, it requests calibrating the eye tracker as calibration is key to ensure high data quality collection (2).Furthermore, the user has to select the storage location for the recorded data to define where the data is stored (3).Currently esmLoop integrates the Tobii eye tracker 4C with the required research license that provides access to the Tobii Pro SDK.For the calibration of the Tobii 4C, we rely on the standard 7 points calibration provided by the Tobii 4C driver.
Once the user starts an experience sampling session by clicking the start button, data is recorded until the session is terminated by the user.The users have full control on starting and ending the recording of the data (4).Furthermore, the system can always be reached by the user through the icon on the task bar as shown in Figure 1c.
In terms of data collection, esmLoops records raw gaze data and pupil size during the experience sampling sessions.In addition, the title of the active window including a timestamp is recorded as well when the user swaps to another application.esmLoop issues a questionnaire at a random point in time every 20-60 min.Here, a message box pops up and asks whether the user is available to answer a short experience sampling questionnaire.If yes, the user is forwarded to the questionnaire window as demonstrated in Figure 1b.The questionnaire window is divided into two areas.The top area provides information about collected labels and data.The user gets a transparent information about how many labels and mega bytes of data were collected on a specific day and overall.The bottom part shows the likert-scale based questions of the experience sampling questionnaire.In this study, we specifically asked participants adapted version the flow short scale and the NASA TLX questionnaire 32,50 on a 7 point likert scale.Furthermore, we also survey the currently interrupted task.

Study design
The study design went through the institutional review process in terms of ethics and data security and was approved prior to the study.This study focused on university students working on their final thesis projects.We selected thesis work because students continuously work on their thesis projects for several hours a day over several weeks, most phases of a thesis have to be completed at a computer, and it involves a variety but limited set of tasks.In addition, students experience higher levels of cognitive load during thesis work because the goal of the thesis is usually to work on a complex task or problem.In addition, thesis work typically also fulfills the requirements established by Nakamura and Csikszentmihalyi 39 of meeting the skill-challenge balance, having a clear goal, and experiencing unambiguous feedback about the progress of solving the tasks.

Participants
We recruited eight Bachelor and four Master students (total: 12 participants (4 female, 8 male)) with an average age of 25.03 years (SD = 3.15 years) who were invited through a university experimental lab panel and were working on their thesis project.In order to be eligible for participation, students had to work for at least 4 h per day on their thesis project for the duration of the study and they had to use their own PC or notebook for the data collection.To provide some flexibility for the students, they could select a minimum of 5 days during a time frame of 7 days for the data collection.Furthermore, we allowed participants to use their personal computers and collect data at any location (e.g., at home, library, student room etc.) that they would also work at under normal circumstance in order to increase external validity.Participants received 100A C as a compensation for study participation.
Later, we excluded one participant (female, P12) due to technical issues with the eye-tracking software driver during the experiment and continued with remaining 11 students.Two of the participants had previous experience with eye-tracking technology.All participants had a normal or corrected to normal vision except one participant who had one eye and a second glass eye.Monitor setups varied from single monitor to dual monitor setups with a screen resolutions between 1920 × 1080 pixels to 3000 × 2000 pixels.If participants had a dual monitor setup, we required them to install the eye tracker on the main monitor based on their own evaluation.

Procedure
To execute the study, we first invited all participants to an introduction workshop to introduce them to the study design, cognitive load and flow measurement, as well as eyetracking technology.In this workshop, participants installed the required software esmLoop and the eye tracker jointly with the experimenter on their own private computer.To validate the correctness of the setup, they examplarily ran through the daily procedure of an experience sampling session within the esmLoop software including the setup and calibration of the eye tracker.After the introduction workshop, participants were ready for the actual field study and could start with the first experience sampling session.
The total duration of all experience sampling sessions per day were required to be higher than 4 h.They were allowed to be split into several session but each session had to be at least 60 min in order to be able to reach flow during that session.We decided for the 60 min minimum as many students follow a time boxing technique like Pomodoro technique, 51 meaning that after 1 h of focused work they take a 5-10 min break.
Participants were asked to calibrate the eye tracker and check if the data recording works before they started the session as shown in Figure 2.During the session, they were interrupted every 20-60 min at a random point of time by the experience sampling questionnaire.If the questionnaire prompt popped up at an inconvenient moment, e.g., during a video meeting, they were also allowed to postpone the questionnaire.By postponing the questionnaire, a new 20-60 min cycle was started.At the end of an experience sampling session, participants had to terminate the session in the esmLoop system.Once they collect more than 4 h of data on that day and finished the last session, participants had to upload the recorded data to a cloud drive in order to share the data for analysis.This procedure was repeated during minimum 5 days of the 7 days study time frame.At the end of the study, participants joined a final interview which was conducted in a semi-structured format and took 30 min.In this part, we asked interview questions about the experiences with the esmLoop system during data collection and labeling, about how they experience cognitive load and flow during the study, and about their perceived privacy and experiences with the eye tracker.

Data processing & modeling
In Figure 3 we visualized the data processing steps that were undertaken to develop and evaluate the cognitive state models.The eye-tracking data collected by the esmLoop software was pre-processed to extract fixations and saccades using the Pygaze Analyzer by Dalmaijer et al. 52 First, we filtered the raw gaze data for gaze points that were marked as valid for both eyes and on the screen.Then, we calculated the average of both left and right eye as the Pygaze Analyzer only takes one X-& Y-coordinate as input for saccade and fixation calculation.For the participant with one glass eye, we considered only valid data  we set the minimum duration threshold for a fixation to 50 ms.As we experienced several outliers regarding fixation duration and saccade duration, we applied outlier detection (IQR > 1.5) and removed the detected fixation and saccade outliers.In addition, we decided to normalize the pupil size per session using min-max normalization to account for varying pupil size between sessions as pupil size is depending heavily on the light condition of the environment.After calculating the fixations and saccades, we extracted the features based on the fixations, saccades, and pupil size of different window sizes (1, 2, 3, 4, 5, 10, 20 min) before the questionnaire was issued.For example, for the one-minute window size, we only considered the eye-tracking data collected 1 min before the software administered the questionnaire and calculated the features using the eye-tracking data during that time window.Exploring multiple window sizes is a common approach in the development of eye-based models, 27,48 as eye-tracking data has a spatial and temporal dimension.We calculated the following features: fixation count, fixations per second, total fixation duration, mean fixation duration, saccade count, saccades per second, total saccade duration, saccade amplitude, saccade velocity, saccade acceleration, saccade absolute angle, saccade relative angle, saccade-fixation ration, fixation-saccade ratio, pupil diameter.We also calculated statistical features like mean, median, standard deviation, minimum, maximum, skew, and kurtosis where applicable.In total, we calculated 52 features (9 fixation-based, 36 saccade-based, 5 pupil-based and 2 ratio-based) as input for the model development.To define the cognitive load and flow state labels of participants, we averaged their answers on the 7 point likert scale regarding the 5 item NASA TLX questionnaire (excluded physical effort) and the first 10 items of the Flow Short Scale.We excluded the physical effort dimension of NASA TLX questionnaire as thesis writing does not vary in terms of physical effort.For the flow labels we focused on the first 10 items of the Flow Short Scale as these represent the flow experience while the later three questions focus on concerns.
The processing and modeling python scripts including the aggregated data frames of all above mentioned eyetracking features and cognitive state labels can be downloaded here. 53The following steps can also be found in the python scripts for the classification and regression.We decided to follow a binary classification approach as it was done in previous eye-tracking studies (e.g. 23,27,54,55) and also a regression-based approach as likert-scale data is suitable for regression models.The following paragraphs are descriptions of the conducted modeling steps in the python scripts.
As the first step we applied the SelectFromModel function of the sklearn package 1 to reduce the set of features and therefore tried to avoid over-fitting of the models.To increase generalizability and robustness to unseen data we decided for a leave-one-out cross validation (LOOCV).This means that we considered data of 10 participants for training and fitting the model and then later evaluated the model's performance based on the data of the 11th participant.This procedure was repeated 11 times so that each participant's data served once as a test data set.As the next step, we splitted the data into a training and test data set.Since the cognitive load and flow labels were not evenly distributed across the binary classes of high and low cognitive load and flow and no flow, we decided to also test oversampling minority classes using the Synthetic Minority Oversampling Technique (SMOTE) function of the imblearn package 2 for the classification models.Then, we conducted hyper-parameter tuning using grid search with 10-fold cross validation on the data set of 10 participants.After model fitting, the performance of the model was evaluated by using the data set of the 11th participant.We developed the classification models using XGBoost (XGB), Random Forest (RF), CART Decision Tree (DT) algorithms of the sklearn package and compared them to the baseline majority class model (Base) for various window sizes (1, 2, 3, 4, 5, 10, 20 min).To calculate the overall performance of the classification models, the F1-Score and Area Under Curve (AUC) scores of all 11 folds of the LOOCV were averaged.We decided to use these two metrics for evaluation as they reflect the accuracy, generalizability and robustness of a model.For the regression models we used Linear Regression (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR) algorithms of the sklearn package and for XGB Regressor (XGBR) the 1 https://scikit-learn.org/ 2 https://pypi.org/project/imblearn/algorithm of the XGBoost package 3 and compared it to a baseline model always predicting the mean of the cognitive state labels of the test user.The overall performance of the regression models was evaluated by calculating the average of all mean square error (MSE) and the mean R 2 across all 11 folds of the LOOCV.This procedure was repeated 10 further times until all participants' data served once as a test data set.

Descriptive data 4.1.1 Recorded eye-tracking data
In total, more than 250 h or 15,000 min of data were collected from the 11 participants in our field study.As shown in the Table 1, we were able to collect approximately 186.84 h of valid eye-tracking data.The Tobii SDK provides a Boolean value for each eye separately, indicating whether the eye tracker was able to correctly calculate the gaze point for that eye.In total, 59.54 % of the time (151.23 h) the participants collected eye-tracking data that was marked as valid for both eyes (see Table 1) and 73.56 % of the time at least one eye was marked as valid, which is also in line with a previous longitudinal eye-tracking field study. 56It should be noted that invalid eye-tracking data can occur when the participant is not present, looking at the screen, looking at another screen, or due to technical problems.As this is a field study, it must be emphasized that participants may have left the computer for short breaks during data collection.
In Tables 2 and 3, we demonstrate descriptive data for selected eye-tracking data based features of flow/no flow and high/low cognitive load labels.We report fixation duration (total fixation duration during the time window divided by window size), fixation count per second, saccade duration (total saccade duration during the time window divided   by window size), saccade count per second, and mean normalized pupil size to account for the influence of window size.It can be seen that the average of almost all fixationand saccade-based metrics is higher for the flow/high cognitive load labels than for the no flow/low cognitive load labels.Moreover, the larger the time window, the lower the differences in the features between the flow/high cognitive load labels than for the no flow/low cognitive load labels.

Recorded label data
During this study, participants completed 293 experience sampling questionnaires about their current cognitive load, flow state, and current task.We observed a compliance rate of 98.75 %, which means that 98.75 % of all questionnaires distributed were actually answered by the participants, and only a small fraction of 1.25 % of the questionnaires were postponed.However, we did not incentivize the number of answered questionnaires.
The averaged responses of the cognitive load (NASA TLX) and flow state (Flow Short Scale) questionnaires are shown in Figure 4a and b.From these visualizations we can see that the participants experienced cognitive load and flow state differently during the study.
We normalized the averaged responses using a min-max normalization since the final interview results confirm that all participants experienced both high and low cognitive load as well as flow and no flow during the study.The normalized cognitive load and flow state distribution can be seen in Figure 5a and b.Finally, we considered a label as high cognitive load or flow if the normalized mean was greater than 0.5 (visualized by the red line) and otherwise as low cognitive load or no flow.
In Table 4 you can see the distribution of cognitive load and flow labels before and after normalization.Before normalization, we considered a label as high cognitive load or flow if the averaged questionnaire answers were greater than 4. We can see that the label distribution became closer to a 50:50 distribution due to normalization.
In addition, we examined the tasks reported in the experience sampling questionnaires to explore the tasks that users were engaged in during the experiment.Figure 6 visualizes the number of tasks that participants were working on when they were interrupted.Since we recruited participants who were working on their thesis project, the two most frequently reported tasks were writing a text (94 times) and literature research (87 times).Both tasks account for more than half of all recorded task labels.Other tasks were other tasks (30 times), proofreading (19 times), and creative tasks (19 times).

Recorded interaction data
To provide further context to the collected labels, we also recorded the duration that an application was the active window on the screen during the entire experience sampling session.Figure 7 shows all of the applications we tracked and the total amount of time each application was the active window on the screen.The analysis shows that the most used application was the internet browser (87.60 h).The second and third most used applications were Word (58.40 h) and Overleaf (38.38 h), reflecting the writing task of thesis projects.The "Other" block (29.29 h) represents all other applications that were not tracked to increase privacy and may not be related to thesis project work, such as Spotify.

Interview results
At the end of the study, each participant was interviewed for 30 min.After using the esmLoop for 5 days, 91 % of participants (10/11) rated their experience with the software and data collection process as positive, and they experienced only minor problems during the course of using the  esmLoop.Only one participant rated the experience as moderate due to connectivity issues with the eye tracker.In addition, P04, P06, and P08 particularly emphasized that the introductory workshop was helpful and supported them in familiarizing themselves with the study and the eyetracking technology.
Regarding the labeling task, 36 % of the participants (4/11) stated that filling out the questionnaires was sometimes interrupting, while 18 % of the participants (2/11) did not feel interrupted in their work.For example, P03 said that the esmLoop questionnaire schedule fitted to her schedule as she was following the Pomodoro approach with 50 min focused work and 10 min break.P01 and P05 reported that their usual break schedule was partially affected by esmLoop as they wanted to wait until the next questionnaire pops up to finally have a break.Furthermore, four interviewees stated that the report of the number of collected labels (see Figure 1b) was useful to see a progress in the data collection process and stay motivated.
Regarding cognitive load, 36 % of the participants (4/11) experience a high cognitive load several times during the study.27 % of interviewees stated that most of the time they operated on a medium level of cognitive load.Participants defined cognitive load based on different dimensions, like time pressure, focus, stress or a challenging tasks.However, five out of 11 interviewees regarded cognitive load as a spectrum while one interview would define it as binary.Moreover, P09 stated that "for a good flow you need a high mental workload" and P06 reported that "when you have a task that requires a high mental workload, you definitely get into the flow state easily".All in all these two participants   Regarding experiencing flow during the task, the interview results highlight that all participants experienced flow during this study differently.The number of times and intensity of experiencing flow varied across the study's participants.Two participants (P02, P04) stated that they experienced more flow during the study than they would have anticipated.On the other side, P05 disclosed that he was more often not in flow than in flow.Furthermore, five participants also associated flow with productivity.P01 stated "I was already productive and noticed that I was productive and then when it was a very strong flow (. . . ) then it was so that I didn't even think about it anymore".The timing of the questionnaire sometimes had an influence on their flow according to four participants as they were either interrupted in the middle of finishing a task or they waited for the next questionnaire to pop up.
In terms of privacy and data protection, 72 % of interviewees (8/11) had no concerns during the study.At the beginning three participants felt observed due to the eye tracker and recorded data but this feeling vanished over the course of the study.However, two of them reported that this effect got less over the course of the study as they got used to it and that it increased their focus on the thesis.Four interviewees also reported that the eye tracker did not disturb them.Regarding the eye-tracking technology, in general, many participants were surprised by the accuracy of the eye tracker.However, 82 % of participants (9/11) reported problems with the Tobii 4C eye tracker because it sometimes randomly disconnected itself from the computer while using it for a longer time.This created some frustration at the participants site as P03 reported that "the recording has been running for three quarters of an hour and then an error message comes out of nowhere, that's frustrating".Furthermore, two participants feared that the red infrared light sources of the eye tracker would bother them which eventually was not the case.

Cognitive state models
In chapter 3 we described the data processing & modelling procedure in detail which we followed to develop the cognitive state models for cognitive load and flow using the collected field data.

Models considering all labels
First, we developed the cognitive state models using all the label data collected during the study to investigate the feasibility of generalizable and accurate eye-based cognitive state models using field data.We used the SelectFromModel function of the sklearn package to reduce the feature set to the most important features for each algorithm separately.After selecting the features, we trained with and without applying SMOTE and evaluated the models using LOOCV.

Classification models for all tasks
For the classification models, we evaluated the performance of the models based on the average F1-Score and AUC of the LOOCV and the results can be found in Table 5.If we compare the performance of the classifiers without SMOTE/oversampling with the baseline classifier, we can see that for flow only, the Random Forest classifier performed as well as the baseline classifier for both F1-Score (F1 − Score = 72.80)and AUC (AUC = 0.5) for a window size of 1 min.However, a closer look at the confusion matrices of each LOOCV step shows that the classifier always predicted the negative class, i.e., no flow, in 11 out of 11 cross-validation steps.This shows that the classifier has little discriminative power and therefore does not show generalizability.Considering SMOTE/oversampling for the evaluation of the flow classifier, a Decision Tree classifier with a window size of 10 min performed best in terms of F1-Score (F1 − Score = 62.12,AUC = 0.542), while a Random Forest classifier with a window size of 10 min performed best in terms of AUC (AUC = 0.586).However, none of the classifiers outperformed the baseline classifier for both metrics using SMOTE/oversampling.Similar results can be observed for the cognitive load classifier.Without SMOTE/oversampling, the best performing classifier in terms of F1-Score is based on Decision Trees and a window size of 3 min (F1 − Score = 71.09,AUC = 0.517) and outperformed the baseline model by a small margin, but exhibited a low discriminative power slightly above 0.5.In terms of AUC an XGBoost based classifier and a window size of 3 min is performing best (AUC = 0.566) but did not outperform the baseline model in terms of F1-Score.Using SMOTE/oversampling, the best performing classifier in terms of F1-Score is based on Decision Trees and a window size of 2 min (F1 − Score = 65.83) and in terms of AUC, a Random Forest based classifier with a window size of 20 min performs best (AUC = 0.579).Overall, none of the classifiers performed significantly better than the baseline classifier for both metrics.Furthermore, none of the window sizes significantly outperformed the other window sizes.

Regression models for all tasks
For the regression models, we evaluated the performance of the models based on the average MSE and R 2 of the LOOCV and the results can be found in Table 6.Note that R 2 can become negative if the residual sum of squares is very large, indicating a poor fit of the model.
Comparing the performance of the regressions without label normalization to the baseline model, we see that for flow, the Decision Tree Regressor with a window size of 3 min performed best in terms of MSE (MSE = 0.969), while for R 2 , Decision Tree Regressor with a window size of 2 min performed best (R 2 = 0.047).This Decision Tree regressor also outperformed the basline model by a small margin regarding the MSE (MSE = 1.044), making it slightly superior to the baseline model.However, an R 2 of 0.047 indicates that the model is performing relatively poorly on unseen data, as only 4.7 % of the variance in the dependent variable is explained by the independent variables included in the model.Applying normalization to the flow labels, the Random Forest Regressor for a window size of 3 min performed best regarding the MSE (MSE = 0.084) and also outperformed the baseline model by a small margin, while a Random Forest Regressor for a 20 min window size performed best for the R 2 score (R 2 = 0.168).This Random Forest Regressor model also performed as well as the baseline model in terms of the MSE (MSE = 0.088).Overall, none of the models for flow significantly outperformed the baseline model while also having a good predictive power.
For cognitive load, none of the regression models outperformed the baseline model in terms of MSE and achieved an R 2 score >0.The best performing model without normalization was a Random Forest Regressor for the  However, none of the regression models for normalized cognitive state labels could achieve a positive R 2 highlighting that none of the models had a good model fit.
Overall, the results for both the eye-based cognitive state classification and the regression models trained on all label data show that we have not been successful in building models that are generalizable across tasks and participants.None of the developed classification and regression models significantly outperformed the baseline models.Only one of the developed classifiers slightly outperformed the baseline classifier in terms of F1-Score and AUC, and only one regression model slightly outperformed the baseline model while showing poor variance explainability.

Models considering writing task labels
As shown in the task and application analysis, the tasks varied significantly during the field study and also between participants.Therefore, we filtered our dataset for the most frequently reported task "text writing" and followed the same procedure as described above to evaluate the performance of different classifiers trained with only 94 "text writing" labels.However, not all participants reported labels for the "text writing" task.Therefore, only P01, P02, P04, P05, P06, P09, P10 could be considered in the following section.

Classification models for writing tasks
In order to apply SMOTE with a minimum of k = 2 k-neighbors, at least three labels of the minority class must be collected.Therefore, we could only include P1, P2, P4, P6 and P9 for the development of the flow classifier and P1, P2, P4, P9, P10 for the development of the cognitive load classifier.The results of the classification models using only "text writing" labels can be found in Table 7.For flow, without SMOTE/oversampling, XGBoost outperformed the baseline classifier for the window sizes of 1 min (F1 − Score = 67.54,AUC = 0.618) and 2 min (F1 − Score = 74.38,AUC = 0.691).Using SMOTE to balance the flow labels, XGBoost outperformed the baseline classifier for a window size of 2 min and was the best overall performer for both metrics (F1 − Score = 71.83,AUC = 0.689).In addition, a Decision Tree classifier for the 20 min window also outperformed

Regression models for writing tasks
The results of the regression models using only "text writing" labels can be found in Table 8.Without normalizing the flow labels, an XGBoost Regressor for a window size of 2 min performed best in terms of the R 2 value (R 2 = 0.229), while also outperforming the baseline regression Overall, the results of the classification-and regressionbased cognitive state models (flow and cognitive load) for writing tasks demonstrate that cognitive state models based on eye-tracking data collected in the field exhibit robustness, accuracy and generalizability across participants.Moreover, these results highlight that cognitive state models are generalizable across participants but not across tasks.

Discussion
Our results show that developing generalizable, accurate, and robust cognitive state models based on field eyetracking data is a challenging but feasible task.We were unable to develop an eye-based cognitive state ML model that generalizes across tasks and participants, as the developed models using labels from all tasks did not significantly outperform the baseline classifier in terms of F1-Score and AUC, or the baseline regression in terms of MSE and R 2 .Eyetracking data is known to be highly task dependent, which may explain the poor performance of the eye-based cognitive state models when considering labels from all tasks.However, we were successful in developing eye-based cognitive state models when considering only labels collected during writing tasks, as they outperformed the baseline classifier in terms of F1-Score and AUC, and the baseline regression in terms of MSE and R 2 .Thus, for a writing task, we were able to develop accurate models that achieve generalizability in terms of working across different participants.Furthermore, this suggests that it is possible to develop cognitive state models using eye-tracking data collected in the field.
To further advance the development of eye-based cognitive state models, we systematically examined the eye tracking and log data collected, the labels collected, the model performance, and the interviews with participants.On this basis, we derived six major lessons learned (LLs) from our field study that can help other researchers in conducting field data collection studies with eye tracking for the purpose of building cognitive state models.The first three LLs relate to the data collection method and system, and the subsequent LLs relate to model development using eye-tracking data.
Many existing ESM studies have focused on collecting data on mobile devices. 44This has typically required shifting attention from the task to another platform.Therefore, we implemented esmLoop to integrate ESM data collection directly into the operating system of the PC.With our approach and the esmLoop system we were able to successfully collect a large amount of data during our study.Using esmLoop, all participants were able to collect large amounts of valid eye tracking (>150 h), interaction, and labeling data (293 labels of cognitive load and flow) on their own, without the active supervision of an experimenter.According to our interviews, participants found esmLoop user-friendly and liked its integration into the operating system.In addition, they reported that resuming tasks after completing surveys was quick because it was well integrated into their work environment.However, even though the system was tightly integrated and worked well on different PCs during testing, participants reported some problems with the software running on their computer during the field studies, which ultimately affected data collection.In the case of participant P12, the eye-tracking driver software stopped working and recording data during the study, resulting in the participant's removal from the experiment.In addition, several users reported that the software crashed due to random disconnections of the eye tracker, causing some frustration for the user.This may be because the students' computer hardware is not up to date, or because the external eyetracking hardware is not well integrated with the hardware and operating system, or is not designed to record for several hours.It also shows that tight integration of hardware and software with the system is important for robust and high quality data collection in our context.Hardware integration may change as computing devices with a built-in eye tracker, such as the Apple Vision Pro, come to market and continuously rely on eye tracking for interactions, improving integration within the system.These findings underscore our first lesson: LL1: To make data collection more robust and ensure high data quality, label and eye-tracking data collection should be tightly integrated with the system the user is using to perform their task(s).
Eye tracking and self-reported cognitive state data is known to be sensitive data, as it is considered health data by law and reveals a lot of information about the user.72 % of the users who participated in our experiment had no privacy concerns when using esmLoop during the study.They appreciated the ability to schedule the recording sessions by starting and stopping them as needed, and that the data was not automatically shared for analysis.During the data collection phase, users could see that data collection was in progress from the icon in the task bar.They found that this visual cue gave them a sense of control over their own data and transparency in the process.In addition, when using esmLoop, users were curious about their contribution and found the reports on the number of labels and the amount of data collected to be interesting features of esmLoop, which increased their motivation to continue collecting data.Finally, providing participants with information about the goal of the experiment, the privacy check, the experimental design approved by the ethics board, etc. during the initial workshop also helped to increase their confidence and motivation to actively participate in the experiment.On this basis, we articulate the following lesson learned: LL2: For the collection of eye tracking and self-reported cognitive state data, which are sensitive data and require high effort, the system should follow a transparent data collection policy to support user privacy and increase motivation to provide data and labels.
The feeling of being tracked is known to be one of the most common discussions about eye-tracking studies and the potential to bias user behavior.Despite initial concerns about the presence of the eye tracker and the potential distraction of being recorded, as well as the light from the infrared light sources of the eye tracker, users reported that they became accustomed to the presence of the eye tracker and that it did not affect their behavior.In fact, some participants reported the positive effect of being more focused on their task due to the eye tracker and felt that their behavior, and specifically the importance of experiencing the flow state, was important to this research goal.We summarize this finding in the following lessons learned:

LL3: Study participants will become accustomed to the presence of eye trackers and to being recorded.
Similar to affect detection based on biosignal data collected in the field, 57 developing a generalizable cognitive state ML model using eye-tracking data collected in the field is challenging.To limit the complexity, we decided to approach the model development as a binary classification and regression problem rather than a multi-class problem.While the label distribution analysis shows that cognitive load and flow are experienced differently, this is also supported by the interview data.Furthermore, it is still challenging to define a generally valid threshold between high and low cognitive load and flow or no flow state for training the classification models.Although the visual behavior of the participants was individual, the classification models had a hard time identifying patterns for different cognitive states across tasks that would lead to successful classification.Even our regression models, which did not consider any cognitive state label distribution, did not perform significantly better than the baseline model when considering data from all participants and tasks.Therefore, we would suggest the approach of building within-subject models that, on the other hand, require a large number of labels to cover all manifestations of cognitive states for different tasks from one individual.In our study, we did not collect enough labels per participant to follow this suggestion.Therefore, this suggestion needs to be evaluated in future studies.Based on the above findings, we articulate the following lessons learned:

LL4: Developing an eye-based cognitive state ML model that is generalizable across users is challenging because eye movements are very individual and cognitive states are perceived very differently by individuals.
The results of our study showed that we were able to collect labels for a wider range of tasks, but for most tasks we collected only a few labels.The model development results suggest that the development of a cognitive load and flow model generalizable across tasks and participants using real-world data was not significantly successful.None of the developed models significantly outperformed the baseline model when considering labels of all tasks.Only a Decision Tree classifier for cognitive load slightly outperformed the baseline model in terms of F1-Score and AUC but exhibited a low discriminative power.Moreover, one regression models based on decision tree regressors for flow slightly outperformed the baseline model, but the R 2 value was very low, indicating that little of the variation in cognitive state is predictable from the eye-based features.Only when considering labels collected during a writing task were we able to develop models generalizable across participants that outperformed the baseline models for flow and cognitive load.This is also reflected in the fact that eye movements are highly dependent on task and environment, as shown in Yarbus. 58To achieve generalizability of eye-based cognitive state models across tasks, not only a large dataset covering a variety of tasks and environments is needed, but also a variety of cognitive state labels need to be collected for each of the tasks.
LL5: Developing an eye-based cognitive state ML model that is generalizable across tasks and environments is challenging because eye movements are highly task and environment dependent.Therefore, it is important to collect enough labels per task and cover a variety of environments.
At first glance, the performance of the developed cognitive state models looks quite good especially when following a binary classification-based approach.As shown in the Tables 5 and 7, some classifiers achieved an F1-Score above 70 %.However, without a balanced dataset in terms of labels and a comparison with a baseline classifier, an evaluation based on the F1-Score may overestimate the performance of the classifiers.For example, if a baseline classifier for flow with a window size of 1 min also achieves an F1-Score of 72.80 % due to unbalanced data (see Table 5), a Random Forest classifier that achieves an F1-Score of 72.80 % and AUC of 0.5000 is performing just equally and is not learning from the data.One approach to overcome the risk of overestimating performance is to oversample the minority class using SMOTE to create a dataset.This approach proved to be helpful as seen in the Table 5, since the Random Forest classifier for flow and a window size of 1 min did not achieve a higher F1-Score than the baseline classifier after using SMOTE.Furthermore, checking the confusion matrices of the LOOCV showed that the Random Forest classifier did not provide any discriminative power to distinguish between the two classes of flow and no flow, as the classifier always predicted the no flow class for all iterations of the CV, which is also supported by the AUC value of 0.5.In summary, we draw the following lessons learned: LL6: For the development of eye-based cognitive state models following a binary classification approach, a differentiated evaluation of classifier performance is important.

Limitations & future work
The findings of this study are limited to the context of thesis work by students.For the present study, we recruited 11 students as subjects who were working on their thesis projects.The students were recruited from different study programs and they worked on different tasks depending on the different stages of their thesis projects.Furthermore, we only collected data during a limited time frame of 5 days, during which we probably could not cover all tasks of that specific thesis work stage.The interview results show that users adjusted their break schedule based on when survey questions might appear.Therefore, there is a need to more dynamically adjust the frequency and timing of survey questions so that users do not change their behavior.In addition, we only examined the cognitive states of cognitive load and flow.There are many more cognitive states, such as situation awareness, comprehension, distraction, certainty, or fatigue, that are also worth investigating using ESM and eye-tracking data to build supervised ML models.
The approach we used to develop the eye-based cognitive state models also has several limitations.Because this was a field study, we focused on a rather broad set of tasks, neglecting the collection of a balanced data set for each task and participant.Despite collecting data over 5 days, we were not able to collect enough labels per participant (approximately 26 labels per participant) to develop individual models for each participant.In the future, we propose to collect more labels per participant and task to investigate the development of within-subject cognitive state models.In addition, we simplified the classification problem to a binary problem and a regression problem, but did not take a multi-class approach.Also, the models developed based on labels collected only during the writing tasks are less representative.They only consider data from a label subset of all participants, since not all participants reported cognitive state labels for the writing tasks.

Conclusions
In the future, advances in sensor technology and machine learning will make it possible to monitor our cognitive state like our physical activity.Eye-tracking technology is a promising off-and on-body sensor technology that can provide access to cognitive user states.We contribute to this vision by investigating the development of cognitive state models using eye-tracking data collected in the field.In this paper, we present and apply an experience sampling system called esmLoop to record eye-tracking data and collect cognitive state labels from 11 students working on their thesis project in the field.We develop cognitive load and flow models using supervised machine learning algorithms for classification and regression and the collected field data.We also evaluate these models for accuracy, robustness, and generalizability across tasks and users.Our results demonstrate that developing cognitive state models that are generalizable across participants and tasks is challenging, and we have not been successful in developing such models.However, the results of task-specific cognitive state models highlight that it is possible to develop cognitive state models that are generalizable across participants using eye-tracking data collected in the field.Finally, we articulate six lessons learned during data collection and model development to enable the development of cognitive state models that are generalizable across participants and tasks in the future.
from the real eye and skipped the averaging step.Next, we converted the normalized Xand Y-coordinates from the Tobii Pro SDK to pixel coordinates based on a 1920 × 1080 screen resolution to treat recorded data on different screens the same.Due to the frequency of the Tobii Eye Tracker 4C

Figure 6 :
Figure 6: Overview of tasks that participants have been accomplishing when answering the questionnaire about their flow and cognitive load state.
related a high cognitive load to flow as either a requirement or a facilitator.
size when considering MSE (MSE = 0.488) and R 2 (R 2 = −0.868).This model actually outperformed the baseline model, but the R 2 value was negative, indicating a high residual sum of squares and low model fit.When normalizing the cognitive load labels, a Decision Tree Regressor and a Random Forest Regressor for the 10 min window size, and a XGBoot Regressor for the 20 min window size outperformed the baseline model in terms of MSE (MSE = 0.063).
the baseline classifier for both metrics (F1 − Score = 70.00,AUC = 0.571), but performed slightly worse than the XGBoost classifier for the 2 min window.For cognitive load and without SMOTE/oversampling, a Decision Tree classifier produced the highest F1-Score (F1 − Score = 74.64)and in terms of AUC a Random Forest classifier for a window size of 1 min (AUC = 0.66).Applying SMOTE/oversampling, a XGBoost classifier for a window size of 4 min performed best in terms of the F1-Score and AUC (F1 − Score = 70.53,AUC = 0.711) while also outperforming the baseline classifier.
This is typically done to investigate empirically what is the best window size to capture the responses of triggers in the biosignal data that explain best the self-reported affective or cognitive state.

Table 1 :
Collected valid eye tracking data.

Table 2 :
Descriptive eye tracking data for flow.

Table 3 :
Descriptive eye tracking data for cognitive load.

Table 4 :
Label distribution of high/low cognitive load and flow/no flow before and after normalization.

Table 5 :
Classifier performance for all window sizes (1, 2, 3, 4, 5, 10, 20 min) and classifiers (DT, RF, XGB, Base) including labels from all tasks.Bold marked results of classifiers outperformed the baseline classifier of the corresponding window size in terms of both metrics.

Table 6 :
Regression-based model performance for all window sizes (1, 2, 3, 4, 5, 10, 20 min) and regression based models (LR, DTR, RFR, XGBR, Base) including labels from all tasks.Bold marked results of models outperformed the baseline model of the corresponding window size in terms of MSE and had a R 2 > 0.

Table 7 :
Classifier performance for all window sizes (1, 2, 3, 4, 5, 10, 20 min) and classifiers (DT, RF, XGB, Base) including labels from writing tasks only.Bold marked results of classifiers outperformed the baseline classifier of the corresponding window size in terms of both metrics.

Table 8 :
Regression-based model performance for all window sizes (1, 2, 3, 4, 5, 10, 20 min) and regression based models (LR, DTR, RFR, XGBR, Base) including labels from writing tasks.Bold marked results of models outperformed the baseline model of the corresponding window size in terms of MSE and had a R 2 > 0.