Real-time gaze estimation via pupil center tracking

Abstract Automatic gaze estimation not based on commercial and expensive eye tracking hardware solutions can enable several applications in the fields of human computer interaction (HCI) and human behavior analysis. It is therefore not surprising that several related techniques and methods have been investigated in recent years. However, very few camera-based systems proposed in the literature are both real-time and robust. In this work, we propose a real-time user-calibration-free gaze estimation system that does not need person-dependent calibration, can deal with illumination changes and head pose variations, and can work with a wide range of distances from the camera. Our solution is based on a 3-D appearance-based method that processes the images from a built-in laptop camera. Real-time performance is obtained by combining head pose information with geometrical eye features to train a machine learning algorithm. Our method has been validated on a data set of images of users in natural environments, and shows promising results. The possibility of a real-time implementation, combined with the good quality of gaze tracking, make this system suitable for various HCI applications.


Introduction
It has long been recognized that human interaction encompasses multiple channels [1].Eye gaze plays a special role, as it can express emotions, desires, feelings and intentions [2].Gaze tracking is the process of determining the point-of-gaze in the physical space.Accurate eye gaze tracking normally requires expensive specialized hardware (such as the eye-tracking solutions produced by Tobii [3] or SR Research [4]) that relies on active sensing (most commonly, infrared illuminators) [5].This reduces the appeal of these systems for consumer market applications [6].Moreover, these solutions often require a manual calibration procedure for each new user.
More recently, inexpensive solutions that do not require active illumination have been proposed [7].These systems rely on modern computer vision algorithms.Some can use the camera embedded in any computer screen, laptop, and even tablet computer, requiring no additional hardware.
In this work, we propose a new algorithm that can estimate eye gaze in real time without constraining the motion of the user's head.Our system does not need person-dependent calibration, can deal with illumination changes, and works with a wide range of distances from the camera.It is based on an appearance-based method that tracks the user's 3-D head pose from images taken by a standard built-in camera.From the same images, the irises are detected, and their center locations are fed (together with other geometrical measurements) to a machine learning algorithm that estimates the gaze point on the screen.Iris detection and gaze point estimation are computed in real time.
Our system has been validated on a data set of images of users in a natural environment, showing promising results.In addition, we present qualitative results with an online user interaction test using our system.This experiment shows that the system can provide useful real-time information about the user's focus of attention.Other potential applications for this technology include data analytics on visual exploration from multiple users watching a video, and the control of assistive devices.
This paper is organized as follows.Sec. 2 illustrates the eye gaze estimation problem and related work.Sec. 3 describes the proposed method to achieve gaze.The experimental setup is explained in Sec.4.1, while the data set used to train the sytstem is presented in Sec.4.2.
Results are shown and discussed in Sec.4.3.Sec. 5 has the conclusions.

Related Work
Following the seminal work of Just and Carpenter [8], who studied the relation between eye fixation and cognitive tasks, the measurement of eye gaze direction has been used in a broad range of application areas over the time, including human-computer interaction (HCI) [9,10], visual behavior analysis [11,12], visual search [13], soft-biometrics [14,15], market analysis [16], cognitive process analysis [17], and interaction with children affected by autism spectrum disorder [18,19].Moreover, eye gaze tracking is fundamental for human-robot interaction (HRI) [20,21], as it provides useful information about user engagement, turn taking schemes, or intention monitoring.Gaze has been considered even in environment with both robot and gaze-interactive display [22], or for the design of Attentive Robots [23].For a review of gaze estimation applications in HRI/HCI, socially assistive robotics (SAR), and assistive technologies, the reader is referred to [24].
The availability of modern low-cost depth sensors, combined with advances in computer vision, has lead to new solutions for gaze estimation that are less invasive and cheaper than prior methods, which were based on active illuminators.Passive gaze estimation solutions can be divided into two main categories: model-based and appearance-based.Model-based methods rely on a 3-D model of the head and of the eyeball and use geometric reasoning [25][26][27][28][29]. Their main advantage is that they can naturally handle head pose movements, provided that these can be measured reliably.Unfortunately, precisely locating the eyeball in space is very challenging; indeed, the most successful algorithms used a 3-D camera for this purpose [30,31].In addition, in the process of building the mathematical model of the eye, these methods need to know the relative pose of cameras and screen, as well as the relationship between multiple cameras and the parameters of each camera.Consequently, a small amount of noise can strongly influence the final estimation [32].When compared with appearance-based methods, their accuracy is generally lower.In addition, it is unclear whether shape-based approaches can robustly handle low image quality [33].
In contrast, appearance-based methods detect and track one's eye gaze directly from images, without the need for a full 3-D model of head and eyeball.Instead, these methods learn a mapping function from eye images to gaze directions.They can manage lighting conditions changes and, since they normally use the entire eye image as a high-dimensional input feature and map this feature to low-dimensional gaze position space, can potentially work with low-resolution images [34], at the cost of acquiring a large amount of user-specific training data.User-dependent calibration is often necessary.A main problem with these methods is that they are not robust to head pose movements [35,36], unless this is explicitly taken into account.For example, the method of Schneider et al. [37] achieves person-independent and calibration-free gaze estimation, at the cost of assuming fixed head pose.The approach of Ferhat et al. [38] is based on an iris segmentation algorithm in order to track anchor points; histogram features are then used in a Gaussian process estimator.This system requires a person-dependent calibration and cannot deal with free head movements.
In this paper we only compare our work with other projects that integrate 3-D head pose information and that achieve real-time estimations of gaze tracks.Tab. 1 presents a summary of the most relevant related methods.For each entry, we provide a description of the category (model-based or appearance-based), the input type, computational details, lighting conditions (when available), details about user-dependent calibration, error (as reported), and discussion of application in a real HCI/HRI scenario.Note that we only report light conditions in the case of online testing.For tests on existing data sets, we refer the reader to the description given in the data set documentation.We want to emphasize that this table is meant solely to provide some context through a snapshot of competing approaches.Quantitative comparative evaluation is complicated by the different experimental setups and data sets considered for benchmarking.

Gaze Estimation Method
Our system analyzes video frames from a regular camera to detect the pose of the user's head and the location (in the image) of facial features using an open source software (IntraFace [48]).Subsequently, the center of each iris is found using a customized version of the Circular Hough Transform (CHT) [49].A random forest regressor, trained on a labeled data set, is then used to map pose and pupil center information into a gaze point on  [41] a-b camera potentially real-time good illumination conditions yes 2.49 ○ -Sugano et al. [42] a-b multiple cameras n.a.n.a.no 6.5 ○ -Xiong et al. [43] a-b stereo camera n.a.natural light source yes 6.43 ○ -Funes-Mora and Odobez [44] a-b depth sensor 10 fps (only for gaze, not the whole solution) n.a.no 5.7 − 7 ○ natural dyadic interaction Lu et al. [45] a-b camera potentially real-time n.a.yes 3 ○ -Wood et al. [27] m-b camera 12 fps n.a.no 6.88 ○ gaze estimation on tablet Holland et al. [46] a-b camera 0.65 fps n.a.yes 6.88 ○ gaze estimation on tablet Cazzato et al. [29] m-b depth sensor 8.66 fps uniform, left or right no 2.48 ○ soft-biometric identification Sun et al. [28] m-b depth sensor 12 fps n.a.yes 1.38 − 2.71 ○ chess game, eye keyboard Chen and Ji [47] m-b camera + 2 IR LEDs 20 fps robust no 1.78 ○ (after 80 frames) -Zhang et al. [33] a-b camera n.a.n.a.no 6.3 ○ (cross-data set) -

Head Pose Estimation
The head presence in the scene is detected by using the Viola-Jones face detector [50].We use the IntraFace [48] software to detect face features from the image in real time.The software also produces the head orientation (in terms of yaw, pitch and roll angles) with respect to a reference system centered in the camera.Head orientation is computed by aligning a deformable face model to the detected face.The model is characterized by the 2-D positions of a number of landmarks, describing the face, eyes position (not pupils), mouth and nose outlines.Feature detection and tracking is based on the Supervised Descent Method (SDM), an algorithm that optimizes a non-linear least square problem.For more details, see [51].The next step in our algorithm is the estimation of the vector T from the origin of the camera reference frame to a reference frame centered at the user's face.More precisely, the head pose reference system has its origin centered at the nose base and x-, y-axes parallel to the mouth and nose, respectively (Fig. 3).The rotation matrix R between the head pose and the cam-era reference system is produced by IntraFace.In order to compute T , we assume a fixed distance D = 90 mm between the external corners of the eye contours and a fixed distance H = 70 mm from a point at the bottom of the nose (nose base) and the segment joining the two pupils (see Fig. 2).Note that these feature points are computed by IntraFace.These values of distances between the selected features are justified by analysis of several studies.For example, the study of [52] determined that the interpupillary distance for the majority of humans lies in the range 50-75mm.More precisely, the average interpupillary distance for men is of 64.0 mm with a standard deviation of 3.6 mm and 61.7 mm with a standard deviation of 3.4 for women (2012 Anthropometric Survey of US Army Personnel [53]).Considering that the average palpebral fissure width is of approximately 30 mm, the distance between the external eye contours can be expected to be of approximately of 94 mm for men and 91 mm for women.The chosen value of 90 mm is close to these average values.As for H, since no related data in the literature was found, this quantity has been chosen empirically.
The 3-D coordinates of the left and right eye corners, expressed in the camera reference system, are: In the head pose reference system, these vector are expressed as We the express rotation matrix R by its row components, i.e.: Let: be the projections on the camera's focal plane of the left and right eye corners, where K is the intrinsic camera matrix, and u, v are the image coordinates of the left (subscript l) and right (subscript r) pupil center locations.
Putting these equations together, Eq. 3 can be rewritten as: Thus, it is possible to compute the translation vector T as: (5) This directly provides the component z C l as: ) from which the vector T is easily computed.

Iris Detection
IntraFace produces a number of key points in the periocular region.Within these regions, we extract the iris areas, whose centers are then used for gaze estimation.We start from the bounding boxes of the two periocular regions, as provided by IntraFace, augmented by 10 pixels in each dimension to compensate for alignment errors.The greyscale image is low-pass filtered to reduce noise; then, it is high-pass filtered and histogram equalized.
A Canny edge detector [54] is used to extract the irises edges.The parameters of the Canny edge detector need to be chosen carefully, to avoid the risk of returning a large number of false positives (the small blood vessels in the sclera) or, conversely, to miss part of or the entire iris contour.In our experiments, we used a Gaussian kernel of 15 × 15 pixels with σ = 2.Note that, due to the histogram equalization, the distribution of intensity values within each eye region covers the full available range, and thus adaptive thresholds are not required.The current implementation sets the minimum and maximum thresholds of the Canny edge detector to 60 and 128 respectively, which roughly correspond to 1 4 and 1 2 of the full range [0, 255].
The iris regions appear circular in the image when the user is imaged front-to-parallel, ellipsoidal otherwise.We use the Hough transform to extract circles at multiple radii from the edge image [49].Specifically, given an M × N pixel area, for each radius R iris of the circle under consideration, a counter is defined at each pixel.An edge pixel at p = (p x , p y ) triggers an increment by 1 of all counters at pixels {p + r}, where r spans the circumference of radius R iris centered at the origin.The counter with highest value over all radii determines the estimated iris circle.In our implementation, r A = 5 and r B = 7 mm, while R iris = 10.We also experimented with a version of the Hough transform that extracts ellipses; however, due to the larger number of parameters, this version was too slow for real time implementation, without appreciable benefits.
This iris detection algorithm can be improved by observing that the iris is typically darker than the surrounding sclera.This means that, at the iris' edge, the image gradient is expected to point outwards.We exploit this observation by only incrementing a counter at p + r when it is compatible with the image gradient, that is, when the vector r forms an obtuse angle with the image gradient at p.This strategy has proven effective, but computing this angle introduces a substantial computational cost.This can be alleviated by quantizing the vector r into angular multiples of 45 ○ , and maintaining, for each quantized angular value, a look-up table that determines which gradient angles are compatible with r.We also experimented with assigning different incremental values to the counters depending on the magnitude of the gradient at the edges, but this didn't result in an appreciable improvement.
We implemented one more variation to the original Hough algorithm, one that still exploits the property that the iris is typically darker than the surrounding sclera.This approach selects a Hough peak by taking into account not only the value of a counter, but also the average brightness within the candidate circle.Specifically, we keep 10% of the candidate circles with highest value of the associated counter; among these, we select the one with the lowest average brightness value within its boundary.Figs. 4 and 5 show some examples of iris detection with our algorithm.
Our iris detection algorithm is applied to both the left or right eye image region.Let (u, v) represent the pixel coordinate of the iris center (the pupil) for one of the two eyes.Our next step is to transform this location into a 3-D point expressed in the head reference system.Denoting the iris center position in camera coordinates by p C and in head pose coordinates by p H , the following relationship holds: where K is the intrinsic calibration matrix and α is an unknown scale factor.Then, where the notation R y x = (R x y ) T indicates the rotation matrix from the coordinate system x to the coordinate system y, and T x is a translation vector in the coordinate system x.p H lies in the straight line s.t.: This line intersects the plane z H = 0 at (using Eqs.7 and 8):  This implies, decomposing R H C in three vectors r 1 , r 2 and r 3 : Substituting in Eq. 9, one obtains the pupil center location p H in the head pose coordinate system.

Gaze Estimation by Random Regression Forests
The algorithm described above results in the location (in the image plane or in the head coordinate system) of the two iris centers.The next step is to map this information (together with the head pose) into a gaze point on the screen.We design a separable regressor to compute the (x, y) screen coordinates of the gaze point (one independent regressor per coordinate).We use a random forest regressor [55], based on a combination of decision tree predictors [56] such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [57].When used as a regressor, the random forests algorithm works as follows: 1. Take n tree bootstrap sample from the data by random sampling with replacement from the original training data, where n tree represents the number of trees in the forest.

Grow, for each sample, an unpruned regression tree
where at each node, m variables at random out of all M possible variables are selected independently at each node, and the best split is chosen on the selected m variables.3. Predict new data by aggregating the predictions (average value) of the n tree trees.
An estimation of the error can be obtained at each bootstrap iteration by using the tree grown with the bootstrap sample to predict the data that does not lie in the bootstrap sample (i.e. the out-of-bag predictions, see [58] for more details).
The random forest is trained on labeled data as explained in the next section.

Experiments
In this section we discuss our experiments.In particular, the experimental setup is discussed in Sec.4.1, while Sec.4.2 introduces two different data sets for our experiments.Results of the experiments are shown in Sec.4.3.

Experimental Setup
For our experiments, we used a laptop (a MacBook 15" with retina display, 2.6 GHz Intel Core i7 processor, and 16 GB 1600 MHz DDR3 of memory) with built-in camera.The laptop was placed on a desk during the experiments.Images were acquired at a resolution of 1280 × 780 at 30 fps, while the screen resolution for the test was set to 2880 × 1800.The system has been tested on a Windows 7 machine with 8 GB of RAM on a virtual machine run by Parallels Desktop v10.
The software was implemented in a single threaded C++ application using OpenCV [59] and Qt libraries [60] for the user interface.Facial features are detected and tracked by the IntraFace library [48], that also returns the head rotation in real-time.The random forest regressor used the OpenCV implementation with the following parameters: maximum tree depth = 25; maximum number of tree in a forest = 200; size of the randomly selected subset of features at each node used to find the best split = 4; minimum number of samples required at a leaf node for it to be split = 5.

Data Sets
We have used two different data sets in our experiments.The first set (Data Set 1) comprises six short videos from two different users.Users were asked to look at a circle moving in the screen in order to capture and evaluate different eye positions.These videos were 20 seconds long, and all frames have been manually labeled.The users' faces were illuminated by a desktop lamp positioned behind the camera.This data set was used to assess the error in the estimation of the irises' position in the image, as described in Sec.4.3.1.
The second set (Data Set 2), which was used to evaluate the mapping between pupil center and gaze point, is composed by 1130 images taken from 10 different participants.70% of the images in this data set were acquired with a light source coming from a desk lamp positioned behind the screen, while the remaining images were acquired with artificial room illumination, thus reproducing common user interaction scenarios in an indoor environment.Participants were asked to look at a colored marker (a red circle with radius of 90 pixel, corresponding to a field of view of 0.8 ○ ) appearing at random locations on a black screen.When a circle is displayed, participants were tasked with looking at it and pressing a key on the keyboard.Then the circle was reduced in diameter by two thirds, at which point participants pressed a second key.The system acquired and stored, along with the position of the circle center in the screen, the left and right irises' position in the camera and in the head pose reference systems.
Each participant was tested with both illumination conditions.Participants took turns in the data collection, with each participant testing with 10 circle locations before another participant took over.Note that the rotation matrix R H C and translation vector relating the screen and the camera reference systems are also stored in the file (the origin of the screen reference system is placed at the top left corner of the screen).

Experimental Results
With the aim to provide a complete analysis of the solution, our system evaluation was divided in two different steps.In the first test, we assessed the iris detection system (Sec.4.3.1).In the second one, we evaluated the performance of the the gaze estimation algorithm by means of leave-one-out cross-validation (Sec.4.3.2).A qualitative evaluation of the system in an HCI scenario is described in Sec.4.3.3.Several possible sets of feature vectors have been analyzed and their performance has been compared.

Iris Detection
Data Set 1 (see Sec. 4.2) was used for this test.At each frame, we measured the distance E (in pixels) between the location of the center of the detected circle and the manually labeled pupil center for both the left and right eye.Fig. 6 shows the histograms of E over all frames and users for the left and right eye.Note that computing the center of both irises (as explained in Sec.3.2) takes about 5.4 msec in our implementation.
It should be noted that the precision of iris center estimation is affected by the quality of the acquired images, the limited image size of the periocular regions, and the illumination changing conditions (including the shadows on the head surface that are generated by head movement).
With respect to other algorithms in the literature that may achieve higher gaze localization accuracy, our system has the advantage that it can be implemented at a high fame rate, and thus represents an attractive solution for those solutions in which speed is more critical than high precision.

Gaze Point Detection
We used Data Set 2 to assess the quality of the mapping from iris center measurement to gaze point using regression random forests.We used two cross-validation modalities.In the first modality, each vector in the data set was selected in turn as a test vector, and the system was trained on all remaining vectors.Error statistics are shown by means of histograms in Fig. 7.The feature vector given in input to the regressor includes the 3-D head pose (rotation and translation) as well as the irises' locations.We considered different representations for these quantities, specifically: Euler angles vs. quaternions for the head rotation, and camera vs. head reference system for the irises' location.In addition, we experimented both with single (left) pupil center and with both centers.Note that the feature length varied between 8 and 11, depending on the representation chosen and on whether one or two eyes were considered.The best results were found when using both pupils and the quaternion representation of the head rotation.Fig. 8 shows results in terms of gaze point errors in the x and y coordinate.Note that, in general, detection accuracy tends to be higher for the x-axis.
In the second cross-validation modality, we selected the data from each participant in turn as test data set, and trained the system with data from the other participants.For these tests, we used the configuration with the head rotation expressed using quaternions, and pupils centers position expressed in terms of the head pose reference system.Tab. 2 shows the results for each participant.
Tab. 3 compares the average mean square error using our system (evaluated with the first cross-validation modality) against similar results for other real-time systems reported in the literature.Results are expressed in units of angular error for consistency.
Our experiments have highlighted a noticeable difference in accuracy between the x component (mean error = 2.29 ○ ) and the y component (mean error = 5.33 ○ ).The main reason for this behavior is the higher error in the y component of iris localization.

Qualitative Results
We conducted a qualitative test in order to assess how our system could be used for a real-world humancomputer interface (HCI) application.In this simple test, two videos with a target (represented by the red circle) moving on the screen was shown to three partici-pants.In the first video, the circle was moving on a simple rectangular trajectory, while in the second video the circle followed a sawtooth trajectory.Participants sat in front of the screen at a distance varying between 40 and 70 cm.Uniform illumination was created by artificial light in the room.Participants were asked to follow the circle's trajectory with their gaze.They were informed prior to the test about the details of the test, and about the expected trajectory of the circle.Fig. 9 shows the circle's trajectories (blue line) and the measured gaze point trajectories (red line) for user #1.Fig. 10 shows the complete hit maps for all three users interacting with our system with both videos.Each user is represented by a different color, with the brightness of the color (light to dark) indicating the progress of the trajectory.Note that our system, in the configuration considered for this test, can process images at 8.88 frames per second on average, for an input resolution of 1280 × 780 pixels.This frame rate makes it suitable for various realtime HCI applications [29,61,62].When a faster tracking rate is required, hardware-based gaze tracking solutions should be used [63][64][65][66].

Conclusion
We proposed a novel real-time gaze estimation system suitable for HCI applications.This system uses a regular camera, of the type that is typically embedded in laptops and computer screens.The proposed system does not require a user-dependent calibration, can deal with illumination changes, and can work with a variety of head poses.The solution is based on an appearancebased method that uses video from a regular camera to detect the pose of the user's head and the location (in the image) of the eye features.This information is fed to a machine learning system, which produces the gaze point location on the screen.Our end-to-end system is able to process images at more than 8 frames per second on a regular laptop computer.Quantitative and qualitative tests in natural conditions have shown promising results in terms of robustness and accuracy.
The main shortcoming of the proposed system is its reduced accuracy in the vertical component of the estimated gaze point.Future work will explore strategies to overcome this problem, as well as methods to automatically calibrate some of the user-dependent system parameters (e.g.interpupillary distance).Finally, we plan to benchmark our system against existing data sets such as MPIIGaze [33].

Fig. 1 .
Fig. 1.A block diagram of the proposed solution.

Fig. 2 .
Fig. 2. A scheme of the employed face measurements.Point O lies on the nose base.

Fig. 4 .
Fig. 4. The proposed feature extraction algorithm for an incoming frame.Green points represent IntraFace facial tracking output, while red circles on the eyes represent iris detection output.

Fig. 5 .
Fig. 5. Sample results from our pupil center detection algorithm.

Fig. 8 .
Fig. 8. Error analysis by using two eyes and expressing rotations by quaternions.

Fig. 9 .
Fig. 9. Example of real-time interaction with user #1 and two predefined sequences.

Fig. 10 .
Fig. 10.Hit maps of the three users involved in the experiment.For color scale, lightest colors refer to the beginning of the visual explorative session whereas darkest colors refer to the final part of the video.

Table 1 .
Summary of relevant state of the art methods: m-b stands for model-based, while a-b stands for appearance-based, n.a. for not available data.

Table 2 .
Mean square eye gaze direction error (in degrees) with a truncated Data Set 2 with a person excluded from the training set.

Table 3 .
Mean square eye gaze direction error (in degrees) compared with other published real-time systems.