0% found this document useful (0 votes)
43 views

Eye Tracker Data Quality: What It Is and How To Measure It

Uploaded by

wuwy997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Eye Tracker Data Quality: What It Is and How To Measure It

Uploaded by

wuwy997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/254007815

Eye tracker data quality: What it is and how to measure it

Article · March 2012


DOI: 10.1145/2168556.2168563

CITATIONS READS
244 16,267

3 authors, including:

Kenneth Holmqvist
Nicolaus Copernicus University; Universität Regensburg; University of the Free State
183 PUBLICATIONS 9,169 CITATIONS

SEE PROFILE

All content following this page was uploaded by Kenneth Holmqvist on 11 February 2015.

The user has requested enhancement of the downloaded file.


Eye tracker data quality: What it is and how to measure it
Kenneth Holmqvist∗ Marcus Nyström† Fiona Mulvey‡
Lund University, Sweden

Abstract
Data quality is essential to the validity of research results and to
the quality of gaze interaction. We argue that the lack of standard
measures for eye data quality makes several aspects of manufactur-
ing and using eye trackers, as well as researching eye movements
and vision, more difficult than necessary. Uncertainty regarding the
comparability of research results is a considerable impediment to
progress in the field. In this paper, we illustrate why data qual- Figure 1: Good and poor precision in two remote 50 Hz eye
ity matters and review previous work on how eye data quality has trackers as seen in an x-/y-visualisation (scanpath view). From
been measured and reported. The goal is to achieve a common [Holmqvist et al. 2011], page 149.
understanding of what data quality is and how it can be defined,
measured, evaluated, and reported.1

CR Categories: I.3.7 [Eye tracking]: Data quality—


Standardization;

Keywords: data quality, eye tracker, eye movements, precision,


accuracy, latency

1 Does data quality matter? Figure 2: Very inaccurate data in one corner. From [Holmqvist
et al. 2011], page 132.
The validity of research results based on eye movement analysis are
clearly dependent on the quality of eye movement data. The same is
true of the performance of gaze based communication devices. Eye
Since fixation analysis obscures the original data quality, most re-
data contain noise and error which must be accounted for. There are
searchers estimate the quality of their own recordings from various
currently no norms or standards for what researchers report about
plots of raw data samples. For instance, Figure 1 shows good versus
data quality in publications, or for what manufacturers report about
poor precision, and Figure 2 a case of poor accuracy in the upper
their eye tracker’s typical performance. What may be a serious im-
left corner. It may be obvious that eye tracker data quality affects
pediment for one purpose may not be significant for other purposes,
the validity of results, but how large is the effect? Is it reasonable to
for example, a cheap eye tracker composed of off-the-shelf compo-
assume valid results from a commercial eye tracker without mea-
nents may be sufficient for clicking large buttons in gaze interaction
suring quality in a particular data set, or should all eye movements
or for looking at larger AOIs with sufficient margin sizes, and may
researchers check their data quality and report it as part of their
work as an assistive device mounted on a wheelchair, whereas a
results? To illustrate these issues, we begin with four examples.
more expensive, high performance eye tracker may have better data
quality and a greater number of valid eye movement measures nec-
1.1 Example 1: Effect of accuracy on dwell time mea-
essary in much psychological, neurological and reading research. It
is a case of matching the system to the purposes and also to the user sures
or participant group, and this is a very difficult task without some
Accuracy (sometimes called offset) is one of the most highlighted
standardized measures of data quality. If data quality is measured
aspect of data quality. Loosely speaking, it refers to the difference
and characterised for the eye tracker, participant group and in terms
between the true and the measured gaze direction.
of the specific experimental measures of interest, there are meth-
ods of dealing with low quality to maximise the validity of results: Figure 3(a) shows high quality data recorded from one participant
correcting or abandoning data [Holmqvist et al. 2011, p. 140 and looking at the stimulus image for 30 seconds, with the task of esti-
224]. However, these methods cannot be considered without first mating the age of people in the scene. Binocular data were recorded
analysing the data and identifying what is and is not noise or error. with a tower-mounted eye tracker sampling at 500 Hz, but only data
from the left eye are shown and analysed. The eye tracker reports
[email protected] an average accuracy of 0.30◦ horizontally and 0.14◦ vertically after
[email protected] calibration and a four-point validation procedure.
[email protected]
1 We thank the members of the COGAIN Technical Committee (see
Figure 3(b) displays areas of interest (AOIs) for faces in the stimu-
lus image. Because this is a real image, there is no whitespace—i.e.
www.cogain.org/EyeDataQualityTC) for the standardisation of eye data
an area not covered by any AOIs—between the faces that could be
quality for their ongoing participation and comments to this text.
used as AOI margins. AOIs with small margins are common in
Copyright © 2012 by the Association for Computing Machinery, Inc. reading research, web studies, and studies that use videos or real
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
world stimuli. They are also common in gaze interaction scenar-
for commercial advantage and that copies bear this notice and the full citation on the ios, e.g. when typing on an onscreen keyboard. When there is no
first page. Copyrights for components of this work owned by others than ACM must be room for margins, data with poor accuracy will sometimes move
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on to another AOI than the one intended. We can simulate degrees
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
of poor quality by adding 0.5◦ offset to the recorded data, moving
[email protected]. them a bit in space. Even with this additional offset, the accuracy
ETRA 2012, Santa Barbara, CA, March 28 – 30, 2012.
© 2012 ACM 978-1-4503-1225-7/12/0003 $10.00

45
60 700
1200
x−coordinate
1000 y−coordinate

Fixation duration (ms)


Pixel coordinate

Number of fixations
800 50 600

600

400 40 500

200
RMS: 0.03 degrees RMS: 0.37 degrees
0 30 400
5000 6000 7000 8000 9000 10000 0 0.1 0.2 0.3 0.4
Sample number Precision (RMS of intersample distances)
(a) Original data in high quality. (b) AOI positions.
(a) Illustration of data with high (b) Influence of precision on the
7000 7000 (left) and low (right) precision. number of fixations and the average
6000 6000 fixation duration.
Total dwell time (ms)

Total dwell time (ms)


5000 5000

4000 4000 Figure 4: How a decrease in precision affects the number and du-
3000 3000 ration of detected fixations. The precision was decreased by adding
2000 2000
Gaussian noise with an increasing variance. Fixations were de-
1000 1000
tected with the algorithm by [Nyström and Holmqvist 2010], using
0 0
1 2 3 4
AOI
5 6 7 1 2 3 4
AOI
5 6 7 default settings.
(c) Total dwell time in each (d) Total dwell time after 0.5◦
AOI; original data. inaccuracy (offset) has been
added to the data. dwell-time based selection to restart, or if very inaccurate with no
space between targets, may mean selection is very difficult or can
Figure 3: Figure (c) and (d) compare total dwell times with accu- only be made on very large (or magnified) targets.
rate vs slightly inaccurate data. The inaccuracy was added to the
original data. Note that 0.5◦ is considered a very small error. 1.2 Example 2: Effects of precision on the number and
duration of fixations

is still considered rather high in comparison to what is commonly Inaccuracy is not the only data quality issue affecting the viability
reported in the literature, in fact several manufacturers report 0.5◦ of research results. While accuracy refers to the difference between
offset as their standard or even best possible accuracy. If a system’s true and recorded gaze direction, precision refers to how consistent
inaccuracy is not taken into account when designing test stimuli calculated gaze points are, when the true gaze direction is constant.
and analysing data from a study, what kind of effect may it have on It is often tested with an artificial eye, which does not move at all.
results? Precision measures are commonly conducted to test a particular eye
tracker, and when using an artificial eye, this measure gives an idea
Dwell time ( ‘gaze duration’, ‘glance time’, ...) is the time gazed at of system noise or error, which varies with the quality of the eye
an AOI, from entry to exit, whereas total dwell time is the sum of tracking system. In essence, this enables us to investigate the effect
all dwell times to a specific AOI over a trial [Holmqvist et al. 2011, of collecting data with or without a bite-bar or chin rest, or with a
pp. 190 and 389]. It is a very common measure in eye-movement tower-mounted eye tracker compared to a remote one. It is also one
research. Figure 3(c) shows dwell times for seven AOIs based on aspect of testing eye tracker quality. By adding Gaussian noise with
the original, and Figure 3(d) shows dwell time for the same AOIs an increasing standard deviation to the eye movement data in Fig-
after a 0.5◦ offset has been added. Note that for some AOIs, total ure 3(a), we can simulate poor precision in an eye tracker. Figure
dwell time is reduced, for others it is significantly reduced or even 4(a) shows an example of the original data (left part of figure) and
totally removed from one AOI, and for some AOIs, dwell time is the data after noise has been added. The range of added noise has
hardly affected at all. The effect is not uniform across AOIs and so been chosen to conform to recorded precision values for current eye
can’t be corrected or controlled for. The purpose of this example trackers, which according to [Holmqvist et al. 2011] is 0.01−0.05◦
study was to analyse AOIs for dwell time and number of fixations, for tower-mounted systems and 0.03 − 1.03◦ for remote ones. The
which is typical of many studies. Adding 0.5◦ degree imprecision larger values in the latter range, however, are likely to reflect eye
to the data simulates many common recording scenarios. The point trackers with exceptionally poor precision, and are therefore not in-
to note is that even when precision is relatively good, the small cluded in the data presented. Precision values are calculated as the
amount of imprecision present can lead to significant differences in root mean square (RMS) of intersample distances in the data.
the results.
Figure 4(b) illustrates how precision influences the number and du-
Often noise in data can be counteracted by increasing the amount
ration of fixations, as detected by the adaptive velocity algorithm
of data; as, for instance, with the effect of low sampling frequency
developed by [Nyström and Holmqvist 2010]. According to this al-
on fixation duration [Andersson et al. 2010]. In contrast, more data
gorithm, fixations become fewer and longer as precision decreases.
does not remedy the effect of poor accuracy on AOI measures such
This is most likely due to the saccade detection threshold increas-
as dwell time, because the different data are likely often to be dis-
ing as a direct consequence of the higher noise level, which prevents
tributed in the same direction, out of the AOI.
small saccades from being detected. These small saccades then be-
Apart from the effect of accuracy on research results, accuracy also come part of adjacent fixations, merged into one longer fixation.
affects gaze based communication technologies. In gaze based in- The effect in Figure 4(b) is dramatic; even though the data should
teraction, interactive on screen targets are in fact AOIs with clear represent exactly the same eye movement behaviour, the number of
margins. Dwell time select is a common method of ‘clicking’ a fixations decreases by more than 30%, whereas the average fixation
button with gaze. Having several buttons side by side in an array, duration increases by about 10%. The size of this effect will change
for example in an on-screen keyboard, will produce error in selec- with the method of event detection used. For example, it is likely
tion if data moves to the neighbouring button. When the selection that systems using dispersion based fixation detection algorithms
method involves dwell time, this may cause an almost complete produce a different result.

46
1400
60 550 be attributed to differences in cognitive processing or emotional re-
1200
x−coordinate
y−coordinate
sponses to the objects. [Gagl et al. 2011] reported a similar effect
and also propose a method to correct the errors. Such problems can

Fixation duration (ms)


Number of fixations
1000
Pixel coordinate

50 500

800
be corrected, but only if the error is first measured for the particular
600
set up used.
40 450
400

200
2 Factors influencing data quality
Proportion of lost samples: 0.18
0
0 5000 10000
Sample number
15000 30
0 0.05 0.1
Proportion of lost data
0.15
400
0.2 Many factors influence data quality, including:
(a) Data with missing samples (indi- (b) Influence of data loss on the num- 1. Participants have different eye physiologies, varying neurol-
cated with red dots). 18% of the sam- ber of fixations and the average fixa- ogy and psychology, and differing ability to follow instruc-
ples were lost in this example. tion duration. tions. Some participants may wear glasses, contact lenses, or
mascara, or may have long eyelashes or droopy eyelids which
Figure 5: How data loss affects the number and duration of de- all interfere with the eye image and may or may not be ac-
tected fixations. Data loss was simulated by randomly insert- counted for in a system’s eye model [Nyström et al. submit-
ing burst losses with a length uniformly drawn from the interval ted].
[10, 100] pixels. Fixations were detected with the algorithm by 2. Operators have differing levels of skill, and more experienced
[Nyström and Holmqvist 2010], using default settings. operators should be able to record data with higher quality
[Nyström et al. submitted]. Operator skills include adjusting
eye to camera angles and mirrors, monitoring the data quality
in order to decide whether to recalibrate, as well as providing
1.3 Example 3: Effect of data loss on the number and
clear instructions to the participants.
duration of fixations
3. A task that requires participants to move around a lot, for ex-
Lost data refers to samples that are reported as invalid by the eye ample, could affect data quality. A task that causes partici-
tracker. Typically, this correspond to (0, 0)-coordinates or sam- pants blink more often leads to more data loss, unless blinks
ples that are flagged with a certain validity code in the data file. are modeled as eye events.
Data losses derive from periods when critical features in the eye 4. The recording environment has a strong influence on data
image—often the pupil and the corneal reflection(s)—cannot be re- quality. Was the data collected outdoors in sunlight or indoors
liably detected and tracked. This can occur when, for example, in a controlled laboratory environment, for instance? Were
glasses, contact lenses, eyelashes, or blinks prevent the video cam- there any vibrations in the room that reduced the stability of
era from capturing a clear image of the eye. the eye movement signal? These factors should be considered
Sometimes, it may be desirable to differentiate blinks from other and reported.
sources of data loss. This may be because blinks are used as a be- 5. The geometry, that is the relative positions of eye camera, par-
havioural measure (e.g. [Holland and Tarlow 1972], [Tecce 1992]) ticipant, and stimulus affects data quality, as does the position
or because they are used for gaze based interaction, for example of the head in what is known as the head box [Holmqvist et al.
as a ‘click’ select input. In such cases, simply removing raw data 2011, p. 58]. This may be of particular importance when using
samples with (0, 0) coordinates is not possible, and blinks need to eye trackers as a communication aide for the disabled, who
be modeled and differentiated from other causes of loss of signal. may be constrained in their movement or sitting/lying posi-
Many eye trackers do not output blinks as an event. tion.
Figure 5(a) shows how losses have been introduced into the eye 6. The eye tracker design does of course have a large impact on
movement signal, where red dots represent lost or invalid samples. the quality of the recorded data. Simply put, an eye tracker
To simulate short, local losses of data, invalid data are inserted as consists of a camera, illumination, and a collection of soft-
burst losses, which occur with probability Pl and last for Nl sam- ware that detects relevant features in the eye, and map these
ples, where Nl is drawn uniformly from A = {10, 11, . . . , 100}. to positions on the screen. The resolution of the video cam-
Figure 5(b) reveals the same trend for data loss as Figure 4(b) did era and the sharpness of the eye image are important fac-
for decreased precision: a reduction in the number of fixations and tors that are directly related to some aspects of data qual-
an increase in fixation duration. ity. Equally important are the image analysis algorithms, the
eye model, the eye illumination and the calibration procedure.
1.4 Example 4: Effect of screen position on pupil size Eye tracker system specifications will also have an influence
on data quality. The most quoted system specification is sam-
Pupil size reacts primarily to changes in illumination, but it is of- ple rate, or sampling frequency. Sample frequency will dic-
ten used as a measure of mental workload, emotional valence, or tate the system’s ability to record brief events and to produce
as an indication of drug use [Holmqvist et al. 2011, pp. 393–394]. accurate velocity profiles. Other system specifications which
A prerequisite for such investigations (apart from controlled light influence data quality are whether the system is bright or dark
conditions) is that the recorded change in pupil size reflects the true pupil based (i.e. whether the eye illumination is on or off
change in pupil size, and therefore that the eye tracker does not add axis, producing a bright or dark image of the pupil, for a re-
any systematic or variable error to the data. Pupil size measures view of the various set-ups currently in use see [Hansen et al.
will include systematic error if the apparent change in pupil size 2011]). This may interact with eye colour or other factors to
with viewing angle is not controlled for by the eye tracking system, effect data quality. Finally, whether the eye tracker records
or corrected in the recorded data subsequently. The effect of view- monocularly or binocularly is of interest. Accuracy and pre-
ing angle is that pupil size is larger when the eye is on-axis with cision of fixation data may improve if data from two eyes is
the eye camera. This typically means that the pupil is largest when combined, particularly if using a dispersion based fixation de-
looking in the centre of the screen compared to the edges. Without tection method, but, if data from two eyes are not separable,
knowing this relationship between pupil size and screen position for saccade velocity profiles, microsaccades, drift, and saccade
the particular system being used, the difference in pupil size may amplitude measures will lose validity.

47
3 Terminology for data quality
First, let us make clear that we cannot know where a human is look-
ing. Even when a participant says she looks at a point, the centre of
the fovea can be slightly misaligned. When we talk about ‘actual
gaze’ we refer to this subjective but reportable impression, which is
what the vast majority of eye trackers are designed to measure.
Thus, in general terms, data quality can be defined as the spatial Figure 6: The set of raw data samples on the left have large sample-
and temporal deviation between the actual and the measured gaze to-sample distances, and therefore RMS will be high. They are not
direction and the nature of this deviation, on a sample to sample so dispersed, so standard deviation will be low. The data set on the
basis. In the very simplest case, we consider these deviations in the right, typical of a vibration in the eye tracker, has short sample-to-
presence of only one data sample x̂i . This sample can either be re- sample distances, which gives a low RMS, but it is fairly dispersed,
ported as valid or invalid by the eye tracker, where an invalid sample so standard deviation will higher.
usually means that relevant eye features could not be detected from
the video feed of the eye, for instance due to loss of the eye image.
Clearly, with the exception of blinks, it does not make much sense is often a good indication of whether the system has problems track-
to characterize the quality of missing data other than to classify it ing a particular individual or in a particular environment.
as invalid. When the eye tracker reports a valid sample, data qual-
ity can be defined as the distance θi (in visual degrees) between the The spatial accuracy and precision of pupil size can be defined in a
actual xi and the measured x̂i gaze position, known as the spatial similar manner. The unit of measurement is either pixels in the eye
accuracy, or just accuracy, as well as the difference between the camera, or the perhaps more intuitive unit millimeters. Since pupil
time of the actual movement of the eye ti and the time reported by size values are recorded at the same rate as gaze samples, temporal
the eye tracker t̂i , known as latency or temporal accuracy. If both quality values for pupil size are shared with those calculated for
accuracy and precision differences are zero, the data quality for this gaze samples.
single sample is optimal. Closely related to spatial precision is a measure termed as spatial
resolution, which refers to the smallest eye movement that can be
The example with only one sample is however mainly of academic
detected in the data. If such small eye movements are oscillat-
interest. Typically, one needs to consider several samples recorded
ing quickly, they can only be represented in data with high tem-
from a whole experiment, a trial, or a single event such as a fixation.
poral resolution or sampling frequency, according to the Nyquist–
Given n recorded samples, accuracy can be calculated as
Shannon sampling theorem [Shannon 1948].
n
1X
θOffset = θi , (1) 4 Measuring data quality using an artificial
n i=1
eye
The variance in accuracy is often referred to as spatial precision and The artificial eye is an important and versatile tool in the assess-
the variance in latency is typically called temporal precision. Two ment of data quality. However, eye trackers vary in terms of their
common ways to estimate the spatial precision in the eye move- eye models, therefore, finding an artificial eye which will ‘trick’ all
ment signal are the standard deviation of the samples and the root eye trackers is difficult. When deciding which eye tracker to buy or
mean square (RMS) of inter-sample angular distances, but a whole use for a particular study, artificial eyes provide a way of comparing
range of other dispersion measures exist that could be alternatives inherent system noise and error, and can be used to check system
[Holmqvist et al. 2011, p. 359-369]. The standard deviation for a latency. Artificial eyes are usually available from the manufacturer,
set of n data samples x̂i is calculated as at least for systems intended for research purposes. While it is rel-
v atively simple to produce artificial eyes for systems which are dark
u n pupil (i.e. the eye illumination is off-axis with the eye) based, it
u1 X
sx = t (x̂i − x̂avg )2 (2) is trickier for bright pupil based systems (i.e. where the eye illu-
n i=1
mination is on-axis). Battery-equipped eyes with actively luminous
pupils would be one solution.
where x̂avg denotes the sample average. Letting θ denote the angular
distances between samples, precision can be expressed as 4.1 Precision measurements with an artificial eye
Optimal precision for an eye tracker should be calculated with sam-
v
n
u r
u1 X 2 θ12 + θ22 + · · · + θn2 ples originating from a period when the eye is fixating. The only
θRMS = t θi = (3)
n i=1 n way to completely eliminate biological eye movement from the eye
movement signal is to use a completely stationary eye [Holmqvist
et al. 2011, p. 35-40]. Since this is not possible with actual par-
These two precision calculations reflect different factors. Precision ticipants, an artificial eye, which produces the corneal reflections
in particular reacts to vibrations in the environment when calculated required by the eye tracker, is usually employed. This is also
as standard deviation, but not so much when calculated as root mean how many manufacturers measure precision [SR Research 2007;
square (RMS). Figure 6 illustrates this important difference. It is Sadeghnia 2011; Johnsson and Matos 2011]. When assessing pre-
likely that a full standard needs several precision calculations that cision in real data, it is useful to know what the maximum possible
each measure an aspect of data. precision of the system is. If system noise means that baseline pre-
Both accuracy and precision can be computed separately for hori- cision is low, many eye movement measures may not be validly
zontal and vertical dimensions. This may be of particular signifi- recorded. For example, the measurement of velocity profiles will
cance for persons with physical disability. [Cotmore and Donegan be far more effected by low precision than by low accuracy, if the
2011] outlines the development of a gaze controlled interface for a offset in accuracy is uniform across the screen. Likewise, low pre-
user who only has good control of movements in one dimension, for cision may effect which kind of event detection is preferable for the
example. Moreover, the proportion of valid data samples recorded data. The experimental procedure is simple: first of all, calibrate

48
110

Diameter reported by eye tracker (normalized at 4 mm/100%)


with a human eye in the normal way so that you can start recording
100
coordinate data. Calibration of a human eye may introduce some
small noise, so if you have a system where you can get data without 90

first calibrating, you may do that, but be aware that the precision 80

value will not be comparable to systems that require calibration 70

before data recording. Then, put one or a pair of artificial eyes 60

where the human eye(s) would have been, and make sure the arti- 50
2 2.5 3 3.5 4 4.5

ficial eyes are securely attached. Beware of vibration movements Diameter of artificial pupil (mm)

from the environment, which should not be part of your precision


measurement. See to it that the gaze position of the artificial eye(s) Figure 7: Pupil accuracy means that the diameter recorded by
is somewhere in the middle of the calibration area, and then start the eye tracker should be directly proportional to the diameter of
the recording. Export the raw data samples, use trigonometry and the pupil in the artificial eye. This data—from a tower-mounted
the eye-monitor distance with the physical size and resolution of the eye tracker—shows a good accuracy with only minor inaccuracies.
monitor to calculate sample-to-sample movement in visual degrees. Pupil resolution refers to the smallest change in diameter of the ar-
Then select a few hundred samples or more where the gaze position tificial pupil that can be distinguished by the eye tracker. The four
appears to be still, and calculate the RMS or standard deviation of values around 4 mm show that this eye tracker has a pupil resolu-
these samples. tion at least on the 0.1 mm level. Data from personal communica-
tion with one manufacturer.
Different artificial eyes tend to give slightly different RMS values
for the same eye tracker. [Holmqvist et al. 2011] found RMS val-
ues of 0.021◦ and 0.032◦ on the same eye tracker when using two
different artificial eyes from two manufacturers. The variance for
real eyes will be even greater. Part of standardization work might
be to build the specifications for a single or a set of artificial eyes
that can be used on all eye trackers. This may include a variation
in the colour of the artificial iris, as well as the possibility of hav-
ing a reflective artificial retina (to test bright-pupil detection based
eye trackers, i.e. those systems where the infra red light source is
placed on-axis).
Testing only with an artificial eye may be misleading, however. The
artificial eyes do not have the same iris, pupil, and corneal reflection
features as human eyes, and may be easier or more difficult for the
image analysis algorithms in the eye tracker to process. Also, in ac- Figure 8: Measurement of maximum head movement speed in a
tual eye-tracking research, real eyes tend to vary greatly in terms of remote eye tracker. Reproduced from manufacturer document.
image features that cannot be simulated with artificial eyes. There-
fore, some manufacturers compliment the artificial eye test with a
precision test on a human population with a large variation in eye 4.3 Controlled motion of artificial eyes
colour, glasses, and contact lenses, as well as ethnic background,
having them fixate several measurement points across the stimulus If the artificial eye could be made to move as a real eye during sac-
monitor. The full distribution of precision values from such a test cades, fixation and smooth pursuit, it would be possible to measure
across many measurement points and participants is an important optimal data quality during movements. At least one such prototype
indicator of what precision you can expect in actual recordings, and has been built, which mimics human eye movements well, except
its average defines the typical precision. The drawback is that this at a slightly slower speed.
data includes oculomotor noise, and therefore both human and arti- However, motion of an artificial eye can be used in other ways,
ficial eyes are needed. also. For instance, maximum head movement speed in a remote eye
tracker can be measured by setting artificial eyes in front of the eye
4.2 Pupil diameter quality measurements with artifi- tracker in an increasingly rapid sinusoidal movement across the so-
cial eyes called head box. At some speed, tracking will be lost, which can
be seen in gaze coordinates, or as in Figure 8 the x-coordinate of
The quality of pupil diameter data is also typically measured using the corneal reflection. The maximum speed at the last oscillation
artificial eyes. There are three such data quality measures. First, before tracking is lost is the maximum head movement speed.
pupil precision is calculated as the RMS on a sequence of pupil
diameter samples recorded from an artificial eye. 4.4 Switching corneal reflection
Pupil accuracy can be measured by presenting the eye tracker with There are a number of ways that latency can been measured. To
artificial eyes that have known pupil diameters (such as 2, 3, and 4 control the exact onset of an ‘actual movement’, one possibility is
mm). If all eyes are presented at the same distance, the eye tracker to turn off the infrared diode and at the same time turn on another,
should output a line of diameters proportional to the input. For identical infrared diode at a different position (Figure 9). Since the
instance, the pupil dilation value for the 2 mm artificial pupil should time it takes to turn on (and off) the diode can be made arbitrarily
be 50 % of the diameter recorded for the 4 mm pupil. Figure 7 small in comparison to the sampling rate of the eye tracker, the
shows data from such a measurement. latency can be reliably measured as the time between the off- and
Pupil resolution is the smallest detectable change in pupil dilation. onset of the illumination from the diodes and the corresponding
It can be measured by showing artificial eyes with small differences change in coordinate data from the eye tracker.
in diameter to the eye tracker. In Figure 7, the four values around The same setup for measuring latency can be used for simulating
4 mm differ with 0.1 mm. The clear proportional output shows that loss of tracking. The infrared illuminators are turned off, so that the
the eye tracker is capable of distinguishing between these dilations. eye tracker cannot detect any corneal reflections. An illuminator is

49
These measurement points should be placed across the stimulus
presentation screen or area which the data quality values refer to.
This area is typically the whole monitor, and for standardisation
purposes or when testing an eye tracker (as opposed to the data
recorded for a particular study), it seems reasonable to assume the
monitor provided with the system is the relevant presentation area.
In many eye trackers, accuracy tends to be best in the middle of
the monitor/recording area, and worst in the corners [Hornof and
Figure 9: Example measure of eye-tracker latency: an artificial eye Halverson 2002]. If the purpose is to give a realistic account of data
is positioned so that gaze coordinates can be measured. A single quality across varying stimulus presentations in future experiments
infrared light on one side of the eye tracker is used to to create or for future interfaces, then we should select measurement points
a corneal reflection. This light is turned off, and another one on at positions between calibration points, across the whole area of the
the other side, immediately turned on. This will cause a immediate monitor, varying gaze angle and position across the whole range
change in position of the corneal reflection at a time that is known possible when looking at the screen. Hence, the target points pre-
by software, the time until a change in gaze coordinates has been sented should cover the entire area used to display the experimental
registered is the latency. Reproduced from manufacturer document. stimuli.

5.2 Selecting data samples to include in the calcula-


tion of data quality
then turned on again after a fixed period. The recovery time can
then be defined as the time it takes from turning on the infrared As artificial eyes do not move, any samples from the recording can
illuminator, until the change is recorded in gaze coordinates. be used. With humans, accuracy and precision values are calculated
from samples recorded when the participant is assumed to fixate a
5 Measuring data quality using real partici- stationary target. The decision of when the eye is still is typically
made by an algorithm under the assumption that a fixating eye re-
pants looking at targets mains relatively stable over a minimum period of time. As a conse-
The factors in Section 2 must be an integral part of any design to quence, the data quality values calculated from the fixation samples
measure data quality from human. For instance, if the purpose is to are directly related to the performance of the fixation detection al-
compare data quality across different set-ups or different eye track- gorithm. To date, it has been well documented that given the same
ers, characteristics of the sample group are important, and those fac- set of eye movement data, fixation detection algorithms can output
tors listed above should be measured to compare robustness of the very different results [Karsh and Breitenbach 1983; Salvucci and
system to, for example, changes in eye colour or eye shape. Within Goldberg 2000; Shic et al. 2008; Nyström and Holmqvist 2010].
an experiment, such factors may be important in terms of the rel- Even when fixations are correctly detected, one target can be associ-
ative data quality across participants, for example, is data quality ated with several fixations. This can happen due to saccadic under-
significantly lower for those wearing glasses? and if so, were these shoot, overshoot, or small corrective saccades and microsaccades
participants more prevalent in one comparison group? required to align the gaze direction with the target. The researcher
must then decide which fixation(s) should be included in the cal-
5.1 Calibration validation procedures culations for a given target. Figure 10 illustrates a situation where
the the eye first undershoots the target (bottom left), then contin-
When assessing data quality for the data collected in an experi- ues towards the target, to finally shift its position a little to the right.
ment, the issue is not to test the system performance but to assess Three fixations are detected in this case. Including the fixation clos-
the quality of data for each individual, for exclusion criteria, or for est to the target would give the highest accuracy, but what motivates
a particular experimental group. If standardized test reports were this choice of fixation over a different one? Should perhaps all be
available for the eye-tracking system in question, the data from an included? Could we even omit the fixation detection stage and in-
experimental group could be compared to normative data for the clude all samples recorded during the period when the target was
same system. Such comparisons require independent testing across shown, even though saccade samples are present? How might we
a large sample group. This work is underway but not yet complete. account for the detrimental effects of latency in calculating preci-
In the absence of such standardized measures, data quality could be sion, if latency values not reported by the manufacturer? There is
assessed across experimental and control groups, to check if quality a strong argument that if the researcher or interface designer will
may be a confounding factor for results. Calibration procedures are not have access to system latency values for their recorded data,
proprietary to the system in question, hence testing the accuracy of that a standard measure should also assume no latency at all, and
calibration will mean running a subsequent calibration validation calculate precision values in the same way as the consumer will be
procedure. For this, targets (points) should be included between forced to. The means of selecting which data points (raw samples)
trials, at known positions, so that the data can later be assessed for are included for calibration validation purposes should be stated as
accuracy and precision over the duration of recording. part of the research report, or manufacturer specification sheet, and
Having participants look at points presented at known locations on the exclusion criteria should eventually be standardized for compa-
screen is by far the most common data quality evaluation method rability across studies.
and serves to validate the system calibration. It is essentially a re- A related problem concerns how deviating fixation samples or ‘out-
peat of the system calibration to check if inferred gaze direction liers’ due to various recording imperfections should be handled. A
matches the actual gaze direction using targets at known screen co- single outlier can significantly affect the calculated data quality val-
ordinates. The size, colour and shape of these targets effects re- ues, particularly if the sampling frequency of the eye tracker is low.
sulting measures; for example, very large calibration targets will be Perhaps the samples included for the measurement of data quality
‘hit’ even when the (x, y) point at the centre of the target is quite far should be the same ones chosen by the system for the calculation of
from the recorded gaze coordinate. The colour of the background fixation position, since this reflects the end user situation, but fixa-
is also important; bright backgrounds will cause the pupil to close tion accuracy and sample accuracy are two different things; fixation
down, which may affect accuracy [Jan Drewes 2011]. accuracy is affected by the raw data plus event detection methods.

50
7 Reporting data quality from experiments
A standardized set of eye data quality measures could be automated
for use in experimental research, as part of the software package and
Calibration target
compared to an independent report for that eye tracking system or
for a similar participant group tested on other systems. Automated
data quality measures which are standardized across systems would
mean that researchers can easily access them as part of running an
ordinary study. They could also be made publicly available by an
independent body, in a similar fashion to specifications for other
Figure 10: Three fixations are detected (labelled ‘valid samples’) computer based technologies. Table 1 shows what we propose such
under the period when the participant is asked to look at the tar- a report could look like in a publication.
get (reproduced from [Nyström et al. submitted]). Which samples
should be included in precision and accuracy calculations? Table 1: Data quality report from a collection of data in an exper-
iment. Precision values reflect the RMS of inter-sample distances.

Data quality Average SD


Furthermore, removing samples raises the question of how the gaps Calibration accuracy 0.32◦ 0.11◦
should be treated. Whatever method is chosen, it should be fully Accuracy just before end of recording 0.61◦ 0.27◦
described as part of the report on data quality. Calibration precision 0.14◦ 0.05◦
Precision just before end of recording 0.21◦ 0.06◦
6 How is data quality reported? Accuracy after post-recording processing - -
Precision after post-recording processing - -
Proportion of dismissed participants 9% -
To date, researchers have rarely reported measured data quality val- Proportion of lost data samples in retained data 0.3 % 0.041%
ues for their own data sets. The most common way to report ac-
curacy is to refer to the manufacturer’s specification sheets, for in-
stance “This system is accurate to within 0.5◦ (citation to manu- In order to interpret these values in relation to the other parts of
facturer)”. A search on Google Scholar using “with an accuracy the scientific publication, it is important to specify the analysis:
of 0.5” AND “eye tracker” returns 135 papers in all varieties of what event detection algorithm was used, with what settings, to de-
journals that have used this particular phrase for handing over re- tect fixations, saccades etc? What were the data exclusion criteria?
sponsibility for data quality to their particular manufacturer. The What are the sizes of AOIs and their margins? Also, the type of eye
vast majority of researchers appear to treat the values on the speci- tracker should be reported, alongside the recording, stimulus and
fication sheet as a correct and objective characterization of the data analysis software with version numbers. If any values are unavail-
they have collected, as if all data, from all participants, wherever able or unknown, they should be stated as such. This is the basic
they look at on the monitor, would have an accuracy better than level of information required in order to assess eye movement re-
0.5◦ degrees of visual angle. This assumption of optimal accu- search for possible confounding variables in the data and compare
racy across all recording conditions and participants is unlikely to research results.
be correct and may lead to invalid results even when data loss is
accounted for and the data look reasonable. 7.1 Who benefits from eye data quality reports?
Criteria used to exclude data which are cited in literature include, There are two major uses for a standardized method of testing and
for instance, the percentage of zero values in the raw data sam- reporting eye data quality. We propose first, a test house tests eye
ples, a high offset (poor accuracy) value during validation, a high trackers on the market using a battery of tests that result from a
number of events with a velocity above 800◦ /s, and an average standardized set of measurements. They use both standardized arti-
fixation velocity above 15◦ /s (indicative of low precision). For ficial eyes and a large sample of real participants, standardized and
an example of accuracy and data loss criteria, see [Komogortsev selected according to the criteria known to influence data quality
et al. 2010]. [Holmqvist et al. 2011] conclude that around 2–5% such as eye shape and colour. Because operator experience is a sig-
of the data from a population of average non-pre-screened Euro- nificant factor for data quality, experienced eye tracker operators
peans needs to be excluded due to participant-specific tracking dif- should be used, or level of experience measured and controlled for.
ficulties. However, this number varies significantly: [Schnipke and This activity results in a protocol that manufacturers can base their
Todd 2000], [Mullin et al. 2001] and [Pernice and Nielsen 2009] product documentation on. To make these results useful for gaze
report data losses of 20–60% of participants/trials, and [Burmester interaction as well as for replication in research results, the mini-
and Mast 2010] excluded 12 out of 32 participants due to calibra- mum target size and margin between targets viably selected by the
tion (7) and tracking issues (5). eye can be calculated based on these values and reported alongside
accuracy and precision values.
Manufacturer technical development groups need correct data qual-
ity values for internal benchmarking: to judge whether changes in Second, authors will be able to calculate data quality values from
hard- or software result in improved data quality. Therefore, several their experimental data, and compare it to the known quality val-
manufacturers develop data quality assessment methods for their ues for their eye tracker. In order to assist in the comparability
own use. In fact, many of the methods developed by manufacturers of research results, journal reviewers could require these values in
can be expected to be essential parts in a standardization of data their papers. Such measures would greatly assist progress in the
quality measures. This includes the artificial eye, the point-to-point field by removing the large uncertainty in assessing the results of
measurement of latency and several suggestions for calculation that eye movement research using highly variant equipment and cal-
we will see below. Although there is as yet no consensus on the culations. This approach would also benefit manufacturers, who
exact measures for data quality, the measures suggested below are need standardized measures to assess their systems and to compare
nonetheless useful and informative, and could be included in a re- performance to competitors or for internal benchmarking. Finally,
search report alongside a description of the measures chosen. it would make the task of deciding which eye tracker to buy for

51
particular purposes more transparent and straightforward for both Behavior Research Methods, Instruments, & Computers 34, 4,
researchers and users of gaze control systems. 592–604.
JAN D REWES , A NNA M ONTAGNINI , G. S. M. 2011. Effects of
8 Conclusion and future work pupil size on recorded gaze position: a live comparison of two
eyetracking systems. Talk presented at the 2011 Annual Meeting
Clearly, standardization work for eye data quality would benefit eye of the Vision Science Society.
movements technology and research in general. This work has al-
ready begun as a collaborative effort of the COGAIN Association, J OHNSSON , J., AND M ATOS , R. 2011. Accuracy and precision
in the form of a technical committee for the standardisation of eye test method for remote eye trackers. Tobii Technology.
data quality. In the absence of agreed standard measures while this K ARSH , R., AND B REITENBACH , F. W. 1983. Looking at looking:
work is underway, there is an immediate benefit in promoting the The amorphous fixation measure. In Eye Movements and Psy-
testing and reporting of data quality as standard in eye movement chological Functions: International Views, R. Groner, C. Menz,
research using the measures outlined above. D. F. Fisher, and R. A. Monty, Eds. Mahwah NJ: Lawrence Erl-
baum Associates, 53–64.
Not all aspects of data quality would benefit from standardization,
however, there are a number of issues which might be better allowed KOMOGORTSEV, O. V., G OBERT, D., JAYARATHNA , S., KOH ,
to freely evolve, including: (a) How accuracy and precision is ac- D. H., AND G OWDA , S. 2010. Standardization of automated
tually achieved, which is proprietary information and core business analyses of oculomotor fixation and saccadic behaviors. IEEE
of manufacturers. (b) What the eye tracker can be used for? What Transactions on Biomedical Engineering 57, 11, 2635–2645.
conclusions can be drawn from the tests? This should be left up to M ULLIN , J., A NDERSON , A. H., S MALLWOOD , L., JACKSON ,
the informed researcher or developer. (c) How can low accuracy or M., AND K ATSAVRAS , E. 2001. Eye-tracking explorations in
precision be accommodated or overcome? Standardising this would multimedia communications. In Proceedings of IHM/HCI 2001:
likely hold back research in the area. Magnifying windows in gaze People and Computers XV – Interaction without Frontiers, Cam-
interaction software or extra post processing of the data in research bridge: Cambridge University Press, A. Blandford, J. Vander-
must be stated, but should not be standardized. (d) Event detection donckt, and P. Gray, Eds., 367–382.
algorithms and filters used in them. Research is not mature enough. N YSTR ÖM , M., AND H OLMQVIST, K. 2010. An adaptive algo-
Many researchers may be unaware of the magnitude of the effect rithm for fixation, saccade, and glissade detection in eye-tracking
of data quality on their research results or interface functionality data. Behavior Research Methods 42, 1, 188–204.
and there are no guidelines on how to go about assessing their data. N YSTR ÖM , M., A NDERSSON , R., H OLMQVIST, K., AND VAN DE
Likewise, manufacturers may be unsure if their in-house test meth- W EIJER , J. submitted. Participants know best–influence of cali-
ods compare to those of other manufacturers or end user’s quality bration method and eye physiology on eye-tracking data quality.
tests. We hope this paper sets a clear target which will have a pos- Journal of Neuroscience Methods.
itive impact on all aspects of eye movement research, eye tracker
development and gaze based interaction. P ERNICE , K., AND N IELSEN , J. 2009. Eyetracking Methodology -
How to Conduct and Evaluate Usability Studies Using Eyetrack-
ing. Berkeley, CA: New Riders Press.
References
S ADEGHNIA , G. R. 2011. SMI Technical Report on Data Quality
A NDERSSON , R., N YSTR ÖM , M., AND H OLMQVIST, K. 2010. Measurement. SensoMotoric Instruments.
Sampling frequency and eye-tracking measures: How speed af- S ALVUCCI , D., AND G OLDBERG , J. H. 2000. Identifying fixa-
fects durations, latencies, and more. Journal of Eye Movement tions and saccades in eyetracking protocols. In Proceedings of
Research 3, 6, 1–12. the 2002 Symposium on Eye-Tracking Research & Applications,
B URMESTER , M., AND M AST, M. 2010. Repeated web page visits New York: ACM, 71–78.
and the scanpath theory: A recurrent pattern detection approach. S CHNIPKE , S. K., AND T ODD , M. W. 2000. Trials and tribulations
Journal of Eye Movement Research 3, 4, 1–20. of using an eye-tracking system. In CHI’00 Extended Abstracts
C OTMORE , S., AND D ONEGAN , M. 2011. ch. Participatory De- on Human Factors in Computing Systems, ACM, 273–274.
sign - The Story of Jayne and Other Complex Cases. S HANNON , C. E. 1948. A mathematical theory of communication.
G AGL , B., H AWELKA , S., AND H UTZLER , F. 2011. Systematic Bell System Technical Journal 27, 379–423, 623–656.
influence of gaze position on pupil size measurement: analysis S HIC , F., S CASSELLATI , B., AND C HAWARSKA , K. 2008. The
and correction. Behavior Research Methods, 1–11. incomplete fixation measure. In Proceedings of the 2008 Sym-
posium on Eye-Tracking Research & Applications, New York:
H ANSEN , D. W., V ILLANUEVA , A., M ULVEY, F., AND M AR -
ACM, 111–114.
DANBEGI , D. 2011. Introduction to Eye and Gaze Trackers. In
Gaze Interaction and Applications of Eye Tracking: Advances SR R ESEARCH. 2007. EyeLink User Manual 1.3.0. Mississauga,
in Assistive Technologies, P. Majaranta, H. Aoki, M. Donegan, Ontario, Canada.
D. W. Hansen, J. P. Hansen, A. Hyrskykari, and K.-J. Räihä, T ECCE , J. 1992. McGraw-Hill Yearbook of Science & Technol-
Eds., no. 2010. IGI Global: Medical Information Science Refer- ogy. New York: McGraw-Hill, ch. Psychology, physiological
ence, Hershey PA, ch. 19, 288–295. and experimental., 375–377.
H OLLAND , M., AND TARLOW, G. 1972. Blinking and mental
load. Psychological Reports 31, 119–127.
H OLMQVIST, K., N YSTR ÖM , M., A NDERSSON , R., D EWHURST,
R., JARODZKA , H., AND VAN DE W EIJER , J. 2011. Eye track-
ing: A comprehensive guide to methods and measures. Oxford:
Oxford University Press.
H ORNOF, A., AND H ALVERSON , T. 2002. Cleaning up systematic
error in eye-tracking data by using required fixation locations.

52

View publication stats

You might also like