Facial Feature Tracking For Cursor
Control
Abstract
This work is motivated by the goal of providing a non-contact means of
controlling the mouse pointer on a computer system for people with motor
difficulties using low-cost, widely available hardware. The required
information is derived from video data captured using a web camera
mounted below the computer’s monitor. A colour filter is used to identify
skin coloured regions. False positives are eliminated by optionally removing
background regions and by applying statistical rules that reliably identify
the largest skin-coloured region, which is assumed to be the user’s face. The
nostrils are then found using heuristic rules. The instantaneous location of
the nostrils is compared with their at-rest location; any significant
displacement is used to control the mouse pointer’s movement. The system is
able to process 18 frames per second at a resolution of 320 by 240 pixels, or
30 fps at 160 by 120 pixels using moderately powerful hardware (a 500 MHz
Pentium III desktop computer).
Keywords: human-computer interaction, perceptual interfaces, face tracking, enabling
technologies.
1
Introduction
For many people with physical disabilities, computers form an essential tool for
communication, environmental control, education and entertainment. However, access to
the computer may be made more difficult by a person’s disability. A number of users
employ head-operated mice or joysticks in order to interact with a computer and to type
with the aid of an on-screen keyboard. Head-operated mice can be expensive. In the UK,
devices that require the users to wear no equipment on their heads, other than an infrared
reflective dot, for example Orin Instrument’s HeadMouse [1] and Prentke Romich’s
HeadMaster Plus [2], cost in excess of £1,000 (€1,400). Other devices are cheaper,
notably Granada Learning's Head Mouse [3, 4, 5] (£200, €280), and Penny and Gilles’
device (£400, €560). However, these systems require the user to wear a relatively complex
piece of equipment on their head, an infrared transmitter and a set of mercury tilt switches
respectively.
The goal of this research is to develop and evaluate a non-contact interface that is both
low cost and also does not require the user to wear any equipment. The most obvious way
of doing this is to use a camera, interfaced to a PC. The PC/camera system will track the
movement of the user; it will translate head movements into cursor movements, and could
interpret changes in expression as button presses.
There have been many attempts at creating cursor control systems utilising head
movements; these are reviewed below. The head monitoring system must also provide a
mechanism for replacing the mouse buttons. We propose that this can be achieved by
using the blinks of the eyes (or by dwelling the mouse or by using switches). Changes in
expression could also be used as part of a separate device to provide a computer interface
for more severely disabled persons.
A system such as the one proposed will confer significant benefits on its target audience.
It will increase their independence by facilitating their communication with a computer
and hence providing an avenue for communication and environmental control. The
proposed interface may also have an application in the non-contact control of remotely
controlled units for the able bodied; providing a perceptual control mechanism for such a
unit will free the operator to concentrate on other demanding tasks.
The implementation of a system such as the proposed one presents several areas of
difficulty:
1. Identifying and tracking the head location.
2. Identifying and tracking the location of a facial feature.
3. Being able to process the information in real-time using a moderately priced
processor that will be running other applications in the foreground (for example,
Microsoft Word).
Yang et al. [6] presented a review of face detection methods, which fell into a small
number of categories, the most important of which were methods depending of the colour
and physiognomy of the face.
Storring et al. [7] suggested that a face’s apparent colour was due to two factors: the
amount of melanin in the skin and the ambient illumination. Of the two, ambient
illumination caused the greater variation to the perceived colour. They concluded that if
normalised colour values were used (i.e. the effect of illumination were removed) then
skin colours were consistently with a fixed and quite narrow set of limiting values,
independent of the subject’s natural skin colouration. This approach, and slight
modifications, has become very popular due to its simplicity [7-16].
Template matching provides an attractive and simple approach to face detection that relies
on the constant appearance of the facial features, but often presents difficulty in dealing
with variations in scale, shape and pose. The simplistic approach would define a template
that resembled a facial feature and cross correlate it with a face image. The location of the
maximum response defines that feature’s location. However, the response decreases as the
resemblance between the template and the facial region diminishes, and in theory, a
template is required for all different instances of the feature. To overcome this, flexible
templates were originated, for example by Yuille [17] where templates of facial features
are deformed to match against the features of the face. An energy function is used to link
edges, peaks and valleys contained in the input image to corresponding values in the
parameterised template. The best fit of the model is found by altering the parameter
values - performing energy minimisation. The deformable template matching method
goes some way to achieving scale and shape invariance, also proposed by the use of
multiscale and multiresolution templates. Variations on this theme are widespread, [1, 6,
8, 10, 14, 19-22].
To summarise, it is the aim of the research reported here to develop a cheap, non-contact
computer interface that will be used primarily by people with severe motor difficulties.
The system should require minimal initialisation/configuration.
The remainder of the paper is organised as follows. In the following section the options
for capturing the required data are reviewed. The characteristics of the selected hardware
are reviewed, as they relate to the problem we are addressing. Section 3 presents an
overview of the software architecture of the system we have developed. Sections 4 and 5
describe the two parts to the face detection algorithm: identifying possible face candidates
and eliminating false positives, and identifying the extent of the face in an image. The
following section describes how we identify the facial features whose instantaneous
location is used to control the cursor’s movements. The translation from feature location
to cursor movement is described in section 7. In section 8 we present an evaluation of the
system: how accurate is it and how rapidly can it process data. Finally, we draw
conclusions in section 9.
2
Data Capture
As we are developing a non-contact interface that will not require the user to wear any
kind of transducer, video imagery is the obvious data to process. There are two types of
video capture interfaces to the PC: the standard video camera plus video capture hardware
and digital cameras that interface directly to the system, that is Firewire or USB cameras.
We discarded the video camera on the grounds of its cost, even though the image quality
is likely to be better, and instead chose to investigate the use of a webcam interfaced via
the USB, reasoning that this hardware is likely to be in the possession of a large
proportion of computer users.
The USB has an upper throughput rate of 12 megabytes per second. Combinations of
image spatial resolution, colour depth and frame rate must not exceed this. It has been
shown that a minimum rate of 10 frames per second must be processed in order for a user
to perceive real-time operation [23]. Therefore, each frame of data cannot exceed 1.2 MB.
We also consider that colour information is of primary importance and therefore require
24 bit colour information, which further reduces the frame size to be 0.4 million pixels or
less. The largest “standard” size image satisfying this criterion is the SIF resolution (320
by 240 pixels NTSC and 353 by 288 PAL). Fortunately it has been shown in many studies
that satisfactory results may be obtained by processing images with this resolution, smaller
images contain insufficient detail.
Figure 1 shows a typical image captured during facial feature tracking, its spatial
resolution is certainly sufficient to be able to detect the significant facial features, even if
they cannot be outlined with any great degree of precision (if fact, this is more of an
advantage than a problem, as will be shown below).
Figure 1: Sample image captured using a Creative Labs Webcam Go.
This single image fails to reveal the temporal nature of the data. We would expect the
image of a uniform scene to be static over time. This was investigated by capturing
sequences of images of a uniform scene (a Kodak midgrey test card) with the automatic
gain control of the webcam turned on and off. For comparison, the equivalent data was
captured using a Sony HyperHad camera interfaced to a Silicon Graphics workstation. The
differences between the red, green and blue values of equivalent pixels in consecutive
frames were computed. Table I presents the results, showing the averages of these values
(Mean R, Mean G, Mean B), averaged over all frames and the whole sequence; the
average of these averages (Average Mean); the maximum differences (Max R, Max G,
Max B) and the average maximum difference (Average Max).
As expected, the webcam’s image quality is worse than a standard video camera’s but not
significantly worse.
The camera was further tested by capturing images of two people (Caucasian and Asian)
under three different lighting conditions: afternoon sunlight, tungsten lighting and halogen
lighting. The results are shown in figure 2.
According to Storring’s results [7] we would expect quantitatively similar changes in the
images as the lighting conditions are altered. This does not happen, and is especially
obvious in the daylight images in which the Caucasian skin takes a bluish tinge. It is most
likely that this deviation from Storring’s result is due to the quality of the cameras used in
these two studies. The practical consequence is that we would not expect to be able to
specify universal skin colour ranges for the data captured by this camera. Rather, any
system that will rely on colour to identify potential skin regions must be calibrated for
each user and probably also for each use of the system.
Webcam
Sony HyperHad
AGC on
AGC off
Mean R
3
3
1
Mean G
2
2
1
Mean B
4
3
2
Average Mean
2
1
1
Max R
37
34
22
Max G
29
22
18
Max B
48
41
37
Average Max
24
20
16
Table I: Image Quality – inter frame pixel differences
Face sample (a)
Face sample (b)
Face sample (c)
Face sample (d)
Face sample (e)
Face sample (f)
Figure 2: Images of an Asian and a Caucasian captured under different
illumination conditions using the Webcam Go.
We are developing a system that will be used as a non-contact computer interface. It is
therefore obvious that the video camera will be situated such that it captures images of the
computer user’s face as he or she views the computer’s monitor. We have chosen to place
the camera below the monitor, pointing upwards at the user.
The software architecture of the system will now be described.
3
Software Architecture
A diagram illustrating an overview of the system’s operations is presented in figure 3.
Note that this diagram shows the system’s steady state operation, results from processing
the previous frame inform the processing of the current frame, and the results of
processing this frame will inform processing the following one.
Position
cursor
User’s
face
Calculate
pose
Select and segment
dominant blob
Candidate
Feature
Search window
locations
for frame i+1
blobs
Identify
Segment
features
skin
Remove
background
Place feature
search window
Feature locations
Frame i
Figure 3: Data flow diagram.
from frame i-1
The system is divided into four components that are responsible for identifying blobs that
might correspond to the face in an image, selecting and delineating the blob that
corresponds to the face, identifying the facial features that are to be tracked and moving
the cursor. The four components will be described in the following sections.
4
Face Candidate Identification
We have chosen to use colour information to identify the face. Our aim was to use the
simplest reliable algorithm that will satisfy our requirements. We dismissed other methods
of face detection as we deemed them to be overly complex for this application, and
although they might yield correct results at an acceptable frame rate, we do not believe
that they will leave sufficient processing resources to allow any useful software to be used
without an unacceptable degradation in performance.
Storring’s contention is that skin colours are compactly clustered, the size and shape of the
cluster is unaffected by changes in illumination: only the location of the cluster changes.
We investigated methods of removing the illumination dependence by investigating
alternative colour spaces.
Facial images were downloaded from the University of Stirling [24]. This database
contains a demographically representative sample of full frontal face images captured
under varying lighting conditions. Regions of each image containing skin only were
manually extracted and the red, green and blue (RGB) components of each pixel recorded.
The RGB values were converted to normalised red, green and blue, rgb; log-opponent, and
Y, Cr and Cb:
R
R+G+ B
G
g=
R+G+B
B
b=
R+G + B
r=
(1)
L( x ) = 105 log10 ( x + 1)
I = L(G )
R g = L(R ) − L(G )
B y = L (B ) −
(2)
L(G ) + L(R )
2
Y = 0.30 R + 0.59G + 0.11B
C r = 0.50 R − 0.42G − 0.08B
(3)
C b = −0.17 R − 0.33G + 0.50 B
Illumination independence was achieved by deleting any one of the rgb components, the L
and the Y component.
Plots of the normalised skin values are shown in the scattergrams of figure 4a (plotting r
and g), figure 4b (Rg and By) and figure 4c (Cr and Cb). To be considered suitable for this
application, the points must be tightly clustered. It is also advantageous for the
normalisation to be computationally simple, to enhance the data throughput.
Inspection of the scattergrams reveals that the normalised red, green and blue
representation gives the tightest clustering, this representation was therefore chosen.
We have demonstrated that it is possible to identify a range of normalised colours that
include all skin colourations. Of course, these colours will not be unique to skin samples,
the colours of other objects will also be found in this range. We have found that certain
colours of paint and especially woodwork can give false positive results using this
algorithm. These pixels are readily removed using background subtraction.
A background image may be acquired during a calibration phase, this image will be the
view observed by the camera in its usual position, but without the user in place. During
operation, each frame of data is compared to the background image. If pixels are
sufficiently different, then the pixel is considered to be part of the foreground, that is the
object being tracked and it is passed to the colour matching algorithm. Of course, if the
background to the scene is carefully controlled, it will not contain any objects of
potentially confusing colours, and the background subtraction will not be required.
Nevertheless, we have included this operation as an optional step as it can improve face
detection in some environments.
Normalised RG
S29
S25
S21
S17 Norm alis e d
Gr e e n
S13
S9
31
26
21
16
11
6
1
S5
S1
Norm alis e d Re d
Log Opponent
S31
S29
S27
S25
S23
S21
S19
S17
By
S15
S13
S11
S9
S7
S5
S3
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
1
S1
Rg
Cb vs Cr
S29
S25
S21
S17 Cr
S13
S9
31
26
21
16
11
6
1
S5
S1
Cb
Figure 4: Clustering of skin colours in various colour spaces
a normalised red green
b normalised log-opponent
c Cr and Cb
The output of this stage of the processing cycle is an image map that indicates those pixels
that are of a colour consistent with the skin colour model and are not part of the static
background. Despite removing the background pixels, the map still contains a number of
false positive results caused primarily by the poor quality data delivered by the camera.
These are removed as a side effect of the following stage of the processing cycle, Face
Region Growing, whose primary goal is to identify the region of the image that
corresponds to the face.
Figure 5
a typical input
b skin map generated by colour matching.
5
Face Region Growing
Due to the location of the camera, the user’s head and shoulders will be the dominant
object in the images that are captured. Following the previous stage, the image map will
contain a large region corresponding to the face (although this is occasionally fragmented
due to shadowing effects) and a large number of much smaller regions due to image noise
and imperfect colour matching. Figure five represents a sample image map and the input
that was used to generate it.
Connected component analysis can be used to identify the different contiguous groups of
pixels (blobs), and determine their sizes. It is therefore a simple matter to select the largest
blob and assume that it is the face. However, this algorithm takes no account of the blob’s
shape and can give erroneous results, figure 6.
Figure 6: An example of connected component analysis giving incorrect results.
Radial spanning [24] has also been suggested as a means of detecting convex objects, by
growing radial spokes from a seed point and requiring that adjacent spokes are of similar
lengths.
Both of these methods require a seed point within the blob to be identified. Although this
is a simple matter as any point can be used as a seed, it is a time consuming process to
examine all blobs in this fashion.
We have used robust statistical analysis [12] to identify the dominant blob in the image.
This is an iterative process that consistently identifies that centre of the largest blob in the
image.
Figure 7: Dominant blob selection.
The rows and columns of the image are summed. The arrays of row sums and
column sums are treated in the same manner. Firstly the mean and standard
deviation is computed. Secondly the mean and standard deviation of values lying
within one standard deviation of the original mean is computed. This calculation is
iterated until the mean converges, at which time the means derived from the row
and column sums define the co-ordinates of a point within the dominant blob.
Two one dimensional histograms, hx(i) 0 ≤ i < NumRows, and hy(j) j ≤ j < NumCols, were
created by summing the image’s columns and rows. The means and standard deviations of
these were computed in the usual ways. The ranges of the summations can be restricted so
as to exclude blobs neighbouring the image boundaries, thus it is possible to prevent the
error shown in the example image. The two means give an initial estimate of the centre of
the dominant blob in the image which is refined by repeating the calculation of the means
and standard deviation but using only those pixels lying within one standard deviation of
the mean in each distribution. This step was repeated until the change in the values of the
two means was a negligible. The effect of this process is to locate the centre of the most
dominant blob in the image, irrespective of any other blobs that might be present, figure 7.
The values of the standard deviations can be used to compute an approximate bounding
box for the face region. We may also perform a connected component analysis to identify
the region more accurately. Since the purpose of locating the face is to define a region to
search for the facial features we are tracking, we do not require that the face is accurately
delimited. Instead, we are satisfied with the more rapid, but less accurate process of
defining the search region using the two standard deviations. Figure 8 illustrates the
regions identified using radial spanning and robust statistical analysis.
The region defined by this process is passed to the feature location module.
Figure 8: Face regions detected by radial spokes and robust statistical analysis.
6
Feature Location
To determine face pose would normally require that three non-collinear facial features
were identified, such as the eyes and nose. Recognising that facial features are usually
darker than the surrounding skin, some authors have used local minima in the brightness
function to define feature locations, regardless of whether these locations correspond to a
physical feature. It is also difficult to identify features consistently as the head pose is
altered, as the head is bowed, the shadows under the eyebrows become darker, which can
confuse some feature trackers.
We have chosen to track the user’s nostrils and use their position with respect to the edges
of the search region input to this module to define the head pose. The nostrils confer two
major advantages to a simple tracking system. Firstly, they are clearly separated from any
other features that could be confused with them. Secondly they are relatively small and
situated away from the face boundary, this means that they remain visible even under
extreme facial poses.
The window passed to this module defines the face region. The central third of this is used
as a search region for the nostrils, this is large enough that we can be sure it contains the
nostrils and small enough to allow rapid processing. The nostrils are located by a
thresholding process; the data within the search region is thresholded with a gradually
increasing threshold until two regions are found that match the nostril heuristics, that is
they are of suitable and similar sizes and separation. The “suitable size” is defined by the
capture resolution and the identified face size. The centroid of the two nostril regions is
taken as the active point that is the single location that is required by the tracking
algorithm. Having located the nostrils, their location is used to update the centre of the
search region for the following frame. If the nostril search fails and we are unable to
update the search region using this mechanism, the search for the nostrils is reinitiated.
Samples of the search region and the identified nostrils are illustrated in figure 9. The
translation of this information into cursor movement is the subject of the following section
Figure 9: Sample results of the nostril tracking stage.
Figure 10: Translation of nostril and search window locations to cursor control
signals.
7
Cursor Movement
Given the co-ordinates of the nostrils’ active point and the co-ordinates of the face search
area, we may derive signals for driving the cursor’s movement.
Jitter, in this context, is defined as randomised apparent movement of the nostrils due to
small amplitude, random head movements (tremor) and errors in the estimation of nostril
location. Ideally, the cursor would move smoothly following decisive and well-controlled
head movements. However, many of the intended users of this system have poor control
of their movement. The system must therefore be capable of recognising jitter and
eliminating it.
This is accomplished by monitoring the changes in the nostril location and the face search
area co-ordinates. If these values change by more than a predefined threshold, the new
values are used to update the cursor position, otherwise the position and these values
remain unchanged.
The translation of image co-ordinates to cursor control signals is illustrated in figure 10.
The distances from the nostril point to the boundaries of the face region in the vertical and
horizontal directions is computed. This is translated directly into cursor position coordinates by linear scaling. The scaling coefficients are defined by a user specific
calibration stage, we require that the user’s maximum nostril movements in the horizontal
and vertical directions will map into movements of the cursor that completely cover the
monitor’s screen. In a future version of the software we will replace the absolute cursor
positioning with a joystick-like control method whereby the cursor’s velocity is controlled
by the nostrils’ positions as this offers improved cursor positioning [3, 4, 5].
The cursor is constrained to remain on the screen; cursor co-ordinates are therefore
clipped to the screen.
8
Implementation
Evaluation
and
Performance
The system was implemented and tested on various platforms running Windows. It
executes in two phases: there is an initial calibration phase that is followed by the realtime tracking phase.
The initialisation phase performs two tasks. The first is the acquisition of the background
image, if this is warranted. Recall from the discussion above that the blob finding module
gave many false positive results when there were regions in the background with similar
colours to the skin. Bare or varnished wood surfaces were especially problematical.
Secondly, the initialisation phase captures the skin colour values for this particular user.
Storring’s conclusion that normalised skin colour values of all people fall within tightly
constrained limits was found to be partly true, our experience has been that normalised
skin colours vary by a small amount but it is not possible to set global threshold values
due to the variability in colour response of the cameras that are being used. Whilst it is
possible to automate the process of acquiring a specific skin colour model, at present we
use manual initialisation, even though this violates our requirement of minimal or zero
initialisation.
Having performed the initialisation, the system moves into the real-time tracking phase
that executes as previously described.
The system was evaluated using several metrics:
•
The maximum throughput rate at various frame sizes,
•
The accuracy with which the nostrils were located,
•
Its robustness with respect to extreme head positions,
•
Its robustness with respect to varying lighting conditions and
•
The accuracy of the cursor positioning.
Using an entry level personal computer (at the time of development, a PIII processor with
a clock speed of 500 MHz) we have achieved throughput rates of 30 frames per second at
a resolution of 160 by 120 pixels, and 18 frames per second at 320 by 240 pixels. These
results are intended to be illustrative of the speed that this system can achieve, but they do
not necessarily reflect the speeds that would be achieved in practice, for three reasons.
Firstly, an “entry level personal computer” is more powerful now than when the
development of this system commenced. Secondly, the tests were performed with the
system providing graphical feedback to the developer, which would not be given in the
same way in a release version. Thirdly, the tests were performed without the system
running any other applications, such as a word processor. Given these factors, we expect
that the system will be viable.
The system is able to detect the required facial features reliably, provided that the scene is
adequately illuminated and the illumination remains unchanged. Adequate illumination is
a reasonable requirement as the location in which this system will be used is expected to
be well lit. As the illumination is reduced, darker skin tones result in tracking failure
sooner than lighter tones due to the lower contrast between the skin and the nostril areas.
The system can be forced into tracking errors by introducing any large skin-coloured
object into the field of view. If this object merges with the face, then the face region will
change catastrophically.
The system is able to maintain a fix on the nostrils, even under movements that are more
extreme than would usually occur when the user is controlling the cursor. This is a
desirable feature given the physical characteristics of the intended users.
Several naïve users tested the system. They all found the system simple to use and were
able to place the cursor within a 1 cm target on a 19” monitor with no difficulty. Although
the system does not remove jitter completely, users compensate for slight remaining
inaccuracies via the visual feedback that is provided by the on-screen cursor.
Although some of the users wore glasses, and this did not pose any problems, none had
any facial hair. Moderate amounts of facial hair should not affect the performance of the
system, especially if the hair is similar in colouration to the face. In fact, facial hair ought
not to influence the system until it actually obscures the nostrils themselves.
9
Conclusion
A method of controlling the position of a cursor using video data has been presented. The
system is intended to be used primarily by people with motor difficulties, although it could
be used by the able bodied to enhance the mouse. We have aimed to develop a technically
and computationally simple system. Input is captured using a webcam, simple processing
methods have been employed.
The system tracks the user’s nostrils, which are located by first finding a large skincoloured region that is assumed to be the user’s face. The position of the nostrils relative
to the face region is used to control the position of the cursor. The system has been shown
to be accurate and reliable under normal conditions of illumination and subject movement.
The system is also able to track the desired features in real-time, lending support to our
claim that this can be a valid and useful method of human computer interface for a certain
class of computer users.
Figure 11: Selected frames from a 30 second period of tracking. The images are
screenshots of the system’s output showing a video frame with tracking results
superimposed: the rectangular facial region and two blobs indicating where the
system believes the nostrils to be located. Each image also contains a small square
illustrating where the system would place the cursor, note that the cursor’s
movement is a mirror image of the nostrils’ movements.
References
[1]
Orin Instruments 2001 http://www.orin.com/access/
[2]
Prentrom 2001 http://store.prentrom.com/
[3]
R Drew, S Pettitt, P Blenkhorn, D G Evans, “A Head Operated 'Joystick' Using
Infrared”, In Computers and Assistive Technology ICCHP '98, Proc XV IFIP
World Computer Congress, Ed. A. D N Edwards, A Arato , W. L. Zagler,
Österreichische Computer Gesellschaft, 1998
[4]
D. G. Evans, R. Drew, P. Blenkhorn, “Controlling Mouse Pointer Position Using
an Infrared Head Operated Joystick”, IEEE Trans. Rehabilitation Engineering,
Vol. 8, pp 107-117, 2000.
[5]
D. G. Evans, P. Blenkhorn, “A Head Operated Joystick - Experience with Use”,
Fourteenth Annual International Conference on Technology and Persons with
Disabilities, Los Angeles, March 1999
[6]
M.-H. Yang, N. Ahuja, D. Kriegman. “A survey on face detection methods”,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999.
[7]
M. Störring, H. J. Andersen, E. Granum. “Skin colour detection under changing
lighting conditions”. In H. Araujo and J. Dias, editors, 7th International
Symposium on Intelligent Robotic Systems, pages 187-195, Coimbra, Portugal,
20-23 July 1999.
[8]
M.-H. Yang, N. Ahuja. “Detecting human faces in color images”. In Proceedings
of IEEE Int'l Conference on Image Processing, Chicago, IL, October 4-7 (1998),
pp. 127-130.
[9]
K. Toyama. “Look, Ma - No Hands! Hands Free Cursor Control with Real-time
3D Face Tracking”. In Proc. of Workshop on Perceptual User Interface (PUI'98),
1998.
[10]
J. Sobottka, I. Pittas. “Segmentation and tracking of faces in color images”. In
Proceedings of the Second International Conference on Automatic Face and
Gesture Recognition, pp 236 - 241, 1996.
[11]
J. Yang, A. Waibel. “A real-time face tracker”. Proc. 3rd IEEE Workshop on
Application of Computer Vision p. 142 – 147, 1996
[12]
R. J. Qian, M. I. Sezan, K. E. Metthews. “Face Tracking Using Robust Statistical
Estimation”. In Proc Workshop on Perceptual User Interfaces, San Francisco,
California, November 1998.
[13]
K. Sobottka, I. Pitas. “Extraction of facial regions and features using color and
shape information”. In Int. Conf. on Pattern Recognition (ICIP), Vienna, Austria,
August 1996.
[14]
B. Menser, M. Brünig. “Segmentation of human faces in color images using
connected operators”. In Proc. IEEE International Conference on Image
Processing ICIP99, volume 3, pages 632 - 636, Kobe, Japan, October 1999.
[15]
Y. Jie, W. Lu, A. Waibel. “Skin-Color Modeling and Adaption”. In 3rd Asian
Conference on Computer Vision, pp 687-694, Hong Kong University of Science
and Technology, Hong Kong, Jan 1998.
[16]
B. Schiele, A. Waibel. “Gaze tracking based on face-color”. In Proc. Int.
Workshop on Auto. Face and Gesture Recognition, pp 344 - 349, Zurich, 1995.
[17]
A. Yuille, P. Hallinan, D. Cohen. “Feature extraction from faces using
deformable templates”. In International Journal of Computer Vision, 8, pp. 99111, 1992.
[19]
R. Brunelli, T. Poggio. “Face recognition; features versus templates”. IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 10, pp.
1042-1043, October 1993.
[20]
V. Bakic, G. Stockman. “Menu Selection by Facial Aspect”. In Proc. Vision
Interface '99, Quebec Canada, 18-21 May 1999.
[21]
A. Nikolaidis, I. Pitas. “Facial feature extraction and pose determination”.
Pattern Recognition 33(11) 1783-1791, 2000.
[22]
T. Morris, F. Zaidi, P. Blenkhorn, “Blink detection for real-time eye tracking”, J
of Network and Computer Applications, 25(2) pp 129-143. 2002.
[23]
J.M. Stroud, “The Fine Structure of Psychological Time”, Information Theory in
Psychology, (H. Quastlar ed.) Freepress, Glencoe, Ill, 1956.
[24]
University of Stirling Face Database 2003 http://pics.psych.stir.ac.uk/
[25]
C. L. Zitnick, J. Gemmell, K. Toyama. “Manipulation of Video Eye Gaze and
Head Orientation for Video Teleconferencing”. Technical Report MSR-TR-9946. Microsoft Research, Redmond, WA. June, 1999. 12