Academia.eduAcademia.edu

Facial feature tracking for cursor control

2006, Journal of Network and Computer Applications

This work is motivated by the goal of providing a non-contact means of controlling the mouse pointer on a computer system for people with motor difficulties using low-cost, widely available hardware. The required information is derived from video data captured using a web camera mounted below the computer's monitor. A colour filter is used to identify skin coloured regions. False positives are eliminated by optionally removing background regions and by applying statistical rules that reliably identify the largest skin-coloured region, which is assumed to be the user's face. The nostrils are then found using heuristic rules. The instantaneous location of the nostrils is compared with their at-rest location; any significant displacement is used to control the mouse pointer's movement. The system is able to process 18 frames per second at a resolution of 320 by 240 pixels, or 30 fps at 160 by 120 pixels using moderately powerful hardware (a 500 MHz Pentium III desktop computer).

Facial Feature Tracking For Cursor Control Abstract This work is motivated by the goal of providing a non-contact means of controlling the mouse pointer on a computer system for people with motor difficulties using low-cost, widely available hardware. The required information is derived from video data captured using a web camera mounted below the computer’s monitor. A colour filter is used to identify skin coloured regions. False positives are eliminated by optionally removing background regions and by applying statistical rules that reliably identify the largest skin-coloured region, which is assumed to be the user’s face. The nostrils are then found using heuristic rules. The instantaneous location of the nostrils is compared with their at-rest location; any significant displacement is used to control the mouse pointer’s movement. The system is able to process 18 frames per second at a resolution of 320 by 240 pixels, or 30 fps at 160 by 120 pixels using moderately powerful hardware (a 500 MHz Pentium III desktop computer). Keywords: human-computer interaction, perceptual interfaces, face tracking, enabling technologies. 1 Introduction For many people with physical disabilities, computers form an essential tool for communication, environmental control, education and entertainment. However, access to the computer may be made more difficult by a person’s disability. A number of users employ head-operated mice or joysticks in order to interact with a computer and to type with the aid of an on-screen keyboard. Head-operated mice can be expensive. In the UK, devices that require the users to wear no equipment on their heads, other than an infrared reflective dot, for example Orin Instrument’s HeadMouse [1] and Prentke Romich’s HeadMaster Plus [2], cost in excess of £1,000 (€1,400). Other devices are cheaper, notably Granada Learning's Head Mouse [3, 4, 5] (£200, €280), and Penny and Gilles’ device (£400, €560). However, these systems require the user to wear a relatively complex piece of equipment on their head, an infrared transmitter and a set of mercury tilt switches respectively. The goal of this research is to develop and evaluate a non-contact interface that is both low cost and also does not require the user to wear any equipment. The most obvious way of doing this is to use a camera, interfaced to a PC. The PC/camera system will track the movement of the user; it will translate head movements into cursor movements, and could interpret changes in expression as button presses. There have been many attempts at creating cursor control systems utilising head movements; these are reviewed below. The head monitoring system must also provide a mechanism for replacing the mouse buttons. We propose that this can be achieved by using the blinks of the eyes (or by dwelling the mouse or by using switches). Changes in expression could also be used as part of a separate device to provide a computer interface for more severely disabled persons. A system such as the one proposed will confer significant benefits on its target audience. It will increase their independence by facilitating their communication with a computer and hence providing an avenue for communication and environmental control. The proposed interface may also have an application in the non-contact control of remotely controlled units for the able bodied; providing a perceptual control mechanism for such a unit will free the operator to concentrate on other demanding tasks. The implementation of a system such as the proposed one presents several areas of difficulty: 1. Identifying and tracking the head location. 2. Identifying and tracking the location of a facial feature. 3. Being able to process the information in real-time using a moderately priced processor that will be running other applications in the foreground (for example, Microsoft Word). Yang et al. [6] presented a review of face detection methods, which fell into a small number of categories, the most important of which were methods depending of the colour and physiognomy of the face. Storring et al. [7] suggested that a face’s apparent colour was due to two factors: the amount of melanin in the skin and the ambient illumination. Of the two, ambient illumination caused the greater variation to the perceived colour. They concluded that if normalised colour values were used (i.e. the effect of illumination were removed) then skin colours were consistently with a fixed and quite narrow set of limiting values, independent of the subject’s natural skin colouration. This approach, and slight modifications, has become very popular due to its simplicity [7-16]. Template matching provides an attractive and simple approach to face detection that relies on the constant appearance of the facial features, but often presents difficulty in dealing with variations in scale, shape and pose. The simplistic approach would define a template that resembled a facial feature and cross correlate it with a face image. The location of the maximum response defines that feature’s location. However, the response decreases as the resemblance between the template and the facial region diminishes, and in theory, a template is required for all different instances of the feature. To overcome this, flexible templates were originated, for example by Yuille [17] where templates of facial features are deformed to match against the features of the face. An energy function is used to link edges, peaks and valleys contained in the input image to corresponding values in the parameterised template. The best fit of the model is found by altering the parameter values - performing energy minimisation. The deformable template matching method goes some way to achieving scale and shape invariance, also proposed by the use of multiscale and multiresolution templates. Variations on this theme are widespread, [1, 6, 8, 10, 14, 19-22]. To summarise, it is the aim of the research reported here to develop a cheap, non-contact computer interface that will be used primarily by people with severe motor difficulties. The system should require minimal initialisation/configuration. The remainder of the paper is organised as follows. In the following section the options for capturing the required data are reviewed. The characteristics of the selected hardware are reviewed, as they relate to the problem we are addressing. Section 3 presents an overview of the software architecture of the system we have developed. Sections 4 and 5 describe the two parts to the face detection algorithm: identifying possible face candidates and eliminating false positives, and identifying the extent of the face in an image. The following section describes how we identify the facial features whose instantaneous location is used to control the cursor’s movements. The translation from feature location to cursor movement is described in section 7. In section 8 we present an evaluation of the system: how accurate is it and how rapidly can it process data. Finally, we draw conclusions in section 9. 2 Data Capture As we are developing a non-contact interface that will not require the user to wear any kind of transducer, video imagery is the obvious data to process. There are two types of video capture interfaces to the PC: the standard video camera plus video capture hardware and digital cameras that interface directly to the system, that is Firewire or USB cameras. We discarded the video camera on the grounds of its cost, even though the image quality is likely to be better, and instead chose to investigate the use of a webcam interfaced via the USB, reasoning that this hardware is likely to be in the possession of a large proportion of computer users. The USB has an upper throughput rate of 12 megabytes per second. Combinations of image spatial resolution, colour depth and frame rate must not exceed this. It has been shown that a minimum rate of 10 frames per second must be processed in order for a user to perceive real-time operation [23]. Therefore, each frame of data cannot exceed 1.2 MB. We also consider that colour information is of primary importance and therefore require 24 bit colour information, which further reduces the frame size to be 0.4 million pixels or less. The largest “standard” size image satisfying this criterion is the SIF resolution (320 by 240 pixels NTSC and 353 by 288 PAL). Fortunately it has been shown in many studies that satisfactory results may be obtained by processing images with this resolution, smaller images contain insufficient detail. Figure 1 shows a typical image captured during facial feature tracking, its spatial resolution is certainly sufficient to be able to detect the significant facial features, even if they cannot be outlined with any great degree of precision (if fact, this is more of an advantage than a problem, as will be shown below). Figure 1: Sample image captured using a Creative Labs Webcam Go. This single image fails to reveal the temporal nature of the data. We would expect the image of a uniform scene to be static over time. This was investigated by capturing sequences of images of a uniform scene (a Kodak midgrey test card) with the automatic gain control of the webcam turned on and off. For comparison, the equivalent data was captured using a Sony HyperHad camera interfaced to a Silicon Graphics workstation. The differences between the red, green and blue values of equivalent pixels in consecutive frames were computed. Table I presents the results, showing the averages of these values (Mean R, Mean G, Mean B), averaged over all frames and the whole sequence; the average of these averages (Average Mean); the maximum differences (Max R, Max G, Max B) and the average maximum difference (Average Max). As expected, the webcam’s image quality is worse than a standard video camera’s but not significantly worse. The camera was further tested by capturing images of two people (Caucasian and Asian) under three different lighting conditions: afternoon sunlight, tungsten lighting and halogen lighting. The results are shown in figure 2. According to Storring’s results [7] we would expect quantitatively similar changes in the images as the lighting conditions are altered. This does not happen, and is especially obvious in the daylight images in which the Caucasian skin takes a bluish tinge. It is most likely that this deviation from Storring’s result is due to the quality of the cameras used in these two studies. The practical consequence is that we would not expect to be able to specify universal skin colour ranges for the data captured by this camera. Rather, any system that will rely on colour to identify potential skin regions must be calibrated for each user and probably also for each use of the system. Webcam Sony HyperHad AGC on AGC off Mean R 3 3 1 Mean G 2 2 1 Mean B 4 3 2 Average Mean 2 1 1 Max R 37 34 22 Max G 29 22 18 Max B 48 41 37 Average Max 24 20 16 Table I: Image Quality – inter frame pixel differences Face sample (a) Face sample (b) Face sample (c) Face sample (d) Face sample (e) Face sample (f) Figure 2: Images of an Asian and a Caucasian captured under different illumination conditions using the Webcam Go. We are developing a system that will be used as a non-contact computer interface. It is therefore obvious that the video camera will be situated such that it captures images of the computer user’s face as he or she views the computer’s monitor. We have chosen to place the camera below the monitor, pointing upwards at the user. The software architecture of the system will now be described. 3 Software Architecture A diagram illustrating an overview of the system’s operations is presented in figure 3. Note that this diagram shows the system’s steady state operation, results from processing the previous frame inform the processing of the current frame, and the results of processing this frame will inform processing the following one. Position cursor User’s face Calculate pose Select and segment dominant blob Candidate Feature Search window locations for frame i+1 blobs Identify Segment features skin Remove background Place feature search window Feature locations Frame i Figure 3: Data flow diagram. from frame i-1 The system is divided into four components that are responsible for identifying blobs that might correspond to the face in an image, selecting and delineating the blob that corresponds to the face, identifying the facial features that are to be tracked and moving the cursor. The four components will be described in the following sections. 4 Face Candidate Identification We have chosen to use colour information to identify the face. Our aim was to use the simplest reliable algorithm that will satisfy our requirements. We dismissed other methods of face detection as we deemed them to be overly complex for this application, and although they might yield correct results at an acceptable frame rate, we do not believe that they will leave sufficient processing resources to allow any useful software to be used without an unacceptable degradation in performance. Storring’s contention is that skin colours are compactly clustered, the size and shape of the cluster is unaffected by changes in illumination: only the location of the cluster changes. We investigated methods of removing the illumination dependence by investigating alternative colour spaces. Facial images were downloaded from the University of Stirling [24]. This database contains a demographically representative sample of full frontal face images captured under varying lighting conditions. Regions of each image containing skin only were manually extracted and the red, green and blue (RGB) components of each pixel recorded. The RGB values were converted to normalised red, green and blue, rgb; log-opponent, and Y, Cr and Cb: R R+G+ B G g= R+G+B B b= R+G + B r= (1) L( x ) = 105 log10 ( x + 1) I = L(G ) R g = L(R ) − L(G ) B y = L (B ) − (2) L(G ) + L(R ) 2 Y = 0.30 R + 0.59G + 0.11B C r = 0.50 R − 0.42G − 0.08B (3) C b = −0.17 R − 0.33G + 0.50 B Illumination independence was achieved by deleting any one of the rgb components, the L and the Y component. Plots of the normalised skin values are shown in the scattergrams of figure 4a (plotting r and g), figure 4b (Rg and By) and figure 4c (Cr and Cb). To be considered suitable for this application, the points must be tightly clustered. It is also advantageous for the normalisation to be computationally simple, to enhance the data throughput. Inspection of the scattergrams reveals that the normalised red, green and blue representation gives the tightest clustering, this representation was therefore chosen. We have demonstrated that it is possible to identify a range of normalised colours that include all skin colourations. Of course, these colours will not be unique to skin samples, the colours of other objects will also be found in this range. We have found that certain colours of paint and especially woodwork can give false positive results using this algorithm. These pixels are readily removed using background subtraction. A background image may be acquired during a calibration phase, this image will be the view observed by the camera in its usual position, but without the user in place. During operation, each frame of data is compared to the background image. If pixels are sufficiently different, then the pixel is considered to be part of the foreground, that is the object being tracked and it is passed to the colour matching algorithm. Of course, if the background to the scene is carefully controlled, it will not contain any objects of potentially confusing colours, and the background subtraction will not be required. Nevertheless, we have included this operation as an optional step as it can improve face detection in some environments. Normalised RG S29 S25 S21 S17 Norm alis e d Gr e e n S13 S9 31 26 21 16 11 6 1 S5 S1 Norm alis e d Re d Log Opponent S31 S29 S27 S25 S23 S21 S19 S17 By S15 S13 S11 S9 S7 S5 S3 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 S1 Rg Cb vs Cr S29 S25 S21 S17 Cr S13 S9 31 26 21 16 11 6 1 S5 S1 Cb Figure 4: Clustering of skin colours in various colour spaces a normalised red green b normalised log-opponent c Cr and Cb The output of this stage of the processing cycle is an image map that indicates those pixels that are of a colour consistent with the skin colour model and are not part of the static background. Despite removing the background pixels, the map still contains a number of false positive results caused primarily by the poor quality data delivered by the camera. These are removed as a side effect of the following stage of the processing cycle, Face Region Growing, whose primary goal is to identify the region of the image that corresponds to the face. Figure 5 a typical input b skin map generated by colour matching. 5 Face Region Growing Due to the location of the camera, the user’s head and shoulders will be the dominant object in the images that are captured. Following the previous stage, the image map will contain a large region corresponding to the face (although this is occasionally fragmented due to shadowing effects) and a large number of much smaller regions due to image noise and imperfect colour matching. Figure five represents a sample image map and the input that was used to generate it. Connected component analysis can be used to identify the different contiguous groups of pixels (blobs), and determine their sizes. It is therefore a simple matter to select the largest blob and assume that it is the face. However, this algorithm takes no account of the blob’s shape and can give erroneous results, figure 6. Figure 6: An example of connected component analysis giving incorrect results. Radial spanning [24] has also been suggested as a means of detecting convex objects, by growing radial spokes from a seed point and requiring that adjacent spokes are of similar lengths. Both of these methods require a seed point within the blob to be identified. Although this is a simple matter as any point can be used as a seed, it is a time consuming process to examine all blobs in this fashion. We have used robust statistical analysis [12] to identify the dominant blob in the image. This is an iterative process that consistently identifies that centre of the largest blob in the image. Figure 7: Dominant blob selection. The rows and columns of the image are summed. The arrays of row sums and column sums are treated in the same manner. Firstly the mean and standard deviation is computed. Secondly the mean and standard deviation of values lying within one standard deviation of the original mean is computed. This calculation is iterated until the mean converges, at which time the means derived from the row and column sums define the co-ordinates of a point within the dominant blob. Two one dimensional histograms, hx(i) 0 ≤ i < NumRows, and hy(j) j ≤ j < NumCols, were created by summing the image’s columns and rows. The means and standard deviations of these were computed in the usual ways. The ranges of the summations can be restricted so as to exclude blobs neighbouring the image boundaries, thus it is possible to prevent the error shown in the example image. The two means give an initial estimate of the centre of the dominant blob in the image which is refined by repeating the calculation of the means and standard deviation but using only those pixels lying within one standard deviation of the mean in each distribution. This step was repeated until the change in the values of the two means was a negligible. The effect of this process is to locate the centre of the most dominant blob in the image, irrespective of any other blobs that might be present, figure 7. The values of the standard deviations can be used to compute an approximate bounding box for the face region. We may also perform a connected component analysis to identify the region more accurately. Since the purpose of locating the face is to define a region to search for the facial features we are tracking, we do not require that the face is accurately delimited. Instead, we are satisfied with the more rapid, but less accurate process of defining the search region using the two standard deviations. Figure 8 illustrates the regions identified using radial spanning and robust statistical analysis. The region defined by this process is passed to the feature location module. Figure 8: Face regions detected by radial spokes and robust statistical analysis. 6 Feature Location To determine face pose would normally require that three non-collinear facial features were identified, such as the eyes and nose. Recognising that facial features are usually darker than the surrounding skin, some authors have used local minima in the brightness function to define feature locations, regardless of whether these locations correspond to a physical feature. It is also difficult to identify features consistently as the head pose is altered, as the head is bowed, the shadows under the eyebrows become darker, which can confuse some feature trackers. We have chosen to track the user’s nostrils and use their position with respect to the edges of the search region input to this module to define the head pose. The nostrils confer two major advantages to a simple tracking system. Firstly, they are clearly separated from any other features that could be confused with them. Secondly they are relatively small and situated away from the face boundary, this means that they remain visible even under extreme facial poses. The window passed to this module defines the face region. The central third of this is used as a search region for the nostrils, this is large enough that we can be sure it contains the nostrils and small enough to allow rapid processing. The nostrils are located by a thresholding process; the data within the search region is thresholded with a gradually increasing threshold until two regions are found that match the nostril heuristics, that is they are of suitable and similar sizes and separation. The “suitable size” is defined by the capture resolution and the identified face size. The centroid of the two nostril regions is taken as the active point that is the single location that is required by the tracking algorithm. Having located the nostrils, their location is used to update the centre of the search region for the following frame. If the nostril search fails and we are unable to update the search region using this mechanism, the search for the nostrils is reinitiated. Samples of the search region and the identified nostrils are illustrated in figure 9. The translation of this information into cursor movement is the subject of the following section Figure 9: Sample results of the nostril tracking stage. Figure 10: Translation of nostril and search window locations to cursor control signals. 7 Cursor Movement Given the co-ordinates of the nostrils’ active point and the co-ordinates of the face search area, we may derive signals for driving the cursor’s movement. Jitter, in this context, is defined as randomised apparent movement of the nostrils due to small amplitude, random head movements (tremor) and errors in the estimation of nostril location. Ideally, the cursor would move smoothly following decisive and well-controlled head movements. However, many of the intended users of this system have poor control of their movement. The system must therefore be capable of recognising jitter and eliminating it. This is accomplished by monitoring the changes in the nostril location and the face search area co-ordinates. If these values change by more than a predefined threshold, the new values are used to update the cursor position, otherwise the position and these values remain unchanged. The translation of image co-ordinates to cursor control signals is illustrated in figure 10. The distances from the nostril point to the boundaries of the face region in the vertical and horizontal directions is computed. This is translated directly into cursor position coordinates by linear scaling. The scaling coefficients are defined by a user specific calibration stage, we require that the user’s maximum nostril movements in the horizontal and vertical directions will map into movements of the cursor that completely cover the monitor’s screen. In a future version of the software we will replace the absolute cursor positioning with a joystick-like control method whereby the cursor’s velocity is controlled by the nostrils’ positions as this offers improved cursor positioning [3, 4, 5]. The cursor is constrained to remain on the screen; cursor co-ordinates are therefore clipped to the screen. 8 Implementation Evaluation and Performance The system was implemented and tested on various platforms running Windows. It executes in two phases: there is an initial calibration phase that is followed by the realtime tracking phase. The initialisation phase performs two tasks. The first is the acquisition of the background image, if this is warranted. Recall from the discussion above that the blob finding module gave many false positive results when there were regions in the background with similar colours to the skin. Bare or varnished wood surfaces were especially problematical. Secondly, the initialisation phase captures the skin colour values for this particular user. Storring’s conclusion that normalised skin colour values of all people fall within tightly constrained limits was found to be partly true, our experience has been that normalised skin colours vary by a small amount but it is not possible to set global threshold values due to the variability in colour response of the cameras that are being used. Whilst it is possible to automate the process of acquiring a specific skin colour model, at present we use manual initialisation, even though this violates our requirement of minimal or zero initialisation. Having performed the initialisation, the system moves into the real-time tracking phase that executes as previously described. The system was evaluated using several metrics: • The maximum throughput rate at various frame sizes, • The accuracy with which the nostrils were located, • Its robustness with respect to extreme head positions, • Its robustness with respect to varying lighting conditions and • The accuracy of the cursor positioning. Using an entry level personal computer (at the time of development, a PIII processor with a clock speed of 500 MHz) we have achieved throughput rates of 30 frames per second at a resolution of 160 by 120 pixels, and 18 frames per second at 320 by 240 pixels. These results are intended to be illustrative of the speed that this system can achieve, but they do not necessarily reflect the speeds that would be achieved in practice, for three reasons. Firstly, an “entry level personal computer” is more powerful now than when the development of this system commenced. Secondly, the tests were performed with the system providing graphical feedback to the developer, which would not be given in the same way in a release version. Thirdly, the tests were performed without the system running any other applications, such as a word processor. Given these factors, we expect that the system will be viable. The system is able to detect the required facial features reliably, provided that the scene is adequately illuminated and the illumination remains unchanged. Adequate illumination is a reasonable requirement as the location in which this system will be used is expected to be well lit. As the illumination is reduced, darker skin tones result in tracking failure sooner than lighter tones due to the lower contrast between the skin and the nostril areas. The system can be forced into tracking errors by introducing any large skin-coloured object into the field of view. If this object merges with the face, then the face region will change catastrophically. The system is able to maintain a fix on the nostrils, even under movements that are more extreme than would usually occur when the user is controlling the cursor. This is a desirable feature given the physical characteristics of the intended users. Several naïve users tested the system. They all found the system simple to use and were able to place the cursor within a 1 cm target on a 19” monitor with no difficulty. Although the system does not remove jitter completely, users compensate for slight remaining inaccuracies via the visual feedback that is provided by the on-screen cursor. Although some of the users wore glasses, and this did not pose any problems, none had any facial hair. Moderate amounts of facial hair should not affect the performance of the system, especially if the hair is similar in colouration to the face. In fact, facial hair ought not to influence the system until it actually obscures the nostrils themselves. 9 Conclusion A method of controlling the position of a cursor using video data has been presented. The system is intended to be used primarily by people with motor difficulties, although it could be used by the able bodied to enhance the mouse. We have aimed to develop a technically and computationally simple system. Input is captured using a webcam, simple processing methods have been employed. The system tracks the user’s nostrils, which are located by first finding a large skincoloured region that is assumed to be the user’s face. The position of the nostrils relative to the face region is used to control the position of the cursor. The system has been shown to be accurate and reliable under normal conditions of illumination and subject movement. The system is also able to track the desired features in real-time, lending support to our claim that this can be a valid and useful method of human computer interface for a certain class of computer users. Figure 11: Selected frames from a 30 second period of tracking. The images are screenshots of the system’s output showing a video frame with tracking results superimposed: the rectangular facial region and two blobs indicating where the system believes the nostrils to be located. Each image also contains a small square illustrating where the system would place the cursor, note that the cursor’s movement is a mirror image of the nostrils’ movements. References [1] Orin Instruments 2001 http://www.orin.com/access/ [2] Prentrom 2001 http://store.prentrom.com/ [3] R Drew, S Pettitt, P Blenkhorn, D G Evans, “A Head Operated 'Joystick' Using Infrared”, In Computers and Assistive Technology ICCHP '98, Proc XV IFIP World Computer Congress, Ed. A. D N Edwards, A Arato , W. L. Zagler, Österreichische Computer Gesellschaft, 1998 [4] D. G. Evans, R. Drew, P. Blenkhorn, “Controlling Mouse Pointer Position Using an Infrared Head Operated Joystick”, IEEE Trans. Rehabilitation Engineering, Vol. 8, pp 107-117, 2000. [5] D. G. Evans, P. Blenkhorn, “A Head Operated Joystick - Experience with Use”, Fourteenth Annual International Conference on Technology and Persons with Disabilities, Los Angeles, March 1999 [6] M.-H. Yang, N. Ahuja, D. Kriegman. “A survey on face detection methods”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999. [7] M. Störring, H. J. Andersen, E. Granum. “Skin colour detection under changing lighting conditions”. In H. Araujo and J. Dias, editors, 7th International Symposium on Intelligent Robotic Systems, pages 187-195, Coimbra, Portugal, 20-23 July 1999. [8] M.-H. Yang, N. Ahuja. “Detecting human faces in color images”. In Proceedings of IEEE Int'l Conference on Image Processing, Chicago, IL, October 4-7 (1998), pp. 127-130. [9] K. Toyama. “Look, Ma - No Hands! Hands Free Cursor Control with Real-time 3D Face Tracking”. In Proc. of Workshop on Perceptual User Interface (PUI'98), 1998. [10] J. Sobottka, I. Pittas. “Segmentation and tracking of faces in color images”. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pp 236 - 241, 1996. [11] J. Yang, A. Waibel. “A real-time face tracker”. Proc. 3rd IEEE Workshop on Application of Computer Vision p. 142 – 147, 1996 [12] R. J. Qian, M. I. Sezan, K. E. Metthews. “Face Tracking Using Robust Statistical Estimation”. In Proc Workshop on Perceptual User Interfaces, San Francisco, California, November 1998. [13] K. Sobottka, I. Pitas. “Extraction of facial regions and features using color and shape information”. In Int. Conf. on Pattern Recognition (ICIP), Vienna, Austria, August 1996. [14] B. Menser, M. Brünig. “Segmentation of human faces in color images using connected operators”. In Proc. IEEE International Conference on Image Processing ICIP99, volume 3, pages 632 - 636, Kobe, Japan, October 1999. [15] Y. Jie, W. Lu, A. Waibel. “Skin-Color Modeling and Adaption”. In 3rd Asian Conference on Computer Vision, pp 687-694, Hong Kong University of Science and Technology, Hong Kong, Jan 1998. [16] B. Schiele, A. Waibel. “Gaze tracking based on face-color”. In Proc. Int. Workshop on Auto. Face and Gesture Recognition, pp 344 - 349, Zurich, 1995. [17] A. Yuille, P. Hallinan, D. Cohen. “Feature extraction from faces using deformable templates”. In International Journal of Computer Vision, 8, pp. 99111, 1992. [19] R. Brunelli, T. Poggio. “Face recognition; features versus templates”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 10, pp. 1042-1043, October 1993. [20] V. Bakic, G. Stockman. “Menu Selection by Facial Aspect”. In Proc. Vision Interface '99, Quebec Canada, 18-21 May 1999. [21] A. Nikolaidis, I. Pitas. “Facial feature extraction and pose determination”. Pattern Recognition 33(11) 1783-1791, 2000. [22] T. Morris, F. Zaidi, P. Blenkhorn, “Blink detection for real-time eye tracking”, J of Network and Computer Applications, 25(2) pp 129-143. 2002. [23] J.M. Stroud, “The Fine Structure of Psychological Time”, Information Theory in Psychology, (H. Quastlar ed.) Freepress, Glencoe, Ill, 1956. [24] University of Stirling Face Database 2003 http://pics.psych.stir.ac.uk/ [25] C. L. Zitnick, J. Gemmell, K. Toyama. “Manipulation of Video Eye Gaze and Head Orientation for Video Teleconferencing”. Technical Report MSR-TR-9946. Microsoft Research, Redmond, WA. June, 1999. 12