Semiconductor Defect Classification
Semiconductor Defect Classification
Semiconductor Defect Classification
Keisuke Kameyama Interdisciplinary Graduate School of Sci. and Eng. Tokyo Institute of Technology Yokohama 226-8502, Japan Yukio Kosugi Frontier Collaborative Research Center Tokyo Institute of Technology Yokohama 226-8503, Japan
Abstract
An automatic defect classication (ADC) system for visual inspection of semiconductor wafers, using a neural network classier is introduced. The proposed Hyperellipsoid Clustering Network (HCN) employing a Radial Basis Function (RBF) in the hidden layer, is trained with additional penalty conditions for recognizing unfamiliar inputs as originating from an unknown defect class. Also, by using a dynamic model alteration method called Model Switching, a reduced-model classier which enables an efcient classication is obtained. In the experiments, the effectiveness of the unfamiliar input recognition was conrmed, and a classication rate sufciently high for use in the semiconductor fab was obtained.
5m
1. Introduction
Visual inspection plays an important role in the manufacturing processes of semiconductors. The disorders found on the wafer surface, such as the one shown in Fig. 1, are commonly referred to as defects. The motive for defect classication is to nd out the process stages and the sources that are causing them. Early detection of the sources of defects is essential in order to maintain high product yield and quality. Fig. 1 By replacing the review process typically conducted by human experts, it is also aimed to improve both the stability and speed of inspection. In the literature, it is reported that the classication accuracies of human experts are typically 6080% [1]. If this stage of visual inspection could be automated, it will greatly contribute to enhance the productivity of the semiconductor fab. The task of classifying the defect image features has several
specic conditions inherent to the particular problem. Most distictive among them is the fact that the user does not have the freedom of collecting a sufcient number of, or an appropriate selection of training images. Also, the number of the training samples are extremely unbalanced. When the number of samples for a defect class is small, approaches whose decisions rely on all samples, such as the radial basis function (RBF) networks [10][12] or the joint use of nonparametric estimation of the probability distribution function by Parzens method [11] and Bayes classication, perform well. However, for a class with large samples, these methods are computationally costly. In this case, instead of using all the training samples for classication, methods based on distances from the class-cluster prototypes such as the k nearest neighbor algorithm [2] and learning vector quantization [9], and those based on class borders such as multilayer perceptrons (MLP) [14] and support vector machines [15] are computationally more efcient. So-called reduced variants of the above nonparametric methods such as the generalized RBF networks [12] and reduced Parzen classiers [3] are also methods depending on the distances from the prototypes.
In this work, a three-layered neural network named the Hyperellipsoid Clustering Network (HCN), having hidden layer units of RBF type will be used. In addition to the parameter adjustment by backpropagation (BP) method [14], model alteration method called Model Switching (MS) [7] which allows the map acquired by training to be inherited to the new model, is used during the training process for efciently obtaining an appropriate reduced model. The second requirement to the system is to classify the known defect classes without fail and not to make wild guesses against unfamiliar defects. Such cases should be pointed out as unclassiable and be left for the human expert to see. Since the training set will usually provide answers at only a small portion of the feature space, inputs to the remaining open space should be treated as being unknown. For recognizing unfamiliar inputs, the HCN was trained with additional penalty condition, so that the sizes of the hyperellipsoid kernels will be kept small, to tightly enclose the clusters formed by the training samples. In Sec. 2, the HCN will be introduced, together with its training method and the output interpretation method for recognition of unfamiliar inputs. In Sec. 3, the idea of Model Switching for allowing dynamic model alteration during training will be reviewed. The defect classes and the outline of the automatic defect classication (ADC) system will be explained in Sec. 4. In Sec. 5, the network and the ADC system will be evaluated by applying to the classication of the defect image sets, and the paper will be concluded in Sec. 6.
Input l vector x n
O k Output vector y
1 Input layer
1 Hidden layer
Output layer
(Linear)
h1 1.0
h2
0 x1 x2
Figure 3. An example of the kernel functions made by the joint use of (hyper) ellipsoid discriminants and sigmoid functions.
A unit in the output layer takes the fan out of the hidden layer units and calculates the weighted sum with no bias as,
ok = wT h k
(3)
(1)
where k = (wk1:::wkN ) 2 RN is the weight vector of the k-th output unit, and = (h 1 :::hN )T 2 RN . The weight vector is also modied in the training process.
rn 2 R
mn = (mn1:::mnL)T H n = Hnst] 2 RN
The transfer function of the hidden layer unit is the wellknown sigmoid function. Thus, the output of unit n is,
(2)
By employing a discriminant in Eq. (1), the discrimination plane in the feature space will always be a hyperellipsoid. Since the unit potential in Eq. (1) depends on the distance between the input and the center vector n , the network is a RBF network. However, in contrast with the popular Gaussian RBF network [12], various proles of the kernel function are possible by controlling the gain [4] of the sigmoid function with the radius parameter r n, as shown in Fig. 3. This network model using the hyperellipsoid discriminant and the sigmoid function in the hidden layer, will be referred to as the Hyperellipsoid Clustering Network (HCN).
The training method used in the HCN is based on the batched BP law with momentum terms [14]. The error criterion E is dened as
3. Model switching
As a method for obtaining a reduced network model in the learning process, model alteration scheme called Model Switching (MS) [7] is employed. MS is a framework for dynamic model alteration during the BP training for improvement of the training efciency, by avoiding the local minima and reducing the redundancy in the network model. Denition 1 (Model Switching) On altering the neural network model, methods which determine the moment or the occasion of model alteration, by taking into account both the two factors in the following : 1. The nature and tness of the new model and the initial map candidate within the new model. 2. The status of the immediate model and map. will be referred to as Model Switching (MS). In this work, MS will be used to reduce the number of hidden layer units in the HCN in which the training is initially started with a model having the same number of units as the training sample. Pruning algorithms [13] which is also an attempt to reduce the network size, mostly limit the occasion of model reduction to after the convergence of the training error. With MS, however, the occasion can be set at any time, as long as the tness of the candidate of the initial map within the new model is met. When only the model reduction is used in MS, only the rst factor in Def. 1 needs to be considered. The process of training by BP with MS is shown in Fig. 4. For each training epoch of BP, the tness of the switchable candidates will be evaluated, and switching will take place when the tness IF of a candidate exceeds a given threshold IF 0 . The candidate set of the new model and map was made by using the unit reduction method of unit fusion [6]. Unit fusion selects a pair of units in a layer and replaces them by a single unit. On replacement, connection weights to the new unit is determined so that the map of the old network will be inherited by the new network. Let us put that units indexed i and j will be fused to make a single unit i 0 . The weighted sum of the inputs from units i , j and the unity bias b to the subsequent layer unit k, can be written as,
P 1X
(4)
with P , Ep , p 2 RO and p 2 RO denoting the cardinality of the training set, the error for the p-th training pair, the p-th training output vector and the p-th output vector, respectively. For enabling a tight bounding by hyperellipsoids to implement the recognition of the unfamiliar inputs, the volume of the hyperellipsoids should be kept small as long as it does not harm the achievement of training. This can be done by setting some penalty term to restrict the radius of the hyperellipsoids. The distance from the center to the edge of the hyperellipsoid in the direction of the i-th princip pal component can be written as jrnj= i , where i is the T n, which is always posi-th eigenvalue of the matrix n itive. Thus, a penalty to suppress the absolute value of the radius parameter rn can be considered to be effective. Also, a term to prevent the eigenvalues from becoming too small, was necessary. This second restriction was implemented indirectly by preventing the Euclidean norm of the matrix n from becoming too small. Consequently, the modication measures to the weight matrix n and the radius parameter rn were formulated as,
H H
(5)
and
with the terms tion measures by the plain BP training. Parameters H and r denote the penalty term gains. The network will be trained to respond with a class specic unit vector. Since the output is the weighted sum of the kernel functions of the hidden layer units, it can be justied to reject an output vector that does not have a signicant winner. In such a case, the input pattern should be classied to be originating from an unknown class. Therefore, the output interpretation of,
2 rn = rnBP ; r @ (rn ) (6) @rn H nBP and rnBP denoting the modica-
class =
argmax(ok ) unknown
if ok
>
otherwise
(7)
(8)
Start
The tness of the new map will be a function of the degree of map inheritance, and the closeness of the two kernels to be fused in the feature space, to give priority to the fusion of kernels that are placed close together. For evaluating the degree of map inheritance, a measure named Map Distance will be used. Denition 2 (Map Distance) The map distance between two mapping vector functions N 1 ( ) and N 2 ( ) trained with the training vector set f( p p )gP=1 is dened as, p
Evaluate Fitness Index IF(fN , fN i ) for all fN i CMS Model size reduction Y No switching N
1 D (f N 1 f N 2 ) = P
P X p=1
f z z y
f z
kf N 1 (z p ) ; f N 2 (z p )k2 E
(13)
Switch fN fNk
k = argmax{ I F (fN, f Ni )}
i
The tness of the candidates will be evaluated by the tness index function of,
IF (f N8f ij ) N
Figure 4. BP training with Model Switching. where w, o, e and v are the connection weight, unit response, average unit response and the varying portion of the response, respectively. Generally, we can put,
vj = sgn(rij ) j vi i
(9)
with i and rij denoting the standard deviation of the unit output, and the output similarity of the unit pair, respectively, both evaluated for all the training inputs. From Eqs. 8 and 9, we have
where ij , L and Dmax denote the map obtained by fusN ing the i-th and j -th units, the dimension of feature space, and the maximum possible map distance, respectively. It is assumed that all the feature elements are bounded to the (0 1) domain. On actual evaluation of the map distance, the theorem approximating the map distance generated by the fusion of hidden layer units [7] was used.
(14)
(10)
(11)
and
(12)
where the prime denote the connection weights after the fusion. Since no bias unit is used in the hidden layer of HCN, only the compensation in Eq. 11 will be used. As unit fusion can be applied to all unit pairs in the hidden layer, 1 N (N ; 1) 2 switching candidates exist. The one which is most t will be selected by evaluating the tness index I F ( N Ni ).
f f
Reference image
Defect image
x2
Defect mask
AND
x2
Color quantization
x1
(a)
1 0
x1
(b)
Shape feature
Color ratio
Figure 6. An articially generated cluster data of four classes. (a) Training set (P = 100). (b) Test set (P= 1000).
layers have been stacked over a foreign object. EO class defects appear slightly larger and irregular-shaped than those of the FO class, because the patterns of the heaped area in the covering layers are deformed by the embedded object. In addition to the characteristic dark color of the particle itself, other colors can be observed as well. Defects of FO and EO classes can appear quite similar, and are sometimes hard to distinguish even for an expert. C. Pattern failure (PF) This class covers all kinds of defects that have pattern deformations without any existence of external objects. Defects of PF class can also be caused by insufcient exposure or etching. Thus they can have a wide variety of size and shape. Since the defect is usually an extra region or a lack in the pattern of a layer, the color of the defect region tends to be one of those observed in the normal patterns.
ages of the layer to be inspected. Also, typical defect colors are manually added as prototype colors. The ratios of the quantized colors in the defect region were used as the color feature vector of the defect. In the experiments in Sec. 5, the feature dimension was 12, including the 2 shape features and 10 color features, all normalized to unity range.
5. Experiment
A. Membership thresholding in an articial cluster data The effect of membership thresholding and MS was evaluated using an articial four-class data in a 2D domain shown in Fig. 6. Three types of networks and training strategies were tried. All networks were trained to the target error of E0 = 0:01, to respond with class specic unit vectors. 1. MLP with (input-hidden-output)=(2-4-4) units.
2. HCN with (input-hidden-output)=(2-100-4) units. 3. HCN trained by BP with MS for model reduction during training. Initial model : (2-100-4). The change in the recognition rate for the test set, and the ratio of the area within the input domain which was pointed out as being of unknown class, was evaluated by changing the membership threshold in Eq. 7. Ideally, the recognition rate will be maintained high, even when a large portion of the input domain is judged as unknown (rejected). The result is shown in Fig. 7. It is clear that by reducing the model of HCN by MS, larger portion of input domain is properly rejected without losing the classication ability for the test set.
6. Conclusion
MLP HCN HCN with MS
Recognition rate
0.95 0.9 0.85 0.8 0.75 0.7 0 0.1 0.2 0.3 0.4 = 0.9 = 0.2
0.5
0.6
Figure 7. The change in the recognition rate and the ratio of the rejected input domain, when the membership threshold is changed.
An ADC system for visual inspection of semiconductor wafers, using a neural network classier was introduced. The Hyperellipsoid Clustering Network was introduced, and the training rule with cost terms for recognizing unfamiliar inputs as originating from an unknown defect class was given. Further, by using BP training with Model Switching, a reduced-model classier which enables an efcient classication was obtained. The defect classes and the descriptions of the extracted image features was dened. In the experiments, the effectiveness of the unfamiliar input recognition was conrmed, and a classication rate comparable to those of human experts were obtained.
Table 1. The classication rate and the confusion matrix for the HCN evaluated by the leave-one-out method. The numbers in bold typeface are for the cases when membership thresholding was used.
Estimation True Foreign Object (FO) Embedded Object (EO) Pattern Failure (PF) FO 32 32 2 1 2 0 EO 1 0 32 30 0 0 PF 0 0 2 0 22 21 Unknown Correct (%) Error (%) 0 1 0 5 0 3 97.0 97.0 88.9 83.3 91.7 87.5 92.5 89.2 3.0 0.0 11.1 2.8 8.3 0.0 7.5 1.1
References
[1] P. B. Chou, A. R. Rao, M. C. Struzenbecker, F. Y. Wu, and V. H. Brecher. Automatic defect classication for semiconductor manufacturing. Machine Vision and Applications, 9(4):201214, 1997. [2] R. O. Duda and P. E. Hart. Pattern Classication and Scene Analysis. Wiley, 1973. [3] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990. [4] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990. [5] P. Heckbert. Color image quantization for frame buffer display. Computer Graphics, 16(3):297307, 1982. [6] K. Kameyama and Y. Kosugi. Neural network pruning by fusing hidden layer units. Transactions of IEICE, E74(12):41984204, 1991. [7] K. Kameyama and Y. Kosugi. Model switching by channel fusion for network pruning and efcient feature extraction. Proceedings of International Joint Conference on Neural Networks 1998, pages 18611866, 1998. [8] K. Kameyama, Y. Kosugi, T. Okahashi, and M. Izumita. Automatic defect classication in visual inspection of semiconductors using neural networks. IEICE Transactions on Information and Systems, E81-D(11):12611271, 1998. [9] T. Kohonen. Self-organization and associative memory. Springer, 1988. [10] J. E. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281 294, 1989. [11] E. Parzen. On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33:1065 1076, 1962. [12] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78:14811497, 1990. [13] R. Reed. Pruning algorithms a survey. IEEE Trans. Neural Networks, 4(5):740747, 1993. [14] D. Rumelhart, J. L. McClelland, and the PDP Research Group. Parallel distributed processing. MIT Press, 1986. [15] V. N. Vapnik. Statistical Learning Theory. Wiley, 1999.
B. Leave-one-out evaluation with HCN using MS A collection of defect images obtained from the same process layer of a product was used for evaluating the ADC system. The set consisted of 33 FO class, 36 EO class and 24 PF class images. The class information for all the images were provided by an expert inspector. The classication rates were evaluated by the leave-one-out method [3]. A HCN network with unit conguration of (12-933), initialized by placing each kernels at the training inputs were trained using MS. The model typically converged to reduced models with 9 to 14 hidden layer units. The results are shown in Table 1. By employing the membership thresholding with = 0:5, it is found that the nondiagonal elements (errors) in the confusion matrix could be reduced drastically. The obtained classication rate is considered to be comparable to those of human experts. By reducing the network model by MS, the computation required for using the network was also reduced by 8590%, when compared with the initial network model.