Orleary 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

72 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO.

1, FEBRUARY 2020

Deep Learning for Classification of the Chemical


Composition of Particle Defects
on Semiconductor Wafers
Jared O’Leary, Kapil Sawlani, and Ali Mesbah , Senior Member, IEEE

Abstract—Manual classification of particle defects on assemblies, new manufacturing methods, and close control on
semiconductor wafers is labor-intensive, which leads to slow solu- cleaning and handling techniques [1], [2], [3]. The introduction
tions and longer learning curves on product failures while being of new aspects of the manufacturing process can be a major
prone to human error. This work explores the promise of deep
learning for the classification of the chemical composition of these source of unwanted particles depositing on wafers. In addition,
defects to reduce analysis time and inconsistencies due to human changes in existing process conditions can result in particle
error, which in turn can result in systematic root cause analysis generation depending on the reaction dynamics of the system.
for sources of semiconductor defects. We investigate a deep convo- These particle defects on semiconductor wafers can be one of
lutional neural network (CNN) for defect classification based on a the many causes of product failure. In fact, over 75% of the
combination of scanning electron microscopy (SEM) images and
energy-dispersive x-ray (EDX) spectroscopy data. SEM images total chip defects seen in standard semiconductor manufactur-
of sections of semiconductor wafers that contain particle defects ing processes are due to particle defects [4], [5], [6].
are fed into a CNN in which the defects’ EDX spectroscopy To understand the impact of manufacturing tools on their
data is merged directly with the CNN’s fully connected layer. defects, the common practice in the semiconductor equipment
The proposed CNN classifies the chemical composition of semi- industry is to run thousands of wafers per year per characteri-
conductor wafer particle defects with an industrially pragmatic
accuracy. We also demonstrate that merging spectral data with zation tool, such as scanning electron microscopy and optical
the CNN’s fully connected layer significantly improves classifica- scattering tools. Each wafer can contain tens of defects. As
tion performance over CNNs that only take either SEM image manufacturing processes and therefore particle defect compo-
data or EDX spectral data as an input. The impact of train- sitions become more complex, the amount of defects greatly
ing data collection and augmentation on CNN performance is increases. As a result, the time spent by process and pro-
explored and the promise of transfer learning for improving
training speed and testing accuracy is investigated. ductivity engineers to classify defects becomes excessively
large. In addition, manual classification techniques are prone
Index Terms—Convolutional neural networks, defect to human error. Clearly, implementing automated classification
classification, semiconductor manufacturing, particle defects,
chemical composition, transfer learning, data augmentation, techniques are required to improve both defect classification
outlier detection. accuracy and efficiency.

B. Motivation
I. I NTRODUCTION In any defect analysis study, the ultimate goal is to eliminate
A. Background the sources producing these defects in order to maximize the
EMICONDUCTOR devices are found in almost every yield on the wafer. While specific steps and sequences may
S facet of modern life — from smartphones to ultra-high
definition television sets. To sustain the ever-increasing
vary depending on the level of inspection desired for a given
application, the first step of a general defect analysis workflow
demand for lower cost, faster computing power, and/or (see Fig. 1) is to measure a wafer before and after a certain
higher memory capacity devices, the design of semiconduc- step in the manufacturing process using optical scattering tech-
tor manufacturing processes is becoming more complex and niques. The data from the surface scattering tool provides a
requires introduction of new hardware components, complex snapshot of the changes that occurred in that step by producing
wafer maps based on the size distribution of the defects and
Manuscript received November 5, 2019; revised December 13, 2019; their locations on the wafer. In several instances, looking at
accepted December 30, 2019. Date of publication January 1, 2020; date of
current version February 3, 2020. This work was supported by Lam Research this global picture can provide some guidance into the source
Corporation. (Corresponding author: Ali Mesbah.) of the defects. For example, a scratch may indicate mechan-
Jared O’Leary and Ali Mesbah are with the Department of Chemical ical contact during operations, or, depending on the chamber
and Biomolecular Engineering, University of California, Berkeley, CA 94720
USA. geometry, agglomeration of the defects in one area may indi-
Kapil Sawlani is with the Deposition Product Group (Digital Initiative), cate requirement of maintenance of the chamber parts. Many
Lam Research Corporation, Fremont, CA 94538 USA. authors have recently provided machine learning solutions
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. with high accuracy for these problems using convolutional
Digital Object Identifier 10.1109/TSM.2019.2963656 neural networks [7], adaptive balancing generative adversarial
0894-6507 c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 73

Fig. 1. Semiconductor defect classification workflow. To fully characterize the defects on a semiconductor wafer, the wafer’s defect map must first be
determined. Next, the morphologies and chemical spectra of individual particle defects are examined. Due to morphology similarity and peak overlap among
different defect classes, it is necessary to combine EDX spectral and SEM image data from steps 2 and 3 to fully characterize a wafer’s defects. Several
automated classification techniques exist for each of the three steps in this process, such as convolutional neural networks (CNNs), support vector machines
(SVM), randomized general regression networks (RGRNs), and binning techniques. However, no automated classification technique exists that combines
multiple steps in this workflow. The main contribution of this paper is automating the classification of particle defects on semiconductor wafers based on
combined information of the defect SEM image and chemical spectra obtained using EDX spectroscopy.

networks (AdaBalGAN) [8], randomized general regression


networks (RGRNs) [9], and support vector machines
(SVMs) [10] for single classification. In many cases, wafer
map defect patterns contain multiple sources. In other cases,
wafer maps even show mixtures of several other commonly-
observed patterns. Neural networks for the classification of
these patterns have also been investigated [11], [12], [13].
Some authors have even gone so far as to provide machine
learning solutions to detect unseen wafer patterns [14]. Other
commonly-implemented strategies for wafer map classifica-
tion are based on wafer map pattern failure recognition, which Fig. 2. Example SEM image and labeled EDX spectrum for SiO2 defect.
include wafer-based clustering techniques [15], [16], region- The SEM image is taken from a “top-view” using a center detector. Peak-
based modeling techniques [17], [18], and spatial signature labeling on the EDX spectrum is accomplished through an automatic process
that is error prone due to sensor noise and peak overlap.
analyses [19], [20]. However, such strategies can be inade-
quate for large-scale data sets without obtaining a reduced
representation of the wafer maps [21].
The second step in the general defect analysis workflow techniques based on chemical spectra information to program
is to review the scanning electron microscopy (SEM) images random forests and decision trees for automated particle
of the defects to identify their morphology (see Fig. 1). For chemical composition classification [23], [24]. However, this
example, a given reactor may generate particles of a spe- approach has the following drawbacks: (1) it is based on pri-
cific shape, which can aid in particle classification. Recently, ority of the conditions, (2) it relies on chemical spectra only
a convolutional neural network combined with a k-nearest (no defect image is included), and (3) it is not capable of
neighbor algorithm was explored to classify defects based on weighting the possibility of different classifications, which
the defects’ morphology (i.e., a spot, a rock shape, a ring hinders analysis and recognition of incorrectly labeled defects.
shape, etc.) [22]. Currently, particle chemical composition identification
In many cases, the wafer map classification or the mor- (i.e., the third step in the classification workflow) is accom-
phology classification can help identify the source of defect plished by manually analyzing the EDX spectrum and SEM
generation. However, the third and typically final step in the images of a given defect. Fig. 2 shows an SEM image and
general workflow involves obtaining energy-dispersive x-ray EDX spectrum of an example particle defect. It is chal-
(EDX) spectra of the particle defects. The identification of lenging to classify defects by EDX spectra alone because:
the chemical composition can help narrow the defect source (1) certain defects are too small to have their peaks detected
with most certainty, as equipment designers are well aware of and (2) peak overlap confounds defect classification. Fig. 3
equipment material compositions (if the reason for the defects depicts an example defect whose classification is affected by
are due to mechanical failures) and process engineers are well peak overlap. As Fig. 4 shows, it is challenging to classify
aware of their process conditions (if the reason for the defects defects by image data alone because many different defect
are due to chemical interactions). It is common to use binning types are similar in size, shape, and topography. As a result,

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
74 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO. 1, FEBRUARY 2020

based on artificial neural networks for supervised learning hold


promise [25], [26], [27], [28].
Due to their ability to classify large sets of images with
high accuracy and low error rates [29], [30], [31], this work
explores the use of convolutional neural networks (CNNs) for
semiconductor defect classification. A convolution is essen-
tially a function that has a high response when there is a
high spatial correlation between two other functions. The con-
volution generates features, or correlation relationships based
on signals (i.e., image data) that are close to each other. As
such, if features are supposed to be local indicators of global
Fig. 3. Example SEM image and labeled EDX spectra of AlOx Fy Nz significance (e.g., looking for eyes that are a certain width
semiconductor defect. Note that the Nitrogen and Fluorine peak labels in
the AlOx Fy Nz defect are not present in the EDX spectrum. This may be due apart to detect faces), the convolution is the operation that
to inadequate sampling of Nitrogen and Fluorine peaks during EDX collection provides this locality [31], [32], [33]. The primary objective
and/or because of peak overlap (Fluorine has a Kα peak at 0.677 keV which of this work is to investigate the promise of using CNNs to
overlaps with the Lα peak of Iron at 0.705 keV.)
classify the chemical composition of particle defects on semi-
conductor wafers based on a combination of SEM images and
EDX spectroscopy data. To be considered viable for integra-
tion into an industrial semiconductor manufacturing process,
the proposed CNN-based defect classification approach must
be able to classify a given defect with a Top-3 accuracy of at
least 95%, where Top-3 accuracy is defined as the percentage
with which the largest three relative classification probabil-
ities output by the CNN include the true defect class. This
95% number does not necessarily represent “state-of-the-art
performance”, but instead represents a baseline for industrial
utility.
To accomplish this objective, Lam Research Corporation
provided SEM images and EDX spectra of real industrial semi-
conductor defects (see Fig. 2 for an example). We propose a
Fig. 4. Example SEM defect images. The above 4 centered, “top-view” CNN that takes one centered, top-view SEM image of a sec-
SEM images are example defects from 4 different defect classes: tion of the semiconductor wafer that contains a particle defect
Al/Ox Fy /Nz (top-left), Fe-Ni/O (top-right), CuO/S (bottom-left), and WCNO
(bottom-right). The defects have similar sizes, shapes, and topographies, which as an input. The CNN merges its fully connected layer with
makes classification based on SEM images alone difficult. that defect’s corresponding EDX spectroscopy data to classify
the defect’s chemical composition. We analyze the industrial
viability of this approach in terms of validation and testing
an effective automated classification strategy for particle defect accuracies to examine the overall performance of the CNN as
chemical composition must combine SEM image and EDX well as confusion matrices to analyze class-specific classifica-
chemical spectral data. tion performance. Additionally, we investigate the effects of
the amount and distribution of training data as well as training
data augmentation on CNN performance. Lastly, we explore
C. Objectives the viability of transfer learning for improving training speed
To the best of our knowledge, there has been no and accuracy.
effort towards automating the classification of the chemical This paper is organized as follows. A brief overview of
composition of particle defects on semiconductor wafers based the concepts required to understand the architecture of CNNs
on combined information of the defect SEM image and chem- is provided. The architecture of a CNN that uses both SEM
ical spectra obtained using EDX spectroscopy. This paper image and EDX spectral data is next outlined, and its place-
thus aims to explore an automated classification approach for ment with the outlined workflow in Fig. 1 is explained. The
the chemical composition of particle defects to enable the methods used to train, validate, and test the CNN are next
identification of the source(s) of these defects in a semicon- described and a transfer learning strategy is proposed. The
ductor equipment chamber. efficacy and industrial viability of all proposed models is then
As semiconductor manufacturing processes (and their analyzed, with special attention to training data collection,
corresponding defects) become more complex, classification distribution and augmentation.
techniques must be designed to handle process upgrades as
well as the introduction of multiple data types into classi-
fication decisions. There is thus a need for a classification II. C ONCEPTUAL OVERVIEW OF C ONVOLUTIONAL
scheme that is adaptive, can use information from defect N EURAL N ETWORKS
images to improve accuracy, and can provide relative clas- CNNs are a class of feed-forward artificial neural networks
sification probabilities. To this end, deep learning techniques that are widely used to classify images [29], [30], [31].

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 75

CNNs determine classification probabilities that correspond


to a pre-determined list of potential classes for input images.
The CNN then assigns the input image to the class with the
highest corresponding classification probability. CNNs con-
tain three primary components: the convolutional layers, the
pooling layers, and the fully connected layers [29], [30], [31].
The convolutional layers use filters to slide (or convolve)
over the height and width of input images of size W × H. The
CNN connects certain areas of the input image defined by the
filter to nodes in the convolutional layer. Each node in turn
Fig. 5. Convolutional layer visualization. An input image is converted to
is connected to all pixels within a filter of size F × F. These pixels. The filter moves around the pixelated area to extract key image features
connections have F 2 weights (w) and one bias term (b). Next, and construct a feature map. Each numerical entry in the feature map is
the dot product between the entries of the filter (i.e., the filter’s produced by Eq. (1).
weights and biases) and the pixel values of the input image
are calculated. This product is fed through a non-linear activa-
tion function (e.g., ReLU, tanh) to produce a two-dimensional
feature map. The CNN uses the feature map to learn which
filters activate when the CNN detects some specific type of fea-
ture at some spatial position in the input [31], [32], [33], [34].
The stride S is the distance (in pixels) that filters move around
an image to extract features (i.e., a convolutional layer with
Fig. 6. Pooling layer visualization. The outputs from the feature map are
a stride of one would move the filter one pixel at a time pooled in the next layer, condensing information in the process. Here, maxi-
before extracting new features). Smaller strides create larger mum pooling is employed, where the maximum of all input values of a small
and more informative, yet more computationally expensive neighborhood of an image is propagated to the next layer.
feature maps. The convolution operation is mathematically
summarized below [22], [35].
 F F  logistic regression) to transform this feature vector into the
 classification probabilities [31], [38], [42]. The softmax func-
yij = σ wcd x(c+i×S)(d+j×S) + b tion is shown below for reference. Here, the probability P
c=1 d=1 of the input image x belonging to a class i is expressed as
H−F W −F
0≤i≤ , 0≤j≤ , (1) a function of the feature vector z. In turn, zi is a function
S S of the input image, x, along with a weight vector wi and
where yij is the output value of node (i, j) in the feature a bias term bi that are connected to a given output node i.
map, x(c+i×S)(d+j×S) is the input data element with the coordi- Note the correspondence between the expression for zi and
nate (in pixel-space) of x(c+i×S)(d+j×S) , and σ is the nonlinear Eq. (1) [22], [35].
activation used to extract features from the input data.
Fig. 5 shows an exaggerated version of the convolution of zi = wi x + bi
an input semiconductor defect image. Here, a filter slides over exp zi
P(zi ) =  , (2)
the image with a given stride. Equation (1) is then used to j exp zj
produce the numerical values seen in the feature map. The
pooling layer then reduces the spatial resolution of the fea- Before implementation, the parameters (e.g., the filter’s
ture maps to reduce model complexity and achieve spatial weights and biases) within CNNs must be trained with a
invariance to input distortions and translations. Fig. 6 shows set of images with known labels. CNN training involves two
an example of “maximum” pooling, which propagates the steps. First, the forward stage represents the input image with
maximum of all input values of a small neighborhood of an the current system parameters. A loss function is then calcu-
input image to the next layer [31], [36], [37], [38], [39], [40]. lated based on the computed classification probabilities and the
Note that convolutional layers are used to identify simpler known labels of the training data set. Second, the backward
geometric patterns (i.e., local edges, corners, and endpoints) stage computes parameter gradients based on the loss cost
and pooling layers identify location-invariant features of larger associated with the known labels of the training data. CNN
patterns that combine the simpler patterns found by the convo- parameters are next updated and the forward/backward stages
lutional layers [22]. Convolutional and pooling layers are often are repeated until the calculated loss functions converge. Due
repeated with different filters to extract more abstract feature to the large number of parameters within deep CNNs, train-
representations while moving through the network [31], [38]. ing data sets often include tens to hundreds of thousands of
Next, fully connected layers, which perform the function of images, and techniques to overcome parameter over-fitting are
high-level reasoning with the CNN [41], [42], [43], convert often employed. For example, a dropout layer is often included
feature maps from the convolutional and pooling layers into between fully connected layers to randomly omit certain fea-
a feature vector. For classification problems, it is standard ture detectors and prevent complex co-adaptations of training
to use the softmax operator (i.e., a generalized form of data [41].

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
76 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO. 1, FEBRUARY 2020

convolutional layer uses 128 filters with a 3 × 3 pixel


size, a stride of 1 pixel, and an ReLU activation func-
tion. The max-pooling layer uses a 2 × 2 pixel size and
a stride of 2.
4) The third convolutional block consists of three
convolutional layers followed by one max-pooling layer.
Each convolutional layer uses 256 filters with a 3 × 3
pixel size, a stride of 1 pixel, and an ReLU activation
function. The max-pooling layer uses a 2 × 2 pixel size
and a stride of 2.
5) The fourth and fifth convolutional blocks consist of three
Fig. 7. Proposed CNN architecture. A centered, top-view SEM image is fed
through convolutional and max pooling layers. EDX spectral data is merged convolutional layers followed by one max-pooling layer.
directly with the fully connected layers. C represents the pixel count, and Each convolutional layer uses 512 filters with a 3 × 3
the number of filters used during each convolutional block is specified. For pixel size, a stride of 1 pixel, and an ReLU activation
example, convolutional block 1 applies 64 filters to an image of size C × C.
Note that the CNN that distinguishes data containing defects versus data that function. The max-pooling layer uses a 2 × 2 pixel size
does not contain defects has one less fully connected layer, and does not and a stride of 2.
involve the use of spectral data. 6) The output of the fifth convolutional block enters the
fully connected layer and generates a one-dimensional
feature vector with 4096 nodes. A tanh activation func-
III. P ROPOSED C ONVOLUTIONAL N EURAL tion is used to produce this feature vector. A dropout
N ETWORK A RCHITECTURE layer follows with a neuron retention probability of 50%.
We design a CNN to classify semiconductor defects using 7) The raw intensity data (i.e., counts vs. incident beam
both SEM image and EDX spectroscopy data. Here, a cen- energy (keV)) for the defect’s EDX spectrum is merged
tered, top-view SEM image is first fed into the CNN. The with the fully connected layer. The raw intensity data is
EDX spectroscopy data is later merged directly with the fully an array 2 × 2048 array that is reshaped into a 1 × 4096
connected layer. The conceptual justification for merging EDX array.
spectroscopy data directly with the fully connected layer stems 8) A second fully connected layer follows and generates
from: another one-dimensional feature vector with 4096 nodes.
1) EDX spectroscopy extracts chemical information from A tanh activation function is used to produce this feature
defects based on the defect’s atomic structure. The vector. A dropout layer follows with a neuron retention
resulting spectrum thus comprises a given defect’s probability of 50%.
chemical feature vector. 9) The automatically labeled peak data for the defect’s
2) The fully connected layer within a CNN creates a visual EDX spectra is reshaped into a 1 × 4096 array and is
feature vector for a given defect based on that defect’s merged with the fully connected layer.
shape, size, orientation, and topography. 10) A third fully connected layer follows and generates a
3) It is thus expected that merging a defect’s chemical and final one-dimensional feature vector with 4096 nodes.
visual feature vectors within the fully connected layer A tanh activation function is used to generate this feature
would allow the CNN to classify semiconductor defects vector. A dropout layer follows with a neuron retention
with improved accuracy. probability of 50%.
The spectroscopy data includes two separate components: 11) Softmax activation yields output classification
(1) raw intensity values for each defect’s EDX spectrum probabilities for a given defect. An RMSProp optimizer
(counts vs. incident beam energy (keV)) and (2) automati- and a categorical cross-entropy loss function [38]
cally labeled peaks, some of which may be incorrect due to are used during training to calculate the gradients
peak overlap. Both raw intensity values and labeled peaks are that optimize the weights and biases used in the
provided to enable accounting for incorrect automatic peak convolutional and pooling layers.
detection due to peak overlap and noise in raw intensity data.
The exact structure of the proposed CNN is as follows, which
IV. M ETHODS
is shown in Fig. 7 as well.
1) A centered, top-view, grayscale SEM image is fed into A. Data Collection and Tools
the CNN. The data for this project was obtained through various Lam
2) The image then enters the first convolutional block, Research Corporation processes tools. The measurements were
which consists of two convolutional layers followed by taken using an optical scattering tool and a review SEM/EDX
one max-pooling layer. Each convolutional layer uses 64 tool, which provides dual capability of a scanning electron
filters with a 3 × 3 pixel size, a stride of 1 pixel, and an microscope (SEM) and energy dispersive x-ray (EDX) spec-
ReLU activation function. The max-pooling layer uses troscopy. Defects on surfaces may be generated as part of
a 2 × 2 pixel size and a stride of 2. wafer handling or from process tools during process test-
3) The second convolutional block consists of two convo- ing. The surface scattering tool generates data identifying the
lutional layers followed by one max-pooling layer. Each defect, its size, and provides the physical location of the

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 77

TABLE I
N UMBER OF D EFECTS P ER C LASS . D EFECTS B ELONGING TO THE
S AME C LASS MAY O CCASIONALLY H AVE S LIGHTLY D IFFERENT
C OMPOSITIONS ( E . G ., A L Ox Fy AND A L Ox Fy Nz ). THE C LASS
L ABELS W ERE P ROVIDED BY L AM R ESEARCH C ORP

Fig. 8. Example SEM defect images. Centered, top-view SEM images of


two WCNO defects of drastically different sizes.

baseline comparison, training, validation, and testing studies


were performed on a version of the CNN that did not use
EDX spectroscopy information and a version of the CNN that
did not use SEM image data. Both of these latter CNNs did
defect on a 300 mm wafer. This defect identification, size, not merge any separate data into their fully-connected layers
and location data is used by the review SEM/EDX tool to (e.g., the labeled peak data mentioned in Section 3). Finally,
generate images from the SEM and the x-ray spectra from the the CNN’s performance on the test data set was compared
EDX tools. Although images from many perspectives are gen- to that of a “stratified dummy” classifier, which generates
erated, only images from the center detector at a top-view are predictions by respecting the training data set’s class distri-
used to train the CNN. The x-ray spectra are contained in an bution [44]. For example, Al/Ox Fy /Ox Fy Nz defects account
EMSA file. for 25.4% of the training data, and SiO2 defects account for
22.9% of the training data, and so on. For a given sample
in the test data set, the stratified dummy classifier predicts
B. CNN Model Training, Validation, and Testing
that sample’s chemical composition as Al/Ox Fy /Ox Fy Nz with
Lam Research Corporation provided 5761 semiconductor a 25% probability and as SiO2 with a 22.9% probability, etc.
defects from 8 different classes to train, validate, and test the This classifier is often implemented as a baseline for unevenly
proposed CNNs. They additionally provided 98 examples of distributed training data sets [44].
SEM image and EDX spectral data that corresponded to sec- The CNN was written in TFLearn using Python, which is
tions of the semiconductor wafer without defects. This “no an abstraction layer of TensorFlow. Each CNN was trained
defect” data is not included in the training, validation, or test- using computing resources that utilized 4 Tesla K80-GPUs
ing of the CNN. The reasoning behind this and the treatment running in parallel. The CNN trained over 10 epochs with
of this data is explained more in depth in Section 4D. Table I a training speed of 0.264 seconds/defect/epoch/GPU. Image
shows the distribution of the number of defects per class. To pre-processing and data augmentation methods used during
train the CNN, all of the 5761 SEM images and EDX spec- training are described in later sections.
tra containing defects were divided among their 8 respective
classes. Note that there is a clear imbalance in the number of
defects in each class. This is discussed in depth in Section 5A. C. Data Pre-Processing
Sixty percent of the provided defects were used for train- Semiconductor particle defects drastically vary in size.
ing the CNN, 20% were used for validating the CNN, and Fig. 8 shows an example of two different-sized defects belong-
20% were used to test the CNN. Five-fold cross-validation ing to the same class. Some defects take up nearly 50% of the
was implemented to train, tune, and validate the CNN, lead- entire frame, while other defects of identical chemical com-
ing to creation of 5 separate models. The defect training set position take up less than 10% of the frame. The remainder
distribution was kept consistent among all five folds used in the of the frame is taken up by a silicon and/or thin-film coating
validation. The exact fold in which a given defect was placed background (which appears black). To improve computational
was randomly selected. The mean and variance (in terms of efficiency and to reduce computational cost on pixels that do
squared percentage) of the training and validation accuracies not carry any information, images are cropped before they are
were recorded, where the accuracy is defined as the total fed into the CNN. The optimal crop size was chosen to elimi-
percentage of correct classifications across 8 classes. After nate as much background as possible while being large enough
training and 5-fold cross-validation, the model (out of the 5 to capture key features of larger defects. The provided SEM
created models) with the highest recorded validation accuracy images were 480 × 480 pixels, and the optimal crop size was
was applied to the test data to quantify model performance. determined to be 140 × 140 pixels. In addition, gray scale val-
Note that all 5 models displayed very similar performance ues of input images were normalized from a scale of [0, 255]
(see Section 5A). For the test data set, both Top-1 and Top-3 to [−1, 1] to reduce noise.
accuracies and a confusion matrix were recorded. Information Most CNNs perform optimally when trained with at
from confusion matrices was used to analyze discrepancies least tens of thousands of images. When fewer images
between classification accuracies among different classes. As a are provided, it becomes necessary to enlarge the data

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
78 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO. 1, FEBRUARY 2020

set by appending to it slightly altered versions of


the existing images. This process, commonly known as
data augmentation [45], [46], [47], [48], [49], often involves
re-scaling, translating, adding noise, changing lighting condi-
tions, rotating images, and transforming perspective. Because
the provided defects are orientation-independent, the data set
was augmented by rotating each image sixty degrees, yielding
training and validation data sets six times the size of the origi-
nal data sets (note that the test data set images were not rotated,
however). The number of times each image was rotated was
determined by a study that compared image rotation number
Fig. 9. Example of defect-free and defect-containing SEM images. The
to accuracy (see Section 5A). SEM image on the right contains a particle defect. The image on the left is
Note that defect collection and labeling is a primarily an example of a collected SEM image for a section of the wafer with no
manual process. This means that some of the defects used defect.
to train the model may in fact be mislabeled. This mislabeling
may degrade the classification accuracy of the CNN.
the algorithm, these architectures struggle with locating small
objects within a frame and may predict bounding boxes that
D. Outlier Detection are not large enough to see the edges of larger defects (see
Our overarching research objective is to design an Fig. 8 for reference) [52].
automated classification strategy that takes SEM images and Moreover, the relatively small amount of defect-free data
EDX spectral data of individual particle defects as an input indicates that such data should be treated as outliers, and
(see Fig. 1). The strategy is intended to classify the chemical that the defect/no defect classification can likely be han-
composition of the particle defect in question. The process of dled by an outlier detection strategy. Due to the EDX
collecting SEM image and EDX spectral data involves human- spectra peak interference issues mentioned in the previous
operated machinery. This machinery automatically locates paragraph, this outlier detection strategy must use SEM
particle defects on wafers given certain constraints chosen images alone. There exist many outlier detection strategies
by the human operator (e.g., scan for particles over a certain for image classification in the literature, including prin-
size, ignore certain portions of the wafer). Occasionally, oper- cipal component analysis [53], [54], Canny edge detection
ator error such as inappropriate constraint choices can lead algorithms [55], [56], fitting the data to a normal distribution,
to the collection of SEM images and EDX spectra for sec- and kernel Fisher discrimination analysis [57].
tions of the wafer in which no defect is present. This data Fig. 9 shows that the primary difference between the two
corresponds to the “no defect” class shown in Table I. As a image types is the presence (or lack thereof) of white contours
result, a robust automated defect classification process must be with a grayish/black background. A variety of edge detection
able to distinguish spectral and image data that are defect-free methods have previously demonstrated the ability to accurately
from image and spectral data that contain defects. Because depict the presence or absence of such contours [58], [59].
semiconductor wafers are constructed of silicon that oxidizes Canny edge detection algorithms in particular, have been
slightly over time, the EDX spectral data for a defect-free shown to be especially adept at locating edges in structures
wafer section will contain strong silicon, and occasionally with complex shapes [55], [56]. As a result, the designed out-
weak oxygen peaks. These peaks will interfere with the classi- lier detection strategy locates contours within a given SEM
fication of silicon-based particle defects (e.g., SiO2 , SiOx Cy ). image using a standard Canny edge detection algorithm within
As a result, the designed CNN, or any strategy that involves Python’s OpenCV library. The areas within these contours are
EDX spectra, cannot classify defect-free data with any reason- next calculated. If the calculated area of the largest contour
able accuracy. A representative example of a defect-free and exceeds a threshold value, the SEM image is determined to
a defect-containing SEM image can be found in Fig. 9. contain a defect and vice-versa. Note that the area of the
Many existing CNN architectures perform object detection “largest” contour is used because some SEM images demon-
and classification in one step, such as MASK-RCNN [50], strate flashes of gray and white, which the edge detection
Faster R-CNN [51], and YOLO [52]. In theory, such algorithm confuses as very small contours (even though noise
architectures could analyze the SEM image input alone, reduction strategies are employed).
and decide whether a defect is present before involving The threshold area is determined from examining the defect-
EDX spectral data. In this way, the previously mentioned containing SEM images used for the training and validation
silicon and oxygen peaks in the defect-free data would not of the CNN, in addition to 80% of the defect-free data. The
interfere with defect classification. However, R-CNN methods mean and variance of the largest contour area in the defect-
are exceptionally computationally expensive, as they involve containing data is calculated. The lower area bound of the
the use of time-consuming selective search algorithms [51]. defect-containing data set is three standard deviations below
YOLO architectures avoid the need for selective-search prob- the mean area. The mean and variance of the largest contour
lems by predicting bounding boxes in which objects are most area in the defect-free data is calculated as well. The upper
likely to exist. However, due to the spatial constraints of bound of the defect-free data set is three standard deviations

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 79

their fully connected layers, it is further unclear how much


these pretrained models would increase defect classification
performance.
As semiconductor manufacturing processes improve and
become increasingly more complex, new particle defects, with
new morphologies and chemical compositions will undoubt-
edly appear. Each time a new defect reveals itself and/or a
current defect source is eliminated, the proposed CNNs will
need to be retrained. The subsequent semi-continuous process
of re-training the CNNs will clearly be time-intensive. It is
reasonable to assume, however, that the features extracted in
earlier CNNs will be similar to the features extracted in the
subsequent CNNs (which will classify different numbers and
types of defects). As a result, we explore transfer learning to
improve the training speed (and potentially the accuracy) of
these subsequent data training processes.
We retrained the CNN with the 4 classes with the fewest
number of defects (i.e., CuO/S, Organic, SiOx Cy , and Y). We
then explored two different methods of transfer learning:
1) The weights and biases from the 4-class model are
loaded onto the convolutional blocks. The CNN is then
retrained for all 8 classes using these weights as the
starting points. This model will be referred to as CNN
TLv1 (i.e., transfer learning, version 1) moving forward.
The idea here is that overall training time will decrease
because fewer epochs will be required for convergence.
Fig. 10. Defect classification strategy. First, an SEM image is fed through
an outlier detection algorithm. This algorithm establishes whether or not the 2) The weights and biases from the 4-class model are
SEM image contains a defect. If this SEM image contains a defect, the SEM loaded onto the convolutional blocks. Only the fully con-
image and corresponding EDX spectrum are fed through the CNN to classify nected layers are retrained on the CNN for all 8 classes.
the chemical composition of the defect in question.
This model will be referred to as CNN TLv2 (i.e., trans-
fer learning, version 2) moving forward. The idea here
above the mean area. The threshold area is then set to be the is to decrease training time because the amount of time
average of these upper and lower bound values. The efficacy to complete each epoch will drastically decrease.
of the outlier detection strategy is evaluated on the “outlier To assess the viability of the transfer learning strategy we
detection test data set”, which is comprised of the CNN test compared Top-1 and Top-3 testing accuracies and training
data for the defect-containing data, and the remaining 20% speed of the original CNN, CNN TLv1, and CNN TLv2.
of the defect-free data. The proposed automated classification Industry implementation of our proposed automated particle
strategy is summarized in Fig. 10). chemical composition classification strategy (see Fig. 10) will
then ideally involve the following steps:
1) Identify all relevant defect classes
E. Transfer Learning 2) Train CNN using SEM images and EDX spectra of these
Transfer learning is a machine learning technique where a defect classes
model trained on one task is re-purposed on a second related 3) Each time a new defect class appears, retrain the CNN
task [35], [60]. The process can improve data training effi- using a transfer learning strategy.
ciency and model accuracy if the features from the first task
are general (meaning suitable for both tasks), instead of spe-
cific for the base task [61]. Additionally, understanding that V. R ESULTS AND D ISCUSSION
CNN features are more generic in early layers and more data Validation and testing accuracies of the CNN are first exam-
set specific in later layers, it is not obvious that there will ined to understand overall CNN performance. Then, confusion
be a performance benefit to implementing transfer learning matrices are generated to understand the CNN’s ability to
for esoteric classification tasks (such as the classification of classify individual defect classes and determine the effects
semiconductor defect chemical composition based on SEM of the number and distribution of defect classes on CNN
images) without significant CNN structural tuning trial and performance. Training data augmentation methods are then
error. For image classification using CNNs, it is common to briefly explored. The efficacy of our chosen outlier detection
use a model pretrained for a large and challenging image strategy is briefly analyzed. Finally, the viability of using trans-
classification task (e.g., ImageNet 1000-class photograph clas- fer learning methods to decrease training time and potentially
sification competition). Because such models do not involve increase CNN performance after the introduction of new defect
merging secondary data sources (e.g., spectral data) with classes is investigated.

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
80 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO. 1, FEBRUARY 2020

TABLE II
CNN T RAINING AND VALIDATION ACCURACIES FOR 5-F OLD C ROSS -VALIDATION . M EAN AND VARIANCE OF THE T RAINING AND VALIDATION
ACCURACIES A MONG THE F IVE M ODELS C REATED D URING 5-F OLD C ROSS -VALIDATION A RE G IVEN . ACCURACY DATA I S A LSO G IVEN FOR
BASELINE C OMPARISON CNN S T HAT E ITHER D O N OT I NCLUDE SEM-I MAGE DATA OR D O N OT I NCLUDE EDX-S PECTROSCOPY DATA .
L OW VARIANCE IN T RAINING AND VALIDATION ACCURACIES S UGGESTS THE C ONSISTENCY OF THE P ROPOSED CNN S TRATEGY.
N OTE THE L ARGE P ERFORMANCE D IFFERENCE B ETWEEN THE CNN T HAT I NCLUDES B OTH I MAGE AND S PECTRAL DATA
I TS I MAGE -O NLY BASELINE

TABLE III
S UMMARY OF CNN C LASSIFICATION ACCURACY ON THE T EST DATA S ET. T OP -1 AND T OP -3 C LASSIFICATION ACCURACIES A RE G IVEN FOR THE
T EST DATA S ET FOR CNN AND ITS SEM I MAGE -O NLY AND EDX S PECTRA -O NLY BASELINES . THE FIRST ROW R EPRESENTS THE T EST ACCURACY
OF A “S TRATIFIED D UMMY ” C LASSIFIER , W HICH G ENERATES P REDICTIONS BY R ESPECTING THE DATA S ET ’ S T RAINING D ISTRIBUTION

A. CNN Performance
The mean and variance of the 5-fold cross validation
training and validation accuracies are reported in Table II. The
small variances of training and validation model accuracies
suggest the consistency of the proposed CNN. Discrepancies
between training and validation accuracies are due to the fact
that validation accuracies are calculated without a dropout
layer, thereby making the CNN more robust [41].
The Top-1 and Top-3 accuracies are next reported for the
test data set in Table III. Clearly, the CNN that uses image
and spectroscopy data outperforms the baseline CNNs that
use either only SEM image data or only EDX spectral data.
For example, Fig. 11 shows a comparison of the classifica-
tion probabilities output for an example Fe-Ni/O defect. The
image-only CNN misclassifies the example defect as an SiO2
defect, while the spectra-only CNN misclassifies the exam-
ple defect as an SiOx Cy defect. Meanwhile, the combined
SEM/EDX CNN correctly classifies the defect with a very high
confidence. A closer look at Fig. 11 reveals that the image-only
CNN determined the two highest-probability classes as SiO2
and Fe-Ni/O, while the spectra-only CNN determined the three
highest probability classes as SiO2 , SiOx Cy , and Fe-Ni/O.
The EDX spectrum clearly shows strong oxgyen and silicon Fig. 11. Classification probabilities for Fe-Ni/O defect. A top-view, centered
SEM image and an EDX spectrum for the example defect. (top) Comparison
peaks, a weak carbon peak, and weak, semi-overlapping nickel of classification probabilities of the example defect for CNNs trained with
and iron peaks. The combined SEM image and EDX spectra only SEM image data, only EDX spectral data, and both SEM image and
CNN clearly identifies some correlation between the contours EDX spectral data. (bottom) The CNN that only uses SEM image data mis-
classifies the example defect as SiO2 while the CNN that uses only EDX
seen in the SEM image and the shape of the EDX spectra to spectral data misclassifies the defect as SiOx Cy . The CNN that uses both
correctly classify the defect with high confidence. SEM image and EDX spectroscopy correctly classifies the defect as Fe-Ni/O.
The combined SEM image and EDX spectra CNN yields The combined data CNN clearly identifies a correlation between the image
and spectral features to create a high probability classification.
a greater than 99% Top-3 accuracy. The high performance
of the CNN indicates that the proposed defect classification
framework meets the previously specified 95% requirement
for pragmatic viability of defect chemical composition classi- in Table IV. The table shows that certain defects are classi-
fication in the semiconductor industry. fied with greater than 90% accuracies (e.g., Al/Ox Fy /Nz , SiO2 ,
A closer examination of the combined SEM image data Y). Meanwhile, SiOx Cy and CuO/S defects are often misclas-
and EDX spectral data CNN’s performance reveals that certain sified as SiO2 and Fe-Ni/O defects, respectively. For example,
classes in the test data set are much more accurately classi- the CNN only correctly classified 29.3% of SiOx Cy defects
fied than others. The confusion matrix for the CNN is shown while incorrectly classifying SiOx Cy as SiO2 61.2% of the

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 81

TABLE IV
C ONFUSION M ATRIX FOR THE CNN ON T EST DATA S ET. T HE CNN M ODEL C HOSEN F ROM 5-F OLD C ROSS -VALIDATION WAS U SED TO C LASSIFY
D EFECTS IN THE T EST DATA S ET. T HE C ONFUSION M ATRIX S HOWS THE P ERCENTAGE OF C LASSES T HAT W ERE C ORRECTLY C LASSIFIED
(T HE D IAGONAL E NTRIES ) AND THE P ERCENTAGE OF C LASSES T HAT W ERE I NCORRECTLY C LASSIFIED . N OTE T HAT D EFECTS
B ELONGING TO THE C U O/S AND S I Ox Cy C LASSES A RE O FTEN I NCORRECTLY C LASSIFIED AS M EMBERS
OF THE F E -N I /O AND S I O 2 C LASSES , R ESPECTIVELY

time. The CNN correctly classified 61.8% of CuO/S defects


and incorrectly classified 38.2%. Out of the incorrectly clas-
sified defects, 35.3% out of 38.2% were incorrectly classified
as Fe-Ni/O.
There are many potential reasons why the CNN consistently
misclassifies CuO/S and SiOx Cy defects as Fe-Ni/O and
SiO2 defects, respectively. The CNN may have had difficulty
differentiating visual features that distinguish these specific
defect types. In addition, the close compositional resemblance
between SiO2 and SiOx Cy defects makes these two defects
Fig. 12. Defect number trends. The CNN was retrained with different num-
more prone to incorrect labeling due to human error (SiO2 is bers of defects (each defect image was still rotated 6 times). The newly trained
just SiOx Cy with x = 2 and y = 0, after all). However, the CNN model was then used to classify the defects in the test data set to yield
key trend is that the SiOx Cy and CuO/S defects are incorrectly new accuracy values. Clearly, increasing the number of defects used to train
the CNN improves classification accuracy.
classified as SiO2 and Fe-Ni/O classes and not vice-versa.
This likely has do to with the fact that the CNN is trained
using many more defects of the latter classes than the for-
mer ones (see Table I for reference). For example, when
the CNN is retrained with identical defect numbers per class
(150 randomly selected defects per class) and then re-run on
the test data set, the occurrence of the misclassification of
SiOx Cy and CuO/S defects as SiO2 and Fe-Ni/O defects sig-
nificantly decreases. The confusion matrix for this CNN is
shown in Table V. Here, the classification accuracy of CuO/S
jumps to 97.2% (from 61.8% in the previous case) and that of
SiOx Cy jumps from 29.3% to 68.4%. A closer look at defects’
EDX spectroscopy data reveals that Cu has an L-α peak at Fig. 13. Data augmentation trends. The CNN was retrained with different
numbers of defect rotations to understand the importance of data augmenta-
0.93 keV which overlaps with the L-α peak of Ni at 0.85 keV. tion. The newly trained CNN model was then used to classify the defects in
The low number of CuO/S images provided for training yield the test data set to yield new accuracy values. Clearly, classification accuracy
low image variation. This low variation combined with the increases with the number of rotations of the training data set.
fact that roughly 4 times as many Fe-Ni/O defects were pro-
vided for training could have certainly confused the CNN.
Further note that the accuracy of previously more highly pop- included in Fig. 13) because there exist no images to rotate in
ulated classes (e.g., Al, SiO2 ) does decrease in the 150 defect that data set.
per class case. Tables IV–V thus suggest that overall CNN Overall, the designed CNN uses both SEM image and
model accuracies would increase if the models were trained EDX spectral data to classify the chemical composition of
with a larger number of more uniformly distributed defects. As semiconductor particle defects with an industrially pragmatic
a reference, Figs. 12–13 support the claim that larger data sets accuracy. Comparisons to versions of the CNN that use
increase CNN classification accuracy. These figures show that either only SEM image or only EDX spectral data demon-
classification accuracy improves when the size of the training strate the utility of integrating both data types into one
data set is increased either naturally (by adding more defects, CNN. The performance of the CNN appears to be primar-
see Fig. 12) or artificially (via data augmentation, see Fig. 13) ily limited by the small, non-uniformly distributed train-
until a certain saturation point. Note that data for the CNN ing data set. Although more advanced CNN architectures
that was trained with only EDX spectral information is not (e.g., ResNET [62], ResNeXT [63], Dense CNN [64]) could

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
82 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO. 1, FEBRUARY 2020

TABLE V
C ONFUSION M ATRIX FOR THE CNN U SING 150 D EFECTS P ER C LASS . T HE CNN WAS R ETRAINED W ITH 150 D EFECTS P ER C LASS . T HE CNN WAS
T HEN U SED TO C LASSIFY THE D EFECTS IN THE T EST DATA S ET. C LASSIFICATION ACCURACY D ECREASES FOR H EAVILY P OPULATED C LASSES
( E . G ., A L , S I O2 ). C LASSIFICATION ACCURACY S IGNIFICANTLY I NCREASES FOR C U O/S AND S I Ox Cy , W HICH W ERE P REVIOUSLY O FTEN
I NCORRECTLY C LASSIFIED AS F E -N I /O AND S I O2 R ESPECTIVELY (S EE TABLE IV FOR C OMPARISON ). N OTE T HAT T WO C LASSES
H AVE B ETWEEN 100 AND 150 AVAILABLE D EFECTS FOR T RAINING (Y, C U O/S) AND THE M AXIMUM N UMBER OF AVAILABLE
D EFECTS W ERE U SED FOR CNN T RAINING IN T HESE C ASES ( I . E ., 136 FOR C U O/S AND 111 FOR Y). F URTHER N OTE T HAT
THE 150 D EFECTS P ER E ACH C LASS W ERE R ANDOMLY S ELECTED B EFORE DATA AUGMENTATION

TABLE VI
possibly lead to enhanced performance on a larger, more C ONFUSION M ATRIX FOR THE O UTLIER D ETECTION S TRATEGY. T HE
uniform data set, implementing such complex nets on this data O UTLIER D ETECTION S TRATEGY WAS U SED TO D ETERMINE
THE P RESENCE OF D EFECTS IN THE O UTLIER D ETECTION
set would likely lead to overfitting. As such, we cannot claim
T EST DATA S ET. T HE C ONFUSION M ATRIX S HOWS THE
that the designed CNN is by any means “state-of-the-art”, and P ERCENTAGE OF C LASSES T HAT W ERE C ORRECTLY
instead primarily serves as a proof of concept for the idea C LASSIFIED (T HE D IAGONAL E NTRIES ) AND THE
of using a CNN to classify real industrial defects based on a P ERCENTAGE OF C LASSES T HAT W ERE
I NCORRECTLY C LASSIFIED . N OTE T HAT
combination of SEM image and EDX spectral data. The ear- O NLY SEM I MAGES W ITH V ERY
lier provided 95% Top-3 accuracy metric was a target metric S MALL D EFECTS W ERE
for pragmatic industrial utility. I NCORRECTLY C LASSIFIED

B. CNN Architecture Justification


It is important to briefly discuss the dependence of the
performance of the CNN on various architecture choices. For
example, “AlexNet” [38] could not classify the chemical com-
position of defects based on SEM images alone. In fact the
accuracy of such a net was nearly equal to 1 divided by the after each type of spectral data was merged with the net)
number of classes. As a result, a more complex VGGNet-based outperformed the use of two fully connected layers by 10%,
architecture was used for the convolutional and pooling layers and the use of only 1 fully connected layer by close to 40%.
in both CNNs [42]. In addition, the identity of the activation The image-only CNN on the other hand, yielded roughly equal
function in the fully connected layer significantly impacted classification accuracies with 1 and 2 fully connected layers.
performance once spectral data was merged with the fully Interestingly, the CNN was relatively insensitive to the dropout
connected layer. For example, the use of “ReLU” activation percentage, as long as this percentage stayed below 50%.
functions instead of “tanh” activation functions in the fully Finally, efforts to merge both sections of spectral data into
connected layers led to about a 15% overall decrease in val- one layer as opposed to treating each spectral data layer sep-
idation accuracy for the CNN. However, in the image-only arately led to marginal performance decreases (≤ 5%), and
version of the CNN, models with both activation functions may not be statistically significant.
performed comparably. It is possible that merging spectral
data with the fully connected layer invokes the “dying ReLU” C. Outlier Detection Efficacy
problem [65], [66], which means that during training, a weight The outlier detection strategy was applied to the out-
update triggered by a large gradient flowing through an ReLU lier detection test data set. The confusion matrix shown in
neuron could make the neuron inactive for other data points Table VI indicates that the only misclassifications were cases
in the future. This can cause many neurons to exist in a in which images containing defects were predicted to not
“dying state,” where weights are not updated over future iter- contain defects. Further image analysis shows that these mis-
ations. To avoid this problem, some researchers (e.g., [65]) classified images happened to contain exceptionally small
place sigmoid functions at both ends of the network, and defects. The misclassified defect-containing data was still
ReLU functions in the other layers while other researchers used in the 5-fold cross validation for the CNN, as it is
(e.g., [67]) modified the ReLU function to a “leaky” ReLU important to test the CNN’s ability to distinguish smaller
function. defects.
In addition, the use of three fully connected layers in the This outlier detection strategy is by no means
CNN (one after the convolution and pooling layers, and one “state-of-the-art” and merely serves as a proof of concept

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 83

TABLE VII
T RANSFER L EARNING P ERFORMANCE S UMMARY. T OP -1 AND T OP -3 C LASSIFICATION ACCURACIES A RE G IVEN FOR THE T EST DATA S ET AS W ELL
T RAINING T IMES T HAT A RE N ORMALIZED IN C OMPARISON TO THE T RAINING T IME FOR THE O RIGINAL CNN. H ERE , THE T RAINING
T IME FOR THE O RIGINAL CNN I S S ET TO 1. CNN TLV 1 I NVOLVES R ETRAINING A LL OF THE O RIGINAL CNN A FTER I NITIALIZING
THE W EIGHTS AND B IASES IN THE C ONVOLUTIONAL L AYER BASED ON A M ODEL W ITH 4 D EFECT C LASSES . CNN TLV 2
I NVOLVES THE S AME W EIGHT AND B IAS I NITIALIZATION , Y ET O NLY THE F ULLY C ONNECTED L AYER I S R ETRAINED .
B OTH CNN TLV 1 AND TLV 2 A RE FASTER , Y ET S LIGHTLY L ESS ACCURATE T HAN THE O RIGINAL .
CNN TLV 2 S LIGHTLY O UTPERFORMS CNN TLV 1 IN T RAINING S PEED AND ACCURACY

for the efficacy of easily implementable, computationally VI. C ONCLUSION AND F UTURE W ORK
cheap detection strategies for this particular problem. It is We design a deep convolutional neural network for the clas-
possible that more advanced noise filtering methods or types sification of the chemical composition of particle defects on
of image pre-processing could further improve performance. semiconductor wafers. The CNN takes one centered, top-view
Meanwhile, some of the inherently more advanced outlier SEM image of a given particle defect an input. The CNN then
detection methods mentioned earlier (e.g., principal com- merges its fully connected layer with that defect’s correspond-
ponent analysis) may yield better performance as well. ing EDX spectral data. The CNN represents the first example
The main purpose of including outlier detection in this of a CNN that uses multiple data types (i.e., image and spectral
paper is to demonstrate that although the defect-free and data) for semiconductor defect classification. This CNN was
defect-containing images should be separated, the problem is able to classify semiconductor defects with a high, industri-
not particularly difficult to solve. ally pragmatic accuracy, that seemed to be primarily limited
As process changes occur, it is possible that particle defects by the imbalance and low numbers of defects belonging to
become progressively smaller and render the previously cal- certain, similar classes. The strong accuracy of the proposed
culated defect threshold area mean and variance invalid. It is CNN suggests that merging spectroscopy data, or potentially
further possible that defect size variance increases to the point other object metadata, directly with the fully connected layer
at which the recommended three standard deviation back-off of a CNN can greatly improve that CNN’s classification capa-
is no longer appropriate. As a result, statistical process con- bilities. It is expected that larger training data sets with a
trol methods [68], [69] can be implemented to monitor outlier more uniform distribution among defect classes will improve
detection efficacy. Here, time-frames at which threshold defect classification accuracy. Data augmentation methods separate
areas should be recalculated will be identified. from rotation can also be explored (e.g., translation, re-scaling,
light-scattering) to further increase classification accuracy.
From a practical implementation standpoint, transfer learning
D. Transfer Learning Analysis studies showed that when new defects are introduced, only
Testing accuracy and training times of the original CNN the fully connected layers of the CNN need to be retrained to
and the two CNNs trained using transfer learning strategies account for such defects.
(i.e., CNN TLv1, CNN TLv2) are reported in Table VII. The
weights and biases for CNN TLv1 were initialized by training ACKNOWLEDGMENT
the original CNN with the 4 classes with the fewest numbers of The authors would like to thank several members from Lam
defects (i.e., CuO/S, Organic, SiOx Cy , and Y). CNN TLv1 was Research Corporation who have provided data, insight and
then trained for all 8 classes with 2 epochs. CNN TLv2 was directions for this work. They acknowledge the contributions
initialized with the same weights and biases, yet only its fully of Lam’s metrology team (N. Tran, B. Skyberg, and H. Li)
connected layers were retrained for all 8 classes. CNN TLv1 in collection of defect data from various review SEM/EDX
and TLv2 perform comparably to the original CNN. However, tools and also the engineers from different product groups
CNN TLv2 slightly outperforms CNN TLv1 in terms of both (C. La, S. Zhang, and A. Radocea) who helped organize labels
training speed and accuracy. The results of CNN TLv2 indicate for the defects to perform the supervised learning study in this
that once convolutional layers have been trained with a set work. Neural network discussions with H. Li and S. Riggs
of SEM images of particle defects, these weights and biases were useful. K. Sawlani would also like to acknowledge the
extract features well enough to be applied to separate, much guidance and mentoring provided by R. Gottscho, K. Ashtiani,
larger groups of defects. Further note that the accuracy of M. Danek, K. Wells, K. Hansen, E. Gurer, D. Pirkle, Y. Feng,
CNN TLv1 eventually converges to that of the original CNN, R. Roberts, and several others from Lam Research.
but does so after between 6 and 8 epochs of training, which
does not significantly reduce training time. Therefore, when R EFERENCES
training future CNNs for new classes of defects, it will be
[1] H. C. Pfeiffer, “PREVAIL: IBM’s e-beam technology for next
more efficient and accurate to only re-train the fully connected generation lithography,” in Proc. Emerg. Lithograph. Technol. IV,
layers following the strategy outlined for CNN TLv2. vol. 3997, 2000, pp. 206–214.

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
84 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 33, NO. 1, FEBRUARY 2020

[2] L. Harriott, “Next generation lithography,” Mater. Today, vol. 2, no. 2, [25] M. Egmont-Petersen, D. de Ridder, and H. Handels, “Image processing
pp. 9–12, 1999. with neural networks—A review,” Pattern Recognit., vol. 35, no. 10,
[3] Y. Gomei, “Cost analysis on the next-generation lithography tech- pp. 2279–2301, 2002.
nology,” in Proc. Emerg. Lithograph. Technol. III, vol. 3676, 1999, [26] M. R. G. Meireles, P. E. M. Almeida, and M. G. Simões, “A compre-
pp. 1–9. hensive review for industrial applicability of artificial neural networks,”
[4] D. A. Drabold and S. K. Estreicher, Theory of Defects in IEEE Trans. Ind. Electron., vol. 50, no. 3, pp. 585–601, Jun. 2003.
Semiconductors. Heidelberg, Germany: Springer, 2007. [27] J. Paola and R. Schowengerdt, “A review and analysis of backpropaga-
[5] F. A. Aziz, I. H. Ahmad, N. Zulkifli, and R. M. Yusuff, “Particle reduc- tion neural networks for classification of remotely-sensed multi-spectral
tion at metal deposition process in wafer fabrication,” in Manufacturing imagery,” Int. J. Remote Sens., vol. 16, no. 16, pp. 3033–3058, 1995.
System. Rijeka, Croatia: InTech, 2012. [28] W. E. Reddick, J. O. Glass, E. N. Cook, T. D. Elkin, and R. J. Deaton,
[6] S. H. Park, S. Kim, and J.-G. Baek, “Kernel-density-based particle “Automated segmentation and classification of multispectral magnetic
defect management for semiconductor manufacturing facilities,” Appl. resonance images of brain using artificial neural networks,” IEEE Trans.
Sci., vol. 8, no. 2, p. 224, 2018. Med. Imag., vol. 16, no. 6, pp. 911–918, Dec. 1997.
[7] T. Nakazawa and D. V. Kulkarni, “Wafer map defect pattern classifi- [29] J. Gu et al., “Recent advances in convolutional neural networks,” Pattern
cation and image retrieval using convolutional neural network,” IEEE Recognit., vol. 77, pp. 354–377, May 2018.
Trans. Semicond. Manuf., vol. 31, no. 2, pp. 309–314, May 2018. [30] S. Srinivas et al., “A taxonomy of deep convolutional neural nets for
[8] J. Wang, Z. Yang, J. Zhang, Q. Zhang, and W.-T. K. Chien, computer vision,” Front. Robot. AI, vol. 2, p. 36, Jan. 2016.
“AdaBalGAN: An improved generative adversarial network with imbal-
[31] W. Rawat and Z. Wang, “Deep convolutional neural networks for image
anced learning for wafer defective pattern recognition,” IEEE Trans.
classification: A comprehensive review,” Neural Comput., vol. 29, no. 9,
Semicond. Manuf., vol. 32, no. 3, pp. 310–319, Aug. 2019.
pp. 2352–2449, 2017.
[9] F. Adly, P. D. Yoo, S. Muhaidat, Y. Al-Hammadi, U. Lee, and M. Ismail,
“Randomized general regression network for identification of defect pat- [32] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
terns in semiconductor wafer maps,” IEEE Trans. Semicond. Manuf., no. 7553, p. 436, 2015.
vol. 28, no. 2, pp. 145–152, May 2015. [33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
[10] R. Baly and H. Hajj, “Wafer classification using support vector ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
machines,” IEEE Trans. Semicond. Manuf., vol. 25, no. 3, pp. 373–383, pp. 2278–2324, 1998.
Aug. 2012. [34] D. Yu, H. Wang, P. Chen, and Z. Wei, “Mixed pooling for convolutional
[11] G. Tello, O. Y. Al-Jarrah, P. D. Yoo, Y. Al-Hammadi, S. Muhaidat, and neural networks,” in Proc. Int. Conf. Rough Sets Knowl. Technol., 2014,
U. Lee, “Deep-structured machine learning model for the recognition pp. 364–375.
of mixed-defect patterns in semiconductor fabrication processes,” IEEE [35] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
Trans. Semicond. Manuf., vol. 31, no. 2, pp. 315–322, May 2018. MA, USA: MIT Press, 2016.
[12] K. Nakata, R. Orihara, Y. Mizuoka, and K. Takagi, “A comprehensive [36] Y. LeCun et al., “Handwritten digit recognition with a back-propagation
big-data-based monitoring system for yield enhancement in semicon- network,” in Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 396–404.
ductor manufacturing,” IEEE Trans. Semicond. Manuf., vol. 30, no. 4, [37] Y. LeCun et al., “Backpropagation applied to handwritten zip code
pp. 339–344, Nov. 2017. recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.
[13] K. B. Lee, S. Cheon, and C. O. Kim, “A convolutional neural network [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
for fault classification and diagnosis in semiconductor manufacturing tion with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
processes,” IEEE Trans. Semicond. Manuf., vol. 30, no. 2, pp. 135–142, Process. Syst., 2012, pp. 1097–1105.
May 2017. [39] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified
[14] T. Nakazawa and D. V. Kulkarni, “Anomaly detection and segmentation activations in convolutional network,” CoRR, vol. abs/1505.00853, 2015.
for wafer defect patterns using deep convolutional encoder–decoder neu- [Online]. Available: http://arxiv.org/abs/1505.00853
ral network architectures in semiconductor manufacturing,” IEEE Trans. [40] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, “Unsupervised
Semicond. Manuf., vol. 32, no. 2, pp. 250–256, May 2019. learning of invariant feature hierarchies with applications to object
[15] C.-F. Chien, S.-C. Hsu, and Y.-J. Chen, “A system for online detection recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
and classification of wafer bin map defect patterns for manufacturing (CVPR), 2007, pp. 1–8.
intelligence,” Int. J. Prod. Res., vol. 51, no. 8, pp. 2324–2338, 2013. [41] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
[16] F.-L. Chen and S.-F. Liu, “A neural-network approach to recognize R. R. Salakhutdinov, “Improving neural networks by preventing co-
defect spatial pattern in semiconductor fabrication,” IEEE Trans. adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
Semicond. Manuf., vol. 13, no. 3, pp. 366–373, Aug. 2000. [Online]. Available: http://arxiv.org/abs/1207.0580
[17] T. Yuan, S. J. Bae, and J. I. Park, “Bayesian spatial defect pattern [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
recognition in semiconductor fabrication using support vector clus- large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
tering,” Int. J. Adv. Manuf. Technol., vol. 51, nos. 5–8, pp. 671–683, [Online]. Available: https://arxiv.org/abs/1409.1556v6
2010.
[43] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
[18] T. Yuan, W. Kuo, and S. J. Bae, “Detection of spatial defect patterns gen-
volutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014,
erated in semiconductor fabrication processes,” IEEE Trans. Semicond.
pp. 818–833.
Manuf., vol. 24, no. 3, pp. 392–403, Aug. 2011.
[19] K. W. Tobin, Jr., S. S. Gleason, T. P. Karnowski, and S. L. Cohen, [44] S. Borra and A. Di Ciaccio, “Methods to compare nonparametric
“Feature analysis and classification of manufacturing signatures based classifiers and to select the predictors,” in New Developments in
on semiconductor wafer maps,” in Proc. Mach. Vis. Appl. Ind. Inspection Classification and Data Analysis. Heidelberg, Germany: Springer, 2005,
V, vol. 3029, 1997, pp. 14–25. pp. 11–19.
[20] T. P. Karnowski, K. W. Tobin, Jr., S. S. Gleason, and F. Lakhani, [45] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell,
“Application of spatial signature analysis to electrical test data: “Understanding data augmentation for classification: When to warp?”
Validation study,” in Proc. Metrol. Inspection Process Control in Proc. Int. Conf. Digit. Image Comput. Techn. Appl. (DICTA), 2016,
Microlithography XIII, vol. 3677, 1999, pp. 530–541. pp. 1–6.
[21] M.-J. Wu, J.-S. R. Jang, and J.-L. Chen, “Wafer map failure pattern [46] J. Ding, B. Chen, H. Liu, and M. Huang, “Convolutional neural network
recognition and similarity ranking for large-scale data sets,” IEEE Trans. with data augmentation for SAR target recognition,” IEEE Geosci.
Semicond. Manuf., vol. 28, no. 1, pp. 1–12, Feb. 2015. Remote Sens. Lett., vol. 13, no. 3, pp. 364–368, Mar. 2016.
[22] S. Cheon, H. Lee, C. O. Kim, and S. H. Lee, “Convolutional neu- [47] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convo-
ral network for wafer surface defect classification and the detection of lutional neural networks applied to visual document analysis,” in Proc.
unknown defect class,” IEEE Trans. Semicond. Manuf., vol. 32, no. 2, IEEE 7th Int. Conf. Document Anal. Recognit., 2003, p. 958.
pp. 163–170, May 2019. [48] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and
[23] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis J. Schmidhuber, “Flexible, high performance convolutional neural
and an algorithm,” in Proc. Adv. Neural Inf. Process. Syst., 2002, networks for image classification,” in Proc. Int. Joint Conf. Artif. Intell.
pp. 849–856. (IJCAI), vol. 22. Barcelona, Spain, 2011, p. 1237.
[24] C.-H. Wang, “Recognition of semiconductor defect patterns using spec- [49] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neu-
tral clustering,” in Proc. IEEE Int. Conf. Ind. Eng. Eng. Manag., 2007, ral networks for image classification,” CoRR, vol. abs/1202.2745, 2012.
pp. 587–591. [Online]. Available: http://arxiv.org/abs/1202.2745

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.
O’LEARY et al.: DEEP LEARNING FOR CLASSIFICATION OF CHEMICAL COMPOSITION OF PARTICLE DEFECTS 85

[50] J. W. Johnson, “Adapting mask-RCNN for automatic nucleus seg- [60] E. S. Olivas, J. D. M. Guerrero, M. M. Sober, J. R. M. Benedito,
mentation,” CoRR, vol. abs/1805.00500, 2018. [Online]. Available: and A. J. S. Lopez, Handbook of Research on Machine Learning
http://arxiv.org/abs/1805.00500 Applications and Trends: Algorithms, Methods and Techniques—2
[51] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards Volumes, Hershey, PA, USA: IGI Glob., 2009.
real-time object detection with region proposal networks,” in Proc. Adv. [61] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
Neural Inf. Process. Syst., 2015, pp. 91–99. features in deep neural networks?” in Proc. Adv. Neural Inf. Process.
[52] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only Syst., 2014, pp. 3320–3328.
look once: Unified, real-time object detection,” in Proc. IEEE Conf. [62] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
Comput. Vis. Pattern Recognit., 2016, pp. 779–788. networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 630–645.
[53] S. S. Raj, K. S. Kannan, and K. Manoj, “Principal component analy- [63] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
sis for outlier detection,” Res. Rev. J. Stat., vol. 7, no. 1, pp. 62–68, transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
2018. Vis. Pattern Recognit., 2017, pp. 1492–1500.
[54] I. T. Jolliffe and J. Cadima, “Principal component analysis: A review [64] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
and recent developments,” Philos. Trans. Roy. Soc. A Math. Phys. Eng. connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Sci., vol. 374, no. 2065, 2016, Art. no. 20150202. Pattern Recognit., 2017, pp. 4700–4708.
[55] R. Muthukrishnan and M. Radha, “Edge detection techniques for image [65] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga, “Outlier detection with
segmentation,” Int. J. Comput. Sci. Inf. Technol., vol. 3, no. 6, p. 259, autoencoder ensembles,” in Proc. SIAM Int. Conf. Data Min., 2017,
2011. pp. 90–98.
[56] J. Canny, “A computational approach to edge detection,” IEEE Trans. [66] M. M. Lau and K. H. Lim, “Investigation of activation functions in
Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986. deep belief network,” in Proc. IEEE 2nd Int. Conf. Control Robot. Eng.
[57] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisher (ICCRE), 2017, pp. 201–206.
discriminant analysis with kernels,” in Proc. IEEE Neural Netw. Signal [67] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
Process. IX IEEE Signal Process. Soc. Workshop, 1999, pp. 41–48. improve neural network acoustic models,” in Proc. ICML, vol. 30, 2013,
[58] N. R. Pal and S. K. Pal, “A review on image segmentation techniques,” p. 3.
Pattern Recognit., vol. 26, no. 9, pp. 1277–1294, 1993. [68] J. F. MacGregor and T. Kourti, “Statistical process control of multivariate
[59] P. Dhankhar and N. Sahu, “A review and research of edge detection tech- processes,” Control Eng. Pract., vol. 3, no. 3, pp. 403–414, 1995.
niques for image segmentation,” Int. J. Comput. Sci. Mobile Comput., [69] J. S. Oakland, Statistical Process Control. Oxford, U.K.: Routledge,
vol. 2, no. 7, pp. 86–92, 2013. 2007.

Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:29:04 UTC from IEEE Xplore. Restrictions apply.

You might also like