Engineering Applications of Artificial Intelligence

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Contents lists available at SciVerse ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

Defect cluster recognition system for fabricated semiconductor wafers


Melanie Po-Leen Ooi a,n, Hong Kuan Sok a, Ye Chow Kuang a, Serge Demidenko b, Chris Chan c
a
Monash University, Sunway Campus, Jalan Lagoon Selatan, Selangor, Malaysia
b
RMIT International University, 702 Nguyen Van Linh, Ho Chi Minh City, Vietnam
c
Freescale Semiconductor, Sungeiway Free Industrial Zone, Selangor, Malaysia

a r t i c l e i n f o abstract

Article history: The International Technology Roadmap for Semiconductors (ITRS) identifies production test data as an
Received 30 January 2012 essential element in improving design and technology in the manufacturing process feedback loop. One
Received in revised form of the observations made from the high-volume production test data is that dies that fail due to a
12 March 2012
systematic failure have a tendency to form certain unique patterns that manifest as defect clusters at
Accepted 27 March 2012
the wafer level. Identifying and categorising such clusters is a crucial step towards manufacturing yield
Available online 27 April 2012
improvement and implementation of real-time statistical process control. Addressing the semiconductor
Keywords: industry’s needs, this research proposes an automatic defect cluster recognition system for semiconductor
Semiconductor wafer fabrication wafers that achieves up to 95% accuracy (depending on the product type).
Defect cluster classification
& 2012 Elsevier Ltd. All rights reserved.
Recognition
Feature extraction

1. Introduction With rapid developments in computer technology and with


the availability of low-cost storage devices, data logging has
Advances in semiconductor technology and design have been become a commercially feasible task and is currently a standard
the driving forces behind the successful progress of the electro- procedure in almost all industries. The data logging ensures that
nics industrial sector. Over the past few decades, the semicon- virtually all industrial process-related data are always ready for
ductor industry has evolved into one of the critical foundations extraction and analysis. Yet there is still much knowledge buried
and contributors of the world economy, recording a staggering in the huge data collection, which is waiting to be discovered.
100% growth from the year 1996 to 2011(Semiconductor Industry Furthermore, the knowledge banks are continually enriched with
Association (SIA), 2011). Today, the semiconductor industry is new data being included and compiled each day. Thus, it is
generating revenues of over USD $25.5 billion and is expected to envisage that by applying new appropriate analysis tools further
continue ascending by 10.5% annually between 2011 and 2013. improvements in the semiconductor manufacturing could be
Such rapid growth is going on hand-in-hand with a rapid increase achieved.
of design and manufacturing complexity pushing the physical The production of modern microelectronic devices has an
limits of semiconductor production techniques (International important feature that makes it significantly different from other
Technology Roadmap for Semiconductors, 2009). Reaching the manufacturing processes. The semiconductor wafer fabrication
nanoscale half-pitch dimensions alters the type and distribution results in large number of Integrated Circuits (ICs) produced simul-
of defects. This causes defects that were previously benign for taneously in a multitude of sequential fabrication steps on a single
microscale technology to manifest as killer defects. piece of a silicon substrate (Fig. 1). Every fabricated wafer contains
The International Technology Roadmap for Semiconductor (ITRS), 100s to 1000s of devices, which are then tested, singulated (i.e.,
which is the world’s authority on the semiconductor industry, has individually removed from the matrix of the products on the
identified the detection of systematic failures as one of the top processed wafer) and packaged into individual protection casings
challenges in the next generation of semiconductor products thus making so-called IC chips. This unique characteristic whereby
(International Technology Roadmap for Semiconductors, 2009). each individual device is produced together with other devices on a
Production data, in particular, was singled out as one of the single piece of wafer allows production dataset to be physically
crucial elements in aiding the process feedback loop. and meaningfully interpreted as a two-dimensional ‘‘image’’.
Devices that fail due to an identifiable root-cause during the wafer
n
fabrication process have a tendency to form unique and systematic
Corresponding author. Tel.: þ603 55146238; fax: þ 603 55146001.
E-mail addresses: [email protected] (M.-L. Ooi),
patterns on the fabricated wafers (International Technology Roadmap
[email protected] (H.K. Sok), [email protected] (Y.C. Kuang), for Semiconductors, 2009; Zhao and Cui, 2008; Ooi et al., 2011). These
[email protected] (S. Demidenko), [email protected] (C. Chan). are known as defect clusters. Fig. 2 shows two examples of raw

0952-1976/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.engappai.2012.03.016
1030 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

production test data, whereby Fig. 2(a) shows random device failure in Ooi et al. (2011). Thus, the problem of detecting systematic failures
(no identifiable defect clusters) and Fig. 2(b) shows a defect cluster can be simplified into detecting and identifying these defect clusters
(a group of failing devices clustered together) on the bottom-left edge from the production test data.
of the wafer. Defect clusters are normally located around a specific
location and are process-related. There are six local defect patterns
identified and examined in this research, which are Bull’s-Eye, Blob, 2. Intelligent defect cluster recognition system for fabricated
Edge, Ring, Line and Hat (Fig. 3). Their formal description is provided semiconductor wafers

Developing a robust and accurate defect cluster recognition


system for semiconductor wafers based on production test data is
a non-trivial task. Unfortunately standard pattern recognition
techniques do not work well for the semiconductor wafer based
‘‘images’’ due to several object-specific issues. These issues are
discussed below Section 2.1. Section 2.2 outlines the aim of this
research and the main contributions of the paper.

2.1. Challenges in developing an intelligent system for defect cluster


recognition

The main factors that must be overcome in order to develop a


Fig. 1. Devices on a semiconductor wafer are fabricated simultaneously. Each die
is singulated and packaged as individual Integrated Circuit (IC) product during IC
robust-yet-accurate defect cluster recognition system for semi-
Assembly process. conductor wafer datasets are listed below (they will be further
elaborated below in individual subsections):

 Variations in defect cluster size, shape, location and orienta-


tion on the wafer for the same class of defects
 Non-symmetrical geometry
 Low signal-to-noise ratio
 Insufficient quality historical data for training
 A-priori unknown best feature sets and/or best classifiers

2.1.1. Variations of defect cluster size, shape, location and


orientation for the same class of defects
It is difficult to select versatile feature vectors with strong
distinguishing capabilities that describe the different classes of
defects accurately. Fig. 4 shows two examples of defects, whereby
Fig. 2. Raw manufacturing wafer maps whereby red, green and blue squares Fig. 4(a)–(d) are examples of Blob defects, while Fig. 4(e)–(h) are
represent failing, passing and invalid device locations, respectively for: Line defects. The feature vector must be sufficiently flexible to
(a) Random failure; (b) Defect cluster at the bottom-left edge of the wafer. account for the variations in defect sizes, shapes and locations on

Fig. 3. Some examples of defect cluster patterns found on semiconductor wafers: (a) Bull-Eye, (b) Blob, (c) Line, (d) Edge, (e) Hat, (f) Ring.
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1031

Fig. 4. Examples of different defect sizes, shapes, locations and orientations for Blob (a)–(d) and Line (e)–(f) defect types.

pattern (see Fig. 5(b)). Thus, many classical algorithms would fail
if applied to this subject area.

2.1.3. Low signal-to-noise ratio


In most of the previous examples, only ‘‘clean’’ defect patterns
were shown. In actuality, the real semiconductor production test
dataset is more similar to that shown in Fig. 6. Fabricated semi-
conductor wafers have a relatively small number of points in the
dataset (ranging from 100 to 10,000 dies/wafer). Additionally, a
high proportion of these small datasets (10%–50%) are random
failures, which are interpreted as a ‘‘noise’’. The ‘‘noise’’ level
differs depending on the manufacturing yield, which in turn
Fig. 5. Two examples of the semiconductor wafer dataset for actual devices in depends on the die and wafer sizes, employed fabrication tech-
production whereby the blue and green squares represent the invalid and valid die
nology and other factors. Thus the failure distribution varies
locations, respectively. (For interpretation of the references to colour in this figure
legend, the reader is reffered to the web version of this article.) between different types of devices, different implementation
technologies, etc.
It is important to note that clustering algorithms in pattern
recognition and image processing have been developed for large
the wafer in the Blob defect examples, as well as the orientation numbers of pixel count (or points). With a small data count and
in the Line defect examples. Thus, direct application of principle high noise level, it is difficult to fine-tune most classical algo-
component analysis or template matching (e.g., techniques dis- rithms such that the Type I/II errors (false alarm and false rejection
cussed in Jain et al. (2001)) without some form of transformation rates) would both be within acceptable limits. Furthermore,
is clearly insufficient for this application. selecting an optimum threshold to differentiate between cluster
and non-cluster regions (normally required in most clustering
algorithms) is a non-trivial task when each device type has
2.1.2. Non-symmetrical geometry different numbers of dies per wafer, different configurations on
Although the semiconductor wafer dataset can be viewed in a a two-dimensional space and completely different overall failure
two-dimensional space, it is still important to remember that rates. A preliminary investigative study performed in Ooi et al.
each data point is an actual device being produced. Thus tradi- (2011) shows that many clustering algorithms such as k-means,
tional normalisation that graphically rotates the wafer map does k-medoid, mean-shift, Balanced Iterative Reducing and Clustering
not give meaningful results. Additionally, the location of each die using Hierarchies (BIRCH) and Clustering Using Representatives
on the wafer does not form a square dataset, but rather a circular (CURE) do not provide meaningful results when applied to the
pattern (Fig. 5(a)). The wafer perimeter (valid die locations at the research problem under consideration.
wafer edges) varies for different devices and different processes; Fig. 6 shows an example of the Ring defect type for the same
the data points usually do not form a perfectly symmetrical device type with three different manufacturing yields (40%, 70%
1032 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Fig. 6. Examples of the Ring defect type for manufacturing yield of (a) 40%, (b) 70% and (c) 90% whereby red, green and blue squares represent failing, passing and invalid
device locations, respectively. (For interpretation of the references to colour in this figure legend, the reader is reffered to the web version of this article.)

and 90%). It can be seen that the noise level at 40% yield is very 2. Should the training data contain only pure clusters, with
high, yet any applied intelligent recognition system must be the raw data filtered to extract a ‘‘noiseless’’ defect cluster?
capable of recognising the obvious defect at the wafer edges. And if so, what filtering technique should be applied to the raw
Conversely, the noise level is low at 90% yield, but this brings a data?
new problem: the data points that form the defect cluster are
quite sparse. Yet, the defect recognition system must still be
capable of recognising this defect pattern once it occurs.
2.1.5. Best feature set or most suitable classifier is not known
a-priori
2.1.4. Insufficient quality historical data A good feature set is an optimised set of attributes that best
The accuracy of any classifier depends greatly on the com- distinguishes between different classes (Kononenko and Kukar,
pleteness of the training set. For the application under discus- 2007). A good classifier is one that uses the feature set to obtain
sion, this translates to getting sufficient amount of high quality the most accurate grouping of data. A good classifier is generally
historical production data for training a classifier to account for required to test the efficacy of the features. Yet a good feature
the issues highlighted above such as variations in defect cluster set is required to generate a good classifier (Kononenko and
size, shape, orientation and location with different manufactur- Kukar, 2007). This is a ‘‘chicken-and-egg’’ problem for most classi-
ing yield and noise levels. Unfortunately, even though semi- fication algorithms, since neither the best feature set nor the best
conductor devices are manufactured in high volumes, it was classifier is known a-priori.
observed When the salient features are unknown, an extremely large
that there is normally a lack of sufficient number or appropriate set of features is used to train the classifier in a hope of
selections of good quality training samples available in the obtaining accurate classification results. Yet, the reverse is often
historical production test data-logs to obtain a stable statistical desirable, whereby efficient and optimal machine learning
inference for a chosen classifier (Kameyama and Kosugi, 1999). and recognition are achieved through careful selection of an
Semiconductor manufacturers produce thousands of different optimal subset of salient features. This is because the memory
product types annually to cope up with the increasing market requirements and computational time are expected to be
demands. At the same time new innovations in design and kept low without compromising the accuracy of the classifier
fabrication are routinely achieved and implemented. This results (Kononenko and Kukar, 2007; Jain et al., 2001; Xu and Wunsch
in process changes being made on a monthly, if not weekly, basis II, 2009). Thus, the chosen features should ideally be robust with
(Hsieh et al., 1999). There are also devices that are being produced respect to noise, easy to obtain and simple to interpret (Xu and
in small volumes for niche applications. Thus, production life- Wunsch II, 2009).
cycles could be rather short for some semiconductor products There is still no known and accepted feature set or classifier
leading to new more advanced device types being put into that best suit this application. This leads to an open question in
production. Furthermore, some defect clusters may not have yet the developmental work on how to best begin developing the
been encountered during production (thus there will be no defect recognition system: feature first or classifier first?
historical log on such cases), though this does not mean that they
would not occur in future! The combined effect of these factors 2.2. Research aims and contributions
complicates the developmental efforts for any intelligent system
(Hsieh et al., 1999). The ultimate aim of this research is to develop an optimal and
In general, it can be observed that approaches attempting to accurate defect cluster recognition system for application on the
recognise the cluster type directly from the raw semiconductor semiconductor wafer production data. In order to achieve this
production data encounter huge problems during implementation aim, while basing on the earlier results in defect clustering (Ooi
due to the above mentioned factors (Ooi et al., 2011). To over- et al., 2011), an experimental study must be performed to address
come this problem, a defect cluster simulator can be used to the needs for defect classification. The goal of such a study is to
simulate the training set for the classifier. However, there are two determine the following:
important decisions are to be considered when using such an
approach: 1. The optimal feature set for chosen classifier (as discussed
above in Section 2.1.1)
1. Should the training data contain ‘‘noise’’ to emulate the actual 2. The best feature extraction method and type of classifier
raw data? And if so, how much noise? (as outlined in Sections 2.1.1, 2.1.2 and 2.1.5)
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1033

Fig. 7. Effects of DPFT on a rotated Line defect cluster: (a) wafer in X–Y coordinates; (b) wafer in frequency domain after applying DPFT; and (c) wafer in frequency domain
after peak-centring.

3. The classifier training method while taking into account the identifying a group (sub-population) to which particular obser-
lack of historical data and high level of noise in the relatively vations belong to.
small count (as presented in Sections 2.1.3 and 2.1.4)
3.1. Feature extraction and selection
Within the reported research, the outlined experimental study
was conducted on large-scale industrial data of several million In the field of pattern recognition, dimension reduction tech-
units of ICs from five types of semiconductor products that are in niques such as the principal component analysis, factor analysis,
the mainstream high-volume production of one of the world’s linear discriminant analysis and projection pursuit are highly
leading semiconductor device manufacturers. The experimental popular feature extraction methods (Jain et al., 2001). These
results are presented and discussed in Section 4. In order to techniques work well only if the defects in a same class are
follow this study, Section 3 provides the necessary theoretical geometrically well aligned. Rotation of the wafer map to align
foundation on suitable feature extraction methods and classifiers defect cluster is not feasible due to non-symmetrical geometry
employed in the research. and relatively large die size. Besides, such a rotation would cause
significant number of dies to be misaligned.
Since semiconductor wafers are approximately circular in
shape this characteristic can be considered as invariant in its
3. Introduction into feature extraction, selection and
rotational properties. Thus, potentially suitable methods to
classification
extract this particular characteristic are the Polar Fourier Trans-
form (PFT) and Rotational Moment Invariant (RMI), which are
Feature extraction, selection and classification are inter-depen-
discussed below in Sections 3.1.1 and 3.1.2, respectively. In
dent tasks. Feature selection refers to algorithms that choose a
addition, Section 3.1.3 provides the list of geometrical features
small and optimal subset from a larger input set of features. It is
to account for the variations in shape, size and defect location.
normally performed during training of the classifier. Feature
extraction, on the other hand, refers to algorithms that generate
new and hopefully more useful features from the original set by 3.1.1. Polar Fourier Transform
applying a transformation (Xu and Wunsch II, 2009; Jain et al., The Polar Fourier Transform is similar to the traditional Fourier
2001). Extraction of features may provide more discriminative Transform, except that it considers the frequency spectrum in
ability compared to the best of the original features. However, polar coordinates (Averbuch et al., 2006). Thus, it is invariant
transformation normally presents a more abstract represen- towards rotational properties. The polar grid of frequencies inside
tation of the data, thus the new feature may no longer be the concentric circle is defined as xp,q ¼ fxw ½p,q, xy ½p,qg. Eq. (1)
physically interpretable. Feature extraction normally precedes below describes the discrete PFT operation, which includes a set
the feature selection stage. Finally, classification is a process of of the samples Fðxp,q Þ with sample points governed by Eq. (2)
1034 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Fig. 8. Applying DPFT with peak-centring on different defect cluster types.

(Averbuch et al., 2006). moments to derive the seven famous invariants to the rotation
of two-dimensional objects which he used for automatic char-
X
N X
1 N 1
Fðxp,q Þ ¼ f ½i1 ,i2 eiði1 xw ½p,qi2 xy ½p,qÞ ð1Þ acter recognition (Flusser, 2000; Flusser and Suk, 2006). These
i1 ¼ 0 i2 ¼ 0 are simple properties, which are found via image moments.
( ) They include the area or total intensity, its centroid and orientation
pp pq
xw ½p,q ¼ N cos 2N information. Since then, moment invariants have become a
pp pq f orðN r p r N1Þ ,ð0 rq r 2N1Þ ð2Þ
xy ½p,q ¼ N sin 2N classical tool in pattern recognition and object classification
(Mukundan and Ramakrishnan, 1998).
Fig. 7 shows an example of Discrete Polar Fourier Transform A moment function Fpq of the order (p þq) for a general two-
(DPFT) application on a line defect pattern whereby (a) and (b) dimensional function f(x,y) can be given as Eq. (3), where z
show the before and after effect, respectively. Comparing wafers denotes the image region of the x–y plane, which is the domain
A–C, it can be seen that DPFT maps any rotational variances as a of the function f(x,y) (Mukundan and Ramakrishnan, 1998);
shift along the x-axis. Thus to eliminate any rotational variances Cpq(x,y) is the moment weighting kernel or the basis set and is a
between different line defect patterns, the peak of the frequency is continuous function of (x,y) in z, the indices p and q usually
centred on the x-axis. In this manner, a similar signature is obtained denote the degrees of the coordinates x and y, respectively. Hu’s
for the same defect pattern, as shown in Fig. 7(c). Fig. 8 shows the invariants are listed in Flusser (2000).
results of applying DPFT on the six different types of defect clusters ZZ
afterthe peak centring. Fp,q ¼ Cpq ðx,yÞf ðx,yÞdxdy ð3Þ
Since each defect cluster type has a unique and identifiable z
frequency domain signature, DPFT can be considered as a possible
Moment invariants with respect to image rotation can be easily
feature extraction method for classification of the semiconductor
derived from the complex moments. In particular, Rotational
defect clusters. However, it is important to note that the DPFT
Moment Invariant (RMI) is a good tool for classifying defect clusters
does not consider any translational property of the local defect
on semiconductor wafers because it extracts the same features for
cluster patterns. Therefore, additional geometric features should
the same type of cluster regardless of the rotational differences.
be considered for inclusion into the feature set to increase the
However, availability of Hu’s invariant is very restrictive for a given
classification accuracy. These are discussed below in Section 3.1.3.
order. Research paper (Flusser, 2000) proposes to use complex
The frequency signature for each defect cluster is obtained by
moments to generate more general rotational invariants, which
calculating the eigenvectors using Principle Component Analysis —
include Hu’s invariant as special cases. The complex moment cp,q of
PCA (Jolliffe, 2002). It is important to note that the conventional
the order (pþq) is an image moment function f(x,y) as shown in
PCA calculation is a lengthy process because the DPFT application
Eq. (4) where i is an imaginary term. Each complex moment can be
leads to a high number of feature dimensions after transfor-
expressed in terms of geometric moments mpq in Cartesian or Polar
mation. Thus, a two-directional approximation called (2D)2 PCA
coordinates (Flusser, 2000):
(Zhang and Zhou, 2005) is used instead, as it offers a higher
Z 1Z 1
computational speed without significant loss in a calculation
cpq ¼ ðx þ iyÞp ðx iyÞq f ðx,yÞdxdy ð4Þ
accuracy. 1 1

!
3.1.2. Rotational moment invariants q   q
p X
X p
Moment functions were first introduced in 1962 by Hu, who cpq ¼ ð1Þqj ip þ qkj mk þ j,p þ qkj ð5Þ
k j
employed the results of the theory of algebraic and geometric k¼0j¼0
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1035

Table 1 version. It is clear that the cluster rotation is not exact due to the
Two variations of Blob, Edge, Hat and Line defect patterns and their corresponding non-ideal wafer geometry (as mentioned earlier the semiconduc-
symbol used in RMI space.
tor wafer is almost never a perfect circle). Even with the restric-
Defect cluster Rotated defect cluster tion of the upper and lower limits, the RMI value can go up to
104 range. Four arbitrarily chosen RMI are plotted in Table 2.
Wafer yield map Symbol in Wafer yield map Symbol in The results show that despite various sources of error, the RMI
RMI space RMI space of the rotated cluster still produces a similar complex number.
Blob
In addition, cross validating various RMI values yields reliable
classification of a given cluster.

3.1.3. Geometrical features


It is important to note that the PFT and RMI features discussed
in Sections 3.1.1 and 3.1.2 do not consider any translational
Edge property of the defect clusters. These properties are necessary
to distinguish between different defect classes. Thus, Table 3
provides a set of geometric-based rules that are normally used to
distinguish between the types of defect clusters for inclusion into
the feature set so to increase the classification accuracy (Guise
et al., 2002).
Hat
3.2. Classification algorithms

Classification is the most popular application of all the machine


learning methods. Classifiers utilise a mapping function to link the
features to the respective class space (Kononenko and Kukar,
Line 2007). The mapping function can be either given to the classifier
or, more commonly, it is learnt from a given set of high quality
training data. Different classifiers have different ways to repre-
sent the learning and mapping algorithms. Table 4 summarises
the common classification methods in machine learning.
The semiconductor wafer datasets used in the reported research
are non-linear with multi-classes. Thus Nearest neighbours, Support
vector machines and Discriminant functions are typically unsuitable
Z 1 Z 2p for this application. Neural network classifiers were initially con-
cpq ¼ r p þ q þ 1 eiðpqÞy f ðr, yÞdrdy ð6Þ sidered. However, the neural network classification rule sets are
0 0
not transparent to the user. This poses a serious problem in the
According to Flusser (2000), if n Z1 and ki, pi and qi; i¼1, y, n, industrial application. If there are any inaccuracies in classification,
are non-negative numbers such that Rni¼ 1 ki ðpi qi Þ ¼ 0, then I in there is no method to determine their exact root-cause which
Eq. (7) is invariant to rotation. impedes process improvement efforts. Therefore the use of this
tool in the manufacturing process would not satisfy the Six-Sigma
Y
n
I¼ ckpii qi ð7Þ quality audits (Pande, et al., 2000), which are performed regularly
i¼1 by both the company and their clients. Hybrid functions are very
complex to implement, and should only be considered if all other
Application of Eqs. (4) and (7) generates a complex number classification methods fail to provide suitable results. It is a general
which is the RMI feature. There is a large number of possible rule-of-thumb to use the ‘‘Occam’s Razor’’, principle stating that
combinations of p1, p2, q1, q2, k1, k2 that generates rotational given several classifiers with the same training error, the simpler
invariance property. However not all combinations are equally classifier is more likely to generalise better.
useful. Discretisation of the integration operation in Eq. (4) Thus for this particular application, the suitable classifiers are
negatively impacts the theoretical invariance property and causes the Bayesian classifier and variations of Decision trees which are:
perturbation in the resultant RMI feature. Certain combinations Top–Down Induction Decision Trees (TDIDT), Bagging on TDIDT, and
may be more sensitive to the discretisation errors. In this study Boosting on the Alternating Decision Tree. Section 4 provides the
p1, p2, q1, q2A{0.67, 0.79, 0.93, 1.00, 1.10, 1.30, 1.54, 1.82, 2.00, experimental flow to determine the most suitable classifier for
2.15, 2.54, 3.00} were used because they are almost equally this application. To aid in reader comprehension, the following
spaced numbers between the range of 0.67 (lower limit) and 3 subsections provide some theory on the selected classifiers.
(upper limit). These limits have been empirically selected to limit
the dynamic range of the RMI value to prevent numerical
3.2.1. Bayesian classifier
instability. If no limits are enforced, the RMI value may be too
The Bayesian classifier is a simple statistical classifier based on
large for representation even when using double precision num-
Bayes’ theorem (Bramer, 2007). Assuming that each class is Ck for
bers. Indeed, there are 20,736 possible combinations using the
k¼[1, y, number of classes] and n is a number of features F, the
set of p1, p2, q1, q2 given above. Further restrictions are placed
Bayes rule is written as Eqs. (8) and (9). The probability that an
on k1 and k2 such that k1 ¼ ðp2 q2 =p1 q1 p2 þ q2 Þ and
instance belongs to a particular class Ck given a combination of
k2 ¼ ðp1 q1 =p1 q1 p2 þq2 Þ to prevent numerical singularity
features (F1, F2, y, Fn) is shown in Eq. (10).
during calculation. Apart from discretisation error, the geome-
trical characteristics of the wafer (imperfect circle) and target PðC k ,F 1 ,F 2 ,. . .,F n Þ ¼ PðC k ÞPðF 1 ,F 2 ,. . .,F n jC k Þ ð8Þ
defect class also affect the robustness of RMI features. Table 1 
shows several typical defect clusters and corresponding rotated PðC k ,F 1 ,F 2 ,. . .,F n Þ ¼ PðF 1 ,F 2 ,. . .,F n ÞPðC k F 1 ,F 2 ,. . .,F n Þ ð9Þ
1036 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Table 2
Results of RMI for the Wafer Yield Maps using their corresponding symbols shown in Table 1.

PðC k Þ
PðC k 9F 1 ,F 2 ,. . .,F n Þ ¼ ¼ PðF 1 ,F 2 ,. . .,F n 9C k Þ ð10Þ must be determined by a measure of impurity (Kononenko
PðF 1 ,F 2 ,. . .,F n Þ
and Kukar, 2007). This research uses entropy (Quinlan, 1986) to
The prior probability of an instance belonging to class is P(Ck). measure the impurity. It is an information-theoretic measure of
It is calculated even before any of its features are obtained the uncertainty in the training set in the presence of one or more
(Bramer, 2007). Another prior probability is P(F1, F2, y, Fn), which possible classification. Decision trees result in a symbolic and
is the likelihood of observing the (F1, F2, y, Fn) combination in the logical representation of the classification function, which nor-
data. These two priors are normally estimated from historical mally conforms to a physical knowledge of the problem. Decision
datasets. The priors will normally help to adjust the prediction of trees generally do not make any pre-determined assumptions on
class for better accuracy. For example, if historically 90% of the the statistical distribution of the dataset. This allows the features
observed defect clusters belong to Edge defect pattern and 10% to be selected and used in an independent manner. The biggest
belongs to Bull’s Eye, the prior of Edge cluster will be 0.9 and the problem for decision trees is the empty or null leaf phenomena
prior of Bull’s eye cluster 0.1. This will increase the likelihood that (Kononenko and Kukar, 2007; Bramer, 2007), where there is a
a cluster will be categorised as an Edge class. The Bayesian valid path with no corresponding learning examples. This results
classifier can be readily used with discrete features. For contin- in an unclassified instance.
uous features, discretisation must first be performed, thus some
discretisation error will inevitably be introduced into the system
(Bramer, 2007). 3.2.3. Bagging on TDIDT
The natural extension of TDIDT is to apply bagging paradigm
to reduce the problem of overlearning or over-fitting of the data
3.2.2. Top–Down Induction Decision Trees (Polikar, 2006). Bagging attempts to improve classifier perfor-
Decision tree is a flowchart-like tree structure that consists of mance by generating many different classifiers. This is achieved
internal nodes, branches and leaves (or terminal nodes)(Bramer, by training them on slightly different training sets as illustrated in
2007). Each path starts at the topmost node (also known as the Fig. 9 whereby n decision trees are produced. The class is chosen
root node). Conditions are evaluated at each internal node, which by a combined decision (or majority voting) of the different trees.
corresponds to a test on an attribute. The outcome of the test is The bagged TDIDT classification accuracy is normally comparable
represented by each branch, which ends in a tree leaf. Each leaf with that of the traditional TDIDT while it displays less sensitivity
corresponds to a discrete class label or decision rule (Kononenko to noise from the training set and generalises better.
and Kukar, 2007). A good choice of feature or attribute is There are several drawbacks of using bagging on TDIDT.
absolutely vital for high accuracy in prediction. Some features The rules used to classify an instance into a particular class are
are far more informative than others, and thus the best attribute obscured when many trees are used in this manner. Bagging does
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1037

not prevent the generation of contradictory rules from different Table 3 (continued )
trees. This negates the main advantage of decision trees over
8 Standard
black-box approaches such as neural network, transparency of
deviation angle,
decision rules. yi

Table 3
Geometrical features to improve classification accuracy.

Customised features Illustration or description

1 Major axis length


9 Distance between
the centroid and
the wafer centre

2 Minor axis length


10 Proportion of a
defect in Region 1

3 Aspect ratio Aspect ratio ¼ Major axis length


Minor axis length
4 Ascribe angle Total dies in cluster within Green region
Number of dies in the cluster
11 Proportion of a Total dies in cluster within Yellow region
Number of dies in the cluster
defect in Region 2
12 Proportion of a Total dies in cluster within Red region
Number of dies in the cluster
defect in Region 3
13 Proportion of a Total dies in cluster within Blue region
Number of dies in the cluster
defect in Region 4

5 Angle between This paper only considers bagging on decision trees, which is
major axis and equivalent to the standard ID3 algorithm used for bagging in the
radial direction of
the centroid
machine learning community (Quinlan, 1986). The later extension
of this algorithm is C4.5, which is widely considered as the best
decision tree and includes tree pruning, continuous attribute values,
missing values and explicit rule generation (Quinlan, 1992). Tree
pruning is a process to remove unreliable tree branches that lead to
a poor generalisation performance. According to Bauer and Kohavi
(1999), bagging on decision trees works well without pruning
because bagging itself reduces variance through the majority vote
6 Vectorial average of many specialised trees. Pruning these trees on the other hand
of defect will generate many similar trees that give similar level of trade-off
locations or Mean between specialisation and generalisation performance. Thus prun-
(ri) where i ing bagged decision trees actually negates the benefit of statistical
references each
location of the
voting. A similar conclusion has also been drawn from another
dies within the empirical study (Quinlan, 1996) whereby comparison between the
cluster C4.5 and its bagged version produces similar error rates.
7 Standard
deviation of ri

3.2.4. Boosting on the alternating decision tree


Boosting was introduced in 1990 by (Schapire, 1990). It uses
a set of several weak classifiers rather than one strong high
1038 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Table 4
Classification methods (Kononenko and Kukar, 2007; Jain et al., 2001).

Classifier Description Comments

Decision Selects features and subsets of their values Fast classification (testing) time. At the same time, decisions trees
trees and according to their quality, and uses them as a generally tend to overlearn and pruning is needed. Rule sets that are
rules conjunctive rule’s antecedent. generated for big trees are very complex and difficult to interpret.
Bayesian Calculates the conditional probability of all classes BC implementation is generally slow and complex. In addition BC
classifiersand assigns a pattern to a class that has maximum has high computational requirements. If the number of features is too high,
likelihood. Bayesian classifiers (BC) use statistical probability calculations BC will face the curse of dimensionality (Bramer, 2007). NBC is faster and
to achieve minimum error rate. Naive Bayes classifier (NBC) is a variation simpler to implement. However the independence assumption is
of often mot justifiable.
BC that makes a statistical assumption on independence of features.
Nearest Stores all the learnt data and assigns patterns to the most Easiest classifier to implement. No learning or training required,
neighbours similar example according to a distance metric. but has slow classification time for large dataset. Unable to handle
spatial transformations such as rotation and translation. Requires special
data structure to handle high dimensional datapoints. Prone to high error
rates.
Discriminant Linear classifier that finds the optimal classification boundary Only works for linearly separable classes. Has limited accuracy for many
functions using mean-squared error optimisation. applications.
Support Maximises the distance between different classes by Iterative and thus has high computational requirements.
vector selecting the minimum number of support vectors. Does not work well for high-dimensional multiclass system.
machines Can be understood
as a non-linear extension of discriminant functions
Neural Abstract classification that selects weights for each connection Classification rules are not transparent. Assignment of weights during
networks between neurons by learning them from training set. learning is not made clear. Prone to over learning. Prone to local minimum,
Assigns classes based on the calculated weights which results in suboptimal solution
Hybrid Combines two or more of the above approaches. Introduces at least one extra layer of complexity where an interface is
functions required to integrate the output of various classifiers into a single decision.

BrownBoost, while in others — either LogitBoost or BrownBoost


may provide the lowest error rate (Friedman et al., 2000; Freund,
2001; Schapire, 1999). During implementation, parameters within
the chosen boosting algorithm should be tweaked until the best
classification performance is obtained. The fabricated wafer yield
does not have a problem of outliers. Thus in order to lower the
complexity of the classifier, AdaBoost is chosen over BrownBoost
and LogitBoost for implementation in the reported research.
In general, boosting on the TDIDT generates better results than
those of the bagging IDIDT (Quinlan, 1996). However it has a
similar problem with bagging on TDIDT, whereby it generates
complex trees with every boosting iteration. Thus, the final
classification rules may be difficult (if not impossible) to interpret
(Freund and Mason, 1999). Additionally, theoretical and empirical
studies in Reyzin and Schapire (2006) and Schapire et al. (1998)
show that if the base classifier is too complex for its intended
application, the performance of the boosted learning process will
be negatively affected. Therefore, the alternating decision tree
(ADTree) was specifically developed for boosted learning.
Fig. 9. Illustration of bagging on TDIDT.
ADTree controls the complexity of each base classifier by con-
sidering only two conditions at any time. Thus every node has only
performance classifier. In general, the boosting process involves two branches. Each node in the ADTree has a prediction value, and
training the next set of weak classifiers on a set of data points, the final prediction value is the sum of all predictions in the
which are reweighted according to the mistakes of the preceding transverse path. A positive sum represents one class, while the
classifiers. Freund and Schapire introduced an improved boosting negative sum represents the other class for a two-class problem. In
algorithm known as AdaBoost in 1997 (Freund and Schapire, this manner, ADTree is able to return a measure of confidence known
1997), which is now among the top 10 data mining algorithms as a classification margin apart from simply classifying an instance
(Wu et al., 2007). There are other popular boosting algorithms (Comite et al., 2003). A higher classification margin indicates a higher
available such as LogitBoost (Friedman et al., 2000) and Brown- probability of the cluster belonging to that class. Every training
Boost (Freund, 2001). The BrownBoost was developed to address iteration adds a decision ‘‘stump’’, which is one node and two leafs, to
the learning instability problem caused by outliers in the training the ADTree. A stump is the simplest form of a decision tree.
set. It has a longer computation time since its termination
condition requires solving a non-linear equation (Freund, 2001).
LogitBoost is an alternative approach to AdaBoost with compar- 4. Experimental study
able performance in most cases. However it can be far more com-
plex to implement because it uses a regression model depending 4.1. Experimental setup
on the dataset (Friedman et al., 2000).
All three boosting methods have comparable performances. One of the world’s leading semiconductor manufacturers was
In some cases, AdaBoost may outperform LogitBoost and involved in and supported the reporting study by providing the
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1039

necessary resources for testing and analysis. Those included 4.2. Classifier training
several million units of integrated circuits of five types that were
in the mainstream high-volume production (see Table 5). These In order to perform this experiment, the type of classifier must
devices were selected for experimental study in consultation be held constant while changing the training method. The boost-
with the manufacturing company and their characteristics were ing on ADTree was selected for the experiment due to its super-
obtained from their respective design documents. They were iority over the other classifiers from the list. This is because the
specifically chosen to be of different technologies (half-pitch boosting on ADTree was originally developed to overcome some
sizes), different complexity, number of metallisation layers and of the weaknesses of previous versions of decision trees. It is
dies/wafer counts. The true names and functional descriptions of important to note that although the ADTree was used in this
the devices have intentionally been changed in this paper for experiment, the best classifier for this application was still
confidentiality reasons. unknown at the time of experimentation. At the same time, it
Three experiment rounds were performed to address the could be expected that the classifier training results would scale
challenges discussed in Section 2. The first one (Section 4.2) was accordingly when applied to other decision trees, as they would
on classifier training, whereby the following three classifier face the same problems during the training and implementation
training and implementation methods were compared: cycle.

1. Using historical production data to train a classifier with


implementation on raw production test data; 4.2.1. Experimental flow
2. Using simulated data (with noise) to train a classifier with Fig. 10 shows the experimental flow for classifier training
implementation on raw production test data; methods. At this point, the optimal feature set was still unknown,
3. Using simulated data (noiseless) to train a classifier with hence only the RMI features discussed in Section 3.1 were used
implementation on filtered production test data. during the Feature Extraction stage. This is because using all the
features would result in an extremely large feature set, which
would result in an extremely lengthy computation time.
Section 4.3 below describes an experiment that was performed In the first flow (Method 1 shown in Fig. 10), raw production
to determine the best combination of rotational feature vector test data was used to train and test the ADTree classifier. Hence,
(discussed in Section 3.1.1 and 3.1.2) and classifier (discussed in the dataset was divided into a training set and a test set. This is
Section 3.2). Finally, the defect cluster recognition system is the standard train-and-test method used in supervised learning
concluded in Section 4.4 with an experiment on the optimal algorithms (Kononenko and Kukar, 2007). In order to reduce
feature set. Section 4.5 shows the proposed system flow for variability and to provide a better generalisation of the classifier
implementation on the semiconductor production test floor. performance, 10-fold cross-validation was performed, whereby
the dataset was randomly divided into training and test sets in
ten iterations and the validation results were averaged.
Table 5 The second and third experimental flows (Methods 2 and
Devices used in the experimental trials. 3 shown in Fig. 10) involved the use of a defect cluster simulator,
which will be discussed in detail below. To train the classifier,
Devices Half-pitch size (nm) Dies/wafer Metallisation layers Total ICs
Method 2 uses ‘‘noisy’’ simulated wafers, or wafers that are
A 250 984 3 2460,000 simulated to emulate real production dataset. The amount of
B 250 794 3 1985,000 ‘‘noise’’ or random failure depends on the production yield. Thus,
C 130 402 6 1105,500 the average wafer yield for each device must be known prior to
D 90 384 7 960,000
simulation. This is done to ensure that sufficient numbers of
E 250 235 3 587,500
samples of high quality training data are available for training.

Fig. 10. Experimental flow to determine the best training and implementation method for the classifier.
1040 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Method 3 trains the classifier using pure (noiseless) defect Segmentation, Detection and Cluster Extraction (SDC) algorithm in
clusters. Thus to test the classifier, raw production data must be Ooi et al. (2011) as the filtering method of choice.
‘‘filtered’’ to remove the random failure and to extract the defect Any wafer can be simulated if the valid device locations (x–y
clusters. The accuracy of the classifier would thus depend not positions on the wafer) are known. Thus, each defect cluster
only on the training set, but also on the effectiveness of the noise shown in Fig. 3 was simulated using the defect simulator algo-
filtering technique. While other filtering methods can be used rithm shown in Fig. 11.
such as (Wang, 2007; White et al., 2008), this research applied the The simulator produces ‘‘perfect’’ defect cluster patterns using
the set of knowledge rules provided in Ooi et al. (2011) in the first
stage. In the second stage, cluster variations are included to obtain
more realistic defect patterns. These variations are achieved using
the rules shown in Table 6. In the final stage, random failures are
included into the wafer map based on a predetermined wafer yield.
The outputs of the second and third stages were used in Methods
3 and 2, respectively.

4.2.2. Experimental results and discussions on classifier training


methods
In the reported research data equivalent to over 300,000
different wafers were generated using the defect cluster simulator
for the classifier training threshold selection. This included 5000
wafers for each of the six defect types, with and without random
failure and five device types. The reason for simulating 5000
wafers for each point was to ensure that the classification criteria
were observed to converge to an asymptotical value. Ooi et al.
(2011) provides some examples on verification of the simulation
stability.
Table 7 shows the classification results for the three trialled
methods. It must be noted that Devices A, B and D did not
Fig. 11. The defect cluster simulator. encounter all the defect types during the three month experimental

Table 6
Rules for generating cluster variations for the defect cluster simulator.

Rules and description Variation1 Variation2

1. Defect pattern failing probability


Number of dies that fail in a specified cluster can be reduced by setting a
lower failing probability, as shown in Variation 1. Alternatively,
a higher failing probability will result in more failing dies in the cluster,
as shown in Variation 2.

2. Defect pattern jitter size


Noise is added into the cluster to simulate imperfect cluster shapes.
This is termed as ‘‘jitter’’. The jitter size can be large (Variation 1) or
small (Variation 2).

3. Defect pattern jitter failing probability


The additional jitter can have 100% passing probability, or 100% failing probability,
which changes the defect cluster size as shown in Variations 1 and 2, respectively.
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1041

Table 7
Correct identification rates.

Device Method 1 accuracy Method 2 accuracy Method 3 accuracy


(%) (%) (%)

A 55.26 51.47 69.5


B 92.06 29.06 66.7
C 36.3 55.38 51.1
D 52.94 17.19 64.8
E 29.17 27.25 48.4

period. For Device A, only Edge, Blob and Hat defect types were
encountered, while for Devices B and C only Edge and Blob defect
types were found. Thus, although Method 1 appears to be fairly
accurate for Devices A, B and C, it is likely to fail if the production
lines encountered a different defect type in future. Overall, Method
3 performs fairly consistently across the different device types
compared to Methods 1 and 2. Thus, it can be concluded that
Method 3 gives a better generalisation for application across many Fig. 12. Experimental flow to obtain the most optimum combination of feature
different devices being produced by the manufacturer. extraction and classifier.

4.3. Experiment on combinations of rotational feature vectors with


Table 8
different classification algorithms
Correct identification rates in (%) on production test data.

Two feature extraction methods capable of extracting defect Classifier Bayesian TDIDT Bagging on Boosting on
cluster information while accounting for different orientation on classifier decision tree ADTree
the wafer were discussed earlier in Section 3.1.1 and 3.1.2. These
Feature used PFT RMI PFT RMI PFT RMI PFT RMI
are: Rotational Moment Invariants (RMI) and Discrete Polar Fourier
Transform (DPFT). Both feature sets are subsequently used to train Bull’s Eye 50.0 80.8 57.7 7.7 50.0 7.7 50.0 84.6
the four shortlisted classifiers, which are the Bayesian classifier, Blob 75.5 37.4 53.6 24.8 31.1 32.5 50.7 19.5
TDIDT, bagging on TDIDT and boosting on the ADTree. Hat 52.8 26.4 30.2 24.5 0.0 18.9 49.1 39.6
Ring 71.4 57.1 71.4 42.9 0.0 0.0 100 100
Line 16.5 45.9 25.9 36.5 43.5 62.4 4.7 52.9
4.3.1. Experimental flow on feature–classifier combination Edge 49.3 41.9 51.4 32.7 0.0 23.0 88.4 65.2
Average 52.6 48.3 48.4 28.2 20.8 24.1 57.2 60.3
Two feature vector sets were obtained as shown in Fig. 12 — the
RMI and the DPFT feature sets, respectively. The shortlisted classifiers
were trained based on the respective feature sets (Fig. 12). Thus after
the training, eight different classifiers were generated and compared
Table 9
against each other. The classifiers were trained and implemented
Accuracy comparison between the classifiers generated by different feature sets.
using the method determined in Section 4.2.2, whereby data from the
defect cluster simulator were employed to generate the training set. Device RMI feature set Extended feature set Percentage point
Production test data from the devices shown in Table 5 were used to classifier accuracy classifier accuracy (%) improvement
(%)
generate the classifier performance table (after applying the SDC
algorithm). A 69.5 90.3 20.8
B 66.7 94.7 28.0
C 60.3 81.5 21.2
4.3.2. Experimental results and discussions on feature–classifier
D 64.8 95.9 31.1
combination E 48.4 84.1 35.7
Table 8 shows the performance for every feature–classifier
combination for each defect cluster type when applied on the
production dataset. It can be observed that the TDIDT and bagging
on decision trees have poor average hit rates. The poor perfor- still below 70% because geometrical features have not been con-
mance of the TDIDT is mostly caused by the unresolved missing sidered. Thus, the description of some defect clusters (in particular,
branch (or null hypothesis) problem. Although bagging was the Blob defect type) is poorly defined, leading to very low identifi-
expected to increase the performance, it did not manage to cation rates. To overcome this weakness, a third and final experi-
achieve so in this application. ment on the optimal feature set is discussed in the next section to
It is clear that the pairing of RMI with boosting on ADTree increase its suitability for the real-world implementation.
results in the overall performance, while PFT with boosting on
ADTree has comparable performance. However, PFT describes 4.4. Experiment on optimal feature set for defect cluster recognition
clusters in frequency domain, which makes any refinement or system
modification on its features more difficult to perform. Addition-
ally, RMI has a smaller feature spaces dimension, which makes Two different features sets were used to generate two different
it less affected by the curse of dimensionality when geometrical ADTree classifiers each for Devices A–E (Table 5). The RMI Feature
features are included into the feature set (Section 4.4). This leads Set consisted of only RMI features, while the Extended Feature Set
to the conclusion that the feature–classifier combination that comprised RMI and geometrical features (refer to Section 3.1.3).
best suits the wafer-detection and classification application for This was done to compare the differences in accuracy with and
this research is RMI with ADTree classifier. Its overall accuracy is without the geometrical features. The results are shown in Table 9.
1042 M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043

Fig. 13. Proposed defect cluster recognition system for semiconductor wafers.

It is clear from the experimental results that the ADTree 5. Conclusions


classifier generated by Extended Feature Set is far superior in
terms of accuracy. It offers average percentage point improve- This paper presents a defect cluster recognition system that
ment of approximately 30% while achieving a very good recogni- automatically recognises several types of known defect clusters
tion accuracy of up to 96% depending on the product type. Thus it found on fabricated semiconductor wafers. The proposed system
can be well recommended as the Defect Cluster Classifier for generates an ADTree classifier that achieves classification accu-
semiconductor wafers. racy of up to over 95% depending on the product type. Incorpora-
tion of the classifier into an online production data analysis
system allows any type of known defect cluster encountered
4.5. Proposed defect cluster recognition system during the manufacturing process to be quickly and automatically
identified. This provides a very valuable information that can be
The defect cluster recognition system for fabricated semicon- used for fast fault diagnosis and rapid root cause identification,
ductor wafers shown in Fig. 13 is derived using the results of the which in turn leads to a better process control.
three experiments discussed above in Sections 4.2, 4.3 and 4.4.
The offline process involves the generation of the features and
training of the ADTree classifier, which can take up to a few Acknowledgements
minutes depending on the performance of the employed compu-
ter system. For example in this research, the training time was The authors would like to thank the industrial partner, Free-
approximately 10 min on a commodity computer powered by a scale Semiconductor for the provision of data, resources and
Intel s Pentium s T4400 (2.2 GHz) processor. By using a more equipment for this research and Monash University for the
powerful processor, the training time can be reduced dramati- scholarship support. This research was supported by Malaysian
cally. At the same time it is important to note that this process is Ministry of Higher Education Fundamental Research Grant
performed one time only for each device in an offline mode, and Scheme FRGS/1/2011/SG/MUSM/03/1.
thus it does not affect the manufacturing throughput. On the
other hand, the classification time is practically negligible for all References
the data in the manufacturing dataset due to the built-in
simplicity of ADTree. As a result, the proposed system is capable Averbuch, A., Coifman, R., Donoho, D., Elad, M., Isreali, M., 2006. Fast and accurate
of running in an online mode of application without negatively Polar Fourier Transform. Appl. Comput. Harmon. Anal. 21, 145–167.
impacting the manufacturing throughput. Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification
algorithms: bagging, boosting, and variants. Mach. Learn. 36 (1-2), 105–139.
It is important to note that since simulation is used to produce Bramer, M., 2007. Principle of data mining. Springer.
the training set for the classifier, types of defect clusters recog- Comite, F., Cilleron, R., Tommasi, M., 2003. Learning multi-label alternating
nisable by the system are limited to the defect types that were decision trees from texts and data. Mach. Learn. Data Min. Pattern Recognit.,
251–274.
used for training. In this research, the system was trained to Flusser, J., 2000. On The Independence of Rotational Moment Invariants. Pattern
recognise a limited set of defects including Bulls-Eye, Blob, Line, Recognition 33, 1405–1410.
Edge, Ring and Hat. However, the system could be trained Flusser, J., Suk, T., 2006. Rotation moment invariants for recognition of symmetric
object. IEEE Transaction on Image Processing 15 (2), 3784–3790.
to recognise new defect types by specifying their geometry and Freund, Y., 2001. An Adaptive Version of the Boost by Majority Algorithm. Mach.
simulating it. Learn. 43 (3), 293–318.
The defect cluster simulator overcomes the problem of insuffi- Freund, Y., & Mason, L., 1999. The alternating decision tree learning algorithm.
Proceedings of the 16th International Conference on Machine Learning, 124-133.
cient training samples, while complementary application of the Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of on-line
SDC algorithm (Ooi et al., 2011) filters the noisy raw production learning and an application to boosting. J. Comput. Syst. Sci. 55 (1), 119–139.
dataset. The combination of these algorithms overcomes the Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive logistic regression: a
statistical view of boosting. Ann. Stat. 28 (2), 337–407.
problem of the low signal-to-noise ratio discussed in Section
Guise, M., Poe, D., Stafford, J., Wahba, A., 2002. Automated defect pattern
2.1. The application of RMI feature extraction method with recognition: an approach to detect classification and lot characterisation. IEEE
extended geometrical features provide good distinguishing for Syst. Inf. Des. Symp., 63–68.
defect clusters encountered in semiconductor wafer fabrication, Hsieh, S., Lin, S.-C., Lee, M.-H., Wang, J.-R., Lin, C., Huang, C.-W., et al., 1999.
Novel assessment of process control monitor in advanced semiconductor
while its combination with the ADTree classifier results in an manufacturing: a complete set of addressable failure site test structures
overall performance of up to 95% depending on the device type. (AFS-TS). Proc. Semicond. Manuf. Conf., 241–244.
M.P.-L. Ooi et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1029–1043 1043

International Technology Roadmap for Semiconductors. (2009). Test and Test Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S., 1998. Boosting the margin: A new
Equipment 2009 Edition. Report, International Roadmap for Semiconductors, explanation for the effectiveness of voting methods. Ann. Stat. 26 (5),
/http://www.itrs.net/Links/2009ITRS/Home2009.htmS. 1651–1686.
Jain, A.K., Duin, R.P., Mao, J., 2001. Statistical pattern recognition: a review. IEEE Schapire, R.E., 1999. Theoretical views of boosting and applications. Proceedings of
Trans. Pattern Anal. Mach. Intell. 22 (1), 4–36. the 10th International Conference on Algorithmic Learning Theory (pp. 13–25).
Jolliffe, I., 2002. Principal Component Analysis (Springer Series in Statistics), 2nd Tokyo, Japan: Springer-Verlag.
edn. Springer, New York, United States of America. Schapire, R., 1990. The strength of weak learnability. Mach. Learn. 5 (2), 197–227.
Kameyama, K., & Kosugi, Y., 1999. Semiconductor defect classification using Semiconductor Industry Association (SIA). (2011, March 7). Semiconductor Indus-
hyperellipsoid clustering neural networks and model switching. International try Reports January Chip Sales Grew 14.0% Year over Year. WASHINGTON, D.C.
Joint Conference on Neural Networks, (p. 3505). Wang, C.-H., 2007. Recognition of semiconductor defect patterns using spectral
Kononenko, I., Kukar, M., 2007. Machine Learning and Data Mining: Introduction clustering. In: Proceedings of the IEEE International Conference on Industrial
to Principles and Algorithms. Horwood. Engineering and Engineering Management. Singapore, pp. 588–591.
Mukundan, R., Ramakrishnan, K., 1998. Moment Functions in Image Analysis: White, K.P., Kundu, B., Mastrangelo, C., 2008. Classification of defect clusters on
Theory and Applications. World Scientific Publishing Co. Pte. Ltd., Danvers. semiconductor wafers via the Hough transformation. IEEE Transactions on
Ooi, M.P.-L., Sim, E.K., Kuang, Y.C., Demidenko, S., Kleeman, L., Chan, C.W., 1986. Semiconductor Manufacturing 21 (2), 272–278.
Getting more from the semiconductor test: data mining with defect-cluster Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., Mc-Lachlan, G.J., Ng,
extraction. IEEE Trans. Instrum. Meas. 60 (10), 3300–3317. A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D., 2007. Top
Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. 10 algorithms in data mining. Knowl. Inf. Sys. 14 (1), 1–37.
Mag. 6 (3), 21–45. Xu, R., Wunsch II, D.C., 2009. Clustering. Wiley, Danvers, USA.
Quinlan, J., 1996. Bagging, boosting and C4.5. Proceedings of the 13th National Zhang, D., Zhou, Z., 2005. (2D)2 PCA: Two-directional two-dimensional PCA
Conference on Artificial Intelligence, AAAI 96, (pp. 725–730). Portland. for efficient face representation and recognition. Neurocomputing 69 (1–3),
Quinlan, J., 1992. C4.5: Programs for Machine Learning. Morgan Kaufmann, USA. 224–231.
Quinlan, J., 1986. Induction of Decision Trees. Mach. Learn. 1 (1), 81–106. Zhao, X., & Cui, L., 2008. Defect pattern recognition on nano/micro integrated
Reyzin, L., & Schapire, R.E., 2006. How boosting the margin can also boost classifier circuits wafer. Proceedings of the 3rd IEEE International Conference on Nano/
complexity. Proceedings of the 23rd International Conference on Machine Learn- Micro Engineered and Molecular Systems, NEMS 2008, (pp. 519–523). Sanya,
ing (pp. 753–760). Pittsburgh: ACM. China.

You might also like