Advances in Machine Learning and Deep Learning Applications Towards Wafer Map Defect Recognition and Classification-A Review
Advances in Machine Learning and Deep Learning Applications Towards Wafer Map Defect Recognition and Classification-A Review
https://doi.org/10.1007/s10845-022-01994-1
Received: 24 September 2021 / Accepted: 7 July 2022 / Published online: 23 August 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
With the high demand and sub-nanometer design for integrated circuits, surface defect complexity and frequency for semicon-
ductor wafers have increased; subsequently emphasizing the need for highly accurate fault detection and root-cause analysis
systems as manual defect diagnosis is more time-intensive, and expensive. As such, machine learning and deep learning
methods have been integrated to automated inspection systems for wafer map defect recognition and classification to enhance
performance, overall yield, and cost-efficiency. Concurrent with algorithm and hardware advances, in particular the onset of
neural networks like the convolutional neural network, the literature for wafer map defect detection exploded with new devel-
opments to address the limitations of data preprocessing, feature representation and extraction, and model learning strategies.
This article aims to provide a comprehensive review on the advancement of machine learning and deep learning applications
for wafer map defect recognition and classification. The defect recognition and classification methods are introduced and
analyzed for discussion on their respective advantages, limitations, and scalability. The future challenges and trends of wafer
map detection research are also presented.
Keywords Wafer Map · Semiconductor manufacturing · Machine learning · Deep learning · Defect recognition · Defect
classification
123
3216 Journal of Intelligent Manufacturing (2023) 34:3215–3247
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3217
sliced into thin wafers by (diamond) wire cutting. For the pur- which involves electrical testing to determine die function-
poses of wafer tracking, wafers are marked with characters ality. As part of assembly and packaging, the wafer is sliced
to indicate manufacturing information (i.e., identification, into individual pieces, in which the faulty dies are discarded,
dopants, orientation) (Airaksinen et al., 2015). Afterwards, and the remaining dies are forwarded to packaging.
using a profiled diamond wheel, the wafer edges are grinded The current technologies and design standards for IC
to a standardized or customized edge profile to adjust diam- fabrication are evolving, specifically for photolithography
eter, and minimize risk of slipping and chipping (Airaksinen and IC design. Current designs and lithography technolo-
et al., 2015). Resulting from the prior cutting process, the gies are at the sub-10 nm scale, specifically with extreme
wafer surface is susceptible to large total thickness varia- ultraviolet (EUV) lithography (Hasan & Luo, 2018; Preil,
tions (TTV), which disposes the surface to additional process 2016). With competition and fast-evolving technologies, the
variations from downstream processes. As such, lapping or future trends for IC fabrication include sub-5 and sub-3 nm
single-sided grinding is conducted to achieve TTV, surface scale lithography. As these future trends and technologies
roughness, and thickness measures within acceptable stan- are realized, defect frequency and complexity increase, sub-
dard ranges. Residual mechanical damage may develop on sequently increasing the emergence of unknown, rare, and
the surface and/or edges after the lapping and grinding opera- mixed-type defects; rendering defect detection more diffi-
tions (Airaksinen et al., 2015). To remove the damage and any cult, and emphasizing the need for more robust and reliable
remaining impurities, chemical etching (alkaline or acidic) detection methods. The wafer production and IC fabrication
is conducted. Subsequently, the wafers undergo polishing processes, associated defects and causes are summarized in
to achieve desired thickness, TTV, and flatness. Then the Table 1.
polished wafers undergo a cleaning sequence and quality
inspection prior to IC fabrication. Quality inspections for
wafers involve measuring the physical, material, and chemi-
Table 1 Summary of processes and associated defects
cal properties of the finished product with respect to standard
and design specifications (Airaksinen et al., 2015; Cuevas & Defect Associated Cause
Sinton, 2018). For surface inspections, wafer defect detec- process/stage
tion systems leverage WM images, or wafer images. WM
Random Clean room Environmental
images are the spatial results from electrical testing, which conditions of clean
illustrate individual die functionality, such that defect pat- room may induce
terns are clusters of faulty dies. Wafer bin maps (WBM) are particles and debris
the resulting binarized WM images. Wafer images are gen- onto wafer surface
erated from automated visual or electron beam inspection Loc Lapping, grinding Non-uniform surface
Polishing Uneven cleaning
systems (Patel et al., 2020). Automated visual inspection sys-
tems typically utilize optical imaging techniques, including Edge-Loc Lapping, grinding Non-uniform surface
Polishing Uneven cleaning
scanning acoustic tomography (SAT) (Chen, 2020), scanning
Center Polishing Non-uniform surface
electron microscopy (SEM) (Kim & Oh, 2017; Cheon et al., during chemical
2019), and charged-coupled device (CCD)-based imaging mechanical
(Chen et al., 2020a, 2020b; Wen et al., 2020). process (CMP)
IC fabrication consists of photolithography, assembly, and Edge-Ring Lapping, grinding Non-uniform surface
packaging. Photolithography is used to pattern the wafer, and Photolithography Layer-to-layer
misalignment
involves a repetition of various steps: masking, exposure, and
Chemical etching
etching. Mask design is used to develop the desired patterns issues
for masking; inverse-lithography technologies (ILT) deter- Scratch Assembly, packaging Mishandling
mine the optimal mask to achieve the desired wafer patterns, Polishing Hardening of
and is emerging as a prominent research field (Shi et al., 2019, Clean room polishing pads
2020). Masking involves the application of photoresist, and Agglomeration of
particles
photomask alignment to the wafer. Then the wafer is exposed
Near-full – Agglomeration of
to ultraviolet (UV) light through the photomask to reveal multiple systematic
the patterns, which is followed by etching. Using chemical and random defects
processes, etching develops and removes the exposed pho- Donut Lapping, grinding Non-uniform surface
toresist and exposed oxide layer. To create the desired IC Polishing Equipment
patterns, photolithography is repeated in cycles for pattern handling or
hardening of
and structure development. After the dies (also known as
polishing pads
chips) have been developed, wafers undergo a sorting test,
123
3218 Journal of Intelligent Manufacturing (2023) 34:3215–3247
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3219
Fig. 3 Normal, single-type, and mixed-type defects with image dimensions of (52, 52) from Wang et al. (2020)
123
3220 Journal of Intelligent Manufacturing (2023) 34:3215–3247
Fig. 4 Defect class distribution for WM-811 K (top) and Mixed WM-38 (center, bottom) datasets
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3221
Geometrical White et al. (2008) Area Reflects the area of the defect pattern, typically the most salient
Chang et al. region (single-type) is considered. Also expressed as ratio of
(2012) defect pattern area to wafer map area
Wu et al. (2015) Perimeter Defines the perimeter of the defect pattern. Also expressed as ratio
Fan et al. (2016) of defect pattern perimeter to radius of wafer map
Wang and Ni
(2019) Convexity Indicates the convexity of a defect pattern, and is expressed as the
Saqlain et al. ratio of the convex hull’s perimeter to the total perimeter of the
(2019) defect
Kang and Kang Length of Major/Minor Computes the length of the major/minor axes of the approximated
(2021) Axes ellipse that surrounds the defect pattern (most salient region)
Eccentricity Describes the outline shape of the approximated ellipse that
surrounds the defect pattern
Solidity Estimates the proportion of defective die in the convex hull of the
defect pattern
Hough Transform Identifies edges and lines in defect patterns
Hu Invariant Moments Set of seven values for central image moments. Image moments can
recognize patterns independent of size, position, and orientation
(Hu, 1962)
Projection Wu et al. (2015) Radon Transform Image projections at various angles are collected as a 2D
Yu and Lu (2016) representation of the wafer map. Projections obtain
Piao et al. (2018) geometric/structural information specific to defect patterns
Saqlain et al.
(2019)
Kang and Kang
(2021)
Density Fan et al. (2016) – Reflects the computed failed die density distribution. Involves
Saqlain et al. dividing the wafer map into multiple segments, and computing the
(2019) defective die density per segment
Kang and Kang
(2021)
Texture Yu and Lu (2016) – Describes and extracts surface textural features from images using
the statistical method, gray level co-occurrence matrix (GLCM).
Examples of textural features are correlation, entropy, energy,
contract, and uniformity (Mohanaiah et al., 2013)
Gray Yu and Lu (2016) – Reflects the pixel distribution in images. Is typically expressed with
various statistical features, including mean, variance, skewness,
and kurtosis
However, this advantage also poses as a caveat to gener- discriminant analysis (LDA) are applied to extract the crit-
ating effective, handcrafted features because the degree of ical features for a lower dimensional representation (Wang
domain knowledge may not be sufficient to represent and dif- & Ni, 2019; Yu & Liu, 2020). As information is lost when
ferentiate the different defect patterns (Kang & Kang, 2021; transforming into a lower dimensional space, PCA aims
Yu & Lu, 2016). This also imposes a limitation in detecting to minimize the number of features while maximizing the
rare/unknown defects in regards to forming features: impor- amount of variance captured by set of features. A major
tant characteristics of these defects may not be known or limitation of PCA is that it does not consider spatial rela-
understood to generate effective features for detection and tions within the data, such that the underlying patterns are
classification. not effectively captured. On the other hand, LDA weakly
In contrast to feature generation, feature extraction can maintains spatial relations by using class labels to instill
be applied to raw data, such as the wafer map images. Fea- low-level discriminatory power in separating classes in the
ture extraction includes dimensionality reduction techniques, lower dimensional subspace (Wang et al., 2014). Despite
and representation learning. Dimensionality reduction tech- these limitations, dimensionality reduction techniques can
niques, like principal component analysis (PCA) and linear
123
3222 Journal of Intelligent Manufacturing (2023) 34:3215–3247
reduce computational complexity, and improve model perfor- Table 3 Selection of prominent machine learning and deep learning
mance. Additionally, research into non-linear dimensionality algorithms for wafer map defect detection
reduction (manifold learning) techniques have demonstrated Algorithm Approach Paper
improved retention of spatial relations, including autoen-
coders (AE), t-distributed stochastic neighbor embedding Supervised Conventional ML Piao et al. (2018)
(t-SNE), locally linear embedding (LLE), multi-dimensional Classifiers (i.e., SVM, Saqlain et al.
scaling (MDS), and isomap (Faaeq et al., 2018). decision trees, (2019)
ensembles) Kang and Kang
Representation learning is automated feature extraction. (2021)
Deep learning models, like convolutional neural network Neural Networks Kyeong and Kim
(CNN)—which are neural networks that employ nonlinear (2018)
kernels for learning shared weights for input feature maps, Nakazawa and
have been highly used in various computer vision tasks due to Kulkarni (2018)
Kim et al. (2021)
the automated feature extraction ability (Nakazawa & Kulka- Wang et al.
rni, 2018; Park et al., 2020; Shen & Yu, 2019). The automated (2020)
feature extraction learns rich and highly descriptive fea- Unsupervised Mixture Models Kim et al., (2018)
tures at each convolution layer. Similarly, representation Ezzat et al.
learning can also be conducted via inference models. Proba- (2021)
bilistic generative models, such as variational autoencoders Density-based Jin et al. (2018)
(VAE) and generative adversarial networks (GAN), leverage (DBSCAN, OPTICS)
inference methods to approximate and learn latent feature Semi-supervised Pretraining-Finetuning Yu (2019)
Shon et al.
representations of the data via latent variable(s) z (Kingma
(2021)
et al., 2014; Kong & Ni, 2020a). It is important to note that the
Generative Modelling Kong and Ni
latent space embeds the input to a compact, and non-linear (2018)
representation. Depending on the learning approach, auto- Hu et al. (2021)
mated feature learning can be executed with labeled and/or Yu et al. (2019b)
unlabeled data. With representation learning, raw data can be Yu and Liu
(2020)
used, and can gain high discriminatory power as the under- Lee and Kim
lying structure of the data can be learned, demonstrating (2020)
capability with complex patterns and data structures (Khas- Kong and Ni
tavaneh & Ebrahimpour-Komleh, 2020; Zhong et al., 2016). (2020a)
The significance of representation learning is demonstrated Enhanced Model Optimization Bello et al. (2017)
Learning Jang et al. (2020)
with transfer learning (Section Enhanced learning strate- Shon et al.
gies), wherein the feature extractor networks (backbone) of (2021)
pretrained models have gained strong feature extraction capa- Incremental Learning Shim et al. (2020)
bilities to extract meaningful features (Chien et al., 2020; Kong and Ni
Ishida et al., 2019; Shen & Yu, 2019). However, the capacity (2020a)
of representation learning is constrained by model complex- Data Augmentation Wang et al. (2019)
ity, as performance is dependent on whether the model is Saqlain et al.
(2020)
suited to the respective data complexity and problem.
Transfer Learning Shen & Yu (2019)
With the onset of neural networks, research has shifted Ishida et al.
from manual feature generation to feature representation (2019)
learning as leveraging feature learning algorithms has proven Chien et al.
to generate more meaningful and effective features for down- (2020)
stream tasks, especially for problems with complex data
structures. in-depth in Section Methodologies and learning strategies. In
Table 3, the prominent works for each main algorithm used
Algorithms for wafer map defect detection in wafer map defect detection are listed.
Supervised learning utilizes labels for model training, and
The algorithms are the learning strategies in which the model loss functions, which measure the error between the predic-
learns and trains from the input data. In this section, the tions and ground truth. The labels are factored into the loss
three learning strategies that we will focus on are introduced: function, and acts as the supervisory signal for the model
supervised, unsupervised, and semi-supervised learning. The to learn the mapping for an input and the respective desired
main algorithms for wafer map defect detection are discussed model output. Loss functions are optimized by finding the
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3223
global minimum or optimal local minimum. It should be different defect patterns for both single-type and mixed-type
noted that loss functions are dependent on the downstream defects. Unsupervised methods have also been leveraged for
task, and their mathematical optimization is constrained by pretraining to supplement supervised methods with unsu-
the convexity. The problem for supervised wafer map defect pervised feature representation learning using autoencoders
detection is defined as classification, in which the algorithms (Shon et al., 2021; Yu, 2019). By taking advantage of the
aim to learn the mapping from input to output to predict spe- plethora of unlabeled data, unsupervised pretraining meth-
cific defect patterns. Early literature has transitioned from ods operate to learn general feature representations to better
conventional machine learning classifiers to neural networks. initialize the model weights for supervised training (relative
Conventional machine learning classifiers typically require to zero or random initialization) via reconstruction errors.
extensive preprocessing and manual feature generation and Semi-supervised learning leverages both labeled and unla-
have mainly been applied for single-type defect detection. beled data for the model training process. During the training
Common classifiers used in WM defect detection include process, the labeled data is utilized in the same manner as
SVM, decision trees, and ensembles (Fan et al., 2016; Piao supervised learning, whereas the unlabeled data is leveraged
et al., 2018; Saqlain et al., 2019; Wu et al., 2015). Neural for transduction-based inference learning. This is reflected
networks are prominently used throughout the literature and in the loss function, where a combined, and weighted loss
have demonstrated capability for single-type and mixed-type function is defined to account for both labeled and unla-
defect classification. beled data. With transduction-based inference learning, all
It is important to note that the classification problem can available data is observed to enhance the learned data rep-
be multi-class or multi-label. In multi-class classification, resentations for inferring missing labels. Relative to the
there are a distinct number of classes that the classifier learns former learning strategies, development of semi-supervised
and models. Each data sample belongs to a single class, and algorithms is growing to overcome the limitations imposed
the classifier predicts the probability across all classes that by supervised and unsupervised learning. For WMDD,
the data sample belongs to a particular class. Multi-label pretraining-finetuning and semi-supervised generative mod-
classification is a multi-output algorithm, such that the data els have been implemented to tackle the real-world issue of
examples can be annotated with multiple target classes. For limited annotated wafer maps. For pretraining-finetuning,
multi-class neural networks, the softmax function is used in unsupervised pretraining methods are followed by super-
the final output layer to compute the decimal probabilities, vised finetuning. Semi-supervised generative models are
which add up to 1.0. On the other hand, multi-label neural probabilistic methods, which include the models: variational
networks utilize the sigmoid function in the final output layer autoencoders (VAE), and modified Ladder networks (Kong
to predict the probabilities (between 0 and 1) for each class. & Ni, 2020a; Lee & Kim, 2020). These methods have been
Mixed-type defect detection can be framed as a multi-class applied towards both single-type and mixed-type defect pat-
or multi-label classification problem. As a multi-class clas- terns.
sification problem, mixed-type defects are segmented into Beyond the model training algorithms, enhanced learn-
multiple single-type defect patterns and are subsequently ing algorithms and techniques have been applied for wafer
classified with a network of binary classifiers (Kong & Ni, map defect detection to boost performance, and to address
2019, 2020b; Kyeong & Kim, 2018). On the other hand, as a the issues with labeled data availability, class imbalance,
multi-label classification problem, mixed-type defect detec- rare/unknown defect detection, and model sensitivity. These
tion aims to recognize the different patterns and predicts the algorithms and techniques have been introduced as data aug-
probability per class label for a single wafer map (Lee & mentation, incremental learning, transfer learning, and model
Kim, 2020; Wang et al., 2020). optimization (Bello et al., 2017; Jang et al., 2020; Ji & Lee,
Unsupervised learning algorithms leverage unlabeled data 2020; Shim et al., 2020).
to learn their underlying patterns, and structure. For wafer
map defect detection applications, the main unsupervised
learning tasks are clustering, and pretraining. Clustering Evaluation
focuses on self-organization to cluster data based on similar-
ity and dissimilarity distances. Popular clustering algorithms Evaluation methods are used to assess the performance, and
for wafer map defect detection include density-based spatial can be conducted at validation, or the final testing stage. The
clustering of applications with noise (DBSCAN), ordering results from the validation stage drive hyperparameter tuning
point to identify the cluster structure (OPTICS), and mix- and model optimization. The evaluation methods are depen-
ture models, such as Gaussian mixture models (GMM) and dent on the data and learning approach. Across the wafer map
infinite warped mixture models (iWMM) (Ezzat et al., 2021; defect detection literature, the common performance evalu-
Fan et al., 2016; Iwata et al., 2013; Kim et al., 2018). Spa- ation indices have been identified and summarized in Table
tial clustering applications in WMDD aim to segment the 4 below (Hwang & Kim, 2020; Kim et al., 2018; Lee &
123
3224 Journal of Intelligent Manufacturing (2023) 34:3215–3247
N C j j
Micro-recall j=1 yi ŷi
(8)
M Re = i=1
N C j =
i=1 j=1 ŷi
j
j TP
j T P j+ j FN j
Kim, 2020; Li et al., 2021; Saqlain et al., 2019). In Table 4, average of precision and recall (Eq. 6), such that its respec-
the variables TP, TN, FP, and FN represent True Positives, tive score indicates how close the predicted and ground
True Negatives, False Positives, and False Negatives respec- truth values are. These metrics are used to evaluate the
tively. multi-class classification performance for wafer map defect
The evaluation methods for supervised learning algo- detection.
rithms indicate how well the model has learned by the In the context of multi-label classification problems, exact
number of correct and incorrect predictions. The (top- match ratio (EMR), micro-precision (MPre), micro-recall
1) accuracy, precision, recall, and confusion matrix are (MRe), and Hamming loss can be used (Lee & Kim, 2020;
the metrics typically used to evaluate and compare mod- Santos & Canuto, 2012; Wang et al., 2020). MPre (Eq. 7) and
els. Equations (1) to (5) represent the (top-1) accuracy, MRe (Eq. 8) differ from their multi-class counterpart by con-
precision, and recall. The accuracy indicates the total num- sidering partially correct predictions, as each correct target
ber of correctly identified wafer maps; precision signifies label is counted for each sample i ∈ N , where N represents
the total correctly identified wafer maps from all identi- the total number of samples, and each class j ∈ C, where
fied wafer maps, and recall indicates the total number of C represents the total number of known classes labels. On
correctly identified wafer maps within a given set. Note the other hand, EMR (Eq. (9)) is computed similarly to accu-
that Eqs. (2) and (3), as well as Eqs. (4) and (5) rep- racy and reflects all fully correct predictions. Note that yi
resent the same equation, but are qualified by the given and yi represent the true labels and predicted labels respec-
j
class i, such that the precision and recall are computed tively, whereas yi j and
yi are the per class label equivalents.
for each respective class i. The F-1 metric is the weighted Hamming loss reflects the proportion of incorrectly predicted
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3225
labels to the total number of labels at the individual label- filtering. Preprocessing operations include image size stan-
level. As shown in Eq. (10), the indicator function evaluates dardization, binarization, and denoising.
to 1 when predicted labels do not match the ground truth label. Image size standardization is to reshape the raw wafer
Like MPre and MRe, N and C represent the total number of maps to a single, uniform size, and utilizes interpolation algo-
samples and total number of known classes respectively. rithms to minimize quality loss. Interpolation algorithms are
As semi-supervised algorithms leverage unlabeled data subject to the pixel neighborhood size for approximation,
for label imputation, and feature representation learning dur- such that with increasing sizes results in longer rendering
ing training, the performance is evaluated like supervised times and higher quality. The bicubic interpolation algorithm
algorithms. Evaluation methods like accuracy, EMR, etc. are is typically used due to the optimal quality and time trade-
calculated on the labeled data. off. Binarization is used to convert wafer maps to wafer bin
For unsupervised wafer map defect detection methods, the maps, in which individual die functionality is indicated by
performance indices typically evaluate the defect clustering 0 s and 1 s.
results. The following have been identified and described by Image denoising (outlier detection) and filtering refers to
Eqs. (11) to (15) as the commonly used evaluation metrics the process of removing random defects. It is typically con-
for unsupervised defect detection algorithms: (i) Rand Index ducted to enhance model performance and accuracy as the
(RI), (ii) adjusted Rand Index (ARI), (iii) normalized mutual removal of random defects enhances the systematic defects.
information (NMI), (iv) adjusted mutual information (AMI), Past works have utilized spatial filtering and clustering meth-
and (v) Purity. These metrics focus on comparing the clusters ods to remove noise and isolate the systematic defects (Chien
via similarity, and shared information. et al., 2013; Liu & Chien, 2013; Wang, 2008, 2009; Yuan
RI is the ratio of the number of correct similar pairs (a), et al., 2010). Spatial filtering algorithms focus on how to
and correct dissimilar pairs (b) to all possible combination effectively differentiate between the random defects and the
pairs, where n represents the number of samples. ARI is the dies that belong to systematic defects. Spatial clustering algo-
RI, but adjusted, such that independent of the number of clus- rithms focus on forming a separate cluster for each different
ters and samples, randomly clustered samples are closer to 0, defect patterns. The input to these methods has already fil-
and highly similar samples are closer to 1. In Eq. (12), E[R I ] tered the defect patterns. Support vector clustering (SVC)
indicates the expected RI value. NMI (Eq. 13) is the normal- has been used in Wang (2009) and Yuan et al. (2010) for
ization of mutual information (MI), which results in scores defect denoising, and identification of systematic defect pat-
between 0 and 1. For Eq. (13), I (X; Y), H(X) and H(Y) terns. SVC demonstrated robustness against noisy data, but
represent the mutual information between X and Y, and the high sensitivity to defect complexity as clustering efficiency
entropy of X and Y respectively. AMI is mutual informa- decreases with more complex defect patterns (i.e., multiple
tion adjusted, such that permutations of the class and cluster defects). Similarly, the k-nearest neighbors (kNN) algorithm
labels would not affect the score. Lastly, purity (Eq. (15)) is also used to differentiate between defective dies that belong
measures the accuracy of cluster assignments by tallying the to systematic defect patterns (Huang, 2007). The spatial ran-
number of correctly assigned samples and dividing by the domness filter is a statistical method that checks the spatial
total number of samples (N ). independence of adjacent dies. The spatial independence is
computed by taking the logarithm (Log) of the odds ratio
(θ̂), in which the resulting Logθ̂ determines whether the
wafer map is spatially random, contains a defect cluster, or
Methodologies and learning strategies
repeating patterns (Chien et al., 2013; Liu & Chien, 2013).
Although the filtering results indicate which wafer maps
In this section, the recent developments in AI applications
should be used for classification, as the spatial independence
for WM defect recognition and classification are introduced,
test is computed for the dies and not the pattern, misclassi-
analyzed, and discussed. This section is organized into (1)
fication can occur. Median filtering is a popular denoising
preprocessing, (2) supervised learning, (3) unsupervised
method that replaces each die’s value with the median value
learning, (4) semi-supervised learning, and (5) enhanced
of the neighboring dies, and has been used in many works
learning strategies.
for image preprocessing (Kong & Ni, 2020a; Wang et al.,
2006; Yu & Lu, 2016; Yu, 2019). Median filtering can be
Preprocessing effective in removing the random defects, however, may also
remove important pattern information as some of the sys-
The purpose of the data preprocessing stage is to preprocess tematic pattern dies may be removed. The thin geometries of
and prepare the wafer map images for feature extraction and the Scratch, and Edge-Ring defects are particularly sensitive
model training. Data preprocessing typically includes a mul- to median filtering (Fig. 5). It is important to note that poor
titude of operations for image transformations, and spatial spatial filtering and spatial clustering can significantly affect
123
3226 Journal of Intelligent Manufacturing (2023) 34:3215–3247
Fig. 5 a–d Original wafer maps, e–h Wafer maps after median filtering
downstream tasks as the filtered systematic defect pattern densities of each defect, in particular the Scratch and Edge-
quality is damaged. Ring defects. According to Jin et al. (2019), their proposed
Wang and Chen (2019) proposed using three masking fil- DBSCAN-based algorithm considers defect pattern type for
ters to preprocess wafer maps and extract rotation-invariant outlier detection. The outliers are completely removed for
features for defect pattern classification. To address the lim- most defects (i.e., Loc, Donut, Random), and either care-
its of traditional spatial filtering methods for curvilinear and fully removed for the Scratch and Edge-Ring defects. The
edge patterns, polar, line, and arc masks were applied at var- authors recommended to not completely remove the outliers
ious angles to real-world wafer maps to extract features of for the Scratch and Edge-Ring defects as defect pattern qual-
concentric, linear, and eccentric patterns. Used to train vari- ity would deteriorate.
ous classifiers (i.e., neural networks, random forest, SVM), The above filtering methods have demonstrated limita-
the masking filters demonstrated effectiveness with high tions towards the Scratch and Edge-Ring defects due to their
defect recognition rates, but limited recognition for defect thin and elongated shapes. As such, Kim et al. (2018) pro-
patterns with complex geometry (i.e., Scratch, Reticle). posed the connected-path filtering (CPF) algorithm. The CPF
The king-move neighborhood (Chien et al., 2013; Hsu algorithm uses depth-first search (DFS) to explore all pos-
et al., 2020; Wang, 2008; Wang & Ni, 2019), and Moore sible paths between two defective dies, and recognizes the
neighborhood (Jin et al., 2019) are utilized to compute the connected paths that are longer than a threshold criterion to
spatial correlation weights for the adjacent dies. Although represent the identified defective die connected paths. Note
both the king-move neighborhood and Moore neighborhood that the CPF algorithm relies on the optimal threshold cri-
filters consider the eight surrounding dies, the Moore neigh- terion to effectively detect systematic defects, which can
borhood filter also considers the center die. Typically, a be determined by parameter-tuning or domain experts. The
global threshold criterion is applied to the spatial correla- authors utilized domain experts to set a global threshold cri-
tion weights, such that dies are removed if the criterion is terion of 12 for all defects, in which distances greater than
not met. The downfall of using a global threshold criterion is 12 are recognized as systematic defects. The advantage of
that it does not consider the geometries and typical defect die the CPF algorithm is that the threshold criterion allows for
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3227
the detection of the Scratch defect. The limitation of apply- et al., 2012; Hsu & Chien, 2007; Kim et al., 2020b; Li &
ing a global threshold criterion for all defect-types is that the Huang, 2009; Liao et al., 2014; Ooi et al., 2013). These
local spatial information, such as defective die density and methods reported a high overall detection accuracy (approxi-
distribution, defect-type geometry, and disjoint connection mately > 90%), however demonstrated low detection rates for
paths [due to random defects], is not considered. Specifically, geometrically complex defect patterns (i.e., Donut, Scratch,
the defective dies that are not associated with a connected mixed-types), and diminished effectiveness with imbalanced
path are completely disregarded. Additionally, defining a uni- datasets. To boost overall classification accuracy, Jin et al.
versal threshold for all defect-types cause scalability issues (2020) incorporated error-correcting output codes (ECOC)
for real-world applications as the onset of complex, and and SVM for single-type WM defect classification using
mixed-type defects would require domain experts and fre- CNN-based feature extraction.
quent updates to threshold values. Yu and Lu (2016) proposed the joint local and non-local
To address the limitations of the CPF algorithm, the linear discriminant analysis (JLNLDA) framework, which
graph-theoretic approach for adjacency clustering (AC) was utilizes manifold learning to extract highly discriminative
developed by Ezzat et al. (2021). Based on graph theory, this features. With the aim to preserve defect geometry at lower
algorithm represents the dies and the neighborhood connec- dimensional space, four neighborhood graphs: two graphs
tions on the wafer map as the graph nodes, and graph edges. for local and non-local spatial information, and two penal-
Although the AC algorithm is executed as a spatial clustering ization graphs that apply penalties to promote maximizing
task, it functions as a spatial filtering method by leveraging between-class separation, and minimizing within-class sep-
spatial correlation information between adjacent dies to clus- aration. Geometry, gray, texture, and radon-based features
ter the defective dies into two groups: random and systematic were generated, followed by dimensionality reduction and
defects. The authors compared the AC and CPF algorithms, feature extraction. For wafer defect detection, JLNLDA was
and demonstrated the improved performance of AC in fil- extended to construct JLNLDA-FD, a Fischer discriminant-
tering high complexity defects, and overall improved impact based recognition model to compute the discriminant func-
to the defect recognition task. The authors have noted that tion value of a wafer map belonging to the defect classes,
too small or too large separation loss would result in unde- such that wafer maps are classified as the defect class with
sired filtering results (i.e., weak to absent filtering effect, the maximum probability.
same label wafer maps), and cross-validation may be used Saqlain et al. (2019) proposed a soft voting ensemble
to determine the optimal weight trade-off. In comparison to (SVE) classifier for wafer defect recognition and classifica-
existing preprocessing methods, this algorithm fully utilizes tion. Using the WM-811K dataset, three multi-type features
the available spatial information (i.e., spatial dependency of (geometry-based, density-based, radon-based) are extracted,
adjacent dies), demonstrating state-of-the-art performance. and used as inputs to train the base classifiers of the ensem-
ble. The authors used four state-of-the-art ML classifiers for
Supervised learning the ensemble: logistic regression, gradient boosting machine
(GBM), ANN, and random forest. To train the proposed
Supervised learning utilizes labels as a supervisory signal ensemble, the base classifiers are trained individually using
for training. Early literature for wafer map defect detection the extracted features, and then in a soft voting ensemble
mostly consists of supervised machine learning algorithms, approach, the results of the base classifiers are combined to
including common models such as artificial neural network output the final defect prediction. Soft voting uses weighted
(ANN), random forest (RF), and support vector machines averages to determine the final prediction; based on perfor-
(SVM). Note that in wafer map defect detection applica- mance, better performing classifiers have higher weights for
tions, multi-class classification is more popular and widely voting. The authors reported defect classification accuracy of
developed than multi-label learning. The methodologies dis- 95.87%, proving the ensemble classifier achieves improved
cussed in this section are structured into three categories: (i) performance relative to a single individual base classifier.
conventional machine learning, (ii) deep learning, and (iii) Although both JLNLDA and SVE achieved high defect clas-
specialized modules. sification rates, their performance is contingent on manually
Conventional machine learning algorithms used for generated features, which can bottleneck performance.
WMDD include SVM, decision trees, and ensembles. Extensions of supervised ANNs have featured in WMDD
Although a bit antiquated due to the onset of neural networks literature, including multilayer perceptron (MLP), and gen-
and deep learning, conventional ML algorithms can remain eral regression network (GRN) (Adly et al., 2015a, 2015b;
competitive. In related works, SVM and decision trees were Huang, 2007; Huang et al., 2009; Tello et al., 2018). In
prominently used for single-type WM defect classification (Huang, 2007) and (Huang et al., 2009), self-supervised
as the classifiers are relatively computationally inexpensive, MLP models were trained to recognize clusters of defec-
stable, and can work well with high-dimensional data (Chang tive dies, however, classification was restrained to predicting
123
3228 Journal of Intelligent Manufacturing (2023) 34:3215–3247
good and bad wafers, such that limited details of the defect et al., 2021; Kong & Ni, 2019, 2020b; Kyeong & Kim,
were learned. GRNs utilize Gaussian kernels as activation 2018; Zhuang et al., 2020). For multi-label classification of
functions in the hidden layer. Adly et al., (2015b) applied a mixed-type defects, CNN models used sigmoid activation to
randomized bootstrapping technique to train an ensemble of compute the probability for each defect label. On the other
GRN models, such that each model would learn from a ran- hand, for multi-class classification of mixed-type defects,
dom, independently sampled data to decrease variance, and Kyeong and Kim (2018) proposed the use of CNNs for
increase detection accuracy. Similarly, Adly et al., (2015a) mixed-type defect pattern classification by training multiple
extended the previous work with a data dimensionality reduc- binary CNNs (Fig. 7). Each CNN is built to detect the absence
tion technique, which employed Voronoi diagrams for data or presence of a distinct pattern (Scratch, Ring, Circle, Zone),
partitioning and K-means for clustering to represent the data and then the CNN outputs are combined. By leveraging
at a reduced size. As the Voronoi diagrams portion the data multiple CNNs, this method has the advantage of adaption,
into a vector space; smaller regions reflect different defect as new defect patterns can be easily trained and added to
patterns, and K-means clustering was used to find the centroid the existing framework. Compared to SVM and multilayer
for each region in the vector space, which was subsequently perceptron (MLP), the proposed CNN achieved superior
used for training. Both the GRN-based models demonstrated classification accuracy, recall and precision of 0.910, 0.945,
high accuracy, but by applying the data reduction technique, and 0.949 respectively. Similarly, in (Zhuang et al., 2020), a
computational time complexity was reduced. As these pre- network of deep belief networks (DBN) was used to classify
vious works considered only single-type defects, Tello et al. six defect patterns for single-type and mixed-type defect clas-
(2018) combined the randomized GRN (RGRN) model with sification. Kong and Ni proposed mixed-type defect detection
a CNN model. By using information gain theory to sepa- by pattern segmentation, such that overlapped defect pat-
rate the data into single-type and mixed-type defects, RGRN terns are processed into multiple single patterns, which are
and CNN classify single-type and mixed-type defects respec- then classified using multiple binary CNNs (Kong & Ni,
tively, achieving an overall accuracy of 86.17%. Although 2019, 2020b). Both proposed models achieved comparable
mixed-type defect detection was investigated, a limited range classification performance as other high performing models
of mixed-type defects were considered. and demonstrated how pattern segmentation of overlapped
Deep learning models employ CNNs and additional layers mixed-type defects can improve recognition and classifica-
for training. Due to the automated feature extraction capabil- tion accuracy. Kim et al. (2021) applied the object detection
ity, deep learning has been heavily applied for image-based algorithm, single shot detector (SSD), to effectively recog-
tasks, including wafer map defect recognition and classifi- nize, segment and classify the multiple instances of defect
cation. Deep learning models typically have more than three patterns within a mixed-type defect sample. As object detec-
layers, and with each progressive layer, the model extracts tion frameworks require bounding box (BB) information (for
higher level features. Many related works utilize CNNs for the desired object instances), an automatic BB generator was
single-type, and mixed-type WMDD. In (Batool et al., 2020; designed to utilize digital image preprocessing techniques
Bella et al., 2019; Du & Shi, 2020; Kim et al., 2020a; Maksim and libraries (i.e. PIL, spatial filters) to obtain the BBs. The
et al., 2019; Nakazawa & Kulkarni, 2018; Yu et al., 2019a), SSD algorithm simultaneously solves the object classifica-
CNNs with customized model architecture were trained for tion and localization problems, which subsequently improves
single-type WM defect classification. For example, the cus- run-time, and performance. The SSD model utilized pretrain-
tom CNN architecture by Nakazawa and Kulkarni (2018) ing from large-scale image datasets, and fine-tuned the last
for multi-class defect pattern classification achieved an over- output layer on a selection of the WM-811K data. Compared
all test accuracy of 98.2%, and considered 22 defect classes to the CNN model, the proposed SSD model achieves a higher
(Fig. 6), in which many classes were variations of fundamen- accuracy for single-type and mixed-type defects.
tal defect patterns. It should be noted that in the case of class The methods categorized as specialized modules inte-
distinctiveness, many classes were quite similar, such that grate advanced model elements different from standardized
misclassification rates were high as the model had difficulty model components, which can encompass specialized loss
differentiating between the similar-looking defect patterns. functions, modified kernel functions, etc. Park et al. (2020)
Additionally, in multi-class classification methods, mixed- proposed a Siamese network integrated with an uncertainty-
type defect detection is difficult as the most salient defect reducing technique for class label reconstruction via G-
pattern is typically predicted, disregarding the other present means clustering (Fig. 8). For discriminative feature learning,
defects. the Siamese network learns feature embeddings based on
The related works for mixed-type WM defect detection similarities between the input image pairs, and aims to min-
framed the problem as multi-label classification (Devika & imize the contrastive loss, such that embeddings for similar
George, 2019; Hyun and Kim (2020); Wang et al., 2020) images are closer together, and embeddings for dissimilar
or multi-class classification (Byun & Baek, 2020; Kim images are farther apart. G-means clustering leverages the
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3229
learned feature embeddings from the Siamese network to are subsequently added to a standard convolution (Fig. 9)
enable enhanced class label reconstruction and outlier detec- (Dai et al., 2017; Zhu et al., 2019). The authors compared
tion. The results demonstrate that the proposed model can the proposed DCN to state-of-the-art mixed-type defect clas-
segment mixed-type defects, however, has difficulty with sification models on the Mixed WM-38 dataset, in which
controlling the degree of pattern segmentation, and differ- the results demonstrated the superior performance of DCN
entiating between the unknown cases from the known cases. in the detection of complex mixed-type defects. Similarly,
By leveraging class label reconstruction, uncertainty associ- Tsai and Lee (2020a) incorporated depth-wise separable
ated with the wafer map labels can be mitigated. convolutions to improve run-time and reduce overfitting
Modified convolutional blocks were proposed by Wang as they have fewer parameters than standard convolutions.
et al. (2020), Tsai and Lee (2020a), Hyun and Kim (2020), By using depth-wise separable convolutions, the proposed
and Alawieh et al. (2020). Wang et al. (2020) used deformable model achieved a 96.63% classification accuracy based on
convolution networks (DCN) for multi-label classification, single-type defect patterns. Another development of modi-
which demonstrated enhanced performance as deformable fied convolutional blocks was introduced by Hyun and Kim
convolutional layers can learn and recognize the geometric (2020); a memory module to keep track of a fixed number of
variations of defect patterns. Deformable convolutional units rare occurrences for each class to mitigate class imbalance
learn the two-dimensional offsets to learn different deforma- issues. The memory module is used to learn high quality
tions of the filter sizes and geometric characteristics, which representative samples in latent space for each defect class.
123
3230 Journal of Intelligent Manufacturing (2023) 34:3215–3247
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3231
patterns, and to determine the respective spatial depen- Unsupervised pre-training is typically conducted by train-
dence relative to the identified centroid defective die point. ing an autoencoder in an unsupervised approach to minimize
Hierarchical clustering was used by Alawieh et al. (2018) the reconstruction loss and learn latent feature represen-
to minimize clustering sensitivity to outliers; incorporat- tations of the data. For classification tasks, a classifier is
ing various optimization methods to determine the optimal added to the trained encoder and fine-tuned; the fine-tuning
number of clusters, optimal number of singular values for adjusts the encoder and classifier. The general process of
noise removal, and optimal number of defect patterns. As unsupervised pre-training is shown in Fig. 10. Shon et al.
clustering algorithms are sensitive to initialization and hyper- (2021) applied unsupervised pre-training and data augmenta-
parameters (i.e., number of clusters), many suffered from tion to improve CNN classifier performance based on limited
difficulty of determining the appropriate number of clusters labeled wafer maps. Using the unlabeled data of WM-811K,
for defect patterns (Patel et al., 2015; Xu & Tian, 2015). a convolutional variational autoencoder (CVAE) was trained
Related works (Hwang & Kim, 2020; Jin et al., 2019; in efforts to better initialize the feature extraction layers of
Kim et al., 2018) leveraged clustering algorithms for defect the CNN classifier. Subsequently, the CVAE encoder and
detection. Kim et al. (2018) utilized connected-path filter- CNN classifier are fine-tuned in an end-to-end manner by
ing, and then spatial clustering via infinite warped mixture minimizing the cross-entropy loss. The results showed that
models (iWMM). iWMMs (originally introduced in (Iwata the proposed method achieved high classification perfor-
et al., 2013)) apply a warping function to the defect clus- mance at early epochs, indicating the benefit of unsupervised
ters, such that in the latent space, the clusters have Gaussian pre-training. Although pre-training can improve downstream
shapes. In (Ezzat et al., 2021; Iwata et al., 2013; Kim et al., classification performance, as the WM-811K data consists of
2018), the authors report iWMM as an effective clustering single-type defects, the proposed model is limited in complex
algorithm due to its warping function, and ability to effec- mixed-type defect recognition as CVAE may have difficulty
tively estimate the number of clusters, which circumvents differentiating between the multiple defects with a single
the influence of setting the number of clusters. However, discriminative network. Similarly, Yu (2019) proposed a two-
Kim et al. (2018) noted that iWMM had difficulty in appro- phase methodology for wafer map recognition: an enhanced
priately isolating the partial-ring defect pattern due to its stacked denoising autoencoder (ESDAE) for feature learning
complex and non-Gaussian geometry. Jin et al. (2019) intro- via unsupervised pre-training, and then supervised finetun-
duce DBSCANWBM, a novel DBSCAN-based clustering ing. ESDAE consists of two autoencoders, which incorporate
method. DBSCANWBM inherits DBSCAN characteristics, manifold regularization such that intrinsic local and nonlocal
and was adapted to: (i) consider defect-type for outlier geometric information is preserved. ESDAE involves a cost-
detection, (ii) bypass the requirement to specify number of sensitive layer-wise training procedure, in which each layer
clusters, (iii) parallelize outlier detection and defect detec- is trained to minimize the reconstruction error, and assigns
tion, and (iv) detect both single-type and mixed-type defects. different costs to different defect classes for misclassification
By adjusting outlier removal relative to defect-type, the sys- to address class imbalance. The experimental results on the
tematic defect geometries can be better preserved, which in influence of manifold regularization demonstrate that perfor-
turn, can improve classification accuracy. Hwang and Kim mance improved with the increasing degree of regularization
(2020) developed a one-step clustering method that com- (γ). Compared to a typical stacked denoising autoencoder
bines Gaussian mixture models and Dirichlet process (DP) (SDAE), logistic regression, DBN, and back propagation net-
to a VAE framework. Within the proposed VAE framework, work (BPN), ESDAE achieved the best defect recognition
DP is used to automate the updating of number of clus- accuracy of 97.03%. Despite the improved performance, the
ters, and the GMMs are employed as a prior distribution proposed methodology involves feature generation of orig-
to learn the nuances of different wafer maps. Like iWMM, inal geometrical, gray, texture, and projection features for
and DBSCANWBM, this VAE framework works without model training; generally, as it is difficult to estimate the
specifying the number of clusters in advance. The VAE effectiveness of manually generated features, model perfor-
framework encodes and decodes latent feature representa- mance may be hampered. Likewise to CVAE, ESDAE trains
tions that follow a Gaussian mixture distribution (Hwang & on the single-type defects in WM-811K, and as such is inad-
Kim, 2020). The authors reported that their proposed clus- equate against mixed-type defects.
tering framework estimated the number of clusters more
accurately than the comparison models, and achieved better Semi-supervised learning
clustering performance relative to adjusted mutual informa-
tion and adjusted rand index. The clustering methods that The performance of supervised learning is limited by the
utilized generative models have demonstrated improved per- amount of available labels; on the other hand, without the
formance as the models are built to learn effective feature supervisory signal from labels, the performance of unsu-
representations. pervised learning for defect classification is unsatisfactory
123
3232 Journal of Intelligent Manufacturing (2023) 34:3215–3247
in comparison. Semi-supervised learning is introduced, and two small datasets containing only single-type defect pat-
addresses the limitations of supervised and unsupervised terns, the small class sample sizes most likely skewed feature
learning. Regarding real-world applicability, with limited learning, such that the model had difficulty differentiating
available labels, a surplus of unlabeled wafer maps, and large between similar-looking pattern variations. This is shown by
volumes of incoming unlabeled wafer maps, semi-supervised the confusion matrices reported in (Kong & Ni, 2018), which
learning can achieve better performance as it utilizes both divulge the misclassification rates of select defects. Addition-
labeled and unlabeled data for model training. For semi- ally, as the datasets contained only single-type defect wafer
supervised learning, the labeled wafer maps are used to learn maps, defect classification is limited and requires modifica-
the relevant features for each defect pattern, and then the tion and model re-training for mixed-type defects.
unlabeled wafer maps are used to refine the feature repre- In (Yu & Liu, 2020), PCACAE, a novel semi-supervised
sentations. To the best of our knowledge, semi-supervised two-dimensional PCA-based convolutional autoencoder
learning algorithms for wafer map defect recognition and with effective feature extraction capability is introduced. To
classification have been scarcely developed. overcome class imbalance and preserve spatial information,
Kong and Ni (2018) trained a CNN-based Ladder net- conditional two-dimensional PCA (C2DPCA) is proposed.
work in a semi-supervised manner to detect and classify C2DPCA aims to find the optimal projection direction by
wafer map defects. The semi-supervised Ladder network minimizing the reconstruction error, and as an image pro-
consists of a clean encoder, a corrupted encoder, and a jection method, can effectively map the high dimensional
decoder, which were trained and tested separately on two wafer maps into lower dimensional space. By transform-
datasets with 22 classes of single-type defect patterns. The ing the principal eigenvectors from 1 to 2D, C2DPCA-based
encoders are responsible for learning the latent features of the kernels are formed, such that discriminative principal com-
wafer maps. The latent features from the encoder layers are ponents are learned and used downstream for pretraining and
shared with the decoder through skip connections to recover finetuning purposes. The authors compared PCACAE perfor-
additional spatial information. Given the noised latent fea- mance to a pretrained deep learning models (i.e., AlexNet,
tures from the corrupted encoder, the decoder reconstructs GoogleNet), stacked denoising autoencoder (SDAE), and
the wafer maps with the aim to minimize the reconstruc- DBN. The results and visualizations reported in (Yu & Liu,
tion error at each layer. Compared to supervised CNN with 2020) indicate the usefulness of pretraining, and that the
varying amounts of labeled data, the authors established how C2DPCA-based kernels have effective, and powerful feature
semi-supervised learning can improve wafer map defect clas- learning capabilities. As the PCACAE framework trained
sification accuracy. As the proposed framework trained on on the WM-811K dataset, defect recognition is limited to
single-defect patterns. With the use of pretraining, PCACAE
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3233
has shown reduced computational run-time (per iteration) defect recognition and classification to single-type defects,
relative to the comparison models. Although C2DPCA has disregarding the onset of mixed-type defects.
demonstrated to be effective, it is limited regarding non-linear A semi-supervised convolutional deep generative model
data, as it is essentially an orthogonal linear transformation (SS-CDGMM), shown in Fig. 12, was proposed by Lee and
on the data. Kim (2020). In contrast to other semi-supervised models
Kong and Ni (2020a) also presented a semi-supervised which established multi-class classification for single-type
variational autoencoder (SVAE) with incremental learning defect patterns, a multi-label configuration for mixed-type
(Section Enhanced learning strategies) for wafer map defect defect classification was utilized. Kingma et al. (2014)
classification, which was trained and tested on two datasets introduced new semi-supervised deep generative models
with 22 classes of single-type defect patterns. The proposed (SS-DGM), wherein the data is described as being gen-
SVAE framework (Fig. 11) comprised of three networks: erated by a latent class variable and a continuous latent
(i) inference network, (ii) discriminative network, and (iii) variable. As an extension of SS-DGM, SS-CDGMM consists
generative network. The inference network is responsible of multiple discriminative networks structures, such that each
for approximating and learning the latent feature representa- corresponding latent class variable is dedicated to one of the
tions of the wafer map defects. The discriminative network fundamental defect-types. Like Kong and Ni (2020a), SS-
is used to predict the labels of the unlabeled WMs, includ- CDGMM consists of an inference network, discriminative
ing WMs with rare/unseen defect patterns. The generative networks, and a generative network, however each discrim-
network leverages the learned latent features and predicted inative network is used to learn the absence and presence
labels for the unlabeled wafer maps to reconstruct the orig- of its respective single-defect pattern. Compared to various
inal wafer map. The authors compared the classification models (i.e., CNN, multi-layer perceptron (MLP), SS-DGM,
performance of a CNN, and the supervised components of unified VAEs), including the state-of-the-art, convolutional
SVAE, and semi-supervised Ladder network (Kong & Ni, ladder network (ConvLadder), the results showed compa-
2018) with different percentages of supervised training data. rable or better performance to the state-of-the-art. Relative
The results demonstrated the superior performance of the to the comparison models, SS-CDGMM demonstrated how
semi-supervised approach as the Ladder network, and SVAE it effectively uses labeled and unlabeled data, as well as
consistently achieved higher classification accuracy than the the effectiveness of using multiple discriminative networks.
supervised CNN, particularly at lower percentages of super- However, as the training and test data were generated and bal-
vised data. Despite the improved performance, the confusion anced across the classes, the impact of class imbalance has
matrices showed some defect classes were prone to mis- not been investigated or addressed. Additionally, only four
classification, which may have been attributed by the class distinct single-type defect patterns were considered, disre-
imbalance as the classes of the datasets were the defect garding the other known distinct defect patterns (i.e., Donut,
patterns and their respective variants. Yu et al. (2019b) pro- Near-full). Although more defect patterns may be consid-
posed a hybrid learning model, stacked convolutional sparse ered, this would result in higher run-times as the marginal
denoising autoencoder (SCSDAE). Employing data sam- log-likelihood component of the objective function requires
pling methods, SCSDAE has demonstrated effective learning computation over all defect classes.
of discriminative features from the single-type WM data; Moving away from generative modelling, self-supervised
with performance superior to deep neural networks. Simi- pretraining is emerging as an effective pretraining method
larly to (Kong & Ni, 2018), the training and test datasets for semi-supervised frameworks and classification tasks (He
contained only single-type defect patterns, which constrains et al., 2019; Chen et al., 2020b). Self-supervised contrastive
123
3234 Journal of Intelligent Manufacturing (2023) 34:3215–3247
learning has been increasingly leveraged as a feature learn- 2017). Data augmentation can be executed in many ways,
ing method, wherein meaningful representations can be such as resampling, data modification, and data generation.
learned from unlabeled data and data augmentations. Hu et al. Resampling methods function to balance the class distri-
(2021) proposed a contrastive learning framework for single- bution of the existing data. Undersampling and oversampling
type defect patterns, followed by supervised finetuning of a are subcategories of resampling methods. Undersampling
classifier. Despite performance that is lower than other algo- reduces the amount of data examples from the majority
rithms, the reported results demonstrate detection rates on classes by removing data, whereas oversampling increases
par with state-of-the-art contrastive methods (i.e., SimCLR), the amount of data examples by sampling from the minority
and great potential for contrastive learning. classes with replacement. Both subcategories of resampling
methods are effective for obtaining a more balanced class
distribution, however, have their share of limitations. Under-
sampling ultimately reduces the overall amount of data, and
Enhanced learning strategies may disregard critical data examples for the majority classes,
which may impede feature learning and model performance.
The methods included in this section focus on enhancing Oversampling may result in overfitting and increased gen-
model learning, and are used to elevate model performance. eralization error, as well as increased computational time
They are organized into the following groups: (i) data aug- as the overall amount of training data is increased. Due to
mentation, (ii) incremental learning, (iii) transfer learning the limitations of resampling methods, data augmentation
and fine-tuning, and (iv) model optimization. via modification and generation are typically conducted as
Data augmentation aims to reduce overfitting by increas- they can increase and balance the amount of data, as well as
ing the amount of data, and is typically used to mitigate class increase the data diversity.
imbalance issues, which neural networks and deep learning
models are particularly sensitive towards (Perez & Wang,
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3235
Data modification methods apply label-preserving oper- denoised training data to improve model training of deep con-
ations to create synthetic variations of the existing data. volutional neural networks. Similarly, in (Lee & Kim, 2020),
Considering the circular shape of the wafer maps and the the authors employed the trained VAE to generate labeled
diversity of defect patterns, select geometric operations can wafer maps by leveraging the learned class latent variables
be applied to maintain the geometric characteristics, and for each defect-type. Data augmentation via generation can
original labels. In Kang (2020) and Jang et al. (2020), create highly diverse and realistic data, however, requires
rotation and horizontal flipping operations were applied substantial computational time and power to effectively train
to create diversified, rotation-invariant wafer maps, which the generative models.
subsequently improved defect classification performance. Incremental learning (IL) aims to increase model per-
Similarly, Saqlain et al. (2020) applied random rotations of formance by extending and adapting an existing model’s
10°, horizontal flipping, width shift, height shift, shearing, knowledge base with new training data. In context of real-
channel shifting, and zooming to augment the data. These world wafer fabrication, labels are expensive to obtain,
operations are used as they diversify the data with changes and have limited availability, which bottlenecks model per-
in orientation, position, and/or size. The different operations formance. Additionally, as defect complexity evolves and
used in data modification methods help improve model gen- new, unseen defect patterns emerge, model efficiency may
eralization as models are trained to be highly tolerant to the decrease overtime. As such, IL methods are employed to
diversified variations of defect patterns. enhance model performance in the long-term against evolv-
The data generation methods utilize generative models, ing wafer map data and defect patterns. Popular methods
such as generative adversarial networks and autoencoders, include active learning, and pseudo-labeling.
to supplement the existing collection of data by generating Active learning utilizes a querying strategy to select infor-
new synthetic data. The generative models focus on learning mative unlabeled data for manual annotation to fine-tune and
the latent feature representations and distributions of the data. further train an existing model (Fig. 14). It is important to
As the performance of many deep learning models are con- note that there are many querying strategies (Settles, 2009),
tingent on the amount and distribution of labeled data, data including uncertainty sampling, information gain, query-
generation methods are used to create realistic, new instances by-committee, expected error reduction, and total expected
of data. GANs consist of two convolutional neural networks: variance minimization. Shim et al. (2020) proposed a CNN
a generator and discriminator (Fig. 13). The generator learns with active learning via uncertainty sampling for wafer map
to create authentic fake data, and the discriminator learns defect classification. Uncertainty sampling selects the most
to distinguish between the real and fake data. Variations of ambiguous unlabeled data examples; least confidence, mar-
GANs have been developed to improve the generative mod- gin, and entropy are common estimators for uncertainty. In
elling capability. Wang et al. (2019) proposed the adaptive addition to the common uncertainty estimators, the authors
balancing generative adversarial network (AdaBalGAN), a compared mean standard deviation, variation ratio, Bayesian
conditional categorical GAN that incorporates imbalanced active learning by disagreement (BALD), and predictive
learning to generate a balanced set of synthetic data. In entropy as uncertainty estimation methods. Their results indi-
addition to the generator and discriminator, AdaBalGAN cated that BALD and mean standard deviation provided the
includes an adaptive generative controller, which recognizes best performance for defect classification via CNN with
the minority defect classes by considering defect class size, active learning. On the other hand, Kong and Ni (2020a)
as well as the recognition accuracy difference between each employed active learning using information entropy for their
defect class and the majority defect class. By recognizing the semi-supervised models, such that the unlabeled wafer maps
imbalanced class distribution, the adaptive generative con- with the maximum information entropy were selected for
troller automatically adjusts the number of synthesized wafer labeling and model fine-tuning. When investigating the sig-
maps for each defect-type. Ji and Lee (2020) developed a nificance of active learning, and pseudo-labeling, the results
deep convolutional GAN, which compounds the image pro- demonstrated improved classification accuracy. Although
cessing capabilities of multiple convolutional layers, for data active learning strategies have helped in improving model
augmentation. Aside from GANs, there are many types of performance, they are vulnerable to class imbalance, and
autoencoders, including variational, convolutional, denois- catastrophic forgetting. Class imbalance introduces sampling
ing, stacked, and sparse; the fundamental components of bias in query sampling, which skews the querying towards
autoencoders are the encoder and decoder. The encoder com- the newer classes (Ren et al., 2020), and brings on catas-
presses the input into latent space representation, and the trophic forgetting. In the process of finetuning the model
decoder uses the latent representation to reconstruct the input. with the new labeled data, catastrophic forgetting can occur
Shawon et al. (2019) and Tsai and Lee (2020b) utilized a con- when the previously learned information is degraded, and
volutional autoencoder (CAE) to generate new instances of significantly lowers model generalization (Luo et al., 2020).
The effectiveness of active learning methods is sensitive to
123
3236 Journal of Intelligent Manufacturing (2023) 34:3215–3247
the querying and model updating strategies, which warrants they are incorrectly predicted. Kong and Ni (2020a) imple-
careful consideration for model implementation. mented pseudo-labeling with confidence level constraints.
Pseudo-labeling supplements the incremental training The authors computed and compared the information entropy
data for model fine-tuning with predicted class labels for the for each unlabeled wafer map against a criterion threshold to
unlabeled data. As a semi-supervised learning strategy, this ensure highly confident wafer maps were used for model fine-
method uses an existing trained model to assign the pseudo- tuning. Similarly, to account for uncertainty, a 2:1 ratio for
label as the class with the maximum predicted probability. the original labeled wafer maps and pseudo-labeled wafer
The pseudo-labels for the unlabeled data would increase maps to diminish the potential disturbance from incorrect
the overall training dataset size, however, it is important to pseudo-labels.
note that pseudo-labels may disturb model performance if Transfer learning is the process of utilizing a pretrained
model for another task. As the pretrained models were trained
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3237
T-DenseNet VGG Faster R-CNN-KITTI (Chien et al., Faster R-CNN-COCO (Chien et al.,
(Shen & Yu, (Ishida et al., 2020) 2020)
2019) 2019)
on large, diverse image datasets (i.e., ImageNet, CIFAR- and epochs, etc., which typically requires extensive man-
10), it is presumed that the model effectively learned feature ual searching. As such, strategies for optimization policies
representations and obtained powerful generalization capa- and network architecture engineering have been developed
bilities. The learned feature representations of the pretrained to automate the design process.
models can be re-purposed to train a new classifier, or the pre- Recently, reinforcement learning (RL) models have been
trained models can be fine-tuned to fit to a specific dataset leveraged to parse optimization policies. Bello et al. trains a
and task. The application of transfer learning and fine-tuning recurrent neural network (RNN) controller with RL for neu-
can significantly reduce training time, and achieve high per- ral optimizer searching (Bello et al., 2017). Essentially, the
formance without requiring large volumes of data. Related performance of child networks trained with different sets of
works have utilized pretrained models for wafer map defect optimizer update rules are compared to determine the opti-
recognition and classification. Shen and Yu (2019) proposed mal set of updating rules for optimization methods (Bello
the T-DenseNet framework; the pretrained DenseNet model et al., 2017). Similarly, in (Shon et al., 2021), RL was used
was fine-tuned on the wafer map dataset, and then the refined to train a RNN controller to determine the optimal data aug-
feature representations were used to set up an online testing mentation policy for wafer map transformation operations
system for incoming unlabeled wafer maps. Similarly, the (i.e., rotation, flipping, zooming). The general training pro-
pretrained VGG model (Ishida et al., 2019), and faster R- cess for RNN controllers and search algorithms is shown
CNN model (Chien et al., 2020) were utilized for wafer map in Fig. 15. Architecture engineering is used to learn and
defect recognition and classification. In Table 5, the perfor- automate the design process of deep neural network design
mance of the models in (Chien et al., 2020; Ishida et al., 2019; selection. Related works, like (Baker et al., 2017; Zoph et al.,
Shen & Yu, 2019) for the test dataset is shown, and reflect how 2018) have also used RL to explore and discover high per-
effective deep transfer learning is despite the shorter train- forming network architectures relative to task and dataset.
ing times, and how pretrained model selection may affect In both applications, RL was leveraged for as a search algo-
performance on the downstream tasks. rithm for optimal parameters and design, which improved
Research has demonstrated the importance of model selec- model training and performance, but required separate and
tion and hyperparameter tuning as these design choices (i.e., extensive training.
conditional variables, objective function, architecture, etc.) In contrast to various existing frameworks for global opti-
significantly influence performance (Banchhor & Srinivasu, mization of hyperparameters and model parameters (i.e.,
2021; Parsa et al., 2020; Ungredda & Branke, 2021). As an grid search, random search, sequential search), Bayesian
enhanced learning strategy, model optimization concentrates optimization (BO) frameworks have demonstrated state-of-
on the advanced strategies for optimizing model parameters. the-art performance with high efficiency in computation-
For image tasks like wafer map defect recognition and classi- ally expensive-to-evaluate applications (Snoek et al., 2012,
fication, the design choices for model architecture can affect 2015). As a black-box method, BO algorithms aim to prob-
performance and computational time. Standard optimiza- abilistically model the unknown function—commonly with
tion techniques involve extensive hyperparameter tuning of Gaussian processes—and establish the posterior distribution
layer parameters (i.e., stride, filter size, etc.), training batches of the respective results for the explored hyperparameter set-
tings. By maintaining the resulting posterior distribution and
123
3238 Journal of Intelligent Manufacturing (2023) 34:3215–3247
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3239
known or understood well-enough to generate effective fea- train the network of binary classifiers, which only considered
tures. The onset of CNNs were prompted by the automated a limited range of defects, and required large amounts of data
feature extraction capability in which rich, descriptive fea- for sufficient training. The prominent supervised methods are
tures can be learned. Similarly, with representation learning, summarized in Table 6.
raw data can be used with minimal preprocessing, and can As supervised methods face performance limitations due
gain high-level discriminatory power for complex patterns; to the amounts of labeled wafer maps, the unsupervised meth-
proving that feature representation learning methods can gen- ods demonstrate how the plethora of unlabeled wafer maps
erate more meaningful and effective features for downstream can be leveraged. Despite achieving comparable defect detec-
defect pattern classification. However, the capacity of feature tion performance, unsupervised clustering algorithms are
representation learning is constrained by model complexity, sensitive to kernel methods and their respective parameters,
and whether the model is suited towards learning the complex and typically have high time complexity, resulting in long
structure of the data and problem (task) at hand. run-times. These methods are sensitive to initialization and
For supervised learning methods, although the use of hyperparameters, indicating the criticality of hyperparameter
labels can help models achieve improved performance with optimization for performance (Samariya & Thakkar, 2021).
low computational cost, they are limited by the following: (i) Related works have recognized this difficulty of using pre-set
amount of labeled data, (ii) heavily influenced by class label parameters (i.e., number of clusters), and in response adapted
distribution and data splits, and (iii) overfitting. As the acqui- clustering algorithms with the ability to estimate the num-
sition of labels is expensive, limited amounts of high-quality ber of clusters. However, as these methods involve inference
wafer maps are available for training and testing, in which the networks, computational complexity increases, attributing
limited amounts bottleneck classification performance, and long inference speeds which subsequently increases overall
highlights limitations in terms of real-world scalability. To run-time. It is important to note that despite the importance
tackle this limitation, related works have employed data aug- of hyperparameter optimization, related works employed
mentation techniques to supplement the small-sized datasets, simpler optimization frameworks, such as grid search or low-
however with the risk of increasing computational costs and level sensitivity evaluation. Based on reconstruction error,
generalization error. Similarly, related works implemented unsupervised pretraining is utilized to improve the initializa-
specialized modules and deep learning networks to improve tion of the model weights relative to random initialization,
learning. With modified modules like the deformable con- such that training time is faster as the weights are closer to a
volutional unit and the usage of specialized loss functions local optimum. However, in (Alberti et al., 2017), the authors
(i.e., contrastive loss, triplet loss), discriminative feature rep- demonstrated how minimizing the reconstruction error for
resentations were learned. However, these works focused layer-wise training of the autoencoder is not optimal for
on single-type defects, or considered a limited degree of downstream finetuning for classification tasks as the learned
diversity for the mixed-type defects. In face of new defects feature representations may not necessarily be meaningful
and combinations, the performance of these algorithms may (i.e., an identity function may be learned). The literature for
decrease with the growth in number of defect classes as class unsupervised pretraining methods demonstrates that repre-
distinctiveness and imbalance may take a toll. As labels are sentation learning with unlabeled data can be advantageous
used as the supervisory signal for training, model perfor- but needs an effective strategy to learn meaningful feature
mance is sensitive to class label distribution, data splits, and representations without high computational costs. In Table
class distinctiveness. Class imbalance can induce overfitting 7, the performance of the prominent unsupervised clustering
on the majority classes, with high misclassification on the algorithms is summarized.
minority classes, and may not be able to differentiate between Semi-supervised algorithms address the issues of data
similar defect patterns. This is a critical issue as supervised availability and ineffective feature learning from super-
methods are highly susceptible to overfitting as training is vised and unsupervised methods; demonstrating how the use
contingent on labels, such that careful considerations should of both labeled and unlabeled data can achieve improved
be made for model and training process parameters to pre- defect recognition and classification. In particular, the semi-
vent overfitting. Relative to defect type, majority of works for supervised deep generative modelling approach has shown
supervised algorithms are focused on the detection of single- effective latent representation learning, and generative capa-
type defects, and are not suitable for recognizing mixed-type bilities, but at a relatively high computational cost. It is
defects, albeit the increasing relevance of mixed-type defects. important to note that with limited amounts of labeled
Multi-label based mixed-type WMDD has limited develop- data, model selection is quite important for semi-supervised
ment and has not be extensively studied for scalability under learning to avoid overfitting and to promote effective rep-
low resource settings, whereas for multi-class based mixed- resentation learning (Kingma et al., 2014). In comparison
type defect detection, much more literature exists. It has to supervised and unsupervised methods, the literature for
noted that additional computational power was required to semi-supervised is scarcely developed despite the promising
123
3240 Journal of Intelligent Manufacturing (2023) 34:3215–3247
Table 7 Summary of prominent unsupervised clustering algorithms for Table 8 Summary of prominent semi-supervised algorithms for wafer
wafer map defect detection map defect detection1
Kim et al. (2018) CPF-iWMM Real RI: 0.96 Yu et al. SCSDAE WM-811K 0.9883
ARI: (2019b)2 (Real)
0.92 Yu (2019) ESDAE WM-811K 0.9703
NMI: (Real)
0.92
Hu et al. Semi-supervised WM-811K 0.7790
Taha et al. (2018)a DDPfinder Real & 0.9980 (2021)3 with contrastive (Real)
Synthetic learning
Hwang and Kim DPGMM Real ARI: Lee and Kim SS-CDGMM Real & 0.9488
(2020) 0.76 (2020)4 Synthetic
AMI:
0.76 Kong and Ni Semi-supervised Real 0.9020
(2020b) Ladder Network
a Result reflects the clustering accuracy, which is the algorithm’s ability
Kong and Ni SVAE Real 0.9030
to correctly cluster defect patterns (2020a)
Yu and Liu PCACAE WM-811K 0.9377
(2020) (Real)
results; indicating great potential in future developments. In Shon et al. CVAE + CNN WM-811K 0.9689
Table 8, the prominent semi-supervised methods are sum- (2021) (Real)
marized, including the methods that utilized unsupervised 1 Performance results here reflect the overall average recognition rates.
pretraining and supervised finetuning. 2 Based on performance with original dataset.
3 Performance notes the best overall accuracy.
Enhanced learning strategies were used to boost defect
4 Notes the highest EMR score from the labeled-unlabeled ratios.
recognition and classification performance. Data augmen-
tation methods utilized image transformations and/or gen-
erative models to mitigate class imbalance issues, and
subsequently increase data diversity. Although GANs have
advanced data generation capabilities, they require substan- discriminator networks. As GAN training involves a trade-
tial computational time to effectively train the generator and off between the generator and discriminator, the models are
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3241
susceptible to getting stuck in local minima. For incremen- which in turn, restricts scalability and potential deployment
tal learning strategies, techniques like active learning and for real-time implementation, and increases training time and
pseudo-labeling have demonstrated capability in boosting needed processing resources. As this industry will remain
model performance. However, are susceptible to catastrophic competitive and continuously growing, computational com-
forgetting, and hyperparameter sensitivity (i.e., querying plexity should be reduced to be more efficient. These existing
strategy, ratio of original to pseudo-labeled data). With the challenges impose on implementation, scalability, and adapt-
help of transfer learning, many training processes have been ability to new state-of-the-art designs and feature sizes.
expedited to achieve relatively high accuracy with shorter With the plethora of unlabeled data available, recent
training times. However, as model complexity, data, and other developments that leverage ML/DL for self-supervised, and
design choices can impact performance, model selection and semi-supervised learning indicate potential to surpass super-
hyperparameter tuning need to be carefully considered. For vised learning for efficient feature representation learning,
model optimization strategies, RL and BO frameworks are image recognition, and classification. To promote future
used to bypass the extensive manual searching. These strate- developments for defect detection, which allows researchers
gies are important in understanding the sensitivities a model and engineers to validate and test against new designs and
may have to input/output, architecture, etc. Although RL feature sizes, consideration towards building a database with
imposes extensive training to determine optimal parameters real-world defects is needed. Consolidating continuous inno-
and design, BO provides a more computationally efficient vation, growth, and development indicates great promise
alternative to tuning model hyperparameters. However, it towards achieving efficient, and robust defect detection.
should be noted that for multi-objectives and increasing Based on the challenges, and current landscape of this field,
number of observations, BO frameworks become more com- the future outlook of WMDD research is summarized as fol-
putationally complex, which subsequently requires more lows:
processing resources.
(1) Handling class imbalance: As many works have focused
Challenges and outlook on tackling the class imbalance issue, it is evident
that performance suffers with skewed data distribu-
In this article, we survey the literature of ML and DL
tion. Development for more robust handling of class
applications for wafer map defect recognition and classifi-
imbalance is needed, particularly as mixed-type defects
cation, which demonstrated superior performance, as well
become more critical.
as great potential and applicability for in-line integration.
(2) Effective unsupervised feature representation learn-
However, despite the reported successes, many challenges
ing: As ML/DL, and computer vision applications are
in implementing these methodologies have been identified,
increasingly developing self-supervised techniques for
including difficulty learning new defects, difficulty differen-
image classification and pattern segmentation, these
tiating between similar defect patterns, taxing computational
methods should be investigated, especially in face of
loads, and lack of robust detection of complex defects. With
limited labeled data and the limitations of pretraining
respect to the surveyed literature, the following findings
via reconstruction loss.
emerged as the most prominent challenges in the WMDD
(3) Real-time Monitoring: Majority of developments are
field: (i) data availability, (ii) mixed-type defects, and (iii)
offline systems. Consideration of model requirements
high computational complexity. The field of WMDD is con-
to meet the conditions needed for real-time monitoring
tinuously developed, however, there is limited access to
and operation.
databases that reflect the current design and complexity of
(4) Computational Complexity: With respect to real-time
wafers and ICs. This is apparent in recent works that utilize
monitoring, more efficient and less computational com-
the WM-811K dataset, which is most likely outdated in terms
plex algorithms are needed to reduce the burden from
of wafer size, IC node size, etc. Similarly, as only private data
training and processing, memory requirements, and
can properly reflect the present design standards, which has
scalability limitations.
restricted access, the innovation and research for WMDD
(5) Model Optimization: Due to the complex parameter-
is slowed. Although simulated data is an option, there is
structure-performance relationship, calibrating model
currently a gap in producing realistic, synthetic wafer maps
selection and the optimal set of parameters is needed.
similar to real defect patterns. Majority of existing literature
From the existing literature, there is limited exploration
focuses on single-type defects, despite the growing criticality
and investigation into model optimization and joint
of mixed-type defects. Although mixed-type defect detection
hyperparameter tuning.
algorithms exist, many are limited in terms of labeled data
availability, range of defect pattern types, and computational
load. Many developments impose a high computational load,
123
3242 Journal of Intelligent Manufacturing (2023) 34:3215–3247
Acknowledgements The work described in this paper was supported Abbreviation Term
by Natural Sciences and Engineering Research Council of Canada
(NSERC under grant RGPIN-217525). The authors are grateful for their
DCNN Deep Convolutional Neural Network
support.
DDPfinder Dominant Defective Patterns Finder
Funding This work was supported by Natural Sciences and Engineer- DFS Depth-first Search
ing Research Council of Canada (NSERC), Grant RGPIN-217525.
DL Deep Learning
DP Dirichlet Process
ECOC Error-Correcting Output Codes
Appendix EMR Exact Match Ratio
ESDAE Enhanced Stacked Denoising Autoencoder
EUV Extreme Ultraviolet
FAM Fuzzy ARTMAP
Abbreviation Term
FD Fischer-discriminant
AC Adjacency Clustering FZ Float-zone
AdaBalGAN Adaptive Balancing Generative Adversarial GAN Generative Adversarial Network
Network GBM Gradient Boosting Machine
AMI Adjusted Mutual Information GMM Gaussian Mixture Model
ANN Artificial Neural Network GRN Generalized Regression Network
ARI Adjusted Rand Index IC Integrated Circuit
BALD Bayesian Active Learning by Disagreement IL Incremental Learning
BO Bayesian Optimization ILT Inverse-lithography Technology
BB Bounding Box iWMM Infinite Warped Mixture Model
BPN Back Propagation Network JLNLDA Joint Local and Non-local Linear Discriminant
C2DPCA Conditional Two-Dimensional PCA Analysis
CCD Charge-Coupled Devices kNN k-Nearest Neighbors
CMP Chemical Mechanical Process LDA Linear Discriminant Analysis
CNN Convolutional Neural Network LLE Locally Linear Embedding
CPF Connected-Path Filtering LR Logistic Regression
CVAE Convolutional Variational Autoencoder MDS Multi-Dimensional Scaling
CZ Czochralski MI Mutual Information
DBN Deep Belief Network ML Machine Learning
DBSCAN Density-based Spatial Clustering of Applications MLP Multi-Layer Perceptron
with Noise MPre Micro-Precision
DCN Deformable Convolutional Network MRe Micro-Recall
NMI Normalized Mutual Information
OPTICS Ordering Point to Identify the Cluster Structure
PCA Principal Component Analysis
PCACAE PCA-based Convolutional Autoencoder
RCA Root-cause Analysis
RF Random Forest
RGRN Randomized General Regression Network
RI Rand Index
RL Reinforcement Learning
RNN Recurrent Neural Network
SAT Scanning Acoustic Tomography
SCSDAE Stacked Convolutional Sparse Denoising
Autoencoder
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3243
Abbreviation Term Bella, R. D., Carrera, D., Rossi, B., Fragneto, P., & Boracchi, G. (2019,
September). Wafer defect map classification using sparse convo-
lutional networks. In International Conference on Image Analysis
SDAE Stacked Denoising Autoencoder
and Processing (pp. 125–136). Springer, Cham.
SEM Scanning Electron Microscopy Bello, I., Zoph, B., Vasudevan, V., & Le, Q. V. (2017). Neural Optimizer
SS-CDGMM Semi-supervised Convolutional Deep Generative Search with Reinforcement Learning. In Proceedings of 34th Inter-
Multiple Models national Conference on Machine Learning (pp. 459–468). Sydney.
Retrieved from https://arxiv.org/abs/1709.07417.
SSD Single Shot Detector Banchhor, C., & Srinivasu, N. (2021). Analysis of Bayesian optimiza-
SVAE Semi-supervised Variational Autoencoder tion algorithms for big data classification based on Map Reduce
SVC Support Vector Clustering framework. Journal of Big Data, 8(1), 81. https://doi.org/10.1186/
s40537-021-00464-4
SVE Soft Voting Ensemble Byun, Y., & Baek, J. G. (2020). Mixed pattern recognition methodology
SVM Support Vector Machine on wafer maps with pre-trained convolutional neural networks. In
t-SNE t-distributed Stochastic Neighbor Embedding A. Rocha, L. Steels, & J. van den Herik (Eds.), ICAART 2020 -
Proceedings of the 12th International Conference on Agents and
TTV Total Thickness Variation Artificial Intelligence (pp. 974–979). (ICAART 2020—Proceed-
UV Ultraviolet ings of the 12th International Conference on Agents and Artificial
VAE Variational Autoencoder Intelligence; Vol. 2). SciTePress.
Chang, C.-W., Chao, T.-M., Horng, J.-T., Lu, C.-F., & Yeh, R.-H. (2012).
WBM Wafer Bin Map Development pattern recognition model for the classification of
WM Wafer Map circuit probe wafer maps on semiconductors. IEEE Transactions
WMDD Wafer Map Defect Detection on Components, Packaging and Manufacturing Technology, 2(12),
2089–2097. https://doi.org/10.1109/TCPMT.2012.2215327
Chen, H.-C. (2020). Automated detection and classification of defec-
tive and abnormal dies in wafer images. Applied Sciences, 10(10),
3423. https://doi.org/10.3390/app10103423
Chen, S.-H., Kang, C.-H., & Perng, D.-B. (2020a). Detecting and mea-
References suring defects in wafer die using GAN and YOLOv3. Applied
Sciences, 10(23), 8725. https://doi.org/10.3390/app10238725
Adly, F., Alhussein, O., Yoo, P. D., Al-Hammadi, Y., Taha, K., Muhai- Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020b). A Simple
dat, S., Jeong, Y.-S., Lee, U., & Ismail, M. (2015a). Simplified Framework for Contrastive Learning of Visual Representations.
subspaced regression network for identification of defect patterns Chen, F.-L., & Liu, S.-F. (2000). A neural-network approach to rec-
in semiconductor wafer maps. IEEE Transactions on Industrial ognize defect spatial pattern in semiconductor fabrication. IEEE
Informatics, 11(6), 1267–1276. https://doi.org/10.1109/TII.2015. Transactions on Semiconductor Manufacturing, 13(3), 366–373.
2481719 https://doi.org/10.1109/66.857947
Adly, F., Yoo, P., Muhaidat, S., Al-Hammadi, Y., Lee, U., & Ismail, Cheon, S., Lee, H., Kim, C. O., & Lee, S. H. (2019). Convolutional
M. (2015b). Randomized general regression network for identi- neural network for wafer surface defect classification and the
fication of defect patterns in semiconductor wafer maps. IEEE detection of unknown defect class. IEEE Transactions on Semicon-
Transactions on Semiconductor Manufacturing, 28(2), 145–152. ductor Manufacturing, 32(2), 163–170. https://doi.org/10.1109/
https://doi.org/10.1109/tsm.2015.2405252 tsm.2019.2902657
Airaksinen, V.-M. (2015). Silicon wafer and thin film measurements. In Chien, C.-F., Hsu, S.-C., & Chen, Y.-J. (2013). A system for online
M. Tilli, T. Motooka, V.-M. Airaksinen, S. Franssila, M. Paulasto- detection and classification of wafer bin map defect patterns for
Kröckel , & V. Lindroos (Eds.), Handbook of Silicon Based MEMS manufacturing intelligence. International Journal of Production
Materials and Technologies (2nd Ed., pp. 381–390). https://doi. Research, 51(8), 2324–2338. https://doi.org/10.1080/00207543.
org/10.1016/B978-0-323-29965-7.00015-4 2012.737943
Alawieh, M. B., Boning, D., & Pan, D. Z. (2020). Wafer map defect Chien, J.-C., Wu, M.-T., & Lee, J.-D. (2020). Inspection and classi-
patterns classification using deep selective learning. In 2020 57th fication of semiconductor wafer surface defects using CNN deep
ACM/IEEE Design Automation Conference (DAC). https://doi.org/ learning networks. Applied Sciences, 10(15), 5340. https://doi.org/
10.1109/dac18072.2020.9218580 10.3390/app10155340
Alawieh, M. B., Wang, F., & Li, X. (2018). Identifying wafer-level Choi, G., Kim, S.-H., Ha, C., & Bae, S. J. (2012). Multi-step ART1 algo-
systematic failure patterns via unsupervised learning. IEEE Trans- rithm for recognition of defect patterns on semiconductor wafers.
actions on Computer-Aided Design of Integrated Circuits and International Journal of Production Research, 50(12), 3274–3287.
Systems, 37(4), 832–844. https://doi.org/10.1109/TCAD.2017. https://doi.org/10.1080/00207543.2011.574502
2729469 Cuevas, A., & Sinton, R. A. (2018). Chapter III-1-A - Characteriza-
Alberti, M., Seuret, M., Ingold, R., & Liwicki, M. (2017, December tion and Diagnosis of Silicon Wafers, Ingots, and Solar Cells. In
17). A Pitfall of Unsupervised Pre-Training. arXiv.org. Retrieved D. Macdonald & S. A. Kalogirou (Eds.), McEvoy’s Handbook of
from https://arxiv.org/abs/1712.01655. Photovoltaics (3rd Ed., pp. 1119–1154). Essay, Academic Press.
Baker, B., Raskar, R., Naik, N., & Gupta, O. (2017). Designing Neural Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017).
Network Architectures using Reinforcement Learning. In Proc. of Deformable convolutional networks. In 2017 IEEE International
ICLR 2017. Retrieved from https://arxiv.org/abs/1611.02167. Conference on Computer Vision (ICCV). https://doi.org/10.1109/
Batool, U., Shapiai, M. I., Fauzi, H., & Fong, J. X. (2020). Convolutional iccv.2017.89
neural network for imbalanced data classification of silicon wafer
defects. In 2020 16th IEEE International Colloquium on Signal
Processing Its Applications (CSPA), 230–235. https://doi.org/10.
1109/CSPA48992.2020.9068669
123
3244 Journal of Intelligent Manufacturing (2023) 34:3215–3247
Devika, B., & George, N. (2019). Convolutional neural network for Hwang, J., & Kim, H. (2020). Variational deep clustering of wafer
semiconductor wafer defect detection. In 2019 10th Interna- map patterns. IEEE Transactions on Semiconductor Manufactur-
tional Conference on Computing, Communication and Network- ing, 33(3), 466–475. https://doi.org/10.1109/tsm.2020.3004483
ing Technologies (ICCCNT) (pp. 1–6). https://doi.org/10.1109/ Hyun, Y., & Kim, H. (2020). Memory-augmented convolutional neural
ICCCNT45670.2019.8944584 networks with triplet loss for imbalanced wafer defect pattern clas-
di Palma, F., de Nicolao, G., Miraglia, G., Pasquinetti, E., & Pic- sification. IEEE Transactions on Semiconductor Manufacturing,
cinini, F. (2005). Unsupervised spatial pattern classification of 33(4), 622–634. https://doi.org/10.1109/tsm.2020.3010984
electrical-wafer-sorting maps in semiconductor manufacturing. Ishida, T., Nitta, I., Fukuda, D., & Kanazawa, Y. (2019). Deep learning-
Pattern Recognition Letters, 26(12), 1857–1865. https://doi.org/ based wafer-map failure pattern recognition framework. In 20th
10.1016/j.patrec.2005.03.007 International Symposium on Quality Electronic Design (ISQED).
Du, D.-Y., & Shi, Z. (2020). A wafer map defect pattern classifica- https://doi.org/10.1109/isqed.2019.8697407
tion model based on deep convolutional neural network. In 2020 Iwata, T., Duvenaud, D., & Ghahramani, Z. (2013, March 21). Warped
IEEE 15th International Conference on Solid-State Integrated Mixtures for Nonparametric Cluster Shapes. arXiv.org. Retrieved
Circuit Technology (ICSICT) (pp. 1–3). https://doi.org/10.1109/ from https://arxiv.org/abs/1206.1846.
ICSICT49897.2020.9278021 Jang, J., Seo, M., & Kim, C. O. (2020). Support weighted ensemble
Ebayyeh, A. A., & Mousavi, A. (2020). A review and analysis of model for open set recognition of wafer map defects. IEEE Trans-
automatic optical inspection and quality monitoring methods in actions on Semiconductor Manufacturing, 33(4), 635–643. https://
electronics industry. IEEE Access, 8, 183192–183271. https://doi. doi.org/10.1109/tsm.2020.3012183
org/10.1109/access.2020.3029127 Ji, Y. S., & Lee, J.-H. (2020). Using GAN to improve CNN perfor-
Ezzat, A. A., Liu, S., Hochbaum, D. S., & Ding, Y. (2021). A graph- mance of wafer map defect type classification: Yield enhancement.
theoretic approach for spatial filtering and its impact on mixed-type In 2020 31st Annual SEMI Advanced Semiconductor Manufac-
spatial pattern recognition in wafer bin maps. IEEE Transactions turing Conference (ASMC). https://doi.org/10.1109/asmc49169.
on Semiconductor Manufacturing, 34(2), 194–206. https://doi.org/ 2020.9185193
10.1109/tsm.2021.3062943 Jin, C. H., Kim, H.-J., Piao, Y., Li, M., & Piao, M. (2020). Wafer
Faaeq, A., Guruler, H., & Peker, M. (2018). Image classification using map defect pattern classification based on convolutional neural
manifold learning based non-linear dimensionality reduction. In network features and error-correcting output codes. Journal of
2018 26th Signal Processing and Communications Applications Intelligent Manufacturing, 31(8), 1861–1875. https://doi.org/10.
Conference (SIU). https://doi.org/10.1109/siu.2018.8404441 1007/s10845-020-01540-x
Fan, M., Wang, Q., & van der Waal, B. (2016). Wafer defect pat- Jin, C. H., Na, H. J., Piao, M., Pok, G., & Ryu, K. H. (2019). A
terns recognition based on OPTICS and multi-label classification. novel DBSCAN-based defect pattern detection and classification
In 2016 IEEE Advanced Information Management, Communi- framework for wafer bin map. IEEE Transactions on Semiconduc-
cates, Electronic and Automation Control Conference (IMCEC), tor Manufacturing, 32(3), 286–292. https://doi.org/10.1109/tsm.
912–915. https://doi.org/10.1109/IMCEC.2016.7867343 2019.2916835
Hasan, R. M., & Luo, X. (2018). Promising lithography techniques for Kang, S. (2020). Rotation-invariant wafer map pattern classifi-
next-generation logic devices. Nanomanufacturing and Metrology, cation with convolutional neural networks. IEEE Access, 8,
1(2), 67–81. https://doi.org/10.1007/s41871-018-0016-9 170650–170658. https://doi.org/10.1109/access.2020.3024603
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2019). Momentum Kang, H., & Kang, S. (2021). A stacking ensemble classifier with
Contrast for Unsupervised Visual Representation Learning. CoRR, handcrafted and convolutional features for wafer map pattern clas-
abs/1911.05722. http://arxiv.org/abs/1911.05722 sification. Computers in Industry, 129, 103450. https://doi.org/10.
Hsu, S.-C., & Chien, C.-F. (2007). Hybrid data mining approach for 1016/j.compind.2021.103450
pattern extraction from wafer bin map to improve yield in semi- Khastavaneh H., & Ebrahimpour-Komleh H. (2020) Representation
conductor manufacturing. International Journal of Production learning techniques: An overview. In Bohlouli, M., Sadeghi
Economics, 107(1), 88–103. https://doi.org/10.1016/j.ijpe.2006. Bigham, B., Narimani, Z., Vasighi, M., & Ansari, E. (Eds.),
05.015 Data Science: From Research to Application. CiDaS 2019. Lec-
Hsu, C.-Y., Chen, W.-J., & Chien, J.-C. (2020). Similarity matching ture Notes on Data Engineering and Communications Technolo-
of wafer bin maps for manufacturing intelligence to empower gies, Vol. 45. Springer, Cham. https://doi.org/10.1007/978-3-030-
Industry 3.5 for semiconductor manufacturing. Computers & 37309-2_8
Industrial Engineering, 142, 106–358. https://doi.org/10.1016/j. Kim, Y., Cho, D., & Lee, J.-H. (2020a). Wafer map classifier using deep
cie.2020.106358 learning for detecting out-of-distribution failure patterns. IEEE
Hu, M. (1962). Visual pattern recognition by moment invariants. IEEE International Symposium on the Physical and Failure Analysis
Transactions on Information Theory, 8(2), 179–187. https://doi. of Integrated Circuits (IPFA), 2020, 1–5. https://doi.org/10.1109/
org/10.1109/tit.1962.1057692 IPFA49335.2020.9260877
Hu, H., He, C., & Li, P. (2021). Semi-supervised wafer map pattern Kim, B., Jeong, Y.-S., Tong, S. H., & Jeong, M. K. (2020b). A gener-
recognition using domain-specific data augmentation and con- alised uncertain decision tree for defect classification of multiple
trastive learning. IEEE International Test Conference (ITC), 2021, wafer maps. International Journal of Production Research, 58(9),
113–122. https://doi.org/10.1109/ITC50571.2021.00019 2805–2821. https://doi.org/10.1080/00207543.2019.1637035
Huang, C.-J. (2007). Clustered defect detection of high quality Kim, J., Lee, Y., & Kim, H. (2018). Detection and clustering of mixed-
chips using self-supervised multilayer perceptron. Expert Sys- type defect patterns in wafer bin maps. IISE Transactions, 50(2),
tems with Applications, 33(4), 996–1003. https://doi.org/10.1016/ 99–111. https://doi.org/10.1080/24725854.2017.1386337
j.eswa.2006.07.011 Kim, T. S., Lee, J. W., Lee, W. K., & Sohn, S. Y. (2021). Novel method
Huang, C.-J., Chen, Y.-J., Wu, C.-F., & Huang, Y.-A. (2009). Applica- for detection of mixed-type defect patterns in wafer maps based
tion of neural networks and genetic algorithms to the screening for on a single shot detector algorithm. Journal of Intelligent Manu-
high quality chips. Applied Soft Computing, 9(2), 824–832. https:// facturing. https://doi.org/10.1007/s10845-021-01755-6
doi.org/10.1016/j.asoc.2008.10.002
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3245
Kim, S., & Oh, I. S. (2017). Automatic Defect Detection from SEM Northcutt, C., Jiang, L., & Chuang, I. (2021). Confident learning:
Images of Wafers using Component Tree. JSTS Journal of Semi- Estimating uncertainty in dataset labels. Journal of Artificial Intel-
conductor Technology and Science, 17(1), 86–93. https://doi.org/ ligence Research, 70, 1373–1411. https://doi.org/10.1613/jair.1.
10.5573/jsts.2017.17.1.086 12125
Kingma, D. P., Rezende, D. J., Mohamed, S., & Welling, M. Ooi, M.P.-L., Sok, H. K., Kuang, Y. C., Demidenko, S., & Chan, C.
(2014). Semi-supervised learning with deep generative models. (2013). Defect cluster recognition system for fabricated semi-
In Advances in Neural Information Processing Systems (Vol. 4, conductor wafers. Engineering Applications of Artificial Intel-
pp. 3581–3589). ligence, 26(3), 1029–1043. https://doi.org/10.1016/j.engappai.
Kong, Y., & Ni, D. (2018). Semi-supervised classification of wafer map 2012.03.016
based on ladder network. In 2018 14th IEEE International Confer- Park, S., Jang, J., & Kim, C. O. (2020). Discriminative feature learning
ence on Solid-State and Integrated Circuit Technology (ICSICT). and cluster-based defect label reconstruction for reducing uncer-
https://doi.org/10.1109/icsict.2018.8564982 tainty in wafer bin map label. Journal of Intelligent Manufacturing,
Kong, Y., & Ni, D. (2019). Recognition and location of mixed-type 32(1), 251–263. https://doi.org/10.1007/s10845-020-01571-4
patterns in wafer bin maps. In 2019 IEEE International Confer- Parsa, M., Mitchell, J. P., Schuman, C. D., Patton, R. M., Potok, T.
ence on Smart Manufacturing, Industrial & Logistics Engineering E., & Roy, K. (2020). Bayesian multi-objective hyperparameter
(SMILE). https://doi.org/10.1109/smile45626.2019.8965309 optimization for accurate, fast, and efficient neural network accel-
Kong, Y., & Ni, D. (2020a). A semi-supervised and incremental mod- erator design. Frontiers in Neuroscience. https://doi.org/10.3389/
eling framework for wafer map classification. IEEE Transactions fnins.2020.00667
on Semiconductor Manufacturing, 33(1), 62–71. https://doi.org/ Patel, D. V., Bonam, R., & Oberai, A. A. (2020). Deep learning-based
10.1109/tsm.2020.2964581 detection, classification, and localization of defects in semicon-
Kong, Y., & Ni, D. (2020b). Qualitative and quantitative analysis of ductor processes. Journal of Micro/nanolithography, MEMS, and
multi-pattern wafer bin maps. IEEE Transactions on Semiconduc- MOEMS, 19(02), 1. https://doi.org/10.1117/1.jmm.19.2.024801
tor Manufacturing, 33(4), 578–586. https://doi.org/10.1109/tsm. Patel, S., Sihmar, S., & Jatain, A. (2015). A study of hierarchical
2020.3022431 clustering algorithms. In 2015 2nd International Conference on
Kyeong, K., & Kim, H. (2018). Classification of mixed-type defect pat- Computing for Sustainable Global Development (INDIACom)
terns in wafer bin maps using convolutional neural networks. IEEE (pp. 537–541).
Transactions on Semiconductor Manufacturing, 31(3), 395–402. Perez, L., & Wang, J. (2017). The effectiveness of data augmentation in
https://doi.org/10.1109/tsm.2018.2841416 image classification using deep learning. CoRR, abs/1712.04621.
Lee, H., & Kim, H. (2020). Semi-supervised multi-label learning for http://arxiv.org/abs/1712.04621
classification of wafer bin maps with mixed-type defect pat- Piao, M., Jin, C. H., Lee, J. Y., & Byun, J.-Y. (2018). Decision tree
terns. IEEE Transactions on Semiconductor Manufacturing, 33(4), ensemble-based wafer map failure pattern recognition based on
653–662. https://doi.org/10.1109/tsm.2020.3027431 radon transform-based features. IEEE Transactions on Semicon-
Li, K., Liao, P., Cheng, K., Chen, L., Wang, S., Huang, A., et al. (2021). ductor Manufacturing, 31(2), 250–257. https://doi.org/10.1109/
Hidden wafer scratch defects projection for diagnosis and quality TSM.2018.2806931
enhancement. IEEE Transactions on Semiconductor Manufactur- Pleschberger, M., Scheiber, M., & Schrunner, S. (2019). Simulated ana-
ing, 34(1), 9–16. https://doi.org/10.1109/tsm.2020.3040998 log wafer test data for pattern recognition. Zenodo. https://doi.org/
Li, T.-S., & Huang, C.-L. (2009). Defect spatial pattern recognition 10.5281/zenodo.2542504
using a hybrid SOM–SVM approach in semiconductor manufac- Preil, M. E. (2016). Patterning challenges in the sub-10 nm era. Optical
turing. Expert Systems with Applications, 36(1), 374–385. https:// Microlithography XXIX, 10(1117/12), 2222256.
doi.org/10.1016/j.eswa.2007.09.023 Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Chen, X., & Wang, X.
Liao, C.-S., Hsieh, T.-J., Huang, Y.-S., & Chien, C.-F. (2014). Similarity (2020, August 30). A survey of deep active learning. arXiv.org.
searching for defective wafer bin maps in semiconductor manufac- Retrieved September 22, 2021, from https://arxiv.org/abs/2009.
turing. IEEE Transactions on Automation Science and Engineer- 00236.
ing, 11(3), 953–960. https://doi.org/10.1109/TASE.2013.2277603 Ruthotto, L., & Haber, E. (2021). An introduction to deep gen-
Liu, C.-W., & Chien, C.-F. (2013). An intelligent system for wafer erative modeling. GAMM-Mitteilungen. https://doi.org/10.1002/
bin map defect diagnosis: An empirical study for semiconduc- gamm.202100008
tor manufacturing. Engineering Applications of Artificial Intel- Samariya, D., & Thakkar, A. (2021). A comprehensive survey of
ligence, 26(5–6), 1479–1486. https://doi.org/10.1016/j.engappai. anomaly detection algorithms. Annals of Data Science. https://
2012.11.009 doi.org/10.1007/s40745-021-00362-9
Luo, Y., Yin, L., Bai, W., & Mao, K. (2020). An appraisal of incremental Santos, A. M., & Canuto, A. M. P. (2012). Using semi-supervised
learning methods. Entropy, 22(11), 1190. https://doi.org/10.3390/ learning in multi-label classification problems. In The 2012 Inter-
e22111190 national Joint Conference on Neural Networks (IJCNN) (pp. 1–8).
Maksim, K., Kirill, B., Eduard, Z., Nikita, G., Aleksandr, B., Arina, https://doi.org/10.1109/IJCNN.2012.6252800
L., Vladislav, S., Daniil, M., & Nikolay, K. (2019). Classifica- Saqlain, M., Abbas, Q., & Lee, J. Y. (2020). A deep convolutional
tion of wafer maps defect based on deep learning methods with neural network for wafer defect identification on an imbalanced
small amount of data. International Conference on Engineering dataset in semiconductor manufacturing processes. IEEE Transac-
and Telecommunication (EnT), 2019, 1–5. https://doi.org/10.1109/ tions on Semiconductor Manufacturing, 33(3), 436–444. https://
EnT47717.2019.9030550 doi.org/10.1109/tsm.2020.2994357
Mohanaiah, P., Sathyanarayana, P., & GuruKumar, L. (2013). Image Saqlain, M., Jargalsaikhan, B., & Lee, J. Y. (2019). A voting ensem-
texture feature extraction using GLCM approach. International ble classifier for wafer map defect patterns identification in
Journal of Scientific and Research Publications, 3(5), 1–5. semiconductor manufacturing. IEEE Transactions on Semicon-
Nakazawa, T., & Kulkarni, D. V. (2018). Wafer map defect pat- ductor Manufacturing, 32(2), 171–182. https://doi.org/10.1109/
tern classification and image retrieval using convolutional neural tsm.2019.2904306
network. IEEE Transactions on Semiconductor Manufacturing, Settles, B. (2009). (rep.). Active Learning Literature Survey. Madison,
31(2), 309–314. https://doi.org/10.1109/tsm.2018.2795466 WI.
123
3246 Journal of Intelligent Manufacturing (2023) 34:3215–3247
Shawon, A., Faruk, M. O., Habib, M. B., & Khan, A. M. (2019). Silicon Wang, C.-H., Kuo, W., & Bensmail, H. (2006). Detection and classifica-
wafer map defect classification using deep convolutional neural tion of defect patterns on semiconductor wafers. IIE Transactions,
network with data augmentation. In 2019 IEEE 5th International 38(12), 1059–1068. https://doi.org/10.1080/07408170600733236
Conference on Computer and Communications (ICCC). https:// Wang, J., Xu, C., Yang, Z., Zhang, J., & Li, X. (2020). Deformable con-
doi.org/10.1109/iccc47050.2019.9064029 volutional networks for efficient mixed-type wafer defect pattern
Shen, Z., & Yu, J. (2019). Wafer map defect recognition based on recognition. IEEE Transactions on Semiconductor Manufacturing,
deep transfer learning. In 2019 IEEE International Conference 33(4), 587–596. https://doi.org/10.1109/tsm.2020.3020985
on Industrial Engineering and Engineering Management (IEEM). Wang, J., Yang, Z., Zhang, J., Zhang, Q., & Chien, W.-T.K. (2019).
https://doi.org/10.1109/ieem44572.2019.8978568 AdaBalGAN: An improved generative adversarial network with
Shi, X., Yan, Y., Zhou, T., Yu, X., Li, C., Chen, S., & Zhao, Y. (2020). imbalanced learning for wafer defective pattern recognition. IEEE
Fast and Accurate Machine Learning Inverse Lithography Using Transactions on Semiconductor Manufacturing, 32(3), 310–319.
Physics Based Feature Maps and Specially Designed DCNN. In https://doi.org/10.1109/tsm.2019.2925361
2020 International Workshop on Advanced Patterning Solutions Wang, W., Huang, Y., Wang, Y., & Wang, L. (2014). Generalized autoen-
(IWAPS). https://doi.org/10.1109/iwaps51164.2020.9286814 coder: A neural network framework for dimensionality reduction.
Shi, X., Zhao, Y., Cheng, S., Li, M., Yuan, W., Yao, L., Zhao, W., In 2014 IEEE Conference on Computer Vision and Pattern Recog-
Xiao, Y., Kang, X., & Li, A. (2019). Optimal feature vector design nition Workshops. https://doi.org/10.1109/cvprw.2014.79
for computational lithography. Optical Microlithography XXXII, Wang, Y., & Ni, D. (2019). Multi-bin Wafer Maps Defect Patterns
10(1117/12), 2515446. Classification. In 2019 IEEE International Conference on Smart
Shim, J., Kang, S., & Cho, S. (2020). Active learning of convolutional Manufacturing, Industrial & Logistics Engineering (SMILE).
neural network for cost-effective wafer map pattern classifica- https://doi.org/10.1109/smile45626.2019.8965299
tion. IEEE Transactions on Semiconductor Manufacturing, 33(2), Wen, G., Gao, Z., Cai, Q., Wang, Y., & Mei, S. (2020). A novel
258–266. https://doi.org/10.1109/tsm.2020.2974867 method based on deep convolutional neural networks for wafer
Shon, H. S., Batbaatar, E., Cho, W.-S., & Choi, S. G. (2021). Unsuper- semiconductor surface defect inspection. IEEE Transactions on
vised pre-training of imbalanced data for identification of wafer Instrumentation and Measurement, 69(12), 9668–9680. https://
map defect patterns. IEEE Access, 9, 52352–52363. https://doi. doi.org/10.1109/tim.2020.3007292
org/10.1109/access.2021.3068378 White, K. P., Kundu, B., & Mastrangelo, C. M. (2008). Classification
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian of defect clusters on semiconductor wafers via the hough trans-
optimization of machine learning algorithms. Advances in neural formation. IEEE Transactions on Semiconductor Manufacturing,
information processing systems, 25. 21(2), 272–278. https://doi.org/10.1109/tsm.2008.2000269
Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Wu, M.-J., Jang, J.-S.R., & Chen, J.-L. (2015). Wafer map failure pat-
Patwary, Md. M. A., Prabhat, & Adams, R. P. (2015). Scalable tern recognition and similarity ranking for large-scale data sets.
Bayesian Optimization Using Deep Neural Networks. ArXiv E- IEEE Transactions on Semiconductor Manufacturing, 28(1), 1–12.
Prints, arXiv:1502.05700. https://doi.org/10.1109/tsm.2014.2364237
Taha, K., Salah, K., & Yoo, P. D. (2018). Clustering the dominant Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algo-
defective patterns in semiconductor wafer maps. IEEE Transac- rithms. Annals of Data Science, 2(2), 165–193. https://doi.org/10.
tions on Semiconductor Manufacturing, 31(1), 156–165. https:// 1007/s40745-015-0040-1
doi.org/10.1109/TSM.2017.2768323 Yu, J. (2019). Enhanced stacked denoising autoencoder-based feature
Tello, G., Al-Jarrah, O., Yoo, P., Al-Hammadi, Y., Muhaidat, S., & learning for recognition of wafer map defects. IEEE Transactions
Lee, U. (2018). Deep-structured machine learning model for the on Semiconductor Manufacturing, 32(4), 613–624. https://doi.org/
recognition of mixed-defect patterns in semiconductor fabrication 10.1109/tsm.2019.2940334
processes. IEEE Transactions on Semiconductor Manufacturing, Yu, J., & Liu, J. (2020). Two-dimensional principal component analysis-
31(2), 315–322. https://doi.org/10.1109/tsm.2018.2825482 based convolutional autoencoder for wafer map defect detection.
Tsai, T.-H., & Lee, Y.-C. (2020a). Wafer Map Defect Classification with IEEE Transactions on Industrial Electronics, 68(9), 8789–8797.
Depthwise Separable Convolutions. In 2020a IEEE International https://doi.org/10.1109/tie.2020.3013492
Conference on Consumer Electronics (ICCE). https://doi.org/10. Yu, J., & Lu, X. (2016). Wafer map defect detection and recognition
1109/icce46568.2020a.9043041 using joint local and nonlocal linear discriminant analysis. IEEE
Tsai, T.-H., & Lee, Y.-C. (2020b). A light-weight neural network for Transactions on Semiconductor Manufacturing, 29(1), 33–43.
wafer map classification based on data augmentation. IEEE Trans- https://doi.org/10.1109/tsm.2015.2497264
actions on Semiconductor Manufacturing, 33(4), 663–672. https:// Yu, N., Xu, Q., & Wang, H. (2019a). Wafer defect pattern recognition
doi.org/10.1109/TSM.2020.3013004 and analysis based on convolutional neural network. IEEE Trans-
Ungredda, J., & Branke, J. (2021). Bayesian optimisation for con- actions on Semiconductor Manufacturing, 32(4), 566–573. https://
strained problems. CoRR, abs/2105.13245. https://arxiv.org/abs/ doi.org/10.1109/TSM.2019.2937793
2105.13245 Yu, J., Zheng, X., & Liu, J. (2019b). Stacked convolutional sparse
Wang, C.-H. (2009). Separation of composite defect patterns on wafer denoising auto-encoder for identification of defect patterns in
bin map using support vector clustering. Expert Systems with semiconductor wafer map. Computers in Industry, 109, 121–133.
Applications, 36(2, Part 1), 2554–2561. https://doi.org/10.1016/ https://doi.org/10.1016/j.compind.2019.04.015
j.eswa.2008.01.057 Yuan, T., Bae, S. J., & Park, J. I. (2010). Bayesian spatial defect pat-
Wang, C.-H. (2008). Recognition of semiconductor defect patterns tern recognition in semiconductor fabrication using support vector
using spatial filtering and spectral clustering. Expert Systems with clustering. The International Journal of Advanced Manufacturing
Applications, 34(3), 1914–1923. https://doi.org/10.1016/j.eswa. Technology, 51(5), 671–683. https://doi.org/10.1007/s00170-010-
2007.02.014 2647-x
Wang, R., & Chen, N. (2019). Wafer map defect pattern recognition Zhong, G., Wang, L., Ling, X., & Dong, J. (2016). An overview on
using rotation-invariant features. IEEE Transactions on Semicon- data representation learning: From traditional feature learning to
ductor Manufacturing, 32(4), 596–604. https://doi.org/10.1109/ recent deep learning. The Journal of Finance and Data Science,
TSM.2019.2944181 2(4), 265–278. https://doi.org/10.1016/j.jfds.2017.05.001
123
Journal of Intelligent Manufacturing (2023) 34:3215–3247 3247
Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable ConvNets V2: Publisher’s Note Springer Nature remains neutral with regard to juris-
More Deformable, Better Results. 2019 IEEE/CVF Conference on dictional claims in published maps and institutional affiliations.
Computer Vision and Pattern Recognition (CVPR). https://doi.org/
10.1109/cvpr.2019.00953 Springer Nature or its licensor holds exclusive rights to this article
Zhuang, J., Mao, G., Wang, Y., Chen, X., & Wei, Z. (2020). A neural- under a publishing agreement with the author(s) or other rightsholder(s);
network approach to better diagnosis of defect pattern in wafer bin author self-archiving of the accepted manuscript version of this article
map. China Semiconductor Technology International Conference is solely governed by the terms of such publishing agreement and appli-
(CSTIC), 2020, 1–3. https://doi.org/10.1109/CSTIC49141.2020. cable law.
9282438
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018, April 11).
Learning transferable architectures for scalable image recogni-
tion. arXiv.org. Retrieved September 22, 2021, from https://arxiv.
org/abs/1707.07012.
123