Maintenance - Practical Reliability of Electronic Equipment, Products (Dekker 2003) PDF
Maintenance - Practical Reliability of Electronic Equipment, Products (Dekker 2003) PDF
Reliability
of Electronic
Equipment and
Products
Eugene R. Hnatek
Qualcomm Incorporated
San Diego, California
Headquarters
Marcel Dekker, Inc.
270 Madison Avenue, New York, NY 10016
tel: 212-696-9000; fax: 212-685-4540
The publisher offers discounts on this book when ordered in bulk quantities. For more
information, write to Special Sales/Professional Marketing at the headquarters address
above.
Neither this book nor any part may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopying, microfilming, and recording,
or by any information storage and retrieval system, without permission in writing from
the publisher.
FOUNDING EDITOR
Marlin O. Thurston
Department of Electrical Engineering
The Ohio State University
Columbus, Ohio
Reliability is important. Most organizations are concerned with fast time to mar-
ket, competitive advantage, and improving costs. Customers want to be sure that
the products and equipment they buy work as intended for the time specified.
That’s what reliability is: performance against requirements over time.
A number of excellent books have been written dealing with the topic of
reliability—most from a theoretical and what I call a ‘‘rel math’’ perspective.
This book is about electronic product and equipment reliability. It presents a
practical ‘‘hands-on perspective’’ based on my personal experience in fielding a
myriad of different systems, including military/aerospace systems, semiconduc-
tor devices (integrated circuits), measuring instruments, and computers.
The book is organized according to end-to-end reliability: from the customer
to the customer. At the beginning customers set the overall product parameters
and needs and in the end they determine whether the resultant product meets
those needs. They basically do this with their wallets. Thus, it is imperative that
manufacturers truly listen to what the customer is saying. In between these two
bounds the hard work of reliability takes place: design practices and testing; selec-
tion and qualification of components, technology and suppliers; printed wiring
assembly and systems manufacturing; and testing practices, including regulatory
testing and failure analysis.
To meet any reliability objective requires a comprehensive knowledge of
the interactions of the design, the components used, the manufacturing techniques
employed, and the environmental stresses under which the product will operate. A
reliable product is one that balances design-it-right and manufacture-it-correctly
techniques with just the right amount of testing. For example, design verification
testing is best accomplished using a logical method such as a Shewhart or Deming
cycle (plan–do–check–act–repeat) in conjunction with accelerated stress and
failure analysis. Only when used in this closed-feedback loop manner will testing
help make a product more robust. Testing by itself adds nothing to the reliability
of a product.
The purpose of this book is to give electronic circuit design engineers,
system design engineers, product engineers, reliability engineers, and their man-
agers this end-to-end view of reliability by sharing what is currently being done
in each of the areas presented as well as what the future holds based on lessons-
learned. It is important that lessons and methods learned be shared. This is the
major goal of this book. If we are ignorant of the lessons of the past, we usually
end up making the same mistakes as those before us did. The key is to never
stop learning. The topics contained in this book are meant to foster and stimulate
thinking and help readers extrapolate the methods and techniques to specific work
situations.
The material is presented from a large-company, large-system/product per-
spective (in this text the words product, equipment, and system are interchange-
able). My systems work experiences have been with large companies with the
infrastructure and capital equipment resources to produce high-end products that
demand the highest levels of reliability: satellites, measuring instruments (auto-
matic test equipment for semiconductors), and high-end computers/servers for
financial transaction processing. This book provides food for thought in that the
methods and techniques used to produce highly reliable and robust products for
these very complex electronic systems can be ‘‘cherry-picked’’ for use by
smaller, resource-limited companies. The methods and techniques given can be
tailored to a company’s specific needs and corporate boundary conditions for an
appropriate reliability plan.
My hope is that within this book readers will find some methods or ideas
that they can take away and use to make their products more reliable. The meth-
ods and techniques are not applicable in total for everyone. Yet there are some
ingredients for success provided here that can be applied regardless of the product
being designed and manufactured. I have tried to provide some things to think
about. There is no single step-by-step process that will ensure the production
of a high-reliability product. Rather, there are a number of sound principles that
have been found to work. What the reader ultimately decides to do depends
on the product(s) being produced, the markets served, and the fundamental pre-
cepts under which the company is run. I hope that the material presented is of
value.
ACKNOWLEDGMENTS
I want to acknowledge the professional contributions of my peers in their areas
of expertise and technical disciplines to the electronics industry and to this book.
This book would not have been possible without the technical contributions to
the state of the art by the people listed below. I thank them for allowing me to
use their material. I consider it an honor to have both known and worked with
all of them. I deeply respect them personally and their abilities.
David Christiansen, Compaq Computer Corporation, Tandem Division
Noel Donlin, U.S. Army, retired
Jon Elerath, Compaq Computer Corporation, Tandem Division; now with
Network Appliance Inc.
Charles Hawkins, University of New Mexico
Michael Hursthopf, Compaq Computer Corporation, Tandem Division
Andrew Kostic, IBM
Edmond Kyser, Compaq Computer Corporation, Tandem Division; now
with Cisco Systems
Ken Long, Celestica
Chan Moore, Compaq Computer Corporation, Tandem Division
Joel Russeau, Compaq Computer Corporation, Tandem Division
Richard Sevcik, Xilinx Inc.
Ken Stork, Ken Stork and Associates
Alan Wood, Compaq Computer Corporation, Tandem Division
David Christiansen, Michael Hursthopf, Chan Moore, Joel Russeau, and
Alan Wood are now with Hewlett Packard Company. Thanks to Rich Sevcik for
reviewing Chapter 8. His expert comments and suggestions were most helpful.
Thanks to Joel Russeau and Ed Kyser for coauthoring many articles with me.
Their contributions to these works were significant in the value that was provided
to the readers. To my dear friend G. Laross Coggan, thanks for your untiring
assistance in getting the artwork together; I couldn’t have put this all together
without you. To my production editor, Moraima Suarez, thanks for your editorial
contribution. I appreciate your diligence in doing a professional job of editing
the manuscript and your patience in dealing with the issues that came up during
the production process.
Eugene R. Hnatek
Preface
1. Introduction to Reliability
1.1 What Is Reliability?
1.2 Discipline and Tasks Involved with Product Reliability
1.3 The Bathtub Failure Rate Curve
1.4 Reliability Goals and Metrics
1.5 Reliability Prediction
1.6 Reliability Risk
1.7 Reliability Growth
1.8 Reliability Degradation
1.9 Reliability Challenges
1.10 Reliability Trends
References
5. Thermal Management
5.1 Introduction
5.2 Thermal Analysis Models and Tools
5.3 Impact of High-Performance Integrated Circuits
5.4 Material Considerations
5.5 Effect of Heat on Components, Printed Circuit Boards,
and Solder
5.6 Cooling Solutions
5.7 Other Considerations
References
Further Reading
7. Manufacturing/Production Practices
7.1 Printed Wiring Assembly Manufacturing
7.2 Printed Wiring Assembly Testing
7.3 Environmental Stress Testing
7.4 System Testing
7.5 Field Data Collection and Analysis
7.6 Failure Analysis
References
Further Reading
8. Software
8.1 Introduction
8.2 Hardware/Software Development Comparison
8.3 Software Availability
in reliability level should be just sufficient enough to meet that expectation and
threshold of pain. Thus, reliability and customer expectations are closely tied to
price. For example if a four- to five-function electronic calculator fails, the cus-
tomer’s level of irritation and dissatisfaction is low. This is so because both the
purchase price and the original customer expectation for purchase are both low.
The customer merely disposes of it and gets another one. However, if your Lexus
engine ceases to function while you are driving on a busy freeway, your level
of anxiety, irritation, frustration, and dissatisfaction are extremely high. This is
because both the customer expectation upon purchase and the purchase price are
high. A Lexus is not a disposable item.
Also, for a given product, reliability is a moving target. It varies with the
maturity of the technology and from one product generation to the next. For
example, when the electronic calculator and digital watch first appeared in the
marketplace, they were state-of-the-art products and were extremely costly as
well. The people who bought these products were early adopters of the technol-
ogy and expected them to work. Each product cost in the neighborhood of several
hundred dollars (on the order of $800–$900 for the first electronic calculator and
$200–$400 for the first digital watches). As the technology was perfected (going
from LED to LCD displays and lower-power CMOS integrated circuits) and ma-
tured and competition entered the marketplace, the price fell over the years to
such a level that these products have both become disposable commodity items
(except for high-end products). When these products were new, unique, and high
priced, the customer’s reliability expectations were high as well. As the products
became mass-produced disposable commodity items, the reliability expectations
became less and less important; so that today reliability is almost a “don’t care”
situation for these two products. The designed-in reliability has likewise de-
creased in response to market conditions.
Thus companies design in just enough reliability to meet the customer’s
expectations, i.e., consumer acceptance of the product price and level of discom-
fort that a malfunction would bring about. You don’t want to design in more
reliability than the application warrants or that the customer is willing to pay for.
Table 1 lists the variables of price, customer discomfort, designed-in reliability,
and customer expectations relative to product/application environment, from the
simple to the complex.
Then, too, a particular product category may have a variety of reliability
requirements. Take computers as an example. Personal computers for consumer
and general business office use have one set of reliability requirements; comput-
ers destined for use in high-end server applications (CAD tool sets and the like)
have another set of requirements. Computers serving the telecommunication in-
dustry must operate for 20-plus years; applications that require nonstop availabil-
ity and 100% data integrity (for stock markets and other financial transaction
applications, for example) have an even higher set of requirements. Each of these
Computers
Personal for banking
Calculators computers Pacemaker applications Auto Airline Satellite
3
Copyright 2003 by Marcel Dekker, Inc. All Rights Reserved.
4 Chapter 1
The bathtub curve is the sum of infant mortality, random failure, and wear-
out curves, as shown in Figure 3. Each of the regions is now discussed.
FIGURE 3 The bathtub curve showing how various failures combine to form the compos-
ite curve.
failure rates in this region. Field problems are due to “freak” or maverick lots.
Stress screening cannot reduce this inherent failure rate, but a reduction in op-
erating stresses and/or increase in design robustness (design margins) can reduce
the inherent failure rate.
is very different from the electronic product life curve in the following ways:
significantly shorter total life; steeper infant mortality; very small useful operating
life; fast wearout.
Figure 9 shows that the life curve for software is essentially a flat straight
line with no early life or wearout regions because all copies of a software program
FIGURE 7 Failure rates for NPN silicon transistors (1 W or less) versus calendar operating
time.
Let me express a note of caution. The bathtub failure rate curve is useful
to explain the basic concepts, but for complete electronic products (equipment),
the time-to-failure patterns are much more complex than the single graphical
representation shown by this curve.
Note: All rates are annualized and based on installed part population.
tolerant computer are shown in Table 2. The CM rate is what customers see.
The part (component) replacement (PR) rate is observed by the factory and logis-
tics organization. The failure rate is the engineers’ design objective. The differ-
ence between the failure rate and the PR rate is the NTF rate, based on returned
components that pass all the manufacturing tests. The difference between the CM
rate and PR rate is more complex.
If no components are replaced on a service call, the CM rate will be higher
than the PR rate. However, if multiple components are replaced on a single ser-
vice call, the CM rate will be lower than the PR rate. From the author’s experi-
ence, the CM rate is higher than the PR rate early in the life of a product when
inadequate diagnostics or training may lead to service calls for which no problem
can be diagnosed. For mature products these problems have been solved, and the
CM and PR rates are very similar.
Each of the stated reliability metrics takes one of three forms:
The relationships among the various forms of the metrics are shown in
Figure 10.
for that product and what the factors are that detract from achieving higher reli-
ability. This results in an action plan.
Initial reliability predictions are usually based on component failure rate
models using either MIL-HDBK-217 or Bellcore Procedure TR-332. Typically
one analyzes the product’s bill of materials (BOM) for the part types used and
plugs the appropriate numbers into a computer program that crunches the num-
bers. This gives a first “cut” prediction. However, the failure rates predicted are
usually much higher than those observed in the field and are considered to be
worst-case scenarios.
One of the criticisms of the probabilistic approach to reliability (such as
that of MIL-HDBK-217) is that it does not account for interactions among com-
ponents, materials, and processes. The failure rate for a component is considered
to be the same for a given component regardless of the process used to assem-
ble it into the final product. Even if the same process is used by two different
assemblies, their methods of implementation can cause differences.
Furthermore, since reliability goals are based on competitive analysis and
customer experience with field usage, handbook-based reliability predictions are
unlikely to meet the product goals. In addition, these predictions do not take into
account design or manufacturing process improvements possibly resulting from
the use of highly accelerated life test (HALT) or environmental stress screening
(ESS), respectively. Table 3 presents some of the limitations of reliability predic-
tion.
Thus, reliability prediction is an iterative process that is performed through-
out the design cycle. It is not a “once done, forever done” task. The initial reliabil-
ity prediction is continually refined throughout the design cycle as the bill of
materials gets solidified by factoring in test data, failure analysis results, and
An ambient air temperature of 40°C around the components (measured 0.5 in. above
the component) is assumed.
Component Quality Level I is used in the prediction procedure. This assumes standard
commercial, nonhermetic devices, without special screening or preconditioning. The
exception is the Opto-couplers, which per Bellcore recommendation are assumed
to be Level III.
Electrical stresses are assumed to be 50% of device ratings for all components.
Mechanical stress environment is assumed to be ground benign (GB).
Duty cycle is 100% (continuous operation).
A mature manufacturing and test process is assumed in the predicted failure rate
(i.e., all processes under control).
The predicted failure rate assumes that there are no systemic design defects in the
product.
there are no manufacturing test, or design problems that significantly affect field
reliability. The results fall well within the normal range for similar hardware
items used in similar applications. If quality Level II components are used the
MTBF improves by a factor of about 2.5. One has to ask the following question:
is the improved failure rate worth the added component cost? Only through a
risk analysis and an understanding of customer requirements will one be able to
answer this question.
The detailed bill-of-material failure rates for Quality Levels I and II are
presented in Tables 6 and 7, respectively.
TABLE 6 Continued
TABLE 7 Continued
out its life. Figure 12 shows the bathtub curve with a vertical line placed at the
product’s design life requirements. If a high margin exists between the lifetime
requirement and the wearout time, a high cost is incurred for having this design
margin (overdesign for customer requirements), but there is a low reliability risk.
If the wearout portion of the curve is moved closer to the lifetime requirement
(less design margin), then a lower cost is incurred but a greater reliability risk
presents itself. Thus, moving the onset of wearout closer to the lifetime expected
by the customer increases the ability to enhance the performance of all products,
is riskier, and is strongly dependent on the accuracy of reliability wearout models.
Thus, one must trade off (balance) the high design margin versus cost. Several
prerequisite questions are (1) why do we need this design margin and (2) if I
FIGURE 12 Bathtub curve depicting impact of short versus long time duration between
product lifetime requirement specifications and wearout.
didn’t need to design my product with a larger margin, could I get my product
to market faster?
This begs the question what level of reliability does the customer for a
given product really need. It is important to understand that customers will ask
for very high levels of reliability. They do this for two reasons: (1) they don’t
know what they need and (2) as a safety net so that if the predictions fall short
they will still be okay. This requires that the designer/manufacturer work with
the customer to find out the true need. Then the question must be asked, is the
customer willing to pay for this high level of reliability? Even though the custom-
er’s goal is overall system reliability, more value is often placed on performance,
cost, and time to market. For integrated circuits, for example, it is more important
for customers to get enhanced performance, and suppliers may not need to fix
or improve reliability. Here it’s okay to hold reliability levels constant while ag-
gressively scaling and making other changes.
(a) (b)
growth. Burn-in removes the weak components and in this way brings the equip-
ment into its useful life period with a (supposedly) constant hazard rate λ (see
Fig. 13a). Reliability growth through design and manufacturing improvements,
on the other hand, steadily reduces the inherent hazard rate in the useful life
period of the product, i.e. it increases the MTTF. The corrective actions we speak
of when discussing burn-in are primarily directed toward reducing the number
of infant mortality failures. Some of these improvements may also enhance the
MTTF in the useful life period, providing an added bonus. The efforts expended
in improving the MTTF may very well reflect back on early failures as well.
Nonetheless, the two reliability enhancement techniques are independent.
7. Train the work and maintenance forces at all levels and provide essen-
tial job performance skills.
8. Include built-in test equipment and use of fault-tolerant circuitry.
1. Electrical overstress
2. Operation outside of a component’s design parameters
3. Environmental overstress
4. Operational voltage transients
5. Test equipment overstress (exceeding the component’s parameter rat-
ings during test)
6. Excessive shock (e.g., from dropping component on hard surface)
7. Excessive lead bending
8. Leaking hermetically sealed packages
9. High internal moisture entrapment (hermetic and plastic packages)
10. Microcracks in the substrate
11. Chemical contamination and redistribution internal to the device
12. Poor wire bonds
13. Poor substrate and chip bonding
14. Poor wafer processing
15. Lead corrosion due to improperly coated leads
16. Improper component handling in manufacturing and testing
17. Use of excessive heat during soldering operations
18. Use of poor rework or repair procedures
19. Cracked packages due to shock or vibration
20. Component inappropriate for design requirements
sign, manufacturing, and derating processes and by ensuring that the correct com-
ponent is used in the application.
It is difficult to detect component degradation in a product until the product
ceases functioning as intended. Degradation is very subtle in that it is typically
a slowly worsening condition.
ods presented make sense and should be used in the changing conditions facing
them in their own company in their chosen marketplace.
ACKNOWLEDGMENT
Portions of Section 1.4 are extracted from Ref. 4, courtesy of the Tandem Divi-
sion of Compaq Computer Corporation Reliability Engineering Department.
REFERENCES
1. Criscimagna NH. Benchmarking Commercial Reliability Practices. IITRI, 1997.
2. Reliability Physics Symposium, 1981.
3. Duane JT. Learning curve approach to reliability monitoring. IEEE Transactions on
Aerospace. Vol. 2, No. 2, pp. 563–566, 1964.
4. Elerath JG et al. Reliability management and engineering in a commercial computer
environment. 1999. International Symposium on Product Quality and Integrity. Cour-
tesy of the Tandem Division of Compaq Computer Corporation Reliability Engi-
neering Dept.
PRR/year ≅
PRR
year
8760
冢
Number of removals
Run hours
⫽
8760
MTBPR冣 (2.11)
冤 冥
∞
冷 ⫹冮
∞ ∞
MTTF ⫽ 冮 0
t
⫺ dR(t)
dt
dt ⫽ ⫺tR(t)
0 0
R(t)dt (2.9)
Therefore,
∞
MTTF ⫽ 冮 0
R(t)dt (2.10)
FIGURE 1 Relationship between MTTF, MTBF, and MTTR. Note: if MTBF is defined
as mean time before failure, then MTBF ⫽ MTTF.
b. Each repair took 1.25 hr. Compute the availability. Show this as outage min-
utes per year.
MTTF 70,000
A⫽ ⫽
MTTF ⫹ MTTR 70,000 ⫹ 1.25
⫽ .999982 (unitless, measure of uptime)
To scale this to something more meaningful, we convert this to minutes per year:
1. 3.2 6. 9.3
2. 4.8 7. 15.4
3. 5.3 8. 19.7
4. 5.5 9. 22.2
5. 7.3
What is the MTTF, as determined from the plot of time versus probability? If
the failures seem to represent a constant failure rate, then what is that rate, λ?
Here is a schematic look at where the failures occur on the time scale:
1. First calculate intervals (delta-t’s): 3.2, 1.6, 0.5, 1.3, 0.7, 2.0, 6.1, 4.3,
2.5.
2. Put these in ascending order: 0.5, 0.7, 1.3, 1.6, 2.0, 2.5, 3.2, 4.3, 6.1.
3. Divide the interval of 100% by N ⫹ 1 ⫽ 9 ⫹ 1 ⫽ 10: 10%, 20%,
30%, etc.
4. Plot on exponential paper time values on the y axis, percentage on the
x axis (Fig. 2).
5. Find the 63% point, drop down the 63% line, and find the y-axis inter-
cept. It’s about 2.7 months. This is the MTTF. If this is a constant
failure rate (it appears to be), then the failure rate λ is the inverse of
the MTTF: λ ⫽ 1/2.8 ⫽ .37 per month.
冢 冣冷
1
冮
1
e⫺0.1t
F(1) ⫽ 0.1e⫺0.1t dt ⫽ 0.1 ⫽ ⫺e⫺.1 ⫹ 1
0 ⫺0.1 0
⫽ ⫺.905 ⫹ 1 ⫽ .095
or one can use the relation F(t) ⫽ 1 ⫺ e⫺λt ⫽ 1 ⫺ e⫺0.1 (1) ⫽ 1 ⫺ .905
⫽ .095.
b. Chance of failure in 5 years:
冢 冣冷
5
冮
5
e⫺0.1t
F(1) ⫽ 0.1e dt ⫽ 0.1
⫺0.1t
⫽ ⫺e⫺.5 ⫹ 1
0 ⫺0.1 0
⫽ ⫺.607 ⫹ 1 ⫽ .393
So success is 1 ⫺ .393 ⫽ .607.
Or, this could be calculated using R(t) to begin with: R(5) ⫽ e⫺0.1(5) ⫺ e⫺0.1∞
⫽ .607.
Reliability Calculations Example: Integrated Circuits Using
the Exponential Distribution
Integrated Circuit Example 1. Assume an exponential distribution of fail-
ures in time; λ ⫽ 5 ⫻ 10⫺5 failures per hour and 100,000 parts.
What is the probability that an IC will fail within its first 400 hr?
Using the cumulative function and taking the difference at 2000 and 2500
hr, respectively:
The fallout is 11,750 ⫺ 9516 ⫽ 2234 ICs between 2000 and 2500 hr.
to fail in the first year. This example assumes an accurate match of data to correct
PDF.
2 years ⫽ 17,520 hr
F(t) ⫽ 17,520 ⫽ 1 ⫺ e⫺λ(17,520) ⫽ 10⫺3
λ ⫽ ln
冢 1
1 ⫺ 10⫺3 冣冢 1
17,520 冣
⫽ 57 FITs
β⫺1 exp β
PDF: f (t) ⫽
β t
η η 冢冣 冢 冣 ⫺t
η
(2.20)
冢 冣
Reliability: R(t) ⫽ exp
⫺t
η
(2.22)
where
β ⫽ shape parameter
η ⫽ scale parameter or characteristic life (at which 63.2% of the
population will have failed)
The Weibull distribution failure rate is plotted in Figure 7.
Depending upon the value of β, the Weibull distribution function can also
take the form of the following distributions:
⬍1 Gamma Decreasing
1 Exponential Constant
2 Lognormal Increasing/decreasing
3.5 Normal (approximately) Increasing
The Weibull distribution is one of the most common and most powerful
in reliability because of its flexibility in taking on many different shapes for
various values of β. For example, it can be used to describe each of the three
regions of the bathtub curve as follows and as shown in Figure 8.
The Weibull plot is a graphical data analysis technique to establish the failure
distribution for a component or product with incomplete failure data. Incomplete
means that failure data for a power supply or disk drive, for example, does not
include both running and failure times because the units are put into service at
different times. The Weibull hazard plot provides estimates of the distribution
parameters, the proportion of units failing by a given age, and the behavior of
the failure rates of the units as a function of their age. Also, Weibull hazard
plotting answers the reliability engineering question, does the data support the
engineering conjecture that the failure rate of the power supply or disk drive
increases with their age? If so, there is a potential power supply/disk drive wear-
out problem which needs to be investigated.
Both the lognormal and Weibull models are widely used to make predic-
tions for components, disk drives, power supplies, and electronic products/equip-
ment.
The reliability function R(t), or the probability of zero failures in time t, is given
by
(λt) 0 e⫺λt
R(t) ⫽ ⫽ e⫺λt (2.25)
0!
that is, simply the exponential distribution.
FIGURE 9 Linearizing data. Useful plot is natural log of time to failure (y axis) versus
cumulative percent fails (x axis).
terms of intervals, with an associated probability, or confidence that the true value
lies within such intervals. The end points of the intervals are called confidence
limits and are calculated at a given confidence level (probability) using measured
data to estimate the parameter of interest.
The upper plus lower confidence limits (UCL ⫹ LCL) for a confidence
level must always total 100%. The greater the number of failures, the closer the
agreement between the failure rate point estimate and upper confidence limit of
the failure rate. Numerical values of point estimates become close when dealing
with cases of 50 or more failures.
Here is how confidence limits work. A 90% upper confidence limit means
that there is a 90% probability that the true failure rate will be less than the rate
computed. A 60% upper confidence limit means that there is a 60% probability
Manipulate Weibull into a straight equation so that constants m and c can be observed di-
rectly from the plot.
F(t) ⫽ 1 ⫺ e⫺(t/c)m (2.26)
1 ⫺ F(t) ⫽ e⫺(t/c)m (2.27)
⫺ln[1 ⫺ F(t)] ⫽ (t/c)m (2.28)
ln{⫺ln[1 ⫺ F(t)]} ⫽ m ln(t) ⫺ m ln(c) (plot) (2.29)
y ⫽ mx ⫹ b (2.30)
FIGURE 11 Confidence levels showing the probability that a parameter lies between two
lines.
that the true failure rate will be less than the rate computed. Both of these are
diagrammed in Figure 12. Conversely, the true failure rate will be greater than
the computed value in 10% and 40% of the cases, respectively. The higher the
confidence level, the higher the computed confidence limit (or failure rate) for
a given set of data.
FIGURE 13 Definition of confidence interval. The confidence intervals consist of the in-
terval and the confidence level for the interval.
FIGURE 14 Impact of data on confidence limits and intervals. The 90% lower confidence
limit is less than true MTTF with 90% confidence. The 90% upper confidence limit is
greater than true MTTF with 90% confidence (90% LCL, 90% UCL) and has an 80%
confidence interval.
FIGURE 17 LCL and UCL added to Figure 16. LCL ⬍ CL% probability that true (future)
MTBF ⬍ UCL.
Ps ⫽ EXP
冢 ⫺T
MTBF 冣 (2.35)
where
Ps ⫽ probability of failure free operation for time T
T ⫽ operating period of interest
MTBF ⫽ MTBF of product in same units as T ⫽ 1/λ
EXP ⫽ exponential of the natural logarithm
Let’s use this equation to calculate the probability of failure free operation
for a number of different operating times all with an MTBF of 4000 hr.
If T ⫽ 8000 hr,
Ps ⫽ EXP
冢 ⫺8000
4000 冣
⫽ .1353 ⫽ 13.53%
If T ⫽ 4000 hr,
Ps ⫽ EXP
冢 ⫺4000
4000 冣
⫽ .3679 ⫽ 36.79%
If T ⫽ 168 hr,
Ps ⫽ EXP
冢 冣
⫺168
4000
⫽ .9588 ⫽ 95.88%
If T ⫽ 8 hr,
Ps ⫽ EXP
冢 冣 ⫺8
4000
⫽ .9980 ⫽ 99.8%
Failure rate ⫽ λ
Reliability ⫽ R(A) ⫹ R(B) ⫺ R(A)R(B) ⫽ e⫺λt ⫹ e⫺λt ⫺ e⫺λt e⫺λt (2.39)
⫽ 2e⫺λt ⫹ e⫺2λt
Let’s work a practical example using the series-parallel network shown in
Figure 20. Assume that power supply and CPU reliabilities have exponential
distributions with failure rates of λPS and λCPU, respectively.
1. What is the system reliability? Leave it in the form of an exponential
expression.
2. What is the system failure rate? (Trick question.)
We can start by calculating the CPU reliability:
R CPU0,CPU1 ⫽ 2e⫺λ CPU t ⫺ e2λ CPU t (2.40)
Note that we cannot combine the exponentials. Now multiply by the term for
the power supply:
R PS ⫹ CPUs ⫽ e⫺λ PS t (2e⫺λCPU t ⫺ e2λ CPU t ) (2.41)
This is a complex expression.
The system failure rate Question 2 cannot be easily computed because of
the time dependence; we can’t add the exponentials.
This illustrates a basic fact about reliability calculations: unless the model
is reasonably easy to use mathematically (as the exponential certainly is), it takes
some powerful computing resources to perform the reliability calculations.
P(n) ⫽
冢 冣
λntn ⫺λt
n!
e (2.45)
where
P(n) ⫽ percent of units exhibiting failures
t ⫽ time duration
n ⫽ number of failures in a single system (e.g., 1, 2, 3, . . . , n)
Let’s learn how many units will have 1, then 2, then 3, etc., failures per
unit in the group of 63 units that will exhibit these 100 failures.
For zero failures (note, 1! is defined as equaling 1),
P(0) ⫽ 冤 1冥
0.0010 (10000)
2.171⫺.001(1000)
⫽冤
1 冥
1(1) ⫺1
2.71
⫽ 1 (.37)
P(0) ⫽ .37, or 37%
So with 100 units there will be 37 units exhibiting zero failures in one MTBF
time period.
P(1) ⫽ 冤 0.0011(10001)
1 冥
2.71⫺1
P(1) ⫽
冢11冣 .37
P(1) ⫽ .37, or 37% will exhibit one failure.
So with 100 units there will be 37 units exhibiting one failure in one MTBF time
period.
For two failures,
P(2) ⫽
冢21冣 .37
P(2) ⫽ 18%
So with 100 units there will be 18 or 19 units exhibiting two failures in one
MTBF time period.
P(3) ⫽ 6 units exhibiting three failures in one MTBF.
P(4) ⫽ 1 or 2 units exhibiting four failures in one MTBF.
P(5) ⫽ maybe 1 unit exhibiting five failures in one MTBF.
A simpler way of finding the percentage of failures encountered in some
time period is
P(f ) ⫽ λt (2.46)
Find how many will fail in one-hundredth of an MTBF time period.
0.001(1000)
P(f ) ⫽ ⯝ 0.001(10) ⫽ 0.01, or 1%
100 hr
Using 100 units this means that one unit exhibits the very first failure in 10 hr.
So the time to first failure is 10 hr. Which one out of the 100 units will fail is
a mystery, however.
Now let’s move to the warranty issue. A system has an MTBF of 4000 hr.
An engineer makes a recommendation for a hardware change that costs $400.00
per unit to install and raises the system’s MTBF to 6500 hr. What is known:
The average cost of a field failure is $1,800.00/failure.
0 28–29 46–47
1 35–36 35–36
2 22–23 13–14
3 9–10 3–4
4 2–3 0–1
Total range 114–124 70–80
1
MTBF Total ⫽
1/MTBF1 ⫹ MTBF2 ⫹ MTBF3 ⫹ MTBF4 ⫹ MTBF5
or
1
MTBF Total ⫽
λ1 ⫹ λ2 ⫹ λ3 ⫹ λ4 ⫹ λ5
1
MTBF ⫽
.000041 ⫹ .000071 ⫹ .000030 ⫹ .000042 ⫹ .000056
1
⫽ ⫽ 4167 hrs
0.000240
Thus, the system consisting of these five subassemblies has a system MTBF of
4167 hr. This is an acceptable number because it exceeds the specified 4000 hour
goal.
ACKNOWLEDGMENT
Portions of this chapter have been excerpted from the reference. Section 2.6 was
generated by Ted Kalal, and is used with permission.
REFERENCE
Wood A. Reliability Concepts Training Course, Tandem Computer Corporation,
Cupertino, California.
APPENDIX A
APPENDIX A (Continued)
3.1 INTRODUCTION
Technology has created an increasingly complex problem for delivering reliabil-
ity. More systems are designed today faster than ever before under shrinking cost
margins and using more electronics with ever more complex devices that no one
has the time to test thoroughly. Concurrently, the tolerance for poor reliability
is shrinking, even while expectations are rising for rapid technology changes and
shorter engineering cycles.
Grueling project schedules, thirst for performance and cost competitiveness
result in corners being cut in design, with as little verification and testing being
done as possible. In all of this what is a company to do to deliver high-reliability
products to its customers? Here are some hints. Use common platforms, product
architectures, and mainstream software and keep new engineering content down
as much as possible to have evolutionary rather than revolutionary improvements.
Use proven or preferred components. Perform margin analysis to ensure perfor-
mance and reliability beyond the stated specifications. Conduct accelerated stress
screening to expose and correct defects in the engineering phase before shipping
any products. Conduct detailed design reviews throughout the design process
composed of multidisciplinary participants.
This chapter discusses the importance of the design stage and specifically
the elements of design that are vital and necessary to producing a reliable product.
Design has the most impact on the reliability outcome of a product. 80% of the
reliability and production cost of a product is fixed during its design. Reliability
must be designed in a product. This requires a lot of forethought and a conscious
formal effort to provide just the required margin needed for customer application.
A given product’s projected life in its intended application determines the amount
of margin (robustness) that will be designed in and for which the customer is
willing to pay. Figure 1 shows the product life cycles of various categories of
computers. This figure shows that not all products require high reliability and
that not all segments of a given product category have the same reliability require-
ments, i.e., a one size fits all mind set. For example, personal computers (PCs)
are commodity items that are becoming disposable (much as calculators and cell
phones are). Computers used in financial transaction applications (stock markets,
banks, etc.), automobiles, telecommunication equipment, and satellites have
much more stringent reliability requirements.
How product reliability has been accomplished has differed greatly be-
tween U.S. and Japanese companies. The Japanese companies generate many
change notices during product design, continually fine tuning and improving the
design in bite-sized pieces until the design is frozen upon release to production.
Companies in the United States, on the other hand, are quick to release a product
to production even though it contains known “bugs” or deficiencies. Geoffrey
Moore, in Living on the Fault Line, calls this “going ugly early” to capture market
share. After being released to production a number of changes are made to correct
these deficiencies. But in the meantime, plan on customer returns and complaints.
Reliability begins at the global system concept design phase; moves to the
detailed product design phase that encompasses circuit design, application-
FIGURE 1 Product life cycles for various categories of computers. (Courtesy of Andrew
Kostic, IBM.)
Synthesis
Piece parts and suppliers selected for the design
However, a startling change is taking place. Design (concurrent engineering)
teams are shrinking in size. This is due to the increased integration possible with
integrated circuit technology and the available design tools.
Chapter 3
support support support
Manufacturing sup- Manufacturing sup-
port port
TABLE 4 Continued
TABLE 4 Continued
7. Regarding testing:
Do you plan to test each supplier’s part in your application?
Do you plan to measure all signals for timing, noise margin, and signal
integrity?
Do you plan to test power fail throughout the four corners of your product
specification?
Do you plan to perform HALT evaluation of the product (test to
destruction–analyze–fix–repeat)? If so, which assemblies?
Will all suppliers’ parts be used in the HALT evaluation?
Will you be testing worst/best case parts (which ones)?
Will you HALT these?
8. What kind of manufacturability verification/process verification do you
plan to do?
9. What special equipment/software will be needed to program the
programmable logic?
10. For new suppliers, what discussions and design work and supplier
qualification/component qualification has been done to date?
Who has been the principal contact or interface between the circuit
designer and the supplier?
Are these discussions/efforts documented?
11. What are the key project assumptions?
Major design assemblies
Design reviews planned (which/when/who)
List of key designers and managers (and their responsibilities today)
Schedule:
Design specifications
Prototypes
Alpha
Beta
FCS
Number of alpha tests (by whom?)
Number of prototypes (built by whom?)
Product support life
Product sales life
Total product quantity during product life (per year if estimated)
Design and phase review dates (Phases 1,2,3,4,5)
There is also the issue of old products and technologies, those products
and technologies that the designers have used before in previous designs and are
comfortable with. In some cases the designers simply copy a portion of a circuit
that contains these soon-to-be obsolete components. How long will the technolo-
gies continue to be manufactured (1.0-µm CMOS in a 0.25-µm CMOS world,
for example)? How long will the specific product be available? Where is it in
its life cycle?
This process helps to direct and focus the design to the use of approved
and acceptable components and suppliers. It also identifies those components that
are used for design leverage/market advantage that need further investigations
by the component engineer; those suppliers and components that need to be inves-
tigated for acceptability for use and availability for production; and those compo-
nents and suppliers requiring qualification.
The answers allow the component engineer and the designers to develop
a technology readiness and risk assessment and answer the question is it possible
to design the product per the stated requirements. A flow diagram showing user
technology needs and supplier technology availability and the matching of these
is shown in Figure 2.
As the keeper of the technology, the component engineer
Is responsible for keeping abreast of the technology road maps for assigned
circuit functions (connectors, memory, microprocessor, for example)
Conducts technology competitiveness analyses
Works with the design organization to develop a project technology sizing
Understands the technology, data, and specifications
must thus be paid as to how they are electrically interconnected and physically
located on the printed circuit board (PCB). As digital IC processes migrate to
below 0.25-µm (the deep submicron realm), the resultant ICs become much nois-
ier and much more noise sensitive. This is because in deep submicron technology
interconnect wires (on-chip metallization) are jammed close together, threshold
voltages drop in the quest for higher speed and lower operating power, and more
aggressive and more noise-sensitive circuit topologies are used to achieve even
greater IC performance. Severe chip-level noise can affect timing (both delay
and skew) and can cause functional design failures.
The analog circuitry needs to be electrically isolated from the digital cir-
cuitry. This may require the addition of more logic circuitry, taking up board
space and increasing cost. Noise can be easily coupled to analog circuitry, re-
sulting in signal degradation and reduced equipment performance. Also, interfac-
ing analog and digital ICs (logic, memory, microprocessors, phase-locked loops,
digital-to-analog and analog-to-digital converters, voltage regulators, etc.) is not
well defined. Improper interfacing and termination can cause unwanted interac-
tions and crosstalk to occur. Then, too, the test philosophy (i.e., design for test,
which will be discussed in a later section) must be decided early on. For example,
bringing out analog signals to make a design testable can degrade product perfor-
mance. Some of the pertinent differences between analog and digital ICS are
listed in Table 5.
Analog Digital
Transistors full on Transistors either on or off
Large feature size ICs Cutting edge feature size ICs
High quiescent current Low quiescent current
Sensitive to noise Sensitive to signal edge rates
Design tools not as refined/sophisticated Sophisticated design tools
Tool incompatibility/variability Standard tool sets
Simulation lags that of digital ICs Sophisticated simulation process
labor time, increased production and rework costs, as well as yield and quality
issues.
Knowledgeable engineers with a breadth of experience in designing differ-
ent types of products/systems/platforms and who are crosstrained in other disci-
plines form the invaluable backbone of the design team. All of the various design
disciplines (each with their own experts)—circuit design (analog and digital),
printed circuit board design and layout (i.e., design for manufacturability), system
design (thermal, mechanical and enclosure), design for test, design for electro-
magnetic compatibility, design for diagnosability (to facilitate troubleshooting),
design for reliability, and now design for environment (DFE)—are intertwined
and are all required to develop a working and producible design. Each one im-
pacts all of the others and the cost of the product. Thus, a high level of interaction
is required among these disciplines. The various design tasks cannot be separated
or conducted in a vacuum.
The design team needs to look at the design from several levels: component,
module, printed wiring assembly (PWA)—a PCB populated with all the compo-
nents and soldered—and system. As a result, the design team is faced with a
multitude of conflicting issues that require numerous tradeoffs and compromises
to be made to effect a working and manufacturable design, leading to an iterative
design process. The availability and use of sophisticated computer-aided design
tools facilitate the design process.
Tools
Circuit design has changed dramatically over the past decade. For the most part,
gone are the days of paper and pencil design followed by many prototype bread-
board iterations. The sophistication of today’s computer-aided design tools for
digital integrated circuit designs allows simulations to be run on virtual bread-
board designs, compared with the desired (calculated) results, debugged, and cor-
rected—and then the process repeated, resulting in a robust design. This is all
done before committing to an actual prototype hardware build. But there are two
big disconnects here. First, digital IC design tools and methods are extremely
effective, refined, and available, whereas analog IC and mixed-signal IC tools
are not as well defined or available. The analog tools need to be improved and
refined to bring them on a par with their digital equivalents. Second, there tends
to be a preoccupation with reliance on simulation rather than actually testing a
product. There needs to be a balance between simulation before prototyping and
testing after hardware has been produced.
There are many CAD and electronic design automation (EDA) tools avail-
able for the circuit designer’s toolbox. A reasonably comprehensive list is pro-
vided in Table 6. Electronic systems have become so large and complex that
simulation alone is not always sufficient. Companies that develop and manufac-
ture large digital systems use both simulation and hardware emulation. The rea-
IC design
Acceleration/emulation Extractors Physical verification
CBIC layout Floor planning Process migration
Custom layout Gate array layout Reliability analysis
Delay calculator Metal migration analysis Signal integrity analysis
EMI analysis Power analysis Thermal analysis
SPICE Timing analysis
PCB design
EMI analysis Physical verification Thermal analysis
MCM/hybrid design Power analysis Timing analysis
PCB design Signal integrity analysis Virtual prototype evaluation
Autorouter
sons are twofold and both deal with time to market: (1) Simulation cycle time
is several orders of magnitude slower than emulation. Faster simulations result
in quicker design verification. (2) Since today’s systems are software intensive
and software is often the show stopper in releasing a new product to market,
system designers cannot wait for the availability of complete hardware platforms
(i.e., ASICs) to begin software bring-up.
A methodology called the electronic test bench aids the design verification
task. The “intelligent test bench” or “intelligent verification environment” has
been developed to further ease the designer’s task. The intelligent test bench is
a seamless, coherent, and integrated EDA linkage of the myriad different kinds
facing the designer include the following: How does one use a simple buffer?
With or without pull-up resistors? With or without pull-down resistors? No con-
nection? How is the value of the resistors chosen (assuming resistors are used)?
It has been found that oftentimes the wrong resistor values are chosen for the
pull-up/pull-down resistors. Or how do you interface devices that operate at dif-
ferent voltage levels with characterization data at different voltages (these ICs
have different noise margins)? This also raises the question what is the definition
of a logic 1 and a logic 0. This is both a mixed voltage and mixed technology
issue.
Additionally, the characteristics of specific components are not understood.
Many of the important use parameters are not specified on the supplier’s data
sheets. One needs to ask several questions: What parameters are important (speci-
fied and unspecified)? How is the component used in my design/application?
How does it interface with other components that are used in the design?
Examples of important parameters, specified or not specified, include the fol-
lowing:
1. Input/output characteristics for digital circuit design.
Bus hold maximum current is rarely specified. The maximum
bus hold current defines the highest pull-up/pull-down resistors for a
design.
It is important to understand transition thresholds, especially
when interfacing with different voltage devices. Designers assume 1.5-
V transition levels, but the actual range (i.e., 1.3 V to 1.7 V) is useful
for signal quality analysis.
Simultaneous switching effect characterization data with 1, 8,
16, 32, and more outputs switching at the same time allows a designer
to manage signal quality, timing edges, edge rates, and timing delay
as well as current surges in the design,
Pin-to-pin skew defines the variance in simultaneously launched
output signals from package extremes.
Group launch delay is the additional delay associated with simul-
taneous switching of multiple outputs.
2. Functional digital design characteristic.
Determinism is the characteristic of being predictable. A com-
plex component such as a microprocessor should provide the same
output in the same cycle for the same instructions, consistently.
3. Required package considerations.
Data on the thermal characteristics (thermal resistance and con-
ductance), with and without the use of a heat sink, in still air and with
various air flow rates are needed by the design team.
Package capacitance and inductance values for each pin must be
3.8 REDUNDANCY
Redundancy is often employed when a design must be fail safe, or when the
consequences of failure are unacceptable, resulting in designs of extremely high
reliability. Redundancy provides more than one functional path or operating ele-
ment where it is critical to maintain system operability (The word element is
used interchangeable with component, subassembly, and circuit path). Redun-
dancy can be accomplished by means of hardware or software or a combination
of the two. I will focus here on the hardware aspects of redundancy. The use of
redundancy is not a panacea to solve all reliability problems, nor is it a substitute
for a good initial design. By its very nature, redundancy implies increased com-
plexity and cost, increased weight and space, increased power consumption, and
usually a more complicated system checkout and monitoring procedure. On the
other hand, redundancy may be the only solution to the constraints confronting
the designer of a complex electronic system. The designer must evaluate both
the advantages and disadvantages of redundancy prior to its incorporation in a
design.
Depending on the specific application, numerous different approaches are
available to improve reliability with a redundant design. These approaches are
normally classified on the basis of how the redundant elements are introduced
into the circuit to provide an alternative signal path. In general, there are two
major classes of redundancy:
1. Active (or fully on) redundancy, where external components are not
required to perform a detection, decision, or switching function when
an element or path in the structure fails
2. Standby redundancy, were external components are required to detect,
make a decision, and then to switch to another element or path as a
replacement for the failed element or path
Redundancy can consist of simple parallel redundancy (the most commonly
used form of redundancy), where the system will function if one or both of the
subsystems is functional, or more complex methods—such as N-out-of-K ar-
rangements, where only N of a total of K subsystems must function for system
operation—and can include multiple parallel redundancies, series parallel redun-
dancies, voting logic, and the like.
For simple parallel redundancy, the greatest gain is achieved through the
addition of the first redundant element; it is equivalent to a 50% increase in
the system life. In general, the reliability gain for additional redundant elements
decreases rapidly for additions beyond a few parallel elements. Figure 3 shows
cause loss of system or of one of its major functions, loss of control, unintentional
actuation of a function, or a safety hazard. Redundancy is commonly used in the
aerospace industry. Take two examples. The Apollo spacecraft had redundant
on-board computers (more than two, and it often landed with only one computer
operational), and launch vehicles and deep space probes have built-in redundancy
to prevent inadvertent firing of pyrotechnic devices.
Input data
Cost per unit $1000 $1300
Expected service calls in 5 years 0.6 0.1
Cost per service call (OEM cost only) $850 $1100
Cost per service call (OEM cost ⫹ customer cost) $2550 $3600
Cost calculations
Development cost Minimal Minimal
Production cost per unit $1000 $1300
Support cost per unit (OEM cost only) $510 $110
Support cost per unit (OEM cost ⫹ customer cost) $1530 $360
Other costs Minimal Minimal
Life cycle cost
Total cost (OEM cost only) $1510 $1410
Total cost (OEM cost ⫹ customer cost) $2530 $1660
1. Use the part type number to determine the important electrical and
environmental characteristics which are reliability sensitive, such as
voltage, current, power, time, temperature, frequency of operation,
duty cycle, and others.
2. Determine the worst case operating temperature.
3. Develop derating curves or plots for the part type.
4. Derate the part in accordance with the appropriate derating plots. This
becomes the operational parameter derating.
5. Using a derating guideline such as that in Appendix A of Chapter 3
or military derating guideline documents [such as those provided by
the U.S. Air Force (AFSCP Pamphlet 800-27) and the Army Missile
Command] to obtain the derating percentage. Multiply the operational
derating (the value obtained from Step 4) by this derating percentage.
This becomes the reliability derating.
6. Divide the operational stress by the reliability derating. This provides
the parametric stress ratio and establishes a theoretical value to deter-
mine if the part is overstressed. A stress ratio of 1.0 is considered
to be critical, and for a value of ⬎1.0 the part is considered to be
overstressed.
7. If the part is theoretically overstressed, then an analysis is required and
an engineering judgment and business decision must be made whether
it is necessary to change the part, do a redesign, or continue using the
current part.
FIGURE 6 Power versus temperature derating curve for CD 4025 CMOS gate.
deal with PWA thermal issues: hot spots and heat generating components and
their effect on nearest neighbor components; voltage and current transients and
their duration and period; the surrounding environment temperature; PWA work-
manship; and soldering quality. There are other factors that should be considered,
but those listed here provide some insight as to why it is necessary to derate a
part and then also apply additional safety derating to protect against worst case
and unforeseen conditions. In reliability terms this is called designing in a safety
margin.
Example 3: Power Derating a Bipolar IC
Absolute maximum ratings:
Since the maximum junction temperature allowed for the application is 110°C
and the estimated operating junction temperature is less than this (105°C), the
operating junction temperature is satisfactory.
The next step is to draw the derating curve for junction temperature versus
power dissipation. This will be a straight-line linear derating curve, similar to
that of Figure 6. The maximum power dissipation is 50 mW/gate. Plotting this
curve in Figure 7 and utilizing the standard derating method for ICs, we see that
the derating curve is flat from ⫺55 to 25°C and then rolls off linearly from 25
to 175°C. At 175°C the power dissipation is 0 W.
Using the derating curve of Figure 7 we proceed in the following manner:
1. Products have become increasingly complex. In the last few years the
sophistication of printed circuit packaging has increased dramatically.
Not only is surface mount now very fine pitch, but ball grid array and
chip scale packages and flip chip technologies have become commer-
cially viable and readily available. This plus the many high-density
interconnect structures (such as microvia, microwiring, buried bump
interconnection, buildup PCB, and the like) available has made the
design task extremely complex.
2. Minimizing cost is imperative. The use of DFM/A has been shown in
benchmarking and case studies to reduce assembly costs by 35% and
PWA costs by 25%.
3. High manufacturing yield are needed. Using DFM/A has resulted in
first-pass manufacturing yields increasing from 89 to 99%.
4. In the electronic product design process, 60–80% of the manufacturing
costs are determined in the first stages of design when only 35% or so
of the design cost has been expended.
5. A common (standard) language needs to be established that links man-
ufacturing to design and R&D. This common language defines produc-
ibility as an intrinsic characteristic of a design. It is not an inspection
milestone conducted by manufacturing. The quantitative measure of
producibility directly leads to a team approach to providing a high-
quality cost-competitive product.
The traditional serial design approach, where the design proceeds from the logic
or circuit designer to physical designer to manufacturing and finally to the test
engineer for review is not appropriate because each engineer independently eval-
uates and selects alternatives. Worse is a situation where the manufacturing engi-
neer sees the design only in a physical form on a PCB. This normally is the case
when contract manufacturers only perform the component assembly (attachment
to the PCB) process.
How should a product be designed? As mentioned previously, the design
team should consist of representatives from the following functional organiza-
tions: logic design; analog design; computer-aided design (CAD) layout; manu-
facturing and process engineering; mechanical, thermal, component, reliability,
and test engineering; purchasing; and product marketing. Alternatives are dis-
cussed to meet thermal, electrical, real estate, cost, and time-to-market require-
ments. This should be done in the early design phases to evaluate various design
alternatives within the boundaries of the company’s in-house self-created DFM
document. This team should be headed by a project manager with good technical
and people skills who has full team member buy-in and management support.
Manufacturing engineering plays an important role during the design phase
and is tasked with accomplishing the following:
Vias not covered with solder mask can allow hidden shorts to via pads under
components.
Vias not covered with solder mask can allow clinch shorts on DIP and axial
and radical components.
Specify plated mounting holes with pads, unplated holes without pads.
For TO-220 package mounting, avoid using heat sink grease. Instead use sil-
pads and stainless hardware.
Ensure that polarized components face in the same direction and have one axis
for PTH automation to ensure proper component placement and PWA testing.
Align similar components in the same direction/orientation for ease of
component placement, inspection, and soldering.
PTH hole sizes need adequate clearance for automation, typically 0.0015 in.
larger than lead diameter.
Fiducial marks are required for registration and correct PCB positioning.
For multilayer PCBs a layer/rev. stack-up bar is recommended to facilitate
inspection and proper automated manufacture.
Obtain land pattern guidelines from computer-aided design libraries with CAD
programs, component manufacturers, and IPC-SM-782A. SMT pad geometry
controls the component centering during reflow.
Provide for panelization by allowing consideration for conveyor clearances
(0.125 in. minimum on primary side; 0.200 in. minimum on secondary side),
board edge clearance, and drill/route breakouts.
Maximum size of panel or PCB should be selected with the capabilities of the
production machine in mind as well as the potential warp and twist problems in
the PCB.
PCBs should fit into a standard form factor: board shape and size, tooling hole
location and size, etc.
To prevent PCB warpage and machine jams the panel width should not exceed
1.5⫻ the panel length.
Panels should be designed for routing with little manual intervention.
complete while manufacturing reports 100 of 100 deliverables at Gate 10, then
engineering is clearly not done and manufacturing is ready and waiting.
Design for manufacture and assembly is predicated on the use of accurate
and comprehensive computer-integrated manufacturing (CIM) tools and sophisti-
cated software. These software programs integrate all relevant data required to
design, manufacture, and support a product. Data such as simulation and models;
CAD and computer-aided engineering (CAE) data files; materials, processes, and
characteristics; specifications and documents; standards and regulations; and en-
gineering change orders (ECOs), revisions, parts, etc. The software programs
efficiently
Communicate information both ways between design engineering and man-
ufacturing.
Automate CAD data exchange and revision archiving.
Provide product data tracking and packaging completeness checking and
support standard industry networking protocols.
Allow design for assembly by analyzing part placement, supporting multi-
ple machine configurations, analyzing machine capacity, and providing
production engineering documentation.
By having these design files available in an integrated form, PWA design
and manufacturing engineers have the necessary information available in one
place to develop a cost-effective design implementation, including analyzing var-
ious tradeoff scenarios such as
Product definition and system partitioning (technology tradeoff)
Layout and CAD system setup
PWA fabrication design rules, yield optimization, and cost tradeoffs
SMT assembly process, packaging, component, and test tradeoffs
An example of such a tool is GenCAM (which stands for Generic Com-
puter-Aided Manufacturing). GenCAM is an industry standard written in open
ASCII format for electronic data transfer from CAD to computer-aided manufac-
turing (CAM) to assembly to test in a single file. This file may contain a single
board to be panelized for fabrication or subpanelized for assembly. The fixture
descriptions in the single GenCAM file allow for testing the assemblies in an array
or singular format, as shown in Figure 9. Some of the features and benefits of Gen-
CAM are listed in Table 10. A detailed description is documented in IPC-2511.
GenCAM contains 20 sections (Table 11) that convey design requirements
and manufacturing details. Each section has a specific function or task, is indepen-
dent of the other sections, and can be contained within a single file. The relation-
ship between sections is very important to the user. For example, classic informa-
tion to develop land patterns is important to both the assembly and in-circuit test
(ICT) functions. GenCAM files can be used to request quotations, to order details
that are specifically process-related, or to describe the entire product (PWA) to
be manufactured, inspected, tested, and delivered to the customer.
The use of primitives and patterns provides the information necessary to
convey desired final characteristics and shows how, through naming conventions,
one builds upon the next, starting with the simplest form of an idea, as shown
in Figure 10. Figure 11 shows an example of various primitives. Primitives have
no texture or substance. That information is added when the primitive is refer-
enced or instanced.
When textured primitives are reused and named they can become part of
an artwork, pattern, or padstack description. When primitives are enhanced, there
are many ways in which their combinations can be reused. Primitives can also
become symbols, which are a specific use of the pattern section. Figure 12 shows
the use of primitives in a pattern to surface mount small-outline ICs (SOICs). In
this instance, the symbol is given intelligence through pin number assignment.
Thus, the logic or schematic diagram can be compared to the net list identified
in the routes section.
GenCAM can handle both through-hole and surface mount components.
GenCAM accommodates through-hole components (dual in line packages, pin
grid array packages, and DC/DC converters, for example) by including holes
Recipient Advantages
User Improves cycle time by reducing the need to spoon-feed the sup-
ply chain; supply-chain management, necessary as more ser-
vices are outsourced; equipment reprocurement capability.
Also establishes a valuable archiving capability for fabrication
and assembly tooling enhancement; and segmentation of the
GenCAM file avoids the need to distribute proprietary prod-
uct performance data.
Designer Features ability to provide complete descriptions of one or more
assemblies; a direct correlation with CAD library methodol-
ogy. GenCAM establishes the communication link between
design and manufacturing; facilitates reuse of graphical data;
permits descriptions of tolerances for accept/reject criteria;
brings design into close contact with DFM issues.
Manufacturer Provides complete description of PCB topology; opportunity to
define fabrication panel, assembly subpanel, coupons, and
other features; layering description for built-up and standard
multilayer construction; ease of reference to industry material
specifications; design rule check (DRC) or DFM review and
feedback facilitation.
Also, data can be extracted to supply input to various manufac-
turing equipment, e.g., drill, AOI, router.
Assembler Provides complete integrated bill of materials. Identifies compo-
nent substitution allowances. Accommodates several BOM
configurations in a single file. Establishes flexible reuse of
component package data. Supports subpanel or assembly
array descriptions. Considers all electrical and mechanical
component instances, including orientation, board-mount side.
Electrical bare board Identifies one or more fixtures needed for electrical test require-
and ICT ments and specific location of pads or test points; describes
test system power requirements, complete net list to establish
component connectivity and net association, component val-
ues and tolerances. Provides reference to component behav-
ior, timing, and test vectors.
Source: Ref. 1.
used in the CAD system in the padstack section to make connections to all layers
of the PCB. For surface mount components, the relationship of vias, as shown
in Figure 13, becomes an important element for design for assembly.
GenCAM handles components intermixing by combining components,
padstacks, patterns, and routes imformation to position parts on the individual
Source: Ref. 1.
FIGURE 10 Definitions and relationships among the three primary sections of GenCAM.
PCB and on the subpanel assembly array. Since many assembly operations use
wave soldering, the general description of the component identified in the pack-
ages section can be transformed through (X, Y) positioning, rotation, and mirror
imaging. This permits a single description of a package to be positioned in many
forms to meet the requirements, shown in Figure 14.
IPC is an industry association that has taken the responsibility for generat-
ing, publishing, and maintaining extensive guidelines and standards for PCB de-
sign, artwork requirements, assembly and layout, qualification, and test—facili-
tating DFM. Table 12 provides a list of some of these documents.
Another change in PWA design and manufacturing that is driven by fast
time to market is that PWAs are designed from a global input/output (I/O) per-
spective. This means that a given PWA is designed and the first article manufac-
tured using embedded field programmable gate arrays (FPGAs) and programma-
ble logic devices (PLDs) without the core logic being completed. After the PWA
is manufactured, then the core logic design is begun. However, choosing to use
FPGAs in the final design gives the circuit designers flexibility and upgradeability
through the manufacturing process and to the field (customer), throughout the
product’s life. This provides a very flexible design approach that allows changes
What these examples show is that careful attention must be paid by experi-
enced engineering and manufacturing personnel to the components that are placed
Chapter 3
Solvents
Testability
Corrosion aerospace and high-altitude concerns
111
sembly drawings, and miscellaneous part drawings.
112
IPC document Description
IPC-D-354 Library Format Description for Printed Describes the use of libraries within the processing and generation of in-
Boards in Digital Form formation files. The data contained within cover both the definition and
use of internal (existing within the information file) and external librar-
ies. The libraries can be used to make generated data more compact
and facilitate data exchange and archiving. The subroutines within a li-
brary can be used one or more times within any data information mod-
ule and also in one or more data information modules.
IPC-D-355 Printed Board Automated Assembly Descrip- Describes an intelligent digital data transform format for describing com-
tion in Digital Form ponent mounting information. Supplements IPC-D-350 and is for de-
signers and assemblers. Data included are pin location, component ori-
entation, etc.
IPC-D-356A Bare Substrate Electrical Test Information Describes a standard format for digitally transmitting bare board electrical
in Digital Form test data, including computer-aided repair. It also establishes fields, fea-
tures, and physical layers and includes file comment recommendations
and graphical examples.
IPC-D-390A Automated Design Guidelines This document is a general overview of computer-aided design and its
processes, techniques, considerations, and problem areas with respect
to printed circuit design. It describes the CAD process from the initial
input package requirements through engineering change.
IPC-C-406 Design and Application Guidelines for Sur- Provides guidelines for the design, selection, and application of soldered
face Mount Connectors surface mount connectors for all types of printed boards (rigid,
flexible-rigid) and backplanes.
IPC-CI-408 Design and Application Guidelines for the Provides information on design characteristics and the application of
Use of Solderless Surface Mount Connectors solderless surface mount connectors, including conductive adhesives,
Chapter 3
in order to aid IC package-to-board interconnection.
IPC-D422 Design Guide for Press Fit Rigid Printed Contains back-plane design information from the fabrication and assem-
Board Backplanes bly perspective. Includes sections on design and documentation, fabrica-
tion, assembly, repair, and inspection.
113
sign systems to computer-aided manufacturing systems for printed
board fabrication, assembly, and test.
next to or in close proximity to each other during the PWA design (to the size
of the components and the materials with which they are made). Unfortunately,
these experienced manufacturing personnel are getting rarer and lessons learned
have not been documented and passed on, and it takes far too long to gain that
experience. The difficulties don’t end there. Often, the manufacturer is separated
by long geographical distances which serve to create a local on-site technical
competence shortage. Suffice it to say that PWA manufacturing is in a turbulent
state of flux.
heating. This produces a large thermal gradient across the surface of the IC die
(or across several areas of the die), potentially cracking the die or delaminating
some of the material layers. New assembly and packaging technology develop-
ments make the situation even more complex, requiring new approaches to
cooling.
The ability of an electronic system to dissipate heat efficiently depends on
the effectiveness of the IC package in conducting heat away from the chip (IC)
and other on-board heat-generating components (such as DC/DC converters) to
their external surfaces, and the effectiveness of the surrounding system to dissi-
pate this heat to the environment.
The thermal solution consists of two parts. The first part of the solution is
accomplished by the IC and other on-board component suppliers constructing
their packages with high thermal conductivity materials. Many innovative and
cost effective solutions exist, from the tiny small outline integrated circuit and
chip scale packages to the complex pin grid array and ball grid array packages
housing high-performance microprocessors, FPGAs, and ASICs.
Surface mount technology, CSP and BGA packages and the tight enclo-
sures demanded by shrinking notebook computers, cell phones, and personal digi-
tal assistant applications require creative approaches to thermal management. In-
creased surface mount densities and complexities can create assemblies that are
damaged by heat in manufacturing. Broken components, melted components,
warped PWAs, or even PWAs catching on fire may result if designers fail to
provide for heat buildup and create paths for heat flow and removal. Stress
buildup caused by different coefficients of thermal expansion (CTE) between the
PWA and components in close contact is another factor affecting equipment/
system assembly reliability. Not only can excessive heat affect the reliability of
surface mount devices, both active and passive, but it can also affect the operating
performance of sensitive components, such as clock oscillators and mechanical
components such as disk drives. The amount of heat generated by the IC, the
package type used, and the expected lifetime in the product combine with many
other factors to determine the optimal heat removal scheme.
In many semiconductor package styles, the only thing between the silicon
chip and the outside world is high thermal conductivity copper (heat slug or
spreader) or a thermally equivalent ceramic or metal. Having reached this point
the package is about as good as it can get without resorting to the use of exotic
materials or constructions and their associated higher costs. Further refinements
will happen, but with diminishing returns. In many applications today, the pack-
age resistance is a small part (less than 10%) of the total thermal resistance.
The second part of the solution is the responsibility of the system designer.
High-conductivity features of an IC package (i.e., low thermal resistance) are
wasted unless heat can be effectively removed from the package surfaces to the
external environment. The system thermal resistance issue can be dealt with by
breaking it down into several parts: the conduction resistance between the IC
package and the PWA; the conduction resistance between the PWAs and the
external surface of the product/equipment; the convection resistance between the
PWA, other PWAs, and the equipment enclosure; and the convection resistance
between these surfaces and the ambient. The total system thermal resistance is
the sum of each of these components. There are many ways to remove the heat
from an IC: placing the device in a cool spot on the PWA and in the enclosure;
distributing power-generating components across the PWA; and using a liquid-
cooled plate connected to a refrigerated water chiller are among them.
Since convection is largely a function of surface area (larger means cooler),
the opportunities for improvement are somewhat limited. Oftentimes it is not
practical to increase the size of an electronic product, such as a notebook com-
puter, to make the ICs run cooler. So various means of conduction (using external
means of cooling such as heat sinks, fans, or heat pipes) must be used.
The trend toward distributed power (DC/DC converters or power regulators
on each PWA) is presenting new challenges to the design team in terms of power
distribution, thermal management, PWA mechanical stress (due to weight of heat
sinks), and electromagnetic compatibility. Exacerbating these issues still further
is the trend toward placing the power regulator as close as possible to the micro-
processor (for functionality and performance reasons), even to the point of putting
them together in the same package. This extreme case causes severe conflicts in
managing all issues. From a thermal perspective, the voltage regulator module
and the microprocessor should be separated from each other as far as possible.
Conversely, to maximize electrical performance requires that they be placed as
close together as possible. The microprocessor is the largest source of electromag-
netic interference, and the voltage regulator module adds significant levels of
both conducted and radiated interference. Thus, from an EMI perspective the
voltage regulator and microprocessor should be integrated and encapsulated in
a Faraday cage. However, this causes some serious thermal management issues
relating to the methods of providing efficient heat removal and heat sinking. The
high clock frequencies of microprocessors requires the use of small apertures
to meet EMI standards which conflict with the thermal requirement of large open-
ings in the chassis to create air flow and cool devices within, challenging the
design team and requiring that system design tradeoffs and compromises be
made.
A detailed discussion of thermal management issues is presented in
Chapter 5.
creating a serious signal integrity issue. Signal integrity addresses the impact of
ringing, overshoot, undershoot, settling time, ground bounce, crosstalk, and
power supply noise on high-speed digital signals during the design of these sys-
tems. Some symptoms that indicate that signal integrity (SI) is an issue include
skew between clock and data, skew between receivers, fast clocks (less setup
time, more hold time) and fast data (more setup time, less hold time), signal
delay, and temperature sensitivities, Figure 15 shows a signal integrity example
as it might appear on a high-bandwidth oscilloscope. The clock driver has a nice
square wave output wave form, but the load IC sees a wave form that is distorted
by both overshoot and ringing. Some possible reasons for this condition include
the PCB trace may not have been designed as a transmission line; the PCB trace
transmission line design may be correct, but the termination may be incorrect;
or a gap in either ground or power plane may be disturbing the return current
path of the trace.
As stated previously, signal integrity is critical in fast bus interfaces, fast
microprocessors, and high throughput applications (computers, networks, tele-
communications, etc.). Figure 16 shows that as circuits get faster, timing margins
decrease, leading to signal integrity issues. In a given design some of the ways
in which faster parts are used, thus causing SI problems, include
FIGURE 15 The signal integrity issue as displayed on an oscilloscope. (From Ref. 2, used
with permission from Evaluation Engineering, November 1999.)
FIGURE 16 Impact of faster ICs on timing margins. (From Ref. 2, used with permission
from Evaluation Engineering, November 1999.)
Without due consideration of the basic signal integrity issues, high-speed prod-
ucts will fail to operate as intended.
Signal integrity wasn’t always important. In the 1970–1990 time frame,
digital logic circuitry (gates) switched so slowly that digital signals actually
looked like ones and zeroes. Analog modeling of signal propagation was not
necessary. Those days are long gone. At today’s circuit speeds even the simple
passive elements of high-speed design—the wires, PC boards, connectors, and
chip packages—can make up a significant part of the overall signal delay. Even
worse, these elements can cause glitches, resets, logic errors, and other problems.
Today’s PC board traces are transmission lines and need to be properly
managed. Signals traveling on a PCB trace experience delay. This delay can be
much longer than edge time, is significant in high-speed systems, and is in addi-
tion to logic delays. Signal delay is affected by the length of the PCB trace and
any physical factors that affect either the inductance (L) or capacitance (C), such
as the width, thickness, or spacing of the trace; the layer in the PCB stack-up;
material used in the PCB stack-up; and the distance to ground and VCC planes.
Reflections occur at the ends of a transmission line unless the end is termi-
nated in Zo (its characteristic impedance) by a resistor or another line. Zo ⫽
√L/C and determines the ratio of current and voltage in a PCB trace. Increasing
PCB trace capacitance by moving the traces closer to the power plane, making the
traces wider, or increasing the dielectric constant decreases the trace impedance.
Capacitance is more effective in influencing Zo because it changes faster than
inductance with cross-sectional changes.
Increasing PCB trace inductance increases trace impedance; this happens
if the trace is narrow. Trace inductance doesn’t change as quickly as capacitance
does when changing the cross-sectional area and is thus less effective for influ-
encing Zo. On a practical level, both lower trace impedances and strip lines (hav-
ing high C and low Zo) are harder to drive; they require more current to achieve
a given voltage.
How can reflection be eliminated?
Slow down the switching speed of driver ICs. This may be difficult since
this could upset overall timing.
Shorten traces to their critical length or shorter.
Match the end of the line to Zo using passive components.
Signal integrity and electromagnetic compatibility (EMC) are related and
have an impact on each other. If an unintended signal, such as internally or exter-
nally coupled noise, reaches the destination first, changes the signal rise time, or
causes it to become nonmonotonic, it’s a timing problem. If added EMC suppres-
sion components distort the waveform, change the signal rise time, or increase
delay, it’s still a timing problem. Some of the very techniques that are most
effective at promoting EMC at the PWA level are also good means of improving
SI. When implemented early in a project, this can produce more robust designs,
often eliminating one prototype iteration. At other times techniques to improve
EMC are in direct conflict with techniques for improving SI.
How a line is terminated determines circuit performance, SI, and EMC.
Matched impedance reduces SI problems and sometimes helps reduce EMC is-
sues. But some SI and EMC effects conflict with each other. Tables 13 and 14
compare termination methods for their impact from signal integrity and EMC
perspectives, respectively. Notice the conflicting points between the various ter-
mination methods as applicable to SI and EMC.
If fast-switching ICs were not used in electronic designs and we didn’t
have signal transitions, then there would be no SI problems, or products manufac-
tured for that matter. The faster the transitions, the bigger the problem. Thus, it
is important to obtain accurate models of each IC to perform proper signal integ-
rity and EMC analysis. The models of importance are buffer rather than logic
models because fast buffer slew times relative to the line lengths cause most of
the trouble.
There are two widely used industry models available, SPICE and IBIS.
SPICE is a de facto model used for modeling both digital and mixed-signal (ICs
with both digital and analog content) ICs. IBIS is used for modeling digital sys-
tems under the auspices of EIA 656. It is the responsibility of the IC suppliers
(manufacturers) to provide these models to original equipment manufacturers
(OEMs) for use in their system SI analysis.
In summary, as operating speeds increase the primary issues that need to
be addressed to ensure signal integrity include
1. A greater percentage of PCB traces in new designs will likely require
terminators. Terminators help control ringing and overshoot in trans-
mission lines. As speeds increase, more and more PCB traces will be-
gin to take on aspects of transmission line behavior and thus will re-
These higher levels of IC and PWA complexity and packing density inte-
gration result in reduced observability and controllability (decreased de-
fect coverage).
The task of generating functional test vectors and designing prototypes is
too complex to meet time-to-market requirements.
Tossing a net list over the wall to the test department to insert test structures
is a thing of the past.
Traditional functional tests provide poor diagnostics and process feedback
capability.
Design verification has become a serious issue with as much as 55% of the
total design effort being focused on developing self-checking verification
programs plus the test benches to execute them.
What’s the solution? What is needed is a predictable and consistent design
for test (DFT) methodology. Design for test is a structured design method that
includes participation from circuit design (including modeling and simulation),
test, manufacturing, and field service inputs. Design for test provides greater test-
ability; improved manufacturing yield; higher-quality product; decreased test
generation complexity and test time; and reduced cost of test, diagnosis, trouble-
shooting, and failure analysis (due to easier debugging and thus faster debug
time). Design for test helps to ensure small test pattern sets—important in reduc-
ing automated test equipment (ATE) test time and costs—by enabling single
patterns to test for multiple faults (defects). The higher the test coverage for a
given pattern set, the better the quality of the produced ICs. The fewer failing
chips that get into products and in the field, the lower the replacement and war-
ranty costs.
Today’s ICs and PWAs implement testability methods (which include inte-
gration of test structures and test pins into the circuit design as well as robust
test patterns with high test coverage) before and concurrent with system logic
design, not as an afterthought when the IC design is complete. Designers are
intimately involved with test at both the IC and PWA levels. Normally, a multi-
disciplinary design team approaches the technical, manufacturing, and logistical
aspects of the PWA design simultaneously. Reliability, manufacturability, diag-
nosability, and testability are considered throughout the design effect.
The reasons for implementing a DFT strategy are listed in Table 16. Of
these, three are preeminent:
Higher quality. This means better fault coverage in the design so that
fewer defective parts make it out of manufacturing (escapes). However,
a balance is required. Better fault coverage means longer test patterns.
From a manufacturing perspective, short test patterns and thus short test
times are required since long test times cost money. Also, if it takes too
long to generate the test program, then the product cycle is impacted
Benefits Concerns
Improved product quality. Initial impact on design cycle while DFT
Faster and easier debug and diagnostics techniques are being learned.
of new designs and when problems Added circuit time and real estate area.
occur. Initial high cost during learning period.
Faster time to market, time to volume,
and time to profit.
Faster development cycle.
Smaller test patterns and lower test costs.
Lower test development costs.
Ability to tradeoff performance versus
testability.
Improved field testability and mainte-
nance.
initially and every time there is a design change new test patterns are
required. Designs implemented with DFT result in tests that are both
faster and of higher quality, reducing the time spent in manufacturing
and improving shipped product quality level.
Easier and faster debug diagnostics when there are problems. As designs
become larger and more complex, diagnostics become more of a chal-
lenge. In fact, design for diagnosis (with the addition of diagnostic test
access points placed in the circuit during design) need to be included in
the design for test methodology. Just as automatic test pattern generation
(ATPG) is used as a testability analysis tool (which is expensive this
late in the design cycle), diagnostics now are often used the same way.
Diagnosis of functional failures or field returns can be very difficult. An
initial zero yield condition can cause weeks of delay without an auto-
mated diagnostic approach. However, diagnosing ATPG patterns from
a design with good DFT can be relatively quick and accurate.
Faster time to market.
Design for test (boundary scan and built-in self-test) is an integrated ap-
proach to testing that is being applied at all levels of product design and integra-
tion, shown in Figure 17: during IC design, PWA (board) design and layout, and
system design. All are interconnected and DFT eases the testing of a complete
product or system. The figure shows built-in self-test (BIST) being inserted into
large complex ICs to facilitate test generation and improve test coverage, primar-
ily at the IC level but also at subsequent levels of product integration. Let’s look
at DFT from all three perspectives.
FIGURE 17 Applying BIST and boundary scan at various levels of product integration.
require special testing, the generation of test patterns for the chip logic can be
rapid and of high quality.
The use of scan techniques facilitates PWA testing. It starts with the IC
itself. Scan insertion analyzes a design, locates on chip flip flops and latches, and
replaces some (partial scan) or all (full scan) of these flip flops and latches with
scan-enabled versions. When a test system asserts those versions’ scan-enable
lines, scan chains carry test vectors into and out of the scan compatible flip flops,
which in turn apply signals to inputs and read ouputs from the combinatorial
logic connected to those flip flops. Thus, by adding structures to the IC itself,
such as D flip flops and multiplexers, PWA testing is enhanced through better
FIGURE 18 Continued
controllability and observability. The penalty for this circuitry is 5–15% in-
creased silicon area and two external package pins.
Scan techniques include level sensitive scan design (LSSD), scan path, and
boundary scan. In the scan path, or scan chain, technique, DQ flip flops are in-
serted internally to the IC to sensitize, stimulate, and observe the behavior of
combinatorial logic in a design. Testing becomes a straightforward application
of scanning the test vectors in and observing the test results because sequential
FIGURE 18 Continued
FIGURE 18 Continued
FIGURE 18 Continued
logic is transformed to combinational logic for which ATPG programs are more
effective. Automatic place and route software has been adapted to make all clock
connections in the scan path, making optimal use of clock trees.
The boundary scan method increases the testability over that of the scan
path method, with the price of more on-chip circuitry and thus greater complexity.
With the boundary scan technique, which has been standardized by IEEE 1149.1,
a ring of boundary scan cells surrounds the periphery of the chip (IC). The bound-
ary scan standard circuit is shown in Figures 19 and 20, and the specific character-
istics and instructions applicable to IEEE 1149.1 are listed in Tables 17 and 18,
respectively.
Each boundary scan IC has a test access port (TAP) which controls the
shift–update–capture cycle, as shown in Figure 19. The TAP is connected to a
test bus through two pins, a test data signal, and a test clock. The boundary scan
architecture also includes an instruction register, which provides opportunities
for using the test bus for more than an interconnection test, i.e. component iden-
tity. The boundary scan cells are transparent in the IC’s normal operating mode.
In the test mode they are capable of driving predefined values on the output pins
and capturing response values on the input pins. The boundary scan cells are
linked as a serial register and connected to one serial input pin and one serial
output pin on the IC.
It is very easy to apply values at IC pins and observe results when this
technique is used. The tests are executed in a shift–update–capture cycle. In the
shift phase, drive values are loaded in serial into the scan chain for one test while
the values from the previous test are unloaded. In the update phase, chain values
are applied in parallel on output pins. In the capture phase, response values are
loaded in parallel into the chain.
Boundary scan, implemented in accordance with IEEE 1149.1, which is
mainly intended for static interconnection test, can be enhanced to support dy-
namic interconnection test (see Fig. 21). Minor additions to the boundary scan
cells allow the update–capture sequence to be clocked from the system clock
rather than from the test clock. Additional boundary scan instruction and some
control logic must be added to the ICs involved in the dynamic test. The drive
and response data are loaded and unloaded through the serial register in the same
way as in static interconnection test. There are commercially available tools that
support both static and dynamic test.
For analog circuits, boundary scan implemented via the IEEE 1149.4 test
standard simplifies analog measurements at the board level (Fig. 22). Two (alter-
natively four) wires for measurements are added to the boundary scan bus. The
Bypass
Inserts a 1-bit bypass register between TDI and TDO.
Extest
Uses boundary register first to capture, then shift, and finally to update I/O pad
values.
Sample/preload
Uses boundary register first to capture and then shift I/O pad
values without affecting system operation.
Other optional and/or private instructions
Defined by the standard or left up to the designer to specify behavior.
original four wires are used as in boundary scan for test control and digital data.
Special analog boundary scan cells have been developed which can be linked to
the analog board level test wires through fairly simple analog CMOS switches.
This allows easy setup of measurements of discrete components located between
IC pins. Analog and digital boundary scan cells can be mixed within the same
device (IC). Even though the main purpose of analog boundary scan is the test
of interconnections and discrete components, it can be used to test more complex
board level analog functions as well as on-chip analog functions.
After adding scan circuitry to an IC, its area and speed of operation change.
The design increases in size (5–15% larger area) because scan cells are larger
than the nonscan cells they replace and some extra circuitry is required, and
the nets used for the scan signals occupy additional area. The performance of
the design will be reduced as well (5–10% speed degradation) due to changes
in the electrical characteristics of the scan cells that replaced the nonscan cells
and the delay caused by the extra circuitry.
Built-in self-test is a design technique in which test vectors are generated
on-chip in response to an externally applied test command. The test responses are
compacted into external pass/fail signals. Built-in self-test is usually implemented
through ROM (embedded memory) code instructions or through built-in (on-
chip) random word generators (linear feedback shift registers, or LSFRs). This
allows the IC to test itself by controlling internal circuit nodes that are otherwise
unreachable, reducing tester and ATPG time and date storage needs.
In a typical BIST implementation (Fig. 23) stimulus and response circuits
are added to the device under test (DUT). The stimulus circuit generates test
patterns on the fly, and the response of the DUT is analyzed by the response
circuit. The final result of the BIST operation is compared with the expected
result externally. Large test patterns need not be stored externally in a test system
since they are generated internally by the BIST circuit. At-speed testing is possi-
FIGURE 23 Built-in self-test can be used with scan ATPG to enable effective system-
on-chip testing. (From Ref. 4.)
ble since the BIST circuit uses the same technology as the DUT and can be run
off the system clock.
Built-in self-test has been primarily implemented for testing embedded
memories since highly effective memory test algorithms can be implemented in
a compact BIST circuit but at a cost of increased circuit delay. The tools for
implementing digital embedded memory BIST are mature. Because of the un-
structured nature of logic blocks, logic BIST is difficult to implement but is being
developed. The implementation of analog BIST can have an impact on the noise
performance and accuracy of the analog circuitry. The tools to implement analog
BIST are being developed as well.
Both BIST and boundary scan have an impact on product and test cost
during all phases of the product life cycle: development, manufacturing, and field
deployment. For example, boundary scan is often used as a means to rapidly
identify structural defects (e.g., solder bridges or opens) during early life debug-
ging. Built-in self-test and boundary scan may be leveraged during manufacturing
testing to improve test coverage, reduce test diagnosis time, reduce test capital,
or all of the above. In the field, embedded boundary scan and BIST facilitate
accurate system diagnostics to the field replacement unit (FRU, also called the
customer replaceable unit, or CRU). The implementation of BIST tends to
lengthen IC design time by increasing synthesis and simulation times (heavy
computational requirements), but reduces test development times.
Design for test techniques have evolved to the place where critical tester
(ATE) functions (such as pin electronics) are embedded on the chip being tested.
The basic idea is to create microtesters for every major functional or architectural
TABLE 19 Examples of Design Hints and Guidelines at the PWA Level to Facilitate
Testing
PWA test point placement
Electrical design hints rules Typical PWA test points
Disable the clocks to ease All test points should be Through leads.
testing. located on single side of Uncovered and soldered
Provide access to enables. PWA. via pads (bigger).
Separate the resets and en- Distribute test points Connectors.
ables. evenly. Card-edge connectors.
Unused pins should have Minimum of one test point Designated test points.
test point access. per net.
Unused inputs may require Multiple VCC and ground
pull-up or pull-down re- test pads distributed
sistors. across PWA.
Batteries must have en- One test point on each un-
abled jumpers or be in- used IC pin.
stalled after test. No test points under com-
Bed-of-nails test fixture re- ponents on probe side
quires a test point for ev- of PWA.
ery net, all on the bot-
tom side of the board.
methods are emerging that make use of the standardized boundary scan bus. This
activity will only serve to facilitate the widespread adoption of DFT techniques.
The myriad topics involved with IC and board (PWA) tests have been dis-
cussed via tutorials and formal papers, debated via panel sessions at the annual
International Test Conference, and published in its proceedings. It is suggested
that the reader who is interested in detailed information on these test topics con-
sult these proceedings.
where designers and test engineers work interactively and concurrently to solve
the testability issue. Design for test is also a value-added investment in improving
testability in later product phases, i.e., manufacturing and field troubleshooting.
aspects or elements of the design. These types of design reviews are much more
prevalent in smaller-sized entrepreneurial companies.
supply constraints? How will my design decisions have the most impact on supply
constraints? How will my design decisions affect NPI schedules? What design
decisions will result in optimizing my production costs and schedules?
systematic, and documented analysis of the ways in which a system can fail, the
causes for each failure mode, and the effects of each failure. Its primary objective
is the identification of catastrophic and critical failure possibilities so that they
can be eliminated or minimized through design change. The FMEA results may
be either qualitative or quantitative, although most practitioners attempt to quan-
tify the results.
In FMEA, each component in the system is assumed to fail catastrophically
in one of several failure modes and the impact on system performance is assessed.
That is, each potential failure studied is considered to be the only failure in the
system, i.e., a single point failure. Some components are considered critical be-
cause their failure leads to system failure or an unsafe condition. Other compo-
nents will not cause system failure because the system is designed to be tolerant
of the failures. If the failure rates are known for the specific component failure
modes, then the probability of system malfunction or failure can be estimated.
The design may then be modified to make it more tolerant of the most critical
component failure modes and thus make it more reliable. The FMEA is also
useful in providing information for diagnostic testing of the system because it
produces a list of the component failures that can cause a system malfunction.
The FMEA, as mentioned, can be a useful tool for assessing designs, devel-
oping robust products, and guiding reliability improvements. However, it is time
consuming, particularly when the system includes a large number of components.
Frequently it does not consider component degradation and its impact on system
performance. This leads to the use of a modified FMEA approach in which only
failures of the high risk or critical components are considered, resulting in a
simpler analysis involving a small number of components. It is recommended
that the FMEA include component degradation as well as catastrophic failures.
Although the FMEA is an essential reliability task for many types of system
design and development, it provides limited insight into the probability of system
failure. Another limitation is that the FMEA is performed for only one failure
at a time. This may not be adequate for systems in which multiple failure modes
can occur, with reasonable likelihood, at the same time. However, the FMEA
provides valuable information about the system design and operation.
The FMEA is usually iterative in nature. It should be conducted concur-
rently with the design effort so that the design will reflect the analysis conclusions
and recommendations. The FMEA results should be utilized as inputs to system
interfaces, design tradeoffs, reliability engineering, safety engineering, mainte-
nance engineering, maintainability, logistic support analysis, test equipment de-
sign, test planning activities, and so on. Each failure mode should be explicitly
defined and should be addressed at each interface level.
The FMEA utilizes an inductive logic or bottom-up approach. It begins at
the lowest level of the system hierarchy (normally at the component level) and
using knowledge of the failure modes of each part it traces up through the system
hierarchy to determine the effect that each potential failure mode will have on
system performance. The FMEA focus is on the parts which make up the system.
The FMEA provides
1. A method for selecting a design with a high probability of operational
success and adequate safety.
2. A documented uniform method of assessing potential failure modes
and their effects on operational success of the system.
3. Early visibility of system interface problems.
4. A list of potential failures which can be ranked according to their seri-
ousness and the probability of their occurrence.
5. Identification of single point failures critical to proper equipment func-
tion or personnel safety.
6. Criteria for early planning of necessary tests.
7. Quantitative, uniformly formatted input data for the reliability predic-
tion, assessment, and safety models.
8. The basis for troubleshooting procedures and for the design and loca-
tion of performance monitoring and false sensing devices.
9. An effective tool for the evaluation of a proposed design, together with
any subsequent operational or procedural changes and their impacts
on proper equipment functioning and personnel safety.
The FMEA effort is typically led by reliability engineering, but the actual
analysis is done by the design and component engineers and others who are inti-
mately familiar with the product and the components used in its design. If the
design is composed of several subassemblies, the FMEA may be done for each
subassembly or for the product as a whole. If the subassemblies were designed
by different designers, each designer needs to be involved, as well as the product
engineer or systems engineer who is familiar with the overall product and the
subassembly interface requirements. For purchased assemblies, like power sup-
plies and disk drives, the assembly design team needs to provide an FMEA that
meets the OEM’s needs. We have found, as an OEM, that a team of responsible
engineers working together is the best way of conducting a FMEA.
The essential steps in conducting an FMEA are listed here, a typical FMEA
worksheet is shown in Table 21, and a procedure for critical components is given
in Table 22.
1. Reliability block diagram construction. A reliability block diagram is
generated that indicates the functional dependencies among the various
elements of the system. It defines and identifies each required subsys-
tem and assembly.
2. Failure definition. Rigorous failure definitions (including failure
modes, failure mechanisms, and root causes) must be established for
the entire system, the subsystems, and all lower equipment levels. A
Briefly describe process being Assume incoming parts/materi- Customer could be next opera- Severity is an assessment of se-
analyzed. als are correct. tion, subsequent operation riousness of effect (in Poten-
Concisely describe purpose of Outlines reason for rejection at or location, purchaser, or tial effects of failure
process. specific operation. end user. column):
Note: If a process involves Cause can be associated with Describe in terms of what cus- Severity of effect Rank
multiple operations that potential failure either up- tomer might notice or expe- Minor 1
have different modes of fail- stream or downstream. rience in terms of system No real effect caused.
ure, it may be useful to list List each potential failure performance for end user or Customer probably
as separate processes. mode in terms of a part or in terms of process perfor- will not notice failure.
process characteristic, from mance for subsequent opera- Low 2,3
engineer and customer per- tion. Slight customer
spectives. If it involves potential noncom- annoyance.
Note: Typical failure mode pliance with government reg- Slight
could be bent, corroded, istration, it must be indi- inconvenience with
leaking, deformed, mis- cated as such. subsequent process
aligned. Examples include noise, unsta- or assembly.
ble, rough, inoperative, Minor rework action.
erratic/intermittent opera- Moderate 4,5,6
tion, excessive effort re- Some customer
quired, operation impaired. dissatisfaction.
May cause
unscheduled
rework/repair/damage
to equipment.
High 7,8
High degree of
customer
dissatisfaction
due to the nature
of the failure.
Does not involve
safety or
noncompliance
with government
regulations.
May cause serious
disruption to subsequent
operations, require
major rework, and/or
endanger machine
or operator.
Very High 9,10
Potential failure mode
affects safe operation.
Noncompliance with
government regulations.
Note: Severity can only
be affected by design.
List every conceivable failure cause as- Occurrence is how frequently the failure Describe the controls that either prevent
signable to each potential mode. mode will occur as a result of a spe- failure modes from occurring or detect
If correcting cause has direct impact on cific cause (from Potential causes of them should they occur.
mode, then this portion of FMEA pro- failure column). Examples could be process control (i.e.,
cess is complete. Estimate the likelihood of the occurrence SPC) or postprocess inspection/testing.
If causes are not mutually exclusive, a of potential failure modes on a 1 to 10
DOE may be considered to determine scale. Only methods intended to pre-
root cause or control the cause. vent the cause of failure should be con-
Causes should be described such that re- sidered for the ranking: failure detect-
medial efforts can be aimed at perti- ing measures are not considered here.
nent causes. The following occurrence ranking system
Only specific errors or malfunctions should be used to ensure consistency.
should be listed; ambiguous causes The possible failure rates are based on
(e.g., operator error, machine malfunc- the number of failures that are antici-
tion) should not be included. pated during the process execution.
Examples are handling damage, incorrect Possible
temperature, inaccurate gauging, incor- Probability Rank failure rate
rect gas flow. Remote: 1 ⱕ1 in 106,
failure unlikely; ⬇ ⫾5 sec
Cpk ⱖ 1.67
Very low: in 2 ⬎1 in 106,
statistical control; ⱕ1 in 20k,
Cpk ⬎ 1.33 ⬇ ⫾4 sec
Low: relatively 3 ⬎1 in 20k,
few failures, in ⱕ1 in 4k,
statistical ⬇ ⫾3.5 sec
control;
Cpk ⬎ 1.00
Moderate: 4,5,6 ⬎1 in 4k,
occasional ⱕ1 in 80,
failures, in ⬇ ⫾3 sec
statistical
control;
Cpk ⱕ 1.00
High: repeated 7,8 ⬎1 in 80,
failures, not in ⱕ1 in 40,
statistical ⬇ ⫾1 sec
control; Cpk ⬍ 1
Very high: 9,10 1 in 8,
failure almost ⱕ1 in 40
inevitable
TABLE 21 Continued
Process description and
purpose Detection Risk priority number (RPN) Recommended actions
B. C.
Prepared by: FMEA date (orig.):
(Rev. date): Key production date:
Eng. release date: Area:
Plant(s):
Action Results
Area/individual responsible Action taken and actual
and completion date completion date Severity Occurrence Detection RPN
Enter area and person respon- After action has been taken, Resulting RPN after correc-
sible. briefly describe actual ac- tive action taken.
Enter target completion date. tion and effective or com- Estimate and record the new
pletion date. ranking for severity, occur-
rence, and detection re-
sulting from corrective ac-
tion.
Calculate and record the re-
sulting RPN.
If no action taken, leave
blank.
Once action has been com-
pleted, the new RPN is
moved over to the first
RPN column. Old FMEA
revisions are evidence of
system improvement.
Generally the previous
FMEA version(s) are kept
in document control.
Note: Severity can only be af-
fected by design.
Follow-up:
Process engineer is responsible for assuring all recommended actions have been implemented or ad-
dressed.
“Living documentation” must reflect latest process level, critical (key) characteristics, and manufac-
turing test requirements.
PCN may specify such items as process condition, mask revision level, packaging requirements,
and manufacturing concerns.
Review FMEAs on a periodic basis (minimum annually).
for a memory module is presented in Appendix B at the back of this book. Also
provided is a list of action items resulting from this analysis, the implementation
of which provides a more robust and thus more reliable design.
The growing demand for electrical and electronic appliances will at the
same time create more products requiring disposal. Efforts to increase the reuse
and recycling of end-of-life electronic products have been growing within the
electronics industry as a result of the previously mentioned regulatory and legal
pressures. There are also efforts to reduce packaging or, in some cases, provide
reuseable packaging. Products containing restricted or banned materials are more
costly and difficult to recycle because of regional restrictive legislation. All of
this is adding complexity to the product designer’s task.
5. Special studies and application tests are used to investigate the idiosyn-
cracies and impact of unspecified parameters and timing condition in-
teractions of critical ICs with respect to each other and their impact
on the operation of the product as intended.
6. Accelerated environmental stress testing (such as HALT and STRIFE
testing) of the PWAs, power supplies, and other critical components
is used to identify weaknesses and marginalities of the completed de-
sign (with the actual production components being used) prior to re-
lease to production.
Some of these have been discussed previously; the remainder will now be ad-
dressed in greater detail.
They verify the CAD model simulation results and rationalize them with actual
hardware build and with the variability of components and manufacturing pro-
cesses. Mechanical engineers model and remodel the enclosure design. Printed
circuit board designers check the layout of the traces on the PCB, adjust pad and
package sizes, and review component layout and spacing. Manufacturing process
engineers check the PCB’s chemistry to ensure it is compatible with production
cells currently being built. If a PCB has too many ball grid array components or
too many low profile ceramic components, it may force the use of the more
expensive and time-consuming “no-clean” chemistry. Test engineers look for
testability features such as net count and test point accessibility. A board that
has no test points or exposed vias will make in-circuit testing impossible and
thus require a costlier alternative such as functional test. Cable assembly engi-
neers must look at interconnects for better termination and shielding opportuni-
ties. Today’s products are challenged by higher transmission rates, where greater
speeds can cause crosstalk, limiting or preventing specified performance. Finally,
plastic/polymer engineers review for flow and thermal characteristics that will
facilitate an efficient production cycle, and sheet metal engineers look for tooling
and die compatibility.
A well-designed DVT provides a good correlation of measured reliability
results to modeled or predicted reliability. Design verification testing delivers
best on its objectives if the product has reached a production-ready stage of design
maturity before submission to DVT. Major sources of variation (in components,
suppliers of critical components, model mix, and the like) are intentionally built
into the test population. Test data, including breakdown of critical variable mea-
surements correlated to the known sources of variation, give the product design
team a practical look at robustness of the design and thus the ability to produce
it efficiently in volume.
In these ways test is an important aspect of defining and improving product
quality and reliability, even though the act of performing testing itself does not
increase the level of quality.
3.24.3 Thermography
Design verification testing is also a good time to check the product design’s
actual thermal characteristics and compare them with the modeled results, and
to validate the effectiveness of the heat sinking and distribution system. A thermal
profile of the PWA or module is generated looking for hot spots due to high
power–dissipating components generating heat and the impact of this on nearby
components.
Electronic equipment manufacturers have turned to the use of computa-
tional fluid dynamics (CFD) (discussed in Chapter 5) during the front end of the
design process and thermography after the design is complete to help solve com-
FIGURE 26 Ideal environmental stress and product strength distributions after product
improvements.
HALT
Several points regarding HALT (an acronym for highly accelerated life test,
which is a misnomer because it is an overstress test) need to be made.
Selection of the stresses to be used is the basis of HALT. Some stresses are
universal in their application, such as temperature, thermal cycling, and vibration.
Others are suitable to more specific types of products, such as clock margining
for logic boards and current loading for power components. Vibration and thermal
stresses are generally found to be the most effective environmental stresses in
precipitating failure. Temperature cycling detects weak solder joints, IC package
integrity, CTE mismatch, PWA mounting problems, and PWA processing is-
sues—failures that will happen over time in the field. Vibration testing is nor-
mally used to check a product for shipping and operational values. Printed wire
assembly testing can show weak or brittle solder or inadequate wicking. Bad
connections may be stressed to failure at levels that do not harm good connec-
tions.
2. HALT is an iterative process, so that stresses may be added or deleted
in the sequence of fail–fix–retest.
3. In conducting HALT there is every intention of doing physical damage
to the product in an attempt to maximize and quantify the margins of product
strength (both operating and destruct) by stimulating harsher-than-expected end-
use environments.
4. The HALT process continues with a test–analyze–verify–fix ap-
proach, with root cause analysis of all failures. Test time is compressed with
accelerated stressing, leading to earlier product maturity. The results of acceler-
ated stress testing are
Fed back to design to select a different component/assembly and/or sup-
plier, improve a supplier’s process, or make a circuit design or layout
change
grouping, such as EN 60950 for ITE. Table 23 lists typical product safety test
requirements.
different signal delay (impedance) and timing characteristics, the length of that
transition period varies. The final voltage value attained also varies slightly as a
function of the device characteristics and the operating environment (temperature,
humidity). Computer hardware engineers allow a certain period of time (called
design margin) for the transition period to be completed and the voltage value
to settle. If there are timing errors or insufficient design margins that cause the
voltage to be read at the wrong time, the voltage value may be read incorrectly,
and the bit may be misinterpreted, causing data corruption. It should be noted
that this corruption can occur anywhere in the system and could cause incorrect
data to be written to a computer disk, for example, even when there are no errors
in computer memory or in the calculations.
The effect of a software design error is even less predictable than the effect
of a hardware design error. An undiscovered software design error could cause
both a processor halt and data corruption. For example, if the algorithm used to
compute a value is incorrect, there is not much that can be done outside of good
software engineering practices to avoid the mistake. A processor may also attempt
to write to the wrong location in memory, which may overwrite and corrupt a
value. In this case, it is possible to avoid data corruption by not allowing the
processor to write to a location that has not been specifically allocated for the
value it is attempting to write.
These considerations stress the importance of conducting both hardware
and software design reviews.
ACKNOWLEDGMENTS
Section 3.10.1 courtesy of Reliability Engineering Department, Tandem Division,
Compaq Computer Corporation. Portions of Section 3.12 used with permission
from SMT magazine. Portions of Section 3.11 courtesy of Noel Donlin, U.S.
Army (retired).
REFERENCES
1. Bergman D. CAD to CAM made easy, SMT, July 1999 and PennWell, 98 Spit Brook
Rd., Nashua, NH 03062.
2. Brewer R. EMC design practices: preserving signal integrity. Evaluation Engineering,
November 1999.
3. McLeish JG. Accelerated Reliability Testing Symposium (ARTS) USA, 1999.
4. Carlsson G. DFT Enhances PCB manufacturing. Future Circuits International.
www.mriresearch.com.
5. Bergman D. GenCam addresses high density circuit boards. Future Circuits Interna-
tional, Issue No. 4. www.mriresearch.com.
Further Reading
1. Barrow P. Design for manufacture. SMT, January 2002.
2. Cravotta R. Dress your application for success. EDN, November 8, 2001.
3. Dipert B. Banish bad memories. EDN, November 22, 2001.
4. McDermott RE et al. The Basics of FMEA. Portland, OR: Productivity, Inc.
5. Nelson R. DFT lets ATE work MAGIC. Test & Measurement World, May 2001.
6. Parker KP and Zimmerle D. Boundary scan signals future age of test. EP&P, July
2002.
7. Sexton J. Accepting the PCB test and inspection challenge. SMT, April 2001.
8. Solberg V. High-density circuits for hand-held and portable products. SMT, April
2001.
9. Troescher M and Glaser F. Electromagnetic compatibility is not signal integrity.
Item 2002.
10. Webb W. Designing dependable devices. EDN, April 18, 2002.
11. Williams P and Stemper M. Collaborative product commerce—the next frontier.
Electronic Buyers News, May 6, 2002.
4.1 INTRODUCTION
This chapter focuses on an extremely important, dynamic, and controversial part
of the design process: component and supplier selection and qualification. The
selection of the right functional components and suppliers for critical components
in a given design is the key to product manufacturability, quality, and reliability.
Different market segments and system applications have different requirements.
Table 1 lists some of these requirements.
168
Components used Component
Sensitive versus
Market Component Technology Fault Component to total
system quality Well Leading leader or coverage Time to available material system System
application Volume requirements Stda provenb edge follower requirement market at design cost cost design style
a
Off-the-shelf.
b
“Old” or mature.
Chapter 4
c
Custom leading-edge component or chip set for heart of system (chip for autofocus camera; microcontroller for automotive, etc.).
d
All components available at design stage.
e
System designed before critical/core component is available, then dropped in at FCS.
f
Work with supplier to have custom components available when needed.
FIGURE 1 Supply chain for electronic equipment manufacturer with communication re-
quirements included.
FIGURE 2 Detailed supply chain diagram showing logistics, freight, and systems inter-
connections.
for example) have been formed to develop common areas of focus, re-
quirements, and guidelines for suppliers of that industry segment with the
benefit of improved supplier quality and business processes and reduced
overhead.
Commodity (aka supply base) teams consisting of representatives of many
functional disciplines own the responsibility for managing the suppliers
of strategic or critical components.
Many traditional job functions/organizations will have to justify their exis-
tence by competing with open market (external) providers of those same
services for the right to provide those services for their company (e.g.,
Purchasing, Human Resources, Shipping/Logistics, Design Engineering,
Component Engineering, etc.). A company is no longer bound to use
internal service providers.
Why has managing the supply base become so important? The largest cost
to an OEM is external to the factory, i.e., the suppliers or supply base. It has
been estimated that approximately 60% of the cost of a personal computer is due
to the purchased components, housings, disk drives, and power supplies. Therein
lies the importance of managing the supply base.
Companies are taking the supply base issue seriously. In today’s competi-
tive environment, original equipment manufacturers are continually evaluating
their supply base. Nearly all companies have too many suppliers, not enough
good suppliers, inadequate measurements for supplier performance, and supplier
problems. They are evaluating their supply base to determine
1. If they have the right suppliers
2. Which poorly performing current suppliers are incapable of becoming
good suppliers
3. Which poorly performing current suppliers have the potential of be-
coming good suppliers with OEM cooperation
4. How good current suppliers can be helped to become even better
5. Where can more good suppliers be found with the potential to become
best in class for the components they provide
based on price, the lowest bidder got the business, in an arm’s length and of-
ten adversarial relationship. There was no such thing as total cost of ownership.
The entire sourcing decision was based on an antiquated cost accounting sys-
tem. Delivery, quality, responsiveness, and technical expertise ran far behind
price consideration. Companies felt that the more suppliers they had, the bet-
ter. They would (and some still do) leverage (pit) suppliers against one an-
other for price concessions. There are no metrics established for either the
supplier or the purchasing function. Negotiated lead times were used to ensure
delivery.
Purchasing was staffed as a tactical organization being reactive to Engi-
neering’s and Manufacturing’s needs, rather than having a strategic forward-look-
ing thrust. The sourcing activity was treated as an unimportant subactivity of
the Purchasing Department, which was more internally focused than externally
focused. The main focus was on activities such as manufacturing resource plan-
ning and inventory strategies. Purchasing was essentially a clerical function with
no special skills or technical education required. It was viewed as providing no
competitive advantage for the company.
Internally, Engineering generated the design essentially without utilizing
the suppliers’ technical expertise, Manufacturing’s inputs, or Purchasing’s
involvement. Then Engineering would throw the specifications over the wall to
Purchasing. Purchasing would obtain competitive quotations and place the order
with the lowest bidder with guaranteed lead times. Every functional organization
at the OEM’s facility operated as an independent silo (Fig. 3) doing its own thing,
as opposed to today’s use of crossfunctional teams. The entire organizational
structure was based on doing things (activities) rather than achieving measurable
results, i.e., firefighting versus continuous improvement. All in all, it was a poor
use of everyone’s intellectual resources.
people and fixed assets difficult. This leads to the natural tendency to seek out
qualified partners that can materially contribute to strategic goals. Table 2 lists
some of the main tactical and strategic reasons for outsourcing.
I need to point out that outsourcing is not merely contracting out. What
are the differences?
1. While contracting is often of limited duration, outsourcing is a long-
term commitment to another company delivering a product and/or ser-
vice to your company.
2. While providers of contracted services sell them as products, providers
of outsourcing services tailor services to the customer’s needs.
3. While a company uses external resources when it contracts out, when
it employs outsourcing it usually transfers its internal operation (in-
cluding staff ) to the outsource supplier for a guaranteed volume sup-
port level over a specified period of time (5 years, for example).
4. While in a contracting relationship the risk is held by the customer and
managed by the supplier, in an outsourcing relationship there is a more
equal sharing of risk.
5. Greater trust is needed to engage successfully in an outsourcing ar-
rangement than in a simple contracting situation. This is because in
outsourcing the supplier also assumes the risk, but the customer is more
permanently affected by any failure by the supplier.
6. While contracting is done within the model of a formal customer–
supplier relationship, the model in an outsourcing relationship is one
of true partnership and mutually shared goals.
What are the advantages of outsourcing noncore work? First, as already stated,
are the simple cost issues. Fixed costs are voided (they become variable costs)
in that the company does not have to maintain the infrastructure through peak
and slack periods. This is a big benefit for the OEM since the market volatility
on the cost of goods sold and overhead costs is now borne by the contract manu-
facturer (CM), making the OEM immune to this market variability.
Second, there are cost savings due to the fact that the outsource service
provider can provide the services more efficiently and thus cheaper than the OEM
can. If the service-providing company is at the cutting edge of technology, it is
constantly reengineering its core processes to provide increased efficiency and
reduced costs to its OEM customers.
Third is the issue of staffing. OEM in-house service processes are staffed
to cope with all possible crises and peak demand. An outsourced support process
can be staffed to meet day-to-day needs, secure in the knowledge that there is
adequate staff in reserve for peak loads and adequate specialized staff to bring
on board quickly and cost efficiently.
Fourth are service issues. Having an agreement with a provider of a particu-
lar service gives a company access to a wider skill base than it would have in-
house. This access provides the OEM with more flexibility than it would have
if it had to recruit or contract specific skills for specialized work. The OEM can
change the scope of service any time with adequate notice, not having to face
the issues of “ramping up” or downsizing in response to market conditions. Work-
ing with a company whose core business process is providing a particular service
also improves the level of service to OEMs above what they are often able to
provide for themselves.
Fifth, creating a good outsourcing relationship allows the OEM to maintain
control over its needs and sets accountability firmly with the partnering service
provider. The clear definition and separation of roles between OEM and CM
ensures that service levels and associated costs can be properly identified and
controlled to a degree rarely seen in-house. All of this can be done without the
internal political issues that so often clutter relations between support and man-
agement process managers who are seen by their internal customers as merely
service providers.
Finally, and most importantly, outsourcing allows the OEM to focus its
energies on its core business processes. It focuses on improving its competitive
position and on searching the marketplace for opportunities in which to compete.
Companies thinking about outsourcing some of their non-core processes
have some fears or concerns. One set of concerns revolves around the loss of in-
house expertise and the possible coinciding loss of competitiveness, and the loss
of control over how the services will be provided. A second set of concerns
revolves around becoming locked in with one supplier and his technologies. If
that supplier does not keep pace with industry trends and requirements and de-
velop new technologies to meet (and in fact drive) these changes, an OEM can
be rendered noncompetitive.
Both sets of fears and concerns reflect the basic reticence of business lead-
ers to engage in long-term relationships based on the kind of trust and “partner-
ship” necessary to function in today’s business environment. A third set of con-
cerns revolves around the internal changes that will be necessary to effect the
kind of business change that will occur when support and management processes
are shed. The cost and logistics of planning and implementing changes to any
process are considerable, but they must always be balanced against the opportu-
nity of upgrading capability in another area.
Strategic Sourcing
The old-fashioned relationship to suppliers by companies who operated in a “get
the product out the door” mindset doesn’t work today. This method of operation
(in which technologists find what the market needs and the operationally directed
part of the company makes it) was characterized by customers who
Exploited suppliers
Found subordinate and easily swayed suppliers
Limited the information provided to suppliers
Avoided binding or long-term agreements
Purchased in single orders, each time setting up competition among sup-
pliers
The major factor driving change in the procurement process is the recognition
among strategic thinking electronics manufacturers that the largest single expense
category a company has is its purchases from suppliers (typically, corporations
spend 20 to 80% of their total revenue on goods and services from suppliers). Thus,
it is important to view suppliers strategically because they exert a great influence
on the product manufacturing costs and quality through the materials and design
methodology. An example of this is shown in Table 3 for Ford Motor Company.
TABLE 3 Impact of the Design and Material Issues on Manufacturing Cost at Ford
Percent of Percent of influence on Percent of influence
product cost manufacturing costs on quality
Material 50 20 55
Labor 15 5 5
Overhead 30 5 5
Design 5 70 35
Total 100% 100% 100%
Source: From the short course ‘‘Supplier Management’’ at the California Institute of Technology.
Courtesy of Ken Stork and Associates, Inc., Batavia, IL.
Strategic sourcing isn’t easy to implement. There are many barriers to over-
come. Especially hard pressed to change are those companies who have an
inspection/rework mindset based on mistrust of the adversarial arms-length trans-
actions with their suppliers, such as government contractors. Strategic sourcing
takes a lot of work to implement properly. It is neither downsizing (or right-
sizing as some prefer to call it) nor is it a quick fix solution to a deeper problem.
A strategic sourcing organization is not formed overnight. It takes a concerted
effort and evolves through several stages (Table 4).
The goal of strategic sourcing is to develop and maintain a loyal base of
critical (strategic) suppliers that have a shared destiny with the OEM. The sup-
plier provides the customer with preferential support (business, technology, qual-
ity, responsiveness, and flexibility) that enables and sustains the customer’s com-
petitive advantages and ensures mutual prosperity.
This sourcing strategy can occur on several levels. Take the case of a candy
bar manufacturer who purchases milk, cocoa, sweetener, and packaging. In this
FIGURE 4 Chrysler platform teams utilizing functional silo expertise. (Courtesy of Ken
Stork and Associates, Inc., Batavia, IL.)
Source: From the short course ‘‘Supplier Management’’ at the California Institute of Technology.
Courtesy of Ken Stork and Associates, Inc., Batavia, IL.
the only way of achieving velocity in design and manufacture is to break down
the old barriers and rebuild new relationships by adopting simultaneous engi-
neering. That means all functions working together in teams from the start; it is
about relationships, not technology.
Not all suppliers are created equal. Nor should all suppliers be considered
as strategic sources because
1. The components they provide are not critical/strategic nor do they pro-
vide a performance leverage to the end product.
2. The OEM has limited resources to manage a strategic sourcing rela-
tionship. It’s simply a case of the vital few suppliers versus the trivial
many.
The more tiers of suppliers involved, the more likely an OEM will get
into the arena of dealing with very small companies that do not operate to the
administrative levels that large companies do. With fewer staff available to over-
see second- and third-tier suppliers, it has become increasingly difficult to main-
tain a consistent level of quality throughout the supply chain. Companies have
to accept the commitment to work with second- and third-tier suppliers to develop
the same quality standards as they have with their prime suppliers. It’s part of
managing the supply chain. Most companies in the electronics industry prefer to
deal with tier 1 or tier 2 suppliers. However, often unique design solutions come
from tier 3 suppliers, and they are then used after due consideration and analysis
of the risks involved. It takes a significantly larger investment in scarce OEM
resources to manage a tier 3 supplier versus managing tier 1 and tier 2 suppliers.
Many tier 2 and 3 suppliers outsource their wafer fabrication to dedicated
foundries and their package assembly and test as well. Integrated device manufac-
turers, which are typically tier 1 suppliers, are beginning to outsource these opera-
tions as well. This removes the need for keeping pace with the latest technology
developments and concomitant exponentially accelerating capital equipment ex-
penditures. The major worldwide foundries are able to develop and maintain
leading-edge processes and capability by spreading mounting process develop-
ment and capital equipment costs across a large customer base. Many of these
are “fabless” semiconductor suppliers who focus on their core competencies: IC
circuit design (both hardware and software development), supply chain manage-
ment, marketing, and customer service and support, for example.
Having said this, I need to also state that IC suppliers have not been stand-
ing still in a status quo mode. Like their OEM counterparts they are going through
major changes in their structure for providing finished ICs. In the past (1960s
and 1980s) IC suppliers were predominantly self-contained vertical entities. They
designed and laid out the circuits. They manufactured the ICs in their own wafer
fabs. They assembled the die into packages either in their own or in subcontracted
assembly facilities (typically off-shore in Pacific Rim countries). They electri-
cally tested the finished ICs in their own test labs and shipped the products from
their own warehouses. This made qualification a rather simple (since they directly
controlled all of the process and resources) yet time-consuming task. Mainline
suppliers (now called integrated device manufacturers, or IDMs) such as AMD,
IBM, Intel, National Semiconductor, Texas Instruments, and others had multiple
design, wafer fab, assembly, electrical test, and warehouse locations, complicat-
ing the OEM qualification process.
Integrated circuit suppliers were probably the first link in the electronic
product food chain to outsource some of their needs; the first being mask making
and package assembly. Next, commodity IC (those with high volume and low
average selling price) wafer fab requirements were outsourced to allow the IDM
to concentrate on high-value-added, high-ASP parts in their own wafer fabs. Elec-
trical testing of these products was also outsourced. Then came the fabless IC
suppliers such as Altera, Lattice Semiconductor, and Xilinx who outsource all
of their manufacturing needs. They use dedicated pure play foundries for their
wafer fabrication needs and outsource their package assembly, electrical test, and
logistics (warehouse and shipping) needs, allowing them to concentrate on their
core competencies of hardware and software development, marketing, and sup-
plier management. A newer concept still is that of chipless IC companies. These
companies (Rambus, ARM, and DSP Group, for example) develop and own intel-
lectual property (IP) and then license it to other suppliers for use in their products.
Cores used to support system-on-a-chip (SOC) technology are an example.
Currently, IC suppliers are complex entities with multiple outsourcing strat-
egies. They outsource various parts of the design-to-ship hierarchy based on the
functional part category, served market conditions, and a given outsource provid-
er’s core competencies, design-manufacturing strategy, and served markets. Each
of the functions necessary to deliver a completed integrated circuit can be out-
sourced: intellectual property, design and layout, mask making, wafer fabrication,
package assembly, electrical test, and warehousing and shipping (logistics). In
fact, for a specific component, there exists a matrix of all possible steps in the
IC design-to-ship hierarchy (that the supplier invokes based on such things as
cost, delivery, market needs, etc.), complicating supplier selection and qualifica-
tion, part qualification, and IC quality and reliability. From the OEM’s perspec-
tive the overarching questions are who is responsible for the quality and reliability
of the finished IC, to whom do I go to resolve any problems discovered in my
application, and with so many parties involved in producing a given IC, how can
I be sure that permanent corrective action is implemented in a timely manner.
Another issue to consider is that of mergers and acquisitions. Larger IC
suppliers are acquiring their smaller counterparts. This situation is occurring as
well across the entire supply chain (material suppliers, EMS providers, and
OEMs). In these situations much is at risk. What products are kept? Which are
discontinued? What wafer fab, assembly, test, and ship facilities will be retained?
If designs are ported to a different fab, are the process parameters the same? Is
the same equipment used? How will the parts be requalified? All of this affects
the qualification status, the ability of the part to function in the intended applica-
tion, and ensuring that a continuous source of supply is maintained for the OEM’s
manufacturing needs. Then there is the issue of large companies spinning off
their semiconductor (IC) divisions. Companies such as Siemens (to Infineon),
Lucent (to Agere Systems), and Rockwell (to Conexant), for example, have al-
ready spun off their IC divisions.
In fact, it is predicted that the IC industry will be segmented into a number
of inverted pyramids characterized by dramatic restructuring and change (Fig.
6). The inverted pyramid shows that for a given product category, digital signal
processors (DSPs), in this example, four suppliers will eventually exist: one large
tier 1 supplier and 2 or 3 smaller suppliers will account for the entire market for
that circuit function. A reduction in the supply base will occur due to mergers
and acquisitions and the process of natural selection. This model has proven to
be valid for programmable logic devices (PLDs) and dynamic random access
memories (DRAMs), with the number of major DRAM suppliers being reduced
from 10 five years ago to five today. Other IC product types are likely to follow
this path as well. This structure will prevent second tier or start-up companies
from breaking into the top ranking. The barriers to entry will be too formidable
for these companies to overcome and the inertia too great to displace the market
leader. Such barriers include financial resources, process or design capabilities,
patent protection, sheer company size, and the like. This will serve to further
complicate the selection and qualification of IC suppliers and the parts they pro-
vide. OEMs will be limited in the number of suppliers for a given functional IC.
FIGURE 6 Inverted pyramid model for the 1998 DSP market. (Courtesy of IC Insights,
Scottsdale, AZ.)
1. They have found that this relationship results in both fewer quality
problems and fewer missed delivery schedules.
2. When there is a supply shortage problem, the firms with long-term
relationships suffer less than do opportunistic buyers.
3. Frequent changes in suppliers take a lot of resources and energy and
require renewed periods of learning to work together. Relationships
can’t be developed and nurtured if the supply base is constantly
changing.
4. Product innovations require design changes. Implementation of
changes in requirements is less costly and time consuming when a
long-term relationship has been developed.
5. If either party runs into a financial crisis, concessions are more likely
if a long-term relationship exists.
6. Cooperation helps minimize inventory carrying costs.
7. A company and its suppliers can work together to solve technical issues
to achieve the desired product performance and quality improvement.
both cost and quality improvement ideas and they are expected to understand the
goals, products, and needs of their partner OEM.
Table 7 presents a high level supplier–OEM relationship model put forth
in Beyond Business Process Reengineering. It progresses from the “conventional”
(or historical) OEM–supplier relations that were built on incoming inspection
and driven by price to what the authors call a holonic node relationship. This,
in effect, represents the ultimate relationship: a truly integrated supplier–OEM
relationship in which both are mutual stakeholders or comakers in each other’s
success.
In this ultimate relationship, there is cooperation in designing new products
and technologies. Suppliers are integrated into the OEM’s operations, and there
is a feeling of mutual destiny—the supplier lives for the OEM and the OEM
lives for the supplier’s ability to live for its business. Increasingly, the OEM’s
product components are based on the supplier’s technology.
There is also a constant exchange of information concerning both products
and processes. The client company’s marketing people feed back information
directly to the supplier company. This allows the partners to make rapid, global
decisions about any required product change.
195
Flexibility
Source: McHugh, Patrick, et al., Beyond Business Reengineering, John Wiley & Sons, 1995.
functional discipline brings various strengths and perspectives to the team rela-
tionship. For example,
1. Purchasing provides the expertise for an assured source of supply,
evaluates the suppliers financial viability, and helps architect a long-
term sourcing strategy.
2. System Engineering evaluates and ensures that the part works in the
product as intended.
3. Manufacturing Engineering ensures that the parts can be reliably and
repeatedly attached to the PCB and conducts appropriate evaluation
tests of new PCB materials and manufacturing techniques, packaging
technology (BGA, CSP, etc.), and attachment methods (flip chip, lead
free solder, etc.).
4. Component Engineering guides Develop Engineering in the selection
and use of technologies, specific IC functions, and suppliers beginning
at the conceptual design phase and continuing through to a fixed de-
sign; evaluates supplier-provided test data: conducts specific analyses
of critical components; generates functional technology road maps; and
conducts technology audits as required.
5. Reliability Engineering ensures that the product long-term reliability
goals, in accordance with customer requirements, are established; that
the supplier uses the proper design and derating guidelines; and that
the design models are accurate and in place.
Each commodity team representative is responsible for building relationships
with her/his counterparts at the selected component suppliers.
would yield 0.05 defective components. That is, it would take 50,000 parts to
find a single defect versus finding 1 defect in 100 parts in the 1970s.
A dramatic improvement in the quality of integrated circuits occurred dur-
ing the past 30–40 years. Since the early 1970s, the failure rate for integrated
circuits has decreased aproximately 50% every 3 years! A new phenomenon be-
gan to dominate the product quality area. A review of OEM manufacturing issues
and field returns showed that most of the problems encountered were not the
result of poor component quality. Higher clock rate and higher edge rate ICs
as well as impedance mismatches cause signal integrity problems. Component
application (fitness for use) and handling processes, software/hardware interac-
tion, and PWA manufacturing quality (workmanship) and compatibility issues are
now the drivers in product quality. Many misapplication and PWA manufacturing
process problems can be discovered and corrected using HALT and STRIFE tests
at the design phase. Furthermore, many component- and process-related defects
can be discovered and controlled using production ESS until corrective actions
are in place.
gic planning. The focus is on matching a company’s product direction with the
direction of the supply base (see Chapter 3, Fig. 2). Throughout the design pro-
cess, the component engineer questions the need for and value of performing
various tasks and activities, eliminating all that are non–value added and retaining
the core competencies. As more IC functions are imbedded into ASICs via core-
ware, the coreware will have to be specified, modeled, and characterized, adding
a new dimension to a CE’s responsibility.
The component engineer is continuously reinventing the component quali-
fication process (more about this in a subsequent section) in order to effectively
support a fast time-to-market requirement and short product development and
life cycles. The CE has traditionally been a trailer in the product development
cycle by qualifying components that are designed into new products. Now com-
ponents need to be qualified ahead of the need, both hardware and software—
and their interactions.
A listing of the component engineer’s responsibilities is as follows:
such as PWA water wash and reducing the number of device fami-
lies to support.
Work with Development, Manufacturing, and Purchasing on cost re-
duction programs for existing and new designs. Focus on Total cost
of ownership in component/supplier selection.
9. Provide CAD model support. Component engineers MCAD, and
ECAD designers should combine efforts to create singular and more
accurate component models. The MCAD and ECAD models will be-
come the specification control vehicle for components in replacement
of SCDs.
10. Coordinate the establishment of quality (ppm) and reliability (FIT)
goals for critical, core, or strategic commodites including a 3-year
improvement plan. Establish a plan to monitor commodity perfor-
mance to these quality and reliability goals on a quarterly basis.
11. Institute obsolescence BOM reviews for each production PWA every
2 years. Institute obsolescence BOM reviews when a portion of an
existing board design will be used in a new design.
ture at most large users of ICs. Some of the smaller users chose to outsource
their incoming electrical inspection needs to independent third-party testing labo-
ratories, thus fueling that industry’s growth.
Up to the mid-1980s most users of integrated circuits performed some level
of screening, up to the LSI level of integration, for the following reasons:
They lacked confidence in all suppliers’ ability to ship high-quality, high-
reliability ICs.
They felt some screening was better than none (i.e., self-protection).
They embraced the economic concept that the earlier in the manufacturing
cycle you find and remove a defect, the lower the total cost.
The last item is known as the “law of 10” and is graphically depicted in Figure
7. From the figure, the lowest cost node where the user could make an impact
was at incoming test and thus the rationale for implementing 100% electrical and
environmental stress screening at this node.
The impact of failure in the field can be demonstrated by considering some
of the effects that failure of a computer in commercial business applications can
have. Such a failure can mean
The Internet going down
A bank unable to make any transactions
The telephone system out of order
A store unable to fill your order
Airlines unable to find your reservation
A slot machine not paying off
Since components (ICs) have historically been the major causes of field
failures, screening was used
To ensure that the ICs meet all the electrical performance limits in the
supplier’s data sheets (supplier and user).
To ensure that the ICs meet the unspecified parameters required for system
use of the selected ICs/suppliers (user).
To eliminate infant mortalities (supplier and user).
To monitor the manufacturing process and use the gathered data to institute
appropriate corrective action measures to minimize the causes of varia-
tion (supplier).
As a temporary solution until the appropriate design and/or process correc-
tive actions could be implemented based on a root cause analysis of the
problem (user or supplier) and until IC manufacturing yields improved
(user).
Because suppliers expect sophisticated users to find problems with their
parts. It was true in 1970 for the Intel 1103 DRAM, it is true today for
the Intel Pentium, and it will continue to be true in the future. No matter
how many thousands of hours a supplier spends developing the test vec-
tors for and testing a given IC, all the possible ways that a complex IC
will be used cannot be anticipated. So early adopters are relied on to
help identify the bugs.
As mentioned, this has changed—IC quality and reliability have improved dra-
matically as IC suppliers have made major quality improvements in design, wafer
fabrication, packaging and electrical test. Since the early 1970s IC failure rates
have fallen, typically 50% every 3 years. Quality has improved to such an extent
that
1. Integrated circuits are not the primary cause of problems in the field
and product failure. The major issues today deal with IC attachment
to the printed circuit board, handling issues (mechanical damage and
ESD), misapplication or misuse of the IC, and problems with other
system components such as connectors and power supplies. Although
when IC problems do occur, it is a big deal and requires a focused
effort on the part of all stakeholders for timely resolution and imple-
mentation of permanent corrective action.
2. Virtually no user performs component screening. There is no value to
be gained. With the complexity of today’s ICs no one but the IC sup-
plier is in a position to do an effective job of electrically testing the
ICs. The supplier has the design knowledge (architectural, topographi-
cal, and functional databases), resources (people) who understand the
device operation and idiosyncracies, and the simulation and the test
tools to develop the most effective test vector set for a given device
and thus assure high test coverage.
3. United States IC suppliers have been continually regaining lost market
share throughout the 1990s.
4. Many failures today are system failures, involving timing, worst case
combinations, or software–hardware interactions. Increased system
complexity and component quality have resulted in a shift of system
failure causes away from components to more system-level factors,
including manufacturing, design, system-level requirements, interface,
and software.
improved manufacturing processes and more accurate design processes (design for
quality) and models. Also notice the potentially shorter life due to smaller feature
size ramifications (electromigration and hot carrier injection, for example).
Nonetheless failures still do occur. Table 9 lists some of the possible failure
mechanisms that impact reliability. It is important to understand these mecha-
nisms and what means, if any, can be used to accelerate them in a short period
of time so that ICs containing these defects will be separated and not shipped to
customers. The accelerated tests continue only until corrective actions have been
implemented by means of design, process, material, or equipment change. I
would like to point out that OEMs using leading edge ICs expect problems to
occur. However, when problems occur they expect a focused effort to understand
the problem and the risk, contain the problem parts, and implement permanent
corrective action. How issues are addressed when they happen differentiates stra-
tegic suppliers from the rest of the pack.
Today the Arrhenius equation is widely used to predict how IC failure rates vary
with different temperatures and is given by the equation
R 1 ⫽ R 2 eEa/k(1/T1⫺1/T2) (4.2)
Acceleration is then
A T ⫽ eEa/k(1/T1⫺1/T2) (4.3)
where
T1 and T2 ⫽ temperature (K)
k ⫽ Boltzmann’s constant ⫽ 86.17 µeV/K
FIGURE 9 The Arrhenius model showing relationship between chip temperature and ac-
celeration factor as a function of Ea.
dence and thus represents a high acceleration factor. Figure 9 shows the relation-
ship between temperature, activation energy, and acceleration factor.
Voltage Acceleration
An electric field acceleration factor (voltage or current) is used to accelerate the
time required to stress the IC at different electric field levels. A higher electric
field requires less time. Since the advent of VLSICs, voltage has been used to
accelerate oxide defects in these CMOS ICs such as pinholes and contamination.
Since the gate oxide of a CMOS transistor is extremely critical to its proper
functioning, the purity and cleanliness of the oxide is very important, thus the
need to identify potential early life failures. The IC is operated at higher than
normal operating VDD for a period of time; the result is an assigned acceleration
factor AV to find equivalent real time. Data show an exponential relation to most
defects according to the formula
AV ⫽ eγ(VS⫺VN) (4.4)
where
VS ⫽ stress voltage on thin oxide
VN ⫽ thin oxide voltage at normal conditions
γ ⫽ 4–6 volt⫺1
Humidity Acceleration
The commonly used humidity accelerated test consists of 85°C at 85% RH. The
humidity acceleration formula is
AH ⫽ e0.08(RHs⫺RHn) (4.5)
where
RHs ⫽ relative humidity of stress
RHn ⫽ normal relative humidity
For both temperature and humidity accelerated failure mechanisms, the accelera-
tion factor becomes
AT&H ⫽ AH AT (4.6)
Temperature Cycling
Temperature cycling, which simulates power on/off for an IC with associated
field and temperature stressing is useful in identifying die bond, wire bond, and
metallization defects and accelerates delamination. The Coffin–Manson model
for thermal cycling acceleration is given by
[∆Tstress]c
ATC ⫽ (4.7)
∆Tuse
where c ⫽ 2–7 and depends on the defect mechanism. Figure 10 shows the
number of cycles to failure as a function of temperature for various values of c.
The total failure rate for an IC is simply the mathematical sum of the indi-
vidual failure rates obtained during the various acceleration stress tests, or
Ftotal ⫽ FT ⫹ FVDD ⫹ FT&H ⫹ FTC (4.8)
where
FT ⫽ failure rate due to elevated temperature
FVDD ⫽ failure rate due to accelerated voltage (or electric field)
FT&H ⫽ failure rate due to temperature humidity acceleration
FTC ⫽ failure rate due to temperature cycling
where
ES ⫽ stress field on thin oxide (mV/cm)
EN ⫽ stress field at thin oxide at normal conditions
EEF ⫽ experimental or calculate (Suyko, IRPS’91)(MV/cm)
and
1
VAF ⫽ f (tOX, process ) ⫽ and γ ⫽ 0.4exp(0.07/kT) (4.10)
ln(10)γ
Stress: high voltage extended life test (HVELT): 125°C and 6.5 V
48 hr 168 hr 500 hr
The way you read the information is as follows: for lot 1, 1/999 at the 168-hr
electrical measurement point means that one device failed. An investigation of
the failure pointed to contamination as the root cause (see Step 2).
A step-by-step approach is used to calculate the FIT rate from the experi-
mental results of the life tests.
Step 2. Calculate the total device hours for Lots 1 and 2 excluding infant
mortality failures, which are defined as those failures occurring in the
first 48 hr.
Total device hours ⫽ 1200(168) ⫹ 1199(500–168) ⫹ 1098(1000 - 500)
⫹ 1093(2000–1000) ⫽ 2.24167 ⫻ 10⫺6 hr
For Lot 3,
800(168) ⫹ 799(500–168) ⫽ 3.997 ⫻ 105 hr
Calculate TAF for Ea ⫽ 0.3 eV, T2 ⫽ 398.15 K, and T1 ⫽ 328.15 K.
R 2 ⫽ R 1 e(0.3/86.17 ⫻ 10⫺6(1/328.15⫺1/398.15)
R 2 ⫽ R 1 e1.8636 ⫽ R 1(6.46)
R2
TAF ⫽ ⫽ 6.46
R1
The TAF values for Ea ⫽ 0.6 eV and 1.0 eV are calculated in a similar
manner. The results are as follows:
Ea(eV) TAF
0.3 6.46
0.6 41.70
1.0 501.50
Device hours
Ea at 125°C TAF at 55°C Equivalent at 55°C
0.3 2.24167 ⫻ 10 6
6.46 1.44811 ⫻ 10 7
0.6 2.24167 ⫻ 10 6 41.70 9.34778 ⫻ 10 7
1.0 2.24167 ⫻ 10 6 501.50 1.12420 ⫻ 10 9
Step 4. Divide the number of failures for each Ea by the total equivalent
device hours:
48 hr 168 hr 500 hr
Calculate the number of total oxide failures in their time: two fails gives
6.5 FITs [2 ⫼ (1.44811 ⫻ 107 ⫹ 2.9175 ⫻ 108 hours)].
Therefore, the total failure rate is 6.5 ⫹ 42 ⫹ 3.6 ⫽ 52 FITs.
destroyed many expensive ICs in the process (an ASIC, FPGA, or microprocessor
can cost between $500 and $3,000 each). For an ASIC, one could use (destroy)
the entire production run (1–3 wafers) for qualification testing—truly unrealistic.
Table 13 lists the issues with the traditionally used stress test–driven qualification
test practices.
When the leadership in the military/aerospace community decided that the
best way for inserting the latest technological component developments into their
equipment was to use commercial off-the-shelf (COTS) components, the com-
mercial and semiconductor industries looked to EIA/JESD 47 and JESD 34 to
replace MIL-STD-883 for IC qualification testing. The following is a brief discus-
sion of these standards.
In either case the set of reliability requirements and tests should be appropriately
modified to properly address the new situations in accordance with JESD 34.
The supplier is not in a position to determine the impact that various failure
mechanisms have on system performance.
The method does not take into account functional application testing at the
system level, which is really the last and perhaps most important step
in component qualification.
The current qualification methodology recognizes that there are two distinct part-
ners responsible for qualification testing of a given component. It is a cooperative
effort between IC suppliers and OEMs, with each party focusing their time and
energy on the attributes of quality and reliability that are under their respective
spheres of control. The industry has moved from the OEMs essentially conduct-
ing all of the environmental and mechanical qualification tests to the IC suppliers
performing them to ensure the reliability of their components.
The IC suppliers have the sole responsibility for qualifying and ensuring
the reliability, using the appropriate simulations and tests, for the components
they produce. Identification of reliability risks just prior to component intro-
duction is too late; qualification must be concurrent with design. By imple-
menting a design-for-reliability approach, IC suppliers minimize reliance on con-
ventional life tests. Instead, they focus on reliability when it counts—during the
IC circuit, package design, and process development stages using various valida-
tion vehicles and specially designed test structures to gather data in a short period
of time. This IC qualification approach has been validated through the successful
field performance of millions of parts in thousands of different fielded computer
installations.
The component suppliers are also asked to define and verify the perfor-
mance envelope that characterizes a technology or product family. They do this
by conducting a battery of accelerated testing (such as 1000-hr life tests) for the
electrical function and technology being used in the design: wafer fab process,
package and complete IC. From these tests they draw inferences (projections)
about the field survivability (reliability) of the component. Additionally, a com-
mitment in supplying high-quality and high-reliability ICs rquires a robust quality
system and a rigorous self-audit process to ensure that all design, manufacturing,
electrical test, continuous improvement, and customer requirement issues are be-
ing addressed.
Table 14 details the tasks that the IC supplier must perform to qualify the
products it produces and thus supply reliable integrated circuits. Notice the focus
(as it must be) is on qualifying new wafer fabrication technologies. Equally im-
portant is the effort required to conduct the appropriate analyses and qualification
and reliability tests for new package types and methods of interconnection, since
they interconnect the die to the PWA, determine IC performance, and have an
impact on product reliability.
The user (OEM) requires a stable of known good suppliers with known
good design and manufacturing processes. This begins with selecting the right
technology, part, package, and supplier for a given application early in the design
TABLE 14 Continued
Prerequisites for conducting complete IC qualification testing
Topological and electrical design rules established and verified. There is a good
correlation between design rules and wafer fab process models.
Verifies that die package modeling (simulation) combination for electrical and
thermal effects matches the results obtained with experimenting testing.
Comprehensive characterization testing is complete including four-corner (process)
and margin tests at both NPI and prior to each time a PCN is generated, as
appropriate.
Electrical test program complete and released.
Data sheet developed and released.
Manufacturing processes are stable and under SPC.
In-line process monitors are used for real-time assessment and corrective action of
critical process or product parameters with established Cpk metrics and to
identify escapes from standard manufacturing, test, and screening procedures
(maverick lot).
Conducts complete IC qualification tests
Conducts electrical, mechanical and environmental stress tests that are appropriate
for the die, package, and interconnection technologies used and potential failure
mechanisms encountered and to assess time-dependent reliability drift and wear-
out.
Both categories of risk require three lots from three different time points (variability).
FEOL: Front end of line (IC transistor structure formation); BEOL: back end of line (transistor in-
terconnect structures); EM ⫽ electromigration; FIT ⫽ failures in time (failures in 109 hr); HCI ⫽
hot carrier injection; HTOL ⫽ high temperature operating life; ILD ⫽ interlevel dielectric; NPI ⫽
new product introduction; PCN ⫽ product change notice; SPC ⫽ statistical process control; TDDB
⫽ time-dependent dielectric breakdown; THB/HAST ⫽ temperature-humidity bias/highly acceler-
ated stress test; WLR ⫽ wafer level reliability.
Table 16 is a detailed list of the steps that the OEM of a complex electronic
system takes in qualifying critical components for use in that system, both cur-
rently and looking to the future.
TABLE 16 Continued
share and profitability; therefore, short product design cycles are here to stay.
Newer designs face greater time-to-market pressures than previous designs. For
example, new PCs are being released to production every 4 months versus every
6–9 months previously. In order to support these short design cycles and in-
creased design requirements, component qualification processes must be relevant
and effective for this new design environment. Figure 11 shows the dramatically
shortened product design and qualification timeframes.
Traditional back-end stress-based qualification test methods will not meet
the short cycle times for today’s market. Integrated circuit suppliers need to de-
velop faster and more effective technology and process-qualification methods
(test vehicles and structures) that give an indication of reliability before the IC
design is complete.
2. Shorter component life cycles. Component production life cycles
have significantly been reduced over the last 5–10 years. A typical component
has a 2- to 4-year production life cycle (time to major change or obsolescence).
The shortened component life cycle is due to the effect of large PC, telecommuni-
cation equipment, and consumer product manufacturers pressuring component
suppliers for cost reductions until a part becomes unprofitable to make. Often, a
ogy that often requires special attention to assure a good design fit. Some package
characteristics for evaluation are
Thermal characteristics in still air and with various air flow rates.
Package parasitics (i.e., resistance, capacitance, and inductance) vary with
package type and style. Some packages have more desirable characteris-
tics for some designs.
Manufacturing factors such as solderability, handling requirements, me-
chanical fatigue, etc.
Advanced packaging innovations such as 3D packages are being used to
provide increased volumetric density solutions through vertical stacking of die.
Vertical stacking provides higher levels of silicon efficiency than those achievable
through conventional multichip or wafer-level packaging (WLP) technologies.
Through 3D packaging innovations, a product designer can realize a 30 to
50% PWA area reduction versus bare die or WLP solutions. Stacked chip-scale
packaging (CSP) enables both a reduction in wiring density required in the PWA
and a significant reduction in PWA area. A 60% reduction in area and weight
are possible by migrating from two separate thin small outline package (TSOPs)
to a stacked CSP. Nowhere is this more important than in the mobile communica-
tions industry where aggressive innovations in packaging (smaller products) are
required. Examples of several stacked die chip scale packages are shown in Fig-
ure 12.
As the packaging industry migrates to increased miniaturization by em-
ploying higher levels of integration, such as stacked die, reliability issues must
be recognized at the product development stage. Robust design, appropriate mate-
rials, optimized assembly, and efficient accelerated test methods will ensure that
reliable products are built. The functionality and portability demands for mobile
electronics require extensive use of chip scale packaging in their design. From
a field use (reliability) perspective portable electronics are much more subject to
bend, torque, and mechanical drops than other electronic products used in busi-
ness and laboratory environments. As a result traditional reliability thinking,
which focuses on having electronic assemblies meet certain thermal cycling reli-
ability requirements, has changed. There is real concern that these products may
not meet the mechanical reliability requirements of the application. For stacked
packages the combined effects of the coefficient of thermal expansion (CTE ) and
elastic modulus determine performance. In stacked packages there is a greater
CTE mismatch between the laminate and the package. The failure mechanism may
shift to IC damage (cracked die, for example) instead of solder joint damage.
Failures occur along the intermetallic boundaries. Drop dependent failures de-
pend on the nature of the intermetallics that constitute the metallurgical bond.
During thermal cycling, alternating compressive and tensile stresses are opera-
tive. Complex structural changes in solder joints, such as intermetallic growths,
grain structure modifications (such as grain coarsening and elastic and plastic
deformations due to creep) are operative. The different CTE values of the die that
make up the stacked package could lead to the development of both delamination
and thermal issues. The surface finish of the PWA also plays a significant role
in the reliability of the PWA.
Thus, new package types, such as stacked packages, provide a greater chal-
lenge for both the IC supplier and the OEM user in qualifying them for use.
6. Functional application testing. Functional application testing (FAT)
is the most effective part of component qualification, since it is in essence proof
of the design adequacy. It validates that the component and the product design
work together by verifying the timing accuracy and margins; testing for possible
interactions between the design and components and between hardware, software,
and microcode; and testing for operation over temperature and voltage extremes.
The following are examples of functional requirements that are critical to designs
and usually are not tested by the component supplier:
Determinism is the characteristic of being predictable. Complex compo-
nents such as microprocessors, ASICs, FPGAs, and multichip modules
should provide the same output in the same cycle time for the same
instructions consistently.
ACKNOWLEDGMENTS
Portions of Sections 4.3.2 and 4.3.4 excerpted by permission from the short
course “Supplier Management” at the courtesy of Ken Stork and Associates, Inc.
California Institute of Technology. Portions of Section 4.4 were excerpted from
Ref. 3. Portions of Section 4.6.1 were excerpted from Ref. 4. Sections 4.7.3 and
4.7.4 excerpted from Ref. 5.
REFERENCES
1. Moore GA. Living on the Fault Line, Harper Business, 2000.
2. Hnatek ER. Integrated Circuit Quality and Reliability, 2nd ed., Marcel Dekker, 1995.
3. Hnatek ER, Russeau JB. Component engineering: the new paradigm. Advanced Elec-
tronic Acquisition, Qualification and Reliability Workshop, August 21–23, 1996, pp
297–308.
4. Hnatek ER, Kyser EL. Practical lessons learned from overstress testing: a historical
perspective. EEP Vol. 26-2, Advances in Electronic Packaging, ASME, 1999.
5. Russeau JB, Hnatek ER. Technology qualification versus part qualification beyond
the year 2001. Military/Aerospace COTS Conference Proceedings, Berkeley, CA,
August 25–27, 1999.
FURTHER READING
1. Carbone J. HP buyers get hands on design. Purchasing, July 19, 2001.
2. Carbone J. Strategic purchasing cuts costs 25% at Siemens. Purchasing, September
20, 2001.
3. Greico PL, Gozzo MW. Supplier Certification II, Handbook for Achieving Excellence
Through Continuous Improvement. PT Publications Inc., 1992.
4. Kuglin FA. Customer-centered supply chain management. AMACOM, 1998.
5. Morgan JP, Momczka RM. Strategic Supply Chain Management. Cahners, 2001.
6. Poirier CC. Advanced Supply Chain Management. Berrett-Koehler, San Francisco,
CA, 1999.
7. Supply Chain Management Review magazine, Cahners business information publica-
tion.
Scoring methodology:
1. Add up points earned by supplier for each metric.
2. Apply following formula to get a “cost” score:
100–performance score
⫹ 1.00
100
For example, a supplier’s score is 83 points upon measurement; thus
100–83
Cost score ⫽ ⫹ 1.00 ⫽ 1.17
100
3. The TCOO score is then compared to $1.00. In the example above,
the supplier’s TCOO rating would indicate that for every dollar spent
with the supplier, the OEM’s relative cost was $1.17.
Who Generates the Score?
Attribute Responsibility
Quality Supplier Quality Engineering and Component Engineering
OTD Purchasing at both OEM and EMS providers
Price OEM commodity manager and Purchasing and EMS Purchasing
Support OEM commodity manager and Purchasing and EMS Purchasing
Technology Product Design Engineering and Component Engineering
General Guidelines
OEM purchasing:
Maintains all historical files and data
Issues blank scorecards to appropriate parties
Coordinates roll-up scores and emailing of scorecards
Publishes management summaries (trends/overview reports)
If a supplier provides multiple commodities, then each commodity manager
prepares a scorecard and a prorated corporate scoreboard is generated.
Each commodity manager can then show a supplier two cards—a divi-
sional and an overall corporate scorecard.
The same process will be applied (prorated by dollars spent) for an OEM
buyer and a EMS buyer.
Scorecard Process Timetable Example
Week one of quarter: gather data.
Week two of quarter: roll-up scorecard and review results.
Goal: scorecards will be emailed to suppliers by the 15th of the first month
of the quarter.
Executive meetings will be scheduled over the quarter.
# Points
Price
Meets OEM price goals 25
Average price performance 15
Pricing not competitive 0
Support
Superior support/service 15
Acceptable support/service 10
Needs corrective action 0
Technologya Points
100 ⫺ SCORE
Total cost of ownership ⫽ ⫹ 1.00 (Goal: 1.0)
100
5.1 INTRODUCTION
The objective of thermal management is the removal of unwanted heat from
sources such as semiconductors without negatively affecting the performance or
reliability of adjacent components. Thermal management addresses heat removal
by considering the ambient temperature (and temperature gradients) throughout
the entire product from an overall system perspective.
Thermal removal solutions cover a wide range of options. The simplest
form of heat removal is the movement of ambient air over the device. In any
enclosure, adding strategically placed vents will enhance air movement. The cool-
ing of a critical device can be improved by placing it in the coolest location in
the enclosure. When these simple thermal solutions cannot remove enough heat
to maintain component reliability, the system designer must look to more sophis-
ticated measures, such as heat sinks, fans, heat pipes, or even liquid-cooled plates.
Thermal modeling using computational fluid dynamics (CFD) helps demonstrate
the effectiveness of a particular solution.
The thermal management process can be separated into three major phases:
1. Heat transfer within a semiconductor or module (such as a DC/DC
converter) package
2. Heat transfer from the package to a heat dissipater
3. Heat transfer from the heat dissipater to the ambient environment
The first phase is generally beyond the control of the system level thermal engi-
neer because the package type defines the internal heat transfer processes. In the
second and third phases, the system engineer’s goal is to design a reliable, effi-
cient thermal connection from the package surface to the initial heat spreader
and on to the ambient environment. Achieving this goal requires a thorough un-
derstanding of heat transfer fundamentals as well as knowledge of available inter-
face and heat sinking materials and how their key physical properties affect the
heat transfer process.
vent, diverting the fan flow away from the processor site and toward the PS. This
effect virtually negates any benefit from the second fan as far as the CPU is
concerned. The flow at the processor site comes up mainly from the side vents.
Therefore, any increase or decrease in the flow through the side vent will have
a more significant impact on the processor temperature than a change in flow
from the fan.
The ineffectiveness of the system fan was compounded by the fan grill and
mount design. Because of the high impedance of the grill and the gap between
FIGURE 3 Particle traces show that most fan airflow bypasses the microprocessor. (From
Ref. 2.)
the fan mount and chassis wall, only 20% of the flow through the fan was fresh
outside air. The remaining 80% of the air flow was preheated chassis air recircu-
lated around the fan mount.
The analysis helped explain why the second PS fan reduced the flow
through the side vent of the chassis. It also showed that the processor temperature
actually declined when the second fan was shut off and demonstrated that the
second fan could be eliminated without a thermal performance penalty, resulting
in a cost saving. The analysis pointed the way to improving the thermal perfor-
mance of the chassis. Modifying the chassis vents and eliminating the second
PS fan provided the greatest performance improvement. Particle traces of the
modified vent configuration demonstrate improved flow at the processor site (Fig.
4).
Use of Computational Fluid Dynamics to Predict
Component and Chassis Hot Spots
Computational fluid dynamics using commercially available thermal modeling
software is helping many companies shorten their design cycle times and elimi-
nate costly and time-consuming redesign steps by identifying hot spots within
an electronic product. An example of such a plot is shown in Figure 5. Figures
6 and 7 show a CFD plot of an Intel Pentium II processor with a heat sink and
a plot of temperature across a PC motherboard for typical operating conditions,
respectively (see color insert).
FIGURE 6 CFD plot of Pentium II processor and heat sink. (See color insert.)
as well as with the square of the signal voltage. Capacitance, in turn, climbs with
the number of integrated transistors and interconnections. To move heat out, cost-
conscious designers are combining innovative engineering with conventional
means such as heat sinks, fans, heat pipes, and interface materials. Lower op-
erating voltages are also going a long way toward keeping heat manageable. In
addition to pushing voltages down, microprocessor designers are designing in vari-
ous power-reduction techniques that include limited usage and sleep/quiet modes.
Let’s look at several examples of what has been happening in terms of
power dissipation as the industry has progressed from one generation of micro-
processors to the next. The 486 microprocessor–based personal computers drew
12 to 15 W, primarily concentrated in the processor itself. Typically, the power
supply contained an embedded fan that cooled the system while a passive heat
sink cooled the processor. However, as PCs moved into the first Pentium genera-
tion, which dissipated about 25 W, the traditional passive cooling methods for
the processor became insufficient. Instead of needing only a heat sink, the proces-
sor now produced enough heat to also require a stream of cool air from a fan.
The Intel Pentium II microprocessor dissipates about 40 W; AMD’s K6
microprocessor dissipates about 20 W. The high-performance Compaq Alpha
series of microprocessors are both high-speed and high-power dissipation de-
vices, as shown in Table 2. The latest Intel Itanium microprocessor (in 0.18-
µm technology) dissipates 130 W (3).
Attention to increased heat is not limited to microprocessors. Power (and
thus heat) dissipation problems for other components are looming larger than in
the past. Product designers must look beyond the processor to memory, system
chip sets, graphics controllers, and anything else that has a high clock rate, as
well as conventional power components, capacitors, and disk drives in channeling
heat away. Even small ICs in plastic packages, once adequately cooled by normal
air movement, are getting denser, drawing more power, and getting hotter.
Alpha generation
21064 21164 21264 21264a 23164
Transistors (millions) 1.68 9.3 15.2 15.2 100
Die size (cm2) 2.35 2.99 3.14 2.25 3.5
Process technology 0.75 0.50 0.35 0.25 0.18
(µm)
Power supply (V) 3.3 3.3 2.3 2.1 1.5
Power dissipation 30 at 200 50 at 300 72 at 667 90 at 750 100 at 1
(W) MHz MHz MHz MHz GHz
Year introduced 1992 1994 1998 1999 2000
Thermal resistance
Junction to case (θJC) Junction to ambient (θJA)
Package No. of
type pins A42 L.F. Cu L.F. A42 L.F. Cu L.F.
DIP 8 79 35 184 110
14 44 30 117 85
16 47 29 122 80
20 26 19 88 68
24 34 20 76 66
28 34 20 65 56
40 36 18 60 48
48 45 — — —
64 46 — — —
SOP 8 — 45 236 159
14 — 29 172 118
16 — 27 156 110
16w — 21 119 97
20 — 17 109 87
24 — 15 94 75
28 — 71 92 71
PLCC 20 — 56 — 20
28 — 52 — 15
44 — 45 — 16
52 — 44 — 16
68 — 43 — 13
84 — 42 — 12
PQFP 84 — 14 — 47
100 — 13 — 44
132 — 12 — 40
164 — 12 — 35
196 — 11 — 30
the lead frame material: Alloy 42 and copper. Notice that copper lead frames
offer lower thermal resistance.
Many components and packaging techniques rely on the effects of conduc-
tion cooling for a major portion of their thermal management. Components will
experience the thermal resistances of the PCB in addition to those of the semicon-
ductor packages. Given a fixed ambient temperature, designers can lower junction
temperature by reducing either power consumption or the overall thermal resis-
tance. Board layout can clearly influence the temperatures of components and
thus a product’s reliability. Also, the thermal impact of all components on each
other and the PWA layout needs to be considered. How do you separate the heat-
generating components on a PWA? If heat is confined to one part of the PWA,
what is the overall impact on the PWA’s performance? Should heat generating
components be distributed across the PWA to even the temperature out?
260
Thermal Thermal
conductivity (κ), conductivity (κ),
Material W/cm ⋅ °K Material W/cm ⋅ °K
Metals Insulators
Silver 4.3 Diamond 20.0
Copper 4.0 AlN (low O2 impurity) 2.30
Gold 2.97 Silicon carbide (SiC) 2.2
Copper–tungsten 2.48 Beryllia (BeO) (2.8 g/cc) 2.1
Aluminum 2.3 Beryllia (BeO) (1.8 g/cc) 0.6
Molybdenum 1.4 Alumina (Al2O3) (3.8 g/cc) 0.3
Brass 1.1 Alumina (Al2O3) (3.5 g/cc) 0.2
Nickel 0.92 Alumina (96%) 0.2
Solder (SnPb) 0.57 Alumina (92%) 0.18
Steel 0.5 Glass ceramic 0.05
Lead 0.4 Thermal greases 0.011
Stainless steel 0.29 Silicon dioxide (SiO2) 0.01
Kovar 0.16 High-κ molding plastic 0.02
Silver-filled epoxy 0.008 Low-κ molding plastic 0.005
Semiconductors Polyimide-glass 0.0035
Silicon 1.5 RTV 0.0031
Germanium 0.7 Epoxy glass (PC board) 0.003
Gallium arsenide 0.5 BCB 0.002
Liquids FR4 0.002
Water 0.006 Polyimide (PI) 0.002
Liquid nitrogen (at 77°K) 0.001 Asbestos 0.001
Liquid helium (at 2°K) 0.0001 Teflon 0.001
Freon 113 0.0073 Glass wool 0.0001
Gases
Chapter 5
Hydrogen 0.001
Helium 0.001
Oxygen 0.0002
Air 0.0002
CMOS Bipolar
Materials/junctions technology technology
Intrinsic carrier and concentration Threshold voltage Leakage current
Carrier mobility Transconductance Current gain
Junction breakdown Time delay Saturation
Voltage
Diffusion length Leakage current Latch-up current
immunity degrade as temperature increases. For bipolar ICs, leakage current in-
creases and saturation voltage and latch-up current decrease as temperature in-
creases. When exposed to elevated temperature ICs exhibit parameter shifts, as
listed in Table 8. One of the most fundamental limitations to using semiconduc-
tors at elevated temperatures is the increasing density of intrinsic, or thermally
generated, carriers. This effect reduces the barrier height between n and p regions,
causing an 8% per degree K increase in reverse-bias junction-leakage current.
The effects of elevated temperature on field effect devices include a 3- to
6-mV per degree K decrease in the threshold voltage (leading to decreased noise
immunity) and increased drain-to-source leakage current (leading to an increased
incidence of latch-up). Carrier mobility is also degraded at elevated temperatures
by a factor of T⫺1.5, which limits the maximum ambient-use temperature junction-
isolated silicon devices to 200°C.
Devices must also be designed to address reliability concerns. Elevated
temperatures accelerate the time-dependent dielectric breakdown of the gate ox-
ide in a MOS field-effect transistor (FET), and can cause failure if the device is
operated for several hours at 200°C and 8 MV/cm field strength, for example.
However, these concerns can be eliminated by choosing an oxide thickness that
decreases the electric field sufficiently. Similar tradeoffs must be addressed for
electromigration. By designing for high temperatures (which includes increasing
the cross-section of the metal lines and using lower current densities), electromi-
gration concerns can be avoided in aluminum metallization at temperatures up
to 250°C.
Integrated Circuit Wires and Wire Bonds
The stability of packaging materials and processes at high temperatures is an
important concern as well. For example, elevated temperatures can result in ex-
cessive amounts of brittle intermetallic phases between gold wires and aluminum
bond pads. At the same time, the asymmetric interdiffusion of gold and alumi-
num at elevated temperatures can cause Kirkendall voiding (or purple plague).
These voids initiate cracks, which can quickly propagate through the brittle inter-
metallics causing the wire bond to fracture. Though not usually observed until
125°C, this phenomenon is greatly accelerated at temperatures above 175°C, par-
ticularly in the presence of breakdown products from the flame retardants found
in plastic molding compounds.
Voiding can be slowed by using other wire bond systems with slower inter-
diffusion rates. Copper–gold systems only show void-related failures at tempera-
tures greater than 250°C, while bond strength is retained in aluminum wires
bonded to nickel coatings at temperatures up to 300°C. Monometallic systems,
which are immune to intermetallic formation and galvanic corrosion concerns
(such as Al-Al and Au-Au), have the highest use temperatures limited only by
annealing of the wires.
Plastic Integrated Circuit Encapsulants
Plastic-encapsulated ICs are made exclusively with thermoset epoxies. As such,
their ultimate use temperature is governed by the temperature at which the mold-
ing compound depolymerizes (between 190 and 230°C for most epoxies). There
are concerns at temperatures below this as well. At temperatures above the glass
transition temperature (Tg ) (160 to 180°C for most epoxy encapsulants), the coef-
ficient of thermal expansion (CTE ) of the encapsulant increases significantly and
the elastic modulus decreases, severely compromising the reliability of plastic-
encapsulated ICs.
Capacitors
Of the discrete passive components, capacitors are the most sensitive to elevated
temperatures. The lack of compact, thermally stable, and high–energy density
capacitors has been one of the most significant barriers to the development of
high-temperature systems. For traditional ceramic dielectric materials, there is a
fundamental tradeoff between dielectric constant and temperature stability. The
capacitance of devices made with low–dielectric constant titanates, such as C0G
or NP0, remains practically constant with temperature and shows little change
with aging. The capacitance of devices made with high–dielectric constant ti-
tanates, such as X7R, is larger but exhibits wide variations with increases in
temperature. In addition, the leakage currents become unacceptably high at ele-
vated temperatures, making it difficult for the capacitor to hold a charge.
There are few alternatives. For example, standard polymer film capacitors
are made of polyester and cannot be used at temperatures above 150°C because
both the mechanical integrity and the insulation resistance begin to break down.
Polymer films, such as PTFE, are mechanically and electrically stable at higher
operating temperatures—showing minimal changes in dielectric constant and in-
sulation resistance even after 1000 hr of exposure to 250°C; however, these films
also have the lowest dielectric constant and are the most difficult to manufacture
in very thin layers, which severely reduces the energy density of the capacitors.
Melting point
Alloy composition (°C) Comment
Sn63Pb37 183 Low cost
Sn42Bi58 138 Too low melting point, depending on usage
Unstable supply of Bi
Sn77.2In20Ag2.8 179–189 Higher cost
Unstable supply
Sn85Bi10Zn5 168–190 Poor wettability
Sn91Zn9 198 Poor wettability
Sn91.7Ag3.5Bi4.8 205–210 Good wettability
Sn90Bi7.5Ag2Cu0.5 213–218 Poor reliability
Poor control of composition
Sn96.3Ag3.2Cu0.5 217–218 Good reliability
Sn95Ag3.5In1.5 218 Unstable supply of In
Sn96.4Ag2.5Cu0.6Sb0.5 213–218 Poor control of composition
Sn96.5Ag3.5 221 Much experience
considered early in the design phase. All microprocessors and ASIC manufactur-
ers offer thermal design assistance with their ICs. Two examples of this from
Intel are Pentium III Processor Thermal Management and Willamette Thermal
Design Guidelines, which are available on the Internet to aid the circuit designer.
Up-front thermal management material consideration may actually enhance
end-product design and lower manufacturing costs. For example, the realization
that a desktop PC can be cooled without a system fan could result in a quieter
end product. If the design engineer strategizes with the electrical layout designers,
a more efficient and compact design will result. Up-front planning results in ther-
mal design optimization from three different design perspectives:
Thermal—concentrating on the performance of the thermal material
Dynamic—designing or blending the material to operate within the actual
conditions
Economic—using the most effective material manufacturing technology
Thermal management design has a significant impact on product package
volume and shape. Heat generated inside the package must be moved to the
surface of the package and/or evacuated from the inside of the package by air
movement. The package surface area must be sufficient to maintain a specified
temperature, and the package shape must accommodate the airflow requirements.
Early decisions concerning component placement and airflow can help pre-
vent serious heat problems that call for extreme and costly measures. Typically,
larger original equipment manufacturers (OEMs) are most likely to make that
Chapter 5
Copyright 2003 by Marcel Dekker, Inc. All Rights Reserved.
Thermal Management 269
investment in design. Those that don’t tend to rely more heavily, often after the
fact, on the thermal-component supplier’s products and expertise. A component-
level solution will be less than optimal. If one works solely at the component
level, tradeoffs and assumptions between performance and cost are made that
often are not true.
In light of the growing need to handle the heat produced by today’s high-
speed and densely packed microprocessors and other components, five ap-
proaches to thermal management have been developed: venting, heat spreaders,
heat sinks (both passive and active), package enclosure fans and blowers, and
heat pipes.
5.6.1 Venting
Natural air currents flow within any enclosure. Taking advantage of these currents
saves on long-term component cost. Using a computer modeling package, a de-
signer can experiment with component placement and the addition of enclosure
venting to determine an optimal solution. When these solutions fail to cool the
device sufficiently, the addition of a fan is often the next step.
FIGURE 9 Folded fin heat sinks offer triple the amount of finned area for cooling.
Heat spreaders frequently are designed with a specific chip in mind. For
example, the LB-B1 from International Electronic Research Corp. (Burbank, CA)
measures 1.12 in. ⫻ 1.40 in. ⫻ 0.5 in. high and dissipates 16.5° C/W.
BGAs because of the reduced package size. As the need to dissipate more power
increases, the optimal heat sink becomes heavier. Attaching a massive heat sink
to a BGA and relying on the chip-to-board solder connection to withstand me-
chanical stresses can result in damage to both the BGA and PWA, adversely
impacting both quality and reliability.
Care is needed to make sure that the heat sink’s clamping pressure does
not distort the BGA and thereby compromise the solder connections. To prevent
premature failure caused by ball shear, well-designed, off-the-shelf heat sinks
include spring-loaded pins or clips that allow the weight of the heat sink to be
borne by the PC board instead of the BGA (Fig. 10).
Active Heat Sinks
When a passive heat sink cannot remove heat fast enough, a small fan may be
added directly to the heat sink itself, making the heat sink an active component.
These active heat sinks, often used to cool microprocessors, provide a dedicated
FIGURE 10 Cooling hierarchy conducts heat from the die through the package to the
heat sink base and cooling fins and into the ambient.
FIGURE 11 Example of active heat sink used for cooling high-performance microproces-
sors.
airstream for a critical device (Fig. 11). Active heat sinks are often a good choice
when an enclosure fan is impractical. As with enclosure fans, active heat sinks
carry the drawbacks of reduced reliability, higher system cost, and higher system
operating power.
Fans can be attached to heat sinks in several ways, including clip, thermal
adhesive, thermal tape, or gels. A clip is usually designed with a specific chip
in mind, including its physical as well as its thermal characteristics. For example,
Intel’s Celeron dissipates a reasonable amount of heat—about 12 W. But be-
cause the Celeron is not much more than a printed circuit board mounted verti-
cally in a slot connector, the weight of the horizontally mounted heat sink may
cause the board to warp. In that case, secondary support structures are needed.
It is believed that aluminum heat sinks and fans have almost reached their perfor-
mance limitations and will have no place in future electronic products.
Similarly, because extruded heat sinks or heat spreaders may have small
irregularities, thermal grease or epoxies may be added to provide a conductive
sealant. The sealant is a thermal interface between the component and the heat
sink. Because air is a poor thermal conductor, a semiviscous, preformed grease
may be used to fill air gaps less than 0.1 in. thick.
But thermal grease doesn’t fill the air gaps very well. The problem with
thermal grease is due to the application method. Too much grease may “leak”
on the other components creating an environment conductive to dendritic growth
and contamination, resulting in a reliability problem. This has led designers to
use gels. The conformable nature of the gel enables it to fill all gaps between
components and heat sinks with minimal pressure, avoiding damage to delicate
components.
The Interpack and Semi-Therm conferences and Electronics Cooling Maga-
zine website (http:/ /www. electronics-cooling.com) deal with thermal grease and
heat management issues of electronic systems/products in detail.
FIGURE 13 Diagram showing thermal resistance improvement with and without both
heat spreaders and forced air cooling for a 16-pin plastic DIP. The results are similar for
larger package types.
conjunction with a heat spreader, and Table 13 presents various IC package styles
and the improvement in θja that occurs with the addition of a fan.
The decision to add a fan to a system depends on a number of consider-
ations. Mechanical operation makes fans inherently less reliable than a passive
system. In small enclosures, the pressure drop between the inside and the outside
of the enclosure can limit the efficiency of the fan. In battery-powered applica-
tions, such as a notebook computer, the current drawn by the fan can reduce
battery life, thus reducing the perceived quality of the product.
Despite these drawbacks, fans often are able to provide efficient, reliable
cooling for many applications. While fans move volumes of air, some PC systems
also require blowers to generate air pressure. What happens is that as the air
moves through the system its flow is hindered by the ridges of the add-on cards
and the like. Even a PC designed with intake and exhaust fans may still require
a blower to push out the warm, still air.
Fan design continues to evolve. For example, instead of simply spinning
at a fixed rate, fans may include a connector to the power supply as well as an
embedded thermal sensor to vary the speed as required by specific operating
conditions.
TABLE 13 Junction-to-Ambient
Thermal Resistance as a Function of
Air Cooling for Various IC Package
Styles
Package/ θJA(°C/W) θJA(°C/W)
lead count 0 LFM 500 LFM
M-DIP
16 82 52
18 80 50
20 78 48
24 77 47
40 56 44
PLCC
20 92 52
44 47 31
68 43 30
84 43 30
124 42 30
PQFP
132 43 33
TAPEPAK
40 92 52
64 60 38
132 42 21
notebook PCs. A heat pipe (Fig. 14) is a tube within a tube filled with a low–
boiling point liquid. Heat at the end of the pipe on the chip boils the liquid, and
the vapor carries that heat to the other end of the tube. Releasing the heat, the
liquid cools and returns to the area of the chip via a wick. As this cycle continues,
heat is pulled continuously from the chip. A heat pipe absorbs heat generated by
components such as CPUs deep inside the enclosed chassis, then transfers the
heat to a convenient location for discharge. With no power-consuming mechani-
cal parts, a heat pipe provides a silent, light-weight, space-saving, and mainte-
nance-free thermal solution. A heat pipe usually conducts heat about three times
more efficiently than does a copper heat sink. Heat pipe lengths are generally
less than 1 ft, have variable widths, and can dissipate up to 50 W of power.
FIGURE 14 A heat pipe is a highly directional heat transport mechanism (not a heat
spreading system) to direct heat to a remote location.
Thermoelectric Coolers
A thermoelectric cooler (TEC) is a small heat pump that is used in various appli-
cations where space limitations and reliability are paramount. The TEC operates
on direct current and may be used for heating or cooling by reversing the direction
of current flow. This is achieved by moving heat from one side of the module
to the other with current flow and the laws of thermodynamics. A typical single-
stage cooler (Fig. 17) consists of two ceramic plates with p- and n-type semicon-
ductor material (bismuth telluride) between the plates. The elements of semicon-
ductor material are connected electrically in series and thermally in parallel.
When a positive DC voltage is applied to the n-type thermoelement, elec-
trons pass from the p- to the n-type thermoelement, and the cold side temperature
will decrease as heat is absorbed. The heat absorption (cooling) is proportional
to the current and the number of thermoelectric couples. This heat is transferred to
the hot side of the cooler, where it is dissipated into the heat sink and surrounding
environment. Design and selection of the heat sink are crucial to the overall ther-
moelectric system operation and cooler selection. For proper thermoelectric man-
agement, all TECs require a heat sink and will be destroyed if operated without
one. One typical single-stage TEC can achieve temperature differences up to
70°C, and transfer heat at a rate of 125 W.
The theories behind the operation of thermoelectric cooling can be traced
back to the early 1800s. Jean Peltier discovered the existence of a heating/cooling
effect when electric current passes through two conductors. Thomas Seebeck
found that two dissimilar conductors at different temperatures would create an
electromotive force or voltage. William Thomson (Lord Kelvin) showed that over
a temperature gradient, a single conductor with current flow will have reversible
heating and cooling. With these principles in mind and the introduction of semi-
conductor materials in the late 1950s, thermoelectric cooling has become a viable
technology for small cooling applications.
Thermoelectric coolers (TECs) are mounted using one of the three methods:
adhesive bonding, compression using thermal grease, or solder. Figure 18 shows
a TEC with an attached heat sink being mounted with solder.
Metal Backplanes
Metal-core printed circuit boards, stamped plates on the underside of a laptop
keyboard, and large copper pads on the surface of a printed circuit board all
employ large metallic areas to dissipate heat. A metal-core circuit board turns
the entire substrate into a heat sink, augmenting heat transfer when space is at
a premium. While effective in cooling hot components, the heat spreading of this
technique also warms cooler devices, potentially shortening their lifespan. The
75% increase in cost over conventional substrates is another drawback of metal-
core circuit boards.
When used with a heat pipe, stamped plates are a cost-effective way to
cool laptop computers. Stamped aluminum plates also can cool power supplies
and other heat-dissipating devices. Large copper pads incorporated into the
printed circuit board design also can dissipate heat. However, copper pads must
be large to dissipate even small amounts of heat. Therefore, they are not real-
estate efficient.
Thermal Interfaces
The interface between the device and the thermal product used to cool it is an
important factor in implementing a thermal solution. For example, a heat sink
attached to a plastic package using double-sided tape cannot dissipate the same
amount of heat as the same heat sink directly in contact with a thermal transfer
plate on a similar package.
Microscopic air gaps between a semiconductor package and the heat sink
caused by surface nonuniformity can degrade thermal performance. This degrada-
tion increases at higher operating temperatures. Interface materials appropriate
to the package type reduce the variability induced by varying surface roughness.
Since the interface thermal resistance is dependent upon applied force, the contact
pressure becomes an integral design parameter of the thermal solution. If a
package/device can withstand a limited amount of contact pressure, it is impor-
tant that thermal calculations use the appropriate thermal resistance for that pres-
sure. The chemical compatibility of the interface materials with the package type
is another important factor. Plastic packages, especially those made using mold-
release agents, may compromise the adherence of tape-applied heat sinks.
In summary, the selected thermal management solution for a specific appli-
cation will be determined by the cost and performance requirements of that partic-
ular application. Manufacturing and assembly requirements also influence selec-
tion. Economic justification will always be a key consideration. Figure 19
summarizes the effectiveness of various cooling solutions. Figure 19 is a nomo-
graph showing the progression of cooling solutions from natural convection to
liquid cooling and the reduction in both thermal resistance and heat sink volume
resulting from this progression. Figure 20 compares the heat transfer coefficient
of various cooling techniques from free air convection to spray cooling.
ACKNOWLEDGMENTS
Portions of Section 5.2 were excerpted from Ref. 1. Used by permission of the
author.
I would also like to thank Future Circuits International (www.mriresearch.
com) for permission to use material in this chapter.
REFERENCES
1. Addison S. Emerging trends in thermal modeling. Electronic Packaging and Produc-
tion, April 1999.
2. Konstad R. Thermal design of personal computer chassis. Future Circuits Int 2(1).
3. Ascierto J, Clenderin M. Itanium in hot seat as power issues boil over. Electronic
Engineering Times, August 13, 2001.
FURTHER READING
1. Case in Point. Airflow modeling helps when customers turn up heat. EPP, April 2002.
2. Electronics Cooling magazine. See also www.electronics-cooling.com.
trons, and the other gets positively charged, due to a deficiency of electrons in
equal measure. Materials differ in their capacity to accumulate or give up elec-
trons.
The increasing miniaturization in electronics and the consequent use of
small-geometry devices with thin layers has increased susceptibility to ESD dam-
age; ESD is a silent killer of electronic devices that can destroy a device in nano-
seconds, even at low voltages. Electrostatic discharge causes damage to an elec-
tronic device by causing either an excessive voltage stress or an abnormally high
current discharge, resulting in catastrophic failure or performance degradation
(i.e., a latent defect in the device that may surface later during system operation
and cause device failure).
Taking a few simple precautions during device design, assembly, testing,
storage, and handling, and the use of good circuit design and PWA layout tech-
niques can minimize the effects of ESD and prevent damage to sensitive elec-
tronic components.
Damage is expensive—in the cost of the part; the processes; detection and
repair; and in loss of reputation, as well as lost production time. Walking
wounded parts can be extremely expensive and although the exact figures are
difficult to establish, real overall costs to industry worldwide are certainly mea-
sured in terms of many millions, whatever the currency. Once damage has been
done, it cannot normally be undone. Therefore, precautions need to be taken from
cradle to grave.
FIGURE 1 Factors that impact the level of electrostatic discharge to ICs on PWAs.
are used, personnel can become charged due to walking or other motion. When
charged personnel contact (or nearly touch) metallic portions of a PWA, ESD
will occur. The probability of charged personnel causing ESD damage to an IC
is especially severe if the discharge from the person is via a metallic object such
as a tool, a ring, a watch band, etc. This hand/metal ESD results in very high
discharge current peaks.
Charged PWAs. Printed wire assemblies may also be a source of ESD.
For example, assemblies can become charged when transported along a conveyer
belt, during shipping, or when handled by a charged person. If a charged PWA
contacts a conductive surface, or is plugged into a conductive assembly, while
not in conductive contact with any other source charge, ESD will occur and dis-
charge the PWA.
Charged PWA and Charged Personnel. If a charged person and a PWA
are in conductive contact during the ESD event, the ESD will discharge both the
person and the PWA. If a person walks across a carpeted floor while in conductive
contact with a PWA, the person and the PWA may become charged. If the PWA
then contacts a conductive surface, or is plugged into an equipment assembly,
while still in conductive contact with the person, charged-PWA-and-person ESD
occurs.
Places of Electrostatic Discharge
The places where ESD impinges on a PWA are called discharge points. Discharge
points to PWAs can be grouped into three categories:
1. Directly to IC pins
2. Printed circuit board traces between ICs
3. Printed circuit board connector pins
With the possible exception of connector pins, discharge points could be expected
to be physically located almost anywhere on the surface of the PCB assembly
(PWA).
Integrated Circuit Pins. The pins of a PCB-mounted IC extend above the
surface of the board itself. Because of this, an ESD arc can actually terminate
on the pins of an IC. In this case, the ESD current will not travel to the device via
a PCB trace. However, any trace connected to the IC pin may alter the character of
the ESD threat.
PCB Traces. Since ICs do not cover the entire surface of a PCB assembly,
a PCB trace may be the nearest metallic point to which an ESD threat may occur.
In this case, the ESD arc will terminate not on an IC pin, but on a PCB trace
between IC pins. This is especially true if an ESD-charged electrode (such as a
probe tip) is located very close to a PCB trace, at a point equidistant between
two ICs. In this case, the ESD current flows to the IC via the PCB trace, modifying
the ESD current waveform.
PCB Connector Pins. The connector pins of a PCB assembly are ex-
tremely likely to be subjected to ESD when the assembly is being installed in
equipment or in a higher-level assembly. Thus, ESD to or from connector pins
is often associated with ESD from a charged PCB or a charged PCB-and-person.
Like ESD to traces, ESD to connector pins must flow via a PCB trace to the IC.
Local Ground Structure. The local ground structure of a PCB is the sec-
tion of the PCB’s ground reference that is part of the PCB assembly itself.
Multilayer PCBs with ground plane layers have the most extensive local ground
structures. At the other extreme are PCB assemblies where the only local ground
reference is provided by a single ground trace to the IC.
of E-field interference would be the noise you hear on your portable telephone
when you bring it near your computer.
Radiated magnetic field (H-field) interference is a magnetic field that is
propagated through the air. An example would be two computer monitors placed
sufficiently close so that their displays appear to bounce and distort.
Recapping the previous section, electrostatic discharge occurs in the fol-
lowing manner: An object (usually conductive) builds up an electrostatic charge
through some form of motion or exposure to high velocity air flows. This charge
can be many kilovolts. When the charged object comes in contact with another
(usually conductive) object at a different charge potential, an almost instanta-
neous electron charge transfer (discharge) occurs to normalize the potential be-
tween the two objects, often emitting a spark in the process. Depending upon
the conductivity of the two objects, and their sizes, many amperes of current can
flow during this transfer. Electrostatic discharge is most often demonstrated by
shuffling across a carpet and touching someone, causing the signature electrical
zap to occur. Confusion occurs with plastic insulators that can become charged
to many kilovolts, but because of their high resistance dissipate low current levels
very slowly.
have the tendency to form a dV/dt at the gap and radiate from the slot antenna
formed.
Note: it is advisable to derate vendor claims by at least 10 dB since the
tests are typically conducted in ideal situations.
Bonding. Bonding is the low-impedance interconnection to conductive
ground potential subassemblies (including printed circuit board grounds), enclo-
sure components to each other, and to chassis. Low impedance is much more
than an ohmmeter reading 0.0 Ω. What is implied here is continuous conductive
assembly to provide high-frequency currents with continuous conductive sur-
faces.
Some practical design hints to ensure proper bonding include the following:
Painted, dirty, and anodized surfaces are some of the biggest problems to
avoid up front. Any painted surfaces should be masked to prevent
overspray in the area where bonding is to occur. (The exception is where
conductive paints are used on plastic enclosures; then paint is needed
along the bond interface.)
Screw threads are inductive in nature and cannot be used alone to assure
a suitable bond. They are also likely to oxidize over time.
Care should be used to avoid the use of dissimilar metals. Galvanic corro-
sion may result in very high–impedance bonds.
Conductive wire mesh or elastomer gaskets are very well suited to assure
continuous bonds over long surface interfaces.
Choose conductive platings for all metal surfaces.
Designs that take advantage of continuous conductive connections along
surface interfaces that are to be bonded are the most effective to support
high-frequency currents. Sharp angle turns along a bond should be
avoided since they will cause high-frequency current densities to in-
crease at such locations, resulting in increased field strengths.
Loop Control. Designers need to consider that current flows in a loop and
that the return current tries to follow as close to the intentional current as possible,
but it follows the lowest impedance path to get there. Since all these return cur-
rents are trying to find their different ways back to their source, causing all kinds
of noise on the power plane, it is best to take a proactive multipronged design
approach. This includes designing an intentional return path (i.e., keeping to one
reference plane), preventing nets from crossing splits in adjacent planes, and
inserting bypass capacitors between planes to give the return current a low-
impedance path back to the source pin.
Often in CPU server designs (see Fig. 2) the overall loop and loop currents
are of the same size, but individual loop areas are dramatically different. It is
the loop area that is the cause for concern with respect to ground loops, not the
loop. It is also important to note that a ground plane has an infinite number of
ground loops.
Each small ground loop in the lower block is carrying a loop current IRF.
What is happening is that the loop currents cancel each other everywhere except
along the perimeter of the loop. The larger ground loop also carries loop current
IRF but has a greater loop area. Larger loops tend to emit and cause interference
and are also more susceptible to disruption from external electromagnetic fields.
This is why EMC design of two-sided printed circuit boards is improved by
forming grids with both power and ground. The loop area can also be reduced
by routing cables along ground potential chassis and enclosure walls.
Skin Effect. Skin effect is the depth of penetration of a wave in a conduct-
ing medium. As frequency increases the “skin depth” decreases in inverse propor-
tion to the square root of the frequency. Thus, as frequency increases, current
flows closer to the surface of a conductor. This is one reason why bonding of
conductive surfaces requires careful attention. It is desirable to provide as much
continuous surface contact as possible for high-frequency currents to flow unim-
peded.
Current Density. The current density J in a wire of cross-sectional area
A and carrying current I is
I
J⫽
A
Thus, as the area increases, the current density will decrease, and the current in
any one section of this area will be low.
Bonding and shielding attempt to distribute high-frequency return currents
throughout the surfaces of the ground system. When this is completed, the high-
frequency current density (dI/dt) is reduced to very low levels and results in a
near equipotential surface, causing dV/dt to approach zero. By minimizing the
current density, the generated field strengths can be reduced to a minimum. Using
a distributed, low-impedance conductive current path to accomplish this is a very
effective, economical, and reliable design approach.
Distributed Current Path. As IRF crosses any boundary we know that it
is flowing on the surface of the conductors. If the path impedance ZP at any
contact is low, and we have many contacts, then we have a distributed current
path (Fig. 3). The best way to keep high-frequency currents from radiating is to
maintain a distributed low-impedance current path where currents can freely flow.
This is one of the design variables that are controlled and is effective, cost effi-
cient, and very reliable.
Chassis/Enclosures
The chassis or enclosure forms the outer protective and/or cosmetic packaging
of the product and should be part of the EMC design. Metal enclosures should
use plated protective coatings, but they must be conductive. Do not use anodized
metals as they are nonconductive.
Design hints for metal enclosures:
10. When an I/O port is intended for use with unshielded cable, an I/O
filter is required, and it must include a common mode stage that incor-
porates a common core shared by all I/O lines. The filter must be
physically located as close to the I/O connector as possible. Traces
to the input side of the filter cannot cross traces to the output side of
the filter on any layer. Traces from the filter output to the I/O connec-
tor should be as direct as possible. A void is then designed in all
layers (including the power and ground planes) of the printed circuit
board between the filter input connections and the I/O connector pins.
No traces may be routed in the void area except for those going from
the filter to the I/O. A minimum separation of five trace widths of
the I/O signal lines must be maintained between the void edges
and the I/O traces. The void represents a transition zone between the
filter on the printed circuit board and free space. The intent is to avoid
coupling noise onto the filtered I/O lines.
Design Rules for Single- and Double-Sided PCBs. Single- and double-
sided printed circuit boards do not enjoy the luxury of a continuous power and
ground plane, so the preferred practice is to create two grids, thus simulating
these planes using interconnected small loops. The method of doing this is to
run traces in one direction on the top side of the board and perpendicular traces
on the bottom side of the board. Then vias are inserted to conductively tie the
traces of each, forming a power grid and a ground grid at the intersection of
these respective traces if viewed perpendicular to the board plane. Power and
ground grid linewidths should be at least five times the largest signal trace widths.
The ground must still bond to the chassis at each mounting, and the perimeter
band is still preferred.
With single-sided boards, the grid is formed using one or both sides of the
board using wire jumpers soldered in place versus traces on the opposite side
and via bonds.
Clock lines are usually routed with a ground trace (guard band) on one or
both sides following the entire clock trace lengths and maintaining very close
proximity to the clock trace(s). Guard bands should be symmetrical with the clock
trace and should have no more than one trace width between the clock and guard
traces.
All other rules previously described are essential and should be observed.
faster, the values of the capacitor or series termination resistor need to be lower
so that the rise time is shorter, the circuit designer needs to remember the noise
comes from uncontrolled current. The shorter the rise time, the higher the dI/dt
of the signal, allowing higher frequency content noise. Timing budgets might be
met, but EMI can be running rampant. Tuning the rise time so that it’s the mini-
mum required to make timing requirements minimizes the noise current created.
Take the example of a trapezoidal signal. At frequencies beginning at 1/π (τ ⫺
r) where τ is the pulse width, the spectrum of the waveform falls off at ⫺20 dB/
decade. For frequencies above 1/π (τ ⫺ r), where (τ ⫺ r) is the rise time, the
fall-off is ⫺40 dB/decade, a much greater level of attenuation. In order to reduce
the spectral content quickly, the frequency of the ⫺40-dB/decade region needs
to be as low as possible, and pulse width and rise times need to be as large as
possible, again trading this off with the timing budgets.
Clock terminations are essential at the origin and at every location where
a clock signal is replicated. Whenever possible, a series of damping resistors
should be placed directly after a clock signal exits a device. While there are many
termination methods—pull-up/pull-down resistors, resistor–capacitor networks,
and end series resistor terminations—the use of series resistors at the source termi-
nations are preferred. The series clock termination resistor values are usually in the
range of 10 to 100 Ω. It is not advisable to drive multiple gates from a single
output since this may exceed the capability to supply the required load current.
Similarly, series terminations are effective in high-speed data or address
lines where signal integrity is an issue. These resistors flatten the square wave by
diminishing the overshoot and undershoot (ringing). Occasionally a component or
subassembly will interfere with other parts of a product or will require supple-
mental shielding to achieve design goals. Local shields are effective in situations
where there are high levels of magnetic or electric field radiation. These shields
should be bonded to ground. Local shields may also be used to supplement the
shielding effectiveness of the enclosure, thereby reducing overall radiated emis-
sions.
Conductive shells of I/O connectors intended to mate with shielded cables
must be conductively bonded to the printed circuit bond ground plane, and the
conductive connector shell must bond to the enclosure 360° around as it exits
the enclosure.
Cables
Cables can affect both the emissions and desired signal integrity characteristics.
The following are some EMC design hints for cables:
1. Shielded cables rely upon 360° shield terminations at each end of the
cable and at the devices to which they attach.
2. Ribbon cables may increase or alternate the number of ground conduc-
ACKNOWLEDGMENT
The material in Section 6.2 courtesy of EMC Engineering Department, Tandem
Computer Division, Compaq Computer Corporation.
FURTHER READING
1. Archambeault B. Eliminating the myths about printed circuit board power/ground
plane decoupling. Int J EMC, ITEM Publications, 2001.
2. Brewer R. Slow down—you’re going too fast. Int J EMC, ITEM Publications, 2001.
3. Brewer RW. The need for a universal EMC test standard. Evaluation Engineering,
September 2002.
4. Dham V. ESD control in electronic equipment—a case study. Int J EMC, ITEM Publi-
cations, 2001.
5. Gerfer A. SMD ferrites and filter connectors—EMI problem solvers. Int J EMC,
ITEM Publications, 2001.
6. ITEM and the EMC Journal, ITEM Publications.
7. Mayer JH. Link EMI to ESD events. Test & Measurement World, March 2002.
8. Weston DA. Electromagnetic Compatibility Principles and Applications. 2nd Ed. New
York: Marcel Dekker, 2001.
Outsourcing originally began with the OEMs’ need to manage the manufac-
turing peaks and valleys resulting from volatile, often unpredictable, sales vol-
umes. In order to perform their own manufacturing, OEMs had to face three
difficult choices:
errors and ensure that production processes remain in control, thereby maximiz-
ing product quality and functionality and increasing chances for market success.
Outsourcing creates a new and complex relationship system in the supply
management chain. In the case of printed wiring assemblies, the quadrangle
shown in Figure 2 best describes the relationship between OEMs and CMs. The
insertion of the CM and the component supplier’s authorized distributor between
the OEM and component supplier serves to both separate the supplier and OEM
and complicate the supply chain. This figure lists some of the key functions of
each link in the supply chain and indicates the changing nature of these functions.
The Changing Role of the Contract Manufacturer
Early on (late 1980s and early 1990s) the sole service provided by the CM was
consigned assembly services; the OEM provided the components and the bare
PCB, while the EMS provider performed the assembly operation. As time passed,
more and more value-added services have been added to the EMS provider’s
portfolio. These include component engineering [component and supplier selec-
314 Chapter 7
tion and qualification, managing the approved vendors list (AVL), alternative
sourcing issue resolution, etc.]; test development and fixturing; material procure-
ment; and most recently full product design, integration, and assembly services
as well as shipment and distribution of the product to the end customer. Compo-
nent costs are reduced because OEMs benefit from the purchasing power, ad-
vanced processes, and manufacturing technology of the CM.
Today, many OEMs perform little or no manufacturing of their own. Orga-
nizations are using outsourcing to fundamentally change every part of their busi-
ness because no company can hope to out-innovate all the competitors, potential
competitors, suppliers, and external knowledge sources in its marketplace world-
wide. Companies are focusing on their core competencies—what they are best
in the world at—and then acquiring everything else through a strategic relation-
ship with a CM, in which the “everything else” is the outsource service provider’s
core competencies. Some have referred to the CM as “infrastructure for rent.”
With this shift in functions, the OEM looks like a virtual corporation focusing on
overall system design, marketing and sales, and managing the outsource service
providers to ensure delivery of the product to the customer when requested. The
EMS provider becomes a virtual extension of the OEM’s business and looks
more and more like the vertically integrated OEM of old. The trend toward the
“virtual corporation”—in which different parts of the process are handled by
different legal entities—will continue to accelerate.
The role of the CM is becoming more critical in the global electronics
Manufacturing/Production Practices 315
underscores the need for local control by the provider. Manufacturing and test
strategy engineering and modification also must occur locally because manufac-
turing departments cannot afford to wait 2 days or longer for answers to engi-
neering problems. Time zone distances aggravate product shipment delays.
OEM Proprietary Information. Out of necessity, an EMS provider will
learn intimate details of product design and construction. Revealing such proprie-
tary information makes many OEMs uncomfortable, especially when the same
provider is typically working for competitors as well (no OEM puts all of its
eggs in one CM basket). This situation has become increasingly common and
offers no easy solution. The OEMs must select EMS providers that have a reputa-
tion for honesty and integrity. This requires a leap of faith that the provider’s
reputation, potential loss of business, or (as a last resort) threat of legal action
will sufficiently protect the OEM’s interests.
Quality. Probably the largest concern with outsourcing is that of quality.
The continuing exodus to outsource manufacturing can be a source of quality
issues. When making a transition from in-house manufacturing to outsourcing,
some things may be omitted. In an environment of rapid technological change,
the complex OEM–CM relationship raises a number of valid questions that must
be addressed: Will the quality level of the OEM’s product prior to the outsource
decision be maintained or improved by the outsource provider? Are corners being
cut on quality? How much support will the OEM need to provide? What happens
and who is responsible for resolving a component issue identified by the OEM
or end customer in a timely manner? Who is responsible for conducting a risk
assessment to determine the exposure in the field and if a given product should
be shipped, reworked, retrofitted, or redesigned? When a low price is obtained
for a computer motherboard from an offshore supplier, what is the quality level?
When the customer initially turns a product on, does it fail or will it really run
for 5 years defect-free as intended? If a CM does a system build, who is responsi-
ble for resolving all customer issues/problems? Obviously, it is imperative for
OEMs to work closely with contract manufacturers to ensure that relevant items
are not omitted and that corners are not cut concerning quality.
The key to managing supplier quality when an OEM outsources manufac-
turing is to integrate the outsource provider into the OEM’s design team. Even
before a product is designed, the supplier has to be involved in the design and
specifications phase. Quality can best be enhanced by connecting the EMS pro-
vider to the OEM’s design centers and to its customers’ design centers before
the supplier starts producing those parts. One can’t decree that a product should
have a certain number of hours of reliability. Quality and reliability can’t be
screened in; they have to be designed in.
Today’s OEMs are emphasizing the need for robust quality systems at their
CMs. They must maintain a constant high level of quality focus, whether the
318 Chapter 7
work is done internally or externally, and be actively involved with the quality
of all CM processes. The reason for this is that the OEM’s quality reputation is
often in the hands of the CM since the product may never come to the OEM’s
factory floor but be shipped directly from the CM to the customer. This requires
that a given CM has a process in place to manage, among other things, change
in a multitude of production locations. Outsourcing requires extra care to make
sure the quality process runs well. Technical requirements, specifications, perfor-
mance levels, and service levels must be carefully defined, and continuous com-
munication with the CM must be the order of the day. Outside suppliers must
understand what is acceptable and what is not. Quality problems are often experi-
enced because an outsource supplier works with one set of rules while the OEM
works with another. Both have to be on the same page. The big mistake some
OEMs make is to assume that by giving a contract manufacturer a specification
they can wash their hands of the responsibility for quality. To the contrary, the
OEM must continuously monitor the quality performance of the CM.
Communication. An OEM–CM outsource manufacturing and service
structure requires a delicate balancing act with focus on open, accurate, and con-
tinuous communication and clearly defined lines of responsibility between par-
ties. (See preceding section regarding quality.)
OEM Support. Effective outsourcing of any task or function typically re-
quires much more support than originally envisioned by the product manufac-
turer. In many companies the required level of support is unknown before the
fact. Many companies jumped into outsourcing PWA manufacturing with both
feet by selling off their internal PWA manufacturing facilities, including employ-
ees, CMs without fully understanding the internal level of support that is required
for a cost-effective and efficient partnership. The product manufacturers felt that
they could simply reduce their purchasing, manufacturing engineering, and manu-
facturing production staffs and turn over all those activities and employees associ-
ated with printed wire assembly to the outsource provider. Wrong!
In reality it has been found that outsourcing does not eliminate all manufac-
turing costs. In fact the OEM has to maintain a competent and extensive technical
and business staff (equal to or greater than that prior to the outsourcing decision)
to support the outsource contract manufacturer and deal with any issues that arise.
More resources with various specialized and necessary skill sets (commodity
purchasing specialists, component engineering, failure analysis, manufacturing
engineering, etc.) are required at the OEM site than were often considered. Origi-
nally, one reason for outsourcing was to eliminate the internal infrastructure. Now
a more focused internal support infrastructure is required. Unfortunately, without
a significant manufacturing operation of their own, OEMs may find a dwindling
supply of people with the necessary skills to perform this function, and they often
compete for them with the EMS providers.
Manufacturing/Production Practices 319
Most large OEMs place no more than 20% of their outsource manufacturing
needs with a single EMS provider. This means that OEMs need to have the per-
sonnel and support structure in place to deal with at least five EMS providers,
all with different infrastructures and processes.
Design-Related Issues
Three issues pertaining to design have come to the forefront. First, as networking
and telecommunications equipment manufacturers pare their workforces and sell
their manufacturing operations to EMS providers, they are becoming more depen-
dent on IC suppliers for help in designing their systems, the complexities of
which are becoming difficult for contract manufacturers and distributors to handle
effectively. Thus, IC designers will need to have a systems background.
Second, three levels of EMS design involvement that can affect design
for manufacturability have been identified, going from almost none to complete
involvement. These are (1) OEMs either give the company a fully developed
product; (2) look for EMS engineering support in the middle of product develop-
ment at or about the prototype phase; or (3) engage with the OEM at the concep-
tual design phase. Contract manufacturers are moving toward or already provid-
ing new product introduction (NPI) services that help get high-volume projects
underway more quickly. The sooner the customer and CM can get together on
a new project, the sooner such issues as design for manufacturability and design
for test can be resolved. The closer the CM can get to product inception, the
quicker the ramp to volume or time to market and the lower the overall product
cost.
The third issue is that significant delays are being experienced by OEMs
in their product development efforts. According to a report published by AMR
Research Inc. (Boston, MA) in May 2001, product design times for OEMs in
PC, peripherals, and telecommunications markets have increased by an average
of 20% as a result of the growing outsourcing trend. The report identifies two
troubling issues: the OEM design process is being frustrated by an increase in
the number of participants collaborating in product development, and OEMs are
losing their ability to design products that can be easily transferred to manufac-
turing.
In their quest for perfection, OEM design engineers create a design bill of
materials that has to be translated into a manufacturing bill of materials at the
EMS level because design engineers aren’t familiar with how the EMS providers
produce the product. They’ve lost the connection with manufacturing they had
322 Chapter 7
when manufacturing was internal. OEMs are designing their products one way,
and EMS providers preparing the product for manufacturing have to rewrite (the
design) to be compatible with the manufacturing process. The disconnect is in
the translation time.
Because OEMs in many cases have relinquished production of printed wir-
ing assemblies and other system-level components, their design teams have begun
to move further away from the manufacturing process. When the OEMs did the
manufacturing, their design engineers would talk to the manufacturing personnel
and look at the process. Since the OEMs don’t do manufacturing anymore, they
can no longer do that. The result is increased effort between OEM design and
EMS manufacturing. Time is being wasted both because the design has to be
redone and then changes have to be made to make the product more manufactur-
able by the EMS provider.
The EMS Provider’s Viewpoint
An OEM looking for the right EMS partner must recognize that the provider’s
manufacturing challenges differ from those the OEM would experience. First,
the provider generally enjoys little or no control over the design itself (although,
as mentioned, this is changing). Layout changes, adding or enhancing self-tests,
etc. will not compensate for a design that is inherently difficult to manufacture
in the required quantities. An experienced EMS provider will work with a cus-
tomer during design to encourage manufacturability and testability and discour-
age the reverse, but ultimate design decisions rest with the company whose name
goes on the product.
One primary concern of the CM in accepting a contract is to keep costs
down and avoid unpleasant surprises that can make thin profit margins vanish
completely. The OEMs need to be sensitive to this situation, remembering that
the object is for everyone in the relationship to come out ahead. Keeping costs
low requires flexibility. For the EMS provider, that might mean balancing manu-
facturing lines to accommodate volume requirements that may change without
warning. A board assembled and tested on line A on one day may be assigned
to a different line the next day because line A is running another product. Simi-
larly, volume ramp-ups may demand an increase in the number of lines manufac-
turing a particular product. Achieving this level of flexibility with the least
amount of pain requires that manufacturing plans, test fixtures, and test programs
are identical and perform identically from one line to the next. Also, one make
and model of a piece of manufacturing or test equipment on one manufacturing
line must behave identically to the same make and model on another manufactur-
ing line (repeatability and reproducibility), which is not always the case.
The CMs depend on OEMs for products to maintain high volumes on their
manufacturing lines to maximize capacity and lower the overhead associated with
maintaining state-of-the-art equipment that may not always be fully loaded (uti-
Manufacturing/Production Practices 323
lized). A contract manufacturer may send production from one geographical loca-
tion to another for many reasons. Tax benefits and import restrictions such as
local-content requirements may encourage relocating part or all of a manufactur-
ing operation elsewhere in the world. The EMS provider may relocate manufac-
turing just to reduce costs. Shipping distances and other logistics may make
spreading production over several sites in remote locations more attractive as
well. Again, seamless strategy transfer from one place to another will reduce
each location’s startup time and costs. To offset the geographical time differ-
ences, crisp and open communications between the EMS provider and the OEM
are required when problems and/or questions arise.
The first 6 to 12 months in a relationship with an EMS provider are critical.
That period establishes procedures and defines work habits and communication
paths. Planners/schedulers should allow for longer turnaround times to imple-
ment product and process changes than with in-house projects.
The Bottom Line
Connecting outsourcing to an OEM’s business strategy, selecting the right oppor-
tunities and partners, and then supporting those relationships with a system de-
signed to manage the risk and opportunities are the essential success factors for
implementing an outsourcing business model. For an OEM, specific needs and
technical expertise requirements must be evaluated to select an appropriate EMS
provider. The selected CM’s manufacturing and test strategy must match the
OEM’s. Identifying the right partner requires that a satisfactory tradeoff among
quality, cost, and delivery factors be found. If the product reach is global, the
EMS provider should have a worldwide presence as well. Because the product
could ship from manufacturing lines anywhere in the world, the EMS provider
should strive for consistency and uniformity in the choice of manufacturing and
test equipment, thus minimizing the variation of produced product and the effort
and cost of that wide implementation and distribution. In the final analysis, there
are no canned solutions. One size cannot fit all.
Reference 1 discusses the process for selecting contract manufacturers, CM
selection team responsibilities, qualifying CMs, integrating CMs into the OEM
process flow, material and component supplier responsibilities and issues, and
selecting a manufacturing strategy.
minimize the conflicts and thus the potential defects. As a result, performing
effective testing to identify and correct defects is as important to the manufactur-
ing process as buying quality materials and installing the right equipment.
In fact, the manufacturing process has become part of what is considered
as traditional test. Test is thus comprised of the cumulative results of a process
that includes bare board (PCB) test; automated optical inspection (AOI); x-ray,
flying probe, manufacturing defect analyzer (MDA), ICT, and functional test so-
lutions. Electrical testing is an important part of manufacturing because visual
inspection is not sufficient to ensure a good PWA.
The electrical design, the physical design, and the documentation of all
impact testability. The types of defects identified at test vary from product, de-
pending on the manufacturing line configuration and the types of component
packages used. The difficult part of testing is the accumulation and processing
of and action taken on defect data identified through the manufacturing test pro-
cesses. The manufacturing defect spectrum for a given manufacturing process is
a result of the specific limits of that manufacturing process. As such, what is an
actionable item is ultimately process determined and varies by line design. A
typical PWA manufacturing defect spectrum is shown in Table 3.
Opens and shorts are the most predominant failure mechanisms for PWAs.
There is a difference in the failure mechanisms encountered between through-
hole technology (THT) and surface-mount technology (SMT). Practically, it is
difficult to create an open connection in THT unless the process flow is not reach-
ing part of the assembly, there are contaminants in the assembly, or a lead on
the part is bent during the insertion process. Otherwise the connection to the part
is good. The bigger problem is that solder will bridge from one component lead
to the next to create a short. When soldering using SMT, the most significant
problem is likely to be an open connection. This typically results from insufficient
reflow. Since it is more difficult for manufacturers to catch (detect) opens with
ATE, some tune their assembly processes toward using additional solder to make
the manufacturing process lean more toward shorts than it would if left in the
neutral state.
Testing SMT or THT will find all of the shorts unless SMT doesn’t allow
full nodal coverage. Either technology allows electrical testing to find most opens.
For example, when testing a resistor or capacitor, if there is an open, it will be
detected when the part is measured. The issue is finding open connections in ICs.
For the typical test, a mapping is made of the diode junctions present between
IC pins and the power/ground rails of the IC under test. Normally this works
fine, but in many cases an IC is connected to both ends of the trace. This looks
like two parallel diodes to the test system. If one is missing due to an open
connection in the IC, the tester will miss it since it will measure the other diode
that is in parallel, not knowing the difference.
Every test failure at any deterministic process step must have data analyzed
to determine the level of existing process control. Data can be compared to the
manufacturing line’s maximum capability or line calibration factor to achieve a
relative measure of line performance quality. The result is a closed loop data
system capable of reporting the manufacturing quality monitored by the various
test systems. In general, it has been observed that 80% of the defects encountered
at the various test stages are process related. Looking at the defect spectrum of
Table 3 it is seen that the nature of the defects has not changed with technology.
Rather it is the size of the defect spectrum that has changed.
fied display, automated optical inspection, and x-ray. There is an increased trend
toward the use of x-ray inspection using x-ray–based machine vision systems,
which provide pin-level diagnostics. The reasons for this are
1. The shrinking sizes of passive components and finer linewidths of
PCBs result in much denser PWAs than in the past. Thus, solder joints
are much more critical as they become smaller and closer together,
making human inspection more difficult and inaccurate.
2. The increased use of chip scale and ball grid array packages in which
soldered connections around the periphery of an IC package are re-
placed by an underside array of solder balls that are used for electrical
connections. The result is that the solder joint connections are hidden
from view underneath the package and cannot be inspected by humans
or conventional optical inspection or machine vision systems.
3. Densely populated PWAs are so complex that there is no way to access
enough test nodes. While board density makes physical access difficult,
electrical design considerations (radiofrequency shielding, for exam-
ple) may make probing extremely difficult if not impossible. X-ray
inspection provides an excellent solution for inspecting solder joints
where limited access or board topography makes it impossible to probe.
4. A functional failure (via functional test) only identifies a segment of
problem circuitry and the diagnostics are not robust.
often called an analog ICT, assumes that the components are good and endeavors
to find how they may be improperly installed on the PCB. In-circuit testing takes
testing one step further by checking ICs in the circuit for operation. Going from
MDA to full ICT improves the test coverage by a few percent, but at multiple
times tester and test fixture development cost.
Functional test verifies that the PWA operates as designed. Does it perform
the tasks (function) that it was designed for? Figure 4 is a block diagram of the
electrical tests performed during manufacturing and the sequence in which they
are performed, from simple and least expensive (visual inspection, not shown)
to complex and most expensive (functional test).
In the 1960s and 1970s PWAs were simple in nature and edge connectors
provided easy test access using functional test vectors. In the 1980s, increased
PWA density led to the use of ICT or bed-of-nails testing to mechanically access
hard-to-get-at nodes. In the 1990s, due to continually increasing PWA density
and complexity, boundary scan, built-in self-test (BIST), vectorless tests, and
unpowered opens testing gained increased importance with the decreasing use
of ICT.
Table 4 summarizes and compares some of the key features of the various
electrical test methods being used. These are further discussed in the following
sections.
In-Circuit Testing
The most basic form of electrical test is by use of a manufacturing defects ana-
lyzer, which is a simple form of an in-circuit tester. The MDA identifies manufac-
turing defects such as solder bridges (shorts), missing components, wrong compo-
nents, and components with the wrong polarity as well as verifies the correct
value of resistors. Passive components and groups or clusters of components can
be tested by the MDA. The PWA is connected to the MDA through a bed-of-nails
fixture. The impedance between two points is measured and compared against the
expected value. Some MDAs require programming for each component, while
328
TABLE 4 Comparison of Test Methods
Self test Functional test In-circuit test Vectorless open test
Limited defect coverage Catches wider spectrum of de- Highest defect coverage No models required
fects
Limited diagnosis Better diagnostics than self-test Precise diagnostics Overclamp and probes required
Inexpensive Most expensive test Very fast Fast test development
Least expensive test
Can help ICT (boundary scan) Computer run High defect coverage
No test fixture required Test fixture required Requires test fixture and pro- Reversed capacitors detectable
gramming
Long test development time Facilitates device programming
and costs
Easy data logging/analysis of Easy data logging/analysis of
results results
Chapter 7
Manufacturing/Production Practices 329
others can learn the required information by testing a known good PWA. The
software and programming to conduct the tests is not overly complicated.
Conventional in-circuit testers are the workhorses of the industry. They
have more detection capability than MDAs and extended analog measurement
capability as well as the added feature of digital in-circuit test for driving inte-
grated circuits. The programming and software required for conventional in-
circuit testing is more complicated than for an MDA, but simpler than for func-
tional testing.
An in-circuit test is performed to detect any defects related to the PWA
assembly process and to pinpoint any defective components. The ICT searches
for defects similar to those found by both visual inspection and by the MDA
but with added capability; ICT defect detection includes shorts, opens, missing
components, wrong components, wrong polarity orientation, faulty active devices
such as nonfunctioning ICs, and wrong-value resistors, capacitors, and inductors.
To be effective, in-circuit test requires a high degree of nodal access, there-
fore the tester employs a bed-of-nails fixture for the underside of the PWA. The
bed-of-nails fixture is an array of spring-loaded probes; one end contacts the
PWA, and the other end is wired to the test system. The PWA-to-fixture contact
pressure is vacuum activated and maintained throughout the test. The fixture con-
tacts as many PWA pads, special test pads, or nodes as possible such that each
net can be monitored. (A node is one circuit on the assembly such as GROUND
or ADDR0.) With contact to each net, the tester can access every component on
the PWA under test and find all shorts. The industry standard spring probe for
many years was the 100-mil size, typical of through-hole technology probes for
the vast majority of devices that had leads on 0.100-in. centers. 75-mil probing
was developed in response to the needs of SMT, then 50-mil, then 38-mil; and
now even smaller spacing probes are available.
A major concern in ICT is test fixture complexity. Fixtures for testing THT
PWAs are generally inexpensive and reliable compared with the fixtures required
for testing SMT PWAs if the latter are not designed with test in mind. This leads
to high fixture costs, but also makes it extremely difficult to make highly reliable,
long-lasting fixtures. Dual-sided probing allows access to both sides of a PWA
(whether for ICT or functional text). Complex PWAs may require the use of
clam-shell bed-of-nails fixturing to contact the component side as well as the back
or underside of the PWA, but it adds significantly to the cost of the test fixture.
Also, sometimes not all nodes are accessible, compromising test coverage.
In-circuit testing is constantly being refined to provide improved test coverage
while minimizing the number of required test (probe) points. This is accom-
plished by including testability hooks in the design and layout of the PWA such
as using boundary scan devices, clustering groups of components, being judicious
in test point selection, adding circuitry to provide easier electrical access, and
placing test points at locations that simplify test fixture design.
330 Chapter 7
FIGURE 5 Boundary scan circuit showing latches added at input/output pins to make IC
testable.
By way of summary from Chapter 3, the use of boundary scan has been
effective in reducing the number of test points. In boundary scan the individual
ICs have extra on-chip circuitry called boundary scan cells or latches at each
input/output (see Fig. 5). These latches are activated externally to isolate the I/O
(wire bond, IC lead, PWA pad, external net) from the internal chip circuitry (Fig.
6), thereby allowing ICT to verify physical connections (i.e., solder joints).
Boundary scan reduces the need for probe access to each I/O because an
input signal to the latches serially connects the latches. This allows an electrical
check of numerous solder joints and nets extending from device to device. The
reality of implementing boundary scan is that most PWAs are a mix of ICs (both
analog and digital functions) with and without boundary scan using a mix of ad
hoc testing strategies and software tools. This creates a problem in having a
readily testable PWA.
Conventional in-circuit testing can be divided into analog and digital test-
ing. The analog test is similar to that performed by an MDA. Power is applied
through the appropriate fixture probes to make low-level DC measurements for
detecting shorts and the value of resistors. Any shorts detected must first be re-
paired before further testing. The flying probe tester, a recent innovation in ICT,
performs electrical process test without using a bed-of-nails fixture interface be-
tween the tester and the board under test. Originally developed for bare board
testing, flying probe testing—together with associated complex software and pro-
gramming—can effectively perform analog in-circuit tests. These systems use
multiple, motor-operated, fast-moving electrical probes that contact device leads
Manufacturing/Production Practices 331
FIGURE 6 Integrated circuit with boundary scan in normal operating mode (left) and with
boundary scan enabled for testing (right).
and vias and make measurements on the fly. The test heads (typically four or
eight) move across the PWA under test at high speed as electrical probes located
on each head make contact and test component vias and leads on the board,
providing sequential access to the test points. Mechanical accuracy and repeat-
ability are key issues in designing reliable flying probers, especially on dense
PWAs with small lead pitches and trace widths.
Flying probers are often used during prototype and production ramp-up to
validate PWA assembly line setup without the cost and cycle time associated
with designing and building traditional bed-of-nails fixtures. In this application,
flying probers provide fast turnaround and high fault coverage associated with
ICT, but without test fixture cost. Flying probers have also been used for in-line
applications such as sample test and for production test in low-volume, high-mix
PWA assembly lines.
The second phase of analog test—an inherent capability of the conventional
in-circuit tester—consists of applying low-stimulus AC voltages to measure
phase-shifted currents. This allows the system to determine the values of reactive
components such as capacitors and inductors. In taking measurements, each com-
ponent is electrically isolated from others by a guarding process whereby selec-
tive probes are grounded.
In digital ICT, power is applied through selected probes to activate the ICs
332 Chapter 7
and the digital switching logic. Each IC’s input and output is probed to verify
proper switching. This type of test is made possible by the use of a technique
called back driving in which an overriding voltage level is applied to the IC input
to overcome the interfering voltages produced by upstream ICs. The back driving
technique must be applied carefully and measurements taken quickly to avoid
overheating the sensitive IC junctions and wire bonds (Fig. 7). An example of
how this happens is as follows. Some companies routinely test each IC I/O pin
for electrostatic discharge (ESD) susceptibility and curve trace them as well be-
ginning at the PWA perimeter and working inward. They then move farther into
the PWA back driving the just-tested ICs. These devices are weaker than the
devices being tested (by virtue of the test) and fail.
In-circuit testing is not without its problems. Some of these include test
fixture complexity, damage to PWAs due to mechanical force, inefficiency and
impracticality of testing double-sided PWAs, and the possibility of overdriving
ICs causing thermal damage. Because of these issues many companies have elim-
inated ICT. Those companies that have done so have experienced an increase in
board yield and a disappearance of the previously mentioned problems.
Vectorless Opens Testing
Vectorless opens testing uses a special top probe over the IC under test in con-
junction with the other standard probes already present. By providing a stimulus
and measuring the coupling through the IC, open connections can be accurately
detected. Vectorless test can be used effectively even when IC suppliers are
changed. It doesn’t require the expensive programming of full ICT techniques.
FIGURE 7 Back- (or over-) driving digital ICs can result in adjacent ICs being overdriven
(a and c), resulting in overheating and temperature rise (b), leading to permanent damage.
Manufacturing/Production Practices 333
Functional Testing
Although an in-circuit test is effective in finding assembly defects and faulty
components, it cannot evaluate the PWA’s ability to perform at clock speeds. A
functional test is employed to ensure that the PWA performs according to its
intended design function (correct output responses with proper inputs applied).
For functional testing, the PWA is usually attached to the tester via an edge
connector and powered up for operation similar to its end application. The PWA
334 Chapter 7
inputs are stimulated and the outputs monitored as required for amplitude, timing,
frequency, and waveform.
Functional testers are fitted with a guided probe that is manually positioned
by the operator to gain access to circuit nodes on the PWA. The probe is a trouble-
shooting tool for taking measurements at specific areas of the circuitry should a
fault occur at the edge connector outputs. The probe is supported by the appro-
priate system software to assist in defect detection.
First-pass PWA yield at functional test is considerably higher when pre-
ceded by an in-circuit test. In addition, debugging and isolating defects at func-
tional test requires more highly skilled personnel than that for ICT.
Cluster Testing
Cluster testing, in which several components are tested as a group, improves
PWA testability and reduces the concerns with ICT. One begins with the compo-
nents that are creating the testing problem and works outward adding neighboring
components until a cluster block is defined. The cluster is accessed at nodes that
are immune to overdrive. Address and data buses form natural boundaries for
clusters. Cluster testing combines the features of both ICT and functional test.
Testing Microprocessor Based PWAs
The ubiquitous microprocessor-based PWAs present some unique testing chal-
lenges since they require some additional diagnostic software that actually runs
on the PWA under test, in contrast to other board types. This software is used
to exercise the PWA’s circuits, such as performing read/write tests to memory,
initializing I/O devices, verifying stimuli from external peripherals or instru-
ments, generating stimuli for external peripherals or instruments, servicing inter-
rupts, etc. There are a number of ways of loading the required test code onto the
PWA under test.
1. With a built-in self-test, the tests are built into the board’s boot code
and are run every time the board is powered up, or they can be initiated
by some simple circuit modification such as link removal.
2. With a test ROM, a special test ROM is loaded onto the PWA during
test. On power up, this provides the necessary tests to fully exercise
the board.
3. The required tests can be loaded from disk (in disk-based systems)
after power up in a hot mock-up situation.
4. The tests can be loaded via an emulator. The emulator takes control
of the board’s main processor or boot ROM and loads the necessary
test code by means of this connection.
When a microprocessor-based board is powered up, it begins running the
code contained in its boot ROM. In a functional test environment this generally
Manufacturing/Production Practices 335
consists of various test programs to exercise all areas of the PWA under test. An
emulator provides an alternative approach by taking control of the board after
PWA power up and providing the boot code. Two types of emulation are used:
processor emulation and ROM emulation.
Processor Emulation. Many microprocessor manufacturers incorporate
special test circuitry within their microprocessor designs which is generally ac-
cessible via a simple three- to five-wire serial interface. Instructions can be sent
through this interface to control the operation of the microprocessor. The typical
functions available include the following:
Stop the microprocessor.
Read/write to memory.
Read/write to I/O.
Set breakpoints.
Single step the microprocessor.
Using these low-level features, higher level functions can be constructed
that will assist in the development of functional test programs. These include
Download test program to microprocessor under test’s memory.
Run and control operation of downloaded program.
Implement test program at a scripting language level by using the read/
write memory or I/O features.
Recover detailed test results, such as what data line caused the memory to
fail.
Control and optimize the sequence in which test programs run to improve
test times.
Emulators are commercially available that already have these functions pre-
programmed.
ROM Emulation. If microprocessor emulation is not available, ROM em-
ulation can be a viable alternative. A ROM emulator replaces the boot ROM of
the (DUT) device under test. This means that there must be some way of disabling
or removing this. Once connected, the ROM emulator can be used in two different
ways. In the first, the DUT is run from the ROM emulator code (after the test
code has been downloaded to the ROM emulator), rather than from the PWA’s
own boot code. This removes the need to program the ROM with boot code and
test code. Alternatively, the PWA’s own boot ROM can be used to perform the
initial testing and then switch to the ROM emulator to perform additional testing.
In the second method, the emulator is preloaded with some preprogrammed ge-
neric tests (such as read/write to memory and RAM test, for example). These
are controlled by means of a scripting language to quickly implement comprehen-
sive test programs.
336 Chapter 7
workmanship issues that could later cause failure at the customer’s site. Optimal
ESS assumes that design defects and margin issues have been identified and cor-
rected through implementation of accelerated stress testing at design.
Also called accelerated stress testing, ESS has been extensively used for
product improvement, for product qualification, and for improving manufacturing
yields for several decades. In the early 1980s, computer manufacturers performed
various stress screens on their printed wiring assemblies. For example, in the
1981–1983 timeframe Apple Computer noticed that “a number of their boards
would suddenly fail in the midst of manufacturing, slowing things down. It hap-
pened to other computer makers, too. In the industry it had been seen as an
accepted bottleneck. It was the nature of the technology—certain boards have
weak links buried in places almost impossible to find. Apple, working with out-
side equipment suppliers, designed a mass-production PWA burn-in system that
would give the boards a quick, simulated three month test—a fast burn-in—the
bad ones would surface” (3). The result was that this “burn-in system brought
Apple a leap in product quality” (4).
Now a shift has taken place. As the electronics industry has matured, com-
ponents have become more reliable. Thus, since the late 1980s and early 1990s
the quality focus for electronic equipment has moved from individual components
to the attachment of these components to the PCB. The focus of screening has
changed as well, migrating to the PWA and product levels. The cause of failure
today is now much more likely to be due to system failures, hardware–software
interactions, workmanship and handling issues (mechanical defects and ESD, for
example), and problems with other system components/modules such as connec-
tors and power supplies. Stress screening is an efficient method of finding faults at
the final assembly or product stage, using the ubiquitous stresses of temperature,
vibration, and humidity, among others.
The lowest cost of failure point has moved from the component level to
board or PWA test, as shown in Figure 7 of Chapter 4. This is currently where
screening can have the greatest benefit in driving product improvement. As was the
case for components, increasing reliability mitigates the necessity for screening.
The decision process used to initiate or terminate screening should include tech-
nical and economic variables. Mature products will be less likely candidates for
screening than new products using new technology. Given the decision to apply
ESS to a new product and/or technology, as the product matures, ESS should be
withdrawn, assuming a robust field data collection–failure analysis–corrective
action process is in place. Figure 8 puts the entire issue of ESS into perspective
by spanning the range of test from ICs to systems and showing the current use
of each. Figure 1 of Ref. 5 shows a similar trend in reliability emphasis.
Much has been written in the technical literature over the past 10 years
regarding accelerated stress testing of PWAs, modules, and power supplies. The
338
FIGURE 8 Environmental stress screening perspective. C/A, Corrective action; OEM, original equipment manufacturer; final ET QA,
electrical test; EOS, electrical overstress; ESD, electrical static discharge.
Chapter 7
Manufacturing/Production Practices 339
FIGURE 9 Accelerated stress testing uses substantially higher than normal specification
limits.
random vibration environments. The most effective screening process uses a com-
bination of environmental stresses. Rapid thermal cycling and triaxial six–degree
of freedom (omniaxial) random vibration have been found to be effective screens.
Rapid thermal cycling subjects a product to fast and large temperature vari-
ations, applying an equal amount of stress to all areas of the product. Failure
mechanisms such as component parameter drift, PCB opens and shorts, defective
solder joints, defective components, hermetic seals failure, and improperly made
crimps are precipitated by thermal cycling. It has been found that the hot-to-cold
temperature excursion during temperature cycling is most effective in precipitat-
ing early life failures.
Random vibration looks at a different set of problems than temperature or
voltage stressing and is focused more on manufacturing and workmanship de-
fects. The shift to surface mount technology and the increasing use of CMs make
it important to monitor the manufacturing process. Industry experience shows
that 20% more failures are detected when random vibration is added to thermal
cycling. Random vibration should be performed before thermal cycling since this
sequence has been found to be most effective in precipitating defects. Random
vibration is also a good screen to see how the PWA withstands the normally
encountered shipping and handling vibration stresses. Table 6 summarizes some
benefits of a combined temperature cycling and six-axis (degree of freedom) ran-
dom vibration ESS profile and lists ways in which the combined profile precipi-
tates latent defects.
Once detected, the cause of defects must be eliminated through a failure
analysis to root cause–corrective action implementation–verification of improve-
ment process. More about this later.
The results of highly accelerated life testing (HALT) (Chapter 3), which
was conducted during the product design phase, are used to determine the ESS
profile for a given product, which is applied as part of the normal manufacturing
342 Chapter 7
FIGURE 10 Environmental stress screening profile for high-end computer server. GPR
(general product requirements) ⫽ operating environment limits.
process. The ESS profile selected for a given product must be based on a practical,
common-sense approach to the failures encountered, usage environment ex-
pected, and costs incurred. The proper application of ESS will ensure that the prod-
uct can be purged of latent defects that testing to product specifications will miss.
Figure 10 shows an example of a HALT profile leading to an ESS profile
compared with the product design specification general product requirements
(GPR) for a high-end computer server. From this figure you can see that the
screen provides accelerated stress compared to the specified product environment.
Recently compiled data show that at PWA functional test, 75% of defects de-
tected are directly attributed to manufacturing workmanship issues. Only 14%
of failures are caused by defective components or ESD handling issues, according
to EIQC Benchmark.
Figure 11 shows how a selected ESS profile was changed for a specific
CPU after evaluating the screening results. The dark curve represents the original
ESS profile, and the portions or sections labeled A, B, and C identify various
circuit sensitivities that were discovered during the iterative ESS process. For
section A, the temperature sensitivity of the CPU chip was causing ESS failures
that were judged unlikely to occur in the field. The screen temperature was low-
ered while an engineering solution could be implemented to correct the problem.
In section B the vibration level was reduced to avoid failing tall hand-inserted
capacitors whose soldering was inconsistent. Section C represented a memory
chip that was marginal at low temperature. Each of the problems uncovered in
these revisions was worked through a root cause/physics of failure–corrective
action process, and a less stressful ESS screen was used until the identified im-
provements were implemented.
Manufacturing/Production Practices 343
The results shown in Figure 13 were particularly disturbing since a small but
significant probability of failure was predicted for temperatures in the normal
operating region. The problem could be approached either by reworking the
Comm Logic ASIC or by attempting a software workaround.
The final solution, which was successful, was a software revision. Figure
14 also shows the results obtained by upgrading system software from the old
version to the corrected version. In Figure 14, the failures have been converted
to rates to show direct comparison between the two software versions. Note that
the Comm Logic error has been completely eliminated and that the remaining
error rates have been significantly diminished. This result, completely unexpected
(and a significant lesson learned), shows the interdependence of software and
hardware in causing and correcting CPU errors.
Case Study 5: Family of CPU PWAs. Figure 15 is a bar chart showing
ESS manufacturing yields tracked on a quarterly basis. Each bar is a composite
yield for five CPU products. Note that the ESS yield is fairly constant. As process
and component problems were solved, new problems emerged and were ad-
dressed. In this case, given the complexity of the products, 100% ESS was re-
quired for the entire life of each product.
Figure 16 shows a detailed breakout of the 3Q97 ESS results shown in the
last bar of Figure 15, and adds pre-ESS yields for the five products in production
that make up the 3Q97 bar. This chart shows the value of conducting ESS in
Manufacturing/Production Practices 347
production and the potential impact of loss in system test or the field if ESS were
not conducted. Notice the high ESS yield of mature PWAs (numbers 1–3) but
the low ESS yield of new boards (4 and 5), showing the benefit of ESS for new
products. Also, note particularly that the post-ESS yields for both mature and
immature products are equivalent, indicating that ESS is finding the latent defects.
Figure 16 also shows that the value of ESS must be constantly evaluated. At
some point in time when yield is stable and high, it may make sense to discontinue
its use for that PWA/product. Potential candidates for terminating ESS are PWA
numbers 1–3.
Figures 17–20 show the results of ESS applied to another group of CPU
PWAs expressed in terms of manufacturing yield. Figures 17 and 18 show the
ESS yields of mature PWAs, while Figures 19 and 20 show the yields for new
CPU designs. From these figures, it can be seen that there is room for improve-
ment for all CPUs, but noticeably so for the new CPU designs of Figures 19 and
20. Figures 17 and 18 raise the question of what yield is good enough before we
cease ESS on a 100% basis and go to lot testing, skip lot testing, or cease testing
altogether. The data also show that ESS has the opportunity to provide real prod-
uct improvement.
The data presented in Figures 17 through 20 are for complex high-end
CPUs. In the past, technology development and implementation were driven pri-
marily by high-end applications. Today, another shift is taking place; technology
is being driven by the need for miniaturization, short product development times
(6 months) and short product life cycles (⬍18 months), fast time to market, and
consumer applications. Products are becoming more complex and use complex
ICs. We have increasing hardware complexity and software complexity and their
interactions. All products will exhibit design- or process-induced faults. The
question that all product manufacturers must answer is how many of these will
we allow to get to the field. Given all of this, an effective manufacturing defect
test strategy as well as end-of-line functional checks are virtually mandated.
life) failure rate in PWAs over the past 10–15 years. Application of ESS during
manufacturing can further reduce early life failures, as shown in the figure.
The question that needs to be answered is how do failures from an ESS
stress-to-failure distribution map onto the field failure time-to-failure distribution.
Figure 22 graphically depicts this question. Failures during the useful life (often
called the steady state) region can be reduced by proper application of the ESS–
failure analysis to root cause–implement corrective action–verify improvement
(or test–analyze–fix–test) process. The stress-to-fail graph of Figure 22 indicates
that about 10–15% of the total population of PWAs subjected to ESS fail. The
impact of this on time to failure (hazard rate) is shown in the graph on the right
of that figure.
Implementing an ESS manufacturing strategy reduces (improves) the infant
mortality rate. Field failure data corroborating this for a complex computer server
CPU are shown in Figure 23. Data were gathered for identical CPUs (same design
revision), half of which were shipped to customers without ESS and half with
ESS. The reason for this is that an ESS manufacturing process was implemented
in the middle of the manufacturing life of the product, so it was relatively easy
to obtain comparison data holding all factors except ESS constant. The top curve
shows failure data for PWAs not receiving ESS, the bottom curve for PWAs
receiving ESS.
Figure 24 shows field failure data for five additional CPU products. These
data support the first two regions of the generic failure curve in Figure 21. Figure
24 shows the improvements from one generation to succeeding generations of a
product family (going from CPU A to CPU E) in part replacement rate (or failure
rate) as a direct result of a twofold test strategy: HALT is utilized during the
product design phase and an ESS strategy is used in manufacturing with the
resultant lessons learned being applied to improve the product.
The previous section showed the effectiveness of 100% ESS. Pre-ESS man-
ufacturing yields compared with post-ESS yields show improvement of shippable
PWA quality achieved by conducting 100% ESS. This is directly translated into
lower product cost, positive customer goodwill, and customer product rebuys.
In all of the preceding discussions, it is clear that a key to success in the
field of accelerated stress testing is the ability to make decisions in the face of
large uncertainties. Recent applications of normative decision analysis in this
field show great promise. Figures 25 and 26 show the probability of achieving
positive net present savings (NPS) when applying ESS to a mature CPU product
and a new CPU product, respectively.
The NPS is net present savings per PWA screened in the ESS operation.
It is the present value of all costs and all benefits of ESS for the useful lifetime
of the PWA. A positive NPS is obtained when benefits exceed costs.
This approach to decisionmaking for ESS shows that there is always a
possibility that screening will not produce the desired outcome of fielding more
reliable products, thus reducing costs and saving money for the company. Many
variables must be included in the analysis, both technical and financial. From the
two distributions shown in Figures 25 and 26, it is seen that there is a 20% chance
we will have a negative NPS for CPU product B and an 80% chance we will
have a negative NPS for CPU product E. The decision indicated is to continue
FIGURE 25 Probability distribution. Net present savings for ESS on a new CPU product
(B in Fig. 24).
354 Chapter 7
FIGURE 26 Probability distribution. Net present savings for ESS on a mature CPU prod-
uct (E in Fig. 24).
ESS for CPU product B, and to stop ESS for CPU product E. The obvious next
step is to analyze the decision of whether or not to perform ESS on a sample
basis.
Here we see the historical trend repeating itself—as products become more
reliable, there is less need to perform ESS. Just as in the case for components,
the economics of screening become less favorable as the product becomes more
reliable. Emphasis then shifts to sampling or audit ESS.
wrong with them (no trouble found—NTF), they meet all published specifications.
In the electronics equipment/product industry 40–50% of the anomalies discov-
ered are no trouble found. Much time in manufacturing is spent troubleshooting
the NTFs to arrive at a root cause for the problem. Often the components are
given to sustaining engineering for evaluation in a development system versus
the artificial environment of automated test equipment. Investigating the cause of
no trouble found/no defect found (NDF)/no problem found (NPF) components
is a painstaking, costly, and time-consuming process. How far this investigation
is taken depends on the product complexity, the cost of the product, the market
served by the product, and the amount of risk and ramifications of that risk the
OEM is prepared to accept. Listed in the following sections are lessons learned
from this closed-loop investigative anomaly verification–corrective action pro-
cess.
Lesson 1: PWAs are complex structures.
The “fallout” from electrical tests and ESS is due to the complex interactions
between the components themselves (such as parametric distribution variation
leading to margin pileup when all components are interconnected on the PWA,
for example), between the components and the PWA materials, and between the
hardware and software.
Lesson 2: Determine the root cause of the problem and act on the results.
Performing electrical testing or ESS by itself has no value. Its what is done with
the outcome or results of the testing that counts. Problems, anomalies and failures
discovered during testing must be investigated in terms of both short- and long-
term risk. For the short-term, a containment and/or screening strategy needs to
be developed to ensure that the defective products don’t get to the field/customer.
For the long term, a closed-loop corrective action process to preclude recurrence
of the failures is critical to achieving lasting results. In either case the true root
cause of the problem needs to be determined, usually through an intensive investi-
gative process that includes failure analysis. This is all about risk evaluation,
containment, and management. It is imperative that the containment strategies
and corrective actions developed from problems or failures found during electri-
cal testing and ESS are fed back to the Engineering and Manufacturing Depart-
ments and the component suppliers. To be effective in driving continuous im-
provement, the results of ESS must be
1. Fed back to Design Engineering to select a different supplier or im-
prove a supplier’s process or to make a design/layout change
2. Fed back to Design Engineering to select a different part if the problem
was misapplication of a given part type or improper interfacing with
other components on the PWA
356 Chapter 7
3. Fed back to Design Engineering to modify the circuit design, i.e., use
a mezzanine card, for example
4. Fed back to Manufacturing to make appropriate process changes, typi-
cally of a workmanship nature
Lesson 3: Troubleshooting takes time and requires a commitment of re-
sources.
Resources required include skilled professionals along with the proper test and
failure analysis tools. There is a great deal of difficulty in doing this because
engineers would prefer to spend their time designing the latest and greatest prod-
uct rather than support an existing production design. But the financial payback
to a company can be huge in terms of reduced scrap and rework, increased reve-
nues, and increased goodwill, customer satisfaction, and rebuys.
Lesson 4: Components are not the major cause of problems/anomalies.
Today the causes of product/equipment errors and problems are due to handling
problems (mechanical damage and ESD), PCB attachment (solderability and
workmanship) issues; misapplication/misues of components (i.e., design applica-
tion not compatible with component); connectors; power supplies; electrical
overstress (EOS); system software “rev” versions; and system software–hard-
ware interactions.
Lesson 5: The majority of problems are NTF/NDF/NPF.
In analyzing and separately testing individual components that were removed
from many PWAs after the troubleshooting process, it was found that the re-
moved components had no problems. In reality though, there is no such thing as
an NDF/NTF/NPF; the problem has just not been found due either to insufficient
time or resources being expended or incomplete and inaccurate diagnostics. Sub-
sequent evaluation in the product or system by sustaining engineering has re-
vealed that the causes for NTF/NDF/NPF result from
1. Shortcuts in the design process resulting in lower operating margins
and yields and high NTF. This is due to the pressures of fast time to
market and time to revenue.
2. Test and test correlation issues, including low test coverage and not
using current (more effective) revision of test program and noncompre-
hensive PWA functional test software.
3. Incompatible test coverage of PWA to component.
4. Component lot-to-lot variation. Components may be manufactured at
various process corners impacting parametric conditions such as bus
hold, edge rate, and the like.
5. Design margining/tolerance stacking of components used due to para-
Manufacturing/Production Practices 357
The single largest defect detractor (⬎40%) is the no defect/no trouble found issue.
Of 40 problem PWA instances, two were due to manufacturing process issues and the
balance to design issues.
Lucent expended resources to eliminate NDFs as a root cause and has found the real
causes of so-called NDFs to be lack of training, poor specifications, inadequate
diagnostics, different equipment used that didn’t meet the interface standards, and
the like.
No defect found is often the case of a shared function. Two ICs together constitute an
electrical function (such as a receiver chip and a transmitter chip). If a problem
occurs, it is difficult to determine which IC is the one with the defect because the
inner functions are not readily observable.
Removing one or more ICs from a PWA obliterates the nearest neighbor, shared
function, or poor solder joint effects that really caused the problem.
Replace a component on a PWA and the board becomes operational. Replacing the
component shifted the PWA’s parameters so that it works. (But the PWA was not
defective.)
The most important and effective place to perform troubleshooting is at the customer
site since this is where the problem occurred. However, there is a conflict here
because the field service technician’s job is to get the customer’s system up and
running as fast as possible. Thus, the technician is always shotgunning to resolve a
field or system problem and removes two or three boards and/or replaces multiple
components resulting in false pulls. The field service technician, however, is the
least trained to do any troubleshooting and often ends up with a trunk full of
components. Better diagnostics are needed.
Software–hardware interactions may happen once in a blue moon.
In a manufacturing, system, or field environment we are always attempting to isolate a
problem or anomaly to a thing: PWA, component, etc. In many instances the real
culprit is the design environment and application (how a component is used, wrong
pull-up or pull-down resistor values, or a PWA or component doesn’t work in a box
that is heated up for example). Inadequate design techniques are big issues.
Testing a suspected IC on ATE often finds nothing wrong with the IC because the
ATE is an artificial environment that is dictated by the ATE architecture, strobe
placement, and timing conditions. Most often the suspected IC then needs to be
placed in a development system using comprehensive diagnostics that are run by
sustaining development engineers to determine if the IC has a problem or not. This
ties in with Lessons 3 and 7.
Manufacturing/Production Practices 359
not fail consistently, this is a strong clue that the interconnection between the
PWA and the suspected component contributes to the PWA failure and should
be investigated further. If the failure does repeat consistently after removal and
replacement three times, the suspected failing component should be used in a
component swap, as described next.
corrective action. If, however, after the swap, the failed PWA still fails and the
passing PWA still passes, then the swapped component probably is not the cause
of the problem. If after the swap, both PWAs now pass, the root cause may have
something to do with the interconnection between the component and the PWA
(such as a fractured solder joint) and nothing to do with the component.
Lesson 6: Accurate diagnostics and improved process mapping tests are
required for problem verification and resolution.
Let’s take an example of a production PWA undergoing electrical test to illustrate
the point. Diagnostics point to two or three possible components that are causing
a PWA operating anomaly. This gives a 50% and 33% chance of finding the
problem component, respectively, and that means that the other one or two com-
ponents are NTF. Or it may not be a component issue at all, in which case the
NTF is 100%. Take another example. Five PWAs are removed from a system
to find the one problem PWA, leaving four PWAs, or 80%, as NTF in the best
case. This ties in with Lesson 5.
The point being made is that diagnostics that isolate a 50% NTF rate are
unacceptable. Diagnostics are typically not well defined and need to be improved
because PWAs are complex assemblies that are populated with complex compo-
nents.
Lesson 7: Sustaining Engineering needs to have an active role in problem
resolution.
Many of the anomalies encountered are traced to one or more potential problem
components. The problem components need to be placed in the product/system
to determine which, if any, component is truly bad. As such, sustaining engi-
neering’s active participation is required. This ties in with Lesson 3.
modules, and power supplies work together and that the entire system/product
functions in accordance with the requirements established by the global specifi-
cation and the customer. For example, a Windows NT server system could consist
of basic computer hardware, Windows NT server and clustering software, a Veri-
tas volume manager, a RAID box, a SCSI adapter card, and communications
controllers and drivers.
Many of the same issues incurred in IC and PWA test cascade to the system
(finished or complete product) level as well. Since products begin and end at the
system level, a clear understanding of the problems expected must emanate from
the customer and/or marketing-generated product description document. The sys-
tem specifications must clearly state a test and diagnostic plan, anticipating diffi-
cult diagnostic issues (such as random failures that you know will occur); specify
the expected defects and the specific test for these defects; and outline measure-
ment strategies for defect coverage and the effectiveness of these tests.
Some of the issues and concerns that cause system test problems which
must be considered in developing an effective system test strategy include
Defects arise from PWA and lower levels of assembly as well as from
system construction and propagate upward.
Interconnections are a major concern. In microprocessor-based systems,
after the data flow leaves the microprocessor it is often affected by
glitches on the buses or defects due to other components.
Timing compatability between the various PWAs, assemblies, and modules
can lead to violating various timing states, such as bus contention and
crosstalk issues, to name two.
Effectiveness of software diagnostic debug and integration.
Design for test (DFT) is specified at the top of the hierarchy, but imple-
mented at the bottom: the IC designer provides cures for board and sys-
tem test, and the board designer provides cures for system test.
ICs, PWAs, assemblies, and modules have various levels of testability de-
signed in (DFT) that may be sufficient for and at each of these levels.
However, their interconnectivity and effectiveness when taken together
can cause both timing and testing nightmares. Also the level of DFT
implementation may be anywhere from the causal and careless to the
comprehensive.
System failures and intermittent issues can result in shortened component
(IC) life.
Computer systems are so complex that no one has figured out how to design
out or test for all the potential timing and race conditions, unexpected
interactions, and nonrepeatable transient states that occur in the real
world.
Interaction between hardware and software is a fundamental concern.
362 Chapter 7
A list of some typical system test defects is presented in Table 8. This list
does not contain those manufacturing and workmanship defects that occur during
PWA manufacturing and as detected by PWA visual inspection, ICT, and func-
tional test: solder issues, missing and reversed components, and various PCB
issues (vias and plated through-holes, for example). Since system test is so varied
and complex, there is no way to address and do justice to the myriad unique
issues that arise and must be addressed and solved. System testing includes not
only electrical testing, but also run-in test—also called system burn-in (e.g., 72-hr
run-in at 25°C, or 48-hr at 50°C)—that checks for product stability and facilitates
exercising the product with the full diagnostic test set. The same lessons learned
from PWA testing apply to systems test as well. Once a system/product success-
fully completes system test it can be shipped to the customer.
The accuracy of the predictions made and the effectiveness of the reliability
tests conducted
Whether established goals are being met
Whether product reliability is improving
In addition to measuring product reliability, field data are used to determine the
effectiveness of the design and manufacturing processes, to correct problems in
existing products, and to feed back corrective action into the design process for
new products. Regular reports are issued disseminating both good and bad news
from the field. The results are used to drive corrective actions in design, manufac-
turing, component procurement, and supplier performance.
A comprehensive data storage system is used to determine accurate opera-
tion times for each field replaceable unit (FRU) and allows identification of a
field problem that results in a unit being replaced, as well as the diagnosis and
364 Chapter 7
FIGURE 28 Example of field data collection system. (From Ref. 1. Courtesy of the Tan-
dem Division of Compaq Computer Corporation Reliability Engineering Department.)
repair actions on the unit that caused the problem. Since the installation and
removal dates are recorded for each unit, the analysis can be based on actual run
times, instead of estimates based on ship and return dates.
At the Tandem Division of Compaq Computer Corp. data are extracted
from three interlinked databases (Figure 28). Each individual FRU is tracked
from cradle to grave, i.e., from its original ship date through installation in the
field and, if removed from a system in the field, through removal date and repair.
From these data, average removal rates, failure rates, and MTBF are computed.
Individual time-to-fail data are extracted and plotted as multiply processed data
on a Weibull hazard plot to expose symptoms of wearout.
Using the data collection system shown in Figure 28, a typical part replace-
ment rate (PRR) is computed by combining installation data from the Installed
Systems database and field removal data from the Field Service Actions database.
The data are plotted to show actual field reliability performance as a function of
time versus the design goal. If the product does not meet the goal, a root cause
analysis process is initiated and appropriate corrective action is implemented. An
example of the PRR for a disk controller is plotted versus time in Figure 29 using
a 3-month rolling average.
Figure 30 shows a 3-month rolling average part replacement rate for a
product that exhibited several failure mechanisms that had not been anticipated
during the preproduction phase. Corrective action included respecification of a
critical component, PWA layout improvement, and firmware updates. The new
revision was tracked separately and the difference was dramatically demonstrated
by the resultant field performance data. The old version continued to exhibit an
Manufacturing/Production Practices 365
FIGURE 29 Example of a disk controller field analysis. (From Ref. 1. Courtesy of the
Tandem Division of Compaq Computer Corporation Reliability Engineering Department.)
unsatisfactory failure rate, but the new version was immediately seen to be more
reliable and quickly settled down to its steady state value, where it remained until
the end of its life.
Power supplies are particularly troublesome modules. As such much data
are gathered regarding their field performance. Figure 31 plots the quantity of a
(a)
(b)
(c)
FIGURE 31 Power supply field data: (a) installed field base for a power supply; (b) field
removal for the power supply; (c) part replacement rate/MTBF for the power supply.
Manufacturing/Production Practices 367
given power supply installed per month in a fielded mainframe computer (a),
units removed per month from the field (b), and run hours in the field (c). The
PRR and MTBF for the same power supply are plotted in Figure 32a and b,
respectively.
Certain products such as power supplies, disk drives, and fans exhibit
known wearout failure mechanisms. Weibull hazard analysis is performed on a
regular basis for these types of products to detect signs of premature wearout.
To perform a Weibull analysis, run times must be known for all survivors as
FIGURE 32 Example of 3-month rolling average plots of (a) PRR and (b) MTBF for
power supply of Figure 31.
368 Chapter 7
FIGURE 33 Weibull plot showing disk drive wearout. Note: The software’s attempt to
fit a straight line to these bimodal data illustrates the necessity of examining the graph
and not blindly accepting calculations. (From Ref. 1. Courtesy of the Tandem Division
of Compaq Computer Corporation Reliability Engineering Department.)
well as for removals. A spreadsheet macro can be used to compute the hazard
rates and plot a log–log graph of cumulative hazard rate against run time. Figure
33 is a Weibull hazard plot of a particular disk drive, showing significant prema-
ture wearout. This disk drive began to show signs of wearout after 1 year (8760
hr) in the field, with the trend being obvious at about 20,000 hr. Field tracking
confirmed the necessity for action and then verified that the corrective action
implemented was effective.
Figure 34 is the Weibull hazard plot for the example power supply of Fig-
ures 31 and 32. This plot has a slope slightly less than 1, indicating a constant
to decreasing failure rate for the power supply. Much effort was expended over
an 18-month period in understanding the root cause of myriad problems with
this supply and implementing appropriate corrective actions.
FIGURE 34 Weibull hazard plot for power supply of Figures 31 and 32. Slope ⬎1: in-
creasing failure rate. Slope ⫽ 1: constant failure rate. Slope ⬍1: decreasing failure rate.
(From Ref. 1. Courtesy of the Tandem Division of Compaq Computer Corporation Reli-
ability Engineering Department.)
Example
Complaint/symptom Data line error
What happened?
How does the problem manifest itself?
Failure mode Leakage on pin 31
Diagnostic or test result (customer site and IC supplier in-
coming test)?
How is the problem isolated to a specific IC (customer site)?
What is the measurement or characterization of the problem?
Failure mechanism Oxide rupture
Physical defect or nonconformity of the IC?
What is the actual anomaly that correlates to the failure
mode?
Failure cause ESD
Explanation of the direct origin or source of the defect.
How was the defect created?
What event promoted or enabled this defect?
Root cause Improper wrist strap
Description of the initial circumstances that can be attached use
to this problem.
Why did this problem happen?
ESD, electrostatic discharge.
ultimate cause of failure (see Fig. 36). The process of Figure 36 is shown in
serial form for simplicity. However, due to the widely varying nature of compo-
nents, failures, and defect mechanisms, a typical analysis involves many iterative
loops between the steps shown. Identifying the failure mechanism requires an
understanding of IC manufacturing and analysis techniques and a sound knowl-
edge of the technology, physics, and chemistry of the devices plus a knowledge
of the working conditions during use. Let’s look at each of the steps of Figure
36 in greater detail.
Fault Localization
The first and most critical step in the failure analysis process is fault localization.
Without knowing where to look on a complex VLSI component, the odds against
locating and identifying a defect mechanism are astronomical. The problem is
like the familiar needle in the haystack.
Because of the size and complexity of modern VLSI and ULSI components,
along with the nanometric size of defects, it is imperative to accurately localize
faults prior to any destructive analysis. Defects can be localized to the nearest
logic block or circuit net or directly to the physical location of the responsible
372 Chapter 7
defect. There are two primary methods of fault localization: hardware-based diag-
nostics using physical parameters like light, heat, or electron-beam radiation, and
software-based diagnostics using simulation and electrical tester (ATE) data.
Hardware diagnostic techniques are classified in two broad categories. The
first is the direct observation of a physical phenomenon associated with the defect
and its effects on the chip’s operation. The second is the measurement of the
chip’s response to an outside physical stimulus, which correlates to the instanta-
neous location of that stimulus at the time of response. While a fault can some-
times be isolated directly to the defect site, there are two primary limitations of
hardware diagnostics.
The first is that the techniques are defect dependent. Not all defects emit
light or cause localized heating. Some are not light sensitive nor will they cause
a signal change that can be imaged with an electron beam. As such it is often
necessary to apply a series of techniques, not knowing ahead of time what the
defect mechanism is. Because of this, it can often take considerable time to local-
ize a defect.
The second and most serious limitation is the necessity for access to the
chip’s transistors and internal wiring. In every case, the appropriate detection
equipment or stimulation beam must be able to view or irradiate the site of inter-
est, respectively. With the increasing number of metal interconnect layers and
the corresponding dielectric layers and the use of flip chip packaging, the only
way to get to the individual transistor is through the backside of the die.
Software diagnostics are techniques that rely on the combination of fault
Manufacturing/Production Practices 373
simulation results and chip design data to determine probable fault locations.
While it is possible to do this by manually analyzing failure patterns, it is imprac-
tical for ICs of even moderate complexity. Software diagnostics are generally
categorized in two groups that both involve simulation of faults and test results:
precalculated fault dictionaries and posttest fault simulation.
Deprocessing
Once the fault has been localized as accurately as possible the sample must be
prepared for further characterization and inspection. At this stage the chip usually
needs to first be removed from its package. Depending on the accuracy of fault
localization and the nature of the failure, perhaps multiple levels of the interlevel
insulating films and metal wiring may need to be sequentially inspected and re-
moved. The process continues until the defect is electrically and physically iso-
lated to where it is best identified and characterized.
To a great extent deprocessing is a reversal of the manufacturing process;
films are removed in reverse order of application. Many of the same chemicals
and processes used in manufacturing to define shapes and structures are also used
in the failure analysis laboratory, such as mechanical polishing, plasma or dry
etching, and wet chemical etching.
Defect Localization
Again, depending on the accuracy of fault localization and the nature of the fail-
ure, a second localization step or characterization of the defect may be necessary.
At this point the defect may be localized to a circuit block like a NAND gate,
latch, or memory cell. By characterizing the effects of the defect on the circuit’s
performance it may be possible to further pinpoint its location. Because the subse-
quent steps are irreversible it is important to gather as much information as possi-
ble about the defect and its location before proceeding with the failure analysis.
A number of tools and techniques exist to facilitate defect localization and
characterization. Both optical source and micrometer-driven positioners with
ultrafine probes (with tips having diameters of approximately 0.2 µm) are used
to inject and measure signals on conductors of interest. High-resolution optical
microscopes with long working-distance objectives are required to observe and
position the probes. Signals can be DC or AC. Measurement resolution of tens
of millivolts or picoamperes is often required. Because of shrinking linewidths
it has become necessary to use a focused ion beam (FIB) tool to create localized
probe pads on the nodes of interest. A scanning probe microscope (SPM) may
be used to measure the effects of the defect on electrostatic force, atomic force,
or capacitance. A number of other techniques are used for fault localization based
on the specific situation and need. These are based on the use of light, heat, or
electron-beam radiation.
374 Chapter 7
ACKNOWLEDGMENTS
Portions of Section 7.1.2 excerpted from Ref. 1.
Much of the material for Section 7.2.4 comes from Refs. 2, 3 and 6.
Manufacturing/Production Practices 375
Portions of Section 7.3 excerpted from Ref. 7, courtesy of the Tandem Divi-
sion of the Compaq Computer Corporation Reliability Engineering Department.
Portions of Section 7.3.1 excerpted from Ref. 8.
REFERENCES
1. Hnatek ER, Russeau JB. PWA contract manufacturer selection and qualification, or
the care and feeding of contract manufacturers. Proceedings of the Military/Aerospace
COTs Conference, Albuquerque, NM, 1998, pp 61–77.
2. Hnatek ER, Kyser EL. Straight facts about accelerated stress testing (HALT and
ESS)—lessons learned. Proceedings of The Institute of Environmental Sciences and
Technology, 1998, pp 275–282.
3. Hnatek ER, Kyser EL. Practical Lessons Learned from Overstress Testing—a Histori-
cal Perspective, EEP Vol. 26-2: Advances in Electronic Packaging—1999. ASME
1999, pp 1173–1180.
4. Magaziner I, Patinkin M. The Silent War, Random House Publishers, 1989.
5. Lalli V. Space-system reliability: a historical perspective. IEEE Trans Reliability
47(3), 1998.
6. Roettgering M, Kyser E. A Decision Process for Accelerated Stress Testing, EEP
Vol. 26-2: Advances in Electronic Packaging—1999, Vol. 2. ASME, 1999, pp 1213–
1219.
7. Elerath JG et al. Reliability management and engineering in a commercial computer
environment. Proceedings of the International Symposium on Product Quality and
Integrity, pp 323–329.
8. Vallet D. An overview of CMOS VLSI/failure analysis and the importance of test
and diagnostics. International Test Conference, ITC Lecture Series II, October 22,
1996.
FURTHER READING
1. Albee A. Backdrive current-sensing techniques provide ICT benefits. Evaluation En-
gineering Magazine, February 2002.
2. Antony J et al. 10 steps to optimal production. Quality, September 2001.
3. Carbone J. Involve buyers. Purchasing, March 21, 2002.
4. IEEE Components, Packaging and Manufacturing Technology Society Workshops
on Accelerated Stress Testing Proceedings.
5. International Symposium for Testing and Failure Analysis Proceedings.
6. Kierkus M and Suttie R. Combining x-ray and ICT strategies lowers costs. Evalua-
tion Engineering, September 2002.
7. LeBlond C. Combining AOI and AXI, the best of both worlds. SMT, March 2002.
8. Prasad R. AOI, Test and repair: waste of money? SMT, April 2002.
9. Radiation-induced soft errors in silicon components and computer systems. Interna-
tional Reliability Symposium Tutorial, 2002.
376 Chapter 7
10. Ross RJ et al. Microelectronic Failure Analysis Desk Reference. ASM International,
1999 and 2001 Supplement.
11. Scheiber S. The economics of x-rays. Test & Measurement World, February 2001.
12. Serant E, Sullivan L. EMS taking up demand creation role. Electronic Buyers News,
October 1, 2001.
13. Sexton J. Accepting the PCB test and inspection challenge. SMT, April 2001.
14. Verma A, Hannon P. Changing times in test strategy development. Electronic Pack-
aging and Production, May 2002.
8
Software
8.1 INTRODUCTION
As stated at the outset, I am a hardware person. However, to paint a complete
picture of reliability, I think it is important to mention some of the issues involved
in developing reliable software. The point is that for microprocessor-based prod-
ucts, hardware and software are inextricably interrelated and codependent in
fielding a reliable product to the customer base.
In today’s systems, the majority of issues/problems that crop up are attrib-
uted to software rather than hardware or the interaction between hardware and
software. For complex systems like high-end servers, oftentimes software fixes
are made to address hardware problems because software changes can be made
more rapidly than hardware changes or redesign. There appears to be a larger
gap between customer expectations and satisfaction as relates to software than
there is for hardware. Common software shortfalls from a system perspective
include reliability, responsiveness to solving anomalies, ease of ownership, and
quality of new versions.
An effective software process must be predictable; cost estimates and
schedule commitments must be met with reasonable consistency; and the re-
sulting products (software) should meet user’s functional and quality expecta-
tions. The software process is the set of tools, methods, and practices that are
used to produce a software product. The objectives of software process manage-
First, a typical hardware design team for a high-end server might have 20
designers, whereas the software project, to support the server hardware design
might involve 100 software engineers.
Second, in the hardware world there is a cost to pay for interconnects.
Therefore, the goal is to minimize the number of interconnects. With software
code, on the other hand, there is no penalty or associated cost for connections
(go to statements). The lack of such a penalty for the use of connections can
add complexity through the generation of “spaghetti code” and thus many oppor-
tunities for error. So an enforced coding policy is required that limits the use of
go to statements to minimize defects.
Also like hardware design, software development can be segmented or di-
vided into manageable parts. Each software developer writes what is called a
unit of code. All of the units of code written for a project are combined and
integrated to become the system software. It is easier to test and debug a unit of
software code than it is to test and debug the entire system.
Hardware designers will often copy a mundane workhorse portion of a
circuit and embed it in the new circuit design (Designers like to spend their time
designing with the latest microprocessor, memory, DSP, or whatever, rather than
designing circuitry they consider to be mundane and not challenging). This ap-
proach often causes interface and timing problems that are not found until printed
wiring assembly (PWA) or system test. The results of software designers copying
a previously written unit of code and inserting it in a new software development
project without any forethought could post similarly dangerous results.
Another significant difference between hardware design and software code
development is that a unit of software may contain a bug and still function (i.e.,
it works), but it does the wrong thing. However, if a hardware circuit contains
a defect (the equivalent of a software bug), it generally will not function.
As a result of these similarities and differences, hardware developers and
their software counterparts can learn from each other as to what methods and
processes work best to build a robust and reliable product.
probabilistic measures as the mean time between failure (MTBF) and the mean
time required to repair and restore the system to full operation (MTTR). Assum-
ing the system is required to be continuously available, availability is the percent
of total time that the system is available for use:
MTTR
Availability ⫽ ⫻ 100
MTTR ⫹ MTBF
Availability is a useful measure of the operational quality of some systems. Un-
fortunately, it is very difficult to project prior to operational testing.
What is the anticipated rate of customer installation for this type of product?
A high installation rate generally causes a sharp early peak in defect rate
with a rapid subsequent decline. Typically, programs install most rapidly
when they require minimal conversion and when they do not affect over-
all system operation. Compiler and utility programs are common exam-
ples of rapidly installed products.
What is the product release history? A subsequent release may install
quickly if it corrects serious deficiencies in the prior release. This, of
course, requires that the earliest experience with the new version is posi-
tive. If not, it may get a bad reputation and be poorly accepted.
What is the distribution plan? Will the product be shipped to all buyers
immediately; will initial availability be limited; or is there to be a prelimi-
nary trial period?
Is the service system established? Regardless of the product quality, will
the users be motivated and able to submit defect reports? If not, the
defect data will not be sufficiently reliable to validate the development
process.
Program module quality will vary, with a relatively few modules containing
the bulk of the errors.
The remaining modules will likely contain a few randomly distributed de-
fects that must be individually found and removed.
The distribution of defect types will also be highly skewed, with a relatively
few types covering a large proportion of the defects.
Since programming changes are highly error-prone; all changes should be
viewed as potential sources of defect injection.
While this characterization does not qualify as a model in any formal sense, it
does provide focus and a framework for quality planning.
Defects.
Defect detection [for an IC one can’t find all of the defect/bugs. Stuck
at fault (SAF) coverage ranges from 85–98%. However, delay,
bridging, opens, and other defects cannot be found with SAF.
100% test coverage is impossible to obtain. Software bug coverage
is 50%.] root cause analysis and resolution
Defect prevention
Importance of peer design reviews. Code review should be conducted, pref-
erably without the author being present. If the author is present, he or
she cannot speak. The reviewers will try to see if they can understand
the author’s thought process in constructing the software. This can be
an eye-opening educational experience for the software developer.
Need for testing (automated testing, unit testing).
Use of statistical methodology (plan–do–check–act).
Need for software quality programs with clear metrics.
Quality systems.
to graphically identify all the potential causes of a problem and the relationship
with the effect, but it does not illustrate the magnitude of a particular cause’s
effect on the problem.
Pareto diagrams complement cause–effect diagrams by illustrating which
causes have the greatest effect on the problem. This information is then used to
determine where one’s problem-solving efforts should be directed. Table 1 is a
Pareto distribution of software module defect densities or defect types. The de-
fects are ranked from most prevalent to least. Normally, a frequency of occur-
rence, expressed either numerically or in percentage, is listed for each defect to
show which defects are responsible for most of the problems. Table 2 is a Pareto
Frequency of
Error category occurrence Percent
Incomplete/erroneous specification 349 28
Intentional deviation from specification 145 12
Violation of programming standards 118 10
Erroneous data accessing 120 10
Erroneous decision logic or sequencing 139 12
Erroneous arithmetic computations 113 9
Invalid timing 44 4
Improper handling of interrupts 46 4
Wrong constants and data values 41 3
Inaccurate documentation 96 8
Total 1202 100
These tests also provide a means to validate the earlier installation and operational
plans and tests, make early availability measurements, and debug the installation,
operation, and support procedures.
Once the defect types have been established, normalization for program
size is generally required. Defects per 1000 lines of source code is generally
the simplest and most practical measure for most organizations. This measure,
however, requires that the line-of-code definition be established. Here the cumu-
lative number of defects received each month are plotted and used as the basis
for corrective action.
The next issue is determining what specific defects should be measured
and over what period of time they should be measured. This again depends on
the quality program objectives. Quality measures are needed during development,
test, and customer use. The development measures provide a timely indicator of
software code performance; the test measures then provide an early validation;
and the customer-use data complete the quality evaluation. With this full spec-
trum of data it is possible to calibrate the effectiveness of development and test
at finding and fixing defects. This requires long-term product tracking during
customer use and some means to identify each defect with its point of introduc-
tion. Errors can then be separated by release, and those caused by maintenance
activity can be distinguished.
When such long-term tracking is done, it is possible to evaluate many soft-
ware process activities. By tracking the inspection and test history of the complete
product, for example, it is possible to see how effective each of these actions
was at finding and removing the product defects. This evaluation can be espe-
cially relevant at the module level, where it provides an objective way to compare
task effectiveness.
The cost of finding and repairing defects increases exponentially the later
they are found in the process.
Preventing defects is generally less expensive than finding and repairing
them, even early in the process.
Finding and fixing errors accounts for much of the cost of software development
and maintenance. When one includes the costs of inspections, testing, and rework,
as much as half or more of the typical development bill is spent in detecting and
removing errors. What is more, the process of fixing defects is even more error-
prone than original software creation. Thus with a low-quality process, the error
rate spiral will continue to escalate.
Hewlett-Packard found that more than a third of their software errors were
due to poor understanding of interface requirements. By establishing an extensive
prototyping and design review program, the number of defects found after release
was sharply reduced.
A development project at another company used defect prevention methods
to achieve a 50% reduction in defects found during development and a 78%
reduction in errors shipped. This is a factor of 2 improvement in injected errors,
and a 4-to-1 improvement in shipped quality.
Finding and identifying defects is necessary but not sufficient. The most
important reason for instituting defect prevention is to provide a continuing focus
for process improvement. Unless some mechanism drives process change, it will
not happen in an orderly or consistent way. A defect prevention mindset focuses
on those process areas that are the greatest source of trouble, whether methods,
technology, procedures, or training.
The fundamental objective of software defect prevention is to make sure
that errors, once identified and addressed, do not occur again. Defect prevention
cannot be done by one or two people, and it cannot be done sporadically. Every-
one must participate. As with any other skill, it takes time to learn defect preven-
tion well, but if everyone on the project participates, it can transform an organiza-
tion.
Most software developers spend much of their working lives reacting to
defects. They know that each individual defect can be fixed, but that its near twin
will happen again and again and again. To prevent these endless repetitions, we
need to understand what causes these errors and take a conscious action to prevent
them. We must then obtain data on what we do, analyze it, and act on what it
tells us. This is called the Deming or Shewhart cycle:
Defect identification and improvement have been discussed, but the real solution
is to learn from the past and apply it to present software development projects
to prevent defects in the first place. The principles of software defect prevention
are
1. The programmers must evaluate their own errors. Not only are they
the best people to do so, but they are most interested and will learn
the most from the process.
2. Feedback is an essential part of defect prevention. People cannot
consistently improve what they are doing if there is not timely rein-
forcement of their actions.
3. There is no single cure-all that will solve all the problems. Improve-
ment of the software process requires that error causes be removed one
at a time. Since there are at least as many error causes as there are
error types, this is clearly a long-term job. The initiation of many small
improvements, however, will generally achieve far more than any one-
shot breakthrough.
4. Process improvement must be an integral part of the process. As the
volume of process change grows, as much effort and discipline should
be invested in defect prevention as is used on defect detection and
repair. This requires that the process is architected and designed, in-
spections and tests are conducted, baselines are established, problem
reports written, and all changes tracked and controlled.
5. Process improvement takes time to learn. When dealing with human
frailties, we must proceed slowly. A focus on process improvement is
healthy, but it must also recognize the programmers’ need for a reason-
ably stable and familiar working environment. This requires a properly
paced and managed program. By maintaining a consistent, long-term
focus on process improvement, disruption can be avoided and steady
progress will likely be achieved.
It helps managers understand where help is needed and how best to provide
the people with the support they require.
It lets the software developers communicate in concise, quantitative terms.
It provides the framework for the software developers to understand their
work performance and to see how to improve it.
While there are many other elements to these maturity level transitions, the pri-
mary objective is to achieve a controlled and measured process as the foundation
for continuing improvement.
The process maturity structure is used in conjunction with an assessment
methodology and a management system to help an organization identify its spe-
cific maturity status and to establish a structure for implementing the priority
improvement actions, respectively. Once its position in this maturity structure is
defined, the organization can concentrate on those items that will help it advance
to the next level. Currently, the majority of organizations assessed with the SEI
methodology are at CMM Level 1, indicating that much work needs to be done
to improve software development.
sionals are unlikely to be fully effective. Small organizations that lack the experi-
ence base to form a process group should address these issues by using specially
formed committees of experienced professionals or by retaining consultants.
The assurance group is focused on enforcing the current process, while the
process group is directed at improving it. In a sense, they are almost opposites:
assurance covers audit and compliance, and the process group deals with support
and change.
Establish a software development process architecture. Also called a devel-
opment life cycle, this describes the technical and management activities required
for proper execution of the development process. This process must be attuned
to the specific needs of the organization, and it will vary depending on the size
and importance of the project as well as the technical nature of the work itself.
The architecture is a structural description of the development cycle specifying
tasks, each of which has a defined set of prerequisites, functional descriptions,
verification procedures, and task completion specifications. The process contin-
ues until each defined task is performed by an individual or single management
unit.
If they are not already in place, a family of software engineering methods
and technologies should be introduced. These include design and code inspec-
tions, formal design methods, library control systems, and comprehensive testing
methods. Prototyping should also be considered, together with the adoption of
modern implementation languages.
At the defined process level, the organization has achieved the foundation
for major and continuing progress. For example, the software teams when faced
with a crisis will likely continue to use the process that has been defined. The
foundation has now been established for examining the process and deciding how
to improve it.
As powerful as the Defined Process is, it is still only qualitative: there are
few data generated to indicate how much is accomplished or how effective the
process is. There is considerable debate about the value of software measurements
and the best ones to use. This uncertainty generally stems from a lack of process
definition and the consequent confusion about the specific items to be measured.
With a defined process, an organization can focus the measurements on specific
tasks. The process architecture is thus an essential prerequisite to effective mea-
surement.
quantify the relative costs and benefits of each major process activity,
such as the cost and yield of error detection and correction methods.
2. Establish a process database and the resources to manage and maintain
it. Cost and yield data should be maintained centrally to guard against
loss, to make it available for all projects, and to facilitate process qual-
ity and productivity analysis.
3. Provide sufficient process resources to gather and maintain this process
database and to advise project members on its use. Assign skilled pro-
fessionals to monitor the quality of the data before entry in the database
and to provide guidance on analysis methods and interpretation.
4. Assess the relative quality of each product and inform management
where quality targets are not being met. An independent quality assur-
ance group should assess the quality actions of each project and track
its progress against its quality plan. When this progress is compared
with the historical experience on similar projects, an informed assess-
ment can generally be made.
In advancing from the initial process through the repeatable and defined
processes to the managed process, software organizations should expect to make
substantial quality improvements. The greatest potential problem with the man-
aged process level is the cost of gathering data. There is an enormous number
of potentially valuable measures of the software process, but such data are expen-
sive to gather and to maintain.
Data gathering should be approached with care, and each piece of data
should be precisely defined in advance. Productivity data are essentially meaning-
less unless explicitly defined. Several examples serve to illustrate this point:
cost per line of code of small modifications is often two to three times that for
new programs. The degree of requirements change can make an enormous differ-
ence, as can the design status of the base program in the case of enhancements.
Process data must not be used to compare projects or individuals. Its pur-
pose is to illuminate the product being developed and to provide an informed
basis for improving the process. When such data are used by management to
evaluate individuals or teams, the reliability of the data itself will deteriorate.
Level 5: The Optimizing Process
The two fundamental requirements for advancing from the managed process to
optimizing process level are
1. Support automatic gathering of process data. All data are subject to
error and omission, some data cannot be gathered by hand, and the
accuracy of manually gathered data is often poor.
2. Use process data both to analyze and to modify the process to prevent
problems and improve efficiency.
Process optimization goes on at all levels of process maturity. However,
with the step from the managed to the optimizing process there is a major change.
Up to this point software development managers have largely focused on their
products and typically gather and analyze only data that directly relate to product
improvement. In the optimizing process, the data are available to tune the process
itself. With a little experience, management will soon see that process optimiza-
tion can produce major quality and productivity benefits.
For example, many types of errors can be identified and fixed far more
economically by design or code inspections than by testing. A typically used rule
of thumb states that it takes one to four working hours to find and fix a bug
through inspections and about 15 to 20 working hours to find and fix a bug in
function or system test. To the extent that organizations find that these numbers
apply to their situations, they should consider placing less reliance on testing as
the primary way to find and fix bugs.
However, some kinds of errors are either uneconomical to detect or almost
impossible to find except by machine. Examples are errors involving spelling
and syntax, interfaces, performance, human factors, and error recovery. It would
be unwise to eliminate testing completely since it provides a useful check against
human frailties.
The data that are available with the optimizing process give a new perspec-
tive on testing. For most projects, a little analysis shows that there are two distinct
activities involved: the removal of defects and the assessment of program quality.
To reduce the cost of removing defects, inspections should be emphasized, to-
gether with any other cost-effective techniques. The role of functional and system
testing should then be changed to one of gathering quality data on the programs.
available to understand the costs and benefits of such work. The optimizing pro-
cess provides the foundation for significant advances in software quality and
simultaneous improvements in productivity.
There are few data on how long it takes for software organizations to ad-
vance through the maturity levels toward the optimizing process. What can be
said is that there is an urgent need for better and more effective software organiza-
tions. To meet this need, software managers and developers must establish the
goal of moving to the optimizing process.
Example of Software Process Assessment
This section is an excerpt from an SEI software process assessment (including
actual company and assessor dialog) that was conducted for a company that de-
velops computer operating system software. The material is presented to facilitate
learning, to identify the pertinent issues in software development up close, to
provide a perspective on how an organization deals with the issues/items as-
sessed, and to see the organization’s views (interpretation) of these items.
Item: Process Focus and Definition.
Software engineering practices are those activities that are essential for
the reliable and timely development of high-quality software. Some examples of
software engineering practices are design documentation, design reviews, and
code inspections. It is possible to develop software without such practices. How-
ever, software engineering practices, when used correctly, are not only effective
in preventing defects in all phases of the software development life cycle, but
also improve an organization’s ability to develop and maintain many large and
complex software products and to deliver products which meet customer require-
ments.
The phases of our software development life cycle are requirements defini-
tion, design, code, unit testing, product quality assurance (QA) testing, and inte-
gration testing. We spend a lot of time on and devote many resources to the back-
end product QA and integration testing trying to verify that products do not have
defects. Much less effort is devoted to ensuring that defects are not introduced
into products in the first place. In other words, we try to test in quality instead
of designing in quality, even though we know that a defect discovered late in
the product development life cycle is much more expensive to correct than one
detected earlier. It would be far more cost effective to avoid defects in the first
place and to detect them as early as possible.
When a defect is found, it means that all of the effort subsequent to the
phase where the defect was introduced must be repeated. If a defect is introduced
in the design phase and is found by integration testing, then fixing the defect
requires redesign, recoding, reinspecting the changed code, rereleasing the fix,
retesting by QA, and retesting by the integration testing staff—all this effort is
collectively known as rework.
The earlier a defect is introduced and the farther the product progresses
through the life cycle before the defect is found, the more work is done by various
groups. Industry studies have shown that 83% of defects are introduced into prod-
ucts before coding begins; there is no reason to believe that we are significantly
better than the industry average in this respect. Anything we do to prevent or
detect defects before writing code will have a significant payoff.
Rework may be necessary, but it adds no value to the product. Rework is
also very expensive; it costs us about $4 per share in earnings. Ideally, if we
produced defect-free software, we would not have to pay for any rework.
It is difficult to hire and train people. People are not eager to have careers
in software maintenance, yet that is what we do most of the time. We cannot have
ments and to recycle the product through the development process. This is one
reason why our products take a long time to reach the marketplace.
An industry-conducted study shows that 56% of bugs are introduced be-
cause of bad requirements; as stated earlier, bugs introduced early in the life
cycle cause a lot of expensive rework. We have no reason to believe that we do
a better job than the rest of the industry in managing requirements.
Poorly understood and constantly changing requirements mean that our
plans are always changing. Development has to change its project plans and re-
sources; sales volume predictions have to change; and staffing levels for all the
supporting organizations such as education, documentation, logistics, and field
engineering to go through constant change. All of these organizations rely on
clear requirements and a product that meets those requirements in order to be
able to respond in a timely and profitable manner.
Item: Cross-Group Coordination and Teamwork.
The improvement areas that we have targeted are not all CMM Level 2
activities. Rather, these areas represent the most significant problem areas in our
company and are crucial to resolve. We will, therefore, continue to use the SEI
assessment and the CMM to guide us and not necessarily follow the model in
sequence. Resolving the problem areas will certainly help us achieve at least
Level 2 capability.
The assessment results offer no surprises nor do they offer a silver bullet.
The assessment was a first step in identifying the highest priority areas for im-
provement. Knowing these priorities will help us target our efforts correctly—
to identify and implement solutions in these high-priority areas. The assessment
also generated enthusiasm and a high level of participation all across the com-
pany, which is encouraging and makes us believe that as an organization we want
to improve our software development processes.
The SEI assessment was a collaborative effort between all organizations.
Future success of this project, and any process improvement effort, will depend
on sustaining this collaboration. Some development groups are using excellent
software development practices and we want to leverage their work. They can
help propagate the practices that work for them throughout the company by par-
ticipating in all phases of this project.
The SEI assessment evaluates processes, not products. The findings, there-
fore, are relevant to the processes used to develop products. As a company we
produce successful products with high quality. However, we have a tremendous
cost associated with producing high-quality products. The assessment and follow-
up activities will help us improve our processes. This in turn will have a signifi-
cant impact on cost and productivity, leading to higher profits for us and higher
reliability and lower cost of ownership for our customers.
REFERENCE
1. Humphrey WS. Managing the Software Process. Addison-Wesley Publishing, 1989.
Fixed, metal film Power 60% Slight inductive reactance at high fre-
RNR styles quency. Hot spot temperature should
not be more than 60% of maximum
specifications.
Fixed, film-insulated Power 60% Hot spot temperature should not be
RLR styles more than 60% of maximum specifi-
cation.
Fixed, precision Power 50% Inductive effect must be considered.
wirewound RBR High resistance values use small-diam-
styles eter windings (0.0007), which are fail-
ure prone in high humidity and high
power environments. Hot spot temper-
ature should be not more than 50% of
maximum specification.
Fixed, power wire- Power 50% Inductive effects must be considered.
wound (axial Noninductively wound resistors are
lead) RWR styles available. Resistance wire diameters
limited to 1 mm with some excep-
tions for high values. Do not locate
near temperature-sensitive devices.
Do not allow hot spot temperature to
exceed 50% of maximum specifica-
tion.
Fixed, power wire- Power 50% Inductive effects must be considered.
wound (chassis Derate further when operated at tem-
mounted) RER peratures exceeding 25°C. Use appro-
styles priate heat sinks for these resistors.
Do not locate near temperature-sensi-
tive devices. Do not allow hot spot
temperature to exceed 50% of maxi-
mum specification.
Variable wirewound Power 50% Inductive and capacitive effects must be
(trimming RTR considered. Current should be limited
styles) to 70% of rated value to minimize
contact resistance.
Variable nonwire- Power 50% Voltage resolution is finer than in wire-
wound (trimming) wound types. Low noise. Current
should be limited to 70% of rated
value to minimize contact resistance.
Appendix A: An Example of Part Derating Guidelines 411
Capacitor Derating
words, the design must tolerate some changes in the capacitance and
other device parameters in order to assure long life reliability.
5. Permissible ripple voltage. Alternating current (or ripple) voltages
create heat in electrolytic capacitors in proportion to the dissipation
factor of the devices. Maximum ripple voltage is a function of the
frequency of the applied voltage, the ambient operating temperature,
and the capacitance value of the part.
6. Surge current limitation. A common failure mode of solid tantalum
electrolytic capacitors is the catastrophic short. The probability of this
failure mode occurring is increased as the “circuit” surge current capa-
bility and the circuit operating voltage are increased. When sufficient
circuit impedance is provided, the current is limited to a level that
will permit the dielectric to heal itself. The designer should provide a
minimum of 3 ohms of impedance for each volt of potential (3Ω/V)
applied to the capacitor.
7. Peak reverse voltage. The peak reverse AC voltage applied to solid
tantalum electrolytic capacitors should not exceed 3% of the forward
voltage rating. Nonsolid tantalum electrolytic capacitors must never
be exposed to reverse voltages of any magnitude. A diode will not
protect them. Reverse voltages cause degradation of the dielectric
system.
125°C 95°C
150°C 105°C
175°C 115°C
200°C 125°C
Microcircuit Derating
125°C 95°C
150°C 105°C
175°C 115°C
200°C 125°C
T j ⫽ T a ⫹ 0 ja Pd
T j ⫽ T c ⫹ 0 jc Pd
where
Ta ⫽ Ambient temperature
Tc ⫽ Case temperature
0 ja ⫽ Thermal resistance of junction to ambient air, °C/W
0 jc ⫽ Thermal resistance of junction to case, °C/W
Pd ⫽ Power dissipation, W
Connector Derating
sion through the contacts. High temperatures will accelerate aging, re-
sulting in increased resistance of the contacts and drastically shortening
life.
2. Failure modes and mechanisms. Friction and wear are the main fail-
ure mechanisms that will cause deterioration of connector reliability.
Galvanic corrosion and fretting corrosion initiated by slip on connector
contact roughness will lead to a direct attack on the connector contact
interfaces by oxygen, water, and contaminants. The slip action, which
can be produced by vibrations or cyclic heating and cooling, causes an
acceleration in the formation of oxide films. Lubricants used to reduce
friction and wear play an active part in the failure mechanisms and
add to the complexities of potential film formations.
3. Contact materials. Gold contact material is recommended for con-
nector reliability. Gold–tin mating is not recommended because of
long-term reliability problems. The gold–tin bimetallic junction be-
comes subject to galvanic corrosion which leads to high contact resis-
tance and eventual failure. Tin-plated contacts must not be used to
make or break current; arcing quickly destroys the tin plating.
4. Parallel pins. When pins are connected in parallel to increase the
current capacity, allow for at least a 25% surplus of pins over that
required to meet the 50% derating for each pin, assuming equal current
in each. The currents will not divide equally due to differences in con-
tact resistance.
5. IC sockets. Integrated circuit sockets should not be used unless abso-
lutely necessary. In most cases, the failure rate of the socket exceeds
that of the IC plugged into it.
Switch Derating
2. Peak in-rush current. Peak in-rush current should not exceed 50%
of the maximum surge current rating.
3. Arc supression. Provide for arc suppression of all switched inductive
loads.
4. Contacts. Operate contacts in parallel for redundancy only and never
to “increase the current rating.” Do not operate contacts in series to
“increase voltage rating.”
5. Switch/relay mounting. Proper support should be used to prevent any
deflection of the switch due to shock, vibration, or acceleration. The
direction of motion of the contacts should not be coincident with the
expected direction of shock.
6. Relay coils. Coils are designed for specific voltage ratings to ensure
good contacting. Applied coil voltage should always be within ⫾10%
of the nominal rating.
Fuse Derating
Current Rating
Most fuses are rated so that at 110% of the rating, the temperature rise (from
room temperature, 21°C or 70°F) at the hottest points of the fuse will not exceed
70°C. Under other conditions, the rating must be adjusted up or down (uprating
or derating) as indicated on the graph below.
422 Appendix A
Voltage Rating
Circuit voltage should not exceed fuse rating.
1 PCB/PWA Electric connection Open/short Solder short 1 1 Data corruption; func- 7 7 Net list test
Resistive Crack 2 7 tion failure; none or 7 98 Supplier feedback
Scratch 1 1 intermittent 7 7 Rating program
Contamination 1 10 7 70 Document rework proce-
Void 1 10 7 70 dures
Overstress (tantalum) 3 5 7 105 Certified operator Review tooling and ther-
Overetch 3 10 1 30 mal profile.
Bad rework 1 3 7 21
Overcurrent 1 1 7 7
Carrying components/ Parts loose or detached Vibe-flex 3 7 7 147 Preventive maintenance Austin, TX, to review
mechanical substrate Wrong profile 1 3 7 21 scheduled design/ system packaging,
Wrong paste 3 3 1 9 analysis specifically that the
Plugged screen 3 3 7 63 module cover bracket
and chassis tolerances
are effective to con-
trol vibration.
Connector system Intermittent open short Fretting 3 10 1 30 Lubrication board cover
Wrong thickness, slot
width/location 1 1 7 7
Oxidation 1 10 1 10 Tandem to review Mi-
Contamination 3 10 5 150 cron’s workmanship
Poor plating 3 7 5 150 standards.
Review ACI process
Appendix B
Cpk for thickness and
purity.
Present components air Burn-out Airflow shadow 1 3 5 15
stream cool system
Entrap EMI/RFI carry la- Airflow shadow 3 5 7 7 DVT measurement Correct design errors if
ble for module found.
425
Copyright 2003 by Marcel Dekker, Inc. All Rights Reserved.
426
Parts FMEA Project: 1 Memory Module Date: 12/9/02
8 Stencil Apply solder paste No paste Misregistration 1 5 Functional failure 7 35 Visual specification
printing Excessive paste Misregistration 1 1 1 1 Visual inspect, 1sm, etc.
Wrong paste Operator error 3 1 3 9 Process documentation Continuous improvement
Low volume solder joint Insufficient paste 3 7 7 147 1sm, visual inspection of paste quality.
Micron will supply quali-
fication specification
and report.
Review stencil printing
process specification.
9 Placement Place components Parts in wrong place Part loaded wrong 1 1 No function 7 7 Feeder load list, visual
Program error 1 1 7 7 Documentation and
revision control
Noncoplanar leads Defective component 3 7 Erratic function 7 147 Visual, ICT Insist parts on tape and
reel.
PM of equipment.
Tandem purchasing to
give better forecasting.
No part placed Pass-through mode 1 1 7 Visual
Part orientation Programming error, 3 7 7 147 Visual, documentation Tandem to give feedback
t/r wrong 1 7 7 49 Visual, feeder PM on 10 prototypes to Mi-
cron.
Damaged parts Feeder malfunction 1 3 7 21
Defective compo- 3 7 7 147 Visual ICT Supplier component quali-
nents fication.
Appendix B
Return all components for
FA and RCA to sup-
plier.
Review Micron MRB
F/A specification.
10 Cleaning Remove flux residue Ionic contamination Dendrite growth 1 10 Long-term reliability 7 70 Process controls, flux
Corrosion quality
427
Copyright 2003 by Marcel Dekker, Inc. All Rights Reserved.
428
Parts FMEA Project: 1 Memory Module Date: 12/9/02
17 Test Verify functionality Bad data Ineffective test codes 1 5 Functional fail 7 35 Incoming specifications
Board grader (coverage re- ↑
port of nodes)
Design for test
Verification checklist
Defective hardware 3 5 7 105 Verify test with known Micron to review and es-
good product (if avail- tablish regular PM
able) schedule on test hard-
Verify with grounded ware.
probe
PM schedule
Inadequate require- 1 1 7 7 Standard program in place
ments
Operator error 1 7 7 49 Operator training ↓
Verify timing Functional failure No coverage at pre- 1 7 7 49 DRAM supplier burn-in
sent
18 Driver (16244) ADDR Buffer Wrong data written/read Address aliasing Parity error Pattern test
ESD Design
Latch-up Design
Slow speed Test
Ground bounce Design/test
Timing drift Process/design
ADDR or data unavail-
Appendix B
able
Tristate for test Part will not tristate Defective driver Interference during mem- Process/design
Missing pull-down ory test
Defective pull-down
Part stuck in tristate Open enable pin Won’t write system data Process/design
Solder joint
Memory module action items list resulting from FMEA (action item assignees names suppressed):
1. PCB/PWA: Review tooling and thermal profile.
2. PCB/PWA: Review system packaging. Specifically, that the module cover bracket and chassis tolerances are effective to control vibration.
3. PCB/PWA: Review supplier’s workmanship standard. Review ACI process Cpk for thickness and purity.
4. Reflow: Review selection and qualification process of paste suppliers (data and qual report).
5. Stencil printing: Continuous improvement of paste quality. Supplier will provide qualification specification and report. Review stencil printing process
specification.
6. Placement: Insist that parts are on tape and reel.
7. Visual: OEM to review CAD library for polarized parts. Provide mechanical sample, if possible.
8. Trace cuts: There will be no cuts; CN is process.
9. Test: Supplier to review and establish regular preventive maintenance schedule.
CA, California; CN, change notice; FA, failure analysis; FRU, field replaceable unit; PM, preventive maintenance; RCA, root cause analysis.
429
Copyright 2003 by Marcel Dekker, Inc. All Rights Reserved.