Dependability Ofmvandhv Protection Devices: Marielle Lemaire
Dependability Ofmvandhv Protection Devices: Marielle Lemaire
contents
1. Introduction Purpose of the document Protection devices Dependability requirements: a compromise between two undesirable events The terms used The reliability engineer's tools Dependability resources Software quality Qualification of protection devices Quality control p. 4 p. 4 p. 4
3. Dependability as a part of a global quality approach 4. Analysis of experience feedback 5. Conclusion 6. Appendix 7. Bibliography
p. 6 p. 6 p. 8 p. 11 p. 11 p. 13 p. 14 p. 14 p. 15 p. 16
1. introduction
Protection devices are produced either using electromechanical technology (the oldest) or analog or digital electronic technology (known as static). A digital protection device (microprocessor based) can perform, in addition to its main protection role, automation, measuring, self-monitoring and communication functions. This device then forms a natural part of control and monitoring systems performing automation, status logging and mimic diagram functions (see fig. 2).
Continuity of energy supply is vital both for companies and electricity distributors. Untimely tripping of the protection device can cause considerable financial losses
protection devices
The main functions of a protection device are to detect network faults by monitoring various parameters (current, voltage....) and to transmit a circuit-breaker opening order should an abnormal situation be observed. A protection device generally protects one of the various components of an electrical distribution substation, such as an incomer, a line feeder, a motor or a transformer. In Medium and High Voltage, these devices are often incorporated in the equipment containing the circuit-breaker (see figure 1). Environmental constraints are then severe (temperature, vibration, electromagnetic disturbances).
G
I on O off trip I on O off trip A V/Hz W/ Wh clear alarm reset A V/Hz W/ Wh clear alarm reset
MERLIN GERIN
MERLIN GERIN
I on
O off
trip
I on
O off
trip
I on
O off
trip
I on
O off
trip
V/Hz
W/
Wh
clear
alarm
reset
V/Hz
W/
Wh
clear
alarm
reset
V/Hz
W/
Wh
clear
alarm
reset
V/Hz
W/
Wh
clear
alarm
reset
MERLIN GERIN
MERLIN GERIN
MERLIN GERIN
MERLIN GERIN
(production shutdown, cost of non-distributed energy...). This event can be avoided by increasing safety of the protection device. Availability and safety are often oposite. The best way for a plane not to crash is for it to stay on the ground. Its safety is then absolute, but its availability zero! Conversely, a plane which is in the air too often, without maintenance, places peoples life in danger. The design of any device calls for a compromise between availability and safety. Availability and safety are increased by using the other two components of dependability: maintenability and reliability (see fig. 3). Protection devices are subjected to numerous aggressive factors which affect the undesirable events, e.g.:
s extreme temperatures, s vibrations due to circuit breaker operations, s corrosive atmospheres in industrial applications (chemistry, paper mills, cement plants...), s intense electromagnetic pulse fields (up to several dozen kV/m 1 metre from a HV or MV circuit breaker with rise times of the order of 5 ns). This extremely severe environment and the fact that MV and HV networks supply many electrical power users make controlled, optimised reliability and maintenability an absolute necessity. Protection devices using microprocessors have enabled considerable headway to be made. For example:
safety 100 %
(1)
100 % availability
fig. 3: increasing reliability and maintenability (1) increases availability and safety.
s integration reduces wiring problems, thus increasing reliability, s self-monitoring increases availability.
Availability is the ratio between the time spent in the operating status and the total reference time. Readers interested in quantification of dependability values can refer both to the appendix and Cahier Technique n 144. To return to figure 3, one of the objectives of the protection device designer is to treat preventively as many failures as possible (maintenability) to increase availability. As few events as possible should result in deterioration of protection device safety (the self-monitoring concept and resources will be described in the following sections). As the networks to be protected are MV and HV ones, their dependability must be far higher than that of most LV equipment. A Preliminary Risk Analysis is used to determine the undesirable events linked to the functions performed by the protection device (see fig. 5). A team of specialists independent from the design team, carries out estimated
dependability studies and proposes technical solutions compatible with the specified level. An iterative approach enables design to be modified until objectives are achieved.
operating protection
RELIABILITY MAINTENABILITY
(serious failure) - -
(repair) --
failure protection
SAFETY
network fault
untimely tripping
failure to trip
AVAILABILITY !
SAFETY !
fig. 4: status graph for the protection device and consequences on electrical distribution.
effects s untimely opening of circuit-breaker s power unavailability causing severe financial losses (production shutdown...)
s tripping an upstream protection level with possible local destruction of equipment s major destruction of equipment (fire...) if there is no upstream protection
causes s internal, for example: s untimely detection of a fault, s untimely activation of the control mechanism s external, for example: s electromagnetic disturbances s sensor saturation s error in protection plan design s internal, for example: s failure to detect a fault; s blocked control mechanism s external, for example: s electromagnetic disturbances s sensor saturation s loss of auxiliary supply s circuit-breaker tripping circuit open s error in protection plan design
prevention for example: s self-monitoring functions s fall-back position s electromagnetic compatibility s non-magnetic sensors
for example: s self-monitoring functions s electromagnetic compatibility s non-magnetic sensors s standby module s supervision of tripping circuit s logic discrimination
Microcircuits, gate/logic arrays and microprocessors Description 1. bipolar devices, digital and linear gate/logic arrays 2. MOS devices, digital and linear gate/logic arrays 3. microprocessors p = (C1 . pT + C2 . pE) pQ . pL failures/106 hours bipolar digital and linear gate/logic array die complexity failure rate - C1 digital no. gates 1 to 100 101 to 1,000 1,001 to 3,000 3,001 to 10,000 10,001 to 30,000 30,001 to 60,000 C1 .0025 .0050 .010 .020 .040 .080 linear no. transistors 1 to 100 101 to 300 301 to 1,000 1,001 to 10,000 C1 .010 .020 .040 .060 prog. logic array no. gates up to 200 201 to 1,000 1,001 to 5,000 C1 .010 .021 .042
MOS digital and linear gate/logic array die complexity failure rate - C1 digital no. gates 1 to 100 101 to 1,000 1,001 to 3,000 3,001 to 10,000 10,001 to 30,000 30,001 to 60,000 C1 .010 .020 .040 .080 .16 .29 linear no. transistor 1 to 100 101 to 300 301 to 1,000 1,001 to 10,000 C1 .010 .020 .040 .060 floating gate prog. logic array no. cells, C C1 up to 16 K .00085 16 K < C i 64 K .0017 64 K < C i 256 K .0034 256 K < C i 1M .0068
microprocessor die complexity failure rate - C1 no. bits bipolar MOS C1 C1 up to 8 .060 .14 up to 16 .12 .28 up to 32 .24 .56
all other model parameters parameter section pT 5.8 C2 5.9 pE, pQ, pL 5.10
s the Failure Modes, their Effects and their Criticity Analysis (FMECA) conducted both on hardware and software, evaluates the effects of each known failure mode on equipment operation. FMECA is used to correct certain malfunctioning risks and to specify self-monitoring functions. It can be performed at general function level (the protection function), at elementary function level (overcurrent protection function), at one of its subfunctions (see fig. 7) up to the lowest level of the basic components (implanted on the electronic boards). s the undesirable events concerning protection devices are modelled using a number of techniques: s failure trees describe all the possible causes of a particular event to be avoided (see fig. 8). The failure tree is a boolean representation used to determine the most critical paths to produce the event. s Markov graphs are a behavioural representation showing operating status, downgraded operation and equipment failure. Transitions between status are quantified by failure and repair () rates. These graphs are used to calculate the likelihoods of occupying failure status (see fig. 9). s Petri nets have the same purpose as Markov graphs, i.e. modelling system status. They enable processing of more complex systems whose transitions between status do not necessarily obey exponential distribution (e.g. Weibulls distribution) (see fig. 10)
These modelling processes enable quantified simulation of operational dependability, thus obtaining likelihoods for reliability, maintenability, availability and safety of protection devices. Readers can find a more detailed description of these various techniques in the references [Villemeur] or [RGE] or [Pages-Gondran].
dependability resources
Reliability, safety and maintenability of protection devices must be controlled to guarantee optimum dependability of electrical installations. As the objectives for these values are fixed, the protection device designer, assisted by the reliability engineer, uses a certain number of resources to achieve them: s thanks to the reliability engineer and his tools, he controls intrinsic reliability before and during development; s thanks to self-monitoring, failure signalling and communication resources, he can: s increase dependability by placing in the fall-back position, s increase maintenability and availability of the protection device. Let us now look at the resources implemented: s self-monitoring Efficiency and relevance of self-monitoring are vital for dependability of the protection device. Below are examples of some resources enabling availability and safety to be increased:
s a check on integrity of information contained in the program and constant data memory boxes must be performed on energising and then cyclically during operation. This check is made by calculating the Checksum with carrying over on the memory zones used. The checksum with carry over covers 99.95 % for 128 bytes (99.998 % for 128 Kbytes) for pasting of address and of memory bits. For volumes of information to be checked exceeding a hundred bytes, calculation of the Checksum with carry over is more efficient than calculation of a CRC 16 for example (see reference [INRS]). s a hardware and software watchdog must be fitted to detect blocking of the CPU (due to a component defect, interference or microprocessor overload). The validity of the watchdog output signal must also be checked. The watchdog must cover failure of the microprocessor quartz and oscillator (see fig. 11). s program cycle time must be controlled. If interruptions are used to sequence cycles, it must be checked that these mechanisms are operating correctly. s a check on supply voltage must be continuously performed to anticipate possible voltage drops and stop the microprocessor properly (saving parameters). s if EEPROM memories are used, use of this component must be monitored by counting the number of writes which must not exceed 10,000. s false digital data must not be processed further to a failure in the analog to digital conversion string.
failure mode false measured current: continuous level > tripping threshold
detection resources the algorithm used works on calculation of current module at 50 HZ detection by periodical calculation of signal dc component detection resources are the same
An efficient check consists in continuously verifying two reference signals at the input of the multiplexer at two complementary addresses (100 % of failures of the Analog to Digital Converter and 100 % of sticking at 1 or 0 of the Multiplexer selection bits are thus detected). Many other detection devices are used, which are obviously very dependent on the technology used. s the reliable fall-back position The self-monitoring functions detect as many major failures as possible. A failure is said to be major if there is a risk of it causing incorrect operation of the protection device.
unavailability of sensors
operation
failure
fig. 9: example of a Markov graph for a system consisting of two redundant, repairable components. If they are two electronic components (exponential reliability), the mean proper operating time after repair is MUT =
1 2
The Petri net represented has two places (P1, P2), two transitions (T1, T2) and four arcs. This net represents the behaviour of a repairable component, by assigning for example the following T1 meanings to the places and transitions: P1: the component is in proper operating condition. P2: the component is not working. T1: the component has failed. T2: the component has just been repaired.
P1
P1
T2
T1
T2
P2
P2
To check data integrity A number of techniques can be used s parity check This consists in systematically making the number of bits transmitted even by completing the useful message by a parity bit. The receiver can thus check the message if there is an error on a bit or 3 bits. Alteration of an even number of bits cannot be detected. s the CRC (Cyclic Redundancy Check) consists in adding to the useful information the rest of its division by a polynomial standardised by the CCIT. For example, the degree 16 dividing polynomial (X16 + X15 + X2 + 1=1100 0000 0000 0011) used for the CRC 16 can detect 16 simultaneous errors. s the Checksum consists in performing the binary sum of bytes and in adding the result (truncated on one or more bytes) to the useful message. The Checksum can be associated for example with the parity byte check.... Checking message integrity by the receiver is easier than for the CRC and can be more efficient. To check proper running of a program Often used in automation systems, the Watchdog technique consists in periodically running a test instruction. Non-running of this instruction, within a given time, reveals a failure and causes an alarm and an equipment protection device to trip.
fig. 11: self-monitoring reduces protection device unavailability time.
fig. 10: example of a Petri net for a system consisting of two redundant, repairable elements.
This type of failure must not degenerate into untimely tripping. The protection device places itself in a reliable, predetermined fall-back position to prevent passage of random orders. The operator is informed of this fallback position and can immediately perform maintenance to restore the availability of the protection device. At the same time, a minor failure, for example a peripheral failure (display or setting console) is signalled, but does not affect availability of the protection device. s failure signalling The self-monitoring functions must provide suitable diagnostic resources to enable a prompt resumption of operating status of the faulty protection device, i.e.: s provide the operator with external, clear and global information on the status of his protection device, s provide the manufacturer, during a maintenance operation, or even after return to the works of the faulty protection device, with internal, clear and precise information on the status of the protection device. For example, failure of the protection device may be signalled by: s a front panel indicator light, s a WatchDog relay output, s a message on the front panel display, s internally saved information detailing failure origin, s a message via the communication system when the protection device is part of a control and monitoring system. This is a considerable advantage over older protection devices which could remain in a state of failure for long periods of time without the operator being aware of this (see fig. 12) and which thus provided no information on the origin of the failure. s a tendency to adopt supervision and control and monitoring systems As stated earlier, digital protection can incorporate automation and communication functions. It thus becomes one of the links in the supervision and control and monitoring system of the electrical installation, thus simplifying operation by enabling supervision, operation and management of the distribution network:
s supervision of status and electrical quantities (measurements), s supervision of devices (switchgear position, temperature, pressure,....) s processing of alarms, s remote control of switching devices, s automatic reconfiguration of networks after fault, s management of consumed energy as a function of distributors tarifs, s editing operating reports, s allocating energy costs to the various site consumers. s ease of maintenance s self-monitoring, signalling, communication facilitate knowledge of failure status, thus allowing immediate maintenance action, s self-diagnostics enable the troubleshooter to know the origin of the failure, thus resulting in rapid troubleshooting, s the programmed functions, customising the protection device in terms of applications/functions performed, are stored in a detachable cartridge. This enables immediate resumption of operation after replacement of the physical part (hard) which is standardised. s special cases Reliability of the protection device may not be sufficient if it is subject to exceptionally aggressive factors or if
the availability and safety needs of electrical distribution are exceptionally high: s severe environment Protection systems are sometimes installed in exceptional environments which exceed specified constraints for equipment: - temperature, - vibration... In each case, needs must be specially identified by the engineering and design department. A customised solution is then proposed: - special varnish on electronic boards, - specific maintenance contract, s an exceptional dependability need A standby module can provide protection in the event of: - a supply fault, - a wiring fault, - a trip release fault, - main protection device not working. Another solution is to backup the protection device with an or circuit in the breaking device control circuit. Installation safety is considerably increased, and electrical power availability is not reduced when protection systems with reliable fallback position are used. As an extreme solution, 2/3 vote systems can be considered.
time
time
no periodical maintenance
software quality
A large part of digital protection device functions are performed by the software. Control of software quality is thus crucial to achieve global dependability objectives. Software quality is controlled by using a rigorous development method. This method, resulting from the recommendations laid down by French (AFCIQ) and international (IEEE) organisations, stipulates: s breakdown of development into a series of phases (see fig.13): s specification, s preliminary design, s detailed design, s coding, s unit tests, s integration and integration tests, s validation. Each phase has a set of documents used and produced during the phase. These documents formalise the studies conducted in each phase and must be validated before moving on to the next phase. s use of design and coding methods and rules aiming at obtaining a high software structuration level (e.g. SADT implemented in the ASA or MACH tool). s use of software configuration management tools enabling management of all software components and in particular control of the respective evolutions and versions of all these components (e.g. CMS tool). Moreover, code reviewing methods are used to great advantage. A reviewer critically reads the code and makes his observations. This manual analysis is still one of the most efficient methods for discovering software errors (bugs). Finally, once each software has been integrated and validated, a final qualification phase conducted by a team other than the development team ensures a last efficient check.
equipment in current pulses of a few dozen amps and a very steep front of the order of a nanosecond, s radioelectric transmitters (e.g. walkie-talkies) generate fields of several dozen V/m 1 metre away. Readers wishing to learn more about Electromagnetic Compatibility (EMC) can consult Cahier Technique n 149. Internal electrical stress withstand standards define the immunity levels required for operation of protection systems in electrical substations. These levels correspond to the withstands defined by IEC 255 standards or are even more severe. Compliance with the severity level defined is checked by tests. Four types of tests are performed: s damped oscillatory wave (IEC 255-22-1) severity: class III, 2.5 kV, s rapid transients (IEC 255-22-4) severity: class IV, 4 kV, s electrostatic discharge (IEC 255-22-2)severity: class III, 8 kV,
software specification
software validation
preliminary design
software integration
detailed design
test plan
unit tests
coding
s radiated fields (IEC -255-22-3) severity: greater than class III, 30 V/m (see fig. 14) NB: the rapid transient test is the transciption in conducted mode of radiated electromagnetic pulse fields, generated during switchgear operations. In addition to EMC tests, protection devices undergo real-life situation tests. For example, after placing the device in the Low Voltage compartment of a Medium and High Voltage cubicle, roughly one hundred circuit-breaker opening and closing operations, on a load imposing arc breaking under a small inductive current (switching operation overvoltages due to current pinching) were performed. During these tests, untimely tripping of the protection device must not occur. s the Kirchhoff laboratory: protection device testing
The functions performed by protection systems are complex. Proper operation of protection devices must be guaranteed for all the phenomena which can occur on electrical networks. An efficient laboratory for performing tests on protection devices is essential (see fig. 15). The Kirchhoff laboratory enables real life reproduction of phenomena such as they occur on electrical networks (see fig. 16). This laboratory is equipped with a digital simulator used to: s calculate currents and voltages on the network, when a short-circuit, insulation failure or device switching operation occurs,
s generate the corresponding signals and apply them to the protection device to be tested. An analysis is then made of the behaviour of the protection devices subjected to conditions identical to those that they will encounter on the real network. Digital simulation of electrical networks in the Kirchhoff laboratory uses two softwares: s EMTP (ElectroMagnetic Transient Program), a program for calculating transient phenomena. This international software enables, from an equipment library (transformers, lines, machines,...) modelling of all kinds of electrical networks, simulation of
work station
Computer
conversion of computed waveforms into analog real time signals simulation software MORGAT, EMTP
digital/analog converter
fig. 14: electromagnetic disturbance tests in anechoic chamber.
current and voltage adapted to protection device input levels
data recorder
recording of protection device outputs
faults or device switching operation and precise calculation of evolution in current and voltage, s MORGAT, an electrical network simulator, developed and distributed by EDF. This software allows both fine analysis of network behaviour and control of the real time aspect of the Kirchhoff laboratory. Currents and voltages, calculated at different points of the simulated electrical network, are converted into analog signals for application to the protection device to be tested.
After the in-situ test, the boards are burnt in under combined thermal and electrical stresses. Burn-in eliminates teething faults in electronic equipment and reduces the length of the early period so that these faults appear in manufacture rather than during operation. Likewise, the fault statistics are used by the quality department so that rapid action can be taken for any drift in manufacturing quality.
Final testing ensures that the assembled boards dialogue correctly with each other and that the configuration achieved really does correspond with the customers order. All the expected functions are thus activated by stimuli applied to the interfaces of the device produced. In addition to the systematic checks made on production, qualification tests are repeated periodically on a representative sample of the range.
quality control
Protection devices undergo numerous quality control tests during production and at the end of production. For example electronic boards undergo initial inspection on the dielectric test bay performing insulation tests. They are then directed to the in-situ tester (see fig. 17). The in-situ test checks proper operation and implantation of each electronic board component. It Photographie indicates mainly manufacturing defects and certain component defects. It provides an implicit dignostics and ensures prompt repair of the board. The results are then used by the quality department and allow rapid detection of any drift in component or board manufacturing quality.
photo
Photographie
To ensure significant experience feedback, a very large installed equipment base must be in operation when reliability is excellent. Operating failure data can then be analysed. Analysis of experience feedback is vital to: s assess operational reliability of equipment; s validate the dependability studies conducted during design; s cumulate technical experience to progress; s possess a dialogue basis between the manufacturer and the operator. Experience feedback relies on reliable and orderly gathering of information relating to incidents occurring at
customers premises. Operational reliability (calculated on experience feedback) is only relevant if failure can be detected, is detected and recorded. Failure data, resulting from an installed equipment base which has no self-monitoring functions and in frequent periodical maintenance, may not be representative of real reliability. Operational reliability data on an installed base of digital protection devices are relevant due to the self-monitoring function. It has been observed that operational reliability is at least greater by a factor 10 than estimated reliability (calculated
from the data book MIL-HDBK-217E). This difference was probably the result of deliberately pessimistic and sometimes anachronistic reliability data books (electronic component technologies and quality evolve at a great pace). Recent updates to reliability data books have considerably reduced the difference between the operational and estimated reliability results. Today, the MTBF corresponding to untimely tripping or failure to trip of the protection device is several hundred years.
5. conclusion
Medium and High Voltage network protection devices perform a vital dependability function. They have to guarantee protection of persons and equipment, while ensuring availability of electrical power. Their malfunctions can inflict severe financial losses on operators. It is thus of prime importance that they meet high reliability, safety, availability and maintenability standards. Consequently, protection devices must meet certain technical and industrial
characteristics, of which the most significant are: s proper protection of MV and HV equipment and networks, by algorithms adapted to the various protection functions; s simplicity of implemenation, operation and maintenance; s reliability in severe environments, as well as: s ability to perform self-monitoring, s possession of a reliable fall-back position.
The work of reliability and quality engineers, at the design stage, ensures that the digital protection devices out on the market today meet all these requirements. Today, taking advantage from development of digital communications (bus) and supervision, the functions of protection devices extend to the control and monitoring domain for optimised management of electrical power distribution.
6. appendix
Mean times characterising dependability (see fig. 18): The MTTF (Mean Time To first Failure) is the mean time a device operates properly before failure. The MTTR (Mean Time To Repair) is the mean repair time. The MTBF (Mean Time Between Failure) is the mean time between two failures (for a repairable system). The MDT (Mean Down Time) is the mean failure time including detection of failure, intervention time, repair time and resumption of operation time. The MUP (Mean Up Time) is the mean time a device operates properly after repair. The MTBF term is wrongly translated as the mean proper operation time. This definition actually belongs to the MTTF! The confusion stems from the fact that the MTTR (of the order of a few hours) is often tiny compared with the MTTF (of the order of several thousand hours).
The likelihoods Reliability, R(t) is the likelihood that the system will not fail over a time t. Maintenability is the likelihood that the system will be repaired in a time t. Availability is the likelihood that the system will operate at a time t. Safety is the likelihood that a disastrous event will be avoided. A quantity which is the failure rate (t) is normally used for working purposes. This is the likelihood to break down in the next instant, bearing in mind that the system has not failed. For electronic components, the failure rate follows an evolution in time known as the bathtub curve. During the useful life period, the component does not age and its failure rate is constant in time. The following fundamental relationships are then obtained: Reliability R(t) = e t and MTTF = 1 / .
Example: If a device has a MTTF of a 100 years, its failure rate = 1 / MTTF is 10-2/ year. The likelihood of failure each year is thus 1 %. This also means that out of 100 devices in operation, on average 1 device will break down each year! A MTTF (or MTBF) of a 100 years on no account means that the system will not fail for 100 years. The MTTF cannot therefore be compared to a guarantee period or to a lifetime...
MTTF MDT
MTBF MUT
fig. 18: diagram of mean times, established for a system requiring no interruption in operation for preventive maintenance.
7. bibliography
Publications s Reliability design approach for protection and control equipment for MV distribution networks. Second International Conference on the Reliability of Transmission and Distribution Equipment, M. LEMAIRE, J.C. TOBIAS, march 1995. s Electrical disturbance tests for measuring relays and protection equipment. Part 1: 1 MHz burst disturbance tests. Eyrolles EDF. A. VILLEMEUR, 1988. s Fiabilit des systmes Eyrolles EDF A. PAGES, M. GONDRAN, 1980. s Autotest d'une mmoire programme : deux solutions Electronique n 4, janvier 1991. s Military Handbook 217 -FDepartment Of Defense, USA. s Recueils de donnes de fiabilit des composants lectroniques, RDF 93 CNET.
Standards s IEC 255 Electrical relays s part 22: Electrical disturbance tests for measuring relayus and protection equipment. - section 1: 1 MHz burst disturbance tests, - section 2: Electrostatic discharge tests, - section 3: Radiated electromagnetic field disturbance tests, - section 4: Fast transient disturbance test. s VEI 191.
Merlin Gerin's Cahiers Techniques s Introduction to dependability design Cahier Technique n 144 P. BONNEFOI 1991. s EMC: electromagnetic compatibility Cahier Technique n 149 F. VAILLANT 1991. s MV public distribution networks worldwide Cahier Technique n 155 C. PURE 1991. s Design of industrial networks in HV Cahier Technique n 169 G. THOMASSET. s Protection devices of industrial and tertiary HVA networks Cahier Technique n 174 A. SASTRE.