Data Centre Design and Operational Best Practices ABB
Data Centre Design and Operational Best Practices ABB
Best Practices
Ed Ansett
Managing Partner
i3 Solutions Group
ABB
5th July 2013
Causes of Failure in Data Centres
Hot summer day, utility power outage, data center at full load 7.2MW
Four 2.5 MW generators installed (N+1 configured)
One generator fails to start (now N configured)
Running on 3 generators
30 minutes later another generator fails (now N-1 configured)
Now 5MW capacity supporting a 7.2MW load
Remaining generators overload 60 seconds
Cooling plant has no power
IT equipment begins shutting down (over-temperature) Why did they
have to lose the
DC data centre runs on UPS for another 30mins (2N 15mins each side) whole data
Total data center failure 30 minutes later centre?
Utility restored after 6 hours Why couldnt
Data Center fully restored after 8 hours later the data centre
load be reduced
Numerous senior managers fired from 7.2MW to
Worldwide enquiry launched 5MW? There
Litigation and financial penalties was enough
time.
Highly publicized therefore reputational damage
Data Centre Outage Root Cause
Data centre operations under pressure to keep costs to a minimum
FM team were technically skilled but did not receive adequate transitional training from
previous contractor or DC owner
On the day the Gen No1 failed to start due to a blown seal in the Pneumatic System 1
Ops didnt know how to manually re-route the pneumatic system to start Gen 1
Data centre failure was inevitable
What was the root cause, who was responsible and why?
The Answers are in the Universal Learning Curve
All mechanical, electrical and electronic components are characterised a Mean Time
Between Failure, typically shaped like a bathtub
Failure Rate
All operations need peer review, rehearsals and full switching plans
Also need approval of the business or customers to carry out the work
Time frame
Risk review
Roll Back plans i.e. when to Roll Back of program to stay within operational
widows
Switching plans
Procedure will generally have prior planning and approval required before operating
and implementing and may cover multiple devices and equipment
Permit to work
Also included would be electrical Single Line Diagram and switching tags
Emergency Operating Procedures (EOPs)
Absolute minimum procedure required to safely and reliably operate the equipment in a
fast response incident
Example EOP to open and close an ACB (Air Circuit Breaker) rated at 4000A
Data Centre Infrastructure Management
Asset,
configuration Building Life,
Cooling
and change Safety, Security Power, energy
control,
management measuring,
BMS,
modeling
alarms etc.
All this before weve decided what to build and where to build it!!
Cost Optimization Requires Reliability and Availability Analysis
9.0
Cost $
Reliability Modelling
Using Probabilistic Risk Assessment
Why do Reliability Modelling?
Identifies vulnerabilities
Utility
Alignment of business mission and performance
Performance benchmarking
ACB
Support business case expenditure
MSB
What is involved?
MCB MCB
Step 1: Develop Resiliency Metrics and Quantify Reliability
Expectations
The reliability data sources used are from IEEE Standard 493-1997 Gold Book and Reliability
Analysis Center NPRD-95
Design with the operator in mind avoid multiple bypasses and other complex operations
The most unreliable component in the data centre is always batteries
Followed by anything that has mechanical moving parts (chillers, pumps, relays etc.)
Anything more than 5 minutes of battery autonomy is a waste
Hot & Cold aisles are wasteful without containment or precision airflow
Data centres comprises business service lines of varying priority, design accordingly
Mirrored tier data centres are orders of magnitude more reliable than a single data centers
When looking at PUE, consider partial load, its more important than full load
Tier 4 can be justified where the frequency of generator ops is high due to a weak utility
The tier system is a guide, nothing more
The key data centre design standard is now ANSI BICSI 002 2011
Cooling economizers do not provide good ROI in the tropics (Sensible PUE target 1.4)
There are new cooling technologies that will radically improve tropical PUE to around 1.15
Dynamic workload allocation will force data centres to be more responsive to applicatons.
Key Design Rules
Understand the IT strategy, engage with your IT group, understand their constraints
The design is flexible, experience shows the future isnt always predictable
The business SLA requirements are met, all DC stakeholders need to be aligned
IT StackAlignment
A Side
B Side
P1 P1 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P4 P4 P4 P4
Data Centre: Prioritized Cooling
P1 P1 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P4 P4 P4 P4
A Side
B Side
P1 P1 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P4 P4 P4 P4
Data Centre: Prioritized Power
P1 P1 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P4 P4 P4 P4
A Side
B Side
P1 P1 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P4 P4 P4 P4
Thank You
Ed Ansett
Managing Partner
i3 Solutions Group