0% found this document useful (0 votes)
3 views

reliability-understanding-the-critical-factor-behind-disk-storage

This shows the reliability statistics of hard disk drives with opinions

Uploaded by

cingo singet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

reliability-understanding-the-critical-factor-behind-disk-storage

This shows the reliability statistics of hard disk drives with opinions

Uploaded by

cingo singet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Reliability: Understanding the Critical Factor Behind Disk Storage

Page 1: Needing a Diversity of Drives

Key Points:

 Global demand for digital storage continues to explode, but there will not be sufficient
storage available to meet demand. This creates even greater need for deployed storage to
perform as expected.
 As applications and markets diversify, a monolithic approach to storage does not
optimally meet market needs. Tailoring drives to fit applications and environments
increases reliability and maximizes ROI.
 Deployment of drives into their appropriate environments combined with stringent
product design results in low annual failure rates (AFR) and high customer satisfaction.
Failure to use drives in their appropriate applications and environments sets them up for
performance issues and possible premature failure.
 To hit AFR objectives, Seagate’s Product Development Process involves at least a year
and a half of testing and design refinement in order to yield ultra-reliable hard drives.
 Reliability fuels drive segmentation. By optimizing drives for specific applications and
environments, Seagate helps users to pair the right drives with the right data where that
data lives, enabling users to better address their present and future storage needs.

IDC’s “The Digital Universe of Opportunities” paper couldn’t be clearer: In Moore’s Law
fashion, the world’s amount of accumulated digital data is doubling every two years. By 2020,
our virtual space will reach 44 zetabytes (44 trillion gigabytes) — nearly one byte for every star
in the physical universe.

While IDC predicts that the digital universe will explode to 44ZB by 2020, there is some good news: Only 13ZB of
that amount will need to be stored. The bad news: The world will have just 6.5ZB of available capacity by then.
Image source: http://bit.ly/1BkL9WP

This incredible report offers a wealth of insight into today’s storage landscape, but three facts
pop out with particular urgency:
1. In 2013, two-thirds of the digital universe was created by end-users, but it fell to enterprises to
store and protect 85% of it.

2. Four out of every ten bits in the digital universe required protection, but fewer than half of
these were actually being protected.

3. Both exciting (for storage providers) and alarming (for everyone), “in 2013, the [world’s]
available storage capacity could hold just 33% of the digital universe. By 2020, it will be able to
store less than 15%.” The world is desperate for more digital storage, but that also opens the risk
of people making short term-focused storage decisions they will later regret.

From streaming cell phone video to machine-to-machine Internet of Things (IoT) data streams,
the need for more storage across every market segment is overwhelming. Consumers
increasingly lean toward hybrid strategies, employing both cloud-based storage services as well
as more quickly restored direct- and network-attached storage solutions. Businesses have even
more diverse needs. Some data must be secured. An increasing amount must be made available
to fast, real-time analysis. And with every passing month, ever more data must find its way into
affordable long-term storage that can still make any file available to users within seconds.

Goodbye, optical; hello, hard drive. This IDC graph of the number of installed bytes by media type clearly shows
that the world’s craving for high capacity digital storage dominates storage adoption. Image source:
http://bit.ly/1BkL9WP

Against this backdrop of broad applications and skyrocketing capacity needs, we have a
complete spectrum of storage environments. Consider all of the places in which people deploy
hard drives: laptops, surveillance appliances, kiosks, hot data centers, Arctic research outposts,
airplanes, cars, small business storage closets, and large-scale university research clusters. Data
lives everywhere. In turn, the storage market must meet data where it lives, not only where it’s
convenient. The digital universe is far, far bigger than PCs and server racks.

Additionally, within each environment, different systems will face different workloads. Will
storage be subject to a heavy, random workload or a light, sequential one? Will that use occur
only sporadically during the workweek or constantly around the clock, year in and year out?
This diversity leads to two inescapable demands. First, like a fine suit to a body, storage must be
tailored and tweaked if it’s going to deliver the desired results. One design does not fit all. In
fact, the more the storage market diversifies, the more incumbent it is upon storage providers and
users alike to adapt their solutions to these variables if they’re going to realize satisfactory
results.

Second, storage must be reliable. This is the most important reality of all storage, because if you
don’t have reliability, everything else, from uptime to data protection to total, long-term solution
cost, falls apart. This begs the obvious question: Is your storage truly reliable? Because if you’re
only trusting the numbers on the spec sheet, you don’t know the whole story.

Page 2: The Criticality of Reliability

What exactly does reliability mean? Every engineer we asked had a specific definition in line
with their particular job function, but taken together, our favorite came from Andrei Khurshudov,
Ph.D., chief technologist at Seagate’s quality organization: “Reliability is quality over time.” In
other words, reliability means that a storage drive will operate at expected performance levels,
within its appropriate environmental and workload contexts, throughout its prescribed service
life.

If that sounds like an easy task, rest assured — it’s not. Consider one oft-cited analogy
describing modern hard drive physics: Imagine that the disk platter is a smooth-as-glass lake and
that the drive’s read/write heads are mounted on a Boeing 747 flying over that lake. If the plane
doesn’t skim just two or three inches over the water’s surface, give or take perhaps an inch,
regardless of air turbulence, temperature changes, and stray surface ripples, you no longer have
reliable operation. Now realize that, in the case of business-class drives meant to operate around
the clock, our plane may have to maintain that level of precision (its “fly height”), without pause,
for years.

The “interface” between magnetic read and write elements and the underlying, fast-spinning, super-smooth media
just a few nanometers. This is several times smaller than even the smallest viruses (18 nm), which in turn are orders
of magnitude smaller than even the height of a fingerprint.

Annualized numbers are key when discussing reliability. Because mean-time between failure
(MTBF) numbers can vary depending on how manufacturers test, actual annualized failure rate
(AFR) is increasingly used as a measurement of drive reliability. In turn, AFR dovetails with
workload duty cycle and ultimately yields a drive’s workload rate limit (WRL), meaning the
number of terabytes a drive can be expected to read and write over both its lifespan and on an
annual basis. A Seagate desktop drive, for example, offers a WRL of 55 TB/year while an
enterprise-grade nearline drive boasts 550 TB/year. Given that enterprises are storing 85% of the
digital universe, it makes sense that they should reliably accommodate 10X more data traffic.

Across its organization, Seagate devotes hundreds of millions of dollars annually to make sure
that its hard drives are as reliable as man and machine can possibly craft them. During the last 20
years, the company has designed, built, and shipped over 2.4 billion hard drives. This figure is
17% higher than Seagate’s closest competitor, Western Digital. Despite such massive output,
Seagate continues to drive down its warranty claims rate. In fact, according to SEC-reported
numbers compiled by Warranty Week, Seagate’s warranty claims now stand at a nearly decade-
long low of just 1.2% as a percent of sales. When this number is low, you know a company is
doing a lot of things right.

Seagate’s quality control methods have now pushed warranty claims to decade-long lows. In fact, you might be
surprised to find that Seagate’s return rates are well below other leading brands known for their exemplary quality.

The methods that Seagate employs to push the boundaries of storage science while continually
improving reliability are complex and fascinating. If you want to know from the bottom up why
the Seagate drives you pick for your storage are reliable, start by examining what happens at the
four research and development facilities the company runs 24 x 7.

Page 3: The Process Behind Reliability


This top-down view of Seagate’s ST-506 may lack many of the features found in modern hard drives, but the core
components are all strikingly similar, even if capacities have increased by over 1 million times.

Seagate shipped its first hard drive in 1980: the 5.25” ST-506, kicking off at a whopping 5MB
capacity. It was the first disk drive of the PC era. Over the 35 years since then, Seagate has
devoted considerable resources to evolving and refining its R&D. Today, the company relies on
a methodology it calls Product Development Process (PDP). In essence, PDP is a nine-stage
development cycle, managed over a 36-month life cycle, calculated to hone a new drive design
practically from the blueprint stage all the way through hand-off to factories for mass production.
The PDP methodology serves to help Seagate design new drives, but its true utility lies in
ensuring that drive designs are battle-hardened and reliable enough to withstand the rigors of real
world use.

Olympic-class athletes devote between 10,000 to 20,000 hours — six hours daily, six days each
week, over an average of 11 years — in order to reach the Games. That’s what it takes to be the
best in the world. (In contrast, Gerard Butler only trained for six hours each day for four months
to develop his 300 physique. There may be a statement about form versus function buried in this
comparison.) Unfortunately, given the pace of computing and digital storage, companies like
Seagate don’t have 11 years to refine new designs, and they definitely don’t have the 1.2 million
hours (almost 137 years) noted in nearline drive MTBF specifications to wait around for failures.
Instead, Seagate engineers have to leverage accelerated testing tactics throughout the PDP.

To understand PDP phases, imagine training to be an Olympic weightlifter. The London 2012
gold medalist in the 85 kg men’s category lifted 211 kg (465 pounds). That’s the target to beat if
you plan to win the event. If you start training today with your current body and strength, you
likely won’t lift anywhere near 211 kg. You will fail to lift that amount. The question is why
you’re failing. Your coach will break your training down into stages. The first stage may only
involve lifting 80 kg. If you fail here, the coach will examine your form, technique, diet, and
many other factors, until you’re able to meet all of the milestones of that first phase with only the
most minimal amount of failure (because nothing is ever perfect). With that done, you move into
the second phase at 100 kg and repeat the process for all of the months needed to reach Olympic-
class performance.

Seagate’s PDP applies a similar approach. At the first phase, engineers might only produce a few
hundred units of a new drive design and subject these to a full spectrum of stress tests. These
tests run the gamut from non-stop operation to operation under extreme temperature, humidity,
voltage fluctuation, vibration levels, and drop shock. Engineers want the drives to fail in order to
uncover any and every weakness.

Deep in the heart of Seagate’s Longmont, Colorado R&D facility, rooms such as this one host rank after rank of test
chambers designed to stress test emerging drive designs by the hundreds and thousands.

“Early on, in small quantities, we’re looking for big failure modes,” explains Josh Tinker, senior
program manager for Seagate’s enterprise product line management group. “And once we figure
out how to solve those, then there are other more obscure aspects to failure that don’t necessarily
show up when you build a small quantity of drives. So at the next stage, you build and test more.
And if we put the drives in regular tests, we may not see issues, so we have to put them in
accelerated, extreme tests and look for different ways the drive can fail.”

Page 4: Get It Hot, Roll It Out

To understand accelerated testing, imagine shoveling gravel. Let’s say that, on a cool, cloudy
day, you can shovel for six hours straight before keeling over from exhaustion. In the world of
amateur gravel shoveling, that’s an acceptable MTBF. But what if you want to know your failure
level and don’t have six hours to test? Then you run your test when it’s hot and sunny, and you
discover that failure sets in after only 90 minutes. Run this test enough times in enough ways and
you’ll eventually learn the relationships between testing under normal operating conditions and
stressed conditions. This allows you to accelerate testing and still derive meaningful, accurate
failure data. In this way, Seagate creates stressed conditions for its hard drives and can reliably
simulate full-term AFR and MTBF results in reasonable time periods.
Hard drives not only need to withstand extremes of temperature and humidity, but also air pressure. This
atmospheric test chamber helps to determine if drives are fit to survive the highs and lows of global operation.

With each PDP phase, the drive counts climb, and the tests grow increasingly long and
strenuous. First phase drives may only have to exhibit a 100-hour MTBF, but the final product
specification — and thus drives at the production phase — might demand a 2 million-hour
MTBF. At each successive phase, the MTBF and AFR expectations become more stringent. In
one Seagate development report we obtained, the company details its AFR targets at various
PDP phases while testing with a simulated power-on hours (POH) workload of 8,760 hours per
year. From early- to late-stage tests, AFR targets became ever more stringent, tightening by
nearly 400 percent.

According to Tinker, a product team typically takes a year or more to move from initial focus on
a new product to declaring the product ready to ship. Throughout this time, the product team has
to work increasingly with factory teams to work through any issues that arise in transitioning to
mass production. There are groups that focus on drive integration, on ASICs, on motors, on
heads and media — on and on. All of these groups must cooperate on troubleshooting, tweaking,
redesigning, and everything else necessary to hit phase milestones and continue through the PDP
pipeline. And, of course, once the product team believes it has hit all of its metrics and has a
solid design ready to run, it then needs to go into the hands of major OEM and integrator
customers to confirm that it will perform as expected in final storage solutions.

Bare drives often arrive at OEMs such as LaCie (shown here) by the pallet load for solution integration and
validation.

“The whole idea is for us to run the most extensive reliability testing internally, before we even
give it to our customers,” says Chris Wilson, principal program manager at Seagate. “What
they’re running is more of a verification of what we’ve already done, only they’re running the
drives in their systems, which might reveal something we haven’t seen before. But we do our
absolute best to find any possible way the customer could break the drive and break it ourselves
first, so that we can resolve that before we let them go do their qualifications.”

Page 5: Optimizing For Where Data Lives

An economy car and a luxury sedan may share many of the same primary components, such as
hybrid powertrain and chassis structure. But tack on a host of value-add amenities, from leather
interior to proximity cameras to app-toting in-dash computers, and the same underlying economy
car can become a luxury vehicle suited to a different set of user priorities and expectations. (Case
in point: Check out how the Lexus CT200h shares several underlying elements of the Toyota
Matrix, Corolla, and Prius lines.)

Similarly, when a new drive design finally rolls off the PDP pipeline, it’s not quite ready for a
model and serial number sticker. Seagate engineers now have a wealth of data revealing how that
design behaves in a variety of situations. It might, for example, demonstrate a markedly higher
tolerance to rotational vibration and longer MTBF at higher temperatures when the spindle is
loaded with four or fewer platters rather than five or six. In such a case, a four-platter version of
the design might become a 4TB drive targeted at small business NAS solutions while the 6TB
six-platter incarnation would still meet all expectations for a consumer desktop drive. Perhaps if
the motor is anchored to the top cover as well as the bottom, and rotational vibration (RV)
sensors are applied to the circuit board to help cope with conditions in larger NAS and
server/storage environments, then the same drive design might even qualify for top capacity
nearline applications. In this way, one drive design can ultimately serve a variety of markets
under different model families.

This isn’t to say that one drive design covers all storage categories any more than one car design
serves all vehicle markets. Seagate uses multiple platforms as starting points, and the more a
platform is made to fit this or that market segment, the more it will skew toward those respective
categories.

Sometimes, different drive models can be fraternal twins born from the same R&D womb. While not identical, these
drives do leverage similar platforms. When it comes to performance and reliability, though, they are vastly different.
Sidebar: Rattle and Hmmm

This chart shows the impact of rotational vibration (12.5 rad/sec 2), such as might be found in an SMB NAS
enclosure, on three drive types. While an enterprise NAS drive suffers no slowing, and an SMB NAS drive only takes
a moderate performance hit, the desktop drive clearly suffers.

Motors and moving parts emit vibrations, and so hard drives, with their platters spinning at
several thousand RPM and the rapid motions of their voice coil motor, emit rotational vibration
(RV) forces. Like someone bumping a spinning record player and causing its needle to skip
across the groove, vibration can cause a hard drive’s read/write head to misalign with the
underlying track, resulting in errors or slower performance if a re-read has to be performed. One
drive alone puts out little vibration, which is why drives intended for desktop PCs don’t need
much RV tolerance. However, pack many drives into a small chassis, especially one without
vibration dampening features, and those drive vibrations can coalesce and result in spikes that, in
extreme cases and over time, can literally rattle a drive to death. Drive-dense environments are
particularly common in data centers. Tack on vibration from chassis cooling fans and the heat
generated from all of the electronics in the enclosure, and the conditions in something like a
high-end NAS box or storage server can add up to a perfect storm for drives.

As the above chart shows, different drive types are designed to withstand varying levels of
ambient RV. A desktop drive’s performance begins to decline soon after being exposed to more
than light fan vibration, and its performance will plummet nearly to zero in hostile, enterprise-
grade environments. Conversely, business-class drives built and qualified through PDP for such
applications and situations continue to perform at very high levels. The question is not whether a
drive can operate in a given place but whether it should.
Seagate engineers have many ways of testing the impact of vibration on drives. The machines shown here assess the
inherent torsional vibration caused by a drive’s own spinning components.

In a way, reliability is the driving force behind all hard drive market segmentation. Market
research and demand informs Seagate that consumers want hard drives able to perform eight
hours per workday with a WRL of 55 TB/year while businesses running surveillance
applications need 24x7 storage writing 180 TB/year. The applications and environments demand
reliability at those respective levels.

One example of these demand differences can be seen in Seagate’s Desktop HDD and NAS
HDD families. They are physically and logically different drives. They share many of the same
features, including motor design and base plate design. However, the NAS HDD features
enhanced motor balancing, uses specific quality traits in heads and media, and incorporates
firmware optimized for NAS-specific workloads. All of this contributes to the NAS drives higher
MTBF and superior performance in that application and usage environment compared to the
Desktop HDD.

NAS applications need their own targeted storage solutions. Seagate’s NAS HDD provides higher performance
heads and disks than its desktop counterparts as well as weighted disk clamps, improved motor balancing, longer
duty cycle, and NAS-optimized firmware. The business-class drive also includes the assurance of Seagate’s Rescue
data recovery service.

Whereas hard drives in years past fell into two large buckets — personal and enterprise storage
— today, Seagate makes more granular distinctions in the mainstream market:

 Desktop
 Laptop
 Archive
 NAS HDD
 Video
 Surveillance
 Enterprise NAS/Cloud
 Enterprise Nearline
 Enterprise Mission Critical
All of these drive categories feature their own design optimizations and resulting reliability
specifications. They are specialized in order to meet the distinct needs of each market segment
and emerging application category. They are designed to meet data where it lives in the digital
universe.

Page 6: The Reliability Payoff

The tweaks and enhancements that segregate drive families carry benefits beyond reliability.
Keeping drive features and performance metrics in line with the needs of a target application and
environment helps to keep costs appropriate to that market. For instance, a desktop user has no
need for a 2 million-hour MTBF and added RV tolerance, so why pay for that extra engineering
and component expense? This remains important even when only looking at environment, as one
application, such as network storage, can encounter a complete spectrum of changing needs as it
scales from the home office to the data center.

Seagate offers several drive families customized to suit multiple application markets. This chart clearly shows many
of the core differences between drive families and the end-user benefits those differences provide.

Cost benefits also extend beyond up front drive expense. Business buyers pay special attention to
total cost of ownership (TCO), which can depend as much on reliability as it does on
management and secondary costs, such as energy consumption and other factors needed to
maintain the drive over three to five years of operation. This is where component quality and
firmware optimization truly come into play.

“When you look at an application like NAS, you have more large files and larger transfers,” says
Abhay Kataria, managing principal engineer at Seagate. “So we optimize our drive design
firmware to manage that usage more. From a power perspective, we have algorithms that will
learn command streams coming from the host and try to optimize the drive’s power usage
without impacting the host behavior, independent of what the host’s software can send. These
algorithms adapt and drive for lower power to keep the TCO for that end-user lower.”

When meeting with a round table of engineers like Kataria, we had to laugh when one of them
mentioned designing a hard drive that could run for over a century. Further discussion drove the
point home: Such a design is possible, even feasible, but no market would want to pay for it. The
object of the game is not to be perfect, only to be appropriate to the demands of the application
and environment.
Seagate is so confident that it has nailed this reliability and optimization process that it now
offers a two-year warranty on Desktop HDD models. NAS HDD and Surveillance drives enjoy
five years of coverage. Enterprise-class drives get five. Additionally, the company bundles its in-
house recovery services with NAS, Surveillance, and Enterprise NAS drive models. (Retailers
often offer Rescue as a modest optional charge when selling drives, even on desktop units.) The
Rescue plan covers the cost of data recovery services should a drive fail. Seagate will pay to rush
the drive into its forensic recovery center and perform all steps necessary to rescue as much data
as possible from the media, even returning data to the user on a replacement drive if needed.
Sometimes, recovery can require scores of man-hours and accrue many thousands of dollars in
service charges. The Rescue plan on Seagate business drives indicates that the company is so
confident of its products that it’s willing to include these rarely needed services alongside its
already competitive drive pricing.

A worker at Seagate Recovery Services (SRS) examines between platters for debris and other signs that a drive
might need to be dismantled. Despite obvious signs of a “head crash” (note the rings gouged into the top platter),
workers can still salvage much of a drive’s contents.

Reliability matters. Few appreciate the Herculean measures that a company like Seagate takes to
make its drives deliver consistent quality over time, and that’s OK. You don’t necessarily have to
understand the thousands of steps and countless hours that go into designing a drive. But you
should have confidence in the reliability of your storage and understand that reliability is not a
fixed object. It scales according to the demands of the storage market, and it depends on the user
deploying the drive in an environment-appropriate manner. When you do your part in the
reliability equation, then the resources that Seagate has poured into its offerings can deliver as
promised.

“To make a high quality product like a disk drive, it takes hard work from the entire company,”
says Seagate’s Andrei Khurshudov. “This cannot be achieved by an effort of just one
organization. You need to have a solid, robust design. You need to have great components that
go into the drive. You need to have an excellent, efficient, and consistent manufacturing process.
You need to have quality control that is top notch. You need to have a quality engineering
organization and systems that take the drive from birth to end of life. You need to have big data
analytics to deal with all the information accumulated through testing and field reliability
feedback. Seagate drive quality today is very good — the best it’s ever been — and we plan to
continue making it better and better in the future.”

You might also like